
Getting supposedly small things done by monster of an LLM to me is still an expensive affair in terms of money. Getting the same damn small thing done by quantized LLM is seeming to be expensive in terms of time.
Prompt Engineering they say is the new language. The reality is LLMs still haven't matured enough to select right tools with simple one-shot or few-shot prompting.
I didn't struggle teaching my pet dog to pick the right tool, as much as I am doing teaching my relatively small LLM running on my laptop to select right tool from a given set of tools to generate appropriate answers.
I love and do bet on GenAI but am cognizant of the cost vs effort tradeoff as with anything else in software engineering, but more blatant in the Generative AI ecosystem.
Yes, it is relatively much much easier to leverage an LLM with 70 billion parameter for better tool-calling capability, but in my opinion is ridiculous wastage of $$$ that it quickly would become untenable for the businesses. FinOps is a serious business in the real world. I see a big scope of optimization in this area by leveraging the right sized LLM and right sized infrastructure to host it, to get the best bang for the bucks invested in Agentic AI.
In my quest to get this done on my constrained environment, I decided to go beyond Prompt Engineering and take the next step of Model Fine-Tuning. I couldn't even get the training done for my Model Fine-Tuning because of persistent OOM errors, no matter how much I tapered with the Training configuration. There were a couple of issues beyond the OOM error and I'd want to document that as well in this post so as to serve a good reference point.
Even with aggressive optimizations, my constrained hardware environment couldn’t handle fine-tuning, highlighting that fine-tuning LLMs is far more memory-intensive than inference due to gradients and optimizer states.
Hardware Spec
- 16GB RAM
- i7 Processor with 6 cores
- NVIDIA GeForce GTX 1650
- 100+GB storage availability
Tech Tools
- Python,
- Pytorch for deep learning,
- Ollama for serving qwen3:0.6b LLM,
- Langchain for LLM orchestration.
Optimizations Exploited
- Batch Size: Set per_device_train_batch_size=1 with gradient_accumulation_steps=4.
- Quantization: Used 4-bit quantization (bitsandbytes, load_in_4bit=True). I started from 16-bit, and then to 8-bit before I downgraded it further to 4-bit, which is a little too crazy. Typically, reduced quantization wouldn't
- CPU Offloading: Enabled via accelerate config ( cpu_offload=True). This moves unused layers to RAM, reducing load on GPU.
-
Double Quantization: helps in further reducing memory usage and it is configured by setting bnb_4bit_use_double_quant=True.
- Mixed Precision: to balance computation speed and memory efficiency. This can reduce memory footprint and speed-up training too. Used FP16 (Apparently, FP8 is not stable and so didn't try this one).
- Gradient Checkpointing: can be enabled like model.gradient_checkpointing_enable() to reduce memory by recomputing activations instead of storing them.
- Tweaking Pytorch memory allocation: by setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
- Reduced Target Layers for Training: Training for Tool-Calling needs more modules/layers to be targeted. Setting target_modules = [q_proj, v_proj, k_proj, o_proj] will improves tool-calling accuracy by fine-tuning more attention components, but it increases memory slightly. But because of my hardware constraints, I reduced this to just one target layer by setting target_modules = [ q_proj].
- As baseline, start with 16bit (FP16/BF16) precision instead of 32 bit. It supposedly should reduce memory footprint by ~50% and this is why I chose it as baseline for my local fine-tuning of LLM. Upon successful training, check for LLM's tool-calling quality. If you hit resource constraints, downgrade to 8bit, to reduce the resource consumption by half. If resource is still a constraint, downgrade to 4bit precision.
- When facing an OOM issue, configure Pytorch setting to leverage expandable segments, so that it prevents OOM from memory fragmentation.
- Select training batch-size based on your hardware spec. In my case I set to to just 1.
- Gradient Checkpoint (is this a misnomer??) can be enabled on model training, so that the intermediate activations are not stored/checkpointed and are recomputed on demand. Think of activations as output of a function in a DL layer.
- Install accelerate python library and configure it to offload unused layers to CPU, thus freeing up your GPU of clutter.
- If LoRa (Low Rank Adaption) fails you, try QLoRa (Quantized LoRa).
- If all of these fail, you perhaps have to think of an LLM smaller in parameter size than your current one.
- LLMs with similar sized parameters don't not necessarily have comparable performance. In my experiments of tool-calling, for instance, I found qwen3:0.6b way better than llama3.2:1b; and this sure surprised me. This calls for validating your assumption against ground reality.
- Efficient Tool-calling just doesn't happen magically out of the box even with the mega sized LLMs out there. Prompt Engineering plays a vital role to condition your LLM to act right.
- Over 80% of the time is expended in training data preparation. It is a very boring and mundane work but then it is important to get this one right - stay motivated in this phase. Preparing this at scale is a definitive challenge in the real-world in terms of time and quality data. Good old MLOps practices remain a mainstay.