
Getting supposedly small things done by monster of an LLM to me is still an expensive affair in terms of money. Getting the same damn small thing done by quantized LLM is seeming to be expensive in terms of time.
Prompt Engineering they say is the new language. The reality is LLMs still haven't matured enough to select right tools with simple one-shot or few-shot prompting.
I didn't struggle teaching my pet dog to pick the right tool, as much as I am doing teaching my relatively small LLM running on my laptop to select right tool from a given set of tools to generate appropriate answers.
I love and do bet on GenAI but am cognizant of the cost vs effort tradeoff as with anything else in software engineering, but more blatant in the Generative AI ecosystem.
Yes, it is relatively much much easier to leverage an LLM with 70 billion parameter for better tool-calling capability, but in my opinion is ridiculous wastage of $$$ that it quickly would become untenable for the businesses. FinOps is a serious business in the real world. I see a big scope of optimization in this area by leveraging the right sized LLM and right sized infrastructure to host it, to get the best bang for the bucks invested in Agentic AI.
In my quest to get this done on my constrained environment, I decided to go beyond Prompt Engineering and take the next step of Model Fine-Tuning. I couldn't even get the training done for my Model Fine-Tuning because of persistent OOM errors, no matter how much I tapered with the Training configuration. There were a couple of issues beyond the OOM error and I'd want to document that as well in this post so as to serve a good reference point.
Even with aggressive optimizations, my constrained hardware environment couldn’t handle fine-tuning, highlighting that fine-tuning LLMs is far more memory-intensive than inference due to gradients and optimizer states.
Hardware Spec
- 16GB RAM
- i7 Processor with 6 cores
- NVIDIA GeForce GTX 1650
- 100+GB storage availability
Tech Tools
- Python,
- Pytorch for deep learning,
- Ollama for serving qwen3:0.6b LLM,
- Langchain for LLM orchestration.
Optimizations Exploited
- Batch Size: Set per_device_train_batch_size=1 with gradient_accumulation_steps=4.
- Quantization: Used 4-bit quantization (bitsandbytes, load_in_4bit=True). I started from 16-bit, and then to 8-bit before I downgraded it further to 4-bit, which is a little too crazy. Typically, reduced quantization wouldn't
- CPU Offloading: Enabled via accelerate config ( cpu_offload=True). This moves unused layers to RAM, reducing load on GPU.
-
Double Quantization: helps in further reducing memory usage and it is configured by setting bnb_4bit_use_double_quant=True.
- Mixed Precision: to balance computation speed and memory efficiency. This can reduce memory footprint and speed-up training too. Used FP16 (Apparently, FP8 is not stable and so didn't try this one).
- Tweaking Pytorch memory allocation: by setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
- Reduced Target Layers for Training: Training for Tool-Calling needs more modules/layers to be targeted. Setting target_modules = [q_proj, v_proj, k_proj, o_proj] will improves tool-calling accuracy by fine-tuning more attention components, but it increases memory slightly. But because of my hardware constraints, I reduced this to just one target layer by setting target_modules = [ q_proj].
- Prompt Engineering plays a vital role.
- Training data preparation is where over 80% of my time is consumed. It is boring and mundane work but it is important to get this one right. Preparing this at scale is a definitive challenge in the real-world in terms of time. Data Pipelining is one part, but sanitizing it for consistency and issues is another.