Buy @ Amazon

Lessons From Failing To Fine-tune A Small LLM On My Laptop


I recently shared my frustration in my LI feed something like below:
Getting supposedly small things done by monster of an LLM to me is still an expensive affair in terms of money. Getting the same damn small thing done by quantized LLM is seeming to be expensive in terms of time. 
Prompt Engineering they say is the new language. The reality is LLMs still haven't matured enough to select right tools with simple one-shot or few-shot prompting.
I didn't struggle teaching my pet dog to pick the right tool, as much as I am doing teaching my relatively small LLM running on my laptop to select right tool from a given set of tools to generate appropriate answers. 
I love and do bet on GenAI but am cognizant of the cost vs effort tradeoff as with anything else in software engineering, but more blatant in the Generative AI ecosystem.

Yes, it is relatively much much easier to leverage an LLM with 70 billion parameter for better tool-calling capability, but in my opinion is ridiculous wastage of $$$ that it quickly would become untenable for the businesses. FinOps is a serious business in the real world. I see a big scope of optimization in this area by leveraging the right sized LLM and right sized infrastructure to host it, to get the best bang for the bucks invested in Agentic AI.

In my quest to get this done on my constrained environment, I decided to go beyond Prompt Engineering and take the next step of Model Fine-Tuning. I couldn't even get the training done for my Model Fine-Tuning because of persistent OOM errors, no matter how much I tapered with the Training configuration. There were a couple of issues beyond the OOM error and I'd want to document that as well in this post so as to serve a good reference point.

Even with aggressive optimizations, my constrained hardware environment couldn’t handle fine-tuning, highlighting that fine-tuning LLMs is far more memory-intensive than inference due to gradients and optimizer states.

Hardware Spec

  • 16GB RAM
  • i7 Processor with 6 cores
  • NVIDIA GeForce GTX 1650
  • 100+GB storage availability

Tech Tools

  • Python
  • Pytorch for deep learning, 
  • Ollama for serving qwen3:0.6b LLM, 
  • Langchain for LLM orchestration.

Optimizations Exploited

  • Batch Size: Set per_device_train_batch_size=1 with gradient_accumulation_steps=4.
  • Quantization: Used 4-bit quantization (bitsandbytes, load_in_4bit=True). I started from 16-bit, and then to 8-bit before I downgraded it further to 4-bit, which is a little too crazy. Typically, reduced quantization wouldn't 
  • CPU Offloading: Enabled via accelerate config ( cpu_offload=True). This moves unused layers to RAM, reducing load on GPU.
  • Double Quantization: helps in further reducing memory usage and it is configured by setting  bnb_4bit_use_double_quant=True.
  • Mixed Precision: to balance computation speed and memory efficiency. This can reduce memory footprint and speed-up training too. Used FP16 (Apparently, FP8 is not stable and so didn't try this one).
  • Tweaking Pytorch memory allocation: by setting  PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
  • Reduced Target Layers for Training: Training for Tool-Calling needs more modules/layers to be targeted. Setting target_modules = [q_proj, v_proj, k_proj, o_proj] will improves tool-calling accuracy by fine-tuning more attention components, but it increases memory slightly. But because of my hardware constraints, I reduced this to just one target layer by setting target_modules = [ q_proj].
Tweaking each of these options is a lesson in itself to learn about LLM fine-tuning. And beyond this, my lessons underscore the importance of 2 key things:
  • Prompt Engineering plays a vital role.
  • Training data preparation is where over 80% of  my time is consumed.  It is boring and mundane work but it is important to get this one right. Preparing this at scale is a definitive challenge in the real-world in terms of time. Data Pipelining is one part, but sanitizing it for consistency and issues is another.  
The Full CPU training (16 GB RAM) is apparently viable but slow. I didn't try this one on my laptop, because the last time I burned out my laptop's motherboard was by running deep learning models for a hackathon. Once bitten, twice shy, they say and it is true in this case of mine. On a serious note, training based on CPU is terribly slow and heats up your machine like hell. I wouldn't recommend you go through this route.

Fine-tuning a 0.6B LLM for tool-calling on a 4 GB GPU proved impossible with my setup, despite extensive optimizations. The experience underscores the high memory demands of fine-tuning, even for small models. Either you got to have even better hardware spec for your laptop (my personal choice) or explore Cloud for various constrained specs. I personally believe in frugality and always focus on areas to optimize for $$$.