Lessons From Failing To Fine-tune A Small LLM On My Laptop

I recently shared my frustration in my LI feed something like below:

Getting supposedly small things done by monster of an LLM to me is still an expensive affair in terms of money. Getting the same damn small thing done by quantized LLM is seeming to be expensive in terms of time.
Prompt Engineering they say is the new language. The reality is LLMs still haven't matured enough to select right tools with simple one-shot or few-shot prompting.
I didn't struggle teaching my pet dog to pick the right tool, as much as I am doing teaching my relatively small LLM running on my laptop to select right tool from a given set of tools to generate appropriate answers.
I love and do bet on GenAI but am cognizant of the cost vs effort tradeoff as with anything else in software engineering, but more blatant in the Generative AI ecosystem.

Yes, it is relatively much much easier to leverage an LLM with 70 billion parameter for better tool-calling capability, but in my opinion is ridiculous wastage of $$$ that it quickly would become untenable for the businesses. FinOps is a serious business in the real world. I see a big scope of optimization in this area by leveraging the right sized LLM and right sized infrastructure to host it, to get the best bang for the bucks invested in Agentic AI.

In my quest to get this done on my constrained environment, I decided to go beyond Prompt Engineering and take the next step of Model Fine-Tuning. I couldn't even get the training done for my Model Fine-Tuning because of persistent OOM errors, no matter how much I tapered with the Training configuration. There were a couple of issues beyond the OOM error and I'd want to document that as well in this post so as to serve a good reference point.

Even with aggressive optimizations, my constrained hardware environment couldn’t handle fine-tuning, highlighting that fine-tuning LLMs is far more memory-intensive than inference due to gradients and optimizer states.

Hardware Spec

16GB RAM
i7 Processor with 6 cores
NVIDIA GeForce GTX 1650
100+GB storage availability

Tech Tools

Python,
Pytorch for deep learning,
Ollama for serving qwen3:0.6b LLM,
Langchain for LLM orchestration.

Optimizations Exploited

Batch Size: Set per_device_train_batch_size=1 with gradient_accumulation_steps=4.
Quantization: Used 4-bit quantization (bitsandbytes, load_in_4bit=True). I started from 16-bit, and then to 8-bit before I downgraded it further to 4-bit, which is a little too crazy. Typically, reduced quantization wouldn't
CPU Offloading: Enabled via accelerate config ( cpu_offload=True). This moves unused layers to RAM, reducing load on GPU.
Double Quantization: helps in further reducing memory usage and it is configured by setting bnb_4bit_use_double_quant=True.
Mixed Precision: to balance computation speed and memory efficiency. This can reduce memory footprint and speed-up training too. Used FP16 (Apparently, FP8 is not stable and so didn't try this one).
Gradient Checkpointing: can be enabled like model.gradient_checkpointing_enable() to reduce memory by recomputing activations instead of storing them.
Tweaking Pytorch memory allocation: by setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Reduced Target Layers for Training: Training for Tool-Calling needs more modules/layers to be targeted. Setting target_modules = [q_proj, v_proj, k_proj, o_proj] will improves tool-calling accuracy by fine-tuning more attention components, but it increases memory slightly. But because of my hardware constraints, I reduced this to just one target layer by setting target_modules = [ q_proj].

Heck, I even tried QLoRa spec in vain. Tweaking each of these options is a lesson in itself to learn about LLM fine-tuning. So what follows are some of my key take aways, as I understand it today (I have more battles to fight, to be very definitive on this, so expect your mileage to vary).

Lessons on Fine-tuning

As baseline, start with 16bit (FP16/BF16) precision instead of 32 bit. It supposedly should reduce memory footprint by ~50% and this is why I chose it as baseline for my local fine-tuning of LLM. Upon successful training, check for LLM's tool-calling quality. If you hit resource constraints, downgrade to 8bit, to reduce the resource consumption by half. If resource is still a constraint, downgrade to 4bit precision.
When facing an OOM issue, configure Pytorch setting to leverage expandable segments, so that it prevents OOM from memory fragmentation.
Select training batch-size based on your hardware spec. In my case I set to to just 1.
Gradient Checkpoint (is this a misnomer??) can be enabled on model training, so that the intermediate activations are not stored/checkpointed and are recomputed on demand. Think of activations as output of a function in a DL layer.
Install accelerate python library and configure it to offload unused layers to CPU, thus freeing up your GPU of clutter.
If LoRa (Low Rank Adaption) fails you, try QLoRa (Quantized LoRa).
If all of these fail, you perhaps have to think of an LLM smaller in parameter size than your current one.

Lessons on Overall Process

LLMs with similar sized parameters don't not necessarily have comparable performance. In my experiments of tool-calling, for instance, I found qwen3:0.6b way better than llama3.2:1b; and this sure surprised me. This calls for validating your assumption against ground reality.
Efficient Tool-calling just doesn't happen magically out of the box even with the mega sized LLMs out there. Prompt Engineering plays a vital role to condition your LLM to act right.
Over 80% of the time is expended in training data preparation. It is a very boring and mundane work but then it is important to get this one right - stay motivated in this phase. Preparing this at scale is a definitive challenge in the real-world in terms of time and quality data. Good old MLOps practices remain a mainstay.

Lessons on Hardware Spec

The Full CPU training (16 GB RAM) is apparently viable but slow. I didn't try this one on my laptop, because the last time I burned out my laptop's motherboard by running deep learning models for a hackathon about 5 odd years back. Once bitten, twice shy, they say and it is true in this case of mine. On a serious note, training based on CPU is terribly slow and heats up your machine like hell. I wouldn't recommend you go through this route.

Fine-tuning a 0.6B LLM for tool-calling on a 4 GB GPU proved impossible with my setup, despite extensive optimizations. The experience underscores the high memory demands of fine-tuning, even for distilled models. The baseline hardware spec how grown many-fold, to even train a distilled LLM like the ones I mentioned earlier which I used for my experiments.

Either you got to have latest hardware spec for your laptop (my personal choice) or explore Cloud for various constrained specs. I personally believe in frugality and always focus on areas to optimize for $$$. I hope to post more on this in later posts as I uncover better ways to training LLMs.

Blog @ Codonomics

Search This Blog

Buy @ Amazon