Buy @ Amazon

What I learned trying to run DeepLearning models locally

Taming the beast of Large models is no easy play

Okay this is not the first time that I'm doing this. The last time that I did it was about 5-6 years ago on my older Dell laptop without GPU (a time when RTX GPUs were new in the market if my memory serves me right). I was running DL models locally as part of my participation in an online hackathon. Guess what happened? The DL models took long time to train and run thus heating up the laptop for long time (Older laptops workstations for development weren't tuned for quick heat dissipation). The CPU heat for longer duration burned up the motherboard, thus leaving my laptop for scrap collection. 

This time around, it all started with the announcement of InfiniteYou (InfU) by ByteDance. This is a new image generation model helps in the accurate generation of infinite versions of an individual (You) in different settings using text prompts. The model leverages Diffusion Transformers (DiTs) to generate images that not only maintain the identity of a person from a source photograph but also allow for flexible text-based editing. This became a rage with its announcement. I wanted to try it out locally, to see what I get to learn from this experience. 

For the specifics, I tried to run this model locally on my WSL 2 on Windows 11 Pro laptop with legacy GPU (RTX 1650). Your mileage may vary according to your hardware rig.

What eats-up your disk space when you use Deep Learning models locally?
  • Starting with the basic hygiene, your Ubuntu `/tmp` directory might be bloated. Best way to clear it is by restarting your OS.
  • If you are using project specific virtual environment, you may want to clear the unused ones. These are opportunities for cleaning up your local workstation. I use venv to isolate my python project's dependencies in my project's sub-directory.
  • If you have used `pip install` you may want to take a relook at its cache (find it with `pip cache dir` command) and clear it. The hugging-face libs, diffusers, transformers take a lot of space. Also, the models are duplicated once in the pip cache and the other being in the projects's sub-directory called `models/`. Deleting the models at least will restore some disk space. 
  • If you got lesser VRAM for your WSL (less than 16 GB, ideally it is better to have at least 32 GB), you might end up seeing a lot of offloading exceeding RAM thus swap file grows unbounded increasing the disk usage. 
  • The pipelines like `FluxPipeline` creates a lot of intermediate states such as latents, checkpoints etc. to disk. You may want to look into the code to estimate how much disk space it can occupy when you fire up your project. Or ideally, such DL projects should document this in their README.md file to aid its consumers in infrastructure planning. 
  • `enable_model_cpu_offload()` limits GPU use to active layers (e.g., 1-2GB of your 4GB VRAM), reducing utilization (e.g., 50-70% vs. 100%). Without it, the full model (~12-16GB runtime) crashed my poor GTX 1650.
  • Instead of using a full-blown foundational model, check if there is a distilled version of it like `FLUX.1-schnell` as distilled version of `FLUX.1-dev`. For your information, mine blew up even with this miniature version. Now that gives me a sense of how big these monsters are and how hard it is to tame them.
  • FluxPipeline in diffusers (used by InfiniteYou) combines components like the transformer (FLUX), VAE, and text encoders (e.g., CLIP, T5). If app.py creates separate pipeline objects—or reloads them—for different tasks (e.g., base generation, InfuseNet injection, LoRA enhancements), each might temporarily cache weights or intermediates on disk.
  • Diffusion models like FLUX generate images iteratively (e.g., 4 steps for schnell). If app.py saves intermediate tensors or images to disk (common in debugging or multi-stage pipelines like InfiniteYou’s identity preservation), this will accumulate.
  • GPU isn't automatically leveraged by the DL models. You have to codify for the GPU device presence and offload the work to it with something like below:
    ```
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    ```
  • As a quicky to know if I can host a DL model on my laptop, I today have learned to check the number of parameters the model is leveraging (for its weights and biases) and multiply it with float16 or float 32 that it uses to represent each of those param. FLUX.1-dev, , a transformer-based diffusion model for text-to-image generation, for instance, uses 12B parameters each stored in bfloat16 thus consuming  ~24GB  (12Billion Params * 16bits = 12B * 2Bytes = 24GB) in disk space.
  • Oh I also got to learn about HuggingFace platform (I'm pretty late in this game) and how I got to accept to model's T&C (FLUX.1-dev is not meant for commercial use for instance) on the website and use generated security token (that can be modified to custom granularity) to access its platform via its python lib `huggingface_hub`.