FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).
https://github.com/FMInference/FlexGen
No additional installation steps are necessary. FlexGen is in the requirements.txt file for this project.
FlexGen only works with the OPT model, and it needs to be converted to numpy format before starting the web UI:
python convert-to-flexgen.py models/opt-1.3b/
The output will be saved to models/opt-1.3b-np/.
The basic command is the following:
python server.py --model opt-1.3b  --flexgen
For large models, the RAM usage may be too high and your computer may freeze. If that happens, you can try this:
python server.py --model opt-1.3b  --flexgen --compress-weight
With this second command, I was able to run both OPT-6.7b and OPT-13B with 2GB VRAM, and the speed was good in both cases.
You can also manually set the offload strategy with
python server.py --model opt-1.3b  --flexgen --percent 0 100 100 0 100 0
where the six numbers after --percent are:
the percentage of weight on GPU
the percentage of weight on CPU
the percentage of attention cache on GPU
the percentage of attention cache on CPU
the percentage of activations on GPU
the percentage of activations on CPU
You should typically only change the first two numbers. If their sum is less than 100, the remaining layers will be offloaded to the disk, by default into the text-generation-webui/cache folder.
In my experiments with OPT-30B using a RTX 3090 on Linux, I have obtained these results:
--flexgen --compress-weight --percent 0 100 100 0 100 0: 0.99 seconds per token.--flexgen --compress-weight --percent 100 0 100 0 100 0: 0.765 seconds per token.temperature and do_sample.