You have now entered a hidden corner of the internet.
A confusing yet intriguing realm of paradoxes and contradictions.
A place where you will find out that what you thought you knew, you in fact didn’t know, and what you didn’t know was in front of you all along.
In other words, here I will document little-known facts about this web UI that I could not find another place for in the wiki.
Load the web UI with
python server.py --cpu
and start training the LoRA from the training tab as usual.
python server.py --load-in-8bit --gpu-memory 4000MiB
--pre_layer
, and not --gpu-memory
, is the right way to do CPU offloading with 4-bit modelspython server.py --wbits 4 --groupsize 128 --pre_layer 20
python server.py --cpu
python server.py
python server.py --load-in-8bit
python server.py --wbits 4
Including the up to date triton and cuda branches. But you have to delete the repositories/GPTQ-for-LLaMa
folder and reinstall the new one every time:
cd text-generation-webui/repositories
rm -r GPTQ-for-LLaMa
pip uninstall quant-cuda
git clone https://github.com/oobabooga/GPTQ-for-LLaMa -b cuda # or any other repository and branch
cd GPTQ-for-LLaMa
python setup_cuda.py install
https://github.com/oobabooga/text-generation-webui/tree/main/characters/instruction-following
Otherwise the prompt will not be formatted correctly.
python server.py --chat
Click on the “instruct” option under “Chat modes”
Select the correct template in the hidden dropdown menu that will become visible.
Ascended individuals have realized that notebook mode is the superset of chat mode and can do chats with ultimate flexibility, including group chats, editing replies, starting a new bot reply in a given way, and impersonating.
Most models are transformers, but not RWKV, which is a RNN. It’s a great model.
--gpu-memory
is not a hard limit on the GPU memoryIt is simply a parameter that is passed to the accelerate
library while loading the model. More memory will be allocated during generation. That’s why this parameter has to be set to less than your total GPU memory.
But it uses a ton of VRAM.
python download-model.py facebook/galactica-125m --check
It doesn’t start over.
python download-model.py facebook/galactica-125m --threads 8
You need to follow these instructions and then start the web UI with the --monkey-patch
flag.