Want to fit the most model in the amount of VRAM you have, if that's a little or a lot? Look no further.
Connect custom data sources to your LLM with one or more of these loaders (via LlamaIndex or LangChain)
satireplusplusVRAM is the limiting factor to run these things though, not tensor cores
currentscurrentsRight. And even once you have enough VRAM, memory bandwidth limits the speed more than tensor core bandwidth.
They could pack more tensor cores in there if they wanted to, they just wouldn't be able to fill them with data fast enough.
pointer_to_nullThis is definitely true. Theoretically you can page stuff in/out of VRAM to run larger models, but you won't be getting much benefit over CPU compute with all that thrashing.
EnturbulatedYou are absolutely correct. text-gen-webui offers "streaming" via paging models in and out of VRAM. Using this your CPU no longer gets bogged down with running the model, but you don't see much improvement in generation speed as the GPU is churning with loading and unloading model data from main RAM all the time. It can still be an improvement worth some effort, but it's far less drastic of an improvement than when the entire model fits in VRAM.
https://github.com/oobabooga/text-generation-webuishafallTo give some more specifics, most of the time its not the CPU that copies the data on modern systems, it is the PCI DMA chip (that may be on the same die though). CPU just sends address ranges to DMA
wojtek15Hey, recently I was thinking if Apple Silicon Macs may be best thing for AI in the future. Most powerful Mac Studio has 128Gb of Uniform RAM which can be used by CPU, GPU or Neural Engine. If only memory size is considered, even A100, let alone any consumer oriented model, can't match. With this amount of memory you could run GPT3 Davinci size model in 4bit mode.
currentscurrentsI'm hoping that non-Vonn-Neumann chips will scale up in the next few years. There's some you can buy today but they're small:
https://www.syntiant.com/ndp200NDP200 is designed natively run deep neural networks (DNN) on a variety of architectures, such as CNN, RNN, and fully connected networks, and it performs vision processing with highly accurate inference at under 1mW.
Up to 896k neural parameters in 8bit mode, 1.6M parameters in 4bit mode, and 7M+ In 1bit mode
An arduino idles at about 10mw, for comparison.
The idea is that if you're not shuffling the entire network weights across the memory bus every inference cycle, you save ludicrous amounts of time and energy. Someday, we'll use this kind of tech to run LLMs on our phones.
C0demunkeemaybe consider Tesla P40s
24gb, lots of CUDA cores, $150 each
Civil_Collection7267Untuned 30B LLaMA, you're saying? It's excellent and adept at storywriting, chatting, and so on, and it can output faster than ChatGPT at 4-bit precision. While I'm not into this myself, I understand that there is a very large RP community at subs like CharacterAI and Pygmalion, and the 30B model is genuinely great for feeling like talking to a real person. I'm using it with text-generation-webui and custom parameters and not the llama.cpp implementation.
For assistant tasks, I've been using either the ChatLLaMA 13B LoRA or the Alpaca 7B LoRA, both of which are very good as well. ChatLLaMA, for instance, was able to answer a reasoning question correctly that GPT-3.5 failed, but it has drawbacks in other areas.
The limitations so far are that none of these models can answer programming questions competently yet, and a finetune for that will be needed. They also have the tendency to hallucinate frequently unless parameters are made more restrictive.
Civil_Collection7267alpaca.cpp runs on the CPU. If you want to use LLaMA with GPU, you'll need to set it up with something like text-generation-webui.
At 8-bit precision, 7B requires 10GB VRAM, 13B requires 20GB, 30B requires 40GB, and 65B requires 80GB.
At 4-bit precision, 7B requires 6GB VRAM, 13B requires 10GB, 30B requires 20GB, and 65B requires 40GB.
With some tweaks, it's possible to run 7B LLaMA with 4GB VRAM.
Civil_Collection7267how does it compare to chat?
13B and 30B LLaMA are both amazing for this, and you can even upload character presets to make it into whatever you want. Characters can remember things you say in a conversation and genuinely feel lifelike. I promise I'm not exaggerating when I say it's that good. I don't really use llama.cpp and alpaca.cpp so I can't comment much on the experience there.
I haven't tried anything like RPG character builds, but 30B LLaMA can write very good stories, so I imagine it could do that too depending on what you're looking for.
Can it write code
This is the one major area where the models fail currently. None of the untuned models can write code in any competent capacity. However, finetuning should be able to improve this significantly. I don't think it'll be long until someone steps up to do this.
https://www.reddit.com/r/StableDiffusion/comments/11y6qs7/comment/jd841h9/?utm_source=share&utm_medium=web2x&context=3nDeconstructedIDK. I guess I'm just...
(•_•)
( •_•)>⌐■-■
(⌐■_■)
... that good.
This is a Dockerfile and docker-compose configuration to run the Alpaca language model in a container.