Apple Silicon Macs have proven to be a great value for running large language models (LLM) like Llama2, Mixtral, and others locally using software like Ollama due to their unified memory architecture which provides both GPU and CPU access to main memory over a relatively high bandwidth connection (compared to most CPUs and integrated GPUs).
High end consumer GPUs from NVIDIA cost $2000+ and top out at 24GB RAM. Using multiple cards requires more expensive motherboards and careful case, cooling and power supply selection. Apple Silicon Macs are available with up to 192GB RAM. MacBook Pros are available with 128GB and refurbished systems are available at a deep discount.
By default, MacOS allows 2/3rds of this RAM to be used by the GPU on machines with up to 36GB / RAM and up to 3/4s to be used on machines with >36GB RAM. This ensures plenty of RAM for the OS and other applications, but sometimes you want more RAM for running the LLM. Fortunately the VRAM split can be altered at runtime using a kernel tunable.
You change these settings using a utility known as sysctl, which has to be run with sudo. The key you use depends on whether you are running Ventura or Sonoma
- On Ventura: debug.iogpu.wired_limit
- On Sonoma: iogpu.wired_limit_mb
By default they have a value of 0, which corresponds to the default split described above. If you want to increase the allocation you set them to the value you want in megabytes.
For example
- Sonoma,
sudo sysctl iogpu.wired_limit_mb=26624
will set it to 26GB. - Ventura the equivalent is
sysctl debug.iogpu.wired_limit=26624
I’d leave ~4-6GB free for the OS and other apps, but some people have reported success with as little as 2GB free for other uses. Before going that low I’d quit non-essential applications and make sure files are saved just in case there is a problem and the system panics.
Note: Ollama now runs models on CPU if it thinks they are too big for the GPU RAM allowance. Originally, it based this on total system memory, rather than the actual allowance. However, as of 0.1.28, Ollama honors this setting change and so you can run larger models on GPU