When you click on links to various merchants on this site and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network. As an Amazon Associate I earn from qualifying purchases. #ad #promotions



This guide builds off the prior guide for setting up Ollama and OpenWEBUI in an LXC container with Proxmox 9. This means it assumes that you have followed the prior guide which had use install the drivers on the host system. Conceptually, you will install some parts to the host system, and some parts in the LXC. Some of these will overlap as well. Assuming you followed the prior guide, you have Nvidia driver 580.76.05 installed and fully functioning for GPU passthrough on your Proxmox 9 system to LXC containers. Drivers are however only part of the NVIDIA software moat, and CUDA TOOLKIT is also a key component and we will be installing and using this to build software. Keep in mind cuda compute capabilities when selecting build levels. I am demonstrating this install using quad 3090s at 8.6 compute level. My 4090s are at 8.9 level. My 5060 ti are at 12.0 level.
IF YOU HAVE BLACKWELL 50×0 series Nvidia GPU’s select the MIT and not Proprietary version during install.
YOU HAVE TO HAVE FOLLOW THE GUIDES SEQUENTIALLY IF UP TO THIS STEP, Step 2 Llama.cpp and Unsloth setup guide! We create a backup you will see me using in ALMOST EVERY future guide, you need it also.
Step 1 Ollama and OpenWEBUI
Step 2 Llama.cpp and Unsloth: THIS GUIDE
Step 3 vLLM guide
Video that accompanies this guide here:
Article Updated
9/23/2025
Installing Cuda 12.8 in Proxmox 9 Host
The first place we will be installing dependencies so we can build cuda is in the root host of the system. You can access this by clicking your proxmox node name on the far left, and then your shell. You can also ssh into your system if you prefer.
Install needed libraries, note these should already be installed if you followed the prior guide on to this one.
You should get an output like this:
Install Cuda
Hit enter to edit the .bashrc file. You can paste this right in at the bottom or at the top.
It should look like this:
Now we need to source and reboot
Once rebooted, we will verify you have the nvidia-toolkit installed and can utilize it with the following cmd.
You should get output that resembles this. It may have some different numbers based on when you have installed.
There are fairly complex relationships between the support matrix for drivers/cuda but installing this nvidia toolkit version along with the 12.8 compatible driver 580 we will have a fully functional setup. Over time these will bump up to new versions, but you have a few quarters of runtime easy on any version before everything else will support it. You do not want to nor have to update to the latest and greatest versions. Trying to always run the most recent versions that just drop will in fact usually lead to things not fully working. Also a good time for me to mention you should check you have nvidia drivers installed. Run nvidia-smi and you should get output like this:
You are now done with the setup on the host machine. Next we will create an LXC, then we will install CUDA TOOLKIT and DRIVERS inside the LXC. Then we will build llama.cpp from source. I will then show you how to do some basic usage tasks with Huggingface which is how we will get access to models. Then we will test Llama.cpp with a downloaded model. Finally we will take a snapshot of the LXC container so we have a known functional base level to restore to if we tinker and break something later.
Install Linux Cuda Toolkit and Drivers inside our LXC
First we will install the drivers inside our new LXC container. You should still be at the HOST level terminal and have the prior guides driver .run file downloaded. We will push this into our new container like before and install it. The driver versions and cuda toolkit versions should match inside the LXC and on the host outside the LXC.
In case you forgot those steps, Visit the nvidia unix drivers archive page here: https://www.nvidia.com/en-us/drivers/unix/ and click the Latest Production Branch Version: XXXXXX link.
Then click the “latest production branch” which is 580.76.05 in this instance. Then grab the link to download from the next page.
and then copy that file address like and type wget then paste in the link. Hit enter. It will download that file to your location in your proxmox HOST machine.
Now type ls and view the name of the downloaded file. It should start with NVIDIA-blah-blah.run essentially.
Type chmod +x NV* and hit enter. You can now execute this installer.
To execute the installer type the following:
It will install, select the appropriate driver branch during installation. IF YOU HAVE BLACKWELL 50×0 series Nvidia GPU’s use MIT Drivers and not Proprietary version during installation. The same package is downloaded, but different option selected during installation process. Once finished you can test with nvidia-smi cmd. (pictured container is the ollama one, but the llama.cpp is exactly the same for these steps.
Next we will install the Cuda Toolkit 12.8 Prerequisite packages
Next we will install the Cuda Toolkit 12.8
Hit enter to edit the .bashrc file. You can paste this right in at the bottom or at the top.
Hit enter to edit the .bashrc file. You can paste this right in at the bottom or at the top.
It should again look like this:
Build llama.cpp from source inside our LXC
There are a lot of flags you can set at compile time when you are building llama.cpp in linux and you can find a list of many llama.cpp env variables here. Please do not just jump directly into using those first, use what is in this guide so you can at least have a working baseline and snapshot before you go and tinker/break things. I will have a much deeper llama.cpp dive when we look at performance optimization and ik_llama.cpp in the future.
You have now built the following binaries at the following paths in your home directory:
llama.cpp/build/bin/llama-cli
llama.cpp/build/bin/llama-bench
llama.cpp/build/bin/llama-server
llama.cpp/build/bin/llama-guff-split
Each does a different task as you may guess. When you want to run these binaries, you need to remember the path to them. You can also download prebuilt binaries but these have various things you could call quirks about them. It is highly advised to use linux, and build from source when you want to run llama.cpp currently.
How to run Deepseek V3.1 GGUF Locally
Use the code with some modifications for the size you want to the quant you can run to download the desired size. You will need to manually configure the releaser/model/quant parts of this for yourself somewhat unless you want the Deepseek V3.1 at quant K4_K_M which is a good balance if you have 512GB of ram and some GPUs in addition and are okay with speed sacrifices.
If this refuses to download and times out, you may have packet loss. There is an alternate manner to dl this that will not break connection and reset downloads but it is less clean. Here are those steps, yes its not even close to simple.
How to Download Huggingface Models with Huggingface-cli tools
Install Python and an env manager
Make a director and enter that directory
Activate our python env
Install some Hugging face software from pip
Make a downloader script file (this is the where the unsloth website picks up with things)
Then paste in the basic downloader script. *Note this is structured for the way Unsloth releases quants* and cntrl-X and y to save the file.
Grant running permissions for the script
Run the downloader script
This will dump the files in this directory in your LXC *LXC is root in this instance, a user folder at whatever folder ~ hits is also a place to look if you are using other users. It will look like this running for a while as it download.
Now just find the path to 0001 GGUF file:
When the download completes you will have a bunch of files in here. We need to grab the path and the first GGUF file for our cmd line argument to run. This is a pretty good place to start for the quad 3090 512GB EPYC system:
Connecting OpenWEBUI and Llama.cpp
Start in the lower left of your openwebui instance and go to the admin menu/settings. Fill in with your LAMMACPP servers IP:PORT/v1 and username and prefix. The prefix I use is llamacpp- for this server so I can see which server is running what model quickly. This helps me avoid overloading the VRAM on the same server. Here is the steps to follow, start in the lower left.
Head over to your models and lets fix the sometimes very long unreadable in dropdown model names. You can edit this name, just click the pencil icon.
Then just type directly into the title the updated name you want. Save at the end of it. I do not think the settings you would apply in the Advanced properties work with llama.cpp but that was not a recent memory I had so maybe things have changed. If you find the Advanced Params work, LMK in a comment on this video pls.
As always you should snapshot this once you have your vLLM container up and running. Easy fix to slide back. Snapshotting before upgrades is a good idea.