Ollama provides an interface to self-host and interact with
open-source LLMs (Large Language Models) using its binary
or container image. Managing LLMs using Ollama
is like managing container lifecycle using container engines like docker
or podman
.
-
Ollama commands
pull
andrun
are used to download and execute LLMs respectively, just like the ones used to manage containers withpodman
ordocker
. -
Tags like
13b-python
and7b-code
are used to manage different variations of an LLM. -
A
Modelfile
(likeDockerfile
) is created to build a custom model using an existing LLM as its base. Additional parameters likeTEMPLATE
andPARAMETER
could be used to define a prompt template or fine-tune model parameters respectively.
Deploying Ollama container with NVIDIA GPU
Deploying the Ollama container directly would allow it to utilize CPU resources for its LLM workloads, but with the parallel computation capabilities of a Graphics Processing Unit (GPU), we can improve the inference performance of all models.
In this article I’m using an NVIDIA GeForce RTX 3070 Ti GPU, if you want to use a GPU from AMD/Intel or any other manufacturer then steps like driver and container toolkit installation and GPU configuration for the container engine will differ.
GPU Passthrough to VM
I am deploying the Ollama container on a Fedora 38 virtual machine so the first step will be the GPU Passthrough from my hypervisor (Proxmox) to the VM. You can skip this step if you are deploying Ollama on a baremetal machine.
In the Proxmox’s Web UI, we can go to the VM’s Hardware
section
and Add
your PCI Device
i.e. your GPU.
Proxmox VM's Hardware Section
Make sure to mark the All Functions
checkbox.
GPU Passthrough to a Proxmox VM
Once the VM is rebooted we can verify the GPU Passthrough using the following command.
|
|
If the GPU name is present in the command’s output (like below) then the passthrough is successful and we can move to the next step.
|
|
CUDA Toolkit Installation
To utilize the parallel computation capabilities of the CUDA cores provided in NVIDIA GPUs we have to install the CUDA Toolkit. You can follow NVIDIA’s documentation on the CUDA Toolkit installation on Linux because the steps vary depending on the host’s configuration.
Here are the steps for Fedora 38:
- Downloading CUDA Toolkit Repo RPM.
|
|
- Installing CUDA Toolkit Repo RPM.
|
|
- Cleaning
dnf
Repository Metadata.
|
|
- Installing
cuda-toolkit
package.
|
|
- Installing
legacy
(proprietary) oropen
(open source) kernel module fornvidia-driver
.
|
|
or
|
|
NVIDIA Container Toolkit Installation
With nvidia-container-tookit
, we can use our NVIDIA GPU
in containerized applications. Here are the steps for installing
NVIDIA Container Toolkit on Fedora 38:
- Adding
nvidia-container-tookit
repository.
|
|
- Installing the
nvidia-container-tookit
package.
|
|
- Once the container toolkit is installed, we have to add its runtime to our container engine.
|
|
- Finally, we can start using our NVIDIA GPU with Docker
containers after restarting the
docker
Daemon.
|
|
Deploying Ollama as a Docker Container
- Create a directory on our host to store LLMs to avoid re-downloading models after reprovisioning or updating the container.
|
|
- The following
compose.yaml
file will deploy theollama
container with our NVIDIA GPU.
|
|
If you want to provision the container without GPU
you have to remove the deploy
section.
- Deploy the
ollama
container using the following command.
|
|
If you want to deploy Ollama with a ChatGPT-Style Web UI then follow the deployment steps for Ollama Web UI.
Managing LLMs using Ollama
Once the container is provisioned we can start downloading and executing models.
To attach the Ollama container with a terminal use the following command
|
|
Downloading LLMs using the pull
command
To download a model use the ollama pull
command with the name
of LLM and its tag (refer to the Ollama Library).
For example, to download the Code Llama model with 7 Billion
parameters we have to pull the codellama:7b
model.
|
|
The model size could range from 4 to 19 GB (or even more). So choosing the right model tag is crucial to decrease downloading time and resource utilization.
If we want to delete a downloaded model we’ll use the ollama rm
command followed by the name of the model.
Executing LLMs using the run
command
Before we prompt the model we have to run it
first using the ollama run
command followed by the name of
the model.
|
|
This command will drop us directly into the model’s prompting window.
|
|
Prompting LLMs from Command Line
Ollama exposes multiple REST API endpoints to manage and interact with the models
/api/tags
: To list all the local models./api/generate
: To generate a response from an LLM with the prompt passed as input./api/chat
: To generate the next chat response from an LLM. The prior chat history could be passed as input.
We can perform these API requests using curl
and format the
response using jq
.
|
|
By assigning the stream
as false
we will receive the complete
response as a single JSON object rather than a stream of multiple
objects.
Ollama Web UI
With self-hosted applications, it always helps to have a web interface for management and access from any device. The Ollama Web UI provides an interface similar to ChatGPT to interact with LLMs present in Ollama.
Deploying Ollama Web UI
Similar to the ollama
container deployment we will create a data
directory for ollama-webui
|
|
Modify our existing compose.yaml
.
|
|
Deploy both containers using docker compose
.
|
|
If the ollama
container is deployed on a different host then
we have to rebuild the ollama-webui
container image by following
the instructions from here.
Managing LLMs from Ollama Web UI
Once the deployment is completed we can visit the web UI
at localhost:3030
.
Ollama Web UI
Alongside prompting we can also use the Web UI to manage models.
Managing models using Ollama Web UI
Integrating Ollama with Neovim
If you are using Neovim (like me)
then you can integrate models in your development environment
using ollama.nvim
.
ollama.nvim
supports the following features:
- Code generation from a text prompt
- Generating an explanation for a code snippet
- Code modification suggestions
Code explanation from Ollama using ollama.nvim
I am using LazyVim so I’ve created ~/.config/nvim/lua/plugins/ollama.lua
with the following content.
|
|
Integrating Ollama with VSCode
The Continue VSCode extension supports the integration of LLMs as coding assistants. To use it with Ollama we have to change the Proxy Server Url in its settings to the one used by our Ollama container.
Continue Extension Settings
Watch Ollama in action inside VSCode
Optimizing code using Ollama in VSCode
Thank you for taking the time to read this blog post! If you found this content valuable and would like to stay updated with my latest posts consider subscribing to my RSS Feed.
Resources
Ollama
Ollama Docker Image
NVIDIA GeForce RTX 3070 Ti
NVIDIA CUDA Installation Guide for Linux
NVIDIA Container Toolkit
Ollama Library
Ollama Web UI
ollama.nvim
Continue