How to Host LLMs Easily with OpenLLM

As artificial intelligence continues to advance, hosting large language models (LLMs) has become a pivotal aspect of leveraging their power. OpenLLM simplifies this process, allowing developers to run any open-source LLMs, such as Llama 2 and Mistral, as OpenAI-compatible API endpoints. Whether you’re working locally or in the cloud, OpenLLM optimizes for serving throughput and production deployment. Let's explore how you can get started with OpenLLM and host your own LLMs effortlessly.

What is OpenLLM?

OpenLLM is a versatile tool designed to help developers run and manage open-source LLMs. It supports a wide range of models, including those fine-tuned with your own data, and provides seamless OpenAI-compatible API endpoints. This makes it easier to transition from proprietary LLM applications to open-source alternatives.

Key Features

Broad Model Support:** Run various open-source LLMs with ease.
- **OpenAI-Compatible Endpoints:** Easily transition from OpenAI to open-source models.
- **High Performance:** Optimized for serving and inference performance.
- **Simplified Cloud Deployment:** Easily deploy models to the cloud with BentoML.


Before you begin, ensure you have Python 3.9 or later installed. Using a virtual environment is recommended to prevent package conflicts.

Step 1: Install OpenLLM

First, install OpenLLM using pip:

pip install openllm

Verify the installation by running:

openllm -h

Step 2: Start an LLM Server

OpenLLM allows you to start an LLM server quickly. For instance, to start a Phi-3 server, use the following command:

openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code

You can interact with the server via the web UI at `` or send requests using curl. Additionally, you can use OpenLLM’s built-in Python client:

import openllm

client = openllm.HTTPClient('http://localhost:3000')
client.generate('Explain to me the difference between "further" and "farther"')

OpenLLM supports various models and their variants. To specify different models, use:

openllm start <model_id> --<options>

Supported Models

OpenLLM supports numerous models, including:

- Llama
- Mistral
- Falcon
- StableLM
- And many more

You can find the complete list of supported models and learn how to add new ones in the OpenLLM documentation.


Quantization reduces storage and computation requirements for models, making it feasible to deploy large models on resource-constrained devices. OpenLLM supports several quantization techniques, such as AWQ, GPTQ, and SqueezeLLM.


OpenLLM integrates seamlessly with other powerful tools, including:

- **OpenAI Compatible Endpoints:** Use OpenLLM as a drop-in replacement for OpenAI’s API.
- **LlamaIndex:** Interact with LLMs using LlamaIndex’s API.
- **LangChain:** Connect to an OpenLLM server using LangChain.

Deploying Models to Production

There are several ways to deploy your LLMs:

Docker Containers

Build a Bento for your model using OpenLLM and BentoML:

openllm build mistralai/Mistral-7B-Instruct-v0.1
bentoml containerize <name:version>

This creates a Docker image that can be deployed anywhere Docker runs.


Deploy your LLMs to BentoCloud for better scalability and reliability. Start by creating a BentoCloud account and logging in:

bentoml cloud login --api-token <your-api-token> --endpoint <bento-cloud-endpoint>

Build and push your Bento:

openllm build mistralai/Mistral-7B-Instruct-v0.1
bentoml push <name:version>

Finally, deploy your Bento:

bentoml deployment create <deployment_name>

By using OpenLLM, you can easily host and manage your LLMs, ensuring high performance and seamless integration with your existing systems. Whether you're deploying locally or in the cloud, OpenLLM simplifies the process, making advanced AI accessible to everyone.

Post a Comment

Previous Post Next Post