Working with Models
Using Models for Inference Tasks
The Llm
class and the llms()
helper function provide the programmatic interface for model inference in Tower. For an overview of these concepts, see the Models page. Underneath, the Llm
class uses Inference Routers and Inference Providers to perform inference.
Inference Routers and Inference Providers
Tower recognizes that users want to use different inference providers depending on offered model versions, inference performance, and cost. Tower also understands that users will sometimes want to do local inference when developing applications and switch to serverless, remote inference when deploying apps to production.
For this reason, Tower allows using popular inference providers such as Together.ai or Ollama, and sets up routing services that route inference calls to these providers.
Local Inference (Ollama)
For local inference, we recommend Ollama. When using Ollama, it serves as both the router and the provider.
- Runs models locally on your machine
- Good for development and testing
- No API keys required
- Limited by local hardware capabilities
Remote Inference (Hugging Face Hub)
For remote, serverless inference, we recommend Hugging Face Hub. When using Hugging Face Hub, it serves as a router of inference requests to various inference providers on that platform, including Together, SambaNova, and Hugging Face's own provider, HF-Inference.
- Runs models on remote servers
- Requires API keys for authentication
- Better throughput for production workloads
- Access to a wide variety of models
Setting Up Ollama for Local Inference
We recommend using Ollama to serve as a local inference server during development.
pip install ollama
You will also need to download models that you want to use for inference. To download and run a local LLM, for example, DeepSeek R1, use:
ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b
Signing Up for Hugging Face for Remote Inference
In production, many users will want to use Hugging Face Hub to route inference calls to commercial inference providers.
You don't have to use the Hub and you can call inference providers directly, but using the Hub is free and adds flexibility to switch providers that offer better value, such as lower latency, higher request rates, lower costs, or better availability.
You should sign up for Hugging Face and get the Hugging Face access token. Note this token as you will need it later.
(Optional) Signing Up for Third-Party Inference Providers
Third-party inference providers like Together.ai can help you get started with remote inference while you decide on your long-term inference provider.
Together.ai is a popular serverless inference provider that offers many OSS models like DeepSeek R1 for inference. Sign up for Together.ai and note its access key.
Follow this quickstart to enable Together.ai in the Hugging Face Hub. You will enter your Together.ai access token in the Hugging Face Hub settings. Once you do that, you can use your Hugging Face access token to make inference calls from your Tower app.
Choosing Inference Routers and Providers at Run-time
The choice of Inference Routers and Providers for a particular app run is controlled by specific secrets
defined in the environment where the Tower app is running.
TOWER_INFERENCE_ROUTER
- Can be set to "ollama" or "hugging_face_hub" when running the app in local mode. Must be set to "hugging_face_hub" when doing serverless inferenceTOWER_INFERENCE_ROUTER_API_KEY
- When using Ollama inference router, the API key can be left unset. When using the "hugging_face_hub" inference router, should be set to the Hugging Face tokenTOWER_INFERENCE_PROVIDER
- Should be set to a Hugging Face Hub Inference provider like "together" or others
To create one of these secrets, use the Tower CLI or the Web UI:
To define the inference secrets for local execution in an environment called "dev-local":
tower secrets create --environment="dev-local" \
--name=TOWER_INFERENCE_ROUTER --value="ollama"
To define the inference secrets for remote execution in an environment called "prod":
tower secrets create --environment="prod" \
--name=TOWER_INFERENCE_ROUTER --value="hugging_face_hub"
tower secrets create --environment="prod" \
--name=TOWER_INFERENCE_ROUTER_API_KEY --value="hf_1234567"
tower secrets create --environment="prod" \
--name=TOWER_INFERENCE_PROVIDER --value="together"
Combining Local and Remote Inference
Tower's inference capabilities were designed to give you choices. You can use the same inference router and provider in both development and production. Alternatively, you could use local inference during development and remote, serverless inference in testing and production.
- Use Tower's
--local
mode to run the app on your dev machine during development - Use Ollama to host a local inference server that the Tower app will use in local mode
- Ollama will use local GPUs (e.g., Apple Silicon) to save on inference costs during development and avoid inference rate throttling
Run in one terminal:
ollama run deepseek-r1:14b
Run in another terminal:
tower run --local --environment="dev-local" \
--parameter=xyz='123' \
--parameter=model_to_use='deepseek-r1:14b'
- Once you are done developing your app, you can deploy the app to the Tower cloud
- To maintain flexibility with inference providers, use Hugging Face Hub as the router of inference calls
- Use Together.AI as the serverless inference provider
Run in terminal:
tower run --environment="prod" \
--parameter=xyz='123' \
--parameter=model_to_use='deepseek-ai/DeepSeek-R1'
Tower example DeepSeek-Summarize-Github demonstrates how this can be done.
Specifying Model Names
Inference providers can have different names for the same model. Usually, model vendors release a family of model versions that vary in the number of parameters (via a process called distillation) and the quantization levels. For example, the 3 billion parameter, Instruct-fine-tuned model of the Llama 3.2 family is known as "llama3.2:3b" in Ollama and as "meta-llama/Llama-3.2-3B-Instruct" in Hugging Face Hub.
Tower's LLM inference was designed to allow developers to use LLMs across their development and production environments by specifying a short model family name without having lots of conditional statements to handle differences in environments. For example, specifying llms("llama3.2")
should resolve to either "llama3.2:3b" or "meta-llama/Llama-3.2-3B-Instruct" depending on the chosen inference provider.
Specifying Model Families
Users can specify a model family, e.g., llms("deepseek-r1")
, in both local and Tower cloud environments, and Tower will resolve the model family to a particular model that is available for inference:
- In local environment, Tower will find the model that is installed and, if there are multiple installed, pick the one with the largest number of parameters
- In Tower cloud environments, Tower will take the first model returned by Hugging Face search, making sure that this model is servable by the Inference Provider
Tower currently recognizes ~170 names of model families.
Specifying Particular Models
In addition to using model family names, users can also specify a particular model in both local and Tower cloud environments:
- Locally:
llms("deepseek-r1:14b")
orllms("llama3.2:latest")
- Remotely:
llms("deepseek-ai/DeepSeek-R1-0528")
orllms("meta-llama/Llama-3.2-3B-Instruct")
Name Flexibility in Development, Precise Naming in Production
One recommended pattern is to specify a model family name in development and use a precise model name in production.
The code that supports this pattern is the same; you just pass the model name as a parameter of the app or as an environment secret:
model_name = os.getenv("MODEL_NAME")
llm = llms(model_name)
In the development environment, you would then set:
MODEL_NAME = "llama3.2"
# (any model of this family installed locally will do)
In the "prod" environment, you would set:
MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"
# (use a particular model)
Learn More
For detailed information about the llms()
function and Llm
class methods, see the Tower SDK Reference.