Working with Models

Using Models for Inference Tasks

The Llm class and the llms() helper function provide the programmatic interface for model inference in Tower. For an overview of these concepts, see the Models page. Underneath, the Llm class uses Inference Routers and Inference Providers to perform inference.

Inference Routers and Inference Providers

Tower recognizes that users want to use different inference providers depending on offered model versions, inference performance, and cost. Tower also understands that users will sometimes want to do local inference when developing applications and switch to serverless, remote inference when deploying apps to production.

For this reason, Tower allows using popular inference providers such as Together.ai or Ollama, and sets up routing services that route inference calls to these providers.

Local Inference (Ollama)

For local inference, we recommend Ollama. When using Ollama, it serves as both the router and the provider.

Runs models locally on your machine
Good for development and testing
No API keys required
Limited by local hardware capabilities

Remote Inference (Hugging Face Hub)

For remote, serverless inference, we recommend Hugging Face Hub. When using Hugging Face Hub, it serves as a router of inference requests to various inference providers on that platform, including Together, SambaNova, and Hugging Face's own provider, HF-Inference.

Runs models on remote servers
Requires API keys for authentication
Better throughput for production workloads
Access to a wide variety of models

Setting Up Ollama for Local Inference

We recommend using Ollama to serve as a local inference server during development.

pip install ollama

You will also need to download models that you want to use for inference. To download and run a local LLM, for example, DeepSeek R1, use:

ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b

Signing Up for Hugging Face for Remote Inference

In production, many users will want to use Hugging Face Hub to route inference calls to commercial inference providers.

You don't have to use the Hub and you can call inference providers directly, but using the Hub is free and adds flexibility to switch providers that offer better value, such as lower latency, higher request rates, lower costs, or better availability.

You should sign up for Hugging Face and get the Hugging Face access token. Note this token as you will need it later.

(Optional) Signing Up for Third-Party Inference Providers

Third-party inference providers like Together.ai can help you get started with remote inference while you decide on your long-term inference provider.

Together.ai is a popular serverless inference provider that offers many OSS models like DeepSeek R1 for inference. Sign up for Together.ai and note its access key.

Follow this quickstart to enable Together.ai in the Hugging Face Hub. You will enter your Together.ai access token in the Hugging Face Hub settings. Once you do that, you can use your Hugging Face access token to make inference calls from your Tower app.

Choosing Inference Routers and Providers at Run-time

The choice of Inference Routers and Providers for a particular app run is controlled by specific secrets defined in the environment where the Tower app is running.

TOWER_INFERENCE_ROUTER - Can be set to "ollama" or "hugging_face_hub" when running the app in local mode. Must be set to "hugging_face_hub" when doing serverless inference
TOWER_INFERENCE_ROUTER_API_KEY - When using Ollama inference router, the API key can be left unset. When using the "hugging_face_hub" inference router, should be set to the Hugging Face token
TOWER_INFERENCE_PROVIDER - Should be set to a Hugging Face Hub Inference provider like "together" or others

To create one of these secrets, use the Tower CLI or the Web UI:

To define the inference secrets for local execution in an environment called "dev-local":

tower secrets create --environment="dev-local" \
  --name=TOWER_INFERENCE_ROUTER --value="ollama"

To define the inference secrets for remote execution in an environment called "prod":

tower secrets create --environment="prod" \
  --name=TOWER_INFERENCE_ROUTER --value="hugging_face_hub"

tower secrets create --environment="prod" \
  --name=TOWER_INFERENCE_ROUTER_API_KEY --value="hf_1234567"

tower secrets create --environment="prod" \
  --name=TOWER_INFERENCE_PROVIDER --value="together"

Combining Local and Remote Inference

Tower's inference capabilities were designed to give you choices. You can use the same inference router and provider in both development and production. Alternatively, you could use local inference during development and remote, serverless inference in testing and production.

Use Tower's --local mode to run the app on your dev machine during development
Use Ollama to host a local inference server that the Tower app will use in local mode
Ollama will use local GPUs (e.g., Apple Silicon) to save on inference costs during development and avoid inference rate throttling

Run in one terminal:

ollama run deepseek-r1:14b

Run in another terminal:

tower run --local --environment="dev-local" \
  --parameter=xyz='123' \
  --parameter=model_to_use='deepseek-r1:14b'

Once you are done developing your app, you can deploy the app to the Tower cloud
To maintain flexibility with inference providers, use Hugging Face Hub as the router of inference calls
Use Together.AI as the serverless inference provider

Run in terminal:

tower run --environment="prod" \
  --parameter=xyz='123' \
  --parameter=model_to_use='deepseek-ai/DeepSeek-R1'

Tower example DeepSeek-Summarize-Github demonstrates how this can be done.

Specifying Model Names

Inference providers can have different names for the same model. Usually, model vendors release a family of model versions that vary in the number of parameters (via a process called distillation) and the quantization levels. For example, the 3 billion parameter, Instruct-fine-tuned model of the Llama 3.2 family is known as "llama3.2:3b" in Ollama and as "meta-llama/Llama-3.2-3B-Instruct" in Hugging Face Hub.

Tower's LLM inference was designed to allow developers to use LLMs across their development and production environments by specifying a short model family name without having lots of conditional statements to handle differences in environments. For example, specifying llms("llama3.2") should resolve to either "llama3.2:3b" or "meta-llama/Llama-3.2-3B-Instruct" depending on the chosen inference provider.

Specifying Model Families

Users can specify a model family, e.g., llms("deepseek-r1"), in both local and Tower cloud environments, and Tower will resolve the model family to a particular model that is available for inference:

In local environment, Tower will find the model that is installed and, if there are multiple installed, pick the one with the largest number of parameters
In Tower cloud environments, Tower will take the first model returned by Hugging Face search, making sure that this model is servable by the Inference Provider

Tower currently recognizes ~170 names of model families.

Specifying Particular Models

In addition to using model family names, users can also specify a particular model in both local and Tower cloud environments:

Locally: llms("deepseek-r1:14b") or llms("llama3.2:latest")
Remotely: llms("deepseek-ai/DeepSeek-R1-0528") or llms("meta-llama/Llama-3.2-3B-Instruct")

Name Flexibility in Development, Precise Naming in Production

One recommended pattern is to specify a model family name in development and use a precise model name in production.

The code that supports this pattern is the same; you just pass the model name as a parameter of the app or as an environment secret:

model_name = os.getenv("MODEL_NAME")
llm = llms(model_name)

In the development environment, you would then set:

MODEL_NAME = "llama3.2"
# (any model of this family installed locally will do)

In the "prod" environment, you would set:

MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"
# (use a particular model)

Learn More

For detailed information about the llms() function and Llm class methods, see the Tower SDK Reference.

Using Models for Inference Tasks​

Inference Routers and Inference Providers​

Local Inference (Ollama)​

Remote Inference (Hugging Face Hub)​

Setting Up Ollama for Local Inference​

Signing Up for Hugging Face for Remote Inference​

(Optional) Signing Up for Third-Party Inference Providers​

Choosing Inference Routers and Providers at Run-time​

Combining Local and Remote Inference​

Specifying Model Names​

Specifying Model Families​

Specifying Particular Models​

Name Flexibility in Development, Precise Naming in Production​

Learn More​