How to Use the Ollama API — Python, curl & JavaScript

The Ollama terminal chat is useful, but the real power of Ollama is its REST API. Every time you type a message in the terminal, Ollama is actually processing an HTTP request under the hood — and you can send those same requests directly from your own code, scripts, and applications.

This means you can build AI-powered tools, automate text generation tasks, integrate local AI into existing projects, and create applications that run entirely offline — all for free, with no rate limits and no API keys to manage.

In this guide, I’ll walk you through Ollama’s entire API: using it with curl, Python, JavaScript/Node.js, and more. All examples are hands-on and immediately runnable.


Ollama API Basics — What You Need to Know

Ollama’s API server runs automatically when Ollama is installed. It listens at:

http://localhost:11434

It offers two API flavors:

  1. Ollama native API — Ollama’s own format, with streaming support and full feature access
  2. OpenAI-compatible API — a drop-in replacement for OpenAI’s API at /v1/ endpoints, meaning any code written for OpenAI often works with zero changes

Verify the API is running:

curl http://localhost:11434
# Response: Ollama is running
Ollama API documentation on GitHub showing all available endpoints and request formatsOllama API documentation on GitHub showing all available endpoints and request formats
The official Ollama API documentation on GitHub — covers every endpoint with request/response examples and all supported parameters.

API Endpoints Reference

EndpointMethodPurpose
/api/generatePOSTGenerate a completion (single turn)
/api/chatPOSTMulti-turn chat with message history
/api/embeddingsPOSTGenerate text embeddings
/api/pullPOSTDownload a model
/api/pushPOSTUpload a model to Ollama registry
/api/createPOSTCreate a model from a Modelfile
/api/tagsGETList all downloaded models
/api/showPOSTShow model info and Modelfile
/api/deleteDELETEDelete a model
/api/psGETList currently running models
/v1/chat/completionsPOSTOpenAI-compatible chat endpoint
/v1/modelsGETOpenAI-compatible model list

Using the Ollama API with curl

curl is the fastest way to test any API. All examples below work on Windows (PowerShell / WSL), Mac, and Linux.

Basic completion — /api/generate

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.1",
    "prompt": "What is the capital of France?",
    "stream": false
  }'

Response:

{
  "model": "llama3.1",
  "response": "The capital of France is Paris.",
  "done": true,
  "total_duration": 1234567890,
  "eval_count": 9,
  "eval_duration": 987654321
}

Chat completion — /api/chat (multi-turn)

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.1",
    "messages": [
      { "role": "system", "content": "You are a helpful coding assistant." },
      { "role": "user", "content": "Write a Python function to reverse a string." }
    ],
    "stream": false
  }'

Streaming responses (tokens as they generate)

Remove "stream": false or set it to true — Ollama streams newline-delimited JSON:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.1",
    "prompt": "Tell me a short story about a robot learning to paint."
  }'
# Each line of output is a JSON object with "response" and "done" fields

List available models

curl http://localhost:11434/api/tags

Check running models

curl http://localhost:11434/api/ps

Using the Ollama API with Python

The official Ollama Python library on GitHub — the cleanest way to integrate Ollama into any Python project.
Ollama Python library on GitHub at github.com/ollama/ollama-python showing installation and usage examples

There are two approaches for Python: the official ollama library or the openai library pointed at Ollama’s compatible endpoint.

READ ALSO  Best Ollama Models in 2026 — Top 10 Ranked by Use Case & Hardware

Method A — Official Ollama Python Library (Recommended)

pip install ollama

Basic Chat

import ollama

response = ollama.chat(
    model='llama3.1',
    messages=[
        {'role': 'user', 'content': 'Explain what machine learning is in simple terms.'}
    ]
)
print(response['message']['content'])

Streaming Response

import ollama

stream = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Write a haiku about autumn.'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)
print()  # newline after output

Multi-turn Conversation

import ollama

conversation_history = []

def chat(user_message):
    conversation_history.append({
        'role': 'user',
        'content': user_message
    })
    
    response = ollama.chat(
        model='llama3.1',
        messages=conversation_history
    )
    
    assistant_message = response['message']['content']
    conversation_history.append({
        'role': 'assistant',
        'content': assistant_message
    })
    
    return assistant_message

# Example conversation
print(chat("My name is Sarah and I'm a software developer."))
print(chat("What's a good first project to build with AI?"))
print(chat("Can you give me a Python code skeleton for that?"))

Generate Text Embeddings

import ollama

# Get embeddings for a piece of text
result = ollama.embeddings(
    model='nomic-embed-text',
    prompt='Machine learning is the study of algorithms that improve through experience.'
)

embedding_vector = result['embedding']
print(f"Embedding dimensions: {len(embedding_vector)}")
# Output: Embedding dimensions: 768

List and Manage Models

import ollama

# List all downloaded models
models = ollama.list()
for model in models['models']:
    size_gb = round(model['size'] / 1e9, 1)
    print(f"{model['name']}: {size_gb} GB")

# Pull a new model
ollama.pull('phi3:mini')

# Delete a model
ollama.delete('phi3:mini')

Method B — Using the OpenAI Python Library with Ollama

If your project already uses the OpenAI Python library, switch to Ollama with just two changes:

pip install openai
from openai import OpenAI

# Point the OpenAI client at your local Ollama instance
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.1',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'What are the benefits of local AI?'}
    ]
)
print(response.choices[0].message.content)

This makes migrating existing OpenAI-powered code to local Ollama extremely fast — in most cases, it’s a two-line change.


Using the Ollama API with JavaScript / Node.js

Method A — Official Ollama JavaScript Library

npm install ollama
import ollama from 'ollama';

const response = await ollama.chat({
  model: 'llama3.1',
  messages: [{ role: 'user', content: 'What is the best programming language to learn in 2026?' }]
});

console.log(response.message.content);

Streaming in JavaScript

import ollama from 'ollama';

const stream = await ollama.chat({
  model: 'llama3.1',
  messages: [{ role: 'user', content: 'Write a short blog post intro about local AI.' }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}
console.log();

Method B — Using fetch() in JavaScript (No dependencies)

// Works in Node.js 18+ and modern browsers (if CORS allowed)
const response = await fetch('http://localhost:11434/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.1',
    messages: [
      { role: 'user', content: 'Summarize the benefits of running AI locally.' }
    ],
    stream: false
  })
});

const data = await response.json();
console.log(data.message.content);

Streaming with fetch() in JavaScript

const response = await fetch('http://localhost:11434/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.1',
    messages: [{ role: 'user', content: 'Tell me about the future of AI.' }],
    stream: true
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  
  const lines = decoder.decode(value).split('\n').filter(line => line.trim());
  for (const line of lines) {
    const data = JSON.parse(line);
    process.stdout.write(data.message?.content || '');
    if (data.done) break;
  }
}

Important API Parameters Explained

Ollama model library showing all available models that can be used with the API
Every model in the Ollama library is accessible via the API by name — run ollama list to see your locally available models.
ParameterTypeDefaultEffect
modelstringrequiredWhich model to use (e.g. “llama3.1”)
streambooleantrueStream tokens as they generate vs. wait for full response
temperaturefloat0.8Creativity (0=deterministic, 2=very random)
num_ctxint2048Context window size in tokens
top_pfloat0.9Nucleus sampling — lower = more focused
top_kint40Limits token selection — lower = more predictable
repeat_penaltyfloat1.1Penalizes repeated tokens
seedintrandomSet for reproducible outputs
stoparraynullStop generation at these tokens
formatstringnullSet to “json” to force JSON output

Force JSON Output

A tremendously useful feature — set format: "json" to guarantee the model outputs valid JSON:

import ollama

response = ollama.chat(
    model='llama3.1',
    messages=[{
        'role': 'user',
        'content': '''Extract the following from this text and return as JSON:
        
        "John Smith, 34, lives in Austin, Texas and works as a software engineer at TechCorp."
        
        Return: {"name": "...", "age": ..., "city": "...", "state": "...", "company": "..."}'''
    }],
    format='json'
)

import json
data = json.loads(response['message']['content'])
print(data)
# {'name': 'John Smith', 'age': 34, 'city': 'Austin', 'state': 'Texas', 'company': 'TechCorp'}

Practical Use Cases — Ready-to-Run Code

1. Batch Text Summarizer

import ollama

articles = [
    "The Federal Reserve raised interest rates by 0.25% today...",
    "A new breakthrough in quantum computing was announced...",
    "Three major tech companies reported quarterly earnings..."
]

summaries = []
for i, article in enumerate(articles):
    response = ollama.chat(
        model='llama3.1',
        messages=[{
            'role': 'user',
            'content': f'Summarize this in one sentence:\n\n{article}'
        }]
    )
    summaries.append(response['message']['content'])
    print(f"Article {i+1}: {summaries[-1]}")

2. Document Q&A System

import ollama

def qa_from_document(document_text: str, question: str) -> str:
    """Ask a question about a document using local AI."""
    response = ollama.chat(
        model='llama3.1',
        messages=[
            {
                'role': 'system',
                'content': 'Answer questions based only on the provided document. '
                           'If the answer is not in the document, say so clearly.'
            },
            {
                'role': 'user',
                'content': f'Document:\n\n{document_text}\n\nQuestion: {question}'
            }
        ]
    )
    return response['message']['content']

# Example usage
doc = open('report.txt').read()
answer = qa_from_document(doc, "What were the total sales figures for Q4?")
print(answer)

3. Code Review Assistant

import ollama

def review_code(code: str, language: str = "Python") -> str:
    """Get an AI code review for the given code."""
    response = ollama.chat(
        model='codellama',  # Use CodeLlama for better code understanding
        messages=[{
            'role': 'user',
            'content': f'''Review this {language} code. Point out:
1. Bugs or potential errors
2. Performance issues  
3. Security concerns
4. Style improvements

Code:
```{language.lower()}
{code}
```'''
        }]
    )
    return response['message']['content']

# Example
code = """
def get_user(user_id):
    query = "SELECT * FROM users WHERE id = " + user_id
    return db.execute(query)
"""
print(review_code(code))

4. Simple Semantic Search with Embeddings

import ollama
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embedding(text):
    result = ollama.embeddings(model='nomic-embed-text', prompt=text)
    return result['embedding']

# Build a simple knowledge base
documents = [
    "Ollama runs AI models locally on your computer",
    "Python is a popular programming language for data science",
    "Machine learning requires large amounts of training data",
    "Open WebUI provides a chat interface for Ollama"
]

doc_embeddings = [get_embedding(doc) for doc in documents]

# Search
query = "How do I get a web interface for local AI?"
query_embedding = get_embedding(query)

similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
best_match_idx = np.argmax(similarities)

print(f"Best match: {documents[best_match_idx]}")
print(f"Similarity score: {similarities[best_match_idx]:.3f}")

Allow Ollama API Access from Other Machines

By default, Ollama only listens on localhost. To expose the API to other machines on your network (e.g., accessing from a laptop while Ollama runs on a server):

READ ALSO  Ollama Not Working? Complete Troubleshooting Guide (2026)

Windows

Set the environment variable before starting Ollama:

# In PowerShell (temporary)
$env:OLLAMA_HOST = "0.0.0.0:11434"
ollama serve

# Or permanently via Windows System Properties → Environment Variables

Linux (systemd service)

sudo systemctl edit ollama
# Add this inside [Service]:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama

Then call the API from another machine using the server’s IP:

curl http://192.168.1.42:11434/api/generate \
  -d '{"model":"llama3.1","prompt":"Hello!","stream":false}'

Frequently Asked Questions

Does the Ollama API require authentication?

No — by default, the Ollama API has no authentication. Any application on your local machine can call it freely. If you expose it on a network, consider placing Nginx or a reverse proxy with basic authentication in front of it for security.

Can I use the Ollama API from a browser?

Yes, but browsers enforce CORS (Cross-Origin Resource Sharing) restrictions. If you’re building a browser-based app that calls Ollama directly, set the OLLAMA_ORIGINS environment variable to allow your app’s origin: OLLAMA_ORIGINS=http://localhost:3000. For production web apps, it’s better to have your backend call Ollama rather than calling it directly from the browser.

How do I handle rate limiting?

Ollama has no rate limits — it processes requests as fast as your hardware allows. If you send multiple simultaneous requests, Ollama queues them. The bottleneck is always hardware: GPU VRAM and RAM for model loading, GPU compute for generation. For production applications, consider implementing a request queue on your application side to prevent out-of-memory errors from parallel requests.

Can I use the Ollama API with LangChain?

Yes, LangChain has a built-in Ollama integration:
pip install langchain-ollama

from langchain_ollama import ChatOllama
llm = ChatOllama(model=”llama3.1″)
response = llm.invoke(“Explain what LangChain is.”)
print(response.content)

What’s the maximum context length I can use?

It depends on the model. Llama 3.1 supports up to 128k tokens in its context window. By default, Ollama uses 2048 tokens of context. Increase it with the num_ctx parameter, but be aware that longer context windows require significantly more VRAM — setting num_ctx: 32768 on a 16k context model may cause it to fall back to CPU if your GPU doesn’t have enough VRAM.


Building something with the Ollama API? Share your project in the comments — I feature interesting reader projects in my monthly roundup.

About this guide: All code examples tested with Ollama 0.6.x, Python 3.12, and Node.js 22. Examples are immediately runnable — just make sure Ollama is running and you have at least one model downloaded. L

Leave a Reply

Scroll to Top