The Anthropic API

Everything you actually use when shipping an app on Claude — auth, the model picker, prompt caching, extended thinking, tool use, vision, PDFs, files, batch processing, citations, and the cost levers that decide your bill.

1. Setup & first call

Sign in at console.anthropic.com.
Settings → API Keys → Create Key. Copy it once.
Add billing + set a monthly usage limit.
Export ANTHROPIC_API_KEY in your shell.

# Python
pip install "anthropic>=0.40"
# Node
npm i @anthropic-ai/sdk

from anthropic import Anthropic
client = Anthropic()

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system="You are a concise senior engineer.",
    messages=[{"role": "user", "content": "Refactor this for clarity: ..."}],
)
print(resp.content[0].text)
print(resp.usage)  # input_tokens, output_tokens, cache_*

2. Model IDs & picking the right one

Model	ID	Best for
Opus 4.7	`claude-opus-4-7`	Hard reasoning, large refactors, multi-step planning.
Opus 4.7 (1M)	`claude-opus-4-7[1m]`	Same model, 1M-token context. Use for huge codebases or long docs.
Sonnet 4.6	`claude-sonnet-4-6`	Daily workhorse. Strong + fast + cheap. Default for most apps.
Haiku 4.5	`claude-haiku-4-5-20251001`	High-volume, latency-sensitive: classification, routing, autocomplete.

See the Models page for the full comparison, benchmarks, and a decision flowchart.

Default to Sonnet. Upgrade to Opus only when Sonnet visibly struggles. Downgrade to Haiku when the task is bounded and a million calls/day is the norm.

3. Prompt caching (always-on cost cut)

The single biggest cost lever. Mark any chunk of your prompt with cache_control: {"type": "ephemeral"} and Anthropic caches it for ~5 minutes. Cache hits are ~90% cheaper and faster.

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a senior engineer."},
        {
            "type": "text",
            "text": LONG_CODEBASE_CONTEXT,            # 200k tokens of docs / code
            "cache_control": {"type": "ephemeral"},  # <-- cache this
        },
    ],
    messages=[{"role": "user", "content": "What does foo() do?"}],
)

The next call within 5 minutes that has the same prefix gets the cache hit. The cache extends as long as you keep hitting it — heavy users see TTLs in the hours.

Cache stable prefixes. System prompt, tool definitions, big context docs. Don't try to cache the user's question.
Put dynamic content last. Caching is prefix-based — change a byte at the start and the whole cache misses.
Up to 4 cache breakpoints per request. Useful for layered prompts (always-cached system + sometimes-cached project context + per-call question).

Rule of thumb: if you're sending the same 10k+ tokens twice in five minutes, you're losing money by not caching.

4. Extended thinking

Give Claude scratchpad tokens before its final answer. Pays off on math, multi-step reasoning, and code planning. Opus and Sonnet support it; Haiku does not.

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": HARD_PROBLEM}],
)

for block in resp.content:
    if block.type == "thinking":
        print("Thinking:", block.thinking)
    elif block.type == "text":
        print("Answer:", block.text)

budget_tokens caps how much it thinks. 4000–16000 is the sweet spot for most.
Thinking blocks count toward your output tokens, but they're cached for free if you continue the conversation.
For agentic workflows, pass the previous assistant turn's thinking blocks back in — it keeps continuity.

5. Tool use

The mechanism that turns Claude into an agent. You define tools as JSON schemas; the model picks when and how to call them.

tools = [{
    "name": "get_weather",
    "description": "Current weather for a city.",
    "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"],
    },
}]

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's it like in Tokyo?"}],
)

# If resp.stop_reason == "tool_use", run the tool and send tool_result back

Schema is the docs. Tool selection quality is mostly a function of how clearly you describe each tool + its parameters.
Multi-tool loops — keep calling messages.create, appending the tool results, until stop_reason is end_turn.
Parallel tools — Claude can request multiple tool calls in one turn. Run them concurrently.
Server-side tools available out of the box: web_search, code_execution, computer_use.

6. Vision (images)

import base64, httpx
img = base64.b64encode(httpx.get("https://.../diagram.png").content).decode()

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img}},
            {"type": "text", "text": "What does this architecture diagram describe?"},
        ],
    }],
)

Supports PNG, JPEG, GIF, WebP. Up to 100 images per request. Use URL sources ("type": "url") when the file is already public — saves you the base64 dance.

7. PDF input

Send PDFs directly. Claude sees text and visual layout — better than passing only OCR'd text.

pdf_b64 = base64.b64encode(open("contract.pdf", "rb").read()).decode()

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64}},
            {"type": "text", "text": "Summarize the indemnity clause and flag risks."},
        ],
    }],
)

8. Files API

Upload once, reference many times. Useful for skills, large context docs, and reusable PDFs.

file = client.beta.files.upload(file=open("my-skill.zip", "rb"), purpose="skills")

resp = client.beta.messages.create(
    model="claude-opus-4-7",
    betas=["skills-2025-10-02"],
    container={"skills": [{"file_id": file.id}]},
    messages=[...],
)

9. Batch API (50% off)

Submit up to 100k requests, get results within 24 hours, at half the per-token cost. The right tool for evals, backfills, bulk classification.

batch = client.beta.messages.batches.create(
    requests=[
        {"custom_id": "row-1", "params": {"model": "claude-sonnet-4-6", "max_tokens": 1024, "messages": [...]}},
        {"custom_id": "row-2", "params": {...}},
        # ...up to 100,000
    ],
)

# later
result = client.beta.messages.batches.retrieve(batch.id)

10. Citations

Make Claude cite exact passages from documents you provided. Critical for RAG, compliance, and any UX that displays sources.

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "text", "media_type": "text/plain", "data": DOC_TEXT},
                "title": "Q3 financial report",
                "citations": {"enabled": True},
            },
            {"type": "text", "text": "What was net revenue?"},
        ],
    }],
)

The response includes citations with character ranges back into the source doc. Render them as footnotes.

11. Computer use

Claude can drive a virtual desktop — move mouse, type, take screenshots. Public beta on Opus and Sonnet.

resp = client.beta.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    betas=["computer-use-2025-04-29"],
    tools=[{"type": "computer_20250124", "name": "computer", "display_width_px": 1024, "display_height_px": 768}],
    messages=[{"role": "user", "content": "Open the Settings app and turn on dark mode."}],
)

Run this in a sandboxed VM. Computer use means Claude can click anything. Treat it like a junior intern with admin access — useful, scary, contained.

12. Streaming

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Tell me about caching."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Use streaming for any UI surface — perceived latency drops to near-zero. The Node SDK has an equivalent async iterator.

13. Pricing & rate limits

Model	Input ($/M tok)	Output ($/M tok)	Cache write	Cache read
Opus 4.7	~$15	~$75	+25%	−90%
Sonnet 4.6	~$3	~$15	+25%	−90%
Haiku 4.5	~$1	~$5	+25%	−90%

Indicative pricing — confirm at anthropic.com/pricing.

Rate limits

Tiered. You start at Tier 1 (low) and auto-promote as you spend. Check Settings → Limits in the console. If you're hitting limits hard, contact sales for Tier 4+.

14. Production patterns

Cache the system prompt aggressively. The single biggest win for any RAG-style app.
Use Haiku for routing, Sonnet for execution, Opus only for the hard cases. Three-tier setups are normal.
Always set max_tokens. Without it, runaway generations can rack up bills.
Stream when you can. Even backend-to-backend, streaming lets you start downstream work earlier.
Use the Batch API for any non-interactive bulk workload — half the cost, same quality.
Tool definitions are prompts. Spend time on schemas; you'll get fewer wrong calls.
Log usage on every response. You need this data for cost reporting and to tune caching.
Handle overloaded_error with retry + jitter. Anthropic publishes a status page; subscribe.

The Anthropic API

On this page

1. Setup & first call

2. Model IDs & picking the right one

3. Prompt caching (always-on cost cut)

4. Extended thinking

5. Tool use

6. Vision (images)

7. PDF input

8. Files API

9. Batch API (50% off)

10. Citations

11. Computer use

12. Streaming

13. Pricing & rate limits

Rate limits

14. Production patterns