The Anthropic API
Everything you actually use when shipping an app on Claude — auth, the model picker, prompt caching, extended thinking, tool use, vision, PDFs, files, batch processing, citations, and the cost levers that decide your bill.
On this page
1. Setup & first call
- Sign in at console.anthropic.com.
- Settings → API Keys → Create Key. Copy it once.
- Add billing + set a monthly usage limit.
- Export
ANTHROPIC_API_KEYin your shell.
# Python pip install "anthropic>=0.40" # Node npm i @anthropic-ai/sdk
from anthropic import Anthropic client = Anthropic() resp = client.messages.create( model="claude-opus-4-7", max_tokens=1024, system="You are a concise senior engineer.", messages=[{"role": "user", "content": "Refactor this for clarity: ..."}], ) print(resp.content[0].text) print(resp.usage) # input_tokens, output_tokens, cache_*
2. Model IDs & picking the right one
| Model | ID | Best for |
|---|---|---|
| Opus 4.7 | claude-opus-4-7 | Hard reasoning, large refactors, multi-step planning. |
| Opus 4.7 (1M) | claude-opus-4-7[1m] | Same model, 1M-token context. Use for huge codebases or long docs. |
| Sonnet 4.6 | claude-sonnet-4-6 | Daily workhorse. Strong + fast + cheap. Default for most apps. |
| Haiku 4.5 | claude-haiku-4-5-20251001 | High-volume, latency-sensitive: classification, routing, autocomplete. |
See the Models page for the full comparison, benchmarks, and a decision flowchart.
3. Prompt caching (always-on cost cut)
The single biggest cost lever. Mark any chunk of your prompt with cache_control: {"type": "ephemeral"} and Anthropic caches it for ~5 minutes. Cache hits are ~90% cheaper and faster.
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a senior engineer."},
{
"type": "text",
"text": LONG_CODEBASE_CONTEXT, # 200k tokens of docs / code
"cache_control": {"type": "ephemeral"}, # <-- cache this
},
],
messages=[{"role": "user", "content": "What does foo() do?"}],
)
The next call within 5 minutes that has the same prefix gets the cache hit. The cache extends as long as you keep hitting it — heavy users see TTLs in the hours.
- Cache stable prefixes. System prompt, tool definitions, big context docs. Don't try to cache the user's question.
- Put dynamic content last. Caching is prefix-based — change a byte at the start and the whole cache misses.
- Up to 4 cache breakpoints per request. Useful for layered prompts (always-cached system + sometimes-cached project context + per-call question).
4. Extended thinking
Give Claude scratchpad tokens before its final answer. Pays off on math, multi-step reasoning, and code planning. Opus and Sonnet support it; Haiku does not.
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": HARD_PROBLEM}],
)
for block in resp.content:
if block.type == "thinking":
print("Thinking:", block.thinking)
elif block.type == "text":
print("Answer:", block.text)
budget_tokenscaps how much it thinks. 4000–16000 is the sweet spot for most.- Thinking blocks count toward your output tokens, but they're cached for free if you continue the conversation.
- For agentic workflows, pass the previous assistant turn's thinking blocks back in — it keeps continuity.
5. Tool use
The mechanism that turns Claude into an agent. You define tools as JSON schemas; the model picks when and how to call them.
tools = [{
"name": "get_weather",
"description": "Current weather for a city.",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
}]
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's it like in Tokyo?"}],
)
# If resp.stop_reason == "tool_use", run the tool and send tool_result back
- Schema is the docs. Tool selection quality is mostly a function of how clearly you describe each tool + its parameters.
- Multi-tool loops — keep calling
messages.create, appending the tool results, untilstop_reasonisend_turn. - Parallel tools — Claude can request multiple tool calls in one turn. Run them concurrently.
- Server-side tools available out of the box:
web_search,code_execution,computer_use.
6. Vision (images)
import base64, httpx img = base64.b64encode(httpx.get("https://.../diagram.png").content).decode() resp = client.messages.create( model="claude-opus-4-7", max_tokens=1024, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img}}, {"type": "text", "text": "What does this architecture diagram describe?"}, ], }], )
Supports PNG, JPEG, GIF, WebP. Up to 100 images per request. Use URL sources ("type": "url") when the file is already public — saves you the base64 dance.
7. PDF input
Send PDFs directly. Claude sees text and visual layout — better than passing only OCR'd text.
pdf_b64 = base64.b64encode(open("contract.pdf", "rb").read()).decode() resp = client.messages.create( model="claude-opus-4-7", max_tokens=2048, messages=[{ "role": "user", "content": [ {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64}}, {"type": "text", "text": "Summarize the indemnity clause and flag risks."}, ], }], )
8. Files API
Upload once, reference many times. Useful for skills, large context docs, and reusable PDFs.
file = client.beta.files.upload(file=open("my-skill.zip", "rb"), purpose="skills") resp = client.beta.messages.create( model="claude-opus-4-7", betas=["skills-2025-10-02"], container={"skills": [{"file_id": file.id}]}, messages=[...], )
9. Batch API (50% off)
Submit up to 100k requests, get results within 24 hours, at half the per-token cost. The right tool for evals, backfills, bulk classification.
batch = client.beta.messages.batches.create(
requests=[
{"custom_id": "row-1", "params": {"model": "claude-sonnet-4-6", "max_tokens": 1024, "messages": [...]}},
{"custom_id": "row-2", "params": {...}},
# ...up to 100,000
],
)
# later
result = client.beta.messages.batches.retrieve(batch.id)
10. Citations
Make Claude cite exact passages from documents you provided. Critical for RAG, compliance, and any UX that displays sources.
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "text", "media_type": "text/plain", "data": DOC_TEXT},
"title": "Q3 financial report",
"citations": {"enabled": True},
},
{"type": "text", "text": "What was net revenue?"},
],
}],
)
The response includes citations with character ranges back into the source doc. Render them as footnotes.
11. Computer use
Claude can drive a virtual desktop — move mouse, type, take screenshots. Public beta on Opus and Sonnet.
resp = client.beta.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
betas=["computer-use-2025-04-29"],
tools=[{"type": "computer_20250124", "name": "computer", "display_width_px": 1024, "display_height_px": 768}],
messages=[{"role": "user", "content": "Open the Settings app and turn on dark mode."}],
)
12. Streaming
with client.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": "Tell me about caching."}], ) as stream: for text in stream.text_stream: print(text, end="", flush=True)
Use streaming for any UI surface — perceived latency drops to near-zero. The Node SDK has an equivalent async iterator.
13. Pricing & rate limits
| Model | Input ($/M tok) | Output ($/M tok) | Cache write | Cache read |
|---|---|---|---|---|
| Opus 4.7 | ~$15 | ~$75 | +25% | −90% |
| Sonnet 4.6 | ~$3 | ~$15 | +25% | −90% |
| Haiku 4.5 | ~$1 | ~$5 | +25% | −90% |
Indicative pricing — confirm at anthropic.com/pricing.
Rate limits
Tiered. You start at Tier 1 (low) and auto-promote as you spend. Check Settings → Limits in the console. If you're hitting limits hard, contact sales for Tier 4+.
14. Production patterns
- Cache the system prompt aggressively. The single biggest win for any RAG-style app.
- Use Haiku for routing, Sonnet for execution, Opus only for the hard cases. Three-tier setups are normal.
- Always set
max_tokens. Without it, runaway generations can rack up bills. - Stream when you can. Even backend-to-backend, streaming lets you start downstream work earlier.
- Use the Batch API for any non-interactive bulk workload — half the cost, same quality.
- Tool definitions are prompts. Spend time on schemas; you'll get fewer wrong calls.
- Log
usageon every response. You need this data for cost reporting and to tune caching. - Handle
overloaded_errorwith retry + jitter. Anthropic publishes a status page; subscribe.