Token counting for GPT, Claude, and Llama in one API
You send a prompt to GPT-4o and the response cuts off mid-sentence. You check your bill and find a batch job burned through $40 because the input was 3x larger than you expected. You paste a long document into Claude and get an error: context window exceeded. Every one of these problems traces back to the same root cause; you didn't know how many tokens your text contained before you sent it.
Token counting is the pre-flight check every LLM integration needs. Character count won't help you. Word count gets you in the ballpark, but tokenizers split text differently depending on the model. You need the exact count for the model you're calling.
Why character count is not token count
LLMs don't process raw characters. They break text into tokens using a tokenizer, which is a vocabulary of subword pieces trained on a large corpus. The mapping from text to tokens is non-obvious and model-specific.
Some examples that show why counting characters misleads you:
- "I can't" splits into 3 tokens in GPT-4:
I,can,'t. That's 7 characters but 3 tokens. - "antidisestablishmentarianism" is one word but 6-8 tokens depending on the model. The tokenizer breaks it into subword pieces it recognizes.
- "Hello" is 1 token. " Hello" (with leading spaces) might be 2 tokens because the whitespace gets its own token.
- Code snippets tokenize differently from prose. Curly braces, semicolons, and indentation each consume tokens. A 500-character function can easily cost 200+ tokens.
GPT models use BPE (byte-pair encoding) with the cl100k_base or o200k_base vocabulary. Claude uses a similar but distinct BPE tokenizer. Llama uses SentencePiece. The same paragraph produces different token counts across all three.
Count tokens with one API call
Send your text to the botoi /v1/token/count endpoint with the target model. The API
returns the estimated token count along with character and word counts.
curl -X POST https://api.botoi.com/v1/token/count \
-H "Content-Type: application/json" \
-d '{
"text": "The quick brown fox jumps over the lazy dog. This sentence is used to test tokenizers across different language models.",
"model": "gpt-4o"
}' Response:
{
"success": true,
"data": {
"tokens": 24,
"model": "gpt-4o",
"method": "estimated",
"characters": 116,
"words": 20
}
}
The response tells you this 20-word sentence costs 24 tokens in GPT-4o. You also get
characters and words for quick reference. The method field
indicates the counting approach used.
Token counts by model
The same text produces different token counts depending on which model you target. The
model parameter accepts 15 models across the major families. Here's how they compare
for the same input:
| Model | Tokenizer | Context window | Tokens (same text) |
|---|---|---|---|
| gpt-4o | o200k_base (BPE) | 128K | 24 |
| gpt-3.5-turbo | cl100k_base (BPE) | 16K | 24 |
| claude-3.5-sonnet | Claude BPE | 200K | 25 |
| claude-4-opus | Claude BPE | 200K | 25 |
| llama-3.2 | SentencePiece | 128K | 24 |
| gemini-2.0-flash | SentencePiece | 1M | 24 |
| mistral | SentencePiece (BPE) | 32K | 24 |
The differences are small for short English sentences but grow as input length increases. Non-English text, code, and structured data (JSON, XML) can show larger variation. Always count tokens with the specific model you plan to call.
Truncate text to a token limit
When your prompt exceeds the context window, you need to trim it without breaking mid-word.
The /v1/token/truncate endpoint cuts text to a target token count at a word boundary.
curl -X POST https://api.botoi.com/v1/token/truncate \
-H "Content-Type: application/json" \
-d '{
"text": "You are a helpful assistant. Summarize the following document in three bullet points. The document discusses the impact of renewable energy adoption on global carbon emissions over the past decade, with specific focus on solar and wind installations in Europe and Southeast Asia.",
"max_tokens": 20,
"model": "claude-3.5-sonnet"
}' Response:
{
"success": true,
"data": {
"truncated": "You are a helpful assistant. Summarize the following document in three bullet points. The",
"tokens": 18,
"was_truncated": true,
"model": "claude-3.5-sonnet",
"max_tokens": 20,
"original_tokens": 48
}
}
The original prompt was 48 tokens. The API truncated it to 18 tokens (within the 20-token budget)
at a clean word boundary. The was_truncated flag tells you whether the text was modified.
The original_tokens field shows how many tokens the full text contained.
This is useful for fitting system prompts into tight token budgets, trimming chat history to stay within the context window, and chunking documents before sending them to an embeddings API.
Build a pre-flight check for LLM calls
The highest-value integration: a function that counts tokens, compares against the model's context window, and truncates if the prompt is too long. This prevents both silent truncation and API errors.
const MODEL_LIMITS = {
"gpt-4o": 128000,
"gpt-4o-mini": 128000,
"claude-3.5-sonnet": 200000,
"claude-4-sonnet": 200000,
"llama-3.2": 128000,
};
async function countTokens(text, model = "gpt-4o") {
const res = await fetch("https://api.botoi.com/v1/token/count", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, model }),
});
const { data } = await res.json();
return data.tokens;
}
async function truncateText(text, maxTokens, model = "gpt-4o") {
const res = await fetch("https://api.botoi.com/v1/token/truncate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, max_tokens: maxTokens, model }),
});
const { data } = await res.json();
return data;
}
async function preflightCheck(prompt, model = "gpt-4o") {
const limit = MODEL_LIMITS[model];
if (!limit) throw new Error("Unknown model: " + model);
const tokens = await countTokens(prompt, model);
// Reserve 20% of the context window for the model's response
const inputBudget = Math.floor(limit * 0.8);
if (tokens <= inputBudget) {
return { safe: true, tokens, limit, model };
}
// Truncate to fit within the input budget
const result = await truncateText(prompt, inputBudget, model);
return {
safe: false,
original_tokens: tokens,
truncated_tokens: result.tokens,
truncated_text: result.truncated,
limit,
model,
};
}
// Usage
const prompt = buildPromptFromChatHistory(messages);
const check = await preflightCheck(prompt, "claude-3.5-sonnet");
if (!check.safe) {
console.log(
"Prompt truncated from " +
check.original_tokens + " to " +
check.truncated_tokens + " tokens"
);
prompt = check.truncated_text;
}
const response = await callLLM(prompt, "claude-3.5-sonnet"); The function reserves 20% of the context window for the model's response. If the input fits, it passes through unchanged. If it's too large, it gets truncated to the input budget. You always know exactly how many tokens you're sending.
Wrap this around every LLM call in your application. It adds one HTTP request (two if truncation is needed) and eliminates an entire class of production failures.
Real-world use cases
- Cost estimation before API calls. Count tokens in a batch of prompts, multiply by the model's per-token price, and know the total cost before you commit. This Node.js function does it in a few lines:
async function estimateCost(text, model = "gpt-4o") {
const tokens = await countTokens(text, model);
// Price per 1M input tokens (March 2026 pricing)
const rates = {
"gpt-4o": 2.50,
"gpt-4o-mini": 0.15,
"claude-3.5-sonnet": 3.00,
"claude-4-sonnet": 4.00,
"llama-3.2": 0.00, // self-hosted
};
const rate = rates[model] || 0;
const cost = (tokens / 1_000_000) * rate;
return {
tokens,
model,
estimated_cost_usd: cost.toFixed(6),
};
}
// Check cost before sending a large document
const estimate = await estimateCost(longDocument, "gpt-4o");
console.log(estimate);
// { tokens: 14320, model: "gpt-4o", estimated_cost_usd: "0.035800" } - Prompt size validation. Reject or trim user-submitted prompts that exceed your application's token budget. Prevent a single long input from consuming your entire rate limit.
- Chunking documents for embeddings. Split long documents into chunks that fit within your embedding model's token limit (typically 512 or 8,192 tokens). Count tokens per chunk to ensure none exceed the limit.
- Chat history management. As conversations grow, older messages push the total token count past the context window. Count the cumulative token total after each message and drop the oldest messages when you approach the limit.
- CI/CD pipeline guards. Add a token count step to your deployment pipeline. If a prompt template exceeds a defined threshold, fail the build before it reaches production.
Key points
- Token count varies by model. GPT, Claude, and Llama tokenize the same text differently. Always specify the target model when counting.
- Two endpoints cover the full workflow.
/v1/token/counttells you the size./v1/token/truncatetrims to fit. Both support 15 models. - Pre-flight checks prevent production failures. Count tokens before every LLM call to avoid truncated responses, context window errors, and surprise costs.
- No account required. The free tier allows 5 requests per minute with no signup. Get an API key for higher volume at botoi.com/api.
The full API docs cover the complete list of supported models and additional developer utility endpoints.
Frequently asked questions
- How many tokens are in a word?
- On average, one English word equals about 1.3 tokens. Short common words like "the" or "is" are one token. Longer or uncommon words like "authentication" split into 2-4 subword tokens. The exact count depends on the model's tokenizer.
- What is a token in GPT?
- A token is a chunk of text that the model processes as a single unit. GPT models use a byte-pair encoding (BPE) tokenizer that splits text into subword pieces. Common words stay whole, while rare or long words split into smaller fragments. Punctuation and whitespace are also tokenized.
- How do I count tokens before an API call?
- Send your text to POST https://api.botoi.com/v1/token/count with an optional model parameter (gpt-4o, claude-3.5-sonnet, llama-3, etc.). The API returns the estimated token count, word count, and character count in a single response.
- Do different LLMs tokenize text the same way?
- No. GPT models use cl100k_base or o200k_base encoding. Claude uses a similar but distinct BPE tokenizer. Llama uses SentencePiece. The same sentence produces different token counts across models. Always count tokens with the specific model you plan to call.
- What happens when you exceed a model's context window?
- Most LLM APIs return an error when the input exceeds the context window. Some silently truncate the input, which can cut off critical instructions or context. Pre-checking token count and truncating to fit prevents both failure modes.
Try this API
Text Stats API — interactive playground and code examples
More tutorial posts
Start building with botoi
150+ API endpoints for lookup, text processing, image generation, and developer utilities. Free tier, no credit card.