跳转到内容
POST AI agent ready /v1/text/similarity

Text Similarity API - Cosine & Jaccard Scoring

Computes a numeric similarity score between two strings using either cosine similarity on term-frequency vectors or Jaccard similarity on token sets. Both methods tokenize input by matching word characters, lowercase everything, and ignore punctuation. The response rounds the score to four decimal places.

Parameters

stringrequired

The first text to compare.

stringrequired

The second text to compare.

string

Similarity algorithm to use.

Code examples

curl -X POST https://api.botoi.com/v1/text/similarity \
  -H "Content-Type: application/json" \
  -d '{"text_a":"The quick brown fox jumps over the lazy dog","text_b":"A quick brown fox leaps over a sleepy dog","method":"cosine"}'

When to use this API

Detect duplicate support tickets

Score incoming tickets against open issues so agents get a "likely duplicate" flag when similarity exceeds 0.8, cutting resolution time on repeat reports.

Deduplicate user-generated content

Compare new forum posts or product reviews against recent entries before publishing. Block near-identical spam while letting genuine paraphrases through.

Route FAQ answers in a chatbot

Match a user question against a library of canned answers and serve the top-scoring match instead of calling a more expensive LLM for every query.

Grade short-answer quiz responses

Score student submissions against a reference answer to auto-grade within a threshold and flag borderline cases for human review.

Frequently asked questions

What is the difference between cosine and Jaccard?
Cosine similarity compares term-frequency vectors, so it rewards texts that share words in similar proportions. Jaccard compares sets, so it only counts whether a word appears in both, ignoring how often. Use cosine for longer passages, Jaccard for short tags or keywords.
How is the score scaled?
Both methods return a value between 0 and 1. A score of 1 means the tokens are identical in distribution, and 0 means no words overlap. The response rounds to four decimal places.
Is this semantic similarity?
No. The scoring is lexical. "Dog" and "canine" score 0 even though they mean the same thing. For semantic matching, embed the texts with an LLM and compute cosine similarity on the vectors.
How is text tokenized?
The API lowercases input and splits on word boundaries using a Unicode-aware regex. Punctuation, emoji, and whitespace are dropped. Numbers and accented characters are preserved.
Does word order matter?
No. Both algorithms treat each text as a bag of words. Sentences with the same tokens in different orders receive the same score.

Get your API key

Free tier includes 5 requests per minute with no credit card required. Upgrade for higher limits.