guide

Text Similarity Explained: Cosine Similarity, Lexical Matching, and Embeddings

By Rui Barreira · Last updated: 13 June 2026

Text similarity is a measure of how much two pieces of text have in common. It underpins duplicate detection, plagiarism checking, search ranking, and recommendation systems. You do not need an API key or GPU for a first-pass lexical comparison: brevio Text Similarity Checker runs cosine similarity entirely in your browser, with no data leaving your device.

What Is Term Frequency?

Term frequency counts how often each word appears in a text. For a two-text comparison, each document becomes a vector: one dimension per unique word, with the value set to that word's count. Words that appear in both texts push the vectors closer together; words that appear in only one text push them apart.

What Is Cosine Similarity?

Once you have a term-frequency vector for each text, you compare them using cosine similarity. Imagine each vector as an arrow pointing through a high-dimensional space: one dimension per unique word across both texts. The cosine similarity is the cosine of the angle between the two arrows:

  • Angle of 0 degrees: cosine = 1.0, identical vocabulary distribution
  • Angle of 90 degrees: cosine = 0.0, no shared terms at all

The formula is the dot product of the vectors divided by the product of their magnitudes. This makes the score less dependent on document length because direction matters more than size.

Lexical vs Semantic Similarity

Term-frequency cosine similarity is lexical: it compares word forms directly. If text A uses "car" and text B uses "automobile", their score stays low even though a human reader knows they mean nearly the same thing. This is the fundamental limitation of lexical methods.

Semantic similarity methods solve this using neural embeddings. Each text is converted to a dense numeric vector by a model that encodes meaning, not just exact words. That lets "cancel subscription" and "account termination" score as similar even though their vocabulary differs.

When Lexical Similarity Is Enough

Lexical methods are fast, free, deterministic, and run without any external service. They work well for:

  • Duplicate detection. Near-identical copies share most of the same words.
  • Near-duplicate filtering. Crawled web content often contains pages with only minor wording changes.
  • Plagiarism checking. Copied text with small edits still scores high lexically.
  • Search within a controlled corpus. Technical specs and legal contracts often use consistent terminology.

When You Need Embeddings

Embedding-based similarity is better when you need to match queries to answers with different vocabulary, group semantically related content, compare short texts, or support cross-language matching.

Comparison Table

MethodTypeSpeedSemantic?Best For
Term-frequency cosineLexicalInstant in-browserNoDuplicate detection, quick comparison
BM25LexicalFast server-sideNoDocument retrieval, search ranking
Sentence TransformersSemantic, open sourceDepends on model and hardwareYesProduction RAG, semantic search
Hosted embeddings APISemantic, APINetwork request per batchYesHigh-accuracy production pipelines

DevTools Verification

Open DevTools, switch to the Network tab, filter for Fetch/XHR, and paste text into both boxes. No outbound requests fire during comparison. Your text stays in the browser and is not written to localStorage, cookies, or the URL.

Frequently Asked Questions

What is cosine similarity?
Cosine similarity measures the angle between two vectors in a high-dimensional space. A score of 1.0 means the vectors point in the same direction; 0 means they share no terms.
What is the difference between lexical and embedding similarity?
Lexical similarity compares exact word overlap. Embedding similarity compares meaning, so synonyms and related phrases can score as similar even when they share few words.
When should I use lexical similarity instead of embeddings?
Use lexical similarity for duplicate detection, near-duplicate filtering, plagiarism checks, and controlled corpora where vocabulary is consistent.
Does brevio send text to an embeddings API?
No. The Text Similarity Checker uses in-browser term-frequency cosine similarity and makes no API calls.
More free toolsSee all 162
Merge PDFsCompress ImageJSON FormatterPassword GeneratorVAT CalculatorQR Code Generator