Text Similarity Explained: Cosine Similarity, Lexical Matching, and Embeddings
By Rui Barreira · Last updated: 13 June 2026
Text similarity is a measure of how much two pieces of text have in common. It underpins duplicate detection, plagiarism checking, search ranking, and recommendation systems. You do not need an API key or GPU for a first-pass lexical comparison: brevio Text Similarity Checker runs cosine similarity entirely in your browser, with no data leaving your device.
What Is Term Frequency?
Term frequency counts how often each word appears in a text. For a two-text comparison, each document becomes a vector: one dimension per unique word, with the value set to that word's count. Words that appear in both texts push the vectors closer together; words that appear in only one text push them apart.
What Is Cosine Similarity?
Once you have a term-frequency vector for each text, you compare them using cosine similarity. Imagine each vector as an arrow pointing through a high-dimensional space: one dimension per unique word across both texts. The cosine similarity is the cosine of the angle between the two arrows:
- Angle of 0 degrees: cosine = 1.0, identical vocabulary distribution
- Angle of 90 degrees: cosine = 0.0, no shared terms at all
The formula is the dot product of the vectors divided by the product of their magnitudes. This makes the score less dependent on document length because direction matters more than size.
Lexical vs Semantic Similarity
Term-frequency cosine similarity is lexical: it compares word forms directly. If text A uses "car" and text B uses "automobile", their score stays low even though a human reader knows they mean nearly the same thing. This is the fundamental limitation of lexical methods.
Semantic similarity methods solve this using neural embeddings. Each text is converted to a dense numeric vector by a model that encodes meaning, not just exact words. That lets "cancel subscription" and "account termination" score as similar even though their vocabulary differs.
When Lexical Similarity Is Enough
Lexical methods are fast, free, deterministic, and run without any external service. They work well for:
- Duplicate detection. Near-identical copies share most of the same words.
- Near-duplicate filtering. Crawled web content often contains pages with only minor wording changes.
- Plagiarism checking. Copied text with small edits still scores high lexically.
- Search within a controlled corpus. Technical specs and legal contracts often use consistent terminology.
When You Need Embeddings
Embedding-based similarity is better when you need to match queries to answers with different vocabulary, group semantically related content, compare short texts, or support cross-language matching.
Comparison Table
| Method | Type | Speed | Semantic? | Best For |
|---|---|---|---|---|
| Term-frequency cosine | Lexical | Instant in-browser | No | Duplicate detection, quick comparison |
| BM25 | Lexical | Fast server-side | No | Document retrieval, search ranking |
| Sentence Transformers | Semantic, open source | Depends on model and hardware | Yes | Production RAG, semantic search |
| Hosted embeddings API | Semantic, API | Network request per batch | Yes | High-accuracy production pipelines |
DevTools Verification
Open DevTools, switch to the Network tab, filter for Fetch/XHR, and paste text into both boxes. No outbound requests fire during comparison. Your text stays in the browser and is not written to localStorage, cookies, or the URL.
Frequently Asked Questions
- What is cosine similarity?
- Cosine similarity measures the angle between two vectors in a high-dimensional space. A score of 1.0 means the vectors point in the same direction; 0 means they share no terms.
- What is the difference between lexical and embedding similarity?
- Lexical similarity compares exact word overlap. Embedding similarity compares meaning, so synonyms and related phrases can score as similar even when they share few words.
- When should I use lexical similarity instead of embeddings?
- Use lexical similarity for duplicate detection, near-duplicate filtering, plagiarism checks, and controlled corpora where vocabulary is consistent.
- Does brevio send text to an embeddings API?
- No. The Text Similarity Checker uses in-browser term-frequency cosine similarity and makes no API calls.