Information Gain and Entity Salience: The On-Page Signals Search Engines Actually Read
On-page optimization stopped being a keyword-density game years ago. Modern retrieval systems read two harder signals: information gain, how much your page adds to the corpus already ranking, and entity salience, how clearly it commits to the named concepts that define its topic. Get both right and the page reads as a primary source rather than a derivative summary, which is what semantic search has been quietly rewarding since the BERT and MUM rollouts. This guide walks through what those signals actually measure, how to engineer them into a page, and which tools surface them in a way that’s actually actionable.
What Information Gain Means for Your Pages
Information gain measures how much new, non-redundant content your page contributes compared to what already ranks for the query. Instead of rehashing facts found on every competing result, pages with high information gain offer fresh data, novel angles, original research, or deeper detail that isn’t readily available elsewhere. The patent literature behind this (Google’s 2020 “Contextual estimation of link information gain” filing) frames it as a comparative measure, gain is always relative to the existing top results, not absolute.
Quick vocabulary
- Information gain
- A comparative measure of how much novel content a page contributes versus the existing top-ranking results for the same query.
- Entity salience
- How prominently and consistently a named entity (person, place, organization, concept) features in a page, used by retrieval models to determine the page’s primary subject.
- Topical coherence
- The degree to which every section, paragraph, and heading on a page reinforces the same core topic rather than drifting across unrelated subtopics.
- Semantic distance
- The vector-space distance between two concepts (or two documents) in an embedding model, smaller distance means closer semantic relationship.
- Canonical entity name
- The official, widely recognized form of an entity name (e.g., “World Health Organization” on first mention, “WHO” thereafter), used to align content with knowledge-graph identifiers.
Search engines reward information gain because users abandon results that repeat what they’ve already read. Look, Google’s algorithms now assess whether a page adds substantive value beyond the existing top-rankers, in most cases this means unique case studies, proprietary datasets, expert interviews, or granular how-to steps competitors omit. Moz’s coverage of the information-gain score walks through how the patent describes it, and the practical takeaway is the same: derivative content gets ranked lower not because it’s spammy but because the model already has that information from another source.
Derivative content isn’t penalized for being spammy. It’s penalized for being redundant.
This differs sharply from traditional keyword optimization, which focused on matching query terms and sprinkling them throughout the copy. Information gain prioritizes substance over signals: relevant keywords still matter for topical alignment, but the differentiator is whether the page has actually said something new. A page stuffed with target keywords but offering zero original insight will, in most cases, underperform a page with moderate keyword use and genuinely unique findings. Every time. (I’ve watched a 4,000-word “definitive guide” lose to a 1,200-word post with two original screenshots, more than once.)
Entity Salience: Teaching Search Engines What Your Page Is Really About
Entity salience measures how prominently and consistently specific named entities (people, places, organizations, concepts) appear in content. Search engines use this signal to determine what a page is genuinely about, not just which keywords it contains. When “Apple” appears alongside “orchard,” “harvest,” and “Honeycrisp,” the algorithm understands the page means the fruit, not the tech company. Disambiguation through co-occurence is the entire point.
Pro tip
Run a draft through Google Cloud Natural Language API and look at the salience scores it returns. If the entity you want the page to rank for isn’t in the top three by salience, the page isn’t really about what you think it’s about, no amount of keyword tweaking will fix that until the entity structure changes.
Entity salience helps search engines disambiguate meaning and assess topical authority. A page that weaves core entities throughout headings, body text, and supporting examples signals coherent, substantive coverage. Thin content might mention an entity once; authoritative resources return to it, contextualize it, and connect it to related concepts. Ahrefs’s guide to semantic SEO describes this clustering effect well, the pages that rank for “semantic” queries aren’t the ones with the densest keyword usage, they’re the ones with the tightest entity neighborhoods.
The practical approach is to identify the primary entities central to a topic (specific people, products, methodologies, locations) and ensure they recur naturally across the page structure. Use full names on first mention, then consistent shorthand. Link entities to authoritative sources when appropriate. Avoid random keyword stuffing; instead, build a semantic network where core concepts reinforce one another through proximity and, well, what really matters here is co-occurrence patterns the model can actually pick up.

Tools like natural-language processing APIs can surface entity recognition patterns, but editorial judgment remains essential. The honest test: if a reader scanned only the entities and concepts on a page, would they immediately grasp the subject? That clarity is what entity salience delivers to both readers and algorithms.
High Info-Gain vs Derivative Content, Signal by Signal
Two pages can cover the same query and still produce wildly different gain scores. The difference shows up across a handful of signals that retrieval models can measure directly, and that editors can spot in a draft.
| Signal | High info-gain page | Derivative page |
|---|---|---|
| Primary sources | Original data, named expert quotes, first-party screenshots | Citations to other secondary articles that themselves cite the original |
| Subtopic coverage | Depth on at least one angle no top-10 result covers | The same five H2s every competing page uses |
| Entity neighborhood | Tight cluster of related entities reinforcing one topic vector | Sparse, generic entities that could apply to a dozen adjacent topics |
| Embedding distance from SERP | Far enough from the existing top-10 vector to read as differentiated | Vector overlap so high it reads as a near-duplicate |
| Reader artifact | A diagram, dataset, calculator, or screenshot the reader can act on | Stock imagery, no downloadable or interactive element |
| Dwell behavior | Long reads, scroll depth past the fold, follow-up clicks to related guides | Quick bounces back to SERP, the “pogo-stick” pattern |
The last row is the consequence, not the cause. When the first five signals are right, dwell behavior follows. Backlinko’s primer on semantic search walks through the same reasoning from a slightly different angle, the underlying mechanic is identical: retrieval models compare a candidate page against the corpus already ranking, and pages that look like near-duplicates get demoted regardless of how clean their on-page SEO is.
Practical Tactics to Increase Information Gain
Five moves consistently raise the gain score, and they map roughly to how much effort each demands. Truth is, most teams skip the first two and wonder why their content sits at position 14.
Original Data, Case Studies, and Expert Quotes
Publish original data from your own experiments (search-behavior studies, A/B test results, traffic analysis) that competing pages cite secondhand or ignore entirely. Unique datasets become link magnets and, more importantly, signal novelty to retrieval models that have already indexed the secondary sources. Commission or conduct case studies showing real implementations on live sites, complete with before-and-after metrics. Concrete examples cut through theory and prove ROI to skeptical stakeholders.
Interview subject-matter experts or practitioners who’ve deployed these techniques at scale, then quote them directly as primary sources. First-person insights carry more authority than rehashed blog summaries, and the quote itself becomes an artifact other pages will cite back, which is the cleanest backlink pattern you can engineer without doing outreach. In my experience, a single named quote with a verifying link does more for both salience and inbound links than a fresh round of cold pitches.
Note
Expert quotes only count as information gain if the expert is identifiable. An anonymous “industry insider” quote registers as filler to both readers and entity-recognition models. Use full name, role, and a link to a verifying source, or skip the quote.
Depth on Overlooked Subtopics and Replicable Artifacts
Identify subtopics competitors mention briefly (like schema markup interplay with entity salience, or multilingual entity disambiguation) and dedicate full subsections with step-by-step walkthroughs. Depth on overlooked angles satisfies searchers hunting niche answers, and it widens the embedding distance between your page and the SERP cluster.
Add annotated screenshots, code snippets, or tool output examples readers can replicate immediately. Actionable artifacts reduce friction between reading and doing. A page with three replicable screenshots and a working code block consistently outperforms a page with twice the word count and no artifacts, both for dwell time and for the secondary signal of being saved, shared, and re-cited. Every time.


How to Optimize for Entity Salience
Optimizing for entity salience means helping search engines confidently identify and understand the key people, places, organizations, and concepts on a page. The optimization cycle is short, four loops, and it pays off because every loop tightens the entity neighborhood without changing the underlying claims.
The info-gain optimization cycle
Truth is, start by using canonical entity names, the official, widely recognized form. This reduces ambiguity and aligns the content with knowledge graphs that Google and Bing already index. Add structured data markup using Schema.org vocabulary. Mark up entities like Person, Organization, Product, or Event so search engines can parse them directly from the HTML. Google’s Rich Results Test confirms whether the markup is valid.
Link to authoritative entity sources. When introducing an entity, hyperlink to its Wikipedia page, official website, or trusted reference. These outbound signals reinforce identity and context, showing search engines the content is grounded in recognized sources. (I’ve watched this single change move pages two or three spots, especially for topics where the SERP is full of pages that link out to nothing.)
Place key entities in strategic locations: page title, H1 and H2 headings, opening paragraph, and naturally throughout body text. Frontloading entities signals their centrality to the topic and gives entity-recognition models the strongest possible cue about what the page is for.
Tools and Resources Worth Bookmarking
Four tools cover most of the entity-salience and gap-analysis workflow without overlap, and they pair well with the existing audit tools most teams already pay for.
Google Cloud Natural Language API extracts entities, sentiment, and syntax from text using machine learning; it returns salience scores for each entity to show relative importance. Upload a paragraph or full page to see which topics Google considers central. This reveals how search engines may weight different concepts in content, and it’s the cheapest sanity check before publishing.
Bing Entity Search API queries Bing’s knowledge graph to understand how entities are classified and connected; it helps verify whether topics align with recognized entities. Useful when an ambiguous term could map to two different knowledge-graph nodes, run both engines and see which one Bing’s graph favors.
Similarweb and the Semrush Content Marketing Toolkit together cover the gap-analysis side. Similarweb shows which topics drive a competitor’s traffic, Semrush’s keyword-gap and topic-research modules surface the specific semantic clusters competitors cover that the current draft doesn’t. For most teams, that pairing produces a punch-list of subtopics to add before publishing.
Screaming Frog SEO Spider handles the on-page verification layer: crawl the live page, confirm the canonical entity actually appears in the title, H1, meta description, and schema markup. This is the dull part of the workflow, and it’s the part most teams skip until a published page underperforms and they have to reverse-engineer why.
Putting It All Together
Information gain and entity salience are the same signal viewed from two angles. Gain asks “does this page add something the corpus didn’t already have?” Salience asks “is what it adds clearly tied to a recognizable topic?” A page can be high-gain and low-salience (novel but unfocused, the algorithm can’t tell what it’s for) or high-salience and low-gain (focused but redundant, the algorithm already has that information from a stronger source). Both fail. The pages that win are the ones where the answer to both questions is yes.
✓
Worth the effort for
- ›Cornerstone pages competing in saturated SERPs
- ›Topics where the top-10 reads as paraphrases of the same source
- ›Pages that have ranked at position 11–25 for months without movement
- ›Sites building topical authority in a defined niche
- ›Editorial teams with access to first-party data or expert sources
✗
Skip it for
- ›Transactional pages where intent is the primary ranking factor
- ›News commodity coverage where freshness beats novelty
- ›Programmatic pages built from templated data feeds
- ›Thin glossary entries serving navigational queries
- ›One-off posts where speed beats semantic engineering
Look, the two signals don’t replace the fundamentals. Page speed, internal linking, schema markup, and authoritative backlinks still matter. What information gain and entity salience change is the ceiling, no amount of technical SEO will push a derivative page past the genuinely novel one above it in a mature SERP. For most teams, the cheapest way to raise that ceiling is to publish one piece of first-party data and rewrite one existing post around its canonical entity.
Try it this week
Audit one stuck page for entity salience. Fix the top issue.
-
1
Pick one page ranking 11–25 for its target query. Paste the body into Google Cloud Natural Language API and capture the salience scores. -
2
If the target entity isn’t top-three by salience, rewrite the title, H1, and first paragraph with the canonical entity name frontloaded. -
3
Add one piece of original information (a screenshot, a small dataset, a named expert quote) the current top-10 doesn’t include. Republish and re-crawl.
One page, three changes, one week. That’s the smallest unit of work that produces both a salience lift and a gain lift, and it’s the cleanest way to feel the effect before scaling it across a content library.
Related guides
- E-E-A-T signals, How experience, expertise, authoritativeness, and trustworthiness layer on top of semantic relevance.
- Topical authority, How cluster structure and internal linking reinforce entity salience at the site level.