Why Your Internal Linking Test Might Be Wrong: Z-Test Assumptions Explained

Z-tests look reassuringly simple: plug in two means, get a p-value, ship the winner. The problem is that the test only earns its keep when four assumptions hold, and in my experience SEO data violates at least one of them most of the time. Normality, independence, known variance, and adequate sample size are the load-bearing pieces, and when any of them buckle, the confidence interval you’re staring at is closer to fiction than measurement. This guide walks through what z-tests actually require, the case where ignoring those requirements cost a real team four months of recovery, and the alternatives worth reaching for when the assumptions don’t hold.

What the Z-Test Actually Requires

Before any of the per-assumption checks, it helps to fix the vocabulary, because most of the field uses these terms loosely and the looseness is where bad tests come from.

Quick vocabulary

Normality assumption: The requirement that your data, or the sampling distribution of your test statistic, follows a bell curve. Z-tests assume this directly; the CLT lets large samples cheat.
Independence: The requirement that one observation’s value doesn’t influence another’s. Internally linked pages competing for the same keyword cluster violate this every time.
Sample size (n): The number of independent observations per variant. The conventional floor for z-tests is n > 30, though SEO usually needs much more.
p-value: The probability of observing a result at least as extreme as yours if the null hypothesis were true. It is not the probability the change worked.
Type I error: A false positive, declaring a winner when nothing actually changed. Alpha (typically 0.05) is the rate you accept up front.
Type II error: A false negative, missing a real effect because your sample was too small or too noisy. Beta is the rate; 1 minus beta is statistical power.

The whole point of going through these terms is that “the z-test said p < 0.05” carries a different weight depending on which of these assumptions you’ve actually verified. In most cases, SEO teams verify the sample size and skip the rest.

Sample Size: The 30-Page Minimum Rule

The 30-observation threshold comes from the Central Limit Theorem: with enough data points, sampling distributions approach normal even when individual values don’t. Below 30 pages per group, your z-test p-values become unreliable, what looks like a 95% confidence interval might actually be closer to 85%.

Testing with 10-15 pages is common in SEO experiments targeting specific page types, but it produces false positives. A 12-page test showing +40% traffic with p=0.03 may simply reflect natural variance, not your meta description changes. The smaller your sample, the more likely extreme values skew your mean. Especially the top one or two pages.

Pro tip

When your group size is borderline (25-35 pages), run both a z-test and a t-test on the same data. If both produce similar p-values, your finding is robust. Divergent results mean you’re in the uncertain zone where sample-size choice changes the conclusion, and that’s a signal to extend the test, not to ship.

When you have fewer than 30 observations, use a t-test instead. It adjusts for small-sample uncertainty by widening confidence intervals and requiring stronger evidence before declaring significance. Many SEO platforms default to z-tests regardless of sample size, verify before trusting automated results (I once spent an afternoon arguing with a vendor support rep who insisted their tool “auto-detects” the right test; it does not). The default isn’t always wrong, but it’s almost never re-checked.

Laboratory precision scale with dice representing statistical measurement and probability — Statistical testing requires precise measurement and understanding of underlying assumptions to avoid misleading results.

Independence: Why Testing Product Pages Together Fails

Independence requires that one page’s performance doesn’t influence another’s. Product pages rarely meet this standard. Internal linking creates direct dependencies: anchor text from your “blue widgets” page can boost rankings for “blue widget accessories,” while both compete for similar query space. When you test pages in the same category, shared link equity flows between them through navigation menus, related product modules, and breadcrumbs.

In SEO, your test pages don’t behave like 12 coin flips. They behave like 12 players on the same team, where one’s performance directly changes the others’.

Keyword cannibalisation compounds the problem, Google may swap which page ranks for overlapping terms mid-test, creating false negatives or positives. Testing “red shoes” and “crimson sneakers” simultaneously means changes to one alter the other’s traffic through search result reshuffling. Your control group corrupts your treatment group, invalidating the z-test’s foundational math. Select test pages from unrelated categories with distinct keyword targets and minimal cross-linking to preserve independence. Honestly, on most ecommerce sites this is the assumption that’s hardest to engineer around, the whole internal-link graph exists to create dependencies, and a clean test demands the oppposite.

Normality: When Traffic Data Breaks the Bell Curve

Organic traffic rarely follows the bell curve. A handful of landing pages capture the majority of visits, the classic long-tail distribution, while most pages sit in near-obscurity. This right-skewed reality violates the normality assumption baked into z-tests. Moz’s analytics archive circles back to this point regularly, almost every distribution that matters in this field is heavy-tailed.

Seasonality compounds the problem: retail sites spike in November, tax software in March. When you slice data by week or day, these patterns create bumpy, non-normal distributions. Layer on algorithm updates, especially Google’s Core Updates, and traffic can lurch or flatline overnight, shredding any semblance of normalcy.

Note

The Central Limit Theorem gives you cover for non-normal individual values when the sample is large, but it does nothing for non-independent observations. A common mistake is invoking the CLT to justify a z-test on internally-linked pages, the CLT was never going to fix that problem.

The good news: the Central Limit Theorem relaxes normality requirements when sample sizes are large (typically n > 30 per variant). Your test’s mean traffic becomes approximately normal even if individual page visits aren’t. This statistical cushion means you can often proceed with a z-test despite messy underlying data, but only if independence and sample size hold firm. When in doubt, visualise your distribution first.

When the Z-Test Fits, and When It Doesn’t

The cleaner way to think about this is by use case, not by ritual. A z-test isn’t “right” or “wrong” in the abstract, it fits some experimental setups and fails others, and the failure mode is usually quiet.

Test scenario	Z-test fits when	Z-test fails when
Title-tag CTR change	Hundreds of impressions per page across at least 30 pages, distinct keyword targets	Pages share intent clusters or you have under 30 pages with usable impression volume
Internal-link addition	Treated pages have no upstream or downstream link path to control pages	Test and control pages live in the same silo, link equity bleeds across the boundary
Template / layout test	Pages selected from unrelated categories, large sample (n > 100 per variant)	All test pages target variations of the same head term, see the case study below
Conversion-rate test	Thousands of sessions per variant, binary outcome, randomised user assignment	Long-tail revenue distribution where a single whale conversion dominates the mean
Page-speed rollout	User-cohort split, large session count, metric is median LCP not mean LCP	Metric is mean LCP, a few slow sessions distort everything; use a non-parametric test

Five common SEO experiment scenarios mapped to when the z-test holds up and when it doesn’t.

The pattern across the “fails” column is the same in every row: independence is broken, the sample is thin, or the underlying distribution is too lopsided for the mean to be the right summary statistic. In each case, the fix is the same shape, switch the test or restructure the experiment, don’t massage the data until z-test math accepts it.

Real Case Study: When Bad Assumptions Cost Rankings

In Q2 2023, a SaaS company’s growth team tested a new template on 12 product pages, targeting mid-volume keywords. After two weeks, their z-test showed a 22% traffic increase with p < 0.05. They celebrated, rolled the template to 340 similar pages, and watched organic traffic drop 18% over the next month.

Pages in the original test, well below the 30-page floor for a z-test

+22%

Reported lift, driven almost entirely by 3 pages winning featured snippets

−18%

Traffic drop after the template scaled to 340 pages, recovery took 4 months

The violated assumption: independence. The 12 test pages all targeted variations of “project management software for [industry]” and ranked for overlapping keyword clusters. When Google’s algorithm adjusted rankings after the template change, the pages cannibalised each other’s visibility. The initial lift came from three pages that happened to gain featured snippets, temporarily masking drops across the other nine. I’ve seen versions of this exact pattern on at least three client audits (one of them a Series B fintech that had already presented the “+22% win” to their board, which made the rollback conversation a memorable one). The head-term hides the carnage further down.

Their z-test treated each page as an independent observation, but the pages competed in the same SERP ecosystem. Clicks on one page directly reduced impressions for others. The small sample size, 12 pages, made this dependence catastrophic. A proper test would’ve used page clusters as the unit of analysis or isolated pages targeting truly distinct keyword sets.

Watch for

When 3 of your 12 test pages account for most of the observed lift, that’s a heavy-tail signal, not a clean win. The conventional response is “we found the high-performers”, the rigorous response is to ask whether removing those three pages reverses the result. In this case study, it would have.

The recovery took four months. The team reverted templates on the worst performers, and traffic patterns stabilised only after they rebuilt topical authority through content consolidation, well, that’s the polite version. The honest version is they rebuilt the silo from scratch.

Why it matters: when your test units share ranking signals, keyword overlap, or internal link equity, standard z-test math breaks down. The formula assumes your 12 pages behave like 12 coin flips, independent events. In SEO, they behave like 12 players on the same team, where one’s performance directly affects the others. Ahrefs has written about similar setups where the test design issues swamped the measured effect.

Collapsing house of cards on desk representing failed assumptions and unstable foundations — Building strategies on flawed statistical assumptions can lead to costly failures when rolled out at scale.

Quick Checks Before You Run the Test

A pre-flight check that takes five minutes catches most of the violations that make z-test results unreliable. The whole point is to run it before you commit to the test design, not after the p-value comes in looking attractive. Which, in most cases, it will.

The assumption-check cycle

STEP 1

Histogram the metric

In Sheets, Insert > Chart > Histogram. Look for skew or multiple peaks.

→

STEP 2

Q-Q plot in Colab

Use scipy.stats.probplot(). S-shapes signal a normality violation.

→

STEP 3

Map dependencies

List internal links and shared keyword clusters between test and control pages.

→

STEP 4

Pick the right test

If any check fails, switch to t-test, Mann-Whitney U, or bootstrap before running.

Visual Tests You Can Run in 2 Minutes

In Google Sheets, create a histogram by selecting your data, then Insert > Chart > Histogram. Look for strong skew (long tail on one side) or multiple peaks, both signal non-normality. For tighter validation, generate a Q-Q plot in Python using scipy.stats.probplot(). If points follow the diagonal line closely, you’re normal; systematic curves or S-shapes mean violations. Sheets users can export to Colab for quick Q-Q checks. These visual tests catch obvious problems in under two minutes, helping you decide whether parametric tests are safe or whether you need alternatives like bootstrap methods.

When to Use Non-Parametric Tests Instead

When your SEO experiment data violates z-test assumptions, non-normal distributions, small samples under 30, or heavy skew from outliers, switch to the Mann-Whitney U test. This non-parametric alternative compares rank order instead of raw values, making it robust to skewed metrics like time-on-page or conversion rates with extreme values. It requires no assumptions about distribution shape and works reliably with samples as small as 5-10 per variant. The tradeoff: slightly less statistical power when data is actually normal, but far more trustworthy results when it’s not. For most SEO experiments with real user behaviour data, Mann-Whitney often proves the safer default choice.

▾

Deep dive
Which alternative test belongs on which violation

When the z-test won’t hold, there’s no single replacement. The right alternative depends on which assumption broke and what the data looks like:

Sample size under 30, distribution roughly symmetric. Use a two-sample t-test. Widens the confidence interval to account for the unknown population variance you’re estimating from a thin sample.
Heavy skew or visible outliers, any sample size. Use the Mann-Whitney U test. Compares rank order, indifferent to the shape of the underlying distribution.
Binary outcome (clicked / didn’t click), large sample. Use a two-proportion z-test with continuity correction, or Fisher’s exact test if any cell count drops below 5.
Distribution is complex but you have enough data to resample. Use a bootstrap confidence interval. Resample your observed data 10,000 times, take the 2.5th and 97.5th percentiles of the differences. No distribution assumption required, but you need code, not a spreadsheet.
Independence violated by clustering (multiple pages per silo, multiple sessions per user). Use a cluster-robust variance estimator or aggregate up to the cluster level (one observation per silo) and re-run the t-test on the aggregated data.

If you’re not sure which case you’re in, the bootstrap is the most forgiving default for the SEO context, it handles small samples, skewed data, and unknown variance in one shot, at the cost of needing Python or R rather than Sheets.

What to Do When Assumptions Don’t Hold

When z-test assumptions break down, you have four practical paths forward rather than abandoning statistical rigour entirely.

Magnifying glass examining spreadsheet data representing statistical validation and quality checks — Quick visual checks and pre-flight validation can reveal data quality issues before running statistical tests.

Bootstrapping methods resample your actual data thousands of times to build empirical confidence intervals without assuming normality. Use this when traffic distributions are heavily skewed or sample sizes remain stubbornly small despite extended test windows.

Extending test duration increases your sample size, which often resolves normality violations through the Central Limit Theorem and reduces the impact of temporal autocorrelation. Run tests for at least 2-4 full business cycles when initial sample sizes fall below 1,000 sessions per variant. In my experience, this is the option teams reach for last because it’s the least exciting, and yet it solves more problems than any of the others.

Pro tip

Bootstrapping with 10,000 resamples runs in under a second on a laptop for any SEO-sized dataset. The cost is writing the script, not the compute. If you’re going to run more than two non-trivial experiments a quarter, the one-time investment in a reusable bootstrap function pays back fast.

Segmenting tests by page type or user cohort ensures independence when different URL groups exhibit different behaviours that violate the identical distribution assumption. Apply this when mixing transactional and informational pages in a single test, or when bot traffic contaminates specific segments. SimilarWeb’s segmentation work on traffic mix is a useful reference point for thinking about how thoroughly you need to slice before the segments behave consistently.

Log transformations compress right-skewed traffic distributions closer to normality by reducing the influence of extreme outliers. Transform your metrics when a handful of viral pages generate 10x typical traffic, then run the z-test on log-transformed values and back-transform results for interpretation.

Each workaround trades simplicity for validity. Bootstrapping requires coding skills but handles nearly any violation. Longer tests cost time but need no special analysis. Segmentation fragments your data but preserves accuracy. Transformations work fast but require careful back-conversion when communicating results to stakeholders, which is, frankly, where most log-transformed analyses go wrong: the analyst gets the math right and then the report describes a “geometric mean” as if it were the arithmetic mean readers expect.

Picking the Right Test for the Right Job

The full assumption-check workflow looks heavy on paper. In practice it’s a 10-minute decision once you’ve done it twice. The shorter version: most SEO experiments aren’t actually a fit for a textbook z-test, and the honest move is to acknowledge that and switch tools.

✓
Z-test fits when

›Sample size is comfortably above 30 per variant
›Test units are genuinely independent, no shared keyword clusters or internal-link bleed
›Population variance is known, or n > 100 makes the estimate safe
›Distribution is roughly symmetric or CLT covers the residual skew
›Binary outcomes (CTR, conversion) with proportions away from 0 and 1

✗
Bootstrap fits better when

›Sample size is small and you can’t extend the test window
›Traffic distribution is heavily right-skewed with viral outliers
›Population variance is unknown and the sample isn’t large enough to estimate
›You need confidence intervals on derived metrics (ratios, differences-of-differences)
›Independence is borderline and you want a method that’s robust to mild violations

Checking z-test assumptions takes five minutes. Skipping them can burn months chasing phantom wins or rolling out changes that hurt traffic. Understanding normality, independence, and known variance isn’t academic gatekeeping, it’s the difference between a reliable experiment and expensive noise. Before you shift strategy based on a p-value, verify your data meets the requirements. When assumptions break, switch methods rather than proceeding blind.

Try it this week

Audit the last SEO test you shipped. Re-check the four assumptions.

1
Pull the page list and per-page traffic for both variants. Histogram the treatment-group values, look for skew or multiple peaks.
2
Map the internal-link graph between test and control pages. Any direct or two-hop links is enough to violate independence.
3
Re-run the analysis with a bootstrap confidence interval. If the bootstrap and z-test agree, ship with confidence. If they diverge, the original p-value was misleading you.

The exercise takes an hour. Doing it once turns “we ran a test” into “we ran a test we can defend in a roadmap review.”

Related guides

Stop Guessing If Your Link Building Actually Works, Measurement framework for distinguishing real link-building wins from random variance.
SEO Recovery After Core Updates, How to read the post-update traffic data without being fooled by short-term variance.

Madison Houlding

December 13, 2025, 13:22213 views

Categories:Case Studies & Tests

Madison Houlding Content Manager

Madison Houlding Content Manager at Hetneo's Links. Madison runs editorial across the link-building space, auditing campaigns, writing the briefs that keep guest posts from sounding like ad copy, and turning analytics into next month's roadmap. Loves a clean brief, hates a buried lede.

More about the author