Why Your Internal Linking Test Might Be Wrong: Z-Test Assumptions Explained
Z-tests look reassuringly simple: plug in two means, get a p-value, ship the winner. The problem is that the test only earns its keep when four assumptions hold, and in my experience SEO data violates at least one of them most of the time. Normality, independence, known variance, and adequate sample size are the load-bearing pieces, and when any of them buckle, the confidence interval you’re staring at is closer to fiction than measurement. This guide walks through what z-tests actually require, the case where ignoring those requirements cost a real team four months of recovery, and the alternatives worth reaching for when the assumptions don’t hold.
What the Z-Test Actually Requires
Before any of the per-assumption checks, it helps to fix the vocabulary, because most of the field uses these terms loosely and the looseness is where bad tests come from.
Quick vocabulary
- Normality assumption
- The requirement that your data, or the sampling distribution of your test statistic, follows a bell curve. Z-tests assume this directly; the CLT lets large samples cheat.
- Independence
- The requirement that one observation’s value doesn’t influence another’s. Internally linked pages competing for the same keyword cluster violate this every time.
- Sample size (n)
- The number of independent observations per variant. The conventional floor for z-tests is n > 30, though SEO usually needs much more.
- p-value
- The probability of observing a result at least as extreme as yours if the null hypothesis were true. It is not the probability the change worked.
- Type I error
- A false positive, declaring a winner when nothing actually changed. Alpha (typically 0.05) is the rate you accept up front.
- Type II error
- A false negative, missing a real effect because your sample was too small or too noisy. Beta is the rate; 1 minus beta is statistical power.
The whole point of going through these terms is that “the z-test said p < 0.05” carries a different weight depending on which of these assumptions you’ve actually verified. In most cases, SEO teams verify the sample size and skip the rest.
Sample Size: The 30-Page Minimum Rule
The 30-observation threshold comes from the Central Limit Theorem: with enough data points, sampling distributions approach normal even when individual values don’t. Below 30 pages per group, your z-test p-values become unreliable, what looks like a 95% confidence interval might actually be closer to 85%.
Testing with 10-15 pages is common in SEO experiments targeting specific page types, but it produces false positives. A 12-page test showing +40% traffic with p=0.03 may simply reflect natural variance, not your meta description changes. The smaller your sample, the more likely extreme values skew your mean. Especially the top one or two pages.
Pro tip
When your group size is borderline (25-35 pages), run both a z-test and a t-test on the same data. If both produce similar p-values, your finding is robust. Divergent results mean you’re in the uncertain zone where sample-size choice changes the conclusion, and that’s a signal to extend the test, not to ship.
When you have fewer than 30 observations, use a t-test instead. It adjusts for small-sample uncertainty by widening confidence intervals and requiring stronger evidence before declaring significance. Many SEO platforms default to z-tests regardless of sample size, verify before trusting automated results (I once spent an afternoon arguing with a vendor support rep who insisted their tool “auto-detects” the right test; it does not). The default isn’t always wrong, but it’s almost never re-checked.

Independence: Why Testing Product Pages Together Fails
Independence requires that one page’s performance doesn’t influence another’s. Product pages rarely meet this standard. Internal linking creates direct dependencies: anchor text from your “blue widgets” page can boost rankings for “blue widget accessories,” while both compete for similar query space. When you test pages in the same category, shared link equity flows between them through navigation menus, related product modules, and breadcrumbs.
In SEO, your test pages don’t behave like 12 coin flips. They behave like 12 players on the same team, where one’s performance directly changes the others’.
Keyword cannibalisation compounds the problem, Google may swap which page ranks for overlapping terms mid-test, creating false negatives or positives. Testing “red shoes” and “crimson sneakers” simultaneously means changes to one alter the other’s traffic through search result reshuffling. Your control group corrupts your treatment group, invalidating the z-test’s foundational math. Select test pages from unrelated categories with distinct keyword targets and minimal cross-linking to preserve independence. Honestly, on most ecommerce sites this is the assumption that’s hardest to engineer around, the whole internal-link graph exists to create dependencies, and a clean test demands the oppposite.
Normality: When Traffic Data Breaks the Bell Curve
Organic traffic rarely follows the bell curve. A handful of landing pages capture the majority of visits, the classic long-tail distribution, while most pages sit in near-obscurity. This right-skewed reality violates the normality assumption baked into z-tests. Moz’s analytics archive circles back to this point regularly, almost every distribution that matters in this field is heavy-tailed.
Seasonality compounds the problem: retail sites spike in November, tax software in March. When you slice data by week or day, these patterns create bumpy, non-normal distributions. Layer on algorithm updates, especially Google’s Core Updates, and traffic can lurch or flatline overnight, shredding any semblance of normalcy.
Note
The Central Limit Theorem gives you cover for non-normal individual values when the sample is large, but it does nothing for non-independent observations. A common mistake is invoking the CLT to justify a z-test on internally-linked pages, the CLT was never going to fix that problem.
The good news: the Central Limit Theorem relaxes normality requirements when sample sizes are large (typically n > 30 per variant). Your test’s mean traffic becomes approximately normal even if individual page visits aren’t. This statistical cushion means you can often proceed with a z-test despite messy underlying data, but only if independence and sample size hold firm. When in doubt, visualise your distribution first.
When the Z-Test Fits, and When It Doesn’t
The cleaner way to think about this is by use case, not by ritual. A z-test isn’t “right” or “wrong” in the abstract, it fits some experimental setups and fails others, and the failure mode is usually quiet.
| Test scenario | Z-test fits when | Z-test fails when |
|---|---|---|
| Title-tag CTR change | Hundreds of impressions per page across at least 30 pages, distinct keyword targets | Pages share intent clusters or you have under 30 pages with usable impression volume |
| Internal-link addition | Treated pages have no upstream or downstream link path to control pages | Test and control pages live in the same silo, link equity bleeds across the boundary |
| Template / layout test | Pages selected from unrelated categories, large sample (n > 100 per variant) | All test pages target variations of the same head term, see the case study below |
| Conversion-rate test | Thousands of sessions per variant, binary outcome, randomised user assignment | Long-tail revenue distribution where a single whale conversion dominates the mean |
| Page-speed rollout | User-cohort split, large session count, metric is median LCP not mean LCP | Metric is mean LCP, a few slow sessions distort everything; use a non-parametric test |
The pattern across the “fails” column is the same in every row: independence is broken, the sample is thin, or the underlying distribution is too lopsided for the mean to be the right summary statistic. In each case, the fix is the same shape, switch the test or restructure the experiment, don’t massage the data until z-test math accepts it.
Real Case Study: When Bad Assumptions Cost Rankings
In Q2 2023, a SaaS company’s growth team tested a new template on 12 product pages, targeting mid-volume keywords. After two weeks, their z-test showed a 22% traffic increase with p < 0.05. They celebrated, rolled the template to 340 similar pages, and watched organic traffic drop 18% over the next month.
The violated assumption: independence. The 12 test pages all targeted variations of “project management software for [industry]” and ranked for overlapping keyword clusters. When Google’s algorithm adjusted rankings after the template change, the pages cannibalised each other’s visibility. The initial lift came from three pages that happened to gain featured snippets, temporarily masking drops across the other nine. I’ve seen versions of this exact pattern on at least three client audits (one of them a Series B fintech that had already presented the “+22% win” to their board, which made the rollback conversation a memorable one). The head-term hides the carnage further down.
Their z-test treated each page as an independent observation, but the pages competed in the same SERP ecosystem. Clicks on one page directly reduced impressions for others. The small sample size, 12 pages, made this dependence catastrophic. A proper test would’ve used page clusters as the unit of analysis or isolated pages targeting truly distinct keyword sets.
Watch for
When 3 of your 12 test pages account for most of the observed lift, that’s a heavy-tail signal, not a clean win. The conventional response is “we found the high-performers”, the rigorous response is to ask whether removing those three pages reverses the result. In this case study, it would have.
The recovery took four months. The team reverted templates on the worst performers, and traffic patterns stabilised only after they rebuilt topical authority through content consolidation, well, that’s the polite version. The honest version is they rebuilt the silo from scratch.
Why it matters: when your test units share ranking signals, keyword overlap, or internal link equity, standard z-test math breaks down. The formula assumes your 12 pages behave like 12 coin flips, independent events. In SEO, they behave like 12 players on the same team, where one’s performance directly affects the others. Ahrefs has written about similar setups where the test design issues swamped the measured effect.

Quick Checks Before You Run the Test
A pre-flight check that takes five minutes catches most of the violations that make z-test results unreliable. The whole point is to run it before you commit to the test design, not after the p-value comes in looking attractive. Which, in most cases, it will.
The assumption-check cycle
Visual Tests You Can Run in 2 Minutes
In Google Sheets, create a histogram by selecting your data, then Insert > Chart > Histogram. Look for strong skew (long tail on one side) or multiple peaks, both signal non-normality. For tighter validation, generate a Q-Q plot in Python using scipy.stats.probplot(). If points follow the diagonal line closely, you’re normal; systematic curves or S-shapes mean violations. Sheets users can export to Colab for quick Q-Q checks. These visual tests catch obvious problems in under two minutes, helping you decide whether parametric tests are safe or whether you need alternatives like bootstrap methods.
When to Use Non-Parametric Tests Instead
When your SEO experiment data violates z-test assumptions, non-normal distributions, small samples under 30, or heavy skew from outliers, switch to the Mann-Whitney U test. This non-parametric alternative compares rank order instead of raw values, making it robust to skewed metrics like time-on-page or conversion rates with extreme values. It requires no assumptions about distribution shape and works reliably with samples as small as 5-10 per variant. The tradeoff: slightly less statistical power when data is actually normal, but far more trustworthy results when it’s not. For most SEO experiments with real user behaviour data, Mann-Whitney often proves the safer default choice.
What to Do When Assumptions Don’t Hold
When z-test assumptions break down, you have four practical paths forward rather than abandoning statistical rigour entirely.

Bootstrapping methods resample your actual data thousands of times to build empirical confidence intervals without assuming normality. Use this when traffic distributions are heavily skewed or sample sizes remain stubbornly small despite extended test windows.
Extending test duration increases your sample size, which often resolves normality violations through the Central Limit Theorem and reduces the impact of temporal autocorrelation. Run tests for at least 2-4 full business cycles when initial sample sizes fall below 1,000 sessions per variant. In my experience, this is the option teams reach for last because it’s the least exciting, and yet it solves more problems than any of the others.
Pro tip
Bootstrapping with 10,000 resamples runs in under a second on a laptop for any SEO-sized dataset. The cost is writing the script, not the compute. If you’re going to run more than two non-trivial experiments a quarter, the one-time investment in a reusable bootstrap function pays back fast.
Segmenting tests by page type or user cohort ensures independence when different URL groups exhibit different behaviours that violate the identical distribution assumption. Apply this when mixing transactional and informational pages in a single test, or when bot traffic contaminates specific segments. SimilarWeb’s segmentation work on traffic mix is a useful reference point for thinking about how thoroughly you need to slice before the segments behave consistently.
Log transformations compress right-skewed traffic distributions closer to normality by reducing the influence of extreme outliers. Transform your metrics when a handful of viral pages generate 10x typical traffic, then run the z-test on log-transformed values and back-transform results for interpretation.
Each workaround trades simplicity for validity. Bootstrapping requires coding skills but handles nearly any violation. Longer tests cost time but need no special analysis. Segmentation fragments your data but preserves accuracy. Transformations work fast but require careful back-conversion when communicating results to stakeholders, which is, frankly, where most log-transformed analyses go wrong: the analyst gets the math right and then the report describes a “geometric mean” as if it were the arithmetic mean readers expect.
Picking the Right Test for the Right Job
The full assumption-check workflow looks heavy on paper. In practice it’s a 10-minute decision once you’ve done it twice. The shorter version: most SEO experiments aren’t actually a fit for a textbook z-test, and the honest move is to acknowledge that and switch tools.
✓
Z-test fits when
- ›Sample size is comfortably above 30 per variant
- ›Test units are genuinely independent, no shared keyword clusters or internal-link bleed
- ›Population variance is known, or n > 100 makes the estimate safe
- ›Distribution is roughly symmetric or CLT covers the residual skew
- ›Binary outcomes (CTR, conversion) with proportions away from 0 and 1
✗
Bootstrap fits better when
- ›Sample size is small and you can’t extend the test window
- ›Traffic distribution is heavily right-skewed with viral outliers
- ›Population variance is unknown and the sample isn’t large enough to estimate
- ›You need confidence intervals on derived metrics (ratios, differences-of-differences)
- ›Independence is borderline and you want a method that’s robust to mild violations
Checking z-test assumptions takes five minutes. Skipping them can burn months chasing phantom wins or rolling out changes that hurt traffic. Understanding normality, independence, and known variance isn’t academic gatekeeping, it’s the difference between a reliable experiment and expensive noise. Before you shift strategy based on a p-value, verify your data meets the requirements. When assumptions break, switch methods rather than proceeding blind.
Try it this week
Audit the last SEO test you shipped. Re-check the four assumptions.
-
1
Pull the page list and per-page traffic for both variants. Histogram the treatment-group values, look for skew or multiple peaks. -
2
Map the internal-link graph between test and control pages. Any direct or two-hop links is enough to violate independence. -
3
Re-run the analysis with a bootstrap confidence interval. If the bootstrap and z-test agree, ship with confidence. If they diverge, the original p-value was misleading you.
The exercise takes an hour. Doing it once turns “we ran a test” into “we ran a test we can defend in a roadmap review.”
Related guides
- Stop Guessing If Your Link Building Actually Works, Measurement framework for distinguishing real link-building wins from random variance.
- SEO Recovery After Core Updates, How to read the post-update traffic data without being fooled by short-term variance.