The Three Ways Backtests Lie: Survivorship, Look-Ahead, and Overfitting

Most public backtests look great and predict nothing. The three biases that make them dishonest — and how Tessera's engine avoids each.

Published March 22, 202610 min read

TL;DR

Three bugs make most public backtests lie: survivorship bias (the dead companies quietly vanish), look-ahead bias (you secretly used data that did not exist yet), and overfitting (you tuned the strategy to noise and called it skill).
All three inflate returns in the same direction. A backtest with even one of them un-fixed can easily show double the live-trading return.
Honest backtests are harder to build, take more data engineering, and look less impressive on the marketing page. That is the trade-off.

Why backtests lie by default

A clean backtest sounds simple: take historical data, run the strategy against it, see what happens. The reason it rarely works that way is that "historical data" is not a fixed object. It is a stack of files that someone has curated, cleaned, restated, and reindexed, usually with the benefit of knowing what happened next. Every one of those curation steps is an opportunity for today's information to leak backwards into yesterday's decision.

The three biases below are the large, well-studied leaks. Each one has a standard fix. The problem is that applying all three fixes at once is expensive, and most retail-facing "we beat the market" charts simply skip the step.

1. Survivorship bias

Definition. A backtest is survivorship-biased when the universe it tests against only contains companies that still exist at the end of the period.

The classic example: "The S&P 500 returned roughly 10% per year from 1990 through 2020." You download today's S&P 500 constituent list, pull 30 years of price history for each ticker, average the returns. The number you get is too high — sometimes by a percent or two per year — because every company that was in the index at some point over those three decades but then went bankrupt, got acquired at a discount, or was kicked out for underperformance does not appear in your dataset at all.

Think about what is missing. Enron. WorldCom. Lehman Brothers. Washington Mutual. GM before its 2009 bankruptcy reorganization. Kodak. Circuit City. Blockbuster. Countrywide. Bear Stearns. Radio Shack. These companies were real holdings in real portfolios. They were in the index. They are not in the current constituent list because they failed, and the backtest that starts from today's list pretends they never existed.

The same issue affects any universe definition that is built from a present-day list: "large-cap growth," "dividend aristocrats," "mid-cap tech." The set is defined by surviving to the definition date.

The fix. Point-in-time membership snapshots and delisted-stock price history. On any given historical date, you need to know which tickers were actually in the index or universe on that date, and you need to have the price and fundamental data for the ones that later disappeared. Delisted-stock datasets are sold separately from live-ticker data because they are harder to maintain, and many cheap data vendors do not include them.

How Tessera handles it. Our universe is reconstructed as of each rebalance date, not filtered from a present-day list. When the strategy holds a position and that company is later acquired or delisted, the position is marked out at the actual delisting or acquisition price — it does not silently disappear from the ledger. Discovery at each rebalance pulls from the full investable set that existed on that date, including names that have since been merged away. We still have holes in microcap coverage and in the deepest corners of over-the-counter history, and we say so when we report.

2. Look-ahead bias

Look-ahead bias is the subtler killer. It happens whenever the backtest uses a piece of information that was not yet available on the decision date. The traps are numerous and easy to miss.

Restated financials. Companies revise reported earnings — sometimes a quarter later, sometimes years later after a restatement or accounting change. If your 2010 backtest uses the revenue number that is now recorded for 2010 Q1, but that number was revised twice after the original release, the strategy is trading on information that did not exist at the time. The fix is to use as-reported values with their original release timestamps, not the clean revised series that most data vendors default to.
Sector reclassification. The GICS framework reclassifies companies periodically. In September 2018, the Telecommunications sector was renamed Communication Services and broadened to include Alphabet, Meta, Netflix, Disney, and others that had been classified under Technology or Consumer Discretionary. A sector-relative strategy that uses today's GICS mapping for historical dates is cheating — it is grouping those stocks under the new taxonomy before the taxonomy existed.
Index inclusion timing. If your 2005 backtest uses the current S&P 500 membership list, you are including companies that had not yet been added. Tesla joined the S&P 500 in December 2020. Any pre-2020 backtest that holds Tesla as an S&P 500 member is using future information.
Intraday look-ahead. Many backtests assume execution at the daily close, then measure performance from the same close. In practice, if your strategy generates a signal from the close, you cannot trade it until the next open at the earliest, and in most retail workflows you trade somewhere midday. Using the same bar for both the signal and the fill is a small but compounding leak.
Benchmark timing and dividend adjustments. Total return series with back-adjusted dividends, reverse stock splits applied retroactively without matching the original trading context, corporate action reconstructions that "smooth" a real price gap — all of these can quietly inject future information into past bars.

The fix. Strict as-of timestamps on every data point. A financial datum is not just a value — it is a value with a release date and, ideally, a revision history. Using a datum before its release date is a bug, and catching the bug requires building the data layer around timestamps from the start, not bolting them on afterward.

How Tessera handles it. Our fundamentals store keeps the original release timestamp alongside each reported value, and strategy queries are filtered by as_of <= decision_date. Sector assignments are historical GICS mappings, not the current snapshot. Index membership queries return the constituents who were in the index on that date. Signals generated from the daily close are executed at the next bar's open, with a separate slippage model layered on top.

3. Overfitting

Overfitting is what happens when you search a large parameter space and report the best result. The search process itself is the leak.

The classic example is almost literally every amateur moving-average crossover study. You try 10 fast periods and 10 slow periods, that is 100 combinations. You backtest all 100 on SPY, sort by Sharpe ratio, and report the top one. It shows 18% annualized and a Sharpe of 1.3. Published as a strategy, it is almost certain to underperform out of sample, because it is the best of 100 draws from a distribution whose mean is closer to zero. You have not discovered an edge — you have discovered a local maximum in a noisy surface.

The problem scales badly. A strategy with six parameters, each tested at ten values, is a one-million-cell grid. The best cell out of a million almost always looks extraordinary. The gap between the best cell and the median cell is a direct measure of how much the reported number is driven by search rather than by signal.

The fix. Several complementary tools:

Reserve a true out-of-sample window that the parameter search never touches. Split history into an in-sample calibration period and an out-of-sample validation period. Tune on the first, measure on the second, and report both.
Regularize the parameter choice. Prefer simpler configurations unless added complexity earns its keep. Penalize the number of tuned knobs explicitly.
Report the distribution of outcomes across the grid, not only the peak. The median and the worst reasonable configuration tell you how robust the result is.
Use walk-forward testing: roll the calibration window forward through time and re-fit, so the strategy is always evaluated on data it has not yet seen.

How Tessera handles it. Parameters for the quality gate, rotation thresholds, and regime cutoffs are calibrated on an early slice of history and then held fixed through a later out-of-sample slice for the reported results. When we describe a parameter choice, we also report the range of outcomes across neighboring values, so the reader can see whether the reported number is a solitary peak or a stable plateau. We count the number of tuned parameters explicitly and disclose it.

What a skeptical reader should ask about any backtest

A practical checklist when anyone waves a chart at you:

Was the universe point-in-time, or filtered from a current list? (Survivorship.)
Were fundamentals as-reported at the original release date, or the revised series? (Look-ahead.)
How many parameters were tuned, and across how large a grid? (Overfitting.)
Was there a true out-of-sample hold-out that the parameter search never touched?
What is the Sharpe ratio, not only the total return? High total return with a Sharpe below 0.5 is mostly leverage or luck.
What are the worst rolling three-year and five-year windows? A great CAGR with a 55% drawdown is not a strategy most people can actually hold.
How did it do through 2008–2009, March 2020, and the 2022 bear market? Those are the periods that reveal whether risk management was real or cosmetic.
Does the return include realistic transaction costs, bid-ask spreads, and slippage, or is it gross of all frictions?

If the chart does not come with answers to most of those questions, treat the chart as marketing, not evidence.

Tessera's honesty layer

Concretely, this is what we do and do not do:

Point-in-time universe. Each rebalance rediscovers the investable set as of that date, including names that have since been acquired or delisted.
As-reported financials. Fundamentals carry original release timestamps and are queried with an as-of filter, so strategies never see a revision before it was published.
Declared parameter counts. Every reported backtest states how many parameters were tuned and across what range.
Out-of-sample validation. Parameters are calibrated on earlier data and held fixed through a later window. The reported headline number is the out-of-sample number, not the in-sample number.
Distribution, not just peak. We report the median and a pessimistic configuration alongside the best one, so the stability of the result is visible.
Realistic slippage. A basis-point drag scaled to market cap is applied on every fill; commissions are modeled separately where relevant.
Stress-test windows. Full 2008–2009, March 2020, and 2022 bear-market windows are included in any multi-year report, not excised.

Where we still fall short: corporate-event reconstruction is imperfect for microcaps and thinly-traded names, pre-2000 fundamentals coverage is uneven, and our slippage model is an average, not a tick-level order-book simulation. We say so in the fine print because the fine print matters.

Let Tessera do this automatically

Tessera scores every US stock weekly on 24 quality factors and ranks them against their sector. Get the top picks in your inbox — no credit card.

Try the free screener →

Why it matters for you

A pretty backtest is not a future return. Ever. A screener or newsletter that claims "11x the S&P in backtests" is almost certainly running at least one of the three leaks above, and probably all three. That includes any product that shows you a chart without telling you how the universe, the data, and the parameters were constructed — including ours, if we ever stop showing our work.

What you should look for in any backtest, from Tessera or anyone else: honest treatment of bias, realistic costs, a real out-of-sample test, and a coherent theory of why the edge should exist at all. The last item is the most important. A strategy that worked in the backtest for reasons the author cannot articulate is a strategy that will stop working for reasons the author will also not be able to articulate.