Building an effective backtest is not significantly different than building any other kind of predictive model. The goal is to have similar behavior out of sample as you have in sample. As such, there are methodologies developed in statistics and machine learning that can be useful:
So a few general recommendations:
The output of your model will be a realization of your assumptions. Shane's given you a great answer. Besides doing out of sample testing (i.e., calibrating on period X then testing in period Y only using info available at the time of each trade), I would add that you should test it in sub-periods. If you have a big chunk of data, break it up and see how it works on each subset of the data.
Thanks for the answer as it tackles a lot of backtesting flaws, model parsimony, overfitting, survivorship bias, look ahead... But actually one can look at thousands of technical trading rules and other more sofisticated strategies, and maybe find the few ones that will answer all these problems. Nevetheless we would still be left with data snooping ie we have used our data set untill we find a satisfactory result.
I have seen Hansen's SPA ('Superior Predictive Ability') test and stepwise variants used for this purpose. Hansen's test is a Studentized version of White's Reality Check. The stepwise variants allow one to accept or reject the null of no predictive ability on a subset of some tested strategies while maintaining a familywise error rate.
In his book, 'Evidence-Based Technical Analysis,' David Aronson discusses the overfit bias very well, although I believe his techniques for minimizing the bias may only apply to technical strategies, because they rely on Monte Carlo simulations.
Strictly speaking, data snooping is not the same as in-sample vs out-of-sample model selection and testing, but has to deal with sequential or multiple tests of hypothesis based on the same data set. To quote Halbert White:
Data snooping occurs when a given set of data is used more than once for purposes of inference or model selection. When such data reuse occurs, there is always the possibility that any satisfactory results obtained may simply be due to chance rather than to any merit inherent in the methody yielding the results.
Let me provide an example. Suppose that you have a time series of returns for a single asset, and that you have a large number of candidate model families. You fit each of these models, on a test data set, and then check the performance of the model prediction on a hold-out sample. If the number of models is high enough, there is a non-negligible probability that the predictions provided by one model will be considered good. This has nothing to do with bias-variance trade-offs. In fact, each model may have been fitted using cross-validation on the training set, or other in-sample criteria like AIC, BIC, Mallows etc. For examples of a typical protocol and criteria, check Ch.7 of Hastie-Friedman-Tibshirani's "The Elements of Statistical Learning". Rather the problem is that implicitly multiple tests of hypothesis are being run at the same time. Intuitively, the criterion to evaluate multiple models should be more stringent, and a naive approach would be to apply a Bonferroni correction. It turns out that this criterion is too stringent. That's where Benjamini-Hochberg, White, and Romano-Wolf kick in. They provide efficient criteria for model selection. The papers are too involved to describe here, but to get a sense of the problem, I recommend Benjamini-Hochberg first, which is both easier to read and truly seminal.
This blog post points to a presentation about backtesting and data snooping: http://www.portfolioprobe.com/2010/11/05/backtesting-almost-wordless/
I think the only non-datasnooping method there is is to trade live. But the problem of data snooping can be reduced by seeing how significant the backtest result is compared to what would have happened if the trades were random. Using this technology also makes it clear that backtesting results can easily be deceiving.