Warnings‎ > ‎

Data dredging

"Data dredging" (sometimes called "data fishing") is a real risk which may invalidate any conclusions you draw from your analysis.

Data dredging occurs when:
  • Exploratory analyses are used to find subsets of data that confirm (or are more likely to confirm) an a priori hypothesis which may not be generalisable to the whole (statistical) population.
  • Exploratory analyses are used to generate a hypothesis from a given data set which is tested using the same data set.

How do you avoid data dredging?

  • Evaluate if your data supports the results of a hypothesis based on previous knowledge and research. If using data transformations or discarding data, ensure that there is solid rationale to do so. If not, you may simply be 'massaging' the data for a (probably false) signal.
  • If you use exploratory analyses to generate hypotheses, be sure to test those hypotheses on data sets other than the one used for exploratory analysis. If you have a very large data set (with hundreds or thousands of samples), it may be feasible to use a random subset of samples for exploratory analysis and test any hypotheses derived therefrom on the other samples.

Large data sets...

If you are using data mining procedures to test large data sets for 'significant' associations, be sure to correct for multiple testing and other purely statistical phenomena that might mislead interpretation.

If you are confident that you are not dredging data, click here to continue the exploration wizard...