Data Dredging

Data dredging (data fishing, data snooping, equation fitting, p-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results... Conventional tests of statistical significance are based on the probability that an observation arose by chance, and necessarily accept some risk of mistaken test results, called the significance. When large numbers of tests are performed, some produce false results, hence 5% of randomly chosen hypotheses turn out to be significant at the 5% level, 1% turn out to be significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some falsely appear statistically significant, since almost every data set with any degree of randomness is likely to contain some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these apparently significant results. http://en.wikipedia.org/wiki/Data_dredging

http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation

http://en.wikipedia.org/wiki/Spurious_relationship

http://www.tylervigen.com/


Edited:    |       |    Search Twitter for discussion