P Values: Significant or Outdated?
Whether or not results are significant is often the most important detail of hypothesis testing. Insignificant results are not likely to be reproduced or meaningful and thus dropped, but significant results are published and put into development. For the vast majority of cases, significance is based on the p value, a concept that has been around for much longer than the field of data science. Recently some researchers have been discussing the sometimes unethical ways that p values are used and have argued that there are other alternatives to standard hypothesis testing. For this post I thought we would dive into the concept of a p value and discuss its limitations and potential replacement.
P Value Refresher:
The p value is vital to the way that most data scientists understand and use statistics. This term represents the likelihood that the data of a study are obtained if the null hypothesis is true. This idea often gets simplified into the effect of chance on a study or the likelihood that the null hypothesis is true, and while these ideas don’t actually match the definition, they do sum up how we use the p value. With a high p value, the patterns in the sample of a study are likely due to randomness in sampling and the values within the data are not representative of the larger population. Thus the results are not likely to be reproduced and the study is labelled insignificant. With a low p value (typically <.05) the sample of a study is said to be representative of the population at large. The results are not due to chance, but due to inherent patterns in the data. Similarly, the results are likely to be reproduced if the study is conducted again and are labelled significant.
Problems with P-Value:
While understanding how well a study can be reproduced is important to generalizing the relationships present in the data, there are some well known problems with typical hypothesis testing using p values. For one thing, a p value in isolation is utterly meaningless. One p value does not give any indication of effect size or false positive rates inherent in a study. In many cases a small effect size or high false positive rate can make the results of a study meaningless in the larger overall picture, yet the results are still labelled statistically significant.
Another issue with the p value, and one that is particularly relevant to data science, is the fact that it is very easy to drive the value down by using a large dataset. This is difficult to do in clinical settings, but data sets for machine learning projects tend to be extremely large to start with. This means that p values in data science projects are usually small to begin with and thus have lesser intrinsic meaning.
Yet another problem with traditional hypothesis testing is that the p value encourages researchers not to test models for predictive validity. With the significance of a test being determined by a p value, one does not necessarily need to test the predictive power of a model against new data sets. This is slightly less of a concern in data science where the use of several different kinds of validation testing are frequently used alongside traditional p values.
One final commonly discussed short fall of the p value is the way that it can be abused by researchers. Frequently a p value acts as a gatekeeper for publication. This is especially true in medical and clinical environments where a non-significant study will not be published but a significant study will be. This can lead to situations where researchers use unethical methods to achieve significant results. Some methods include removing important outliers, tailoring data collection based on significance tests that are conducted before collection is completed and rerunning the same test multiple times until a significant result is achieved. Much like any other statistic, a p value is really meant to be shown with several other measures (sample mean, effect size, standard deviation, etc.) to gauge the full effect of statistical analysis and not used as a single all important factor on the validity of a study.
Alternatives to Using P Values:
I have come across two interesting methods of performing statistical analysis while avoiding using traditional p values. The first involves fuzzy-set analysis. In fuzzy-set theory, a researcher creates a set that he is testing for and all variable values are reassigned as ‘membership’ scores between 0 and 1 by using a logarithmic function. A score of 1 would indicate that the data point is a definite member of the set and a score of 0 would indicate that the data point is not a member of the set. From here, a researcher picks a cutoff point for full membership (usually 0.95), a cutoff for full non-membership (0.05) and a full ambiguity score (0.5). The researcher can then construct a model to predict the outcome set likelihood on the same scale. Effectively, this method of analysis turns statistical testing into a logarithmic classification model and eliminates the need for a p value to test significance. Further more, there is more ground for predictive validity for the model moving forward.
Another method removing the need for statistical significance involves the construction of confidence intervals. For this method we simply construct confidence intervals for our groups before comparing them. For example, let’s say that we are looking at a record of the number of items per order when sold by a particular employee. We get a new order and we want to know if it compares with their other orders. With traditional statistics we could calculate a t-test and find the significance level to tell us if this is an unusual order. But with confidence intervals, we simply construct a confidence interval of our choosing (say 95%) around the mean number of items in the employees orders. We then compare the new order to this confidence interval and we immediately know if it is larger than a typical order (above the confidence interval), smaller than a typical order (below the confidence interval), or is a typical order (within the confidence interval). This method is similar to a typical t-test, but it doesn’t use normal distributions and maintains greater predictive validity moving forward.
My Take:
While I find these approaches to be interesting ways of avoiding the issues of statistical testing, I’m not sure that I fully see the need. The confidence interval approach is not really all that different from a traditional T-test and eventually just comes down to an issue of semantics. Furthermore, I don’t particularly see anything wrong with the concept of statistical significance, but there are issues with the way that it is implemented. Significance does not necessarily mean importance and I think this is a key idea to keep in mind when looking at the results of studies. Significance is meaningless without taking into account effect sizes and false positive rates, and it is determined on an arbitrary scale (there is no particular reason to use 0.05 as a cutoff other than tradition). I don’t think significance should be used as the sole gatekeeper of publication, but that mindset seems to be growing and publishing negative results can be as interesting as positive results.
What do you think? Is it time for a change in regards to statistical testing, or is statistical significance important?
Resources:
For an in-depth look at reducing the importance of traditional hypothesis testing, check out this paper.
For more information on the confidence interval method, take a look at this blog post.
For more information about the issues with p value testing, check out this post.
For a full example of how to use fuzzy-set analysis, take a look at this paper.