By EAS Senior Consultant William R. Fairweather, PhD.
You have a question, so you design a study to explore it. You determine that A is better than B and that the average difference is 3 units. It should be obvious that this is not sufficient information, but why not? After all, it is exactly what you wanted to know.
First of all, is a 3-unit difference a large number or a small one? You would likely know this from your experience as a scientist or engineer. And this is certainly important. But it is also important to know whether the difference is well or poorly known. This is more of a statistical question.
The first issue is the variation encountered in the study. If two of the experimental units in the study on different treatments are just as likely to have an 8-unit difference or even a difference of 2 units in the opposite direction, the 3-unit difference is no longer so impressive. So in designing the study a statistician would want to take enough observations to ensure that with high probability the study would recognize an important difference as an important difference (a statistically significant comparison) but would not declare a small difference to be significant.
The procedure for calculating sample sizes for a simple two-treatment comparison study is well known and widely available on the Internet. Unfortunately, this is not the only issue to consider. Recently the National Institutes of Health and the Food and Drug Administration have become concerned that a large number of preclinical studies are not reproducible (Landis, et al 2012). The stakes are very high. NIH spends about $30 billion per year to support scientific research and it has been estimated that half of these studies cannot be reproduced. Studies and clinical trials that would follow up on the results shown could well be looking in the wrong direction.
The NIH has recently been emphasizing the need for reproducible and repeatable studies. First, some definitions. In this context, a repeatable study is one that an independent scientist/statistician can analyze and get the same answers as the original scientists did. It is a matter of clarity in reporting the (statistical) analysis methods and in having the same data available. A reproducible study is one that independent scientists can begin again from scratch and follow the study description to produce very similar results as described in the original reports.
Repeatability is a rather minimal requirement, but one that is easily failed in a study of any complexity. There will likely be several scientists involved and each will be evaluating the data. Someone may modify the data after someone else has finished their analysis. It will then be impossible to get the same result as the first scientist did. Early in my career with the FDA, I ran into this problem frequently. When we did, we had to stop our review and send the application back to the sponsor. Sponsors soon learned to lock the database before anyone proceeded to an analysis and to be sure that FDA statisticians received only that database.
There is more flexibility in the requirement for reproducibility because variation in the responses of a second study are anticipated. This variation should be largely overcome by the sample size. Difficulties in reproducibility may be subtle and therefore more difficult to correct. Scientists may not even recognize the cause of a problem. For example, it might not even be known that there are important differences in the subspecies of animals used in the study. Or a reagent used in the study might have run out midway and have been replaced by a fresh batch, corresponding to the changeover to laboratory evaluation of animals on Treatment B from that of those on Treatment A.
Scientists may well have a good idea about how the study “should” come out. It would be unreasonable to expect that they run randomly or haphazardly chosen studies. In some cases, animals that did not respond as expected were eliminated from the study and not included in the analysis. There may have been a good reason for this, such as a determination that the animal failed to eat enough to receive a proper dose of its assigned treatment. The problem was that neither this evaluation nor the course of action was preplanned and the evaluation was not conducted for every animal in the study. The resulting bias prevents reproduction of the study.
Neither of the above definitions of repeatable or reproducible insist that the statistical analyses be appropriate for the study. In many cases in the literature, they have not been. It should be taken for granted that the appropriateness of statistical methods is as important as the appropriateness of laboratory methods.
A good sampling plan is one that will reliably produce a repeatable, reproducible study. It should start with a protocol that describes both the laboratory and statistical procedures to be employed. It should cover all anticipated events of the study in detail, and it should make provision for unexpected events.
In this short article I have not covered many of the potential statistical issues that occur in a study. These include interim looks at the data, specification of primary and secondary hypotheses, treatment of missing data, etc., all of which can affect the analytical methods to be used and the interpretation of the results.
Posted in Issue of the Month and tagged William Fairweather.