What are the results?
So far, we have talked about the design of a study. Clearly, this is important in deciding if the study is valid. What about the results? Research papers have two broad types of analysis: statistical and clinical. The statistical results are the p-values and confidence intervals and stuff. In one of your on campus sessions, you will have (or already have had) a series of lectures on basic statistics (mean, median, confidence intervals, p-values, statistical tests, etc.). In this module, we will focus primarily on the clinical results and will not discuss the statistical stuff. The main reason for making this decision is twofold. The first is a pragmatic one. This module could be very long if we went into all the discussion about descriptive statistics and comparative statistics, hypothesis testing, and so on. The second and more important reason is that when you read a paper, the statistical stuff is rarely critical; you base your decisions on the clinical stuff. We want you to emphasize the clinical importance. A p-value only tells you the likelihood that the results were due to chance. If a study showed that drug A lowered blood pressure an average of 1 mm Hg and that this was highly statistically significant, youd still say "So what?" because the difference is not important clinically! The literature is full of statistically significant but clinically irrelevant results.
When we read a therapeutic paper, we are usually interested in seeing if one treatment group does better than another. We are also interested in which outcomes are improved and whether patients were harmed. To make the most sense of the outcomes, we need to know how often patients experience good outcomes and bad outcomes. When we compare their frequency, we are comparing rates or risks.
Example
A researcher develops a new drug to prevent the common cold in children over the age of two. In preliminary uncontrolled studies, when the drug was given as a single dose in September, it reduced the frequency of colds for six months. In the current study, 1000 children received placebo and 1000 received the drug. During the six month follow up the researchers diagnosed colds in 650 of the kids on placebo and in 500 of the active treatment. None of the children were hospitalized for complications of the cold. Otitis media was diagnosed in 300 of the children with colds on active treatment and in 299 of the children with colds on placebo. There were no deaths. Oh yeah, the mean number of days of school missed in the placebo group was 2.7 (95% confidence interval 0.8, 4.9) and in the treatment group was 2.2 (95% CI 0.3, 3.1). As you might have guessed, researchers have several ways of expressing the differences between groups. Before discussing them, do the following for each group:
Task: Calculate the rate of colds and the rate of ear infections for the children that got colds.
Absolute risk reduction
Calculating absolute risk reduction (ARR) is simple. It is merely the difference in the proportion of subjects with the outcome of interest in each group. If the event rate in the placebo group is X and the event rate in the treatment group is Y, the ARR is X-Y.
In a study of the management of the third stage of labor (this is the time spent waiting for the placenta after the baby is delivered), 16.5% of women managed expectantly had post partum hemorrhage compared to 6.8% in the actively managed group. This represents an ARR of (0.165-0.068) = .097 or 9.7%. In English, this means that active management of the third labor reduced the rate of postpartum hemorrhage 9.7%
Task: Calculate the ARR for colds and for otitis media from the above example.
Relative risk reduction
The most commonly used measure of risk is relative risk reduction (RRR). It is expressed as a percent difference and is calculated as 100*(X-Y)/X. In the postpartum hemorrhage study the RRR is 100*(16.5-6.8)/16.5 = 59%. In English, this means that active management of labor reduced the rate of postpartum hemorrhage by 59%.
Task: Calculate the RRR for developing colds in the hypothetical study in our example.
RRR is the most commonly reported form of risk reduction. This is probably no accident. Why do you think this is the case?
Number needed to treat
The number needed to treat (NNT) is the number of patients you need to treat to prevent one additional bad outcome. It is calculated as the reciprocal of ARR or (1/ARR). If we have to treat 1000 patients to prevent a single bad outcome, we might be less enthusiastic about the treatment, especially if we only need to treat 15 with an alternative therapy to prevent the same bad outcome (assuming relatively equal adverse effects). The NNT is also useful because if we also know the rate of adverse events, we can balance risk and benefit. If, for example, the NNT for a drug to prevent cancer is 300, but the rate of fatal pulmonary embolism is 2%, we know that for every cancer we prevent, we cause 6 fatal pulmonary emboli.
Unfortunately, the NNT is not reported as frequently as EBM proponents would like. This
is slowly changing. In the meantime, NNTs are available at many EBM web sites, such as Bandolier
.
Task: Calculate the NNT for developing colds in our hypothetical study.
Confidence intervals
The results can tell you both the magnitude of effect and the precision. A confidence interval combines the statistical stuff and the clinical stuff to provide you with an idea of how precise the treatment effect is. Data is usually presented as a 95% confidence interval, meaning that if the study is repeated multiple times, 95% of the studies will have result within that range. Can you intuit why the confidence interval combines both the statistical and clinical results?
When you compare the confidence intervals of different treatment groups, look for overlap. If the confidence intervals dont overlap, the difference is statistically significant. To see if the difference is clinically meaningful, compare the closest extremes. For instance, if the upper limit of the treatment effect for the control group is very close to the lower limit of the treatment group, its possible the results arent clinically important.
When you look at confidence intervals, see how wide they are. A narrow or tight confidence interval represents a precise estimate. These are usually found in studies with a large number of participants. If you want the full down and dirty detail on this, lets have a separate conversation. If not, read on!
No difference
Occasionally youll read a study that shows no difference between treatment groups. It doesnt happen often, though. These are called negative studies. You dont see as many of these because researchers are less likely to submit them for publication. When they are submitted, they take longer to get published and are less likely to get published. When you get to the module on meta-analysis, keep this in mind!
A negative study manifests itself either as insignificant p-values, p-values over 0.05, or the confidence intervals overlap. How might you interpret the results of a negative study? There are only two possible explanations! (Even I can remember two things!) The first explanation is that maybe there really is no difference. The second is that the study did not have enough power (not enough subjects) to find a difference if there really was one. This is called a beta-error or a type II error and the study is referred to as being underpowered. When you read a negative paper, you need to go back to the methods section. The author should discuss how they calculated the size of the study. If they didnt, write a letter to the editor and lambaste them for not providing enough information to adequately interpret the paper. All isnt lost, though. You might be able to calculate a sample size using the data available in the paper. Its a royal pain in the butt and usually requires using a software program or going to tables in a textbook (like Hulley and Cummings).
If the author reported a sample size estimation, it will include some assumptions about the data (remember that a sample size estimation is done before the study is started, so they had to make their best guess about the mean treatment effect, standard deviation, and magnitude of difference). Compare their assumptions with the actual results. How did they do? If they came pretty close, then the likelihood of a type II error is minimized (keep in mind that most studies use 80-90 percent power meaning that 10-20 percent of the time an adequately powered study may still not be able to find a difference). If not, it is still possible that the study lacked power.