Time to ROC'n'roll... Receiver-operating characteristic curves, for obvious reasons called ROC (pronounced rock) curves, are an excellent way to compare diagnostic tests. Remember that when a test becomes more sensitive, it becomes less specific, and vice versa. Consider again the data for serum ferritin as a test for iron deficiency anemia:
Serum ferritin (mmol/l) |
# with IDA (% of total) |
# without IDA (% of total) |
< 15 |
474 |
20 |
15-34 |
175 |
79 |
35-64 |
82 |
171 |
65-94 |
30 |
168 |
> 94 |
48 |
1332 |
If we just want to calculate sensitivity and specificity for this test, we have to choose a "cutpoint" which separates 'normal' from 'abnormal'. If we choose <= 34 as an abnormal ferritin, we can "collapse" some rows and get the following table:
Serum ferritin (mmol/l) |
# with IDA (% of total) |
# without IDA (% of total) |
<= 34 |
474 + 175 |
20 + 79 |
> 34 |
82 + 30 + 48 |
171 + 168 + 1332 |
Doing the math, we know have a familiar 2 x 2 table:
Serum ferritin (mmol/l) |
# with IDA (% of total) |
# without IDA (% of total) |
<= 34 |
649 |
99 |
> 34 |
160 |
1671 |
Whew! Finally, we can calculate sensitivity and specificity for this cutpoint of 34:
Sensitivity = 649 / (649 + 160) = 649 / 809 = 80.2%
Specificity = 1671 / (1671 + 99) = 1671 / 1770 = 94.4%
Remember, though, that the sensitivity and specificity depend on where we make the cutpoint. I have done the math, and calculated the sensitivity and specificity for each of 4 different cutpoints in the table below:
Cutpoint which defines an abnormally low serum ferritin (mmol/l) |
Sensitivity | Specificity |
| < 15 | 58.5% | 98.9% |
<= 34 |
80.2% |
94.4% |
| <= 64 | 90.4% | 84.7% |
| <= 94 | 94.1% | 75.3% |
This confirms that as the sensitivity increases, the specificity drops, and vice versa. Here is a graphical example, using creatinine kinase and diagnosis of MI:

This diagram graphs the creatinine kinase values for two groups of patients, those with MI and those without MI. As we know from our clinical experience, there is an overlap in the CK values between the two groups, shown in the middle of the diagram. Using "Cutpoint 1" means that almost all of the patients with MI will be considered 'abnormal', but so will many without MI. This is highly sensitive, but not very specific. Cutpoint 2 does not misclassify nearly as many of the patients without MI, but also misses more of those who actually had an MI. When setting our cutpoints, we have to keep in mind that we are making a trade-off, and have to think about what is worse: a false positive or a false negative. In this case, most would agree that a false negative (i.e. telling a patient with an MI that he or she doesn't have one) is the worse error, so we would choose Cutpoint 1.
What about ROC curves? We're getting there, but the above concepts are important. Make sure you understand how you can derive multiple pairs of sensitivity and specificity for a diagnostic test, and why sensitivity and specificity are inversely related.
An ROC curve is simply a graph of sensitivity vs (1-specificity). Why not sensitivity vs specificity? Well, you could do that, but because the area under the curve for sensitivity vs (1-specificity) has special meaning, whereas it does not for sensitivity vs specificity, we choose the former. You'll see.
Below, we've graphed the values from the table of sensitivities and specificities for the diagnosis of iron deficiency anemia using serum ferritin:

That's it! The area under the ROC curve (AUROCC) is a reflection of how good the test is at distinguishing (or "discriminating") between patients with and without iron deficiency anemia. The greater the area, the better the test. Let's look at another graph, which shows where a really good test (that has a high sensitivity and specificity) and a perfectly bad test (which classifies diseased patients as healthy, and vice versa) would fall on the ROC curve:

A worthless test, which does not discriminate between alcoholic and non-alcoholic patients, would have a curve shown by the diagonal red line. Thus, the best possible test (100% sensitive and 100% specific) would have an area under the curve of:
A worthless test would have an area under the curve of:
In addition to being a good way to compare tests, the AUROCC has clinical meaning. Consider the example of trying to diagnose alcoholism using a screening test like the CAGE score. Imagine that you have a group of 5 patients with alcoholism and a second group of 15 who are not alcoholic, shown below by the letters A and N:
A A N N N N N
A A N N N N N
A N N N N N
Now, pick at random a single alcoholic patient from the "A" group and a single non-alcoholic patient from the "N" group:
A N
The probability that the CAGE score will correctly classify "A" as alcoholic and "N" as non-alcoholic is given by the area under the ROC curve. If that area is 0.81 (as for the CAGE score), then the probability that the patients will be correctly classified is 81%. This is called a "forced choice comparison", and helps you put the AUROCC into a clinical context.
So what is a "good" AUROCC? There are no hard and fast rules, but in general an ROC of 0.5 to 0.7 is associated with marginally useful tests, an area of 0.7 to 0.9 with a good test, and an area greater than 0.9 with an excellent test.
By the way, some of you are probably trying to figure out how to calculate the area under the ROC curve. While that is beyond the scope of this course, you can learn more in the following articles:
This is the end of the Diagnosis Module. Please select "Quiz" from the menu at left, and if you successfully answer the questions, move on to the Treatment Module. Thanks!