Chiropractic Techniques

Reliability and Validity of the Prone Leg Check

Michael Schneider, DC, PhD

I recently published a study that investigated the inter-examiner reliability of the prone leg-length check.¹ This study was performed with the assistance of two chiropractors who utilize this procedure daily in their practice and a set of 45 patients who all had a history of low back pain, but were not necessarily symptomatic on the day of their examination.

The design of this reliability test was quite simple and straightforward: Patients would come into the adjusting room and be placed face-down on the chiropractic table. The two doctors each entered the room separately, examined the patient's leg length (without knowing what the other person reported) and verbally gave the following test results: side of perceived short leg with knees extended; estimate of the magnitude of the difference in four categories (<¼ inch, ¼-½ inch, ½-¾inch and >¾ inch); perform head rotation left/right and report change in leg length (Derifield test); and after flexing the knees to 90 degrees, report any change in leg length. Here is a brief summary of our findings:

There is good reliability for finding the short-leg side (k=0.65).
There is poor to fair reliability for finding the magnitude of difference (k=0.28).
The head-rotation test (Derifield) is completely unreliable (k=0.04, -0.19).
Flexing the knees to 90 degrees gives an indeterminate reliability value.

Testing Reliability and Validity

In order to interpret these results properly, it is important to review the issues of reliability and validity of diagnostic tests in the context of chiropractic practice. Reliability is the amount of consistency between measurements and validity is the amount of accuracy or the degree to which a test measures what it was intended to measure. The key point is that valid tests must first be shown to be reliable, but reliable tests are not necessarily valid. Our prone leg-length study only measured the reliability or consistency of this test across examiners, not the validity or clinical accuracy of the test as compared to some other gold-standard test.

There are many examples of a test being reliable but having little or no validity. If a patient were weighed on a poorly calibrated scale that gave readings of 10 pounds over the true weight, the scale would consistently give the same reading on repeated measures. In this sense, the scale meets the definition of being a reliable test because it gives consistent results across repeated measurements. Yet the validity of the scale would be considered terrible, because it was not accurately measuring what it was intended to measure.

We certainly could not base any clinical decisions about whether a patient had actually gained or lost weight based upon these inaccurate measurements. Certain X-ray findings, such as seeing a normal disc space on a lateral film, may be a highly reliable finding between two examiners, yet have very little validity when compared to the history of back pain as reported by the patient or upon MRI scanning.

Reliability is typically assessed in the clinical setting by having two or more clinicians score an observation and then determining the level of agreement/disagreement between them. This is known as the level of inter-examiner reliability and is reported with certain statistics such as the raw percentage of agreement, kappa (k) statistic or correlation coefficients. As will be discussed later in this article, percentage of agreement is a poor measure of reliability because it does not account for agreement by chance.

Validity is assessed by comparing the measurements we obtain from our new test or procedure with an established gold-standard test with known accuracy. The level of clinical accuracy or validity is typically reported with certain statistics such as sensitivity, specificity, predictive values and likelihood ratios. Research that attempts to determine the accuracy level of a new test by comparison with an established gold-standard test can be very difficult, expensive and time-consuming to perform.

Inter-examiner reliability is simply the amount of agreement between two or more examiners on positive and negative test results. Although this might seem a rather easy task, it is actually a bit more complicated to report the results of an inter-examiner reliability test in statistical terms. Instead of using percentages of agreement, most researchers use the Greek symbol kappa (k) to denote the level of inter-examiner reliability.² Raw percentage of agreement between the two examiners might seem like the best way to report the reliability, but it turns out that is not an accurate statistic due to the issue of chance agreement.

Chance Agreement and Probability

Remember that game you played as a kid where you and a buddy would put closed fists behind your backs and say, "One, two, three ... shoot"? You would both stick out your open hands, displaying either one or two fingers, and determine if you were showing "odds or evens." What if I claimed to have a special way of predicting the outcome of this game in advance and we designed an experiment to test this question? You and your buddy would be asked to play the game 100 times while I recruited two body-language experts to write down their predictions on a piece of paper before you showed your hands, and then compared the level of agreement between them as a test of inter-examiner reliability.

Since there is a 50 percent chance of showing odds or evens, you would not likely be impressed if these experts agreed on their predictions only 50 out of 100 times. Most people would laugh and say, "Heck, that's no better than pure chance." However, at what level of raw percent agreement would you be convinced that the experts were reliably coming up with the same finding? Sixty percent, 70 percent or 95 percent agreement? Certainly if their predictions were the same 100 percent of the time, you would be convinced they had perfect reliability. And if they got it right only 40 percent of the time, you would say that their level of agreement was completely unreliable, because the prediction level was less than pure chance.

This example is very straightforward because we are using a 50 percent probability of chance agreement.

Things get more complicated if we change the probabilities of chance agreement. What if two sunrise experts were told to go outside every morning for a month and independently record the direction of the sunrise as occurring in the east or the west? We would be very surprised if the agreement between them was only 95 percent because the chance agreement is 100 percent that the sun would rise in the east. Reporting a finding of 95 percent raw agreement in this case represents terrible inter-examiner reliability and would be a misleading conclusion if published in a journal article.

In order to account for the factor of chance agreement in reliability studies, the kappa statistic (k) was created, which can be calculated using a very simple formula:

k = Observed agreement - Chance agreement
100 - Chance agreement

Let's take some possible examples of the 100 "odds-evens" game above and plug it into the above formula to calculate some kappa values:

Case 1: Observed agreement = 40 percent - Chance agreement = 50 percent; k = 40-50 / 100-50 =-10/50 = -0.2.
Case 2: Observed agreement = 50 percent - Chance agreement = 50 percent; k = 50-50 / 100-50 = 0/50 = 0.
Case 3: Observed agreement = 60 percent - Chance agreement = 50 percent; k = 60-50 / 100-50 = 10/50 = 0.2.
Case 4: Observed agreement = 80 percent - Chance agreement = 50 percent; k = 80-50 / 100-50 = 30/50 = 0.6.
Case 5: Observed agreement = 100 percent - Chance agreement = 50 percent; k = 100-50 / 100-50 = 50/50 = 1.0.

We can see from the calculations above that the range of kappa is from -1.0, which is perfect disagreement, to 0.0, which is perfect chance agreement, to +1.0, which is perfect agreement. Any kappa value that is close to 0.0 or a negative number indicates that the reliability is basically pure chance, or less than chance (negative values).

In our leg-length data set, we found the Derifield head-rotation test to have kappa values of 0.04 for head rotation to the right, and -0.2 for head rotation to the left. These extremely small kappa values indicate that this test is completely unreliable and no better than chance. The standard interpretation of kappa values is as follows:³

k = 0.81-1.0 denotes almost perfect reliability.
k = 0.61-0.80 denotes substantial reliability.
k = 0.41-0.60 denotes moderate reliability.
k = 0.21-0.40 denotes fair reliability.
k = 0.01-0.20 denotes slight reliability.
k =< 0.0 denotes poor reliability.

In our leg-length reliability study, we found the determination of the short-leg side to have a kappa value of 0.65, which indicated good reliability, and the determination of magnitude of that difference to have a kappa value of 0.28, indicating only fair reliability. The head-rotation portion of the test (Derifield) gave kappa values close to 0.0, which indicated poor reliability.

We ran into difficulty with interpretation of the kappa value for our last observation of what happens to the leg length upon flexing the knees to 90 degrees. Forty-two out of 45 times (93 percent agreement), the two examiners reported the finding of "the short leg gets longer." However, there were no times (0 percent agreement) that these two examiners reported a finding of "the short leg gets shorter." Because of the extremely high prevalence of only one finding (the short leg gets longer), the kappa value cannot be calculated. Therefore, the reliability of this portion of the leg length test is unknown.

Watch what happens to the calculation of the kappa value when we take the case of raw percentage of agreement at 93 percent and the "chance agreement" equal to 90 percent:

Observed agreement = 93 percent -
Chance agreement = 100 percent;
k = 93-100 / 100-100 = -7/0 = ?

It is not possible to divide by 0, so therefore we cannot calculate a kappa value for any observation where the chance agreement is 100 percent. What if we assume the chance agreement was 90 percent and the observed agreement was still found to be 93 percent?

Observed agreement = 93 percent -
Chance agreement = 90 percent;
k = 93-90 / 100-90 = 3/10 = 0.33

In this scenario, the 95 percent raw agreement has translated into an extremely low kappa value of 0.33 that reflects poor inter-examiner reliability. This is the challenge of kappa values; they are so extremely sensitive to the prevalence of the findings and level of pre-determined chance agreement that they sometimes give misleading information.

It is tempting to assume from our leg-length study that flexing the knees and observing the short leg getting longer is a reliable procedure because the raw percentage of agreement was 93 percent. We must assume that the chance agreement of this finding was at least 90 percent, because there were no findings of a short leg getting shorter. Maybe it is something about a patient's physiology or biomechanics that leads to the observation of a short leg getting longer in a majority of cases. Regardless, in this group of 45 patients this phenomenon occurred in just about every case, making it appear to be a constant finding that confounds the assumption of chance agreement. Due to the extremely high prevalence of this finding, we cannot assume a 50-50 chance agreement regarding the flexed knee position and this violates a fundamental rule of kappa calculation.

Clinical Relevance

What are we to make of all this statistical jibberish in clinical terms? The bottom line is that inter-examiner reliability is good for determining the short leg side, the Derifield test is completely unreliable, and flexing the knees to 90 degrees seems to lead to the observation of the short leg getting longer in the vast majority of cases. Using the "eyeball approximation" method of determining the degree of short leg is not very reliable, at least using the method we described. Our study of the prone-leg check was only designed to assess reliability, not validity. We did not attempt to correlate the findings of the leg-check analysis with some gold-standard test such as standing weight-bearing plain-film X-ray analysis of leg length.

An important clinical test is one that is both reliable and valid. This would be a test that consistently gives the same results when performed by two or more different clinicians (reliability), as well as a test that gives meaningful and accurate clinical information that has been cross-referenced by comparison to some other gold-standard test (validity). In a chiropractic setting, most clinicians perform a prone leg-length check as a precursor to performing a spinal adjustment or manipulation. How does the short-leg side help the clinician decide when and where to treat the patient's spine? Has this determination been validated with some other gold-standard test? In this clinical context, we would need to know information about the validity of the prone-leg check, not merely its reliability.

For the prone-leg check to be considered a valid test for determining the presence of spinal subluxation or segmental dysfunction/restriction, it would need to be established that the finding of "short leg" is correlated highly with some other gold-standard test. For procedures that have been shown to be unreliable, such as the Derifield head-rotation test, there is no reason to even consider validity testing, because reliability is a prerequisite of validity. These same questions could be asked about motion or static palpation of the spine as an indicator of spinal subluxation or segmental dysfunction.

Herein lies a huge challenge for the chiropractic profession. Do we have some gold-standard test for determining the presence of spinal subluxation or segmental dysfunction? If not, how can perform any significant validity testing without a standardized comparison test of known accuracy? The bottom line is that we may not be able to perform validity studies due to lack of an acceptable gold-standard comparison test, but we can certainly perform reliability studies. Since we know that reliability is a prerequisite of validity, tests that do not show an adequate level of inter-examiner reliability should be discarded from our clinical practices. In my opinion, reliability and validity studies should be an important research agenda for the chiropractic profession as it moves forward into the 21st century of evidence-based practice.

References

Schneider MJ, et al. Interexaminer reliability of the prone leg length analysis procedure. J Manip Physiol Ther 2007;30(7):514-21.
Gordis L. Epidemiology. 2nd ed. Philadelphia: W.B. Saunders Company, 2000.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-74.

January 2009