Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments
Published in: Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments / Lisa Hartling et al. Methods Research Report (Prepared by the Southern California Evidence-based Practice Center under Contract No. 290-2007-10021-I). AHRQ Publication No. 12-EHC039-EF. (Rockville, MD : Agency for Healthcare Research and Quality, Mar. 2012), 106 p
Posted on RAND.org on March 01, 2012
BACKGROUND: Numerous tools exist to assess methodological quality, or risk of bias in systematic reviews; however, few have undergone extensive reliability or validity testing. OBJECTIVES: (1) assess the reliability of the Cochrane Risk of Bias (ROB) tool for randomized controlled trials (RCTs) and the Newcastle-Ottawa Scale (NOS) for cohort studies between individual raters, and between consensus agreements of individual raters for the ROB tool; (2) assess the validity of the Cochrane ROB tool and NOS by examining the association between study quality and treatment effect size (ES); (3) examine the impact of study-level factors on reliability and validity. METHODS: Two reviewers independently assessed risk of bias for 154 RCTs. For a subset of 30 RCTs, two reviewers from each of four Evidence-based Practice Centers assessed risk of bias and reached consensus. Inter-rater agreement was assessed using kappa statistics. We assessed the association between ES and risk of bias using meta-regression. We examined the impact of study-level factors on the association between risk of bias and ES using subgroup analyses. Two reviewers independently applied the NOS to 131 cohort studies from 8 meta-analyses. Inter-rater agreement was calculated using kappa statistics. Within each meta-analysis, we generated a ratio of pooled estimates for each quality domain. The ratios were combined to give an overall estimate of differences in effect estimates with inverse-variance weighting and a random effects model. RESULTS: Inter-rater reliability between two reviewers was considered fair for most domains (κ ranging from 0.24 to 0.37), except for sequence generation (κ=0.79, substantial). Inter-rater reliability of consensus assessments across four reviewer pairs was moderate for sequence generation (κ=0.60), fair for allocation concealment and "other sources of bias" (κ=0.37, 0.27), and slight for the remaining domains (κ ranging from 0.05 to 0.09). Inter-rater variability was influenced by study-level factors including nature of outcome, nature of intervention, study design, trial hypothesis, and funding source. Inter-rater variability resulted more often from different interpretation of the tool rather than different information identified in the study reports. No statistically significant differences were found in ES when comparing studies categorized as high, unclear or low risk of bias. Inter-rater reliability of the NOS varied from substantial for length of followup to poor for selection of non-exposed cohort and demonstration that the outcome was not present at outset of study. We found no association between individual NOS items or overall NOS score and effect estimates. CONCLUSION: More specific guidance is needed to apply risk of bias/quality tools. Study-level factors that were shown to influence agreement provide direction for detailed guidance. Low agreement across pairs of reviewers has implications for incorporation of risk of bias into results and grading the strength of evidence. Variable agreement for the NOS, and lack of evidence that it discriminates studies that may provide biased results, underscores the need for more detailed guidance to apply the tool in systematic reviews.