Reliability of a rating scale to measure severity of adverse healthcare events
Methods: We conducted a reliability assessment using 50 adverse event case descriptions for 8 different Common Format case types (medication, perinatal, blood product, device, fall, healthcare-associated infection [HAI], pressure ulcer, surgical) used by the federally-mandated National Patient Safety Database. Nine clinicians representing 3 clinical specialties (physicians, nurses, and pharmacists) with 3 levels of adverse event evaluation experience (expert, moderate, novice) rated the same 400 cases after a standardized training session on the application of the Harm Scale. IRR was evaluated with free marginal multirater kappa. Generalizability analysis was used to identify sources of error.
Results: Overall, the IRR across all case types and raters was moderate (Kfmm=0.51), but differed by case type and rater specialty. For harm severity, performance ranged from fair (medications and blood transfusion) to good (HAI); for duration of harm, performance ranged from moderate (medications) to good (HAI). Intra-disciplinary agreement ranged from fair to good across all 8 case types for both physicians and pharmacists, but is better (moderate to good) among nurses. Reliability analysis showed that higher level of rating experience does not consistently produce higher reliabilities.
Generalizability analysis revealed that most of the variance (commonly referred to as measurement error') in the mean case rating is due to true' case differences in harm severity. Much smaller sources of variance included rater stringency and the interaction between case severity and rater stringency, resulting in high reliability estimates. Pharmacists were slightly more consistent in their ratings than either physicians or nurses, and expert raters were slightly more consistent then less experienced raters. Raters agreed most with each other when harm severity was death' and least when harm severity was mild harm'. On average, projected reliability for the Harm Scale reached 0.85 when the number of raters exceeded two.
Conclusions The IRR for the AHRQ Harm Scale is moderate and most of the variance is due to case type, and to a much lesser extent rater stringency, specialty, or level of experience.
Learning Areas:Conduct evaluation related to programs, research, and other areas of practice
Public health or related laws, regulations, standards, or guidelines
Describe the interrater reliability of the Agency for Healthcare Research and Quality (AHRQ) Harm Scale for measuring severity and duration of adverse events. Discuss alternative approaches of measuring interrater reliability.
Qualified on the content I am responsible for because: As a trained psychologist/psychometrician I am well versed in the development, evaluation and application of patient and clinician reported measures.
Any relevant financial relationships? No
I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.