Nursing education has incorporated simulation into its teaching/learning curriculum and in many cases has expanded its use to include evaluation of student performance in summative simulations, including those with high stakes. The International Nursing Association for Clinical Simulation and Learning (INACSL) provides nurse educators with Standards of Best Practice: Simulation Participant Evaluation (2016). Required elements include “trained, nonbiased, objective raters or evaluators” (INACSL, 2016, p. S27).
Educators have identified that student performance evaluations are not always fair or consistent. This led the National League for Nursing (NLN) to carry out a four-year research study looking at the feasibility of using simulation for high stakes evaluation in pre-licensure RN programs. This study clarified the need for evaluator training and generated the question, “What are the best methods to train raters in high stakes simulation?” (Rizzolo, Kardong-Edgren, Oermann, & Jeffries, 2015, p. 302). In response to that question, we designed a training strategy and tested it in our research study, The Effect of Evaluator Training on Intra/Inter Rater Reliability in High Stakes Assessment in Simulation.
Methods: We conducted a nationwide study with an experimental, randomized, controlled design. Participants identified themselves as faculty who used simulation in a pre-licensure nursing program. Seventy-five participants completed the study with equal numbers in the control (37) and intervention (38) groups. The two groups were similar in demographic characteristics such as age, or years of teaching with simulation.
The researchers developed the training strategies by adapting methods of the previously mentioned NLN study and enhancing them with additional information regarding the evaluation tool, a model evaluation, additional practice evaluating in simulation, and several small group training webinars. All participants (control and intervention) watched a basic training video on the use of the chosen validated evaluation tool, the Creighton Competency Evaluation Instrument (CCEI) (Hayden, Keegan, Kardong-Edgren, & Smiley, 2014), and used the tool to practice evaluating one videotaped student performance in simulation. Only the participants in the advanced training intervention group went on to participate in an online small group webinar to discuss their evaluations and share their thinking regarding their evaluation decisions. In the webinar, they also viewed the video again; this time with a researcher’s voiceover as a model for how to view the performance. Participants then viewed and evaluated three more video performances of varying quality. After three weeks they again viewed and evaluated the same three videos. They then participated in a second webinar to discuss their use of the evaluation tool. The webinars were conducted to facilitate the development of a shared mental model by which the participants’ ratings of performance would become more similar, and thus result in better interrater reliability. A private remediation session occurred when further coaching was needed. After the training, all participants (control and intervention) proceeded to the experimental stage, where they watched and evaluated three experimental videos. One month later, they again watched and scored the same three videos.
Results: The Interclass Correlation Coefficient (ICC) statistics for the intervention group showed excellent reliability, and the Kappa statistics for the intervention group fall in the moderate to substantial ranges. In contrast, the ICC statistics for the control group are lower and range from poor to excellent reliability for the experimental videos. The Kappa statistics for the control group show poor to substantial reliability. Overall, the intervention group participants achieved higher inter-rater reliability, with less variability among the experimental videos. Reliability was not dependent on the quality of student performance (poor, mediocre, good).
Conclusion: The results of our research study indicate a planned training method that includes a model evaluation, many practice assessments, and group discussions regarding evaluation decisions results in greater consistency in scores among evaluators. Providing evaluators with a video-recorded model evaluation assisted in initiating a shared mental model. Evaluating multiple student performances of differing quality was useful in identifying essential performance measures. A key element in the webinar iscussions was the opportunity for participants to consider their evaluations along with others and discuss the rational for their decisions. Knowing they were attempting to create consistency, the participants were open to consider a different perspective as they worked to develop a shared mental.
Training of faculty is necessary to achieve interrater reliability in evaluation of student performance in simulation. Multi modal strategies contribute to the achievement of a shared mental model, which will enable faculty to have a more consistent and standard approach to student assessment (Boulet, et al., 2011; Kardong-Edgren, Oermann, Rizzolo, & Odom-Maryon., 2017).