The Effect of Evaluator Training on Reliability in High-Stakes Assessment Simulations: Pilot Study Results

Tiffany, Jone; Tiffany, Jone

Purpose: Evaluating clinical competency of nursing students is essential as we prepare them for the ever-changing and rapid pace of the healthcare environment. As greater emphasis is placed on high stakes assessment of clinical performance, the training of evaluators to assure good intra and inter-rater reliability of simulation performance is paramount. Supported by recent findings from the National Council for State Boards of Nursing study (NCSBN), simulation is quickly developing into a core teaching strategy for nursing education. Evaluation of learning during simulation, an essential component in the NCSBN study, informed nurse educators about valid and reliable mechanisms to assess achievement and competency in practice. The growing interest in using simulation to evaluate student competency led the National League for Nursing to conduct a four-year study to evaluate the process and feasibility of using mannequin-based high fidelity simulation for high stakes assessment in pre-licensure RN programs. Achieving clarity about the specific behaviors that students need to exhibit in order to demonstrate competency is paramount. Equally important is the training of evaluators to assure satisfactory intra/inter-rater reliability. This presentation describes the results of a pilot study conducted to test the effectiveness of a training intervention in producing intra and inter-rater reliability among nursing faculty evaluating student performance in simulation. The study is an extension of the NLN Project to Explore the Use of Simulation for High Stakes assessment. The pilot study was guided by the questions: What is the effect of (a) a training intervention and (b) faculty personality characteristics on faculty ability to achieve intra/inter-rater reliability when evaluating student performance during high-stakes simulation?

Methods: A pilot study was designed to precede a national experimental study. Basic orientation and advanced evaluator training modules were developed. These modules included orientation documents, StrengthsFinder Inventory instructions, a training video for the Creighton Competency Evaluation Instrument (CCEI), a training webinar, and a coaching webinar. Study instruments included the CCEI, student performance videos created for the National League for Nursing (NLN) Project to Explore the Use of Simulation for High Stakes, a demographic survey, and the StrengthsFinder Inventory Survey. With NLN approval, the student performance videos and the performance assessment tool produced for and used in the NLN feasibility study were used in the pilot study. A training intervention for faculty evaluators was developed. Five simulation experts completed the training intervention and the performance evaluation procedure. Reliability and correlational analysis was performed to evaluate the impact of training and faculty personality characteristics on inter/intra rater reliability. Feedback was collected from the participants to guide modifications to the content and process of the intervention in preparation for a regional, multi-site, experimental study, which began in the fall of 2016.

Results: All pilot participants were female. Three participants were ages 51 to 60; one was age 61-70; and one was age 31-40. Four participants held a master's degree and one held a doctoral degree as the highest academic credential. The participants taught in associate, baccalaureate, and entry-level master’s programs. The five participants taught in three different states. Only one participant taught in a program currently conducting high-stakes assessment in simulation. Quantitative analysis was conducted on the CCEI video evaluations. When the analysis of the six videos was compared with the three training videos, a large increase in inter-rater reliability was noted for two subscales: Assessment and Clinical Judgment. Two subscales: Communication and Patient Safety showed little difference. The two overall measures, Yes/No Competency, and Overall score showed little difference. The results were reported in an aggregated format, which obscured the differences between the separate evaluations of the training videos and the experimental videos. Even though the statistics were reported in aggregate, it still appears that the training intervention helped the participants to develop a more shared mental model of evaluation. These statistics will be analyzed and reported both individually and in aggregate for the full study.

Conclusion: It was evident that conducting a pilot study was invaluable. When data collection instruments, study procedures, and data analysis are complex, one can expect difficulties that require problem solving. This study has raised some critical questions relative to high stakes assessment, including: 1) what is the “right” amount and format of training? and 2) how do you help teams of faculty develop a shared mental model? This pilot study provided the opportunity to implement study procedures and make changes where issues and problems were discovered.