Analytical Challenges in the Era of Big Data

Jeffery, Alvin D.; Jeffery, Alvin D.

Purpose: The popularity of “big data” along with an increasing capacity for real-time predictive analytics holds significant promise for nurses and other clinicians to gain new insights and develop novel decision support tools from our large clinical datasets. Unfortunately, these large datasets are not the panacea that some big data proponents would taut. For nurses with vast subject matter expertise in a clinical area who desire to leverage big data for solving practical problems, road blocks quickly surface in the form of acquisition and management of data, missing data, meeting assumptions of statistical models, and model evaluation for statistical and clinical performance. This talk will engage the audience in addressing these issues using an exemplar of the development of a prediction model for in-hospital cardiopulmonary arrest.

Methods: The following 4 topics will be addressed:

Data Acquisition and Management: From ethics approval to ensuring individual patient privacy to preventing undesired user access, collecting and storing “big data” is no simple task. The presenter will provide: (a) an overview of key concepts, (b) an exemplar for constructing a data acquisition and management team, and (c) several resources for learning more independently.

Missing Data: Almost all large datasets contain some amount of missing data. Regardless of the amount, finding the cause of missingness is of paramount importance. Approaches to determining a cause will be introduced, and disadvantages of complete case analysis will be described. Advantages and disadvantages of median imputation, multiple imputation, and machine learning imputation will be compared.

Statistical Model Assumptions: There are a variety of statistical models available, and with recent advances in machine learning methods, more approaches to retrieve information from the data are available to a wide array of users. An overview of the purpose and requirements of traditional modeling (e.g., logistic and linear regression) and machine learning approaches (e.g., random forests and cluster analyses) will be provided.

Model Evaluation: Determining how well a model performs on the current data and how well it is expected to perform on future data is essential in determining whether or not the model is helpful for clinical care. Internal (e.g., bootstrapping and cross-validation) versus external validation (e.g., split sample and chronological validation) techniques will be presented along with their respective advantages and disadvantages.

Results: Our in-hospital cardiopulmonary arrest prediction model required a team-based approach to solving the aforementioned challenges, and the audience will hear not only how we chose to solve the problems but also other approaches we considered. From the perspective of data acquisition/management, we found the best approach to be the inclusion of database and informatics specialists who used structured query language to extract the relevant data and then store it on a secure, organizational server. Following a simulation study, we discovered the missing data problem was best resolved by creating a multiple imputation model that included the outcome variable. Statistical model assumptions were best met by not assuming linearity while not permitting too many spline knots. Model evaluation comprised internal bootstrap validation for the regression models and split-sample validation for the machine learning methods.

Conclusion: Arriving at clinically meaningful insights contained within large datasets requires multifaceted expertise and teamwork. Nurses and other clinicians are the best members of the team to identify a problem that “big data” can help solve. To ensure a clinically meaningful solution surfaces from big data efforts, nurses should be aware of common challenges in big data research. As nurses become more knowledgeable, they position themselves to be leaders in these research teams and advocates for implementation of novel findings.