Learning to Do Data Science Healthcare Research

Monday, 18 November 2019

Lana Pasek, MSN
School of Nursing, SUNY@Buffalo School of Nursing, Buffalo, NY, USA

Purpose: Data science describes the process of asking questions, analyzing, and manipulating large data sets in the search for patterns and knowledge by using diverse mathematical models and methods (Sullivan, 2018). With the great advances of technology and computer science, scientists now have quick access to large amounts of data from a variety of sources. These sources are digitilized and can be monetary as in credit card/banking information or even the most personal from Electronic Medical Records. Concurrently there has been increased capacity to store and retrieve these huge amounts of data electronically. The speed of delivering all this data by electronic transmission is something never experienced before. This speed coupled with organization and labeling of the data within software programs makes it easy to use the data for different analyses.

Health care research focusing on improved managing and delivery of care is evolving quickly with the accessibility of these large data bases. But this requires interdisciplinary teamwork of computer softwear engineers, mathematics and statistics, and those with substantive area expertise, like nurses (Grus, 2015). Nurses are experts in clinical knowledge and wisdom. Clinical wisdom comes from the general truths and principles that emerge from the understanding and intimate knowledge of clinical practice (Benner, Hooper Kyriakidis, & Stannard, 2011). Therefore they are ideal to contribute to data science health research and to learn how to perform data science processes.

Methods: Data science research starts with a research question. With this research question is an understanding of it's significance in solving a problem and the background of the variables involved interacting as predictors of the outcome, The literature would have data on the problem's incidence and prevalence in healthcare. The next step is to define the data the researcher needs to analyze. These definitions include being explicit as what data warehouse or government or private database from which your data was obtained and the dates explored. Defined data elements are located within a data registry which can be from the clinical codes such as the International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10). The data registry also includes the data to be exlcuded such as certain age ranges or conditions like pregnancy. The strategy is to request broadly to get as much data as think you need initially and through data cleaning the duplicates and data not related to the research question can be eliminated. The data is then normalized for the research question with certain data flagged such as deaths. Next in the process is the analytic database to be established from which the the data is described with all the independent and dependent variables and the data science approaches are applied. There are numerous data science approaches and the most important of healthcare data analytics is clinical prediction. These approaches try to uncover the underlying relationship between attributes and a dependent variable and include regression models and survival analyses, your mathematics and statistics (Reddy & Yi, 2015). Data science approaches, therefore, help data scientists in identifying patterns in these massive data sets to answer questions such as, “What is odd about my data?” (anomaly detection) or “What will happen next?” (predictive analytics) (Sullivan, 2018).

Results: The results from the different approaches can be compared for hypothesis testing and to start to understand what is going on in relation to the research question. Nurses use data science to uncover knowledge in large datasets (usually Electronic Health Record data). They determine how large data sets can define new meaning for patient care by discovering patterns, associations, or factors related to patient outcomes. Predictive models were developed by nurses to identify risks for adverse health outcomes such as infection or mortality. Moreover, nurses use data science to develop, assess, and evaluate patient outcomes by way of clinical decision support tools, health portals, and care coordination activities (Westra, et al., 2017; Delany & Weaver, 2018). Lessons learned from performing data science research are that some data may be missing and the need to re-think the data science approach (but there are many to choose from), difficulty in filtering a population, or discovery of an incomplete data registry.