An Application of Machine Learning for the Identification of Adolescent Smoking Risk Factors

Sunday, 30 July 2017

Sophia J. Chung, PhD, MSN
Department of Nursing, University of Ulsan, Ulsan, Korea, Republic of (South)
Youngji Lee, PhD, MS
School of Nursing, University of Pitssburg, Pittsburg, PA, USA

Purpose: Smoking is known to be a modifiable risk behavior that causes various health problems that include cancer and respiratory disease. Moreover, the literature reveals that adolescent smoking behaviors are likely to persist through adulthood, and this is the case in countries worldwide. In South Korea, despite many effeorts to reduce smoking among Korean adolescents, this modifiable risk behavior remains a significant social problem. An effective intervention to target and modify the behavior of adolescents concerning smoking must understand and address the factors that underlie and influence the behavior of smoking. These factors canbe surfaced in data using an appropriate approach. Machine learning is an approach that is well suited to reveal patterns of infromation in large, complex datasets that are useful in predicting outcomes (Chekround, 2016). For example, machine learning has been used to predict readmission in in-patients (Mortazavi, 2016; Frizzell, 2016). However, this approach had not yet been applied to address an adolescents risk behavior, such as smoking. Therefore, the goal of this study was to identify the predictors of adolescents smoking behaviors in South Korea using a machine-learning approach.

Methods: The 2015 Korean Youth Risk Behviors Web-based Survey (KYRBS) was used as the data source of this study. The KYRBS is an annual, nationwide survey conducted in South Korea to examine health behaviors that include cigarette smoking, individual hygiene, and alcohol consumption. Data gatered in the 2015 KYRBS was collected via self-report questionnaires responded to by 68,043 students in grades 7 through 12 in randomly-selected 800 schools in South Korea. For this study, we used 5,123 surveys which completed items concerning smooking on the questionnaires. This study utilized the machine-learning pipeline developed by Fayyad (1996) and Yoon (2015). To reduce the "surse of dimensionality," in which a high number of inter-related variables in large dataset interfere with the accuracy of the machine-learning model, we selected clinically meaningful features based on the concpetual framework for adolescent risk behaviors (Jessor, 1991). Then, we applied three machine learning algorithms embedded in Weka (i.e., J48, Naïve Bayes, and Logistic Regression) to build a predictive model for the smoking behavior of the adolescents represented by the KYRBY dataset. The final model was selected based on the accuracy of not only the predictive model, but also the F-measure calculated using precision and recall rate.

Results: Through the feature selection process, we classified 40 features into three predictive categories. Among three machine algorithms we applied, we found that the Logistic Regression algorithm demonstrated the highest level of accuracy (i.e., 84.0% of adolescent smokers were correctly classified; F-measure = 0.795). Using this model, grade (-0.06) and alcohol consumption (-0.56) were the top two features with the highest coefficietns. In other words, middle school students and students who had never drank alcohol were highly associated with the behavior of smoking.

Conclusion: Our studey demonstrates that a machine-learning approach is effective in identifying behavioral predictors from a large, complex dataset—in this case, the behavioral predicators associated with smoking using the KYRBY. However, our study results were inconsistent with those reported in the literature. Previous study shooed that increasing grade and previous alcohol consumption were associated with adolescents' smoking behaviors (Mendol, 2013; Talip, 2015). Further study with association between smoking behaviors and alcohol consumption among Korean adolescent is needed. Although this study did have some limitations (e.g., the data from the KYRBY is cross-sectional), our machine-learning approach shows promise, and subsequent research using longitudinal data can take into account the trends of association implicit in creating a predictive model.