A machine-learning approach for nonalcoholic steatohepatitis susceptibility estimation
Abstract
Background Nonalcoholic steatohepatitis (NASH), a severe form of nonalcoholic fatty liver disease, can lead to advanced liver
damage and has become an increasingly prominent health problem worldwide. Predictive models for early identification of highrisk individuals could help identify preventive and interventional measures. Traditional epidemiological models with limited
predictive power are based on statistical analysis. In the current study, a novel machine-learning approach was developed for
individual NASH susceptibility prediction using candidate single nucleotide polymorphisms (SNPs).
Methods A total of 245 NASH patients and 120 healthy individuals were included in the study. Single nucleotide polymorphism
genotypes of candidate genes including two SNPs in the cytochrome P450 family 2 subfamily E member 1 (CYP2E1) gene
(rs6413432, rs3813867), two SNPs in the glucokinase regulator (GCKR) gene (rs780094, rs1260326), rs738409 SNP in patatinlike phospholipase domain-containing 3 (PNPLA3), and gender parameters were used to develop models for identifying at-risk
individuals. To predict the individual’s susceptibility to NASH, nine different machine-learning models were constructed. These
models involved two different feature selections including Chi-square, and support vector machine recursive feature elimination
(SVM-RFE) and three classification algorithms including k-nearest neighbor (KNN), multi-layer perceptron (MLP), and random
forest (RF). All nine machine-learning models were trained using 80% of both the NASH patients and the healthy controls data.
The nine machine-learning models were then tested on 20% of both groups. The model’s performance was compared for model
accuracy, precision, sensitivity, and F measure.
Results Among all nine machine-learning models, the KNN classifier with all features as input showed the highest performance
with 86% F measure and 79% accuracy.
Conclusions Machine learning based on genomic variety may be applicable for estimating an individual’s susceptibility for
developing NASH among high-risk groups with a high degree of accuracy, precision, and sensitivity.
Volume
41Issue
5Collections
The following license files are associated with this item: