br Acknowledgements br Appendix A Supplementary data br Refe
Appendix A. Supplementary data
Contents lists available at ScienceDirect
Journal of Biomedical Informatics
journal homepage: www.elsevier.com/locate/yjbin
Cancer classification and pathway discovery using non-negative matrix factorization
Zexian Zenga, Andy H. Vob, Chengsheng Maoa, Susan E. Clarec, , Seema A. Khanc, , Yuan Luoa,
a Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL, USA b Committee on Developmental Biology and Regenerative Medicine, The University of Chicago, Chicago, IL, USA c Department of Surgery, Northwestern University, Feinberg School of Medicine, Chicago, IL, USA
Non-negative matrix factorization
Objectives: Extracting genetic information from a full range of sequencing data is important for understanding disease. We propose a novel method to effectively explore the landscape of genetic mutations and aggregate them to predict cancer type. Design: We applied non-smooth non-negative matrix factorization (nsNMF) and support vector machine (SVM) to utilize the full range of sequencing data, aiming to better aggregate genetic mutations and improve their power to predict disease type. More specifically, we introduce a novel classifier to distinguish cancer types using somatic mutations obtained from whole-exome sequencing data. Mutations were identified from multiple can-cers and scored using SIFT, PP2, and CADD, and collapsed at the individual gene level. nsNMF was then applied to reduce dimensionality and obtain coefficient and basis matrices. A feature matrix was derived from the obtained matrices to train a classifier for cancer type classification with the SVM model.
Results: We have demonstrated that the classifier was able to distinguish four cancer types with reasonable accuracy. In five-fold cross-validations using mutation counts as features, the average prediction accuracy was 80% (SEM = 0.1%), significantly outperforming baselines and outperforming models using mutation scores as features.
Conclusion: Using the factor matrices derived from the nsNMF, we identified multiple KX2-391 and pathways that are significantly associated with each cancer type. This study presents a generic and complete pipeline to study the associations between somatic mutations and cancers. The proposed method can be adapted to other studies for disease status classification and pathway discovery.
1. Background and significance
Personalized medicine is becoming increasingly popular in cancer where genetic profiles of tumors can be used to guide clinical decisions such as treatment options and preventive measures . The develop-ment of massively parallel, high throughput DNA sequencing tech-nology has enabled the cataloging of somatic mutations in cancer, making genomic data increasingly accessible. Understanding the asso-ciation between genetics and disease is important for understanding the underlying pathophysiology. In cancer, many molecular and genomic studies have identified somatic mutations within genes associated with cancer initiation, progression, and treatment responses [2–4].
The majority of sequencing studies have focused on the
identification of individual driver genes . However, driver mutations are often highly heterogeneous between cancer genomes, even within the same type of cancer . Furthermore, studies have observed cancer to be highly complex, often resulting from multiple interacting muta-tions and related pathways [7,8]. While many methods attempt to ad-dress the complex mutational heterogeneity in cancer, it still remains a challenge due to limited study-power and lack of complete knowledge regarding gene and pathway interaction [9–13]. Despite the fact that mutations in many genes have been identified in cancer, it is not yet understood how these genes cumulatively interact in the development and progression of cancer. It has been a challenge to study these mu-tations and their interactions together due to large-scale complexity.
It is important to consider methods that can encompass the full
Corresponding authors at: Department of Surgery, Feinberg School of Medicine, Northwestern University, Robert H Lurie Medical Research Center, Room 4-113 250 E Superior, Chicago, IL 60611, USA (S.E. Clare). Department of Surgery, Feinberg School of Medicine, Northwestern University, NMH/Prentice Women's Hospital, Room 4-420 250 E Superior, Chicago, IL 60611, USA (S.A. Khan). Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive Room 11-189, Chicago, IL 60611, USA (Y. Luo).