GET THE APP

Gene Expression Data Multi classification with Some Complex Features
..

Journal of Biometrics & Biostatistics

ISSN: 2155-6180

Open Access

Commentary - (2021) Volume 12, Issue 10

Gene Expression Data Multi classification with Some Complex Features

Jessica Landis*
*Correspondence: Jessica Landis, Department of Molecular Biology, Princeton University, Princeton, NJ, USA, Tel: +18526457820, Email:
Department of Molecular Biology, Princeton University, Princeton, NJ, USA

Received: 05-Oct-2021 Published: 26-Oct-2021

Description

Classification problem is one of important topics in statistical analysis. The main purpose of classification is to classify subjects to their classes. A dataset whose response contains more than two classes, called multiclass response, is frequently considered in the past literature, and it is so-called multi classification. In the perspective of linear model, Agresti shows comprehensive discussions in the logistic regression with multiclass response. In the perspective of machine learning theory, there are several methods which have been proposed, including support vector machine (SVM), k-nearest neighbor (KNN), linear discriminant analysis (LDA), and so on. General discussions can be found in Hastie et al. Thanks to the modern technology, we can easily collect the high-dimensional data with complex features incorporated. Therefore, in the era of big data, the highdimensional data is inevitable to encounter and it always attracts our attention. In this paper, we mainly focus on the multi classification with the following features: I. Network structure. II. Ultrahigh-dimension. III. Measurement error. IV. Multi classification with ordinal response In fact, it is expected that the conventional classification methods may not either completely solve problems or ignore those complex features mentioned above in the developments of methods. A motivated example comes from cancer classification with tumor gene expression signatures. The dataset is collected by Ramaswamy et al. Basically, this dataset contains 14 common human cancer classes and 16,063 gene expression values. The sample size in this dataset is 218. The main target of this study is to correctly classify 218 patients to 14 cancer classes by treating gene expression values as predictors. To show the classification, Ramaswamy et al. presented SVM with One vs. All (OVA) approach. However, some important features listed above may not be fully considered. As a result, in the following presentation, we briefly outline key ideas to analyze this dataset. Opinion Network Structure In the gene expression data, it is expected that there exists the (pairwise) dependence structure within genes. In order to detect the pairwise dependence structure, incorporate the network and improve the prediction, we need to implement the technique in graphical model.

For example, Huttenhower developed nearest neighbor networks for gene expression data. Zhu et al. proposed SVM with network Cai et al. discussed network linear discriminate analysis. However, those methods basically assume a common network structure for predictors of all subjects without taking into account of possible heterogeneity for different classes. That is, it is expected the network structures in different classes should be different. In addition, the most existing methods with incorporation of network structure mainly focus on the binary response. Hence, it is important to invest the multi classification with incorporation of heterogeneous network structures. Ultrahigh-dimension in our motivated example, one of the challenges is the ultrahigh-dimension in predictors. It is also well-known problem, i.e., Even though Ramaswamy et al. has proposed valid procedure, there is also a large improvement. Specifically, since not all genes are relevant to the response, then a nature idea is to remove those predictors which are irrelevant to the response before developing methods or analyzing the data. The Sure Independent Screening (SIS) proposed by Fan and Lv is one of powerful methods to achieve this target. Measurement error the other important feature is the measurement error in predictors. Since the gene expression values are usually obtained by medical measurement, so it may be possible to produce the error.

That is, the observed gene expression values may not exactly equal to the true gene expression values of patients. In addition, Carroll et al. pointed out that the wrong estimation or conclusion may be produced if the error effect is ignored in the analysis. Therefore, to produce a precise classification and prediction, it is necessary to carefully analyze the measurement error in predictors. Multiclassification with Ordinal Response Finally, we notice that many existing works mainly focus on the multiclass response which is free of order (i.e., nominal). In the motivated dataset, it may be interesting to consider 14 cancers with ordering. The criterion of ordering can be the rank of severe cancer or proportion among patients (e.g., lung cancer may have higher proportion or may be more severe than breast cancer). To the best of our knowledge, there is no contribution in machine learning theory which considers such setting

 

Google Scholar citation report
Citations: 3254

Journal of Biometrics & Biostatistics received 3254 citations as per Google Scholar report

Journal of Biometrics & Biostatistics peer review process verified at publons

Indexed In

arrow_upward arrow_upward