### Introduction

### Materials and methods

### 2.1 Data

#### 2.1.1 Breast cancer data from van't Veer (NKI_97)

#### 2.1.2 Breast cancer data from van de Vijver (NKI_295)

#### 2.1.3 Breast cancer data from the Wang study (VDX_286)

### 2.2 Wavelet Transform

*t*) and ø (

*t*) as follows: where

*j*, respectively. The variable

*k*is the translation coefficient for the localization of gene expression data. The scales denote the different (low to high) scale bands. The variable symbol

*d*

_{1,}while the low-pass filter associated with the scaling function produces approximation coefficients (scaling coefficients)

*c*

_{1}. Subsequently, the approximation coefficients

*c*

_{1}are split into two parts by using the same algorithm and are replaced by

*c*

_{2}and

*d*

_{2}, and so on. This decomposition process is repeated until the required level is reached. The coefficient vectors are produced by down sampling and are only half the length of the signal or the coefficient vector at the previous level [12].

### 2.3 Q-value

*p*value. One main challenge in those studies is to find suitable multiple testing procedures that provide an accurate control of the error rates. Whereas the

*p*value is a measure of significance in terms of the false positive rate, the

*q*value is an approach used to measure statistical significance based on the concept of the false discovery rate. Similar to the

*p*value, the

*q*value gives each feature its own individual measure of significance [17].

### 2.4 Supervised WT

*t*test is taken as the measure to identify differently expressed genes and a list of

*q*values is derived. All the genes are ranked according to their corresponding

*q*value and the required numbers of genes are selected from the list; and (2) in each step the top number of genes based on the

*q*value are picked out. Then, this reduced set of genes is modeled by the one-dimensional DWT using Haar mother wavelet and finally, the wavelet approximation coefficients in the first and second levels of decomposition are used in the SVM model, respectively.

### 2.5 Supervised PCA

*q*values. We then apply PCA to this subset of genes, and in each step include the top numbers of principal components into a SVM model. The top numbers of principal components that will be comprised of at least 75% of the total variance are included in the SVM model.

### 2.6 SVM

*C*is a user-defined penalty parameter on the training error that controls the trade-off between classification errors and the complexity of the model. By solving the optimization problem (1) by finding the parameters w and b for a given training set, a decision hyperplane over an n-dimensional input space that produces the maximal margin in the space is designed. Thus, the decision function can be formulated as follows:

*q*value).

### 2.7 Cross data set comparison

### Results

*t*test statistics were used to identify discriminative genes in each data set. After selecting the top ranked genes based on

*q*values, one-dimensional WT in the first and second levels was applied to these preselected genes. SVMs with three types of kernels—linear, sigmoid, and radial, were used based on wavelet approximation coefficients in the first and the second levels of decomposition. For further assessment of the reported subsets of 70 genes selected by van't Veer et al [2] (for NKI_97 and NKI_295) and 76 signature genes selected by Wang et al [16] (for VDX_286), the supervised wavelet method and supervised PCA were applied. The predictive performance of SVM models was tested by cross-validation, consisting of 10 times 10-folding experiments. The results of supervised wavelet and supervised PCA for the three data sets are shown in Tables 1–3, respectively.

*q*values was better than the 70 gene signature from the van't Veer study (Table 1).

*q*values was better than the 70 gene signature from the van't Veer study.

*t*statistics was better than the 76 gene signature identified in the Wang study.

### Discussion

*t*test statistics. If the WT is performed directly by using all of the genes in a data set, there is no guarantee that the resulting wavelet coefficients will be related to metastasis. Thus, this study introduced a supervised form of WT that can be considered as a supervised wavelet. After extracting supervised wavelet approximation coefficients using discrete Haar WT, these coefficients had higher predictive performances than the first three principal components. Therefore, our results suggested that the wavelet coefficients are the efficient way to characterize the features of high-dimensional microarray data. Because the performance of the proposed supervised wavelet method is likely to be improvable compared to some other studies, we conclude that this method is worth further investigation as a tool for cancer patient classification based on gene expression data. For example, to achieve optimal classification performance, a suitable combination of the classifier and the gene selection method needs to be specifically selected for a given data set.

*t*test scores, one would use a different metric to measure the association between a given gene and metastasis occurrence. By contrast, another mother wavelet and a different level of decomposition can be studied. In this study, gene expression data were employed as predictors. However, prediction performance may be improved by adding other covariates such as age, lymph node status, tumor size, and histological grade. It is likely that the classification performances could be improved with the use of some other classifiers.