Diagnosis of Gastric Cancer via Classification of the Tongue Images using Deep Convolutional Networks

Gastric cancer is the second most common cancer worldwide, responsible for the death of many people in society. One of the issues regarding this disease is the absence of early and accurate detection. In the medical industry, gastric cancer is diagnosed by conducting numerous tests and imagings, which are costly and time-consuming. Therefore, doctors are seeking a cost-effective and time-efficient alternative. One of the medical solutions is Chinese medicine and diagnosis by observing changes of the tongue. Detecting the disease using tongue appearance and color of various sections of the tongue is one of the key components of traditional Chinese medicine. In this study, a method is presented which can carry out the localization of tongue surface regardless of the different poses of people in images. In fact, if the localization of face components, especially the mouth, is done correctly, the components leading to the biggest distinction in the dataset can be used which is favorable in terms of time and space complexity. Also, since we have the best estimation, the best features can be extracted relative to those components and the best possible accuracy can be achieved in this situation. The extraction of appropriate features in this study is done using deep convolutional neural networks. Finally, we use the random forest algorithm to train the proposed model and evaluate the criteria. Experimental results show that the average classification accuracy has reached approximately 73.78 which demonstrates the superiority of the proposed method compared to other methods.


1-Introduction
Cancer is the second leading cause of death after cardiovascular diseases throughout the world [1]. Gastric cancer has a high mortality rate [2] and is the fourth prevalent cancer type, as well as the second most fatal cancer [3,4]. However, the emergence of gastric cancer has decreased, especially in developed countries [5,6]. In Iran and unlike most developed countries, the emergence of gastric cancer is on the rise, which is particularly significant in the north and northwest of Iran [8]. Traditional Chinese medicine (TCM) is a form of natural and comprehensive healthcare system dating back to 3-5 thousand years ago [13]. TCM has historically been used for the treatment of various diseases in East Asia and is known as a complementary and alternative medical system in western countries [15]. In TCM, diseases are diagnosed based on the information obtained through observing, hearing, smelling, and touching. Most diagnoses are based on pulse measurement and tongue examination [18]. In Chinese medicine, the tongue is essentially examined as it is located in the mouth and is not affected by external and environmental factors [19]. Recently, several studies have investigated gastric cancer diagnosis using the images of the color and texture of the tongue [14,16,18]. Object, texture, and component recognition in an image are the most important criteria in image processing, and the main challenge in this regard is the diversity of natural images, which results from differences in objects and cameras, illumination variations, movement, metamorphosis, and background congestion. In this study, we used an approach based on recent developments in deep learning for the visual recognition of tongue images, and a new method was proposed based on deep convolutional networks and random forest to solve finegrained image classification, which could be applied in other areas than the detection of tongue texture images.

2-Literature Review
Recently, significant developments have been achieved in image classification. Image classification has also become a commercial and applied issue within the past decade rather than a research subject. In the present study, we initially evaluated the basic classification method and achieved developments. Fine-grained classification has recently attracted the attention of researchers as well. For instance, a human recognizes a chair by recognizing its components, such as the stands and back. The ability to recognize the side level is associated with the ability to discriminate between similar objects; this observation inspired our proposed method and other approaches. In a study in this regard, Branson et al. assessed object classification using a semi-automatic method where the user would be asked about the object for a given set of images, and the type of a bird was recognized based on the user's response. In the mentioned study, the accuracy of the applied method was reported to be 19% [20]. On the other hand, Wlinder et al. performed automatic classification using color histogram characteristics and the KNN classifier, reporting low accuracy [21]. To increase the accuracy of the method, Moghimi extracted features from the area where the probability of the presence of an object was high. In the mentioned study, the area where an object was present was selected manually, ultimately resulting in the accuracy of 18.9% [22]. Zhang et al. proposed a method to significantly match the sections and feature extraction with a smaller dimension. Initially, the important areas were selected using the selective search algorithm. Following that, the SVM classifier was used to detect the areas with the maximum score, which resulted in the accuracy of 82.8% [24]. In another research, machine learning was applied to determine the components and extract features, and the optimal results were obtained with the CUB-200-2011 database. The methods used in the aforementioned studies could be classified into three categories. The first category includes the primary methods with the classification accuracy of 10-30%, in which conventional classification methods are often used for fine-grained classification without remarkable results [17,23,25,26]. The second category includes the methods that are used to better recognize the granular classification problem with the reported accuracy of 40-60%; the low accuracy of these methods might be due to non-deep features. The third category contains the methods based on deep learning, which are used to solve this problem with the accuracy of 80-90% [12,37,41]. The proposed method in our study has been classified into the third category.

3-The Proposed Method
We proposed a method to locate the tongue surface independent of various gestures in images. If the face components (especially the mouth) are located correctly, the component resulting in a higher discrimination of the dataset can be used, which is cost-efficient in terms of tempo-spatial complexity. Since we had the optimal estimate of the components, the optimal features could be extracted, and the best possible accuracy was calculated as well. The initial image of the tongue was with the face and head, and the dimensions of the initial raw image were 2988×5312 pixels. Figure 1 shows an example of a raw (initial) image of the healthy image category. As mentioned earlier, the problem of fine-grained image classification in computer vision has been resolved by deep convolutional networks, which are highly diverse and very similar to each other in terms of structure. In the present study, we applied the deep network structure of AlexNet, which consists of eight main layers. Figure 2 depicts a schematic view of the selected network. Notably, we had a different view of neural networks in this study. When it comes to neural networks, they are often viewed as 1D connected layers, while in convolutional neural networks, the layers are considered as 3D information. Training is considered to be a major process of deep convolutional networks. Since these networks have more than 120 million parameters, their training is rather difficult due to over-fitting problems. In addition, forward pass for calculating the values of all the network nodes layer by layer using the input information and backward pass for calculating the errors and network learning are time-consuming processes. The network retuning method is aimed at applying deep learning methods to small databases. The fine-tuning of a network is used to enhance transmission learning, with the new database used for learning. In this method, a pre-trained network is used on a database with more images (e.g., ILSVRC) as the initial values for another network that is similar to the target network where the only difference is in the probability generator layer. The target network differs from the origin network only in terms of the number of the outputs of the probability layer. Therefore, the weights of the probability layer should be estimated again. On the other hand, the layers before the probability layer in the target network could be initialized using the learned weights of the corresponding layers in the original network. As a result, learning is carried out using the data of the target database. Our proposed method was divided into training and test steps, which have been further discussed below.

3-1-Training Step
The training step had three main stages, as follows:  The training data required by the random forest were obtained using the following equation: shows the ith data belonging to one pixel of the images of the database and the component of interest, while the ith data does not belong to the component of interest. In Figure 3, the blue dots are the negative points ( ), and the red ones ( ) are the positive points. The algorithm used to generate the training data of A was implemented in several steps. For each training image, the peripheral rectangle of the components was obtained, and its zoning was also prepared. The deep features of each pixel were calculated as well. An arbitrary number of pixels (20 pixels) were generated inside the peripheral rectangle using random or uniform approaches. Each pixel inside the tongue region was added to set A, and its deep feature were considered as positive data. Moreover, an arbitrary number of pixels (100 pixels) were generated in the entire image using random or uniform approaches. Each pixel outside the peripheral rectangle and its deep features were also added to set A as negative data. After completing the training dataset A, the model was trained using the random forest. Due to using one forward pass in the neural network and the random forest (including 10 decision trees with maximum depth of 10), the likelihood estimate of the membership of all the pixels could be calculated rapidly. 2. Retuning the deep convolutional neural network (DCNN): Three DCNNs were used for piecewise feature extraction and retuned for the entire image, mouth image, and tongue image. The purpose of using the features of the layers of these networks was for one feature vector to be generated per each pixel. To calculate the deep pixel features of each image or image section, the image was fed as input to the network, and forward pass was calculated once. Following the forward pass, the values of the features of the input image were also calculated in all the images and referred to as the feature channels ( Figure  4). As is shown in Figure 4, the feature channels had a specific size, which was changed by up-sampling. Notably, one of the limitations of this method was the size of the input network. Currently, the input image size should be 227×227 to use the employed network. After changing the size of the feature channels to the input image size, all the feature channels were inserted into the main channel to constitute a column of feature channels with the length of 1,376. an arbitrary image to the input, all the feature channels of the middle layers or the data values in the middle layer were calculated by forward pass; the size of the channels was changed to the size of the input (upsample); 1,376 different channels were obtained; the total depth of layers conv1 to conv5 in the AlexNet network was 1,376 with the same dimension as the input image.) The corresponding random forest was used to retune the mouth and tongue area, and the estimated peripheral rectangle was extracted from the training data. At the next stage, this image section was cut, and retuning was performed. To extract the piecewise features, the following steps were taken:  We selected the image section from which the features had to be extracted.  The selected image section was resized to the dimension of the input data (222×227), and the resized image was fed as input to the network.  By calculating the forward pass, the values of the layers were calculated for the network input.  The values of the data in the fc7 layer (4,096dimensional) were retuned as a feature. The proposed method could generate an excellent 4,096dimensional feature vector for the arbitrary segments of multiple images or one image. The feature vector could be fed as input to classifiers such as the SVM.

3-2-Test Step
In the test step, the locations of the components and feature extraction were determined for each test image. Following that, the classifier was used to estimate the test image class. The peripheral rectangle of the tongue, the peripheral rectangle of the mouth area, and the peripheral rectangle of the face were also estimated for each test image using the random forest algorithm. The retuned neural networks were used to extract the three-piece deep features similar to the training step. The final vector was obtained by combining these feature vectors, and the final feature vector was classified as healthy and unhealthy using the classifier.

4-Results
The empirical results of the proposed method have been discussed in this section. For the test, we used the database of the tongue images, which contained 700 images of the patients and 800 images of the healthy subjects. For the training, 500 healthy images and 470 patient images (total: 970 images) were used. In addition, 300 healthy images and 330 patient images (total: 630 images) were used for the test. Initially, the evaluation metrics of the classification in the database were introduced, and the proposed method was assessed based on these metrics. Notably, we had limited choices for the evaluation of the general image classification and fine-grained image classification. If a computer vision such as object detection, pose estimation or the segmentation of the estimated class were considered equivalent to the real class, the classification would be correct; otherwise, the classification would be incorrect.
For more accurate outcomes, the classification accuracy metric was assessed in detail.
We assumed that the test data were in the form of ordered pairs such as *( )+ , in which is the image, is its label, and is the number of the test images. The following formula was used to calculate the mean accuracy to evaluate the classifier function (f[x]): In the formula above, C is the number of the classes, and ( ) shows the set of the indices of the test data belonging to class C, which would be calculated as: ( ) * + Notably, ( ) represents the number of the test samples belonging to class C, and II(.) is the mathematical indicator function. The value of mA was the mean classification accuracy among different classes. The proposed deep non-parametric transfer (DNPT) method was evaluated based on various features in the neighboring detection section (Table 1). In addition, the deep feature space was used to detect the neighbor inside the parenthesis. In other words, DNPT(conv3) indicated that the proposed transfer method could be employed by the conv3 feature to detect the neighbors. DNPT(oracle) represents the tests in which the location of the main components was used instead of estimating the location of the components. If a method was available for the accurate estimation of the location of the components, the mean accuracy of the DNPT(oracle) could be obtained as the method provides the upper bound of the mean accuracy for the enhanced transfer method. Two states could be considered in the comparison of the proposed method with the results of the fine-grained classification; one is DeepRF, in which it is assumed that the bounding box is available to the test images and only requires the proposed method to estimate the peripheral rectangle of the individual's mouth and tongue. Another is DeepRF(AII), which requires the estimation of the peripheral rectangle of the face in addition to the peripheral rectangle of the mouth and tongue. Since the random forest was employed in this method and was inherently random, the tests were performed in triplicate, and the mean accuracy was reported. According to the information in Table 2, the proposed method (DeepRF) could achieve the mean accuracy of 73.78, which is comparable to the knowledge boundary method with the mean accuracy of 76.37. Furthermore, the proposed method of DeepRF(II) could achieve the mean accuracy of 72.02, while the knowledge boundary method also had the mean accuracy of 73.89 in case of an unavailable peripheral rectangle.

5-Conclusion
In this research, we studied the granular classification problem and its importance along with the methods available for the early detection of stomach cancer. To this end, deep convolutional neural networks were used for finding the neighbors and extracting features. Using this network has, to a certain degree, led to the improvement of the average classification accuracy on the dataset of tongue images. Also, to eliminate the lack of generalization problem of the deep convolutional network, we have used a random forest and deep pixel features. Using this method has led to high speed and simplicity compared to other methods in addition to acceptable results. The results show a 73.78 accuracy for the proposed method, which is comparable to the cutting-edge 76.37 average accuracy. Also, the proposed method called DeepRF (All) reaches an average accuracy of 72.02. The state-of-the-art method also reaches an average accuracy of 73.89 when bounding boxes are not available during testing.