CHAPTER 2: LITERATURE REVIEW
2.5 Dataset Reduction
2.5.2 Dimensionality Reduction
81 Cervantes, Li & Yu, 2008). These samples are usually far from the classes’ centers and near to classes’ boundaries. It is usual for a recognition system to classify support vector samples wrongly. However, one of the main criteria for evaluating system efficiency, and measuring the power of a PR system, is the correct recognition of these SVs and outlier samples. Also, support vector samples are necessary to adjust system parameters in training phase. Hence, it is not a good strategy to delete all these samples from the initial dataset, in order to achieve dataset size reduction.
The second group of algorithms removes the samples near the centers of the classes, from the initial training dataset, to create a short final dataset version (Javed, Ayyaz
& Mehmoud, 2007; Shayegan & Aghabozorgi, 2014b). However, these samples include highly valuable information about a specific class that is needed for making a system model in the training phase of a PR system.
Keeping both groups of samples, i.e. the samples near to classes’ centers and the outlier samples in final reduced dataset is important goal. This is another objective of this research.
82 complexity of the recognition process (Shayegan & Aghabozorgi, 2014a). Therefore, following the feature extraction process, another important process, i.e. Feature Selection (FS), is involved.
Different applications such as face recognition, license plate recognition, and image compression have used PCA technique for dimensionality reduction purpose. Also, different efforts using PCA have been made to recognize printed and handwritten characters in several languages.
Hyun-chul, Daijin and Sung Yang (2002) tried to recognize handwritten numeral from UCI dataset. They first modeled each digit class to several components, and then used PCA technique to move extracted components into a de-correlated features space. They also used the membership value method in the classification stage.
Gesualdi and Seixas (2002) employed PCA in a license plate recognition system. They used PCA for data compression in the feature extraction part, and neural networks in the recognition stage to recognize printed digits and letters that appear as strings in license plate images. They reduced the number of features from 30 to 4. They reported that the achieved accuracy for digits recognition was acceptable, but the accuracy for letters recognition was degraded significantly, when they applied PCA on data.
In 2004, Deepu, Sriganesh and Ramakrishnan employed PCA for online handwritten Tamil character recognition. The pre-processing step was carried out on a sequence of points from the digitizer and some features were extracted from each pattern. The PCA technique was then applied to reduce the dimensionality of each class. The novelty of this
83 work lies in that it is language independent and the proposed method can be used for other scripts.
In 2005, Mozaffari, Faez and Ziaratban used fractal code as a features vector and k-NN classifier for handwritten Farsi zip code recognition. They applied the pre-processing operations to their dataset, and in the first step, achieved up to 86.3% accuracy. Using PCA technique, they reduced the number of fractal features to 240 and achieved 90.6% accuracy, at the end. The main disadvantage of fractal features in OCR application is computational time complexity.
To recognize handwritten Arabic isolated letters, Abandah, Younis and Khedher (2008) extracted 95 features such as image area, image width and height, image center of mass, the numbers and locations of dots in image, and so on from main body, secondary components, skeleton and boundary of each character. After that, PCA was used for feature reduction and then only the first 40 features were selected from the PCA process result. Finally, five different classifiers were employed and 87% accuracy was achieved on average in the best case.
Ebrahimi and Kabir (2008) extracted a set of 256 features in a holistic word recognition system to create a pictorial dictionary of printed Farsi words. They used the PCA technique and succeeded in reducing the number of loci features from 223 to 27.
Zhang, Suen, and Bui (2004) introduced a multi-modal approach for reducing the features dimensions in an OCR system and tested their approach on handwritten English digits dataset MNIST. They employed PCA for feature compression and succeeded in reducing the CPU time for classification.
84 Ziaratban, Faez and Allahveiradi (2008) proposed a novel statistical description for structure of isolated Farsi handwritten letters. They thinned the character body, decomposed a skeleton into its primitives, a curved line between any two successive feature points, and then extracted a set of features points including terminal, two-way branch and three-way branches points from primitives. Since the number of primitives varies from one character to another, they used the PCA technique to reduce and equalize the length of the features vectors. A NN-MLP with Euclidian distance was employed as classifier. To determine the exact class, they applied a post-processing stage, and achieved to 93.15%
accuracy on a dataset with 11,471 samples for train and 7,647 samples for test.
Random Projection (RP) is another features selection method from group of statistical methods. RP technique is a powerful dimension reduction technique that uses random projection matrices to map the data from a huge dimensional space to a lower space. To achieve this aim, a mapping matrix R is used where the columns of matrix R are realizations of independent zero-mean normal variables, scaled to have unit length. A brief description of RP technique has been introduced in appendix V.
Shayegan, Aghabozorgi and Ram (2014) used one-dimensional and two-dimensional standard deviation and minimum to maximum spectrum diagrams to find a small subset features out of an initial features set in an OCR application. Their proposed method succeeded to decrease the dimension of features vectors to 43.6% (for Farsi digits), and to 59.4% (for English digits), meanwhile the systems accuracies were improved from 90.41%
to 95.12% (for Farsi digit recognition) and from 91.93% to 94.88% (for English digits recognition). They also showed the superiority of their proposed technique compared to
85 rival techniques PCA and RP. Table 2.13 shows related researches regarding features selection subject in OCR domain.
Table 2.13 : Summarization of researches in OCR domain, based on features selection operation
After Feature Selection No. of
Accuracy No. of Features
Azmi et al., 2010 G.A. 81 77% 55 80%
G.A. 48 75% 30 94% FOCR-Digits
Gesualdi & Seixas, 2002
PCA 30 --- 4 96%
Deepu et al., 2004 PCA 20 89.4% 13 89.6% Tamil
characters recognition Mozaffari et al.,
PCA 728-1752 86.3% 240 90.6% Farsi zip code recognition Abandah et al.,
PCA 95 84% 40 87% Handwritten
Arabic characters Ebrahimi & Kabir,
PCA 256 --- 27 99.01% Printed Farsi words
Zhang et al., 2004 PCA 132 99.3% 10 98.4% MNIST dataset
Bahmani et al., 2010
GA 400 86% 200 87.6% Holistic FOCR
For features selection operation, finding the correct sequence of deleting the features one by one is very important. It means that a system’s derived efficiency after deleting features A, B, and C is not the same as the same system’s derived efficiency after deleting the features in order A, C, and B or B, C, and A and so on
86 (El-Glaly & Quek, 2011). Due to their nature, some features are relevant to others from different view of points. In this case, SBS techniques do not help to find the best subset of features. For example, El-Glaly and Quek (2011) extracted four feature sets S1, S2, S3 and S4 (with some common features) to use in an Arabic OCR system. They trained the system with these four feature sets, separately. After that, these sets were delivered to a PCA algorithm, and PCA rearranged the features based on their importance in the recognition system. The results showed that feature X in rank 23 in set S3 took rank 7 in set S1 and so on. This experiment indicates that if feature X is deleted for the sake of feature reduction, it may cause a large error in final results.
The main problem concerning GA methods is that they always select chromosomes one by one with the best recognition percentage, and move this chromosome (feature) to the next stage. However, it is possible that when a good characteristic feature gets combined to another feature, the overall performance will not be as good as the individual performances. Also this technique is very time consuming.
Based on the FOCR literature, the PCA technique has produced better results in comparison with the other two categories of feature selection techniques. However, although PCA has been widely utilized in different PR applications, it suffers from high computational cost. Computation of PCA requires eigenvalue decomposition of the covariance matrix of the features vectors with around O(d3 + d2n) computations, where n is the number of samples and d is the dimensionality of the features space (Sharma & Palimal, 2008). Hence, this powerful technique is usually employed for features extraction/selection operation on small scale datasets. Also, choosing the
87 optimum number of features, based on Eigen vectors, from the new re-ordered features space, created by PCA, is not an easy task.