**CHAPTER 4: THE PROPOSED MODEL FOR DATA REDUCTION IN**

**4.6 Dimensionality Reduction**

**4.6.1 The New Proposed Two-Stage Spectrums Analysis (2S_SA) Method**

**4.6.1.1 Stage 1 : Using One-Dimensional Spectrum Analysis Tool**

** In 1D_SD spectrum, the mean and SD values, corresponding to a specific feature, by using **
all samples in different classes, are computed, firstly. Then, a spectrum line corresponding
to that feature is drawn from mean-SD to mean+SD for each class. For example, there are
10 classes corresponding to digits ‘0’ to ‘9’ in digits recognition case. For creating 1D_SD
spectrum diagrams, the mean, standard deviation (SD), mean-SD, and mean+SD are
computed. Table 4.4 shows the mentioned values for ‘Normalized Vertical Transition’

feature of English digits (MNIST dataset). For more simplicity, the values were rounded to the nearest integer numbers.

138 Table 4.4 : Mean, SD, mean-SD, and mean+SD of ‘Normalized Vertical Transition’

feature for English digits from MNIST dataset

**Class (Digit) ** **mean ** **Standard **
**Deviation (SD) **

**mean-SD ** **mean+SD **

**0 ** 111 25 86 136

**1 ** 32 25 7 57

**2 ** 133 32 101 165

**3 ** 127 24 103 151

**4 ** 94 24 70 118

**5 ** 116 28 88 144

**6 ** 97 18 79 115

**7 ** 95 21 74 116

**8 ** 110 19 91 129

**9 ** 96 20 76 116

The length of 1D_SD spectrum line for a class is twice of its SD, corresponding to that feature. The smaller length for a spectrum line means the samples of that class are more similar to each other, respect to that specific feature, compared to a class with longer spectrum line. Two spectrum lines have overlapping, if the mean+SD of line ‘1’ is greater than mean-SD of line ‘2’ and mean+SD of line ‘2’ is greater than mean-SD of line ‘1’.

Table 4.5 shows the convolution overlapping matrix of **‘Normalized Vertical Transition’ **

feature for all pair classes of English digits (MNIST dataset). Each cell of this table
indicates the value of spectrum lines overlapping for two classes respect to **‘Normalized *** Vertical Transition’ feature. The smaller value of a cell is better than the larger value, *
because it means those classes have less overlapping. The value ‘0’ indicates there is not

139 any overlapping between those classes, and as a result, the related feature can separate those classes from each other, completely.

Table 4.5 : Convolution overlapping matrix of ‘Normalized Vertical Transition’ feature for all pair classes of English digits (MNIST dataset)

**0 ** **1 ** **2 ** **3 ** **4 ** **5 ** **6 ** **7 ** **8 ** **9 **

**0 ** ^{--- } ^{0 } ^{35 } ^{33 } ^{31 } ^{48 } ^{29 } ^{30 } ^{39 } ^{30 }

**1 ** ^{0 } ^{--- } ^{0 } ^{0 } ^{0 } ^{0 } ^{0 } ^{0 } ^{0 } ^{0 }

**2 ** ^{35 } ^{0 } ^{--- } ^{48 } ^{17 } ^{43 } ^{14 } ^{15 } ^{28 } ^{15 }

**3 ** ^{33 } ^{0 } ^{48 } ^{--- } ^{15 } ^{41 } ^{12 } ^{13 } ^{26 } ^{13 }

**4 ** ^{31 } ^{0 } ^{17 } ^{15 } ^{--- } ^{30 } ^{36 } ^{42 } ^{27 } ^{40 }

**5 ** ^{48 } ^{0 } ^{43 } ^{41 } ^{30 } ^{--- } ^{27 } ^{28 } ^{38 } ^{28 }

**6 ** ^{29 } ^{0 } ^{14 } ^{12 } ^{36 } ^{27 } ^{--- } ^{36 } ^{24 } ^{36 }

**7 ** ^{30 } ^{0 } ^{15 } ^{13 } ^{42 } ^{28 } ^{36 } ^{--- } ^{25 } ^{40 }

**8 ** ^{39 } ^{0 } ^{28 } ^{26 } ^{27 } ^{38 } ^{24 } ^{25 } ^{--- } ^{25 }

**9 ** ^{30 } ^{0 } ^{15 } ^{13 } ^{40 } ^{28 } ^{36 } ^{40 } ^{25 } ^{--- }

For digits recognition case, and in order to find the final reduced features vector, a 1D_SD
diagram is plotted with 10 spectrum lines for each available feature in the initial features
vector, corresponding to digits ‘0’ to ‘9’. Figure 4.23.a and Figure 4.23.b show the 1D_SD
distribution diagrams corresponding to ‘X Coordinate Centre of Mass’ and ‘Normalized
* Vertical Transition’ features for English digits, respectively. In Figure 4.23.a, the majority *
of spectrum lines are in an overlapping range [20 , 25], meaning that the ‘X Coordinate

*features space. In Figure 4.23.b, the spectrum line corresponding to class (digit) ‘1’ is completely separated from the other spectrum lines, indicating that the ‘Normalized*

**Centre of Mass’ feature, alone, cannot discriminate existing classes from each other in the***feature can completely discriminate digit (class) ‘1’ from the other English digits (other classes). Therefore, this feature can be considered as a candidate feature in the final features vector.*

**Vertical Transition’**140 (a) ‘X Coordinate Centre of Mass’ feature (b) ‘Normalized Vertical Transition’ feature

Figure 4.23 : 1D_SD spectrums diagram for the English digits set

Similar to Figures 4.23, Figures 4.24.a and 4.24.b show the 1D_SD distribution diagrams corresponding to ‘Maximum Vertical Crossing Count’ and ‘Aspect Ratio’ features for the Farsi digits, respectively. In Figure 4.24.a, the majority of the spectrum lines are in an overlapping range [3.5 , 7], meaning that the ‘Maximum Vertical Crossing Count’ feature, alone, cannot discriminate the existing classes from each other in the features space. In Figure 4.24.b, the spectrum line corresponding to class (digit) ‘1’ is completely separated from other spectrum lines, indicating that the ‘Aspect Ratio’ feature can completely discriminate class (digit) ‘1’ from the other Farsi digits (other classes). Therefore, it can be considered as a candidate feature in the final features vector.

141 (a) ‘Maximum Vertical Crossing Count’ feature (b) ‘Aspect Ratio’ feature

Figure 4.24 : 1D_SD spectrums diagram for the Farsi digits set

A shorter spectrum line corresponding to a specific feature indicates that the existing samples in a particular class have more similarity (less diversity) to each other in respect to that feature. In addition, a distribution diagram with class centers (locations of the means of the classes) farther apart is better than one with closer class centers. In this case, a classifier separates the existing clusters better.

**b) One-Dimensional Minimum to Maximum (1D_MM) Spectrum **

Finding a set of separated spectrum lines by using 1D_SD distribution diagrams is not enough to create an optimum features vector, because the outlier samples in each class are not in the range of 1D_SD spectrum lines. Indeed, they are in the 1D_MM range. In the 1D_MM plot, the minimum and maximum values corresponding to a specific feature, using all samples in different classes, are computed, firstly. Then, a spectrum line corresponding to that feature is drawn from the minimum to the maximum value of that specific feature for each class. Meanwhile, the other conditions are similar to 1D_SD plots.

142
Figures 4.25.b displays 1D_MM spectrums diagram for the feature ‘Normalized Vertical
* Transition’ (Figure 4.25.a) for English digits set. It is obvious that in this figure, some *
samples of class ‘1’ overlap with some samples in all of the rest classes. This means that in
the recognition phase, these samples may be misclassified to other classes and vice versa, if
only the ‘Normalized Vertical Transition’ feature is employed.

(a) (b)

(a) 1D_SD Spectrums diagram for ‘Normalized Vertical Transition’ feature corresponding to English digits (b) 1D_MM Spectrums diagram for ‘Normalized Vertical Transition’ feature corresponding to English digits

Figure 4.25 : Comparing 1D_SD and 1D_MM spectrum distribution diagrams

Figures 4.26.b displays 1D_MM spectrum lines for the feature ‘Aspect Ratio’ (Figure 4.26.a) for Farsi digits set. It is obvious that in this figure, some samples of class ‘1’

overlap with some samples in classes ‘2’ or ‘9’. In other words, in the recognition phase, it is possible that some samples of class ‘1’ are misclassified to classes ‘2’ or ‘9’ and vice versa, if only the ‘Aspect Ratio’ feature is utilized.

143 (a) (b)

(a) 1D_SD Spectrums diagram for ‘Aspect Ratio’ feature corresponding to Farsi digits (b) 1D_MM Spectrums diagram for ‘Aspect ratio’ feature corresponding to Farsi digits

Figure 4.26 : Comparing 1D_SD and 1D_MM spectrum distribution diagrams

In the proposed dimensionality reduction method 2S_SA, 1D_MM is used to find the
maximum allowable overlapping threshold T* 1*, to create the first reduced features vector S1
from the initial features set

*spectrum lines in the 1D_MM diagram for each feature in Initial_S, the value of threshold*

**Initial_S. By investigating the overlapping values of the**

**T***is selected. In this study, the T*

_{1}*threshold was selected 30%, experimentally.*

_{1}**4.6.1.2 Stage 2 : Using Two-Dimensional Spectrum Analysis Tool **