**CHAPTER 4: THE PROPOSED MODEL FOR DATA REDUCTION IN**

**4.6 Dimensionality Reduction**

**4.6.1 The New Proposed Two-Stage Spectrums Analysis (2S_SA) Method**

**4.6.1.2 Stage 2 : Using Two-Dimensional Spectrum Analysis Tool**

143 (a) (b)

(a) 1D_SD Spectrums diagram for ‘Aspect Ratio’ feature corresponding to Farsi digits (b) 1D_MM Spectrums diagram for ‘Aspect ratio’ feature corresponding to Farsi digits

Figure 4.26 : Comparing 1D_SD and 1D_MM spectrum distribution diagrams

In the proposed dimensionality reduction method 2S_SA, 1D_MM is used to find the
maximum allowable overlapping threshold T* 1*, to create the first reduced features vector S1
from the initial features set

*spectrum lines in the 1D_MM diagram for each feature in Initial_S, the value of threshold*

**Initial_S. By investigating the overlapping values of the**

**T***is selected. In this study, the T*

_{1}*threshold was selected 30%, experimentally.*

_{1}**4.6.1.2 Stage 2 : Using Two-Dimensional Spectrum Analysis Tool **

144 for each couple of features. In 2D_SD, the main ellipse diagonals (or the length and the width in the rectangular case) are plotted from the mean-SD to the mean+SD for two features. In 2D_MM, the main ellipse diagonals (or the length and the width in the rectangular case) are plotted from the minimum value to the maximum value for those two features. As such, the [n×(n-1)/2] 2D_SD (or 2D_MM) distribution diagram can be generated for n independent features.

Figure 4.27 shows a 2D_SD distribution diagram for two features; namely, ‘X Coordinate
* Centre of Mass’ and ‘No. of Foreground Pixels in Upper Half of Image’ for the Farsi *
digits set. It is completely clear that the ellipse for class (digit) ‘0’ is completely distinct
from the other ellipses. Hence, the feature pair (‘X Coordinate Centre of Mass’, ‘No. of

*features vector (to distinguish class (digit) ’0’ from the other classes (digits)).*

**Foreground Pixels in Upper Half of Image’) is a good choice for membership in the final**Figure 4.27 : 2D_SD spectrums distribution diagram for Farsi digits

‘X Coordinate Center of Mass’ feature vs.

‘No. of Foreground Pixels in Upper Half of image’ feature

145
Figure 4.28 shows another example for 2D_SD distribution of two features: ‘Y Coordinate
* Centre of Mass’ and ‘No. of Foreground Pixels in Lower Half of Image’ of the Farsi *
digits set. It is completely clear that, in this case, the mentioned features are highly
correlated. Hence, they are not a suitable feature pair for membership in the final features
vector.

Figure 4.28 : 2D_SD spectrums distribution diagram for Farsi digits

‘Y Coordinate Centre of Mass’ feature vs.

‘No. of Foreground Pixels in Lower Half of image’ feature

**b) Two-Dimensional Minimum to Maximum (2D_MM) Spectrum **

In the proposed dimensionality reduction method, 2D_MM is utilized to find the maximum
allowable overlapping threshold T*_{2}* , to create the final reduced features vector S2 from the
first reduced features vector

*ellipses (rectangular) in the 2D_MM diagram for all features pair in*

**S1. By investigating the overlapping values of the spectrum***threshold T*

**S1, the value of***is selected. In this study, the T*

_{2}*threshold was selected 20%, experimentally.*

_{2}146
By using the literature and based on the introduced features in Section 4.5, an initial
features vector Initial_S with 133 members of most-used features, which are employed for
handwritten characters recognition, was identified and extracted from the training samples
(The second column of Table 1 of Appendix VII). For the dimensionality reduction
process, the value of a specific feature is defined as f_{k}(S_{i,j}), where f_{k} is the value of the k’th
feature from the initial features vector Initial_S, and Si,j represents the j’th sample of class
i. Subsequently, and by using all samples in the training part of each class, the values of the
minimum, maximum, mean, and standard deviation for all features in the initial features
vector are computed. The reduction operation is carried in two stages as follows:

**Stage 1: Using 1D_SD and 1D_MM **

To find the first reduced subset of features vector, the 1D_SD and 1D_MM distribution
spectrums are generated for all available features in initial features vector * Initial_S. The *
system selects every feature that its 1D_SD spectrum line had a maximum of 30%

overlapping (threshold **T*** 1*, that it was found experimentally) with the other 1D_SD
spectrum lines of the other classes. The output of this stage is the first reduced version of
the features vector

*features vector (the columns 4, 6, and 8 of Table 1 in Appendix VII).*

**S1, which satisfies the criteria necessary for membership in the final****Stage 2: Using 2D_SD and 2D_MM **

By plotting the 2D_SD and 2D_MM distribution diagrams for the first reduced version of
the features vector * S1, the final reduced version of features vectors S2 *is created (the
columns 5, 7, and 9 of Table 1 in Appendix VII). In this stage, a couple of features are
selected, if their 2D_SD has a maximum of 20% overlapping (threshold

**T***, that it was found experimentally) with the other 2D_SD distribution diagrams of the other classes. In*

**2**147 this stage, it is possible that a feature is added to S2 more than once. Hence, in the end, the repetitive features in S2 are removed to create the smallest size of S2.

In the proposed dimensionality reduction method 2S_SA, the 1D_SD and 2D_SD
spectrums are utilized to decide whether or not a feature is suitable for membership in the
final features vector. 1D_MM and 2D_MM are used to find the best value for thresholds T*_{1}*
and

**T***. It is obvious that these threshold values are dependent to the type and characteristics of training dataset samples. Algorithm 3 explains the first stage of applying 2S_SA method to generate the intermediate reduced features vector S1 from initial features vector Initial_S. The output of this stage is fed as input to next stage (Algorithm 4).*

**2****Algorithm 3. 2S_SA method for dimensionality reduction purpose – Stage 1. **

**Input: **

** Initial_S : Initial Features Vector /* Initial_S = {f**1, f2, … fn} */

n : Number of features in the initial features vector , i.e. Initial_S noc : Number of Classes in Pattern Space

** Output: **

** S1 : First Reduced Version of Features Vector /* S1 = {g**_{1}, g_{2}, … g_{k}} ; k ≤ n */

**Method: 2S_SA **
1. S1:= null ;
2. for k = 1 : n do
4. {

5. Compute the coordinate of all 1D_SD spectrum lines corresponding to
feature f**k** ;

6. for c = 1 : noc do 7. {

8. If ( overlapping of spectrum line of class c with all the rest 9. spectrum lines has the value less than threshold T1 ) then 10. {

11. Insert feature f**k** to S1;

12. goto L1;

13. } 14. }

15. L1: continue;

16. }

17. return (S1);

148 Algorithm 4 explains the second stage of applying 2S_SA method to generate the final reduced features vector S2, from the first reduced features vector S1. The input for this step is the output of stage 1, and the output of this stage is the final reduced features vector.

**Algorithm 4. 2S_SA method for dimensionality reduction purpose – Stage 2 **
**Input: **

** S1 : First Reduced version of Features Vector /* S1 = {g**1, g2, … gk} */

k : Number of features in the first reduced features vector, i.e. S1 noc : Number of Classes in Pattern Space

**Output: **

** S2 : Final Reduced Features Vector /* S2 = {e**1, e2, … ej} ; j ≤ k */

**Method: 2S_SA **

1. m := Number of features in the first reduced version of the features vector S1;

2. S2:= null ; 3. for k = 1 : m do 4. for h = 1 : m do 5. {

6. Compute the coordinate of all 2D_SD spectrum ellipses
7. corresponding to features pair (f**k **, f**h****); **

8. for c = 1 : number of classes do 9. {

10. if ( overlapping of spectrum ellipses of class c with all the rest 11. spectrum ellipses has the value less than threshold T2 ) then 12. {

13. Insert feature pair f** _{k}** and f

**to S2;**

_{h}14. goto L2;

15. } 16. } 17. }

18. L2: continue;

19. }

20. delete the repetitive features from S2;

21. return (S2) ;

The mentioned operations created the features vectors * Initial_S, F/D-S1, and F/D-S2 *for
the digits part of the Farsi Hoda dataset, features vectors Initial_S, F/C-S1, and F/C-S2 for

149
the characters part of the Farsi Hoda dataset, and the features vectors **Initial_S, E-S1, and *** E-S2 *for the English digits MNIST dataset. Table 4.6 shows the number of features in
different versions of features vectors in each stage.

Table 4.6 : Number of features in initial features vector, first reduced version of features vector, and final reduced version of features vector, created by 2S_SA method

Number of Features

Dataset

Initial
features
vector :
**Initial_S**

First reduced versions of features vector :

**S1 **

Final reduced versions of features vector :

**S2 **

Ratio of
**S2 **

to
* Initial_S *
Farsi dataset

Hoda Digits part

133 **F/D-S1 : 94 ****F/D-S2 : 58 ****0.44 **
Farsi dataset

Hoda Characters part

133 **F/C-S1 : 119 ****F/C-S2 : 93 ****0.70 **
English dataset

MNIST 133 **E-S1 : 103 ****E-S2 : 79 ****0.59 **