• Tiada Hasil Ditemukan

Stage 2 : Using Two-Dimensional Spectrum Analysis Tool

CHAPTER 4: THE PROPOSED MODEL FOR DATA REDUCTION IN

4.6 Dimensionality Reduction

4.6.1 The New Proposed Two-Stage Spectrums Analysis (2S_SA) Method

4.6.1.2 Stage 2 : Using Two-Dimensional Spectrum Analysis Tool

143 (a) (b)

(a) 1D_SD Spectrums diagram for ‘Aspect Ratio’ feature corresponding to Farsi digits (b) 1D_MM Spectrums diagram for ‘Aspect ratio’ feature corresponding to Farsi digits

Figure 4.26 : Comparing 1D_SD and 1D_MM spectrum distribution diagrams

In the proposed dimensionality reduction method 2S_SA, 1D_MM is used to find the maximum allowable overlapping threshold T1, to create the first reduced features vector S1 from the initial features set Initial_S. By investigating the overlapping values of the spectrum lines in the 1D_MM diagram for each feature in Initial_S, the value of threshold T1 is selected. In this study, the T1 threshold was selected 30%, experimentally.

4.6.1.2 Stage 2 : Using Two-Dimensional Spectrum Analysis Tool

144 for each couple of features. In 2D_SD, the main ellipse diagonals (or the length and the width in the rectangular case) are plotted from the mean-SD to the mean+SD for two features. In 2D_MM, the main ellipse diagonals (or the length and the width in the rectangular case) are plotted from the minimum value to the maximum value for those two features. As such, the [n×(n-1)/2] 2D_SD (or 2D_MM) distribution diagram can be generated for n independent features.

Figure 4.27 shows a 2D_SD distribution diagram for two features; namely, ‘X Coordinate Centre of Mass’ and ‘No. of Foreground Pixels in Upper Half of Image’ for the Farsi digits set. It is completely clear that the ellipse for class (digit) ‘0’ is completely distinct from the other ellipses. Hence, the feature pair (‘X Coordinate Centre of Mass’, ‘No. of Foreground Pixels in Upper Half of Image’) is a good choice for membership in the final features vector (to distinguish class (digit) ’0’ from the other classes (digits)).

Figure 4.27 : 2D_SD spectrums distribution diagram for Farsi digits

‘X Coordinate Center of Mass’ feature vs.

‘No. of Foreground Pixels in Upper Half of image’ feature

145 Figure 4.28 shows another example for 2D_SD distribution of two features: ‘Y Coordinate Centre of Mass’ and ‘No. of Foreground Pixels in Lower Half of Image’ of the Farsi digits set. It is completely clear that, in this case, the mentioned features are highly correlated. Hence, they are not a suitable feature pair for membership in the final features vector.

Figure 4.28 : 2D_SD spectrums distribution diagram for Farsi digits

‘Y Coordinate Centre of Mass’ feature vs.

‘No. of Foreground Pixels in Lower Half of image’ feature

b) Two-Dimensional Minimum to Maximum (2D_MM) Spectrum

In the proposed dimensionality reduction method, 2D_MM is utilized to find the maximum allowable overlapping threshold T2 , to create the final reduced features vector S2 from the first reduced features vector S1. By investigating the overlapping values of the spectrum ellipses (rectangular) in the 2D_MM diagram for all features pair in S1, the value of threshold T2 is selected. In this study, the T2 threshold was selected 20%, experimentally.

146 By using the literature and based on the introduced features in Section 4.5, an initial features vector Initial_S with 133 members of most-used features, which are employed for handwritten characters recognition, was identified and extracted from the training samples (The second column of Table 1 of Appendix VII). For the dimensionality reduction process, the value of a specific feature is defined as fk(Si,j), where fk is the value of the k’th feature from the initial features vector Initial_S, and Si,j represents the j’th sample of class i. Subsequently, and by using all samples in the training part of each class, the values of the minimum, maximum, mean, and standard deviation for all features in the initial features vector are computed. The reduction operation is carried in two stages as follows:

Stage 1: Using 1D_SD and 1D_MM

To find the first reduced subset of features vector, the 1D_SD and 1D_MM distribution spectrums are generated for all available features in initial features vector Initial_S. The system selects every feature that its 1D_SD spectrum line had a maximum of 30%

overlapping (threshold T1, that it was found experimentally) with the other 1D_SD spectrum lines of the other classes. The output of this stage is the first reduced version of the features vector S1, which satisfies the criteria necessary for membership in the final features vector (the columns 4, 6, and 8 of Table 1 in Appendix VII).

Stage 2: Using 2D_SD and 2D_MM

By plotting the 2D_SD and 2D_MM distribution diagrams for the first reduced version of the features vector S1, the final reduced version of features vectors S2 is created (the columns 5, 7, and 9 of Table 1 in Appendix VII). In this stage, a couple of features are selected, if their 2D_SD has a maximum of 20% overlapping (threshold T2, that it was found experimentally) with the other 2D_SD distribution diagrams of the other classes. In

147 this stage, it is possible that a feature is added to S2 more than once. Hence, in the end, the repetitive features in S2 are removed to create the smallest size of S2.

In the proposed dimensionality reduction method 2S_SA, the 1D_SD and 2D_SD spectrums are utilized to decide whether or not a feature is suitable for membership in the final features vector. 1D_MM and 2D_MM are used to find the best value for thresholds T1 and T2. It is obvious that these threshold values are dependent to the type and characteristics of training dataset samples. Algorithm 3 explains the first stage of applying 2S_SA method to generate the intermediate reduced features vector S1 from initial features vector Initial_S. The output of this stage is fed as input to next stage (Algorithm 4).

Algorithm 3. 2S_SA method for dimensionality reduction purpose – Stage 1.

Input:

Initial_S : Initial Features Vector /* Initial_S = {f1, f2, … fn} */

n : Number of features in the initial features vector , i.e. Initial_S noc : Number of Classes in Pattern Space

Output:

S1 : First Reduced Version of Features Vector /* S1 = {g1, g2, … gk} ; k ≤ n */

Method: 2S_SA 1. S1:= null ; 2. for k = 1 : n do 4. {

5. Compute the coordinate of all 1D_SD spectrum lines corresponding to feature fk ;

6. for c = 1 : noc do 7. {

8. If ( overlapping of spectrum line of class c with all the rest 9. spectrum lines has the value less than threshold T1 ) then 10. {

11. Insert feature fk to S1;

12. goto L1;

13. } 14. }

15. L1: continue;

16. }

17. return (S1);

148 Algorithm 4 explains the second stage of applying 2S_SA method to generate the final reduced features vector S2, from the first reduced features vector S1. The input for this step is the output of stage 1, and the output of this stage is the final reduced features vector.

Algorithm 4. 2S_SA method for dimensionality reduction purpose – Stage 2 Input:

S1 : First Reduced version of Features Vector /* S1 = {g1, g2, … gk} */

k : Number of features in the first reduced features vector, i.e. S1 noc : Number of Classes in Pattern Space

Output:

S2 : Final Reduced Features Vector /* S2 = {e1, e2, … ej} ; j ≤ k */

Method: 2S_SA

1. m := Number of features in the first reduced version of the features vector S1;

2. S2:= null ; 3. for k = 1 : m do 4. for h = 1 : m do 5. {

6. Compute the coordinate of all 2D_SD spectrum ellipses 7. corresponding to features pair (fk , fh);

8. for c = 1 : number of classes do 9. {

10. if ( overlapping of spectrum ellipses of class c with all the rest 11. spectrum ellipses has the value less than threshold T2 ) then 12. {

13. Insert feature pair fk and fh to S2;

14. goto L2;

15. } 16. } 17. }

18. L2: continue;

19. }

20. delete the repetitive features from S2;

21. return (S2) ;

The mentioned operations created the features vectors Initial_S, F/D-S1, and F/D-S2 for the digits part of the Farsi Hoda dataset, features vectors Initial_S, F/C-S1, and F/C-S2 for

149 the characters part of the Farsi Hoda dataset, and the features vectors Initial_S, E-S1, and E-S2 for the English digits MNIST dataset. Table 4.6 shows the number of features in different versions of features vectors in each stage.

Table 4.6 : Number of features in initial features vector, first reduced version of features vector, and final reduced version of features vector, created by 2S_SA method

Number of Features

Dataset

Initial features vector : Initial_S

First reduced versions of features vector :

S1

Final reduced versions of features vector :

S2

Ratio of S2

to Initial_S Farsi dataset

Hoda Digits part

133 F/D-S1 : 94 F/D-S2 : 58 0.44 Farsi dataset

Hoda Characters part

133 F/C-S1 : 119 F/C-S2 : 93 0.70 English dataset

MNIST 133 E-S1 : 103 E-S2 : 79 0.59