CHAPTER 2: LITERATURE REVIEW
2.2 Farsi Writing Characteristics
19 systems are easier to apply than the online systems. Hence, most of the researches have been carried out on offline systems, and this is also true for the Farsi OCR (FOCR) systems (Abed i , F a e z , & Mozaffari, 2009; Alaei, Nagabhushan & Pal, 2010a; Bahmani, Alamdar, Azmi & Haratizadeh, 2010; Enayatifar & Alirezanejad, 2011; Jenabzade, Azmi, Pishgoo & Shirazi, 2011; Pourasad, Hassibi & Banaeyan, 2011; Rajabi, Nematbakhsh &
Monadjemi, 2012; Salehpor & Behrad, 2010; Ziaratban & Faez, 2012).
The available useful information in online recognition systems have caused researchers try to extract some of these information for offline systems, too. They try to develop some approaches to find distribution of the image pixels (identical to online methods) from available information in offline handwriting texts. For an example, Elbaati, Kherallah, Ennaji and Alimi (2009) tried to find strokes temporal order from a scanned handwritten Arabic text for using them in an offline Arabic OCR system. They extracted some features such as end stroke points, branching points, and crossing points from the image skeleton.
After that, they tried to find the order of strings in each stroke. They used also genetic algorithm for finding the best combination of stroke order.
Farsi Alphabet: Farsi alphabet involves 32 basic letters. Figure 2.1 shows a sample of the whole handwritten isolated mode of Farsi letters.
Figure 2.1 : A sample of isolated mode of Farsi letters
Writing Direction: Farsi texts are written from right to left direction on an (or more) imaginary horizontal line(s) called baseline(s), as compared to Latin, but numeral strings are written from left to right similar Latin.
Cursive language: By nature, Farsi writing is cursive, even in machine-printed forms, which means letters stick together from one or two sides to make the sub-words. However, some letters are written separately. The cursive nature of the Farsi texts is the main obstacle to any FOCR system. For handling this situation, sometimes FOCR systems need to use external segmentation operation to disjoint connected letters. However, segmentation is one of the bottlenecks steps in FOCR systems. This subject causes the performance of FOCR systems is lower than of Latin OCR systems.
Sub-words: seven out of 32 Farsi letters ( ا , د , ذ , ر , ز , ژ , و ) cannot be linked by the left succeeding letter in a word and they stick only to previous letter.
Therefore, if one of these letters exits in a word, it divides the word into two or more sub-words.
Different shapes for letters: Farsi letters shapes are content sensitive according to their location within a word, where each letter can take up to four different shapes, as shown in Table 2.1. These forms are: Beginning (or Initial), Middle, End Sticky and Isolated (or Alone). This fact has caused that although the number of Farsi alphabet is 32, but they appear in more than 120 different shapes.
Table 2.1 : Farsi alphabet and their different shapes Farsi
Initial Mode I
Middle Mode M
End Sticky Mode E
Alone Mode A
ا ---- ---- ﺎ ا
ب ﺒ ﺒ ﺐ ب
پ ﭘ ﭙ ﭗ پ
ت ﺗ ﺘ ﺖ ت
ث ﺛ ﺜ ﺚ ث
ج ﺟ ﺠ ﺞ ج
چ ﭼ ﭽ ﭻ چ
ح ﺣ ﺤ ﺢ ح
خ ﺧ ﺨ ﺦ خ
د ---- ---- ﺪ د
ذ ---- ---- ﺬ ذ
ر ---- ---- ﺮ ر
ز ---- ---- ﺰ ز
ژ ---- ---- ﮋ ژ
س ﺳ ﺴ ﺲ س
ش ﺷ ﺸ ﺶ ش
ص ﺻ ﺼ ﺺ ص
ض ﺿ ﻀ ﺾ ض
ط ﻄ ﻄ ﻂ ط
ظ ﻇ ﻈ ﻆ ظ
ع ﻋ ﻌ ﻊ ع
غ ﻏ ﻐ ﻎ غ
ف ﻓ ﻔ ﻒ ف
ق ﻗ ﻘ ﻖ ق
ک ﮐ ﮑ ﮏ ک
گ ﮔ ﮕ ﮓ گ
ل ﻠ ﻠ ﻞ ل
م ﻣ ﻤ ﻢ م
ن ﻧ ﻨ ﻦ ن
و ---- ---- ﻮ و
ه ﻫ ﻬ ﻪ ه
ی ﻳ ﻴ ﻲ ی
Similar characters: More than half the Farsi letters share the same main body.
They are differentiated in terms of the number and location of some complementary (secondary) parts such as dots, zigzags, slanted bars, and so on. The similarity of the letters can cause further problems with classification, when noise is added to these similar characters. Table 2.2 shows the different groups of Farsi character which share similar bodies.
Table 2.2 : Different groups of Farsi letters and digits with similar bodies Groups Similar Characters
1 ث ، ت ، پ ، ب 2 خ ، ح ، چ ، ج
3 ذ ، د
4 ژ ، ز ، ر
5 ش ، س
6 ض ، ص
7 ظ ، ط
8 غ ، ع
9 ق ، ف
10 گ ، ک
11 4 ،3 ،2
12 9 ،٦ ،1
Dots: Eighteen out of 32 Farsi letters (more than 56%) have one, two or three dot(s) above, below or in the middle of letters body. It is worth noting that any erosion or removal of these dots will lead to a misrepresentation of the letters. Therefore, efficient pre-processing techniques have to be used in order to keep these dots and
23 avoid misunderstanding during processing such as noise removal for image enhancement (Al-Khateeb, Jiang, Ren, Khelifi & Ipson, 2009).
Dots Shapes: Styles of writing (shape and size) of dots are different from person to person in handwritten documents. Therefore, sometimes it is necessary to consider extra classes for these new shapes of dots. Figure 2.2 shows some examples of writing 3-dots pattern.
Figure 2.2 : Some style of writing 3-Dots in Farsi letters and words
Secondary parts: Some Farsi letters have extra parts such as slanted bars,
"Hamzeh", "Tanvin", and "Tashdid" symbols. The majority of hasty writers draw the secondary components in wrong position or even they attach them to the main letter body. It sometimes causes a lot of difficulty in finding and recognition these secondary parts.
Baseline: Farsi characters are lied down on imaginary horizontal lines (called baselines), where letter connections are occurred and from where descending and ascending letters extend.
Jags: A considerable percentage of Farsi letters (especially sticky letters) have jags near to baselines. If the original documents have low quality, or scanner has low
24 resolution, then these jags are appeared in very small size and they are not seen by the system. This subject produces several errors in segmentation and recognition phases.
Ligature: In both printed and handwritten mode of writing, two or occasionally three letters can be combined vertically, in an accepted manner, to form a new unit shape, called ‘ligature’ (Lorigo & Govindaraju, 2006). Ligatures are exceptions from the joining letters’ rules to make a sub-word. One example is combining letters
‘ ﻟ ‘ and ‘ ا ‘ and producing ligature ‘ ﻻ ‘. The most vertical ligatures are not obligatory, and they are appeared for the aesthetic reasons (Sari & Sellami, 2007).
Usually, segmenting a ligature to initial letters is very difficult. Therefore a ligature is considered as a new pattern in a dataset. It causes the increasing in the number of patterns in pattern space.
Vertical Overlapping: Majority of neighbor letters in handwritten Farsi words may overlap vertically, without any touching. Hence, these letters cannot separate completely from each other by drawing a simple vertical line. In addition, extra parts of letters, such as slanted bar, usually overlap the adjacent letters in a word.
Different Dimensions: The height and width of Farsi characters vary across various characters and across the different shapes of the same character in different position in a word, even in printed form (Table 2.1). For example letter ‘ ک ‘ and letter ‘ ه ‘ have not equal height and width.
Intra Space: The space between two sub-words does not have a standard amount in handwritten Farsi texts.
Confusing Characters: Some Farsi letters are very similar to digits, such as: letter
‘ ا ‘ and digit ‘ 1 ‘ , letter ‘ ه ‘ and digit ‘ 5 ‘, and letter ‘ . ‘ and digit ‘ 0 ‘. This characteristic leads the recognition module to error.
Extra forms for a character: Some Farsi digits have more than one form, such as:
digits ‘ 4 ‘ and ‘ 4 ‘, digits ‘ 2 ‘ and ‘ 2 ‘, digits ‘ ٥ ‘ and digit ‘ 5 ‘, digits ‘ ٦ ‘ and ‘ ۶ ‘.
Sloping and multiple baselines: Some of Farsi writing styles such as Nasta’aligh, have more than one baseline in each line of a text and these baselines are not horizontal in nature.
Many of Farsi letters have ascending and descending part which are salient characteristics for recognition.
Table 2.3 and Figure 2.3 show the mentioned characteristics.
26 Table 2.3: Some Farsi letters and their characteristics
Transcriptions Farsi Alphabets
Dal د ---- ---- ﺪ د A letter with 2
Be ب ﺒ ﺒ ﺐ ب A letter with 4
Jeem ج ﺟ ﺠ ﺞ ج 11: Different letters
with similar bodies
Chaa چ ﭼ ﭽ ﭻ چ
Ghain غ ﻏ ﻐ ﻎ غ 12: Different shapes
for a letter
Raa ر ---- ---- ﺮ ر 13: A letter without
Zal ذ ---- ---- ﺬ ذ 13: A letter with one
Qaaf ق ﻗ ﻘ ﻖ ق 13: A letter with two
Shin ش ﺷ ﺸ ﺶ ش 13: A letter with three
Gaaf گ ﮔ ﮕ ﮓ گ 14: Letters with
7: Slanted bar
Meem م ﻣ ﻤ ﻢ م
Figure 2.3: Some various aspects of Farsi writing characteristics.
(1): Writing direction, (2): Creating a word using sticky letters (3): Creating a sub-word using non-sticky letters in the middle of a word, (4): Jags, (5): Ligature, (6): Vertically overlapping, (7): Extra part of a letter, (8): Various types of spaces between sub-words, (9): Different shape of a letter, (10): Dots, (11): Various heights and widths for different characters, (12): Various shapes and sizes of dots.
27 Although most of the techniques used in FOCR systems are not fundamentally different from those used in Latin OCR systems, but there are some special linguistic rules associated with Farsi writing that render Farsi character recognition task more challenging than that for Latin. The aforementioned characteristics have prompted researchers to examine some of the problems encountered, which have only recently been addressed by researchers of other languages. These problems are the main obstacles of developing OCR systems for Farsi language (and similar alphabet languages) (Khorsheed, 2002).
Fortunately, there are evidences of intense efforts being made to overcome the FOCR problems.