• Tiada Hasil Ditemukan

STOCK MARKET CLASSIFICATION MODEL USING SENTIMENT ANALYSIS BASED ON HYBRID NAÏVE BAYES CLASSIFIERS

N/A
N/A
Protected

Academic year: 2022

Share "STOCK MARKET CLASSIFICATION MODEL USING SENTIMENT ANALYSIS BASED ON HYBRID NAÏVE BAYES CLASSIFIERS "

Copied!
201
0
0

Tekspenuh

(1)

The copyright © of this thesis belongs to its rightful author and/or other copyright owner. Copies can be accessed and downloaded for non-commercial or learning purposes without any charge and permission. The thesis cannot be reproduced or quoted as a whole without the permission from its rightful owner. No alteration or changes in format is allowed without permission from its rightful owner.

(2)

STOCK MARKET CLASSIFICATION MODEL USING SENTIMENT ANALYSIS BASED ON HYBRID NAÏVE BAYES CLASSIFIERS

GHAITH ABDULSATTAR A. JABBAR ALKUBAISI

DOCTOR OF PHILOSOPHY UNIVERSITI UTARA MALAYSIA

2019

(3)

i

Permission to Use

In presenting this thesis in fulfilment of the requirements for a postgraduate degree from Universiti Utara Malaysia, I agree that the Universiti Library may make it freely available for inspection. I further agree that permission for the copying of this thesis in any manner, in whole or in part, for scholarly purpose may be granted by my supervisor(s) or, in their absence, by the Dean of Awang Had Salleh Graduate School of Arts and Sciences. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to Universiti Utara Malaysia for any scholarly use which may be made of any material from my thesis.

Requests for permission to copy or to make other use of materials in this thesis, in whole or in part, should be addressed to:

Dean of Awang Had Salleh Graduate School of Arts and Sciences UUM College of Arts and Sciences

Universiti Utara Malaysia 06010 UUM Sintok

(4)

ii

Abstrak

Analisis sentimen telah menjadi satu kaedah lazim untuk mengklasifikasi tingkah laku pasaran saham. Malahan, analisis sentimen kian menjadi penting pada dekad ini terutamanya dengan ketersediaan data daripada media sosial seperti Twitter. Walau bagaimanapun, ketepatan model klasifikasi pasaran saham masih rendah, dan ini secara negatifnya memberi kesan kepada petunjuk pasaran saham. Tambahan pula, terdapat pelbagai faktor yang memberi kesan langsung kepada ketepatan model klasifikasi yang tidak diambil kira dalam kajian terdahulu. Salah satu faktornya adalah pengecualian ciri spatial-temporal. Faktor lain yang penting adalah teknik pelabelan automatik yang menjurus kepada ketepatan klasifikasi yang rendah disebabkan oleh ketiadaan leksikon khusus. Kesesuaian pengklasifikasi terhadap ciri data dan domain juga adalah faktor lain yang memberi kesan kepada ketepatan klasifikasi. Dalam kajian ini, model klasifikasi pasaran saham berdasarkan analisis sentimen telah dibangunkan. Model ini direka bentuk untuk meningkatkan ketepatan klasifikasi dengan penggabungan ciri tweet timestamp dan lokasi, teknik pelabelan pakar domain pasaran saham dan pembangunan pengklafikasi Naïve Bayes hibrid untuk mengklasifikasi sentimen pasaran saham. Metodologi kajian ini terdiri daripada enam fasa. Fasa pertama adalah pengumpulan data, dan fasa kedua merupakan fasa penting yang melibatkan pelabelan dimana polariti data ditentukan sebagai nilai negatif, positif atau neutral. Fasa ketiga melibatkan pra-pemprosesan data yang mana hanya ciri berkaitan sahaja diambil kira. Fasa keempat adalah klasifikasi dimana corak pasaran saham yang sesuai dikenal pasti melalui penghibridan pengklasifikasi Naïve Bayes. Fasa kelima adalah penilaian dan prestasi, dan fasa terakhir iaitu pengecaman tingkah laku pasaran saham. Model ini menghasilkan dapatan yang signifikan dalam mengklasifikasi tingkah laku pasaran saham dengan ketepatan melebihi 89%. Model ini bermanfaat kepada pelabur dan penyelidik. Bagi pelabur, ia membolehkan mereka merumus pelan berdasarkan ketepatan petunjuk di mana ia mengurangkan risiko dalam pembuatan keputusan. Dari segi penyelidik, ia menarik perhatian terhadap kepentingan kejuruteraan ciri, teknik pelabelan, dan penghibridan pengklasifikasi dalam meningkatkan ketepatan klasifikasi.

Kata Kunci: Klasifikasi pasaran saham, Pengklasifikasi Naive Bayes hibrid, Analisis sentimen, Pelabelan pakar, ciri spatial-temporal.

(5)

iii

Abstract

Sentiment analysis has become one of the most common method to classify stock market behaviour. Moreover, sentiment analysis has gained a lot of importance in the last decade especially due to the availability of data from social media such as Twitter.

However, the accuracy of stock market classification models is still low, and this has negatively affected the stock market indicators. Furthermore, there are many factors that have a direct effect on the classification models’ accuracies which were not addressed by previous research. One of the factors is the exclusion of spatial-temporal features. Another important factor is the automatic labelling technique which leads to low classification accuracy due to the absence of specific lexicon. The appropriateness of the classifiers to the data features and domain is also another factor, which affect the classification accuracy. In this research, a model for stock market classification based on sentiment analysis is constructed. It is designed to enhance the classification accuracy by the incorporation of tweet timestamp and location features, stock market domain expert labelling technique and the construction of a hybrid Naïve Bayes classifiers to classify the stock market sentiments. The methodology for this research consists of six phases. The first phase is data collection, and the second phase represents the most important phase, which is labelling, in which polarity of data is specified as negative, positive or neutral values. The third phase involves data pre- processing, which is conducted to get only relevant features. The fourth phase is classification in which suitable patterns of the stock market are identified by hybridizing different Naïve Bayes classifiers. The fifth phase is performance and evaluation, and the final phase is recognition for the stock market behaviour. The model produced a significant result in classifying stock market behaviour with accuracy more than 89%. The model is beneficial for investors and researchers. For investors, it enables them to formulate their plans based on accurate indicators whereby it reduces the risk in decision making. For researchers, it draws their attention to the importance of feature engineering, labelling technique, and the classifiers hybridization in enhancing the classification accuracy.

Keywords: Stock market classification, Hybrid Naive Bayes classifiers, Sentiment analysis, Expert labelling, Spatial-temporal features.

(6)

iv

Acknowledgment

All thanks to almighty Allah

O Lord, to You is praise as befits the Glory of Your Face and the greatness of Your Might.

َك�ناَطْلُس �مي�ظَع�لَو َك�ه ْجَو �لَ�َج�ل ي�غَبْنَي اَمَك ُدْمَحْلا َكَل � ب َر اَي .

First and foremost, I am heartily thankful to my supervisors, Associate Prof. Dr. Siti Sakira Kamaruddin and Associate Prof. Dr. Husniza Husni for their appreciated guidance and continuous support from the initial to the final level in this research. The honest supervision and real encouragement that they gave truly help the headway of this research. It is an honor for me to have both of you as my supervisors. May Allah reward you well (ري��� مكازج).

My genuine and heartfelt thanks to my dear parents and my big brother, My father Prof. Dr. Abdulsattar Al-kubaisi, my mother Siham Al-zubaidi, and my big brother Laith Al-kubaisi. My heartfelt thanks and appreciation are also extended to my parents in law, my father in law Noori Alani and my mother in law Alya Alani. Thank you for your support, continual prayers, and patience for our parting. Nothing in this world is equal to your boundless giving and support.

I wish to express my gratitude and thanks to the academic and supporting staff in AHSGS and SOC, especially to the dean of AHSGS Prof. Dr. Ku Ruhana Ku- Mahamud, Associate Prof. Dr. Yuhanis Yusof, Dr. Farzana Kabir Ahmad, Dr. Juhaida Abu Bakar, and Dr. Nor Hazlyna Harun.

Special thanks are due to my dear friends, Ghanim Shamas, Qais Alrubaiei, and Muthana Alani. Dear Ghanim and dear Qais, I will never forget our happy moments in UUM, thanks for everything. Dear Muthana, I will never forget your support.

Last but not least, to the person who made my life beautiful, my wife, thanks a lot for your love and support during my difficult times. My dear son, Ayham, my dear daughter, Elaf, thanks for your patience and continual prayers for your daddy.

(7)

v

Table of Contents

Permission to Use ... i

Abstrak ... ii

Abstract ... iii

Acknowledgment ... iv

Table of Contents ... v

List of Tables ... viii

List of Figures ... x

List of Abbreviations ... xii

CHAPTER ONE INTRODUCTION ... 1

1.1 Overview ... 1

1.2 Problem Statement ... 6

1.3 Research Questions ... 10

1.4 Research Objectives ... 10

1.5 Research Motivation ... 11

1.6 Research Scope ... 12

1.7 Research Significance ... 15

1.8 Thesis Organization ... 15

CHAPTER TWO LITERATURE REVIEW ... 17

2.1 Overview ... 17

2.2 Stock Market Classification Model ... 17

2.2.1 Stock Market Classification Model using Sentiment Analysis on English Tweets ... 21

2.2.2 Stock Market Classification Model using Sentiment Analysis on Arabic Tweets ... 24

2.3 Data Source for Stock Market Classification Model ... 28

2.3.1 Twitter as A Data Source ... 28

2.3.2 Data Availability from Twitter using API ... 30

2.3.3 Twitter and Stock Market Classification Model ... 34

2.4 Labelling Techniques and Stock Market Classification Model ... 39

2.5 Data Pre-processing and Feature Engineering ... 42

(8)

vi

2.5.1 Data Pre-processing for Stock Market Classification Model... 42

2.5.2 Feature Engineering and Representation ... 45

2.6 Classification ... 49

2.6.1 Classification and Supervised ML Classifiers ... 49

2.6.2 NBCs and Classification Model ... 53

2.6.3 Hybridization and Ensemble Voting in Classification Model Improvements ... 59

2.7 Performance and Evaluation for Stock Market Classification Model ... 62

2.8 Chapter Summary... 72

CHAPTER THREE RESEARCH METHODOLOGY ... 73

3.1 Overview ... 73

3.2 Research Design ... 73

3.3 Conceptual Framework ... 74

3.3.1 Phase 1: Data Collection using Twitter API ... 75

3.3.2 Phase 2: Labelling Techniques ... 76

3.3.3 Phase 3: Data Pre-processing ... 78

3.3.4 Phase 4: Classification ... 82

3.3.4.1 HNBCs1 ... 88

3.3.4.2 HNBCs2 ... 91

3.3.4.3 HNBCs3 ... 93

3.3.5 Phase 5: Performance Evaluation ... 95

3.3.6 Phase 6: Recognize the Stock’s Behaviour ... 95

3.4 Chapter Summary... 95

CHAPTER FOUR THE STOCK MARKET CLASSIFICATION MODEL ... 97

4.1 Overview ... 97

4.2 The Constructed Stock Market Classification Model ... 97

4.2.1 Data Collection using Tweets Collector ... 99

4.2.2 Labelling Techniques ... 102

4.2.2.1 Expert Labelling Technique based on Research Domain ... 102

4.2.2.2 Auto-Labelling Technique using General Lexicon ... 104

4.2.3 Tweets Pre-processing and Feature Representation... 106

(9)

vii

4.2.4 Classification Based on Hybrid Naïve Bayes Classifiers ... 115

4.2.5 Performance and Evaluation ... 119

4.2.6 Stock’s Behaviour ... 120

4.3 Chapter Summary... 121

CHAPTER FIVE RESULTS AND DISCUSSION ... 122

5.1 Overview ... 122

5.2 Initial Testing ... 122

5.2.1 Data Collection (tweets collection) ... 122

5.2.2 Expert Labelling Technique (manual labelling by the expert) ... 123

5.2.3 Tweets Pre-processing ... 124

5.2.4 Initial Classification using Different Classifiers (MNB, BNB, and Hybrid Model) ... 126

5.2.5 Initial Results ... 127

5.2.6 Recognize the Stock’s Behaviour (initial testing)... 128

5.3 HNBCs Experimental Results ... 129

5.3.1 HNBCs1 Performance and Evaluation ... 130

5.3.2 HNBCs2 Performance and Evaluation ... 134

5.3.3 HNBCs3 Performance and Evaluation ... 136

5.4 The Role of Expert Labelling in Classification Accuracy Enhancement ... 139

5.5 The Role of Feature Engineering in Classification Accuracy Enhancement ... 146

5.6 ML Hybridization and Classification Accuracy Enhancement ... 152

5.7 Selection of the ML Classifier to Improve Classification Accuracy ... 154

5.8 The Relationship between High Classification Accuracy and Stock Market Indicators ... 155

5.9 Benchmarking ... 158

5.10 Chapter Summary... 160

CHAPTER SIX CONCLUSION ... 161

6.1 Overview ... 161

6.2 Research Contributions ... 161

6.3 Recommendations and Future Works ... 164

REFERENCES ... 165

(10)

viii

List of Tables

Table 2.1 The General Advantages and Disadvantages for the Most Common ML Classifiers

in the Domain of Stock Market Classification Model ... 51

Table 2.2 Confusion Metrics for a Two-Class Classifier ... 63

Table 2.3 Equations used for Evaluation the Classification Model ... 63

Table 2.4 The Facts Sheet ... 69

Table 2.5 The Main Characteristics of the Reviewed Stock Market Classification Models.. 71

Table 3.1 Example of Expert Labelling and Defining the Polarity ... 77

Table 3.2 Example about the Probabilities Averaging in Soft Voting Ensemble ... 89

Table 4.1 Sample of Almarai Tweets ... 100

Table 4.2 Sample of DM Tweets ... 101

Table 4.3 Sample of Almarai Tweets after Labelling ... 103

Table 4.4 Sample of DM Tweets after Labelling ... 104

Table 4.5 Sample of DM Tweets after Auto-Labelling ... 105

Table 4.6 Sample of Cleaned English Tweets ... 108

Table 4.7 Sample of Cleaned Arabic Tweets ... 109

Table 5.1 Sample from the Manually Collected Etisalat Tweets (initial test) ... 123

Table 5.2 Labelled Tweets (Original Arabic Tweets-initial test)... 123

Table 5.3 Sample of Features after Pre-processing ... 125

Table 5.4 MNB Performance Evaluation (Initial Test) ... 127

Table 5.5 BNB Performance Evaluation (Initial Test) ... 127

Table 5.6 Hybrid Classifier Performance Evaluation (Initial Test) ... 127

Table 5.7 Classification Accuracy (Initial Test) ... 128

Table 5.8 HNBCs1 using Almarai Arabic Tweets (all classes: 1, 2, and 0) ... 130

Table 5.9 HNBCs1 using ASA Arabic Tweets (all classes: 1, 2, and 0) ... 130

Table 5.10 HNBCs1 using Almarai English Tweets (all classes: 1, 2, and 0) ... 131

Table 5.11 HNBCs1 using DM English Tweets (all classes: 1, 2, and 0)... 132

Table 5.12 HNBCs1 using DMM English Tweets (all classes: 1, 2, and 0) ... 132

Table 5.13 HNBCs1 using Etisalat UAE English Tweets (all classes: 1, 2, and 0) ... 133

Table 5.14 HNBCs2 using Almarai English Tweets (all classes: 1, 2, and 0) ... 134

Table 5.15 HNBCs2 using DM English Tweets (all classes: 1, 2, and 0)... 135

Table 5.16 HNBCs2 using DMM English Tweets (all classes: 1, 2, and 0) ... 135

Table 5.17 HNBCs2 using Etisalat UAE English Tweets (all classes: 1, 2, and 0) ... 136

Table 5.18 HNBCs3 using Almarai English Tweets (all classes: 1, 2, and 0) ... 137

(11)

ix

Table 5.19 HNBCs3 using DM English Tweets (all classes: 1, 2, and 0)... 137

Table 5.20 HNBCs3 using DMM English Tweets (all classes: 1, 2, and 0) ... 138

Table 5.21 HNBCs3 using Etisalat UAE English Tweets (all classes: 1, 2, and 0) ... 138

Table 5.22 Auto-Labelling vs Expert Labelling ... 140

Table 5.23 Expert vs Auto using HNBCs1 ... 141

Table 5.24 Expert vs Auto using HNBCs2 ... 142

Table 5.25 Expert vs Auto using HNBCs3 ... 142

Table 5.26 HNBCs2 and HNBCs3 Performance and Evaluation with Fraction = 0.1 ... 144

Table 5.27 HNBCs2 and HNBCs3 Performance and Evaluation with Fraction = 0.2 ... 144

Table 5.28 HNBCs2 and HNBCs3 Performance and Evaluation with Fraction = 0.3 ... 145

Table 5.29 HNBCs2 Performance and Evaluation with and without Temporal and Spatial Functions ... 149

Table 5.30 HNBCs3 Performance and Evaluation with and without Temporal and Spatial Functions ... 149

Table 5.31 HNBCs2 and HNBCs3 Performance and Evaluation with Optimization = 0.1 . 150 Table 5.32 HNBCs2 and HNBCs3 Performance and Evaluation with Optimization = 0.2 . 151 Table 5.33 HNBCs2 and HNBCs3 Performance and Evaluation with Optimization = 0.3 . 151 Table 5.34 HNBCs1 vs SVM using Almarai English Tweets ... 155

Table 5.35 HNBCs Classification Accuracy vs NBCs ... 158

Table 5.36 HNBCs Classification Accuracy vs the Reviewed Classification Models based on NB ... 159

(12)

x

List of Figures

Figure 2.1. Multiple Types of ML and Associated use-cases ... 19

Figure 2.2. The Proposed Model for Stock Price and Significant Keyword Correlation ... 21

Figure 2.3. The Proposed Model by Qasem et al. (2015) ... 22

Figure 2.4. MS. Azure ML ... 22

Figure 2.5. The Proposed Model by Cakra and Trisedya (2015) ... 23

Figure 2.6. The Proposed Model by Kordonis et al. (2016) ... 24

Figure 2.7. The Proposed Model by Hamed et al. (2015) ... 25

Figure 2.8. The Proposed Model by Hamed et al. (2016) ... 26

Figure 2.9. The Proposed Model by AL-Rubaiee et al. (2018) ... 27

Figure 2.10. Tweet's Attributes ... 31

Figure 2.11. Example about Feature Selection from Twitter using JSON by Tweepy Tool . 33 Figure 2.12. Example about Tweet’s Auto-labelling ... 41

Figure 2.13. Pre-processing Steps for Arabic Tweets by AL-Rubaiee et al. (2018) ... 44

Figure 2.14. Feature Engineering Main Phases ... 47

Figure 2.15. Structure of Naïve Bayes Classifier ... 54

Figure 2.16. Structure of Ensemble ML Models ... 61

Figure 3.1. Conceptual Framework ... 75

Figure 3.2. Auto-Labelling Framework ... 77

Figure 3.3. Data Pre-processing Steps ... 80

Figure 3.4. General Structure for the Proposed HNBCs ... 83

Figure 3.5. Cross-Validation with 2-Fold ... 84

Figure 3.6. HNBCs1 Proposed Framework ... 88

Figure 3.7. HNBCs2 Proposed Framework ... 91

Figure 3.8. HNBCs3 Proposed Framework ... 93

Figure 4.1. The Implemented Stock Market Classification Model using Sentiment Analysis on English Tweets Based on HNBCs. ... 98

Figure 4.2. The Implemented Stock Market Classification Model using Sentiment Analysis on Arabic Tweets Based on HNBCs1. ... 98

Figure 4.3. Stock's Behaviours ... 121

Figure 5.1. Size of Polarities (initial testing) ... 128

Figure 5.2. Tweet with Timestamp, id, and Company Name ... 146

(13)

xi

Figure 5.3. Abstract Tweet without Timestamp, id, and Company Name ... 147 Figure 5.4. Baseline NB vs HNBCs Classification Accuracy using Different Datasets ... 153 Figure 5.5. Example to Represents the Size of Reviews with Classification Accuracy ... 156

(14)

xii

List of Abbreviations

ADX Abu Dhabi Securities Exchange API Application Programming Interface

ASA AlSafi Arabia

B2B Business to Business

BN Bayesian Network

BNB Bernoulli Naive Bayes

CSV Comma Separated Values

DFM Dubai Financial Market

DM Dubai Mall

DMM Dubai Marina Mall

EM Expectation Maximization

EMH Efficient Market Hypothesis

FN Falls Negative

FP Falls Positive

GCC Gulf Cooperation Council

GNB Gaussian Naïve Bayes

HNBCs Hybrid Naïve Bayes Classifiers

(15)

xiii HTTP Hypertext Transfer Protocol JSON JavaScript Object Notation

KNNs K-Nearest Neighbours

LR Linear regression

ME Maximum Entropy

ML Machine Learning

MNB Multinomial Naïve Bayes

MS Microsoft

MSA Modern Standard Arabic

NB Naïve Bayes

NBC Naïve Bayes Classifier NBCs Naïve Bayes Classifiers

NNs Nerul Networks

OAUTH Open Authorization

REST Representational State Transfer

RF Random Forest

R-Reqs Research Requirements

S&P Standard and Poor

(16)

xiv

SA Saudi Arabia

SSNB Semi-Supervised Naïve Bayes

SVM Support vector machine

TF-IDF Term Frequency-Inverse Document Frequency

TN True Negative

TP True Positive

UAE United Arab Emirates

URL Uniform Resource Locator

UTF Unicode Transformation Format

(17)

1

CHAPTER ONE INTRODUCTION

This chapter presents an overview of stock market investment and stock market classification models, to introduce the research. It explains the problem statement and proposed solutions, discusses the research questions, and introduces the purpose of the study by presenting the research objectives, the motivation for the study, the research scope, the research significance, and finally the thesis organization.

1.1Overview

Investors and business people need to decide on an effective approach to improve the outputs of their investments and to avoid massive financial losses, mainly on investment in the stock market (Nassirtoussi, Aghabozorgi, Wah, & Ngo, 2014; Ren, Wu, & Liu, 2018). The stock market is important because a company’s stock prices play a pertinent role in all economic sectors (Baker, Stein, & Wurgler, 2002; Pan &

Mishra, 2018). The global increment of the stock exchanges has raised the need for an in-depth decision-making tool using a stock market classification model (Bartov, Faurel, & Mohanram, 2017; Ruan, Durresi, & Alfantoukh, 2018).

Accurate classification of the data sources in the stock market domain is necessary for investors to make suitable decisions, such as selling or buying stocks (Guresen, Kayakutlu, & Daim, 2011; Hsu, Lessmann, Sung, Ma, & Johnson, 2016; Zhong &

Enke, 2017). These kinds of investments need a pattern (Smedt & Daelemans, 2012;

Fortuny, Smedt, Martens, & Daelemans, 2014) to assist decision makers in the stock market reach the right decision with minimal risk (Fortuny et al., 2014; Nguyen, Shirai, & Velcin, 2015). To determine a suitable pattern, trends must be followed by

(18)

2

observing the reactions of consumers to everything related to the product, such as the quality and price reported in company financial reports.

This reaction pattern can be easily captured by following social media to identify consumer reactions, such as posts and tweets (Bollen & Mao, 2011; Vu, Chang, Ha,

& Collier, 2012; Kumar, Choi, & Greene, 2017) as well as companies’ financial reports (Chanthinok, Ussahawanitichakit, & Jhundra-indra, 2015; Cao, 2017). For example, when a service company shares a tweet as an announcement about new services, if the customers tweet positively this will lead to an increase in the demand of that company’s stocks. On the other hand, negative tweets will lead to a decrease in demand for the stocks. The last possibility is a neutral tweet, which does not affect the demand for stocks (Bollen, Mao, & Zeng, 2011; Zhang, Fuehres, & Gloor, 2011).

These reactions represent valuable data: tweets can be mined and analyzed in order to construct useful indicators of these data to support and help the decision makers in selling or buying stocks on the stock exchange (Kim, Jeong, & Ghani, 2014;

Nassirtoussi et al., 2014).

Accessing a pattern with high degree of accuracy is still a major challenge for researchers in the domain of stock market classification using sentiment analysis on Twitter (Khaidem, Saha, & Dey, 2016; Navale, Dudhwala, Jadhav, Gabda, &

Vihangam, 2016; Jeon, Hong, & Chang, 2017). Stock market classification models still suffer from low accuracy in classification (Khan, Baharudin, Khan, & Ullah, 2014; Arvanitis & Bassiliades, 2017; Yong, Tang, Cui, & Wen, 2018); this affects the reliability of the stock market indicators which are extracted by following and

(19)

3

analyzing social media data and stock prices (Bollen, Mao, & Zeng, 2011; Ludwig et al., 2013; Lin & Ryaboy, 2013).

Various factors can affect the results of a stock market classification model, including features such as company name, location and volume (Bonde & Khaled, 2012; Prusa, Khoshgoftaar, & Dittman, 2015; Prusa, Khoshgoftaar, & Napolitano, 2015; Tang, He, Baggenstoss, & Kay, 2016; Iacomin, 2016); labelling technique (Donmez, Carbonell,

& Schneider, 2009; Hasbullah, Maynard, Chik, Mohd, & Noor, 2016; Canuto, Gonçalves, & Benevenuto, 2016); and classification method (Prusa et al., 2015a; Prusa et al., 2015b; Prusa, Khoshgoftaar, & Seliya, 2016).

From the dataset, important features such as the number of keywords and the interval of data collection can affect the performance of the model (Prusa et al., 2015a;

Iacomin, 2016; Tang et al., 2016). An instance of labelling technique affecting a stock market classification model is training using prior lexical knowledge; auto-labelling uses a lexical base to define the keywords’ polarity as positive, negative or neutral (He

& Zhou, 2011; Hasbullah et al., 2016).

Machine learning (ML) classifiers depend on the research requirements (Prusa et al., 2015a; Prusa et al., 2016); for example, dataset size (large or small), dataset type (numbers or text), features, and parameter sets. This research therefore focuses on these factors because they have a significant direct effect on the reliability and accuracy of the stock market classification model (Jiang, Wang, Cai, & Yan, 2007;

Sathyadevan, Sarath, Athira, & Anjana, 2014; Yang, Zhang, Pan, & Xiang, 2015;

Giatsoglou et al., 2017).

(20)

4

To predict stock market behaviour, different ML classifiers are used to perform sentiment analysis on Twitter, including the Support Vector Machine (SVM), Decision Trees and Naïve Bayes Classifiers (NBCs) (Marsland, 2015; Raschka &

Mirjalili, 2017). These classifiers must be chosen according to the requirements of the research domain and the research data (Tanwani, Afridi, Shafiq, & Farooq, 2009; Ali, Lee, & Chung, 2017).

This research focuses on specific spatial and temporal features to distinguish the relationship of every feature to classification accuracy. It also assumes that expert labelling directly impacts classification accuracy. This assumption requires the parameters to be defined to show the effects of labelling technique individually. As the research domain involves the stock market, the classification findings have to be prompt and precise, with the ability to classify the different labelled datasets that represent real consumer reactions. The features, dataset and research domain that constitute the requirements of the research become the basis for the implementation of sentiment analysis on Twitter. This is done by hybridizing NBCs to produce a stock market classification model with increased accuracy of classification, and hence supporting the stock market decision-making process.

NBCs are selected because Naïve Bayes (NB) models with appropriate pre-processing are competitive with more advanced methods such as the SVM classifier (Rennie, Shih, Teevan, & Karger, 2003; Ting, Ip, & Tsang, 2011). NB models are also essential here for the following reasons. First, the role of each of the proposed features of the classification theory can be proven through naïve assumptions, that is by assuming the independence of features; NB models simplify the calculation of probabilities by

(21)

5

supposing the probability of each attribute belonging to a given class value is independent of all other attributes (Marsland, 2015; Chandrasekar & Qian, 2016).

Even though this assumption is strong, it provides fast and effective results (Ting et al., 2011; Kuhn & Johnson, 2013).

Secondly, the role of expert labelling in classification accuracy can be proven by defining parameters that subsequently tune the ratio of the labelled and unlabelled tweets. By hybridizing Semi-Supervised Naïve Bayes (SSNB) (Han, Nan-feng, &

Zhao, 2011; Bhattu & Somayajulu, 2012) with Multinomial Naïve Bayes (MNB) (Tan

& Zhang, 2008; Witten, Frank, Hall, & Pal, 2016) and Bernoulli Naïve Bayes (BNB) (Di Nunzio & Sordoni, 2012a; Raschka, 2014), the performance of the classification model using an expert labelled dataset and a general auto-labelled dataset (Jain &

Mandowara, 2016; Catal & Nangir, 2017) can be judged.

Thirdly, NBCs represent four model estimators (Marsland, 2015; Raschka & Mirjalili, 2017); the different models work to support the proposed model by endorsing the evaluation of the final results and the validity of each test using Bayes theorem and Naïve assumption (Sarkar & Sana, 2009; Jain & Mandowara, 2016; Catal & Nangir, 2017).

Eventually, this research aims to classify consumer reactions for the purpose of assisting investors in stock market exchange to realize the situation of the company in the market based on the classification model’s results. All NB models are simple to build; users are easily trained, and the method provides quick results (Gong & Yu, 2010; Anjaria & Guddeti, 2014; Shubhrata, Kaveri, Pranit, Bhavana & Chate, 2016).

(22)

6

These models’ attributes support the research domain requirements as users require rapid training and results, as well as providing high accuracy and excellent performance (Bollen, Mao, & Zeng, 2011; Lin & Ryaboy, 2013).

1.2Problem Statement

The current stock market classification models that utilize sentiment analysis on Twitter suffer from low accuracy in classification, not exceeding 89% after being implemented on a dataset with different sources and languages (Zhang, 2013; Meesad

& Li, 2014; Skuza & Romanowski, 2015; Navale et al., 2016; Arvanitis & Bassiliades, 2017). The poor accuracy of classification has a direct impact on the reliability of stock market indicators, such as a series of statistical figures and financial reports which explain the stock’s position in the existing stock market (Bratu, Muresan, & Potolea, 2008; Bollen et al., 2011; Ludwig et al., 2013; Lin & Ryaboy, 2013; Li et al., 2016).

Many factors have a direct effect on the accuracy of classification models, such as feature selection, sample size, period of data collection, labelling technique and the classification method (Jiang et al., 2007; Sathyadevan et al., 2014; Yang et al., 2015).

This research focuses on issues regarding feature extraction and selection, tweet labelling, and the tweet classification method.

There are two noticeable problems beside that of selecting the appropriate classification method. First is the feature selection problem. In the stock market domain, valuable data must include several features like time, location, targeted audience, brand and types of service. However, the most significant features for decision-makers who are looking to invest in the stock market are brand, time and

(23)

7

location (Janssen, van der Voort, & Wahyudi, 2017; Sohangir, Wang, Pomeranets, &

Khoshgoftaar, 2018). Therefore, the approach can be more useful when extracting or simulating the location of the tweets posted on a specific page and defining it as a feature. Data lacking timestamp and location cannot support decision makers in stock market investment (Ruth & Hannon, 2012; Fernández-Avilés, Montero, & Orlov, 2012; Song & Xia, 2016). Sentiment analysis will not achieve a fine-grained level without timestamp and location (Bollen et al., 2011; Song & Xia, 2016; Luo et al., 2017). In this research, two substantial features have been selected to achieve more accurate classification results: temporal and spatial features, denoting the tweet’s timestamp and location respectively.

Secondly, the automatic technique used in the labelling phase affects the accuracy of the classification model in the absence of a specific lexicon. Automatic labelling recognizes sentiments expressed in a given tweet based on existing general lexicons not specifically concerning the research domain (He & Zhou, 2011; Makrehchi, Shah,

& Liao, 2013). The weakness here is related to the automatic assigning of polarity (positive, negative or neutral) that affects classification accuracy because it does not carry the real weight of the sentiment for each tweet. Hence, the resulting labels need to be studied and tabulated to represent real users’ sentiments specifically. For a specific domain, it is essential to focus on data relevant to that domain (Tanwani et al., 2009; Jha & Mahmoud, 2017; Giatsoglou et al., 2017).

The integrated dataset for the classification model must have relevant words to generate the required features, in addition to real polarity, to achieve high classification accuracy (Miranda & Abreu, 2015; Hasbullah et al., 2016). Therefore,

(24)

8

the labelling phase must be precisely matched with the research domain and the aim of the classification model (Tanwani & Farooq, 2010; Jha & Mahmoud, 2017). To ensure complete consistency with the research domain, an expert analyst in stock market classification models are employed because they know the real polarity of each tweet and can assign the appropriate polarity based on their specific knowledge of the research domain (Bing Liu, 2012; Zhang, 2013). For this research, an expert labelling technique was performed manually to label the tweets before the pre-processing and classification phases. The expert’s knowledge is necessary because no specific lexicon exists for the research domain. If existing lexical knowledge does not match the domain of the classification model, the required classification accuracy will not be achieved because accurate polarity is not given to the tweet (Tanwani et al., 2009; Jha

& Mahmoud, 2017; Giatsoglou et al., 2017).

Lastly, the supervised learning approach is recommended for developing and building a pattern recognition model (Kuhn & Johnson, 2013; Ali et al., 2017). SVM, Decision Trees, NBCs, K-Nearest Neighbours (KNN) and Neural Networks are widely used classifiers in this domain with various characteristics that could be advantageous in one research area but not in another (Marsland, 2015; Raschka & Mirjalili, 2017).

Therefore, before selecting the classifier it is important to identify its suitability for a given research area and its compatibility with the research domain requirements (Tanwani et al., 2009; Ali et al., 2017).

In this research, NBCs have been selected over the other classifiers listed above as their characteristics are more suitable for the requirements of a stock market classification model using sentiment analysis of Twitter, as follows. First, as

(25)

9

mentioned in Section 1.1, this research focuses on a specific kind of spatial and temporal feature; in addition to the tweet’s text, i.e. a timestamp and the location.

NBCs represent a more useful version of classifier for this domain by supporting the concept of independence, which extends to dealing with collections of more than two events or random variables. This approach overcomes the weakness of features’

independence assumption by sorting each into a separate class, such as spatial or temporal. For example, when a tweet includes the following text “#DubaiMall very interesting place for all family members”, after feature selection and extraction the document will be as follows “DubaiMall very interesting place for all family members,

@Dubai, last Friday morning”. @Dubai represents the spatial class (feature) and last Friday morning the temporal class (timestamp feature after converting from date format into text format).

The second point is that NBCs are highly scalable, requiring a number of parameters corresponding to the number of variables (features/predictors) in a learning problem (Liu et al., 2013; Shoeb & Ahmed, 2017). This research is based on a dataset that reflects consumer reactions, high scalable because it includes #, short keywords, slang, etc. The third point is that this research focuses on fast training and fast results. This can be achieved by using maximum-likelihood because the training can be done by evaluating a closed-form expression; this takes linear time, rather than using expensive iterative approximation as do many other types of classifier (Gong & Yu, 2010; Bollen et al., 2011; Anjaria & Guddeti, 2014). The fourth point is that NB comprises four models, each one simple to build and implement. In fact, the availability of different models in one model supports the validity of the proposed model’s results through

(26)

10

comparing the results of each model using the same dataset at the same time (Sarkar

& Sana, 2009; Jain & Mandowara, 2016; Catal & Nangir, 2017).

1.3Research Questions

To achieve the required classification accuracy, the following questions are addressed:

1. How can the extraction of temporal and spatial features enhance the accuracy of the stock market classification model using sentiment analysis on Twitter?

2. How can the expert labelling technique achieve classification reliability and accuracy?

3. How to construct a hybrid NBCs to classify datasets of different sizes and languages to enhance the accuracy of a stock market classification model using sentiment analysis on Twitter?

1.4Research Objectives

The aim of this research is to construct a stock market classification model using sentiment analysis on Twitter based on HNBCs. This research aims to achieve the following objectives:

1. To construct a temporal and spatial function to extract the tweet’s timestamp and location and incorporate them into the processed text to increase the valuable feature set to improve classification accuracy.

(27)

11

2. To demonstrate the role of expert labelling in classification reliability and accuracy enhancement.

3. To construct a hybrid NBCs with the ability to classify datasets of different sizes and languages to enhance the accuracy of a stock market classification model using sentiment analysis on Twitter.

1.5Research Motivation

According to Navale et al. (2016), Khaidem et al. (2016) and Jeon et al. (2017), classification in the stock market remains a major challenge. At the same time, the rapid growth of online stock exchanges needs more accurate information about companies’ stock behaviour to increase the ability of investors in taking appropriate decisions (Guresen et al., 2011; Zhong & Enke, 2017). There are omissions in the available stock market classification models, such as temporal and spatial features, which affect the accuracy of the classification (Makrehchi et al., 2013; Zhang, 2013).

From another perspective, research in the area of sentiment analysis and stock market classification uses an auto-labelling technique based on a general lexicon, negatively affecting the classification accuracy (Miranda & Abreu, 2015; Hasbullah et al., 2016).

Currently, many Middle East countries, including Iraq, Syria, Lebanon and Yemen, are facing political instability and financial crises, which affect investment and the countries’ security. This has a direct effect on the economic situation, prompting most local businessmen to transfer their capital to neighbouring countries with greater political and economic stability, like the Gulf Cooperation Council (GCC) countries (Yousif, 2014; Elyazji, 2015), specifically the United Arab Emirates (UAE) and Saudi

(28)

12

Arabia (SA) because these two countries represent the business centre for both the GCC and the Middle East as a whole (Abdulqader, 2015; Santos, 2015). The majority of these foreign investors are investing in the stock market. Hence, there must be a way to guide and help them to choose the most heavily traded and most profitable share in the UAE and SA stock market (DFM, 2014; Gulf News, 2015). Designing and implementing a stock market classification model that fits the situation and requirements of the GCC stock markets will be helpful to both foreign investors and local investors.

1.6Research Scope

The scope of this research is data pre-processing and sentiment analysis as a process to infer tweets’ polarity by implementing HNBCs as a ML classifier. The pre- processing prepares the data and extracts the necessary features before classification;

sentiment analysis assigns the appropriate polarity for each tweet, based on the knowledge, and classification is a data mining function that assigns clauses in a collection to objective categories or classes (Bollen et al., 2011; Meesad & Li, 2014).

The goal of classification is to accurately predict the target class for each clause in the dataset (Tan, Steinbach, & Kumar, 2006; Kesavaraj & Sukumaran, 2013).

This research concentrates on consumer reactions and stock market behaviour for the Almarai Company, the Dubai Mall (DM), Etisalat UAE, Dubai Marina Mall (DMM), and AlSafi Arabia (ASA), investigating these five companies’ tweets and stock separately. Almarai and ASA are represented on the SA stock market, DM and DMM on the Dubai Financial Market (DFM) under the title EMAAR Malls, and Etisalat

(29)

13

UAE on the Abu Dhabi Securities Exchange (ADX). Tweets from all five companies have been collected from their official sites on Twitter.

The research also needs a suitable dataset to achieve its objectives, so the number of Twitter followers for each company is another important reason for selecting these five companies. Almarai (@almarai) has more than 447,000 followers on Twitter, DM (@TheDubaiMall) more than 800,000, Etisalat UAE (@etislalat) more than 2,000,000, DMM (@DXBMarinaMall) more than 9,000, and ASA (@alsafiarabia) more than 40,000.

Collecting and analyzing data from social networks such as Twitter is necessary to obtain useful information to serve and develop a particular research area. It means building a classification model using sentiment analysis based on ML methods like NBCs, which are probabilistic supervised ML classifiers that assume that each dataset feature is independent (Bing, Chan, & Ou, 2014; Abdelwahab et al., 2015). The characteristics of supervised ML classifiers make them the most widely used in data mining applications (Islam, Wu, Ahmadi, & Sid-Ahmed, 2007; Nagwani & Verma, 2014). One of the most popular social networks is Twitter, the fastest growing online social networking service with more than 140 million users sharing around 400 million tweets a day. Tweets posted on Twitter share everything from daily life stories to the newest local or international events. These tweets enable users and organizations to collect valuable information in different domains (Atefeh & Khreich, 2015). and their scope means that different organizations regard them as datasets of different resources related to all kinds of events, news, economic data such as stock markets, financial

(30)

14

data associated with market indicators, e-commerce and marketing statistics (Bollen et al., 2011; Geser, 2011; Thompson, 2008).

In recent years, one of the most popular research areas using sentiment analysis on Twitter is the stock market classification model (Bollen et al., 2011; Xu & Keelj, 2014). Investors are worried about stock behaviour; they need a classification model to help them reduce the decision risk (Bollen et al., 2011; Xu & Keelj, 2014). This classification model depends on mining consumer opinions on everything relating to different kinds of products, goods and services for a specific company. These consumers’ reactions are converted into a pattern that builds a stock market classification model with acceptable accuracy in the exchanges (Bollen et al., 2011;

Meesad & Li, 2014).

The sentiment analysis process starts with feature extraction and ends with labelled words from the dataset that define the polarity of each word as positive, negative or neutral (Liu, 2010; Kuhn & Johnson, 2013). The process of defining and tagging the polarity of words is not easy because it depends on the knowledge in the research domain, whether stock market, health or education (Tanwani et al., 2009; Padmaja &

Fatima, 2013). Sentiment analysis also affects the training set before classification, preparation, and detection of the polarity for the dataset, and helps to improve the classification accuracy (Abdelwahab, Bahgat, Lowrance, & Elmaghraby, 2015;

Canuto et al., 2016), the objective of this research.

(31)

15 1.7Research Significance

This research aims to improve the accuracy of the stock market classification model using sentiment analysis on Twitter. Increasing the classification accuracy will improve the reliability of the generated indicators and reports based on the classification results, which together will reduce the risk in decision making. This study will also draw researchers’ attention to the role of feature engineering in the extraction and incorporation of temporal and spatial features in the text of tweets, labelling techniques to assign the appropriate polarities, the suitability of ML classifiers to the research requirements, and finally the hybridization of the ML classifiers in improving classification accuracy.

1.8Thesis Organization

This thesis is structured as follows:

Chapter 2 reviews the literature on the stock market and classification, sentiment analysis, ML classifiers, the latest research in stock market classification using sentiment analysis on Twitter based on the ML classifier, social media data source, labelling techniques, data pre-processing, feature engineering, data representation, Naive Bayes classifiers, ML hybridization methods, and finally the classification model and performance evaluation.

In Chapter 3, the implemented methodology is described thoroughly, including data collection, expert labelling techniques, data pre-processing, classification using HNBCs, measurement equations, stock behaviour rules, and an experimental test

(32)

16

which shows the phases of the proposed framework and initial results. Chapter 4 represents the implementation of the framework phases, from data collection to recognizing the stock’s behaviour.

Chapter 5 presents the results and discussion. Chapter 6 listing the research contributions and making recommendations for future work.

(33)

17

CHAPTER TWO LITERATURE REVIEW

2.1Overview

This chapter reviews the literature related to the research domain by surveying the theoretical and empirical studies on the stock market classification model, sentiment analysis, studies that employed supervised ML in sentiment analysis, data sources, labelling techniques, data pre-processing, feature engineering, spatial and temporal decision making, data representation, classification, Naïve Bayes classifiers, hybridization and ensemble learning, and performance and evaluation.

2.2Stock Market Classification Model

This section reviews the literature related to the stock market classification models, ML and sentiment analysis. The main goal of the research is to model stock market classification to serve decision makers and investors increasing the accuracy of textual classification. Stock market classification expectation has become an especially popular field of research because of its business applications and the high stakes and benefits that it has to offer (Majhi, Panda, Sahoo, Dash, & Das, 2007; Oliveira, Cortez,

& Areal, 2017). Making the right decision to buy or sell on the stock exchange is a testing assignment for financial time series expectation. It can be frustrating because of the multi-faceted nature of the stock market with its audible noise and unclassifiable environment (Zhang & Wu, 2009; Ticknor, 2013). Online data are changeable, and it is important to assess if there is a relationship between general society reaction and an organization’s stock. One approach is to break down the general reaction to an

(34)

18

organization in order to strategize plans about the progress of the stock. Unfortunately, the stock market is basically powerful, non-linear, non-parametric, complicated and anarchic (Tan, Quek, & Ng, 2005; Li et al., 2016). Moreover, stock market developments are influenced by numerous macro-economic factors (Gay, 2016; Borio, Gambacorta, & Hofmann, 2017), for example, institutional investors’ choices, the psychology of investors, bank exchange rates, political events, movements of other stock markets, general economic conditions, firms’ policies, investors’ expectations, commodity price index, bank rate, newspapers and quarterly and annual reports. The reason for a sound classification model through online networking is that numerous organizations have official online links in order to stay in contact with their customers.

This enables the customers to express their views or report on the organization’s products.

Zhang (2013), Qasem, Thulasiram, and Thulasiram (2015), Cakra and Trisedya (2015), and Kordonis, Symeonidis, and Arampatzis (2016) have developed a stock market classification model using sentiment analysis of English-language tweets.

Hamed et al. (2015), Hamed et al. (2016), and AL-Rubaiee, Qiu, Alomar, and Li (2018) have developed a similar model using sentiment analysis of Arabic tweets. All these researchers have implemented baseline supervised ML to classify tweets after completing the sentiment analysis process. ML involves programming computers to optimize a performance criterion using sample data or experience (Alpaydin, 2009;

Baştanlar & Özuysal, 2014; Kokalj-Filipovic, Greco, Poor, Stantchev, & Xiao, 2018).

The goal of ML is to develop methods that can automatically detect patterns in data

(35)

19

and then to use these patterns to predict future data or other outcomes of interest (Harrington, 2012; Marsland, 2015). Figure 2.1 shows fields which use ML.

Figure 2.1. Multiple Types of ML and Associated use-cases (Marsland, 2015)

Figure 2.1 shows that the machine-learning approach is divided into three parts:

supervised learning, reinforcement learning, and unsupervised learning. The supervised learning is further divided into decision trees, linear and probabilistic classification and so forth. The probabilistic classification is divided into Bayes types, and linear into neural networks and SVM. The overall process supports learning sentiment analysis (Pang & Lee, 2008; Schumaker et al., 2012). The supervised learning approach is recommended for pattern recognition modelling (Kuhn &

Johnson, 2013; Ali et al., 2017), and the supervised ML classifier uses different computational methods, for example, NNs, SVM, Decision Trees, Random Forest (RF), KNNs, Linear Regression (LR), and NBCs. Over the last ten years, SVM, NB, and NNs have become the most widely used in the domain of modelling stock market classification.

(36)

20

Sentiment analysis can be described as the study of people’s emotions, their attitude or behaviour towards specific news or events and their opinions of these events (Schumaker et al., 2012; Rosenthal, Farra, & Nakov, 2017). Sentiment analysis is normally considered to be the result of a procedure which starts with identification of a person’s sentiment or emotion towards an event, then selects features to classify that sentiment and finally polarizes the sentiment (Yu, Wu, Chang, & Chu, 2013;

Rosenthal et al., 2017). Sentiment analysis is also a way of investigating product surveys on the Web, to reach a consensus about an item (Ghorpade & Ragha, 2012;

Bhattacharjee, Das, Bhattacharya, Parui, & Roy, 2015). Reviews represent the user- generated content, an increasingly important resource and a rich asset for advertising groups, sociologists, psychologists and others who may want to gauge the public mood, or individual attitudes (Tang, Tan, & Cheng, 2009; Abdelwahab et al., 2015).

Sentiment analysis classification has three levels: the document level ascertains whether the sentiment or emotion expressed is positive or negative; the sentence level studies the sentiments and opinions in sentences, whether subjective or objective (Liu, 2010; Giatsoglou et al., 2017); and the aspect level examines different aspects of the opinions or emotions expressed by different persons. Some aspects may indicate positive opinions, some negative, and some both (Schumaker et. al 2012; Yu et. al 2013). Nowadays, sentiment analysis represents the core of research that identifies public opinions regarding electronic commerce, management, political figures, trade, etc. It can be applied in many fields and serves the decision makers in different research domains. However, it needs homogeneous data such as consumer reactions to a specific kind of service or product to give a valid result of positive, negative or

(37)

21

neutral sentiment. One of the richest resources for opinion and data mining is social media. Sections 2.2.1 and 2.2.2 review the latest stock market classification models using sentiment analysis on Twitter.

2.2.1Stock Market Classification Model using Sentiment Analysis on English Tweets

Zhang (2013) has developed a stock market classification model using sentiment analysis of English tweets. He first tested the effectiveness of different ML methods on identifying a positive or negative sentiment from the tweets, applying his findings to two tasks: first, examining the relationship between Twitter sentiment and the stock prices; and secondly, locating which words in the tweets correlate to the impact in stock prices through a post-analysis of price shift and the tweets. Figure 2.2 shows the stock market classification model developed by Zhang (2013).

Figure 2.2. The Proposed Model for Stock Price and Significant Keyword Correlation

(38)

22

As shown in Figure 2.2, the model mines tweets using the Twitter Application Programming Interface (API). Three baseline ML methods were applied: NB, Maximum Entropy (ME), and SVM.

Qasem, Thulasiram, and Thulasiram (2015) also proposed a stock market classification model using sentiment analysis on English tweets. The model has four steps: the first and second are data collection and pre-processing, both performed by Twitter API; the third phase is feature engineering, implemented using Term Frequency-Inverse Document Frequency (TF-IDF) for feature representation; the training phase is the last step, performed by dividing the dataset into 70% for training and 30% for testing. Figure 2.3 shows Qasem et al.’s (2015) model.

Figure 2.3. The Proposed Model by Qasem et al. (2015)

Qasem et al. (2015) supported their model with Azure ML, as shown in Figure 2.4.

Figure 2.4. MS. Azure ML (Mund, 2015)

(39)

23

They applied the SVM classifier instead of MS. Azure ML can enhanced performance by defining, developing, and modifying functions inside the selected method(s).

Cakra and Trisedya (2015) proposed a model to classify the Indonesian stock market using simple sentiment analysis based on ML algorithms including SVM, NB, Decision Tree, RF, and Neural Network algorithms to classify tweets and then to analyze sentiment regarding a firm. They used two classification algorithms: RF and NB. Figure 2.5 shows the model proposed by Cakra and Trisedya (2015), based on the linear regression method.

Figure 2.5. The Proposed Model by Cakra and Trisedya (2015)

Finally, Kordonis, Symeonidis, and Arampatzis (2016) developed model a to collect and process past tweets then test the effectiveness of different ML methods such as BNB and SVM for providing a positive or negative sentiment on the tweet. They employed the same ML methods to analyze how the tweets are correlated with the stock market price behaviour. Their main goal was to predict future stock behaviour from sentiment analysis of tweets collected over the previous few days, and to examine if the proposed model was applicable in the domain of the stock market. Figure 2.6 shows the model proposed by Kordonis et al. (2016).

(40)

24

Figure 2.6. The Proposed Model by Kordonis et al. (2016)

As shown in Figure 2.6, the proposed model has four phases as follows: (a) data collection using Twitter API, (b) data pre-processing, (c) classification using BNB and SVM, and (d) defining the correlation between the tweets collected over the last few days and the stock market’s behaviour based on historical stock data using Yahoo finance API.

2.2.2Stock Market Classification Model using Sentiment Analysis on Arabic Tweets

Hamed et al.’s (2015) study is one of the few to be conducted in Arabic sentiment analysis. It demonstrates the importance of the pre-processing step as a key factor in achieving a high level of accuracy of sentiment analysis. The research was introduced for Saudi stock market tweets. It aimed to illustrate the relationship between Saudi tweets and the Saudi market index, using different implementations of SVM, KNNs and NB algorithms. Figure 2.7 shows the model proposed by Hamed et al. (2015). The step of weighting comes after applying the most relevant of these data mining approaches in order to build a classification model based on the sentiment polarity of the tweet.

(41)

25

Figure 2.7. The Proposed Model by Hamed et al. (2015)

These authors have developed a small desktop application to collect the corpus of Arabic tweet data from Twitter; it was created under the C# environment supported by the official developer’s APIs of Twitter. The main function of the application is to collect, label and save the related tweets, as well as removing those tweets that are not relevant. They considered their Twitter micro-blog as a platform for trading opinion mining in the Saudi stock market.

Hamed et al. (2016) also proposed a sentiment analysis model for the SA stock market using sentiment analysis on Arabic tweets. It classifies the tweets into positive, negative or neutral, focusing on the role of the neutral class. The model was built based on the hybridization of ML classifiers such as SVM and NB, and the data pre- processing. It aims to show and define the role of the polarity class (label) by

(42)

26

classifying the collected tweets in Arabic into three polarities, positive, negative, and neutral. Again, there are four stages: data collection, pre-processing, classification, and model evaluation. Figure 2.8 shows this model.

Figure 2.8. The Proposed Model by Hamed et al. (2016)

The last model in the domain of Arabic tweets was proposed by AL-Rubaiee, Qiu, Alomar, and Li (2018); they aimed to enhance the labelling process, which has a direct impact on the reliability of the classification. They proposed improvements to the expert knowledge via two approaches: first, by defining neutral to include tweets with both positive and negative polarity; and second by relabelling. 2000 tweets were collected using Twitter API (all in Arabic), and they classified the dataset using SVM.

The classification phase aims to show the differences in accuracy when using the original labelling process and the improved labelling process. Figure 2.9 shows their proposed model.

(43)

27

Figure 2.9. The Proposed Model by AL-Rubaiee et al. (2018)

In summary, the various classifiers used by researchers to perform sentiment analysis for stock market classification are SVM (Go, Bhayani, & Huang, 2009; Kolchyna, Souza, Treleaven, & Aste, 2015), KNNs (Barbosa analysis & Feng, 2010), Expectation Maximization (EM) (Yengi, Karayel, & Omurca, 2015) and NB (Bingwei Liu, Blasch, Chen, Shen, & Chen, 2013; Narayanan, Arora, & Bhatia, 2013;

Chandrasekar & Qian, 2016). The review indicates that the process of selecting the ML classifier or classifiers was based on the availability of the classifier rather than the research and model requirements.

(44)

28

2.3Data Source for Stock Market Classification Model

This section aims to explain facts about Twitter as the main source for the stock market classification models that utilized sentiment analysis on Twitter, and it demonstrates the way has adopted by the latest models in Section 2.2 to collect the required dataset.

2.3.1Twitter as A Data Source

Data has turned into the cash of this time as constantly expanding in size and worth.

The accessible data online is multiplying in size like clockwork (Gantz & Reinsel, 2011). While the measure of information online that was produced in 2013 was 4.4 Zettabytes (ZB) and its expected that in 2020 information made will reach 44 ZB.

Individual users are the primary source of these data with 75 percent of general produced data (Gantz & Reinsel, 2011; Perrin, 2015). In this aspect, online social networking has gotten to be a data source and key players in dispersing data to influenced entities in crisis. According to Smith (2010) findings, recommend that around 33% of consumers, investors and youths who are dynamic online communicators are utilizing the platforms, for example, microblogging, online reports (organization financial reports), text messaging, tweets (posts) and online social networking. These online networking stages are ease for communicating ideas, assessments, views and considerations concerning critical organizations issues. As a matter of fact, 73 percent of users engaged with social networking has been documented due to their interest, information dissemination and ease communication they had achieved (Perrin, 2015). The most well-known site was Twitter with 68 percent. Consumers utilized tweets data to contrast and compare with organization financial reports in order to decide the best products, organizations strategy and

(45)

29

information about the stock price. Furthermore, in 2006 Twitter saw the potential in giving valuers or followers a chance to share their considerations and sentiments and propelled their stage with the mission: "To give everybody the ability to make and share thoughts and information immediately, without obstructions" (Arceneaux &

Schmitz Weiss, 2010). As well, in 2014, the demographic perception of Twitter was roughly equivalent with respect to gender. The age ranges from 18-29 and 30-49 both have a higher than normal infiltration while the 50-64 and 65 + stays low at roughly 10%. With respect to, the entrance is higher for the Urban and Suburban populace while the Rural draw behind (Duggan, Ellison, Lampe, Lenhart, & Madden, 2015).

Currently, Twitter is the most famous micro-blog device among other existing reciprocals and has been included broadly in people in general media; for instance, it has been utilized by business communications, product information, political campaign, and news organizations. It is the only micro-blogging service that has turned into the main quickest developing patterns on the Internet, with an exponentially- increasing users base exceeding 190 million users in July 2010. On Twitter each user can give out short messages with a most extreme of 280 characters, so-called tweets, which are unmistakable openly or semi-freely (e.g. limited to the user's assigned contacts) on a message leading of the site or through other applications.

Its founders’ unique thought was to give a service that empowers individual or organization status updates. Due to its popularity, tweets cover each believable point, running from product information. The general population timeline conveying on the tweets of all users worldwide is a broad continuous information stream of 65 million of messages for each day (Java, Song, Finin, & Tseng, 2007).

(46)

30

Java et al. (2007), and Naaman, Boase, and Lai, (2010) have demonstrated that individuals use Twitter for different purposes. Up to 94,000 user’s intentions Twitter help to provide data source 1.3 million tweets between businesses and consumers in order to understand their reaction and opinion (Java et al, 2007). The intentions of using Twitter has been classified into four forms: conversation, reporting news, sharing information/ Unified Resource Locator (URL), and daily chatter. Java et al.

(2007) reported that there were huge great qualities in Twitter content: individual whereabouts information, opinions to headline news, and connections to articles and news. Utilizing Twitter for a selection of social purposes, including (a) releasing emotional stress; (b) raising brightness of intriguing things to one's interpersonal organizations; (c) gathering valuable data for one's calling or other individual interests;

(d) looking for aides and ideas; and (e) staying in contact with companions and associates. In Twitter, diverse sorts of tweets are recognized, for example, distinguished such as singleton, re-tweet, reply, direct message and mention.

Eventually, Culnan (2010) have reported that Twitter has risen as new channels which are continually utilized by firms to make and catch business value. Furthermore, Deans (2011) states, "Twitter tweets are changing the way firms can relate and connect with their consumers, and the way they can communicate and team up inside with their workers to provide customer’s needs. Twitter tweets have brought about the rebuilding of the advertising capacity, and in addition the way firms consider their associations with consumers, business associates and inside representatives".

2.3.2Data Availability from Twitter using API

Twitter is a widespread microblogging facility where followers and users make status

(47)

31

messages (called "tweets") (Arceneaux & Schmitz Weiss, 2010). These tweets here and there express sentiments about different themes. According to Go et al. (2009) and Rout et al. (2018) classified Twitter Streaming API and Twitter Representational State Transfer (REST) API as two sorts of API utilized to assemble tweets. Twitter's data is displayed using APIs (Makice, 2009; Ifrim, Shi, & Brigadir, 2014; Trupthi, Pabboju,

& Narasimha, 2017). The APIs provides encoded dataset by using JavaScript Object Notation (JSON). The JSON describes each tweet based on the set of attributes (features) and the relationship between them (Russell, 2013; Ifrim et al., 2014). Each tweet has a group of attributes, this group set start with tweet id, created time (timestamp), geo-information, user id, and end with the Twitter name (Kumar, Morstatter, & Liu, 2014; Gentry, 2016). According to Russell (2013), and Wijeratne et al. (2017) the tweet includes the previous attributes beside the tweet's text. As well, the text may include many other attributes such as the hashtag, names, numbers, and symbols. Figure 2.8 shows a tweet has posted on the wall of Downtown Dubai at 4:54 AM-29 Dec 2017.

Figure 2.10. Tweet's Attributes

Rujukan

DOKUMEN BERKAITAN

4.4.3(c)i The Relationship between Gold Returns and Stock Return Conditional on the Extreme Stock Market Shocks

In measuring investor sentiment, this research proposes a new construct of investor sentiment proxies in the Malaysian stock market based on the consumer sentiment

This research studies the relationship between Hong Kong stock market which proxy by Hang Seng Index (HSI) and four determinants including gold price, crude

Return and volatility linkages among International crude oil price, gold price, exchange rate and stock markets: Evidence from Mexico. Forecasting Volatility of

Lim and Shaista (2008) have studied the presence of linkages or co-movements between Malaysia stock market and stock markets of its three major trading partners, which

The objective of this paper is to examine the relationship between Japan stock market and five macroeconomic variables namely index of industrial production, inflation

The relationship between stock market and macroeconomic variables become a popular topic in financial research. Stock market is a crucial part of the economy as it acts as

Thus, it makes this research project one of the first papers seeking to understand how the 12 th general election in Malaysia affects the stock market, making use of the GARCH