• Tiada Hasil Ditemukan

A thesis submitted in fulfilment of the requirement for the degree of Doctor of Philosophy in Information Technology

N/A
N/A
Protected

Academic year: 2022

Share "A thesis submitted in fulfilment of the requirement for the degree of Doctor of Philosophy in Information Technology "

Copied!
24
0
0
Tunjuk Lagi ( halaman)

Tekspenuh

(1)

USER TRAVERSAL BEHAVIOUR MINING OF SERVER LOGS USING FUZZY FRS

BY

ROSLI OMAR

A thesis submitted in fulfilment of the requirement for the degree of Doctor of Philosophy in Information Technology

Kulliyyah of Information and Communication Technology International Islamic University Malaysia

APRIL 2018

(2)

ii

ABSTRACT

Web Usage Mining (WUM) is the application of data mining methods in extracting potentially useful information from web usage data. Its application includes improving website design, personalised service, target marketing etc. Even though there has been an extensive study in WUM, lack of related product commercialisation indicates that there are still a number of outstanding research issues in this area. Among the challenges mentioned in the literature include inefficiency in mining typically large weblogs, extracted patterns that are not representative of actual user behavior, and mining results which are too general, uninteresting and lack insights. This thesis attempts to address the above problems in three parts. Firstly, based on the notion of regularity, a mining algorithm is introduced to efficiently extract usage patterns from large weblogs that are reflective of individual user behaviour. Secondly, a fuzzy method is incorporated into the algorithm that enables the expression of the pattern quality, thus reducing possible confusion due to extremely large number of patterns.

Finally, in order to gain deeper insights of the extracted patterns, the algorithm is further extended using the framework of transitional pattern to capture possible variation in pattern behaviour, thus facilitating the subsequent pattern interpretation process. The promising results obtained from a series of experiments conducted suggest that the new algorithm is faster and more scalable compared to an existing one, especially when mining large weblogs. Furthermore, the extracted patterns demonstrate better representation of user traversal behaviour, contain less ambiguity, and are more readily interpretable for subsequent analysis.

(3)

iii

ثحبلا صخلم

مادختسا نم تناايبلا في بيقنتلا تا

بيولا (WUM) وه

بيلاسأ لامعتسا قيبطت نع ةرابع

.بيولا مادختسا تناايب نم ةديفم نوكت نأ لمتيح تيلا تامولعلما صلاختسا في تناايبلا جارختسا دهتسلما قيوستلاو ،ةيصخشلا ةمدلخا ،عقاولما ميمصت نم لك ينستح قيبطتلا اذه لمشيو ف

مغرلا ىلع

ةعساو تاسارد كانه نأ نم صوصبخ

(WUM) تاجتنلما قيوستل ةيجيتاترسا دوجو مدع نأ لاا ،

.لالمجا اذبه ةقلعتلما ةيثحبلا يااضقلا نم ددع كانه لازي لا هنأ لىإ يرشي ةلصلا تاذ نم هنا امك

ينب

وه ةقباسلا تاساردلا في ةروكذلما تيادحتلا ع

بيقنتلا في ةءافكلا مد في

ا نوكت امدنع ةداع تناايبل

،ةيلعفلا مدختسلما كولس لثتم لا ةجرختسلما طانملأا نا امك ،مجلحا ةيربك بيولا تناودم لىا ةفاضلإبا

كلذ ةيؤر لىإ رقتفت انها امك ،ةسوردم يرغو ةماع ةفصب نوكت تناايبلا في بيقنتلا جئاتن ةيعوضوم

.

لاعأ ةروكذلما لكاشلما ةلجاعم ةساردلا هذه لواتح متي ،ماظتنلاا موهفم لىإ ادانتسا ،لاوأ .ءازجأ ةثلاث في ه

ةءافكب مادختسلاا طانمأ جارختسلا بيقنتلا ةيمزراوخ لاخدإ ةيلاع

تيلا مجلحا ةيربك بيولا تناودم نم

ينمضت متي ،اينثا .يدرفلا مدختسلما كولس سكعت ماظنلا

مهبلما (Fuzzy method) في

ةيمزراولخا

ع يربعتلا نكتم تيلا .طانملأا نم يربكلا ددعلل ارظن لمتلمحا سابتللاا نم دلحا لياتلباو ،طمنلا ةيعون ن

قمعأ ةيؤر ىلع لوصلحا لجأ نم ،ايرخأو م

اضيأ ةيمزراولخا قاطن عيسوت متي ،ةجرختسلما جذامنلا ن

يرسفت ةيلمع لهسي امم ،كولسلا طنم في لمتلمحا نيابتلا طاقتللا لياقتنلاا طمنلا راطإ مادختسبا طانملأا

تيرجأ تيلا براجتلا نم ةلسلس نم اهيلع لوصلحا تم ةدعاو جئاتن ىلع ةساردلا تصلخ دقو .ةقحلالا ةصاخو ،ةقباسلا تايمزراولخا عم ةنراقم ريوطتلل ةيلباق رثكأو عرسأ دعت ةديدلجا ةيمزراولخا نأ اهنيب نمو لع ةولاعو .مجلحا ةيربك تناودلما نم تناايبلا في بيقنتلا دنع رهظت ةجرختسلما جذامنلا نإف ،كلذ ى

اهيرسفتل ةلوهس رثكأ انها امك ،ضومغلا نم لقأ ردق ىلع يوتتحو ،مدختسلما ذيفنت كولسل لضفأ لايثتم

.اقحلا اهليلتحو

(4)

iv

APPROVAL PAGE

The thesis of Rosli bin Omar has been approved by the following:

_____________________________

Zainatul Shima Abdullah Supervisor

_____________________________

Abu Osman Md Tap Co-Supervisor

_____________________________

Mira Kartiwi Internal Examiner

_____________________________

Mustafa Mat Deris External Examiner

_____________________________

Shoab Ahmed Khan External Examiner

_____________________________

Amir Akramin Shafie Chairman

(5)

v

DECLARATION

I hereby declare that this thesis is the result of my own investigations, except where otherwise stated. I also declare that it has not been previously or concurrently submitted as a whole for any other degrees at IIUM or other institutions.

Rosli bin Omar

Signature ... Date ...

(6)

vi

COPYRIGHT

INTERNATIONAL ISLAMIC UNIVERSITY MALAYSIA

DECLARATION OF COPYRIGHT AND AFFIRMATION OF FAIR USE OF UNPUBLISHED RESEARCH

USER TRAVERSAL BEHAVIOUR MINING OF SERVER LOGS USING FUZZY FRS

I declare that the copyright holders of this thesis are jointly owned by the student and IIUM.

Copyright © 2018 by Rosli bin Omar and International Islamic University Malaysia. All rights reserved.

No part of this unpublished research may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without prior written permission of the copyright holder except as provided below

1. Any material contained in or derived from this unpublished research may only be used by others in their writing with due acknowledgement.

2. IIUM or its library will have the right to make and transmit copies (print or electronic) for institutional and academic purposes.

3. The IIUM library will have the right to make, store in a retrieved system and supply copies of this unpublished research if requested by other universities and research libraries.

By signing this form, I acknowledged that I have read and understand the IIUM Intellectual Property Right and Commercialization policy.

Affirmed by Rosli bin Omar

……..……….. ………..

Signature Date

(7)

vii

This thesis is dedicated to my beloved father, Omar, and my beloved daughter, Amalina, both of whom have peacefully returned to their Lord. May Allah

shower His endless mercy and forgiveness upon their souls.

(8)

viii

ACKNOWLEDGEMENTS

All glory is due to Allah the Almighty, whose grace and mercy have been with me throughout the duration of my studies. Although it has been a challenging experience, His blessings have kept me persevered until the end.

I am most indebted to my supervisor Assistant Professor Dr Zainatul Shima Abdullah, whose in-depth knowledge in the subject matter as well as her endearing disposition and kindness facilitated the successful completion of my work. I appreciate her detailed comments and useful suggestions which have considerably improved this thesis. The moral support and encouragement have indeed helped me through this process. I am also highly grateful to Prof Dr Abu Osman Md Tap and Dr Media Ayu for paving a strong foundation for my work, as well as providing invaluable guidance especially at the beginning of this journey.

This thesis would not have been possible without the sacrifice and understanding of my beloved wife Suriana, whose strength and support had made it possible for me to continue with the uphill journey in completing my studies. Thanks also for checking the thesis. To my children, thanks for the prayer and I hope you will draw inspirations from this work. Last but not least, my deepest appreciation to my beloved Mother whose sacrifice, inspiration and guidance helped me become who I am today.

Once again, glory be to Allah for His endless bounties on me, one of which is completing this thesis. Alhamdulillah.

(9)

ix

TABLE OF CONTENTS

Abstract ... ii

Abstract in Arabic ... iii

Approval page ... iv

Declaration ... v

Copyright ... vi

Dedication ... vii

Acknowledgements ... viii

List of Tables ... xii

List of Figures ... xiii

CHAPTER ONE: INTRODUCTION ... 1

1.1 Introduction ... 1

1.2 Background of the Study ... 3

1.3 Statement of the problem ... 5

1.4 Research Questions ... 7

1.5 Research Objectives ... 8

1.6 Significance of study ... 9

1.7 Research Design ... 10

1.8 Organization of the thesis... 12

CHAPTER TWO: REVIEW OF LITERATURE ... 13

2.1 Introduction ... 13

2.2 Data Mining ... 13

2.2.1 Pattern and Model ... 14

2.2.2 Pattern Discovery ... 15

2.3 Sequential Pattern Mining ... 15

2.3.1 The Apriori Approach ... 17

2.3.1.1 Generalised Sequential Patterns (GSP) algorithm ... 19

2.3.2 Apriori Vertical Layout Database ... 19

2.3.2.1 SPADE Algorithm ... 20

2.3.3 The Pattern Growth Approach ... 21

2.4 Constraint-Based Sequential Patterns ... 21

2.5 Web Usage Mining ... 23

2.5.1 Applications of WUM ... 25

2.5.2 The Main Processes of Web Usage Mining (WUM) ... 26

2.5.3 Existing Researches in WUM ... 27

2.5.3.1 Basic WUM Algorithm ... 27

2.5.3.2 Utility Patterns ... 31

2.5.4 Transitional Pattern ... 33

2.6 Fuzzy Theory for Sequential Data Mining... 36

CHAPTER THREE: METHODOLOGY ... 39

3.1 Introduction ... 39

3.2 Theoretical Framework ... 39

3.2.1 Pre-Processing Step ... 40

(10)

x

3.2.2 Pattern Discovery ... 42

3.2.3 Pattern Analysis ... 43

3.3 Proposed Research Framework ... 43

3.3.1 Pre-processing of Web Log Data ... 45

3.3.2 Pattern Discovery ... 45

3.3.2.1 Frequent-Regular Sequential Patterns ... 45

3.3.2.2 Fuzzy FRS-pattern ... 47

3.3.2.3 Transitional FRS-pattern ... 48

3.3.3 The Proposed Algorithm ... 48

3.4 Experimental Design ... 51

3.4.1 Test Environment and Datasets ... 51

3.4.2 Pre-Linked Web Access Pattern (PLWAP) as a Benchmark Algorithm ... 52

3.4.3 Performance Comparison between FRS and PLWAP algorithms ... 53

3.4.4 Pattern’s Knowledge Representation ... 53

3.4.5 Scalability Test ... 54

CHAPTER FOUR: REGULAR SEQUENTIAL PATTERNS ... 55

4.1 Introduction ... 55

4.2 Overview of Frequent-Regular Sequential Patterns (FRS) ... 56

4.3 Regularity as a Measure of User Behaviour ... 57

4.3.1 Efficient Candidate Enumeration ... 59

4.3.2 Downward closure property ... 61

4.3.3 Reduction of Recursions ... 62

4.3.4 Vertical Database ... 63

4.4 Problem Definition ... 66

4.5 The Proposed FRS Algorithm ... 69

4.6 Mining Frequent Regular Sequential Patterns ... 74

4.7 Experiments... 77

4.7.1 Description of Input Data ... 77

4.7.2 Experiment 1: Processing Time of FRS Components ... 79

4.7.3 Experiment 2: FRS Performance Using Various Set Of Thresholds ... 80

4.7.4 Experiment 3: Performance Comparison between FRS and PLWAP ... 81

4.7.5 Experiment 4: Scalability of FRS ... 82

4.7.6 Experiment 5: Quality of Sequential Patterns ... 83

4.8 Conclusion ... 85

CHAPTER FIVE: FUZZY FREQUENT-REGULAR SEQUENTIAL PATTERNS ... 87

5.1 Introduction ... 87

5.2 Problem Definition ... 88

5.3 Linguistic Variables ... 90

5.4 Mining Fuzzy Frequent Regular Sequential Patterns... 92

5.5 Experiment ... 93

5.6 Conclusion ... 94

(11)

xi

CHAPTER SIX: TRANSITIONAL FREQUENT-REGULAR

SEQUENCES ... 96

6.1 Introduction ... 96

6.2 Transitional Patterns... 97

6.3 Problem Definition ... 100

6.4 The Trans-Frs Algorithm ... 102

6.5 The Experiments ... 104

6.6 Conclusion ... 109

CHAPTER SEVEN: DISCUSSIONS AND CONCLUSION ... 111

7.1 Introduction ... 111

7.2 Regular Sequential Patterns ... 112

7.3 Fuzzy Regular Sequential Patterns... 116

7.4 Transitional Frequent-Regular Pattern (Trans- Frs) ... 117

7.5 Contributions Of Research ... 121

7.5.1 Theoretical Contributions ... 121

7.5.2 Practical Contributions ... 122

7.5.3 Methodological Contribution ... 124

7.6 Limitations Of The Study ... 125

7.7 Future Research Areas ... 125

REFERENCES ... 128

APPENDIX A: FRS SOURCE CODES IN C ... 137

APPENDIX B: PLWAP SOURCES CODES ... 154

(12)

xii

LIST OF TABLES

Table 1.1 Mapping of Research Questions and Objectives 8

Table 2.1 Sample entries of a web usage log 24

Table 3.1 CLF Format Description 42

Table 4.1 Example of Sequence Database 58

Table 4.2 Sample Entries of a pre-processed Web Log 64

Table 4.3 Sequence Db after Sessionisation 65

Table 4.4 Web Usage Sequence Db 65

Table 4.5 Sequence Db 74

Table 4.6 2-FRS Matrix 75

Table 4.7 News Categories of MSNBC Dataset 78

Table 4.8 Comparison of Number of Extracted Patterns and

corresponding Processing Time 84

Table 4.9 Selected FRS Patterns Extracted from MSNBC Web Log 85

Table 5.1 Sequence Db 93

Table 5.2 Extracted Fuzzy FRS Patterns 94

Table 6.1 Comparison of Extracted Patterns of FRS and PLWAP 98

Table 6.2 Sequence Db 99

Table 6.3 Time Points in Sequence Db 101

Table 6.4 Sequence Db 102

Table 6.5 Description of Notations 104

Table 6.6 Extracted FRS Sequences with Fuzzy Quality Value 105

Table 7.1 Comparison of FRS and PLWAP 114

Table 7.2 Scalabity : FRS vs PLWAP 116

Table 7.3 Analysis of Selected FRS Patterns 120

(13)

xiii

LIST OF FIGURES

Figure 2.1 Main Processes of WUM 27

Figure 3.1 Theoretical Framework of WUM 40

Figure 3.2 CLF Format 41

Figure 3.3 Research Framework 44

Figure 3.4 FRS Algorithm 50

Figure 3.5 Experimental Design 52

Figure 4.1 An Example of Sequence Enumeration 60

Figure 4.2 TIDlist for Item A, B And C 65

Figure 4.3 Construction of Reduced TIDlists 66

Figure 4.4 FRS Algorithm 70

Figure 4.5 TIDlists of FR Items 74

Figure 4.6 Constructing 2-FRS Sequence 75

Figure 4.7 Constructing 3-Sequence ABC 76

Figure 4.8 Constructing 4-FRS Sequence 76

Figure 4.9 Partial Enumeration of Items A,B, And C 77

Figure 4.10 Processing Time of FRS Components Under Different Support

Thresholds 79

Figure 4.11 Processing Time of FRS Under Different Support And

Regularity Thresholds 81

Figure 4.12 Performance Comparison Between FRS And PLWAP 81

Figure 4.13 Scalability of FRS Vs. PLWAP 83

Figure 5.1 Definition of Fuzzy Terms 92

Figure 6.1 Tran-FRS Algorithm 103

Figure 6.2 Transitional Behaviour of FRS Pattern 6-1-2 105

Figure 6.3 Transitional Behaviour of FRS Pattern 1-2 107

(14)

xiv

Figure 6.4 Transitional Behaviour of FRS Pattern 1-2-3 108

Figure 6.5 Transitional Behaviour of FRS Pattern 1-2-4 109

(15)

1

CHAPTER ONE INTRODUCTION

1.1 INTRODUCTION

One of the main challenges for a website owner is providing a website that offers effective and efficient browsing experience for the users. As people are increasingly more reliant on the internet for information and services, it is crucial that a website is designed and developed in accordance to the needs and interests of its users. One of the best and most reliable sources that offers clues on the needs and interests of users is the website itself. In general, web servers not only store the content of a website, they also keep myriads of other data including details of users browsing activities. The growth of internet activities generates huge quantity of such data that are recorded in files on the servers, which are often referred to as the Web usage log files. As such, the application of data mining in web data is crucial to assist in understanding users’

needs and preferences. The extracted information from web data mining may offers useful knowledge that can potentially assist the owners of websites to better understand the behavioural patterns of their clients, and to restructure the websites in the way that increases the level of quality of services provided (R Cooley, Mobasher,

& Srivastava, 1997).

In this context, web mining is an emerging line of studies, which explores sophisticated methods, and techniques of extracting dominant patterns from web data, which goals include to explore ways to improve the quality of a website through various approaches and strategies. Popular applications of web mining include website structure improvement, user profiling, target marketing, service personalization, intrusion detection, etc. Web mining lies within the broader framework of Knowledge

(16)

2

Discovery from Databases (KDD), which was introduced by Fayyad (1996). The objective of KDD is to discover useful knowledge embedded in large databases through the application of data mining techniques. Essentially, web mining relies on data mining techniques to intelligently extract and analyse interesting patterns of user behaviour embedded in a number of web files that include the web server logs and application logs. There are three categories of web mining: Web Content mining, Web Structure mining and Web Usage mining (Facca & Lanzi, 2005). Cooley et al., (1997) first introduced the term WUM and its main goal is to extract interesting user behaviour from usage activities recorded in web usage log. Due to broad nature of web mining, this study will focus on WUM.

Although numerous studies have been conducted in this area, there are still a number of open research issues that need to be addressed due to the heterogeneous nature of websites and the sheer size of information it contains which contribute to the complexity of the problem. Thus, the application of data mining in web data continues to gain much attention from the research community. This thesis concerns mining of frequently occurring sequential patterns that considers the element of user traversing behaviour from web server log. Sequential pattern mining is a field of study introduced by (Srikant & Agrawal, 1996a) and web usage behaviour is a concept proposed by Cao(2010). This chapter first presents the background of the study, which then followed by the statement of research problems to be addressed. Subsequently, the research questions that drive this study and the corresponding research objectives are stated. A summary of our proposed idea ensued and the chapter ends by giving the layout of the entire thesis.

(17)

3 1.2 BACKGROUND OF THE STUDY

Web usage mining (WUM) is an application of data mining (Rakesh Agrawal &

Srikant, 1994) techniques, which seek to find major trends from web usage data.

These trends represent the underlying navigational patterns derived from clickstream activity frequently executed among the visitors of a website. Understanding web user behaviour provides important insights of user preferences, and helps web designer to formulate effective strategies that can enhance the quality of a website. User behaviour information can be used, among others, for redesigning the website structure so as to decrease overall propagation delay time among frequently access pages, improving the quality of the design or content of the less frequently visited pages in order to attract visitors as well as providing valuable input for web caching policy. Despite a considerable number of previous studies within this field, most of the existing WUM algorithms do not consider the individual user behaviour and interest when mining usage access sequences. Some researchers claim the lack of consideration for activity data as the criteria in mining results in the failure to translate many research findings into operational commercial products (Cao, Ou, & Yu, 2012;

De Meo, Nocera, Terracina, & Ursino, 2011).

In data mining, there are a number of different methods of mining, based on types of pattern, such as association rules, clustering and classification. Since web traversal patterns are formed by sequences of web page navigation activities, therefore the frequent sequence patterns, or sequential patterns method, is expected to be the more accurate and appropriate method for WUM application. Thus, the fundamental problem of mining sequential patterns concerns the extraction of the set of frequent web traversal sequences from web server log.

(18)

4

One of the challenges in WUM stems from the excessively huge number of sequential patterns generated by WUM algorithms. This is especially true since Web usage log files in general are typically large; for example, the MSNBC clickstream log file used as input for experiments in this study contains almost one million records that belong to only a single day transactions. When mining from such a large dataset, usually a very small threshold is used since using large threshold will miss most patterns. A threshold indicates a minimum number of records that must contain a certain pattern. However, setting the threshold too low will inevitably result in extremely large number of extracted patterns to consider, and has a significant impact on processing time. Furthermore, as not all of the extracted frequent patterns are usually considered meaningful and interesting, the process of analysing frequent patterns may become daunting and inefficient. Thus, a new method that addresses this issue is crucial.

In addition, most of the existing WUM algorithms only generate the set of frequent web pages without providing details on possible fluctuation of degree of frequentness throughout the database. Since the measure used for indicating frequency is the aggregate support measure, existing algorithms do not highlight possible variation in frequency that almost invariably occur at different stages in the database.

Therefore, the dynamic behaviour of the discovered pattern is often undetected. As often the case, a frequent pattern is normally frequent only at certain points of the database, and rarely consistently frequent throughout the database. Having additional information on the dynamic behaviour of a pattern provides deeper understanding of the pattern and can help the analyst formulate different action plans for different patterns even though they might all be frequent. For example, if two patterns A and B are found to be frequent, but pattern A is largely frequent only at the beginning of the

(19)

5

database, while pattern B is mostly on the opposite trend, then there is possibility that the trend represented by pattern A is already obsolete while the trend reflected by pattern B is only emerging. Therefore, there is a need to enhance existing WUM methods to provide insights of the dynamic behaviour of a frequent pattern in order, which will enable the analyst to derive more accurate conclusion in the investigation.

1.3 STATEMENT OF THE PROBLEM

The issues related to mining sequential pattern from web usage data mentioned in the previous section represent the problems that this thesis aims to address. First, lack of consideration for activity data in existing methods undermines the quality of mining results. Activity data refers to series of clickstream resulting from navigating the web pages. Existing SPM algorithms rely solely on the frequency measure, which is generally determined by the number of users forming a certain pattern of web page navigation. While pattern frequency is crucial in determining major trends or behaviours, it disregards information of activity data and thus, leads to problems such as very high number of extracted patterns. This issue has been addressed by Tanbeer et al, (2009) for frequent pattern mining from transactional databases by proposing the notion of periodic-frequent patterns, but there is no existing sequential mining methods which address the issue. This observation leads us to believe that having a pattern-mining framework, which relies on the consideration of the behaviour of the masses as well as the individuals, will further improves the quality of extracted patterns and hence, the quality of knowledge derived from them.

Second, despite the many significant improvements made in WUM, the problem of efficiency in mining sequential patterns is still an open research issue (Pei et al., 2007 & Chand et al., 2012), particularly when the threshold value is set very

(20)

6

low, the size of web usage log data is very large and the records may contains long sequences. This is mainly due to extremely large number of sub patterns that algorithms need to consider, contributing to high computing cost. Moreover, if the mining process produces a large number of resulting patterns, which invariably include many interesting and uninteresting ones, the task of analysing and interpreting the patterns can be overwhelming and potentially confusing. When constraints are applied in mining, the issue of efficiency is further aggravated since the constraints are applied towards the end of mining process.

Third, most existing WUM approaches involve sequence with items of binary attributes only, which means only the presence or absence of an item is concerned.

This imposes limit to the potential of WUM since the existence of categorical data in real life applications are abundant and useful knowledge in the form of sequential patterns may be hidden in them. Even though fuzzy concepts have been introduced into sequence mining quite sometimes ago (T Hong, Kuo, & Chi, 1999), the fuzzy sequential pattern approaches suffer from efficiency issue, including the problem of computing the cardinality of membership degree. In this study, fuzzy concept is used to provide ranking among frequent usage sequences by taking activity data into consideration. The behaviour of user expressed by the sequence of mouse clicks may be interpreted differently if the intervals between events in the sequence are being considered. The length of intervals can be a factor that characterised the strength of behaviour. Therefore, for any proposed web usage mining method to be effective and accurate, due consideration on the interval is important.

Finally, existing methods implicitly assume that the frequency of certain behaviour is persistent throughout the period of investigation. This implicit assumption is indirectly transpired as the result of the use of frequency-based method.

(21)

7

Relying on the indication of pattern frequency may not be sufficient to assist the analyst in making sound and effective decisions, because the frequency measure only provides an aggregated view of a pattern’s frequency. In real life, however, most behaviour seldom remain persistent for the entire period, thus a measure to identify the level of frequency variation is absolutely necessary for the result to be more accurate. Failing to consider the variation of frequency level will result in the mining process will not be able to capture exactly where or when a certain behaviour experience changes. Since the behaviour of each pattern varies throughout the database and may differ among frequent patterns, the web analyst needs to be equipped with the additional information to derive a proper conclusion in his consideration. Insufficient information will not only result in poor decision-making, it may also lead to an erroneous web design. Having detailed understanding of the behaviour of a sequence throughout the database will give a much deeper knowledge and insight for better interpretation.

1.4 RESEARCH QUESTIONS

The issues described in the previous section prompt the following research questions:

i. How may the issue of algorithm inefficiency be addressed when mining vast server logs?

ii. Can browsing activity be more accurately captured into WUM in order to extract more meaningful patterns?

iii. How do we address the confusion arising from excessively high number of extracted patterns?

(22)

8

iv. How do we capture possible changes in a pattern’s behaviour throughout the entire period of investigation in order to accurately understand the characteristic of the pattern?

1.5 RESEARCH OBJECTIVES

In order to address the above research questions, the following objectives are set:

i. To implement a vertical-database mining algorithm with the aim of better efficiency when mining large web logs using very low support threshold.

ii. To introduce a new constraint in the process of pattern mining that can capture user traversing activity

iii. To propose a method of sorting extracted patterns based on proximity of elements in a pattern by employing fuzzy methods

iv. To propose a method that captures the dynamic behaviour of web usage patterns, which can provide deeper insight for the analyst.

The following Table 1.1 shows the mapping of research issues, research questions and research objectives of this study.

Table 1.1: Mapping of Research Questions and Objectives

Research issue Research Question Research Objective

Algorithm inefficiency for large dataset

How may the issue of algorithm inefficiency be addressed when mining vast server logs?

To implement a vertical-database mining algorithm with better efficiency when mining large web logs using very low support threshold Quality of

extracted patterns

Can browsing activity be more

accurately captured into WUM in order to extract more meaningful patterns?

To introduce a new constraint in the process of pattern mining that can capture user traversing activity Quantity of

extracted patterns

How do we address the confusion arising from excessively high number of extracted patterns?

To propose a method of sorting extracted patterns based on proximity of elements in a pattern by employing fuzzy methods

Insufficient insight from extracted patterns

How do we capture changes in a pattern’s behaviour throughout the entire period of investigation in order to accurately understand the characteristic of the pattern?

To propose a method that captures the dynamic behaviour of web usage patterns, which can provide deeper insight for the analyst.

(23)

9

The achievement of the above objectives will be measured by an experimental design approach. To evaluate the quality of patterns discovered and the performance of the algorithm, a series of experiments will be conducted using real web access data as input. In addition, the algorithm will be evaluated in several aspects including processing speed, memory usage, scalability and quality of the extracted patterns.

1.6 SIGNIFICANCE OF STUDY

This study is an important endeavour to promote web usage mining for the benefits of organisations, businesses and industries, which rely on websites to deliver goods and services. It proposes a novel method, which incorporates the element of user behaviour in mining web usage patterns and reveals detail characteristic of each pattern in the way that facilitate the interpretation of data mining result. As such, the research work may benefit the web designer and web administrator community in learning and understanding the traversal behaviour and interests of their users and take necessary actions to enhance their website in order to enrich their customers browsing experience.

It is the objective of this study to address the above outstanding research issues in the area of web usage mining and present the answers, in terms of methodological and theoretical approaches, to overcome the obstacles that prevent the web community from benefiting from the WUM technology. The study contributes to the knowledge in the field of web usage mining in a number of ways. Firstly, it seeks to proposed a more effective method of mining web usage pattern, which incorporates individual users’ behaviour in to the existing framework of web usage mining, thereby injecting the element of user behaviour with the objective of improving the overall quality of extracted patterns. Secondly, in the course of implementing the method, an enhanced

(24)

10

algorithm is to be proposed that offer better performance especially in dealing with typical situation in web usage mining where the threshold is very small.

This study also attempts to propose an approach that reveals more detailed characteristics of each extracted pattern, instead of only the simple conventional frequency indicator as used by most existing approaches. This details information is crucial to assist the analyst in making more accurate interpretation, hence, making the output more actionable, thereby providing a solution that are more readily acted upon and requiring less manual intervention.

In summary, the entire study seeks to discover a better alternative in mining web usage pattern that are not only effective but also able to fulfil the needs of the web industry at large.

1.7 RESEARCH DESIGN

This research proposes a model for mining sequences of web usage patterns that considers behavioural criteria of regularity in addition to frequency. The notion of regularity represents activity data that reflect user navigational pattern when browsing a website (Cao, 2010). Thus, the goal of the proposed model is to find web usage patterns, which are not only highly frequent but also repeatedly performed.

Moreover, the model captures patterns’ dynamic behaviour to provide deeper insight of the possible variation of frequency level throughout the database. Based on the framework of transitional pattern mining (Qian & Ann, 2006), the changes in the intensity of frequency of each frequent pattern throughout the database will be revealed to provide the analyst with a more accurate and detailed picture of the pattern. With this additional knowledge, the analyst is in the position to derive better conclusions from the analysis of frequent web sequences.

Rujukan

DOKUMEN BERKAITAN

In order to correct this situation, this research study investigated the impact of leadership self-efficacy, change oriented behaviour and staff’s organizational citizenship

7.2.2 Relationship between the Physical Activity Imagery Questionnaire and the Passion Scale In Study 2, I aimed to investigate the relationship between types of imagery use

learning outcomes knowledge, skill performance, critical thinking, learner satisfaction and self-confidence of nursing students using an adult code blue drill simulated programme

1.2 Research Objectives The objectives of the study are to examine the sequential and categorical organization of Malaysian radio phone-in programmes, as well as to explore the types

Distributive justice i.e., focused on negotiation outcome mediates the perception of tax practitioners on their final proposed offer in the different concession timing

current study cannot support the typological view on Mandarin by taking the constructions with syntagm Ng + Vg, Ng + Ng + Vg, and Ng + Ng + Vg with ― dou‖ as evidence by Li and

of Malaya.. Multiple whole-genome sequence comparisons of closely related strains will not only lead to the better understanding of their relationships but also provide

of Malaya.. In some structures containing pyridine, HOMO comes from the anion while the LUMO comes from heterocyclic nitrogen compound. In complexes of PYR-[EPY][Ac],

Secondly, the methodology derived from the essential Qur’anic worldview of Tawhid, the oneness of Allah, and thereby, the unity of the divine law, which is the praxis of unity

The research is focused on the use of recurrent neural networks for devising a comprehensible framework for pharmaceutical product formulation using time series

In this research, the researchers will examine the relationship between the fluctuation of housing price in the United States and the macroeconomic variables, which are

External risks such as political, economic, legal, cultural languages and religious differences and social risks play an important role on a firm’s strategic bidding decisions

Although the Egypt Arbitration Law of 1994 marked a significant evolution in commercial arbitration in the Arab Republic of Egypt, the current position of setting aside an

Figure 4.2 General Representation of Source-Interceptor-Sink 15 Figure 4.3 Representation of Material Balance for a Source 17 Figure 4.4 Representation of Material Balance for

As the fibers ratio increase in long and short fiber, the flexural strength is increasing but decrease after exceeding 60vol % due to limitation of matrix to coat the overall

The main purpose of this study is to derive the features and core principles of a curriculum model for an Islamic-based teacher education programme (IBTEC)

Last semester, the author only concentrated on the temperature effect cross the membrane by using the Joule-Thomson coefficient and how to put it in the User Unit Operation in

Tall slender frames maybuckle laterally due to loads that are much smaller than predicted bybuckling equations applied to isolated columns. Instability may occur for a variety

The findings of this study support the hypothesized relationships proposed in the theoretical model: there are significant positive relationships between adoption

Specifically, it investigated the r elationships among leadership styles (authentic, transformational and transactional), interpersonal and institutional trust,

The specific objectives of the study are to: (1) identify the challenges faced by the Malaysian Private HLIs in providing quality education, (2) ascertain the CSFs

This article reviews the potential of oil palm trunk (OPT) for SA production, from bioconversion aspects such as biomass pretreatment, enzymatic saccharification, and fermentation,

(2020) who have proved that higher apoptotic cells were observed in HEp-2 cells after pre-treatment with cisplatin and then irradiated with 190.91 J/cm 2 laser irradiation