USER TRAVERSAL BEHAVIOUR MINING OF SERVER LOGS USING FUZZY FRS
A thesis submitted in fulfilment of the requirement for the degree of Doctor of Philosophy in Information Technology
Kulliyyah of Information and Communication Technology International Islamic University Malaysia
Web Usage Mining (WUM) is the application of data mining methods in extracting potentially useful information from web usage data. Its application includes improving website design, personalised service, target marketing etc. Even though there has been an extensive study in WUM, lack of related product commercialisation indicates that there are still a number of outstanding research issues in this area. Among the challenges mentioned in the literature include inefficiency in mining typically large weblogs, extracted patterns that are not representative of actual user behavior, and mining results which are too general, uninteresting and lack insights. This thesis attempts to address the above problems in three parts. Firstly, based on the notion of regularity, a mining algorithm is introduced to efficiently extract usage patterns from large weblogs that are reflective of individual user behaviour. Secondly, a fuzzy method is incorporated into the algorithm that enables the expression of the pattern quality, thus reducing possible confusion due to extremely large number of patterns.
Finally, in order to gain deeper insights of the extracted patterns, the algorithm is further extended using the framework of transitional pattern to capture possible variation in pattern behaviour, thus facilitating the subsequent pattern interpretation process. The promising results obtained from a series of experiments conducted suggest that the new algorithm is faster and more scalable compared to an existing one, especially when mining large weblogs. Furthermore, the extracted patterns demonstrate better representation of user traversal behaviour, contain less ambiguity, and are more readily interpretable for subsequent analysis.
مادختسا نم تناايبلا في بيقنتلا تا
بيولا (WUM) وه
بيلاسأ لامعتسا قيبطت نع ةرابع
.بيولا مادختسا تناايب نم ةديفم نوكت نأ لمتيح تيلا تامولعلما صلاختسا في تناايبلا جارختسا دهتسلما قيوستلاو ،ةيصخشلا ةمدلخا ،عقاولما ميمصت نم لك ينستح قيبطتلا اذه لمشيو ف
ةعساو تاسارد كانه نأ نم صوصبخ
(WUM) تاجتنلما قيوستل ةيجيتاترسا دوجو مدع نأ لاا ،
.لالمجا اذبه ةقلعتلما ةيثحبلا يااضقلا نم ددع كانه لازي لا هنأ لىإ يرشي ةلصلا تاذ نم هنا امك
وه ةقباسلا تاساردلا في ةروكذلما تيادحتلا ع
بيقنتلا في ةءافكلا مد في
ا نوكت امدنع ةداع تناايبل
،ةيلعفلا مدختسلما كولس لثتم لا ةجرختسلما طانملأا نا امك ،مجلحا ةيربك بيولا تناودم لىا ةفاضلإبا
كلذ ةيؤر لىإ رقتفت انها امك ،ةسوردم يرغو ةماع ةفصب نوكت تناايبلا في بيقنتلا جئاتن ةيعوضوم
لاعأ ةروكذلما لكاشلما ةلجاعم ةساردلا هذه لواتح متي ،ماظتنلاا موهفم لىإ ادانتسا ،لاوأ .ءازجأ ةثلاث في ه
ةءافكب مادختسلاا طانمأ جارختسلا بيقنتلا ةيمزراوخ لاخدإ ةيلاع
تيلا مجلحا ةيربك بيولا تناودم نم
ينمضت متي ،اينثا .يدرفلا مدختسلما كولس سكعت ماظنلا
مهبلما (Fuzzy method) في
ع يربعتلا نكتم تيلا .طانملأا نم يربكلا ددعلل ارظن لمتلمحا سابتللاا نم دلحا لياتلباو ،طمنلا ةيعون ن
قمعأ ةيؤر ىلع لوصلحا لجأ نم ،ايرخأو م
اضيأ ةيمزراولخا قاطن عيسوت متي ،ةجرختسلما جذامنلا ن
يرسفت ةيلمع لهسي امم ،كولسلا طنم في لمتلمحا نيابتلا طاقتللا لياقتنلاا طمنلا راطإ مادختسبا طانملأا
تيرجأ تيلا براجتلا نم ةلسلس نم اهيلع لوصلحا تم ةدعاو جئاتن ىلع ةساردلا تصلخ دقو .ةقحلالا ةصاخو ،ةقباسلا تايمزراولخا عم ةنراقم ريوطتلل ةيلباق رثكأو عرسأ دعت ةديدلجا ةيمزراولخا نأ اهنيب نمو لع ةولاعو .مجلحا ةيربك تناودلما نم تناايبلا في بيقنتلا دنع رهظت ةجرختسلما جذامنلا نإف ،كلذ ى
اهيرسفتل ةلوهس رثكأ انها امك ،ضومغلا نم لقأ ردق ىلع يوتتحو ،مدختسلما ذيفنت كولسل لضفأ لايثتم
The thesis of Rosli bin Omar has been approved by the following:
Zainatul Shima Abdullah Supervisor
Abu Osman Md Tap Co-Supervisor
Mira Kartiwi Internal Examiner
Mustafa Mat Deris External Examiner
Shoab Ahmed Khan External Examiner
Amir Akramin Shafie Chairman
I hereby declare that this thesis is the result of my own investigations, except where otherwise stated. I also declare that it has not been previously or concurrently submitted as a whole for any other degrees at IIUM or other institutions.
Rosli bin Omar
Signature ... Date ...
INTERNATIONAL ISLAMIC UNIVERSITY MALAYSIA
DECLARATION OF COPYRIGHT AND AFFIRMATION OF FAIR USE OF UNPUBLISHED RESEARCH
USER TRAVERSAL BEHAVIOUR MINING OF SERVER LOGS USING FUZZY FRS
I declare that the copyright holders of this thesis are jointly owned by the student and IIUM.
Copyright © 2018 by Rosli bin Omar and International Islamic University Malaysia. All rights reserved.
No part of this unpublished research may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without prior written permission of the copyright holder except as provided below
1. Any material contained in or derived from this unpublished research may only be used by others in their writing with due acknowledgement.
2. IIUM or its library will have the right to make and transmit copies (print or electronic) for institutional and academic purposes.
3. The IIUM library will have the right to make, store in a retrieved system and supply copies of this unpublished research if requested by other universities and research libraries.
By signing this form, I acknowledged that I have read and understand the IIUM Intellectual Property Right and Commercialization policy.
Affirmed by Rosli bin Omar
This thesis is dedicated to my beloved father, Omar, and my beloved daughter, Amalina, both of whom have peacefully returned to their Lord. May Allah
shower His endless mercy and forgiveness upon their souls.
All glory is due to Allah the Almighty, whose grace and mercy have been with me throughout the duration of my studies. Although it has been a challenging experience, His blessings have kept me persevered until the end.
I am most indebted to my supervisor Assistant Professor Dr Zainatul Shima Abdullah, whose in-depth knowledge in the subject matter as well as her endearing disposition and kindness facilitated the successful completion of my work. I appreciate her detailed comments and useful suggestions which have considerably improved this thesis. The moral support and encouragement have indeed helped me through this process. I am also highly grateful to Prof Dr Abu Osman Md Tap and Dr Media Ayu for paving a strong foundation for my work, as well as providing invaluable guidance especially at the beginning of this journey.
This thesis would not have been possible without the sacrifice and understanding of my beloved wife Suriana, whose strength and support had made it possible for me to continue with the uphill journey in completing my studies. Thanks also for checking the thesis. To my children, thanks for the prayer and I hope you will draw inspirations from this work. Last but not least, my deepest appreciation to my beloved Mother whose sacrifice, inspiration and guidance helped me become who I am today.
Once again, glory be to Allah for His endless bounties on me, one of which is completing this thesis. Alhamdulillah.
TABLE OF CONTENTS
Abstract ... ii
Abstract in Arabic ... iii
Approval page ... iv
Declaration ... v
Copyright ... vi
Dedication ... vii
Acknowledgements ... viii
List of Tables ... xii
List of Figures ... xiii
CHAPTER ONE: INTRODUCTION ... 1
1.1 Introduction ... 1
1.2 Background of the Study ... 3
1.3 Statement of the problem ... 5
1.4 Research Questions ... 7
1.5 Research Objectives ... 8
1.6 Significance of study ... 9
1.7 Research Design ... 10
1.8 Organization of the thesis... 12
CHAPTER TWO: REVIEW OF LITERATURE ... 13
2.1 Introduction ... 13
2.2 Data Mining ... 13
2.2.1 Pattern and Model ... 14
2.2.2 Pattern Discovery ... 15
2.3 Sequential Pattern Mining ... 15
2.3.1 The Apriori Approach ... 17
18.104.22.168 Generalised Sequential Patterns (GSP) algorithm ... 19
2.3.2 Apriori Vertical Layout Database ... 19
22.214.171.124 SPADE Algorithm ... 20
2.3.3 The Pattern Growth Approach ... 21
2.4 Constraint-Based Sequential Patterns ... 21
2.5 Web Usage Mining ... 23
2.5.1 Applications of WUM ... 25
2.5.2 The Main Processes of Web Usage Mining (WUM) ... 26
2.5.3 Existing Researches in WUM ... 27
126.96.36.199 Basic WUM Algorithm ... 27
188.8.131.52 Utility Patterns ... 31
2.5.4 Transitional Pattern ... 33
2.6 Fuzzy Theory for Sequential Data Mining... 36
CHAPTER THREE: METHODOLOGY ... 39
3.1 Introduction ... 39
3.2 Theoretical Framework ... 39
3.2.1 Pre-Processing Step ... 40
3.2.2 Pattern Discovery ... 42
3.2.3 Pattern Analysis ... 43
3.3 Proposed Research Framework ... 43
3.3.1 Pre-processing of Web Log Data ... 45
3.3.2 Pattern Discovery ... 45
184.108.40.206 Frequent-Regular Sequential Patterns ... 45
220.127.116.11 Fuzzy FRS-pattern ... 47
18.104.22.168 Transitional FRS-pattern ... 48
3.3.3 The Proposed Algorithm ... 48
3.4 Experimental Design ... 51
3.4.1 Test Environment and Datasets ... 51
3.4.2 Pre-Linked Web Access Pattern (PLWAP) as a Benchmark Algorithm ... 52
3.4.3 Performance Comparison between FRS and PLWAP algorithms ... 53
3.4.4 Pattern’s Knowledge Representation ... 53
3.4.5 Scalability Test ... 54
CHAPTER FOUR: REGULAR SEQUENTIAL PATTERNS ... 55
4.1 Introduction ... 55
4.2 Overview of Frequent-Regular Sequential Patterns (FRS) ... 56
4.3 Regularity as a Measure of User Behaviour ... 57
4.3.1 Efficient Candidate Enumeration ... 59
4.3.2 Downward closure property ... 61
4.3.3 Reduction of Recursions ... 62
4.3.4 Vertical Database ... 63
4.4 Problem Definition ... 66
4.5 The Proposed FRS Algorithm ... 69
4.6 Mining Frequent Regular Sequential Patterns ... 74
4.7 Experiments... 77
4.7.1 Description of Input Data ... 77
4.7.2 Experiment 1: Processing Time of FRS Components ... 79
4.7.3 Experiment 2: FRS Performance Using Various Set Of Thresholds ... 80
4.7.4 Experiment 3: Performance Comparison between FRS and PLWAP ... 81
4.7.5 Experiment 4: Scalability of FRS ... 82
4.7.6 Experiment 5: Quality of Sequential Patterns ... 83
4.8 Conclusion ... 85
CHAPTER FIVE: FUZZY FREQUENT-REGULAR SEQUENTIAL PATTERNS ... 87
5.1 Introduction ... 87
5.2 Problem Definition ... 88
5.3 Linguistic Variables ... 90
5.4 Mining Fuzzy Frequent Regular Sequential Patterns... 92
5.5 Experiment ... 93
5.6 Conclusion ... 94
CHAPTER SIX: TRANSITIONAL FREQUENT-REGULAR
SEQUENCES ... 96
6.1 Introduction ... 96
6.2 Transitional Patterns... 97
6.3 Problem Definition ... 100
6.4 The Trans-Frs Algorithm ... 102
6.5 The Experiments ... 104
6.6 Conclusion ... 109
CHAPTER SEVEN: DISCUSSIONS AND CONCLUSION ... 111
7.1 Introduction ... 111
7.2 Regular Sequential Patterns ... 112
7.3 Fuzzy Regular Sequential Patterns... 116
7.4 Transitional Frequent-Regular Pattern (Trans- Frs) ... 117
7.5 Contributions Of Research ... 121
7.5.1 Theoretical Contributions ... 121
7.5.2 Practical Contributions ... 122
7.5.3 Methodological Contribution ... 124
7.6 Limitations Of The Study ... 125
7.7 Future Research Areas ... 125
REFERENCES ... 128
APPENDIX A: FRS SOURCE CODES IN C ... 137
APPENDIX B: PLWAP SOURCES CODES ... 154
LIST OF TABLES
Table 1.1 Mapping of Research Questions and Objectives 8
Table 2.1 Sample entries of a web usage log 24
Table 3.1 CLF Format Description 42
Table 4.1 Example of Sequence Database 58
Table 4.2 Sample Entries of a pre-processed Web Log 64
Table 4.3 Sequence Db after Sessionisation 65
Table 4.4 Web Usage Sequence Db 65
Table 4.5 Sequence Db 74
Table 4.6 2-FRS Matrix 75
Table 4.7 News Categories of MSNBC Dataset 78
Table 4.8 Comparison of Number of Extracted Patterns and
corresponding Processing Time 84
Table 4.9 Selected FRS Patterns Extracted from MSNBC Web Log 85
Table 5.1 Sequence Db 93
Table 5.2 Extracted Fuzzy FRS Patterns 94
Table 6.1 Comparison of Extracted Patterns of FRS and PLWAP 98
Table 6.2 Sequence Db 99
Table 6.3 Time Points in Sequence Db 101
Table 6.4 Sequence Db 102
Table 6.5 Description of Notations 104
Table 6.6 Extracted FRS Sequences with Fuzzy Quality Value 105
Table 7.1 Comparison of FRS and PLWAP 114
Table 7.2 Scalabity : FRS vs PLWAP 116
Table 7.3 Analysis of Selected FRS Patterns 120
LIST OF FIGURES
Figure 2.1 Main Processes of WUM 27
Figure 3.1 Theoretical Framework of WUM 40
Figure 3.2 CLF Format 41
Figure 3.3 Research Framework 44
Figure 3.4 FRS Algorithm 50
Figure 3.5 Experimental Design 52
Figure 4.1 An Example of Sequence Enumeration 60
Figure 4.2 TIDlist for Item A, B And C 65
Figure 4.3 Construction of Reduced TIDlists 66
Figure 4.4 FRS Algorithm 70
Figure 4.5 TIDlists of FR Items 74
Figure 4.6 Constructing 2-FRS Sequence 75
Figure 4.7 Constructing 3-Sequence ABC 76
Figure 4.8 Constructing 4-FRS Sequence 76
Figure 4.9 Partial Enumeration of Items A,B, And C 77
Figure 4.10 Processing Time of FRS Components Under Different Support
Figure 4.11 Processing Time of FRS Under Different Support And
Regularity Thresholds 81
Figure 4.12 Performance Comparison Between FRS And PLWAP 81
Figure 4.13 Scalability of FRS Vs. PLWAP 83
Figure 5.1 Definition of Fuzzy Terms 92
Figure 6.1 Tran-FRS Algorithm 103
Figure 6.2 Transitional Behaviour of FRS Pattern 6-1-2 105
Figure 6.3 Transitional Behaviour of FRS Pattern 1-2 107
Figure 6.4 Transitional Behaviour of FRS Pattern 1-2-3 108
Figure 6.5 Transitional Behaviour of FRS Pattern 1-2-4 109
CHAPTER ONE INTRODUCTION
One of the main challenges for a website owner is providing a website that offers effective and efficient browsing experience for the users. As people are increasingly more reliant on the internet for information and services, it is crucial that a website is designed and developed in accordance to the needs and interests of its users. One of the best and most reliable sources that offers clues on the needs and interests of users is the website itself. In general, web servers not only store the content of a website, they also keep myriads of other data including details of users browsing activities. The growth of internet activities generates huge quantity of such data that are recorded in files on the servers, which are often referred to as the Web usage log files. As such, the application of data mining in web data is crucial to assist in understanding users’
needs and preferences. The extracted information from web data mining may offers useful knowledge that can potentially assist the owners of websites to better understand the behavioural patterns of their clients, and to restructure the websites in the way that increases the level of quality of services provided (R Cooley, Mobasher,
& Srivastava, 1997).
In this context, web mining is an emerging line of studies, which explores sophisticated methods, and techniques of extracting dominant patterns from web data, which goals include to explore ways to improve the quality of a website through various approaches and strategies. Popular applications of web mining include website structure improvement, user profiling, target marketing, service personalization, intrusion detection, etc. Web mining lies within the broader framework of Knowledge
Discovery from Databases (KDD), which was introduced by Fayyad (1996). The objective of KDD is to discover useful knowledge embedded in large databases through the application of data mining techniques. Essentially, web mining relies on data mining techniques to intelligently extract and analyse interesting patterns of user behaviour embedded in a number of web files that include the web server logs and application logs. There are three categories of web mining: Web Content mining, Web Structure mining and Web Usage mining (Facca & Lanzi, 2005). Cooley et al., (1997) first introduced the term WUM and its main goal is to extract interesting user behaviour from usage activities recorded in web usage log. Due to broad nature of web mining, this study will focus on WUM.
Although numerous studies have been conducted in this area, there are still a number of open research issues that need to be addressed due to the heterogeneous nature of websites and the sheer size of information it contains which contribute to the complexity of the problem. Thus, the application of data mining in web data continues to gain much attention from the research community. This thesis concerns mining of frequently occurring sequential patterns that considers the element of user traversing behaviour from web server log. Sequential pattern mining is a field of study introduced by (Srikant & Agrawal, 1996a) and web usage behaviour is a concept proposed by Cao(2010). This chapter first presents the background of the study, which then followed by the statement of research problems to be addressed. Subsequently, the research questions that drive this study and the corresponding research objectives are stated. A summary of our proposed idea ensued and the chapter ends by giving the layout of the entire thesis.
3 1.2 BACKGROUND OF THE STUDY
Web usage mining (WUM) is an application of data mining (Rakesh Agrawal &
Srikant, 1994) techniques, which seek to find major trends from web usage data.
These trends represent the underlying navigational patterns derived from clickstream activity frequently executed among the visitors of a website. Understanding web user behaviour provides important insights of user preferences, and helps web designer to formulate effective strategies that can enhance the quality of a website. User behaviour information can be used, among others, for redesigning the website structure so as to decrease overall propagation delay time among frequently access pages, improving the quality of the design or content of the less frequently visited pages in order to attract visitors as well as providing valuable input for web caching policy. Despite a considerable number of previous studies within this field, most of the existing WUM algorithms do not consider the individual user behaviour and interest when mining usage access sequences. Some researchers claim the lack of consideration for activity data as the criteria in mining results in the failure to translate many research findings into operational commercial products (Cao, Ou, & Yu, 2012;
De Meo, Nocera, Terracina, & Ursino, 2011).
In data mining, there are a number of different methods of mining, based on types of pattern, such as association rules, clustering and classification. Since web traversal patterns are formed by sequences of web page navigation activities, therefore the frequent sequence patterns, or sequential patterns method, is expected to be the more accurate and appropriate method for WUM application. Thus, the fundamental problem of mining sequential patterns concerns the extraction of the set of frequent web traversal sequences from web server log.
One of the challenges in WUM stems from the excessively huge number of sequential patterns generated by WUM algorithms. This is especially true since Web usage log files in general are typically large; for example, the MSNBC clickstream log file used as input for experiments in this study contains almost one million records that belong to only a single day transactions. When mining from such a large dataset, usually a very small threshold is used since using large threshold will miss most patterns. A threshold indicates a minimum number of records that must contain a certain pattern. However, setting the threshold too low will inevitably result in extremely large number of extracted patterns to consider, and has a significant impact on processing time. Furthermore, as not all of the extracted frequent patterns are usually considered meaningful and interesting, the process of analysing frequent patterns may become daunting and inefficient. Thus, a new method that addresses this issue is crucial.
In addition, most of the existing WUM algorithms only generate the set of frequent web pages without providing details on possible fluctuation of degree of frequentness throughout the database. Since the measure used for indicating frequency is the aggregate support measure, existing algorithms do not highlight possible variation in frequency that almost invariably occur at different stages in the database.
Therefore, the dynamic behaviour of the discovered pattern is often undetected. As often the case, a frequent pattern is normally frequent only at certain points of the database, and rarely consistently frequent throughout the database. Having additional information on the dynamic behaviour of a pattern provides deeper understanding of the pattern and can help the analyst formulate different action plans for different patterns even though they might all be frequent. For example, if two patterns A and B are found to be frequent, but pattern A is largely frequent only at the beginning of the
database, while pattern B is mostly on the opposite trend, then there is possibility that the trend represented by pattern A is already obsolete while the trend reflected by pattern B is only emerging. Therefore, there is a need to enhance existing WUM methods to provide insights of the dynamic behaviour of a frequent pattern in order, which will enable the analyst to derive more accurate conclusion in the investigation.
1.3 STATEMENT OF THE PROBLEM
The issues related to mining sequential pattern from web usage data mentioned in the previous section represent the problems that this thesis aims to address. First, lack of consideration for activity data in existing methods undermines the quality of mining results. Activity data refers to series of clickstream resulting from navigating the web pages. Existing SPM algorithms rely solely on the frequency measure, which is generally determined by the number of users forming a certain pattern of web page navigation. While pattern frequency is crucial in determining major trends or behaviours, it disregards information of activity data and thus, leads to problems such as very high number of extracted patterns. This issue has been addressed by Tanbeer et al, (2009) for frequent pattern mining from transactional databases by proposing the notion of periodic-frequent patterns, but there is no existing sequential mining methods which address the issue. This observation leads us to believe that having a pattern-mining framework, which relies on the consideration of the behaviour of the masses as well as the individuals, will further improves the quality of extracted patterns and hence, the quality of knowledge derived from them.
Second, despite the many significant improvements made in WUM, the problem of efficiency in mining sequential patterns is still an open research issue (Pei et al., 2007 & Chand et al., 2012), particularly when the threshold value is set very
low, the size of web usage log data is very large and the records may contains long sequences. This is mainly due to extremely large number of sub patterns that algorithms need to consider, contributing to high computing cost. Moreover, if the mining process produces a large number of resulting patterns, which invariably include many interesting and uninteresting ones, the task of analysing and interpreting the patterns can be overwhelming and potentially confusing. When constraints are applied in mining, the issue of efficiency is further aggravated since the constraints are applied towards the end of mining process.
Third, most existing WUM approaches involve sequence with items of binary attributes only, which means only the presence or absence of an item is concerned.
This imposes limit to the potential of WUM since the existence of categorical data in real life applications are abundant and useful knowledge in the form of sequential patterns may be hidden in them. Even though fuzzy concepts have been introduced into sequence mining quite sometimes ago (T Hong, Kuo, & Chi, 1999), the fuzzy sequential pattern approaches suffer from efficiency issue, including the problem of computing the cardinality of membership degree. In this study, fuzzy concept is used to provide ranking among frequent usage sequences by taking activity data into consideration. The behaviour of user expressed by the sequence of mouse clicks may be interpreted differently if the intervals between events in the sequence are being considered. The length of intervals can be a factor that characterised the strength of behaviour. Therefore, for any proposed web usage mining method to be effective and accurate, due consideration on the interval is important.
Finally, existing methods implicitly assume that the frequency of certain behaviour is persistent throughout the period of investigation. This implicit assumption is indirectly transpired as the result of the use of frequency-based method.
Relying on the indication of pattern frequency may not be sufficient to assist the analyst in making sound and effective decisions, because the frequency measure only provides an aggregated view of a pattern’s frequency. In real life, however, most behaviour seldom remain persistent for the entire period, thus a measure to identify the level of frequency variation is absolutely necessary for the result to be more accurate. Failing to consider the variation of frequency level will result in the mining process will not be able to capture exactly where or when a certain behaviour experience changes. Since the behaviour of each pattern varies throughout the database and may differ among frequent patterns, the web analyst needs to be equipped with the additional information to derive a proper conclusion in his consideration. Insufficient information will not only result in poor decision-making, it may also lead to an erroneous web design. Having detailed understanding of the behaviour of a sequence throughout the database will give a much deeper knowledge and insight for better interpretation.
1.4 RESEARCH QUESTIONS
The issues described in the previous section prompt the following research questions:
i. How may the issue of algorithm inefficiency be addressed when mining vast server logs?
ii. Can browsing activity be more accurately captured into WUM in order to extract more meaningful patterns?
iii. How do we address the confusion arising from excessively high number of extracted patterns?
iv. How do we capture possible changes in a pattern’s behaviour throughout the entire period of investigation in order to accurately understand the characteristic of the pattern?
1.5 RESEARCH OBJECTIVES
In order to address the above research questions, the following objectives are set:
i. To implement a vertical-database mining algorithm with the aim of better efficiency when mining large web logs using very low support threshold.
ii. To introduce a new constraint in the process of pattern mining that can capture user traversing activity
iii. To propose a method of sorting extracted patterns based on proximity of elements in a pattern by employing fuzzy methods
iv. To propose a method that captures the dynamic behaviour of web usage patterns, which can provide deeper insight for the analyst.
The following Table 1.1 shows the mapping of research issues, research questions and research objectives of this study.
Table 1.1: Mapping of Research Questions and Objectives
Research issue Research Question Research Objective
Algorithm inefficiency for large dataset
How may the issue of algorithm inefficiency be addressed when mining vast server logs?
To implement a vertical-database mining algorithm with better efficiency when mining large web logs using very low support threshold Quality of
Can browsing activity be more
accurately captured into WUM in order to extract more meaningful patterns?
To introduce a new constraint in the process of pattern mining that can capture user traversing activity Quantity of
How do we address the confusion arising from excessively high number of extracted patterns?
To propose a method of sorting extracted patterns based on proximity of elements in a pattern by employing fuzzy methods
Insufficient insight from extracted patterns
How do we capture changes in a pattern’s behaviour throughout the entire period of investigation in order to accurately understand the characteristic of the pattern?
To propose a method that captures the dynamic behaviour of web usage patterns, which can provide deeper insight for the analyst.
The achievement of the above objectives will be measured by an experimental design approach. To evaluate the quality of patterns discovered and the performance of the algorithm, a series of experiments will be conducted using real web access data as input. In addition, the algorithm will be evaluated in several aspects including processing speed, memory usage, scalability and quality of the extracted patterns.
1.6 SIGNIFICANCE OF STUDY
This study is an important endeavour to promote web usage mining for the benefits of organisations, businesses and industries, which rely on websites to deliver goods and services. It proposes a novel method, which incorporates the element of user behaviour in mining web usage patterns and reveals detail characteristic of each pattern in the way that facilitate the interpretation of data mining result. As such, the research work may benefit the web designer and web administrator community in learning and understanding the traversal behaviour and interests of their users and take necessary actions to enhance their website in order to enrich their customers browsing experience.
It is the objective of this study to address the above outstanding research issues in the area of web usage mining and present the answers, in terms of methodological and theoretical approaches, to overcome the obstacles that prevent the web community from benefiting from the WUM technology. The study contributes to the knowledge in the field of web usage mining in a number of ways. Firstly, it seeks to proposed a more effective method of mining web usage pattern, which incorporates individual users’ behaviour in to the existing framework of web usage mining, thereby injecting the element of user behaviour with the objective of improving the overall quality of extracted patterns. Secondly, in the course of implementing the method, an enhanced
algorithm is to be proposed that offer better performance especially in dealing with typical situation in web usage mining where the threshold is very small.
This study also attempts to propose an approach that reveals more detailed characteristics of each extracted pattern, instead of only the simple conventional frequency indicator as used by most existing approaches. This details information is crucial to assist the analyst in making more accurate interpretation, hence, making the output more actionable, thereby providing a solution that are more readily acted upon and requiring less manual intervention.
In summary, the entire study seeks to discover a better alternative in mining web usage pattern that are not only effective but also able to fulfil the needs of the web industry at large.
1.7 RESEARCH DESIGN
This research proposes a model for mining sequences of web usage patterns that considers behavioural criteria of regularity in addition to frequency. The notion of regularity represents activity data that reflect user navigational pattern when browsing a website (Cao, 2010). Thus, the goal of the proposed model is to find web usage patterns, which are not only highly frequent but also repeatedly performed.
Moreover, the model captures patterns’ dynamic behaviour to provide deeper insight of the possible variation of frequency level throughout the database. Based on the framework of transitional pattern mining (Qian & Ann, 2006), the changes in the intensity of frequency of each frequent pattern throughout the database will be revealed to provide the analyst with a more accurate and detailed picture of the pattern. With this additional knowledge, the analyst is in the position to derive better conclusions from the analysis of frequent web sequences.