• Tiada Hasil Ditemukan

STEEPEST DESCENT BASED ADAPTIVE STEP-SIZE REINFORCEMENT LEARNING AND APPLICATION OF REINFORCEMENT LEARNING IN BRAIN FIBER

N/A
N/A
Protected

Academic year: 2022

Share "STEEPEST DESCENT BASED ADAPTIVE STEP-SIZE REINFORCEMENT LEARNING AND APPLICATION OF REINFORCEMENT LEARNING IN BRAIN FIBER "

Copied!
171
0
0

Tekspenuh

(1)

STEEPEST DESCENT BASED ADAPTIVE STEP-SIZE REINFORCEMENT LEARNING AND APPLICATION OF REINFORCEMENT LEARNING IN BRAIN FIBER

TRACKING

By

KHOSROW AMIRI ZADEH

Thesis submitted in fulfillment of the requirements for the Degree of Doctor of Philosophy

June 2015

(2)

ii

ACKNOWLEDGMENTS

I would like to express my special thanks to my supervisor Prof. Dr. Rajeswari Mandava who helped and supported me a lot. I am very thankful for all interesting and technical comments.

Certainly, without her support, completing this study was very difficult or impossible. This is great honor for me working under her supervision. She endured me a lot, trusted me and allowed me to go forward as far as I can go. I hope she always be healthy. Our prayers will always be behind her.

I would like to have a great thanks to my wife, Sharareh, and my daughter, Negin, who endured so many problems and kept their stability under different unexpected bad situations.

Special thanks to Assoc. Prof. Dr Dhanesh Ramachandram for his comments and suggestions throughout the thesis and also to Assoc. Prof. Dr Cheah and all those at the Computer Sciences office which were always provided me other official information and activities during this time.

I gratefully acknowledge the funding received towards my PhD from the Ministry of Higher Education, Malaysia as PhD fellowship under the grant 304/CNEURO/652203/K134.

Khosrow Amiri Zadeh, March, 2015.

School of Computer Sciences, USM Penang, Malaysia

(3)

iii

TABLE OF CONTENTS

Page

Acknowledgments ii

Table of Contents iii

List of Tables viii

List of Figures x

List of Abbreviations xvii

Abstrak xx

Abstract xxii

CHAPTER 1- INTRODUCTION

1.1 Problem statement 4

1.2 Research questions 5

1.3 Research objectives 6

1.4 Significance of the study 6

1.5 Scope of the study 7

1.6 Research contributions 7

(4)

iv CHAPTER 2- LITERATURE REVIEW

2.1 Introduction 9

2.2 Classical Temporal Difference learning models 10

2.2.1 TD learning without prediction, Action-value model 12

2.2.2 TD learning with One-step prediction, TD(0) 13

2.2.3 TD learning with eligibility trace term, TD(λ) 14

2.2.4 TD learning with linear function approximation, TD(λ)-FA 15

2.2.5 Summary 17

2.3 Variant of Temporal Difference learning models 17

2.3.1 Variants in TD-learning without prediction 17

2.3.2 Variants in TD-learning with linear function approximation approach 22

2.4 Related work: Automatic step-size computation models 27

2.4.1 Polynomial Step-size model 27

2.4.2 GHS Step-size model 28

2.4.3 OSA Step-size model 28

2.4.4 HL Step-size model 29

2.5 Conclusion 31

CHAPTER 3- LINE SEARCH APPROACHES, STOCHASTIC APPROXIMATION THEORY AND CONVERGENCE ANALYSIS

(5)

v

3.1 Introduction 32

3.2 Line search approaches 33

3.3 Line search approaches 36

3.4 Ordinary Differential Equation (ODE) analysis 40

3.5 Convergence analysis of a second order objective function 42

3.6 Conclusion 45

CHAPTER 4- STEEPEST DESCENT TD LEARNING MODELS - STD

4.1 Introduction 46

4.2 Derivation of STD without prediction 46

4.2.1 Adaptive epsilon-greedy Model-ASM 49

4.2.2 Adaptive Upper Confidence Bound model –AUCB 50

4.3 Derivation of STD(0) and STD(λ) models 51

4.4 Derivation of STD(0)-FA and STD(λ)-FA models 55

4.5 Conclusion 60

CHAPTER 5- PERFORMANCE EVALUATION OF THE PROPOSED MODELS

5.1 Introduction 61

5.2 Experiments in STD without prediction, evaluation of proposed models ASM and AUCB

61

(6)

vi

5.2.1 Percentage of optimal action selections and amount of cumulative reward under stationary observation in ASM

63

5.2.2 Percent of optimal action selection under non-stationary observation in ASM

65

5.2.3 Stability under increasing number of actions in ASM 65 5.2.4 Behavior of the optimum step size computation in ASM 66 5.2.5 Performance evaluation of existing UCB models with the proposed AUCB 68

5.3 Performance evaluation of STD(0) and STD(λ) 70

5.3.1 Experiments in Multi-State Boyan Chain problem 71

5.4 Experiments and evaluations with STD(0)-FA and STD(λ)-FA 75 5.4.1 Models comparisons of existing TD-FA models with STD-FA 76 5.4.2 Automatic step-size computation strategies comparisons. 77

5.5 Conclusion 78

CHAPTER 6- ACTION-VALUE-BASED BRAIN FIBER TRACKING MODEL

6.1 Introduction 79

6.2 Brian white matter fibers 79

6.3 Diffusion phenomenon 81

6.4 DTI Processing 86

(7)

vii

6.5 Fiber tracking approaches 90

6.5.1 Deterministic approach 90

6.5.2 Non-deterministic and stochastic approaches 91

6.6 Proposed Action-value brain fiber tracking model 93

6.7 Experimental results and discussion 101

6.8 Conclusion 111

CHAPTER 7- CONCLUSION AND FUTURE WORKS

7.1 Conclusion 113

7.2 Future works 115

PUBLICATIONS 116

REFERENCES 117

APPENDIX A EXAMPLES IN LINE SEARCH APPROACH 127

APPENDIX B EXPERIMENTS ON THE EPISODIC TASKS 132

APPENDIX C ACTION-VALUE-BASED FIBER TRACKING PACKAGE 139

APPENDIX D MAKING SYNTHETIC DW-MRI DATA SET 144

(8)

viii

LIST OF TABLES

Page

Table 2.1 The incremental TD(0) 14

Table 2.2 The incremental TD(λ) 15

Table 2.3 The incremental TD(λ) – FA 16

Table 2.4 The incremental GTD – FA 23

Table 2.5 The RLSTD(λ) general model 24

Table 2.6 Main TD models and their limitations. The action-value approaches (1-2), classical temporal difference (3), variants of temporal difference with function approximation (4-9).

26

Table 2.7 Application of OSA to TD without prediction learning model. 29 Table 2.8 The summary of automatic step-size computation strategies 30 Table 3.1 An optimization problem based on steepest descent (SD) method 42

Table 4.1 Adaptive epsilon-greedy Model (ASM) 50

Table 4.2 Adaptive UCB model (AUCB) 50

Table 4.3 The STD(0) model 54

Table 4.4 The STD(λ) model 54

Table 4.5 The on-line STD(λ)-FA model 58

Table 5.1 Parameter settings in MAB problem 62

(9)

ix

Table 5.2 Summarized evaluations results: ASM in compare to relevant existing models

67

Table 5.3 Summarized evaluations results: AUCB in compare to relevant existing models

70

Table 5.4 Summarized evaluations results: STD in compare to relevant existing models.

74

Table 5.5 The features relating to each state 75

Table 5.6 Summarized evaluations results: STD-FA in compare to relevant existing TD-FA models.

77

Table 6.1 The Action-value-based Fiber Tracking algorithm 100

Table 6.2 Summarized results of fiber tracking models on real-data 111 Table B.1 Number of running the learning algorithm over maximum Trials 136 Table B.2 The features relating to each state in Boyan chain problem 136

(10)

x

LIST OF FIGURES

Page

Figure 1.1 The Classical RL 2

Figure 1.2 An example of the state-action map that defines the policy changes after each trial.

2

Figure 1.3 The outline of this thesis. 8

Figure 2.1 An Episode contains some transactions 10

Figure 2.2 The operation of two decreasing functions POL (a) and GHS (b) based on different meta-step-size values. The results show that they are dependent on their meta-parameters values.

28

Figure 3.1 An example of stochastic approximation theory to estimate an interested parameter such as mean of non-stationary observations.

38

Figure 3.2 Mean equation with different step sizes under non-stationary observations.

39

Figure 3.3 The time-domain and step-domain axes relationship 41

Figure 5.1 Average of cumulative rewards (a) and percentage of optimal selections (b).

64

Figure 5.2 Percentage of optimal selections σ = 0.1 (a) and σ = 5 (b). 64 Figure 5.3 Average of cumulative rewards (a) and percentage of optimal

selections (b) in non-stationary case.

64

Figure 5.4 Percent of the optimal action selections in larger set of actions for the stationary case (a) and the non-stationary case (b).

65

(11)

xi

Figure 5.5 The step size values form gradually decreasing curves (red curves).

The temporal difference error curves (green curves) are also decreases after the agent gets more knowledge in the stationary (a) and non-stationary (b) conditions.

66

Figure 5.6 Percentage of the optimal selections under low variance 0.01 (a) and 0.1(b).

69

Figure 5.7 Percentage of optimum selections under high reward variances 1(a) and 3(b) in stationary observation cases.

69

Figure 5.8 Percentage of optimal selections with large set of actions N=20 (a) and N=40 (b).

69

Figure 5.9 Multi-state Boyan chain problem. 71

Figure 5.10 Empirical results of the classical TD models and STD on multi-state Boyan chain, in the cases of =0 (a) and = 1 (b).

72

Figure 5.11 Empirical results of the classical TD with different step-size strategies and STD on multi-state Boyan chain, in the cases of =0.3 (a) and = 0.7 (b).

72

Figure 5.12 Non-stationary observations cases. Some models lose their performance.

73

Figure 5.13 Empirical results of the classical TD models and STD on multi-state Boyan chain with increasing the number of states to 51 in the cases of =0 (a) and = 1 (b).

73

Figure 5.14 13-state Boyan chain experimental results with Gradient TD-FA and STD-FA in cases of =1(a) and =0 (b) conditions.

76

Figure 5.15 The performance of classical TD-FA model with applying three automatic step-size strategies is plotted (a) and step size values generated by these models are drawn in (b).

76

Figure 6.1 White matter and Gray matter areas of the brain. WM Fibers connect one point to another point of the brain. The base picture is taken

80

(12)

xii

Source: http://www.nlm.nih.gov/medlineplus/ency/images/ency.

(Accessed March 2015)

Figure 6.2 Corpus Callosum connects the left and right cerebral hemispheres.

Source: http://en.wikipedia.org/wiki/Corpus_callosum.

80

Figure 6.3 Fibers can limit molecular movements. Here, a bundle of fibers are constructed to simulate an anisotropic region. Consequently, each mass of molecules will be a linear ellipsoid.

82

Figure 6.4 A fiber crossing case is simulated to represent diffusion model.

Voxels in crossing area are not linear ellipsoid.

83

Figure 6.5 A coronal view of brain is overlaid with fiber tracking results in CC and CST bundles.

83

Figure 6.6 CST bundle includes some uncertainty areas. Linear ellipsoids indicate fibers are in similar direction, whereas non-linear ellipsoids represent a mixed fiber directions such as crossing or branching cases.

84

Figure 6.7 Seed points in “A” are defined to compute probability of connection of region “A” and region “B”. The uncertainty area is resulted by double branching of bundles which includes non-linear voxels.

85

Figure 6.8 An example of a linear-ellipsoid tensor model which represents diffusion in a voxel with tensor D whose Eigenvalues are ≫ >

consequently. The main direction is along the vector .

86

Figure 6.9 Diffusion ellipsoid configurations. 87

Figure 6.10 A 2D frame includes two branches. This uncertainty situation contains 3D voxels (b). A voxel inside the uncertainty region is magnified (a). It represents a planar diffusion.

87

Figure 6.11 3D representation of two voxels with different diffusion profiles.

The illustration in (a) exhibits linear-ellipsoid voxel and the (b)

89

(13)

xiii

shows a spherical voxel in a region with multiple fiber orientations.

The brighter area at right voxel indicates the diffusion is not in one direction as it is on the left side.

Figure 6.12 Step by step movement constructs a tract. Usually this movement is in two opposite directions. This process is stopped when a termination criterion is reached.

90

Figure 6.13 The flowchart of the proposed action-value-based fiber tracking model.

94

Figure 6.14 After each movement, the next point mostly falls among other voxels.

For further computation, it is necessary to know that which voxel will be the base.

94

Figure 6.15 Set of sampled unit vectors in 2D case (a) and set of voxels in this dynamic kernel (b) related to a vector with angle 12.6 degree.

96

Figure 6.16 The reward value generator function generates a higher reward for anisotropic region (upper curve) and lower value for uncertainty region (downer curve).

100

Figure 6.17 A double branching synthetic fiber bundles (a) and sampled fiber tracts (b). The number of tracts is set to 30. The proposed model selects the winner by dot line.

102

Figure 6.18 A double crossing synthetic fiber bundles (a) and sampled fiber tracts (b). The number of tracts is set to 30. For this point, the proposed model selects the winner by dot line.

103

Figure 6.19 A kissing-crossing artificial fiber bundles (a) and sampled fiber tracts (b). The number of tracts is set to 30. For this point, the proposed model selects the winner by dot curve.

104

Figure 6.20 Fiber paths are better reconstructed by the proposed model (a) than Friman model (b). The frame work is double crossing that is shown

105

(14)

xiv in Fig. 6.18(a).

Figure 6.21 Three set of seed points are selected in the certainty areas. More fiber tracts are reconstructed by the proposed model (a) than Friman model (b). The frame work is kissing-crossing that is shown in Fig. 6.19(a), showing some challenges to Friman model.

105

Figure 6.22 Results of the proposed model (a) and Friman model (b). The frame work is double branching that is shown in Fig. 6.18(a). These results indicate that the proposed model trace fibers more precisely than the probabilistic method Friman.

105

Figure 6.23 Human brain with labels. Source:

http://www.tutorvista.com/content/biology/biology-ii/control-and- coordination/central-nervous-system.php. (Accessed March 2015)

107

Figure 6.24 The Lobs of the Cerebral Hemispheres (a) and the fiber paths of CST and CC on a Coronal view of brain (b). Source:

www.ablongman.com/carlson6e. (Accessed March 2015)

107

Figure 6.25 The fiber paths of CC on a Coronal view. The proposed model reconstructs more fiber paths.

108

Figure 6.26 Tractography processing on CST bundle. More fiber tracts reconstructed by the proposed Action-value based fiber tracking process.

109

Figure 6.27 The ROIs for Cingulate gyrus fibers are manually selected in two sections along inferior towards posterior section of this bundle. Thus some fibers that pass these through these two ROIs are drawn.

110

Figure A.1 Two local minimum (blue) and local maximum (red) points show two points that the objective function have minimum and maximum values, respectively.

127

Figure A.2 Codes written in Matlab to solve given ODE through the Euler technique.

130

Figure A.3 Some run of Euler technique to solve the given ODE. Exact step-size setting make a good convergence (blue curve) while out of this

131

(15)

xv setting it may fluctuate or slow it.

Figure B.1 The schematic diagram of Windy Grid problem. 132

Figure B.2 One Episode/Trial in Windy Grid problem. 133

Figure B.3 The value of states after 4000 Episodes/Trials in Windy Grid problem.

134

Figure B.4 The schematic diagram of MAB problem. 134

Figure B.5 Multi-state Boyan chain problem. 136

Figure B.6 State-Value estimation in Multi-state Boyan chain with TD-FA complete run.

137

Figure B.7 Error curve converges to the minimum point after 80 Trials. 138 Figure C.1 The GUI of the Fiber tracking Package written during this study.. 139

Figure C.2 The section relating to the fiber tracking processing on artificial date. 140

Figure C.3 The section relating to load Nifti file into the Package. 140 Figure C.4 The section relating to Convert Nifti file into the .mab format. 141 Figure C.5 The section relating to Load directly a .mab format data set. 141 Figure C.6 The section relating to tractography on real data. 142 Figure C.7 The Coronal and Sagittal slices of real data. The slider under each

button defines the slice.

143

Figure C.8 The Voxel at defined address shows the diffusion profile and FA value.

144

(16)

xvi

Figure C.9 The Fornix fibers that is under test. Source (left image):

http://www.tutorvista.com/content/biology/biology-ii/control-and- coordination/central-nervous-system.php. (Accessed March 2015)

145

Figure D.1 The Artificial dMRI data set. 147

Figure D.2 The voxel representation of point. 148

(17)

xvii

LIST OF ABBREVIATIONS

2D 2 Dimensional

ADC Apparent Diffusivity Coefficient

ASM Adaptive Step-size Model

AUCB Adaptive Upper Confidence Bound

BP Back Propagation

CC Corpus Callosum

CSF Cerebro Spinal Fluid

CST Cortico Spinal Tract

CG Cingulate gyrus

DTI Diffusion Tensor Imaging

DWI Diffusion Weighted Imaging

dMRI Diffusion Magnetic Resonance Imaging

FA Fractional Anisotropy

GD Gradient Descent

GHS Generalized Harmonic Step-sizes

GTD Gradient-descent Temporal Difference

(18)

xviii

GM Gray Matter

GTD-FA GTD with linear Function Approximation i.i.d independent and identical distribution

LFA Linear Function Approximation

LSTD Least Square Temporal Difference

MAB Multi-Armed Bandit

MC Monte Carlo

MDP Markov Decision Process

MRI Magnetic Resonance Imaging

ODF Orientation Distribution Function

OSA Optimum Step size Algorithm

PDF Probability Density Function

RL Reinforcement Learning

RLS Recursive Least Square

ROI Region Of Interest

SARSA State-Action-Reward-State-Action

STD Steepest-descent Temporal Difference

TAI Traumatic Axonal Injury

(19)

xix

TBI Traumatic Brain Injury

TD Temporal Difference

TD-FA Temporal Difference with Function Approximation

TF Transition Function

UCB Upper Confidence Bound

VF Value Function

WM White Matter

WP Without Prediction

(20)

xx

PEMBELAJARAN PENGUKUHAN SAIZ LANGKAH BOLEH SUAI BERASASKAN PENURUNAN TERCURAM

DAN PENGGUNAAN PEMBELAJARAN PENGUKUHAN DALAM PENJEJAKAN GENTIAN OTAK

ABSTRAK

Model beza masa (temporal difference, TD) tokokan pembelajaran pengukuhan (reinforcement learning, RL) menawarkan teknik berkesan bagi anggaran nilai dalam masalah melibatkan pengendalian pemilihan berturutan dalam bidang pembelajaran mesin, kawalan suai, system sokongan keputusan dan robotic industri/berautonomi. Memandangkan model ini biasanya beroperasi berdasarkan pembelajaran penurunan cerun, mereka mempunyai beberapa batasan terutamanya dalam penetapan saiz langkah. Model pembelajaran TD penurunan cerun (gradient descent temporal difference, GTD) adalah sensitive kepada jenis pemerhatian dan penetapan parameter. Batasan ini lebih nyata dalam aplikasi dalam talian yang mana model GTD dijangka menjadi mudah suai di bawah pemerhatian tak pegun. Isu ini menunjukkan satu jurang; model TD sedia ada adalah tidak mudah suai. Ini bermakna terdapat masalah kebergantungan parameter dalam algoritma GTD tokokan. Oleh itu penghasilan set model dipertingkat yang menghapuskan atau meminimumkan isu ini diperlukan. Dalam tesis ini, kelas baru bagi model TD dalam pembelajaran pengukuhan dibentangkan. Kelas yang dicadangkan termasuklah model pembelajaran TD berasingan yang ditentukan oleh pendekatan pengoptimuman penurunan tercuram (steepest descent, SD). Fokus utama ialah pada pengkomputeran optimal saiz langkah pembelajaran TD tokokan. Keputusan eksperimen menunjukkan bahawa model yang dicadangkan cepat tertumpu berbanding model yang sediaada.

Istilah cepat merujuk kepada kadar tumpuan lekok ralat kearah aras minimum yang dicapai melalui bilangan percubaan yang minimum. Peningkatan ini,bergantung kepada setiap model, ialah antara 40% hingga 70% manakala model yang dicadangkan mengekalkan kekompleksan

(21)

xxi

linear sepertimana model TD yang standard. Selain itu, model baru tidak bergantung kepada penetapan saiz langkah. Peningkatan ini menunjukkan bahawa model TD yang dibentangkan adalah mudah suai dan dapat memenuhi jurang tersebut. Pendekatan yang dicadangkan diperluaskan kepada proses penjejakan gentian otak yang dikenalisebagai proses traktografi iaitu objektif kedua tesis ini. Proses traktografi secara asasnya berkelakuan seperti tugas pemilihan berturutan di bawah keadaan ketidakpastian yang menjadi sasaran sama untuk RL. Proses traktografi memainkan peranan penting dalam pengajian pra/pasca pembedahan otak dan penilaian kecederaan otak traumatik. Walaubagaimanapun, proses traktografi masih mempunyai masalah masa pengkomputeran panjang yang mana model traktografi garis arus dengan masa pengkomputeran yang pendek gagal menunjukkan profil gentian otak dengan tepat dalam kawasan yang mengandungi struktur gentian yang berbeza. Ini lah sebab yang menggalakkan penggunaan pendekatan RL yang sesuai kepada proses penjejakan gentian otak. Hasil eksperimen pada kedua-dua set data tidak asli dan asli menunjukkan peningkatan dalam proses penjejakan gentian, terutamanya dalam keadaan tidak pasti yang mencabar untuk teknik penjejakan gentian yang lain.

(22)

xxii

STEEPEST DESCENT BASED ADAPTIVE STEP-SIZE REINFORCEMENT LEARNING AND APPLICATION OF REINFORCEMENT LEARNING IN BRAIN FIBER

TRACKING

ABSTRACT

Incremental temporal difference (TD) models of reinforcement learning (RL) offer powerful techniques for value estimation in problems with sequential-selections operation in the areas of machine learning, adaptive control, decision support systems and industrial/autonomous robotics. Since these models mostly operate based-on the Gradient-descent learning, they are presented with certain limitations especially in their step-size settings. Gradient descent TD (GTD) learning models are sensitive to the type of observations and parameter settings. These limitations are more pronounced in on-line applications where GTD models are expected to be adaptive under non stationary observations. These issues indicate a gap; the existing TD models are not adaptive. It means that, there is the parameter dependency problem in the incremental GTD algorithms. Consequently, presenting a set of enhanced models that eliminate or minimize these issues is desirable. This thesis presents a new class of TD model in reinforcement learning.

The proposed class including separate TD learning models which are governed by the steepest descent (SD) optimization approach. The major focus is on the optimal computation of the step- size in the incremental TD learning. Experimental results indicate that, the proposed models converge faster than similar existing models. The term, faster, refers to the rate of convergence of error curves towards their minimum level are happened by minimum trials. This improvement, depending of each model, is between 40% and 70%, while the proposed models maintain the linear complexity similar as standard TD model. Besides, the new models do not depend on the step-size settings. These improvements indicate that the presented TD models are

(23)

xxiii

adaptive and may fill the gap. Extending a reinforcement learning approach to the brain fiber tracking process, which is known as tractography, is the second objective of this thesis. Because the tractography process behaves essentially as a sequential-selection task under uncertainty condition which is same target of the RL. The tractography process plays an important role in pre/post brain surgery studies and traumatic brain injury assessments. However, the process still suffers from long computational time where the Streamline tractography models, with short computational time, fail to precisely reveal the brain fiber profiles in the areas containing different fiber structures. These are reasons which motivated the application of a suitable RL approach to the brain fiber tracking process. Experimental results both on artificial and real datasets indicated considerable improvement in the fiber tracking process, especially in conditions of uncertainty that were challenging to other fiber tracking techniques.

(24)

CHAPTER 1 INTRODUCTION

Reinforcement learning (RL) is a form of machine learning that aims to find an optimum policy in a sequential-selection task where, at each step, an agent selects an action with regards to maximizing the total rewards received (Sutton and Barto, 1998). RL approach has been studied in many domains including machine learning, operational research, intelligent control engineering, industrial robotics, optimal resource sharing and autonomous robotic (Granmo and Glimsdal, 2013; Gai et al., 2010; Dinesh and Saranga, 2010). In addition, it has been applied to the medical imaging applications for segmentation or clustering tasks (Zhao et al., 2009).

Optimum selection technique has been an attractive subject in many disciplines. This technique may enhance the process performance from processing time, computational costs and algorithm’s convergence view points. The significance of the optimum selection is more highlighted where making decisions will be under conditions of uncertainty (Sutton and Barto, 1998; Boyan, 2002). More precisely, in most sequential-selection problems a decision maker/RL-agent faces a set of options and without any prior knowledge about value of each option, selects one of them. This task is repeated several times. This is the common structure in these problems. Certainly, the agent will make every effort to ensure the best decision while only the response received will be a reward value which is given after each selection. If this action selection repeats multiple times, how does the decision maker act to achieve the maximum cumulative reward? In fact, in a successive action selection task what order of these actions can optimize the objective function? Obviously, if the decision maker knows the value of each action it may easily select the best one at each repetition.

The classical framework of reinforcement learning is illustrated Fig. 1.1. It has an evolutionary loop. The decision maker or RL-agent selects an action and receives a reward. This

(25)

2

can change the current state. Based on this new information it updates its value and decides on the next action selection. This procedure will be repeated till the agent gains sufficient knowledge about the environment. After this training phase, it can select actions optimally.

Figure 1.1: The Classical RL

Policy is the optimum selection criterion. Policy can optimize the objective function in long- term perspective. Thus, it is necessary to know which policy can lead the agent to select actions intelligently. In this regard, the agent uses the value of each state to select the best action such that it maximizes long term benefits. Consequently, finding the optimal policy becomes the estimation of the value of each state. This is called value function (VF) estimation. If we have a set of state and set of action , a simple state-action map may be shown by Fig 1.2. For example after 10 trials if the agent goes to state s(3), it selects action A5.

Figure 1.2: An example of the state-action map that defines the policy changes after each trial.

In the domain of RL, the state-action-reward is mostly used. However in some situations with a single state; this map simplifies to an action-reward pair. The value estimation is defined by an

(26)

3

incremental structure. This incremental structure operates based on a temporal difference (TD) error which is known as update rule and will be defined for each state ∈ .

TD error shows the difference between the current value and its target value. Based on different definitions of the target value, variants of TD learning have been presented. If the target value is only the reward received, it is called TD without prediction. If the target value is reward received plus a fraction of the value of the next step, it is called TD with one-step prediction represented by TD(0) and if multi-step prediction is desired, the eligibility trace term is used and is represented by TD( ). The eligibility trace term is defined for each state and must be updated at each step. If the agent goes on a state , its eligibility term is incremented by one and eligibility of other states is decremented other states decrement by a positive fraction less than one. This term can improve the learning time because it simulates multi states prediction.

Finally, if the value function (VF) is approximated by a linear function it is called TD with linear function approximation and is named as TD-FA (Sutton and Barto, 1998). TD models behave similar as the Gradient-descent based models. Therefore, they will inherit some limitations relating to this approach. These limitations are long learning time and dependency on parameter setting. Specification of TD models will be further elaborated in the next chapter.

One major component of this thesis is the brain fiber tracking using diffusion weighted magnetic resonance image data. Tracing or tracking the the brain white matter fibers is important to: neurologists to understand or predict neurological disorders; for neurosurgeons to assist in surgical planning and post surgical evaluation and many more. (Niogi and Mukherjee, 2010).

Brain fiber tracking algorithm starts from a point in the brain and follows the fiber direction to reconstruct a tract. For this job, several algorithms have been presented however these algorithms are faced with various issues. The time to get an acceptable fiber tracking result, insufficient knowledge in fiber direction detection in regions that fiber direction is not clear, difficulty in implementation of algorithms and difficulty in quantitative evaluations of revealed

(27)

4

tracts are some problems which are seen in fiber tracking algorithms (Jbabdi et al., 2011, Fillard et al., 2011).

1.1 Problem Statement

This subsection introduces issues observed in TD learning models. Different types of TD models suffer from one or several of these issues. They are categorized as:

i) Long learning time: TD learning models operate based on an incremental structure to mature VF’s through several running of their updating rules that called trials. The approach of the updating rules is the Gradient descent. The minimum necessary trial to achieve the optimal convergence (i.e. a VF estimate will be more close to the actual value) is called the learning time (Nocedal and Wright, 1999). The efficiency of the incremental TD models may be compared based on this criterion. Therefore, an algorithm is called faster when it converges to their optimum point by fewer trials. However, this convergence is achieved by accurate definition of both direction of the movement towards the optimum point and the step-size of the movement.

These two factors have high impact on convergence quality of an algorithm such as learning time. Gradient-descent algorithms also are known as models with longer learning time (George and Powell, 2006; Benveniste, 1990).

ii) Parameter dependency: Incremental TD models are usually presented together a table of their parameter settings to define the step size values. These algorithms operate well under specific settings whereas out of these settings they often diverge (Hutter and Legg, 2008;

Vermorel and Mohri, 2005; Gai et al., 2010). This indicates that, TD models are dependent on step-size value to achieve the best convergence quality. In other words, they mostly must be again tuned when different data is observed. This indicates instability under variable observation (Benveniste, 1990, Nocedal and Wright, 1999). This concern is more highlighted when TD learning with function approximation model is used.

iii) Less adaptability: As a result of above stated limitations (i.e. the parameter dependency and long learning time), it may be found that TD models are poor at adaptation to different

(28)

5

observations (Hutter and Legg, 2008; Dabney and Barto, 2012). Taking more time to tune the parameters with different observations indicate that, they are not adaptive. In on-line situations, parameter tuning is not a simple task, while the process is running, because of the operational cost or the learning time limitations.

iv) Inaccurate fiber tracking: Fiber tracking models try to find tracts that explain more the observed data and perform spatial interactions during tractography however lack of fiber orientation detection remains the major concern in brain fiber tracking algorithms (Jbabdi et al., 2011). In areas with complex fiber structure (e.g. crossing fiber bundles) the dominant fiber direction mostly is not precisely defined and this represents a big challenge to the fiber tracking algorithms.

In summary, incremental TD models have several underlying limitations which are inherited from their Gradient-descent (GD) approach. Such issues indicate that TD learning models are sensitive to a problem’s settings. The parameter dependency is the main issue in the TD learning algorithms. It means that, the convergence of these algorithms is connected to the exact setting of their parameters. This reflects that Gradient TD learning models are parameter-dependent. In other side, a strong tendency is seen to apply a RL approach to the problem of brain fiber tracking due to similarity of fiber tracking process and the problems which are target of RL to minimize the fiber tracking concerns.

1.2 Research questions

This section identifies following challenges in TD learning which should be addressed:

i) Is there any relation between the accurate step-size value and the convergence rate quality in the incremental TD models? How to automatically compute the step-size in the incremental RL models to achieve the optimum convergence?

ii) How to enhance the performance of RL-based algorithms to improve their stability under non-stationary observations? How to minimize the impact of changing observations on the rate of convergence?

(29)

6

iii) What factors may affect the learning time in TD models? How to decrease the learning time with reference to the optimum step-size setting?

iv) Is it possible to enhance fiber tracking algorithms using a RL approach?

1.3 Research objectives

Following objectives motivated by the research questions which are stated in the previous section. In this thesis, a new set of TD learning models is presented by the author in which the optimal and automatic computation of step-size is applied. In addition, a RL technique is adapted to the brain fiber tracking problem. These objectives are listed as follows:

i) To provide a new technique for adaptive computation of step size based on the steepest descent optimization approach for TD learning.

ii) To apply this new adaptive step-size computation to all classical TD learning models to introduce the Steepest-descent version of TD models.

iii) To apply action-value approach of RL to the brain fiber tracking and present the action- value based brain fiber tracking.

1.4 Significance of the study

Sequential selection under lack of knowledge is a common task of most decision making problems in different disciplines. In this context, incremental TD learning models of RL approach are widely used to address such problems. However, as stated in section 1.1 they have been presented with some limitation (i.e. the parameter dependency and long learning time).

If step-size value in incremental structure is optimally and automatically defined it results an adaptive structure (George and Powell, 2006; Benveniste, 1990). Adaptive TD learning will be useful in on-line applications in domains of machine learning, adaptive control, pattern recognition and image processing. Fixing these limitations may improve the learning time and make more robust learning model.

(30)

7

Developing a RL approach to the brain fiber tracking process opens a new window in a variant domain of application of RL approach. The fiber tracking problem is consistent with the characteristic of the action-value model. This introduces an alternative model that may balance the process time with respect to accurate reconstruction of fiber tracts which is attractive to the brain surgeons.

1.5 Scope of the study

This thesis reviews the classical and recent models of TD learning and presents the Steepest descent version of TD learning which is called by STD. Besides, comprehensive comparisons between the proposed models and the current TD models will be performed. Since, the linear computational complexity always is noted in this thesis; other descent approaches such as the Newton’s descent model elaborated in Chapter 3 is not applied in the proposed TD learning.

Due to similarity of Action-value model and the brain fiber tracking process this technique of RL approach is applied in the fiber tracking problem while other TD learning models such as TD(0) or TD(λ) are not a suitable choice due to the extra term "state" which may further increase

the computational cost.

1.6 Research contributions

This thesis presents two major contributions. The first is adaptive computation of step-size and then its application in all classical TD learning models to introduce a new set of TD learning models as follows:

i) Adaptive Stepsize model (ASM) & Adaptive Upper Confidence Bound (AUCB).

ii) Steepest descent based TD that are STD(0) and STD(λ).

iii) Steepest descent based TD-FA that are STD-FA(0) and STD-FA(λ).

The second contribution is, for first time, to adapt Action-value method of RL into the brain fiber tracking and present Action-value based tractography model. Fig. 1.3 illustrates a summarized overview of all tasks and operations will be conducted.

(31)

8

Figure 1.3: The outline of this thesis.

RL Approach

(Ch.1)

TD Models

Action-Value Model

TD without prediction ASM & AUCB Models

TD with

one-step prediction STD(0) Model

TD with

Eligibility trace feature STD( ) Model

TD with

Function approximation STD(λ)-FA Model A real world

Application

TD models Limitations (Ch.1)

Longer Learning time

Parameter Dependency

Methodology (Ch. 3)

Steepest descent approach Stochastic approximation theroy Weak Sustainability

Weak adaptability

Classical TD models (Ch.2

)

Proposed models (Ch.4

)

Action-value-based Brain Fiber tracking Model

(Ch.6)

Experimental results, Evaluations and Comparisons (Ch.5)

(32)

9

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

In this chapter incremental TD models of RL approach are reviewed and evaluated. The policy that the agent selects an action is defined by the value of each state. TD models estimate the value of each state by the value function (VF) estimator which is called TD learning model which are defined in Ch. 1. To evaluate the operation of TD models, different aspects are considered which are listed as follows:

i) Convergence analysis: the quality of update of the VF’s from initial point towards its optimum point or the true value in terms of the minimum number of trials needed to reach to this point. The metrics used including: the speed of convergence, fluctuation of the error curve and ability to maintain the convergence without further divergence.

ii) Computational complexity: the amount of calculations performed to solve a problem

usually measured by complexity. Complexity is introduced by order “ ”. Linear complexity is represented by ( ) where is the number of observations and a second order complexity is defined by ( ) and so on.

iii) Stability under non-stationary observations: the ability of TD algorithms to maintain the operation under changes in the observations (e.g. change in the mean value or variance) which is known as the non-stationary case.

iv) Parameter dependency: The dependency of the best operation of TD algorithms on the parameter settings or step-size setting. This can cause to achieve the best convergence under a specific setting of their parameters.

(33)

10

2.2 Classical Temporal Difference learning models

This section presents basic definitions, requirements, equations and derivations of variants of TD learning model. TD learning approaches in reinforcement learning (RL) try to approximate the expected/actual value of each state/action by repeating the learning task multiple times under a fixed policy . At each step ∈{0,1, … , }, decision maker is in a state ∈ = { , , . . , } and selects an action ∈ = { , , . . , } that starts a transition ( , , ). After each transition, decision maker receives a hidden reward which comes from a reward function ( , ). This process repeats until the agent reaches to the terminal state (Sutton and Barto, 1998). At each trial, the agent moves from the starting state towards the terminal state under policy . This is called one episode. Thus completing an episode is equivalent to a trial. Fig 2.1 shows one episode. The represents the current state at step k, and represents a state at step + 1. An episodic task is a job that runs different episodes. At each trial, one episode is performed and the RL-agent gets more knowledge based on the value function (VF) update rule which is called TD learning model. This rule operating based on the received reward after each transition. In Appendix B, the episodic tasks are described further.

Figure 2.1: An episode contains limited number of states

The goal of RL is to maximize the total rewards received in long term consideration. Thus the value of each state or action will be the expected value of the cumulative of future rewards. This can be formulated as ( ) = [ ]. It is named actual return or target value given policy . Because of the policy , mostly, is fixed, it can be ignored in subsequent expressions. Since, future estimations of the rewards received are not implementable, a discounted return, that is

=∑ is considered (Sutton and Barto, 1998) where ∈[0,1] is the discounted

(34)

11

factor and indicates how much next rewards may contribute to this estimation. The estimate of ( ) may be iteratively computed according following equation:

( ) = [ ] = [∑ ]

= [ +∑ ]

= [ + ∑ ]

= [ + ( )]. (2.1) It can be seen that, the value of state is the expected value of the current reward as well as a fraction of the next step value. It indicates an estimation based on an estimation. This performs one-step prediction, where if = 0, we do not have any prediction. The following incremental model may estimate the value function for any state based on its Return value (i.e. that stated in Eq. 2.1) as follows (Sutton and Barto, 1998; Benveniste, 1990):

= + (2.2) where ∈(0,1] is the step size. Function shows how is updated at each iteration and is called temporal difference error. This incremental value function, similar to the most incremental structures, converges well if two essential factors, step size and the TD error , are carefully defined (Benveniste, 1990; Nocedal and Wright, 1999). These two parameters have a direct effect on the convergence quality in terms of speed of convergence and its smoothness, which are important in convergence analysis (Yu, H., 2010). The step size ∈(0,1] may be chosen to be a fixed value or a monotonically decreasing value that satisfies these conditions:

∑ =∞, ∑ <∞, lim →∞ →0 (2.3)

where is a ppositive number greater than one. Incremental TD models under these conditions are guaranteed to converge (Sutton and Barto, 1998; Borkar and Meyn, 2000; Dayan and Senjnowski, 1994). However, Gradient based TD models are too sensitive to fine-tuning of

(35)

12

their step-sizes such that a minor change to their settings can cause the models to diverge or lose their good convergence (Dabney and Barto, 2012; Hutter and Legg, 2008). In the folllowing, based on how the update rule is configured, different TD learning models are presented:

2.2.1 TD-learning without prediction (Action-value model)

Action-value model has only one state with several actions. At each step, the agent selects an action and receives a reward. This process would be repeated to mature the initial estimation of the value of each action. According to Eq. (2.1) and considering = 0, an instance estimation of desired value function ( ) comes from the sample-mean equation as following:

( ) = ∑ (2.4)

where, is the number of times that an action is selected. After each selection, the agent receives reward, . However, finding the best estimation of the actual value of each action is the main objective. We expect that lim → ∞V( ) = ( ). The TD without prediction which is known as action-value, model is applied to the well-known multi-armed bandit (MAB) problem explained in Appendix B. The following incremental model which is derived from Eq. (2.4) presents the stochastic approximation of the expected value of an action, (Sutton and Barto, 1998):

( ) = ∑

= ( +∑ )

= ( + ( )− ( ) + ( ) )

= ( ) + [ − ( )] (2.5) where = is the general step-size of this incremental model. In stationary and non- stationary conditions it must be set to a suitable value to have the best convergence quality of the

(36)

13

algorithm. The term = − is the temporal difference error. This shows that, value function of each state, will be updated based on the current reward value. Although the sample- mean procedure stated in Eq. (2. 4) is more attractive and simple to implement in the stationary cases, Eq. (2.5) is more suitable to study the value estimations under non-stationary situations. In reinforcement learning, we usually encounter problems that are effectively non-stationary (Sutton and Barto, 1998).

2.2.2 TD-learning with one-step prediction -TD(0)

The idea of TD with prediction is the weighted average of multiple-step estimation and saving the rewards received. Based on the target value which is stated in Eq. (2.1) the value functions may be updated by + ( ). This is one-step look-ahead scheme and is named TD(0).

The general update rule for TD(0) is as follows:

( ) = ( ) + [ + ( )− ( )] (2.6)

However, if we know that the actual target value is , it can be re written as:

( ) = ( ) + [ − ( )] (2.7)

In Eq. (2.6) the value function uses an estimation of value of the next state whereas, Eqs. (2.7) and (2.5) do not use any prediction. Because of simplicity, in the next equations, ( ) is that the value of the next state at current time is changed to ′. Similar to the definition in Eq. (2.5), the temporal difference error in this case is:

= + − (2.8)

Finally, the simplified incremental TD (0) model to estimate the value function is:

= + (2.9)

(37)

14

Based on this update rule, the incremental TD (0) may be stated draw as in Table 2.1.

Table 2.1: The incremental TD(0)

01- = ℎ ;

02- While ( ≠ )

03- Take an action based on policy ; 04- Observe and next state ′ ;

05- = + − ;

06- = + ;

07- = ′ ; 08- End

2.2.3 TD-learning with eligibility trace term -TD(λ)

In section 2.2 the discounted return is defined by =∑ . Now, if the future steps are truncated to n-steps, we have n-step truncated return that is = + +⋯+

+ ( ) (Sutton and Barto, 1998). If the agent is in step , the weighted average of this truncated Return is defined by = (1− )∑ + , where

∈[0,1] is a decay factor. This is n-step prediction in temporal difference learning. Thus the update function of TD would be the expression = − ( ). It is defined as the TD(λ) learning model. In the case of =1, TD (1), we have = + +⋯+ where will be the return value from step + 1 to the terminal point . When =0, namely TD(0), the update rule becomes to the expression = + ( ′) which is the one-step prediction that is stated in the Eq. (2.9). However, because the n-step forward prediction is not implementable, a backward view in TD learning, which is called eligibility trace term, is used to perform this weighted average of n-step prediction (Sutton and Barto, 1998). The eligibility term, for each state that the agent is in the state, is defined as = + 1, for the selected state and

= for other states at step . Finally, the incremental TD with eligibility trace term will be:

= + (2.10)

(38)

15

The temporal difference error is still = + − which is stated in Eq. (2.8).

Based on this update rule, TD (λ) is drawn in Table 2.2.

Table 2.2: The incremental TD(λ)

01- = ℎ ;

02- While ( ≠ )

03- Take an action based on policy ; 04- Observe and next state ′ ;

05- = + − ;

06- ( ) = ( ) + 1; {current state updated}

07- ⃗ = ⃗ + ⃗ ; {all states updated}

08- ⃗ = ⃗ ; {all states updated}

09- = ′ ; 10- End

Here ⃗ and ⃗ are the value and eligibility trace vectors respectively. It means that all value functions and their eligibility traces should be concurrently updated.

2.2.4 TD-learning with linear function approximation- TD-FA

In real-world applications where the state space often is large or infinite, TD learning cannot perform well due to high computational time. For this reason a linear function approximation technique is used to estimate the value functions. This estimation is performed iteratively, to achieve a good estimation, where the difference between the expected value function, , and its estimation, , is minimal. In TD-learning with linear function approximation (TD-FA), a linear combination of the parameter vector ∈ and feature vector ( )∈ are used to estimate the value function ( ) by ( ). The feature vector corresponding to state is ( ) that at step will be ( ) and in general . This vector should be defined before running the process. More detail of this approach is provided in Appendix B. TD-FA learning rule is formulated as (Sutton and Barto, 1998):

( )≈ ( )≈ [∑ | = ] (2.11)

(39)

16

Now, the goal is to learn the matrix so that the current estimation of ( ) moves towards the expected value, it means → when → ∞. In this regard, the common approach is to learn based on an objective function, for example the mean square error (MSE) function or weighted norm of the error =‖ − ‖ . Taking the gradient of the objective function with respect to and setting it to 0, we can get the incremental form of the update rule as:

= + = + ( + − ) (2.12)

The eligibility trace term vector, ∈ , taking into account the effect of the feature vector ( ) in n-step prediction, is defined by = + . The model is termed as TD(λ)-FA.

In one-step prediction, namely TD(0)-FA, this update rule changes to:

= + = + ( + − ) (2.13)

Table 2.3: The incremental TD(λ)-FA = ( );

01- = ℎ ;

02- While ( ≠ )

03- Take an action based on policy ; 04- Observe and next state ′ ;

05- = + ′ − ;

06- = + ;

07- = + ;

08- = ′ ; 09- End

Tsitsiklis and Van Roy (2001) proved Eq. (2.12) and Eq. (2.13) converges to the optimum point with probability of one, if the step size satisfies the general conditions in Eq. (2.3). Based on Eq. (2.12) or (2.13), TD(λ)-FA algorithm is presented in Table 2.3. Considering = 0, TD(0)-FA algorithm is also easily obtained by replacing the term with . Although this

(40)

17

method seems interesting, dependency between good convergence and the accurate step size setting is still a concern.

2.2.5 Summary

In this section, incremental TD learning models including TD without prediction, TD with one-step prediction, TD with eligibility trace term and TD with function approximation are expressed. These incremental algorithms are dependent on the exact step-size settings. Similar to most incremental models, convergence of these algorithms is strongly connected to the step size value (Hutter and Legg, 2008; Dabney and Barto, 2012; George and Powell, 2006; Mahmood et al., 2012). Some efforts have been made to present TD models with higher efficiency and lower limitations. These are variants of TD models that are elaborated in the next section.

2.3 Variants of TD Learning Model

Apart from the classical TD-learning discussed in Sec. 2.2, several variants of TD models have been presented in the literature that now become hot topics in the RL domain. These models are discussed in the following.

2.3.1 Variants in TD learning without prediction: the MAB model

Multi armed bandit (MAB) is one of the common classical case studies in different disciplines which is still attractive for researchers to study or adapt it to their problems. Many researches in several domains indicate that, MAB is frequently applied in the real world applications. Some of these applications have been presented in (Bae et al., 2011; Scott, 2010; Fialho et al., 2009;

Granmo and Glimsdal, 2013; Gai et al., 2010; Dinesh and Saranga, 2010; Liu and Zhao, 2012).

More details about MAB are given in Appendix B.

In MAB model, the agent faces a row of actions and decides which one must be selected such that the sum of rewards received is maximized. Alternatively, maximizing these cumulative

(41)

18

rewards is equivalent to minimizing the regret which is the difference between total of optimal rewards and so far cumulative rewards received (Kuleshov and Precup, 2010; Audibert et al., 2010; Honda and Takemura , 2011; Bubeck and Cesa-Bianchi, 2012). Rewards follow a normal distribution with given mean and variance in stationary and these terms may change over time.

This is usual that in the empirical studies these terms in different settings apply to MAB algorithm and investigate the results. The settings in MAB algorithm are; No. of actions, variance of observation that is between 0.1 (low variance) to 2 or 5 (high variance), the mean value that is between 0 and 2. You must set these terms however in non-stationary simulation the variance and mean value may change over the time.

Several algorithms have been presented to solve the problem based on the regret analysis or sample-mean analysis. These models estimate the actual value or expected value of each action with different approaches. Unfortunately, most of these algorithms are evaluated theoretically (Bubeck and Cesa-Bianchi, 2012; Even-Dar et al. 2006; Audibert et al., 2010). Vermorel and Mohri, (2005) concluded that most of theoretical proofs of MAB models are valuable, however in their application to real-world problems, are often faced difficulties (Kuleshov and Precup, 2010). As an example, in some MAB algorithm in which it is required to approximate the variance of observation, there is a considerable drop in the performance of algorithms under high variance of the observation. It indicates that some of MAB algorithms do not operate optimally under variable variances while all settings and rules of algorithms are theoretically correct.

An extensive empirical study that compares MAB models under different settings has been conducted by Kuleshov and Precup (Kuleshov and Precup, 2010). They also concluded that, theoretical guarantees do not ensure good performance in the real world applications because some issues, only, are noted in the experimental results. Besides, performance of some MAB algorithms is limited to the specific settings (i.e. especial combination of variance, mean and No.

of actions). According to the results reported by authors in (Xu et al., 2007) performance of some of MAB algorithms, dramatically decreases when a different setting is applied. Kuleshov

(42)

19

and Precup (2010) presented comprehensive experimental results but their experiments did not include non-stationary observations. Besides, they did not evaluate the effect of high variance of rewards that may cause the decline of algorithms because in domains of adaptive control and machine learning the observations often are non-stationary (Sutton and Barto, 1998).

These are major concerns relating to MAB algorithms that are usually not evident in theoretical assessments. One reason may be cause of these issues is parameter dependency in incremental MAB algorithm. As an example, in a theoretical study on MAB presented by Bubeck & Cesa-Bianchi (2012), the authors introduced two online mirror descent algorithms OMD and OSMD which operate based on the Gradient-descent (GD) approach. However, similar to most GD algorithms, it is necessary to fine-tune the step size in these MAB models as well. This is because; the convergence of algorithm is connected to the exact step size setting.

In order to introduce the steepest descent approach of the incremental MAB model and application to the brain fiber tracking process, it is necessary to express the mathematical model of MAB algorithm.

MAB mathematical model

Let = { , . . . } be a set of usable actions with set of distributions of the reward = { , . . . } that the expected values are = { ( ), ( ). . . ( )} and the variances are defined by = { ( ), ( ) … ( )}. The observations follow an independent and identical distribution (i.i.d). In non-stationary observations these expected values may vary over time. Finding the best estimate of ( ) is the main objective of MAB algorithm. After times choosing an action , instant estimation of the actual value ( ), is obtained by the sample-mean Eq. (2.4). It means that → ∞ ( ) = ( ). Alternatively, cumulative regret at step is defined through the expression = ( )− ∑ ( ), where the best action is defined by = ( ) and ( ) is the so far cumulative reward of

(43)

20

selected actions. For selecting an action, the following policies may be used. Each policy defines the prominent action at each step of action selection.

a) -greedy action selection policy

At step with probability 1- ε, the agent greedy exploits current information to select an action with highest ( ) and with probability ε, the agent randomly explores an action (Sutton and Barto, 1998; Vermorel and Mohri, 2005). This policy formulates as:

= ( ) ℎ 1−

ℎ (2.14)

After each selection, the relevant value function is updated by Eq. (2.5). The value of exploration parameter ε balances the exploration-exploitation tasks. It may be a fixed value as = 0.1, or it may vary with time. If the effect of in the exploration phase is reduced by passing time, the percent of random selections decreases as the agent’s knowledge evolves. This may be implemented by = log / which defines the exploration rate at each step.

However, this technique is useful only in stationary situations whereas experimental results in Ch. 5 will show that in non-stationary situation, the better choice is a fixed value.

b) Upper Confidence Bounds (UCB) action selection policy

Authors in (Kuleshov and Precup, 2010; Auer et al., 2002; Audibert et al., 2009; Audibert et al., 2010; Even-Dar et al. 2006) have presented models that operate with an upper confidence bounds (UCB) criterion to select the valuable action at each step. Some variants of this policy are recently used for real-world applications. For instance distributed network operations (Liu and Zhao, 2012; May, 41), economical and industrial decision making (Scott, 2010; Dinesh and Saranga, 2010), software engineering (Fialho et al., 2009), games industries (Granmo and Glimsdal, 2013) and robotics (Honda and Takemura ,2011). One of these variants is UCB-1

(44)

21

which uses a straightforward routine to select the winner action at every step. It considers the number of times that an arm has been selected after rounds, as well as its current value function. Then, the action which maximizes the following criterion is selected.

= ( ( ) +

) (2.15)

Audibert et al. (2009) presented another criterion that considers an estimation of variance of the reward, a function of the exploration probability rate ℇ( ) and the current value function.

The simplified criterion is:

= ( ( ) + ( )ℇ( )+ 1 ℇ( ) ) (2.16)

where ℇ( ) = log is the exploration rate function for each action and 1, 2≥0, while variance of observations must be in the domain [0, b]. This is called UCB-V policy. The variance is experimentally computed by ( ) = ∑ ( − ( )) .

Another UCB action selection policy is called UCB-Tuned (Kuleshov and Precup, 2010) which uses the empirical variance of reward with respect to a boundary 0.25 at each step that is:

= ( ( ) + min(0.25 , ( )) (2.17)

here, ( ) = ( ) +

is computed based on the current estimation of the variance reward (Audibert et al., 2009). These criteria select the most valuable action based on UCB-1, UCB-V and UCB-Tuned approaches. After each selection, the agent receives a reward and updates its value ( ) based on the uniform incremental structure that stated in Eq. (2.5). It may be concluded that, besides two key parameters, variance of reward and numbers of actions that may affect the efficiency of the MAB algorithms (Kuleshov and Precup, 2010), parameter

(45)

22

dependency and type of observations are two other important problems that influence the performance of MAB algorithms. These are the concerns in MAB problems. To the knowledge of the author, these issues are yet to be addressed by the research community.

2.3.2 Variants in TD-learning with linear function approximation approach

As mentioned in Sec. 2.2.4, in TD-FA, a linear combination of two vectors, the feature vector and the parameter vector can help us to define the value functions. So far, two major families of TD-FA approach have been presented. The first group utilizes the Gradient-descent (GD) technique and is called GTD-FA. Sutton and colleagues (Sutton et al., 2009-b; Sutton et al., 2009-a; Maei and Sutton, 2010) have presented a series of valuable works for TD-FA learning.

The second group is constructed based upon the theory of linear least-square estimation and is introduced by Bratke and Barto in 1996. This approach is known as least-square temporal difference (LSTD) with function approximation and is called LSTD-FA. Authors in (Boyan, 1999; Boyan, 2002; Lagoudakis and Parr; 2003; Xu et al, 2005; Schembri et al, 2007; He et al., 2011) have presented some variant of LSTD-FA. Among these, He et al. (2011) presented the recursive LSTD model with lower computational complexity than other LSTD models. It is called RLSTD-FA. For this reason, in this thesis, this model is selected for implementation, comparison and evaluation with proposed model.

These two groups of TD-FA techniques have some limitations. GTD-FA presents models with linear computational complexity, while, their performances are too sensitive to exact step- size settings. It means that, step-size setting and slow rate of convergence are two major limitations similar to most GD-based algorithms (He et al., 2011; Benveniste, 1990; Boyan, 2002). Exact step size settings, often, present a good convergence, whereas any other value may cause the process to be unstable. This may reduce the universality of such algorithms and render them to be parameter dependent. On the other hand, LSTD-FA group do not have dependency on step size and have a good and stable convergence rate (Yu, H., 2010; He et al., 2011). However,

(46)

23

they suffer from computational cost and consequently, they may not be an appropriate choice in on-line applications (Hutter and Legg, 2008; Sutton et al., 2009-a; Geramifard et al, 2006). An iterative LSTD model (i.e. iLSTD) with linear time complexity has been reported by Geramifard (Geramifard et al, 2006). It is claimed that, this model may be suitable for large state space case.

However, iLSTD still suffers from parameter dependency on two step size values.

The update rules of GTD-FA and LSTD-FA family group

a) GTD Group: The most well known GD-FA with one-step prediction was presented by Sutton et.al (Sutton et al., 2009-a). They named it GTD. In this approach, the parameter vector is estimated according to following equations:

= + − . (2.18)

= + ( − ) where = 0 (2.19)

This approach employs two step sizes , ∈(0,1] and should satisfy the conditions in Eq.

(2.3). However, its linear complexity is attractive. Table 2.4 draws the pseudo code of GTD-FA.

Table 2.4: The incremental GTD-FA

= ( );

01- = ℎ ;

02- While ( ≠ )

03- Take an action based on policy ; 04- Observe and next state ′ ;

05- = + − ;

06- = + − ;

07- = + ( − ) 08- = ′ ;

09- End

Again, Sutton and colleagues in (Sutton et al., 2009-b) reported another two GD models that are called (Gradient TD2) GTD2 and (TD-Gradient Correction) TDC. The procedures are one- step prediction learning. The update rules of the parameter vector are:

Rujukan

DOKUMEN BERKAITAN

The multilayer feedforward networks, an important class of neural networks, consist of sensory neurons or nodes that constitute the input layer, one or more hidden

As the Super Ostrowski-HCM was developed by combining Ostrowski-HCM with Quadratic Bezier homotopy function and Linear Fixed Point Function, the Modified Super

Using free and open source, OMEKA as a platform and Dublin Core metadata to describe the resource, UM Memory function is to develop photo collection for teaching, learning

(4) The networksN1(2-10-2) and N2(2-10-5-2) with TRAINIM as training function, LEARNGDM as adaption learning function, and PURELIN as output transfer function

Type III collagen (primary component of early granulation tissue) predominates in the early stage of the normal wound healing and replaces type I collagen at the

The six (6) sub-interfaces are named based on their function: (1) colour code to resistance is used to convert the colour chosen by the user to the value of the resistance;

The simulation results of the performance of the four different basis functions by varying the number of nodes in the hidden layer implemented byNEWRB.. The simulation

Radial basis function neural networks are used in function approximation and interpolation (Sgarbi et al., 1998), pattern classification and recognition (Haykin, 1999),