• Tiada Hasil Ditemukan

PHYLOGEOGRAPHY STUDY OF THE

N/A
N/A
Protected

Academic year: 2022

Share "PHYLOGEOGRAPHY STUDY OF THE "

Copied!
56
0
0

Tekspenuh

(1)

PHYLOGEOGRAPHY STUDY OF THE

INDIGENOUS PEOPLE OF SABAH: KADAZAN, DUSUN, RUNGUS AND BAJAU

GAN YEE MIN

UNIVERSITI SAINS MALAYSIA

2020

(2)

PHYLOGEOGRAPHY STUDY OF THE

INDIGENOUS PEOPLE OF SABAH: KADAZAN, DUSUN, RUNGUS AND BAJAU

by

GAN YEE MIN

Thesis submitted in fulfilment of the requirements for the degree of

Doctor of Philosophy

July 2020

(3)

ACKNOWLEDGEMENT

All praise and glory to the God Almighty that this thesis is completed! My first and biggest thanks would be to Him who has been ever so faithful for being there with me every step of the way, for gifting me with the wisdom, perseverance and patience to see this PhD journey through from day one till the very end.

Next, I would like to express my utmost gratitude to my main supervisor, Dr.

Eng Ken Khong and my co-supervisor, Dr. Velat Bujeng for their guidance and advice on this thesis. I am especially indebted to Dr. Eng, without whom I would not have the opportunity to work on this project to begin with. It is thanks to his relentless encouragement, support and patience in teaching and guiding me throughout these 3- 4 years that kept me motivated and to strive to be better.

This research would also not have been possible without the financial support from several bodies. Firstly, I would like to thank Universiti Sains Malaysia (USM) for funding my studies through the USM Fellowship Scheme (2016-2019). Research work is funded by the Short Term Research Grant, USM (304/PARKEO/6313211) awarded to Dr. Eng, as well as the Long Term Research Grant Scheme, Ministry of Higher Education (304/PPSK/6150115/U132) awarded to Prof. Dr. Zafarina Zainuddin of the Analytical Biochemistry Research Centre (ABrC), USM.

I am also grateful to everyone from Lab 414, School of Biological Sciences, USM, especially Prof. Dr. Nazalan Najimudin and Prof. Dr. Razip Samian for allowing me to use the lab. A special mention goes to Mardani, Haida, Suhaimi and Faizal, who have all been so patient and understanding when guiding me through the various aspects of the lab work, as well as Prof. Dr. Phua Kia Kien and his student Carlos from the Institute for Research in Molecular Medicine (INFORMM), USM who helped me

(4)

with imaging of the gel electrophoresis. Further thanks go out to Prof. Mokhtar Saidin and everyone from the Centre for Global Archaeological Research (CGAR) for their financial and technical support.

I would also like to give my biggest appreciation to my family and friends, who have undoubtedly had to tolerate with me during my most stressful moments. To my friends and colleagues, thank you being part of my PhD journey in one way or another.

A special mention goes to Deejay Daxter and Siaw Chan who have been so helpful and kind in helping me with the formatting and submission process. To my family, my greatest supporters, I would not have been able to pursue my dreams in archaeology if it were not for your trust, understanding and unconditional love. Thank you all for believing in me, and I hope that I have made each and everyone of you proud. Thank you also to my sister, Yee Wei, who has always responded to my requests for drawing images and maps with nothing but patience and helpfulness.

Finally, I would like to dedicate this thesis to all my grandparents, three of whom have passed away throughout the course of my PhD research. This research has taught me even more of the importance of knowing one’s ancestry, and I will be sure to pass that down to the generations of my family to come.

(5)

TABLE OF CONTENTS

ACKNOWLEDGEMENT ... ii

TABLE OF CONTENTS ... iv

LIST OF TABLES ... x

LIST OF FIGURES ... xii

LIST OF ABBREVIATIONS ... xxv

LIST OF APPENDICES ... xxvii

ABSTRAK ... xxviii

ABSTRACT ... xxx

CHAPTER 1 INTRODUCTION... 1

1.1 Aims and motivation ... 3

1.2 Scope and limitations ... 4

1.3 Thesis outline ... 7

CHAPTER 2 LITERATURE REVIEW ... 9

2.1 An introduction to human deoxyribonucleic acid (DNA) ... 9

2.1.1 Human mitochondrial DNA (mtDNA) ... 12

2.2 Studying and analysing the human mtDNA ... 15

2.2.1 The human mtDNA reference sequence ... 15

2.2.2 Polymerase chain reaction (PCR) ... 17

2.2.3 Sanger sequencing ... 19

2.2.4 Phylogenetic trees and networks ... 21

2.2.4(a) Human mtDNA phylogenetic tree ... 22

2.2.5 Mutation rates ... 24

2.3 Modern human origins and dispersals ... 26

2.3.1 The origin of anatomically modern humans (AMH) ... 26

2.3.2 Out of Africa and into Asia ... 28

(6)

2.4 Southeast Asia (SEA) ... 33

2.4.1 Sundaland ... 34

2.4.2 Human migrations in ISEA during the Neolithic ... 36

2.4.2(a) The Austronesian diaspora ... 37

2.4.2(b) The “Out of Taiwan” hypothesis and the spread of agriculture in ISEA ... 39

2.4.2(c) Alternative hypotheses or propositions ... 42

2.4.2(d) The genetic evidence ... 44

2.5 Sabah (Borneo), Malaysia ... 46

2.5.1 Background ... 46

2.5.2 An overview on the prehistory of Sabah ... 47

2.5.3 The ethnic groups of Sabah ... 49

2.5.3(a) Kadazan and Dusun ... 51

2.5.3(b) Rungus ... 53

2.5.3(c) Bajau ... 55

2.5.4 Previous genetic studies in Sabah ... 57

2.6 Summary ... 59

CHAPTER 3 MATERIALS AND METHODOLOGY ... 60

3.1 Study samples ... 60

3.2 Comparative published mtDNA control region and complete mtDNA sequences ... 61

3.3 Laboratory work ... 63

3.3.1 Phenol-chloroform DNA extraction ... 63

3.3.2 Preparation of primers ... 64

3.3.3 Polymerase chain reaction (PCR) amplification ... 65

3.3.4 Gel electrophoresis ... 68

3.3.5 DNA purification and sequencing ... 70

3.4 Data analysis ... 71

(7)

3.4.1 Variants scoring ... 71

3.4.2 Haplogroup assignment ... 72

3.4.3 Network v.5.0.1.1 ... 72

3.4.4 Phylogenetic trees (whole genome) ... 73

3.4.5 Coalescence time estimation (whole genome) ... 75

3.4.5(a) Rho (ρ) statistic ... 75

3.4.5(b) Maximum likelihood (ML) ... 76

3.5 Summary ... 76

CHAPTER 4 CONTROL REGION: HAPLOGROUP PROFILE ... 78

4.1 Results ... 78

4.2 Discussion ... 80

4.2.1 Bajau ... 81

4.2.2 Dusun ... 82

4.2.3 Kadazan ... 83

4.2.4 Rungus ... 84

4.3 Summary ... 85

CHAPTER 5 CONTROL REGION: MACROHAPLOGROUP M ... 86

5.1 Haplogroup M7 ... 87

5.1.1 Haplogroup M7b1a1+16192... 88

5.1.2 Haplogroup M7c1a+@16295 ... 92

5.1.3 Haplogroup M7c1c3 ... 95

5.2 Haplogroup E ... 98

5.2.1 Haplogroup E1 ... 99

5.2.2 Haplogroup E1a1a1 ... 102

5.2.3 Haplogroup E1a2+16261 ... 105

5.3 Haplogroup D ... 107

5.3.1 Haplogroup D4a1d ... 107

(8)

5.3.2 Haplogroup D4a7 ... 111

5.3.3 Haplogroup D4b2a2a ... 113

5.3.4 Haplogroup D4j2 ... 116

5.3.5 Haplogroup D6 ... 119

5.4 Haplogroup M17a ... 121

5.5 Haplogroup M57 ... 124

5.6 Haplogroup M68a ... 126

5.7 Haplogroup M71 ... 130

5.8 Haplogroup C ... 135

5.8.1 Haplogroup C7 ... 136

5.9 Summary ... 138

CHAPTER 6 CONTROL REGION: MACROHAPLOGROUP N ... 141

6.1 Haplogroup N9 ... 142

6.1.1 Haplogroup N9a6a ... 142

6.1.2 Haplogroup Y2 ... 146

6.2 Summary ... 150

CHAPTER 7 CONTROL REGION: MACROHAPLOGROUP R ... 151

7.1 Haplogroup R9 ... 152

7.1.1 Haplogroup R9b2 ... 153

7.1.2 Haplogroup R9c1a ... 155

7.2 Haplogroup F ... 160

7.2.1 Haplogroup F1a ... 160

7.2.2 Haplogroup F1a3+16311 ... 163

7.2.3 Haplogroup F3b ... 167

7.3 Haplogroup B ... 171

7.3.1 Haplogroup B4 ... 171

7.3.2 Haplogroup B4a1a1 ... 174

(9)

7.3.3 Haplogroup B4a1a3a ... 177

7.3.4 Haplogroup B4a2b ... 181

7.3.5 Haplogroup B4c1b+16335 ... 184

7.3.6 Haplogroup B4c2 ... 187

7.3.7 Haplogroup B5a ... 190

7.3.8 Haplogroup B5a1d ... 193

7.3.9 Haplogroup B5b ... 195

7.4 Haplogroup P ... 198

7.4.1 Haplogroup P4a ... 199

7.5 Summary ... 201

CHAPTER 8 WHOLE mtDNA GENOME ... 204

8.1 Haplogroup F3b ... 206

8.2 Haplogroup M68 ... 211

8.3 Haplogroup R9c ... 215

8.4 Summary ... 221

CHAPTER 9 DISCUSSION AND CONCLUSION ... 223

9.1 Origin of the Sabah ethnic groups lineages ... 223

9.1.1 ISEA haplogroups... 226

9.1.2 East Asian (EA) haplogroups ... 228

9.1.3 MSEA/Sunda haplogroups ... 232

9.1.4 South Asian (SA) haplogroup ... 234

9.2 Integrating the whole mtDNA genome data ... 235

9.3 The phylogeography of Sabah: an overview ... 237

9.4 Final remarks ... 244

9.5 Future research ... 245

REFERENCES ... 247 APPENDICES

(10)

LIST OF PUBLICATIONS

LIST OF CONFERENCE PRESENTATIONS

(11)

LIST OF TABLES

Page

Table 2.1 Examples of mtDNA mutation rates that have been proposed by researchers based on the different regions of the human mtDNA.

Mutation rates for the coding region are sometimes referred to as synonymous mutation rates. ... 25 Table 3.1 Examples of some comparative populations included in this study.

Language family: AA - Austroasiatic, AN - Austronesian, ST - Sino-Tibetan, TK - Tai-Kadai. ... 62 Table 3.2 22 pairs of primers used for complete mtDNA genome

amplification, including the primer pair that was already used for the control region amplification (pair number 1). The primer pairs were selected based on the primers used in Eng (2014). ... 67 Table 3.3 32 sets of alternative primer pairs from Maca-Meyer et al. (2001)

used for complete mtDNA genome amplification. ... 68 Table 3.4 23 primers used for complete mtDNA genome sequencing,

including the primer pair that was already used for the control region sequencing (pair number 1). The primers listed in this table corresponds to the amplification primer pairs listed in Table 3.2. ... 71 Table 3.5 Colour codes for the study samples (according to ethnicities) and

published sequences (according to regional locations). ... 74 Table 3.6 Three-letter country codes of the locations of published complete

sequences included in the whole genome phylogenetic trees. ... 74 Table 4.1 Comprehensive list of the haplogroups identified for the samples

in this study and their frequencies according to the ethnic groups of Sabah. Relative frequencies in percentages are listed in brackets. .... 79 Table 5.1 List of haplogroups nested under macrohaplogroup M identified

according to the HVS-I polymorphisms of the ethnic groups.

Relative frequencies in percentages are listed in brackets. ... 87 Table 5.2 Summary of the haplogroups identified for this study that are

nested under macrohaplogroup M. Haplogroups are presented in order of discussion in this chapter. Specific comparative ethnic populations have been listed wherever possible. Haplogroup M68a (shaded) has been chosen for whole mtDNA genome sequencing. . 139

(12)

Table 6.1 List of haplogroups nested under macrohaplogroup N identified according to the HVS-I polymorphisms of the ethnic groups.

Relative frequencies in percentages are listed in brackets. ... 142 Table 6.2 Summary of the haplogroups identified for this study that are

nested under macrohaplogroup N. Haplogroups are presented in order of discussion in this chapter. Specific comparative ethnic populations have been listed wherever possible. No haplogroups were subjected to whole mtDNA genome sequencing. ... 150 Table 7.1 List of haplogroups nested under macrohaplogroup R identified

according to the HVS-I polymorphisms of the ethnic groups.

Relative frequencies in percentages are listed in brackets. ... 152 Table 7.2 Summary of the haplogroups identified for this study that are

nested under macrohaplogroup R. Haplogroups are presented in order of discussion in this chapter. Specific comparative ethnic populations have been listed wherever possible. Haplogroups R9c1a, F3b and F3b1 (shaded) have been chosen for whole mtDNA genome sequencing. ... 201 Table 8.1 List of Sabah samples assigned to haplogroups F3b/F3b1, M68a

and R9c1a according to their mtDNA HVS-I polymorphisms.

Shaded samples are those that were chosen for whole mtDNA genome amplification and sequencing. Insertions, deletions and fast mutations at np 16519 and nps 16182 and 16183 when np 16189 has a transition has been omitted from the list of HVS-I polymorphisms. A full list of the mtDNA HVS-I polymorphisms of the samples can be referred to in Appendix E. ... 205 Table 8.2 Summary of the haplogroups that were subjected to whole mtDNA

genome analysis in this study. ... 222 Table 9.1 Assignment of HVS-I haplogroups of Sabah ethnic groups to

putative source. Figures taken from Table 4.1. EA – East Asia, ISEA/NG – Island Southeast Asia/New Guinea, MSEA/SUN – Mainland Southeast Asia/Sunda, SA – South Asia. Figures listed in brackets are relative frequencies in percentages. ... 225

(13)

LIST OF FIGURES

Page

Figure 1.1 Map showing the approximate route for the “Out of Taiwan”

migration of Austronesian speakers, originating from Taiwan and spreading into ISEA via the Philippines, subsequently moving into Remote Oceania and Polynesia (Bellwood (1992, 2004, 2007, 2017), Bellwood & Dizon (2005, 2008) and Diamond (1988)).

Note that a separate movement of the Austronesian speakers into MSEA/Peninsular Malaysia and further east to Madagascar is not shown here. Highly researched areas in ISEA (the Philippines and Indonesia) have been highlighted in green. By contrast, Sabah and Sarawak (highlighted in red) remains largely under-studied in spite of its strategic location within ISEA and its proximity to MSEA. ... 3 Figure 1.2 Flowchart summarising the materials and methods that are

involved in this study. ... 5 Figure 2.1 Basic chemical structure of a DNA molecule (top) and the four

different bases of a DNA nucleotide (bottom). (After Jobling et al., 2014). ... 10 Figure 2.2 Double helix structure of a DNA molecule. (After Brown & Brown,

2011). ... 11 Figure 2.3 Schematic diagram of the human mtDNA. The mtDNA encodes 37

genes, of which, 22 are for tRNAs (indicated by single letter abbreviations e.g. F, V, L etc.), two are for rRNAs (12S and 16S) and the remaining 13 are for proteins (e.g. ND1, COX1 etc.). The D-loop (or control region) consists of HVS-I, II and III; nucleotide positions (nps) of the HVS are according to Andrews et al. (1999).

Overlapping segments within the mtDNA (e.g. ATP6 and ATP8) are not depicted in this diagram. OL and OH indicate the origins of the replication of the light and heavy strands respectively, whereas PL and PH indicate the promoters for the transcription of these two strands. (After Jobling et al., 2014)... 13 Figure 2.4 First three cycles of the polymerase chain reaction (PCR). At the

end of the third cycle, more shorter strands of the targeted sequencing region would have been produced, which would result in an exponential increase of these DNA strands the subsequent cycles. (After Brown & Brown, 2011). ... 19 Figure 2.5 Chemical structures of a dNTP and a ddNTP. The latter differ from

the former in that it has a hydrogen atom (as opposed to a hydroxyl group) attached to the 3′ carbon. (After Jobling et al., 2014). ... 20

(14)

Figure 2.6 Simplified view of the global human mtDNA phylogenetic tree (Build 17, 18 February 2016). mt-MRCA (top left corner) represents the most recent common matrilineal ancestor of humans.

Alphabets (e.g. M, N, R etc.) and alphanumerics (e.g. L3, M7, N1, etc.) represent haplogroups. Haplogroups followed by an asterisk (*) (e.g. M*, N* etc.) represent all other descendant lineages of a particular clade except for the ones shown. For example, N*

consists of haplogroups N3, N5, N7, N8 etc. (After www.phylotree.org). ... 24 Figure 2.7 Map showing the two possible dispersal exits and entry routes that

could have been used by AMH to enter Asia. Note that a common East African origin is suggested here and that upon entering Asia, AMH subsequently dispersed along the coastal zones of the Indian Ocean to Australasia. This southern coastal route is one that is usually suggested by genetic evidence. (After Forster &

Matsumura, 2005:965). ... 29 Figure 2.8 Southeast Asia. This subregion of Asia can be broadly divided into

MSEA (yellow) and ISEA (blue). ... 33 Figure 2.9 Map of modern-day SEA layered over by map of the ancient

landmass known as Sundaland (light grey). Dotted lines delineate the extent of Sundaland based on studies conducted by Hughes et al. (2003) and Bird et al. (2005). (After Bird et al., 2005). ... 35 Figure 2.10 Subgrouping of the Austronesian (AN) languages. (After Blust,

1984-1985:46). ... 38 Figure 2.11 The "Out of Taiwan” hypothesis, as proposed by Bellwood &

Dizon (2005), Bellwood (2007) and Bellwood & Dizon (2008). ... 40 Figure 2.12 Left: Location of Malaysia within SEA. Right: Location of Sabah

on the island of Borneo. ... 47 Figure 2.13 Map showing the sampling sites in the northwest region of Sabah,

Malaysia. From top to bottom: Kudat (Rungus); Kota Belud (Bajau); Tuaran (Dusun); and Kota Kinabalu (Kadazan). ... 50 Figure 2.14 Kadazan-Dusun ethnic individuals. (After Sabah Tourist

Association, 2020b). ... 52 Figure 2.15 Rungus ethnic individuals. (After Sabah Tourist Association,

2020c). ... 54 Figure 2.16 Bajau ethnic individuals. After (Sabah Tourist Association, 2020a).

... 55 Figure 4.1 Frequency distribution of the haplogroups identified for the Bajau

(n=45). ... 81

(15)

Figure 4.2 Frequency distribution of the haplogroups identified for the Dusun (n=90). ... 82 Figure 4.3 Frequency distribution of the haplogroups identified for Kadazan

(n=13). ... 83 Figure 4.4 Frequency distribution of the haplogroups identified for the

Rungus individuals (n=29). ... 84 Figure 5.1 Simplified tree for the phylogeny of M7b1a1+16192. HVS-I

polymorphisms are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Kutanan et al. (2018a). ... 89 Figure 5.2 Network of M7b1a1+16192 generated from the HVS-I data of two

samples (one Bajau and one Rungus) and 288 published sequences.

Arrow indicates the start of the M7b1a1+16192 phylogeny. ... 90 Figure 5.3 Map showing the distribution of haplogroup M7b1a1+16192 based

on the HVS-I data of two samples and 288 published sequences.

Location key: 1. Russia; 2. China (including Inner Mongolia); 3.

Taiwan; 4. Myanmar; 5. Laos; 6. Thailand; 7. Vietnam; 8.

Philippines; 9. Peninsular Malaysia; 10. Sabah (Borneo, Malaysia);

11. Sumatra (Indonesia). ... 92 Figure 5.4 Simplified tree for the phylogeny of M7c1a. HVS-I polymorphisms

are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Kutanan et al. (2018a). ... 93 Figure 5.5 Network of M7c1a generated from the HVS-I data of two samples

(one Dusun and one Kadazan) and 52 published sequences. ... 94 Figure 5.6 Map showing the distribution of M7c1a based on the HVS-I data

of two samples and 52 published sequences. Location key: 1. South Korea; 2. Japan; 3. China (including Inner Mongolia); 4. Myanmar;

5. Laos; 6. Thailand; 7. Vietnam; 8. Sabah (Borneo, Malaysia). ... 95 Figure 5.7 Network of M7c1c3 generated from the HVS-I data of 12 samples

(five Bajaus, three Dusuns, two Kadazans and two Rungus) and 232 published sequences. Arrow indicate the start of the M7c1c3 phylogeny. ... 96

(16)

Figure 5.8 Map showing the distribution of haplogroup M7c1c3 based on the HVS-I data of 12 samples and 232 published sequences. Location key: 1. Japan; 2. China; 3. Taiwan; 4. Philippines; 5. Thailand; 6.

Vietnam; 7. Peninsular Malaysia; 8. Sabah (Borneo, Malaysia); 9.

Brunei; 10. Sarawak (Borneo, Malaysia); 11. Sumatra (Indonesia);

12. Lesser Sunda Islands (Indonesia); 13. Timor-Leste; 14.

Micronesia; 15. Solomon Islands; 16. Tuvalu. Note: Sumatra includes three samples labelled as “Indonesia” in Soares et al.

(2016) with no further specification on the location, and could, therefore, have originally derived from other regions within Indonesia. ... 97 Figure 5.9 Network of E1 generated from the HVS-I data of six samples (all

Dusun) and eight published sequences. ... 100 Figure 5.10 Map showing the distribution of haplogroup E1 based on the HVS-

I data of six samples and eight published sequences. Location key:

1. China (Tibet); 2. Taiwan; 3. Peninsular Malaysia; 4. Sabah (Borneo, Malaysia); 5. Sumatra (Indonesia). ... 101 Figure 5.11 Network of E1a1a1 generated from the HVS-I data of 35 samples

(15 Bajau, 13 Dusun, three Kadazan and four Rungus) and 60 published sequences. ... 102 Figure 5.12 Map showing the distribution of haplogroup E1a1a1 based on the

HVS-I data of 35 samples and 60 published sequences. Location key: 1. Saudi Arabia; 2. Taiwan; 3. Philippines; 4. Thailand; 5.

Sabah (Borneo, Malaysia); 6. Peninsular Malaysia; 7. Sumatra (Indonesia); 8. Sulawesi (Indonesia). ... 104 Figure 5.13 Network of E1a2+16261 generated from the HVS-I data of six

samples (three Dusun and three Rungus) and 56 published sequences... 105 Figure 5.14 Map showing the distribution of E1a2 and E1a2+16261 based on

the HVS-I data of six samples and 56 published sequences.

Location key: 1. Philippines; 2. Peninsular Malaysia; 3. Sabah (Borneo, Malaysia); 4. Sarawak (Borneo, Malaysia); 5. Sumatra (Indonesia); 6. South Borneo (Indonesia); 7. Sulawesi (Indonesia);

8. Java (Indonesia); 9. Lesser Sunda Islands (Indonesia); 10.

Timor-Leste; 11. Papua New Guinea; 12. Solomon Islands. ... 106 Figure 5.15 Simplified tree for the phylogeny of D4a1d. HVS-I polymorphisms

are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Eng (2014). ... 108 Figure 5.16 Network of D4a1d generated from the HVS-I data of seven samples

(five Bajaus, one Dusun and one Kadazan) and 104 published sequences... 109

(17)

Figure 5.17 Map showing the distribution of haplogroups D4a, D4a1 and D4a1d based on the HVS-I data of seven samples and 104 published sequences. Location key: 1. Russia; 2. Japan; 3. South Korea; 4. China; 5. Myanmar; 6. Laos; 7. Thailand; 8. Vietnam; 9.

Sabah (Borneo, Malaysia); 10. Mauritius. ... 110 Figure 5.18 Network of D4a7 generated from the HVS-I data of two samples

(both Bajau) and eight published sequences. ... 111 Figure 5.19 Map showing the distribution of haplogroup D4a7 based on the

HVS-I data of two samples and eight published sequences.

Location key: 1. China; 2. Taiwan; 3. Vietnam; 4. Sabah (Borneo, Malaysia). ... 112 Figure 5.20 Simplified tree for the phylogeny of D4b2a2a. HVS-I

polymorphisms are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Eng (2014). ... 114 Figure 5.21 Network of D4b2a2a generated from the HVS-I data of seven

samples (one Bajau, four Dusuns, one Kadazan and one Rungus) and 42 published sequences. ... 115 Figure 5.22 Map showing the distribution of haplogroups D4b, D4b2, D4b2a,

D4b2a2 and D4b2a2a based on the HVS-I data of seven samples and 42 published sequences. Location key: 1. Japan; 2. Taiwan; 3.

China (including Tibet); 4. Myanmar; 5. Peninsular Malaysia; 6.

Sabah (Borneo, Malaysia); 7. Timor-Leste. ... 116 Figure 5.23 Network of D4j2 generated from the HVS-I data of one sample

(Kadazan) and 22 published sequences... 118 Figure 5.24 Map showing the distribution of haplogroup D4j and D4j2 based

on the HVS-I data of one sample and 22 published sequences.

Location key: 1. Turkey; 2. Russia (including Siberia); 3. Japan; 4.

China (including Tibet); 5. Nepal; 6. India; 7. Thailand; 8. Sabah (Borneo, Malaysia). ... 119 Figure 5.25 Network of D6 generated from the HVS-I data of three samples

(one Bajau and two Dusuns) and seven published sequences... 120 Figure 5.26 Map showing the distribution of haplogroup D6 based on the HVS-

I data of three samples and seven published sequences. Location key: 1. Japan; 2. China; 3. Philippines; 4. Vietnam; 5. Sabah (Borneo, Malaysia). ... 121 Figure 5.27 Network of M17a generated from the HVS-I data of one sample

(Dusun) and 20 published sequences. ... 122

(18)

Figure 5.28 Map showing the distribution of haplogroup M17a based on the HVS-I data of one sample and 20 published sequences. Location key: 1. Myanmar; 2. Thailand; 3. Vietnam; 4. Peninsular Malaysia;

5. Sabah (Borneo, Malaysia); 6. Sumatra (Indonesia). Note: The Sumatran sample was labelled as “Indonesia” in Tabbada et al.

(2010) with no further specification on the location, and could, therefore, have originally derived from other regions within Indonesia. ... 123 Figure 5.29 Simplified tree for the phylogeny of M57. HVS-I polymorphisms

are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Chandrasekar et al. (2009). . 124 Figure 5.30 Network of M57 from the HVS-I data of one sample (Bajau) and

10 published sequences. ... 125 Figure 5.31 Map showing the distribution of haplogroup M57 based on the

HVS-I data of one sample and 10 published sequences. Location key: 1. Saudi Arabia; 2. Pakistan; 3. India; 4. Sabah (Borneo, Malaysia). ... 126 Figure 5.32 Simplified tree for the phylogeny of M68. HVS-I polymorphisms

are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Zhang et al. (2013). ... 127 Figure 5.33 Network of M68a generated form the HVS-I data of one sample

(Dusun) and 19 published sequences. ... 128 Figure 5.34 Map showing the distribution of haplogroup M68 based on the

HVS-I data of one sample and 19 published sequences. Location key: 1. Myanmar; 2. Cambodia; 3. Vietnam; 4. Sabah (Borneo, Malaysia). ... 129 Figure 5.35 Simplified tree for the phylogeny of M71. HVS-I polymorphisms

are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Eng (2014). ... 131 Figure 5.36 Network of M71 generated form the HVS-I data of two samples

(both Dusun) and 114 published sequences. ... 132 Figure 5.37 Map showing the distribution of the M71 lineages based on the

HVS-I data of two samples and 114 published sequences. Location key: 1. China; 2. Myanmar; 3. Laos; 4. Thailand; 5. Cambodia; 6.

Vietnam; 7. Philippines; 8. Sabah (Borneo, Malaysia); 9.

Peninsular Malaysia; 10. Sumatra (Indonesia); 11. Timor-Leste. ... 133 Figure 5.38 Simplified tree for the phylogeny of haplogroup C. HVS-I

polymorphisms are in blue. ... 136 Figure 5.39 Network of C7 generated form the HVS-I data of two samples

(both Dusun) and 68 published sequences. ... 137

(19)

Figure 5.40 Map showing the distribution of haplogroup C7 based on the HVS- I data of two samples and 68 published sequences. Location key:

1. South Korea; 2. China (including Inner Mongolia); 3. Myanmar;

4. Laos; 5. Thailand; 6. Vietnam; 7. Peninsular Malaysia;8. Sabah (Borneo, Malaysia). ... 138 Figure 6.1 Schematic diagram of major subclades in macrohaplogroup N.

Shaded subclades represent those that are more commonly present in East and Southeast Asia. ... 141 Figure 6.2 Simplified tree for the phylogeny of N9a6a. HVS-I polymorphisms

are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Brandão et al. (2016), except for haplogroup N9a6a, the date of which (marked by an asterisk) is from Eng (2014). ... 143 Figure 6.3 Network of N9a6a generated from the HVS-I data of two samples

(both Dusun) and 137 published sequences. ... 144 Figure 6.4 Map showing the distribution of haplogroup N9a6a and its

precursors based on the HVS-I data of two samples and 137 published sequences. Location key: 1. Japan; 2. China (including Tibet); 3. Myanmar; 4. Laos; 5. Thailand; 6. Cambodia; 7.

Vietnam; 8. Philippines; 9. Peninsular Malaysia; 10. Sumatra (Indonesia); 11. Sabah (Borneo, Malaysia); 12. Sarawak (Borneo, Malaysia); 13. South Borneo (Indonesia). ... 145 Figure 6.5 Network of Y2 generated from the HVS-I data of the one sample

(Dusun) and 72 published sequences. ... 147 Figure 6.6 Map showing the distribution of haplogroups Y and Y2 based on

the HVS-I data of one sample and 72 published sequences.

Location key: 1. Russia (Siberia); 2. Japan; 3. China; 4. Taiwan; 5.

Myanmar; 6. Philippines; 7. Peninsular Malaysia; 8. Sumatra (Indonesia); 9. Sabah (Borneo, Malaysia); 10. Brunei; 11. Sulawesi (Indonesia). Note: Sumatra includes one sample labelled as

“Indonesia” in Tabbada et al. (2010) with no further specification on the location, and could, therefore, have originally derived from other regions within Indonesia. ... 149 Figure 7.1 Schematic diagram of major subclades in macrohaplogroup R.

Shaded subclades represent those that are present in East and Southeast Asia. ... 151 Figure 7.2 Network of R9b2 generated from the HVS-I data of one sample

(Dusun) and 52 published sequences. ... 153 Figure 7.3 Map showing the distribution of haplogroup R9b2 based on the

HVS-I data of one sample and 52 published sequences. Location key: 1. China; 2. Laos; 3. Thailand; 4. Cambodia; 5. Vietnam; 6.

Peninsular Malaysia; 7. Sabah (Borneo, Malaysia). ... 154

(20)

Figure 7.4 Simplified tree for the phylogeny of R9c1a. HVS-I Polymorphisms are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Eng (2014). ... 156 Figure 7.5 Network of R9c1a generated from the HVS-I data of 15 samples

(five Bajaus, five Dusuns, one Kadazan and four Rungus) and 115 published sequences. ... 157 Figure 7.6 Map showing the distribution of haplogroups R9c1 and R9c1a

based on the HVS-I data of 15 samples and 115 published sequences. Location key: 1. China (including Inner Mongolia); 2.

Taiwan; 3. Laos; 4. Vietnam; 5. Philippines; 6. Sabah (Borneo, Malaysia); 7. South Borneo (Indonesia); 8. Lesser Sunda Islands (Indonesia); 9. Timor-Leste; 10. Micronesia. Note: South Borneo includes one sample labelled as “Indonesia” in Tabbada et al.

(2010) with no further specification on the location, and could, therefore, have originally derived from other regions within Indonesia. ... 158 Figure 7.7 Network of F1a generated from the HVS-I data of 12 samples (11

Dusuns and one Rungus) and 223 published sequences. ... 162 Figure 7.8 Map showing the distribution of haplogroup F1a based on the

HVS-I data of 12 samples and 223 published sequences. Location key: 1. India; 2. China (including Inner Mongolia and Tibet); 3.

Myanmar; 4. Laos; 5. Thailand; 6. Cambodia; 7. Vietnam; 8.

Peninsular Malaysia; 9. Sabah (Borneo, Malaysia). ... 163 Figure 7.9 Network of F1a3+16311 generated from the HVS-I data of two

samples (both Dusun) and 81 published sequences. ... 164 Figure 7.10 Map showing the distribution of haplogroup F1a3+16311 based on

the HVS-I data of two samples and 81 published sequences.

Location key: 1. Japan; 2. Taiwan; 3. China; 4. Laos; 5. Thailand;

6. Vietnam; 7. Philippines; 7. Peninsular Malaysia; 9. Sumatra (Indonesia); 10. Sabah (Borneo, Malaysia); 11. Lesser Sunda Islands (Indonesia); 12. Timor-Leste. ... 165 Figure 7.11 Simplified tree for the phylogeny of F3b and F3b1. HVS-I

polymorphisms are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Brandão et al. (2016). ... 168 Figure 7.12 Network of F3b and F3b1 generated from the HVS-I data of three

samples (all Rungus) and 104 published sequences. ... 169

(21)

Figure 7.13 Map showing the distribution of haplogroup F3b and its subclades based on the HVS-I data of three samples and 104 published sequences. Location key: 1. Japan; 2. Taiwan; 3. China; 4.

Philippines; 5. Thailand; 6. Sabah (Borneo, Malaysia); 7. Brunei;

8. South Borneo (Indonesia); 9. Peninsular Malaysia; 10. Sumatra (Indonesia); 11. Lesser Sunda Islands (Indonesia); 12. Madagascar.

Note: Sumatra includes one sample labelled as “Indonesia” in Tabbada et al. (2010) with no further specification on the location, and could, therefore, have originally derived from other regions within Indonesia. ... 170 Figure 7.14 Network of B4 and B4+16261 generated from the HVS-I data of

13 samples (two Bajaus, eight Dusuns and three Rungus) and 138 published sequences. ... 172 Figure 7.15 Map showing the distribution of haplogroups B4 and B4+16261

based on the HVS-I data of 13 samples and 138 published sequences. Location key: 1. India; 2. Japan; 3. China (including Tibet and Inner Mongolia); 4. Myanmar; 5. Thailand; 6. Cambodia;

7. Vietnam; 8. Peninsular Malaysia; 9. Sabah (Borneo, Malaysia);

10. Sumatra (Indonesia); 11. Timor-Leste; 12. Papua New Guinea;

13. Solomon Islands. ... 173 Figure 7.16 Simplified tree for the phylogeny of haplogroup B4a1a1. HVS-I

polymorphisms are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Eng (2014). ... 175 Figure 7.17 Network of B4a1a1 generated from the HVS-I data of one sample

(Dusun) and 168 published sequences. ... 176 Figure 7.18 Map showing the distribution of haplogroup B4a1a1 based on the

HVS-I data of one sample and 168 published sequences. Location key: 1. Madagascar; 2. China; 3. Peninsular Malaysia; 4. Sabah (Borneo, Malaysia); 5. Sarawak (Borneo, Malaysia); 6. South Borneo (Indonesia); 7. Sulawesi (Indonesia); 8. Moluccas (Indonesia); 9. Timor-Leste; 10. Papua New Guinea; 11. Solomon Islands; 12. Vanuatu; 13. Tonga; 14. Samoa; 15. Cook Islands; 16.

French Polynesia. ... 177 Figure 7.19 Network of B4a1a3a generated from the HVS-I data of 18 samples

(three Bajaus, ten Dusuns, one Kadazan and four Rungus) and 142 published sequences. ... 179

(22)

Figure 7.20 Map showing the distribution of haplogroup B4a1a3a based on the HVS-I data of 18 samples and 142 published sequences. Location key: 1. Myanmar; 2. Thailand; 3. Vietnam; 4. Taiwan; 5.

Philippines; 6. Peninsular Malaysia; 7. Sabah (Borneo, Malaysia);

8. Sumatra (Indonesia); 9. South Borneo (Indonesia); 10. Java (Indonesia); 11. Lesser Sunda Islands (Indonesia); 12. Sulawesi (Indonesia); 13. Moluccas (Indonesia); 14. Timor-Leste; 15. Papua New Guinea; 16. Micronesia; 17. Solomon Islands; 18. Tuvalu; 19.

Futuna and Wallis; 20. Fiji; 21. Tonga; 22. Samoa; 23. Cook Islands; 24. French Polynesia. ... 180 Figure 7.21 Network of B4a2 generated from the HVS-I data of three samples

(two Dusuns and one Kadazan) and 34 published sequences. ... 182 Figure 7.22 Map showing the distribution of haplogroup B4a2 based on the

HVS-I data of three samples and 34 published sequences. Location key: 1. Japan; 2. China; 3. Taiwan; 4. Philippines; 5. Sabah (Borneo, Malaysia); 6. South Borneo (Indonesia); 7. Sumatra (Indonesia)... 183 Figure 7.23 Simplified tree for the phylogeny of B4c1b+16335. HVS-I

polymorphisms are in blue. Estimated coalescence dates, where provided, are from a complete mtDNA genome study by Eng (2014). ... 184 Figure 7.24 Network of B4c1b+16335 generated from the HVS-I data of one

sample (Dusun) and 124 published sequences. Note that the subclades of B4c1b+16335 (i.e. the B4c1b2 lineages) have also been included, as some cannot be distinguished from B4c1b+16335 in HVS-I terms. ... 185 Figure 7.25 Map showing the distribution of haplogroup B4c1b+16335 based

on the HVS-I data of one sample and 124 published sequences.

Location key: 1. Russia; 2. Japan; 3. China; 4. Taiwan; 5. Laos; 6.

Thailand; 7. Vietnam; 8. Philippines; 9. Peninsular Malaysia; 10.

Sabah (Borneo, Malaysia); 11. Sumatra (Indonesia); 12. Timor- Leste. ... 186 Figure 7.26 Network of B4c2 generated from the HVS-I data of one sample

(Rungus) and 169 published sequences. ... 188 Figure 7.27 Map showing the distribution of haplogroup B4c2 based on the

HVS-I data of one sample and 169 published sequences. Location key: 1. Uzbekistan; 2. China; 3. Myanmar; 4. Laos; 5. Thailand; 6.

Cambodia; 7. Vietnam; 8. Sabah (Borneo, Malaysia); 9. Peninsular Malaysia; 10. Sumatra (Indonesia); 11. Timor-Leste. ... 189 Figure 7.28 Network of B5a generated from the HVS-I data of two samples

(one Dusun and one Rungus) and 379 published sequences. ... 191

(23)

Figure 7.29 Map showing the distribution of haplogroup B5a based on the HVS-I data of two samples and 379 published sequences. Location key: 1. Japan; 2. China (including Tibet); 3. Myanmar; 4. Laos; 5.

Thailand; 6. Cambodia; 7. Vietnam; 8. Nicobar Islands; 9.

Peninsular Malaysia; 10. Sabah (Borneo; Malaysia); 11. Sumatra (Indonesia)... 192 Figure 7.30 Network of B5a1d generated from the HVS-I data of seven samples

(three Bajaus, two Dusuns, one Kadazan and one Rungus) and 259 published sequences. ... 194 Figure 7.31 Map showing the distribution of haplogroups B5a1 and B5a1d

based on the HVS-I data of seven samples and 259 published sequences. Location key: 1. Nepal; 2. China; 3. Myanmar; 4. Laos;

5. Thailand; 6. Vietnam; 7. Cambodia; 8. Philippines; 9. Peninsular Malaysia; 10. Sabah (Borneo, Malaysia); 11. Sumatra (Indonesia). 195 Figure 7.32 Network of B5b generated from the HVS-I data of three samples

(one Bajau and two Dusuns) and 133 published sequences. ... 197 Figure 7.33 Map showing the distribution of haplogroup B5b based on the

HVS-I data of three samples and 133 published sequences.

Location key: 1. Japan; 2. South Korea; 3. China (including Tibet and Inner Mongolia); 4. Laos; 5. Cambodia; 6. Vietnam; 7.

Philippines; 8. Peninsular Malaysia; 9. Sabah (Borneo, Malaysia);

10. Sumatra (Indonesia); 11. Timor-Leste; 12. Solomon Islands. ... 197 Figure 7.34 Network of P4a generated from the HVS-I data of two samples

(both Dusun) and seven published sequences. ... 199 Figure 7.35 Map showing the distribution of haplogroup P4a based on the

HVS-I data of two samples and seven published sequences.

Location key: 1. Sri Lanka; 2. Sabah (Borneo, Malaysia); 3. Papua New Guinea. ... 200 Figure 8.1 Network of F3b generated from the whole mtDNA genome data of

two samples (both Rungus) and 68 published whole mtDNA genome sequences. ... 206 Figure 8.2 The most parsimonious tree of haplogroup F3b excluding F3b1a.

HVS-I polymorphisms are in blue. Coalescence ages for the clades are estimated based on ML (top) and averaged distance (ρ; bottom, in green). Location codes: CHN – China, JPN – Japan, KUD – Kudat (Sabah, Malaysia), NBO – North Borneo (Malaysia), PHL – Philippines, SBO – South Borneo (Indonesia), THA – Thailand. ... 208

(24)

Figure 8.3 The most parsimonious tree of haplogroup F3b1a. HVS-I polymorphisms are in blue. Coalescence ages for the clades are estimated based on ML (top) and averaged distance (ρ; bottom, in green). Location codes: BRN – Brunei, IDN – Indonesia (unclassified), LSI – Lesser Sunda Islands (Indonesia), MYS – Peninsular Malaysia, PHL – Philippines, TWN – Taiwan. ... 209 Figure 8.4 Network of M68 generated from the whole mtDNA genome data

of one sample (both Dusun) and 18 published whole mtDNA genome sequences. ... 212 Figure 8.5 The most parsimonious tree of haplogroup M68. HVS-I

polymorphisms are in blue. Coalescence ages for the clades are estimated based on ML (top) and averaged distance (ρ; bottom, in green). Location codes: KHM – Cambodia, MMR – Myanmar, TUA – Tuaran, (Sabah, Malaysia), VNM – Vietnam. ... 213 Figure 8.6 Network of R9c generated from the whole mtDNA genome data of

six samples (two Bajaus, two Dusuns and two Rungus) and 43 published whole mtDNA genome sequences. ... 216 Figure 8.7 The most parsimonious tree of haplogroup R9c excluding R9c1a.

HVS-I polymorphisms are in blue. Coalescence ages for the clades are estimated based on ML (top) and averaged distance (ρ; bottom, in green). Location codes: CHN – China, MMR – Myanmar, MNG – Inner Mongolia (China), PHL – Philippines, THA – Thailand, TLS – Timor-Leste, TWN – Taiwan, VNM – Vietnam. ... 218 Figure 8.8 The most parsimonious tree of haplogroup R9c1a. HVS-I

polymorphisms are in blue. Coalescence ages for the clades are estimated based on ML (top) and averaged distance (ρ; bottom, in green). Location codes: GUM – Guam (Micronesia), KBD – Kota Belud (Sabah, Malaysia), KUD – Kudat (Sabah, Malaysia), LSI – Lesser Sunda Islands (Indonesia), PHL – Philippines, SBO – South Borneo (Indonesia), TUA – Tuaran (Sabah, Malaysia), TWN – Taiwan, VNM – Vietnam. ... 219 Figure C.1 Excel file for Network, using haplogroup P4a as an example, listing

the Network IDs, polymorphisms and frequencies. ... 309 Figure C.2 Interface of fm2net_gui. ... 310 Figure D.3 Image showing the M68 sequences aligned in BioEdit and assigned

a new sample ID. ... 314 Figure D.4 Image showing the information and part of the 2-partitions

sequencing map that is added at the top of the .phy file in Word. ... 316

(25)

Figure D.5 (a) Example of an existing M68 tree. Sequences have been assigned a new ID (in green) for easy processing. Unnamed clades have been labelled uniquely (in red) for easy identification. (b) Newick tree build with samples nested using parentheses to be inserted into FigTree. (c) Newick tree of M68 generated on FigTree. ... 317 Figure D.6 Command Prompt interface. ... 318 Figure D.7 ML_M68.pml in Notepad. ... 320 Figure D.8 ML_M68.xlsx ... 320 Figure D.9 Coalescence age estimated for M68 clade using Soares calculator. 321

(26)

LIST OF ABBREVIATIONS

aDNA ancient DNA

AMH anatomically modern humans ATP adenosine triphosphate

BC before Christ

bp base pair

CRS Cambridge Reference Sequence dATP deoxyadenosine triphosphate dCTP deoxycytidine triphosphate ddATP dideoxyadenosine triphosphate ddCTP dideoxycytidine triphosphate ddGTP dideoxyguanosine triphosphate ddNTP dideoxynucleoside triphosphate ddTTP dideoxythymidine triphosphate dGTP deoxyguanosine triphosphate DNA deoxyribonucleic acid

dNTP deoxynucleoside triphosphate dTTP dethymidine triphosphate

EA East Asia

EDTA ethylenediaminetetraacetic acid EtBr ethidium bromide

HLA human leukocyte antigen HVS-I hypervariable segment I HVS-II hypervariable segment II HVS-III hypervariable segment III ISEA Island Southeast Asia

ka thousand years

kya thousand years ago

LGM Last Glacial Maximum

LGP Last Glacial Period MgCl2 magnesium chloride

(27)

MgCl2·6H2O magnesium chloride hexahydrate

MJ median-joining

ML maximum likelihood

MP maximum parsimony

MRCA most recent common ancestor MSEA Mainland Southeast Asia mtDNA mitochondrial DNA mya million years ago NaCl sodium chloride

nDNA nuclear DNA

NG New Guinea

NMTCN Nusantao Maritime Trading and Communication Network np nucleotide position

OOA Out of Africa

PAML Phylogenetic Analysis by Maximum Likelihood PCR polymerase chain reaction

RBC red blood cell

rCRS revised Cambridge Reference Sequence RFLP restriction fragment-length polymorphism

RM reduced median

rRNA ribosomal RNA

RSRS Reconstructed Sapiens Reference Sequence

SA South Asia

SDS sodium dodecyl sulphate

SEA Southeast Asia

SNP single nucleotide polymorphism TAE Tris base, acetic acid and EDTA Tris-HCl Tris hydrochloride

tRNA transfer RNA

UV ultraviolet

ya years ago

YTT Youngest Toba Tuff

(28)

LIST OF APPENDICES

Appendix A Consent form (in Malay) to be signed by participants prior to buccal swab sampling.

Appendix B List of publications and accession numbers of published sequences that have been included in this study. A total of 3,425 published sequences have been used to generate the phylogenetic networks and phylogenetic trees in this study. In cases where only polymorphisms were reported with no accession numbers, the number of samples used is indicated. n = total number of published samples used for the respective haplogroup.

Appendix C Workflow for Network 5.0.1.1.

Appendix D Workflow for Phylogenetic Analysis by Maximum Likelihood (PAML), using haplogroup M68 (containing 21 sequences, including the rCRS) as an example.

Appendix E Complete list of the mtDNA control region polymorphisms of the Sabah ethnic individuals sequenced in this study. Results are listed according to haplogroups.

Appendix F Complete list of the whole mtDNA genome polymorphisms of the Sabah ethnic individuals sequenced in this study. Results are listed according to haplogroups.

(29)

KAJIAN FILOGEOGRAFI KAUM-KAUM ETNIK DI SABAH:

KADAZAN, DUSUN, RUNGUS DAN BAJAU

ABSTRAK

Kajian filogeografi merupakan kajian yang mengaplikasi penyelidikan filogenetik dalam bidang arkeologi untuk mengkaji corak migrasi dan penempatan manusia pada masa lampau. Di Asia Tenggara, kajian filogeografi telah digunakan secara meluas untuk mengkaji diaspora Austronesia, satu pergerakan kumpulan etnolinguistik yang luas tersebar. Bukti arkeologi dan linguistik telah menunjukkan bahawa petani padi yang bertutur proto-Austronesia berhijrah dari China Selatan ke Taiwan sekitar 5,500 tahun lalu. Bahasa-bahasa Austronesia kemudiannya berkembang di Taiwan dan mula menyebar ke Asia Tenggara, Oceania dan Polinesia sekitar 4,000 tahun lalu. Migrasi ini dikenali sebagai penyebaran “Out of Taiwan”.

Walau bagaimanapun, kajian-kajian genetik di sekitar kawasan ini menunjukkan bahawa keadaan sebenarnya jauh lebih rumit daripada yang dijangka. Malah, sebilangan penyelidik mencadangkan bahawa migrasi telah banyak kali berlaku pada zaman lampau dan bukannya satu kali sahaja, termasuklah migrasi yang bergerak ke arah Taiwan. Di Sabah (Borneo, Malaysia), semua kaum etnik bertutur dalam bahasa- bahasa Austronesia, contohnya bahasa Dusunik dan bahasa Sama-Bajau. Hal ini menunjukkan bahawa pencerobohan, akulturasi atau asimilasi penutur bahasa Austronesia telah berlaku di rantau ini. Namun, kajian filogenetik yang komprehensif belum lagi dijalankan di Sabah berbanding dengan kawasan-kawasan jiran seperti Filipina dan Indonesia. Oleh itu, kajian ini bertujuan mengisi jurang penyelidikan melalui analisis filogenetik terhadap empat kaum etnik utama di Sabah iaitu Kadazan, Dusun, Bajau dan Rungus, dengan menggunakan DNA mitokondria (mtDNA) yang

(30)

diwarisi dari sebelah ibu. Kajian ini mempunyai dua objektif utama iaitu: (i) untuk mencirikan variasi “control region” (atau “D-loop”) DNA mitokondria kumpulan- kumpulan etnik di Sabah; dan (ii) untuk mengenal pasti sumber populasi bumiputera Sabah samada di Asia Selatan, Asia Timur, daratan Asia Tenggara dan kepulauan Asia Tenggara serta mengkaji corak penempatan dan migrasi mereka menggunakan analisis filogenetik. Seramai 177 individu etnik di Sabah telah dikaji, dan 10 sampel telah dipilih untuk menjalani penurutan genom lengkap mtDNA untuk kajian selanjutnya seperti pembinaan pokok filogenetik dan anggaran masa migrasi serta penempatan.

Hasil kajian didapati bahawa haplogroup mtDNA 177 individu ini boleh dibahagikan kepada lima episod penyebaran dan penempatan, iaitu (i) penempatan pertama oleh manusia anatomi moden (anatomically modern humans), (ii) migrasi awal kumpulan pemburu-pengumpul sebelum ketenggelaman pentas Sunda, (iii) penyebaran manusia selepas tempoh glasial pada Awal Holosen, (iv) penyebaran manusia dari Taiwan semasa Pertengahan Holosen dan (v) penyebaran manusia yang bukan dari Taiwan (tetapi rantau lain seperti Indonesia dan Oceania) semasa Pertengahan Holosen. Secara keseluruhan, hasil kajian ini bukan sahaja menolak model penggantian oleh penutur bahasa Austronesia malah ia turut menunjukkan bahawa kesan Austronesia mungkin berlaku dari aspek budaya dan bahasa dan bukannya dari aspek genetik. Akhirnya, kajian ini merupakan kajian pertama yang menumpu kepada kaum-kaum etnik di Sabah dan diharapkan penyelidikan ini akan turut menjadi rujukan dalam kajian filogenetik di Asia Tenggara pada masa depan.

(31)

PHYLOGEOGRAPHY STUDY OF THE INDIGENOUS PEOPLE OF SABAH: KADAZAN, DUSUN, RUNGUS AND BAJAU

ABSTRACT

Phylogeography is a field of study that applies phylogenetic research in the field of archaeology to study past human migration and settlement patterns. In Southeast Asia (SEA), it has been extensively used to study the Austronesian diaspora, possibly the most widespread movement of a single ethnolinguistic group. Through archaeological and linguistic evidence, it has been shown that rice farmers who were proto-Austronesian speakers migrated from South China into Taiwan c. 5,500 years ago. Austronesian languages then developed in Taiwan and Austronesian speakers subsequently spread into SEA, Oceania and Polynesia c. 4,000 years ago. This movement has been coined as the “Out of Taiwan” dispersal. However, recent genetic studies show that the situation is much more complex than this. Some scholars suggested multiple dispersals rather than a one-off migration, as well as a back or reverse migration into Taiwan instead of one that is mono-directional. In Sabah (Borneo, Malaysia), all the indigenous ethnic groups speak Austronesian languages such as the Dusunic and Sama-Bajau languages; this marks an invasion, acculturation or assimilation of Austronesian speakers in the region. However, Sabah has not been subjected to much phylogenetic analysis as compared to neighbouring regions such as the Philippines and Indonesia in spite of its strategic location. Hence, this study aims to fill this research gap by conducting a phylogenetic analysis using the maternally inherited mitochondrial DNA (mtDNA) on four major Sabah ethnic groups, namely the Kadazan, Dusun, Bajau and Rungus. The main objectives of this study are two- fold: (i) To characterise the mtDNA control region (or D-loop) variations of the ethnic

(32)

groups of Sabah; and (ii) To identify potential source populations in South Asia, East Asia, MSEA and ISEA of the indigenous people of Sabah, as well as study their settlement and migration patterns using phylogenetic analysis. This study involved 177 ethnic individuals from Sabah, out of which 10 samples were selected and subjected to whole mtDNA genome phylogenetic analysis, which further includes drawing of the phylogenetic trees and coalescence time estimation. The resulted mtDNA haplogroups identified from the phylogenetic analysis in this study has yielded the following finding, that the peopling of Sabah can be categorised into five categories.

Namely, the peopling of Sabah can be categorised into (i) The first settlement of anatomically modern humans (AMH) in Sabah; (ii) Settlements and/or migrations of early hunter-gatherer populations and the rise of ancient lineages prior to the flooding of the Sunda shelf; (iii) Early Holocene postglacial expansions; (iv) Mid-Holocene dispersal from Taiwan; and (v) Mid-Holocene dispersals that do not originate from Taiwan but are rooted in other regions such as Indonesia and Near Oceania. Overall, not only do the results from this study reject a total replacement model of Austronesian speakers, they also seem to suggest that the “Austronesian effect” could have been in cultural and linguistic terms rather than genetic. Finally, this is the first study focusing solely on the ethnic groups in Sabah, and it is hoped that this research will serve as a reference for future phylogenetic studies in SEA.

(33)

CHAPTER 1 INTRODUCTION

Past human settlement and migrations in Southeast Asia (SEA) has remained a highly debated and vigorously discussed subject matter in archaeology over the years. Newly unearthed archaeological evidence and material culture have constantly called for a revision of current knowledge, and the advent of scientific analysis in archaeology has also greatly improved (and sometimes disprove) any prevailing theories and models.

In Island Southeast Asia (ISEA), a geographical region covering modern-day East Malaysia, Brunei, Indonesia, Timor-Leste, Singapore and the Philippines, the

“Out of Taiwan” theory is the most predominant model used to describe the spread of Austronesian speakers in the region. It argues for the rise of agriculture as a motivating factor behind the dispersal of Austronesian speakers who subsequently replaced any pre-existing hunter-gatherer populations that they came into contact with. This demic diffusion model has been strongly advocated by scholars such as Bellwood (1992, 2004, 2007, 2017), Bellwood & Dizon (2005, 2008) and Diamond (1988), who based their arguments mostly on archaeological evidence, as well as linguistic analysis largely conducted by Blust (1976, 1984-1985, 1995, 2013). Alternatively, some scholars (e.g. Solheim 1975; Meacham 1984-1985; Solheim 1984-1985; Oppenheimer

& Richards 2002) argue that Holocene human dispersal was a response towards climate change and the rise in sea levels. More information on both opposing views are provided in the next chapter.

Within the past two to three decades, DNA studies, especially those pertaining to phylogeography and population genetic analysis, have greatly contributed to our understanding on human migrations and settlement during the Neolithic. Whilst there

(34)

is now a plethora of DNA studies that has been conducted to shed more light on this subject matter, the currently available literature is seemingly focused on only certain regions within ISEA. Notably, much research has been conducted on the Philippines (e.g. Hill et al. 2006, 2007; Tabbada et al. 2010; Scholes et al. 2011; Delfin et al. 2014 etc.) as an immediate “Out of Taiwan” recipient, as well as Indonesia (e.g. Richards et al. 1998; Mona et al. 2009; Gunnarsdóttir et al. 2011b; Brandão et al. 2016; Kusuma et al. 2017 etc.) due to its proximate location to Near Oceania, thus acting as the final ISEA step for Austronesian speakers before advancing further into Remote Oceania and Polynesia (Figure 1.1). In spite of its strategic location between both regions as well as its close proximity to Peninsular Malaysia and Mainland Southeast Asia (MSEA), the northern part of Borneo, comprising of Sabah and Sarawak of East Malaysia, is often overlooked in phylogenetic studies. Hence, this region presents a research gap in the phylogenetic studies of ISEA populations.

(35)

Figure 1.1 Map showing the approximate route for the “Out of Taiwan” migration of Austronesian speakers, originating from Taiwan and spreading into ISEA via the Philippines, subsequently moving into Remote Oceania and Polynesia (Bellwood (1992, 2004, 2007, 2017), Bellwood & Dizon (2005, 2008) and Diamond (1988)).

Note that a separate movement of the Austronesian speakers into MSEA/Peninsular Malaysia and further east to Madagascar is not shown here. Highly researched areas in ISEA (the Philippines and Indonesia) have been highlighted in green. By contrast, Sabah and Sarawak (highlighted in red) remains largely under-studied in spite of its

strategic location within ISEA and its proximity to MSEA.

1.1 Aims and motivation

For the reason mentioned above, this study hopes to fill in the research gap by subjecting the ethnic groups of Sabah to phylogenetic analysis. The main objectives of this study are two-fold:

1. To characterise the mitochondrial DNA (mtDNA) control region variations of the ethnic groups of Sabah; and

2. To identify potential source populations in South Asia, East Asia, MSEA and ISEA of the indigenous people of Sabah, as well as study their settlement and migration patterns using phylogenetic analysis.

(36)

Due to time and financial constraints, only the major ethnic groups in Sabah will be studied. Nevertheless, as the mtDNA control region variations of the Sabah ethnic groups are previously uncharacterised, this will greatly benefit researchers seeking to identify and quantify human settlement and migrations in the region using mtDNA. We would expect to observe some differences in the control region of the mtDNA within the different ethnic groups, and this would shed light on questions regarding demography and migration history of the different ethnic groups.

Furthermore, the results from this study will set the ground for further mtDNA characterisation of other ethnic groups, as well as future research utilising other types of DNA, such as the paternally inherited non-recombining Y-chromosome or autosomal DNA.

On a broader scale, the results from this study can be used to test if the ideas put forward for other parts of ISEA holds true for this part of the region. Firstly, we would be able to observe how the genetic data from this study relate to the “Out of Taiwan” theory that is primarily based on archaeological and linguistic evidence;

estimated coalescence ages from this study and the published literature will allow us to deduce if the genetic data corroborates or opposes the dates proposed by the “Out of Taiwan” theory. This will subsequently allow us to test population settlements and movements during the Holocene period, as well as study the effects of climate change on human settlements and dispersals since the Last Glacial Maximum (LGM).

1.2 Scope and limitations

As mentioned earlier, due to time and financial constraints, the mtDNA of the major ethnic groups in Sabah, namely the Kadazan, Dusun, Rungus and Bajau, will be the focus of this study. The materials and methodology that have been undertaken to

(37)

carry out this study is summarised in Figure 1.2. Specific details of the materials and methodology will be provided in Chapter 3.

Figure 1.2 Flowchart summarising the materials and methods that are involved in this study.

The Kadazan, Dusun, Rungus and Bajau ethnic groups are chosen, as they represent the largest ethnic groups in Sabah by population (Sabah State Government 2019). On the other hand, the mtDNA is chosen, as it does not reshuffle or recombine, is more easily detectable in the human cell, and has a higher mutation rate, which makes changes and diversity in the mtDNA over a very long period of time more easily detectable. The specifics of the advantages of subjecting mtDNA to phylogenetic analysis is discussed in more details in Section 2.1.1.

However, limitations arise from focusing only on the mtDNA of the major ethnic groups of Sabah. Firstly, Sabah comprises of 33 indigenous groups (Sabah State Government 2019); thus, the findings and outcomes from this study might not reflect the overall phylogenetic signature of the entirety of Sabah. Nonetheless, the findings and outcomes would still provide a snapshot into the mtDNA phylogenetic signature

(38)

of Sabah, which was previously uncharacterised. Secondly, the mtDNA is maternally inherited; thus, the findings and outcomes from this study would be biased towards the maternal lineage and might not reflect a similar scenario as with the paternal lineage, which could be obtained through studying another type of DNA, the Y-chromosome.

Regardless, the findings and outcomes from this study would still benefit current phylogenetic studies in SEA that have focused on the mtDNA by filling in the research gap, as well as serving as a reference study for future phylogenetic studies in Sabah, Malaysia or SEA involving other forms of DNA. It is also important to note that the assignment of the haplogroups in this study is limited to only polymorphisms from the mtDNA control region, or more specifically the HVS-I. Additionally, further assignment into subhaplogroups for some samples will only possible with variants that occur in other parts of the mtDNA, such as the coding region. Regardless, this does not hinder our aim to provide a characterisation of the mtDNA control region and subsequently the genetic signature and diversity of the ethnic groups of Sabah.

Furthermore, the haplogroup diversity that we will observe in the ethnic groups in this study may be partial due to the limiting sample size, especially for the Kadazan individuals who are one of the major ethnic groups in Sabah in terms of population size but are only represented by 13 individuals in this study Figure 1.2). Nonetheless, these limitations will, by no means, impede in shedding new light on the genetic signature and distribution of the ethnic groups from Sabah, a region which has previously been subjected to minimal phylogenetic analysis, as mentioned earlier. As a pilot study, this research will characterise the mtDNA control region haplogroup and genetic diversity of the major ethnic groups in Sabah, and this would unquestionably act as a crucial stepping stone for future mtDNA studies in Sabah and the wider region of Borneo or ISEA.

(39)

1.3 Thesis outline

This thesis will have the following chapters:

i. Firstly, Chapter 2 will provide a literature review on topics and any background information that is relevant to the study. This includes, but is not limited to, an introduction to DNA, basic principles and prerequisite knowledge on phylogenetic analysis of human mtDNA, a discussion on modern human origins and dispersals and human migrations in ISEA during the Neolithic, as well as the wider context of SEA and Sabah.

ii. Chapter 3 will detail the materials and methodology used to conduct the research in this study.

iii. Chapter 4 will provide an overview on the mtDNA control region haplogroup profile of each Sabah ethnic group. This is followed by a preliminary discussion on the genetic signature and diversity of each ethnic group.

iv. Chapters 5, 6 and 7 will examine macrohaplogroups M, N and R respectively. These three chapters will present and discuss the haplogroups nested under these macrohaplogroups that have been identified in this study based on the mtDNA control region. These three macrohaplogroups are presented and discussed in separate chapters, as they have distinctive characteristics in terms of their defining mutations, coalescence dates, geographical region spanned and phylogenetic history that make them differ from one another.

(40)

v. Chapter 8 will present the results and discussion for the selected haplogroups that have been subjected to whole mtDNA genome analysis.

vi. Lastly, Chapter 9 will bring together all of the data obtained in this study and provide a more comprehensive discussion of the results and their implications. This is followed by a conclusion of the study, as well as some suggestions for further research.

(41)

CHAPTER 2 LITERATURE REVIEW

The information provided in this literature review chapter sets out to provide the foundation and in-depth understanding on the subject matters that are relevant to understanding this research study.

2.1 An introduction to human deoxyribonucleic acid (DNA)

Deoxyribonucleic acid (DNA) is the genetic material of all living organisms, except for a few kinds of viruses. DNA is a polymer, formed by a combination of monomeric subunits known as nucleotides. The basic structure of a nucleotide consists of a nitrogenous base, pentose (a type of sugar composed of five carbon atoms), and a triphosphate group (Figure 2.1). There are four nucleotides, which are distinguished by the chemically distinct structure of the bases. The four different bases are adenine (A), cytosine (C), guanine (G) and thymine (T), and they respectively form the four nucleotides known as deoxyadenosine triphosphate (dATP), deoxycytidine triphosphate (dCTP), deoxyguanosine triphosphate (dGTP), and deoxythymidine triphosphate (dTTP).

(42)

Figure 2.1 Basic chemical structure of a DNA molecule (top) and the four different bases of a DNA nucleotide (bottom). (After Jobling et al., 2014).

A DNA molecule consists of two polynucleotide chains that is held together by the hydrogen bonds between the nucleotides, forming the well-known double helix structure (Watson & Crick 1953) of a DNA (Figure 2.2). The bases of the two polynucleotide chains are always complementary to another; in other words, an adenine on one polynucleotide chain is always paired with a thymine on the other polynucleotide chain and vice versa, whereas a cytosine is always paired with a guanine and vice versa.

For this reason, the unit length of the DNA molecule is known as a base pair (bp). The two polynucleotide chains are anti-parallel, and the sequences of each polynucleotide chain are read in the direction of 3′ to 5′.

(43)

Figure 2.2 Double helix structure of a DNA molecule. (After Brown & Brown, 2011).

As with all multicellular eukaryotic organisms, there are, generally speaking, two different types of genomes in humans – the nuclear genome and mitochondrial genome. The nuclear genome (or nuclear DNA, nDNA) is the type of DNA that is found within the nucleus of a cell. nDNA can be arranged into 46 (i.e. 23 pairs of) linear molecules known as chromosomes. The only exception for this is the reproductive cells (or gametes), which only contain 23 chromosomes. Each member of a chromosome pair is individually inherited from the father and the mother thus nDNA undergoes reshuffling and recombination when it is passed on from the parent to the offspring. In humans, nDNA is ~3,200,000,000 bp long per haploid genome (i.e. reproductive cells) and twice the length for somatic cells (i.e. any cell other than the reproductive cells).

There are also only two copies of nDNA per somatic cell.

(44)

2.1.1 Human mitochondrial DNA (mtDNA)

By contrast, mitochondrial DNA (mtDNA) is the type of DNA that is found within organelles known as mitochondria that are present in the cytoplasm of a cell.

Mitochondria are organelles that converts food into chemical energy in the form of adenosine triphosphate (ATP) to be used by cells. Hence, they are often referred to as the “powerhouses” of cells. Initially a controversial theory by Lynn Sagan, née Margulis (Sagan 1967), it is now widely accepted that mitochondria originated as prokaryotic bacteria, which was incorporated into eukaryotic cells ~1.5 billion years ago through a process known as symbiogenesis; this is also known as the endosymbiotic theory.

The human mtDNA is a circular molecule that is 16,568 bp long in humans (Anderson et al. 1981; Andrews et al. 1999; Jobling et al. 2014) (Figure 2.3). It is double-stranded, with one strand being the heavy strand (H) and the other being the light strand (L). Both strands are characterised by the high content of guanine and cytosine respectively. The mtDNA encodes 37 genes, of which, 13 are for proteins, 22 are for transfer RNA (tRNA) and two are for ribosomal RNA (rRNA). The mtDNA also consists of a non-coding region of approximately 1,100 bp in length known as the control region (or D-loop). This region can be separated into three segments, known as the hypervariable segments I, II and II (HVS-I, II and III), and is often targeted for mtDNA analysis. The control region is a good candidate for phylogeographic studies, as it is within this region of the mtDNA where most mutation occur.

(45)

Figure 2.3 Schematic diagram of the human mtDNA. The mtDNA encodes 37 genes, of which, 22 are for tRNAs (indicated by single letter abbreviations e.g. F, V, L etc.), two are for rRNAs (12S and 16S) and the remaining 13 are for proteins (e.g. ND1, COX1 etc.). The D-loop (or control region) consists of HVS-I, II and III; nucleotide

positions (nps) of the HVS are according to Andrews et al. (1999). Overlapping segments within the mtDNA (e.g. ATP6 and ATP8) are not depicted in this diagram.

OL and OH indicate the origins of the replication of the light and heavy strands respectively, whereas PL and PH indicate the promoters for the transcription of these

two strands. (After Jobling et al., 2014).

Unlike the nDNA, the mtDNA does not recombine and reshuffle when it is inherited from the parent to the offspring. This is beneficial over nDNA since reshuffling and recombining causes a change over each generation. Therefore, genomic variations are easier to infer through mtDNA, as the mtDNA would retain its genomic signature over generations and any change in the mitochondrial genome would have been purely mutational. In recent years, studies (e.g. Ruiz-Pesini et al. 2004; Pereira et

Rujukan

DOKUMEN BERKAITAN

H1: There is a significant relationship between social influence and Malaysian entrepreneur’s behavioral intention to adopt social media marketing... Page 57 of

1) To identify if sense of belonging and consumer behavioural intention to accept QR codes as a new form of organization marketing tool is having a significant

In this research, the researchers will examine the relationship between the fluctuation of housing price in the United States and the macroeconomic variables, which are

(2012a), after conducting their study on Kuwait listed companies for 2010, revealed that CEO duality is positively but insignificantly related to ROA; CEO tenure

The first time, I watch with Thai sound and English subtitle.. Sometimes we can practice our foreign language by chat or exchange e-mail with native

Politeness Strategies: Power, Social Distance and Cost of

Firstly, skilled workers from developing countries that migrated to developed nations can play an important role in the development course of their origin

In view of the above phenomenon and to fill-in the gap, this study attempts: first, to determine consumers’ general purchasing behaviour pattern when they