potential dominant group because it had significant change at period 3 as the composite index of period 3 was higher than period 2 and period 4.
Figure 3.2.3.2: Composite index graph of a candidate group
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Period 1 Period 2 Period 3 Period 4 Period 5
Composite Index
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 27 3.2.4 Imputation
In this module, we integrated several algorithms to recover the missing values.
Two abundance matrices were required as input for the module which were case abundance matrix and control abundance matrix, the input format which was exactly same as module 3 (Inferring Biomarker). We adopted the steps in PEP which was introduced by Goh et al. (2011). First, a network would be constructed on control abundance matrix. In order to construct the biological network, we integrated module 1 (CBDN) into this module to infer a network.
Next, differentially expressed bacteria would be detected and considered as seeds. Each of the seed indicated that there were missing values in the samples. In order to find differentially expressed bacteria, we implemented the first step in module 3 (DBN). Therefore, we would like to find the bacteria that had significant changes between case and control group by using student-t test with critical value of 0.05.
Once we found the seeds, each of the seed and their neighbour bacteria in the network would be clustered together. After expanding to their first degree neighbours, the clusters identified may overlapped with other clusters. Therefore, overlapping clusters were identified by using CPM which was introduced by Palla et al. (2005).
After obtaining overlapping clusters or communities, we calculated the seeds’
missing value based on their community. Before calculating imputation value, we turned the network constructed into 𝑁 × 𝑁 adjacency matrix, 𝑊 where 𝑤𝑖𝑗 = 1 if there was a link between bacterium i and bacterium j, else 𝑤𝑖𝑗 = 0.
𝑋𝑖𝑗 = ∑ 𝑋𝑘𝑗𝑤𝑘𝑗+ 𝑐𝑖𝑗
𝑋𝑖𝑗 was the expression value of seed i in sample j whereas 𝑤𝑘𝑗was the interaction between bacterium k and bacterium j. In the equation above, 𝑐𝑖𝑗 was the constant value of bacterium i in the sample j. However, we had no prior information about this constant value. Therefore, we had to compute the constant value for each of the sample except the sample that had missing value, then average c would be calculated.
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 28 𝑋𝑖𝑗 = ∑ 𝑋𝑘𝑗𝑤𝑘𝑗+ 𝑐̅
At the end, imputation value could be calculated for the samples which had missing value for bacterium i.
Figure 3.2.4.1: Imputation process
Now, we used the biological network matrix which I had illustrated in 3.2.1 Constructing Directed Network Module by Using CBDN for explaining this module.
The figure below was the network constructed on control abundance matrix.
Figure 3.2.4.1: Network contructed on control group
Construct
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 29 Let say we found that bacterium C actually had missing values in its case abundance matrix. Then, we clustered it with its first degree neighbours which were bacterium B, D and E. As there was only cluster we had, definitely there was no overlapping cluster.
In order to find average constant value for those samples which had missing value on bacterium C, we computed the average constant value by sum up all the constant values for the samples which are not missing value’s sample and divide among themselves.
𝑋𝐶𝑗= 𝑋𝐵𝑗𝑤𝐵𝑗+ 𝑋𝐷𝑗𝑤𝐷𝑗 + 𝑋𝐸𝑗𝑤𝐸𝑗+ 𝑐̅
By using the equation above, imputation value for bacterium C in sample j can be calculated.
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 30 3.3 Implementation Issues and Challenges
First of all, we would like to parallelize the module’s analysis part. That was, our module should be able to distribute the tasks to the all processors. Our initial plan was to use Parallel Pattern Library (PPL), provided by Microsoft, to parallelize the modules. As our server was running in Linux OS, we had to use GNU Compiler Collection (GCC) to compile our module. Regrettably, we found that GCC does not provide this library yet. Therefore, we were forced to change to another library that provide similar feature which was Thread Building Block (TBB by Intel). In order to use TBB, our server must use Intel’s processors. Luckily, this was indeed the case.
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 31 3.4 Timeline
Figure 3.4.1: Project timeline (Text)
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 32 Figure 3.4.2: Project timeline (Diagram)
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 33 4.1 Constructing Directed Network Module by Using CBDN
This module, we were using human gut microbiome which was from Li et al.
(2014) to construct the network. This dataset was collected from different individuals which were Chinese, Danish, Spanish and American. The motive of the paper actually was to show the differences of gut microbial of different countries. However, we were using this dataset to construct the regulatory network of all microbes. This dataset contained 337 microbes and 1266 samples. The module would output 2 files which were interactions between microbes and ranking of microbes. In order to visualize the network, we turned our interactions result into Cytoscape and the figure below was the network diagram.
Figure 4.1.1: Human Gut Microbiome Network
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 34 In this network, each of the node represented one kind of microbe and the edge represented the interaction between two microbes. The edges in the graph had their directionality which meant we could understand the microbes regulatory. From the Figure 4.1.1, three separated networks were found, and two of them actually were small networks with the size of 5 and 6, whereas the remaining microbes were in the large network. From the result, the microbe which had the highest total influence value was Methylobacillus. This bacterium actually regulated Leptospira, Magnetococcus and Neisseria in the community. The table below was the top 10 bacteria who had the highest total influence score among other bacteria.
Bacteria Name Total Influence Value
Methylobacillus 0.160151
Albidiferax 0.148100
Rubrivivax 0.134855
Nitrospira 0.130282
Listeria 0.123234
Desulfurivibrio 0.115652
Pediococcus 0.111823
Acetobacter 0.109131
Treponema 0.104006
Pelotomaculum 0.099158
Table 4.1.1: Top 10 of important regulators
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 35 4.2 Biological Network Inference by Using TIGRESS
To test the module, we used the stimulation data from TIGRESS. The dataset obtained expression data of 20 genes and a list of transcriptional factors. The gene 1, 2, 3 and 4 were the transcriptional factors. In order not to output all the interactions, we were selected first 15 important interactions to construct the graph below. What make TIGRESS different from CBDN was the directionality of the edges. In TIGRESS, we did not know the directionality of the edges. In fact, the interaction between genes indicated the interaction only, but it did not tell us which gene was the parent node, and which gene was the child node.
Figure 4.2.1 Biological network constructed by TIGRESS
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 36 We furthered our testing on in silico data from DREAM 5 challenge as TIGRESS took it as the benchmark dataset. In the dataset, there were 1643 genes and data were collected from 804 experiments. There were 195 transcriptional factors which were from gene 1 to gene 195. Among 320,190 interactions, we were selected top 1000 interactions and visualized the network below. The interesting fact was there were many sub-networks in the figure below and some of them even isolated from each other because we were selected only 0.3% of the edges. Besides, from the network below, we could determine which gene actually was important regulator.
Figure 4.2.2 in silico data network constructed by TIGRESS
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 37 4.3 Inferring Biomarker Module
The dataset that we were using was genomics data about the lung injury with carbonyl chloride inhalation exposure which was being used by Liu et al. (2012). This dataset could be separated into two sets, one was case group and another one was control group. The dataset was collected by exposing mice to air (control) and phosgene (case). The dataset had 9 sampling points which were 0hr, 0.5hr, 1hr, 4hr, 8hr, 12hr, 24hr, 48hr and 72hr and each of them had 6 samples. In the dataset, there were 12871 genes. By using this module, we would like to detect early warning sign for acute lung injury.
After gone through FDR and 2-fold change, we obtained an array of number of differential expression genes throughout the entire timeline, [0, 29, 72, 195, 269, 173, 188, 176]. In order to find dominant group, we clustered the differential expression genes to maximum 40 clusters in each sampling points.
After clustering, each cluster in each sampling point would calculate their indices and checked with the criteria. In this case, we would have a list of number of potential dominant group throughout entire timeline, [0, 0, 0, 0, 1, 1, 1, 1, 0].
The first potential dominant group appeared in period 5 (8hr) and there were only 2 genes in the groups which were Ensmusg00000058905 and Tlr9. Figure below showing the composite index of the first potential dominant group that fulfilled all criteria. In the figure, we noticed the composite index increases sharply after 4hr and reached the peak on 8hr. In this case, the figure showed that pre-disease state was starting on 4hr.
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 38 Figure 4.3.1 Composite Index of the first potential dominant group
0 0.5 1 4 8 12 24 48 72 Table 4.3.1 Composite Index of all potential dominant groups
0
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 39 4.4 Imputation
To check whether the module worked fine or not, we used the dataset which was used in 4.3 Inferring Biomarker Module. This was data about the lung injury with carbonyl chloride inhalation exposure. However, due to large volume of data, it was very lengthy to complete first step which was constructing biological network.
Therefore, we selected first 50 genes to test the module and we made 6 expression values of gene 15 to be zero, which were considered as missing values.
Figure 4.4.1 Biological network of the first 50 bacteria
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 40 First, a network had been constructed by using the control abundance data on first 50 genes (Figure 4.4.1). Among this 50 genes, we detected 2 seeds which were gene 15 and gene 33.
Figure 4.4.2 First degree neighbours of gene 15
Figure 4.4.3 First degree neighbours of gene 33
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 41 In fact, gene 33 had no missing values but it showed the significant changes between case and control group. Hence, this was not our imputation concern so far.
After we found gene 15 as seed, we expanded the seed to its first degree neighbour genes to form a community which were shown in the Figure 4.4.2. The imputation values were shown in the table below. The initial average expression value for gene 15 was 1889.95 whereas the average expression value of the gene after imputation was 1889.81, different in 0.14.
Table 4.4.1 Imputation values before and after phosgene
_0.5h_4A
phosgene _0.5h_5B
phosgene _0.5h_5A
phosgene _0.5h_6B
phosgene _0.5h_6A
phosgene _1h_10B Case
(Initial) 1845.85 1962.05 2164.8 1734.85 1858.45 1958.25 Case
(Set as missing
value)
0 0 0 0 0 0
Case
(After) 1949.37 1970.41 1919.31 1888.44 1830.64 1958.41
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 42 Our ultimate aim was to develop a platform that provided a series of analysis services to researchers. The platform provided an all-in-one venue for such analyses.
The aim of the platform was to allow researchers to perform all the standard analyses on the website, in an easy and convenient fashion.
This project involved the development of four modules of the platform, namely constructing directed network on metagenomics data by two different approaches, inferring its biomarkers and imputation. For all of the module, we were going to parallelize them so that they work effectively and efficiently. We used Thread Building Block (TBB) library to enable parallelism in the modules. The program would distribute the tasks to processors and fully utilize them.
For the constructing directed network module, we used CBDN method which was introduced by Zhang, Ng & Li (2015). First, influence value of each gene to others would be calculated then transitive relationship would be removed. After that, we would like to figure out the important regulator by ranking their total influence value.
TIGRESS, another method that could be used to construct biological network, which was introduced by Haury et al. (2012). This method used feature selection approach to infer the relationships of each gene pair.
In the other hand, DNB was used to infer biomarker in this project. The method was proposed by Liu et al. (2015). First, differential expression genes would be selected and clustered at each sampling point then we would like to determine the criteria of each cluster at each sampling point. When the cluster fulfilled the criteria, then the particular cluster was considered as biomarker.The last module was imputation that was used to recover missing value in the expression data. Several algorithms were integrated to develop this module.
In order to speed up computational speed of this two modules, we implemented parallelism in our software. We used TBB library in the software as it provided sufficient functions to the software. The process would be distributed to all processors and fully utilized the resource available.
At the end of the project, we had achieved to complete 4 modules. However, the modules did not fully tested as there were limited datasets and processing power.
Therefore, we forced to use genomics data for some test cases.
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 43 Andrea Califano 2016, Modeling Cell Regulatory Networks. Available from:
<http://califano.c2b2.columbia.edu/modeling-cell-regulatory-networks>. [19 November 2016].
Efron, B, Hastie, T, Johnstone, I & Tibshirani, R 2004, ‘Least Angle Regression’, The Annals of Statistics, vol. 32, no. 2, pp. 407-499. [16 November 2016].
EMBL-EBI, n.d., Project: BGI Type 2 Diabetes study. Available from:
<https://www.ebi.ac.uk/metagenomics/projects/SRP008047> [31 March.
2017].
Genetic Science Learning Center 2014, What are Microbes?. Available from:
<http://learn.genetics.utah.edu/content/microbiome/intro/>. [2 November 2016].
Goh, W, Lee, Y, Zubaidah, R, Jin, J, Dong, D, Lin, Q, Chung, M & Wong, L 2011,
‘Network-Based Pipeline for Analyzing MS Data: An Application toward Liver Cancer’, Journal of Proteome Research, vol. 10, no. 5, pp. 2261-2272.
[20 March 2017]
Goh, W., Sergot, M., Sng, J. and Wong, L. 2013, ‘Comparative Network-Based Recovery Analysis and Proteomic Profiling of Neurological Changes in Valproic Acid-Treated Mice’, Journal of Proteome Research, vol. 12, no. 5, pp.
2116-2127. [1 March 2017].
Haury, A, Mordelet, F, Vera-Licona, P& Vert, J 2012, ‘TIGRESS: Trustful Inference of Gene REgulation using Stability Selection’, BMC Systems Biology, vol. 6, no. 1. [20 October 2016].
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 44 Li, J, Jia, H, Cai, X, Zhong, H, Feng, Q, Sunagawa, S, Arumugam, M, Kultima, J, Prifti, E, Nielsen, T, Juncker, A, Manichanh, C, Chen, B, Zhang, W, Levenez, F, Wang, J, Xu, X, Xiao, L, Liang, S, Zhang, D, Zhang, Z, Chen, W, Zhao, H, Al-Aama, J, Edris, S, Yang, H, Wang, J, Hansen, T, Nielsen, H, Brunak, S, Kristiansen, K, Guarner, F, Pedersen, O, Doré, J, Ehrlich, S, Pons, N, Le Chatelier, E, Batto, J, Kennedy, S, Haimet, F, Winogradski, Y, Pelletier, E, LePaslier, D, Artiguenave, F, Bruls, T, Weissenbach, J, Turner, K, Parkhill, J, Antolin, M, Casellas, F, Borruel, N, Varela, E, Torrejon, A, Denariaz, G, Derrien, M, van Hylckama Vlieg, J, Viega, P, Oozeer, R, Knoll, J, Rescigno, M, Brechot, C, M'Rini, C, Mérieux, A, Yamada, T, Tims, S, Zoetendal, E, Kleerebezem, M, de Vos, W, Cultrone, A, Leclerc, M, Juste, C, Guedon, E, Delorme, C, Layec, S, Khaci, G, van de Guchte, M, Vandemeulebrouck, G, Jamet, A, Dervyn, R, Sanchez, N, Blottière, H, Maguin, E, Renault, P, Tap, J, Mende, D, Bork, P & Wang, J 2014, ‘An integrated catalog of reference genes in the human gut microbiome’, Nature Biotechnology, vol. 32, no. 8, pp. 834-841. [8 March 2017]
Liu, R, Li, M, Liu, Z, Wu, J, Chen, L & Aihara, K 2012, ‘Identifying critical transitions and their leading biomolecular network in complex diseases’, Scientific Report, vol. 2.
Margolin, A, Nemenman, I, Basso, K, Wiggins, C, Stolovitzky, G, Favera, R &
Califano, A 2006, ‘ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context’, BMC
Bioinformatics, vol. 7, no. 1, p. S7. [20 October 2016].
Markowetz, F & Spang, R 2007, ‘Inferring cellular networks – a review’, Markowetz, F. and Spang, R. (2007). Inferring cellular networks – a review. BMC
Bioinformatics, vol.8, no. 6, p. S5. [15 November 2016].
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 45 Mayeux, R 2004, ‘Biomarkers: Potential Uses and Limitations.’, NeuroRx, vol. 1, no.
2, pp. 182-188. [3 November 2016].
Members.cbio.mines-paristech.fr. n.d., The TIGRESS page. Available from:
<http://members.cbio.mines-paristech.fr/~ahaury/svn/dream5/html/index.html>. [2 Febuary 2017].
Meta.genomics.cn. n.d., Integrated reference catalog of the human gut microbiome.
Available from: <http://meta.genomics.cn/meta/home>. [1 April 2017].
Morrison, JL, Breitling, R, Higham, DJ & Gilbert, DR 2005. ‘GeneRank: Using search engine technology for the analysis of microarray experiments’, BMC Bioinformatics, vol. 6, no. 1, p. 233. [25 Febuary 2017]
Palla, G, Derényi, I, Farkas, I & Vicsek, T 2005, ‘Uncovering the overlapping community structure of complex networks in nature and society’, Nature, vol.
435, no. 7043, pp. 814-818. [24 Febuary 2017]
Parallel Pattern Library, n.d.. Available from: < https://msdn.microsoft.com/en-us/library/dd492418.aspx>. [23 October 2016].
Taylor, RC, Acquaah-Mensah, G, Singhal, M, Malhotra, D & Biswal, S 2008,
‘Network Inference Algorithms Elucidate Nrf2 Regulation of Mouse Lung Oxidative Stress’, PLoS Comput Biol, vol. 4, no. 8. [19 November 2016].
The Common Fund n.d., Human Microbiome Project. Available from:
<https://commonfund.nih.gov/hmp/overview>. [3 November 2016].
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR 46 Threading Building Blocks, n.d.. Available from:
<https://www.threadingbuildingblocks.org>. [25 October. 2016].
Strimbu, K & Tavel, J 2010, ‘What are biomarkers?’, Current Opinion in HIV and AIDS, vol. 5, no. 6, pp. 463-466. [3 November 2016].
Wiki.c2b2.columbia.edu 2015, ARACNe - Workbench. Available from:
<http://wiki.c2b2.columbia.edu/workbench/index.php/ARACNe>. [19 November 2016].
Zhang, L, Ng, Y & Li, S 2015, ‘Reconstructing directed gene regulatory network by only gene expression data’, 2015 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM), pp. 163-170.
Zyga, L 2014, ‘Diseases, symptoms, genes, and proteins linked together in giant network’, Medical Xpress 18 July. Available from
<http://medicalxpress.com/news/2014-07-diseases-symptoms-genes-proteins-linked.html >. [3 November 2016].
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR A-1 Appendix A Online Platform Introduction
As our main objective was to collaborate with City University of Hong Kong to come out a comprehensive online platform. However, developing online platform was not the concern in this project, we were just in charge in four modules.
http://dl380a.cs.cityu.edu.hk/ is the link to the online platform.
Figure A.1 Main page of the online platform
The online platform was a cloud-based system that provided free services to gene regulation analysis. Besides, they had their own gene database which had been processed from raw data and they offered over 30 modules to users to analyse the dataset. Other than that, users may visualize their result after analysed the data in over 70 ways such as scatter plots, histograms, pie charts, and so on.
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR A-2 Figure A.2 Visualizations that provided by the online platform
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR A-3 Figure A.3 Modules that provided by the online platform
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR A-4 Figure A.4 Gene DB of the online platform
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR A-5 Figure A.4 Delta team of the online platform
Development of this online platform was led by Dr. Shuaicheng Li with a group of enthusiastic students. In this project, Ms Chen JiaXing was the mentor who provided guidances and information about the modules to me. At the end of the project, we delivered 4 modules and all of the modules would be integrated with other modules to form a pipeline, then would be published under pipeline category in the online platform.
BCS (Hons) Computer Science
Faculty of Information and Communication Technology (Perak Campus), UTAR B-1 B-1 Constructing Directed Network Module by Using CBDN (Source Code) B-1.1 main.cpp
//get input file from user
//get input file from user