(2022.9.22)
This is a tutorial of the gene network analysis using our Edge Contribution value (ECv) method. This reproduces results presented in Tanaka, Y. et al (2020). In this paper, we have used a 18 sample gene expression dataset (GSE49644) by Sun, Y. et al (2014), that measured two conditions of control and treated with TGFβ for 3 lung cancer cell lines with triplicate experiments. By using the ΔECv method, the paper succeeded in extracting a subnetwork consisting of differentially regulated edges (DREs) related to epithelial mesenchymal transition (EMT) in lung cancer cells.
To obtain the EMT related subnetwork, you first estimate a basal gene network consisting of the entire genes in the dataset using the SiGN-BN NNSR algorithm. Then, by using this estimated basal network, you calculate an ECv for every estimated edge. Finally, you can extract differentially regulated edges (DREs) by screening out edges whose ΔECvs (difference of ECvs between two conditions you want to compare) are greater than a certain threshold.
To estimate a basal gene network, you use SiGN-BN NNSR software available on the SiGN-BN website. Please, check if it works well with your computer system. We recommend to use a system equipped with at least 32 CPU cores. SiGN-BN NNSR is already installed in SHIROKANE supercomputer system provided by Human Genome Center, The Institute of Medical Science, The University of Tokyo. So if you do not have a computer enough for network estimation, it is an good option to use it.
The ECv calculation is performed by the separate software. To obtain it, please email Prof. Tamada who is a developer of the ECv method. It is free to academic users. If you want to use it for commercial purposes, please contact TLO Kyoto. It has the rights for commercial use of the method.
First, download "GSE49644_annotate_results.txt.gz" from Gene Expression Omnibus website. This is a pre-processed and annotated gene expression data matrix. There is a link to download a gzip-compressed text file of the data set. It is already preprocessed by annotating gene names, log-2 conversion and normalization. Decompress the gzipped file, for example, by executing the following command:
$ gzip -d GSE49644_annotate_results.txt.gz
Next, you need to modify it so that SiGN-BN can read it. Extract column 2 of gene names, and column 4 and everything after that consisting of expression data. Then replace "symbol" at the first line with "@name" to convert it to the EDF format. This can be done by the following command:
$ cut -f2,4- GSE49644_annotated_results.txt |sed -e '1 s/symbol/@name/' > GSE49644.txt
Note: Some gene names are wrongly converted by Excel in this file, e.g. 8-Sep, 5-Mar, etc. Replace them with correct gene names if you want to have the correct gene names ;-)
Estimate a basal gene network with the NNSR algorithm. In SHIROKANE, this can be done by executing the following command (using 64 cores, T=1,000,000 and output threshold 0.1):
If you use your own system with OpenMPI, the command may be as follows:
If the execution succeeds, this generates result1.sgn3 that is a text file containing the estimated network in SGN3 format. Execute the network estimation at least three times for evaluating whether the algorithm was converged or not with the input data. The Smaller number of samples you use, the larger -T value you require. Saves estimated networks as different files, e.g., result2.sgn3 and result3.sgn3. To do so, execute:
$ qsub -o so -e se -pe mpi-fillup 64 ~tamada/sign/signmpi.sh ~tamada/sign/signbnnnsr.0.16.7 -T 1000000 --oth 0.1 -o result3 GSE49644.txt
on SHIROKANE. Or,
$ mpiexec -np 64 signbnnnsr.0.16.7 -T 1000000 --oth 0.1 -o result3 GSE49644.txt
on your system.
To evaluate whether the network estimation was converged, compare structures of every two of the estimated networks by using SiGN-Proc software. To do so, use the --comp filter of SiGN-Proc.
This compares result1.sgn3 and result2.sgn3, and outputs a confusion matrix for result1.sgn3 by regarding result2 as the true network. It may output, for example, the following comparison result:
COML: TP RTP FP FN TN Sn Sp COMP: 125121 436 5690 5646 196824735 0.957 0.953
Sn here represents the ratio of whether edges of result2 were included in result1. An expample of criterion to consider that two networks are similar is 0.95, that is, 95% edges are completely identical in thses two networks. Thus we can consider these are almost the same and the algorithm was converged well. Do this comparison also for result1 and result3, and result2 and result3.
The estimated basal gene network contains only information of edge existence and does not have model parameters of B-splines between parents and children, that are required for ECv calculation. You need to re-fit (calculate) the regression model to estimate model parameters using the estimated structure and the input data. To do this, use SiGN-Proc --score filter, such as:
The output file result1_param.sgn3 has the estimated model parameters.
Next, we calculate ECvs of all the edges for particular samples as below:
The output file result1_ecv.txt is a tab separated text file where each row represents edge and each column corresponds to samples. The first three columns are parent node names, child node names, and parent-child node names concatenated with _ (underscore).
This calculates ECvs and outputs them as the ECv matrix,
Process the ECv matrix file using your favorite scripting language, such as R, or Python, etc. The following is the example using R to extract DREs.
Reading of the ECv matrix.
> dim(X)
[1] 154369 21
Extracting DREs whose ΔECvs are greater than 1.0, which corresponds to the 2-fold change, for all three experiments with three cell lines. The triplicate data were averaged before calculating Δ by rowMeans() function.
> dim(E)
[1] 120 21
This example shows that 120 edges are extracted as DREs.
To save them in a file, perform as follows:
> EE <- data.frame(E[,1:3],dre)
> write.table(EE, "emt.txt", quote=F, row.names=F, sep="\t")
This example added a column, named as dre, that is consisting of values of 1 to distinguish DREs from other basal edges after combining them in the later step.
To extract unique genes contained in the extracted subnetwork, perform as follows:
> length(G)
[1] 150
If you want to save them, use write.table():
This is an optional process. In our paper, we presented the network consisting of not only DREs but also basal edges as dashed line. See Fig. 4 of our paper. This is realized by extracting the induced subnetwork using the node (gene) list genes.txt generated in the previous step. An induced network (graph) is a subgraph consisting of edges connecting to given list of nodes. This can be done by SiGN-Proc as follows:
Import dist1.txt in Cytoscape. You can highlight DREs by additionally importing emt.txt (combining this file in the edge list table) and specify the different appearance for edges whose values of the dre property are 1.