SiGN-BN MANUAL

SiGN-BN HC+BOOTSTRAP
SIGN-BN NNSR
SIGN-BN Para-OS
SiGN-Proc tool

SiGN-BN HC+Bootstrap

Name

signbn-hcbs.sh -- SiGN-BN: Shell script for executing SiGN-BN HC+Bootstrap on HGC SHIROKANE.

signbn.X.Y.Z -- SiGN-BN HC+BS executable binary.

Synopsis

Parallel execution via Grid Engine on HGC supercomputer system

qsub -t 1-N [ Grid Engine options ] ~tamada/sign/signbn-hcbs.sh [ options ] input_file

Note: N corresponds to the number of iterations of the bootstrap method.

Compiling the bootstrapped networks into a single gene network

qsub [ Grid Engine options ] ~tamada/sign/signbn-hcbs.sh --bin signproc --bs prefix=file_prefix[,other options] --output type=file_type,file=output_file

Direct execution of the executable binary for single network estimation

signbn.X.Y.Z [ --total-mem mega_byte ] [ options ] input_file

Note: X.Y.Z represents the SiGN-BN binary version.

Direct execution of the executable binary for the bootstrap method

signbn.X.Y.Z -N N [ --total-mem mega_byte ] [ options ] input_file

Note: N corresponds to the number of iterations of the bootstrap method in a single execution.

Parallel execution with MPI

OMP_NUM_THREADS=n { mpiexec | mpirun } [ MPI options ] signbn.X.Y.Z [ options ] input_file

Note: This is only for the binary compiled for MPI execution.
Note: Use with a specific MPI execution program and a job script available and/or acceptable on your system.

Description

SiGN-BN estimates a gene network from gene expression data. SiGN-BN is originally developed for SHIROKANE, a supercomputer system installed on Human Genome Center, Institute of Medical Science, The University of Tokyo. The system is running with Grid Engine, which is a job dispatch system for PC cluster systems. Therefore, SiGN-BN is also designed for a system running with Grid Engine. Hopefully, the executable binaries may run on various Linux systems with various job dispatch systems.

An EDF file can be specified as an input file of gene expression data. See File Formats for details of EDF.

Signbn-hcbs.sh is a shell script to run SiGN-BN (HC+Bootstrap) with Grid Engine available on SHIROKANE. The program is executed as an array job on Grid Engine. Therefore, the number of tasks of the array job specified by the -t option of Grid Engine corresponds to the number of iterations of the bootstrap method. Each bootstrap execution estimates a gene network from resampled data set, and outputs the estimated gene network as a single file. The array job tasks can run in parallel via Grid Engine. We recommend more than 1000 iterations for the bootstrap method. That is, specify 1000 for N. To obtain a final gene network from these bootstrapped network files, you need to use SiGN-Proc tool which can also be executed through the signbn-hcbs.sh script.

For the multi-node parallel execution, SiGN-BN supports MPI on several systems. The multi-thread execution is also supported. To specify the number of threads per node, set the appropriate number to the OMP_NUM_THREADS environment variable. When running with MPI, the compiling process of the estimated networks are automatically performed.

Direct execution of the binary signbn.X.Y.Z estimates a single gene network without bootstrap by default. It can also perform the bootstrap method with the -N option, but it take a very long time to complete with a large number of iterations.

We highly recommend to always specify the --total-mem option explicitly. The running time of the BNRC score function heavily depends on the size of memory specified by this option. If you estimate a network with a large input data set, it consumes a huge amount of memory and likely to fail to allocate enough memory. If so, specify available memory size in your system by the --total-mem option. See below for details.

SiGN-Proc is a program which can be used to compile output files generated by executing signbn-hcbs.sh into a single gene network file. It can generate a resultant network in CSML, plain text format, and so on. See SiGN-Proc Manual for the details.

Grid Engine options

In signbn-hcbs.sh for Shirokane 3, "-e se -o so -cwd" (and other some minor options) is set for the qsub command. Therefore, the standard output and error are written in se and so files respectively at the current directory where the user submit the Grid Engine job.

In Shirokane 3, the "-e" and "-o" options are not set in the script by default. Thus, the Grid Engine generates many files for storing messages which are output to the standard out and the standard error. The generated file names will be signbn-hcbs.sh.X.<job_id>.<task_id> where X is o or e which represents the standard out or the standard error, respectively. You can stop this by changing these file names with the "-e" and "-o" options for the qsub command.

By specifying Grid Engine options before the job script, these default Grid Engine options can be overwritten.

Special options available only for signbn-hcbs.sh

--bs

Perform the bootstrap method. If the dynamic model is specified (-y), the pseudo bootstrap method is performed. The number of bootstrap iterations is determined by the number of tasks of the array job when submitting the job to Grid Engine. If this is specified, the -B option is induced with the appropriate ID numbers.

-o file_prefix

The prefix of the output file names. It can contain directories. The actual output file will have the 6-digit ID number as its suffix. The number corresponds to a Grid Engine array job task ID. By default, "--log-mode 4 --log-file file_prefix.log" is implied for the first task.

--rel release_number

Use the specified release number of SiGN-BN instead of the lastest stable release.

--dir path

The path where SiGN programs are installed. The slash ("/") is not required at the end of the path. This is for using the beta or older version installed on the other directory.

^ Go to Top

Options for SiGN-BN HC+Bootstrap

( [MPI] indicates the option for parallel execution with MPI. )
( [DE] indicates the option for direct execution of the executable binary. )

--total-mem n

Memory size in giga byte that the algorithm expects to use.

-y

Dynamic model. If the input file is not in the EDF file format, then the users need to specify the number of replicates for each time point by the --replicates option.

-m n

The maximum number of parents that each gene can have. By default n = 10. Specify a small value to avoid overfitting if you have relatively less data samples.

-p n

The number of parent candidates of the greedy algorithm.

-o file

Output file name. The file name can contain directories (relative path).

-O output_type

Output file format. See File Format for details. Do not specify this if you perform the bootstrap method.

--replicates v₁,v₂,...

The list of the numbers of replicates of time points. This is required if the input data file is not an EDF file.

--blocks n

The number of consecutive time point blocks of the pseudo bootstrap method. The final number of samples used becomes (# of time points) x n.

-B n

Only use this with direct exection of the binary. Specifies to perform the bootstrap resampling before network estimation. n represents the ID used to initialize the random number generator.

-s score

The name of the score function. By default, BNRC is used. The following scores are available: BNRC Bayesian Non-parametric regression based score. BNRCMV, BRNC that accepts missing (NaN) values.

-S key=value,...

The score specific options in Key=value style format. See below for the available options.

--algo algorithm

The name of the structure learning algorithm. By default, "hc2" is used, that is the greedy hill-climbing algorithm implementation version 2.0. You can specify "tshc" for the Two Step HC algorithm. See Two Step HC for more details.

--algo-args key=value,...
-A key=value,...

The algorithm specific options.

-r seed

The integer random seed value.

--select-nodes file

The network estimation is performed for genes in file. The specified file file is a line-by-line tab separated text file. By default, the first column is read and used as gene names to be selected for the network estimation. To change the column position to read, use the --select-nodes-col option.

--select-nodes-col n

The 1-based (1-origin) column position for the --select-node option.

--skel file

Skeleton file that specifies the possible edges to be estimated. Available for the "hc2" and "tshc" algorithms. Use "--skel-type" for input file type.

--skel-type { parents | parents_targets | CHLIST | no_parents | file_format}

Skeleton file type.
"parents": The file needs to contain a gene name list where a line in the file represents a gene. Only Genes in the file can be parents of a gene.
"parents_targets": Similter to "parents" above, it restricts the possible parents. But with this type, it is applied to only genes given by "target=file2" specified to the "--skel-args" option. If "inv" is specified, genes listed not in file2 are targets of the restriction.
"CHLIST": The file is a tab-separated file where each line represents a gene at the first column followed by their possible children. genes not listed in the first column of the file are not restricted.
"no_parents": The genes in the file are not allowed to have their parents.
file_format : An arbitrary network file format listed in File Format.

--skel-args key=value,...

Skeleton specific arguments. See explanation of the "--skel-type" option for avaiable arguments.

-I key=value,...

Arguments for input data. Available keys are depending on the input data file type. See File Format for details.

--log-mode n
-L n

Log mode. By default, only 1 file is generated by the first job (Grid Engine) or the root process (MPI).

--log-file file

If specified, the log message is written in file.

--cache n

Specifies the cache algorithm. By default, 3 is used.

-N n
--iteration n

[MPI][DE] The number of iterations of the estimation. Specify the number of bootstrap iterations by this option. By default, 1 is assumed. If the value other than 1 is specified, the -B option is implied. This is available for the parallel execution with MPI and the direct execution of the binary.

--compile { on | off }

[MPI][DE] If on is given, compile the estimated networks into a single network. If the value other than 1 is specified to the -N or --iteration option, on is assumed by default.

--threshold th
-T th

[MPI][DE] Threshold used for compiling the estimated bootstrapped networks. By default, th = 0.05.

--local-output { on | off }

[MPI][DE] Save the estimated bootstrap networks in files. For MPI execution, each process (MPI rank) produces a single file named file.000000 where file is a file name specified by the -o option and 000000 is a six digit number corresponding to the rank ID, and stores the networks into it. For the direct, single execution of the binary produces a single file named file specified by --local-output-file option. If this is not specified the final compiled network is stored in and overwrites the same file. All the networks estimatied by the same process are stored in the same file. The resultant files can be processed (compiled) into a single network by the signproc tool. By default, "off" is assumed.

--local-output-file file

The network file name for the --local-output option.

--local-output-type type

The network file format for files output via the --local-output option. By default, BSF is assumed for reducing the local output file size. See File Format for the available network file formats.

--hybrid

[MPI] Enables the hybrid parallelization mode that performs the single network estimation with multiple threads in an MPI process. This is basically not efficient but effective if the single network estimation of the bootstrap method is not finished within the limitation of the elapsed time that is set in your computation environment.

^ Go to Top

Options of the score functions

The following comma-concatenated key=value style extra arguments are available for the -S option. A white space can be inserted after the camma. The available options are different depending on the score function specified by the -s option.

The BNRC score function

hyper_num=n
hn=n

The number of hyperparameters to search.

hyper_bg=x
hb=x

The initial value of the sequence of hypereparameters to search.

hyper_inc=y
hi=y

linear

Linear mode. This is actually an alias for "hn=2,hb=2.0,hi=1.0".

level=n

Pre-calculation level.

Options of the search algorithms

The following comma-concatenated key=value style extra arguments are available for the --algo-args or -A option. A white space can be inserted after the camma. The available options are different depending on the algorithm.

The HC algorithm

trials=n
t=n

The number of trials. The HC algorithm performs the greedy algorithm n times and returns the best scored network as the result of the algorithm. By default, n = 10.

max_loops=n

^ Go to Top

SiGN-BN NNSR

Name

signmpi.sh -- Shell script for Grid Engine on SHIROKANE for executing SiGN-BN NNSR algorithm.

signbnnnsr.X.Y.Z -- SiGN-BN NNSR executable binary.

Synopsis

In HGC Supercomputer System SHIROKANE

qsub -pe { mpi-fillup | mpi | mpi_8 | mpi_4} N ~tamada/sign/signmpi.sh ~tamada/sign/signbnnnsr.X.Y.Z [ Options ] input_file

Note: N corresponds to the number of processes used simultaneously.

Note: Use signmpi.sh and signbnnnsr.X.Y.Z under ~tamada/sign for the latest release.

Direct execution of the executable binary

mpirun -np N signbnnnsr.X.Y.Z [ --total-mem mega_byte ] [ Options ] input_file

Description

Signmpi.sh is a shell script to run SiGN-BN NNSR on the Human Genome Center supercomputer system SHIROKANE. The program is parallelized with MPI (Message Passing Interface), which is a standard way of parallelizing the program. Therefore you need to run it as an MPI job. The Grid Engine on SHIROKANE supports the parallel execution of the program with MPI. Unlike an array job of the Grid Engine, an MPI job requires the specified number of multiple CPU cores simultaneously during its execution. The required time to finish the calculation depends on the number of CPU cores you specified. The more CPU cores you specify, the faster SiGN-BN NNSR runs under the same input data and parameters. However, if the supercomputer is very crowded then the job with many CPU cores gets less chances to be executed.

You can execute the x86-64 binary signbnnnsr.X.Y.Z on your Linux system with Open MPI. If so, execute the binary via mpirun command.

Similar to other SiGN programs, SiGN-BN NNSR accepts an EDF format gene expression file as its input.

Grid Engine Options on SHIROKANE

In the signmpi.sh script, "-e se -o so -cwd" (and other minor options) is assumed by default. As noted above, SiGN-BN NNSR requores MPI. Therefore, you have to specify the "-pe" option to choose an MPI environment and the number of CPU cores (Grid Engine slots). The available MPI enviroments are mpi-fillup, mpi, mpi_8, and mpi_4. The recommendation is mpi-fillup where the Grid Engine tries to execute as many processes as possible on the same computation node. On the other hand, mpi tries to execute as less processes as possible on the same computation node. mpi_8 and mpi_4 guarantees that exactly 8 or 4 process are executed per single computation node. Therefore, with these environments, N (the number of processes) have to be a multiple of 8 or 4. We recommend to use N = 32 or 64.

Options for SiGN-BN NNSR

-o, -O, -s, -S, --algo, -A, -y, --blocks, --skel, --skel-type, --skel-args, --total-mem

These options are the same as SiGN-BN HC+Bootstrap. See above for details.

-T n

The number of iterations of the subnetwork estimation by the neighbor node sampling and repeat algorithm. By default n = 100000.

-L n

Log output mode. Set n = 1 for outputting log messages for all the processes.

-t n

The number of iterations of the Random Sampling phase. By default, n = 0. Basically, you do not need to change the value by this option. This is prepared for reproduce the Random Sampling phase appearing in our paper.

^ Go to Top

SiGN-BN Para-OS

paraos.X.Y.Z -- SiGN-BN Para-OS algorithm for optimal gene network estimation.

Synopsis

Parallel execution via Grid Engine on HGC supercomputer system

qsub -pe MPI_environment N job_script ~tamada/sign/paraos.X.Y.Z [ Options ] input_file

Parameters

Set the OMP_NUM_THREADS environment variable in your job script to be the number of threads you want to use, before executing the mpirun command. Copy ~tamada/sign/signmpi.sh into your working directory and edit it as your own job_script to set your favorite settings.
Parameter N corresponds to the number of processes you use.
X.Y.Z represents the release number of the SiGN-BN Para-OS binary.

Description

SiGN-BN Para-OS calculates the optimal structure of a gene network from the data. Because the calculation of the optimal structure is difficult, you need lots of computational resources, i.e., many CPUs or computation nodes. It is implemented with MPI. Also the binary supports multi-threaded execution. Therefore, the number of CPU cores used by the program is equal to the number of processes × the number of threads.

SiGN-BN Para-OS supports only the BNRC score function and static Bayesian network model. It does not support dynamic model that uses time-series data.

Note that, the computation time becomes longer exponentially as the number of genes in the dataset. That is, if the number of genes becomes 1 larger than the some data set, it takes twice longer than the data set. In Shirokane 3, the computational time for the sample data GN-16-50.edf.txt is about 8 minutes using 16 computation nodes (processes) with 1 thread for each process.

Options

-o, -S, -L

These options are the same as SiGN-BN HC+Bootstrap. See above for details.

--log file

If specified, the log message is written in file.

^ Go to Top