SiGN-BN MANUAL
SiGN-BN HC+Bootstrap
Name
signbn-hcbs.sh -- SiGN-BN: Shell script for executing SiGN-BN HC+Bootstrap on HGC SHIROKANE.
signbn.X.Y.Z -- SiGN-BN HC+BS executable binary.
Synopsis
Parallel execution via Grid Engine on HGC supercomputer system
qsub -t 1-N [ Grid Engine options ] ~tamada/sign/signbn-hcbs.sh [ options ] input_file
Note: N corresponds to the number of iterations of the bootstrap method.
Compiling the bootstrapped networks into a single gene network
qsub [ Grid Engine options ] ~tamada/sign/signbn-hcbs.sh --bin signproc --bs prefix=file_prefix[,other options] --output type=file_type,file=output_file
Direct execution of the executable binary for single network estimation
signbn.X.Y.Z [ --total-mem mega_byte ] [ options ] input_file
Note: X.Y.Z represents the SiGN-BN binary version.
Direct execution of the executable binary for the bootstrap method
signbn.X.Y.Z -N N [ --total-mem mega_byte ] [ options ] input_file
Note: N corresponds to the number of iterations of the bootstrap method in a single
execution.
Parallel execution with MPI
OMP_NUM_THREADS=n
{ mpiexec | mpirun } [ MPI options ] signbn.X.Y.Z [ options ] input_file
Note: This is only for the binary compiled for MPI execution.
Note: Use with a specific MPI execution program and a job script available and/or acceptable
on your system.
Description
SiGN-BN estimates a gene network from gene expression data.
SiGN-BN is originally developed for SHIROKANE, a supercomputer system installed on Human Genome
Center, Institute of Medical Science, The University of Tokyo. The system is running with Grid
Engine, which is a job dispatch system for PC cluster systems. Therefore, SiGN-BN is also
designed for a system running with Grid Engine. Hopefully, the executable binaries may run on
various Linux systems with various job dispatch systems.
An EDF file can be specified as an input file of gene expression data.
See File Formats for details of EDF.
Signbn-hcbs.sh is a shell script to run SiGN-BN (HC+Bootstrap) with Grid Engine
available on SHIROKANE.
The program is executed as an array job on Grid Engine.
Therefore, the number of tasks of the array job
specified by the -t option of Grid Engine corresponds to the
number of iterations of the bootstrap method. Each bootstrap execution estimates a
gene network from resampled data set, and outputs the estimated gene network as a single
file. The array job tasks can run in parallel via Grid Engine.
We recommend more than 1000 iterations for the bootstrap method. That is, specify
1000 for N.
To obtain a final gene network from these bootstrapped network files,
you need to use SiGN-Proc tool
which can also be executed through the signbn-hcbs.sh script.
For the multi-node parallel execution, SiGN-BN supports MPI on several systems.
The multi-thread execution is
also supported. To specify the number of threads per node, set the appropriate number to the
OMP_NUM_THREADS environment variable.
When running with MPI, the compiling process of the estimated networks are automatically
performed.
Direct execution of the binary signbn.X.Y.Z estimates a single gene network
without bootstrap by default. It can also perform the bootstrap method with the -N
option, but it take a very long time to complete with a large number of iterations.
We highly recommend to always specify the --total-mem option explicitly.
The running time of the BNRC score function heavily depends on the size of memory specified
by this option. If you estimate a network with a large input data set,
it consumes a huge amount of memory and likely to fail to allocate enough memory. If so,
specify available memory size in your system by the --total-mem option. See below for
details.
SiGN-Proc is a program which can be used to compile
output files generated by executing signbn-hcbs.sh into a single gene network file.
It can generate a resultant network in CSML, plain text format, and so on.
See SiGN-Proc Manual for the details.
Grid Engine options
In signbn-hcbs.sh for Shirokane 3, "-e se -o so -cwd" (and other some minor options) is set for the qsub command.
Therefore, the standard output and error are written in se and so files
respectively at the current directory where the user submit the Grid Engine job.
In Shirokane 3, the "-e" and "-o" options are not set in the script by default. Thus, the Grid Engine generates many files for storing messages which are output to the standard out and the standard error. The generated file names will be signbn-hcbs.sh.X.<job_id>.<task_id> where X is o or e which represents the standard out or the standard error, respectively. You can stop this by changing these file names with the "-e" and "-o" options for the qsub command.
By specifying Grid Engine options before the job script, these default Grid Engine
options can be overwritten.
Special options available only for signbn-hcbs.sh
--bs
Perform the bootstrap method. If the dynamic model is specified (-y), the
pseudo bootstrap method is performed. The number of bootstrap iterations is determined
by the number of tasks of the array job when submitting the job to Grid Engine.
If this is specified,
the -B option is induced with the appropriate ID numbers.
-o file_prefix
The prefix of the output file names. It can contain directories. The actual output file
will have the 6-digit ID number as its suffix. The number corresponds to a Grid Engine array job task ID.
By default, "--log-mode 4 --log-file file_prefix.log" is implied for
the first task.
--rel release_number
Use the specified release number of SiGN-BN instead of the lastest stable release.
--dir path
The path where SiGN programs are installed. The slash ("/") is not required
at the end of the path. This is for using the beta or older version installed on the other directory.
Options for SiGN-BN HC+Bootstrap
( [MPI] indicates the option for parallel execution with MPI. )
( [DE] indicates the option for direct execution of the executable binary. )
--total-mem n
Memory size in giga byte that the algorithm expects to use.
-y
Dynamic model. If the input file is not in the EDF file format, then the users need to
specify the number of replicates for each time point by the --replicates
option.
-m n
The maximum number of parents that each gene can have. By default n = 10.
Specify a small value to avoid overfitting if you have relatively less data samples.
-p n
The number of parent candidates of the greedy algorithm.
-o file
Output file name. The file name can contain directories (relative path).
-O output_type
Output file format. See
File Format for details.
Do not specify this if you perform the bootstrap method.
--replicates v1,v2,...
The list of the numbers of replicates of time points. This is required if the
input data file is not an EDF file.
--blocks n
The number of consecutive time point blocks of the pseudo bootstrap method.
The final number of samples used becomes (# of time points) x n.
-B n
Only use this with direct exection of the binary.
Specifies to perform the bootstrap resampling before network estimation.
n represents the ID used to initialize the random number generator.
-s score
The name of the score function. By default, BNRC is used.
The following scores are available:
BNRC Bayesian Non-parametric regression based score.
BNRCMV, BRNC that accepts missing (NaN) values.
-S key=value,...
The score specific options in Key=value style format. See below for the available
options.
--algo algorithm
The name of the structure learning algorithm. By default, "
hc2"
is used, that is the greedy hill-climbing algorithm implementation version 2.0.
You can specify "
tshc" for the Two Step HC algorithm.
See
Two Step HC for more details.
--algo-args key=value,...
-A key=value,...
The algorithm specific options.
-r seed
The integer random seed value.
--select-nodes file
The network estimation is performed for genes in file.
The specified file file is a line-by-line tab separated text file.
By default, the first column is read and used as gene names to be selected for the
network estimation. To change the column position to read,
use the --select-nodes-col option.
--select-nodes-col n
The 1-based (1-origin) column position for the --select-node option.
--skel file
Skeleton file that specifies the possible edges to be estimated.
Available for the "hc2" and "tshc"
algorithms.
Use "--skel-type" for input file type.
--skel-type { parents | parents_targets | CHLIST | no_parents | file_format}
Skeleton file type.
"
parents": The file needs to contain a gene
name list where a line in the file represents a gene. Only Genes in the file can
be parents of a gene.
"
parents_targets": Similter to "parents" above, it
restricts the possible parents. But with this type, it is applied to only
genes given by "
target=file2" specified to the
"
--skel-args" option. If "
inv" is
specified, genes listed not in
file2 are targets of the restriction.
"
CHLIST": The file is a tab-separated file where each line
represents a gene at the first column followed by their possible children.
genes not listed in the first column of the file are not restricted.
"
no_parents": The genes in the file are not allowed to have
their parents.
file_format : An arbitrary network file format listed in
File Format.
--skel-args key=value,...
Skeleton specific arguments. See explanation of the "--skel-type"
option for avaiable arguments.
-I key=value,...
Arguments for input data. Available keys are depending on the input data
file type. See
File Format for details.
--log-mode n
-L n
Log mode. By default, only 1 file is generated by the first job (Grid Engine)
or the root process (MPI).
--log-file file
If specified, the log message is written in file.
--cache n
Specifies the cache algorithm. By default, 3 is used.
-N n
--iteration n
[MPI][DE] The number of iterations of the estimation. Specify the number of bootstrap
iterations by this option. By default, 1 is assumed. If the value other than 1 is specified,
the -B option is implied. This is available for the
parallel execution with MPI and the direct execution of the binary.
--compile { on | off }
[MPI][DE] If on is given, compile the estimated networks into a single
network. If the value other than 1 is specified to the -N or --iteration
option, on is assumed by default.
--threshold th
-T th
[MPI][DE] Threshold used for compiling the estimated bootstrapped networks.
By default, th = 0.05.
--local-output { on | off }
[MPI][DE] Save the estimated bootstrap networks in files. For MPI execution, each process
(MPI rank) produces a single file named file.000000 where file is
a file name specified by
the -o option and 000000 is a six digit number corresponding to the rank ID,
and stores the networks into it. For the direct, single execution of the binary produces a
single file named file specified by --local-output-file option. If this is not
specified the final compiled network is stored in and overwrites the same file.
All the networks estimatied by the same process are stored
in the same file. The resultant files can be processed (compiled) into a single
network by the signproc tool. By default, "off" is assumed.
--local-output-file file
The network file name for the --local-output option.
--local-output-type type
The network file format for files output via the
--local-output option.
By default,
BSF is assumed for reducing the local output file size.
See
File Format for the available network file formats.
--hybrid
[MPI] Enables the hybrid parallelization mode that performs the single network estimation
with multiple threads in an MPI process. This is basically not efficient but effective if
the single network estimation of the bootstrap method is not finished within the limitation of the
elapsed time that is set in your computation environment.
Options of the score functions
The following comma-concatenated key=value style extra arguments are available for
the -S option.
A white space can be inserted after the camma.
The available options are different depending on the score function specified
by the -s option.
The BNRC score function
hyper_num=n
hn=n
The number of hyperparameters to search.
hyper_bg=x
hb=x
The initial value of the sequence of hypereparameters to search.
hyper_inc=y
hi=y
linear
Linear mode. This is actually an alias for "hn=2,hb=2.0,hi=1.0".
level=n
Pre-calculation level.
Options of the search algorithms
The following comma-concatenated key=value style extra arguments are available for
the --algo-args or -A option.
A white space can be inserted after the camma.
The available options are different depending on the algorithm.
The HC algorithm
trials=n
t=n
The number of trials. The HC algorithm performs the greedy algorithm n times
and returns the best scored network as the result of the algorithm.
By default, n = 10.
max_loops=n
SiGN-BN NNSR
Name
signmpi.sh -- Shell script for Grid Engine on SHIROKANE for executing SiGN-BN NNSR algorithm.
signbnnnsr.X.Y.Z -- SiGN-BN NNSR executable binary.
Synopsis
In HGC Supercomputer System SHIROKANE
qsub -pe { mpi-fillup | mpi | mpi_8 | mpi_4} N ~tamada/sign/signmpi.sh ~tamada/sign/signbnnnsr.X.Y.Z [ Options ] input_file
Note: N corresponds to the number of processes used simultaneously.
Note: Use signmpi.sh and signbnnnsr.X.Y.Z under ~tamada/sign for the latest release.
Direct execution of the executable binary
mpirun -np N signbnnnsr.X.Y.Z [ --total-mem mega_byte ] [ Options ] input_file
Description
Signmpi.sh is a shell script to run SiGN-BN NNSR on the Human Genome Center
supercomputer system SHIROKANE. The program is parallelized with
MPI (Message Passing Interface), which is a standard way of parallelizing the
program.
Therefore you need to run it as an MPI job. The Grid Engine on
SHIROKANE supports the parallel execution of the program with MPI.
Unlike an array job of the Grid Engine, an MPI job requires the specified
number of multiple CPU cores simultaneously during its execution.
The required time
to finish the calculation depends on the number of CPU cores you specified.
The more CPU cores you specify, the faster SiGN-BN NNSR runs under the
same input data and parameters. However, if the supercomputer is
very crowded then the job with many CPU cores gets less chances to be
executed.
You can execute the x86-64 binary signbnnnsr.X.Y.Z on your Linux system
with Open MPI. If so, execute the binary via mpirun command.
Similar to other SiGN programs, SiGN-BN NNSR accepts an EDF format
gene expression file as its input.
Grid Engine Options on SHIROKANE
In the signmpi.sh script, "-e se -o so -cwd" (and other minor
options) is assumed by default. As noted above, SiGN-BN NNSR requores MPI.
Therefore, you have to specify the "-pe" option to choose
an MPI environment and the number of CPU cores (Grid Engine slots).
The available MPI enviroments are mpi-fillup, mpi, mpi_8,
and mpi_4. The recommendation is mpi-fillup where the Grid Engine
tries to execute as many processes as possible on the same computation node.
On the other hand, mpi tries to execute as less processes as possible
on the same computation node. mpi_8 and mpi_4 guarantees that
exactly 8 or 4 process are executed per single computation node. Therefore,
with these environments, N (the number of processes) have to be a
multiple of 8 or 4. We recommend to use N = 32 or 64.
Options for SiGN-BN NNSR
-o, -O, -s, -S, --algo,
-A, -y, --blocks, --skel, --skel-type,
--skel-args, --total-mem
These options are the same as SiGN-BN HC+Bootstrap.
See above for details.
-T n
The number of iterations of the subnetwork estimation by
the neighbor node sampling and repeat algorithm. By default n = 100000.
-L n
Log output mode. Set n = 1 for outputting log messages for all the
processes.
-t n
The number of iterations of the Random Sampling phase.
By default, n = 0.
Basically, you do not need to change the value by this option.
This is prepared for reproduce the Random Sampling phase
appearing in our paper.
SiGN-BN Para-OS
paraos.X.Y.Z -- SiGN-BN Para-OS algorithm for optimal gene network estimation.
Synopsis
Parallel execution via Grid Engine on HGC supercomputer system
qsub -pe MPI_environment N job_script ~tamada/sign/paraos.X.Y.Z [ Options ] input_file
Parameters
- Set the OMP_NUM_THREADS environment variable in your job script to be the number of
threads you want to use, before executing the mpirun command. Copy ~tamada/sign/signmpi.sh into your
working directory and edit it as your own job_script to set your favorite settings.
- Parameter N corresponds to the number of processes you use.
- X.Y.Z represents the release number of the SiGN-BN Para-OS binary.
Description
SiGN-BN Para-OS calculates the optimal structure of a gene network from the data. Because the calculation of
the optimal structure is difficult, you need lots of computational resources, i.e., many CPUs or computation
nodes. It is implemented with MPI. Also the binary supports multi-threaded execution. Therefore, the number of
CPU cores used by the program is equal to the number of processes × the number of threads.
SiGN-BN Para-OS supports only the BNRC score function and static Bayesian network model.
It does not support dynamic model that uses time-series data.
Note that, the computation time becomes longer exponentially as the number of genes in the dataset. That is,
if the number of genes becomes 1 larger than the some data set, it takes twice longer than the data set.
In Shirokane 3, the computational time for the sample data GN-16-50.edf.txt is about
8 minutes using 16 computation nodes (processes) with 1 thread for each process.
Options
-o, -S, -L
These options are the same as SiGN-BN HC+Bootstrap. See above for details.
--log file
If specified, the log message is written in file.