INGOR
|
Reads and Writes ytData in GDF format. More...
#include <ytGDF.h>
Public Member Functions | |
ytData * | ytGDF_read_fp (FILE *fp, ytKeyValues *args) |
Reads a GDF file from a file stream. | |
void | ytGDF_write_fp (FILE *fp, ytData *data, ytKeyValues *args) |
Writes a ytData instance in GDF into a file stream. | |
Reads and Writes ytData in GDF format.
The GDF format is a tab delimited text file format for various input data. It can represent both row=sample and column=sample style data. The row=sample style means that each row represents a sample or a case, and each column reprents a variable or an attribute of a sample. The column=sample style means the opposite of the row=sample style. That is, each row represents a variable and a column a sample.
Here is the example of a GDF file.
There are four sections in a GDF file. The first section is the meta data section. In the meta data section, users can describe various attributes (or data) for the data themselves. Meta data are pairs of meta keys and their values. Meta keys begin with "$
" followed by a tab and their values. If the row begins with "$
", then the line is a single meta data in the meta data section.
The second section is column attribut section, that specifies various attributes for columns. The above example specifies two attribute keys "type
" and "name
" for columns.
In addition to column attributes, GDF can define attributes for rows. The third section is row attribute section. It is a single row consisting of attribute keys, each begins with "%". Here is the example.
If row attribute section is omitted, then the first column is assumed to represent names for rows. If you specify multiple row attribute keys, you require empty cells in column attribute section to align attributes. In this case, a hyphen (minus) "-
" is used to fill these empty cells.
For samples, two IDs are defined, the primary ID and the secondary ID, in order to distinguish samples. The primary ID represents an individual, the same experiment with different conditions, and so on. The secondary ID represents the index of times for the dynamic model, that is, there are some dependencies between consecutive two indices.
By default, the "name
" attribute is used as the primary ID, and the 1-origin integer index value "time
" attribute corresponds to the secondary IDs.
The GDF format is a superset of the EDF format. The EDF is used in SiGN software. Therefore, an EDF file can be read and written by the GDF routines as a GDF format file. There are several different points between EDF and GDF. Use the edf
option (argument) for reading the EDF format file. In INGOR, give an argument like "-I edf
".
$GDF
The GDF format version. This meta key should be at the first line so that this can be a byte marker for representing the file type.
$KeywordOfNA
$NA
$NAN
Keyword (string expression) of NaN (Not a Number) and missing values. The default is "NA
".
$PrimaryKey
$PrimaryKeyGroupID
Attribute name used as the primary IDs for samples. The primary ID is used to distinguish the same individuals, genes and etc. The default is "name
".
$SecondaryKey
$SecondaryKeyGroupID
Attribute name used as the secondary IDs for samples. The secondary ID is used to distinguish samples observed at the same time, year and etc. The default is "time
".
$PrimaryKeyType
Attribute value type for primary IDs. The default type is "string
". (The primary ID key is "name
".) The possible values are: string
, integer
, or double
.
string
: character string.
interger
: 1-origin integer value.
dobuble
: double precision floating-point real value.
$SecondaryKeyType
Attribute value type for secondary IDs. The default type is "integer
". (The secondary ID key is "time
".) See above for possible values and their meanings.
type
, typeID
Specifies the type of values in columns or rows. This can appear either as a row attribute or a column attribute.
continuous
, cont
, real
, c
continuous, floating point real values.
ordinal
, integer
, int
Ordinal integer numbers.
discrete
, disc
, d
Discrete values repsented by 0-origin integers, i.e., 0, 1, ...
categorical
, cat
, nominal
name
Column/row names.
alias
alias names. This is a second name of the variable and samples.
time
This is available only for samples. By default, time attributes are 1-origin consecutive integers (values begining from 1). The key name "time
" can be changed by the "$SecondaryKey
" meta key.
ytData * ytGDF_read_fp | ( | FILE * | fp, |
ytKeyValues * | args ) |
Reads a GDF file from a file stream.
The following key-value arguments are acceptable. These overwrites the settings written in the file.
na=
string nan=
string The keyword representing a missing value. If a value of data is identical to string, then it is regarded as NaN (not a number). By default "NA
" is used. The keyword is case insensitive.
This overwrites one specified by the meta data key "$KeywordOfNA
".
type=
string The default variable type. Value string can be a type name defined in GDF.
types=
type1:
type2 ... Variable types.
label_cols=
n l=
n The number of columns that are not data values. If the file does not have any label columns, then specify n=0
, for example.
header_rows=
n h=
n The number of rows that are not data values. The first n lines will be ignored when reading a file.
name_row=
n Line number that contains the names of columns. This is useful when you read a simple tab-delimited text file that has often column names at the first row.
row_var
Each row represents a variable. Specify this for reading an EDF file.
col_var
Each column represents a variable. This is default.
edf
EDF mode (SiGN compatible mode). This sets "PrimaryKeyGroupID
" to the secondary key name to handle its values as consecutive time points for dynamic model, and "SecondaryKeyGroupID
" to the primary key name. (Note: The meanings of the primary and secondary IDs are opposite between GDF and EDF.)
write_edf=
file Writes the read data in EDF format into the specified file.
empty
Ignores (allows) empty cells in the data section. By default, an empty cell in the data section of the input file causes an error.
assume_real
Assumes all the variables to be real values. This does not automatically convert discrete values into one-hot vectorized data. For categorical data, this uses the internal indices of categories as their real values.
split_xy
If this is specified the first half of samples are regarded as data for explanatory variables and the second half are for objective variables in the regression model. This is mainly for dynamic model by specifying data of manually converted data.
csv
Specifies to use a comma character as a field delimiter. By default, a TAB character is used.
rhl
This is an alias for row_var,name_row=1,label_col=1
. That is, each row corresponds to a variable, the first row is a header row and represents the names of samples (columns), and the first column represents the names of variables.
chl
This is an alias for col_var,name_row=1,label_col=1
. That is, each column corresponds to a variable, the first row is a header row and represents the names of variables, and the first column represents the names of samples.
void ytGDF_write_fp | ( | FILE * | fp, |
ytData * | data, | ||
ytKeyValues * | args ) |
Writes a ytData instance in GDF into a file stream.
na=
string nan=
string edf
tsv
v=
n =0
) fp | file stream to which the given data is output. |
data | data to output. |
args | output arguments. |