INGOR
Loading...
Searching...
No Matches
GDF

About GDF input file format

The GDF format is a text file format of input datasets for INGOR. It is a tab delimited text file format for various input data types. It can represent both row=sample and column=sample style data. The row=sample style means that each row represents a sample or a case, and each column reprents a variable or an attribute of a sample. The column=sample style means the opposite of the row=sample style. That is, each row represents a variable and a column a sample.

Here is the example of a GDF file.

$GDF 1.0
# This is a comment line
$NA nan
@type cont cont disc
@name c1 c2 c3
r1 1.0 1.5 0
r2 2.2 nan 1

There are four sections in a GDF file. The first section is the meta data section. In the meta data section, users can describe various attributes (or data) for the data themselves. Meta data are pairs of meta keys and their values. Meta keys begin with "$" followed by a tab and their values. If the row begins with "$", then the line is a single meta data in the meta data section.

The second section is column attribut section, that specifies various attributes for columns. The above example specifies two attribute keys "type" and "name" for columns.

In addition to column attributes, GDF can define attributes for rows. The third section is row attribute section. It is a single row consisting of attribute keys, each begins with "%". Here is the example.

$GDF 1.0
- @type cont cont
- @name c1 c2
%name %alias
r1 R1 1.0 1.5
r2 R2 2.2 2.4

If row attribute section is omitted, then the first column is assumed to represent names for rows. If you specify multiple row attribute keys, you require empty cells in column attribute section to align attributes. In this case, a hyphen (minus) "-" is used to fill these empty cells.

For samples, two IDs are defined, the primary ID and the secondary ID, in order to distinguish samples. The primary ID represents an individual, the same experiment with different conditions, and so on. The secondary ID represents the index of times for the dynamic model, that is, there are some dependencies between consecutive two indices.

By default, the "name" attribute is used as the primary ID, and the 1-origin integer index value "time" attribute corresponds to the secondary IDs.

$GDF 1.0
- @type cont cont
- @name c1 c2
%name %time
id1 1 1.0 5.1
id1 2 2.0 5.2
id1 3 3.0 5.3
id2 1 1.1 3.3
id2 2 2.1 3.4
id2 3 3.1 3.5

The GDF format is a superset of the EDF format. The EDF is used in SiGN software. Therefore, an EDF file can be read and written by the GDF routines as a GDF format file. There are several different points between EDF and GDF. Use the edf option (argument) for reading the EDF format file. In INGOR, give an argument like "-I edf".

Supported meta keys

$GDF

The GDF format version. This meta key should be at the first line so that this can be a byte marker for representing the file type.

$KeywordOfNA
$NA
$NAN

Keyword (string expression) of NaN (Not a Number) and missing values. The default is "NA".

$PrimaryKey
$PrimaryKeyGroupID

Attribute name used as the primary IDs for samples. The primary ID is used to distinguish the same individuals, genes and etc. The default is "name".

$SecondaryKey
$SecondaryKeyGroupID

Attribute name used as the secondary IDs for samples. The secondary ID is used to distinguish samples observed at the same time, year and etc. The default is "time".

$PrimaryKeyType

Attribute value type for primary IDs. The default type is "string". (The primary ID key is "name".) The possible values are: string, integer, or double.
string : character string.
interger : 1-origin integer value.
dobuble : double precision floating-point real value.

$SecondaryKeyType

Attribute value type for secondary IDs. The default type is "integer". (The secondary ID key is "time".) See above for possible values and their meanings.

Supported attribute keys

type, typeID

Specifies the type of values in columns or rows. This can appear either as a row attribute or a column attribute.

continuous, cont, real, c

continuous, floating point real values.

ordinal, integer, int

Ordinal integer numbers.

discrete, disc, d

Discrete values repsented by 0-origin integers, i.e., 0, 1, ...

categorical, cat, nominal
Discrete values represnted by string keywords.

name

Column/row names.

alias

alias names. This is a second name of the variable and samples.

time

This is available only for samples. By default, time attributes are 1-origin consecutive integers (values begining from 1). The key name "time" can be changed by the "$SecondaryKey" meta key.


GDF Input Arguments

The following key-value arguments are acceptable for GDF input arugments (-I argument). These overwrites the settings written in the GDF file.

na=string
nan=string

The keyword representing a missing value. If a value of data is identical to string, then it is regarded as NaN (not a number). By default "NA" is used. The keyword is case insensitive.
This overwrites one specified by the meta data key "$KeywordOfNA".

type=string

The default variable type. Value string can be a type name defined in GDF.

types=type1:type2 ...

Variable types.

label_cols=n
l=n

The number of columns that are not data values. If the file does not have any label columns, then specify n=0, for example.

header_rows=n
h=n

The number of rows that are not data values. The first n lines will be ignored when reading a file.

name_row=n

Line number that contains the names of columns. This is useful when you read a simple tab-delimited text file that has often column names at the first row.

row_var

Each row represents a variable. Specify this for reading an EDF file.

col_var

Each column represents a variable. This is default.

edf

EDF mode (SiGN compatible mode). This sets "PrimaryKeyGroupID" to the secondary key name to handle its values as consecutive time points for dynamic model, and "SecondaryKeyGroupID" to the primary key name. (Note: The meanings of the primary and secondary IDs are opposite between GDF and EDF.)

write_edf=file

Writes the read data in EDF format into the specified file.

empty

Ignores (allows) empty cells in the data section. By default, an empty cell in the data section of the input file causes an error.

assume_real

Assumes all the variables to be real values. This does not automatically convert discrete values into one-hot vectorized data. For categorical data, this uses the internal indices of categories as their real values.

split_xy

If this is specified the first half of samples are regarded as data for explanatory variables and the second half are for objective variables in the regression model. This is mainly for dynamic model by specifying data of manually converted data.

csv

Specifies to use a comma character as a field delimiter. By default, a TAB character is used.

rhl

This is an alias for row_var,name_row=1,label_col=1. That is, each row corresponds to a variable, the first row is a header row and represents the names of samples (columns), and the first column represents the names of variables.

chl

This is an alias for col_var,name_row=1,label_col=1. That is, each column corresponds to a variable, the first row is a header row and represents the names of variables, and the first column represents the names of samples.


INGOR Manual