Expression Data File Format

Format

Gene expression ratios or values are specified over one or more experiments using a text file. Ratios result from a comparison of two expression measurements (experiment vs. control). Some expression platforms, such as Affymetrix, directly measure expression values, without a comparison. The file consists of a header and a number of space- or tab-delimited fields, one line per gene, with the following format:

GeneName [CommonName] ratio1 ratio2 ... ratioN [pval1 pval2 ... pvalN]

Brackets [] indicate fields that are optional. The first two fields are the systematic gene name followed by an optional common name. Expression ratios/values are provided for each experiment, optionally followed by a p-value per experiment or other measure of the significance of each ratio/value, i.e. whether the ratio represents a true change in expression or whether the value is accurately measuring the presence of a gene (according to some statistical model).

Example:
GENE DESCRIPT gal1RG.sig gal2RG.sig gal3RG.sig gal1RG.sig gal2RG.sig gal3RG.sig
YHR051W COX6 -0.034 -0.052 0.152 1.177 0.102 0.857
YHR124W NDT80 -0.090 -0.000 0.041 0.130 0.341 0.061
YKL181W PRS1 -0.167 -0.063 -0.230 -0.233 0.143 0.089

The first line is a header line giving the names of the experimental conditions. Note that each condition is duplicated; the first set of columns gives expression ratios and the second set gives significance values. The significance columns can be omitted if your data doesn't include significance measures. Every remaining row specifies the values for a gene, starting with the formal name of the gene, then a common name, then the ratios, then the significance values.

Some variations on this basic format are recognized: see the formal file format specification below for more information. Expression data files commonly have the file extensions ".mrna" or ".pvals", and these file extensions are recognized by Cytoscape when browsing for data files.

Detailed file format (Advanced Users)

In all expression data files, any whitespace (spaces and/or tabs) is considered a delimiter between adjacent fields. Every line of text is either the header line or contains all the measurements for a particular gene. No name conversion is applied to expression data files.

The names given in the first column of the expression data file should match exactly the names used elsewhere (i.e. in SIF or GML files).

The first line is a header line with one of the following three formats:
<text> <text> cond1 cond2 ... cond1 cond2 ... [NumSigConds]
<text> <text> cond1 cond2 ...
<tab><tab>RATIOS<tab><tab>...LAMBDAS

The first format specifies that both expression ratios and significance values are included in the file. The first two text tokens contain names for each gene. The next token set specifies the names of the experimental conditions; these columns will contain ratio values. This list of condition names must then be duplicated exactly, each spelled the same way and in the same order. Optionally, a final column with the title NumSigConds may be present. If present, this column will contain integer values indicating the number of conditions in which each gene had a statistically significant change according to some threshold.

The second format is similar to the first except that the duplicate column names are omitted, and there is no NumSigConds fields. This format specifies data with ratios but no significance values.

The third format specifies an MTX header, which is a commonly used format. Two tab characters precede the RATIOS token. This token is followed by a number of tabs equal to the number of conditions, followed by the LAMBDAS token. This format specifies both ratios and significance values.

Each line after the first is a data line with the following format:
FormalGeneName CommonGeneName ratio1 ratio2 ... [lambda1 lambda2 ...] [numSigConds]

The first two tokens are gene names. The names in the first column are the keys used for node name lookup; these names should be the same as the names used elsewhere in Cytoscape (i.e. in the SIF or GML files). Traditionally in the gene expression microarray community, who defined these file formats, the first token is expected to be the formal name of the gene (in systems where there is a formal naming scheme for genes), while the second is expected to be a synonym for the gene commonly used by biologists, although Cytoscape does not make use of the common name column. The next columns contain floating point values for the ratios, followed by columns with the significance values if specified by the header line. The final column, if specified by the header line, should contain an integer giving the number of significant conditions for that gene. Missing values are not allowed and will confuse the parser. For example, using two consecutive tabs to indicate a missing value will not work; the parser will regard both tabs as a single delimiter and be unable to parse the line correctly.

Optionally, the last line of the file may be a special footer line with the following format:

NumSigGenes int1 int2 ...

This line specified the number of genes that were significantly differentially expressed in each condition. The first text token must be spelled exactly as shown; the rest of the line should contain one integer value for each experimental condition.