Home     
Quick Start
Scientific article
Technical manual
User manual
Visual manual



1- Introduction
2- Initialization
3- Upload of main data sources files
4- User list annotation
5- Computing number of genes by unit
6- Chance estimation
7- Global and local statistical tests
























Introduction

GexMap successively calls four files of functions, one for each step of the algorithm. The first is gexload.R which creates the data matrix of all necessary data corresponding to probes took from the tested list. Annotations and expression data are compiled into a matrix which is then submitted to the function mapping().
Mapping formats the data of the data matrix chromosome by chromosome and produces a formatted genome matrix before statistical tests. This function census number of ENSEMBL genes ENSEMBL and genes from genome list for each units of each chromosome before statistical test using the gextest() function. Two statistical tests are then used to unveil significant genomic clusters. The global test compares genomic density of the tested list to the gene presence probability - computed from the ENSEMBL genomic density – for entire chromosomes. The last function is gexgraph() which produces pdf files for easy way visualization of the results.


Initialization

At first, GexMap expects three variables: the scale of the graphics, the path of the Rdata source files and the result path.

Then GexMap test if these variables have been given by the user, if not, default values are placed used:

Scale (ech): 1 000 000 bp.
Source path (path): R working directory /library/gexmap/R-ex/
Results path (res): R working directory /gexmap.results

The folders are created if they did not exist, in the other way precedent results files will be replaced.


Upload of main data sources files

Two files are essential:

- Data.Rdata file which contains all ENSEMBL genome information used as reference by GexMap.
- User identifiers list.

The data.Rdata file is automatically uploaded if the source path is valid (load.data() function in gexload.R file). The user list path has to be manually chosen.

The “m” column represents the physic center of the gene. The Data.Rdata file is the matrix of the ENSEMBL genome (34270 genes among which 21272 are unknown with only localization information).

The user list is analyzed and then interpreted using information of the columns. For this purpose, the user list has to be formatted as shows fig.1.


1-A) The title of the first column which contains probes id is used to recognize microarray technology and type. Microarray type in not necessary but it allow to found probes annotations more quickly. Without technology or data base information the program will stop and generate an error as well as if the technology or microarray types are not referenced. The comma is essential to declare a type.


1-B) The second column titled “expression” is used to give supplementary graphic information about the regulation expression (1 for up regulation and -1 for down regulation).

The user has to place the appropriate Rdata file in the source folder. If the file is not in the source folder, user will be asked to manually localize it.


User list annotation

The load.corr() function upload the appropriate Rdata file where is the matrix corr. The matrix id_carte is created by taking information in the corr matrix about each probes of the user list. The resulting matrix is saved in the matrix liste.ENS in the main program. A first probe selection is to exclude the ENS000 probes which have no ENSEMBL annotations. Some microarray identifier have no corresponding identifier in ENSEMBL genome, so we have chosen to associate the ENS000 identifier to facilitate their censing. A second screening is to detect and eliminate the duplicates. Some probes are associated to the same gene and these duplicates will distort the further computing of genomic distribution



Computing number of genes by unit

This step is computed by the function mapping(). A single column matrix lev of the chromosomes is created to travel throughout the genome chromosome by chromosome. For each chromosome a sub matrix of liste.ENS is copied in liste.chr. Then the liste.chr matrix is ordered by mil and concatenated by unit (round(mil/ech)). The resulting information about one chromosome of numbers of ENSEMBL genes, genes of the user list and up or down regulated genes are saved in the genome.temp matrix.



Chance estimation

To detect regions of potential interest, GexMap estimates the number of genes which would be present by chance and compares this genomic distribution to the one of the user gene list. Chance is estimated by adjusting the genomic distribution of ENSEMBLE genome to the scale of the user gene list. The ratio used estimates the number of genes expected by chance for each unit is the ratio of total number of genes of ENSEMBL genome divided by the total number of genes of the user list. The matrix column of the chance distribution is concatenated to the genome matrix.


This hazard is further used by the gexgraph() function as graphic representation of chance.


Global and local statistical tests

The gextest() function statistically tests the differences between the list user distribution and the distribution which would be by chance. Two ways of testing are independently applied. The first statistical test is a binomial test which compares distributions at unit level. This test is applied all over the genome matrix unit by unit only for unit of interest. A loop all over the genome matrix allows discovering all regions of interest for which there is more genes in one unit than expected by chance.


Units of interest and units with positive result for the test are graphically reported in the pdf files in the same graphics as the genomic density curves.

The second test is the chi square test which compares user list gene distribution to the chance distribution chromosome by chromosome. This test can be applied at different unit level, so it is applied with progressively incremented unit sizes.



The scale is progressively incremented in a loop and the distribution is computed by concatenation at each step before application of chisq test. The test produces one result by chromosome and by scale and places it in the g_test matrix.



Only pvalues <=0.05 are reported in the g_test matrix. Then this matrix is graphically reported and saved in a pdf file.