Abstract

The microarray data for gene expression profiling stored in Gene Expression Omnibus (GEO) is massive and ever-increasing, how to effectively mine valuable biological information from these data has always been a concern. As a powerful statistical method, Gene Set Enrichment Analysis (GSEA) advanced by Subramanian et al. has been widely used to interpret high-throughout gene expression data. However, current studies usually only analyze a few data sets, resulting in low reproducibility. Moreover, the messy format of annotation information in GEO impedes the application of GSEA for bench biologists. In response to the above challenges, we developed an R package called rGEO which can universally map probes in the microarray data in GEO to HUGO gene symbol, and built a user-friendly web application, qGSEA, for converting raw data in GEO to input files of GSEA. Using LEM4 gene as an example, we performed GSEA in 883 microarray data sets with these tools and got several gene sets highly correlated with the expression level of LEM4 after filtering. Some information derived from them were consistent with the results of our previous research or published work, while others might provide novel biological insights. We also found that the overall distribution of the significant level of the enrichment results in all data sets shows some interesting trends which are difficult to find when analyzing merely a few data sets. In summary, we introduced a set of convenient tools which facilitates the mining of the abundant gene expression data in GEO and conducted an early attempt to simultaneously analyze a large number of data sets using GSEA. Both rGEO and qGSEA are released under AGPL-3.0 license along with the scripts used in this research at https://github.com/dongzhuoer/thesis.

Keywords: LEM4 gene；Gene Set Enrichment Analysis；DNA microarray