Statistical identification of gene association by CID in application of constructing ER regulatory network

BMC Bioinformatics. 2009 Mar 17:10:85. doi: 10.1186/1471-2105-10-85.

Abstract

Background: A variety of high-throughput techniques are now available for constructing comprehensive gene regulatory networks in systems biology. In this study, we report a new statistical approach for facilitating in silico inference of regulatory network structure. The new measure of association, coefficient of intrinsic dependence (CID), is model-free and can be applied to both continuous and categorical distributions. When given two variables X and Y, CID answers whether Y is dependent on X by examining the conditional distribution of Y given X. In this paper, we apply CID to analyze the regulatory relationships between transcription factors (TFs) (X) and their downstream genes (Y) based on clinical data. More specifically, we use estrogen receptor alpha (ERalpha) as the variable X, and the analyses are based on 48 clinical breast cancer gene expression arrays (48A).

Results: The analytical utility of CID was evaluated in comparison with four commonly used statistical methods, Galton-Pearson's correlation coefficient (GPCC), Student's t-test (STT), coefficient of determination (CoD), and mutual information (MI). When being compared to GPCC, CoD, and MI, CID reveals its preferential ability to discover the regulatory association where distribution of the mRNA expression levels on X and Y does not fit linear models. On the other hand, when CID is used to measure the association of a continuous variable (Y) against a discrete variable (X), it shows similar performance as compared to STT, and appears to outperform CoD and MI. In addition, this study established a two-layer transcriptional regulatory network to exemplify the usage of CID, in combination with GPCC, in deciphering gene networks based on gene expression profiles from patient arrays.

Conclusion: CID is shown to provide useful information for identifying associations between genes and transcription factors of interest in patient arrays. When coupled with the relationships detected by GPCC, the association predicted by CID are applicable to the construction of transcriptional regulatory networks. This study shows how information from different data sources and learning algorithms can be integrated to investigate whether relevant regulatory mechanisms identified in cell models can also be partially re-identified in clinical samples of breast cancers.

Availability: the implementation of CID in R codes can be freely downloaded from (http://homepage.ntu.edu.tw/~lyliu/BC/).

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Breast Neoplasms / genetics
  • Breast Neoplasms / metabolism
  • Computational Biology / methods*
  • Data Interpretation, Statistical
  • Estrogen Receptor alpha / genetics
  • Estrogen Receptor alpha / metabolism*
  • Female
  • Gene Expression Profiling / methods
  • Gene Regulatory Networks / genetics*
  • Humans
  • Oligonucleotide Array Sequence Analysis / methods
  • Systems Biology

Substances

  • Estrogen Receptor alpha