Statistical applications in genetics and molecular biology

A novel method to prioritize RNAseq data for post-hoc analysis based on absolute changes in transcript abundance.

PMID 25781714


The use of fold-change (FC) to prioritize differentially expressed genes (DEGs) for post-hoc characterization is a common technique in the analysis of RNA sequencing datasets. However, the use of FC can overlook certain population of DEGs, such as high copy number transcripts which undergo metabolically expensive changes in expression yet fail to exceed the ratiometric FC cut-off, thereby missing potential important biological information. Here we evaluate an alternative approach to prioritizing RNAseq data based on absolute changes in normalized transcript counts (ΔT) between control and treatment conditions. In five pairwise comparisons with a wide range of effect sizes, rank-ordering of DEGs based on the magnitude of ΔT produced a power curve-like distribution, in which 4.7-5.0% of transcripts were responsible for 36-50% of the cumulative change. Thus, differential gene expression is characterized by the high production-cost expression of a small number of genes (large ΔT genes), while the differential expression of the majority of genes involves a much smaller metabolic investment by the cell. To determine whether the large ΔT datasets are representative of coordinated changes in the transcriptional program, we evaluated large ΔT genes for enrichment of gene ontologies (GOs) and predicted protein interactions. In comparison to randomly selected DEGs, the large ΔT transcripts were significantly enriched for both GOs and predicted protein interactions. Furthermore, enrichments were were consistent with the biological context of each comparison yet distinct from those produced using equal-sized populations of large FC genes, indicating that the large ΔT genes represent an orthagonal transcriptional response. Finally, the composition of the large ΔT gene sets were unique to each pairwise comparison, indicating that they represent coherent and context-specific responses to biological conditions rather than the non-specific upregulation of a family of genes. These findings suggest that the large ΔT genes are not a product of random or stochastic phenomenon, but rather represent biologically meaningful changes in the transcriptional program. They furthermore imply that high abundance transcripts are associated with particularly cellular states, and as cells change in response to internal or external conditions, the relative distribution of the abundant transcripts changes accordingly. Thus, prioritization of DEGs based on the concept of metabolic cost is a simple yet powerful method to identify biologically important transcriptional changes and provide novel insights into cellular behaviors.