They display activity marks as well as the enhancer marks H3K4me1 and p These promoters lack the activity marks. The top of the figure shows density plots for the histone marks discussed. All data for GM cell line. There has for a long time been a keen interest in determining principles of human promoter architecture. Since sequence motifs have not sufficed to achieve this, we turned to analyzing transcription factor occupancy in promoters as determined in the ENCODE project, in conjunction with sequence.
Our study fills a gap between sequence-based promoter studies e. Other authors have studied the global regulatory network [ 47 ] or focused on NF-Y and its co-factors [ 6 ]. Our goal was the delineation of subsets of stereotypical human promoter architectures. While Sp1 binding sites occur ubiquitously, we have shown that this combination is highly characteristic of a subgroup of promoters.
Another subgroup is characterized by the basic helix-loop-helix factor USF binding to its binding site, the E-box. The NF-Y cluster and the USF cluster appear almost mutually exclusive, reinforcing the view that these two groups constitute two characteristic promoter architectures. At the same time, there is a large group of promoters that apparently lack any characteristic TFs or TF-combinations, but clearly show the activating histone modifications.
While the groups discussed are the result of a clearly visible clustering pattern, there are many more promoters for which—at least with the available transcription factor ChIP-seq experiments, no cluster structure is visible. We conclude that certain characteristic promoter types exist, but only a part of all promoters falls into any of these subgroups. A further promoter cluster is characterized by binding of CTCF in the promoter region.
For this group, we presented evidence suggesting that CTCF and cohesin are responsible for the interaction between the bound promoter and one or more enhancers. It remains to be seen whether the two arrangements have different functional consequences. We also had expected to see a stronger role for the repressive mark H3K9me3.
However, the signal for this mark is low across all experiments see Additional file 1 : Figure S1 and it is unclear whether its absence actually constitutes a biological signal or is due to some technical issues. Other authors [ 6 , 16 ] investigated the question whether genes with promoters bound by NF-Y belong to particular functional classes of genes as given, e.
We also tested our derived promoter clusters for such target categories but the results remain rather generic and unconvincing. Although each study, including our own, finds some enriched GO term, those functions are either not consistent within a cluster or they are so generic that they are hardly informative. Thus, we suspect that TF combinatorics might actually lack a systematic link to functional categories.
In summary, basing this analysis on a combination of TF occupancy and motif analysis, we have defined stereotypic patterns delineating a novel grouping of promoters. This grouping opens up interesting new questions concerning the transcription factor complexes at the respective promoter groups, as well as questions on the evolutionary origin of these groups.
Cases where two or more TSSs lie within a 1 Kbp window are ignored altogether. This leaves annotated TSSs. TSSs are divided into active and inactive ones. We further discard inactive TSSs when located in an intergenic region. In the end, we are left with inactive genes in GM and inactive promoters in K We apply cufflinks [ 52 ] with default parameters using RefSeq annotation described before on all the replicates.
This data-set comprises ChIP-seq data for 11 histone modifications and 49 TFs or chromatin-associated proteins. A detailed list is given in Additional file 1 : Table S1 and S2.
This value we normalize within each ChIP-seq experiment by linearly mapping the peak heights to the interval [0,1]. We reduce the influence of outliers by holding out the top 0. If more than one loop is found interacting with a promoter region, the loop with the highest H3K27ac signal at the remote CTCF binding region is chosen.
As a resource for TF binding motifs, we used the Jaspar database [ 26 ]. If there is more than one hit in a promoter region, we take the hit with the smallest p value for further analysis.
Our method for clustering the rows and columns of the CHIP-seq data matrix into a visually understandable heatmap is based on a robust version of the s4vd biclustering algorithm [ 56 ], which is available as an R package. The result of biclustering a matrix is a coupled set of column and row clusters, where these cluster pairs are visible as sub-rectangles of the matrix with rows and columns permuted correspondingly.
We use the default parameter settings of s4vd. Due to its use of a randomized selection step, s4vd produces somewhat different results in different runs of the program. We exploit this with the goal of obtaining robust biclustering results by extracting the common cluster assignments from many runs. To this end, we run s4vd many times and determine the solution with the largest number of column clusters a tie gets broken randomly. We call this the target clustering.
Then, for each of the other solutions obtained from the other runs of s4vd, we determine an optimal assignment of its clusters to the target clustering. This is done by solving a linear assignment problem on the confusion matrix of the two cluster systems. The linear assignment problem is solved using lp. Once all the clusters from all solutions are assigned to the target clustering, we determine the frequency at which a column gets mapped into a target cluster.
Finally, we only keep columns for which there exists a target cluster to which it gets mapped with a frequency of more than 0. In a biclustering result, there exists a connection between a column and a row cluster. Thus, once the column clusters are fixed by the above procedure, one can also determine how often a particular row gets associated to a row cluster. Again, we keep those rows, which are associated to one cluster in more than half of the solutions, and assign this row accordingly.
This also leads to the t test p values reported next to our heatmaps: For one row, the test compares the mean of the values among the entries in the associated column cluster with the mean of the values outside the column cluster. This visualization is a further precaution against over-interpretation of the biclustering results.
We uploaded our original data matrix and source code on the github [ 58 ]. To further ensure that the computational results which we interpret are not due to the specifics of our algorithm, we applied a second biclustering algorithm which is based on a very different computational principle. Additional file 1 : Figure S21 shows how the value of k was selected. To then determine the association of a row with a cluster, we test each row with a t test.
Like in the visualization procedure described above, the t test measures in how far a particular column cluster divides the row of active promotes into two different regimes of high and low values, respectively. For a given column cluster, the rows with a t test p value better than 0.
Both biclustering procedures are depicted graphically in Additional file 1 : Figure S Additional file 1 : Figure S23 shows the confusion matrices between the biclustering resulting from the two algorithms. It is apparent that the clusters that we have interpreted are stably reproduced also by the k -means t test-based algorithm. To not rely solely on the RefSeq promoter annotation, we also alternatively use the same CAGE tag data as above to define promoter location.
There are TSSs in this annotation. In particular, it includes all possible isoforms and a gene might have several possible TSSs in a very short region, see example Figure in Additional file 1 : Figure S In this example, we have 6 TSS for Nat10 within an interval of bases. Although the annotation might be very accurate i. If there is one or more robust peaks of a cell line in one TSS region, then the TSS cluster is deemed active in this cell line.
Note that this step is strand specific. Werner T. Models for prediction and recognition of eukaryotic promoters. Mamm Genome. Kadonaga JT. Wiley Interdiscip Rev Dev Biol.
Ohler U. Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res.
Bucher P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from unrelated promoter sequences. J Mol Biol. Clustering of DNA sequences in human promoters. Genome Res. Discovery of novel human gene regulatory modules from gene co-expression and promoter motif analysis. Sci Rep. Article Google Scholar. Antequera F.
Structure, function and evolution of CpG island promoters. Cell Mol Life Sci. Computational identification of promoters and first exons in the human genome. Nat Genet. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci. Consortium EP, et al.
An integrated encyclopedia of DNA elements in the human genome. Discovery and validation of information theory-based transcription factor and cofactor binding site motifs. Hocomoco: expansion and enhancement of the collection of transcription factor binding sites models. Kheradpour P, Kellis M. Systematic discovery and characterization of regulatory motifs in encode tf binding experiments.
Sequence features and chromatin structure around the genomic regions bound by human transcription factors. A high definition look at the nf-y regulome reveals genome-wide associations with selected transcription factors. Giannopoulou EG, Elemento O. Inferring chromatin-bound protein complexes from genome-wide binding assays. Architecture of the human regulatory network derived from encode data.
NCBI reference sequences refseq : a curated non-redundant sequence database of genomes, transcripts and proteins. Consortium TF, et al. A promoter-level mammalian expression atlas. Biclustering algorithms for biological data analysis: a survey. Discovering statistically significant biclusters in gene expression data. Mantovani R. NF-Y coassociates with FOS at promoters, enhancers, repetitive elements, and inactive chromatin regions, and is stereo-positioned with growth-controlling transcription factors.
Interaction between the two ubiquitously expressed transcription factors NF-Y and Sp1. In: Encyclopedia of Cancer. Berlin: Springer: Google Scholar. Hardin PE. Transcription regulation within the circadian clock: the E-box and beyond.
J Biol Rhythm. The SRF accessory protein Elk-1 contains a growth factor-regulated transcriptional activation domain. Sharrocks AD. The ETS-domain transcription factor family. Nat Rev Mol Cell Biol. ZNF provides sequence specificity to secure chromatin interactions at gene promoters.
Different studies investigated these two statistical features separately, reaching minimal consensus despite sustained efforts. Here we unravel previously unknown symmetries in genetic sequences, which are organized hierarchically through scales in which non-random structures are known to be present. These observations are confirmed through the statistical analysis of the human genome and explained through a simple domain model.
These results suggest that domain models which account for the cumulative action of mobile elements can explain simultaneously non-random structures and symmetries in genetic sequences. Compositional inhomogeneity at different scales has been observed in DNA since the early discoveries of long-range spatial correlations, pointing to a complex organisation of genome sequences 1 , 2 , 3.
While the mechanisms responsible for these observations have been intensively debated 4 , 5 , 6 , 7 , 8 , 9 , several investigations indicate the patchiness and mosaic-type domains of DNA as playing a key role in the existence of large-scale structures 4 , 10 , In its simplest form, it states that on a single strand the frequency of a nucleotide is approximately equal to the frequency of its complement 16 , 17 , 18 , 19 , While the first Chargaff parity rule 23 valid in the double strand was instrumental for the discovery of the double-helix structure of the DNA, of which it is now a trivial consequence, the second Chargaff parity rule remains of mysterious origin and of uncertain functional role.
Different mechanisms that attempt to explain its origin have been proposed during the last decades 19 , 24 , 25 , 26 , Among them, an elegant explanation 27 , 28 proposes that strand symmetry arises from the repetitive action of transposable elements. Therefore, the mechanism shaping the complex organization of genome sequences could be, in principle, different and independent from the mechanism enforcing symmetry.
However, the proposal of transposable elements 29 , 30 as being a key biological processes in both cases suggests that these elements could be the vector of a deeper connection. In this paper we start with a review of known results on statistical symmetries of genetic sequences and proceed to a detailed analysis of the set of chromosomes of Homo Sapiens. Our main empirical findings are: i Chargaff parity rule extends beyond the frequencies of short oligonucleotides remaining valid on scales where non-trivial structure is present ; and ii Chargaff is not the only symmetry present in genetic sequences as a whole and there exists a hierarchy of symmetries nested at different structural scales.
We then propose a model to explain these observations. The key ingredient of our model is the reverse-complement symmetry for domain types, a property that can be related to the action of transposable elements indiscriminately on both DNA strands. Domain models have been used to explain structures e. For instance, we may be interest in the frequency of the codon ACT in a given chromosome. The frequency of occurrence of an observable X in the sequences s is obtained counting how often it appears varying the starting point i in the sequence:.
All major statistical quantities numerically investigated in literature can be expressed in this form, as we will recall momentarily. We start our exploration of different symmetries S with a natural extension to observables X of the reverse-complement symmetry considered by Chargaff.
For our more general case it is thus natural to consider that the observable symmetric to. One of the goals of our manuscript is to investigate the validity of Eq. By combining P X of different observables X this extended Chargaff symmetry applies to the main statistical analyses already investigated in literature, unifying numerous previously unrelated observations of symmetries. As paradigmatic examples we have:.
In the specific case of dinucleotides, such relation has been remarked in ref. This symmetry was observed for oligonucleotides in ref. We now investigate the existence of new symmetries in the human genome. In order to compare results for pairs X A , X B with different abundance, we normalise our observable by the expectation of independence appearance of X A , X B obtaining. Symmetrically related cross-correlations in Homo Sapiens - Chromosome 1.
In order to understand the observations reported above it is necessary to formalize the symmetries that arise as composition of basic transformations.
R , C are involutions , and CRC is the symmetry equivalent to equation 3. A symmetry S is defined by a set of different compositions of C and R. The four symmetries we consider here are shown in Fig. We can now come back to Fig. Similar results are obtained for all choice of dinucleotides and for all chromosomes see SI: Supplementary data These results suggest that: i the extended Chargaff symmetry we conjectured in Eq.
Sets of symmetrically related observables. Figure 3 shows the results for chromosome 1 and confirms the existence of a hierarchy of symmetries at different structural scales. Note that L D and L M are compatible with the known average-size of transposable elements and isochores respectively 36 , Moreover, the results for all Homo-Sapiens chromosomes, summarised in Fig. This, and the scales of L D , L S , and L M , provide a hint on the origin of our observations, which we explore below through the proposal of a minimal model.
Hierarchy of symmetries in Homo Sapiens - Chromosome 1. Hierarchy of symmetries in Homo Sapiens - All chromosomes. We construct a minimal domain model for DNA sequences s that aims to explain the observations reported above. The key ingredient of our model is the reverse-complement symmetry of domain-types, suggested by the fact that transposable elements act on both strands.
Mobile elements are recognised to play a central role in shaping domains and other structures up to the scale of a full chromosome, as well as being considered responsible for the appearance of Chargaff symmetry Our model accounts for structures e. We do not impose a priori restrictions or symmetries on this process. We consider that one realization of this process builds a domain of type p. Structure and symmetry at different scales: domain model. The biological processes that shapes domains imposes that, in each macrostructure, the types of domains comes in symmetric pairs.
We now show how the model proposed above accounts for our empirical observation of a nested hierarchy of four symmetries S 1 - S 4 at different scales.
We start generating a synthetic sequence for a particular choice of parameters of the model described above see section Methods for details. Figure 6 shows that such synthetic sequence reproduces the same hierarchy of symmetries we detected in Homo Sapiens. Hierarchy of symmetries in a synthetic sequence generated by the domain model. The analysis of a synthetic genetic sequence generated by our model reproduces the hierarchy of symmetries observed in the human genome compare the two panels to Figs 1 and 3.
The synthetic sequence is obtained following steps 1 — 3 of the main text. We now argue analytically why these results are expected. This is compatible with the conjecture 4. Therefore, in addition to the previous symmetries, C is valid.
On the other hand, our conjectured Chargaff symmetry, Eq. The complement symmetry in double -strand genetic sequences, known as the First Chargaff Parity Rule, is nowadays a trivial consequence of the double-helix assembly of DNA. However, from a historical point of view, the symmetry was one of the key ingredients leading to the double-helix solution of the complicated genetic structure puzzle, demonstrating the fruitfulness of a unified study of symmetry and structure in genetic sequences.
In a similar fashion, here we show empirical evidence for the existence of new symmetries in the DNA Figs 1 — 4 and we explain these observations using a simple domain model whose key features are dictated by the role of transposable elements in shaping DNA.
In view of our model, our empirical results can be interpreted as a consequence of the action of transposable elements that generate a skeleton of symmetric domains in DNA sequences. Since domain models are known to explain also much of the structure observed in genetic sequences, our results show that structural complex organisation of single-strand genetic sequences and their nested hierarchy of symmetries are manifestations of the same biological processes.
We expect that future unified investigations of these two features will shed light into their up to now not completely clarified evolutionary and functional role. For this aim, it is crucial to extend the analyses presented here to organisms of different complexity In parallel, we speculate that the unraveled hierarchy of symmetry at different scales could play a role in understanding how chromatin is spatially organised, related to the puzzling functional role of long-range correlations 41 , We create synthetic genetic sequences through the following implementation of the three steps of the model we proposed above:.
We use the processes p to generate chunks of average size the length of each chunck was drawn uniformly in the range [, ]. We concatenate two different macrostructures, obtained from steps 1 and 2 with two different matrices M I and M II :.
We used reference assembly build Researchers have also identified several less-easily explainable phenotypic associations with Neanderthal introgression. In their analysis, for example, Kelso and Dannemann found that Neanderthal variants were associated with chronotype—whether people identify as early birds or night owls—as well as links with susceptibility to feelings of loneliness or isolation and low enthusiasm or interest.
Why these associations exist is still a mystery. Kelso suspects that light might be a unifying factor, with both changes in day-length patterns and UV exposure reductions as they moved to more-northern latitudes. Even with more straightforward associations, such as with skin traits or immune responses, conclusions thus far are drawn from correlations between genotypes and phenotypes.
Neanderthal variants tend to come in packages, and the linkage between the variants makes it difficult to identify the function of each one, he explains.
The early data suggest that the Neanderthal variants affect gene expression in the same way as documented by previous work, validating the model. But such research is still in the proof-of-principle stage, says Camp, who is continuing this work in his own lab in Switzerland. You need to do this for or individuals. There are other fundamental questions that are proving difficult to answer about Neanderthal introgression, says Akey, from the number of hybridization events to the timescale over which those events took place, and whether there was sex bias in patterns of gene flow.
A second high-quality Neanderthal genome was published in Science , —58 , and researchers now have the genome of a 40,year-old human who had a Neanderthal ancestor just a few generations back. Last year, researchers published the sequence of a first-generation hybrid of Denisovans and Neanderthals. Those data will likely yield some surprises.
Capra has found evidence, for example, that some of the Neanderthal segments that correlated with modern phenotypes may not affect those pheno-types directly.
His work has uncovered cases in which the correlation was driven by sequences close enough in the genome to Neanderthal variants that the two always appear together. These sequences were carried by the common ancestor of Neanderthals and modern humans but were missing from the group of humans who founded the modern Eurasian population.
These variants, which had been retained by Neanderthals, were then reintroduced to the ancestors of modern non-Africans during periods of interbreeding. Akey has come upon another interesting twist: Africans do have Neanderthal ancestry. Unpublished work from his group points to the possibility that some of the ancient modern humans that bred with Neanderthals migrated back to Africa, where they mixed with the modern humans there, sharing bits of Neanderthal DNA.
In their seminal studies, the groups of David Reich of Harvard Medical School and Joshua Akey, then at the University of Washington, noted that the Neanderthal variants that correlated with human phenotypes did not appear in coding regions. Two years later, a genome-wide analysis published by investigators in France found that Neanderthal ancestry was enriched in areas tied to gene regulation Cell , — To ask this question more directly, Akey turned to the Genotype-Tissue Expression GTEx Project, which has cataloged gene expression data from roughly 50 tissues for each of 10, individuals.
Comparing expression levels based on which allele was being expressed, the researchers found that a quarter of the stretches of Neanderthal DNA in human genomes affect the regulation of the genes in or near those stretches Cell , P— E12, Earlier this year, Rotival and two colleagues calculated ratios of Neanderthal to non-Neanderthal variants across the genome and compared those ratios for protein-coding regions and various regulatory sequences, specifically enhancers, promoters, and microRNA-binding sites.
Consistent with previous results, they found a strong depletion of Neanderthal variants in coding portions of genes, and a slight enrichment of the archaic sequences in regulatory regions Am J Hum Genet , doi Jef Akst is the managing editor of The Scientist.
Email her at jakst the-scientist. All modern humans have ancestry in Africa. The Scientist regrets any confusion.
0コメント