Non-coding DNA

 

Selfish DNA?

The term non-coding DNA refers to the fragments of the genome that do not code for proteins. The terminology in this case is somewhat confusing. The fact that these fragments don’t code for proteins doesn’t mean that they don’t code for anything. Yet, historically this portion of the human genetic blueprint was deemed ‘junk DNA’ and believed to be completely useless. The term was widely popularized and still remains in people’s consciousness- in the survey about common misconceptions about genomics conducted by us at the beginning of the project, 19% of students dismissed non-coding DNA as garbage. Looking backwards, it’s baffling that the scientific community could universally accept the hypothesis that 98% of the human genome (as this is precisely the amount of non-coding sequences in human DNA) is functionless, yet still was evolutionarily favoured. After all, replicating and repairing such huge amounts of DNA consumes a significant portion of the cell’s resources. It was probably a result of biologists’ earnest fascination with multi-functional, perfectly adapted proteins. While these elegant molecules fully deserve their impeccable reputation, they still need some help from the non-coding DNA, which turned out not to be so selfish after all.

 

Recycling the junk

A paradigm shift occurred when a strange correlation was observed. It turned out that the proportion of non-coding DNA in the genome is directly proportional to the complexity of the organism. What’s even more interesting, it’s not the case when it comes to the number of protein-coding sequences. In practice, it means that the bigger number of genes does not make an organism more sophisticated. The Onion Test is a simple concept to demonstrate that idea. The genome of Allium cepa (the onion plant) is five times larger than that of the Homo sapiens (human) and the human organism is more complicated than the onion organism. It seems rational that a more complicated organism needs more genes hence more DNA than a less complicated organism. But the onion test does not conform to this assumption. More likely, it suggests that those regions of the genome that code for proteins do not make up the majority of the genome. Wondering how to crack this conundrum for years, eventually scientists discovered that it’s not the number of genes that makes an organism sophisticated; it’s to what extent they’re able to splice their genes. Of course, complicated splicing patterns need to be carefully regulated. Here comes the acknowledgement to our so undeservedly underappreciated non-coding DNA, which finally could outshine proteins and become a star of research.

 

The ENCODE Project 

The Human Genome Project was finished in 2003 and the fact that less than 2% of the human genome is expressed gave a reason for researchers to explore what functions the rest of the genome might have. This question gave rise to the project called Encyclopedia Of DNA Elements which we abbreviate as ENCODE.  The long-term aim of this project was to map the functional regions of the non-coding DNA and establishing their roles.  

The pilot phase took place between 2003 and 2007 focusing on experiments on a targeted 1 % divided among 44 regions of the human genome. It was an opportunity to try out the newly emerging technologies and it shed light on many previously poorly understood functions of the genome. The pilot phase used microarray-based assays to investigate transcribed regions of non-coding DNA, cis-regulatory elements, chromatin accessibility, and histone modifications. Briefly, microarray assays indicate whether a gene is activated or deactivated. This happens on a chip with many- many wells simultaneously so it was considered very time efficient in the era of the pilot face.  

The results of these four years suggested that the majority of our DNA is transcribed into RNA, though only a fraction of these is expressed. Many regions that were thought to not be transcribed turned out to be the template of transcripts.  

Many cis-regulatory elements had been found as well and scientists realised that the elements that regulate transcription have an equal chance of being located upstream or downstream of the transcription starting sites. 

The second phase of the project, which happened between 2003 and 2012, extended the research to the whole human genome and more cell lines were examined. This phase paid particular attention to the transcriptional regulatory network of the human system. It has been observed that the combinations of the transcription factors differ among the various locations. For example, there are regulatory elements that are distal to genes that bind with different combinations of TFs (Transcription Factors) than regulatory elements that are proximal to genes. These differences gave rise to the hierarchical organization of TFs. In this system, the TFs on distinct levels have certain properties that indicate their function and the regions they are likely to regulate.  

In the third phase, the studies have been taken one step closer to reality in the sense that they carried out experiments on cells taken directly from tissues.  

The third phase introduced new technologies which made it possible to explore the genome from other viewpoints as well. Methods like paired-end tagging and Hi-C conformation capture gave a more sophisticated 3D structure of the chromatin than the previous ideas, which gave an opportunity to understand the interactions between the CREs better. Chromatin looping was one of these aspects which proved to be significant in gene regulation as looping alters the physical distances between regions and it also gave clues about the relations between certain enhancers and genes.  

This phase made a lot of effort to rationalize the cooperation of TS-s at CREs.  For example, it brought the first evidence of the existence of the so-called HOT region model. According to this, there are some HOT regions that are mostly promoters and enhancers which are bound to many TS-s. The assembly of proteins is launched by anchor DNAs that recruit TF-s which makes the chromatin open, and this complex recruits more proteins. Mapping chromatin loops in many cell types showed that there are differences in chromatin looping in different cell types and this is also a factor that regulates gene expression.  

The advanced method also gave landscapes of RNA-binding which provided a new dataset for the project. There are many RNAs that are bound to proteins, the so-called RNA binding proteins. These elements are responsible for many steps in the post-transcriptional processes of the mRNAs including splicing, cleavage and poly-adenylation. 

The fourth phase of the project is considered to last from 2017 to today. Data is continuously produced and analyzed in the hope of achieving a near full understanding of the human genome one day. 

 

Types of non-coding DNA

Introns are non-coding sequences within the DNA that were previously thought to be “junk DNA”. However, it is now known that introns are not “junk” and can have a function in gene expression and regulation. Splicing (intron removal) is a very energetically expensive and timely cellular process, so a lot of work has gone into finding out functions of introns that would justify doing this process in a cell. Within an intron, there are regulatory sequences, therefore when scientists removed introns within the laboratory, the expression of one or more genes was affected. For example, after splicing, some introns can form micro-RNAs (miRNAs) which have a role in regulating gene expression. By interfering with the mRNA, miRNA can make the cell stop the production of a certain protein. Other introns contain genes for other types of non-coding RNA which have important roles in the cell. Now it is known that introns are absolutely essential in species that have them and because splicing is such an important process in humans, mutations which affect splicing can be pathogenic. It’s thought that approximately 50% of disease-causing mutations affect splicing.

The majority (>95%) of human genes can be spliced in multiple different ways to produce several different mature messenger RNA transcripts. This is called alternative splicing and it allows the DNA to make different proteins from the same original gene. 

It was thought that pseudogenes are gene copies that have completely lost their biological function, but recent studies have shown that some pseudogenes possess regulatory functions and are transcribed into RNA, so it is difficult to find an exact definition for this type of non-coding DNA.  One of the functions of pseudogenes is providing genetic diversity which can further help when generating antibodies and antigen variation. Pseudogenes accumulate mutations over the years and this helps scientists study the mutation rates and neutral evolutionary patterns. As pseudogenes are fossils of their parent genes, they can be a source of information regarding ancient transcriptomes.

Transposable elements (TEs) are DNA sequences that can change their location and move around in the genome. These TEs or transposons are classified into two divisions: retrotransposons that require reverse transcription to transpose and DNA transposons that do not require it.  These “jumping” genes can lead to the mutation of genes and cause diseases such as haemophilia. If LINE-1, an active transposon in our body, lands in the APC gene, it can lead to different kinds of cancer. Fortunately, most transposable elements seem to be silent, meaning they do not have any effects on the phenotype of the organism. Some TEs are inactivated by mutations that stop them from moving from one chromosomal location to the other, and others are perfectly capable of changing their location but are held inactive by epigenetic mechanisms such as methylation, miRNAs and chromatin remodelling. When analysing the chromatin remodelling mechanism, we see that heterochromatin is so constricted that the transcription enzymes simply cannot reach the transposable elements found there. Because the movement of these elements is dangerous for the organism, most of the transposable sequences in the human genome are silent. Even the few active transposable elements that are not affected by epigenetic silencing are usually stopped from jumping by mechanisms such as RNAi (small interfering RNA) which control gene expression. However, transposable elements are not always destructive and play an important part in the evolution and gene regulation of an organism. They facilitate the shuffling of exons, the repair of DNA and translocation of the genetic sequence, thus leading to the evolution of the genome.

Tandem arrays (highly repetitive DNA) are common at the centromeres and telomeres and include satellites, microsatellites and minisatellites. These elements were named as such because they are separated from the bulk of nuclear DNA during centrifugation. Highly repetitive DNA poses a significant technical challenge to next-generation sequencing[KW9]  and thus it is difficult to estimate their number. The new long-read sequencing technology of Oxford Nanopore is beginning to make it easier to sequence highly repetitive DNA. Recently, they managed to complete a telomere-to-telomere sequencing of the entire X chromosome, without any gaps, including the satellite centromeric DNA. Satellites have between 1,000 and 10 million repeated units and account for the DNA existing at the centromere. Minisatellites can have hundreds of units of 7 to 100 base pairs (bp) and are present everywhere, especially at the telomeres. Microsatellites have around 100 bp or more and are formed of repeated units of one to six nucleotides. They include certain trinucleotide repeats that are associated with disease development such as Huntington disease. The tandem arrays at the telomeres have suffered very small modification throughout the years, indicating they play an important role in protecting the ends of the eukaryotic chromosomes. 

Although their function is not fully understood yet, repetitive sequences are known to be important in gene evolution and disease-gene mapping.

The bigger picture

It turns out that even in biology, individual beliefs to some extent depend on one’s perspective. In the case of non-coding DNA, a significant part of the scientific community did not accept the ENCODE hypothesis about 70% of non-coding DNA actually having a function. The argument here lies in how to define functionality. According to ENCODE, DNA can be considered functional if it displays any kind of biochemical activity, for instance, if it was copied into RNA. Many scientists believe that it’s not enough to prove such a sequence has a meaningful use. The counterargument says that DNA can only be classified as functional if it has evolved to do something useful enough so a mutation disrupting it would have a harmful effect on the organism.

Eventually, it’s worth looking at the issue of non-coding DNA from the evolutionary point of view. An influential study from 2017, An Upper Limit on the Functional Fraction of the Human Genome, introduced a new definition of functional DNA- whether a sequence could be acted on by natural selection in either a positive or negative way. As the majority of the mutations that occur are harmful, they cause a reduction in the fitness of the population. Therefore it is the fertility that must compensate for that to maintain a constant population size from generation to generation. This required increase in fertility depends on the percentage of functional sites in the genome, the mutation rate, and the proportion of deleterious mutations in functional regions. Mutations in the non-functional, junk DNA regions wouldn’t have any bad influence on the organisms, so they would remain in the population. Taking into account real-life fertility rates in humans, the study estimated that the upper limit of the functional DNA in the human genome is 15%- much less than the ENCODE predicted.

Ultimately, there is still no consensus in the scientific community about what is the exact percentage of the human genome that can really be considered functional. Nevertheless, it’s without any doubt that not only protein-coding DNA sequences are of importance. At least some percentage of our ‘junk DNA’ turns out to be crucial, so it’s definitely worth studying its function in more depth.

 

 

References:

Repetitive DNA Elements .Genetics. . Encyclopedia.com. 9 Sep. 2021 <https://www.encyclopedia.com>.

Tomasi, F., 2018. Transposons: Your DNA that’s on the go – Science in the News. [online] Science in the News. Available at: <https://sitn.hms.harvard.edu/flash/2018/transposons-your-dna-thats-on-the-go/?web=1&wdLOR=c6328BF69-6B3D-45DB-B592-07848D43A862> [Accessed 9 September 2021].

Podlaha, O. and Zhang, J., 2010. Pseudogenes and Their Evolution. Encyclopedia of Life Sciences, [online] Available at: <https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470015902.a0005118.pub2> [Accessed 9 September 2021].

Markgraf, Bert. “Introns vs Exons: What are the Similarities & Differences?” sciencing.com, https://sciencing.com/introns-vs-exons-what-are-the-similarities-differences-13718414.html. 9 September 2021.

Tutar, Y., 2012. Pseudogenes. Comparative and Functional Genomics, [online] 2012, pp.1-4. Available at: <https://www.hindawi.com/journals/ijg/2012/424526/> [Accessed 9 September 2021].

Pray, L. ,2008. Transposons: The jumping genes. Nature Education 1(1):204

Shapiro JA, von Sternberg R. Why repetitive DNA is essential to genome function. Biol Rev Camb Philos Soc. 2005 May;80(2):227-50. doi: 10.1017/s1464793104006657. PMID: 15921050.

 

Leave a Reply

Your email address will not be published. Required fields are marked *