Artificial Intelligence in genomics

The need for AI in biosciences and medical applications is undeniable. The opportunities that arise thanks to the combination of pure computational power and human ingenuity are simply breathtaking. Thanks to AI, potentially unsolvable problems like the protein folding paradox have now been solved (Alpha Fold II). As analyzing the huge amount of data allow us to reach conclusions unavailable to humans, unimaginable advances like curing or even eradicating genetic diseases are now within our reach.

Before elaborating on the exact applications of AI techniques in genomics, let us introduce the basic concepts behind this sometimes mysterious artificial intelligence. Let’s focus on areas most commonly used in biosciences: machine learning and deep learning.

Machine learning is a subset of artificial intelligence. It is based on the algorithms that can learn from the data to improve their functioning in the future. It is done in the process called training when huge numbers of labelled data are used to create an algorithm with its own distinctive features that are consistent across all data. As a result, these algorithms are able to make predictions and decisions without explicit programming from humans.

The most common division of ML is:

Supervised learning, in which algorithms are trained using labelled examples, such as an input where the desired output is known. First of all, the data is split into a training set and a test set. Then, the algorithm is fit to the training dataset. The next step is to evaluate and adjust the model using test data, attempting to avoid overfitting, where the model’s predictive power is low, and underfitting, where the model doesn’t match the data properly. Supervised learning can be divided into classification problems, where a specific title or category is assigned to the input variable, and regression problems, when the output variable is a real or continuous value, such as height.

Unsupervised learning, where the data do not have a categorical output or label assigned to them, so the ‘right answer’ is not known, and the model needs to figure out what it’s processing. The algorithm in unsupervised learning is trained to find trends, patterns or structures by studying and observing the unarranged data and creating its own connections between them. Learning and improving by trial and error is the key to unsupervised learning.

Reinforcement learning, which operates thanks to the use of an algorithm that, through trial and error, identifies which particular input data provides the best numerical output data. It continuously tries to improve the outcome taking into account the feedback of the environment, which can either be a reward, which reinforces that the model achieves what was planned, or a failure, which prompts the model to change a strategy. In machine learning terminology, the environment is called a state, which can have different actions (responses). The goal is to continuously improve the quality of the outcome (Q). The algorithm needs to determine which actions improve the state, so the quality goes up. It continues endless simulations of actions and states until it finds the best strategy.

Deep learning is another subset of artificial intelligence. It is inspired by the activity of the parts of the human brain, such as neurons, to allow computer models to cluster data and make predictions with incredible accuracy.

In our contemporary society, most complex questions cannot be answered using single-layered machine learning algorithms. As recently the computational power has increased exponentially, it allowed the data to be processed in a more sophisticated way using deep learning.

Deep learning distinguishes itself from machine learning because it eliminates some of the data pre-processing that is typically involved with machine learning. Deep learning algorithms can process unstructured data, like text and images, and it automatically selects which features are most crucial for identification, eliminating the need for human intervention as the hierarchy of features does not need to be manually adjusted like in machine learning.

Deep neural networks are made of densely arranged layers of interconnected nodes, each building upon the previous layer to refine and optimize the outcome. This progression through the neural network is called forward propagation.

The input and output layers of a deep neural network are called visible layers. All the layers in between are called hidden layers. The input layer is where the algorithm starts to process the data. The information is then transferred through one hidden layer to another over connecting channels. Each has a value assigned to it, hence it’s called a weighted channel. All neurons have an associated number with it called bias. This bias is added to the weighted sum of inputs reaching the neuron, which is then applied to the activation function. The result of the activation function determines if the neuron is activated or not. Every activated neuron passes on the information to the next layer. This continues until the computations reach the output layer, where the final prediction or classification is made.

The weights and bias are continuously adjusted to produce a well-trained network that gradually becomes more and more accurate by self-correcting its errors. It happens by a process called backpropagation which uses algorithms like gradient descent to calculate errors and biases in predictions.

Now, moving to applications of AI in biosciences…

Genome annotation

When a genome is sequenced, it is not informative yet. It does not tell where the distinct elements of the genome are, which would be essential for almost any investigation in molecular genetics. The identification and annotation of these elements arose to address this task. This process is called genome annotation.

The genome can be interpreted as a huge amount of data that needs to be analyzed and categorized into subgroups. Databases have been analyzed by computer algorithms for a long time and this is the case with genome annotation as well. Algorithms can be trained to recognize patterns. In this case, the patterns are the different elements of the genome.

Transcription sites, splice sites, promoters, enhancers and positioned nucleosomes are all examples that were successfully analyzed and annotated thanks to these algorithms. In the following section, we will only focus on splice sites in order to provide a deeper understanding of such a process. It assumes the knowledge of basic statistical concepts such as conditional probabilities, Bayes theorem and the concept of sensitivity and specificity, so we strongly recommend our readers to make sure they understand these concepts.

Splice site prediction

Firstly, a tremendous amount of data was collected about the nucleotide sequences that occurred upstream or downstream of the splice sites for input for the machine learning algorithm.

These algorithms need features based on which they carry out their computations. In this case, the features were the presence or absence of the nucleotides very close to the splice sites.

The study used genes that corresponded to the splice site consensus, which suggests that there are GT sequences at the 5’ end of the introns and AG sequences at the 3’ ends of the introns.

Negative and positive instances were given based on this consensus: GT at the start of an intron and AG at the end of the intron are positive instances, while GT located within 100 nucleotides upstream the first GT and AG located within 100 nucleotides upstream the first AG are negative instances.

These instances had many features according to the following. One feature is the presence or absence of one nucleotide at a determined position. The study chose to observe 50 positions downstream and upstream the consensus. If we consider the combinations of the upstream and downstream positions as a feature, it gives us 400 features.

A data set was generated from many genes, using their information as input for the algorithm. These are called instances and are labelled with + if they are positive instances and with – if they are negative instances.

Our goal is to develop an algorithm that recognizes positive and negative instances that were not part of the training data set.

It is important to point out that there is not only one method that works, however, we only highlight one specific way to solve this problem here.

Classification method: The Naive Bayes Method:

This method classifies the instances based on the conditional probability that the instance is positive or negative given the features. If the conditional probability of being positive given a certain set of features is greater than the conditional probability of being negative given the same set of features, then the instance will be classified as positive and vice versa.

The conditional probability is computed using the Bayes theorem, hence the name of the method. With Bayes theorem, we can express the conditional probability of belonging to one class given the features in the function of the conditional probability of the features appearing in the class, the probability of the features, and the probability of the class. It is important to note that we consider the features to be independent of each other.

Feature subset selection:

The NBM is however not sufficient as it already builds on the relevant features that the algorithm does not know at this point. In machine learning, feature selection is a key step. Usually, a database provides many features which are not necessarily significant. If an algorithm does not eliminate an insignificant feature it will deteriorate the result. In this case, the algorithm must decide what nucleotides are important to take into consideration. When the features are selected, the algorithm can start carrying out the actual computation in order to classify the instance.

There are many feature selection methods known and they are suitable for different problems. One should always analyze the problem and the potential feature selection models to achieve the best result.

In our case, there are 400 features which means that there are 2400 subsets. Even a computer cannot randomly try all these subsets and see which one works the best. This is why we need so-called search algorithms. The performance is evaluated by computing the harmonic means of sensitivity and specificity. The sensitivity measures the ratio of the false negatives, the specificity gives the ratio of the false positives. The algorithm that the current study used is called the Backward Feature Elimination process. It starts with the subset which includes all features then analyzes features one by one and in every step eliminates the one that is the least relevant for the classification process. The determination of the least relevant happens through comparing the performances of the algorithm that includes a particular feature and the performance when the same feature is eliminated. Based on the effect that the elimination of one feature has on the performance it is possible to order the features according to relevance.

By continuously eliminating the features according to this order, it turned out that only the ten most relevant features made a significant difference. These features describe nucleotides near the splice sites.

Finding disease-related genes

It is well-known that social media platforms suggest people that you may know in order to add them as friends. These suggestions are made based on the common contacts that you and your already existing friends share. A similar method can be used by scientists to create maps of biological networks by analysing the interactions between different proteins or genes. The researchers from Linkoping University have shown in a new study that deep learning can be used to find disease-related genes. In AI, there are entities called “artificial neural networks” which are trained to find patterns in experimental data. These are currently used in image recognition, which can also be widely used in the field of biological research.

A remarkable example of how UCL contributes to this exciting AI development is the successful implementation of Eye2Gene. It is a decision support system to accelerate and improve the genetic diagnosis of inherited retinal disease by using AI on retinal scans. While there are over 300 possible genetic causes behind the disease, quick and accurate diagnosis dramatically increases the chances of successful treatment. It was designed by the research group led by Dr Nikolas Pontikos from the UCL Institute of Ophthalmology. Such few and pioneering examples of personalized medicine in practice significantly contribute to speeding up regulatory processes of various AI-driven technologies.

In other areas, like gene expression patterns, the scientists from Linkoping University used an enormous amount of experimental data: the expressions patterns of 20,000 genes from both people with diseases and healthy people. The information was not sorted before it was inputted into the artificial neural network; the researchers did not give information about which gene expression patterns were from healthy people and which were from the diseased group. The deep learning model was then trained to categorize the gene patterns.

One of the unresolved challenges of machine learning is not being able to see how the artificial neural networks solve their tasks and find the patterns of gene expression. We would only be able to see our input (the experimental data that we provide) and the output showing the result. In the end, the scientists were interested in which of the gene expression patterns found with the help of AI are actually associated with disease and which are not. It was confirmed that the AI model found relevant patterns that concur well with biological reality. At the same time, the model revealed new patterns that are potentially very important for the biological world.

Challenges and limitations of AI

The concept of artificial intelligence has gained a great level of interest in the last few years. AI is already deeply infiltrated into our everyday lives through, for example, our smartphones. However, by comparison, the use of AI in the healthcare system is not yet so advanced.

One of its limitations is that, even though people assume AI is objective and unprejudiced, biases can occur if they were already present in the dataset used for the input. One study showed that when texts written by humans in English were the initial database for an AI model, the algorithm made words associations similar with societal words associations found in the given texts, for example linking European names with more positive associations compared to some African names. Thus, these biases could raise serious issues in a healthcare setting for populations coming from different demographic areas.

One challenge for the further implementation of AI in healthcare is the threat it represents to the utility of human employees. The Topol Review anticipates that robots will be able to perform medical procedures without being controlled by humans as some robots already have a low level of AI controlling their physical actions. However, a study from 2017 reported that only 23% of people in the UK would be comfortable with robots performing medical procedures on them. Therefore, the AI models should not be developed to replace human employees but rather to supplement the work of the medical staff.

From a legal perspective, one other possible challenge would be the patients who seek legal action when an artificially intelligent machine fails to appropriately utilize genetic testing on them. Should the doctor who referred the patients be blamed or the programmer of the software? As mentioned before with the artificial neural networks, in AI, you cannot determine how the output of the programme was decided so this would only further complicate the answer to the question above.

Sources:

High Accuracy Protein Structure Prediction Using Deep Learning, John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Kathryn Tunyasuvunakool, Olaf Ronneberger, Russ Bates, Augustin Žídek, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Anna Potapenko, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Martin Steinegger, Michalina Pacholska, David Silver, Oriol Vinyals, Andrew W Senior, Koray Kavukcuoglu, Pushmeet Kohli, Demis Hassabis, In Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book), 30 November – 4 December 2020.

https://datafloq.com/read/machine-learning-explained-understanding-learning/4478

Carreras, J; Hamoudi, R; Nakamura, N; (2020) Artificial Intelligence Analysis of Gene Expression Data Predicted the Prognosis of Patients with Diffuse Large B-Cell Lymphoma. The Tokai Journal of Experimental and Clinical Medicine, 45 (1) pp. 37-48.

https://www.ibm.com/cloud/learn/deep-learning

https://www.ucl.ac.uk/ioo/news/2021/jun/eye2gene-wins-artificial-intelligence-health-and-care-award

A perspective on the promise of personalized medicine: an interview with Prof William Newman

As the next part of our project, we conducted an interview with Prof of Translational Genomic Medicine William Newman. We talked about what personalized medicine is, what benefits and problems it brings, how it’s already being implemented and what’s its future. Prof Newman research focuses on pharmacogenetics (how patients respond to their medications depending on their genetic traits) and on rare inherited conditions. He successfully implemented Pharmacogenetics to Avoid Loss of Hearing (PALoH) study, a topic on which he also elaborates in the interview. Due to the large size of the video file, here’s a YouTube link to the interview:

Personalized medicine – infographics

Case study: CRIPSR-Cas9 usage in therapeutics

CRISPR in the clinic

Current clinical trials utilizing CRISPR are still in their early development, meaning that even if the technology turns out to be safe and effective, CRISPR-based therapy is still far from being used by the wide public.

CRISPR technology represents an important advancement in personalized medicine. The current clinical targets are blood disorders, cancers, eye disease, chronic infections, and protein-folding disorders. All current clinical trials focus on editing somatic cells or tissues without affecting the sperm or eggs, meaning genomic changes will not be passed down to the next generations.

Two genetic blood disorders are caused by mutations affecting the haemoglobin gene: sickle cell disease (SCD) and beta-thalassemia. CRISPR technology does not directly change the mutation but rather increases the levels of fetal haemoglobin. Fetal haemoglobin can be used in the treatment of SCD and beta-thalassemia by substituting the defective haemoglobin of adults. Stem cells are harvested from the patient’s blood and CRISPR is used to edit the genomes of the harvested blood cells. Chemotherapy is used to eliminate the defective marrow stem cells and the newly genome-edited stem cells are put back into the patient’s bloodstream, resulting in blood cells producing fetal haemoglobin.

Encouraging results were obtained using ex vivo CRISPR-based therapy to treat beta-thalassemia in February 2019 and SCD in July 2019. The treated patients will be monitored for possible harmful effects. The main limiting factor until now has been the chemotherapy used to ablate the existing bone marrow as it can be risky and time-consuming.

CRISPR-based therapies are also used in the treatment of blood and lung cancers. In 2016, CRISPR therapy was first used to treat lung cancer; a total of 12 patients participated in the study. The patients were injected with edited T-cells where the PD-1 gene was prevented from making PD-1 receptors. Inhibiting the programmed cell death 1 (PD-1) receptor is beneficial as this receptor mediates the immune escape of tumour cells at the surface of immune cells by activating apoptosis of antigen-specific T-cells and inhibiting apoptosis of regulatory T-cells. At the end of the study, the researchers concluded that the side effects were acceptable and low levels of edited T-cells were still present in 11 out of the 12 patients at two months after the infusion.

The next study into CRISPR-based therapy ended in 2020 and focused on the safety of the treatment and the potential side effects in two volunteers, who suffered from white blood cell cancer and metastatic bone cancer respectively. The results were encouraging, with T-cells maintained at stable levels even after nine months and correctly identifying tumours. Clinical trials using CRISPR-based immunotherapies are still ongoing.

CRISPR trials are also being performed to edit a patient’s defective photoreceptor gene to treat Leber congenital amaurosis 10 (LCA10) which is a cause of childhood blindness. In LCA10, a mutation of a photoreceptor leads to the formation of a shortened, defective version of an important protein. With treatment, this gene mutation can be corrected, allowing cells to make full-length, functional proteins again. The first clinical study began in March 2020, when a patient was injected with a low-dose treatment, but no results have been published yet. This is the first in vivo study as no CRISPR-based therapy had used direct injection (in this case into the eye) to edit the patient’s genes before this trial.

Besides all the points mentioned above, CRISPR technology is also used in clinical trials focusing on the treatment of urinary tract infections and hereditary transthyretin amyloidosis (hATTR ).

In the following section, we will describe actual research which exploited the power of the CRISPR-cas9 genome editing in order to treat a disease. The study was carried out on mice. This way the genetic engineering technique had two uses in this experiment. First, the mice needed to be engineered to be suitable models of the disease that occurs in humans. Then this induced mutation was corrected by the same tool. The following text aims to provide a good grasp on the concept and structure of such a study without claiming to be exhaustive.

All the images are from the original publication apart from the case where it is stated otherwise.

A case study on CRISPR-cas9 genome engineering

An extract from the study “In Vivo CRISPR/Cas9-Mediated Genome Editing Mitigates Photoreceptor Degeneration in a Mouse Model of X-Linked Retinitis Pigmentosa” by Shuang Hu; Juan Du; Ningning Chen; Ruixuan Jia; Jinlu Zhang; Xiaozhen Liu; Liping Yang

Retinis pigmentosa is an inherited disorder that causes the progressive loss of photoreceptors and can lead to blindness. X-linked retinitis pigmentosa (XLRP) is due to mutations in Mutations in the retinitis pigmentosa GTPase regulator (RPGR) gene.

This study aimed to treat this disorder through genome engineering with the CRISPR-cas9 system.

Mouse model:

Cre-dependent Cas9 knock-in mice were generated by inserting a Cas9 transgene expression cassette into a locus. This transgene was controlled by the combination of a cassette (loxP-stop-loxP) that interrupted the transgene and the Cre recombinase, which duo is commonly used to control gene expression. In this case Cas 9 cannot be expressed without the Cre recombinase.

These Cre-dependent Cas9 knock-in homozygous males were crossed with genetically engineered knockout (Rpgr KO) female mice that had a five basepair deletion in exon eight. This crossing provided the mice to experiment on, the Rpgr−/yCas9+/WT male mice.

Figure adapted from: Rocha-Martins, Maurício & Cavalheiro, Gabriel & Rodrigues, Gabriel & Martins, Rodrigo. (2015). From Gene Targeting to Genome Editing: Transgenic animals applications and beyond. Anais da Academia Brasileira de Ciencias. 87. 1323-1348. 10.1590/0001-3765201520140710.

The wild type (WT) mice used in the study were C57BL/6J (a commonly used species in research)

Generation of Rpgr KO Mouse Model:

Two sgRNAs targeting exon eight of the Rpgr gene.

The mRNA of in vitro transcribed Cas9 and sgRNA were injected into zygotes of C57BL/6J mice.

Sanger sequencing analyses showed which offspring managed to keep the desired mutation:

Mice with a 5-bp deletion in exon eight were selected to carry on the project with.

Six or 12 months after treatment, the animals were euthanized, and the eyes were harvested.

Immunofluorescence, morphometric studies, and fundus photography were carried out and the results were analyzed.

Immunostaining of the gene in question was carried out, and the result can be seen on the following images. It shows no staining in the engineered mice, which indicates that the Rpgr gene had been inactivated by the five bp deletion.

Photoreceptor Degeneration in Rpgr KO Mice

Delayed and slow retinal degeneration was observed in genetically the Rpgr KO mice and histological studies proved the loss of photoreceptor cells in them as well.

The graphs that were plotted based on the ONL thickness at different points along the optic nerve show significant differences between the control and the engineered mice.

Abnormal Photoreceptor Protein Expression in Rpgr KO Mice:

Mouse cone photoreceptors express cone opsins which therefore can be used as markers. Peanut agglutinin (PNA) is a molecule that binds to cone cells, therefore, can be also used as a marker. Staining for these showed the progressive increase of these proteins and retinal degeneration.

In conclusion we can state that this Rpgr KO mouse model provides an appropriate animal model system for gene editing therapy study.

Photoreceptor Preservation Following Cas9-Mediated Gene Editing Therapy

The aim was to correct the five bp deletion through cutting the deficient region and insert the missing bases through homologous DNA repair. The genetic engineering machinery is transferred into non-dividing photoreceptor cells of organisms by Adeno- Associated Virus (AAV) vectors, more precisely AAV vectors, which target specifically photoreceptors. (AAV2/8 vector) The sgRNA targeted the 5-bp deletion in Rpgr KO mice. The 5′ and 3′ homology arms were amplified from the C57BL/6J mouse genome. The RPGR-5′HA and RPGR-3′HA overlap was subcloned into a plasmid. The Cre recombinase was added as well to activate the Cas9 gene.

Expression cassettes of sgRNA targeted to the mutant Rpgr locus and donor template were delivered to 6-month-old Rpgr−/yCas9+/WT mice by the AAV2/8 vector vectors.

The gene therapy treated mice’s retina was analyzed and assessed six months later: As can be seen on the images below up to nine layers of photoreceptors had been preserved in the treated parts of the retina, while they could be found across four degenerated layers in the untreated parts of the retina of the same eye.

There were 1.5-fold more outer nuclear cells in the treated part of the retina than in the untreated part.

Immunostaining was carried out again and indicated significant Rpgr expression in in the treated areas and no expression in the untreated areas as it can be seen in the images below.

Retinal morphology of Rpgr KO mice was assessed 12 months after treatment as well in order to investigate the long-term impact of gene therapy. Again significant, Rpgr staining and PNA staining were observed.

The density of photoreceptors in the treated area was three-fold greater than that of the untreated area.

Both PNA and M-cone expression in the treated area of the Rpgr KO mice were similar to that of 6-month-old untreated Rpgr KO mice, with rhodopsin expression similar to that of 3-month-old untreated Rpgr KO mice.

These data suggested that CRISPR/Cas9-mediated Rpgr gene editing therapy successfully preserved photoreceptors, and this effect seemed to be persistent.

Sources:

Hu S, Du J, Chen N, et al. In vivo CRISPR/Cas9-mediated genome editing mitigates photoreceptor degeneration in a mouse model of X-linked retinitis pigmentosa. Invest Ophthalmol Vis Sci. 2020;61(4):31. https://doi.org/10.1167/iovs.61.4.31

Henderson, H., 2021. CRISPR Clinical Trials: A 2021 Update. [online] Innovative Genomics Institute (IGI). Available at: <https://innovativegenomics.org/news/crispr-clinical-trials-2021/> [Accessed 30 August 2021].

Nast, C., 2021. This is the year that CRISPR moves from lab to clinic. [online] WIRED UK. Available at: <https://www.wired.co.uk/article/jennifer-doudna-crispr> [Accessed 30 August 2021].

Han, Y., Liu, D., & Li, L. (2020). PD-1/PD-L1 pathway: current researches in cancer. American journal of cancer research, 10(3), 727–742. Available at: <https://www.ncbi.nlm.nih.gov/ pmc/articles/ PMC7136921 > [Accessed 16 September 2021].

Genetic engineering, CRISPR-Cas9 gene editing & bioinformatic tools

The beginning of genetic engineering

In 1953 the structure of DNA was discovered, and this event initiated the era of molecular genetics. In 1967 the ligase enzyme and then in 1970 the first restriction enzyme were isolated. These tools made it possible to break DNA and glue the ends together meaning that it was possible to make the first artificial recombinant DNA molecules. This was followed by the emergence of gene cloning, the process during which a fragment of DNA is inserted into a plasmid vector, and then this plasmid is introduced into a bacterial cell. These cells can be grown on agar plates, cloning the cell with modification. This is the way they used to produce human insulin for people with diabetes. Soon genetically modified crops appeared which was a field that was worth putting effort and research into.

Since these first experiments, technology has come a long way. Today we have various genetic engineering tools which make it possible to make certain editions in the genome of all organisms.

The origins of the CRISPR-Cas9 technology

The CRISPR – cas9 system is a now widely used revolutionary genetic engineering tool and as in many things in biology is inspired by nature itself.

CRISPR stands for clustered regularly interspaced short palindromic repeats. They had already been characterized in 1993 and since then have been increasingly studied.

It turned out that its function is to protect bacteria and archaea against viruses. The way it works is the following:

In the chromosome of these organisms there is a so-called CRISPR loci which contains the CRISPRs. Between these there are different regions which turn out to be identical to bits of DNAs of the viruses that attack the given species. It also turns out these loci change dynamically, meaning these regions accumulate these virus DNA fragments. These suggest that the bacteria and archaea have a method for acquiring DNA bits from the viruses and inserting into their CRISPR loci.

These fragments are then transcribed in a way that the palindromic CRISPRs form hairpins. The long molecule that still contains all the CRISPRs and viral copies, will then be chopped in a way that one piece from a virus and one hairpin form so-called CRISPR-RNAs (crRNA). Thanks to the hairpins these molecules can be recognized by the cas proteins, and these assemble into the so- called effector complex. The complex is guided by its RNA part to the corresponding viruses and the protein part of the complex, which is an endonuclease, will cut it. The damaged viral DNA cannot be repaired so it will degrade.

During the investigations of cas proteins these were classified into two major groups. The class 1 cas proteins work in an assembly with the crRNA while the class 2 cas proteins work individually with the RNA.

The protein cas 9 is one of the second group. Researchers found that the cas 9 gene (which was called otherwise back then) is responsible for protecting some bugs from viral infections and they initiated a collaboration to find out what was the mechanism behind it.

It was found that the protein cas 9 which is coded by a single gene also called cas9 is an endonuclease which works together with a duo of two RNAs. A spacer RNA which is actually the CRISPR, and a tracer RNA. The tracer RNA matches the DNA sequence that is aimed to be destroyed, and the tracer RNA is the one that can bind to cas 9 protein and therefore activate the cutting mechanism. Then the enzyme will unbind the double helix, and with its two distinct active sites cuts both strands of the DNA. The site where the cutting happens is followed by a DNA fragment that has the base sequence NGG (where N means any nucleotide, and G means guanine)

Exploiting the invention of nature

Unlike in prokaryotes, in eukaryotes DNA damage does not always result in the death of the cell, as DNA repair is possible in these organisms. This can happen through the insertion of some bases. This, of course, can disturb the expression of the given gene in many ways. This gave the idea of using the CRISPR-cas9 system to our own use for genetic modification.

The general workflow of a genetic engineering project with CRISPR is the following:

The first step is designing the RNA that will match the DNA we want to edit. So, to do this, first the sequence that we wish to engineer must be identified. Then with the help of many already existing resources we can choose an RNA that will suit our goals. The next step is to assemble a complex of the RNAs and the cas protein, the so-called ribonucleoprotein (RNP). This simply means that we put together the trRNA and the crRNA in a 1:1 molar ratio within. This duo is known as the guide RNA. Then the guide RNA will be added to the cas 9 protein also using a 1:1 molar ratio. If the goal of our experiment is to knockout a certain gene, then we do not add anything else to the assembly. These steps are followed by the delivery of the RNP into the cell of our interest. This can happen through various methods: Lipofection, electroporation or microinjection. The simplest one is lipofection. As the consequence of the double stranded break (DSB) nonhomologous end joining will happen (NHJE) This means that a few bases will be inserted between the ends of the broken DNA strands. This way the gene will not code for a functional protein anymore. If our aim is not knocking out a gene but implementing a modification so the gene would function differently than we need to add an extra piece of DNA to the cell of our interest. This is called the homology directed repair (HDR) template. This piece of DNA contains the sequence we want to insert between two arms that are complementary to the DNA that we wish to edit.

Ethical considerations

As CRISPR has become a popular technology worldwide due to its cost effectiveness, ease of use and lack of requirement for sophisticated technology, ethical concerns have arisen regarding its uses. Researchers have argued that CRISPR should be used in gene therapy in somatic cells, but not in germline editing as modifications would be passed to future generations.

When considering germline editing, the primary concern is safety. There is a high risk of off-target effects and mosaicism which cannot balance any potential benefits. However, it was acknowledged that in some cases, such as both parents having the disease-causing variant, germline editing could be more useful than any other existing genome editing technologies used for reproductive purposes.

As this is a new technology, it has also been argued that genome editing will widen the gap between wealthy and poor.

From a moral and religious standpoint, CRISPR should also not be used in genome-editing research involving the creation or destruction of embryos. There are some laboratories which use non viable embryos for their research to address curiosities about human biology, but it cannot be used under any circumstances for reproductive purposes.

Gene editing is also used on animals, but ethical concerns related to decimating an entire species, eliminating food sources for certain species, and promoting the proliferation of invasive pests are raised by opponents of CRISPR.

Bioinformatics

Bioinformatics plays an essential role in detection and analysis of CRISPR systems. Thanks to bioinformatic analyses, matches of CRISPR spacers to bacteriophages were first detected, which lead to the conclusion that CRISPR-Cas acts as an acquired immune system.

Prediction of CRISPR-Cas systems

Perhaps it’s most obvious usage is in the prediction and characterization of CRISPR-Cas systems. This practically means identifying cas genes and CRISPR arrays. While cas genes are easily predicted by classical databases like Pfam, CRISPR sequences can cause more problems due to their irregularity because of spacer acquisition (short sequences from the phage genome inserted between the CRISPR repeats after the infection).Therefore, all identification methods focus on finding sequences that meet specific requirements of repeat length, spacing, similarity or number. The most popular tools to achieve this are CRISPRFinder and CRISPRCasFinder. The desired output for each studied CRISPR array, covers information about the coordinates,length and sequence of each found spacer, a crucial knowledge in designing CRISPR-based gene editing experiments. The process starts with a search for repetitive elements in the genome that can form a putative array, bearing in mind the high sequence similarity between the direct repeats. The most promising CRISPR candidates have a length between 23-55 nt, repeat similarity of 80% and are offset by 0.6-2.5 times the repeat size. In the last step, the algorithm evaluates the similarity of the predicted spacer sequences via multiple alignment through MUSCLE. If the pairwise similarity between spacers exceeds 60%, the sequence is ruled out. Otherwise, a level of confidence for lower similarity level is established, where levels 3 and 4 mark highly promising candidates.

Classification of CRISPR-Cas

Classification of CRISPR-Cas systems is essential to illustrate the origins and evolution of CRISPR loci in microbial genomes. Because of the high diversity in complexity of the majority of cas protein sequences (as they have evolved much quicker compared to other archaeal and bacterial genes), this classification task is as important as it is challenging. The first algorithm developed to crack this conundrum is called CRISPRmap.It utilized CRISPR sequence and the RNA secondary structure conservation of the direct repeats.These direct repeats are taken as an input to group them into clusters based on secondary structure preservation and the sequence. These clusters are then checked for an overlapping motif with the child clusters. Those that satisfy specific criteria are classified into families based on using Markov Clustering. The algorithm was tested on a complex set of more than 3500 CRISPR sequences. The result was a successful identification of 33 potential conserved structural motifs and 40 sequence families. Such information is absolutely crucial in studying evolutionary relationships between distinct cas proteins, and allows for their effective classification.

Target identification

The main advantage of CRISPR-Cas gene editing technology, a specific target identification, is performed by two mechanisms. First, the spacers are almost exactly complementary to sequence in the targeted place in the nucleic acid, and second, the target has to be accompanied by the cas-specific PAM. PAM, the protospacer adjacent motif, is a short DNA sequence required for a Cas nuclease to cut and is generally found 3-4 nucleotides downstream from the desired cleavage site. Popular software tools that have been developed to study the targeting efficiency and potential off-targets are: CCTop, Cas-OFFinder and newly-established uCRISPR. To identify the targets of newly-developed CRISPR-cas systems with unknown PAMs, another program, CRISPRtarget, simply performs similarity searches based on spacer sequences. It is essentially based on the BLAST algorithm, comparing user-provided guide RNA sequences against selected databases along with potential target sequences, for instance of phage genomes.

Guide RNA design

Bioinformatic analysis also plays a significant role in the design of the synthetic guide RNA, a crucial component of this gene-editing technology. Target specificity is the most necessary criteria that has to be met by these gRNAs. Some of the most widely used tools to achieve this are: E-CRISPR, CHOPCHOP, GuideScan. All of them are based on similar workflow principles. While identifying the target gene which is supposed to be edited, the key step is the selection of an appropriate, complementary PAN region. The selected candidate is then assessed according to two desired features: high on-target efficiency (checked using NGS methods) and low off-target activity. The algorithms checking the latter are based on a minimal biophysical model of free energy necessary for transitions of the CRISPR-cas effector complex, i.e. PAM binding and R loop formation, utilizing hybridization kinetics. The most developed ones are fairly reasonable predictions with up to 98% accuracy, and are mainly evaluated by mismatch positions. Another approach which is utilized to find potential off-target binding regions in the whole genomes, is finding sites which have 1-3 mismatches to the guide RNA. All of the known methods are only a relative measure of non-target activity since they only take into consideration sequence similarities, and not experimental factors like Cas proteins concentrations.

CRISPR applications

Though the CRISPR-Cas9 system is mainly famous for its significance in the field of genome engineering, there are many other applications that exploit the benefits of this system.

CRISPR systems in transcriptional activation and repression

Cas9 protein, by introducing just a few mutations, can be rendered catalytically inactive, so it will no longer be able to cut the DNA strand. However, its target finding qualities will be retained. Such modified cas9 protein can then also be marked with an accessory regulatory component. Once bound, it recruits transcriptional factors to the targeted gene, which can reversibly either silence or enhance its expression.

A good example of this application is dCas9 SAM system to amplify gene expression. A specific sgRNA guides dCas9 with an array of transcriptional activators ( such as VP62 and p65) to the promoter of the gene of interest. This powerful method multiplies the gene expression up to three thousand times. Moreover, SAM systems are able to act on 10 genes simultaneously, making polygenic interaction studies possible. In addition to mRNA, SAM influences the activity of non-coding RNAs, crucial regulatory factors in many organisms. All of the above makes it an extremely attractive tool to change the epigenetic landscape of the organism, including reprogramming cellular activity, which has multiple applications in regenerative medicine.

Using CRISPR Libraries for Screening

CRISPR screening is an experimental approach used to discover genes or genetic sequences that elicit a specific function or phenotype for a cell type. For instance, nowadays, CRISPR screening is used to identify genes or genetic sequences associated with drug resistance, drug sensitivity, susceptibility to environmental toxins or DNA sequences leading to a particular disease state.

When, for example, the resistance of a cell line to a drug treatment is tested, CRISPR screening is used to knockout one gene per cell, resulting in a population of cells with a different gene knocked out in each cell, and the new population of edited cells is allowed to grow for a few days. The tested drug will kill some cells, but others will survive and then, next generation sequencing is performed on the edited cells that survive to identify which DNA sequences are now present and which are absent. This technique can identify which genes the cells require in order to survive the drug treatment. This methodology has been used, for example, to understand the genetic changes that have occurred in some cancer cell lines which make them resistant to a particular drug treatment.

“CRISPR libraries” are not exactly CRISPR guide RNAs, but rather the batch of lentiviruses containing a pool of oligonucleotides (each virus will have a different oligonucletide from the pool), each coding for a CRISPR guide and cloned into a lentiviral gene-containing plasmid. Lentiviruses are RNA viruses, so each lentivirus contains viral RNA. The viral RNA is too long to be used by the Cas enzyme; it is first reverse-transcribed into DNA which integrates into the genome of the infected cell. The aim is to infect the cells with one virion per cell and since each lentivirus in the library includes one sequence from the original oligonucleotide pool, only one such sequence is integrated into the genome of each infected cell. After integration, the lentiviral sequences, including the cloned-in CRISPR sequences, are transcribed to RNA, producing CRISPR guide RNA. A Cas enzyme must also be expressed in the target cells. Following treatment of the cells with the lentiviral library and the Cas enzyme, cells are incubated to allow phenotypic CRISPR-mediated changes, following which a specific treatment may be performed if desired for a particular experiment. After this, DNA (or RNA) samples can be collected from the cells and subject to next generation sequencing.

CRISPR screening has also been used in animals. Researchers used a cancer cell line from a mouse and infected it with a CRISPR library of over 67,000 lentiviruses. When the cells were transplanted into the mouse, tumors started to grow. After sequencing the DNA of the metastases, the researchers found several genes targeted by the CRISPR technology. This helped the scientists pinpoint the genes in which loss-of-function results in tumor formation and metastasis.

Imaging living cells

Imaging DNA and RNA in living cells is a challenge that has been addressed before, however, there are still regions which cannot be visualized with great precision. Fluorescent in situ hybridization (FISH) is a commonly known technique that researchers use to track DNA and RNA, however this method requires the fixation of the cell.

The target specificity of the CRISPR-Cas 9 system offers a great potential for achieving improvements in this field.

A cas9 protein that is engineered to lack the endonuclease activity is fused with an enhanced green fluorescent protein (EGFP). This is then combined with a carefully designed small guide RNA (sgRNA).

The analysis of the results occurs then similarly to FISH but in contrast to that, this process does not require the denaturation of nucleic acid, nor cell fixation, which makes it less error prone. Thanks to the specificity of this complex this results in a more efficient way of observing chromosome dynamics The two principal areas in DNA imaging that have special significance are chromosome remodeling and telomere dynamics.

Sources:

Chen, B., Gilbert, L. A., Cimini, B. A., Schnitzbauer, J., Zhang, W., Li, G. W., Park, J., Blackburn, E. H., Weissman, J. S., Qi, L. S., & Huang, B. (2013). Dynamic imaging of genomic loci in living human cells by an optimized CRISPR/Cas system. Cell, 155(7), 1479–1491. https://doi.org/10.1016/j.cell.2013.12.001

Omer S. Alkhnbashi, Tobias Meier, Alexander Mitrofanov, Rolf Backofen, Björn Voß,

CRISPR-Cas bioinformatics, Methods, Volume 172, 2020, Pages 3-11, ISSN 1046-2023,

https://doi.org/10.1016/j.ymeth.2019.07.013, https://www.sciencedirect.com/science/article/pii/S1046202318304717

Genscript.com. 2021. CRISPR for Transcriptional Activation and Repression-GenScript丨CRISPR/Cas9 Applications. [online] Available at: <https://www.genscript.com/crispr-for-transcriptional-activation-and-repression.html> [Accessed 25 August 2021].

Gavin J. Knott, Jennifer A. Doudna, CRISPR-Cas guides the future of genetic engineering, Science 31 Aug 2018: Vol. 361, Issue 6405, pp. 866-869, DOI: 10.1126/science.aat5011

Integrated DNA Technologies: Getting started with CRISPR: a review of gene knockout and homology-directed repair

Desmond S. T. Nicholl: An Introduction to Genetic Engineering Third Edition, 2008, Cambridge University Press

Genome.gov. 2017. What are the Ethical Concerns of Genome Editing?. [online] Available at: <https://www.genome.gov/ about-genomics/policy-issues/Genome-Editing/ethical-concerns>

Caplan, A., Parent, B., Shen, M. and Plunkett, C., 2015. No time to waste—the ethical challenges created by CRISPR. EMBO reports, [online] 16(11), pp.1421-1426. Available at: <https://www.embopress.org/doi/full/10.15252/embr.201541337>

Spencer, N., 2019. Overview: What is CRISPR screening?. [online] IDT. Available at: <https://eu.idtdna.com/pages/education/decoded/article/overview-what-is-crispr-screening> [Accessed 30 August 2021].

Genscript.com. 2021. Applications of CRISPR. [online] Available at: <https://www.genscript.com/applications-of-crispr.html> [Accessed 30 August 2021].

Epigenetics

What is epigenetics?

Epi is a Greek word for ‘above’, which implies that epigenetics is something on the top of our traditional understanding of inheritance.

In the past, scientists were able to observe a lot of events that could not be explained by genetic knowledge. Conrad Waddington, a pioneer in the field, defined epigenetics in the 1950s as changes to the genes expression induced by the environment. Quickly, numerous unexplainable biological phenomena were coined as ‘epigenetic’ changes, while the molecular mechanisms behind them remained largely unknown.

Identical twins have exactly the same genetic code as they originate from one zygote, which is then separated into two distinct embryos. What’s more, they’re subjected to identical conditions during key, early stages of development in the mother’s womb, and, unless they’re separated after birth, they are also brought up in similar environments. Given all that, it’s very surprising that the risk of developing highly heritable conditions like schizophrenia among identical twins (if one sibling already suffers from it) is only 50%. If DNA sequence was all that matters, then identical twins would always be identical in every possible aspect.

However, it turns out that not only the script is of importance, but also the instructions on how to properly interpret it.

Molecular mechanisms

These phenomena of genetically identical organisms actually appearing very different to each other need to have some molecular explanation. This led to seeing epigenetics as modifications to the genetic material that change the genes being expressed, in other words, switched on or off, but which don’t alter the genes themselves. To be fair, it’s easy to say that all the unexplainable changes are due to the influence of the ‘environment’.What really matters, is figuring out how the environment actually does that. There must be a way in which various stimuli physically affect our gene expression machinery, even though the genetic sequence itself remains intact.

Two of the most important mechanisms that regulate gene expression are DNA methylation and histone modifications. The necessary condition for that is the accessibility of DNA, which occurs when it’s not tightly coiled as chromosomes. DNA methylation is a gene-silencing modification, thereby it causes the formation of compact chromatin. It takes place when a methyl group is added to one of the nucleotides of DNA, cytosine, by the group of enzymes called DNA methyltransferases. Importantly, this process occurs almost exclusively for cytosines that are placed next to guanines, forming so-called CpG islands. The methylation of DNA sometimes may physically disrupt the binding of transcriptional proteins to the gene, but it’s far more likely that methylated DNA is bound by proteins known as methyl-CpG-binding domain proteins (MBDs), which then further recruit additional chromatin remodelling proteins that modify histones, thus forming inactive chromatin (heterochromatin).

In almost every organism analyzed, it was found that if the methylation is located in a gene promoter, it acts to repress gene transcription. CpG-dense promoters of actively transcribed genes are never methylated, but, conversely, silent genes do not necessarily have to be methylated. In total, 60–70% of genes have a CpG island in their promoter region and most of these CpG islands remain unmethylated independently of the transcriptional activity of the gene.

The second major epigenetic mechanism is histone modification, specifically acetylation and deacetylation. The octamer of histone proteins with a DNA tightly wrapped around them makes up a structure called a nucleosome, which are then coiled into chromosomes. Looking more specifically at histone conformations, all of them are made of alpha-helical structures and N-terminal tail, which is variable among histone types. This is precisely where all epigenetic modifications are occurring. The most common one is acetylation, which means adding an acetyl group on lysine or arginine residues in the tail. These positively charged amino acids attract negatively charged DNA, making the nucleosome coiling tight. Upon acetylation, lysine and arginine become neutral, so that there’s no longer any attractive electrostatic attraction and the DNA becomes loose. This energetically demanding reaction requires an enzyme to proceed- a histone acetyl transferase. It was observed that acetylated histones contribute to the formation of euchromatin, an easily accessible DNA that is ready to be transcribed and expressed. On the other hand, deacetylation performed by histone deacetylase favours tight coiling of nucleosomes, forming inaccessible heterochromatin.

Two other frequently observed types of modifications are methylation (adding methyl group on lysine or arginine) or phosphorylation (adding a phosphate on serine or threonine). With methylation, the nature of the epigenetic modification depends on which residue is methylated and what type of methylation is it, so it can both have an activating and repressing effect. As for the phosphorylation, it is mostly considered as inducing increased gene expression, as negatively charged phosphates added to the histone tails repel negatively charged DNA backbone, resulting in a DNA strand being packed more loosely.

Gene expression control

Gene expression describes the production of a functional product, such as proteins, from a series of nucleotides in DNA, by transcription of a gene into RNA. Every cell in the body, except for the gametes which contain one copy of each chromosome rather than two, contains all the genetic information of the organism. However, as the cells become specialized into different cell types (for example, a red blood cell, a neuron, a hepatocyte in the liver, a muscle cell), they express different genes.

But how can a cell know which genes to express and which genes to silence to serve its purpose? This process is known as the regulation of gene expression. Imagine what would happen if there was no regulation of gene expression – your eye cells would express the same genes as your stomach cells and, therefore, start secreting hydrochloric acid!

However, genes cannot control an organism on their own, but they will rather interact with the organism’s environment. Some genes are always expressed regardless of the environmental condition. They are called constitutive genes and control the fundamental processes within an organism: DNA replication, transcription, repair, central metabolism and protein synthesis are some examples. The other type of genes are regulated genes whose expression gets turned “on” and “off”, or “up” and “down” (like a rheostat), depending on the environmental conditions.

In prokaryotes, genes are regulated in a less evolved manner by classical mechanisms involving activator and repressor proteins binding to DNA. Most of the regulatory proteins are negative so they turn “off” gene expression. For example, for the formation of tryptophan in E. coli, three enzymes are required. It takes a total of five genes that are very close to one another on the bacterial chromosome, and one promoter to synthesise the needed enzymes. In this case, there is a segment of DNA between the promoter and the first of the five genes, called an operator which acts as an on/off switch. The operator controls whether the RNA polymerase has access to transcribe the downstream genes. This whole system made by the promoter, the operator, and the genes are called an operon. Normally, the operon is “on”, but if a specific repressor binds to the operator, then the promoter is blocked, and RNA polymerase cannot start transcribing. Thus, the repressor inhibits gene expression and stops the production of tryptophan when, for example, there is enough tryptophan in the environment. E. coli only needs to make tryptophan when environmental levels of tryptophan are low, and the operon would be switched “on” in this case. However, there are also genes that are typically “off” and need to be activated. In E. coli, there are genes that produce an enzyme responsible for breaking down lactose into glucose and galactose. Normally, there is a repressor bound to the operator preceding these genes, but an isomer of lactose can deactivate the repressor, thus allowing for transcription of the genes and higher levels of metabolized lactose. These genes are turned “on” when environmental levels of lactose are high. These mechanisms are examples of negative gene regulation. This is not an epigenetic mechanism as it does not involve chemical changes to the DNA and bacteria do not have histone proteins, although it allows the bacteria to respond to environmental changes.

Eukaryotes also regulate genes using transcriptional activators and repressors. However, histone modification and DNA methylation are additional methods of regulating gene expression in eukaryotes. There are some very well-known examples of epigenetics used to regulate the expression of genes.

Genomic imprinting is one example of gene regulation by epigenetic mechanisms. In this process, one of the two alleles of a gene is silenced for the entire life span of the cell, depending upon the sex of the parent from whom the gene was inherited. Some genes are silenced when inherited from the mother, and others when inherited from the father. Genomic imprinting results from methylation of the DNA and histone modification, and when it is combined with a genetic mutation, it can lead to disease. For example, when deletion of a particular DNA sequence in the maternal chromosome 15 occurs, this leads to Prader-Willi syndrome in the child. When the same sequence is deleted from the paternal chromosome 15, the child will have Angelman syndrome.

One other mechanism involving epigenetic gene silencing for the entire lifespan of the cell is X chromosome inactivation. One X chromosome has over 1000 genes that ensure cell development and viability, but as females carry two copies of the X chromosome, there is a risk of toxic double-dose of X-linked genes. To avoid this risk, female mammals have found a way to shut off one of the two copies of chromosome X by transcriptional silencing and then compact it into a stable structure called a Barr Body. This process involves transcription, the participation of two noncoding, complementary RNAs (XIST and TSIX) which initiate and control the process, and CTCF, a DNA-binding protein. Xist “coats” the X chromosome to be silenced, followed by epigenetic changes to that X chromosome. One well-known example of X-chromosome inactivation is the colour pattern of calico cats. The genes coding for fur pigmentation are X-linked and each X-chromosome will result in a different colour when left active (either orange or black). Most of the calico cats are females as X-chromosome inactivation mostly occurs in cells with multiple X-chromosomes. This inactivation process is widely researched in the field of cancer biology, as it was shown that the active state of both X-chromosomes is linked to human breast and ovarian tumour formation.

Epigenetic changes can also occur due to bad nutrition during pregnancy. We can take the example of babies whose mothers were pregnant during the Dutch Hunger Winter Famine between 1944 and 1945. When researchers looked at gene expression in those babies, 60 years after the famine, they discovered that those people had increased levels of methylation at some genes and decreased levels at others when compared to their siblings who were born under normal environmental conditions. These epigenetic changes can explain why humans born during harsh conditions (e.g. famine, war) are more likely to develop a disease such as a type 2 diabetes, heart disease and schizophrenia during their adult life.

Epigenetics in applications

Medicine and therapies

Most illnesses, such as cancers, neurodegenerative disorders, cardiovascular diseases or any ageing-related conditions are often associated with environmentally influenced alterations, which can also be called epigenetic. Scientists were always looking to discover why certain people don’t respond well to standard therapies and drugs. Only after realising the importance of epigenetics, the concept of personalized medicine was born, promising a revolutionary approach in combining both genetic and epigenetic diagnostic testing. This would allow creating an individual’s personal genomic profile by discovering all relevant molecular alterations in cells: due to genetic heterogeneity and epigenetic ones.

In human cancer development, it’s very common that oncogenes, such as the MYC proto-oncogene, are epigenetically activated at some point. Furthermore, cancers frequently use epigenetic mechanisms to deactivate cellular antitumor systems by methylating genes called tumour-suppressors. Either of the two mechanisms (or two simultaneously) lead to a significant imbalance in the rate of cell division, proliferation and death, ultimately causing tumour progression and irreversible organ damages. With the development of various drugs targeting epigenetic regulators, epigenetic-targeted therapy has been applied in the treatment of haematological malignancies and has exhibited viable therapeutic potential for solid tumours in preclinical and clinical trials. Although epigenetic therapy has a rational and profound basis in theory, some problems remain to be discussed and solved. The most important problem is selectivity because epigenetic events are distributed across normal and cancer cells. Therefore, the priority is to determine the most important epigenetic alterations for different, specific types of cancers, so as to avoid targeting healthy cells.

Furthermore, a fuller understanding of the specific mechanisms underlying those alterations in different cancers is necessary to design safe and accessible therapies. Personalized medicine seems to be a great fit for increasing the specificity and safety of therapies for all patients, taking into account their individual, unique genomic and epigenomic data. It’s also worth noting that one of the most significant advantages of epigenetic therapies is that they are fully reversible, unlike gene therapy, which means that they potentially induce less risk.

Drugs and ageing

Epigenetic modifications such as DNA methylation, histone modifications and alteration in microRNA expression, are highly reversible in normal tissue but can become imbalanced and inheritable in tumours or other abnormal cells. These epigenetic changes play an important role in controlling gene expression and genomic stability in the entire life of an organism. When epigenetic dysregulation occurs, there may be a causal effect on already existent age-associated diseases such as cancer, diabetes, and the decline in the immune response. Additionally, there is extensive evidence that epigenetic mechanisms are involved in synaptic plasticity and play a key role in memory and learning; therefore, any dysregulation of an epigenetic mechanism in the brain can lead to neurodegenerative or psychiatric diseases. Thus, one of the main scientific interests of today’s world is the so-called “epigenetic drugs” which act on the enzymes that are responsible for generating epigenetic modifications. One example is drugs based on the superfamily of histone deacetylases (HDACs) including HDAC 1-11 and sirtuins (SIRT) 1-7.

HDAC inhibitors are promising candidates in the treatment of cancer as they have shown anti-tumour activity against haematological malignancies; in the treatment of neurodegenerative diseases such as Huntington’s disease, Parkinson’s disease, Alzheimer’s disease and Rubinstein–Taybi syndrome (by ameliorating deficits in synaptic plasticity, cognition and stress-related behaviours); anxiety and mood disorders; and in the regulation of the innate immune response against microbial pathogens.

Both sirtuins inhibitors and activators have gained much scientific interest as a therapeutic approach for treating metabolic, cardiovascular and neurodegenerative diseases, and cancer. It was reported that some sirtuins inhibitors had anti-proliferative effects in cell cultures and mouse tumour models. Additionally, sirtuin inhibitors of SIRT2 and AGK-2 were reported to be effective for the treatment of Parkinson’s disease, while other sirtuin activators can have protective effects against Alzheimer’s disease.

References:

Ahn, J. and Lee, J., 2008. X Chromosome Inactivation | Learn Science at Scitable. [online] Nature.com. Available at: <https://www.nature.com/scitable/topicpage/x-chromosome-x-inactivation-323/> [Accessed 9 September 2021].

Centers for Disease Control and Prevention. 2020. What is epigenetics?. [online] Available at: <https://www.cdc.gov/genomics/disease/epigenetics.htm> [Accessed 9 September 2021].

Genome.gov. 2021.Genetic Imprinting. [online] Available at: <https://www.genome.gov/genetics-glossary/Genetic-Imprinting> [Accessed 9 September 2021].

Hoopes, L., 2008. Gene Expression and Regulation | Learn Science at Scitable. [online] Nature.com. Available at: <https://www.nature.com/scitable/topic/gene-expression-and-regulation-15/> [Accessed 9 September 2021].

Phillips, T., 2008. Noncoding RNA and Gene Expression | Learn Science at Scitable. [online] Nature.com.Available at: <https://www.nature.com/scitable/topicpage/regulation-of-transcription-and-gene-expression-in-1086/> [Accessed 9 September 2021].

Vaiserman, A. and Pasyukova, E., 2012. Epigenetic drugs: a novel anti-aging strategy?. Frontiers in Genetics, [online] 3. Available at: <https://www.frontiersin.org/articles/10.3389/fgene.2012.00224/full> [Accessed 9 September 2021].

Non-coding DNA

Selfish DNA?

The term non-coding DNA refers to the fragments of the genome that do not code for proteins. The terminology in this case is somewhat confusing. The fact that these fragments don’t code for proteins doesn’t mean that they don’t code for anything. Yet, historically this portion of the human genetic blueprint was deemed ‘junk DNA’ and believed to be completely useless. The term was widely popularized and still remains in people’s consciousness- in the survey about common misconceptions about genomics conducted by us at the beginning of the project, 19% of students dismissed non-coding DNA as garbage. Looking backwards, it’s baffling that the scientific community could universally accept the hypothesis that 98% of the human genome (as this is precisely the amount of non-coding sequences in human DNA) is functionless, yet still was evolutionarily favoured. After all, replicating and repairing such huge amounts of DNA consumes a significant portion of the cell’s resources. It was probably a result of biologists’ earnest fascination with multi-functional, perfectly adapted proteins. While these elegant molecules fully deserve their impeccable reputation, they still need some help from the non-coding DNA, which turned out not to be so selfish after all.

Recycling the junk

A paradigm shift occurred when a strange correlation was observed. It turned out that the proportion of non-coding DNA in the genome is directly proportional to the complexity of the organism. What’s even more interesting, it’s not the case when it comes to the number of protein-coding sequences. In practice, it means that the bigger number of genes does not make an organism more sophisticated. The Onion Test is a simple concept to demonstrate that idea. The genome of Allium cepa (the onion plant) is five times larger than that of the Homo sapiens (human) and the human organism is more complicated than the onion organism. It seems rational that a more complicated organism needs more genes hence more DNA than a less complicated organism. But the onion test does not conform to this assumption. More likely, it suggests that those regions of the genome that code for proteins do not make up the majority of the genome. Wondering how to crack this conundrum for years, eventually scientists discovered that it’s not the number of genes that makes an organism sophisticated; it’s to what extent they’re able to splice their genes. Of course, complicated splicing patterns need to be carefully regulated. Here comes the acknowledgement to our so undeservedly underappreciated non-coding DNA, which finally could outshine proteins and become a star of research.

The ENCODE Project

The Human Genome Project was finished in 2003 and the fact that less than 2% of the human genome is expressed gave a reason for researchers to explore what functions the rest of the genome might have. This question gave rise to the project called Encyclopedia Of DNA Elements which we abbreviate as ENCODE. The long-term aim of this project was to map the functional regions of the non-coding DNA and establishing their roles.

The pilot phase took place between 2003 and 2007 focusing on experiments on a targeted 1 % divided among 44 regions of the human genome. It was an opportunity to try out the newly emerging technologies and it shed light on many previously poorly understood functions of the genome. The pilot phase used microarray-based assays to investigate transcribed regions of non-coding DNA, cis-regulatory elements, chromatin accessibility, and histone modifications. Briefly, microarray assays indicate whether a gene is activated or deactivated. This happens on a chip with many- many wells simultaneously so it was considered very time efficient in the era of the pilot face.

The results of these four years suggested that the majority of our DNA is transcribed into RNA, though only a fraction of these is expressed. Many regions that were thought to not be transcribed turned out to be the template of transcripts.

Many cis-regulatory elements had been found as well and scientists realised that the elements that regulate transcription have an equal chance of being located upstream or downstream of the transcription starting sites.

The second phase of the project, which happened between 2003 and 2012, extended the research to the whole human genome and more cell lines were examined. This phase paid particular attention to the transcriptional regulatory network of the human system. It has been observed that the combinations of the transcription factors differ among the various locations. For example, there are regulatory elements that are distal to genes that bind with different combinations of TFs (Transcription Factors) than regulatory elements that are proximal to genes. These differences gave rise to the hierarchical organization of TFs. In this system, the TFs on distinct levels have certain properties that indicate their function and the regions they are likely to regulate.

In the third phase, the studies have been taken one step closer to reality in the sense that they carried out experiments on cells taken directly from tissues.

The third phase introduced new technologies which made it possible to explore the genome from other viewpoints as well. Methods like paired-end tagging and Hi-C conformation capture gave a more sophisticated 3D structure of the chromatin than the previous ideas, which gave an opportunity to understand the interactions between the CREs better. Chromatin looping was one of these aspects which proved to be significant in gene regulation as looping alters the physical distances between regions and it also gave clues about the relations between certain enhancers and genes.

This phase made a lot of effort to rationalize the cooperation of TS-s at CREs. For example, it brought the first evidence of the existence of the so-called HOT region model. According to this, there are some HOT regions that are mostly promoters and enhancers which are bound to many TS-s. The assembly of proteins is launched by anchor DNAs that recruit TF-s which makes the chromatin open, and this complex recruits more proteins. Mapping chromatin loops in many cell types showed that there are differences in chromatin looping in different cell types and this is also a factor that regulates gene expression.

The advanced method also gave landscapes of RNA-binding which provided a new dataset for the project. There are many RNAs that are bound to proteins, the so-called RNA binding proteins. These elements are responsible for many steps in the post-transcriptional processes of the mRNAs including splicing, cleavage and poly-adenylation.

The fourth phase of the project is considered to last from 2017 to today. Data is continuously produced and analyzed in the hope of achieving a near full understanding of the human genome one day.

Types of non-coding DNA

Introns are non-coding sequences within the DNA that were previously thought to be “junk DNA”. However, it is now known that introns are not “junk” and can have a function in gene expression and regulation. Splicing (intron removal) is a very energetically expensive and timely cellular process, so a lot of work has gone into finding out functions of introns that would justify doing this process in a cell. Within an intron, there are regulatory sequences, therefore when scientists removed introns within the laboratory, the expression of one or more genes was affected. For example, after splicing, some introns can form micro-RNAs (miRNAs) which have a role in regulating gene expression. By interfering with the mRNA, miRNA can make the cell stop the production of a certain protein. Other introns contain genes for other types of non-coding RNA which have important roles in the cell. Now it is known that introns are absolutely essential in species that have them and because splicing is such an important process in humans, mutations which affect splicing can be pathogenic. It’s thought that approximately 50% of disease-causing mutations affect splicing.

The majority (>95%) of human genes can be spliced in multiple different ways to produce several different mature messenger RNA transcripts. This is called alternative splicing and it allows the DNA to make different proteins from the same original gene.

It was thought that pseudogenes are gene copies that have completely lost their biological function, but recent studies have shown that some pseudogenes possess regulatory functions and are transcribed into RNA, so it is difficult to find an exact definition for this type of non-coding DNA. One of the functions of pseudogenes is providing genetic diversity which can further help when generating antibodies and antigen variation. Pseudogenes accumulate mutations over the years and this helps scientists study the mutation rates and neutral evolutionary patterns. As pseudogenes are fossils of their parent genes, they can be a source of information regarding ancient transcriptomes.

Transposable elements (TEs) are DNA sequences that can change their location and move around in the genome. These TEs or transposons are classified into two divisions: retrotransposons that require reverse transcription to transpose and DNA transposons that do not require it. These “jumping” genes can lead to the mutation of genes and cause diseases such as haemophilia. If LINE-1, an active transposon in our body, lands in the APC gene, it can lead to different kinds of cancer. Fortunately, most transposable elements seem to be silent, meaning they do not have any effects on the phenotype of the organism. Some TEs are inactivated by mutations that stop them from moving from one chromosomal location to the other, and others are perfectly capable of changing their location but are held inactive by epigenetic mechanisms such as methylation, miRNAs and chromatin remodelling. When analysing the chromatin remodelling mechanism, we see that heterochromatin is so constricted that the transcription enzymes simply cannot reach the transposable elements found there. Because the movement of these elements is dangerous for the organism, most of the transposable sequences in the human genome are silent. Even the few active transposable elements that are not affected by epigenetic silencing are usually stopped from jumping by mechanisms such as RNAi (small interfering RNA) which control gene expression. However, transposable elements are not always destructive and play an important part in the evolution and gene regulation of an organism. They facilitate the shuffling of exons, the repair of DNA and translocation of the genetic sequence, thus leading to the evolution of the genome.

Tandem arrays (highly repetitive DNA) are common at the centromeres and telomeres and include satellites, microsatellites and minisatellites. These elements were named as such because they are separated from the bulk of nuclear DNA during centrifugation. Highly repetitive DNA poses a significant technical challenge to next-generation sequencing[KW9] and thus it is difficult to estimate their number. The new long-read sequencing technology of Oxford Nanopore is beginning to make it easier to sequence highly repetitive DNA. Recently, they managed to complete a telomere-to-telomere sequencing of the entire X chromosome, without any gaps, including the satellite centromeric DNA. Satellites have between 1,000 and 10 million repeated units and account for the DNA existing at the centromere. Minisatellites can have hundreds of units of 7 to 100 base pairs (bp) and are present everywhere, especially at the telomeres. Microsatellites have around 100 bp or more and are formed of repeated units of one to six nucleotides. They include certain trinucleotide repeats that are associated with disease development such as Huntington disease. The tandem arrays at the telomeres have suffered very small modification throughout the years, indicating they play an important role in protecting the ends of the eukaryotic chromosomes.

Although their function is not fully understood yet, repetitive sequences are known to be important in gene evolution and disease-gene mapping.

The bigger picture

It turns out that even in biology, individual beliefs to some extent depend on one’s perspective. In the case of non-coding DNA, a significant part of the scientific community did not accept the ENCODE hypothesis about 70% of non-coding DNA actually having a function. The argument here lies in how to define functionality. According to ENCODE, DNA can be considered functional if it displays any kind of biochemical activity, for instance, if it was copied into RNA. Many scientists believe that it’s not enough to prove such a sequence has a meaningful use. The counterargument says that DNA can only be classified as functional if it has evolved to do something useful enough so a mutation disrupting it would have a harmful effect on the organism.

Eventually, it’s worth looking at the issue of non-coding DNA from the evolutionary point of view. An influential study from 2017, An Upper Limit on the Functional Fraction of the Human Genome, introduced a new definition of functional DNA- whether a sequence could be acted on by natural selection in either a positive or negative way. As the majority of the mutations that occur are harmful, they cause a reduction in the fitness of the population. Therefore it is the fertility that must compensate for that to maintain a constant population size from generation to generation. This required increase in fertility depends on the percentage of functional sites in the genome, the mutation rate, and the proportion of deleterious mutations in functional regions. Mutations in the non-functional, junk DNA regions wouldn’t have any bad influence on the organisms, so they would remain in the population. Taking into account real-life fertility rates in humans, the study estimated that the upper limit of the functional DNA in the human genome is 15%- much less than the ENCODE predicted.

Ultimately, there is still no consensus in the scientific community about what is the exact percentage of the human genome that can really be considered functional. Nevertheless, it’s without any doubt that not only protein-coding DNA sequences are of importance. At least some percentage of our ‘junk DNA’ turns out to be crucial, so it’s definitely worth studying its function in more depth.

References:

“Repetitive DNA Elements .” Genetics. . Encyclopedia.com. 9 Sep. 2021 <https://www.encyclopedia.com>.

Tomasi, F., 2018. Transposons: Your DNA that’s on the go – Science in the News. [online] Science in the News. Available at: <https://sitn.hms.harvard.edu/flash/2018/transposons-your-dna-thats-on-the-go/?web=1&wdLOR=c6328BF69-6B3D-45DB-B592-07848D43A862> [Accessed 9 September 2021].

Podlaha, O. and Zhang, J., 2010. Pseudogenes and Their Evolution. Encyclopedia of Life Sciences, [online] Available at: <https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470015902.a0005118.pub2> [Accessed 9 September 2021].

Markgraf, Bert. “Introns vs Exons: What are the Similarities & Differences?” sciencing.com, https://sciencing.com/introns-vs-exons-what-are-the-similarities-differences-13718414.html. 9 September 2021.

Tutar, Y., 2012. Pseudogenes. Comparative and Functional Genomics, [online] 2012, pp.1-4. Available at: <https://www.hindawi.com/journals/ijg/2012/424526/> [Accessed 9 September 2021].

Pray, L. ,2008. Transposons: The jumping genes. Nature Education 1(1):204

Shapiro JA, von Sternberg R. Why repetitive DNA is essential to genome function. Biol Rev Camb Philos Soc. 2005 May;80(2):227-50. doi: 10.1017/s1464793104006657. PMID: 15921050.

Next Generation Sequencing

Human Genome Project

Introduction and welcome

Hi there!

It’s Anna, Andreea and Viki here. We’re all 2nd-year students of Natural Sciences at UCL. This summer, we got engaged in UCL’s Innovation Lab project. The topic we decided to explore is called: The Genomic Revolution- reaching for the opportunities written in our DNA.

In our project, we would like to explore the legacy of the Human Genome Project, which resulted in a completely new approach to medicine and therapeutics, popularly called a genomic revolution. Commenting on the Human Genome Project, Bill Clinton famously said: Today we are learning the language in which God created life. We would like to answer a question: to what extent did we actually learn this language of life and where does that lead us? In this investigation, we would like to begin with a survey to check the knowledge of the students on this topic. Using this information, we plan to adjust the content of our posts to address the needs of the students as precisely as we can. The topics we plan to include are: What was achieved in HGP, how the advances in genome sequencing technology allow us to gather and analyze genomic data on a mass scale, how the scientists were mistaken about non-coding ‘’junk DNA’, the discovery of epigenetics. We are also very keen on investigating the opportunities of the future- starting with groundbreaking discoveries in genetic engineering like CRISPR-Cas9, finishing with the perspective on personalized medicine, wondering whether everyone’s genome will be sequenced and how these data could actually be useful in healthcare. Of course, it is very important to consider the ethical and social implications of such a dynamic change.

Our objectives are:

To inform students already interested in science about relatively recent developments in genomics and the areas of future advancements;
To show interdisciplinary links between biology and data science, AI and informatics,
To tell an interesting and comprehensive story about the social, ethical and political impacts of the genomic revolution
To get UCL students more aware of UCL’s contribution to global research in this area

We really hope to get you more interested in genomics and the revolution in medicine that’s in front of us! Welcome to the journey.

The Genomic revolution- reaching for the opportunities written in our DNA

Menu