My thesis project focuses on a specific type of variation in the human DNA called "copy number variants" (CNVs). CNVs refer to a variation in the number of DNA copies - for example, a deletion will reduce the copy number, and a duplication (see Fig. 1) will increase the copy number. A person usually has two copies of DNA - one from dad and one from mom. But just as how two individuals can differ in the sequences of their DNA, the "copy number" of particular DNA segments can vary from a person to person. In fact, a study has showed that there was about 1.2% difference in CNVs and small insertions and deletions between two different human genomes - this number may look small, but along with other structural variants (which causes a structural change in the DNA), the difference between two individuals impacted 4,867 genes! 

Figure 1. A tandem duplication, which is one of many types of CNVs. Image Source

Figure 1. A tandem duplication, which is one of many types of CNVs. Image Source

Although not all CNVs are harmful to the human health (sometimes having more copies of the CCL3L1 gene can provide more resistance to HIV), CNVs could pose a great risk to the human health if the genomic regions they affect are part of essential cellular processes. CNVs in the AUTS2 gene, for example, are thought to contribute to autism spectrum disorder, intellectual disability, seizures, and other neurodevelopmental disorders. A majority of pathogenic, or disease-causing, CNVs are non-recurrent CNVs. Non-recurrent CNVs seen in people have unique breakpoints (where chromosomes break) for each CNV, unlike recurrent CNVs that have common breakpoints for all CNVs at a particular genomic region.

What causes non-recurrent CNVs? We have some clues, but the exact mechanism is still unknown. My lab has developed a method of inducing CNVs that resemble non-recurrent CNVs in cell culture, which allows us to dissect the pathway that leads up to the formation of these CNVs (imagine trying to do this sort of experiment in people!). Our most recent study has highlighted transcription as a contributing factor to non-recurrent CNV formation. Transcription is a process that produces messenger RNAs (mRNAs), which then serve as a template for protein production. We've seen an association between CNV-prone genomic regions (hotspots) and regions of active transcription, meaning that we saw a large overlap between the locations of large genes that were transcribed and the hotspots. Particularly, we saw a strong association between hotspots and regions of active transcription that produces an mRNA longer than 500,000 bp, which is huge! To give you a perspective to appreciate how large that is, an average mRNA in mammals is just over 2,000 bp long. 

Figure 2. Half of CNV-prone regions (>=5 CNVs - blue triangles) do not replicate until G2 phase, which is much higher than approx. 20% of the average genome (black circles). Y axis represents the % of regions that replicated in each stages of the cell cycle. X axis represents different stages of the cell cycle, G1, four stages of S, followed by G2. Image Source

Figure 2. Half of CNV-prone regions (>=5 CNVs - blue triangles) do not replicate until G2 phase, which is much higher than approx. 20% of the average genome (black circles). Y axis represents the % of regions that replicated in each stages of the cell cycle. X axis represents different stages of the cell cycle, G1, four stages of S, followed by G2. Image Source

In addition, we have observed that the hotspots replicate very late in the cell cycle. When cells divide, in order to ensure that the two daughter cells have same amount of identical genome, cells first need to replicate its DNA first during a stage called S phase in the cell cycle. During S phase, different parts of the genome gets replicated at different times - about 20% of the human genome actually do not replicate until G2 phase, which is the stage after S (Fig. 2). Using the same replication timing dataset, we found that almost half of the hotspots do not replicate until the G2 phase, much higher than the average genome. This suggests that the hotspots tend to replicate late in the cell cycle, compared to rest of the genome.

Figure 3. I investigated two different cell types, 090 and HF1, and two large genes, LSAMP and DAB1. LSAMP is transcribed in 090 but not in HF1. DAB1 is transcribed in HF1 but not in 090. I measured the frequency of CNVs in the two cell types at the two genes, where "intact" refers to not having a CNV and "CNV" refers to having a CNV. In LSAMP, 090 had 18 CNVs and HF1 had 0. In DAB1, 090 had 0 CNVs whereas HF1 had 2. This observation supports the idea that transcription of these large genes is somehow contributing to the CNV formation. The p-values were calculated using chi-squared test. Image Source

Figure 3. I investigated two different cell types, 090 and HF1, and two large genes, LSAMP and DAB1. LSAMP is transcribed in 090 but not in HF1. DAB1 is transcribed in HF1 but not in 090. I measured the frequency of CNVs in the two cell types at the two genes, where "intact" refers to not having a CNV and "CNV" refers to having a CNV. In LSAMP, 090 had 18 CNVs and HF1 had 0. In DAB1, 090 had 0 CNVs whereas HF1 had 2. This observation supports the idea that transcription of these large genes is somehow contributing to the CNV formation. The p-values were calculated using chi-squared test. Image Source

So between transcription of large genes and late DNA replication, which contributes more to the non-recurrent CNV formation? Since the hotspots only account for 0.4% of the entire genome but 20% of the genome replicate in G2 phase, we reasoned that late replication, by itself, is not sufficient for the CNV formation. Transcription of a large gene, on the hand, seems to play more of a determining role. We compared how many CNVs form in a large gene that is transcribed in one type of cells versus not transcribed in another type of cells, and the cell type that had the transcription of the large gene had significantly more CNVs at the site (Fig. 3)! 

Currently, I am investigating how active transcription of large genes is exactly contributing to non-recurrent CNVs. Observing a strong association between the transcription and hotspots gives us a reason to believe that transcription is doing something to contribute to the subsequent CNV formations at that site, but we still don't know what. And considering how the clinically important non-recurrent CNVs are, this is an important question to pursue.