In 1972, the geneticist Susumu Ohno coined the term “junk DNA” to explain why the genomes of closely related organisms vary so much in size:
The mammalian genome […] contains roughly […] 3.0 × 109 base pairs. This is at least 750 times the genome size of E. coli. If we take the simplistic assumption that the number of genes contained is proportional to to the genome size, we would have to conclude that 3 million genes or so are contained in our genome. The falseness of such an assumption becomes clear when we realize that the genome of the lowly lungfish and salamanders can be 36 times greater than our own.
Ohno, S. (1972) So much ‘junk’ DNA in our genome, In: Smith, H. H. (ed) Evolution of genetic systems, 23, 366-370.
Setting aside the unjustifiable jibe against “lowly” non-human animals, this observation is as true today as it was in 1972, but you certainly wouldn’t know that from the way the results of the ENCODE project were reported last week. The ENCODE project has claimed that 80% of the human genome has a “specific biological activity”, and journalists have widely reported this as a scientific earthquake that has destroyed the “myth” of junk DNA.
I think this is a headline-baiting and flawed analysis (and I’m certainly … not … alone), but the argument is much more interesting than what one trumped-up Teaching Fellow thinks.
Ohno’s 1972 observation is an aspect of a bigger picture in genomics called the C-value paradox. The C-value of a cell is the amount of DNA contained in its nucleus. In the case of a human egg cell, this is about 3 billion base-pairs, but for comparable cells in different organisms, the C-value varies enormously:
Note the logarithmic scale (powers of 10) on the y-axis. Blue boxes represent ranges for groups of organisms, with examples of those groups shown as black bars to the right. I’m sticking with tradition, and presenting these data as a Scala Naturae, with the pointless scum on the left and the important organisms on the right. As we all know, evolution produces a Great Chain of Being, not a messy tree*.
* May be complete bollocks.
Bacteria such as Escherichia coli
have typical genome sizes of one to ten million base-pairs. The genome size of single-celled organisms that pack their DNA into a nucleus (“protists” in the graph above) varies a great deal more, from about ten million to a trillion base-pairs. Multicellular organisms such as insects and vertebrates have a slightly higher lower limit (about one hundred million base-pairs), but a similar upper limit to the “protists”.
Although the lower limit of genome size in a group fits in with our (arrogant and ill-justified) presumptions about the “complexity” of an organism, the upper limit varies enormously, with the genome size of quite closely related organisms sometimes differing by as much as 100-fold.
Pufferfish (Fugu in the graph above) have a genome size about one tenth the size of a human’s, and salamanders (Amphiuma) have a genome size at least ten times larger. This is the C-value paradox: why would a salamander need ten times as many genes as a human, and – however much it deflates our egos– it’s very difficult to see why bald apes with delusions of grandeur would need ten times as many genes (or even regulatory DNA sequences) as a pufferfish.
Fugu! Poison, poison, poison, tasty fish
Ohno used the term “junk” to describe the apparently superfluous DNA that makes up the difference between an observed genome size for a particular organism, and the lower limit for the group to which it belongs. Pufferfish – for reasons unknown – contain very little of this “junk”, whereas salamanders are apparently even less in control of the proliferation of their “junk” than we humans are. This superfluous “junk” is the DNA sequence equivalent of a commensal organism
a largely harmless hanger-on that you might be fractionally better off without, but which generally isn’t worth the effort of getting rid of.
Tillandsia usnoides (Spanish moss), a spindly upside-down pineapple which grows epiphitically on trees in the Americas, usually without ill-effect on the host tree
In 2001, a rough draft of most of the human genome
was published by the Human Genome Consortium. A break-down of the data supports the suggestion that much of the genome of humans (and by implication, even more of the genome of salamanders) is “junk”:
Only one or two percent of the human genome codes for the proteins that construct your body. This DNA is transcribed to messenger RNA, which is then translated to protein, a process whose details have been used to torture undergraduate students for many decades.
Another two or three percent of your DNA codes for functional RNA molecules, which are vital components of the machinery that translates messenger RNA into protein. A few more percent are structural, forming the ends of your chromosomes, or providing the anchor points needed to pull chromosomes around when your cells divide. One percent or so is of obvious regulatory function, helping to bind proteins that turn your genes on and off as required.
So, all in all, about 10% of your genome (the reddish segments of the pie chart above) is of well-known, long-established, and unambiguous function. The other 90% seems to be superfluous “junk”. Despite all the furore generated by the ENCODE project, nothing written above is news. Only about 1% of your genome directly encodes proteins, but it’s been well-known for a long time that at least 10% of your genome is “functional” by a slightly broader but uncontroversial definition.
So what changed last week?
If you read the press releases, it seems to be “everything”. If you read many of the blogs about this, it seems to be “nothing at all”. You have stumbled upon a genuine scientific argument, one with real mudslinging about real data, rather than the invented controversy of creationists and climate change ‘sceptics’.
ENCODE claims to have found “specific biological activity” for 80% of the human genome, because 80% of the genome appears to have a reproducible interaction with one of the very many proteins that bind, modify or transcribe DNA. An important implication is that much more of your genome is involved in switching your genes on or off than was previously thought.
The reason for the skepticism over the 80% claim, which I share, is that if you look at the pie chart above, you’ll notice a big chunk of blue and purple segments labelled “transposons”, “viruses”, “LINEs” and “SINEs”. About 50% of your genome is made of the corpses of various kinds of virus, indeed, a full 8% is made of broken copies of retroviruses similar to HIV.
In the evolutionary past of the human species, very occasionally, retroviruses similar to (but – I stress – not actually) HIV have inserted their genetic information into the DNA in the nucleus of a germ-line cell, i.e.a cell that is fated to make eggs or sperm. These integrated viruses can then be passed directly from parent to offspring, alongside the cell’s “own” DNA, without going through the rigmarole of escaping from the cell and being coughed, vomited or ejaculated into their next victim.
Integrated viral sequences, and the similar “long interspersed nuclear element” sequences (LINEs) – another 20% of your genome – can continue to replicate indefinitely by copying-and-pasting themselves into new sites in your genome. However, over evolutionary time, most of these inserted viral sequences (and their copies, and copies of copies, and copies of copies of copies…) accumulate mutations and lose the ability to replicate independently. They become fossilised ghosts of misery past. There is no terribly strong pressure from natural selection to remove the broken sequences from the genome, as they’re not terribly harmful. So these corpses tend to accumulate over time, like moribund copies of old Word documents on a hard-drive. Some organisms (salamanders) have come to have more of this decomposing genetic muck than others (pufferfish).
LINEs replicate using the same enzymes as retroviruses: reverse transcriptase and integrase. The DNA sequence (green) of the LINE is transcribed by the host cell into messenger RNA (pink), which is then translated by the host cell to make reverse transcriptase (RT) and integrase. The reverse transcriptase makes a DNA copy of the messenger RNA, which is then pasted into a new site in the genome by the integrase. LINEs are parasites on the host genome. SINEs are a second class of parasitic element that don’t even encode their own enzymes: they are transcribed by the host cell, but then use the LINE’s enzymes to copy and paste themselves. They are parasites upon the backs of other parasites, yet they comprise another 13% of your genome!
50% of your genome is broken viruses. These make up a substantial fraction of the 80% figure that is causing the fuss, and this is where the controversy lies.
Having “specific biological activity” doesn’t mean the same as having “important biological activity”. Although there is precedent for viral sequences being co-opted for interesting new roles in the cell, these poachers-turned-gamekeepers appear to be a vanishingly small fraction of your genome. But the raison d’être of a viral sequence is to get transcribed, and old habits die hard, or at least, tail off slowly as the viral sequence rots away. Many rotting viral sequences are likely to retain some residual function, up to and including being transcribed to messenger RNA. Whether occasional transcription of broken viruses has any biological importance is a completely different question from whether they bind the relevant proteins to be transcribed at all. The messenger RNA could simply be degraded without interesting consequences, neither for good nor for ill.
Similar counterarguments can be made for DNA sequences in the rotting viral sequences that bind regulatory proteins. DNA binding proteins are not terribly specific, and even completely random DNA would be expected to have many binding sites for these proteins. Many of these binding sites may be completely irrelevant to gene regulation, in the same way that very few mentions of the phrase “the end” in a book will actually result in you closing the book because you have reached “The End” of it.
If I were feeling cynical (and you should note this is my ground state, so I barely know what the alternative is) I’d suggest the press-releases and headline figures of the ENCODE publications were deliberately chosen to court controversy of the sort I have now spent three hours adding to by writing this blog-post. When the dust settles, molecular biologists will have a very nice map of places to go looking for genuine co-option of “junk” to novel functions, but most of the “junk” DNA will still be just as much “junk” as it was a week ago.
Having said all of that, there is one way in which I hope that the end for “junk DNA” is nigh. The fossil viral sequences are indeed “junk” as far as the rest of the genome is concerned, but equally well, the 10% (maybe 20% by the time the ENCODE data has been properly considered) of the genome that is of “known function” could be equivalently described as “stupid” for replicating the parasites and commensals hitch-hiking on its back. Your genome is 90% “junk” DNA and 10% “stupid” DNA.
However, I wouldn’t describe the epiphytic orchids or white-rot fungi on a Brazil nut tree as being “junk” organisms, nor would I describe the tree as being “of known function” compared to them. Ecologists have a number of words like “parasite”, “commensal”, “mutualist” and “hyperparasite” for usefully describing the relationships between organisms in an ecosystem, and I wonder if using these might be a more illuminating way of describing the contents of your genome than dividing it into “important stuff” (important to whom?) and “junk”.