Viral Genome Sequencing

To uniquely identify a virus we need to isolate a sample of a virus, identify its unique genome sequence, compare it against existing samples to prove that it either matches a known virus or it is a new virus and then ideally expose another subject to this virus and prove they get the same disease as the original patient.

None of the above things ever happen.

Andrew Kaufman gives a clear description of the process in this 20 minute video; watch out for the main points:

  • The original sample is not purified; it contains foreign DNA
  • No complete sequence is used, only small fragments of RNA
  • A sequence is constructed in a computer only; an actual virus is never reconstituted
  • The result is never identical to the original sample
  • The result is not tested on a living being to see if it causes disease


In the sampling of the coronavirus:

  • Lung fluid was extracted from one ‘suspected’ case only identified by symptoms only
  • The extracted RNA may well have viral origin but could also have come from the lung tissue itself or from any bacteria living in the lungs. This is in contrast to human genome sequencing where all the DNA comes from human tissue.
  • The RNA is chopped into small pieces of only 150 base pairs each. Again in contrast to human genome sequencing where complete strands of RNA are used.

Alignment or Read Mapping

Once we have some fragments of RNA we select short strands of RNA that we will join together to make a complete genome. These fragments or ‘reads’ are 150 base pairs long and there are 56,500,000 such reads.

We now need to select a subset of these reads and join them together using computer software. The paper uses two different pieces of software to do this, resulting in literally millions of potential genomes.

We need a way of choosing the most likely sequence out of all these results but.. we simply choose the longest one!

This turns out to be too long, so some elements are removed until it is the right size.


The final in silico genome was checked against a genome database and was found to be 89% similar to a bat virus and so Sars Cov-2 was born.

Note that the human genome is 88% identical to that of a cat and 98% to that of a chimpanzee.


The weaknesses in the procedure are readily apparent and pretty much jump off the page.

  • There is no guarantee that the RNA used comes from a single virus variant
  • The fact that the software procedures produce different results surely means that one of them is wrong? So maybe the other is also wrong?
  • The ‘reads’ to be joined are only 150 base pairs long but DNA is highly structured and can have repeats of up to 60 pairs long. Wikipedia
  • The selection of short reads only means that there are more end pieces to be joined up and therefore more ways of constructing the final genome, making it much more likely that the procedure will produce the desired result.


A new coronavirus associated with human respiratory disease in China Wu, Zhao et al
The Nature article referred to by Kaufman

Read Mapping

Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/TUR/IMU-SP-02/2020, complete genome