Feb 6, 2017

Oxford MinION 1D and 2D reads

We have been testing out the Oxford MinION DNA sequencing machine to see what it can contribute to larger de novo sequencing projects.  Most of the posted data so far come from small genomes where moderate coverage is more easily obtained.  Recent publications claim improved sequence quality for Oxfort MinION.

We are working on a new de novo sequence for the little skate shark ((Leucoraja erinacea), an interesting system to study developmental biology.  The skate has a BIG genome, estimated at 4 Gb (bigger than human), so this is going to be a difficult project. The existing skate genome in Genbank and the SkateBase website is not in very good shape (3 million contigs with  N50 size of 665 bp).

We got a couple of preliminary Oxford MinION reads from skate DNA - not nearly enough coverage to make a dent in this project, just having a look at the data. Oxford produces two kinds of data (in their own annoying FAST5 format, but I won't rant about that right now), single pass 1D and double pass 2D.  [My Bioinformatics programmer Yuhan Hao did this analysis.] Here is what our data looks like.

So the 1D reads are really long - some more than 50 kb. The 2D reads are mostly 2-10 kb.  The SkateBase has a complete contig of the mitochondrial genome, so we were able to align the Oxford sequences to this as a reference. Coverage was low, but we do have some regions where both 1D and 2D reads match the reference. What we can see is that the 1D reads have a lot of SNPs vs the reference, while the 2D reads have very few SNPs- so it is clear that the 2D reads have been successfully error corrected.  Strangely, both the 1D and 2D reads have a lot of insertion-deletion errors (several per hundred bases) compared to the reference, and in fact they do not match each other - so we consider these to all be novel, uncorrected errors.

We also ran a standard Illumina whole genome shotgun sequencing run for the skate genome, which we aligned to the mitochondrial reference. With this data, we can see a small number of Oxford 2D SNPs shared by hundreds of Illumina reads, others not. None of the indels are supported by our Illumina reads.

Other investigators have had poor quality Oxoford sequences. With more coverage, we may be able to use the Oxoford reads as scaffolds for our de novo assembly. It may be possible to use Illumina reads for error correction, and mark all uncorrected areas of the Oxford sequences as low quality, but that is not the usual method for submitting draft genomes to Genbank.

Jan 17, 2017

GenomeWeb reports on updated Coffee Beetle genome

A nice review of my 2015 Coffee Beetle paper in GenomeWeb today.
My genome had 163 million bases and 19,222 predicted protein-coding genes. I am very pleased to learn that a revised version of the draft genome sequence (from a group in Columbia) contains 160 million bases and 22,000 gene models. They also confirm the 12 horizontally transferred genes that I identified.

Coffee Pest, Plant Genomes Presented at PAG Conference

Researchers from the USDA's Agricultural Research Service, New York University, King Abdullah University of Science and Technology, and elsewhere published information on a 163 million base draft genome for the coffee berry borer in the journal Scientific Reports in 2015.
That genome assembly, produced with Illumina HiSeq 2000 reads, housed hundreds of small RNAs and an estimated 19,222 protein-coding genes, including enzymes, receptors, and transporters expected to contribute to coffee plant predation, pesticide response, and defense against potential pathogens. It also provided evidence of horizontal gene transfer involving not only mannanase, but several other bacterial genes as well.
At the annual Plant and Animal Genomes meeting here this week, National Center for Coffee Research (Cenicafe) scientist Lucio Navarro provided an update on efforts to sequence and interpret the coffee berry borer genome during a session on coffee genomics. For their own recent analyses, Navarro and his colleagues upgraded an earlier version of a coffee berry borer genome that had been generated by Roche 454 FLX sequencing, using Illumina short reads from male and female coffee berry borers to produce a consensus assembly spanning around 160 million bases. The assembly is believed to represent roughly 96 percent of the insect's genome.
In addition to producing a genome with improved contiguity levels, he reported, members of that team also combined 454 and Illumina reads to get consensus transcriptomes for the beetle. With these and other data, they identified almost 22,000 gene models, novel transposable element families, and their own evidence of horizontal gene transfer.

Dec 8, 2016

Finding differences between bacterial strains 100 bases at a time.

This work was conducted mostly by my research assistant Yuhan Hao.

We have recently been using Bowtie2 for sequence comparisons related to shotgun metagenomics, where we directly sequence samples that contain mixtures of DNA from different organisms, with no PCR. Bowtie2 alignments can be made very stringently (>90% identity), and computed very rapidly for large data files with hundreds of millions of sequence reads. This allows us to identify DNA fragments by species and by gene function, provided we have a well-annotated database that contains DNA sequences from similar microbes. I know that MetaPhlan and other tools already do this, but I want to focus on the difference between bacteria (and viruses) at the strain level.

I have been playing around with the idea of constructing a database that will give strain-specific identification for human-associated microbes using Bowtie2 alignments with shotgun metagenomic data. There are plenty of sources for bacterial genome sequences, but the PATRIC database has a very good collection of human pathogen genomes. However the database is very redundant - with dozens or even hundreds of different strains for the same species, and the whole thing is too large to build into a Bowtie database (>350 Gb). So I am looking at ways to eliminate redundancy from the database while retaining high sensitivity at the species level to identify any sequence fragment, not just specific marker genes,  and the ability to make strain-specific alignments. So I want to identify which genome fragments have shared sequences among a group of strains within a single species and which fragments are unique to each strain.

Since our sequence reads are usually ~100 bases, We are looking the similarity between bacterial strains when the genome is chopped into 100 bp pieces.

Bowtie2 can be limited to only perfect alignments (100 bases with zero mismatches) using the parameters  --end-to-end  and  --scoremin C, 0, -1  and to something similar to 99% identity with    --scoremin  ‘L,0, -0.06’ 

Between two strains of  Streptococcus agalactiae, if we limit the Bowtie2 alignments to perfect matches,  half of the fragments align. So at the 100 base level, half of the genome is identical, and the other half has at least one variant. At 99% similarity (one mismatch per read), about 2/3 of the fragments align, and the other 1/3 has more than one variant, or in some cases no-similarity at all to the other genome. Yuhan Hao extended this experiment to another species (Strep pyogenes), where we see almost no alignment to Strep ag.  of genome fragments at 100%, 99%, or ~97% identity (see the Table below).  I have previously used the 97% sequence identity threshold to separate bacterial species, but I thought it was only for 16S rDNA sequences - here it seems to apply to almost every 100 base fragment in the whole genome. 

So we can build a database with one reference genome per species and safely eliminate the 2/3 of fragments that are nearly identical between strains, and retain only those portions of the genome that are unique to each strain.  I'm going to think about how to apply this iteratively (without creating a huge computing task), so we add only the truly unique fragments for EACH strain, rather than just testing if a fragment differs from the Reference for that species. With such a database, stringent Bowtie2 alignments will identify each sequence read by species and by strain and have very low false matches across different species. 

We can visualize the 100 base alignments between two strains at the 100% identity level, 99% identity, and at the least stringent Bowtie2 setting (very fast local) to see where the strains differ. This makes a pretty picture (below), which reveals some fairly obvious biology: There are parts of the genome that are more conserved and parts that are more variable between strains. The top panel shows only the 100% perfect matches (grey bars), the second panel shows that we can add in some 99% matching reads (white bars with a single vertical color stripe to mark the mismatch base), and the lowest panel shows reads that are highly diverged (lots of colored mismatch bases and some reads that are clearly mis-aligned from elsewhere on the genome). So if we choose only the reads that differ by more than 99%, we can build a database that can use Bowtie2 to identify different strains and have few multiple matches or incorrectly matched reads. 

This chart lists the number of UNMATCHED fragments after alignment of all non-overlapping 100 bp fragments from the genomes of each strain to the Strep. agalactiae genome that we arbitrarily chose as the Reference.   C 0 -1 corresponds to perfect matches only, L 0 -0.06 is approximately 99% identity (by my calculation), and L 0, -0.2 is about 97% identity. The lower the stringency of alignment, the fewer fragments end up in the "unmatched" file. 

                                S.ag Ref  S.ag st2   S.ag st3   S.pyo st1 S.pyo st2
# 100b kmers          20615      22213      21660      17869      17094
C 0 -1       0              0              11243      10571      17718      16934  (perfect matches)
L 0 -0.06  0              0              6605        5918        17533      16762  (99% ident)
L 0 -0.1    0              0              6534        5839        17533      16762
L 0 -0.12  0              0              4812        4058        17371      16600
L 0 -0.15  0              0              4767        4006        17369      16600
L 0 -0.17  0              0              4756        3997        17369      16600

L 0 -0.2    0              0              4156        3419        17237      16441 (97% ident)

May 27, 2016

Functional Metagenomics from shotgun sequences

A number of groups at our research center have recently become interested in metagenomic shotgun sequencing (MGS), which is simply taking samples that are presumed to contain some microbes, extracting DNA and sequencing all of it, shotgun style. This is seen as an improvement over metagenomics methods that amplify 16S ribosomal RNA genes using bacteria specific PCR primers, and then sequence these PCR products. The 16S approach has had a lot of success, in that essentially all of the important "microbiome" publications over the past 5 years have been based on the 16S method. The 16S approach has a number of advantages – the amount of sequencing effort per sample is quite small (1,000 to 10,000 sequences is usually considered adequate) and fairly robust computational methods have been developed to process the sequence data into abundance counts for taxonomic groups of bacteria.

There are a number of drawbacks for the 16S method. The data is highly biased by DNA extraction methods, PCR primers and conditions, DNA sequencing technology, and computational methods used to clean, trim, cluster, and identify the sequences. What I mean by biased is that if you change any of these factors, then you get a different set of taxa and abundances from the samples. Even when these biases are carefully addressed, the accuracy of the taxonomic calls are not very good or reliable. It is simply not possible to identify with high precision and accuracy all bacterial  species (or strains) present in a DNA sample with just ~400 bp of 16S sequence data. Many 16S microbiome studies report differences in bacterial abundance at the genus or even the phylum level.

Two samples may be discovered to have reproducibledifferences in 16S sequence content, but the actual bacterial species or strains that differ are not confidently identified.  Even more important, the low resolution of 16S studies may be missing a lot of important biology. Huge changes in environment can perhaps favor anaerobes over aerobes, but smaller changes in pH, nutrient abundance, or immune call populations and function may cause a shift from one species in a genus to another. And this difference in species or strain may bring important changes in metabolite flux, immune system interaction, etc.

It has been proposed that bulk metagenomic shotgun sequencing(MGS) of all of the DNA in a biosample, rather than just PCR amplified 16S sequences, would allow for more precise species and strain identification, and quantification of the actual microbial genes present. Some MGS methods also attempt to count genes in specific functional groups such as within the Gene Ontology, or KEGG pathways. Other people would like to discover completely novel genes that innovative bacteria have developed to do interesting metabolic things. This leads to a computationally hard problem. MGS data tend to be very large (200 million to 1 billion reads per sample), and databases of bacterial genes are incomplete.  In fact, we probably have complete genome sequences for very much less than 0.1 percent of all bacteria in the world. We might also like to identify DNA from archea, viruses, and small eukaryotes in our samples.

Each fragment of DNA in an MGS data file has to be identified based on some type of inexact matching to some set of reference genes or genomes (a sequence alignment problem), which is computationally very demanding. In my experience, BLAST is the most sensitive tool to match diverged DNA sequences, but depending on the size of the database, it takes from 0.1 to 10 cpu seconds for each sequence to be aligned by BLAST. 100,000 sequences takes at least overnight (on 32 CPUs, 128 GB RAM), if not all week. I have never tried a billion.  Multiply that by a billion reads per sample and you can see that we have a serious compute challenge. We have at least 200 samples with FASTQ files queued up for analysis, more being sequenced, and more investigators are preparing to start new studies. So we need a scalable solution that cannot be solved just by brute force BLAST searching on ever bigger collections of computers.

As more investigators have become interested in MGS, computing methods for processing this data have popped up like weeds. It's really difficult to review/benchmark all of these tools, since there are no clear boundaries for what analysis results they should deliver, and what methods they should be using. The omictools.com website lists 12 tools for "metagenomics gene prediction" and 5 for "metagenomics functional annotation" however these categories might be defined.  The best benchmark paper I could find (Gardner et al 2015) looks at about 14.

Should the data be primer and quality trimmed (YES!), human and other contaminants removed (YES!), duplicates removed (maybe not???), clustered (????), assembled into contigs or complete genomes. Then what sort of database should the fragments be aligned to? The PFAM library of protein motifs, a set of complete bacterial genomes, some set of candidate genes?

So far, the most successful tools focus on the taxonomy/abundance problem – choosing some subset of the sequence data and comparing it to some set of reference sequences. I have chosen MetaPhalAn <http://www.ncbi.nlm.nih.gov/pubmed/22688413> to process our data because it does well in benchmark studies, runs quickly on our data, and Curtis Huttenhower has a superb track record of producing excellent bioinformatics software. In addition, we removed primers and quality trimmed using Trimmomatic, then removed human sequences by using Bowtie2 to align to the human reference genome (as much as 90% of the data in some samples). Using MetaPhalAn on the cleaned data files, we got species/abundance counts for our ~200 WGS samples in less than a week. [Note, when I say 'we', I actually mean that all the work was done by Hao Chen, my excellent Bioinformatics Programmer ]

  1. An intentionally fuzzy top species abundance heatmap of pre-publication data for some MGS samples processed with MetaPhalAn. If we added info about which samples came from which patients, this might be considerably more interesting. 

Moving on, our next objective is to identify microbial protein coding genes and metabolic pathways. Here, the methods become much more challenging, and difficult to benchmark. I basically don’t like the idea of assembling reads into contigs. This adds a lot of compute time, huge inconsistency among different samples (some will assemble better than others), and creates all kinds of bias for various species (high vs low GC genomes, repeats, etc.) and genes. Also, I'm having a hard time coming up with a solid data analysis plan that meets all of our objectives. Bacteria are diverse. A specific enzyme that fulfills a metabolic function (for example in a MetaCyc pathway), could differ by  80% of its DNA sequence from one type of bacteria to another, but still do the job. Alternately, there are plenty of multi-gene families of enzymes in bacteria with paralogs that differ in DNA sequence by 20% or less within the genome of a single organism, but perform different metabolic functions. And of course there are sequence variants in individual strains that inactivate an enzyme, or just modifiy one of its functions. There is no way we can compute such subtle stuff on billions of raw Illumina reads from a mixture of DNA fragments from unknown organisms (many of which have no reference data).  So how do we compute up a report that realistically describes the functional differences in gene/pathway metabolic capacity between different sets of MGS samples? 

If I have to design the gene content identifier myself, I will probably translate all reads in 6 frames, then use hmmscan vs. the PFAM library of known protein functional domains. The upside is that I know how to do this, and it is reasonably sensitive for most types of proteins. The downside is that many proteins will receive a very general function (ie: "kinase" or "7-TM domain"), which does not reveal exactly what metabolic functions they are involved in. And of course some 30-40% of proteins will come up with no known function – just like every new genome that we sequence. Another way to do this would be to use BLASTx against the set of bacterial proteins from UniProt. I can speed this up a bit by taking the UniRef 90% identity clusters (which reduces the database size by about 25%). An alternate proposal, which is much more clumsy, slower, and ad hoc, would be to BLASTn against a large set of bacterial genomes. Either all the complete microbial genomes in GenBank, or all the ones collected by the Human Microbiome Project, or some taxonomically filtered set (one genome per genus???). The point of searching against many complete genomes is that we have some chance of catching rare or unusual genes that are not well annotated as a COG, or KEGG pathway member. 

According to Gardner et al (Sci. Rep. 2015), the best tools for identifying gene content in MGS samples are the free public servers MG-RAST and the EBI Metagenomics portal. So we have submitted some test samples to these. However, the queue is at least 10 days for processing by MG-RAST, so this is maybe not going to satisfy my backlog of metagenomics investigators who are rapidly cranking out more MGS samples. We probably need a local tool that will be under our own control in terms of compute time and more amenable to tweaking to the goals of each project. 

At this point I'm asking blog readers for suggestions for a decent method or tool to identity gene content in big MGS FASTQ files which will be reasonably accurate in terms of protein/pathway function, and not crush my servers. Some support from a review or benchmark paper (other than by the tool's own authors...) would be nice also. 

Mar 2, 2016

Irreproducible results

I get frustrated by the lack of stability in genomics data collection and analysis, which of course leads to irreproducible results. I imagine (naively I'm sure) that in physics one measures a quant or a nanode of some particle and it stays measured that way for years and decades. I do accept the inevitable technology changes that lead us to measure similar things, such as gene expression, in different ways (Northern blots, microarrays, RNA-seq). However, my lab-based collaborators become very frustrated when the exact same data (such as RNA-seq FASTQ files) produce different results, such as changes in the list of top differentially expressed (DE) genes with different p-values, when analyzed with different software.  This frustration grows even more severe when the different results come from a different version of the same software!

I was working through my RNA-seq tutorial with a group of students this week and they pointed out that my tutorial worksheet was wrong. Cufflinks did not produce any significant DE genes with our test data comparing two small RNA-seq data files. This was surprising to me, since I did the exact same workflow with the same data files last year and it worked out fine. So golly and darn it, I got hit with the irreproducible results bug.  We keep past versions of software available on our computing server with an Environment Modules system, so I was able to quickly run a test of the exact same data files (aligned with the same Tophat version to the same reference genome) using different versions of Cufflinks. We have the following versions installed:
cufflinks/2.0.2   (July 2012)
cufflinks/2.1.1   (April 2013)
cufflinks/2.2.0   (March 2014)
cufflinks/2.2.1  (May 2014)

I just used the simple Cuffidiff workflow and looked at the gene_exp.diff output file for each software version. The results are quite different. Version 2.0.2 has 46 genes called "significant=yes" (multi-test adjusted q-value less than 0.05) with q-values running as low as 4.14E-10 (ok, one has a q-value of zero).  Now this is not a great result from a biostatistics standpoint, since how can you expect to get significant p-values from RNA expression levels with two samples an no replicates? But it did make for an expedient exercise, since we could take the DE genes into DAVID and look for enriched biological functions and pathways.

Then in version 2.1.1 we have two "significant=yes" genes. In version 2.2.0 we have zero significant genes, and in version 2.2.1 also zero.

The top 10 genes, ranked by q-value also differ. There are no genes in common between the top 10 list for version 2.0.2 and 2.1.1, and only the top 2 genes are shared by version 2.2.0. Thankfully, there are no differences in top genes or q-values between 2.2.0 and 2.2.1 (versions released only 2 months apart).  I'm sure that Cole Trapnell et al. are diligently improving the software, but the consequences for those of us trying to use the tools to make some sense out of biology experiments can be unsettling.