Introduction

Input

Raw fastq files from illumina NGS platforms.

Output

Complied OTU or ZOTU table (taxonomy not assigned) and final sequences in fasta format.

Assumptions

Working within MacOS environment - there are slight differences when working within a linux environment. If you are using a virtual machine or cloud computing than often you will need to work within a linux environment. The documentation for usearch is the same for MacOS and linux with the exception that you will need to download the linux versions of the softwear, however take note it is primarily written for MacOS and some features may not be available. Please provide feedback if functions are not working within a linux environment.

Downloaded USEARCH. This pipeline requires USEARCH v11. For smaller datasets (e.g. one MiSeq run) the 34-bit files are okay, however if undertaking analysis of large datasets you will need to purchase the 64-bit.

Data type

This analysis is for paired-end reads generated on the illumina MiSeq

Using this pipeline

I strongly suggest you take a small sub-set of your data though the pipeline outlined below step-by-step first to get an understanding of what each command is doing. Importantly it will also allow to explore the influence of the different variables on your dataset.

Once you are happy with the parameters and that the script is working you can use the full script to analyse all your samples. Open your favourite text editor and combine all the sections of your workflow into a document with the file extension .sh. For example usearchv11_analysis.sh

Simply edit the script with your parameters, and save it in a directory with your raw fastq files.

All sub-scripts and a complete version of the pipeline can be found on my github repo siobhon-egan/usearch_analysis.

In a bash terminal navigate to your directory and perform the following to execute the script and save the terminal output to a text file. Download the full script here

cd /Users/name/Documents/NGS-Analysis
chmod +x usearchv11_analysis.sh
./usearchv11_analysis.sh > usearchv11_output.txt

*Note: the code below assumes you have installed USEARCH v11, file name usearch11 and it is executable (if not navigate to the file and type chmod +x usearch11 in the command line).


Fastq info

fastq_info command

This commands gives a short summary report of the sequences in a fasta or fastq file. It is useful for a first check on what is in a new file. The report is written to the console and can be saved to a text file using the -output option

#!/bin/bash

raw_data="raw_data"
read_summary="1.read_summary"

mkdir $read_summary

 for fq in $raw_data/*R1*.fastq
 do
  usearch11 -fastx_info $fq -output ${read_summary}/1a_fwd_fastq_info.txt
 done

 for fq in $raw_data/*R2*.fastq
 do
  usearch11 -fastx_info $fq -output ${read_summary}/1b_rev_fastq_info.txt
 done

To look at just one fasta or fastq files, for example inspect your concatenate sequences. The -secs 5 options is good when just wanting a snap shot of a large sequences file, however it can be deleted if you can it to scan the entire file.

usearch11 -fastx_info sequences.fastq -secs 5 -output reads_info.txt

IMPORTANT!

At this point it is important to look at the ‘EE’ value which means expected error.

Detailed information on this can be found here.

In short you want your EE value to be under 2, usually it is lower for the forward reads than the reverse reads.


Merge pairs

fastq_mergepairs command

This command merges paired-end reads to create consensus sequences and, optionally, consensus quality scores. This command has many features and options so I recommend spending some time browsing the documentation to get familiar with the capabilities of fastq_mergepairs and issues that arise in read merging.

As is the convention for illumina paried reads the forward sequences are labeled samplename_SXX_L001_R1_001.fastq and reverse sequences labeled samplename_SXX_L001_R2_001.fastq.

There are many additional options for this command that are specified here, and a thorough outline of reviewing fastq_mergepairs reports here.

#!/bin/bash

raw_data="raw_data"
merged_data="2.merged_reads"
# Maximum no of mismatches in the alignment - Default 5. Consider increasing if you have long overlaps.
maxdiffs="15"
# Discard pair if alugnment is shorter than this value - Default 16 bp
overlap="50"

mkdir ${merged_reads}
mkdir working1

#*****************************************************************************************
# Step1: merge data with usearch9 -fastq-filter

for file1 in ${raw_data}/*R1_001.fastq
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Merging paired reads
        echo forward reads are:
        echo $(basename ${file1})
        echo reverse reads are:
        echo $(basename ${file1} R1_001.fastq)R2_001.fastq

    usearch11 -fastq_mergepairs ${file1} -reverse "${raw_data}/$(basename -s R1_001.fastq ${file1})R2_001.fastq" -fastqout "working1/$(basename "$file1")" -fastq_maxdiffs ${maxdiffs} -fastq_minovlen ${overlap} -report ${merged_data}/2a_merging_seqs_report.txt -tabbedout ${merged_data}/2b_tabbedout.txt
done

#*****************************************************************************************
# Step 2: Remove "_L001_R1_001" from filenames

for file2 in working1/*.fastq
    do

        rename="$(basename ${file2} _L001_R1_001.fastq).fastq"

        mv ${file2} ${merged_data}/${rename}
done

#*****************************************************************************************
# Removing working directory

        rm -r working1
Note There are a number of other commands that you may find useful at this step available on the USEARCH documenation page here. After trials with 16S v1-2 amplicon NGS data I have found that in most cases altering these parameters make little to no difference to the outcome of the sequences (i.e % merged) and hence they are obmitted from this pipeline - however the two exceptions are fastq_minovlen and fastq-maxdiffs. Again it is important to remember the purpose of this line of code, it is simply to merge the forward and reverse sequences not to act as a quality control step.

Trim primers

search_pcr2 command

This command extracts segments between matches to a primer fair. Input can be fasta or fastq. Primer sequences must be in 5-3 direction, and it does support IUPAC codes for sequence wildcard letter. The -strand option must be specific, where -strand plus searches for hits on the forward stand only and -strand both searches for hits on both forward and reverse-complemented strands. The -maxdiff option specifies the maximum allowed number of mismatches with a primer, default = 2. See USEARCH documentation for more information.

It is recommended that primers be trimmed before quality filtering of fastq files, as every base can cause an increase in the expected errors.

A common alternative to using the search_pcr2 command is the fastx_truncate command. Depending on your data you may like to make your own choice, however I feel searching for the primer set specifically yeilds a more reliable result.

#!/bin/bash

merged_data="2.merged_reads"
primer_matched="3a.primer_matches"
primer_not_matched="3b.primer_mismatches"
# Eneter primer sequences 5`-3` (degenerate bases OK)
fwd_primer="AGAGTTTGATCCTGGCTYAG" #16S v1-2 primer, ref Gofton et al. Parasites & Vectors (2015) 8:345
rev_primer="TGCTGCCTCCCGTAGGAGT" #16S v1-2 primer, ref Turner et al. J Eukaryot Microbiol (1999) 46(4):327-338

echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo Triming primers and distal bases
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo ""


Creating working directories

mkdir ${primer_matched}
mkdir ${primer_not_matche

for file3 in ${merged_data}/*.fastq
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Trimming primers step 1: finding seqs with FWD primer
        echo input is:
        echo ${file3}

    usearch11 -search_pcr2 ${file3} -fwdprimer ${fwd_primer} \
    -revprimer ${rev_primer} \
    -strand both -fastqout "${primer_matched}/$(basename ${file3})" -notmatchedfq "${primer_not_matched}/$(basename ${file3})" -tabbedout ${primer_matched}/pcr2_primermatches_output.txt
done

Quality control and removing dimer sequences

fastq_filter command

This command performs quality filtering and converts fastq files to fasta format. Again there are a number of filtering options available with this command that you can explore, however below is what I have found the most approriate for 16S amplicon NGS data tested so far.

#!/bin/bash

primer_matched="3a.primer_matches"
QF_reads="4.quality_filtered"
# Discard reads with > E total expected errors for all bases in the read after any truncation options have been applied. Usually set at 1, however may want to lower to 0.5 or 0.25 for more stringent filtering.
max_ee="1"
# Enter min length of sequence for trimming in bp (eg. to keep all seqs above 200 bp enter "200")
minlen="150"

mkdir ${QF_reads}

for file4 in ${primer_matched}/*.fastq
    do
        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Quality control and removing dimer seqs
        echo input is:
        echo ${file4}

    usearch11 -fastq_filter ${file4} -fastaout "${QF}/$(basename "$file4" .fastq).fasta" -fastq_maxee ${max_ee} -fastq_minlen ${minlen}
done

Renaming sequences

There are no usearch commands or other parameters in this step, it is simply a series of bash commands to ensure the sequence labels are compatable for downstream OTU/ZOTU clustering.

#!/bin/bash
QF_reads="4.quality_filtered"
labeled_data="5.labeled_data"

echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo Renameing sequences with ">barcodelabel=sample_id;sequence_id"
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

mkdir ${labeled_data}
mkdir working2

#*****************************************************************************************
# Step 1: Remove ">" from start of sequence_ID

for file5 in ${QF_reads}/*.fasta
    do

    sed -e 's/>/>barcodelabel=;/g' ${file5} > working2/$(basename "$file5" .fasta).txt
done

#*****************************************************************************************
# Step 2: Add sample_ID (should be filename) to produce ">barcodelabel=sample_ID;sequence_ID"

for file6 in working2/*.txt
    do

    sample_id=$(basename ${file6} .txt)
    echo ${sample_id}

    sed -e "s/;/${sample_id};/g" ${file6} > "${labeled_data}/$(basename "$file6" .txt).fasta"
done

#*****************************************************************************************
# Remove working directories

rm -r working2

##########################################################################################

Removing low abundant sequences

fastx_uniques command - This command finds the set of unique sequences in an input file, also called dereplication. In this case the input is a fasta file. Sequences are compared letter-by-letter and must be identical over the full length of both sequences (substrings do not match).

sortbysize command - This command sorts sequences by decreasing size annotation, which usually refers to the size of a cluster. In this case we also add a -maxsize option to specify singletons i.e. 1.

search_exact command - This command searches for exact, full-length matches to a database sequence. This alogorithm is only available with usearch8, however even for large datasets the 32-bit version is fine as it does not need to store alot of memory.

#!/bin/bash

labeled_data="5.labeled_data"
derep_dir="6.derep_data"
low_abund_seqs="7.low_abund_sequences"
SF="8.singleton_filtered"
# Enter max replicate cluster size (eg. to remove singletons enter 1, for duplicates enter 2)
maxsize="1"

echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo Removing low abundant seqs singletons per sample
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo ""

# Creating directories

mkdir ${derep_dir}
mkdir ${low_abund_seqs}
mkdir ${SF}

#*****************************************************************************************
# Step 1: Dereplicating

for file7 in ${labeled_data}/*.fasta
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Removing singletons step 1: derep_fulllength
        echo input is:
        echo ${file7}

    usearch11 -fastx_uniques ${file7} -fastaout "${derep_dir}/$(basename "$file7" .fasta).fasta" -sizeout
done


#*****************************************************************************************
# Step 2: Filtering low abundant seqs {maxsize}

for file8 in ${derep_dir}/*.fasta
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Removing singletons step 2: sorting uniques
        echo input is:
        echo ${file8}

    usearch11 -sortbysize ${file8} -fastaout "${low_abund_seqs}/$(basename "$file8" .fasta).fasta" -maxsize ${maxsize}
done

#*****************************************************************************************
# Step 3: Mapping reads

for file9 in ${labeled_data}/*.fasta
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Removing singletons step 3: mapping reads to low abundant uniques
        echo input is:
        echo ${file9}

    usearch11 -search_exact ${file9} -db "${low_abund_seqs}/$(basename "$file9" .fasta).fasta" -strand plus -notmatched "${SF}/$(basename "$file9" .fasta).fasta"
done


#*****************************************************************************************

Clustering

This step is where you may encounter issues with enough memory, for 1 or 2 MiSeq runs, the 32-bit versions of usearch may be okay, however any larger and you will need to use the 64-bit, paid versions. There are alternative freely available software for this step such as vsearch, and it may be worth your time exploring these options if you encounter such limits. For this case we will assume you either have a small data set or have access to the 64-bit versions for larger datasets.

There are two main alogorithms we recommend you do both and see what works. In my opinion I think that uparse OTUs are sufficent for 16S microbiome analysis in most cases. However if you are after more specific information (i.e. genotypes) or perhaps you targeted a more specific gene region (i.e. your primer were genus/family specific), you may want to go with ZOTUs. In short I think you should do both and make the descision of which makes biological sense in your study.

UPARSE algorithm to generate OTUs

The UPARSE algorithm is available in usearch 11 (and previous versions). If you have a large dataset you will need the 64-bit version as it does hold memory. Detailed documentation on this algorithm can be found here.

The UPARSE-OTU algorithm constructs a set of OTU representative sequences from NGS amplicon reads. R

Clustering criteria The goal of UPARSE-OTU is to identify a set of OTU representative sequences (a subset of the input sequences) satisfying the following criteria.

All pairs of OTU sequences should have < 97% pair-wise sequence identity. An OTU sequence should be the most abundant within a 97% neighborhood. Chimeric sequences should be discarded. All non-chimeric input sequences should match at least one OTU with ≥ 97% identity. UPARSE-OTU uses a greedy algorithm to find a biologically relevant solution, as follows. Since high-abundance reads are more likely to be correct amplicon sequences, and hence are more likely to be true biological sequences, UPARSE-OTU considers input sequences in order of decreasing abundance. This means that OTU centroids tend to be selected from the more abundant reads, and hence are more likely to be correct biological sequences.

Reference Edgar, R.C. (2013) UPARSE: Highly accurate OTU sequences from microbial amplicon reads, Nature Methods. Pubmed:23955772, dx.doi.org/10.1038/nmeth.2604.

#!/bin/bash

SF="8.singleton_filtered"
cluster="9.cluster"
uparse_otus="9a.uparse_otus"

echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo UPARSE on all
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

mkdir ${cluster}

   cat ${SF}/*.fasta > ${cluster}/all_SF.fasta

cd ${cluster}

  usearch11 -fastx_uniques all_SF.fasta -fastaout all_SF_DR.fasta -sizeout

mkdir ${uparse_otus}
cd ${uparse_otus}

  usearch11 -cluster_otus ../all_SF_DR.fasta -otus uparse_otus.fasta -relabel OTU

  usearch11 -usearch_global ../all_SF.fasta -db uparse_otus.fasta -strand both -id 0.97 -otutabout uparse_otu_tab.txt -biomout uparse_otu_biom.biom

  usearch11 -calc_distmx uparse_otus.fasta -tabbedout uparse_otus_distmx.txt -maxdist 0.2 -termdist 0.3
  
  usearch11 -cluster_aggd uparse_otus_distmx.txt -treeout uparse_otus_clusters.tree -clusterout uparse_otus_clusters.txt \
      -id 0.80 -linkage min

cd ..

UNOISE3 algorithm to generate ZOTUs

The UNOISE3 algorithm is only available in USEARCH10 onwards. If you have a large data set you will need the 64-bit version as it does hold memory. Detailed documation on this algorithm can be found here.

There is a UNOISE2 option in Usearch9.2 with documentation available here.

This algorithm is specifically designed for Illumina sequences.

The UNOISE3 algorithm performs error-correction (denoising) on amplicon reads. It is implemented in the unoise3 command.

Correct biological sequences are recovered from the reads, resolving distinct sequences down to a single difference (often) or two or more differences (almost always).

Errors are corrected as follows: 1. Reads with sequencing and PCR point error are identified and removed. 2. Chimeras are removed.

Reference Edgar, R.C. (2016), UNOISE2: Improved error-correction for Illumina 16S and ITS amplicon reads.http://dx.doi.org/10.1101/081257

#!/bin/bash

cluster="9.cluster"
unoise_zotus="9b.unoise_zotus"

echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo UNOISE on all
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


mkdir ${unoise_zotus}
cd ${unoise_zotus}

  usearch11 -unoise3 ../all_SF_DR.fasta -zotus unoise_zotus.fasta -tabbedout unoise_tab.txt

  usearch11 -fastx_relabel unoise_zotus.fasta -prefix Otu -fastaout unoise_zotus_relabelled.fasta -keep_annots

  usearch11 -otutab ../all_SF.fasta -zotus unoise_zotus_relabelled.fasta -otutabout unoise_otu_tab.txt -biomout unoise_otu_biom.biom -mapout unoise_map.txt -notmatched unoise_notmatched.fasta -dbmatched dbmatches.fasta -sizeout

  usearch11 -calc_distmx unoise_zotus.fasta -tabbedout unoise_zotus_distmx.txt -maxdist 0.2 -termdist 0.3
  
  usearch11 -cluster_aggd unoise_zotus_distmx.txt -treeout unoise_zotus_clusters.tree -clusterout unoise_zotus_clusters.txt \
      -id 0.80 -linkage min

cd ..
cd ..

What next?

This is downstream analysis that can done with usearch. For example, taxonomy assignment and diversity analysis.

However, at present there are more robust taxonomy assignment options available (such as that in QIIME2), that are easier to use. With respect to diversity analysis, QIIME2 and R packages such as Phyloseq and Microbiome offer much more flexibility, robustness and are more aesthectically pleasing.



Vector and Waterborne Pathogens Research Group. 2018. S Egan.