WARNING:Outdated page
This pipeline has since been updated to be fully operational in USEARCH v11.
New link here.
Note: page will no longer be updated regularly.
Input
Raw fastq
files from illumina NGS platforms.
Output
Complied OTU table (taxonomy not assigned) and final OTU sequences in fasta
format.
Assumptions
Working within MacOS environment - there are slight differences when working within a linux environment. If you are using a Pawsey virtual machine (Nimbus) then you will need to work within a linux environment. The documentation for usearch is the same for MacOS and linux with the exception that you will need to download the linux versions, however take note it is primarily written for MacOS and some features may not be available. Please provide feedback if functions are not working within a linux environment. Downloaded USEARCH. This pipeline requires USEARCH 8, 9.2 and 10. In most cases the 34-bit files are okay, however if undertaking analysis of large datasets you will need to purchase the 64-bit. The Cryptick lab has the 64-bit versions of USEARCH8, 9.2 and 10 - you will need to use the iMac to access these. Unfortunately we only have the 64-bit version of USEARCH9.2 on a linux environment, however this is sufficent for almost all steps. You will have to weight up your options between running the analysis on Nimbus vs iMac. You might consider doing the first steps on the Nimbus cloud and the final steps (i.e. UNOISE clustering) on the iMac.
Data type
This analysis is for paired-end reads
Using this pipeline
I strongly suggest you take a small sub-set of your data though the pipeline outlined below step-by-step first to get an understanding of what each command is doing. Importantly it will also allow to explore the influence of the different variables on your dataset.
Once you are happy with the parameters and that the script is working you can use the full script to analyse all your samples - available at this github link
Simply edit the script with your parameters, and save it in a directory with your raw fastq files.
In a bash terminal navigate to your directory and perform the following to execute the script and save the terminal output to a text file.
cd /Users/name/Documents/NGS-Analysis
chmod +x this_script.sh
./this_script.sh > 16S-NGS-usearch-output.txt
This commands gives a short summary report of the sequences in a fasta
or fastq
file. It is useful for a first check on what is in a new file. The report is written to the console and can be saved to a text file using the -output
option
#!/bin/bash
raw_data="raw_data"
read_summary="1.read_summary"
usearch="usearch10"
mkdir $read_summary
for fq in $raw_data/*R1*.fastq
do
$usearch -fastx_info $fq -output $read_summary/1a_fwd_fastq_info.txt
done
for fq in $raw_data/*R2*.fastq
do
$usearch -fastx_info $fq -output $read_summary/1b_rev_fastq_info.txt
done
To look at just one fasta or fastq files, for example inspect your concatenate sequences. The -secs 5 options is good when just wanting a snap shot of a large sequences file, however it can be deleted if you can it to scan the entire file.
usearch10 -fastx_info sequences.fastq -secs 5 -output reads_info.txt
IMPORTANT!
At this point it is important to look at the ‘EE’ value which means expected error.
Detailed information on this can be found here.
In short you want your EE value to be under 2, usually it is lower for the forward reads than the reverse reads.
This command merges paired-end reads to create consensus sequences and, optionally, consensus quality scores. This command has many features and options so I recommend spending some time browsing the documentation to get familiar with the capabilities of fastq_mergepairs and issues that arise in read merging.
As is the convention for illumina paried reads the forward sequences are labeled samplename_SXX_L001_R1_001.fastq
and reverse sequences labeled samplename_SXX_L001_R2_001.fastq
.
There are many additional options for this command that are specified here, and a thorough outline of reviewing fastq_mergepairs reports here.
#!/bin/bash
raw_data="raw_data"
merged_data="2.merged_reads"
usearch9="usearch9.2"
usearch8="usearch8"
# Maximum no of mismatches in the alignment - Default 5. Consider increasing if you have long overlaps.
maxdiffs="15"
# Discard pair if alugnment is shorter than this value - Default 16 bp
overlap="50"
mkdir ${merged_reads}
mkdir working1
#*****************************************************************************************
# Step1: merge data with usearch9 -fastq-filter
for file1 in ${raw_data}/*R1_001.fastq
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Merging paired reads
echo forward reads are:
echo $(basename ${file1})
echo reverse reads are:
echo $(basename ${file1} R1_001.fastq)R2_001.fastq
$usearch -fastq_mergepairs ${file1} -reverse "${raw_data}/$(basename -s R1_001.fastq ${file1})R2_001.fastq" -fastqout "working1/$(basename "$file1")" -fastq_maxdiffs ${maxdiffs} -fastq_minovlen ${overlap} -report ${merged_data}/2a_merging_seqs_report.txt -tabbedout ${merged_data}/2b_tabbedout.txt
done
#*****************************************************************************************
# Step 2: Remove "_L001_R1_001" from filenames
for file2 in working1/*.fastq
do
rename="$(basename ${file2} _L001_R1_001.fastq).fastq"
mv ${file2} ${merged_data}/${rename}
done
#*****************************************************************************************
# Removing working directory
rm -r working1
fastq_minovlen
and fastq-maxdiffs
. Again it is important to remember the purpose of this line of code, it is simply to merge the forward and reverse sequences not to act as a quality control step.
This command performs quality filtering and converts fastq
files to fasta
format. Again there are a number of filtering options available with this command that you can explore, however below are the most approriate for 16S amplicon NGS data.
#!/bin/bash
# This script works with usearch9.2 and usearch10
# fastq_filter provides a quiality filter and max ee rate
merged_reads="2.merged_reads"
QF_reads="3.quality_filtered"
# Usearch executable
usearch="usearch10"
# Enter % expected error rate threchold for seq quality filtering (1% = 0.01)
error_rate="0.01"
# Enter min length of sequence for trimming in bp (eg. to keep all seqs above 200 bp enter "200")
minlen="150"
mkdir ${QF_reads}
for file3 in ${merged_reads}/*.fastq
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Quality control and removing dimer seqs
echo input is:
echo ${file3}
$usearch -fastq_filter ${file3} -fastaout "$QF_reads/$(basename "$file3" .fastq).fasta" -fastq_maxee_rate ${error_rate} -fastq_minlen ${minlen}
done
usearch 9.2 search_oligodb command
This command searches for matches of nucleotide sequences to a database containing short nucleotide sequences (oligonucleotides). In this context the command is used to search for matches to primers. By default, -maxdiffs
2 is assumed and other accept criteria are not used.
usearch8 search_pcr command
This command searches for predicted amplicons. The sequences specific for each primer are made into a fasta
files (databases). Each query sequence is searched for matches to a pair of primers that would generate a PCR amplicon. The -pcr_strip_primers
command is then used to strip the primers from each sequence.
usearch9="usearch9.2"
usearch8="usearch9"
QF="3.quality_filtered"
trimmed_data="4.trimmed_data"
seqs_w_fwd_primer="5.seqs_w_fwd_primer"
seqs_wo_fwd_primer="6.seqs_wo_fwd_primer"
seqs_w_fwd_and_rev_primer="7.seqs_w_fwd_and_rev_primer"
seqs_w_fwd_butnot_rev_primer="8.seqs_w_fwd_butnot_rev_primer"
# Enter FWD primer sequence 5'-3' (degenerate bases OK)
fwd_primer="AGAGTTTGATCCTGGCTYAG"
# Enter REV primer sequence 5'-3' (degenerate bases OK)
rev_primer="TGCTGCCTCCCGTAGGAGT"
# Enter number of primer mismatched allowed
pcr_missmatches="0"
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo Triming primers and distal bases
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo ""
# At the moment this usearch command can only take .fasta as input so can only be done
# after QF
# Creating working directories
mkdir ${trimmed_data}
mkdir ${seqs_w_fwd_primer}
mkdir ${seqs_wo_fwd_primer}
mkdir ${seqs_w_fwd_and_rev_primer}
mkdir ${seqs_w_fwd_butnot_rev_primer}
# Creating FWD primer db
echo ">fwd_primer" > fwd_primer_db.fasta
echo ${fwd_primer} >> fwd_primer_db.fasta
# Creating REV primer db
echo ">rev_primer" > rev_primer_db.fasta
echo ${rev_primer} >> rev_primer_db.fasta
# Creating FWD and REV primer db
echo ">fwd_primer" > both_primers_db.fasta
echo ${fwd_primer} >> both_primers_db.fasta
echo ">rev_primer" >> both_primers_db.fasta
echo ${rev_primer} >> both_primers_db.fasta
#*****************************************************************************************
# Step 1: Finding seqs with FWD primer
for file4 in ${QF}/*.fasta
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Trimming primers step 1: finding seqs with FWD primer
echo input is:
echo ${file4}
$usearch9 -search_oligodb ${file4} -db fwd_primer_db.fasta -strand both -matched "${seqs_w_fwd_primer}/$(basename ${file4})" -notmatched "${seqs_wo_fwd_primer}/$(basename ${file4})"
done
#*****************************************************************************************
# Step 2: Finding seqs with FWD and REV primers
for file5 in ${seqs_w_fwd_primer}/*.fasta
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Trimming primers step 2: finding seqs with FWD and REV primer
echo input is:
echo ${file5}
$usearch9 -search_oligodb ${file5} -db rev_primer_db.fasta -strand both -matched "${seqs_w_fwd_and_rev_primer}/$(basename ${file5})" -notmatched "${seqs_w_fwd_butnot_rev_primer}/$(basename ${file5})"
done
#*****************************************************************************************
# Step 3: Trimming FWD and REV primers
for file6 in ${seqs_w_fwd_and_rev_primer}/*.fasta
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Trimming primers step 3: removing FWD and REV primers
echo input is:
echo ${file6}
$usearch8 -search_pcr ${file6} -db both_primers_db.fasta -strand both -maxdiffs ${pcr_missmatches} -pcr_strip_primers -ampout "${trimmed_data}/$(basename ${file6} .fasta).fasta"
done
#*****************************************************************************************
# Removing working directories
# rm -r seqs_w_fwd_primer
# rm -r seqs_wo_fwd_primer
# rm -r seqs_w_fwd_and_rev_primer
# rm -r seqs_w_fwd_butnot_rev_primer
##########################################################################################
There are no usearch commands or other parameters in this step, it is simply a series of bash commands to ensure the sequence labels are compatable for downstream OTU/ZOTU clustering.
trimmed_data="4.trimmed_data"
labeled_data="9.labeled_data"
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo Renameing sequences with ">barcodelabel=sample_id;sequence_id"
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
mkdir ${labeled_data}
mkdir working2
#*****************************************************************************************
# Step 1: Remove ">" from start of sequence_ID
for file7 in ${trimmed_data}/*.fasta
do
sed -e 's/>/>barcodelabel=;/g' ${file7} > working2/$(basename "$file7" .fasta).txt
done
#*****************************************************************************************
# Step 2: Add sample_ID (should be filename) to produce ">barcodelabel=sample_ID;sequence_ID"
for file8 in working2/*.txt
do
sample_id=$(basename ${file8} .txt)
echo ${sample_id}
sed -e "s/;/${sample_id};/g" ${file8} > "${labeled_data}/$(basename "$file8" .txt).fasta"
done
#*****************************************************************************************
# Remove working directories
rm -r working2
##########################################################################################
fastx_uniques command - This command finds the set of unique sequences in an input file, also called dereplication. In this case the input is a fasta file. Sequences are compared letter-by-letter and must be identical over the full length of both sequences (substrings do not match).
sortbysize command - This command sorts sequences by decreasing size annotation, which usually refers to the size of a cluster. In this case we also add a -maxsize
option to specify singletons i.e. 1.
search_exact command - This command searches for exact, full-length matches to a database sequence. This alogorithm is only available with usearch8, however even for large datasets the 32-bit version is fine as it does not need to store alot of memory.
usearch="usearch9.2"
labeled_data="9.labeled_data"
derep_dir="10.derep_data"
low_abund_seqs="11.low_abund_sequences"
SF="12.singleton_filtered"
# Enter max replicate cluster size (eg. to remove singletons enter 1, for duplicates enter 2)
maxsize="1"
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo Removing low abundant seqs singletons per sample
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo ""
# Creating directories
mkdir ${derep_dir}
mkdir ${low_abund_seqs}
mkdir ${SF}
#*****************************************************************************************
# Step 1: Dereplicating
for file9 in ${labeled_data}/*.fasta
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Removing singletons step 1: derep_fulllength
echo input is:
echo ${file9}
$usearch -fastx_uniques ${file9} -fastaout "${derep_dir}/$(basename "$file9" .fasta).fasta" -sizeout
done
#*****************************************************************************************
# Step 2: Filtering low abundant seqs {maxsize}
for file10 in ${derep_dir}/*.fasta
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Removing singletons step 2: sorting uniques
echo input is:
echo ${file10}
$usearch -sortbysize ${file10} -fastaout "${low_abund_seqs}/$(basename "$file10" .fasta).fasta" -maxsize ${maxsize}
done
#*****************************************************************************************
# Step 3: Mapping reads
for file11 in ${labeled_data}/*.fasta
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Removing singletons step 3: mapping reads to low abundant uniques
echo input is:
echo ${file11}
$usearch -search_exact ${file11} -db "${low_abund_seqs}/$(basename "$file11" .fasta).fasta" -strand plus -notmatched "${SF}/$(basename "$file11" .fasta).fasta"
done
#*****************************************************************************************
Note that this leads to a dead end essentially. However is is useful if wanting to just analyse a single sample, without having to import the large concatenated sequences.
usearch="usearch10"
SF="12.singleton_filtered"
SF_derep="13.singleton_filtered_derep"
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo Dereplicating SF files
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
mkdir ${SF_derep}
#*****************************************************************************************
for file15 in ${SF}/*.fasta
do
echo ""
echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
echo Derep_fulllength SF data
echo input is:
echo ${file15}
$usearch -fastx_uniques ${file15} -fastaout "${SF_derep}/$(basename "$file15" .fasta).fasta" -sizeout
done
This step is where you may encounter issues with enough memory, for 1 or 2 MiSeq runs, the 32-bit versions of usearch may be okay, however any larger and you will need to use the 64-bit, paid versions. There are alternative freely available software for this step such as vsearch, and it may be worth your time exploring these options if you encounter such limits. For this case we will assume you either have a small data set or have access to the 64-bit versions for larger datasets.
There are two main alogorithms we recommend you do both and see what works. In my opinion I think that uparse OTUs are sufficent for 16S microbiome analysis in most cases. However if you are after more specific information (i.e. genotypes) or perhaps you targeted a more specific gene region (i.e. your primer were genus/family specific), you may want to go with ZOTUs. In short I think you should do both and make the descision of which makes biological sense in your study.
The UPARSE algorithm is available in usearch 9.2 and usearch 10. If you have a large dataset you will need the 64-bit version as it does hold memory. Detailed documentation on this algorithm can be found here.
The UPARSE-OTU algorithm constructs a set of OTU representative sequences from NGS amplicon reads. R
Clustering criteria The goal of UPARSE-OTU is to identify a set of OTU representative sequences (a subset of the input sequences) satisfying the following criteria.
All pairs of OTU sequences should have < 97% pair-wise sequence identity. An OTU sequence should be the most abundant within a 97% neighborhood. Chimeric sequences should be discarded. All non-chimeric input sequences should match at least one OTU with ≥ 97% identity. UPARSE-OTU uses a greedy algorithm to find a biologically relevant solution, as follows. Since high-abundance reads are more likely to be correct amplicon sequences, and hence are more likely to be true biological sequences, UPARSE-OTU considers input sequences in order of decreasing abundance. This means that OTU centroids tend to be selected from the more abundant reads, and hence are more likely to be correct biological sequences.
Reference Edgar, R.C. (2013) UPARSE: Highly accurate OTU sequences from microbial amplicon reads, Nature Methods. Pubmed:23955772, dx.doi.org/10.1038/nmeth.2604.
SF="12.singleton_filtered""
cluster="13.cluster"
uparse_otus="13a.otus"
usearch="usearch9.2"
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo UPARSE on all
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
mkdir ${cluster}
cat ${SF}/*.fasta > ${cluster}/all_samples_SF.fasta
cd ${cluster}
mkdir ${uparse_otus}
cd ${uparse_otus}
$usearch -fastx_uniques ../all_samples_SF.fasta -fastaout all_samples_SF_DR.fasta -sizeout
$usearch -cluster_otus all_samples_SF_DR.fasta -otus uparse_otus.fasta -relabel OTU
$usearch -usearch_global ../all_samples_SF.fasta -db uparse_otus.fasta -strand both -id 0.97 -otutabout uparse_otu_tab.txt -biomout uparse_otu_biom.biom
$usearch -cluster_agg uparse_otus.fasta -treeout uparse_otus.tree -clusterout clusters.txt
cd ..
The UNOISE3 algorithm is only available on usearch10. If you have a large data set you will need the 64-bit version as it does hold memory. Detailed documation on this algorithm can be found here.
There is a UNOISE2 option in Usearch9.2 with documentation available here.
This algorithm is specifically designed for Illumina sequences.
The UNOISE3 algorithm performs error-correction (denoising) on amplicon reads. It is implemented in the unoise3 command.
Correct biological sequences are recovered from the reads, resolving distinct sequences down to a single difference (often) or two or more differences (almost always).
Errors are corrected as follows: 1. Reads with sequencing and PCR point error are identified and removed. 2. Chimeras are removed.
Reference Edgar, R.C. (2016), UNOISE2: Improved error-correction for Illumina 16S and ITS amplicon reads.http://dx.doi.org/10.1101/081257
SF="12.singleton_filtered""
cluster="13.cluster"
unoise_zotus="13b.zotus"
# Remember is must be Usearch 10 for the unoise3 algorithm
usearch="usearch10"
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo UNOISE on all
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
cd ${cluster}
mkdir ${unoise_zotus}
cd ${unoise_zotus}
$usearch -fastx_uniques ../all_samples_SF.fasta -fastaout all_samples_SF_DR.fasta -sizeout
$usearch -unoise3 all_samples_SF_DR.fasta -zotus unoise_zotus.fasta -tabbedout unoise_tab.txt
$usearch -fastx_relabel unoise_zotus.fasta -prefix Otu -fastaout unoise_zotus_relabelled.fasta -keep_annots
$usearch -otutab all_samples_SF_DR.fasta -zotus unoise_zotus_relabelled.fasta -otutabout unoise_otu_tab.txt -biomout unoise_otu_biom.biom -mapout unoise_map.txt -notmatched unoise_notmatched.fasta -dbmatched dbmatches.fasta -sizeout
$usearch -cluster_agg unoise_zotus_relabelled.fasta -treeout unoise_zotus.tree -clusterout clusters.txt -id 0.80 -linkage min
cd ..
cd ..
This is downstream analysis that can done with usearch. For example, taxonomy assignment and diversity analysis.
However, at present there are more robust taxonomy assignment options available (such as that in QIIME2), that are easier to use. With respect to diversity analysis, QIIME2 and R packages such as Phyloseq and Microbiome offer much more flexibility, robustness and are more aesthectically pleasing.
Vector and Waterborne Pathogens Research Group. 2018. S Egan.