The internet has an infinite number of resources available - if you can google than you can code!

Don’t forget about youtube - an excellent resources for those that are visual learners.

Below are just some of the resources available, mainly tailored around genomics, however the Data and Software Carpentry resources are an excellent starting point for general information.



Where to start?

It can be a bit daunting knowing where to start. There are so many programs, packages, applications, tutorials - it’s important to remember you cannot do them all!

My personal advice (remember I am mac/linux user) is to start my learning how to do commandline (i.e. terminal) and know some basics of the bash language. Once you can perform some basic commands then move on to learning R/RStudio.

My reason for recommending RStudio is it will save you lots of time in the long term. The RStudio community is huge and everything is free so you never have to worry about no longer having access. There is a wealth of packages for RStudio you can do everything, the whole workflow - data entry, manipulating and summarising spreadsheets, statistical analysis, data visualisation and then writing your report/thesis/publication etc. It will save you long term as you won’t have to learn so many additional programs (i.e. mapping programs, photoshop/figure editor software, statistics, the list goes on…). Most often people will pick and choose to do certain elements in RStudio and then use programs they are familar with for the other bits….the good news is you can take it at your own pace.

Siobhon’s recommendations for beginners

Even if you think your field is not ecology, trust me this is an excellent place to start learning RStudio.

While you maybe tempted to look at the Genomics lessons (of course go ahead if you have a burning desire!) they are still very much a work in progress, and I would not recommend them at this stage.


R/RStudio

In addition to the above software carpentry and data carpentry resources some additional places to start include:

  • Swirl is a great interactive tutorials that run directly in R. Useful for statistical analysis and basic functions.
  • R Tutorial offers here a couple of introductory tutorials on basic R concepts.
  • Code School provides some more tutorials on basic R syntax and basics.
  • Quick-R is great for tutorials on statistical analysis.
  • STAT 457 course website for the 2019-2020 edition of STAT 545A and STAT 547M, colloquially known as just “STAT 545”, delivered at The University of British Columbia in Vancouver, BC.

R Cheat Sheets There are a number of cheatsheets and other reference documenation available for R.

  • A link to the offical R cheat sheets is here.
  • Other useful cheatsheets by DataCamp are available here.
  • Links to other useful resources availbale here, here, here and here

Sequence & Phylogenetics

More coming soon - see my personal page here for some inspiration in the mean time.


NGS Sequence analysis

This pipeline makes use of USEARCH for preprocessing and QIIME2 for taxonomy assignment and analysis - both of these come with a number of different tutorials with example data for you to work through. I strongly suggest you work through these critically don’t just copy and paste and press eneter (even thought it is tempting!) - make sure you understand what and why you are doing things.

Other comparable alternatives

  • QIIME2 - although this webpage uses QIIME only from the taxonomy step onwards, pre-processing steps can be done in QIIME2 make using of various Plugins
  • Mothur - a lot of information and resources available
  • VSEARCH - made as an alternative to USEARCH
  • FastQC - a common tool used for basic fastq commands and manipulation

Some other useful references and analysis tools for genomics

Microbial analysis

Microbial analysis has revieved the most attention with respect to amplicon based NGS approaches; however with some refining these pipelines can also be used for other amplicon based NGS studies (e.g. COI (metazoa), ITS (fungi), TrnL (plant), 18S (protozoa) just to name a few). Plus there is of course overlap with other genomics tools as well. There can be alot of overlap in these so its up to you to see what works for you. Generally those in R allow more customisation however may be more difficult to grasp initially. GUI’s are good to get a quick handle on your data, however you are usually limited in the customisation of the graphical outputs from these programs. Some of these are more of a stand alone package however the majority will utilise a common format (e.g. phyloseq objects are common throughout many of these).

  • phyloseq - Analyze microbiome census data using R
  • microbiomeSeq - An R package for microbial community analysis in an environmental context
  • metacoder - An R package for metabarcoding research planning and analysis
  • microbiome R - Microbiome R package (extending phyloseq)
  • microbiomeutilities - Extending and supporting package based on microbiome and phyloseq (currently in developmental stage)
  • R microbiota - Microbiota analysis in R
  • MicrobiobeMiseq - Analyses of microbial community composition and diversity in R using phyloseq
  • Bioconductor Microbiome - Microbiome data analysis: from raw reads to community analysis
  • Ampvis2 - Tools for visualising amplicon data
  • dada2 - Fast and accurate sample inference from amplicon data with single-nucleotide resolution
  • mare - Pipeline for microbiota analysis based on 16S-amplicon reads
  • metagenomeSeq - Statistical analysis for sparse high-throughput sequencing
  • Rhea - A set of R scripts for the analysis of microbial profiles
  • taxize - A taxonomic toolbelt for R
  • LabDSV - Ordination and multivariate analysis for ecology
  • phylogeo - An R package for geographic analysis and visualization of microbiome data
  • qiimer - R functions to read QIIME output files and create figures
  • RAM - R for amplicon-sequencing-based microbial-ecology

RNASeq

Databases

  • TriTrypDB - kinetoplastid genomics resource
  • VectorBase - bioinformatics resource for invertebrate vectors of human pathogens
  • CryptoDB - Cryptosporidium genomics resource
  • Silva - high quality ribosomal RNA databases. SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).
  • Greengenes - 16S rRNA gene database of Bacteria and Archaea.
  • EuPathDB - Eukaryotic pathogen database resource
  • unite - communication and identification of DNA based fungal species (based on internal transcribed spacer gene - ITS)
  • BoldSystems - The Barcode of Life Data System is designed to support the generation and application of DNA barcode data. Includes the following: Animal identification using mitochondrial mitochondrial cytochrome oxidase subunit 1 (COI); Fungal using internal transcribed spacer (ITS); and Plant using chloroplast ribulose-bisphosphate carboxylase (RbcL) & plastid/nuclear Maturase K (Matk)
  • VEuPathDB - This NIH Bioinformatics Resource Center (BRC) will support the integration of parasite resources currently provided by EuPathDB.org, fungal resources provided by FungiDB.org, and vector resources provided by VectorBase.org.

Epidemiology and Statistics

Web/online resources

  • ClinEpiDB - Advancing global public health by facilitating the exploration and analysis of epidemiological studies. ClinEpiDB, launched in February 2018, is an open-access online resource enabling investigators to maximize the utility and reach of their data and to make optimal use of data released by others. More coming soon
  • Epitool - This site is developed and maintained by Ausvet. The site is intended for use by epidemiologists and researchers involved in estimating disease prevalence or demonstrating freedom from disease through structured surveys, or in other epidemiological applications.
  • VassarStats - A useful and user-friendly tool for performing statistical computation (probabilities, regression, t-test, ANOVA and more)

RMarkdown

More coming soon

Writing your thesis

Making a webpage


Mapping

More coming soon


Online modules and courses

There are a number of online modules and courses are avilable too, some of which you have to pay for but there are many free resources available too.

Popular sites include:

  • Courses.edx
    • DNA sequencing - the lessons labelled ‘practical’ are particularly useful as they have split screen and run through the code line by line. Although they do not use amplicon based NGS examples they do inlclude good fundamental lessons (including how to read a fastq file)
    • Bioinformatics methods 1 - while it does not have recorded practical it goes have a ‘post-lab lecture’ which is helpful. Their handouts are very detailed, however again they do not give examples specific to NGS but more general bioinformatics.
  • Coursera - a whealth of courses to make your way through
  • Datacamp - a fantastic resource for data science and anything coding related. The “on-screen” terminal has some pros and cons to it, give the free lessons ago (there is more than enough to keep you busy) and see how you like i - just make sure you take good notes!



Vector and Waterborne Pathogens Research Group. 2018. S Egan.