Sequences, Unix, Stacks, RADs, snails and outliers!

Welcome to the first day of the CeMEB Advanced Course in Marine Genomics. For the rest of the day, I’ll be giving you a brief introduction into the basics of Unix and bash before showing you how to perform a de novo assembly of RAD sequencing data. After that, I’ll show you how to detect RAD loci that might be under divergent selection between two populations.

Links to each of the tutorials in order are below:

There are also loads of useful resources online that can help you get started with population genomics and computational biology. There are almost too many to name but a good thing to start with is to read this recent Nature Biotechnology article. If you have problems with Unix, R or just about any code, a bit of creative googling can usually fix the problem. If not, Stack Overflow almost never fails to help me out. SeqAnswers is also a great website to visit if you have NGS specific issues. In particular, this thread helped me out as a total beginner.

Outlier analysis with Bayescan

Right, now that we’ve converted our data into the right formats, we’re ready to go with some outlier analysis!

Part 1: Running Bayescan

First things first, we should set up a new directory to perform it in.

cd ~
mkdir Bayescan
cd Bayescan

Now, there are two options. If you successfully converted your data, copy it from the file conversion folder. If not, don’t worry! You can download the files. Just follow the steps below.

# Option 1
cp ../file_conversion/salto.hap.* ./
# Option 2
cp /usr/local/files/outlier_practical/bayescan_input.tgz ./
tar -vxf bayescan_input.tgz

Once the files are copied, we are ready to run Bayescan. Like with Stacks, we’re going to run it on the cluster. Except this time I haven’t given you a bash file to submit. You’ll have to create one yourself! I would recommend that you copy one of the previous bash files to your current directory and then edit it with nano (NB. make sure the cd $HOME line is changed to cd $HOME/Bayescan. Remember to change the job name too. You’ll also obviously have to call Bayescan, however I’ve provided the command for this below:

bayescan_2.1 ./salto.hap.bayes.txt -od ./ -threads 4 \
-n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10

The first line consists of basic options to ensure the program runs properly – i.e. where the input file is, which directory to write the output to (./ means the current directory) and the number of threads or cores we can use (4 in this case).

The second line specifies the options for the analyses and the Monte Carlo Markov Chain and are as follows:

  • -n is the number of iterations we want to sample from the MCMC (5000 in this case)
  • -thin is the thinning interval, set to 10; this means we let the MCMC run for 10 steps in between each sample.
  • -nbp is the number of pilot runs Bayescan runs to tune the MCMC parameters.
  • -pilot is the length of each of the pilot runs, 5000 iterations here.
  • -burn is the length of the burnin (i.e. the part of the MCMC that is discarded), this is 50 000 here

With a total of 5000 sampled iterations and a thinning of 10 plus a burnin of 50000, we will be running the chain for 100 000 steps. Which could take some time! Once you’ve finished your script, call it something relevant (i.e. “bayescan_bash.sh”) and then submit it using qsub.

Part 2: Processing the output and getting results!

Unfortunately we don’t have time to wait around for Bayescan to complete it’s run on the cluster. So we’ll have to use some data I prepared for you earlier. This means though that we can use data based on a Stacks run on the full dataset – i.e. the full compliment of reads for each individual. What’s more we’re going to do this locally on your computer and not the Albiorix cluster.

Being able to upload/download files to and from a cluster is important and you will need to do it many many times when you are doing your own analysis. There are two sets of instructions here for Mac/Linux users and for Windows users.

For Mac/Linux users

Load a new terminal window (Ctrl or ⌘ + N) and type the following. Replace username with your username and input your password when prompted.

cd ~
mkdir outlier_analysis
cd outlier_analysis
scp username@albiorix.bioenv.gu.se:/usr/local/files/outlier_practical/bayescan_output.zip ./
unzip bayescan_output.zip

This will create a new folder in your home directory called outlier_analysis (make sure you don’t already have a folder with this name or it will be overwritten!). The scp command (secure copy) will download the Bayescan output from the cluster to your machine and place it in the outlier_analysis folder.

For Windows users

First create a directory called “outlier_analysis” somewhere on your computer where you can navigate to it easily (i.e. the desktop).

Open WinSCP (Control + R, type “WinSCP”). Once this is open, select the Albioirix session and load it. After a few seconds, you will see a graphical interface where your local directories are on the left and the cluster directories are on the right.

Look for the drop down menu above the file pane (it should say your username). Click the arrow and select “/ <root>”. This will take you to the root file. From here, navigate to usr/local/files/outlier_practical.

Once you’ve found this directory, navigate to the “outlier_analysis” directory on your machine using the left pane. Right click on the “bayescan_outlier.zip” and select copy. This should load a dialogue that will allow you to copy the file to your local directory. Select copy again and the file should transfer.

Finally, close WinSCP and navigate to the “outlier_analysis” directory. Double click the zip file to extract it.

Everyone together

In order to understand our results a little better, we need to do a bit of R analysis and plotting. Unfortunately we don’t have time to go into the fine detail of R, although I will give an informal introduction to it later this week.

For now, load R Studio and click on the small blue box symbol in the upper right hand corner. Then select “Existing Directory” and navigate to your outlier_analysis folder. R Studio has several advantages and one of these is this project management system. It makes associating work with directories much easier.

For instance, the outlier_analysis project is now associated with the directory you created. Click on the Files section of the lower right pane and you’ll see all the folders in your current directory. If you click on the file called “Outlier_analysis_R_script.R”, it will load an R script.

This R script will allow us to quickly run commands to process the results. To run a command, simply highlight it or place the cursor on the line and hit Ctrl (⌘ on a Mac) + Enter. Alternatively you can click the run button above the script. Like with bash, lines that start with “#” are ignored as comments (although they can still be run). R scripts are a really important tool and allow you to manage your analysis and rerun it easily.

So let’s break down the script I prepared for you. The first thing you’ll need to do is install some packages for your R environment. This basically lets you use specific functions within R that are not included in the base installation.

install.packages("ggplot2")
install.packages("diveRsity")

This will download and install the packages ggplot2 and diveRsity (the latter made by a good friend of mine, Kevin Keenan at Queen’s University Belfast). ggplot2 is a sophisticated plotting package and diveRsity is a package for calculating measures of genetic diversity and differentiation. While we won’t be doing this now, the package has a lot of useful functions I’ve used here to access our data. Now we’ve downloaded and installed them, we need to load them to the R environment.

library(ggplot2)
library(diveRsity)

Next we need to load a custom source script. This script contains several functions I have written for dealing with Bayescan data. Feel free to use it beyond the bounds of this practical but please keep in mind that I make no guarantee that it will alway work.

source("outlier_processing_functions_CeMEB.R")

Now we need to load the data. We need to load the main Bayescan output, “salto_fst.txt”. However because accessing the loci names from the Bayescan output is cumbersome, we also need to use the custom loci function to extract loci names from the “salto.hap.genepop” file.

salto_bayes salto_loci 

In order to calculate which loci are under selection from the Bayescan output, we also need to supply our functions with a false discovery rate (FDR). Although we could use a 10% FDR with such a large number of loci (9615), there are almost certainly false positives with FST outlier analyses. So we assign a value of 0.05 to an object called FDR like so:

FDR</pre>
Great! We're all loaded and ready to go. Next we process the data to add the loci names to our dataset and to detect outliers. To do this we will use the custom <code>bayes_process</code> function to reassign the modified dataset to the salto_bayes object.


salto_bayes</pre>
Then we can use the custom function <code>bayes_outliers </code>to create a list of outliers.


salto_outliers salto_outliers$outliers
salto_outliers$outlier_n

The list has two components, the first is a vector of outlier names, the second is the number of outliers. We have detected 137 RAD loci that are outliers at an FDR of 0.05. It’s often very useful to examine outliers graphically, so last of all we can plot the data. Usually this a bit more involved, but again I have written a custom function bayes_plot to do this automatically using ggplot2.

bayes_plot(salto_bayes, FDR)

Looks great! You can see that 136 of the outliers have high FST values, suggesting they are under positive selection between ecotypes. There is also one outlier with a very low FST, which suggests it could be under balancing selection.

Congratulations! You’ve gone from some raw llumina Hi-Seq data to an outlier analysis identifying loci putatively under positive selection. While we will end the practical here, what sort of things might you do next? Well here are some potential ideas:

  • Extract the consensus sequences for your outlier loci from the de novo RAD catalogue.
  • Align them to a reference genome (if you are lucky enough to have access to one)
  • Locate them on a linkage map