Adapter and quality trimming Illumina data with Fastp

I recently made a 150 PE Illumnia library using a NEBNext Ultra DNA Library Prep Kit for sequencing on the Hiseq X ten. When my data came back from sequencing, I had no idea how to prepare it for assembly! I  had to learn how to trim adapters and low quality sequences. After evaluating three trimming tools (Trim Galore!, Trimmomatic, and Fastp) I decided on Fastp, mainly because of its speed and ease of use.

First install Fastp on your cluster or your local system. Install with Conda or download a working binary (see the Fastp github for detailed directions)

conda install -c bioconda fastp
#or
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp

Once installed you should skim the Fastp github to learn how to use the program. Fastp can both trim adapters and low quality reads.  Ideally you know the adapters you used so you can trim them. After emailing with NEB customer support, I found that NEBNext library adapters resemble TruSeq adapters and can be trimmed similarly.

NEBNext Adapter Read1:   AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

NEBNext Adapter Read2:   AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Essentially, you feed your raw read(s) into Fastp, the adapters, the output file name(s), and then set the appropriate flags.  Here is an example:

First I set my variables (input and output):

#!/bin/bash
DIR=/path_to_working_directory
IN1=/path_to_raw_pe_reads_1
IN2=/path_to_raw_pe_reads_1
OUT1=Clean.${IN1}
OUT2=Clean.${IN2}

Then I call Fastp, input my variables, input adapters, and set flags. Here I opted to trim sequences below Q20 (-q 20), shorter than 80 bp (–length_required 80), then I moved a sliding window from front to tail and tail to front trimming if the mean quality drops below Q20. I choose rather stringent setting here because I have over 100X coverage. You should modify your settings dependent on your data and goals.

cd ${DIR}
fastp -i $IN1 -I $IN2 -o ${OUT1} -O ${OUT2} \
--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-q 20 --length_required 80 --cut_tail --cut_front \
--cut_mean_quality 20

That’s it! Hope this helped! FastP is a very powerful program and can do a lot more than what I demonstrated here so be sure to check out its other options.

How to update or install your local NCBI BLAST database in a Unix shell using update_blastdb.pl

I recently updated my local BLAST database and I thought I would revisit the process of installing/updating, but this time using the included update_blastdb.pl script.

First,  make sure the BLAST program is set to your path.  Because I work on a cluster all I  do is load the module.

 
module load bio/blast+/2.7.1

I then delete the old database folder and make a new folder with the same exact name (this keeps your old  scripts working ). If you are doing a fresh install, just create a new folder.

 
module load bio/blast+/2.7.1 
rm -r blastdb_folder_name
mkdir blastdb_folder_name

Now use the perl script to download the database of your choice.  The decompress option automatically decompresses the tar.gz files. Depending on what database you choose to download and your internet speeds, this could be a lengthy process.

 
perl update_blastdb.pl --decompress nt

I am also downloading the taxonomy database to know more about my BLAST hits. You have to manually unpack this database.

 
module load bio/blast+/2.7.1 
perl update_blastdb.pl taxdb
gunzip taxdb.tar.gz 

That’s it! Now you should have an updated BLAST database.

Installing and querying a local NCBI nucleotide database (nt)

While the online version of the non-redundant nucleotide database (nr/nt) is useful for small scale applications, checking for contamination in an assembly is best done with a local NCBI nt database.  Read along for a guide on how I installed and then queried the NCBI nt database on a unix cluster.

First  you need to make a folder where you will store your entire database and enter that folder.

mkdir NCBI_nt_DB
cd NCBI_nt_DB

Next you need to download the entire nt database from the NCBI website. Note that this database is almost 50 GB in size so make sure that you have sufficient space. This download may take some time depending on the speed of your connection.

wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz"

Next each tar.gz file has to first be uncompressed and then deleted. Doing this manually will take some time so I used a for loop. After the loop is complete the database is ready! No formatting is needed as it already is formatted.

#/bin/bash
for file in *.gz
do
tar -zxvpf "$file"
rm "$file"
done

To query the new database its important to point blast to the database folder and to the nt index file, so: path_to_database_folder/nt.  An example query looks like so:

blastn -db path_to_the_folder/nt -query fastafile

That’s it! Everything should we working now. With an offline database it’s important to decide on a regular update schedule. I plan on updating my database every 6 months or so. This can be done with a perl script found with the blast+ software or by deleting and downloading the entire database anew.

A primer on PCR

The PCR gods are a fickle sort and it’s an art to appease them. This strange intersection between science and the occult can be at the best of times trying, but fear not for I have braved the trials of PCR and write to offer advice.

To start I am sharing my standard PCR reaction. 10μL may seem like a tiny amount of product, but it’s enough to allow for testing on an agarose gel, amplicon cleanup for sequencing, and evaporation.

I use standard 10-μL PCR reactions, containing:

  • 5μL of PCR MasterMix (Promega, Madison, Wisconsin, USA)
  • 2.5μL of DNA-free H­2O
  • 0.5μL MgCl­­­­­2
  • 0.5μL forward primer
  • 0.5μL reverse primer
  • 1μL of template DNA

 

I then use touchdown PCR programs optimized for each primer pair to maximize the primer specificity. PCR product is then checked with Agarose gels. If the reaction failed I troubleshoot by trying each step below. Most of the time, diluting the template solves the problem.

  1. Dilute the template
  2. Decrease the specifity of the PCR program
  3. Increase the amount of MasterMix
  4. Increase the size of the reaction
  5. Re-extract template DNA