a. Bacterial Genome data

We will now run some sample code.

First, let’s check our tools:

which bwa

Output shows where bwa is installed.

which samtools

Output shows where samtools is installed.

Basic Bacterial Genome Sequence Analysis

  1. Get a reference sequence:
mkdir -p /tmp/outbreaks/SG-M1
cd /tmp/outbreaks/SG-M1
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/275/545/GCF_001275545.2_ASM127554v2/GCF_001275545.2_ASM127554v2_genomic.fna.gz
gunzip GCF_001275545.2_ASM127554v2_genomic.fna.gz
mv GCF_001275545.2_ASM127554v2_genomic.fna SG-M1.fna
  1. Map and call SNPs:
    Note: For an annotation of the programs used below and other bioinformatics tools, check out our course github page.
bwa index SG-M1.fna
bwa mem SG-M1.fna /tmp/fastq/SRR6327950/SRR6327950_1.fastq.gz /tmp/fastq/SRR6327950/SRR6327950_2.fastq.gz | samtools view -bS - > SRR6327950.bam
samtools sort SRR6327950.bam -o SRR6327950-sort.bam
samtools index SRR6327950-sort.bam
lofreq faidx SG-M1.fna
lofreq call -f SG-M1.fna -r NZ_CP012419.2:400000-500000 SRR6327950-sort.bam > SRR6327950-400k.vcf

Mapping takes ~5 min on a t2.medium. Sorting takes ~2 min. Running lofreq on this limited section of the genome takes ~1 min.

  1. Assembly (runs ~4 min then will run out of RAM if you’re on a t2.medium):
spades.py -t 2 -1 /tmp/fastq/SRR6327950/SRR6327950_1.fastq.gz -2 /tmp/fastq/SRR6327950/SRR6327950_2.fastq.gz -o SRR6327950_spades

NOTE: This assembly above will complete on a t3a.large and takes about 5 hours.

Excellent! This is a pretty routine task that can easily be run on an AWS EC2 instance. As experienced when conducting the assembly in Step 3, selecting the right machine for the job is incredibly important. If you run out of RAM or space on your disk, your job may quit. Luckily, these can be easily addressed by changing your instance type or by attaching another EBS volume to your machine.