We will now run some sample code.
First, let’s check our tools:
which bwa
Output shows where bwa is installed.
which samtools
Output shows where samtools is installed.
Basic Bacterial Genome Sequence Analysis
mkdir -p /tmp/outbreaks/SG-M1
cd /tmp/outbreaks/SG-M1
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/275/545/GCF_001275545.2_ASM127554v2/GCF_001275545.2_ASM127554v2_genomic.fna.gz
gunzip GCF_001275545.2_ASM127554v2_genomic.fna.gz
mv GCF_001275545.2_ASM127554v2_genomic.fna SG-M1.fna
bwa index SG-M1.fna
bwa mem SG-M1.fna /tmp/fastq/SRR6327950/SRR6327950_1.fastq.gz /tmp/fastq/SRR6327950/SRR6327950_2.fastq.gz | samtools view -bS - > SRR6327950.bam
samtools sort SRR6327950.bam -o SRR6327950-sort.bam
samtools index SRR6327950-sort.bam
lofreq faidx SG-M1.fna
lofreq call -f SG-M1.fna -r NZ_CP012419.2:400000-500000 SRR6327950-sort.bam > SRR6327950-400k.vcf
Mapping takes ~5 min on a t2.medium. Sorting takes ~2 min. Running lofreq on this limited section of the genome takes ~1 min.
spades.py -t 2 -1 /tmp/fastq/SRR6327950/SRR6327950_1.fastq.gz -2 /tmp/fastq/SRR6327950/SRR6327950_2.fastq.gz -o SRR6327950_spades
NOTE: This assembly above will complete on a t3a.large and takes about 5 hours.
Excellent! This is a pretty routine task that can easily be run on an AWS EC2 instance. As experienced when conducting the assembly in Step 3, selecting the right machine for the job is incredibly important. If you run out of RAM or space on your disk, your job may quit. Luckily, these can be easily addressed by changing your instance type or by attaching another EBS volume to your machine.