# Tut 12: Sequence Alignment with Geneious + Read mapping

## Files

{% file src="/files/MxmMDbXGBqpfWEh1ElpK" %}

{% file src="/files/SK1QdNh1mw6pKaJYan9P" %}

## Week 12 Walkthrough- Population Dynamics!

**09 Apr 2024**

Hello and welcome to week 12 of ESPM 112L- Metagenomic Data Analysis Lab!

* First step: Making your alignment
* Interlude: .sam files, .bam files, samtools, oh my!
* Geneious
* Interlude: Predicting ORFs (make sure to do this before continuing!)
* Back to mapping
* Turn-in assignment

First stage development of your group’s research project….

This week we’re going to generate an alignment and visualize that alignment in geneious so you can see two things:

1. What coverage actually looks like and what it means
2. SNP-level microdiversity in a population genome

Let’s get to it!

## First step: Making your alignment

We are going to use an aligner, called bbmap.sh. First, make a folder in your home directory (call it Lab12 or whatever you’d prefer) using filezilla. Then, copy in a bin of your choice (from your sample!) to this directory (details to follow).

Remember to make sure the genome you choose is of generally good quality- at least 1.5Mbp, and the fewer contigs the better!

We’re going to make symbolic links to your reads into this folder to make them easier to access. Trust me- this saves a lot of typing time and hard drive space.

First choose a bin from your sample, located in ggkbase. Download its contigs as you would normally, and use filezilla to transfer it over to your Lab12 directory. Next,&#x20;

{% code overflow="wrap" %}

```
mkdir ~/Lab12

#Go to that directory
cd ~/Lab12

#Make symbolic links to reads
#This takes the QC'd read files and makes links to them in your directory
#So you don't have to write out the full path when you make your command.

ln -s /class_data/2024_reads/temp/[SAMPLE NAME]*trim_clean.PE.*.fastq.gz ./
```

{% endcode %}

Awesome- now that you’ve done that, let’s go ahead and run your alignment. Importantly, make sure to specify threads=1 when you run this; otherwise bbmap.sh will take all the available threads and make it difficult for everyone to run their alignments! Make sure to edit the command on a text editor beforehand like I’ve always taught you. It will take approx 10 min to run each.

![](https://lh7-us.googleusercontent.com/YUhI_pxCUs9rhOinTarSk73mLHFMKSNHI5UnPXW6XnURRFsfpDriev_DBQh7tIekMfDmvpLAhPc5-gl-dGeXONLD0FEkdBSmiE66sygfrZQMykKMyrnUWcb4xkVZYEchFQLZStVk0d_96I8bXIudDaQ)

Remember you've got to be located in \~/Lab12 in the command line

Start a new tmux session

Now input the command (note youll have to edit this first)

{% code overflow="wrap" %}

```
bbmap.sh pigz=t unpigz=t ambiguous=random threads=1 ref=[bin contigs file].fa.gz in1=[FORWARD READS FILE].fastq.gz in2=[REVERSE READS FILE].fastq.gz out=stdout.sam | shrinksam | sambam > [genome name].shrink.sort.bam
```

{% endcode %}

Now you can close and reopen the tmux as you please.

Just put in values for \[bin contigs file], the forward and reverse reads files, and name the final output (\[genome name].shrink.sort.bam) what you’d like- I would name it based on the name of the genome you’re looking at. Say we’re using `SPRUCE_SRR5824232_maxbin.scaffolds2bin.tsv_bin_126.fa`; in that case I would name the final output `SPRUCE_SRR5824232_maxbin.scaffolds2bin.tsv_bin_126.shrink.sort.bam` or something similar.

This process should only take a few minutes. bbmap.sh is generally much faster than bowtie2, one of the most commonly used alternatives. Note that this is, of course, dependent on both the size of the genome you’ve chosen and the number of reads in your dataset that you’re aligning to the reference genome.

The additional stuff at the end of the command is various bits of processing; filtering, sorting, and shrinking your alignment file by removing uninformative bits. Let’s talk for a bit about what’s happening there.

### Interlude: .sam files, .bam files, samtools, oh my!

There are two kinds of read alignment files, which are generated by read mapping programs like bowtie2 and bbmap. The first is a .sam file, which stands for Sequence Alignment Map. They look like this:

![SAM file](https://lh7-us.googleusercontent.com/Qf1Fhz8SVB5picUK0oZMpRewD-CvXnqJ3qhH0smxMwAqMvebEI11lYZr1qPsiDRH4YYjLqHOs_9uB19VSqjoW4HXPX6WwZ387BzXGKe0w14oCO2ntn4xH2wRdM5DdI4Hmiyj3NMj4ZM3QraQ8ebc5ew)

The information here basically is:

* Where on the reference genome your reads align
* How well those reads match your reference

Which is fantastically useful information, as you’ve probably surmised due to our having generated such alignments several times so far in lab. However, there’s a caveat: it’s all in text format! This ends up being very costly in terms of hard drive space (they can each be many gigabytes in size) and computational power (if you have to load a 30GB file every time you do something, it gets old quick).

In order to reduce the amount of space these files take up, we first shrink the file by removing unmapped sequences from the .sam file (with shrinksam), then convert it to a much smaller binary file, called a .bam (with sambam).

A .bam file contains the same information as a .sam file, but in a much smaller package. This is more efficient for programs to load if you want to analyze your data, and more efficient in terms of how much space it’s taking up on your hard drive!

The command as I’ve written it above actually never even writes a .sam file to the hard drive, instead feeding the alignment information directly into programs which will process it and then turn it into a .bam file. If you’re interested in further particulars of how this command works, let me know, and I’ll be happy to explain it to you in greater detail.

NOTE: The example mapping files are attached above for your use.

## Geneious

Now that you’ve finished generating this alignment, let’s download both the alignment (the .bam file) and the bin contigs fasta file you used to generate the alignment to your local PC; we’re going to boot them up in geneious.

Geneious is a versatile platform for doing genomics analysis in a GUI (Graphical User Interface)-based environment. It’s especially great for viewing alignments between reads and reference fasta files to see coverage patterns and SNP microdiversity, both of which we’re going to do today.

If you haven’t already, please download geneious and activate the free trial, which will last 30 days (plenty of time for you to use it for your projects, if you wish). <https://www.geneious.com/free-trial/>

Before you do anything else, make sure your .bam and .fasta files are both downloaded from the cluster!

Once geneious has been activated, go ahead and boot it up; here’s how to visualize your alignment in geneious.

1. Import your .fasta file first!

Once geneious has started, go up to File -> Import -> From File and select your genome fasta. When you see the pop-up below, say ‘create sequence list’.

![](https://lh7-us.googleusercontent.com/laaCDizry-dtLOMK1QFBuTtZ9FzkEqjYRcdXwjPvOLQSCkjq44lUik8xwXGjI3eaWqPyFkD5W76ygOhKgAt52uvmz47-6bIJUWq4x8sA13oWiS-8XmuLEMjPDYJJfUs5K8TabPscdgV-WSnfozsTYS4)

1. Now import your .bam file.

Go again to File -> Import -> From File and select your .bam file. Make sure that you’ve clicked on your fasta file to select it before you do so!

Once you’ve got it imported, you’ll see a bunch of scaffold/contig names. Click on one, and let’s see the reads that align to that scaffold. It should look something like this, although all coverage patterns will be different:

![](https://lh7-us.googleusercontent.com/oIgwLUhW3IPijye3Um9tOCogPUlYo2NZ2iSme99fE3LiZtM4sMTxmLa8oVDoEuFxPXVf_vFi_etiY4jj4pEPjQieMAiH0-PRoP3vdw3j3ZyQM0SnjsSEGy7KwEIWSlXgvkt2RIcm4NTLwwJDjuHnBBg)

Let’s take a moment to see what’s going on here.

Zoom in (scroll as you would on a browser, or use ctrl + +) and you can see something very different:

![](https://lh7-us.googleusercontent.com/SeJwVAs_QGjaz7IeRNEcUSqRdwWW3koMSqlAVAv7YdDzdYG1EGtdCew3LA0DrS1H6-2A9QAtRzcZ-qZfYOYM7Lu7vrnsMvPcMboRAmcX3XktPqZE8oxXg16N7QzBg7j8ZBUDoPlxvah6JC2KZwYXlM4)

See now how there are highlighted nucleotides? These indicate positions in the reads which disagree with your reference genome. (e.g. Your reference genome says GATTACA and the read might say GATAACA; the fourth nucleotide is different between the two.) These are SNPs; Single Nucleotide Polymorphisms; positions where a single nucleotide mutation has occurred. These differences are best displayed by choosing the read coloring like this:

![](https://lh7-us.googleusercontent.com/nLNZW7iZbeqosb7WlL4Ddggr6EdKizKn6RmN7Qcrg053MPsiNkC3hn_Ym30TZJfpq_Zu1KE92DEH7SB6Lpmp9oFQbLmJbbN46pckM_sk8D656K2Ul0WPE5ftxyy02a5HfPPCvDZLg6G6pmLqRtOAi-U)

<br>

Also note that if you roll your cursor over a given read, that read’s mate pair will be highlighted. Mate pairs that are not overlapping should be displaying as linked by a blue/green line (unless something is anomalous, then yellow or red).

<br>

### Predicting ORFs (make sure to do this before continuing!)

It’s very useful to be able to see where the genes are along a contig when visualizing an alignment. In order to predict genes, we’re going to need to install a plugin for geneious (don’t worry, it’s easy). Go to the top menu, select ‘Tools’, then click ‘Plugins’. Scroll down in the resulting menu until you find ‘Glimmer’, then click ‘install’. Go to the bottom of the menu and press ‘OK’!

![](https://lh7-us.googleusercontent.com/aFWgR9665IW6Z-MvpVVWyxIN_y0HxYANvRj62KJBtDs7flpgQLjylDbHRgyAYpqokXgVYYfIOJsWUJx6Qi_v2si3AzkAwKLbu7tVBaWiyi3DcEtolSz3zLk8FmM5J3LUq3i5A3z1jz4_fB8VaJ_Ax5E)Now, go to the file for your contig, and select ‘Annotate and Predict’, just above where your sequence is displayed. Click the very bottom option- ‘Predict genes with Glimmer’- and press OK.

<br>

Make a more stringent mapping

Have you noticed that there’s a ton of mutations around? We generally don’t want to see all of the low-identity reads, since they may really belong to other assembled contigs rather than the one you’re interested in. So let’s go ahead and change that.

First, we want to make separate geneious files for A. the contig of interest, and B. the reads which align to that contig. We’re only re-aligning the reads which already map to this contig, so it’s much easier computationally. You can do it easily on your laptop in probably around a minute!

* Making separate geneious files

First, make sure you’re zoomed in to the point that you can see the names of individual reads (like the below example)

![](https://lh7-us.googleusercontent.com/UgaofzJYEFt1qwrMV5RXQ1EJg5cYrRvsBfxDQGSWjEKSmKL6LpNO6mAz_B4cn_ccQCq6-K9y4twzm7XCwn_pdnlhaAUYnPNfrEgYcG6MczZhZP2XCPyVNaOKuU8qeJUCCWB3Jgs8CR4iAux8rZNEnNs)

Click on the name of your contig - in my example, it says ‘scaffold\_17’ near the top left corner- and then navigate to the ‘extract’ button on the top right. Click on that and choose a name (the default is fine), then press ‘OK’. In your top menu, where you see a bunch of file names, you should now see a file for that scaffold.

\
Select all the reads (command A) and deselect the ref.  Extract the reads into a read file. Under “Align/ Assemble” chose map to reference. For non-stringent mapping use “medium sensitivity, fast”. For stringent, use custom and set the % SNPs to be allowed (e.g., 1%) in the “maximum mismatches” box.

<br>

![](https://lh7-us.googleusercontent.com/___jXw_EXwTVC2MEe2KB-h84u0TCA_Oy4Q6RJhUx0IQHRlx6jhSo8-37-uqwW3VBpLpq2TMQjgrwqsfuso3bIkjnEv_5v_RDLnqL9ilEAmh-CkbB2udMeXDskkbZtzOKQjVn7hWTBs99JZDV5nTJqSg)

## Turn-in assignment

1. On the scaffold you’ve re-made the alignment for, what is the maximum coverage value for A. the original mapping, and B. the more stringent mapping?
2. You can find this by looking at your coverage graph and zooming in to the peaks. The legend should also go from 0 to the maximum coverage value to make things easier.
3. Take a screenshot of at least three SNPs that A. occur in more than one read, and B. occur in the coding sequence of a gene.

<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://prestons-tutorials.gitbook.io/metagenome_assembled_genomics_tutorials/tut-12-sequence-alignment-with-geneious-+-read-mapping.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.