# Tut 7: Drep and Genome Comparison

## Powerpoint:

{% file src="/files/3gum8IjQSpbizeLgr0ll" %}

## Walkthrough- Dereplication and Genome Comparison

**Feb 27th 2024**

Hello and welcome to week 7 of ESPM 112L-

Metagenomic Data Analysis Lab!

* How to run dRep (for your reference)
* Analyzing dRep output
* * Downloading and viewing dRep output
* Using ANI clustering to make a dereplicated bin set
* Comparing bins with at least 99% ANI
* * Important note about file locations and names
  * Downloading Mauve and running it locally
* Today’s turn-in

<br>

This week we’re going to be looking at methods of dereplication of metagenomic bins. We often sequence environments that contain lots of very similar microorganisms. Sometimes we want to find the best representative of a group of genomes and other times we want to compare strain genomes to see how they differ. dRep is a program that utilizes genome wide average nucleotide identity (ANI) to group bins into clusters based on how similar they are. It can also be helpful to know which organisms are present in a series of samples (the bin from each sample will fall into the same ANI cluster).

We will start this process using the dereplication step, making use of this program dRep, created by Dr. Matt Olm, a former ESPM 112L GSI and Ph.D. student in the Banfield laboratory.

Today’s lab primarily involves analyzing the output of dRep and interpreting it; this can help give you an idea of which closely-related organisms are present in multiple samples across your dataset.

First, let’s go over how to run dRep. The  documentation is here: <https://drep.readthedocs.io/en/latest/overview.html>&#x20;

First, dRep uses MASH for DNA-DNA comparisons. MASH is a fast alignment algorithm that gives approximate (i.e. not entirely perfect) comparisons between two genome-size chunks of DNA in a reasonable period of time. This takes a very long time if you do it exactly.  Clusters are formed based on the similarities between genomes. The “primary” clusters include somewhat related organisms. Next,  more stringent comparisons are performed for smaller clusters using a different algorithm, gANI (<https://pubmed.ncbi.nlm.nih.gov/31937678/>). This gives more accurate values, and is only feasible when run on a small number of genomes&#x20;

dRep is run using genome fasta files containing DNA (not protein) data. You call the program by saying dRep dereplicate, provide an output directory name (I call it dRep\_output), and a folder full of genomes (-g SPRUCE\_SRR5824232/\*.fa). In this case I put all bins for one sample in a single folder, unzipped them if they were zipped using a program called pigz, which lets you use multiple threads, and ran dRep on that. For your dRep, we will be working with another team’s bins too, just to have more genomes in the pool to dereplicate.

The -p flag refers to the number of threads you’re using- be careful not to use too many. If you’re running it on the cluster, 24 is the maximum number available, and remember not to use those all at once if others are using the cluster too. Ideally only use 1-4 threads (1 if everyone is running).

### Here’s the final command:

\#First make your dRep output folder

mkdir \~/dRep\_output

<br>

\#Then run:

\# Write this according to your sample

dRep dereplicate \~/dRep\_output -g /class\_data/Drep/spruce\_organisms\_contigs/YOUR\_SAMPLE/\*.fa -p 1

***

## Analyzing dRep output

Now you can go in and look at what dRep has generated after comparing all of the bins from our samples. Go ahead and navigate to your \~/dRep\_output folder, or the example dRep output which was run on all samples if you dont end up with much dRep output… /class\_data/Drep/dRep\_output\_all\_contigs and take a look. The dereplicated\_genomes/ folder contains the genomes dRep has chosen as representatives- i.e. the best genome for each group. The figures/ folder has all the pictures you’ll need to look at in the following section. Remember, to download any of these pictures, here’s what to do:

### Downloading and viewing dRep output

Download the files Primary\_clustering\_dendrogram.pdf and Secondary\_clustering\_dendrograms.pdf. Let’s take a look at them.

The primary clustering dendrogram is a clustering of the bins based off of MASH. It should look something like the following and have every bin from every sample in a single dendrogram:

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXcmKL9ml3gLa12MlrnLhR8ev-txV9FNX1L9fflQbq3nt9AEoD3dHCXIxKcr0BKp0XVfjPDTET8QX-Vg62HOIvpggVN1KvixhZCYyRulPLJdQNaetfi1yu8zzKseu6tUR5bqpiaqBTmrG905_9z5R-3ok-U?key=ymkGgYu5t7AGje285y3S3g)

The secondary clustering dendrogram is the ANI clustering performed on each of the identified MASH clusters. This file should contain quite a few different dendrograms, each relating to a different MASH cluster and should look something like the following for a single cluster:

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXfMPTKRR9P1OT6J-_D7ySq94mwlBM0SEvSvnkUxsk2QL0feUnuHH8A2in9TRNT6TGmQtt1TRWS_OBPU73XcCOWR5iGl7NLbblQtzfiQO_SJBCDO_p7NWhx1IOxXmtfU5fxaMSz7XAA1IVeCXLdspyjUbnjK?key=ymkGgYu5t7AGje285y3S3g)

We generally consider bins that share 99% or greater ANI to be from very closely related organisms (i.e. same species). In the above secondary clustering example, all of those bins would be considered to be from the same set of closely related organisms. In the below example, there are bins from two different organisms present:

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXfQ3VrLzKMhbtbzI7rZwjN_IfZg3CI-Kvalp8b_KYpz8m46rqg7OIvdQ4dxpBbuxiHgOXnuPUzpbRUGIPqoKzlhHcGzWrI2RRZ8NQoJ6UH9iTavQMl6SWSUWWqKoZmZH7Xgh--SPN-QaLvsKcHSHEWJZsp_?key=ymkGgYu5t7AGje285y3S3g)

If you don’t understand why there are two groups in this clustering please ask. The important part is being able to read a dendrogram- the vertical line at 99% ANI indicates a dividing line between the two groups (on top, the Lentimicrobium group and on bottom, the Bacteroidetes group).

## Using ANI clustering to make a dereplicated bin set

This secondary clustering file is what we would use to create a dereplicated bin set across samples. We will consider bins that share 99% or greater ANI to be from the same organism type. With that in mind, all we need to do to make our dereplicated bin set is pick one bin from each cluster of bins that share >99% ANI to be that cluster’s representative bin.

We want the representative bins to be high quality, so pick the best bin by looking at each in ggKbase and picking the bin with the best single copy gene profile. If there are ties, pick one arbitrarily.

This dereplicated bin set will be useful for future analyses, but we will not be using it for the rest of this week’s assignment.

## Comparing synteny between bins&#x20;

Synteny is the shared order of genes among two or more organisms - essentially, a syntenic block is a group of genes in the same arrangement in multiple organisms. One analysis that is possible but we won’t do it today is to use Orthologer, which takes two ordered lists of protein sequences, compares them to each other, and displays genes in the first organism that are reciprocal BLAST best hits in the other.

### Mauve:

Today, we will use Mauve, a program by A. Darling, to align pairs (generally) of reasonably similar genomes.  This is a great way to find local differences between genomes (e.g., inserted blocks of genes or sequence differences.  For genomes that are from closely related organisms, we expect most of the genomes to align.  Mauve is also useful to detect quite low levels of sequence similarity to establish some level of organism relatedness.

Mauve is a genome alignment tool that will generate a comparison of ANI between two (or more) closely related sequences. You will be using Mauve to determine the degree of similarity between two of your genome bins that colocated in a secondary cluster (Secondary\_clustering\_dendrograms.pdf).

1. Download two of your fasta sequences using Filezilla (or cyberduck etc.) to your local computer corresponding to your two bins of choice within the secondary cluster. (So look for the name, and download those sequences)
2. Download the Mauve aligner tool <https://darlinglab.org/mauve/download.html>&#x20;
3. Once downloaded, open and click “File” – “ Align with progressiveMauve”

Add your sequences, select them in the menu, and click “Align”

<br>

Example:

## ![](https://lh7-us.googleusercontent.com/docsz/AD_4nXekSyTOk1bfO7qdJIcyqgw-GZxCCFhFstNV5srMm0NMnXsQ1XwKzoTZ7f9xP91CQ7YuKfXJPs43u1qbjsWiXjU_EFx6ydHZAJxTawt6KUGQMwjhCKZfnYELYziW9sRlylMFIKPO-i_9gtGgPVHoJdT0wg86?key=ymkGgYu5t7AGje285y3S3g)

## Today’s turn-in

1. Can you find any regions that these two genomes share? Any of them line up? (Just a summary of what you found)&#x20;
2. What is the taxonomy of the two genomes you chose? (Look them up on ggkbase or make a tree using phylogenetically informative proteins)
3. Can you find any particularly large clusters in the primary clustering dendrogram? (Primary\_clustering\_dendrogram.pdf) Give the name of at least one bin from this cluster and look up its taxonomy on ggkbase.

<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://prestons-tutorials.gitbook.io/metagenome_assembled_genomics_tutorials/tut-7-drep-and-genome-comparison.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.