# Tut 11: Crisper and Phage mining

## Powerpoint:

{% file src="/files/jxWNwtYba7IfTY6KtnPt" %}

## Tutorial

Week 11 Walkthrough- CRISPR and Phage!

### 2 April 2024

Hello and welcome to week 10 of ESPM 112L- Metagenomic Data Analysis Lab!

* Finding Phages
* Jill’s tips for phage prospecting
* Searching for phage with lists
* Looking at the phages that you’ve found
* This week’s turn-in

This week we’re going to be looking into methods of identifying CRISPR systems and arrays, as well as identifying phages and other types of mobile genetic elements in your microbial genomes.

Today we’re going to be looking for CRISPR systems.  One way is to create a custom list that searches gene-level annotations for the word ‘CRISPR’. To do this, open the Advanced Search (blue text, top R of screen) and as follows:

![](https://lh7-us.googleusercontent.com/pYlmAE4dx1d5sInvNWR7NAPoKjynCz7Rx3t4e7vHUJOKmge1cLjpDiWtH4khYf4f7-WE-FrrJFtt_stEsamcFG7IsiC9bi2tLzaBfeQj9Z0d9NPenbxSwiHlOoEjG9rberSYTaATuKW2WipoBQjTXRw)

Save this as a named list.  This will appear under your name in the Genome Summary list menu.

Genome summaries on ggkbase  provide a way we can quickly search our metagenomes  for certain genes (eg. methanogenesis, or ribosomal single copy genes, or CRISPR-cas systems). As a refresher, go to class.ggKbase.berkeley.edu and click on the tab at the top “genome summary”. Click “Create Genome Summary.”

![](https://lh7-us.googleusercontent.com/guiZ0FaJoklbGYHgyHBP6yrC6bMbTC-eegBzsOLDchYU0l-PrmQqnP338xl4oJKcL6SUeh3lf1z_vAI9jbyOT9EFaScjlsv7_6cSoEj1Qf7BwHd9rLLWjeCE3cv6Y5n3ZVufta1Puy1AWCtgIhUzkfo)![](https://lh7-us.googleusercontent.com/R72b2BBBsVnXhPEKDl0oL0-3VtE99IGMIDN_aBX5WmI45BeEdKi9HNMX-1y6tvMpWea3KPCvyPKRAozafMVVolyxtEY9h_ye7f5EOj-WmBY73lM3DiD5VA_qns48VeN1hu-FuuvG92DXCXcqqVFhIGE)You are now at your new, blank, genome summary. Now, click “Projects” and select your SPRUCE sample (note, if nothing appears at the end of this part of the tutorial, you might have to select for all SPRUCE projects). Hit “Apply” and you will be returned to the main genome summary screen.

Next, hit “Organisms”. You can now choose the organisms that you would like to include in this figure. Again, for the purposes of this assignment, click Select All Organisms - including the UNK (unbinned sequences)  in the top left of the popup window, and then Apply.

There is  a system that uses HMMs to detect genes, but it’s not been run on this dataset, so we’ll need to stop it from displaying. Click “Select HMMs” (right-most green button), then at the top left click “Deselect all HMMs”, then “Apply”. Now we can start from scratch.

\
![](https://lh7-us.googleusercontent.com/m1CEW2YKYHpkfCiM4diWcA0x5QzViNUcuKmjGOz69PN58Q8ZWvaCfIhiK_euIuh4iR0X02KTeNPmBu63eLa1FmnOI0lgq0zZpNq1i3YqSd5z89WhLxe6mmWCMKm37lRKhodBuO3G4iAJDJH7ExJ11Tk)

In the same tab, click “Lists”. A screen now pops up showing you all of the “universal lists” which are populated when the data are imported. They’re grouped first by category, and then by smaller groups within each category. The CRISPR list you created should appear under your name (as in Preston’s name in example).

There is a list for Cas proteins (CRISPR-ASsociated protein), but I doubt it will work well, so it may be better to create your own custom Cas protein list using e.g., Cas9, Cas1, Cas2, Cas3 etc.). Either this or the CRISPR list should direct you to genomic regions in which there are CRISPR arrays. &#x20;

Select the box that displays the “cell count” in each box in the summary (\* if > 100) and find a genome with indications of a CRISPR-Cas system. Click on the box to open a page from which you can view the list contents. Make sure these are real crispr systems!

Click the box!

![](https://lh7-us.googleusercontent.com/UDv1fmGqZsF6hm6Sg6aKIH11BMpj0RH29VU_z2QibMWXlWGc78fnV_ea2jIz3J8TGuvrNRGMcQAxmtM1sQ2LHA17hSAu0iOKRU_vjHVwciasXFm6H7xTBL0ncwEFxMOMJO6fFom4mIQQLOs7xTWc_Bc)

For this contig that appears to have the CRISPR locus, click on the “download” at the top to download the contig(s) sequence(s) to a single fasta file. ![](https://lh7-us.googleusercontent.com/O4EjXgtD-UMO2UMbrdsB9cR-8qaC0zT_zJpBXglM93jkxMtfo2iSl1nmDSzh87hEv3fKExlOGlZsQhsk74q6BV1nPBNrR3xh1FzegZBBQoWdgtDFDwTJEzBdSZKtK9TyoliYdMxWtoJHpvnCLMtzNYM)

<br>

Next, we’re going to use CRISPRCasFinder online to find CRISPR systems in your genome of choice.

[CrisprCasFinder](https://crisprcas.i2bc.paris-saclay.fr/CrisprCasFinder/Index)

Go ahead and upload your file, as shown below.

![](https://lh7-us.googleusercontent.com/220MdPW7usaiE6KKVhaq78VJDyswNn2g5irS95JePrQGonG4kdtccZUMEoC0cQQZqjt4JGHlsmZgUJPggmolnZMefPVqZUsd9hacNXep30kgJi9xUodnKpkjrlzZlqWri8I7nBh-MNv0v1T4vPGonLo)

You should have a little green box for each potential CRISPR Cas system. Click on it to expand it, and click on the details tab (or just click on the name of the query sequence, which should be a hyperlink).

![](https://lh7-us.googleusercontent.com/_d1jHKHweQCXq1r13Uhh9glFGxRAyGoFTKXKs-XaWJ64fSkJSbNv_qax69SgiB2q_RGJdlxfPRS1QS8vBwhUvZDFd6yw7YDqFcjg66YXmyGi60ilfaaDZLMpHZblf1o8DNXI6JLzHlTIJomuv4Zj0kI)

You can now search these spacers in the CRISPRCasFinder database. Once you’re in the ‘Details’ tab, check all the boxes on the right-hand side (these are your spacers, which form the guide RNA that targets viral invaders) and select “search spacers in database”. Do you see any hits to related organisms? What about organisms that are totally unrelated? See anything unexpected?

***

## Finding Phages

Next, we’re going to look through your data for some bacteriophages! You can use a custom list (use search terms like “phage”, “terminase”, “capsid”, “integrase”, etc.), or you can use the universal Phage/Virus list in the list menu of the genome summary. Make sure to look only in the unknown → UNK bins, which are really those unbinned contigs. You’ll want to aim for contigs > 30kbp, 50kbp is standard, and larger >100kbp is also normal. If you find more then that (eg 300kbp) tell us!

<br>

### Other tips for phage prospecting

Although there are exceptions (mostly for phages of very well studied human microbiome bacteria), most phages do not look like genome fragments from microbes. Some possible indicating features are:

* Dominance by hypothetical proteins (true of essentially all phages)
* The majority of proteins have no best hit recorded (“unknown” under gene name in contig view). Thus, no phylogenetic profile is given under “contig” in the UNK bin listing (i.e., “unknown” Domain in the binning tools).
* Best hits are to other phage proteins- note many have annotations that start with ‘GP’, such as Gp5, Gp7, etc. This naming convention is because they were originally found and incorporated into a database which named them this way.
* Presence of genes encoding phage structural proteins (e.g. terminase, portal, capsid, tape measure protein)
* Smaller phage may mostly have ORFs coded on only one strand. Remember, you can check which strand an ORF is on by looking at the coordinates of the gene on ggKbase (ask Preston for help if you don’t know how to do this). 3’-5’ strand is default- if it says comp(), it’s on the complementary (5’-3’) strand.
* Generally more ORFs per kbp of DNA sequence (due to viral ORFs being smaller)
* Annotated genes that are not phage/structural genes often involved in nucleotide metabolism, including DNA polymerase, nuclease, helicase, integrase, etc.
* It is possible, but not necessarily true, that the contig/scaffold will circularize. You can test this by finding out if the same sequence appears at the start and end.
* It is quite possible that you will identify prophages (phage DNA integrated into the host genome). These can be identified if your genome fragment is long enough to transition into well annotated bacterial genome sequence; look for integrases as an indicator that this might be happening.

To distinguish a plasmid from a phage, look for plasmid genes, e.g. conjugation related (Tra, Trb, Mob), RepA, partition (e.g. parC), and other replication proteins, as well as Type IV secretion systems (necessary for conjugation).

<br>

### Circularized:

If you think you got a contig thats a phage, download the contig sequence, and in a text editor find if the first 12 nucleotides or so directly match the last few nucleotides in the fasta file (hint: use the find tool in your text editor to query these first few nucleotides).

***

Bin any confidently identified phages or plasmids that are not obviously part of a genome. To do this, click the Rebin button, and input the name of the bin!

![](https://lh7-us.googleusercontent.com/K_08Tfhy9_koq80jFvGC9_2a241ZxnueEuNtyBlUieIp1uOWAxef1eZDb1oj1urtXh6iB-C0x0bKulyRF2S-8le_TTZ0KPV4cFeqqiuCH1lny1st4j07v1g3_y3Ii-QPWU5cKLGEnXdp_jf_UFHVXiM)<br>

Replace GC and COV with the contig’s GC and Coverage values from the contigs screen information. Add your initials and organism group as phage.

![](https://lh7-us.googleusercontent.com/_BxeCLSvSVL2Wsbj9zqKa95GfhYyJmMOGl5jZB05MAFU_JmE0_fwLWqdQMJslhAIR1qRhxU_u539QlezPPNV1i3ihSCXFjYwYTbBAZkQ0vj5_WUNvQfo98BgPVG8Ky4aGEcOZU_QUzcfqpcNJq396e0)<br>

<br>

### This week’s turn-in

This week, please provide the following:

* The bin name, taxonomy, and scaffold ID where you found a CRISPR locus for at least one genome.
* At least three scaffolds you think are phages/ viruses.&#x20;
* Report a few clear phage genes (test for domains using HMMER or blastp domain reporting).
* Click on your UNK “organism” within your sample. Look at the largest unbinned contig in your sample (so the largest in the UNK bin) with taxonomy profiled as ‘Unknown’.  Is it a virus or plasmid? If not, what is it?

<br>

Eg: Look at the DNA sequence bp count. Use the sort by # features if you need to resort.&#x20;

![](https://lh7-us.googleusercontent.com/uotTmC3kHJwJFpLt3Scw4xG1Y_Jf9jtziM_gsAge03yaj67A0mlGqMj8sXw1gvuDCp7gvcczG0mQSAaGpc9s_4A6YL1yngKMABHJUHPYK8MhxiVsscGbMo21U0i7-8tu-0d-dgLQiksZu-jGZB-nmQ0)

<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://prestons-tutorials.gitbook.io/metagenome_assembled_genomics_tutorials/tut-11-crisper-and-phage-mining.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.