# Tut 8: ORF Prediction and Basic Annotation

## Powerpoint:

{% file src="/files/Iiam7Mg3WE70l3K5Y24K" %}

## Tutorial:

Week 8 Walkthrough- ORF Prediction and Basic Annotation

5 Mar 2024

Hello and welcome to week 8 of ESPM 112L!

Metagenomic Data Analysis Lab!

* Goals for today:
* * Tools to investigate proteins of interest:
* Running Prodigal
* Predicting ORFs with NCBI ORF Finder
* * Selecting a DNA sequence on which to predict proteins
  * Performing prediction on NCBI ORF Finder
  * Verifying proteins with BLASTp
  * * Interpreting the graphic summary from Blastp
* Interpro
* Hmmer
* KEGG
* Turn-in for today:

This week’s lab is going to cover both how proteins are predicted in prokaryotes as well as how to go about learning more about interesting proteins you find in metagenomic data. Today we’re going to focus exclusively on proteins you can find in your bins, since those are more interesting (because you know a bit about which organism they came from).

In our lab, we use several popular tools to look at interesting proteins, which each have their own advantages and disadvantages. Let’s talk about them, and what they’re each good at.

<br>

### Goals for today:

* Learn how to use Prodigal
* Predict genes, ORFs using NCBI ORF Finder
* Learn how to use BLASTp to investigate protein sequences
* Learn how to use and interpret results from Interpro and HMMscan
* Start playing with KEGG and investigating the metabolic pathways your proteins are part of

***

Tools to investigate proteins of interest:

* Interproscan (most thorough)

<https://www.ebi.ac.uk/interpro/search/sequence/>

This option is the best if you have a protein of interest and you want to find out exactly what it is. Interproscan uses a large suite of HMMs (probabilistic models that we won’t go over in detail today) to give you a wealth of information about the protein sequence you provide.

* Blastp (alignment-based)

<https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins>

BLASTp draws on the strength of the NCBI’s public sequence database, as well as a great list of structural models that help you see the domain-level features of your protein sequence, which can tell you a lot about its function.

* HMMscan (HMM-based, very fast)

<https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan>

HMMscan allows you to search against a suite of domain-level HMMs, which can tell you a lot about what your protein does, and how it functions. Its companion program, pHMMer, gives you similar results along with a list of similar sequences from the EMBL-EBI’s public database, although this approach yields many fewer hits than running BLASTp and I would recommend using BLAST instead of pHMMer unless you’re pressed for time. It’s really fast, though, and if you’re doing tons of these searches, as I often am in the course of my research, it can be a real time saver.

***

## Predicting Genes with Prodigal

One of the main goals in genomics, or metagenomics, is to figure out what an organism is doing in the ecosystem. In order to understand genetic functions, we must first figure out where the genes are in their genome! This is where prodigal comes in, it annotates genes so that we can later assign function. Here’s the background: <https://github.com/hyattpd/Prodigal>&#x20;

While we would normally run prodigal on an entire sample, for ease we’ll just do one bin. Here’s the command you’ll run (verbatim)! Looking at the command, can you figure out what you’ll need to do before running it? Also, type prodigal -h to understand what these classifiers mean.

{% code overflow="wrap" %}

```
prodigal -i /class_data/Drep/dRep_output_all_contigs/dereplicated_genomes/SPRUCE_SRR7028244_metabat2_85.contigs.fa -o ~/prodigal_output/my.genes
 -a  ~/prodigal_output/my_proteins.faa -d ~/prodigal_output/my_genes.fna -m -p meta
```

{% endcode %}

Prodigal will output just the genes in your genome, each as a seperate entry in a fasta file either in amino acid format (faa), or nucleotide format (fna). There are uses for both, so you’ll likely want these.

The -o and -a options in Prodigal output different types of information:

* -o (output file): Outputs a detailed annotation of each predicted gene, including its location on the genome (start and end positions), the strand it is on (forward + or reverse -), and possibly a score indicating the confidence in the gene prediction. This output is typically in a gene feature format (GFF) or a similar format, which is a text-based format used for describing genes and other features of DNA, RNA, and protein sequences. The -o option provides a comprehensive overview of the gene predictions, including metadata about each gene.
* -a (protein translation file): Outputs a FASTA file containing the amino acid sequences of the proteins predicted by Prodigal. Each entry in this FASTA file is a protein sequence that corresponds to one of the predicted genes. The header of each entry typically includes an identifier and may include additional information such as the gene's location on the genome. This file is focused solely on the predicted protein sequences, without the detailed gene location or feature information found in the -o output.

In essence, the -o file provides a detailed annotation of the predicted genes, including their genomic context, while the -a file provides just the amino acid sequences of the proteins encoded by those predicted genes.

<br>

-d (nucleotide sequence file): This outputs a FASTA file containing the nucleotide sequences of the predicted genes. Similar to the -a option, each entry corresponds to a predicted gene, but here you get the DNA sequence instead of the amino acid sequence. The header of each entry usually contains an identifier and may include the gene's location on the genome.

<br>

An important step when using ggkbase is to run prodigal beforehand, and upload the prodigal files to the project before you bin. Remember how all your contigs on ggkbase have genes assigned on them? Well thats how!

## Predicting ORFs with NCBI ORF Finder

Selecting a DNA sequence on which to predict proteins

Now we’ll do ORF finder manually ourselves. Go ahead and go over to [class.ggkbase.berkeley.edu](https://jwestrob.github.io/Week_8_Walkthrough/class.ggkbase.berkeley.edu) and log in. Select one of your organisms, and click on it to get a list of the scaffolds in that bin. Select a relatively large scaffold (\~15 kbp) and click on it. A good way to do this is to sort the sequences by ‘# features’ and find a scaffold with more than 10 genes.

![](https://lh7-us.googleusercontent.com/7XGgoyeZc_0TWGRqLl31igSVg2TQabC6vUm0rCW1WKdJXdyK4Bvnei7U0tPtvDxE73-ahZYqG4_5vOsQ5TFsWQTfy5pYppG6_PSAaLGAzueBFMJu_nZ_r4FSGhrKxpfjxT3pLGLwQrnPbqjJs85g96M)

Performing prediction on NCBI ORF Finder

Click on the link to a contig and download the DNA sequence for that contig. Open the fasta file in a plain text editor (I’d recommend downloading Sublime Text 3 to do this); select all (cmd+a on Mac or ctrl+a on Windows/Linux), and copy the sequence. Go to NCBI ORF finder (<https://www.ncbi.nlm.nih.gov/orffinder>) and paste the sequence into the Query box.

![](https://lh7-us.googleusercontent.com/ZyalbVlPTW28cH8IKNaPsV00OEO1z-LjTOBb21k7h8PxKulUVfaA7VeajFJWAWoW9-oz1_2Z4J0-Qh2BZ6ukXG3WrbQu5sRWgyba8p1wYiEwwUbFYYowgSxec7XChnTGxwA0GO4cdVx31to0AndrfeA)

You will want to use the standard bacterial genetic code, referred to here as “Bacterial, Archaeal, and Plant Plasmid (11)”. A reasonable minimum ORF length is 150 Amino Acids, but feel free to try other cutoffs. Hit the submit button to see your potential ORFs.

The results show all of the possible genes in all reading frames. You can click on a gene in the viewer or in the list to get its particular sequence. Note: this is ALL of the possibilities across multiple reading frames, some of the resulting proteins are likely not real proteins.

![](https://lh7-us.googleusercontent.com/ytEaiyu955o6i-KlT5b0M-MmF3vqY35H_GJl5y6UrXAVD9N1-OJhIzCy2UVOYrD1aPcOmR2jZno4AWsBn5nFg3lQchmrqmsNivgRUlj1WJVtFEZunyMoO2LJ2UwW47A4UDqu5hShT_ZBo1gxiLb95Cw)

## Verifying proteins with BLASTp

BLASTp is a piece of software that will take a protein sequence and search a database for close hits- think a search engine, but for biological sequence data.

Verify that your selected protein is real by clicking on it, like in the image below, scrolling down to the bottom left of the page and selecting “BLAST”. If your results show a bunch of other proteins with high sequence identity and defined function, congratulations! You got a nice protein. Keep working with it. Otherwise, find another one, rinse and repeat. The best candidates will have relatively little overlap with other predicted ORFs. All the standard parameters are just fine, so don’t worry about changing anything once you see the page shown in the image below- just scroll down and click BLAST.

![](https://lh7-us.googleusercontent.com/CvTIUCgdMxJcAJ0Kkgl1nAn2KNK0XvFxIqiNDzwGZp9-03JpdcGNDC5LVZpIy4FryWXS1S8gXC2U0ROuykHLfbSvQs33K-D_gmlUhYnG5gtDNcFfOxBhXbkEu8gwLPPR2qabZUd-mr31rPf0DQTmKNk)

How well do the results cover your query? Look at the colored bars in the top box to visualize this. Do you get results in the description box that agree on what this protein might be?

Interpreting the graphic summary from Blastp

One of the major strengths of BLAST is its integration with NCBI’s metabolic models- they’re very detailed and you can get a lot of unique info from them. Click the “Graphic summary” tab on the blastp output to see it, and it should display something like the following:

![](https://lh7-us.googleusercontent.com/ChUFL2tvID8f0aN2x9Kp35IoHljEhT0NAHW9G2DKZPXeiRHnB5KIEg85rs3-mZLp8vQa_eOwhSh30hIZOZSs7lpdgOm8THc71eiMChG2dyAvrItSxehl4vCpoHptG7zOSVSrfd1_xmLBqC4dqORbjGg)

Where it shows the green bars that say, in this case, “RMtype1\_S\_TRD-CR-like\_Superfamily”, you can click to get more detailed information. The protein I was looking at here was part of a type I restriction modification (RMtype1) system; you will see something similar but with different annotations.

***

## Interpro

Now that you have a good ORF that you can trust is real, go ahead and navigate over to Interproscan (<https://www.ebi.ac.uk/interpro/search/sequence/>). Paste this amino acid sequence in as your query and wait for a little while - interpro takes a bit of time, but the results are really good and trustworthy.

You’ll get some cool results from interpro which are really interactive and highly detailed, if you have a real protein. If you have a protein with unknown function or that doesn’t look like any well-characterized proteins, you might not. In that case, just go back to NCBI ORF finder and pick another protein and repeat this whole process. (If you’ve closed the window with NCBI ORF finder or just don’t like it, you can always get these proteins from class.ggkbase pretty easily too.)

Below is a run down of the kinds of information interpro will display for you:

***

![](https://lh7-us.googleusercontent.com/H7sH5lnxYH7Ark5o8I5R156YHrZjz5F6C7nfVo7X689vjThHyrAZEYaWZfE5U-Esg7dZfMlmsUZ-2Es7h_vCO2z3uZIDcOe0l6Blegn7LRXqpq7tvYMP3m7oJ7NKkhgvRJ8gZ6tJo5fg1sqRgAsB9xA)

Protein family: in InterPro a protein family is a group of proteins that share a common evolutionary origin, reflected by their related functions and similarities in sequence or structure. (The inclusion of protein structure is one of the differences between the general search in NCBI, that only considered sequence homology, and this search against InterPro)

![](https://lh7-us.googleusercontent.com/pU7P8QnM7vPgnF3d6pfpQ0H3ycnnIw-4pOU6VlDSAd1nyTe-76IMSaB3-921TIrLOLsSppFSJY1j8P_7TSKszRqId7FshpHCUzLteroSNDDK7gaoeIi2XTmhZXEphTX1xGfA9zdO_u0Y-BXCB9yGS1U)

Protein domain: distinct functional and/or structural units in a protein. Usually they are responsible for particular functions or interaction, contributing to the overall role of a protein. Domains may exist in a variety of biological contexts, where similar domains can be found in proteins with different functions.

![](https://lh7-us.googleusercontent.com/2G9ONfvIKMF7GeWTfSLb_uQxgedDOD-HrZ5kh4DYwNvebI6CGfVzwW4u88lrHHgJBXDTq8kTUOtRNfum8YT3pt278EVPAgVy06THsJcS32qADA9csbQQ0mENm5MeINVqD19WJh3DVVGWcBxt-FP4p5Q)

Repeats are typically short amino acid sequences that are repeated within a protein, and may confer binding or structural properties upon it.

![](https://lh7-us.googleusercontent.com/fzUZPOHbxZQVrKqKsbajfX_tZ0IxWsOVlPo6vQuhT0sJnPMOvvRVwy1obrqf-tDour4xObKOe9arkKueEmX3KxQr9aep4-ondDWWPe7c9KhFdQSvZuqglAr6f8TspA3mD8s5gSGq_yawF4TqrN-SsZo)

Sites: groups of amino acids that confer certain characteristics upon a protein, and may be important for its overall function. Sites are usually rather small (only a few amino acids long). Some types of sites in InterPro are active sites (involved in catalytic activity, binding sites (bind molecules or ions), post-translational modification sites (chemically modified after the protein is translated), and conserved sites (found in specific types of proteins, but whose function is unknown)

## HMMer

Let’s continue with this fun! HMMer is a tool useful for predicting the gene function. Well, once you have the gene’s you’ll want to know what they do right?

Hmmscan, a part of Hmmer, takes a protein sequence (or sequences) as input and compares it against a database of HMM profiles representing protein domains or families. Each HMM profile captures the statistical properties of a sequence alignment for a protein family, allowing hmmscan to identify which part of your query sequence matches a known family/domain and to infer the potential function or structural characteristics of the query sequence based on this match

HMMer also has a command line formatted Hmmscan that can quickly search your entire genome (amino acid formatted) and assign function. But today we will be using an online version of it on one gene.

Here’s how:

1. Go to <https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan>&#x20;
2. Input any amino acid sequence (try the same one maybe that you did before). Or use the example, the link right above the click box.
3. Select all protein families and keep the other settings as default (see picture)
4. Click Submit. Note: If it takes more than 5 min, probably means the website is not running right, skip and let Preston know.

![](https://lh7-us.googleusercontent.com/e_mKKRf64redDUbo1tYQ02G6gepzK_c3s_SUe23h9ipohyqzLLp-LOdY4WAT9ECzYAlHLczNZOSB-jgON6Rl3ZWmI9gl2g9WuAmRVRZpTtYcerQTq3P-qxpnaMGDVe05E6Hzqo0bje1DEvqVcOOTtEo)

<br>

Review the output, which will indicate the matches found between your sequence(s) and known protein families or domains in the database, including details like E-values (indicating the significance of the match), match coordinates, and potentially the inferred function or structure based on the matched profiles.

<br>

## KEGG

Once you have a relatively interesting protein, the best way to dig further into it and find out what its role is in the cell is to look it up on KEGG. Just google “kegg” + the name of your gene, and you’ll generally see a few results that line up with what you’re looking for. (This will be part of the demo at the beginning of class.)

One of the greatest things about KEGG is the fact that it provides information on publications related to that gene where you can read about the gene’s function, and also (not always, but often) links to “pathways” and “modules” containing that gene. Pathways and modules are groups of genes that work in sequence to perform a given task - fixation of nitrogen, for instance, is a good example of a module; photosynthesis, being much broader and involving many more genes, is classified as a pathway.

Here’s an example of an important gene, Nitrogenase, involved in the fixation of atmospheric Nitrogen by bacteria: <https://www.genome.jp/dbget-bin/www_bget?K22898>

Notice near the top of the page the KO number (K22898) which is a good way of keeping track of individual genes, the pathway number (ko00910) which contains many other nitrogen metabolism genes, and the module number (M00175) which indicates that this gene is involved in nitrification, the process of converting atmospheric N2 to ammonia, NH3.

Turn-in for today:

1. Find a protein that has an informative annotation. (This is subjective- you decide whether it’s informative or not. Do the annotations give you any helpful clues?)
2. Are there highly related blastp hits? If so, do they come from the same type of organism? (The organism taxonomy is in the sequence names.)
3. What is the suggested function for that protein from interproscan? (You can find this at the bottom of the interpro results page.)
4. Which, if any, HMM models hit this protein on HMMscan?
5. Can you find that pathway for that protein on KEGG? If not chose a different protein.
6. Find the next gene in the pathway, and try to find this on the scaffold where you originally obtained this sequence.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://prestons-tutorials.gitbook.io/metagenome_assembled_genomics_tutorials/tut-8-orf-prediction-and-basic-annotation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.