# Tut 10: Metabolism

## Powerpoint

{% file src="/files/EkADls2NtOy95xD0YPrJ" %}

## Files

Youll need this metabolic output example:

{% file src="/files/BnqCzkJlpYav5wrIOO4u" %}

## Tutorial

## Week 10 Walkthrough- Metabolism + ggkbase annotations and lists!

**19 March 2024**

* What lists are for
* Using Custom ggKbase Lists
* Optional: Download sequences for analysis
* * Investigating Metabolic Pathways
* Today’s turn-in assignment:
* * Picking a pathway/list

This week we’re going to be investigating the metabolic capacities of the organisms in your samples. This is a great chance to get to know your microbes (and viruses) a little better!

The main way we’re going to be looking at this information this week is by using ‘genome summaries’, which are a tool included in the ggkbase platform.

Genome summaries are an important part of the ggkbase platform that allow users to interrogate the metabolic potential of the bins they’ve created, and see what remains in the unbinned fraction.

Go to class.ggKbase.berkeley.edu and click on the tab at the top “genome summary”. Click “Create Genome Summary.”

<br>

<br>

<br>

<br>

<br>

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXeso64ULoc6ikbvUXRrLQPfYkgXsMFochkBlHYiwu5w9JEEAB3SSDid3hb7jRz10_Ec4MHrkTwdAcpFXUoDLVBlwEAz_CSAqP5V-Qob57Fiorl6ppwyom_QAl0xbBvXsVPOOcHhSKQxpyWTO2xG7bC9WN0?key=TgARk9LAk6qu60Hky6KwJQ)![](https://lh7-us.googleusercontent.com/docsz/AD_4nXdIC9O49rpEx1SDjQyg3rO-nynrdesDfd54P1ekpv_KWiMUOwT_PRLhSXwPINa8-DpqsQD4HGEffteZgjE8zDqmCmknttykbwLYeZ3Odd418Xa3JfHLibVpkEFjErj2hroemhYe7y48P5otHGpL7TxlT-ES?key=TgARk9LAk6qu60Hky6KwJQ)

* You are now at your new, blank, genome summary. At this point you are able to restrict genome summaries to certain projects. So, click “Projects” and select your SPRUCE sample (note, if nothing appears at the end of this part of the tutorial, you might have to select for all SPRUCE projects). Hit “Apply” and you will be returned to the main genome summary screen.
* Next, hit “Organisms”. You can now choose the organisms that you would like to include in this figure. Again, for the purposes of this assignment, click Select All Organisms in the top left of the popup window, and then Apply.
* We’ve recently added a new system that uses HMMs to detect genes, but it’s not been run on this dataset, so we’ll need to stop it from displaying. Click “Select HMMs” (right-most green button), then at the top left click “Deselect all HMMs”, then “Apply”. Now we can start from scratch.
* In the same tab, click “Lists”. A screen now pops up showing you all of the “universal lists” which are populated when the data are imported. They’re grouped first by category, and then by smaller groups within each category.

Pick some lists which interest you- some especially interesting ones to look at are the electron transport chain (complexes I-IV), fermentative metabolism, and important biogeochemical cycles of hydrogen, sulfur, and nitrogen. Some (not all!) of the genes involved in these pathways are shown below:

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXew7LkG_tf30q8BYv-lWIf9CD6WTUIRvYG6u3-GZVbOs34PlhU2dDH0SFA6ylYf19O11M0FnZvv4g7KCED9MqNFh8aBXMGGwyUIG-U97AmQv9X6_sKjsz1GdiAg2mwLM-57W37tAWkoXCIs53XEpeMXM6VV?key=TgARk9LAk6qu60Hky6KwJQ)

This is your genome summary! It shows each organism you selected as rows, and the number of proteins in each organism that match the search terms in each list. So in the screen-shot above, you can see that many organisms in this sample have fermentative genes (pink) and lack many electron transport chains (brown), suggesting that these are probably anaerobes. As we might expect, being that they live inside a cow gut! Let’s hope they look different in your soil sample!

To save this summary you made, simply give it a name by clicking on “Choose name” (by “Untitled”). Now you can leave the page, and get back to this summary by clicking Genome Summary at the top of ggkbase again.

To investigate the proteins in each category, you can click on the numbers within the boxes. This will show the list of features which hit this list. From this list, you can find out which scaffold they’re on, they’re location, the sequence (for BLAST searches), the annotations, and more!

### What are “lists” for?

Genome summaries and lists are intended to make it easier to visualize metabolic capacities across a large number of organisms and to share these visualizations with other researchers. You can make custom lists, as you’ll learn about shortly, and can create genome summaries for various groups of genomes, making this a wonderful tool not only for extracting insights from your data but also to convey these insights to others.

Once you’ve given the genome summary a name and saved it, you can then share the link to other people with access to ggkbase, who can then view your summary. Very useful for group projects!

### Using Custom ggKbase Lists

We just used universal lists in our genome summary. Lists work by using key terms. ggKbase searches these terms against all of the annotations in all called proteins in all selected projects. As shown in the List Keywords section, there is some ability to refine you lists to your liking using these keywords.

There is a fantastic help page set up to introduce you to lists on ggkbase, and give you some tips and tricks: <http://ggkbase-help.berkeley.edu/analysis/lists/>&#x20;

To make your own list, click on Lists at the top of the page. Next, click Create a new list on the top right of the page. From here you can fill in all of the details to make your own list. You can give it a name, color, and description. The most important part of the list are the terms you select to include and exclude.&#x20;

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXd4u3l8q5Tl8pW2kT-wJ9d3vfSvNxpqlcXEZ1VLeIoJ68PhAYlCbAsH-GOV16v6nuH3XjLBi5XXWHFWEG_hJh5lEv3z_YDQilFbkzGAX9SEw2EJE8_lUAgdjdNy-rGC28FMz3vtiIYLXNT-BE8V9iB18fTj?key=TgARk9LAk6qu60Hky6KwJQ)

The three search boxes use boolean logic. The first box produces genes that have annotations matching one or more of the keywords (boolean OR). The second search box requires the genes annotation to match all the keywords (boolean AND). The last search box allows you to enter undesired keywords, excluding genes that have these keywords in their annotations (boolean NOT).

Let’s try, as an example, making a list that just looks at Nitrate reductase, a family of related genes that all reduce NO3 to NO2.

* Go to “Lists” at the top of the page, title your list, give it a description and a color.
* Scroll to the bottom, and you’ll see the “List Keywords” menu. Type “Nitrate reductase” and you’ll see a bunch of options pop up; select “nitrate reductase” and then press the “Save list” button.

This list, when used, will search a set of gene annotations for everything containing the words “nitrate reductase”.

Now you’ll see a bunch of information on the next page. On the left-hand side of the page is a box that says “Projects”… next to that, click “Select all”, then on the right of the page click the big blue button that says “Update”. Now you’re looking at all the projects you have access to, and you should see all the nitrate reductase proteins in that set!

### Optional: Download sequences for analysis

This isn’t necessary for today’s lab, but will be useful if you want to analyze groups of sequences for your project at the end of the course.

You can download a FASTA file (either DNA or protein) with all of these results by clicking “Download list” (near the top of the page) and selecting the type of FASTA you’d like to download. Feel free to do that with the nitrate reductase results to test it out.

### Investigating Metabolic Pathways

Let’s now return to your genome summary.

You can pick any metabolism you are interested in for this lab. As an example, click on “Select lists”, then navigate to “Nitrogen cycle” (if you can’t find it, type it in to the search bar on the left). In the middle bar, right under the words “Nitrogen cycle”, there’s a blue button that says “select all”. Click that, then click “Apply”.

Now you can see the genes predicted to be involved in nitrogen transformation pathways across all your selected genomes! This includes the nitrate reductase genes we were investigating earlier.

### METABOLIC: Command line tools

At this point we have done a lot using ggkbase for metabolic pathway analysis, but what about other programs that are publically available? METABOLIC is a decent program for this purpose, Jill thinks its underwhealming what it can produce but I think it’s very accessible for those who don't have as much time or technical skills to manually run each gene into Kegg. Essentially its a push button for getting Metabolic information from Metagenomes or genomes. Here is the link to the wiki:

<https://github.com/AnantharamanLab/METABOLIC/wiki> <br>

And the Github:

<https://github.com/AnantharamanLab/METABOLIC>&#x20;

We won’t run METABOLIC today, but we’ll go through how to run it and use example output. Metabolic has two modes, METABOLIC-G for genome level metabolic analysis, and METABOLIC-C for community level, metagenome metabolic analysis. METABOLIC-G outputs some really useful information about your bins, so we are going to use this. Let’s get to it:

1. You would git clone the METABOLIC program and install it to your server
2. You would run this command, with your own parameters:

FOR YOUR REFERENCE, DONT RUN!

{% code overflow="wrap" %}

```
perl /shared/software/metabolic/v4/bin/METABOLIC-G.pl -in /groups/banfield/projects/environmental/EastRiver/Vegtype/Preston/fastas/proteins_checkm_filtered -o /groups/banfield/projects/environmental/EastRiver/Vegtype/Preston/METABOLIC/MBG_output -t 16
```

{% endcode %}

* The input can be either a directory path for nucleotide fastas (fna) of all your bins, or, your protein predicted (faa files) for all your bins. Its best to do the later if you have them since it saves time not to run prodigal again (also, running prodigal multiple times is bad since each time it may call you genes a bit differently, getting different results, and decreasing your statistical power). The output is a directory you specify for your results, and “-t” is the number of threads/cpus to use.

3. After some time, you would get an output folder. In this folder, you will see an excel file similar to what can be found in the bcourses folder for this week’s lab. Go there and download the example metabolic output.
4. Open using excel, click on the fourth sheet called “KEGGModuleStepHit” and youll see something like this.&#x20;

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXe3WW2GZw-qMdOqX_esLacMg8usuKzSCVwxSxmjY7MIy1WmPu3aPFnZU7-_Zr-wSx8PApf27yU8QLMatY8opA6c6NS9bxCPa4R5sr-kygsSHaKgftC384Pksgtx38t2iyRDpkHwCmG0G9wWvsPSggZbXC8i?key=TgARk9LAk6qu60Hky6KwJQ)

Using Metbolic output with Kegg mapper:

Each one of those KO.id’s is a kegg ID, a gene ID in the KEGG database. We’ve used Kegg quite a bit already, so you should already be familair with this. The Module in the first column refers to KEGG modules, essentially groups of genes that are together in a pathway. For example, the screenshot genes are all involved in the Glycolysis module in Kegg. On the right, you’ll see a presence absence table, which is self explanatory–are these genes present or absent from that genome? (which, you cant see in the screenshot but the column names in col E -> G are individual bin names).

### Kegg Mapper

What can we do with this output? Well, in order to determine if a genome really does code for a pathway it must have all the genes in that pathway. To check for this, Open up Kegg Mapper: <https://www.kegg.jp/kegg/mapper/color.html>

This program allows us to visualize the KO.id’s in METABOLIC output within the pathway map of KEGG modules. Column I (bin H1a2\_concoct\_100\_sub) appears to have all the pentose phosphate pathway genes. Lets see it to believe it.

Take all the KO.id’s that are in the pentose phospate pathway that are present in this bin (should be all of them on the list), and drop them into the kegg mapper entry box seperating by new lines for each Kegg ID, like this:

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXdWWVk-tyF8UavD6zhax7ZffrKIvLS2xvQOjAbCIYUqQEVUqkcDZ-T7pb0jebE0eWTScnqukHX20-q9zPMalMjGugiIy3Ddh-UKzdh8bvew91V_yFALlKEdFSmmtUMr4yb8RWavPNegOxUfkQ5BQyQjG80U?key=TgARk9LAk6qu60Hky6KwJQ)

Let’s use pink as the default color for now. Hit EXEC, and we get to see a map that looks a little like this!

![](https://lh7-us.googleusercontent.com/docsz/AD_4nXfrevF8KIWyBEO6M99VdclMMOCBmdlXvuM1bS78X9pkT1QqkdwayPJMw37KmFm8TewYNvEazAdOGzgmXEox4KFPBiRhLTEyD7ixxq3lG1AOPCbBYGa4TxJIRxRB4n4aIqH64XqB7s0FNjonDongTHUn5dW4?key=TgARk9LAk6qu60Hky6KwJQ)

Pink means this gene is present in the organism. This is not quite a full pathway, we notice we are missing 5.3.1.27 and 4.1.2.43 in the middle that aren’t colored. Maybe its not a full pathway then in the bin, but maybe the organism does have this pathway and some genes just didnt get included in the METABOLIC criterion or didnt assemble well. But cool that we can visualize this!

<br>

## Today’s turn-in assignment:

This week each group member should select a genome from your team’s sample and identify at least one pathway or module of interest using ggKbase lists.

What makes a good genome?

* Should be 1.5Mb at least
* Fewer contigs is better
* Ideally the genomes selected by your group should have different taxonomic classifications from one another (but don’t worry too much about this)

You don’t have to each pick a different pathway to study, but you have to pick different genomes of interest for each group member.

#### Picking a pathway/list

Examples of pathways to consider looking into:

* Methane metabolism
* * (Look for mcrA)
* Cellulose degradation
* Aerobic or anaerobic metabolism markers
* * Can they utilize oxygen? How do you know?
* Hydrogen utilization
* * Look for the hydrogenase large subunit

Once you’ve done this, give me the following information for each genome:![](https://lh7-us.googleusercontent.com/docsz/AD_4nXdqxFeFFCqi4VAOjUaa5EncxPRR-_enZw8xCYPqo_yS9UalahU85O5tckfBClvLWVi8sIRNQsJsFeZpwQROREPTKxDJlQdVFPyJc9ZbG_20ryGa4FF1F1YK841e7BQfzs6iVK5psKste2QqtQLZ635AQ5-2?key=TgARk9LAk6qu60Hky6KwJQ)

Hint: this is what this what the following look like…

1. What is the ggKbase taxonomic classification for this organism?
2. How many scaffolds does the genome contain, and what is the total size of the genome bin?
3. What pathway (which list) did you investigate? Are there multiple counts, are there single counts, or are there no counts!?

For example METABOLIC OUTPUT in files week 10 on bcourses:

<br>

1. Pick one cycle or pathway (eg TCA). Using the example METABOLIC output, go through the similar kegg mapper steps and tell me whether the pathway is complete in any of the geneomes in columns F-H. If not, what genes are missing? (Hint: look at Kegg pathway map, maybe take a screenshot of your kegg map to send in with the assignment).

\ <br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://prestons-tutorials.gitbook.io/metagenome_assembled_genomics_tutorials/tut-10-metabolism.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.