# Tut 2: Read Quality and Trimming

## Powerpoint presentation:

{% file src="/files/woEAfCKTtNZq28cCGo0q" %}

## Tutorial:

Week 2 Walkthrough- Read Quality and Trimming

23 Jan 2024

Hello and welcome to week 2 of ESPM 112L-

Metagenomic Data Analysis Lab!

* Tasks for Today
* 1\. Connecting to the class server
  * SSH
  * Changing the password
  * Samples
* 2\. Getting your reads
  * Running FastQC
  * How to Download Files From The Cluster (IMPORTANT)
  * Trimming with Sickle
* Final step: FastQC on the trimmed reads
* Questions

Tasks for Today

Lab today will hopefully be easy on the technical front, and more about interpreting raw sequence data (what you receive back from a sequencing facility after sending them your extracted DNA).

To get your points for the day, please provide the following in the submission assignment on bCourses:

* The HTML output of FastQC run before and after trimming
* Answers to questions (located at the end of the walkthrough)
* Names of your group members

1\. Connecting to the class server

The class server is a computer we’re using to host your data and perform computational tasks this semester. Each of you will register for an account today, and learn how to access it. Physically, it exists just off campus in the UC Berkeley data center on Hearst; luckily for you, we can access it from anywhere. Let’s go over how.

&#x20;

SSH

SSH stands for ‘Secure Shell’. Remember how the terminal is also called a shell? We’re just going to be connecting to a new terminal session on the class server over the internet. This is one of two most important commands to remember for this class, so write it down somewhere and remember you can always come back here to see it again.

ssh <username@class.ggkbase.berkeley.edu>

Pretty simple! I will be giving out usernames in class, as well as telling you the default password. (Information security 101: Don’t give away passwords willy-nilly.) Once you’ve connected successfully, you’re going to need to change that default password.

Changing the password

The command to change your password is passwd. Type this, and you will see a prompt asking for your current password. Enter it, then enter the new password- then you’re done!

Samples

Each group will be assigned a sample today, and starting this week you’ll work with that sample. These samples are big, though, and so we’re going to have to take a couple precautions to avoid copying them and using up a ton of disk space!

2\. Getting your reads

Make a folder in your home directory (use the mkdir command) called raw\.d. This is how we designate the folders containing our raw sequencing data in the Banfield lab standard workflow. Then cd into that directory.

The read paths are located at \
`/class_data/reads/Cow_8_05.1.fastq.gz and /class_data/reads/Cow_8_05.2.fastq.gz.` These are called the “Forward” and “Reverse” reads and should (almost always) be used together.

Now that you know where the data is located, let’s run our QC program, fastqc! I highly recommend one person per group do this, since it can take quite a while; this way you can get started on the rest of the lab while it’s running.

Running FastQC

FastQC is a program to assess the quality of raw sequencing reads. We’re going to use this before and after trimming your reads to highlight the difference trimming makes.

FastQC is loaded already onto the cluster, so you all have access. Try running the help menu with fastqc -h. The output should look like this:

FastQC - A high throughput sequence QC analysis tool

&#x20;

SYNOPSIS

&#x20;

`fastqc seqfile1 seqfile2 .. seqfileN`

`fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]`

`[-c contaminant file] seqfile1 .. seqfileN`

&#x20;

DESCRIPTION

&#x20;

FastQC reads a set of sequence files and produces from each one a quality

control report consisting of a number of different modules, each one of

which will help to identify a different potential type of problem in your

data.

&#x20;

If no files to process are specified on the command line then the program

will start as an interactive graphical application.  If files are provided

on the command line then the program will run with no user interaction

required.  In this mode it is suitable for inclusion into a standardised

analysis pipeline.

&#x20;

The options for the program as as follows:

&#x20;

```
-h --help       Print this help file and exit
 
-v --version    Print the version of the program and exit
 
-o --outdir     Create all output files in the specified output directory.
```

&#x20;       Please note that this directory must exist as the program

&#x20;       will not create it.  If this option is not set then the

&#x20;       output file for each sequence file is created in the same

&#x20;       directory as the sequence file which was processed.

Read the help menu carefully. FastQC will analyze both your forward and reverse reads at once, so make sure to provide them both!

You have the option of putting the output in a directory (by providing a value for the -o flag), which is tidy, or just having FastQC output everything in your current directory. I recommend making a directory to put the output in, like so:

`mkdir ~/raw.d/fastQC_output`

`fastqc -o ~/raw.d/fastQC_output -t=1 [forward_reads] [reverse_reads]`

When you run fastqc, fastqc -o \~/raw\.d/fastQC\_output -t=1 \[forward\_reads] \[reverse\_reads], you replace \[forward reads] with your path to the forward read, and same for reverse. You do not include the brackets. This took about 10 minutes to run, and you should expect it to take about that long. PLEASE do not use more than one thread (the -t option in fastqc) because we do not have the resources to support everyone doing this simultaneously.

Once this is done, you’re going to have an output .html file (either in your current directory or in the output directory you specified with -o). We’re going to want to download it! This file is an .html file, so you want to open it in a web browser- Firefox, Chrome, Safari, Internet Explorer, Netscape, whatever you want.

&#x20;

### How to Download Files From The Cluster

This is a very important skill, and one that will come up many times throughout the semester. Historically, it’s been a bit tricky, so I want to devote some time to it.

You have two options. The easiest of these by far if you’re on Mac or Windows is to use [Filezilla](https://cyberduck.io/):

GUI-based SCP applications are pretty straightforward. For using filezilla: open the application, under host at the top type in “class.ggkbase.berkeley.edu” without quotations, your username and password where it specifies, and for port type “22” without quotes. When you get into your account, open your raw\.d folder, and then open the fastqc folder. Download your file by double clicking it. Then go into the left pane (local computer) and right click the file and press open. It should open up on your browser.

Another option, that to be frank I don’t really use since filezilla is great, is SCP.

* Download with SCP. Here’s an example of how to do that (I also will give a demonstration in class):
* Get the full path to the file you want to download and copy it! (note, don’t copy the ptasoff line, use realpath to actually get your own path)

&#x20;  realpath example\_fastqc\_output.html

&#x20;  `# OUTPUT: /home/ptasoff/fastQC_output/example_fastqc_output.html`

&#x20;  \# Copy that! ^

* Open a terminal on your computer (NOT the class server- you can leave that one open or close it) and navigate to the folder you want to download to.

```
  cd ~/Downloads/
  scp ptasoff@class.ggkbase.berkeley.edu: /home/ptasoff/fastQC_output/example_fastqc_output.html .
  # In Bash, "." just means "here", i.e. your current location. 
```

### Trimming with bbDuk

Next, we’re going to use a popular trimming program called bbDuk to trim the reads (remember, trimming is important to remove the sequencing adaptors). Try running bbduk.sh -h on the command line for help. Here is also the website to browse for creating your code: <https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/>

Now, we want to trim our adaptors right? Let’s do it! But first, lets go over what a k-mer is briefly: In bioinformatics, a sliding window approach is often used in sequence analysis, where a window of a fixed size 'k' (k-mer) slides along the sequence, one base at a time, to analyze or process the sequence in chunks. This technique is commonly applied in quality control of sequencing data, where the quality of bases within each window is assessed, and if the average quality drops below a certain threshold, the trailing bases are trimmed off. In the example figure, there is a ‘k-mer’ = 4. And you can see how we slide and evaluate each 4 bases (and its quality statistics, not included here).

<img src="file:////Users/ptasoff/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_image001.png" alt="Bioinformatics 1: K-mer Counting. A challenging yet intriguing… | by  Gunavaran Brihadiswaran | The Startup | Medium" data-size="original">

Image by: Brihadiswaran, 2020 *The Startup*

If you have separate files for forward and reverse reads:

&#x20;

{% code overflow="wrap" %}

```
bbduk.sh in1=/class_data/reads/Cow_8_05.1.fastq.gz in2=/class_data/reads/Cow_8_05.2.fastq.gz out1=~/raw.d/Cow_8_05_trim_clean.PE.1.fastq.gz out2=~/raw.d/Cow_8_05_trim_clean.PE.2.fastq.gz ref=/class_data/reads/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=r trimq=20 threads=1
```

{% endcode %}

Here's a quick explanation of what the command does.

* ktrim=r: Trims adapters from the right (3') end of the reads.
* k=23, mink=11: Uses k-mers of length 23 for matching, with a minimum k-mer length of 11.
* ref=adapters.fa: Specifies the adapter sequence reference file. bbduk comes with a default adapter sequence file that we will use.
* hdist=1: Allows for a maximum Hamming distance of 1 for k-mer matches (i.e., allows for a small amount of mismatch).
* tpe: Trims both reads to the same length if one is shorter post-trimming (for paired-end reads).
* tbo: Trims adapters based on overlap detection between paired-end reads.
* qtrim=r: Specifies the type of quality trimming. r trims the right side (3' end), l would trim the left side (5' end), and rl would trim both sides.
* trimq=20: Sets the quality threshold for trimming. In this example, bases with a Phred quality score below 20 will be trimmed off. You can adjust this value based on your specific quality requirements.
* threads=1: keep this so it uses one computer to run this command. We don’t want to overload the system.

So type this command starting from “bbduk.sh” to “threads=1” into your terminal, press enter. It will look like it froze, but don’t worry, its just running in the background. This step should take about 5 minutes or so. Don’t worry if it goes longer. (Every sample is a different size.) If it gives you an error about “tpe” you can just drop this argument from the line (delete tpe in your code, make sure you fix the spaces though).

Final step: FastQC on the trimmed reads

So now that you’ve got trimmed reads, we want to get FastQC results that show how much our quality scores have improved. Run FastQC again, this time using only the trimmed reads.

I recommend making another directory for the trimmed FastQC output, like so:

```
mkdir ~/raw.d/fastQC_trimmed_output
 
#Get the path to your trimmed reads by typing:
cd /path/to/your/reads
realpath [trimmed_forward_reads]
realpath [trimmed_reverse_reads]
```

Next run:

`fastqc -o fastQC_trimmed_output [trimmed_forward_reads] [trimmed_reverse_reads]`

Remember to replace trimmed forward and reverse reads with the paths to your newly trimed reads. Now, as you did before, download the FastQC output and open it on your computer.

Questions

1. Do you see anything worrying in terms of quality scores in your untrimmed reads?
2. What are the noticeable differences in the FastQC output between your trimmed and untrimmed reads?
3. Briefly describe the pros and cons of short read vs. long read sequencing, and a project that would be appropriate (or inappropriate) to use them for.
4. Based on the average read length and number of reads for one of your samples, answer the following question: What % of the community does a microbe with a 3,000,000 bp genome need to be at in order to be recovered at 10x coverage? You may want to look this one up, about what is coverage and such.

&#x20;

***

Note: chatgpt 4 was used to generate/validate some of the code, and used in conjunction with my own explanations for some explanations.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://prestons-tutorials.gitbook.io/metagenome_assembled_genomics_tutorials/tut-2-read-quality-and-trimming.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.