Galaxy Bowtie ChIP-seq exercises

Bowtie and sequence file QC exercise

This exercise is taken from "Galaxy 101" on the public server. For a very useful basic tutorial for Galaxy: https://usegalaxy.org/u/aun1/p/galaxy101.

We have modified the tutorial Instructions to reduce redundancies and to be compatible with the command line exercise later. So please follow these instructions, not the Galaxy tutorial instructions.

For this exercise we will use a ChIP-seq dataset for CTCF in the murine G1E cell line. This is a sample ChIP-seq dataset generated using an antibody to the transcription factor CTCF. Reads have been reduced to those mapping to chr19 for demonstration use.

To get started, get the data files, G1E_CTCF.fastqsanger and G1E_input.fastqsanger from the shared folder. Copy them into a folder in your home directory named chip-data.

On the MAC simply open the terminal under Go >> Utilities >> Terminal

At the prompt, type in:

# Change to your home directory
$ cd
# Make a new directory in your home directory
$ mkdir chip-data
# Copy the original chip data into your home directory
$ cp /ecg/data/2014/chip/*.fastqsanger ~/chip-data/
Download the files, then upload them from your computer into the ecg2014 Galaxy instance
Select the CTCF file and (1) set file format to "fastsanger", then (2) set the genome to Mouse July 2007 (NCBI37/mm9)(mm9). Select "execute".

When the file is finished uploading, click on the eye icon on the right panel to check the file contents. You should see files in the "fastq" format.


Mapping reads to the genome with Bowtie
The Galaxy 101 exercise starts with Fastqc to examine quality, but these reads are fine and you've done that already, so we'll go straight to mapping the reads.

Step 1: Map these reads to a reference genome.

Use the "NGS: Mapping > "bowtie" tool. You will need to change the reference genome build you are mapping against to "Mus musculus (mm9, (UCSC, full))" and be sure the original input file appears in the fastq file toggle. Otherwise for this first try, you can leave the default mapping options.

However: you should take a look at the potential parameter settings you can use. Toggle "full parameter list" to have a look. Scroll down below the window for running Bowtie to find a description of these parameters and the output.

Also, click on the "Bowtie on data 1 aligned reads" label in the right side panel, to open up a window with descriptive information.

  1. How many reads did not map to the genome?
  2. What percentage mapped to 1 location?
  3. What percentage mapped to multiple locations?
  4. What was the overall alignment rate?

In the case of the Bowtie output you cannot see the output data by clicking the "eye" icon. This will prompt you to download the file instead. This is the BAM format.

It is not really necessary (or even possible) to read the BAM format, it is a binary encoding designed to be fed into other programs, like peak mappers.


Step 2: Repeat the Bowtie mapping process with the control

NOW repeat the Bowtie mapping process with the input chromatin control for this sample, G1E_input.fastqsanger. You will need this for the peak finding exercise.

Follow the same steps as Step 1 on this sample.

** note the mapping program BWA is also available on galaxy, and is also very easy to run. This program is an alternative to the original Bowtie because it was a little bit less sensitive to mismatches; but it is not much different than Bowtie. BWA output can also be fed directly into MACS exactly as Bowtie can.


Calling ChIP-seq peaks with MACs
Step 3: Once are reads are mapped, we will call peaks with the program MACS.
  1. Use the "NGS: Peak Calling > MACS" tool
    (MACS14 is actually a better version to use but the local Galaxy version does not seem to work).
    1. Be sure to select the "CTCF" bowtie mapping file as the ChIP-seq file, and use the "input" bowtie mapping file as your control.
    2. You should change the tag size to the read length you observed in Step 1 (36 bp).
    3. Select "parse .xls files into distinct interval files.
    4. Choose to NOT "save" the wiggle file; this is a nice display that can be uploaded into the genome browser to check the quality of your data.

    We will repeat this exercise on the command line later and save the wiggle file for upload into UCSC. However, it takes time to run MACS with the wiggle file option, so we won't do it now.

    Otherwise the default values should be reasonable. We will discuss some of the MACs parameters in a future class.

  2. You should retrieve 4 output files for each analysis. Look at the output file "peaks:bed"; this will give you your peak locations (chr start stop) an peak number for each, and a score in column 5, which gives you the "Q value" (=-10*LOG10(pvalue)) for each peak. This .bed file can be uploaded and viewed in UCSC, along with the .wig files for your CTCF ChIP and input control files. Download these three files and save on your computer for later display.

  3. The file "Html report" gives you additional links including an xls file with peaks that you can download. It also gives you a report on the run with useful record of the parameters used, how many reads there were total for each sample, and other information. Click on the "eye" icon to see this file.

  4. Now, click on the MACS_in_Galaxy_peaks.xls link to open the xls file. This report is the most useful one. The file will give you the predicted "summit" of the peak (prediction of where your protein was bound precisely within the peak, measured from the start of the peak; to get this as a location you need to add value in column E to the value in column B). You also get the number of sequence "tags" mapped to each peak, a Q-value score, a fold enrichment score, and an FDR value calculated by MACS.

  5. Repeat MACs this time using the experimental file (CTCF) without a control.

Download the peaks.xls output from this file and compare it to the one you got with the genomic input control. What is different?