In this exercise, we will use the command line to complete a bowtie2 -> macs analysis of the same files you analyzed earlier in Galaxy.
To get started, get the data files, G1E_CTCF.fastqsanger and G1E_input.fastqsanger from the shared folder. Copy them into a folder in your home directory named "chip-data"
On the MAC simply open the terminal under Go >> Utilities >> Terminal
# Change to your home directory $ cd # Make a new directory in your home directory $ mkdir chip-data # Copy the original chip data into your home directory $ cp /ecg/data/2015/*.fastqsanger ~/chip-data/Running bowtie
# please read the bowtie manual for more details. You can access it by typing: $ bowtie --help# The generic bowtie command is as follows:
$ bowtie <index> <inputfile> <options> <outputfile># You would replace <index> with the name of your genome index, <inputfile> with the name of your fastq file, <outputfile> with whatever name you wish to give your output file.
# <options> can be replaced by one or more things you want to change about bowtie from its default value. For example, in the case below, we will specific option -N, which tells bowtie to allow 1 mismatch instead of 0 for the alignments. The bowtie help (bowtie --help) gives an exhaustive list of options you can use.
# Bowtie needs a genome index to map the reads to. These are not stored in your directory, but in the /ecg shared directory. You need to tell bowtie where to find the mm9 mouse genome index so it can align your reads. It could be stored in your own directory, but the indices are large and generally, you will not want to store it on your computer. The bowtie website has premade indices for most genomes that you can download and use if you want to have them locally.
# The "path" to find the mouse mm9 genome index is: /ecg/data/2015/genome-mm9
# If you were mapping to human or another species those indices would probably also be in the same location but you'd need to type in the species and the exact name of the index you want to align to.
# You DO NOT do this, but as an FYI the index was built using the following command. Note that we're only using the fasta file for mouse chromosome 19 in order to speed the exercise up.
$ bowtie-build chr19.fa chr19
$ cd ~/chip-data $ bowtie /ecg/data/2015/genome-mm9/chr19 G1E_CTCF.fastqsanger -n 1 ctcf.bowtie_n1# Note that the spaces between the entries are important! This line of code instructs bowtie to map the reads to the mm9 index allowing 1 mismatch per read.
# When you hit "return" it will take a few seconds to get the output—your prompt should return:
# reads processed: 270968 # reads with at least one reported alignment: 266672 (98.41%) # reads that failed to align: 4296 (1.59%) Reported 266672 alignments to 1 output stream(s)# This is important to pay attention to. It is a report that gives you important information about your data. i
# What if only 50% of your reads failed to align? (This happened to a student in the 2014 class...what do you think the problem with her data was?)
# You can regenerate the command you just used by hitting the UP arrow one time. Now, you need to scroll back using the <- arrow key to delete the "-n 1 " before hitting "return". You should also give your output file a different name; call it ctcf.bowtie_default
$ bowtie /ecg/data/2015/genome-mm9/chr19 G1E_CTCF.fastqsanger ctcf.bowtie_defaultThis will default so that 2 errors are tolerated. What is the difference in reads mapped?
# Now, run the same command (without the -n 1 option) on the G1E_input file. Hit the "UP arrow" to recall your last command, scroll back with the <- arrow, delete the input file name from the previous one, and enter G1E_input.fastasanger . Be sure to change the ouput file name (or it will over-write the first file!) Call this one input.bowtie.
$ bowtie /ecg/data/2015/genome-mm9/chr19 G1E_input.fastqsanger input.bowtie
# tip: to fill in the input file name, just type G1E_in and hit the <tab> key. This will fill in the full name of the file with those first letters in your directory.
How many reads were processed in this file? Which what % did not align?
The three bowtie files you just generated (ctcf.bowtie_n1, ctcf.bowtie_default, and input.bowtie) are ready to be run on MACS.
# First identify the command-line options. We will use MACS version 1.4.3. $ macs --help
The basic MACS command is
$ macs -t <input-filename> -c <control-filename> -f <fileformat> -g <genomesize> -n <outputfilename> -w# Look at the MACS help file. What does the term -w instruct macs to do?
For this exercise, you would use the following script (default parameters):
$ macs -t ctcf.bowtie_n1 -c input.bowtie -f BOWTIE -g mm -n ctcf.macs -w# What is -g mm telling MACS?
# After hitting return macs will begin and provide a running log of its progress. Copy the text that is produced by MACS to a text file. It is a good idea to keep a record of these data for future reference if you are running the programs on your own.
What do "band width" and "model fold" mean? How many peaks are detected? How many negative peaks?
# You should get back, essentially, the same files you got from galaxy. The one you want to look at is ctcf.macs_peaks.xls.
You can upload the "treat" and "control" tracks as a custom track in UCSC browser. You can upload directly as .gz, no need to unpack them. The files are found in: ~/chip-data/ ctcf.macs_MACS_wiggle. There's a control/ and treat/ directory. Within each you'll find a wiggle file that has been compressed (ends in .gz).
To load a custom track, go to the mm9 genome page in UCSC, and select "add custom tracks". This will open a new page. In the top panel, upload your .gz files one at a time, selecting "choose file" to get the files. When you've selected the right file just click "submit". It will take a few minutes to upload the file.
When the file is uploaded a new page will open. You can change the name of your track and description by clicking on the link at left. Now, repeat with the control file.
[You can repeat and select as many files as you want as long as they each have unique names]. Finally, select "go to genome browser". [Note that if you wanted to, you could go directly to the Table browser instead and download sequence and other features related to your peaks. Ask me if you are interested].
These reads are only from chr19. To view the track, type "chr19" into the browser window; this will load the entire chromosome. The new track appears as a "dense" band along the top of the browser by default.
To see the tracks you need to click on the grey bar at the side of the track and configure the track to "full", and also set the "Data view scaling" to "use vertical range" setting. Set both tracks to max 30. Then press "submit".
Take a look at the region, chr19:45,550,000-45,750,000. What do you see? What would you conjecture about the peak located in the fourth intron of Dpcd?