Summary
In this lab, we will be following the protocols outlined in BEDTools: The Swiss-Army Tool for Genome Feature Analysis". The steps labeled below come from this protocol.
The data files required for this project are available on interactive.hpc.virginia.edu
The goal of this lab is to use BEDTools to identify a set of genome coordinates, and then to download the sequences associated with those coordinates for later use in motif finding. The bedtools program should be in your path. Double check by typing:
which bedtools
This exercise will follow the examples provided in Quinlan's Current Protocols in Bioinformatics (CBPI) Unit on BEDTools, available here (collab). We will be starting with Basic Protocol 2.
bedtools_file_setup.shTo create a ~/biol4230/hwk9 directory and transfer the necessary files to it. After running this script, you should not need to copy or link the files listed below.
How many CPG islands are there (wc cpg.bed). How many exons?
bedtools intersect -a cpg.bed -b exons.bed | head -n 5
Would you expect exons and CPG islands to intersect? Should some types of exons intersect more than others (what kinds of exons are there)? Which ones?
bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -n 5You can also write out the overlap amount with the '-wo' option.
bedtools intersect -a cpg.bed -b exons.bed -wo | head -n 5Look at the overlap amount for the first intersection of exons and CpG islands. Where is the overlap length coming from, the CpG island, the exon, or some combination?
Save the intersection of cpg.bed and exons.bed to a file, and then count the total number of cpg/exon intersections usig wc on the file. Count the number of intersections with exons0,1,2,3 using grep -c exon_X_ (where X=0,1,2,3) on the file. Are cpg islands enriched in early exons? What assumption does this simple calculation make about the number of exons per gene? How could you normalize for the number of exon_0_'s, exon_1_'s, etc.?
bedtools slop -b 1000 -i tss.bed -g hg19.chromsizes > tss.plusminus.1kb.bedLook at the first lines of the new file to see what happened.
(The sort in this pipeline takes a very long time, so do not run this command. Instead, make a link from the sorted $HPC_SLIB/data/bed-tools/tss.plusminus.1kb.5bp.windows.bed to your hwk9 directory.
ln -s $HPC_SLIB/data/bed-tools/tss.plusminus.1kb.5bp.windows.bed . # don't forget the '.'Take a look at the beginning of the new tss.plusminus.1kb.5bp.windows.bed file. How many fields/columns are present? What are the offsets on the chromosome intervals?
bedtools map -a tss.plusminus.1kb.5bp.windows.bed -b wgEncode_ChIP_Sp1.bedg \ -c 4 -o mean -null 0 > sp1.tss.window.coverage.bedgDo the same map of the tss file against wgEncode_ChIP_rev.bedg producing rev.tas.window.coverage.bedg.
The map function calculates summary statistics (mean, max) on the *.bedg intervals that intersect with the .tss file. Take a look at the resulting *.bedg files, and identify each of the fields/columns.
Sum the Sp1 ChIP-seq signal observed for each 5-bp window at each TSS start site:
sort -t$'\t' -k5,5n sp1.tss.window.coverage.bedg \ | bedtools groupby -i - -g 5 -c 6 -o sum > sp1.tss.window.counts.txt
(Step 15) Then do the same thing for the rev control file:
sort -t$'\t' -k5,5n rev.tss.window.coverage.bedg \ | bedtools groupby -i - -g 5 -c 6 -o sum > rev.tss.window.counts.txtAt this point, the resulting files will contain the 401 5-bp windows describing summary of all 2000-bp intervals surrounding (and including) every TSS. Since the TSS themselves will always be window number 200, we can compare the ChIP-seq signal at the TSS from both the Sp1 and negative control experiments.
grep -C 10 '^200' *.tss.window.counts.txt
(Step 18) Transfer the *.tss.window.counts.txt files to your laptop, and print out a graph of binding vs TSS coordinates using 'R' (see the protocol).
Due Monday, 5:00 PM.
Finish mapping Sp1 binding sites to transcription start sites. Produce the following files:
(1,2) The sp1.tss.window.counts.txt, rev.tss.window.counts.txt (in biol4230/hwk9)
(3) A shell script that produces files (1) and (2). (in biol4230/hwk9)
(4) submit on collab, an 'R' plot of the distribution of Sp1 and control binding 1000 nt before and after transcription start sites.
(5) The file of 'R' commands used to produce (in biol4230/hwk9)