In this lab, we will be following the protocols outlined in BEDTools: The Swiss-Army Tool for Genome Feature Analysis". The steps labeled below come from this protocol.

The data files required for this project are available on

The goal of this lab is to use BEDTools to identify a set of genome coordinates, and then to download the sequences associated with those coordinates for later use in motif finding. The bedtools program should be in your path. Double check by typing:

which bedtools
If its not there, you need to add /nv/md_rdlib/seqprg/bin to your $PATH, or run bedtools as /nv/md_rdlib/seqprg/bin/bedtools.

This exercise will follow the examples provided in Quinlan's Current Protocols in Bioinformatics (CBPI) Unit on BEDTools, available here (collab). We will be starting with Basic Protocol 2.

Getting started; Intersecting CpG islands and Exons

  1. Download BED files - The BED files used in the CPBI protocols are available at /nv/md_rdlib2/biol4230/bed-tools. Run the script:
    sh ~wrp/biol4230/hwk9/
    To create a ~/biol4230/hwk9 directory and transfer the necessary files to it. After running this script, you should not need to copy or link the files listed below.

  2. Take a look at each of the files with the head command. Are all of the files .bed files? (Basic Protocol 2, steps 1 and 2). How many of the BED fields do they have?

    How many CPG islands are there (wc cpg.bed). How many exons?

  3. (Step 3a): Identify the CpG (file 'A') coordinates that overlap exons (file 'B'):
    bedtools intersect -a cpg.bed -b exons.bed | head -n 5

    Would you expect exons and CPG islands to intersect? Should some types of exons intersect more than others (what kinds of exons are there)? Which ones?

  4. (Step 3b) Ask bedtools to show the labels on the cpg islands and exons using the -wa (write 'a') and -wb (write 'b') options:
    bedtools intersect -a cpg.bed -b exon.bed -wa -wb | head -n 5
    You can also write out the overlap amount with the '-wo' option.
    bedtools intersect -a cpg.bed -b exon.bed -wo | head -n 5
    Look at the overlap amount for the first intersection of exons and CpG islands. Where is the overlap length coming from, the CpG island, the exon, or some combination?

    Save the intersection of cpg.bed and exon.bed to a file, and then count the total number of cpg/exon intersections usig wc on the file. Count the number of intersections with exons0,1,2,3 using grep -c exon_X_ (where X=0,1,2,3) on the file. Are cpg islands enriched in early exons? What assumption does this simple calculation make about the number of exons per gene? How could you normalize for the number of exon_0_'s, exon_1_'s, etc.?

Arithmetic on ChIP-Seq data, transcription start sites
  1. Copy /nv/md_rdlib2/biol4230/bed-tools/wgEncode_ChIP_rev.bedg and wgEncode_ChIP_Sp1.bedg (converted from BigWig according to Basic Protocol 5, steps 1 - 3) to your hwk9 directory. Also copy tss.bed.

  2. Look at the first lines of the tss.bed file. How long is each Transcription Start Site? Look at *_ChIP_*.bedg files. What are the numbers in the last column?

  3. Since we are interested in transcription factor occupancy both upstream and downstream of the TSS, we must extend the single base pair of each TSS by, for example, 1000 base pairs upstream and downstream of the TSS. To do this, we will use the BEDTools slop command to add 1000 base pairs to both (-b) the upstream and downstream regions:
    bedtools slop -b 1000 -i tss.bed -g hg19.chromsizes > tss.plusminus.1kb.bed
    Look at the first lines of the new file to see what happened.

  4. (Step 8) To provide greater resolution to the plot we will produce, we will break up each 2000-bp (TSS ± 1000-bp) interval flanking each TSS into 400, 5-bp “sub-windows.” One can easily do this with the BEDTools makewindows command. Create 5-bp BED “sub-records” across each 2-kb TSS and its flanks.

    (The sort in this pipeline takes a very long time, so do not run this command. Instead, make a link from the sorted /nv/md_rdlib2/biol4230/bed-tools/ to your hwk9 directory.

    ln -s /nv/md_rdlib2/biol4230/bed-tools/ .    # don't forget the '.'
    Take a look at the beginning of the new file. How many fields/columns are present? What are the offsets on the chromosome intervals?

  5. (Steps 9-11): I have already converted the Sp1 bigWig file, and the reversed crosslink control, into two .bedg files in /nv/md_rdlib2/biol4230/bed-tools/wgEncode_ChIP_Sp1.bedg and wgEncode_ChIP_rev.bedg. Copy these files to your directory, and take a look at the beginning of one of them. What is the resolution of the transcription factor binding signal?

  6. (Steps 12, 13): Next, we will use the map tool to summarize both the Sp1 and negative control ChIP-seq intensities. Note that the BigWig file is converted to BEDGRAPH “on the fly” via a Unix FIFO, so that we don’t have to store the data redundantly as both a BigWig and a BEDGRAPH file. Secondly, if there are no overlaps between a given 5-bp window and the BigWig file, we default that window’s value to 0 using the -null option. 12. Map the Sp1 transcription factor to the 5-bp windows flanking TSS loci (these commands differ slightly from the Protocol, because the .bedGraph file has already been created).
    bedtools map -a -b wgEncode_ChIP_Sp1.bedg \
      -c 4 -o mean  -null 0  > sp1.tss.window.coverage.bedg
    Do the same map of the tss file against wgEncode_ChIP_rev.bedg producing rev.tas.window.coverage.bedg.

    The map function calculates summary statistics (mean, max) on the *.bedg intervals that intersect with the .tss file. Take a look at the resulting *.bedg files, and identify each of the fields/columns.

  7. (Step 14). At this point, we have summarized the ChIP-seq signal for both Sp1 and the negative control for each 5-bp interval flanking each transcript’s TSS. Our goal, however, is to summarize the transcription factor occupancy across all TSS in the entire genome. Recall that when creating the 5-bp windows for each TSS, we added a 5th column reflecting which of the 400 distinct 5-bp windows each interval represents. We can now take advantage of this by using the BEDTools groupby tool, which will group input data by a particular column (or columns), and for each distinct group, it will summarize the values observed in other columns for each group. In this case, we want to calculate the sum of all mean ChiP-seq intensity values observed for each of the 400 5-bp windows measured for each TSS. The window number is the 5th column and represents our “grouping” column, and the mean ChiP-seq intensity values are in the 6th column and reflect the column that we want to summarize for each group (i.e., 5-bp window number). In order to group the data efficiently, the groupby tool requires that the input data be pre-sorted by the grouping column.

    Sum the Sp1 ChIP-seq signal observed for each 5-bp window at each TSS start site:

    sort -t$'\t' -k5,5n sp1.tss.window.coverage.bedg \
      | bedtools groupby -i - -g 5 -c 6  -o sum > sp1.tss.window.counts.txt

    (Step 15) Then do the same thing for the rev control file:

    sort -t$'\t' -k5,5n rev.tss.window.coverage.bedg \
      | bedtools groupby -i - -g 5 -c 6  -o sum > rev.tss.window.counts.txt
    At this point, the resulting files will contain the 401 5-bp windows describing summary of all 2000-bp intervals surrounding (and including) every TSS. Since the TSS themselves will always be window number 200, we can compare the ChIP-seq signal at the TSS from both the Sp1 and negative control experiments.

  8. (Step 16, 17) Use the grep -C command to look at the density of binding in the Sp1 and control (rev) files.
    grep -C 10 '^200' *.tss.window.counts.txt

    (Step 18) Transfer the *.tss.window.counts.txt files to your laptop, and print out a graph of binding vs TSS coordinates using 'R' (see the protocol).

Homework - hwk9

Due Monday, 5:00 PM.

Finish mapping Sp1 binding sites to transcription start sites. Produce the following files:

(1,2) The sp1.tss.window.counts.txt, rev.tss.window.counts.txt (in biol4230/hwk9)

(3) A shell script that produces files (1) and (2). (in biol4230/hwk9)

(4) submit on collab, an 'R' plot of the distribution of Sp1 and control binding 1000 nt before and after transcription start sites.

(5) The file of 'R' commands used to produce (in biol4230/hwk9)

Course home page