$HPC_SLIB/biol4230/data/bed-tools/wgEncodeRegTfbsClusteredV3.beddo the folowing:
cut -f 1-5 your_grepped_factor.bed | awk '{print $0,"\t+"}' > your_factor_fixed.bed
For each intersection, look at the beginning of your tss_m??p??.bed file and your resulting intersection file to make sure that you are getting the binding sites you expect.
At the end of this process, you should have 3 separate intersection files, one for binding sites that are (1) very close, (2) close, and (3) far from transcription start sites (TSSs).
How many of the ChIP-seq sites are within the TSS ranges you used? How many are not? (Be certain to use the uniq command to ensure that each intersected .bed file has unique coordinates.sort -k 5 -n -r your_bed_file | head -n 600 > your_top600.bedAlso select the 600 intervals with the lowest scores:
sort -k 5 -n -r your_bed_file | tail -n 600 > your_low600.bedIf you have fewer than 1200 bed intervals for your transcription factor, then divide your number of bed intervals in two, and separate the top half and bottom half (don't worry if the number in the top and bottom halves is not identical).
Run an awk script, as you did in BEDTools2/homework 10, to make certain that you have less than 60.000 nt in each of your sample sets. If you do not, then either:
fastaFromBed -fi $HPC_SLIB/data/hg19/hg19.fa -bed your_chip_tf.bed -fo your_chip_tf.fastaYou will need to do this six times, for each of your ChIP/TSS/top-half/bottom-half bed files. This output file does not need to have the fasta headers edited.