From Coordinates to Genes


Creating a gene list starting from peak locations

James Taylor has created a workflow on galaxy which you can import and share:
http://main.g2.bx.psu.edu/u/james/w/workflow-from-ucsc-genes-and-symbols

This workflow allows you to take a set of genome coordinates, in bed format, and retrieve a list of gene names for all of the nearby and flanking genes.

What you need to upload to use in this workflow:

  1. A set of background UCSC genes : upload directly into Galaxy using the Get Data link to UCSC Main. This will take you to the UCSC Table Browser.
    1. In the Table Browser, select your species and the genome build that matches the bed coordinates of your peaks.
    2. Select group:
      "Gene and Gene prediction tracks"
      track: UCSC Genes;
      table: KnownGene
    3. Make sure genome is checked under region (should be by default).
    4. Set output format as BED-browser extensible data and check Send output to Galaxy (will happen by default if you link out of Galaxy).
    5. Then select get output to send the table to Galaxy.

  2. A file that translates UCSC gene names to standard gene symbols (this is the input required for most functional programs).
    To do this:
    1. In Galaxy, Get Data again from UCSC Main.
    2. In the Table browser, repeat steps b and c above; except for the Table, select kg X Ref at the bottom of the drop down menu.
    3. In output format, select selected fields from primary and related tables and select get output
    4. This time a new page will open up; scroll down to hg19.kgXref fields and select both (1) kgID and (2) geneSymbol, then scroll to the bottom of the page and click Allow selection from ... .
    5. Then scroll back up to the top section and select done with selections
    6. A new page will come up; select send query to galaxy.

  3. In Galaxy, import Jame's workflow and select run to start the program.
    1. In the first field, select the table you created in 2, above
    2. In the second field, select the table you created in 1, above
    3. In the third field , select your data set bed file
      1. You need to upload this first
      2. Remember BED file is a tab-separated list of chromosome, start and stop positions for your peaks, saved as plain text:
        chr11234412544
    4. Click run workflow; if all goes well you should end up with a simple list of gene names.

Exercise:
You can try this using the human Hg19 genome build and the TF1-top50.bed file provided for the meme exercise.