SUPPLEMENTARY DATA for
"Genome-wide detection and analysis of hippocampus core promoters using Deep CAGE"

CAGE tag sequences

Tag sequences have been deposited in DNA Data Bank of Japan (DDBJ ). Accession numbers for individual tags are: AGAAA0000001-AGAAA0552486

CAGE tag tracks for use with the UCSC browser

All of these data are presented in the paper "Genome-wide detection and analysis of hippocampus core promoters using Deep CAGE" by E Valen et al; a joint collaboration between the Omics Center (Riken Yokohama Institute, Japan), The Bioinformatics Centre (Copenhagen University, Denmark), SISSA (Italy) and university of Grifffith (Australia) .All track coordinates refer to the mm8 assembly. All tracks below can also be downloaded in this directory.

How to use Wig and Bed files

The CAGE tracks can be directly used in the UCSC browser at http://genome.ucsc.edu/, either by

A tutorial on how to use the genome browser can be found at the UCSC browser help page

Clicking on the links below will upload the track on the mm8 assembly in the UCSC browser.

WIG tracks (single nucleotide resolution barplots).

CAGE tags from:

BED tracks (blocks, corresponding to clusters of tags)

Summary tracks:

Preferentially expressed promoters (PEPs): Subsets of the tag cluster aboe that have more that have >30 TPMs and where >50% of tags come from a particular tissue (normalized for sample size)

FASTA file for all PEPs and corresponding otif over-representation files

This tar.zipped data directory (can be opened by any extractor like WinZip) contains the following sub directories:

Here we describe in details the content of directories and files from least post-processed to most post-processed.

We start with the Preferentially expressed promoters (PEPs) (this is also referred to as tag clusters below) for the different tissues (also availabe as bed files above). The data directory contains the raw tag-cluster positions of these in a tab delimited format. Column one contains a unique ID, column two is F for forward strand R for reverse strand, column three chromosome name, column four and five contains start and end of the tag-cluster, respectively.

The seq/tagSequence contains fasta sequences corresponding to the tag-clusters described in the raw files. All fasta headers correspond to the raw unique ID. The seq/fullPromoter contains expanded sequences relative to the positions given in the raw data files. Each region is expanded -1000 to +200 relative to start and end of the tag-cluster sequence. Fasta headers are ID, strand, and new start and end position after the expansion.Note that the fasta files are pure text files, even though they do not have the suffix .txt - any editor can be forced to open them.

Finally, the result directory contains all p-values for all JASPAR-matrices across all full promoter fasta files. The p-value is calculated by a one-tailed binomial test for the greater tail. P-values are sorted in ascending order, given by the first column in the file. The second column contains the JASPAR matrix ID.

Images of genes where hippocampus alternative promoter usage is predicted to give differential protein domain content.

The following tar zipped file contains plots of the 50 genes where hippocampus has a PEP downstream of known protein domain(s). The CTSS track (black) shows the tags at each strand, the PEP track (pink) shows the location of the hippocampus PEPs and the protein domains are shown in dark green. There is also a track showing all mRNAs from the UCSC browser.