Initiation code datasets
This directory contains information related to the manuscript "A code for trancriptional initiation in mammalian genomes" by frith MC et al.
Changes
2008-02-05: Added more explanatory comments in paraclu.pl.
2008-02-05: Web address changed to http://people.binf.ku.dk/albin/supplementary_data/tss_code/
Perl scripts
The scripts begin with comments explaining how to use them.
- paraclu.pl: Parametric clustering of data attached to
sequences.
- make_psmm.pl: Make a position-specific Markov model from sequence
data.
- scan_psmm.pl: Scan a position-specific Markov model across
sequences.
TSS clusters
- all.pclu: Clusters in CAGE data from all human samples in FANTOM 3.
- HBY.pclu: Clusters in CAGE data from Hep G2 cells.
- HBM.pclu: Clusters in CAGE data from human fibroblasts.
- HAM.pclu: Clusters in CAGE data from human occipital cortex.
Sequences used to find overrepresented DNA motifs and train PSMMs
PSMMs were derived from the central regions of these sequences,
excluding sequences from chromosome 1.
- peak100_5_100.fa: Sequences around dominant transcription start
sites in all human samples in FANTOM 3.
- HBY_peak100_5_100.fa: Sequences around dominant transcription
start sites in Hep G2 cells (Fig. 2).
- HBM_peak100_5_100.fa: Sequences around dominant transcription
start sites in human fibroblasts (Fig. S1).
- HAM_peak100_5_100.fa: Sequences around dominant transcription
start sites in human occipital cortex (Fig. S2).
- IN_peak100_5_100.fa: Sequences around dominant transcription start
sites in mouse 17.5 day embryo (Fig. S3).
- CBR_peak100_5_100.fa: Sequences around dominant transcription
start sites in mouse liver (Fig. S4).
- BC_peak100_5_100.fa: Sequences around dominant transcription start
sites in mouse cerebellum (Fig. S5).
Test sequences for PSMMs
The PSMMs were tested on the DNA sequences of TSS clusters <= 100
bp with stability >=2 in hg17 chromosome 1. Just enough flanking
sequence was added to scan with PSMMs of +-50 regions.
- all_stab2_100_chr1_50.fa: Sequences of TSS clusters from all.pclu.
- HBY_stab2_100_chr1_50.fa: Sequences of TSS clusters from HBY.pclu.
- HBM_stab2_100_chr1_50.fa: Sequences of TSS clusters from HBM.pclu.
- HAM_stab2_100_chr1_50.fa: Sequences of TSS clusters from HAM.pclu.
Observed TSS usage in test sequences
(Excluding flanking sequence)
- all_stab2_100_chr1_cage: tag count at each bp in
all_stab2_100_chr1_50.fa.
- HBY_stab2_100_chr1_HBYcage: tag count at each bp in
HBY_stab2_100_chr1_50.fa.
- HBM_stab2_100_chr1_HBMcage: tag count at each bp in
HBM_stab2_100_chr1_50.fa.
- HAM_stab2_100_chr1_HAMcage: tag count at each bp in
HAM_stab2_100_chr1_50.fa.
Position-specific Markov models
- peak100_5_1_train1.psmm: 1st order PSMM of +-1 region from
peak100_5_100.fa.
- peak100_5_50_train1.psmm: 1st order PSMM of +-50 region from
peak100_5_100.fa.
- peak100_5_50_train2.psmm: 2nd order PSMM of +-50 region from
peak100_5_100.fa.
- peak100_5_50_train3.psmm: 3rd order PSMM of +-50 region from
peak100_5_100.fa.
- HBY_peak100_5_50_train1.psmm: 1st order PSMM of +-50 region from
HBY_peak100_5_100.fa.
- HBM_peak100_5_50_train1.psmm: 1st order PSMM of +-50 region from
HBM_peak100_5_100.fa.
- HAM_peak100_5_50_train1.psmm: 1st order PSMM of +-50 region from
HAM_peak100_5_100.fa.
- HBY_peak100_5_50_train2.psmm: 2nd order PSMM of +-50 region from
HBY_peak100_5_100.fa.
- HBM_peak100_5_50_train2.psmm: 2nd order PSMM of +-50 region from
HBM_peak100_5_100.fa.
- HAM_peak100_5_50_train2.psmm: 2nd order PSMM of +-50 region from
HAM_peak100_5_100.fa.
PSMM predictions
- peak100_5_1_train1b: PSMM score at each bp in
all_stab2_100_chr1_50.fa, using peak100_5_1_train1.psmm with a uniform
null model.
- peak100_5_50_train1: PSMM score at each bp in
all_stab2_100_chr1_50.fa, using peak100_5_50_train1.psmm.
- peak100_5_50_train2: PSMM score at each bp in
all_stab2_100_chr1_50.fa, using peak100_5_50_train2.psmm.
- peak100_5_50_train3: PSMM score at each bp in
all_stab2_100_chr1_50.fa, using peak100_5_50_train3.psmm.