Before using this protocol you need to customize paths and urls.
screen
screen -r
exit
cd /usr/seqdata/hrfseq/original_data
get http://example.com/data/reads.tar.bz2
md5sum reads.tar.bz2
tar -jxvf reads.tar.bz2
For further step we have indexes from 1 to 12, FASTQ are distributed between directories named between 01 to 12 and files are gzipped and have various names.
/usr/seqdata/hrfseq/original_data/01/reads-01_S1_L001_R1_001.fastq.gz
mkdir /usr/seqdata/hrfseq/trimmed_adapters
ADAPTER_SEQUENCE=AGATCGGAAGAGCACACGTCT
QUALITY_CUTOFF=17
MINIMAL_LENGTH=40
cd /usr/seqdata/hrfseq/trimmed_adapters
cutadapt --version > cutadapt.version
wait
after the loop.
for i in {01..12}
do
mkdir /usr/seqdata/hrfseq/trimmed_adapters/$i
cd /usr/seqdata/hrfseq/trimmed_adapters/$i
zcat /usr/seqdata/hrfseq/original_data/$i/*.fastq.gz |
nice cutadapt --adapter=$ADAPTER_SEQUENCE --quality-cutoff=$QUALITY_CUTOFF --minimum-length=$MINIMAL_LENGTH - 2> cutadapt.error |
gzip > reads_trimmed.fastq.gz &
done
wait
mkdir /usr/scripts/mie_scripts
cd /usr/scripts/mie_scripts
wget http://people.binf.ku.dk/~jvinther/data/rna_probing/RNAprobBash.tar.gz
tar -zxf RNAprobBash.tar.gz
PATH=$PATH:/usr/scripts/mie_scripts
Check the meaning of different parameters by typing:
preprocessing.sh -h
It is not safe to run multiple preprocessing.sh instances simultaneously, hence do not include '&'. After preprocessing, remove input files and compress output fastq to save disk space.
for i in {01..12}
do
cd /usr/seqdata/hrfseq/trimmed_adapters/$i
preprocessing.sh -b NWTRYSNNNN -t 9 -1 reads_trimmed.fastq.gz
rm reads_trimmed.fastq.gz
gzip ./output_dir/read1.fastq
done
Current version of bash scripts + RNAprobR works only in transcript coordinates, so mapping to genome is not supported. Before mapping one has to select the sequences of interest, to which reads will be mapped. This can be done in multiple ways, e.g.:
/usr/FASTA/top_transcripts.fa
cd /usr/FASTA/
bowtie2-build top_transcripts.fa top_transcripts
Mapping is performed with bowtie2 program, without sending the command to the background (no "&").
Parameters used below mean: -p32 (use 32 cores for mapping), --quiet (print only serious errors to stderr - which is later saved in bowtie2.error file), --norc (do not map to reverse strand)
for i in {01..12}
do
cd /usr/seqdata/hrfseq/trimmed_adapters/$i
nice bowtie2 -p32 --quiet --norc -x /usr/FASTA/top_transcripts -U ./output_dir/read1.fastq.gz 2>bowtie2.error | gzip > tx_mapped.sam.gz
done
bowtie2 --version > bowtie2.version
Count number of unique barcodes connected with reads terminating at each of the transcript nucleotides with the summarize_unique_barcodes.sh script. Check the meaning of different parameters by typing:
summarize_unique_barcodes.sh -h
It SHOULD (no guarantee!) be safe to run 12 jobs in parallel on our servers, although very large files may be an issue*. To call the -k option (produce k2n file) you need RNAprobR to be installed in your R - this calculation can take very long time.
for i in {01..12}
do
cd /usr/seqdata/hrfseq/trimmed_adapters/$i
nice summarize_unique_barcodes.sh -f tx_mapped.sam.gz -b ./output_dir/barcodes.txt -t -k -o summarized &
done
wait
If you intend to work on all 12 probing samples simultaneously it is convenient to read in the data into a list:
cd /usr/seqdata/hrfseq/trimmed_adapters/
R
library(RNAprobR)
dir_names <- formatC(1:12, width=2, flag=0)
#Read in the count information, and convert observed unique barcodes to estimated unique counts:
euc_GRs <- list()
for(index in dir_names){
euc_GRs[[index]] <- readsamples(samples=paste("./",index,"/summarized/unique_barcodes.txt", sep=""), euc="HRF-Seq", k2n_files = paste("./",index,"/summarized/k2n.txt", sep=""))
}
#Compile positional information:
comp_GRs <- list()
for(index in dir_names){
comp_GRs[[index]] <- comp(euc_GR=euc_GRs[[index]], fasta_file="/usr/FASTA/top_transcripts.fa")
}
#Save R objects for future fast loading with the load():
save(euc_GRs, file="euc_GRs.Rsave")
save(comp_GRs, file="comp_GRs.Rsave")
* It can be a good idea to include summarize_unique_barcodes.sh script in the same loop as bowtie2 mapping. In this case, instead of running code in the points 5.3 and 6 run:
for i in {01..12}
do
cd /usr/seqdata/hrfseq/trimmed_adapters/$i
nice bowtie2 -p32 --quiet --norc -x /usr/FASTA/top_transcripts -U ./output_dir/read1.fastq.gz 2>bowtie2.error | gzip > tx_mapped.sam.gz
nice summarize_unique_barcodes.sh -f tx_mapped.sam.gz -b ./output_dir/barcodes.txt -t -k -o summarized &
done
wait