Guide for using RNA probing software from Vinther Lab in a command-line environment, prior to RNAprobR

Date of creation:
April 15th 2015
Last modification:
Author:
Lukasz Jan Kielpinski
Decription:
This guide describes how to turn the link provided by sequencing center into RNAprobR compatible dataset in the command line Linux environment, with the focus on the shared computing facilities. It is based on "Reproducible Analysis of Sequencing-Based RNA Structure Probing Data with User-Friendly Tools". Protocol for single read sequencing.

Before using this protocol you need to customize paths and urls.

Requirements:

Protocol:

  1. If you use one of the remote servers it is the safest to work in the "screen" mode. This mode prevents termination of processes and loss of data if your terminal connection fails.
  2. Download data
    1. You should have received the url for the sequencing data. Go to the directory of choice and download the data with "wget".
      • Example:

      cd /usr/seqdata/hrfseq/original_data
      get http://example.com/data/reads.tar.bz2

    2. You will usually also receive md5sum (either present in the downloaded data or in the email). Compare md5sum of your downloaded data to provided sum.
      • Example:

      md5sum reads.tar.bz2

    3. Extract the data if necessary (Aarhus seq center - it is necessary; KU seq center do not bzip the files):
      • Example:

      tar -jxvf reads.tar.bz2

    4. Make sure that your data is distributed in a logical way, e.g. data is split in directories named by index number.

      For further step we have indexes from 1 to 12, FASTQ are distributed between directories named between 01 to 12 and files are gzipped and have various names.

      • Example path:

      /usr/seqdata/hrfseq/original_data/01/reads-01_S1_L001_R1_001.fastq.gz

  3. Trim the adapters
    1. Start by creating parallel directories for processed data:
      • Example:

      mkdir /usr/seqdata/hrfseq/trimmed_adapters

    2. Set cutadapt parameters to bash variables:
      • Example:

      ADAPTER_SEQUENCE=AGATCGGAAGAGCACACGTCT
      QUALITY_CUTOFF=17
      MINIMAL_LENGTH=40

      Note: by using minimal length 40 at the cutadapt step, we ensure minimal length of 21 at the mapping step (10 nt trimmed from reads left side - barcode, 9 nt trimmed from the right side - possible random part of the RT primer)
    3. Save the cutadapt version (for future reference):
      • Example:

      cd /usr/seqdata/hrfseq/trimmed_adapters
      cutadapt --version > cutadapt.version

    4. For running cutadapt on multiple indexes simultaneously use bash 'for' loop (which will be used multiple times in the guide). Inside a loop: create and enter subdirectory for each index (mkdir; cd), decompress and pipe fastq to cutadapt (zcat). After cutadapt, reads are piped on gzip and saved. It is safe to run 12 jobs in parallel on our servers (48 or 64 cores). We do that by placing '&' after the command which sends the process to the background and allows a new one to start. Since we do not want the next step (preprocessing) to start before all reads are processed with cutadapt, include wait after the loop.
      • Example:

      for i in {01..12}
      do

      mkdir /usr/seqdata/hrfseq/trimmed_adapters/$i
      cd /usr/seqdata/hrfseq/trimmed_adapters/$i
      zcat /usr/seqdata/hrfseq/original_data/$i/*.fastq.gz |
      nice cutadapt --adapter=$ADAPTER_SEQUENCE --quality-cutoff=$QUALITY_CUTOFF --minimum-length=$MINIMAL_LENGTH - 2> cutadapt.error |
      gzip > reads_trimmed.fastq.gz &

      done
      wait

  4. Run a preprocessing tool
    1. Download scripts and decompress them in your scripts directory
      • Example:

      mkdir /usr/scripts/mie_scripts
      cd /usr/scripts/mie_scripts
      wget http://people.binf.ku.dk/~jvinther/data/rna_probing/RNAprobBash.tar.gz
      tar -zxf RNAprobBash.tar.gz

    2. Add the scripts location to the path
      • Example:

      PATH=$PATH:/usr/scripts/mie_scripts

    3. Run the preprocessing script.

      Check the meaning of different parameters by typing:

      preprocessing.sh -h

      It is not safe to run multiple preprocessing.sh instances simultaneously, hence do not include '&'. After preprocessing, remove input files and compress output fastq to save disk space.

      • Example:

      for i in {01..12}
      do

      cd /usr/seqdata/hrfseq/trimmed_adapters/$i
      preprocessing.sh -b NWTRYSNNNN -t 9 -1 reads_trimmed.fastq.gz
      rm reads_trimmed.fastq.gz
      gzip ./output_dir/read1.fastq

      done

  5. Mapping
    1. Prepare fasta index

      Current version of bash scripts + RNAprobR works only in transcript coordinates, so mapping to genome is not supported. Before mapping one has to select the sequences of interest, to which reads will be mapped. This can be done in multiple ways, e.g.:

      1. Sequences of a few transcripts of interest, e.g. fasta file with ribosomal sequences or with sequences of in vitro transcribed RNA molecules
      2. Transcript collection, e.g. Rouskin et al. 2014 (DMS-Seq) used the longest isoform of all RefSeq protein coding genes, Spitale et al. 2015 (icSHAPE) mapped to ENSEMBL transcriptome.
      3. Pre-selected transcript collection. Example: use tophat2 to map to genome using ENSEMBL gtf as a guide, do the cufflinks analysis, choose 1000 top expressed, choose the longest isoform of them
      • Example file name:

      /usr/FASTA/top_transcripts.fa

    2. Prepare bowtie2 index.
      • Example:

      cd /usr/FASTA/
      bowtie2-build top_transcripts.fa top_transcripts

    3. Map to transcripts with bowtie2

      Mapping is performed with bowtie2 program, without sending the command to the background (no "&").

      Parameters used below mean: -p32 (use 32 cores for mapping), --quiet (print only serious errors to stderr - which is later saved in bowtie2.error file), --norc (do not map to reverse strand)

      • Example:

      for i in {01..12}
      do

      cd /usr/seqdata/hrfseq/trimmed_adapters/$i
      nice bowtie2 -p32 --quiet --norc -x /usr/FASTA/top_transcripts -U ./output_dir/read1.fastq.gz 2>bowtie2.error | gzip > tx_mapped.sam.gz

      done

    4. Save the bowtie2 version for future reference
      • Example:

      bowtie2 --version > bowtie2.version

  6. Summarize unique barcodes

    Count number of unique barcodes connected with reads terminating at each of the transcript nucleotides with the summarize_unique_barcodes.sh script. Check the meaning of different parameters by typing:

    summarize_unique_barcodes.sh -h

    It SHOULD (no guarantee!) be safe to run 12 jobs in parallel on our servers, although very large files may be an issue*. To call the -k option (produce k2n file) you need RNAprobR to be installed in your R - this calculation can take very long time.

    for i in {01..12}
    do

    cd /usr/seqdata/hrfseq/trimmed_adapters/$i
    nice summarize_unique_barcodes.sh -f tx_mapped.sam.gz -b ./output_dir/barcodes.txt -t -k -o summarized &

    done
    wait

  7. Read in the data into R session using RNAprobR package.

    If you intend to work on all 12 probing samples simultaneously it is convenient to read in the data into a list:

    cd /usr/seqdata/hrfseq/trimmed_adapters/
    R
    library(RNAprobR)

    dir_names <- formatC(1:12, width=2, flag=0)

    #Read in the count information, and convert observed unique barcodes to estimated unique counts:
    euc_GRs <- list()
    for(index in dir_names){
    euc_GRs[[index]] <- readsamples(samples=paste("./",index,"/summarized/unique_barcodes.txt", sep=""), euc="HRF-Seq", k2n_files = paste("./",index,"/summarized/k2n.txt", sep=""))
    }

    #Compile positional information:
    comp_GRs <- list()
    for(index in dir_names){
    comp_GRs[[index]] <- comp(euc_GR=euc_GRs[[index]], fasta_file="/usr/FASTA/top_transcripts.fa")
    }

    #Save R objects for future fast loading with the load():
    save(euc_GRs, file="euc_GRs.Rsave")
    save(comp_GRs, file="comp_GRs.Rsave")


########################################################################################################

* It can be a good idea to include summarize_unique_barcodes.sh script in the same loop as bowtie2 mapping. In this case, instead of running code in the points 5.3 and 6 run:

for i in {01..12}
do

cd /usr/seqdata/hrfseq/trimmed_adapters/$i
nice bowtie2 -p32 --quiet --norc -x /usr/FASTA/top_transcripts -U ./output_dir/read1.fastq.gz 2>bowtie2.error | gzip > tx_mapped.sam.gz
nice summarize_unique_barcodes.sh -f tx_mapped.sam.gz -b ./output_dir/barcodes.txt -t -k -o summarized &

done
wait