3  Data Retrieval with NCBI SRA Toolkit

NCBI (Natianal Center for Biotechnology Information) is a major source of biological databases related to life and health sciences research, as well as a major source of bioinformatics tools and services. NCBI hosts various types of biological data submitted by researchers from around the world, such as GenBank for nucleotide sequence submissions, Sequence Read Archive (SRA) for raw sequence data, Genome for submitting full or draft genomes, Gene Expression Omnibus (GEO) for quantitative gene expression data sets, and many more.

NCBI SRA toolkit is a set of utilities for downloading, viewing, and searching large amounts of high-throughput sequencing data from the NCBI SRA database.

SRA toolkit can

Screenshots of NCBI Sequence Read Archives

3.1 What is Sequence Read Archives (SRA)

The Sequence Read Archive (SRA) is the largest publicly accessible repository for high-throughput sequencing data. SRA accepts data from all areas of sequencing projects as well as metagenomics and environmental studies. Sequencing data may be isolated from a single species or from multiple species as in metagnomics studies.

SRA also refers in the file description to the format defined by NCBI for NGS data in the SRA database. All data submitted to NCBI must be stored in SRA format and can be converted back to a FASTQ, FASTA, or BAM file depending on the original submission by the researchers. Here, the SRA Toolkit provides tools for downloading data, converting various data formats to SRA format and vice versa, and extracting SRA data to other formats.

Researchers often use SRA data to make discoveries and conduct reproducible research. Data sets can be compared using the SRA web interface. However, if you want to download files for local use on your computer, you should use a command line interface, and the SRA Toolkit is highly recommended by NCBI.

3.2 Searching RNA-Sequencing datasets in NCBI

The databases in NCBI are linked by some common features. This means that you can start wherever you have your research problems in NCBI. In this workshop, we will investigate transcriptional changes during light exposure of the alga Cyanophora paradoxa, a representative species of Glaucophytes. For more information about this alga, please see this article in Science.

In this workshop we’ll retrieve transcriptome sequencing libraries of C. paradoxa under the normal light and dark conditions. This dataset is generated by (Knopp et al. 2020).

Activity

You can easily search the SRA database for any keywords of interest related to your research. In this context, we search for all SRA studies related to C. paradoxa and see what SRA provides us. Note that the SRA database contains not only transcriptome studies, but also genomes and metagenomes.

  1. Go to SRA database: https://www.ncbi.nlm.nih.gov/sra, and search for ‘cyanophora paradoxa’.

Screenshot of the search result of C. paradoxa in the database NCBI SRA. Here you can see all sequencing libraries of this species that have been submitted to NCBI. You can specify which items are of interest or customize the search using the filter box on the left and right sides of the screen.

  1. We’ll adjust our selection using the tool in the SRA database, the SRA Run Selector, as follows.

Browsing sequencing data in NCBI SRA database

  1. In the SRA Run Selector, you can customize the filters based on the metadata columns of all runs. In this case, we filter the SRA runs based on the assay type as RNA-Seq and select only paired-end sequencing data as follows.

Customizing filters in SRA Run Selector

  1. Then you can select which runs you want to download and perform analysis. In this workshop we’ll select C. paradoxa RNA-Seq reads from SRR8306028, SRR8306029, SRR8306032, SRR8306033, SRR8306034 and SRR8306035.

Exporting the selected metadata in SRA Run Selector.

  1. The downloaded metadata is in comma-separated file format. So you can open them with spreadsheet programs like Microsoft Excel on your local laptop. The metadata looks like this.
Run Assay.Type AvgSpotLen Bases BioSample Bytes Center.Name Consent DATASTORE.filetype DATASTORE.provider DATASTORE.region Experiment Instrument Library.Name LibraryLayout LibrarySelection LibrarySource Organism Platform ReleaseDate create_date version Sample.Name SRA.Study BioProject ENA.FIRST.PUBLIC..run. ENA_first_public ENA.LAST.UPDATE..run. ENA.LAST.UPDATE External_Id INSDC_center_alias INSDC_center_name INSDC_first_public INSDC_last_update INSDC_status Sample_Name Submitter_Id Broker_name Developmental_stage Experimental_Factor._stimulus..exp. Genotype growth_condition Organism_part stimulus Experimental_Factor._organism..exp. Experimental_Factor._time..exp. Experimental_Factor._growth_condition..exp. common_name BioSampleModel geo_loc_name_country geo_loc_name_country_continent geo_loc_name Isolate TISSUE Collection_Date Isolation_source BIOMATERIAL_PROVIDER dev_stage Temp TREATMENT BREED culture.collection sequencer source_mat_id tissue_type Host Lat_Lon Sample_Type Strain
SRR8306028 RNA-Seq 300 3373326900 SAMN10588740 1950953927 HHU DUESSELDORF public fastq,run.zq,sra s3,ncbi,gs gs.US,s3.us-east-1,ncbi.public SRX5120515 Illumina HiSeq 2000 Cyanophora dark III PAIRED cDNA TRANSCRIPTOMIC Cyanophora paradoxa ILLUMINA 2019-08-05T00:00:00Z 2018-12-17T16:48:00Z 1 Cyanophora_dark SRP173157 PRJNA509798 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Plant Denmark Europe Denmark: obtained from the Scandinavian culture collection of Algae & Protozoa (SCCAP) SCCAP K-0262 unicells NA NA SCCAP K-0262 exponential 15 °C 0 µmol quanta m-2s-1 NA NA NA NA NA NA NA NA NA
SRR8306029 RNA-Seq 300 3361510200 SAMN10588740 1882426113 HHU DUESSELDORF public run.zq,fastq,sra s3,ncbi,gs gs.US,s3.us-east-1,ncbi.public SRX5120514 Illumina HiSeq 2000 Cyanophora dark II PAIRED cDNA TRANSCRIPTOMIC Cyanophora paradoxa ILLUMINA 2019-08-05T00:00:00Z 2018-12-17T16:44:00Z 1 Cyanophora_dark SRP173157 PRJNA509798 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Plant Denmark Europe Denmark: obtained from the Scandinavian culture collection of Algae & Protozoa (SCCAP) SCCAP K-0262 unicells NA NA SCCAP K-0262 exponential 15 °C 0 µmol quanta m-2s-1 NA NA NA NA NA NA NA NA NA
SRR8306032 RNA-Seq 300 3378668400 SAMN10588739 1903910409 HHU DUESSELDORF public sra,run.zq,fastq gs,s3,ncbi ncbi.public,gs.US,s3.us-east-1 SRX5120511 Illumina HiSeq 2000 Cyanophora light II PAIRED cDNA TRANSCRIPTOMIC Cyanophora paradoxa ILLUMINA 2019-08-05T00:00:00Z 2018-12-17T16:44:00Z 1 Cyanophora_normal light SRP173157 PRJNA509798 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Plant Denmark Europe Denmark: obtained from the Scandinavian culture collection of Algae & Protozoa (SCCAP) SCCAP K-0262 unicells NA NA SCCAP K-0262 exponential 15 °C ~ 50 µmol quanta m-2s-1 NA NA NA NA NA NA NA NA NA
SRR8306033 RNA-Seq 300 3386020200 SAMN10588739 1892499340 HHU DUESSELDORF public fastq,run.zq,sra ncbi,s3,gs gs.US,ncbi.public,s3.us-east-1 SRX5120510 Illumina HiSeq 2000 Cyanophora light I PAIRED cDNA TRANSCRIPTOMIC Cyanophora paradoxa ILLUMINA 2019-08-05T00:00:00Z 2018-12-17T16:35:00Z 1 Cyanophora_normal light SRP173157 PRJNA509798 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Plant Denmark Europe Denmark: obtained from the Scandinavian culture collection of Algae & Protozoa (SCCAP) SCCAP K-0262 unicells NA NA SCCAP K-0262 exponential 15 °C ~ 50 µmol quanta m-2s-1 NA NA NA NA NA NA NA NA NA
SRR8306034 RNA-Seq 300 3436771500 SAMN10588740 1902846665 HHU DUESSELDORF public fastq,run.zq,sra ncbi,s3,gs gs.US,ncbi.public,s3.us-east-1 SRX5120509 Illumina HiSeq 2000 Cyanophora dark I PAIRED cDNA TRANSCRIPTOMIC Cyanophora paradoxa ILLUMINA 2019-08-05T00:00:00Z 2018-12-17T16:38:00Z 1 Cyanophora_dark SRP173157 PRJNA509798 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Plant Denmark Europe Denmark: obtained from the Scandinavian culture collection of Algae & Protozoa (SCCAP) SCCAP K-0262 unicells NA NA SCCAP K-0262 exponential 15 °C 0 µmol quanta m-2s-1 NA NA NA NA NA NA NA NA NA
SRR8306035 RNA-Seq 300 3436308000 SAMN10588739 1909811743 HHU DUESSELDORF public run.zq,fastq,sra s3,gs,ncbi gs.US,s3.us-east-1,ncbi.public SRX5120508 Illumina HiSeq 2000 Cyanophora light III PAIRED cDNA TRANSCRIPTOMIC Cyanophora paradoxa ILLUMINA 2019-08-05T00:00:00Z 2018-12-17T15:00:00Z 1 Cyanophora_normal light SRP173157 PRJNA509798 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Plant Denmark Europe Denmark: obtained from the Scandinavian culture collection of Algae & Protozoa (SCCAP) SCCAP K-0262 unicells NA NA SCCAP K-0262 exponential 15 °C ~ 50 µmol quanta m-2s-1 NA NA NA NA NA NA NA NA NA

3.3 Downloading SRA runs using fasterq-dump

fasterq-dump are tools in the SRA toolkit used to connect from our remote server to the NCBI server and download sequencing data from SRA.

According to the NCBI sra-tools’ guideline, using fasterq-dump in combination with another tool, prefetch, is the better way to download data because prefetch can be invoked at any time if the previous download accidentally failed. So it is not necessary to start the download from the beginning.

However, prefetch can sometimes be skipped if you want to download a small amount of data. In this workshop we’ll use only fasterq-dump to download and process SRA file format to FASTQ file for further analysis.

Activity

We’ll do the following command in Terminal or MobaXterm, by access to the username and password that we’ve provided.

  • For mobaXterm, enter to your session.

  • For terminal, type

ssh <username>@<server IP address>

Now we download RNA-Seq libraries from C. paradoxa using the SRA accessions listed in the first column of the metadata above using the following command.

Activate analysis environment

conda activate ncbi

Go to working directory

cd ~/Cpa_RNASeq/01_Rawdata

Run fasterq-dump

fasterq-dump --threads 2 --progress \
SRR8306028 SRR8306029 SRR8306032 \
SRR8306033 SRR8306034 SRR8306035

From the fasterq-dump command,

--threads refer to how many threads to use (default = 6).

--progress force the terminal to print the progress of downloading and processing file to the screen.

Expected output files.

01_Rawdata
├── SRR8306028_1.fastq
├── SRR8306028_2.fastq
├── SRR8306029_1.fastq
├── SRR8306029_2.fastq
├── SRR8306032_1.fastq
├── SRR8306032_2.fastq
├── SRR8306033_1.fastq
├── SRR8306033_2.fastq
├── SRR8306034_1.fastq
├── SRR8306034_2.fastq
├── SRR8306035_1.fastq
└── SRR8306035_2.fastq

By default, fasterq-dump processes a single SRA file format of paired-end reads by splitting reads into forward (*_1.fastq) and reverse (*_2.fastq), if singletons (unpaired between forward and reverse reads) present, it will be written to another fastq file as described in this figure.

Sequence read processing by fasterq-dump using default parameters. Figure adopted from https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump.

3.4 Reference Sources