Reference genome for RNA-Seq reads decontamination (2024)

Reference genome for RNA-Seq reads decontamination

Entering edit mode

4.3 years ago

gnmcsbnfrmtcsclb &utrif; 70

We are a small group of undergrads, mostly sophom*ore, from a small HBCU, and learning bioinformatics and genomics that is not at all part of our regular syllabus, by trying to teach one another :) - And because of SARS-CoV-2, we have a little more time at hand.

Currently we are trying to learn and understand the theory and practice behind how to decontaminate RNA-Seq reads so we can map the cleaned reads to the genome of our plant species of interest. We tried FastQ_Screen and for one test case, majority of reads mapped to unknown, but quite a few were to mouse and rat.

Syntax was:

fastq_screen --conf fastq_screen.conf --force --quiet --subset 0 $FASTQ_Input

Config file pointed to following reference sequence file, indexed for use by the underlying Bowtie2:

## Adapters - sequence derived from the FastQC contaminats file found at: www.bioinformatics.babraham.ac.uk/projects/fastqc## Ecoli- sequence available from EMBL accession U00096.2## Vectors - Sequence taken from the UniVec database## Lambda## Mitochondrion## PhiX - sequence available from Refseq accession NC_001422.1## rRNA## Human - sequences available from ## ftp://ftp.ensembl.org/pub/current/fasta/hom*o_sapiens/dna/## Mouse - sequence available from ## ftp://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/## Rat

Our questions are these:

Question 1. When decontaminating, is it essential to include the genome of our species of interest, in addition to the ones being checked against - adapters, PhiX, rat, mouse, human, bacterial, etc? If read maps BEST to our genome of interest, then it shouldn't matter if it also maps to those other references, it would be considered a true read, and not a contamination, right? Or is the answer to that "complicated" or "it depends"? :)

Question 2. I want to decontaminate my RNA-Seq reads and ultimately map to plant genome P. And I suspect contamination from mouse and rat genomes, M and R respectively. And since these are all eukarytes, a small but non-zero fraction of all 3 genomes would be common, right? So then, do I need to conduct the FastQ_Screen on this modified genome collection instead:

a. M - P (mouse, but without genomic regions also found in my plant species)b. R - P (rat, but without genomic regions also found in my plant species)c. P (full genome of plant species of interest)

If indeed that is the case, what is the bioinformatic protocol to generate M - P subtracted genome sequences, given the M and P genome assemblies ?

Thanks, in advance, and we wish you all to be safe from SARS-CoV-2!!

FastQ_Screen RNA-Seq • 1.9k views

ADD COMMENT • link updated 10 months ago by Ram 44k • written 4.3 years ago by gnmcsbnfrmtcsclb &utrif; 70

Entering edit mode

You should definitely keep your genome of interest. Otherwise, you don't know if all those contaminants are just hom*ologous reads (which basically leads to your Question 2).

Are you sure you want to decontaminate? If you are working with a plant species, you probably wouldn't have mouse or rat contamination. Most people never decontaminate. If you are just learning RNA-seq, this may not be something that you should worry about yet.

ADD REPLY • link 4.3 years ago by igor 13k