DESeq2
This guide pales in comparison to the mighty DESeq2 vignette, but may be more approachable due to its modest size. However, if you have any questions, they are likely aready answered in that vignette.
→ Why use DESeq2?
After gathering RNAseq data, you often want to find which genes change their expression between conditions. You may want to either to use these genes by themselves (perhaps you have individual genes of interest, markers, etc) or to use in downstream gene set enrichment (GSE) analyses like GSEA or GSVA.
To that end, there are many R packages available - limma, edgeR, and DESeq2 are ones that I've used previously. It also depends if you're working with RNAseq data (likely, for new experiments) or array data (usually older data). If you're using array data, you're likely limited to limma, while RNAseq you have your choice of several packages, including edgeR and DESeq2.
I recommend DESeq2 simply because it's what I'm comfortable with and seems to be an industry standard. It typically performs better (more sensitive and precise) than other methods when you have smaller sample sizes (See Figures 6, 8, 9: link).
→ How to use it
The overall workflow is:
-
Coerce your data into a
DESeqDataSet(hardest step) - Do differential expression
- Shake out your results
→ Import
The main battle is getting your data into DESeq2, which varies depending on the input that you have. Fortunately, DESeq2 comes with many helpful import functions which will mush them into the shape required, a DESeqDataSet. You can find documentation on them by running ?DESeqDataSet in the console. I'll talk about a few that I typically use, though.
-
DESeqDataSet: If you're coming directly from aSummarizedExperiment, an object we'll talk about shortly -
DESeqDataSetFromMatrix: If you have your counts come from some kind of delimited file (.tsv,.csv,xls(x)). -
DESeqDataSetFromTximport:tximportreads in data from a variety of sources (including Salmon) and turns it into a unified format that can be consumed byDESeq2.
A brief sojourn: the `SummarizedExperiment`
Note: The SummarizedExperiment vignette is quite good, though like the DESeq2 vignette may feel overwhelming. I'd recommend taking a peek - it's not too bad - but hopefully I can demonstrate the good bits here.
Now is a useful time to talk about the SummarizedExperiment object. A SummarizedExperiment is essentially a little caddy that is perfect for holding experimental data, particularly (but not necessarily) sequencing data. A lot of other packages create their own data structures derived from SummarizedExperiments, so a lot of the same functions will work on them too. That is to say, getting comfortable with SummarizedExperiments pays back in spades.
First, a motivating example.
Imagine you've sequenced twelve samples. Two groups - one you exposed to, I don't know, saracatinib, the other you only exposed to vehicle (DMSO). You also have two cell lines for each group, for a total of four conditions:
| sample | line | drug | replicate |
|---|---|---|---|
| rt112_dmso_1 | RT112 | DMSO | 1 |
| rt112_dmso_2 | RT112 | DMSO | 2 |
| rt112_dmso_3 | RT112 | DMSO | 3 |
| rt112_saracatinib_1 | RT112 | saracatinib | 1 |
| rt112_saracatinib_2 | RT112 | saracatinib | 2 |
| rt112_saracatinib_3 | RT112 | saracatinib | 3 |
| rt112_dmso_1 | UM-UC14 | DMSO | 1 |
| rt112_dmso_2 | UM-UC14 | DMSO | 2 |
| rt112_dmso_3 | UM-UC14 | DMSO | 3 |
| rt112_saracatinib_1 | UM-UC14 | saracatinib | 1 |
| rt112_saracatinib_2 | UM-UC14 | saracatinib | 2 |
| rt112_saracatinib_4 | UM-UC14 | saracatinib | 3 |
You also have some sequencing data:
| gene | rt112_dmso_1 | rt112_dmso_3 | rt112_dmso_4 | rt112_saracatinib_1 | ... |
|---|---|---|---|---|---|
| GENE1 | 114 | 200 | 170 | 400 | ... |
| GENE2 | 30 | 25 | 22 | 0 | ... |
| ... | ... | ... | ... | ... | ... |
You also would like to lug around some gene name information - sometimes different analyses require different names, so it would be nice to always have them on hand:
| hgnc_symbol | entrez | ensemble |
|---|---|---|
| GENE1 | 10482 | ENSG00190177120 |
| GENE2 | 22479 | ENSG01242155979 |
| ... | ... | ... |
You could just keep track of three separate R objects, but that is a difficult and dangerous road to walk. For instance, filtering in the sequencing data will leave your gene data 'out of sync'. Careless rejoining of data can also cause the wrong samples to be associated with the wrong expression data, leading to difficult-to-detect and extremely consequential errors.
The SummarizedExperiment manages this by having several 'slots' in which to click in these dataframes, and providing functions to update the whole system so all parts stay in sync.
These slots are:
-
rowData, which would hold our gene ID dataframe -
colData, which would hold our dataframe with our sample information -
assays, which holds the actual expression data
Beyond that, there's also a metadata slot that can hold information about the whole experiment not exclusive to any individual sample or gene - like the time and date you sequenced the samples, for instance - but I typically don't use this slot. In addition, assays can hold many expression matrices (as the name implies), so long as they have the same rows and columns. This is useful if, say, you want to store a numerical transformation of your expression, but would like to tote the original data along as well.
→ From a Matrix
Let's spin up some data to work with. In reality, if you're using this method it's because you've received a .csv or something of that nature containing expression data that looks a bit like this:
| gene | rt112_dmso_1 | rt112_dmso_3 | rt112_dmso_4 | rt112_saracatinib_1 | ... |
|---|---|---|---|---|---|
| GENE1 | 114 | 200 | 170 | 400 | ... |
| GENE2 | 30 | 25 | 22 | 0 | ... |
| ... | ... | ... | ... | ... | ... |
So you'll have to read it in with, say, read.csv. Once you have that dataframe, your workflow will be similar to mine.
# Making some dummy data using DESeq2 - you won't need to do this since you have your own data.
library(DESeq2)
expression <- makeExampleDESeqDataSet(n = 10, m = 6, betaSD = 3) |>
assay() |>
as.data.frame()
expression
# sample1 sample2 sample3 sample4 sample5 sample6
# gene1 3 10 8 9 50 43
# gene2 92 84 112 19 4 9
# gene3 35 46 2 9 5 2
# gene4 19 22 19 3 0 0
# gene5 14 2 18 29 261 288
# gene6 13 14 2 122 51 52
# gene7 1 10 3 0 1 1
# gene8 31 36 13 2 3 9
# gene9 25 19 30 44 68 29
# gene10 11 4 25 0 4 6
Caveat 1: rownames, and why might want to use `read.csv` instead of `read_csv`
The easiest input format is like that listed above: a dataset with JUST numbers, and with gene IDs for rownames. If you read in your data with read.csv, it's possible it will be read in just as shown. However, if you read in with readr's read_csv (or read_tsv or whatever), it will instead give the rownames a column of its own.
You should be able to deal with gene names being the first column (rather than rownames) by setting tidy = TRUE in the DESeqDataSetFromMatrix function. However, as of 2025-12-18 there seems to be a bug in that function where it will not behave with the output format of readr's functions (a tibble). This means you would have to convert it to a data.frame, a step I wanted to omit for simplicity's sake. Yet here we are, you and I, in this little aside box together.
If you read in your data with read.csv and the gene names ended up in their own column (let's call it gene), so long as gene is the first column and the rest of the data are only expression data, you can use the tidy = TRUE argument I was talking about, since read.csv returns a data.frame, not a tibble.
After you've read in your data into a data.frame, you'll need to create the colData data.frame. It'll look something like this:
# There are more succinct ways of doing this, but this is the most straightforward
my_col_data <- data.frame(
drug = c("dmso", "dmso", "dmso", "erdafitinib", "erdafitinib", "erdafitinib"),
replicate = c(1, 2, 3, 1, 2, 3)
)
rownames(my_col_data) <- c(
"sample1", "sample2", "sample3", "sample4", "sample5", "sample6"
)
my_col_data
# drug replicate
# sample1 dmso 1
# sample2 dmso 2
# sample3 dmso 3
# sample4 erdafitinib 1
# sample5 erdafitinib 2
# sample6 erdafitinib 3
Your rownames must match the column names of your expression data. You can omit rownames in your colData, but I strongly advise against it: if you do not supply rownames, it is assumed that the first row of colData corresponds to the first column of assay. Misplaced assumptions will go unwarned, resulting in insidious bugs that can, have, and will sink projects and retract papers. Woe to thee.
Caveat 2: `read.csv` may bork your column names
Hi, remember when I told you to use read.csv instead of readr::read_csv? It has a bit of a downside if your sample names (for us, colnames of expression) have spaces or start with numbers.
One bad thing about the base R read functions compared to the readr read functions is that base R read functions mess with your column names. If you column names start with a number or has spaces in it, the read function will silently change it, so a column named 3 little cats would become X3.little.cats.
This can be confusing if your sample names have spaces in them or numbers at the front, because when you go to set your sample names in your column data, you might assume R hasn't messed with your sample names. Then, when trying to make a SummarizedExperiment (or DESeqDataSet), you'll get an error that your rownames and column names don't match. And indeed they don't, but it's because the column names were changed behind your back. You'll need to match you rownames of colData to whatever R mangled the column names of the assay to be.
Believe it or not, we're ready to make a DESeqDataSet:
dds <- DESeqDataSetFromMatrix(
countData = expression,
colData = my_col_data,
design = ~ drug # See section `The design parameter`
)
dds
# class: DESeqDataSet
# dim: 10 6
# metadata(1): version
# assays(1): counts
# rownames(10): gene1 gene2 ... gene9 gene10
# rowData names(0):
# colnames(6): sample1 sample2 ... sample5 sample6
# colData names(2): drug replicate
→ From tximport
If your pipeline spits out salmon files (usually denoted by the .sf extension), you'll need to import them using tximport first, then into a DESeqDataSet.
Step 1 is to make a list of paths to your .sfs:
salmon_file_paths <- c(
"sample1/quant.sf", "sample2/quant.sf", "sample3/quant.sf", "sample4/quant.sf", "sample5/quant.sf", "sample6/quant.sf"
)
# Shorter way:
# salmon_file_paths <- paste0("sample", 1:6, "/quant.sf")
Step 2 is to supply a data.frame that maps from transcript to gene.
The output from salmon is transcript level, rather than gene level. Since there are many types of transcripts per gene, you need a way to combine all transcripts from a single gene into one 'summary' number. This requires an additional file that maps transcripts to their corresponding gene. For the output of nf-core/rnaseq, it will be included as a file called tx2gene.tsv, and have the following form:
| transcript_id | gene_id | gene_name |
|---|---|---|
| FBti0060911-RA | FBti0060911 | gene1 |
| FBtr0082878 | FBgn0038189 | gene2 |
| FBtr0081853 | FBgn0037583 | gene3 |
| ... | ... | ... |
tximport will use the first column as the transcript ID and the second column as the gene ID. If you want to use gene names instead, you'll need to move gene_name to the second column:
| transcript_id | gene_name | gene_id |
|---|---|---|
| FBti0060911-RA | gene1 | FBti0060911 |
| FBtr0082878 | gene2 | FBgn0038189 |
| FBtr0081853 | gene3 | FBgn0037583 |
| ... | ... | ... |
The code would look something like:
library(tidyverse)
tx2gene_df <- read_tsv("tx2gene.tsv")
# If you want to use gene_name instead of gene_id:
tx2gene_df <- select(tx2gene_df, -gene_id)
Aside: `nf-core/rnaseq` doesn't work with `tximeta`
If you haven't heard of tximeta, you can skip this aside.
In theory, you should be able to use tximeta with your salmon files to automagically associate transcripts with the correct gene_id, obviating the need for a tx2gene file. However, the way that nf-core/rnaseq uses salmon prevents this magic from happening (for a technical definition, see the note here)
Putting it together, you can use tximport via:
gene_level_expression <- tximport(
files = salmon_file_paths,
type = "salmon",
tx2gene = tx2gene_df
)
To put this into a DESeqDataSet, you'll need to create colData (just as you would for a matrix-based import with DESeqDataSetFromMatrix)
my_col_data <- data.frame(
drug = c("dmso", "dmso", "dmso", "erdafitinib", "erdafitinib", "erdafitinib"),
replicate = c(1, 2, 3, 1, 2, 3)
)
rownames(my_col_data) <- salmon_file_paths
my_col_data
# drug replicate
# sample1/quant.sf dmso 1
# sample2/quant.sf dmso 2
# sample3/quant.sf dmso 3
# sample4/quant.sf erdafitinib 1
# sample5/quant.sf erdafitinib 2
# sample6/quant.sf erdafitinib 3
And at long last, to get a DESeqDataSet:
dds <- DESeqDataSetFromTximport(gene_level_expression, my_col_data, design = ~ drug)
→ The design parameter
Let's talk about design. This is where you specify your experimental conditions and/or things you'd like to cancel out, like batch effects. It's supplied with a very strange R object called a formula. In this case, it will start with a tilde (~) and contain column names from colData after it, separated by + if there are more than one. For instance, if you had colData for a drug combination experiment you sequenced in multiple batches, your design might look like:
~ drug1 + drug2 + batch
→ Performing differential expression
For most cases, this is trivial:
deseqd <- DESeq(dds)
deseqd
# class: DESeqDataSet
# dim: 10 6
# metadata(1): version
# assays(4): counts mu H cooks
# rownames(10): gene1 gene2 ... gene9 gene10
# rowData names(22): baseMean baseVar ... deviance maxCooks
# colnames(6): sample1 sample2 ... sample5 sample6
# colData names(3): drug replicate sizeFactor
If you have suspicions that you are not 'most cases', see if the DEseq2 vignette can help.
→ Getting the results out
You'll notice you aren't being provided with a list of differentially expressed genes. This is actually a good thing, since it gives us some control as to what groups should compared and how the log fold changes should be presented.
If you have multiple conditions in your design - again, like ~ drug1 + drug2 - you'll want to see the genes that are differentially expressed when one variable is changing, but the others are held constant. This is particularly true for batch effects. When you ask for results of one variable, you are 'correcting for' the changes in all other variables in the design
Once you know what you want, getting the results is quite simple though:
my_results <- results(deseqd, contrast = c("drug", "erdafitinib", "dmso"))
my_results
# log2 fold change (MLE): drug erdafitinib vs dmso
# Wald test p-value: drug erdafitinib vs dmso
# DataFrame with 10 rows and 6 columns
# baseMean log2FoldChange lfcSE stat pvalue padj
# <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# gene1 18.28270 1.4302894 0.923024 1.549570 1.21245e-01 0.134716513
# gene2 73.74474 -3.6954872 0.867105 -4.261866 2.02727e-05 0.000202727
# gene3 20.50386 -2.9483710 1.062492 -2.774960 5.52086e-03 0.011041715
# gene4 14.55253 -4.9317878 1.227223 -4.018656 5.85311e-05 0.000292656
# gene5 85.85078 3.2271886 0.969800 3.327685 8.75708e-04 0.002878590
# gene6 42.25506 2.4698715 0.912466 2.706810 6.79331e-03 0.011322191
# gene7 3.43657 -3.6273445 1.639724 -2.212168 2.69551e-02 0.033693813
# gene8 20.22925 -3.0938104 0.951753 -3.250645 1.15144e-03 0.002878590
# gene9 36.36285 0.0897384 0.740996 0.121105 9.03608e-01 0.903607736
# gene10 11.37902 -2.8640108 1.151573 -2.487041 1.28810e-02 0.018401489
For the contrast argument:
-
The first item in the vector is the variable in your design you want to get the differences of (again, correcting for other variable in the design) - here
"drug". -
The second and third items are the levels of
drugyou want to be the numerator and denominator for log fold change, respectively. If you have two levels (as I do -erdafitinibanddrug), typically you want your experimental condition as the numerator ("erdafitinib") and your control as the denominator ("dmso").
To actually do stuff with the results, I prefer to convert it to a tibble first (a regular old data.frame works fine if you prefer). Here's a common motif for me:
my_results |>
# Saves rownames, puts them in column named 'gene'
as_tibble(rownames = "gene") |>
# Filter for only the statistically significant diff. exp. genes
filter(padj < 0.05) |>
# Order rows by largest magnitude changes, descending
arrange(desc(abs(log2FoldChange)))
# # A tibble: 8 × 7
# gene baseMean log2FoldChange lfcSE stat pvalue padj
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 gene4 14.6 -4.93 1.23 -4.02 0.0000585 0.000293
# 2 gene2 73.7 -3.70 0.867 -4.26 0.0000203 0.000203
# 3 gene7 3.44 -3.63 1.64 -2.21 0.0270 0.0337
# 4 gene5 85.9 3.23 0.970 3.33 0.000876 0.00288
# 5 gene8 20.2 -3.09 0.952 -3.25 0.00115 0.00288
# 6 gene3 20.5 -2.95 1.06 -2.77 0.00552 0.0110
# 7 gene10 11.4 -2.86 1.15 -2.49 0.0129 0.0184
# 8 gene6 42.3 2.47 0.912 2.71 0.00679 0.0113
→ References
The SummarizedExperiment
vignette
has a great illustration of the anatomy of a SummarizedExperiment and how to
access them.
The DESeq2 vignette has answers to most of your practical questions.
The DESeq2 paper has answers to most of your theoretical questions.