Why use DESeq2?
How to use it

Import

From a Matrix
From tximport
The design parameter

Performing differential expression
Getting the results out

References

DESeq2

This guide pales in comparison to the mighty DESeq2 vignette, but may be more approachable due to its modest size. However, if you have any questions, they are likely aready answered in that vignette.

→ Why use DESeq2?

After gathering RNAseq data, you often want to find which genes change their expression between conditions. You may want to either to use these genes by themselves (perhaps you have individual genes of interest, markers, etc) or to use in downstream gene set enrichment (GSE) analyses like GSEA or GSVA.

To that end, there are many R packages available - limma, edgeR, and DESeq2 are ones that I've used previously. It also depends if you're working with RNAseq data (likely, for new experiments) or array data (usually older data). If you're using array data, you're likely limited to limma, while RNAseq you have your choice of several packages, including edgeR and DESeq2.

I recommend DESeq2 simply because it's what I'm comfortable with and seems to be an industry standard. It typically performs better (more sensitive and precise) than other methods when you have smaller sample sizes (See Figures 6, 8, 9: link).

→ How to use it

The overall workflow is:

Coerce your data into a DESeqDataSet (hardest step)
Do differential expression
Shake out your results

→ Import

The main battle is getting your data into DESeq2, which varies depending on the input that you have. Fortunately, DESeq2 comes with many helpful import functions which will mush them into the shape required, a DESeqDataSet. You can find documentation on them by running ?DESeqDataSet in the console. I'll talk about a few that I typically use, though.

DESeqDataSet: If you're coming directly from a SummarizedExperiment, an object we'll talk about shortly
DESeqDataSetFromMatrix: If you have your counts come from some kind of delimited file (.tsv, .csv, xls(x)).
DESeqDataSetFromTximport: tximport reads in data from a variety of sources (including Salmon) and turns it into a unified format that can be consumed by DESeq2.

A brief sojourn: the `SummarizedExperiment`

Note: The SummarizedExperiment vignette is quite good, though like the DESeq2 vignette may feel overwhelming. I'd recommend taking a peek - it's not too bad - but hopefully I can demonstrate the good bits here.

Now is a useful time to talk about the SummarizedExperiment object. A SummarizedExperiment is essentially a little caddy that is perfect for holding experimental data, particularly (but not necessarily) sequencing data. A lot of other packages create their own data structures derived from SummarizedExperiments, so a lot of the same functions will work on them too. That is to say, getting comfortable with SummarizedExperiments pays back in spades.

First, a motivating example.

Imagine you've sequenced twelve samples. Two groups - one you exposed to, I don't know, saracatinib, the other you only exposed to vehicle (DMSO). You also have two cell lines for each group, for a total of four conditions:

sample	line	drug	replicate
rt112_dmso_1	RT112	DMSO	1
rt112_dmso_2	RT112	DMSO	2
rt112_dmso_3	RT112	DMSO	3
rt112_saracatinib_1	RT112	saracatinib	1
rt112_saracatinib_2	RT112	saracatinib	2
rt112_saracatinib_3	RT112	saracatinib	3
rt112_dmso_1	UM-UC14	DMSO	1
rt112_dmso_2	UM-UC14	DMSO	2
rt112_dmso_3	UM-UC14	DMSO	3
rt112_saracatinib_1	UM-UC14	saracatinib	1
rt112_saracatinib_2	UM-UC14	saracatinib	2
rt112_saracatinib_4	UM-UC14	saracatinib	3

You also have some sequencing data:

gene	rt112_dmso_1	rt112_dmso_3	rt112_dmso_4	rt112_saracatinib_1	...
GENE1	114	200	170	400	...
GENE2	30	25	22	0	...
...	...	...	...	...	...

You also would like to lug around some gene name information - sometimes different analyses require different names, so it would be nice to always have them on hand:

hgnc_symbol	entrez	ensemble
GENE1	10482	ENSG00190177120
GENE2	22479	ENSG01242155979
...	...	...

You could just keep track of three separate R objects, but that is a difficult and dangerous road to walk. For instance, filtering in the sequencing data will leave your gene data 'out of sync'. Careless rejoining of data can also cause the wrong samples to be associated with the wrong expression data, leading to difficult-to-detect and extremely consequential errors.

The SummarizedExperiment manages this by having several 'slots' in which to click in these dataframes, and providing functions to update the whole system so all parts stay in sync.

These slots are:

rowData, which would hold our gene ID dataframe
colData, which would hold our dataframe with our sample information
assays, which holds the actual expression data

Beyond that, there's also a metadata slot that can hold information about the whole experiment not exclusive to any individual sample or gene - like the time and date you sequenced the samples, for instance - but I typically don't use this slot. In addition, assays can hold many expression matrices (as the name implies), so long as they have the same rows and columns. This is useful if, say, you want to store a numerical transformation of your expression, but would like to tote the original data along as well.

→ From a Matrix

Let's spin up some data to work with. In reality, if you're using this method it's because you've received a .csv or something of that nature containing expression data that looks a bit like this:

gene	rt112_dmso_1	rt112_dmso_3	rt112_dmso_4	rt112_saracatinib_1	...
GENE1	114	200	170	400	...
GENE2	30	25	22	0	...
...	...	...	...	...	...

So you'll have to read it in with, say, read.csv. Once you have that dataframe, your workflow will be similar to mine.

# Making some dummy data using DESeq2 - you won't need to do this since you have your own data.
library(DESeq2)
expression <- makeExampleDESeqDataSet(n = 10, m = 6, betaSD = 3) |>
  assay() |>
  as.data.frame()

expression

#        sample1 sample2 sample3 sample4 sample5 sample6
# gene1        3      10       8       9      50      43
# gene2       92      84     112      19       4       9
# gene3       35      46       2       9       5       2
# gene4       19      22      19       3       0       0
# gene5       14       2      18      29     261     288
# gene6       13      14       2     122      51      52
# gene7        1      10       3       0       1       1
# gene8       31      36      13       2       3       9
# gene9       25      19      30      44      68      29
# gene10      11       4      25       0       4       6

Caveat 1: rownames, and why might want to use `read.csv` instead of `read_csv`

The easiest input format is like that listed above: a dataset with JUST numbers, and with gene IDs for rownames. If you read in your data with read.csv, it's possible it will be read in just as shown. However, if you read in with readr's read_csv (or read_tsv or whatever), it will instead give the rownames a column of its own.

You should be able to deal with gene names being the first column (rather than rownames) by setting tidy = TRUE in the DESeqDataSetFromMatrix function. However, as of 2025-12-18 there seems to be a bug in that function where it will not behave with the output format of readr's functions (a tibble). This means you would have to convert it to a data.frame, a step I wanted to omit for simplicity's sake. Yet here we are, you and I, in this little aside box together.

If you read in your data with read.csv and the gene names ended up in their own column (let's call it gene), so long as gene is the first column and the rest of the data are only expression data, you can use the tidy = TRUE argument I was talking about, since read.csv returns a data.frame, not a tibble.

After you've read in your data into a data.frame, you'll need to create the colData data.frame. It'll look something like this:

# There are more succinct ways of doing this, but this is the most straightforward
my_col_data <- data.frame(
  drug = c("dmso", "dmso", "dmso", "erdafitinib", "erdafitinib", "erdafitinib"),
  replicate = c(1, 2, 3, 1, 2, 3)
)

rownames(my_col_data) <- c(
  "sample1", "sample2", "sample3", "sample4", "sample5", "sample6"
)

my_col_data

#                drug replicate
# sample1        dmso         1
# sample2        dmso         2
# sample3        dmso         3
# sample4 erdafitinib         1
# sample5 erdafitinib         2
# sample6 erdafitinib         3

Your rownames must match the column names of your expression data. You can omit rownames in your colData, but I strongly advise against it: if you do not supply rownames, it is assumed that the first row of colData corresponds to the first column of assay. Misplaced assumptions will go unwarned, resulting in insidious bugs that can, have, and will sink projects and retract papers. Woe to thee.

Caveat 2: `read.csv` may bork your column names

Hi, remember when I told you to use read.csv instead of readr::read_csv? It has a bit of a downside if your sample names (for us, colnames of expression) have spaces or start with numbers.

One bad thing about the base R read functions compared to the readr read functions is that base R read functions mess with your column names. If you column names start with a number or has spaces in it, the read function will silently change it, so a column named 3 little cats would become X3.little.cats.

This can be confusing if your sample names have spaces in them or numbers at the front, because when you go to set your sample names in your column data, you might assume R hasn't messed with your sample names. Then, when trying to make a SummarizedExperiment (or DESeqDataSet), you'll get an error that your rownames and column names don't match. And indeed they don't, but it's because the column names were changed behind your back. You'll need to match you rownames of colData to whatever R mangled the column names of the assay to be.

Believe it or not, we're ready to make a DESeqDataSet:


dds <- DESeqDataSetFromMatrix(
  countData = expression,
  colData = my_col_data,
  design = ~ drug # See section `The design parameter`
)

dds

# class: DESeqDataSet 
# dim: 10 6 
# metadata(1): version
# assays(1): counts
# rownames(10): gene1 gene2 ... gene9 gene10
# rowData names(0):
# colnames(6): sample1 sample2 ... sample5 sample6
# colData names(2): drug replicate

→ From `tximport`

If your pipeline spits out salmon files (usually denoted by the .sf extension), you'll need to import them using tximport first, then into a DESeqDataSet.

Step 1 is to make a list of paths to your .sfs:

salmon_file_paths <- c(
  "sample1/quant.sf", "sample2/quant.sf", "sample3/quant.sf", "sample4/quant.sf", "sample5/quant.sf", "sample6/quant.sf"
)

# Shorter way:
# salmon_file_paths <- paste0("sample", 1:6, "/quant.sf")

Step 2 is to supply a data.frame that maps from transcript to gene.

The output from salmon is transcript level, rather than gene level. Since there are many types of transcripts per gene, you need a way to combine all transcripts from a single gene into one 'summary' number. This requires an additional file that maps transcripts to their corresponding gene. For the output of nf-core/rnaseq, it will be included as a file called tx2gene.tsv, and have the following form:

transcript_id	gene_id	gene_name
FBti0060911-RA	FBti0060911	gene1
FBtr0082878	FBgn0038189	gene2
FBtr0081853	FBgn0037583	gene3
...	...	...

tximport will use the first column as the transcript ID and the second column as the gene ID. If you want to use gene names instead, you'll need to move gene_name to the second column:

transcript_id	gene_name	gene_id
FBti0060911-RA	gene1	FBti0060911
FBtr0082878	gene2	FBgn0038189
FBtr0081853	gene3	FBgn0037583
...	...	...

The code would look something like:

library(tidyverse)

tx2gene_df <- read_tsv("tx2gene.tsv")

# If you want to use gene_name instead of gene_id:
tx2gene_df <- select(tx2gene_df, -gene_id)

Aside: `nf-core/rnaseq` doesn't work with `tximeta`

If you haven't heard of tximeta, you can skip this aside.

In theory, you should be able to use tximeta with your salmon files to automagically associate transcripts with the correct gene_id, obviating the need for a tx2gene file. However, the way that nf-core/rnaseq uses salmon prevents this magic from happening (for a technical definition, see the note here)

Putting it together, you can use tximport via:

gene_level_expression <- tximport(
  files = salmon_file_paths,
  type = "salmon",
  tx2gene = tx2gene_df
)

To put this into a DESeqDataSet, you'll need to create colData (just as you would for a matrix-based import with DESeqDataSetFromMatrix)

my_col_data <- data.frame(
  drug = c("dmso", "dmso", "dmso", "erdafitinib", "erdafitinib", "erdafitinib"),
  replicate = c(1, 2, 3, 1, 2, 3)
)

rownames(my_col_data) <- salmon_file_paths

my_col_data

#                         drug replicate
# sample1/quant.sf        dmso         1
# sample2/quant.sf        dmso         2
# sample3/quant.sf        dmso         3
# sample4/quant.sf erdafitinib         1
# sample5/quant.sf erdafitinib         2
# sample6/quant.sf erdafitinib         3

And at long last, to get a DESeqDataSet:

dds <- DESeqDataSetFromTximport(gene_level_expression, my_col_data, design = ~ drug)

→ The `design` parameter

Let's talk about design. This is where you specify your experimental conditions and/or things you'd like to cancel out, like batch effects. It's supplied with a very strange R object called a formula. In this case, it will start with a tilde (~) and contain column names from colData after it, separated by + if there are more than one. For instance, if you had colData for a drug combination experiment you sequenced in multiple batches, your design might look like:

~ drug1 + drug2 + batch

→ Performing differential expression

For most cases, this is trivial:

deseqd <- DESeq(dds)

deseqd

# class: DESeqDataSet 
# dim: 10 6 
# metadata(1): version
# assays(4): counts mu H cooks
# rownames(10): gene1 gene2 ... gene9 gene10
# rowData names(22): baseMean baseVar ... deviance maxCooks
# colnames(6): sample1 sample2 ... sample5 sample6
# colData names(3): drug replicate sizeFactor

If you have suspicions that you are not 'most cases', see if the DEseq2 vignette can help.

→ Getting the results out

You'll notice you aren't being provided with a list of differentially expressed genes. This is actually a good thing, since it gives us some control as to what groups should compared and how the log fold changes should be presented.

If you have multiple conditions in your design - again, like ~ drug1 + drug2 - you'll want to see the genes that are differentially expressed when one variable is changing, but the others are held constant. This is particularly true for batch effects. When you ask for results of one variable, you are 'correcting for' the changes in all other variables in the design

Once you know what you want, getting the results is quite simple though:

my_results <- results(deseqd, contrast = c("drug", "erdafitinib", "dmso"))

my_results

# log2 fold change (MLE): drug erdafitinib vs dmso 
# Wald test p-value: drug erdafitinib vs dmso 
# DataFrame with 10 rows and 6 columns
#         baseMean log2FoldChange     lfcSE      stat      pvalue        padj
#        <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
# gene1   18.28270      1.4302894  0.923024  1.549570 1.21245e-01 0.134716513
# gene2   73.74474     -3.6954872  0.867105 -4.261866 2.02727e-05 0.000202727
# gene3   20.50386     -2.9483710  1.062492 -2.774960 5.52086e-03 0.011041715
# gene4   14.55253     -4.9317878  1.227223 -4.018656 5.85311e-05 0.000292656
# gene5   85.85078      3.2271886  0.969800  3.327685 8.75708e-04 0.002878590
# gene6   42.25506      2.4698715  0.912466  2.706810 6.79331e-03 0.011322191
# gene7    3.43657     -3.6273445  1.639724 -2.212168 2.69551e-02 0.033693813
# gene8   20.22925     -3.0938104  0.951753 -3.250645 1.15144e-03 0.002878590
# gene9   36.36285      0.0897384  0.740996  0.121105 9.03608e-01 0.903607736
# gene10  11.37902     -2.8640108  1.151573 -2.487041 1.28810e-02 0.018401489

For the contrast argument:

The first item in the vector is the variable in your design you want to get the differences of (again, correcting for other variable in the design) - here "drug".
The second and third items are the levels of drug you want to be the numerator and denominator for log fold change, respectively. If you have two levels (as I do - erdafitinib and drug), typically you want your experimental condition as the numerator ("erdafitinib") and your control as the denominator ("dmso").

To actually do stuff with the results, I prefer to convert it to a tibble first (a regular old data.frame works fine if you prefer). Here's a common motif for me:

my_results |>
  # Saves rownames, puts them in column named 'gene'
  as_tibble(rownames = "gene") |>
  # Filter for only the statistically significant diff. exp. genes
  filter(padj < 0.05) |>
  # Order rows by largest magnitude changes, descending
  arrange(desc(abs(log2FoldChange)))

# # A tibble: 8 × 7
#   gene   baseMean log2FoldChange lfcSE  stat    pvalue     padj
#   <chr>     <dbl>          <dbl> <dbl> <dbl>     <dbl>    <dbl>
# 1 gene4     14.6           -4.93 1.23  -4.02 0.0000585 0.000293
# 2 gene2     73.7           -3.70 0.867 -4.26 0.0000203 0.000203
# 3 gene7      3.44          -3.63 1.64  -2.21 0.0270    0.0337  
# 4 gene5     85.9            3.23 0.970  3.33 0.000876  0.00288 
# 5 gene8     20.2           -3.09 0.952 -3.25 0.00115   0.00288 
# 6 gene3     20.5           -2.95 1.06  -2.77 0.00552   0.0110  
# 7 gene10    11.4           -2.86 1.15  -2.49 0.0129    0.0184  
# 8 gene6     42.3            2.47 0.912  2.71 0.00679   0.0113

→ References

The SummarizedExperiment vignette has a great illustration of the anatomy of a SummarizedExperiment and how to access them.

The DESeq2 vignette has answers to most of your practical questions.

The DESeq2 paper has answers to most of your theoretical questions.