RQdeltaCT - Relative Quantification of Gene Expression using Delta Ct Methods

Daniel Zalewski [email protected]

5 April 2024

Table of contents

Introduction

RQdeltaCT is an R package developed to perform relative quantification of gene expression using delta Ct methods, proposed by Kenneth J. Livak and Thomas D. Schmittgen in Article1 and Article2.

These methods were designed to analyse gene expression data (Ct values) obtained from real-time PCR experiments. The main idea is to normalise gene expression values using endogenous control gene, present gene expression levels in linear form by using the 2-(value) transformation, and calculate differences in gene expression levels between groups of samples (or technical replicates of a single sample).

There are two main delta Ct methods used for relative quantification. The choice of the best method depends on the study design. A short description of these methods is provided below; for more details, refer to articles in links provided above.

  1. 2-dCt method. In this method, Ct values are normalised by the endogenous control gene (often GAPDH, beta-actin, or other) by subtracting the Ct value of the endogenous control in each sample from the Ct value of the gene of interest in the same samples, obtaining delta Ct (dCt) values. Subsequently, the dCt values are transformed using the 2-dCt formula, summarised by means in the compared study groups, and a ratio of means (fold change) is calculated for a study group. This method is useful in scenarios where samples should be analysed as individual data points, e.g., in comparison between patients and healthy subjects. See example no 5. in Article2.

  2. 2-ddCt method. Similarly to the 2-dCt method, Ct values are normalised by endogenous control gene, but the obtained delta Ct (dCt) values are not exponentially transformed, but summarised by means in the compared study groups, and the mean dCt in a control group is subtracted from the mean dCt in a study group, giving the delta delta Ct (ddCt) value. Subsequently, the ddCt values are transformed using the 2-ddCt formula to obtain the fold change value (also called the RQ value). This method is useful where a compared groups contain rather technical than biological replicates, e.g. where samples of cell line before adding stimulant are compared to samples of the same cell line after stimulation. See examples no. 1 and 2 in Article2.

Presented RQdeltaCT package includes functions that encompass both of these methods either for comparison of independent groups of samples or groups with a paired samples (pairwise analysis). The selection of a suitable method for analysis is up to the user.


To install and load RQdeltaCT package simply run:

remotes::install_github("Donadelnal/RQdeltaCT")
library(RQdeltaCT)

The functions developed within the RQdeltaCT package are designed to be maximally easy to use, even for the users who are beginners to R. The parameters of functions were prepared to sufficiently range all essential tasks and options, and no additional, extensive coding steps are necessary in standard workflow. The package was developed with the intention to be user-friendly and provide an opportunity to perform relative quantification analysis of gene expression using RQdeltaCT package by non-experts in R programming (only a very basic programming skills are required).

The summary of standard workflow

The entire standard workflow of analysis performed using the RQdeltaCT package requires the following external packages:

All plots created by functions of the RQdeltaCT package can be saved as a .tiff files in the working directory (if save.to.tiff parameter is set to TRUE). Image resolution, dimensions, and name can be specified by the user. Also, all generated tables can be saved as .txt files (if save.to.txt parameter is set to TRUE) with name specified by the user.

This vignette includes instructions and examples of usage of main parameters of functions from RQdeltaCT package. For all list of parameters available and their usage, refer to documentation of a particular function.


Data import

Data analysed with the RQdeltaCT package should be in tabular form and contain the following information: group names, sample names, gene names, and Ct values. Flag information can also be included for data filtering purposes. Any other information could exist in the data (user do not have to remove them), but will not be used for analysis.

Files with such tables typically can be exported from software coupled with PCR devices and used to analyse raw data files generated during real-time PCR experiments. Such files are also returned by external software, such as SDS, or even R packages, e.g. qpcR

For user convenience, the RQdeltaCT package provides two functions useful to import tables in .txt or csv. format: * read_Ct_long() - to import tables with long-format structure (each information in columns), * read_Ct_wide() - to import tables with wide-format structure (samples by columns, genes by rows).

NOTE: Imported tables must be free of empty lines.

Reading long-format data using the read_Ct_long() function

Example of a long-format table with a structure suitable for the read_Ct_long() function:

Group Sample Gene Ct Flag
Disease Disease1 Gene1 25.6 OK
Disease Disease2 Gene2 32.9 Undetermined
Control Control1 Gene1 Undetermined OK
Control Control2 Gene2 27.5 OK

For the purpose of presentation of the read_Ct_long() function, this function will be used to import long-format .txt table (data_Ct_long.txt) located in RQdeltaCt package directory:

# Set path to file:
path <- system.file("extdata",
                    "data_Ct_long.txt",
                    package = "RQdeltaCT")

# Import file using path; remember to specify proper separator, decimal character, and numbers of necessary columns:
library(RQdeltaCT)
library(tidyverse)
data.Ct <- read_Ct_long(path = path,
                        sep = "\t",
                        dec = ".",
                        skip = 0,
                        add.column.Flag = TRUE,
                        column.Sample = 1,
                        column.Gene = 2,
                        column.Ct = 5,
                        column.Group = 9,
                        column.Flag = 4)

Let’s look at the data structure:

str(data.Ct)
#> 'data.frame':    1288 obs. of  5 variables:
#>  $ Sample: chr  "Disease1" "Disease10" "Disease12" "Disease13" ...
#>  $ Gene  : chr  "Gene1" "Gene1" "Gene1" "Gene1" ...
#>  $ Ct    : chr  "32.563" "34.648" "35.059" "37.135" ...
#>  $ Group : chr  "Disease" "Disease" "Disease" "Disease" ...
#>  $ Flag  : num  1.38 1.35 1.34 1.2 1.39 ...

The data were imported properly; however, the Flag variable is numeric, but character or factor is required. The Flag column contains a numeric AmpScore parameter, which is often used to evaluate the quality of the amplification curve - curves with AmpScore below 1 are typically considered as low quality and removed from data during analysis. The Flag variable can be changed into character by transforming the numeric AmpScore parameter into a binary variable that contains values “OK” and “Undetermined” according to the applied AmpScore criterion:

library(tidyverse)
data.Ct <- mutate(data.Ct,
                  Flag = ifelse(Flag < 1, "Undetermined", "OK"))
str(data.Ct)
#> 'data.frame':    1288 obs. of  5 variables:
#>  $ Sample: chr  "Disease1" "Disease10" "Disease12" "Disease13" ...
#>  $ Gene  : chr  "Gene1" "Gene1" "Gene1" "Gene1" ...
#>  $ Ct    : chr  "32.563" "34.648" "35.059" "37.135" ...
#>  $ Group : chr  "Disease" "Disease" "Disease" "Disease" ...
#>  $ Flag  : chr  "OK" "OK" "OK" "OK" ...

In this transformation, all AmpScore values in the Flag column that are below 1 were changed to “Undetermined”, otherwise to “OK”. The Flag variable now is character, as required, and the data are ready for further steps of analysis.

Reading wide-format data using the read_Ct_wide() function

The read_Ct_wide() function was designed to import data with gene expression results in the form of a wide-format table (with sample names in the first row and gene names in the first column). Because such a structure does not include names of groups, an additional file is required, containing group names and assigned samples. This second file must contain two columns: column named “Sample” with the names of the samples and column named “Group” with the names of the groups assigned to the samples. The names of the samples in this file must be the same as the names of the columns in the file with Ct values (the order does not have to be kept).

Example structure of a wide-format table suitable for the read_Ct_wide() function:

Gene Sample1 Sample2 Sample3
Gene1 25.4 24.9 25.6
Gene2 21.6 22.5 20.8
Gene3 33.7 Undetermined Undetermined
Gene4 15.8 16.2 17.5

Example structure of an additional file suitable for the read_Ct_wide() function:

Sample Group
Sample1 Control
Sample2 Control
Sample3 Disease
Sample4 Disease

In the following example, the read_Ct_wide() function was used to import and merge a wide-format file with Ct values (data_Ct_wide.txt) and a file with names of groups (data_design.txt) located in RQdeltaCt package directory:

# Set paths to required files: 
path.Ct.file <- system.file("extdata",
                            "data_Ct_wide.txt",
                            package = "RQdeltaCT")
path.design.file <- system.file("extdata",
                                "data_design.txt",
                                package = "RQdeltaCT")

# Import files:
library(tidyverse)
data.Ct <- read_Ct_wide(path.Ct.file = path.Ct.file,
                        path.design.file = path.design.file,
                        sep ="\t",
                        dec = ".")

# Look at the structure:
str(data.Ct)
#> tibble [1,216 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ Gene  : chr [1:1216] "Gene1" "Gene1" "Gene1" "Gene1" ...
#>  $ Sample: chr [1:1216] "Disease1" "Disease10" "Disease12" "Disease13" ...
#>  $ Ct    : chr [1:1216] "32.563" "34.648" "35.059" "37.135" ...
#>  $ Group : chr [1:1216] "Disease" "Disease" "Disease" "Disease" ...

The table imported from the package data object can be directly subjected to further steps of analysis.

Other methods

It also can be a situation, in which an imported wide-format table has samples by rows and genes by columns. There is no need to develop separate function to import file with such a table, we can do:

# Import file, be aware to specify parameters that fit to imported data:
data.Ct.wide <- read.csv(file = "data/data.Ct.wide.vign.txt",
                         header = TRUE,
                         sep = ",")
str(data.Ct.wide)
#> 'data.frame':    64 obs. of  22 variables:
#>  $ X     : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Group : chr  "Disease" "Disease" "Disease" "Disease" ...
#>  $ Sample: chr  "Disease1" "Disease10" "Disease12" "Disease13" ...
#>  $ Gene1 : num  32.6 34.6 35.1 37.1 34 ...
#>  $ Gene2 : chr  "36.554" "37.262" "36.977" "38.295" ...
#>  $ Gene3 : chr  "31.334" "31.161" "32.077" "33.982" ...
#>  $ Gene4 : num  23.4 22.4 24.3 24.9 24.1 ...
#>  $ Gene5 : chr  "35.608" "33.385" "36.374" "36.997" ...
#>  $ Gene7 : num  32.8 33.3 31.3 35.5 31.9 ...
#>  $ Gene8 : num  21.5 22.9 24.7 24 21.6 ...
#>  $ Gene9 : chr  "35.037" "36.36" "35.946" "36.885" ...
#>  $ Gene10: num  26.6 27.1 26.7 27.7 26.8 ...
#>  $ Gene11: num  36.7 34.1 35.4 37.1 35.2 ...
#>  $ Gene12: num  28.2 30.2 29.4 32.5 29.6 ...
#>  $ Gene13: num  28.2 30.5 29.9 33 29.8 ...
#>  $ Gene14: num  27.9 28.5 28.1 30.8 30.9 ...
#>  $ Gene15: num  30 31.2 32 32.1 29.4 ...
#>  $ Gene16: num  21.7 22.5 23.6 24.5 22.7 ...
#>  $ Gene17: num  26.3 27.1 27.6 28.5 27.6 ...
#>  $ Gene18: num  28.2 28.5 29.6 30.2 26.6 ...
#>  $ Gene19: chr  "28.263" "27.358" "28.867" "29.951" ...
#>  $ Gene20: num  32.7 32.9 31 35.9 32 ...

# The imported table is now transformed to a long-format structure. The "X" column is unnecessary and is removed. All variables also are converted to a character to unify the class of variables.
library(tidyverse)
data.Ct <- data.Ct.wide %>%
             select(-X) %>%
             mutate(across(everything(), as.character)) %>%
             pivot_longer(cols = -c(Group, Sample), names_to = "Gene", values_to = "Ct")
str(data.Ct)
#> tibble [1,216 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ Group : chr [1:1216] "Disease" "Disease" "Disease" "Disease" ...
#>  $ Sample: chr [1:1216] "Disease1" "Disease1" "Disease1" "Disease1" ...
#>  $ Gene  : chr [1:1216] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ Ct    : chr [1:1216] "32.563" "36.554" "31.334" "23.415" ...

NOTE: At this stage, the Ct values do not have to be numeric.

NOTE: Data can also be imported to R using user’s own code, but the final object must be a data frame and contain a table with column named “Sample” with sample names, column named “Gene” with gene names, column named “Ct” with raw Ct values, column named “Group” with group names, and optionally column named “Flag” containing flag information (this column should be a class of character or factor).


For package testing purposes, there is also a convenient possibility to use the data objects included in the RQdeltaCT package, named data.Ct (the data with independent groups of samples) and data.Ct.pairwise (with dependent groups of samples, can be used for a pairwise analysis):

data(data.Ct)
str(data.Ct)
#> 'data.frame':    1288 obs. of  5 variables:
#>  $ Sample: chr  "Disease1" "Disease10" "Disease12" "Disease13" ...
#>  $ Gene  : chr  "Gene1" "Gene1" "Gene1" "Gene1" ...
#>  $ Ct    : chr  "32.563" "34.648" "35.059" "37.135" ...
#>  $ Group : chr  "Disease" "Disease" "Disease" "Disease" ...
#>  $ Flag  : chr  "OK" "OK" "OK" "OK" ...

data(data.Ct.pairwise)
str(data.Ct.pairwise)
#> tibble [756 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ Sample: chr [1:756] "Sample01" "Sample01" "Sample01" "Sample01" ...
#>  $ Group : chr [1:756] "After" "After" "After" "After" ...
#>  $ Gene  : chr [1:756] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ Ct    : chr [1:756] "32.563" "36.554" "31.334" "23.415" ...

The data.Ct.pairwise data are a part of the data.Ct data, with a structure transformed to fit a pairwise analysis: the number of samples in the study group (named “Before”) has been equated with the number of the reference group (named “After”). This data are used in this vignette to demonstrate a pairwise approach to the relative quantification of gene expression.


Part A: The workflow for analysis of independent groups of samples

The RQdeltaCT package allows to perform analysis in two variants: a comparison of independent groups of samples and a pairwise comparison of dependent groups of samples. Independent groups contain different samples without inter-group relations. The example of such analysis is the comparison of patients with disease vs. healthy controls. In the pairwise approach, the groups contain either the same set of subjects or different, but paired subjects. The example of pairwise analysis is the comparison between the same patients before and after medical intervention. This part of the vignette contains instruction for analysis of independent groups, for the pairwise variant of the workflow refer to [Part B: A pairwise analysis].

Quality control of raw Ct data

The crucial step of each data analysis is an assessment of the quality and usefulness of the data used to investigate the studied problem. The RQdeltaCT package offers two functions for quality control of raw Ct data: control_Ct_barplot_sample() (for quality control of samples) and control_Ct_barplot_gene() (for quality control of genes). Both functions require specifying quality control criteria to be applied to Ct values, in order to label each Ct value as reliable or not. These functions return numbers of reliable and unreliable Ct values in each sample or each gene, as well as total number of Ct values obtained from each sample and each gene. These results are presented graphically on barplots. The obtained results are useful for inspecting the analysed data in order to identify samples and genes that should be considered to be removed from the data (based on applied reliability criteria).

Three selection criteria can be set for these functions:

NOTE: This function does not perform data filtering, but only report numbers of Ct values labelled as reliable or not and presents them graphically.

An example of using these functions is provided below:

sample.Ct.control <- control_Ct_barplot_sample(data = data.Ct,
                                               flag.Ct = "Undetermined",
                                               maxCt = 35,
                                               flag = c("Undetermined"),
                                               axis.title.size = 9,
                                               axis.text.size = 7,
                                               plot.title.size = 9,
                                               legend.title.size = 9,
                                               legend.text.size = 9)

gene.Ct.control <- control_Ct_barplot_gene(data = data.Ct,
                                               flag.Ct = "Undetermined",
                                               maxCt = 35,
                                               flag = c("Undetermined"),
                                               axis.title.size = 9,
                                               axis.text.size = 9,
                                               plot.title.size = 9,
                                               legend.title.size = 9,
                                               legend.text.size = 9)

Created plots are displayed on the graphic device, and short information about the returned tables appears. Returned objects are lists that contain two elements: an object with plot and a table with numbers of Ct values labelled as reliable (Yes) and unreliable (No), as well as fraction of unreliable Ct values in each gene. To easily identify samples or genes with high number of unreliable values, tables are sorted to show them at the top. To access returned tables, the second element of returned objects should be called:

head(sample.Ct.control[[2]])
#> # A tibble: 6 × 4
#>   Sample    Not.reliable Reliable Not.reliable.fraction
#>   <fct>            <int>    <int>                 <dbl>
#> 1 Control08            9       13                 0.409
#> 2 Control16            9       13                 0.409
#> 3 Control22            9       13                 0.409
#> 4 Control13            8       14                 0.364
#> 5 Control05            7       15                 0.318
#> 6 Control20            7       15                 0.318
head(gene.Ct.control[[2]])
#> # A tibble: 6 × 5
#>   Gene  Group   Not.reliable Reliable Not.reliable.fraction
#>   <fct> <fct>          <int>    <int>                 <dbl>
#> 1 Gene6 Control           48        0                 1    
#> 2 Gene2 Disease           35        5                 0.875
#> 3 Gene9 Disease           34        6                 0.85 
#> 4 Gene2 Control           23        1                 0.958
#> 5 Gene5 Disease           21       19                 0.525
#> 6 Gene9 Control           19        5                 0.792

Visual inspection of returned plots and obtained tables gives a clear image of data quality. The results obtained in the examples show that the Disease group contains more samples than the Control group. Some of the samples have more Ct values (more technical replicates) than other samples. Furthermore, in all samples, the majority of Ct values are reliable.

Regarding genes, Gene6 and Gene8 were investigated in duplicates, other genes have single Ct values. Furthermore, Gene6 was analysed only in the Control group and has all values labeled as unreliable; therefore, it is obvious that this gene should be excluded from the analysis. Some other genes also have many unreliable Ct values (e.g. Gene2, Gene9) and maybe should be considered to be removed from the data.

In some situations, a unified fraction of unreliable data need to be established and used to make decision which samples or genes should be excluded from the analysis. It can be done using the following code, in which the second element of object returned by the control_Ct_barplot_sample() and control_Ct_barplot_gene() functions can be used directly, and a vector with samples or genes for which the fraction of unreliable Ct values is higher than a specified threshold is received:

# Finding samples with more than half of the unreliable Ct values.
low.quality.samples <- filter(sample.Ct.control[[2]], Not.reliable.fraction > 0.5)$Sample
low.quality.samples <- as.vector(low.quality.samples)                        
low.quality.samples
#> character(0)
# Finding genes with more than half of the unreliable Ct values in given group.
low.quality.genes <- filter(gene.Ct.control[[2]], Not.reliable.fraction > 0.5)$Gene
low.quality.genes <- unique(as.vector(low.quality.genes))                        
low.quality.genes
#> [1] "Gene6"  "Gene2"  "Gene9"  "Gene5"  "Gene11"

In the above examples, there is no sample with more than half of the unreliable data. Furthermore, this criterion was met by 5 genes (Gene2, Gene5, Gene6, Gene9, and Gene11); therefore, these genes will be removed from the data in the next step of analysis.


Filtering of raw Ct data

When reliability criteria are finally established for Ct values, and some samples or genes are decided to be excluded from the analysis after quality control of the data, the data with raw Ct values can be filtered using the filter_Ct() function.

As a filtering criteria, a flag used for undetermined Ct values, a maximum of Ct threshold, and a flag used in Flag column can be applied. Furthermore, vectors with samples, genes, and groups to be removed can also be specified:

# Objects returned from the `low_quality_samples()` and `low_quality_genes()`functions can be used directly:
data.CtF <- filter_Ct(data = data.Ct,
                      flag.Ct = "Undetermined",
                      maxCt = 35,
                      flag = c("Undetermined"),
                      remove.Gene = low.quality.genes,
                      remove.Sample = low.quality.samples)

# Check dimensions of data before and after filtering:
dim(data.Ct)
#> [1] 1288    5
dim(data.CtF)
#> [1] 946   5

NOTE: If data contain more than two groups, it is good practice to remove groups that are out of comparison; however, a majority of other functions can deal with more groups unless only two groups are indicated to be given in function’s parameters.


Collapsing technical replicates and imputation of missing data - make_Ct_ready() function

In the next step, filtered Ct data can be subjected to collapsing of technical replicates and (optional) data imputation by means within groups using the make_Ct_ready() function. The term ‘technical replicates’ means observations with the same group name, gene name, and sample name. In the scenario when data contain technical replicates but they should not be collapsed, these technical replicates must be distinguished by different sample names, e.g. Sample1_1, Sample1_2, Sample1_3, etc.

The parameter imput.by.mean.within.groups can be used to control data imputation. If it is set to TRUE, imputation will be done, otherwise missing values will be left in the data. For a better view of the amount of missing values in the data, the information about the number and percentage of missing values is displayed automatically:

# Without imputation:
data.CtF.ready <- make_Ct_ready(data = data.CtF,
                                imput.by.mean.within.groups = FALSE)
# A part of the data with missing values:
as.data.frame(data.CtF.ready)[19:25,]
#>      Group    Sample  Gene1 Gene10 Gene12 Gene13 Gene14 Gene15 Gene16 Gene17
#> 19 Control Control26 34.506 26.266 31.305 30.384 28.956 31.174 22.844 28.401
#> 20 Control Control05     NA 28.339 31.200 30.498 28.562 31.068 21.863 26.798
#> 21 Control Control08     NA 30.892 32.381 33.224 31.643 33.485 25.132 30.133
#> 22 Control Control13     NA 28.976 32.976 33.340 29.684 32.600 24.812 28.904
#> 23 Control Control16     NA 30.689 33.656 32.489 30.335 32.432 24.516 29.772
#> 24 Control Control22     NA 26.988 29.846 31.564 30.587 29.176 22.972 27.208
#> 25 Disease  Disease1 32.563 26.552 28.179 28.227 27.905 30.048 21.691 26.272
#>    Gene18 Gene19 Gene20  Gene3  Gene4  Gene7   Gene8
#> 19 29.012 26.911 33.386 34.256 23.801 31.987 22.5725
#> 20 29.582 28.194 33.680 33.657 24.123 31.019 23.3080
#> 21 31.399 29.908     NA 34.853 26.228     NA 26.4465
#> 22 30.251 28.085     NA 32.600 25.186 34.338 24.9500
#> 23 30.991 28.964     NA 33.432 25.435     NA 25.4190
#> 24 28.941 28.143     NA 34.886 25.056     NA 23.0315
#> 25 28.165 28.263 32.725 31.334 23.415 32.789 21.4550

# With imputation:
data.CtF.ready <- make_Ct_ready(data = data.CtF,
                                imput.by.mean.within.groups = TRUE)
# Missing values were imputed:
as.data.frame(data.CtF.ready)[19:25,]
#>      Group    Sample    Gene1 Gene10 Gene12 Gene13 Gene14 Gene15 Gene16 Gene17
#> 19 Control Control26 34.50600 26.266 31.305 30.384 28.956 31.174 22.844 28.401
#> 20 Control Control05 33.01921 28.339 31.200 30.498 28.562 31.068 21.863 26.798
#> 21 Control Control08 33.01921 30.892 32.381 33.224 31.643 33.485 25.132 30.133
#> 22 Control Control13 33.01921 28.976 32.976 33.340 29.684 32.600 24.812 28.904
#> 23 Control Control16 33.01921 30.689 33.656 32.489 30.335 32.432 24.516 29.772
#> 24 Control Control22 33.01921 26.988 29.846 31.564 30.587 29.176 22.972 27.208
#> 25 Disease  Disease1 32.56300 26.552 28.179 28.227 27.905 30.048 21.691 26.272
#>    Gene18 Gene19   Gene20  Gene3  Gene4    Gene7   Gene8
#> 19 29.012 26.911 33.38600 34.256 23.801 31.98700 22.5725
#> 20 29.582 28.194 33.68000 33.657 24.123 31.01900 23.3080
#> 21 31.399 29.908 32.88615 34.853 26.228 31.83495 26.4465
#> 22 30.251 28.085 32.88615 32.600 25.186 34.33800 24.9500
#> 23 30.991 28.964 32.88615 33.432 25.435 31.83495 25.4190
#> 24 28.941 28.143 32.88615 34.886 25.056 31.83495 23.0315
#> 25 28.165 28.263 32.72500 31.334 23.415 32.78900 21.4550

NOTE: The data imputation process can significantly influence data; therefore, no default value was set to the imput.by.mean.within.groups parameter to force the specification by the user. If there are missing data for a certain gene in the entire group, they will not be imputed and will remain NA.

In general, a majority of functions in RQdeltaCT package can deal with missing data; however, some used methods (e.g. PCA) are sensitive to missing data (see PCA analysis section).

NOTE: The make_Ct_ready() function should be used even if the collapsing of technical replicates and data imputation is not required, because this function also prepares the data structure to fit to further functions.


Reference gene selection

Ideally, the reference gene should have an identical expression level in all samples, but in many situations it is not possible to achieve this, especially when biological replicates are analysed. Therefore, differences between samples are allowed, but the variance should be as low as possible, and it is also recommended that Ct values would not be very low (below 15) or very high (above 30) Kozera and Rapacz 2013.

The RQdeltaCT package includes find_ref_gene() function that can be used to select the best reference gene for normalisation. This function calculates descriptive statistics such as minimum, maximum, standard deviation, and variance, as well as stability scores calculated using the NormFinder (Article) and geNorm (Article) algorithms. Ct values are also presented on a line plot.

NormFinder scores are computed using internal RQdeltaCT::norm_finder() function working on the code adapted from the original NormFinder code. To calculate NormFinder scores, at least two samples must be present in each group. For the geNorm score, the geNorm() function of the ctrlGene package is used. The find_ref_gene() function allows one to choose which of these algorithms should be done by setting the logical parameters: norm.finder.score and genorm.score.

The created plot is displayed on the graphic device. The returned object is a list that contains two elements: an object with plot and a table with results. In the example below, six genes are tested for suitability to be a reference gene:

library(ctrlGene)
# Remember that the number of colors in col parameter should be equal to the number of tested genes:
ref <- find_ref_gene(data = data.CtF.ready,
                     groups = c("Disease","Control"),
                     candidates = c("Gene4", "Gene8","Gene10","Gene16","Gene17", "Gene18"),
                     col = c("#66c2a5", "#fc8d62","#6A6599", "#D62728", "#1F77B4", "black"),
                     angle = 60,
                     axis.text.size = 7,
                     norm.finder.score = TRUE,
                     genorm.score = TRUE)

ref[[2]]
#>            Gene    min     max       sd      var NormFinder_score geNorm_score
#> 1        Gene10 22.043 30.8920 1.894611 3.589551             0.31    1.0511715
#> 2        Gene16 19.801 25.7370 1.359134 1.847245             0.34           NA
#> 3        Gene17 24.436 30.1330 1.370355 1.877874             0.19           NA
#> 4        Gene18 25.323 31.3990 1.329008 1.766262             0.14    0.8127723
#> 5         Gene4 19.295 31.5080 1.998908 3.995635             0.33    1.2335325
#> 6         Gene8 18.314 26.4465 1.724890 2.975244             0.22    0.9001852
#> 7 Gene16-Gene17     NA      NA       NA       NA               NA    0.6930721

NA values are presented because geNorm method returns a pair of genes with the highest stability.
Among tested genes, Gene8, Gene16, Gene17, and Gene18 seem to have the best characteristics to be a reference gene (they have low variance, low NormFinder and low geNorm scores).


Data normalization using reference gene

Data normalization can be performed using delta_Ct() function that calculates delta Ct (dCt) values by subtracting Ct values of reference gene (or mean of the Ct values of reference genes, if more than one reference gene is used) from Ct values of gene of interest across all samples. If 2-dCt method is used, transform should be set to TRUE to tansform dCt values using 2-dCt formula:

# For 2^-dCt^ method:
data.dCt.exp <- delta_Ct(data = data.CtF.ready,
                     normalise = TRUE,
                     ref = "Gene8",
                     transform = TRUE)

If 2-ddCt method is used, transform should be set to FALSE to avoid dCt values transformation that is performed by the consecutive RQ_ddCt() function:

# For 2^-ddCt^ method:
data.dCt <- delta_Ct(data = data.CtF.ready,
                     normalise = TRUE,
                     ref = "Gene8",
                     transform = FALSE)

NOTE: In the scenario where unnormalised data should be analysed, the normalise parameter should be set to FALSE.

Before further processing, non-transformed or transformed dCt data should be subjected to quality control using functions and methods described in Quality control and filtering of transformed Ct data section (see below).


Quality control and filtering of transformed Ct data

Data intended for relative quantification should be subjected to quality assessment. The main purpose is to identify outlier samples that could introduce bias into the results. The RQdeltaCT package offers several functions which facilitate finding outlier samples in data by implementing distribution analysis (control_boxplot_sample() function), hierarchical clustering (control_cluster_sample() function) and principal component analysis PCA (control_pca_sample() function). However, corresponding functions for genes are also provided to evaluate similarities and differences between expression of analysed genes (control_boxplot_gene(), control_cluster_gene() and control_pca_gene() functions).

The abovementioned data quality control functions are designed to be directly applied to data objects returned from the make_Ct_ready() and delta_Ct() functions (see The summary of standard workflow).

Analysis of data distribution

One of the ways to gain insight into data distribution is drawing boxplot that show several statistics, such as median (labelled inside box), first and third quartiles (ranged by box), extreme point in interquartile range (ranged by whiskers), and more distant data as separated points. In the RQdeltaCT package, the control_boxplot_sample() and control_boxplot_gene() functions are included to draw boxplots that present the distribution of samples and genes, respectively.

control_boxplot_sample <- control_boxplot_sample(data = data.dCt,
                                                 axis.text.size = 7)

control_boxplot_gene <- control_boxplot_gene(data = data.dCt,
                                                 by.group = TRUE,
                                                 axis.text.size = 10)

These functions return objects with a plot. The created plots are also displayed on the graphic device.

NOTE: If missing values are present in the data, they will be automatically removed with warning.

Hierarchical clustering

Hierarchical clustering is a convenient method to investigate similarities between variables. Hierarchical clustering of samples and genes can be done using the control_cluster_sample() and control_cluster_gene() functions, respectively. These functions allow drawing dendrograms using various methods of distance calculation (e.g. euclidean, canberra) and agglomeration (e.g. complete, average, single). For more details, refer to the functions documentation.

control_cluster_sample(data = data.dCt,
                       method.dist = "euclidean",
                       method.clust = "average",
                       label.size = 0.6)

control_cluster_gene(data = data.dCt,
                     method.dist = "euclidean",
                     method.clust = "average",
                     label.size = 0.8)

The created plots are displayed on the graphic device.

NOTE: Minimum three samples or genes in data is required for clustering analysis.

PCA analysis

Principal component analysis (PCA) is a data exploratory method, that is commonly used to investigate similarities between variables based on the first principal components that contain the most information about variance. In the RQdeltaCT package, the control_pca_sample() and control_pca_gene() functions are developed to perform PCA analysis for samples and genes, respectively. These functions return objects with a plot. Created plots are also displayed on the graphic device.

control.pca.sample <- control_pca_sample(data = data.dCt,
                                         point.size = 3,
                                         label.size = 2.5,
                                         legend.position = "top")

control.pca.gene <- control_pca_gene(data = data.dCt)

NOTE: PCA algorithm can not deal with missing data (NAs); therefore, variables with NA values are removed before analysis. If at least one NA value occurs in all variables in at least one of the compared group, the analysis can not be done. Imputation of missing data will avoid this issue. Also, a minimum of three samples or genes in the data are required for analysis.


Data filtering after quality control

If any sample or gene was decided to be removed from the data, the filter_transformed_data() function can be used for filtering. Similarly to quality control functions, this function can be directly applied to data objects returned from the make_Ct_ready() and delta_Ct() functions (see The summary of standard workflow).

data.dCtF <- filter_transformed_data(data = data.dCt,
                                     remove.Sample = c("Control11"))

Relative quantification: 2-dCt method

This method is used in studies where samples should be analysed as individual data points, e.g. in analysis of biological replicates. In this method, Ct values are normalised by the endogenous control gene by subtracting the Ct value of the endogenous control from the Ct value of the gene of interest, in the same sample. Obtained delta Ct (dCt) values are subsequently transformed using the 2-dCt formula, summarised by means in the compared study groups, and a ratio of means (fold change) is calculated for the study group (see [Introduction] section).

The whole process can be done using RQ_dCt() function, which performs:
+ calculation of means (returned in columns with the “_mean” pattern) and standard deviations (returned in columns with the “_sd” pattern) of transformed dCt values of genes analysed in the compared groups. + normality testing (Shapiro_Wilk test) of transformed dCt values of analysed genes in compared groups and returned p values are stored in columns with the “_norm_p” pattern. + calculation of fold change values for each gene by dividing the mean of transformed dCt values in the study group by the mean of transformed dCt values in the reference group. Fold change values are returned in the “FCh” column. + statistical testing of differences in transformed Ct values between the study group and the reference group. Student’s t test and Mann-Whitney U test are implemented and the resulting statistics (in column with the “_test_stat” pattern), p values (in column with the “_test_p” pattern), and adjusted p values (in column with the “_test_p_adj” pattern) are returned. The Benjamini-Hochberg method for adjustment of p values is used in default. If compared groups contain less than three samples, normality and statistical tests are not possible to perform (the do.test parameter should be set to FALSE to avoid error).

NOTE: For this method, delta Ct (dCt) values transformed by the 2-dCt formula using delta_Ct() function (called with transform = TRUE) should be used.

data.dCt.exp <- delta_Ct(data = data.CtF.ready,
                         ref = "Gene8",
                         transform = TRUE)
library(coin)
results.dCt <- RQ_dCt(data = data.dCt.exp,
                            do.tests = TRUE,
                            group.study = "Disease",
                            group.ref = "Control")

# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
head(as.data.frame(arrange(results.dCt, MW_test_p)))
#>     Gene Control_mean Disease_mean  Control_sd  Disease_sd Control_norm_p
#> 1  Gene1  0.001779634  0.007572040 0.002442710 0.007962192   3.435812e-06
#> 2 Gene19  0.058796934  0.024802219 0.060456391 0.027097216   3.400562e-07
#> 3 Gene13  0.013651779  0.013343277 0.028406905 0.008790133   5.964156e-09
#> 4 Gene16  1.509777018  0.967490766 1.244618540 0.609112278   1.086815e-06
#> 5 Gene12  0.016030664  0.340252219 0.018392782 1.013461602   1.391742e-05
#> 6  Gene7  0.003424749  0.001911581 0.004925526 0.001866461   1.322123e-07
#>   Disease_norm_p        FCh     log10FCh     t_test_p t_test_stat    MW_test_p
#> 1   1.175080e-05  4.2548290  0.628882107 8.483229e-05 -4.27774491 5.136855e-05
#> 2   4.020200e-08  0.4218284 -0.374864146 1.450713e-02  2.60232741 5.449915e-05
#> 3   2.002333e-03  0.9774021 -0.009926733 9.591380e-01  0.05173793 1.009811e-02
#> 4   4.020298e-03  0.6408170 -0.193265981 5.517804e-02  1.99590948 1.255492e-02
#> 5   6.309057e-12 21.2250864  1.326849466 4.998438e-02 -2.02276492 1.410617e-02
#> 6   3.737597e-07  0.5581668 -0.253236035 1.602119e-01  1.44408971 6.717362e-02
#>   MW_test_stat t_test_p_adj MW_test_p_adj
#> 1    -4.049311  0.001187652  0.0003814941
#> 2     4.035444  0.101549928  0.0003814941
#> 3    -2.572452  0.974195838  0.0394972722
#> 4     2.496151  0.193123123  0.0394972722
#> 5    -2.454548  0.193123123  0.0394972722
#> 6     1.830511  0.448593384  0.1567384376

Relative quantification: 2-ddCt method

Similarly to the 2-dCt method, in the 2-ddCt method Ct values are normalised by the endogenous control gene, obtaining dCt values. Subsequently, dCt values are summarised by means in the compared groups, and for each gene, the obtained mean in the control group is subtracted from the mean in the study group, giving the delta delta Ct (ddCt) value. Finally, the ddCt values are transformed using the 2-ddCt formula to obtain the fold change values. This method is recommended for analysis of technical replicates (see the [Introduction] section).

The whole process can be done using RQ_ddCt() function, which performs:
+ calculation of means (returned in columns with the “_mean” pattern) and standard deviations (returned in columns with the “_sd” pattern) of delta Ct values of the analyzed genes in the compared groups. + normality testing (Shapiro_Wilk test) of delta Ct values of the analyzed genes in the compared groups and returned p values are stored in columns with the “_norm_p” pattern. + calculation of differences in the mean delta Ct values of genes between compared groups, returned in “ddCt” column, + calculation of fold change values (returned in “FCh” column) for each gene by transforming the ddCt values using the 2-ddCt formula. + statistical testing of differences between the compared groups. Student’s t test and Mann-Whitney U test are implemented and the resulted statistics (in column with the “_test_stat” pattern), p values (in column with the “_test_p” pattern) and adjusted p values (in column with the “_test_p_adj” pattern) are returned. The Benjamini-Hochberg method for adjustment of p values is used in default. If compared groups contain less than three samples, normality and statistical tests are not possible to perform and do.test parameter should be set to FALSE to avoid error.

NOTE: For this method, not transformed delta Ct (dCt) values (obtained from delta_Ct() function with transform = FALSE) should be used.

data.dCt <- delta_Ct(data = data.CtF.ready,
                     ref = "Gene8",
                     transform = FALSE)
library(coin)
results.ddCt <- RQ_ddCt(data = data.dCt,
                   group.study = "Disease",
                   group.ref = "Control",
                   do.tests = TRUE)

# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
head(as.data.frame(arrange(results.ddCt, MW_test_p)))
#>     Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1  Gene1    10.117002     7.959426  1.6881153  1.8871311    0.426642224
#> 2 Gene19     4.522833     5.938756  1.1662848  1.3467583    0.007240964
#> 3 Gene13     7.216583     6.565200  1.4577193  1.0627016    0.038235161
#> 4 Gene16    -0.329125     0.338025  0.8145069  0.9570775    0.213259767
#> 5 Gene12     6.675667     5.058075  1.4070518  2.9513180    0.172997230
#> 6  Gene7     8.932742     9.536064  1.4167152  1.2052427    0.614472418
#>   Disease_norm_p       ddCt       FCh   log10FCh     t_test_p t_test_stat
#> 1     0.17446884 -2.1575763 4.4616467  0.6494952 1.690067e-05    4.733411
#> 2     0.56682258  1.4159231 0.3747699 -0.4262353 4.581993e-05   -4.432997
#> 3     0.17567779 -0.6513833 1.5706735  0.1960859 6.426170e-02    1.906189
#> 4     0.57399623  0.6671500 0.6297495 -0.2008322 4.445141e-03   -2.967530
#> 5     0.04414803 -1.6175917 3.0686235  0.4869436 4.508310e-03    2.952083
#> 6     0.61294050  0.6033224 0.6582363 -0.1816182 8.872007e-02   -1.742051
#>      MW_test_p MW_test_stat t_test_p_adj MW_test_p_adj
#> 1 5.136855e-05     4.049311 0.0002366094  0.0003814941
#> 2 5.449915e-05    -4.035444 0.0003207395  0.0003814941
#> 3 1.009811e-02     2.572452 0.1799327522  0.0394972722
#> 4 1.255492e-02    -2.496151 0.0157790854  0.0394972722
#> 5 1.410617e-02     2.454548 0.0157790854  0.0394972722
#> 6 6.717362e-02    -1.830511 0.2070134948  0.1567384376

Final visualisations

For visualisation of final results, these five functions can be used:

All these functions can be run on all data or on finally selected genes (see sel.Gene parameter). These functions have a large number of parameters, and the user should familiarise with all of them to properly adjust created plots to the user needs.

NOTE: The functions FCh_plot(), results_barplot(), and results_boxplot() also allow to add customised statistical significance labels to plots using ggsignif package. If statistical significance labels should be added to the plot, a vector with labels (e.g., “ns”, “*“,”p = 0.03”) should be provided in the signif.labels parameter. There are two important points that must be taking into account when preparing this vector:

The FCh_plot() function

This function creates a barplot that illustrates fold change values obtained from the analysis, together with an indication of statistical significance. Data returned from the RQ_dCt() and RQ_ddCt() functions can be directly applied to this function (see The summary of standard workflow).

On the barplot, bars of significant genes are distinguished by colors and/or significance labels. The significance of genes can be established by two criteria: p values and (optionally) fold change values. Thresholds for both criteria can be specified. The FCh_plot() function offers various options of which p values are used on the plot:

  • p values from the Student’s t test (if mode = “t”).
  • p values from the Mann-Whitney U test (if mode = “mw”).
  • p values depend on the normality of data (if mode = “depends”). If the data in both compared groups were considered derived from normal distribution (p value of Shapiro_Wilk test > 0.05) - p values of Student’s t test will be used, otherwise p values of Mann-Whitney U test will be used.
  • external p values provided by the user (if mode = “user”). If the user intends to use the p values obtained from the other statistical test, the mode parameter should be set to “user”. In this scenario, before running the FCh_plot() function, the user should prepare a data.frame object named “user” containing two columns: the first column with gene names and the second column with p values (see example below).

The created plot is displayed on the graphic device. The returned object is a lists that contain two elements: an object with plot and a table with results.

# Variant with p values depending on the normality of the data:
library(ggsignif)
# Specifying vector with significance labels: 
signif.labels <- c("****","**","ns."," ns. ","  ns.  ","   ns.   ","    ns.    ","     ns.     ","      ns.      ","       ns.       ","        ns.        ","         ns.         ","          ns.          ","***")
# Genes with p < 0.05 and 2-fold changed expression between compared groups are considered significant: 
FCh.plot <- FCh_plot(data = results.ddCt,
                   use.p = TRUE,
                   mode = "depends",
                   p.threshold = 0.05,
                   use.FCh = TRUE,
                   FCh.threshold = 2,
                   signif.show = TRUE,
                   signif.labels = signif.labels,
                   angle = 20)

# Access the table with results:
head(as.data.frame(FCh.plot[[2]]))
#>     Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1  Gene1    10.117002     7.959426   1.688115   1.887131     0.42664222
#> 2 Gene12     6.675667     5.058075   1.407052   2.951318     0.17299723
#> 3 Gene13     7.216583     6.565200   1.457719   1.062702     0.03823516
#> 4 Gene20     9.983942     9.481083   1.571200   1.143782     0.68169170
#> 5  Gene4     0.584125     0.360150   1.358816   1.670593     0.03149060
#> 6 Gene10     4.101125     3.922625   1.336964   1.381818     0.47231242
#>   Disease_norm_p       ddCt      FCh   log10FCh     t_test_p t_test_stat
#> 1   1.744688e-01 -2.1575763 4.461647 0.64949517 1.690067e-05   4.7334112
#> 2   4.414803e-02 -1.6175917 3.068624 0.48694361 4.508310e-03   2.9520830
#> 3   1.756778e-01 -0.6513833 1.570674 0.19608592 6.426170e-02   1.9061886
#> 4   2.041532e-02 -0.5028583 1.417018 0.15137544 1.801127e-01   1.3657411
#> 5   2.060155e-07 -0.2239750 1.167947 0.06742319 5.610445e-01   0.5847601
#> 6   8.555529e-02 -0.1785000 1.131707 0.05373385 6.118866e-01   0.5105974
#>      MW_test_p MW_test_stat t_test_p_adj MW_test_p_adj test.for.comparison
#> 1 5.136855e-05    4.0493114 0.0002366094  0.0003814941    t.student's.test
#> 2 1.410617e-02    2.4545484 0.0157790854  0.0394972722   Mann-Whitney.test
#> 3 1.009811e-02    2.5724516 0.1799327522  0.0394972722   Mann-Whitney.test
#> 4 1.271530e-01    1.5254255 0.3527853502  0.2225177371   Mann-Whitney.test
#> 5 8.551052e-02    1.7195706 0.7138676822  0.1710210453   Mann-Whitney.test
#> 6 6.572160e-01    0.4437602 0.7138676822  0.7902902022    t.student's.test
#>         p.used Selected as significant?
#> 1 1.690067e-05                      Yes
#> 2 1.410617e-02                      Yes
#> 3 1.009811e-02                       No
#> 4 1.271530e-01                       No
#> 5 8.551052e-02                       No
#> 6 6.118866e-01                       No

A variant with user p values - the used p values are calculated using the stats::wilcox.test() function:

user <- data.dCt %>%
          pivot_longer(cols = -c(Group, Sample),
                       names_to = "Gene",
                       values_to = "dCt") %>%
          group_by(Gene) %>%
          summarise(MW_test_p = wilcox.test(dCt ~ Group)$p.value,
                    .groups = "keep")
# The stats::wilcox.test() functions is limited to cases without ties; therefore, a warning "cannot compute exact p-value with ties" will appear when ties occur.

FCh.plot <- FCh_plot(data = results.ddCt,
                   use.p = TRUE,
                   mode = "user",
                   p.threshold = 0.05,
                   use.FCh = TRUE,
                   FCh.threshold = 2,
                   signif.show = TRUE,
                   signif.labels = signif.labels,
                   angle = 30)

# Access the table with results:
head(as.data.frame(FCh.plot[[2]]))
#>     Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1  Gene1    10.117002     7.959426   1.688115   1.887131     0.42664222
#> 2 Gene12     6.675667     5.058075   1.407052   2.951318     0.17299723
#> 3 Gene13     7.216583     6.565200   1.457719   1.062702     0.03823516
#> 4 Gene20     9.983942     9.481083   1.571200   1.143782     0.68169170
#> 5  Gene4     0.584125     0.360150   1.358816   1.670593     0.03149060
#> 6 Gene10     4.101125     3.922625   1.336964   1.381818     0.47231242
#>   Disease_norm_p       ddCt      FCh   log10FCh     t_test_p t_test_stat
#> 1   1.744688e-01 -2.1575763 4.461647 0.64949517 1.690067e-05   4.7334112
#> 2   4.414803e-02 -1.6175917 3.068624 0.48694361 4.508310e-03   2.9520830
#> 3   1.756778e-01 -0.6513833 1.570674 0.19608592 6.426170e-02   1.9061886
#> 4   2.041532e-02 -0.5028583 1.417018 0.15137544 1.801127e-01   1.3657411
#> 5   2.060155e-07 -0.2239750 1.167947 0.06742319 5.610445e-01   0.5847601
#> 6   8.555529e-02 -0.1785000 1.131707 0.05373385 6.118866e-01   0.5105974
#>      MW_test_p MW_test_stat t_test_p_adj MW_test_p_adj       p.used
#> 1 5.136855e-05    4.0493114 0.0002366094  0.0003814941 2.570256e-05
#> 2 1.410617e-02    2.4545484 0.0157790854  0.0394972722 1.360267e-02
#> 3 1.009811e-02    2.5724516 0.1799327522  0.0394972722 1.030219e-02
#> 4 1.271530e-01    1.5254255 0.3527853502  0.2225177371 1.295905e-01
#> 5 8.551052e-02    1.7195706 0.7138676822  0.1710210453 8.683564e-02
#> 6 6.572160e-01    0.4437602 0.7138676822  0.7902902022 6.645315e-01
#>   Selected as significant?
#> 1                      Yes
#> 2                      Yes
#> 3                       No
#> 4                       No
#> 5                       No
#> 6                       No

Three genes (Gene1, Gene12, and Gene19) are shown to meet the used significance criteria.

NOTE: If p values were not calculated due to the low number of samples, or they are not intended to be used to create a plot, the use.p parameter should be set to FALSE.

The results_volcano() function

This function creates a volcano plot that illustrates the arrangement of genes based on fold change values and p values. Significant genes can be pointed out using specified p value and fold change thresholds, and highlighted on the plot by color and (optionally) isolated by thresholds lines. Similarly to FCh_plot() function, data returned from the RQ_dCt() and RQ_ddCt() functions can be directly applied (see The summary of standard workflow) and various sources of used p values are available (from the Student’s t test, the Mann-Whitney U test, depended on the normality of data, or provided by the user).

The created plot is displayed on the graphic device. The returned object is a lists that contain two elements: an object with plot and a table with results.

# Genes with p < 0.05 and 2-fold changed expression between compared groups are considered significant: 
volcano <- results_volcano(data = results.ddCt,
                         mode = "depends",
                         p.threshold = 0.05,
                         FCh.threshold = 2)

# Access the table with results:
head(as.data.frame(volcano[[2]]))
#>     Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1  Gene1    10.117002     7.959426  1.6881153   1.887131    0.426642224
#> 2 Gene10     4.101125     3.922625  1.3369635   1.381818    0.472312423
#> 3 Gene12     6.675667     5.058075  1.4070518   2.951318    0.172997230
#> 4 Gene13     7.216583     6.565200  1.4577193   1.062702    0.038235161
#> 5 Gene14     6.092583     6.530625  1.3929523   1.364024    0.315029911
#> 6 Gene15     7.448292     7.563225  0.9437689   1.234245    0.005740816
#>   Disease_norm_p       ddCt       FCh    log10FCh     t_test_p t_test_stat
#> 1     0.17446884 -2.1575763 4.4616467  0.64949517 1.690067e-05   4.7334112
#> 2     0.08555529 -0.1785000 1.1317066  0.05373385 6.118866e-01   0.5105974
#> 3     0.04414803 -1.6175917 3.0686235  0.48694361 4.508310e-03   2.9520830
#> 4     0.17567779 -0.6513833 1.5706735  0.19608592 6.426170e-02   1.9061886
#> 5     0.91864416  0.4380417 0.7381359 -0.13186368 2.256759e-01  -1.2274336
#> 6     0.22923507  0.1149333 0.9234250 -0.03459838 6.766639e-01  -0.4191284
#>      MW_test_p MW_test_stat t_test_p_adj MW_test_p_adj test.for.comparison
#> 1 5.136855e-05    4.0493114 0.0002366094  0.0003814941    t.student's.test
#> 2 6.572160e-01    0.4437602 0.7138676822  0.7902902022    t.student's.test
#> 3 1.410617e-02    2.4545484 0.0157790854  0.0394972722   Mann-Whitney.test
#> 4 1.009811e-02    2.5724516 0.1799327522  0.0394972722   Mann-Whitney.test
#> 5 2.330239e-01   -1.1926054 0.3527853502  0.3262335180    t.student's.test
#> 6 8.190116e-01   -0.2288165 0.7287149620  0.8820124714   Mann-Whitney.test
#>         p.used Selected as significant?
#> 1 1.690067e-05                      Yes
#> 2 6.118866e-01                       No
#> 3 1.410617e-02                      Yes
#> 4 1.009811e-02                       No
#> 5 2.256759e-01                       No
#> 6 8.190116e-01                       No

The results_boxplot() function

This function creates a boxplot that illustrates the distribution of the data for the genes. It is similar to control_boxplot_gene() function; however, some new options are added, including gene selection, faceting, addition of mean points to boxes, and statistical significance labels.

Data objects returned from the make_Ct_ready() and delta_Ct() functions can be directly applied to this function (see The summary of standard workflow).

final_boxplot <- results_boxplot(data = data.dCtF,
                                 sel.Gene = c("Gene1","Gene12", "Gene16","Gene19"),
                                 by.group = TRUE,
                                 signif.show = TRUE,
                                 signif.labels = c("****","*","***"," * "),
                                 signif.dist = 1.05,
                                 faceting = TRUE,
                                 facet.row = 1,
                                 facet.col = 4,
                                 y.exp.up = 0.1,
                                 angle = 20,
                                 y.axis.title = "dCt")

NOTE: If missing values are present in the data, they will be automatically removed, and a warning will appear.

The results_barplot() function

This function creates a barplot that illustrates mean and standard deviation values of the data for all or selected genes.

Data objects returned from the make_Ct_ready() and delta_Ct() functions can be directly applied to this function (see The summary of standard workflow).

final_barplot <- results_barplot(data = data.dCtF,
                                 sel.Gene = c("Gene1","Gene12","Gene19","Gene20"),
                                 signif.show = TRUE,
                                 signif.labels = c("****","*","***"," * "),
                                 angle = 30,
                                 signif.dist = 1.05,
                                 faceting = TRUE,
                                 facet.row = 1,
                                 facet.col = 4,
                                 y.exp.up = 0.1,
                                 y.axis.title = "dCt")

NOTE: At least two samples in each group are required to calculate the standard deviation and properly generate the plot.

The results_heatmap() function

This function allows to draw heatmap with hierarchical clustering. Various methods of distance calculation (e.g. euclidean, canberra) and agglomeration (e.g. complete, average, single) can be used.

Data objects returned from the make_Ct_ready() and delta_Ct() functions can be directly applied to this function (see The summary of standard workflow).

NOTE: Remember to create named list with colors for groups annotation and pass it in col.groups parameter.

# Create named list with colors for groups annotation:
colors.for.groups = list("Group" = c("Disease"="firebrick1","Control"="green3"))
# Vector of colors for heatmap can be also specified to fit the user needings:
colors <- c("navy","navy","#313695","#4575B4","#74ADD1","#ABD9E9",
            "#E0F3F8","#FFFFBF","#FEE090","#FDAE61","#F46D43",
            "#D73027","#C32B23","#A50026","#8B0000", 
            "#7E0202","#000000")
results_heatmap(data.dCt,
                sel.Gene = "all",
                col.groups = colors.for.groups,
                colors = colors,
                show.colnames = FALSE,
                show.rownames = TRUE,
                fontsize = 11,
                fontsize.row = 11,
                cellwidth = 4)

# Cellwidth parameter was set to 4 to avoid cropping the image on the right side.

The created plot is displayed on the graphic device(if save.to.tiff = FALSE) or saved to .tiff file (if save.to.tiff = TRUE).


Further analyses

Gene expression levels and differences between groups can be further analysed using the following methods and corresponding functions of the RQdeltaCT package:

Data objects returned from the make_Ct_ready() and delta_Ct() functions can be directly applied to all of these functions (see The summary of standard workflow). Moreover, all functions can be run on the entire data or only on selected samples/genes.

PCA and k means clustering

This function allows to simultaneously perform principal component analysis (PCA) and (if do.k.means = TRUE) samples classification using k means method. Number of clusters can be set using k.clust parameter. Results obtained from both methods can be presented on one plot. Obtained clusters are distinguishable by point shapes. Confusion matrix of sample classification is returned as the second element of the returned object and can be used for calculation further parameters of classification performance like precision, accuracy, recall, and others.

pca.kmeans <- pca_kmeans(data.dCt, 
                           sel.Gene = c("Gene1","Gene16","Gene19","Gene20"), 
                           legend.position = "top")

# Access to the confusion matrix:
pca.kmeans[[2]]
#>          
#>           Cluster1 Cluster2
#>   Control       14       10
#>   Disease       19       21

If the legend is to wide and is cropped, setting a vertical position of legend should solve this problem:

pca.kmeans[[1]] + theme(legend.box = "vertical")

Correlation analysis

Correlation analysis is a very useful method to explore linear relationships between variables. The RQdeltaCT package offers corr_sample() and corr_gene() functions to generate and plot correlation matrices of samples and genes, respectively. The correlation coefficients can be calculated using either the Pearson or Spearman algorithm. To facilitate plot interpretation, these functions also have possibilities to order samples or genes according to several methods, e.g. hierarchical clustering or PCA first component (see order parameter).

library(Hmisc)
library(corrplot)
# To make the plot more readable, only part of the data was used:
corr.samples <- corr_sample(data = data.dCt[15:30, ],
                            method = "pearson",
                            order = "hclust",
                            size = 0.7,
                            p.adjust.method = "BH")

library(Hmisc)
library(corrplot)
corr.genes <- corr_gene(data = data.dCt,
                            method = "spearman",
                            order = "FPC",
                            size = 0.7,
                            p.adjust.method = "BH")

The created plots are displayed on the graphic device. The returned objects are tables with computed correlation coefficients, p values, and p values adjusted by Benjamini-Hochberg correction (by default). Tables are sorted by the absolute values of correlation coefficients in descending order.

NOTE: A minimum of 5 samples/target are required for correlation analysis.

Simple linear regression analysis

Linear relationships between pairs of samples or genes can be further analysed using simple linear regression models using the single_pair_sample() function (for analysis of samples) and the single_pair_gene() function for analysis of genes. These functions draw a scatter plot with a simple linear regression line. Regression results such as regression equation, coefficient of determination, F value, or p value can be optionally added to the plot.

library(ggpmisc)
Disease6_Control17 <- single_pair_sample(data = data.dCt,
                                         x = "Disease6",
                                         y = "Control17",
                                         point.size = 3,
                                         labels = TRUE,
                                         label = c("eq", "R2", "p"),
                                         label.position.x = 0.05)

library(ggpmisc)
Gene16_Gene17 <- single_pair_gene(data.dCt,
                                    x = "Gene16",
                                    y = "Gene17",
                                    by.group = TRUE,
                                    point.size = 3,
                                    labels = TRUE,
                                    label = c("eq", "R2", "p"),
                                    label.position.x = c(0.05),
                                    label.position.y = c(1,0.95))

Receiver Operating Characteristic (ROC) analysis

The Receiver Operating Characteristic (ROC) analysis is useful for assessing the performance of sample classification to the particular group, based on gene expression data. In this analysis, ROC curves together with parameters such as the area under curve (AUC), specificity, sensitivity, accuracy, positive and negative predictive value are received. The ROCh() function was designed to perform all of these tasks. This function returns a table with calculated parameters and a plot with multiple panels, each with ROC curve for one gene.

NOTE: The created plot is not displayed on the graphic device, but should be saved as .tiff image (save.to.txt = TRUE) and can be opened directly from the file in the working directory.

library(pROC)
# Remember to specify the numbers of rows (panels.row parameter) and columns (panels.col parameter) to be sufficient to arrange panels: 
roc_parameters <- ROCh(data = data.dCt,
                       sel.Gene = c("Gene1","Gene12","Gene16","Gene19"),
                       groups = c("Disease","Control"),
                       panels.row = 2,
                       panels.col = 2)
roc_parameters
#>     Gene Threshold Specificity Sensitivity Accuracy       ppv       npv
#> 1  Gene1   9.40925       0.775   0.7500000 0.765625 0.6666667 0.8378378
#> 2 Gene12   6.40600       0.700   0.6666667 0.687500 0.5714286 0.7777778
#> 3 Gene16   0.48200       0.475   0.9166667 0.640625 0.5116279 0.9047619
#> 4 Gene19   5.11675       0.725   0.9166667 0.796875 0.6666667 0.9354839
#>     youden       AUC
#> 1 1.525000 0.8041667
#> 2 1.366667 0.6843750
#> 3 1.391667 0.6875000
#> 4 1.641667 0.8031250

Simple logistic regression

Logistic regression is a useful method to investigate the impact of the analysed variable on the odds of the occurrence of the studied experimental condition. In the RQdeltaCT package, log_reg() function allows to calculate for each gene a chances (odds ratio, OR) of being included in the study group when gene expression level increases by one unit (suitable for non-transformed data) or by mean of expression levels (more suitable for transformed data). This function returns a plot and table with the calculated parameters (OR, confidence interval, intercept, coefficient, and p values).

library(oddsratio)

# Remember to set the increment parameter.
log.reg.results <- log_reg(data = data.dCt,
                           increment = 1,
                           sel.Gene = c("Gene1","Gene12","Gene16","Gene19"),
                           group.study = "Disease",
                           group.ref = "Control")

log.reg.results[[2]]
#>     Gene oddsratio CI_low CI_high Increment  Intercept coeficient  p_intercept
#> 1  Gene1     0.540  0.374   0.734         1  6.0820841 -0.6159817 0.0001466894
#> 2 Gene12     0.727  0.534   0.930         1  2.4160726 -0.3189009 0.0085244809
#> 3 Gene16     2.364  1.281   4.908         1  0.5101473  0.8603227 0.0645722078
#> 4 Gene19     2.603  1.572   5.031         1 -4.4442069  0.9567034 0.0027620282
#>         p_coef
#> 1 0.0002879457
#> 2 0.0229318870
#> 3 0.0109936805
#> 4 0.0010298471

Part B: A pairwise analysis

A pairwise analysis regards situations when samples are grouped in pairs. In general, a pairwise analysis is quite similar to the workflow for comparison of independent groups; however, some crucial points are different. Therefore, for the users convenience, a separate part of this vignette was prepared to demonstrate in detail a pairwise variant of analysis using RQdeltaCT workflow.

For a demonstration purposes, the data object named data.Ct.pairwise from RQdeltaCT package is used:

data(data.Ct.pairwise)
str(data.Ct.pairwise)
#> tibble [756 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ Sample: chr [1:756] "Sample01" "Sample01" "Sample01" "Sample01" ...
#>  $ Group : chr [1:756] "After" "After" "After" "After" ...
#>  $ Gene  : chr [1:756] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ Ct    : chr [1:756] "32.563" "36.554" "31.334" "23.415" ...

This dataset contains 21 paired samples analyzed before (“Before” group) and after (“After” group) after the effect of the experimental factor. Total 18 genes were analyzed in these samples.

Quality control of raw Ct data (a pairwise approach)

The crucial step of each data analysis is an assessment of the quality and usefulness of the data used to investigate the studied problem. The RQdeltaCT package offers two functions for quality control of raw Ct data: control_Ct_barplot_sample() (for quality control of samples) and control_Ct_barplot_gene() (for quality control of genes). Both functions require specifying quality control criteria to be applied to Ct values, in order to label each Ct value as reliable or not. These functions return numbers of reliable and unreliable Ct values in each sample or each gene, as well as total number of Ct values obtained from each sample and each gene. These results are presented graphically on barplots. The obtained results are useful for inspecting the analysed data in order to identify samples and genes that should be considered to be removed from the data (based on applied reliability criteria).

Three selection criteria can be set for these functions:

NOTE: This function does not perform data filtering, but only report numbers of Ct values labelled as reliable or not and presents them graphically.

An example of using these functions is provided below:

library(tidyverse)
sample.Ct.control.pairwise <- control_Ct_barplot_sample(data = data.Ct.pairwise,
                                               flag.Ct = "Undetermined",
                                               maxCt = 35,
                                               flag = c("Undetermined"),
                                               axis.title.size = 9,
                                               axis.text.size = 9,
                                               plot.title.size = 9,
                                               legend.title.size = 9,
                                               legend.text.size = 9)

gene.Ct.control.pairwise <- control_Ct_barplot_gene(data = data.Ct.pairwise,
                                               flag.Ct = "Undetermined",
                                               maxCt = 35,
                                               flag = c("Undetermined"),
                                               axis.title.size = 9,
                                               axis.text.size = 9,
                                               plot.title.size = 9,
                                               legend.title.size = 9,
                                               legend.text.size = 9)

Created plots are displayed on the graphic device, and short information about the returned tables appears. Returned objects are lists that contain two elements: an object with plot and a table with numbers of Ct values labelled as reliable (in column “Reliable”) and unreliable (in column “Not.reliable”), as well as fraction of unreliable Ct values in each gene. To easily identify samples or genes with high number of unreliable values, tables are sorted to show them at the top. To access returned tables, the second element of returned objects should be called:

head(sample.Ct.control.pairwise[[2]])
#> # A tibble: 6 × 4
#>   Sample   Not.reliable Reliable Not.reliable.fraction
#>   <fct>           <int>    <int>                 <dbl>
#> 1 Sample13           13       23                 0.361
#> 2 Sample16           13       23                 0.361
#> 3 Sample08           11       25                 0.306
#> 4 Sample05           10       26                 0.278
#> 5 Sample22           10       26                 0.278
#> 6 Sample24            9       27                 0.25
head(gene.Ct.control.pairwise[[2]], 10)
#> # A tibble: 10 × 5
#>    Gene   Group  Not.reliable Reliable Not.reliable.fraction
#>    <fct>  <fct>         <int>    <int>                 <dbl>
#>  1 Gene9  After            21        0                 1    
#>  2 Gene2  After            20        1                 0.952
#>  3 Gene2  Before           20        1                 0.952
#>  4 Gene5  After            16        5                 0.762
#>  5 Gene9  Before           16        5                 0.762
#>  6 Gene11 After            15        6                 0.714
#>  7 Gene1  After            11       10                 0.524
#>  8 Gene11 Before           11       10                 0.524
#>  9 Gene5  Before           11       10                 0.524
#> 10 Gene1  Before            5       16                 0.238

Visual inspection of returned plots and obtained tables gives a clear, preliminary image of data quality. The results obtained in the examples show that the Before group contains the same number of samples as the After group. In all samples, the majority of Ct values are reliable.

Regarding genes, Gene9 has all unreliable Ct values in the After group. Other genes that can be considered to remove from the data are Gene2, Gene5, Gene11, and Gene1.

In some situations, a unified fraction of unreliable data need to be established and used to make decision which samples or genes should be excluded from the analysis. It can be done using the following code, in which the table returned by the control_Ct_barplot_sample() and control_Ct_barplot_gene() functions can be used to identify samples or genes for which the fraction of unreliable Ct values is higher than a specified threshold:

# Finding samples with more than half of the unreliable Ct values.
low.quality.samples.pairwise <- filter(sample.Ct.control.pairwise[[2]],
                                       Not.reliable.fraction > 0.5)$Sample
low.quality.samples.pairwise <- as.vector(low.quality.samples.pairwise)                        
low.quality.samples.pairwise
#> character(0)

In the above example, there is no sample with more than half of the unreliable data.

# Finding genes with more than half of the unreliable Ct values in at least one group.
low.quality.genes.pairwise <- filter(gene.Ct.control.pairwise[[2]],
                                     Not.reliable.fraction > 0.5)$Gene
low.quality.genes.pairwise <- unique(as.vector(low.quality.genes.pairwise))                        
low.quality.genes.pairwise
#> [1] "Gene9"  "Gene2"  "Gene5"  "Gene11" "Gene1"

Five genes (Gene1, Gene2, Gene5, Gene9, and Gene11) have more than half of the unreliable Ct values in at least one group. In this example, these genes will be removed from the data in the next step of analysis.


Filtering of raw Ct data (a pairwise approach)

When reliability criteria are finally established for Ct values, and some samples or genes are decided to be excluded from the analysis after quality control of the data, the data with raw Ct values can be filtered using the filter_Ct() function.

As a filtering criteria, a flag used for undetermined Ct values, a maximum of Ct threshold, and a flag used in Flag column can be applied. Furthermore, vectors with samples, genes, and groups to be removed can also be specified:

# Objects returned from the `low_quality_samples()` and 
# `low_quality_genes()`functions can be used directly:
data.Ct.pairwiseF <- filter_Ct(data = data.Ct.pairwise,
                      flag.Ct = "Undetermined",
                      maxCt = 35,
                      flag = c("Undetermined"),
                      remove.Gene = low.quality.genes.pairwise,
                      remove.Sample = low.quality.samples.pairwise)

# Check dimensions of data before and after filtering:
dim(data.Ct.pairwise)
#> [1] 756   4
dim(data.Ct.pairwiseF)
#> [1] 530   4

NOTE: If data contain more than two groups, it is good practice to remove groups that are out of comparison; however, a majority of other functions can deal with more groups unless only two groups are indicated to be given in function’s parameters.


Collapsing technical replicates and imputation of missing data - make_Ct_ready() function (a pairwise approach)

In the next step, filtered Ct data can be subjected to collapsing of technical replicates and (optional) data imputation by means within groups using the make_Ct_ready() function. The term ‘technical replicates’ means observations with the same group name, gene name, and sample name. In the scenario when data contain technical replicates but they should not be collapsed, these technical replicates must be distinguished by different sample names, e.g. SampleA_1, SampleA_2, SampleA_3, etc.

The parameter imput.by.mean.within.groups can be used to control data imputation. If it is set to TRUE, imputation will be done, otherwise missing values will be left in the data. For a better view of the amount of missing values in the data, the information about the number and percentage of missing values is displayed automatically:

# Without imputation:
data.Ct.pairwiseF.ready <- make_Ct_ready(data = data.Ct.pairwiseF,
                                         imput.by.mean.within.groups = FALSE)
# A part of the data with missing values:
as.data.frame(data.Ct.pairwiseF.ready)[9:19,10:15]
#>    Gene19 Gene20  Gene3  Gene4  Gene7  Gene8
#> 9  28.246 33.375 31.086 22.900 32.242 23.070
#> 10 28.867 33.011 32.077 24.332 32.318 24.677
#> 11 29.951     NA 33.982 24.895     NA 23.973
#> 12 29.459 32.032 31.916 24.139 31.937 22.586
#> 13 29.018 34.362 33.497 24.937 32.363 23.714
#> 14 28.959     NA     NA 23.072 32.809 23.197
#> 15 28.763     NA 32.078 24.051 34.122 24.710
#> 16 27.325 29.509 32.185 21.328 28.986 19.841
#> 17 29.077 32.482 31.011 23.016 32.927 22.667
#> 18 30.415 31.895     NA 23.886 32.974 22.041
#> 19 31.039 34.386 33.739 24.505 33.868 24.492

# With imputation:
data.Ct.pairwiseF.ready <- make_Ct_ready(data = data.Ct.pairwiseF,
                                         imput.by.mean.within.groups = TRUE)
# Missing values were imputed:
as.data.frame(data.Ct.pairwiseF.ready)[9:19,10:15]
#>    Gene19   Gene20    Gene3  Gene4    Gene7  Gene8
#> 9  28.246 33.37500 31.08600 22.900 32.24200 23.070
#> 10 28.867 33.01100 32.07700 24.332 32.31800 24.677
#> 11 29.951 32.91194 33.98200 24.895 32.71465 23.973
#> 12 29.459 32.03200 31.91600 24.139 31.93700 22.586
#> 13 29.018 34.36200 33.49700 24.937 32.36300 23.714
#> 14 28.959 32.91194 32.19953 23.072 32.80900 23.197
#> 15 28.763 32.91194 32.07800 24.051 34.12200 24.710
#> 16 27.325 29.50900 32.18500 21.328 28.98600 19.841
#> 17 29.077 32.48200 31.01100 23.016 32.92700 22.667
#> 18 30.415 31.89500 32.19953 23.886 32.97400 22.041
#> 19 31.039 34.38600 33.73900 24.505 33.86800 24.492

NOTE: The data imputation process can significantly influence data; therefore, no default value was set to the imput.by.mean.within.groups parameter to force the specification by the user. If there are missing data for a certain gene in the entire group, they will not be imputed and will remain NA.

In general, a majority of functions in RQdeltaCT package can deal with missing data; however, some methods (e.g. PCA) are sensitive to missing data (see PCA analysis section).

NOTE: The make_Ct_ready() function should be used even if the collapsing of technical replicates and data imputation is not required, because this function also prepares the data structure to fit to further functions.


Reference gene selection (a pairwise approach)

Ideally, the reference gene should have an identical expression level in all samples, but in many situations it is not possible to achieve this, especially when biological replicates are analysed. Therefore, differences between samples are allowed, but the variance should be as low as possible, and it is also recommended that Ct values would not be very low (below 15) or very high (above 30) Kozera and Rapacz 2013.

The RQdeltaCT package includes find_ref_gene() function that can be used to select the best reference gene for normalisation. This function calculates descriptive statistics such as minimum, maximum, standard deviation, and variance, as well as stability scores calculated using the NormFinder (Article) and geNorm (Article) algorithms. Ct values are also presented on a line plot.

NormFinder scores are computed using internal RQdeltaCT::norm_finder() function working on the code adapted from the original NormFinder code. To calculate NormFinder scores, at least two samples must be present in each group. For the geNorm score, the geNorm() function of the ctrlGene package is used. The find_ref_gene() function allows one to choose which of these algorithms should be done by setting the logical parameters: norm.finder.score and genorm.score.

The created plot is displayed on the graphic device. The returned object is a list that contains two elements: an object with plot and a table with results. In the example below, six genes are tested for suitability to be a reference gene:

library(ctrlGene)
# Remember that the number of colors in col parameter should be equal to the number of tested genes:
ref.pairwise <- find_ref_gene(data = data.Ct.pairwiseF.ready,
                     groups = c("After","Before"),
                     candidates = c("Gene4","Gene13","Gene20"),
                     col = c("#66c2a5", "#fc8d62","#6A6599"),
                     angle = 90,
                     axis.text.size = 7,
                     norm.finder.score = TRUE,
                     genorm.score = TRUE)

ref.pairwise[[2]]
#>           Gene    min    max       sd      var NormFinder_score geNorm_score
#> 1       Gene13 27.184 33.224 1.449397 2.100751             0.17           NA
#> 2       Gene20 29.509 34.871 1.223547 1.497068             0.26    0.9961593
#> 3        Gene4 20.499 26.228 1.418726 2.012784             0.15           NA
#> 4 Gene4-Gene13     NA     NA       NA       NA               NA    0.6912531

NA values are caused because geNorm method returns a pair of genes with the highest stability.
Among tested genes, Gene4 seems to have the best characteristics to be a reference gene (it has Ct values below 30, relatively low standard deviation, variance, NormFinder score, and geNorm score).

Data normalization using reference gene (a pairwise approach)

Data normalization can be performed using delta_Ct() function that calculates delta Ct (dCt) values by subtracting Ct values of reference gene (or mean of the Ct values of reference genes, if more than one reference gene is used) from Ct values of gene of interest across all samples.

data.dCt.pairwise <- delta_Ct(data = data.Ct.pairwiseF.ready, ref = "Gene4", transform = FALSE)

If required, this function also performs a transformation of dCt data using the 2-dCt formula. Setting transform = TRUE enable this transformation:

data.dCt.exp.pairwise <- delta_Ct(data = data.Ct.pairwiseF.ready,
                     ref = "Gene4",
                     transform = TRUE)

Before further processing, non-transformed or transformed dCt data should be subjected to quality control using functions and methods described in Quality control and filtering of transformed Ct data section (see below).


Relative quantification: 2-dCt method (a pairwise approach)

This method is used in studies where samples should be analysed as individual data points, e.g. in analysis of biological replicates (see [Introduction] section). In the pairwise variant of this method, Ct values are normalised by the endogenous control gene by subtracting the Ct value of the endogenous control from the Ct value of the gene of interest, in the same sample. Obtained delta Ct (dCt) values are subsequently transformed using the 2-dCt formula, and a ratio (fold change) of transformed Ct values is calculated individually for each pair of samples, and summarised by mean for each gene.

The whole process can be done using RQ_dCt() function, which performs:
+ calculation of means (returned in columns with the “_mean” pattern) and standard deviations (returned in columns with the “_sd” pattern) of transformed dCt values of genes analysed in the compared groups. + normality testing (Shapiro_Wilk test) of transformed dCt values of analysed genes in compared groups and returned p values are stored in columns with the “_norm_p” pattern. + calculation of the fold change values. In this pairwise approach (when pairwise = TRUE), individual fold change values are firstly calculated for each sample by dividing transformed Ct value obtained for particular gene in the study group by transformed Ct value obtained for the same gene in the reference group. Subsequently, mean and standard deviation values of individual fold changes are calculated within each group (returned in FCh_mean and FCh_sd columns, respectively). + statistical testing of differences in transformed Ct values between the study group and the reference group. Student’s t test and Mann-Whitney U test are implemented and the resulting statistics (in column with the “_test_stat” pattern), p values (in column with the “_test_p” pattern), and adjusted p values (in column with the “_test_p_adj” pattern) are returned. The Benjamini-Hochberg method for adjustment of p values is used in default. If compared groups contain less than three samples, normality and statistical tests are not possible to perform (the do.test parameter should be set to FALSE to avoid error).

NOTE: For this method, delta Ct (dCt) values transformed by the 2-dCt formula using delta_Ct() function (called with transform = TRUE) should be used.

data.dCt.pairwise <- delta_Ct(data = data.Ct.pairwiseF.ready, ref = "Gene4", transform = FALSE)
library(coin)
results.dCt.pairwise <- RQ_dCt(data = data.dCt.pairwise,
                               do.tests = TRUE,
                               pairwise = TRUE,
                               group.study = "After",
                               group.ref = "Before")

# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
results <- as.data.frame(arrange(results.dCt.pairwise[[1]], MW_test_p))
head(results)
#>     Gene After_mean Before_mean  After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene19  5.2109333   3.7824762 0.8428066 0.9893666   0.43685704    0.03510469
#> 2  Gene8 -0.3372381  -1.0184286 0.8749796 1.0425276   0.04748179    0.17942875
#> 3 Gene16 -0.2142381  -0.7020476 0.7958921 0.6781031   0.19024540    0.07832005
#> 4 Gene14  6.0329048   5.7692381 0.5984831 0.6675626   0.58378047    0.03738612
#> 5  Gene7  9.0139833   8.4662381 0.8176409 1.3667726   0.01484502    0.69000384
#> 6 Gene20  9.2112745   9.6542381 0.9144846 1.2836473   0.17140590    0.21430743
#>         FCh    log10FCh    FCh_sd     t_test_p t_test_stat  MW_test_p
#> 1 1.4700335  0.16732722 0.4499545 0.0001845345    4.573104 0.00102128
#> 2 0.2043090 -0.68971252 2.7242366 0.0537860813    2.049238 0.04565545
#> 3 1.9663253  0.29365536 6.4052867 0.0715023737    1.903242 0.05372471
#> 4 1.0595285  0.02511265 0.1643101 0.2140118947    1.283435 0.24426953
#> 5 1.0967073  0.04009074 0.2375549 0.1444388848    1.518898 0.24426953
#> 6 0.9765855 -0.01028974 0.2018320 0.2733067651   -1.126458 0.41404000
#>   MW_test_stat t_test_p_adj MW_test_p_adj
#> 1    3.2845979  0.002214414    0.01225536
#> 2    1.9985649  0.286009495    0.21489883
#> 3    1.9290496  0.286009495    0.21489883
#> 4    1.1643813  0.513628547    0.58624688
#> 5    1.1643813  0.433316654    0.58624688
#> 6   -0.8168048  0.546613530    0.82808001
# Access to the table with fold change values calculated individually for each pair of sampleS:
FCh <- results.dCt.pairwise[[2]]
head(FCh)
#> # A tibble: 6 × 5
#>   Sample   Gene    After Before   FCh
#>   <chr>    <chr>   <dbl>  <dbl> <dbl>
#> 1 Sample01 Gene10  3.14   2.78  1.13 
#> 2 Sample01 Gene13  7.81   5.67  1.38 
#> 3 Sample01 Gene14  6.49   5.86  1.11 
#> 4 Sample01 Gene15  6.63   6.18  1.07 
#> 5 Sample01 Gene16 -0.724 -0.882 0.821
#> 6 Sample01 Gene17  2.86   3.47  0.824

NOTE: In a pairwise approach (if pairwise = TRUE), the abovementioned results can be found in the first element of a list object returned by RQ_dCt() function. For the results transparency, fold change values calculated individually for each sample pair are also returned as the second element of this object, and can be used to assess the homogeneity of individual fold change values within each analysed gene using methods described in [Quality control and filtering of transformed Ct data - a pairwise approach] section.


Relative quantification: 2-ddCt method (a pairwise approach)

Similarly to the 2-dCt method, in the 2-ddCt method Ct values are normalised by the endogenous control gene, obtaining dCt values. Subsequently, individually for each pair of samples, dCt values obtained in the control group are subtracted from the dCt values in the study group, giving individual delta delta Ct (ddCt) values. Finally, the ddCt values are transformed using the 2-ddCt formula to obtain the fold change values, which then are summarised by mean across analysed genes.

The whole process can be done using RQ_ddCt() function, which performs:
+ calculation of means (returned in columns with the “_mean” pattern) and standard deviations (returned in columns with the “_sd” pattern) of delta Ct values of the analyzed genes in the compared groups. + normality testing (Shapiro_Wilk test) of delta Ct values of the analyzed genes in the compared groups and returned p values are stored in columns with the “_norm_p” pattern. + calculation of differences (delta delta Ct, ddCt values) in the delta Ct values between paired samples by subtracting dCt values obtained in the control group from the dCt values in the study group. + calculation of fold change values using 2-ddCt formula. Fold change values are returned in a separated table (see NOTE below). Subsequently, means (returned in “FCh” column) and standard deviations (returned in “FCh_sd” column) of fold change values are calculated for each gene. + a pairwise statistical testing of differences between the compared groups. A pairwise Student’s t test and Mann-Whitney U test are implemented and the resulted statistics (in column with the “_test_stat” pattern), p values (in column with the “_test_p” pattern) and adjusted p values (in column with the “_test_p_adj” pattern) are returned. The Benjamini-Hochberg method for adjustment of p values is used in default. If compared groups contain less than three samples, normality and statistical tests are not possible to perform and do.test parameter should be set to FALSE to avoid error.

NOTE: For this method, not transformed delta Ct (dCt) values (obtained from delta_Ct() function with transform = FALSE) should be used.

data.dCt.pairwise <- delta_Ct(data = data.Ct.pairwiseF.ready, ref = "Gene4", transform = FALSE)
library(coin)
# Remember to set pairwise = TRUE:
results.ddCt.pairwise <- RQ_ddCt(data = data.dCt.pairwise,
                   group.study = "After",
                   group.ref = "Before",
                   pairwise = TRUE,
                   do.tests = TRUE)

# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
results <- as.data.frame(arrange(results.ddCt.pairwise[[1]], MW_test_p))
head(results)
#>     Gene After_mean Before_mean  After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene19  5.2109333   3.7824762 0.8428066 0.9893666   0.43685704    0.03510469
#> 2  Gene8 -0.3372381  -1.0184286 0.8749796 1.0425276   0.04748179    0.17942875
#> 3 Gene16 -0.2142381  -0.7020476 0.7958921 0.6781031   0.19024540    0.07832005
#> 4 Gene14  6.0329048   5.7692381 0.5984831 0.6675626   0.58378047    0.03738612
#> 5  Gene7  9.0139833   8.4662381 0.8176409 1.3667726   0.01484502    0.69000384
#> 6 Gene20  9.2112745   9.6542381 0.9144846 1.2836473   0.17140590    0.21430743
#>         FCh     log10FCh    FCh_sd     t_test_p t_test_stat  MW_test_p
#> 1 0.6485663 -0.188045654 0.9753027 0.0001845345    4.573104 0.00102128
#> 2 1.1687779  0.067731984 1.7391643 0.0537860813    2.049238 0.04565545
#> 3 0.9811133 -0.008280834 0.9145980 0.0715023737    1.903242 0.05372471
#> 4 1.0206707  0.008885637 0.6804825 0.2140118947    1.283435 0.24426953
#> 5 1.1322906  0.053957891 1.0765954 0.1444388848    1.518898 0.24426953
#> 6 2.5222676  0.401791166 2.6165869 0.2733067651   -1.126458 0.41404000
#>   MW_test_stat t_test_p_adj MW_test_p_adj
#> 1    3.2845979  0.002214414    0.01225536
#> 2    1.9985649  0.286009495    0.21489883
#> 3    1.9290496  0.286009495    0.21489883
#> 4    1.1643813  0.513628547    0.58624688
#> 5    1.1643813  0.433316654    0.58624688
#> 6   -0.8168048  0.546613530    0.82808001
# Access to the table with fold change values calculated individually for each pair of samples:
FCh <- results.ddCt.pairwise[[2]]
head(FCh)
#> # A tibble: 6 × 5
#>   Sample   Gene    After Before   FCh
#>   <chr>    <chr>   <dbl>  <dbl> <dbl>
#> 1 Sample01 Gene10  3.14   2.78  0.783
#> 2 Sample01 Gene13  7.81   5.67  0.227
#> 3 Sample01 Gene14  6.49   5.86  0.647
#> 4 Sample01 Gene15  6.63   6.18  0.732
#> 5 Sample01 Gene16 -0.724 -0.882 0.896
#> 6 Sample01 Gene17  2.86   3.47  1.53

NOTE: In a pairwise approach (if pairwise = TRUE), the results can be found in the first element of a list object returned by RQ_ddCt() function. For the results transparency, fold change values calculated individually for each sample pair can be found in the second element of this object, and can be assessed using methods described in the [Quality control and filtering of transformed Ct data - a pairwise approach] section.


Quality control and filtering of transformed Ct data (a pairwise approach)

Data intended for relative quantification should be subjected to quality assessment. The main purpose is to identify outlier samples that could introduce bias into the results. The RQdeltaCT package offers several functions which facilitate finding outlier samples in data by implementing distribution analysis (control_boxplot_sample() function), hierarchical clustering (control_cluster_sample() function) and principal component analysis PCA (control_pca_sample() function). However, corresponding functions for genes are also provided to evaluate similarities and differences between expression of analysed genes (control_boxplot_gene(), control_cluster_gene() and control_pca_gene() functions).

The abovementioned data quality control functions are designed to be directly applied to data objects returned from the make_Ct_ready() and delta_Ct() functions (see The summary of standard workflow). Moreover, in a pairwise approach, a tables with fold change values obtained from individual pairs of samples (the second element of list returned from RQ_dCt() and RQ_ddCt() functions) can be also passed directly, but the pairwise.FCh parameter of control functions must be set to TRUE.

Analysis of data distribution (a pairwise approach)

One of the ways to gain insight into data distribution is drawing boxplot that show several statistics, such as median (labelled inside box), first and third quartiles (ranged by box), extreme point in interquartile range (ranged by whiskers), and more distant data as separated points. In the RQdeltaCT package, the control_boxplot_sample() and control_boxplot_gene() functions are included to draw boxplots that present the distribution of samples and genes, respectively.

These functions return objects with a plot. The created plots are also displayed on the graphic device.

Boxplots for samples:

control_boxplot_sample <- control_boxplot_sample(data = data.dCt.pairwise,
                                                 axis.text.size = 9,
                                                 y.axis.title = "dCt")

And boxplots for genes:

control_boxplot_gene <- control_boxplot_gene(data = data.dCt.pairwise,
                                             by.group = TRUE,
                                             axis.text.size = 10,
                                             y.axis.title = "dCt")

Similar plots can be created for fold change values data returned from RQ_dCt() and RQ_ddCt() functions:

# Remember to set pairwise.FCh to TRUE:
FCh <- results.dCt.pairwise[[2]]
control.boxplot.sample.pairwise <- control_boxplot_sample(data = FCh,
                                                          pairwise.FCh = TRUE,
                                                          axis.text.size = 9,
                                                          y.axis.title = "Fold change")

# There are some very high values, we can identify them using:
head(arrange(FCh, -FCh))
#> # A tibble: 6 × 5
#>   Sample   Gene    After  Before   FCh
#>   <chr>    <chr>   <dbl>   <dbl> <dbl>
#> 1 Sample15 Gene16 -0.966 -0.0370 26.1 
#> 2 Sample14 Gene16 -1.40  -0.0970 14.4 
#> 3 Sample08 Gene8   0.554  0.0610  9.08
#> 4 Sample19 Gene10  3.46   0.969   3.57
#> 5 Sample09 Gene10  5.08   2.03    2.50
#> 6 Sample14 Gene8  -1.55  -0.634   2.45

Similar visualisation can be done for genes:

control.boxplot.gene.pairwise <- control_boxplot_gene(data = FCh,
                                                 by.group = FALSE,
                                                 pairwise.FCh = TRUE,
                                                 axis.text.size = 10,
                                                 y.axis.title = "Fold change")

NOTE: If missing values are present in the data, they will be automatically removed with warning.

Hierarchical clustering (a pairwise approach)

Hierarchical clustering is a convenient method to investigate similarities between variables. Hierarchical clustering of samples and genes can be done using the control_cluster_sample() and control_cluster_gene() functions, respectively. These functions allow to draw dendrograms using various methods of distance calculation (e.g. euclidean, canberra) and agglomeration (e.g. complete, average, single). For more details, refer to the functions documentation.

# Hierarchical clustering of samples:
control_cluster_sample(data = data.dCt.pairwise,
                       method.dist = "euclidean",
                       method.clust = "average",
                       label.size = 0.6)

# Hierarchical clustering of genes:
control_cluster_gene(data = data.dCt.pairwise,
                     method.dist = "euclidean",
                     method.clust = "average",
                     label.size = 0.8)

Similar plots can be created for fold change values data returned from RQ_dCt() and RQ_ddCt() functions:

# Remember to set pairwise.FCh = TRUE:
control_cluster_sample(data = FCh,
                       pairwise.FCh = TRUE,
                       method.dist = "euclidean",
                       method.clust = "average",
                       label.size = 0.7)

control_cluster_gene(data = FCh,
                     pairwise.FCh = TRUE,
                     method.dist = "euclidean",
                     method.clust = "average",
                     label.size = 0.8)

The created plots are displayed on the graphic device.

NOTE: Minimum three samples or genes in data is required for clustering analysis.

PCA analysis (a pairwise approach)

Principal component analysis (PCA) is a data exploratory method, that is commonly used to investigate similarities between variables based on the first principal components that contain the most information about variance. In the RQdeltaCT package, the control_pca_sample() and control_pca_gene() functions are developed to perform PCA analysis for samples and genes, respectively. These functions return objects with a plot. Created plots are also displayed on the graphic device.

control.pca.sample.pairwise <- control_pca_sample(data = data.dCt.pairwise,
                                         point.size = 3,
                                         label.size = 2.5,
                                         hjust = 0.5,
                                         legend.position = "top")

control.pca.gene.pairwise <- control_pca_gene(data = data.dCt.pairwise,
                                              hjust = 0.5)

Similar plots can be created for fold change values data returned from RQ_dCt() and RQ_ddCt() functions:

control.pca.sample.pairwise <- control_pca_sample(data = FCh,
                                         pairwise.FCh = TRUE,
                                         colors = "black",
                                         point.size = 3,
                                         label.size = 2.5,
                                         hjust = 0.5)

control.pca.gene.pairwise <- control_pca_gene(data = FCh,
                                     pairwise.FCh = TRUE,
                                     color = "black",
                                     hjust = 0.5)

NOTE: PCA algorithm can not deal with missing data (NAs); therefore, variables with NA values are removed before analysis. If at least one NA value occurs in all variables in at least one of the compared group, the analysis can not be done. Imputation of missing data will avoid this issue. Also, a minimum of three samples or genes in the data are required for analysis.


Data filtering after quality control (a pairwise approach)

If any sample or gene was decided to be removed from the data, the filter_transformed_data() function can be used for filtering. Similarly to quality control functions, this function can be directly applied to data objects returned from the make_Ct_ready() and delta_Ct() functions (see The summary of standard workflow).

data.dCt.pairwise.F <- filter_transformed_data(data = data.dCt.pairwise,
                                     remove.Sample = c("Sample22", "Sample23", "Sample15","Sample03"))
dim(data.dCt.pairwise)
#> [1] 42 14
dim(data.dCt.pairwise.F)
#> [1] 34 14

NOTE: Remember, there must be a good objective reason to remove specific samples. It must be done with caution to avoid tuning the data to better match the user expectations.


Final visualisations (a pairwise approach)

For visualisation of final results, these five functions can be used:

All these functions can be run on all data or on finally selected genes (see sel.Gene parameter). These functions have a large number of parameters, and the user should familiarise with all of them to properly adjust created plots to the user needs.

NOTE: The functions FCh_plot(), results_barplot(), and results_boxplot() also allow to add customised statistical significance labels to plots using ggsignif package. If statistical significance labels should be added to the plot, a vector with labels (e.g., “ns”, “*“,”p = 0.03”) should be provided in the signif.labels parameter. There are two important points that must be taking into account when preparing this vector:

Before

# Remember to set pairwise = TRUE:
results.ddCt.pairwise <- RQ_ddCt(data = data.dCt.pairwise.F,
                   group.study = "After",
                   group.ref = "Before",
                   pairwise = TRUE,
                   do.tests = TRUE)

# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
results <- as.data.frame(arrange(results.ddCt.pairwise[[1]], MW_test_p))
head(results)
#>     Gene After_mean Before_mean  After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene19  5.2123294   3.5095882 0.7641196 0.6450896   0.76631561    0.37520243
#> 2  Gene8 -0.1840000  -1.2615882 0.8593104 0.7335902   0.01848096    0.57804549
#> 3 Gene16 -0.1601765  -0.8746471 0.8306935 0.5154298   0.23806535    0.28735006
#> 4 Gene15  7.2110588   6.5512941 0.9833445 0.9034758   0.10927151    0.05594945
#> 5  Gene7  9.0409206   8.3574706 0.7906209 1.3700506   0.02375954    0.98526475
#> 6 Gene17  4.0798824   3.7351176 0.7764462 0.6373618   0.72685827    0.90708293
#>         FCh     log10FCh    FCh_sd     t_test_p t_test_stat    MW_test_p
#> 1 0.4078506 -0.389498890 0.3354031 1.134508e-05    6.261821 0.0005026301
#> 2 0.6323613 -0.199034707 0.5379766 9.009319e-04    4.064665 0.0035993566
#> 3 0.7577723 -0.120461285 0.5356882 1.041055e-02    2.901408 0.0056180461
#> 4 1.1067408  0.044045931 1.4292426 9.330078e-02    1.784581 0.0928595314
#> 5 0.9901632 -0.004293203 0.9777963 8.966683e-02    1.806604 0.1127786546
#> 6 0.9001501 -0.045685066 0.4176108 1.174977e-01    1.654544 0.1239292250
#>   MW_test_stat t_test_p_adj MW_test_p_adj
#> 1     3.479351  0.000136141   0.006031561
#> 2     2.911294  0.005405591   0.021596140
#> 3     2.769279  0.041642189   0.022472184
#> 4     1.680503  0.223921872   0.247858450
#> 5     1.585827  0.223921872   0.247858450
#> 6     1.538488  0.224668939   0.247858450

Three genes (Gene8, Gene16, and Gene19) seem to have statistically significant differences in expression between Before and After groups

The FCh_plot() function (a pairwise approach)

This function creates a barplot that illustrates fold change values obtained from the analysis, together with an indication of statistical significance. The first element of the object returned from the RQ_dCt() and RQ_ddCt() functions worked in a pairwise mode (when pairwise = TRUE,) can be directly applied to this function (see The summary of standard workflow).

On the barplot, bars of significant genes are distinguished by colors and/or significance labels. The significance of genes can be established by two criteria: p values and (optionally) fold change values. Thresholds for both criteria can be specified. The FCh_plot() function offers various options of which p values are used on the plot:

  • p values from the Student’s t test (if mode = “t”) or adjusted p values from the Student’s t test (if mode = “t.adj”).
  • p values from the Mann-Whitney U test (if mode = “mw”) or adjusted p values from the Mann-Whitney U test (if mode = “mw.adj”)
  • p values depend on the normality of data (if mode = “depends”). If the data in both compared groups were considered derived from normal distribution (p value of Shapiro_Wilk test > 0.05) - p values of Student’s t test will be used, otherwise p values of Mann-Whitney U test will be used. For adjusted p values, use mode = “depends.adj”.
  • external p values provided by the user (if mode = “user”). If the user intends to use the p values obtained from the other statistical test, the mode parameter should be set to “user”. In this scenario, before running the FCh_plot() function, the user should prepare a data.frame object named “user” containing two columns: the first column with gene names and the second column with p values (see example below).

The created plot is displayed on the graphic device. The returned object is a lists that contain two elements: an object with plot and a table with results.

Below, a variant with p values depending on the normality of the data is presented. furthermore, genes with p < 0.05 and with at least 1.5-fold changed expression between groups are considered significant.

library(ggsignif)
#  Remember to use the first element of list object returned by `RQ_dCt()` or RQ_ddCt() function:
FCh.plot.pairwise <- FCh_plot(data = results.ddCt.pairwise[[1]],
                   use.p = TRUE,
                   mode = "depends.adj",
                   p.threshold = 0.05,
                   use.FCh = TRUE,
                   FCh.threshold = 1.5,
                   angle = 20)

# Access the table with results:
head(as.data.frame(FCh.plot.pairwise[[2]]))
#>     Gene After_mean Before_mean  After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene10  3.4842353   3.2941765 0.9051358 1.3854897    0.1317249    0.79709435
#> 2 Gene13  6.6754706   6.5255882 0.8060583 0.5893336    0.7451515    0.41911404
#> 3 Gene14  6.0054118   5.6788824 0.4614858 0.6721121    0.5598084    0.01920284
#> 4 Gene15  7.2110588   6.5512941 0.9833445 0.9034758    0.1092715    0.05594945
#> 5 Gene16 -0.1601765  -0.8746471 0.8306935 0.5154298    0.2380653    0.28735006
#> 6 Gene17  4.0798824   3.7351176 0.7764462 0.6373618    0.7268583    0.90708293
#>         FCh    log10FCh    FCh_sd   t_test_p t_test_stat   MW_test_p
#> 1 1.5731857  0.19678000 1.5533045 0.66262724   0.4445108 0.554033858
#> 2 1.1437228  0.05832077 0.7683948 0.57420244   0.5736167 0.758312374
#> 3 0.9376122 -0.02797677 0.5691316 0.13105688   1.5915084 0.162571759
#> 4 1.1067408  0.04404593 1.4292426 0.09330078   1.7845808 0.092859531
#> 5 0.7577723 -0.12046129 0.5356882 0.01041055   2.9014075 0.005618046
#> 6 0.9001501 -0.04568507 0.4176108 0.11749771   1.6545444 0.123929225
#>   MW_test_stat t_test_p_adj MW_test_p_adj test.for.comparison     p.used
#> 1    0.5917263   0.66262724    0.66484063    t.student's.test 0.66262724
#> 2    0.3076977   0.66262724    0.82724986    t.student's.test 0.66262724
#> 3    1.3964742   0.22466894    0.27869444   Mann-Whitney.test 0.27869444
#> 4    1.6805028   0.22392187    0.24785845    t.student's.test 0.22392187
#> 5    2.7692792   0.04164219    0.02247218    t.student's.test 0.04164219
#> 6    1.5384885   0.22466894    0.24785845    t.student's.test 0.22466894
#>   Selected as significant?
#> 1                       No
#> 2                       No
#> 3                       No
#> 4                       No
#> 5                       No
#> 6                       No

Two genes (Gene8 and Gene19) met the significance criteria. Standard deviation bars and statistical significance annotations can be added to the plot:

# Firstly prepare a vector with significance labels specified according to the user needings:

signif.labels <- c("ns.",
                   " ns. ",
                   "  ns.  ",
                   "   ns.   ",
                   "    ns.    ",
                   "     ns.     ",
                   "      ns.      ",
                   "       ns.       ",
                   "        ns.        ",
                   "         ns.         ",
                   "**",
                   "***")
# Remember to set signif.show = TRUE:
FCh.plot.pairwise <- FCh_plot(data = results.ddCt.pairwise[[1]],
                   use.p = TRUE,
                   mode = "depends.adj",
                   p.threshold = 0.05,
                   use.FCh = TRUE,
                   FCh.threshold = 1.5,
                   use.sd = TRUE,
                   signif.show = TRUE,
                   signif.labels = signif.labels,
                   signif.dist = 0.2,
                   angle = 20)

# Variant with user p values - the used p values are calculated using the stats::wilcox.test() function:
user <- data.dCt.pairwise %>%
          pivot_longer(cols = -c(Group, Sample),
                       names_to = "Gene",
                       values_to = "dCt") %>%
          group_by(Gene) %>%
          summarise(MW_test_p = wilcox.test(dCt ~ Group)$p.value,
                    .groups = "keep")
# The stats::wilcox.test() functions is limited to cases without ties; therefore, 
# a warning "cannot compute exact p-value with ties" will appear when ties occur.
signif.labels <- c("ns.",
                   " ns. ",
                   "  ns.  ",
                   "   ns.   ",
                   "    ns.    ",
                   "     ns.     ",
                   "      ns.      ",
                   "       ns.       ",
                   "        ns.        ",
                   "         ns.         ",
                   " * ",
                   "***")
FCh.plot <- FCh_plot(data = results.ddCt.pairwise[[1]],
                   use.p = TRUE,
                   mode = "user",
                   p.threshold = 0.05,
                   use.FCh = TRUE,
                   FCh.threshold = 1.5,
                   use.sd = TRUE,
                   signif.show = TRUE,
                   signif.labels = signif.labels,
                   signif.dist = 0.4,
                   angle = 30)

NOTE: If p values were not calculated due to the low number of samples, or they are not intended to be used to create a plot, the use.p parameter should be set to FALSE.

The results_volcano() function (a pairwise approach)

This function creates a volcano plot that illustrates the arrangement of genes based on fold change values and p values. Significant genes can be pointed out using specified p value and fold change thresholds, and highlighted on the plot by color and (optionally) isolated by thresholds lines. Similarly to FCh_plot() function, data returned from the RQ_dCt() and RQ_ddCt() functions can be directly applied (see The summary of standard workflow) and various sources of used p values are available (from the Student’s t test, the Mann-Whitney U test, depended on the normality of data, or provided by the user).

The created plot is displayed on the graphic device. The returned object is a lists that contain two elements: an object with plot and a table with results.

# Genes with p < 0.05 and 2-fold changed expression between compared groups 
# are considered significant. Remember to use the first element of list object 
# returned by RQ_ddCt() function: 
RQ.volcano.pairwise <- results_volcano(data = results.ddCt.pairwise[[1]],
                         mode = "depends.adj",
                         p.threshold = 0.05,
                         FCh.threshold = 1.5)

# Access the table with results:
head(as.data.frame(RQ.volcano.pairwise[[2]]))
#>     Gene After_mean Before_mean  After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene10  3.4842353   3.2941765 0.9051358 1.3854897    0.1317249    0.79709435
#> 2 Gene13  6.6754706   6.5255882 0.8060583 0.5893336    0.7451515    0.41911404
#> 3 Gene14  6.0054118   5.6788824 0.4614858 0.6721121    0.5598084    0.01920284
#> 4 Gene15  7.2110588   6.5512941 0.9833445 0.9034758    0.1092715    0.05594945
#> 5 Gene16 -0.1601765  -0.8746471 0.8306935 0.5154298    0.2380653    0.28735006
#> 6 Gene17  4.0798824   3.7351176 0.7764462 0.6373618    0.7268583    0.90708293
#>         FCh    log10FCh    FCh_sd   t_test_p t_test_stat   MW_test_p
#> 1 1.5731857  0.19678000 1.5533045 0.66262724   0.4445108 0.554033858
#> 2 1.1437228  0.05832077 0.7683948 0.57420244   0.5736167 0.758312374
#> 3 0.9376122 -0.02797677 0.5691316 0.13105688   1.5915084 0.162571759
#> 4 1.1067408  0.04404593 1.4292426 0.09330078   1.7845808 0.092859531
#> 5 0.7577723 -0.12046129 0.5356882 0.01041055   2.9014075 0.005618046
#> 6 0.9001501 -0.04568507 0.4176108 0.11749771   1.6545444 0.123929225
#>   MW_test_stat t_test_p_adj MW_test_p_adj test.for.comparison     p.used
#> 1    0.5917263   0.66262724    0.66484063    t.student's.test 0.66262724
#> 2    0.3076977   0.66262724    0.82724986    t.student's.test 0.66262724
#> 3    1.3964742   0.22466894    0.27869444   Mann-Whitney.test 0.27869444
#> 4    1.6805028   0.22392187    0.24785845    t.student's.test 0.22392187
#> 5    2.7692792   0.04164219    0.02247218    t.student's.test 0.04164219
#> 6    1.5384885   0.22466894    0.24785845    t.student's.test 0.22466894
#>   Selected as significant?
#> 1                       No
#> 2                       No
#> 3                       No
#> 4                       No
#> 5                       No
#> 6                       No

The results_boxplot() function (a pairwise approach)

This function creates a boxplot that illustrates the distribution of the data for the genes. It is similar to control_boxplot_gene() function; however, some new options are added, including gene selection, faceting, addition of mean points to boxes, and statistical significance labels.

Data objects returned from the make_Ct_ready() and delta_Ct() functions can be directly applied to this function (see The summary of standard workflow).

final.boxplot.pairwise <- results_boxplot(data = data.dCt.pairwise.F,
                                 sel.Gene = c("Gene8","Gene19"),
                                 by.group = TRUE,
                                 signif.show = TRUE,
                                 signif.labels = c("***","*"),
                                 signif.dist = 1.2,
                                 faceting = TRUE,
                                 facet.row = 1,
                                 facet.col = 2,
                                 y.exp.up = 0.1,
                                 y.axis.title = "dCt")

If the order of groups in the legend should be changes (to put Before group as first), the levels of the variable with group names should be reordered:

data.dCt.pairwise.F$Group <- factor(data.dCt.pairwise.F$Group, levels = c("Before", "After"))

final.boxplot.pairwise <- results_boxplot(data = data.dCt.pairwise.F,
                                 sel.Gene = c("Gene8","Gene19"),
                                 by.group = TRUE,
                                 signif.show = TRUE,
                                 signif.labels = c("***","*"),
                                 signif.dist = 1.2,
                                 faceting = TRUE,
                                 facet.row = 1,
                                 facet.col = 2,
                                 y.exp.up = 0.1,
                                 y.axis.title = "dCt")

NOTE: If missing values are present in the data, they will be automatically removed, and a warning will appear.

The results_barplot() function (a pairwise approach)

This function creates a barplot that illustrates mean and standard deviation values of the data for all or selected genes.

Data objects returned from the make_Ct_ready() and delta_Ct() functions can be directly applied to this function (see The summary of standard workflow).

final.barplot.pairwise <- results_barplot(data = data.dCt.pairwise.F,
                                 sel.Gene = c("Gene8","Gene19"),
                                 signif.show = TRUE,
                                 signif.labels = c("***","*"),
                                 angle = 30,
                                 signif.dist = 1.05,
                                 faceting = TRUE,
                                 facet.row = 1,
                                 facet.col = 4,
                                 y.exp.up = 0.1,
                                 y.axis.title = "dCt")

NOTE: At least two samples in each group are required to calculate the standard deviation and properly generate the plot.

The results_heatmap() function (a pairwise approach)

This function allows to draw heatmap with hierarchical clustering. Various methods of distance calculation (e.g. euclidean, canberra) and agglomeration (e.g. complete, average, single) can be used.

NOTE: Remember to create named list with colors for groups annotation and pass it in col.groups parameter.

# Create named list with colors for groups annotation:
colors.for.groups = list("Group" = c("After"="firebrick1", "Before"="green3"))
# Vector of colors for heatmap can be also specified to fit the user needings:
colors <- c("navy","navy","#313695","#313695","#4575B4","#4575B4","#74ADD1","#74ADD1",
            "#ABD9E9","#E0F3F8","#FFFFBF","#FEE090","#FDAE61","#F46D43",
            "#D73027","#C32B23","#A50026","#8B0000", 
            "#7E0202","#000000")
library(pheatmap)
results_heatmap(data.dCt.pairwise.F,
                sel.Gene = "all",
                col.groups = colors.for.groups,
                colors = colors,
                show.colnames = TRUE,
                show.rownames = TRUE,
                fontsize = 11,
                fontsize.row = 10,
                cellwidth = 8,
                angle.col = 90)

# Cellwidth parameter was set to 4 to avoid cropping the image on the right side.

The created plot is displayed on the graphic device(if save.to.tiff = FALSE) or saved to .tiff file (if save.to.tiff = TRUE).

The parallel_plot() function (a pairwise approach)

This function can be used to illustrate a pairwise changes in genes expression.

parallel.plot <- parallel_plot(data = data.dCt.pairwise.F,
                                sel.Gene = c("Gene19","Gene8"),
                               order = c(4,3))


Further analyses (a pairwise approach)

Gene expression levels and differences between groups can be further analysed using the following methods and corresponding functions of the RQdeltaCT package:

Data objects returned from the make_Ct_ready() and delta_Ct() functions can be directly applied to all of these functions (see The summary of standard workflow). Moreover, all functions can be run on the entire data or only on selected samples/genes.

PCA and k means clustering (a pairwise approach)

This function allows to simultaneously perform principal component analysis (PCA) and (if do.k.means = TRUE) samples classification using k means method. Number of clusters can be set using k.clust parameter. Results obtained from both methods can be presented on one plot. Obtained clusters are distinguishable by point shapes. Confusion matrix of sample classification is returned as the second element of the returned object and can be used for calculation further parameters of classification performance like precision, accuracy, recall, and others.

pca.kmeans <- pca_kmeans(data.dCt.pairwise.F, 
                           sel.Gene = c("Gene8","Gene19"), 
                           legend.position = "top")

If the legend is cropped, using a vertical position of legend should solve this problem:

pca.kmeans[[1]] + theme(legend.box = "vertical")

Access to the confusion matrix:

pca.kmeans[[2]]
#>         
#>          Cluster1 Cluster2
#>   Before        1       16
#>   After        15        2

Correlation analysis (a pairwise approach)

Correlation analysis is a very useful method to explore linear relationships between variables. The RQdeltaCT package offers corr_sample() and corr_gene() functions to generate and plot correlation matrices of samples and genes, respectively. The correlation coefficients can be calculated using either the Pearson or Spearman algorithm. To facilitate plot interpretation, these functions also have possibilities to order samples or genes according to several methods, e.g. hierarchical clustering or PCA first component (see order parameter).

library(Hmisc)
library(corrplot)
# To make the plot more readable, only part of the data was used:
corr.samples <- corr_sample(data = data.dCt.pairwise.F[1:10, ],
                            method = "pearson",
                            order = "hclust",
                            size = 0.7,
                            p.adjust.method = "BH",
                            add.coef = "white")

library(Hmisc)
library(corrplot)
corr.genes <- corr_gene(data = data.dCt.pairwise.F,
                            method = "pearson",
                            order = "FPC",
                            size = 0.7,
                            p.adjust.method = "BH")

The created plots are displayed on the graphic device. The returned objects are tables with computed correlation coefficients, p values, and p values adjusted by Benjamini-Hochberg correction (by default). Tables are sorted by the absolute values of correlation coefficients in descending order.

NOTE: A minimum of 5 samples/target are required for correlation analysis.

Simple linear regression analysis (a pairwise approach)

Linear relationships between pairs of samples or genes can be further analysed using simple linear regression models using the single_pair_sample() function (for analysis of samples) and the single_pair_gene() function for analysis of genes. These functions draw a scatter plot with a simple linear regression line. Regression results such as regression equation, coefficient of determination, F value, or p value can be optionally added to the plot.

library(ggpmisc)
Sample08_Sample26 <- single_pair_sample(data = data.dCt.pairwise,
                                        pairwise.data = TRUE,
                                        by.group = TRUE,
                                         x = "Sample08",
                                         y = "Sample26",
                                         point.size = 3,
                                         labels = TRUE,
                                         label = c("eq", "R2", "p"),
                                         label.position.x = 0.05,
                                        label.position.y = c(1, 0.95))

library(ggpmisc)
Gene16_Gene17 <- single_pair_gene(data.dCt.pairwise.F,
                                    x = "Gene16",
                                    y = "Gene17",
                                    by.group = TRUE,
                                    point.size = 3,
                                    labels = TRUE,
                                    label = c("eq", "R2", "p"),
                                    label.position.x = c(0.05),
                                    label.position.y = c(1,0.95))

Receiver Operating Characteristic (ROC) analysis (a pairwise approach)

The Receiver Operating Characteristic (ROC) analysis is useful for assessing the performance of sample classification to the particular group, based on gene expression data. In this analysis, ROC curves together with parameters such as the area under curve (AUC), specificity, sensitivity, accuracy, positive and negative predictive value are received. The ROCh() function was designed to perform all of these tasks. This function returns a table with calculated parameters and a plot with multiple panels, each with ROC curve for one gene.

NOTE: The created plot is not displayed on the graphic device, but should be saved as .tiff image (save.to.txt = TRUE) and can be opened directly from that file in the working directory.

library(pROC)
# Remember to specify the numbers of rows (panels.row parameter) and columns (panels.col parameter) to provide sufficient place to arrange panels: 
roc_parameters <- ROCh(data = data.dCt.pairwise.F,
                       sel.Gene = c("Gene8","Gene19"),
                       groups = c("After","Before"),
                       panels.row = 1,
                       panels.col = 2)
# Access to calculated parameters:
roc_parameters
#>     Gene Threshold Specificity Sensitivity  Accuracy       ppv npv   youden
#> 1 Gene19    4.5975   0.8235294           1 0.9117647 0.8500000   1 1.823529
#> 2  Gene8    0.0615   0.5882353           1 0.7941176 0.7083333   1 1.588235
#>         AUC
#> 1 0.9584775
#> 2 0.8269896

Obtained warning informed us that analyzed genes have more than 1 threshold value for calculated Youden J statistic. To get all thresholds, ROC analysis should be performed separately for each gene:

# Filter data:
data <- data.dCt.pairwise.F[, colnames(data.dCt.pairwise.F) %in% c("Group", "Sample", "Gene19")]
# Perform analysis:
data_roc <- roc(response = data$Group,
            predictor = as.data.frame(data)$Gene19,
            levels = c("Before","After"),
            smooth = FALSE,
            auc = TRUE,
            plot = FALSE,
            ci = TRUE,
            of = "auc",
            quiet = TRUE)
# Gain parameters:
parameters <- coords(data_roc,
                    "best",
                    ret = c("threshold",
                            "specificity",
                            "sensitivity",
                            "accuracy",
                            "ppv",
                            "npv",
                            "youden"))
parameters
#>   threshold specificity sensitivity  accuracy    ppv       npv   youden
#> 1    4.5130   0.9411765   0.8823529 0.9117647 0.9375 0.8888889 1.823529
#> 2    4.5975   1.0000000   0.8235294 0.9117647 1.0000 0.8500000 1.823529
# Gain AUC
data_roc$auc
#> Area under the curve: 0.9585

Simple logistic regression (a pairwise approach)

Logistic regression is a useful method to investigate the impact of the analysed variable on the odds of the occurrence of the studied experimental condition. In the RQdeltaCT package, log_reg() function allows to calculate for each gene a chances (odds ratio, OR) of being included in the study group when gene expression level increases by one unit (suitable for non-transformed data) or by mean of expression levels (more suitable for transformed data). This function returns a plot and table with the calculated parameters (OR, confidence interval, intercept, coefficient, and p values).

library(oddsratio)

# Remember to set the increment parameter.
log.reg.results <- log_reg(data = data.dCt.pairwise.F,
                           increment = 1,
                           sel.Gene = c("Gene8","Gene19"),
                           group.study = "After",
                           group.ref = "Before",
                           log.axis = TRUE)

log.reg.results[[2]]
#>     Gene oddsratio CI_low  CI_high Increment  Intercept coeficient p_intercept
#> 1 Gene19    37.769  5.552 1304.777         1 -15.769827   3.631485 0.006397974
#> 2  Gene8     4.604  1.858   14.908         1   1.095075   1.526832 0.049469132
#>        p_coef
#> 1 0.005683362
#> 2 0.003213778

Increase in dCt values by 1 (two-fold decrease in expression) gives the higher chance to be in the “After” group.


Session info

sessionInfo()
#> R version 4.3.2 (2023-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                   LC_CTYPE=Polish_Poland.utf8   
#> [3] LC_MONETARY=Polish_Poland.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Polish_Poland.utf8    
#> 
#> time zone: Europe/Warsaw
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] pheatmap_1.0.12 oddsratio_2.0.1 pROC_1.18.5     ggpmisc_0.5.5  
#>  [5] ggpp_0.5.6      corrplot_0.92   Hmisc_5.1-1     ggsignif_0.6.4 
#>  [9] coin_1.4-3      survival_3.5-7  ctrlGene_1.0.1  lubridate_1.9.3
#> [13] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3     purrr_1.0.2    
#> [17] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.5.0  
#> [21] tidyverse_2.0.0 RQdeltaCT_1.3.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.0   viridisLite_0.4.2  farver_2.1.1       libcoin_1.0-10    
#>  [5] fastmap_1.1.1      TH.data_1.1-2      GGally_2.2.1       digest_0.6.34     
#>  [9] rpart_4.1.23       timechange_0.2.0   lifecycle_1.0.4    cluster_2.1.6     
#> [13] magrittr_2.0.3     compiler_4.3.2     rlang_1.1.1        sass_0.4.8        
#> [17] tools_4.3.2        utf8_1.2.3         yaml_2.3.8         data.table_1.15.2 
#> [21] knitr_1.45         labeling_0.4.3     htmlwidgets_1.6.4  plyr_1.8.9        
#> [25] RColorBrewer_1.1-3 multcomp_1.4-25    withr_3.0.0        foreign_0.8-86    
#> [29] nnet_7.3-19        grid_4.3.2         stats4_4.3.2       fansi_1.0.5       
#> [33] colorspace_2.1-0   scales_1.3.0       MASS_7.3-60        cli_3.6.1         
#> [37] mvtnorm_1.2-4      rmarkdown_2.26     generics_0.1.3     rstudioapi_0.15.0 
#> [41] tzdb_0.4.0         polynom_1.4-1      cachem_1.0.8       modeltools_0.2-23 
#> [45] splines_4.3.2      parallel_4.3.2     matrixStats_1.0.0  base64enc_0.1-3   
#> [49] vctrs_0.6.3        Matrix_1.6-4       sandwich_3.1-0     jsonlite_1.8.8    
#> [53] SparseM_1.81       confintr_1.0.2     hms_1.1.3          Formula_1.2-5     
#> [57] htmlTable_2.4.2    jquerylib_0.1.4    glue_1.6.2         ggstats_0.5.1     
#> [61] codetools_0.2-19   stringi_1.7.12     gtable_0.3.4       munsell_0.5.0     
#> [65] pillar_1.9.0       htmltools_0.5.7    quantreg_5.97      R6_2.5.1          
#> [69] evaluate_0.23      lattice_0.21-9     highr_0.10         backports_1.4.1   
#> [73] bslib_0.6.1        MatrixModels_0.5-3 Rcpp_1.0.12        gridExtra_2.3     
#> [77] nlme_3.1-164       checkmate_2.3.1    mgcv_1.9-1         xfun_0.42         
#> [81] zoo_1.8-12         pkgconfig_2.0.3