::p_load(
pacman"tidyverse",
"kableExtra",
"openxlsx",
"DESeq2",
"pcaExplorer",
"factoextra",
"ggsci",
"ComplexHeatmap",
"RColorBrewer",
"EnhancedVolcano",
"ggvenn",
"ggpubr",
"kableExtra"
)
Plotting Omics Data
1 Load libraries
2 Plotting Abundance of Representative Terms
# Load COG dictionary
<- read_delim("https://github.com/JirathNuan/r-handviz-workshop/raw/main/datasets/cog_category.txt")
COG_dict
# Load eggNOG-mapper result
<- read.xlsx("https://github.com/JirathNuan/r-handviz-workshop/raw/main/datasets/Dme_chr4.emapper.annotations.xlsx",
emapper_dt startRow = 3,
cols = 7) %>%
filter(COG_category != "-") %>%
group_by(COG_category) %>%
summarize(n = n()) %>%
left_join(COG_dict, by = c("COG_category" = "category")) %>%
mutate(label = paste0(COG_category, ": ", category_name))
# Show first 10 lines of data frame
head(emapper_dt, 10) %>% kbl() %>% kable_styling(full_width = FALSE)
# Plot
ggplot(emapper_dt, aes(x = COG_category, y = n, fill = label)) +
geom_bar(stat = "identity") +
scale_y_continuous(breaks = seq(0, 100, 10)) +
labs(x = "COG category",
y = "Number of sequences",
fill = "COG category") +
theme(legend.key.size = unit(0.5, 'cm'))
- 1
-
cog_category.txt
is a cluster of orthologous groups (COG) dictionary. Use for look up the category names into the plot. - 2
-
Read eggNOG-mapper result from excel file into
emapper_dt
, usingread.xlsx()
from openxlsx library. The excel is read by skipping the first 3 rows and select only 7th column. - 3
-
Then, go to the next step by pipe
%>%
. This step it to filter unclassified COGs-
from theCOG_category
using dplyrfilter()
. - 4
-
Then group the data frame by
COG_category
. - 5
- Count number of COGs presented in this eggNOG-mapper result.
- 6
-
Merge COG dictionary into the result using dplyr
left_join()
. Two data frames are merged by matching the columnCOG_category
fromemapper_dt
with the columncategory
ofCOG_dict
. - 7
-
Then, add new column
label
for the plot legend, by append COG category together with the category name. - 8
-
Plot the result from by showing
COG_category
in x-axis, number of COGsn
in y-axis, fill and add legend by columnlabel
. - 9
-
Plot bar plot using
geom_bar()
andstat = "identity"
- 10
- Set y-axis breaks
- 11
-
Customize label of x- and y-axis, and legend name in
fill
. - 12
-
Adjust size of legend using
theme(legend.key.size)
.
COG_category | n | category_name | label |
---|---|---|---|
A | 5 | RNA processing and modification | A: RNA processing and modification |
B | 1 | Chromatin Structure and dynamics | B: Chromatin Structure and dynamics |
C | 5 | Energy production and conversion | C: Energy production and conversion |
D | 5 | Cell cycle control and mitosis | D: Cell cycle control and mitosis |
G | 1 | Carbohydrate metabolism and transport | G: Carbohydrate metabolism and transport |
I | 4 | Lipid metabolism | I: Lipid metabolism |
J | 6 | Tranlsation | J: Tranlsation |
K | 55 | Transcription | K: Transcription |
L | 6 | Replication and repair | L: Replication and repair |
M | 4 | Cell wall/membrane/envelop biogenesis | M: Cell wall/membrane/envelop biogenesis |
3 Principal Component Analysis
Typically, PCA is used for dimensionality reduction to get lower-dimensional data while keeping the most variation possible by taking only the first few principal components. In principle, SVD (singlular value decomposition) or eigenvalues of the data covariance matrix can be used to evaluate the principal components.
From datacamp, The PCA can be more easily computed by following these five steps:
Normalize data: As these data have different scales, performing PCA on them will result in a biased result. In order to ensure that each attribute contributes equally and to prevent one variable from dominating others, the data set needs to be normalized.
Computing the covariable matrix from the normalized data. This is a symmetric matrix, and each element (i, j) corresponds to the covariance between variables i and j.
Get Eigenvectors and eigenvalues: The term eigenvector refers to a direction in mathematics, such as “vertical” or “90 degrees”. By contrast, an eigenvalue represents how much variance there is in a given direction in the data. Therefore, each eigenvector has a corresponding eigenvalue.
Select the principal components: It’s about choosing the eigenvector with the highest eigenvalue that corresponds to the first principal component. The second principal component is the eigenvector with the second highest eigenvalue, etc.
Transforming data into a new form: A new subspace is defined by the principal components, so the original data is re-oriented by multiplying it by the previously computed eigenvectors. As a result of this transformation, the original data doesn’t change, but instead provides a new perspective to better represent it.
We will demonstrate PCA on a pasilla data set (Brooks et al. 2010). This data set was obtained from an experiment on Drosophila melanogaster cell cultures that investigated the effects of knocking down the splicing factor pasilla using RNAi.
# Load data set
<- read_delim("https://raw.githubusercontent.com/JirathNuan/r-handviz-workshop/main/datasets/cts.tsv")
cts
# Prepare DESeq input, which is expecting a matrix of integers.
<- as.matrix(cts[,-1])
de_input row.names(de_input)<- cts$transcript_name
# Remove NAs
<- de_input[complete.cases(de_input), ]
de_input
# Show first 10 rows of the matrix
kbl(head(de_input, 10))
# Create sample metadata
<- data.frame(sample = colnames(de_input),
coldata sample_group = gsub("[0-9]", "", colnames(de_input)))
# Show how experimental data looks like
kbl(coldata)
- 1
-
Load data set into
cts
data frame, and convert to matrix. - 2
- Removing NA. The whole row will be deleted if NA is observed.
- 3
- Sample metadata should show a relationship between the sample name (from the matrix column), the sample group, and other experimental design.
treated1 | treated2 | treated3 | untreated1 | untreated2 | untreated3 | untreated4 | |
---|---|---|---|---|---|---|---|
FBgn0000003 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
FBgn0000008 | 140 | 88 | 70 | 92 | 161 | 76 | 70 |
FBgn0000014 | 4 | 0 | 0 | 5 | 1 | 0 | 0 |
FBgn0000015 | 1 | 0 | 0 | 0 | 2 | 1 | 2 |
FBgn0000017 | 6205 | 3072 | 3334 | 4664 | 8714 | 3564 | 3150 |
FBgn0000018 | 722 | 299 | 308 | 583 | 761 | 245 | 310 |
FBgn0000022 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
FBgn0000024 | 10 | 7 | 5 | 10 | 11 | 3 | 3 |
FBgn0000028 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
FBgn0000032 | 1698 | 696 | 757 | 1446 | 1713 | 615 | 672 |
sample | sample_group |
---|---|
treated1 | treated |
treated2 | treated |
treated3 | treated |
untreated1 | untreated |
untreated2 | untreated |
untreated3 | untreated |
untreated4 | untreated |
This workshop will demonstrate data normalization using DESeq2
library.
# Create DESeq object by load matrix and experimental design
<- DESeqDataSetFromMatrix(countData = de_input,
dds colData = coldata,
design= ~ sample_group)
# Perform differential expression analysis
<- DESeq(dds)
dds
# Create a normalized matrix of cts data set
<- counts(dds, normalized = TRUE)
cts_norm
# Show how the data looks like
head(cts_norm) %>% kbl()
- 1
- Create DESeq object using count matrix and sample metadata.
- 2
- This function is from DESeq2, to perform differential expression analysis based on the Negative Binomial (a.k.a. Gamma-Poisson) distribution. The analysis started with estimating size factors, estimating dispersion, and fitting the Negative Binomial GLM model and calculate Wald statistics.
- 3
- Create a matrix of normalized count from DESeq2 object after DE analysis.
treated1 | treated2 | treated3 | untreated1 | untreated2 | untreated3 | untreated4 | |
---|---|---|---|---|---|---|---|
FBgn0000003 | 0.0000000 | 0.0000 | 1.200981 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 |
FBgn0000008 | 85.5968034 | 115.5963 | 84.068670 | 80.824908 | 89.7936241 | 117.004615 | 93.123591 |
FBgn0000014 | 2.4456230 | 0.0000 | 0.000000 | 4.392658 | 0.5577244 | 0.000000 | 0.000000 |
FBgn0000015 | 0.6114057 | 0.0000 | 0.000000 | 0.000000 | 1.1154487 | 1.539534 | 2.660674 |
FBgn0000017 | 3793.7726082 | 4035.3632 | 4004.070676 | 4097.471407 | 4860.0101913 | 5486.900612 | 4190.561607 |
FBgn0000018 | 441.4349433 | 392.7648 | 369.902150 | 512.183926 | 424.4282483 | 377.185929 | 412.404476 |
This step is to transform the count matrix through Rlog (regularized log) transforms. Which transform the original count data into log2 scale by fitting a model with a term for each sample and a prior distribution on the coefficients. Use before plotting PCA.
# Plot PCA With pcaExplorer
<- rlogTransformation(dds)
rld_dds <- pcaplot(
pca_dds
rld_dds,intgroup = "sample_group",
ntop = Inf,
pcX = 1,
pcY = 2,
title = "",
text_labels = TRUE
+
) scale_fill_npg() +
scale_color_npg() +
theme(legend.position = "bottom") +
coord_flip()
pca_dds
- 1
- Rlog (regularized log) transforms the original count data into log2 scale by fitting a model with a term for each sample and a prior distribution on the coefficients. Use before plotting PCA.
- 2
- Plot PCA
This step is to examine the principal components and visualizing eigenvalues as a scree plot. A scree plot is used to determine the number of significant principal components in a set of data. It is a line graph that shows the eigenvalues of each component in descending order. This helps to identify which components are most important and which can be discarded.
We’ll demonstrate a scree plot using fviz_eig()
function from factoextra
library.
# Scree plot
<- prcomp(t(assay(rld_dds)))
pcaobj_dds
# Visualizing scree plot
fviz_eig(pcaobj_dds,
addlabels = TRUE,
barfill = pal_npg()(1),
barcolor = "black",
title = "Proportion of explained proportion of variance - cts data set",
ggtheme = theme_gray())
Then, extract genes that are mostly contributed in the principal components. By extracting genes that are primarily responsible for the principal components, it becomes easier to understand the underlying structure of the data and identify potential observations that are associated with certain variables.
# extract the table of the genes with high loadings
<- hi_loadings(pcaobj_dds,
top100_pc topN = 100,
exprTable = counts(dds))
# Show how experimental data looks like
head(top100_pc) %>% kbl() %>% kable_styling(full_width = FALSE)
treated1 | treated2 | treated3 | untreated1 | untreated2 | untreated3 | untreated4 | |
---|---|---|---|---|---|---|---|
FBgn0044047 | 348 | 118 | 151 | 428 | 667 | 200 | 235 |
FBgn0037683 | 1496 | 710 | 918 | 1868 | 2848 | 1090 | 1641 |
FBgn0038381 | 150 | 74 | 79 | 260 | 309 | 121 | 178 |
FBgn0040752 | 5117 | 2778 | 2786 | 5293 | 10550 | 4998 | 4235 |
FBgn0029801 | 1628 | 788 | 917 | 1802 | 3677 | 1256 | 1313 |
FBgn0037635 | 167 | 71 | 92 | 229 | 378 | 139 | 181 |
Then, plot PCA biplotgenespca()
by compute the principal components of the genes, eventually displaying the samples as in a typical biplot visualization.
<- colData(dds)$sample_group
groups_cts
<- scales::hue_pal()(2)[groups_cts]
cols_cts
# with many genes, do not plot the labels of the genes
genespca(
rld_dds,ntop = 100,
choices = c(1, 2),
arrowColors = cols_cts,
groupNames = groups_cts,
useRownamesAsLabels = FALSE,
varname.size = 5,
biplot = TRUE,
alpha = 0.5,
point_size = 2.5
)
Plots the distribution of expression values, either with density lines, boxplots or violin plots.
distro_expr(rld_dds, plot_type = "violin") +
scale_fill_npg() +
scale_color_npg()
distro_expr(rld_dds, plot_type = "boxplot") +
scale_fill_npg() +
scale_color_npg()
See more usage of pcaExplorer: https://federicomarini.github.io/pcaExplorer/articles/pcaExplorer.html#functions-exported-by-the-package-for-standalone-usage
4 Hierarchical Clustering Analysis (Heat maps)
Hierarchical clustering is an unsupervised machine learning algorithm. It is used to group data points into clusters based on their similarity. The algorithm gradually merges or splits clusters after creating a hierarchy. As a result, a dendrogram shows which points are in each cluster and how similar they are.
We’ll demonstrate the hierarchical clustering and plot heat map using pheatmap()
.
# Create annotation column
<- data.frame(sample = colnames(top100_pc),
annot_column group = gsub("[0-9]", "", colnames(top100_pc))) %>%
column_to_rownames(var = "sample")
%>% kbl() annot_column
group | |
---|---|
treated1 | treated |
treated2 | treated |
treated3 | treated |
untreated1 | untreated |
untreated2 | untreated |
untreated3 | untreated |
untreated4 | untreated |
# Create a list of annotation color
<- pal_npg()(2)
sample_pal <- list(group = c(treated = sample_pal[1],
annot_colors untreated = sample_pal[2]))
- Heat map of Top 100 genes that are mostly contributed in the principal components above.
# Heatmap of the top 100 principal components
<- pheatmap(
hm_top100pc
top100_pc,scale = "row",
color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(100),
show_rownames = FALSE,
name = "Normalized count",
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
clustering_method = "complete",
annotation_col = annot_column,
annotation_names_col = FALSE,
annotation_colors = annot_colors,
cutree_rows = 3,
annotation_legend = FALSE
) hm_top100pc
- Heat map of differentially expressed genes.
# Identify differentially expressed genes
<- results(object = dds,
dds_result contrast = c("sample_group", "untreated", "treated"),
tidy = TRUE,
pAdjustMethod = "fdr")
Filter DEG, we’ll use absolute log2FC > 1 and FDR < 0.05
<- dds_result %>%
dds_result_deg filter(padj < 0.05 & abs(log2FoldChange) > 1)
# Extract count matrix of DEGenes
<- cts_norm[rownames(cts_norm) %in% dds_result_deg$row, ]
deg_norm_count
# Plot heatmap of DEGs
# Heatmap of the top 100 principal components
<- pheatmap(
hm_deg
deg_norm_count,scale = "row",
color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(100),
show_rownames = FALSE,
name = "Normalized count",
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
clustering_method = "complete",
annotation_col = annot_column,
annotation_names_col = FALSE,
annotation_colors = annot_colors,
cutree_rows = 2,
annotation_legend = FALSE)
hm_deg
See more
pheatmap()
usage: https://www.reneshbedre.com/blog/heatmap-with-pheatmap-package-r.html
5 Volcano Plot
<- dds_result %>%
volcano_deg mutate(Expression = case_when(
>= 1 & padj <= 0.05 ~ "Upregulated",
log2FoldChange <= -1 & padj <= 0.05 ~ "Downregulated",
log2FoldChange TRUE ~ "Unchanged"))
<- EnhancedVolcano(
p_volcano
volcano_deg,lab = volcano_deg$row,
x = 'log2FoldChange',
y = 'padj',
xlim = c(-5,5),
ylab = bquote( ~ -Log[10] ~ italic(FDR)),
title = NULL,
caption = NULL,
subtitle = NULL,
pCutoff = 0.05,
pointSize = 3,
labSize = 3,
col = c("grey50", "dodgerblue3", "dodgerblue3", "firebrick2"),
legendLabels = c("NS",
expression(abs(log[2] ~ FC) > 1),
"FDR < 0.05",
expression(Pvalue < 0.05 ~ and ~ abs(log[2] ~ FC) > 1))) +
theme_minimal() +
theme(legend.position = "bottom",
panel.border = element_rect(fill = NA, color = "black"),
panel.grid = element_blank())
p_volcano
See more
EnhancedVolcano()
usage: https://bioconductor.org/packages/devel/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html
6 Venn Diagram and Upset Plot
6.1 Plotting Venn diagram with ggvenn
# Load data
<- read_delim("https://github.com/JirathNuan/r-handviz-workshop/raw/main/datasets/mouse_msigdb_6hallmarks.txt")
mouse_hallmarks
# Split into list
<- as.list(mouse_hallmarks)
lst_mouse_hallmark # remove NAs
<- lapply(lst_mouse_hallmark, na.omit)
lst_mouse_hallmark
# Chart
ggvenn(
lst_mouse_hallmark, fill_color = pal_npg()(6),
stroke_size = 0.5,
set_name_size = 3
)
Venn diagrams are simple and easy to interpret, however, ggvenn only allows the visualization of four features at a time. As a result, an alternative of Venn diagram has been developed.
6.2 Venn diagram alternative: Upset Plot
UpSet plots are more efficient than Venn Diagrams at visualizing intersections of multiple sets. They are particularly useful when analyzing large datasets with many variables. They help to identify patterns and relationships between different sets more quickly and easily than other visualizations.
There are 3 different modes of upset plot, distinct (default), intersect, and union. Here we’ll only demonstrate the default mode.
set.seed(123)
# Make present-absent matrix of data list
<- list_to_matrix(lst_mouse_hallmark)
mouse_hm_mat01 # Show how the combination matrix looks like
head(mouse_hm_mat01) %>% kbl()
G2M checkpoint | Apoptosis | DNA Repair | WNT/B-Catenin Signaling | Hedgehog Signaling | TGF_BETA_SIGNALING | |
---|---|---|---|---|---|---|
Aaas | 0 | 0 | 1 | 0 | 0 | 0 |
Abl1 | 1 | 0 | 0 | 0 | 0 | 0 |
Ache | 0 | 0 | 0 | 0 | 1 | 0 |
Acvr1 | 0 | 0 | 0 | 0 | 0 | 1 |
Ada | 0 | 0 | 1 | 0 | 0 | 0 |
Adam17 | 0 | 0 | 0 | 1 | 0 | 0 |
# Make the combination matrix
<- make_comb_mat(lst_mouse_hallmark)
comb_mat comb_mat
A combination matrix with 6 sets and 18 combinations.
ranges of combination set size: c(1, 183).
mode for the combination size: distinct.
sets are on rows.
Top 8 combination sets are:
G2M checkpoint Apoptosis DNA Repair WNT/B-Catenin Signaling Hedgehog Signaling TGF_BETA_SIGNALING code size
x 100000 183
x 010000 147
x 001000 143
x 000001 45
x 000100 31
x 000010 31
x x 110000 4
x x 010001 4
Sets are:
set size
G2M checkpoint 195
Apoptosis 161
DNA Repair 148
WNT/B-Catenin Signaling 42
Hedgehog Signaling 36
TGF_BETA_SIGNALING 53
Referred from a full guide of complexHeatmap: The UpSet plot visualizes the size of each combination set. With the binary code of each combination set, we can calculate the size. There are three modes.
Plot upset plot
<- pal_npg()(4)
pal_upset # Plot upset following the default upset mode.
<-
upset_mouse_hm UpSet(
comb_mat,comb_col = pal_upset[1],
row_names_gp = gpar(fontsize = 10),
right_annotation = upset_right_annotation(
comb_mat,gp = gpar(fill = pal_upset[2]),
add_numbers = TRUE
),top_annotation = upset_top_annotation(
comb_mat,gp = gpar(fill = pal_upset[1]),
add_numbers = TRUE
)
) upset_mouse_hm
See more
UpSet()
usage from a complete reference of complexHeatmap: https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html#upset-plot
7 ggplot2 Based Publication Ready Plots with ggpubr
# Load dataset
<- read_csv("https://github.com/JirathNuan/r-handviz-workshop/raw/main/datasets/mouseLiver_data_ClinicalTraits.csv")
mouseLiver_ctg
# Inspect the data
head(mouseLiver_ctg) %>% kbl()
Mice | Strain | sex | weight_g | length_cm | ab_fat | other_fat | total_fat | 100xfat_weight | Trigly | Total_Chol | HDL_Chol | UC | FFA | Glucose | LDL_plus_VLDL | MCP_1_phys | Insulin_ug_l | Glucose_Insulin | Leptin_pg_ml | Adiponectin | Aortic lesions |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F2_290 | BxH ApoE-/-, F2 | Female | 36.9 | 9.9 | 2.53 | 2.26 | 4.79 | 12.981030 | 53 | 1167 | 50 | 484 | 121 | 437 | 1117 | 175.850 | 924 | 0.4729437 | 245462.00 | 11.274 | 496250 |
F2_291 | BxH ApoE-/-, F2 | Female | 48.5 | 10.7 | 2.90 | 2.97 | 5.87 | 12.103093 | 61 | 1230 | 32 | 592 | 173 | 572 | 1198 | 92.430 | 5781 | 0.0989448 | 84420.88 | 7.099 | NA |
F2_292 | BxH ApoE-/-, F2 | Male | 45.7 | 10.4 | 1.04 | 2.31 | 3.35 | 7.330416 | 41 | 1285 | 81 | 460 | 96 | 497 | 1204 | 196.398 | 2074 | 0.2396336 | 105889.76 | 5.795 | 218500 |
F2_293 | BxH ApoE-/-, F2 | Male | 50.3 | 10.9 | 0.91 | 1.89 | 2.80 | 5.566600 | 271 | 1299 | 64 | 476 | 122 | 553 | 1235 | 97.466 | 11874 | 0.0465723 | 100398.68 | 5.495 | 61250 |
F2_294 | BxH ApoE-/-, F2 | Male | 44.8 | 9.8 | 1.22 | 2.47 | 3.69 | 8.236607 | 114 | 1410 | 50 | 516 | 118 | 535 | 1360 | 95.452 | 9181 | 0.0582725 | 130846.30 | 6.868 | 243750 |
F2_295 | BxH ApoE-/-, F2 | Male | 39.2 | 10.2 | 3.06 | 2.49 | 5.55 | 14.158163 | 72 | 1533 | 18 | 620 | 106 | 382 | 1515 | 144.270 | 485 | 0.7876289 | 75166.22 | 17.328 | 104250 |
7.1 Histogram
gghistogram(
mouseLiver_ctg,x = "weight_g",
add = "mean",
rug = TRUE,
color = "sex",
fill = "sex",
bins = 30,
palette = "npg",
xlab = "Weight (g)",
ylab = "Frequency"
)
7.2 Density plot
ggdensity(
mouseLiver_ctg,x = "weight_g",
add = "mean",
rug = TRUE,
color = "sex",
fill = "sex",
palette = "npg",
xlab = "Weight (g)",
ylab = "Frequency"
)
7.3 Simple box plot
<- ggboxplot(ToothGrowth, x = "supp", y = "len",
p_boxplot color = "supp", palette = "npg",
add = "jitter",
xlab = "Supplement type",
ylab = "Tooth length")
p_boxplot
Adding statistical significance to the plot
- Pairwise comparison
# Try to calculate stat first
compare_means(len ~ supp, data = ToothGrowth, method = "anova")
compare_means(len ~ supp, data = ToothGrowth, method = "wilcox.test")
# Plot box plot with significance
+ stat_compare_means(method = "anova")
p_boxplot + stat_compare_means(method = "wilcox.test")
p_boxplot # Change style of singnificance notation
+ stat_compare_means(method = "anova",
p_boxplot aes(label = ..p.signif..),
label.x = 1.5,
label.y = 40)
+ stat_compare_means(method = "wilcox.test",
p_boxplot aes(label = "p.signif"),
label.x = 1.5,
label.y = 40)
# A tibble: 1 × 6
.y. p p.adj p.format p.signif method
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 len 0.0604 0.06 0.06 ns Anova
# A tibble: 1 × 8
.y. group1 group2 p p.adj p.format p.signif method
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 len OJ VC 0.0645 0.064 0.064 ns Wilcoxon
- Compare more than two groups
# Global test
compare_means(len ~ dose, data = ToothGrowth, method = "anova")
# Default method = "kruskal.test" for multiple groups
ggboxplot(ToothGrowth, x = "dose", y = "len",
fill = "dose", palette = "npg")+
stat_compare_means()
# Change method to anova
ggboxplot(ToothGrowth, x = "dose", y = "len",
fill = "dose", palette = "npg")+
stat_compare_means(method = "anova")
# A tibble: 1 × 6
.y. p p.adj p.format p.signif method
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 len 9.53e-16 9.5e-16 9.5e-16 **** Anova
# Perorm pairwise comparisons
compare_means(len ~ dose, data = ToothGrowth)
# A tibble: 3 × 8
.y. group1 group2 p p.adj p.format p.signif method
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 len 0.5 1 0.00000702 0.000014 7.0e-06 **** Wilcoxon
2 len 0.5 2 0.0000000841 0.00000025 8.4e-08 **** Wilcoxon
3 len 1 2 0.000177 0.00018 0.00018 *** Wilcoxon
# Visualize: Specify the comparisons you want
<- list(c("0.5", "1"), c("1", "2"), c("0.5", "2"))
my_comparisons
ggboxplot(
ToothGrowth,x = "dose",
y = "len",
fill = "dose",
palette = "npg") +
stat_compare_means(comparisons = my_comparisons) + # Add pairwise comparisons p-value
stat_compare_means(label.y = 50) # Add global p-value
- Multiple pairwise tests against a reference group
# Pairwise comparison against reference
compare_means(len ~ dose,
data = ToothGrowth,
ref.group = "0.5",
method = "t.test")
# A tibble: 2 × 8
.y. group1 group2 p p.adj p.format p.signif method
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 len 0.5 1 1.27e- 7 1.3 e- 7 1.3e-07 **** T-test
2 len 0.5 2 4.40e-14 8.80e-14 4.4e-14 **** T-test
# Visualize
ggboxplot(
ToothGrowth,x = "dose",
y = "len",
fill = "dose",
palette = "npg") +
stat_compare_means(method = "anova", label.y = 40) + # Add global p-value
stat_compare_means(label = "p.signif",
method = "t.test",
ref.group = "0.5") # Pairwise comparison against reference
Multiple grouping variables
compare_means(len ~ supp, data = ToothGrowth, group.by = "dose")
# A tibble: 3 × 9
dose .y. group1 group2 p p.adj p.format p.signif method
<dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 0.5 len OJ VC 0.0232 0.046 0.023 * Wilcoxon
2 1 len OJ VC 0.00403 0.012 0.004 ** Wilcoxon
3 2 len OJ VC 1 1 1.000 ns Wilcoxon
# Box plot facetted by "dose"
<- ggboxplot(
p_box
ToothGrowth,x = "supp",
y = "len",
color = "supp",
palette = "jco",
add = "jitter",
facet.by = "dose",
short.panel.labs = FALSE
)# Use only p.format as label. Remove method name.
+ stat_compare_means(label = "p.format") p_box
7.4 Bar and line plots (one grouping variable):
ggbarplot(
ToothGrowth,x = "dose",
y = "len",
add = "mean_se",
fill = "supp",
palette = "npg",
color = "black",
position = position_dodge(0.8)) +
stat_compare_means(aes(group = supp), label = "p.signif", label.y = 29)
ggline(
ToothGrowth,x = "dose",
y = "len",
add = "mean_se",
color = "supp",
palette = "npg") +
stat_compare_means(aes(group = supp),
label = "p.signif",
label.y = c(16, 25, 29))
7.5 Scatter plot
<- ggscatter(
p4
mtcars,x = "wt",
y = "mpg",
fill = "mpg",
size = 4,
shape = 21) +
gradient_fill(c("blue", "white", "red"))
p4
More usage of ggpubr: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/.
8 Session info
sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Asia/Bangkok
tzcode source: internal
attached base packages:
[1] grid stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ggpubr_0.6.0 ggvenn_0.1.10
[3] EnhancedVolcano_1.18.0 ggrepel_0.9.3
[5] RColorBrewer_1.1-3 ComplexHeatmap_2.16.0
[7] ggsci_3.0.0 factoextra_1.0.7
[9] pcaExplorer_2.26.1 DESeq2_1.40.1
[11] SummarizedExperiment_1.30.1 Biobase_2.60.0
[13] MatrixGenerics_1.12.0 matrixStats_0.63.0
[15] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0
[17] IRanges_2.34.0 S4Vectors_0.38.1
[19] BiocGenerics_0.46.0 openxlsx_4.2.5.2
[21] kableExtra_1.3.4 lubridate_1.9.2
[23] forcats_1.0.0 stringr_1.5.0
[25] dplyr_1.1.2 purrr_1.0.1
[27] readr_2.1.4 tidyr_1.3.0
[29] tibble_3.2.1 ggplot2_3.4.2
[31] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] splines_4.3.0 later_1.3.1 bitops_1.0-7
[4] filelock_1.0.2 graph_1.78.0 XML_3.99-0.14
[7] lifecycle_1.0.3 rstatix_0.7.2 topGO_2.52.0
[10] doParallel_1.0.17 vroom_1.6.3 lattice_0.21-8
[13] crosstalk_1.2.0 backports_1.4.1 dendextend_1.17.1
[16] magrittr_2.0.3 limma_3.56.1 plotly_4.10.1
[19] rmarkdown_2.21 yaml_2.3.7 shinyBS_0.61.1
[22] httpuv_1.6.10 NMF_0.26 zip_2.3.0
[25] DBI_1.1.3 abind_1.4-5 zlibbioc_1.46.0
[28] rvest_1.0.3 RCurl_1.98-1.12 rappdirs_0.3.3
[31] circlize_0.4.15 seriation_1.4.2 GenomeInfoDbData_1.2.10
[34] AnnotationForge_1.42.0 genefilter_1.82.1 pheatmap_1.0.12
[37] annotate_1.78.0 svglite_2.1.1 codetools_0.2-19
[40] DelayedArray_0.26.2 DT_0.27 xml2_1.3.4
[43] shape_1.4.6 tidyselect_1.2.0 GOstats_2.66.0
[46] farver_2.1.1 viridis_0.6.3 TSP_1.2-4
[49] BiocFileCache_2.8.0 base64enc_0.1-3 webshot_0.5.4
[52] jsonlite_1.8.4 GetoptLong_1.0.5 ellipsis_0.3.2
[55] survival_3.5-5 iterators_1.0.14 systemfonts_1.0.4
[58] foreach_1.5.2 tools_4.3.0 progress_1.2.2
[61] Rcpp_1.0.10 glue_1.6.2 gridExtra_2.3
[64] xfun_0.39 ca_0.71.1 shinydashboard_0.7.2
[67] withr_2.5.0 BiocManager_1.30.20 Category_2.66.0
[70] fastmap_1.1.1 fansi_1.0.4 SparseM_1.81
[73] digest_0.6.31 timechange_0.2.0 R6_2.5.1
[76] mime_0.12 colorspace_2.1-0 GO.db_3.17.0
[79] biomaRt_2.56.0 RSQLite_2.3.1 threejs_0.3.3
[82] utf8_1.2.3 generics_0.1.3 data.table_1.14.8
[85] prettyunits_1.1.1 httr_1.4.6 htmlwidgets_1.6.2
[88] S4Arrays_1.0.1 pkgconfig_2.0.3 gtable_0.3.3
[91] blob_1.2.4 registry_0.5-1 XVector_0.40.0
[94] htmltools_0.5.5 carData_3.0-5 RBGL_1.76.0
[97] clue_0.3-64 GSEABase_1.62.0 scales_1.2.1
[100] png_0.1-8 knitr_1.42 rstudioapi_0.14
[103] rjson_0.2.21 tzdb_0.3.0 reshape2_1.4.4
[106] curl_5.0.0 shinyAce_0.4.2 GlobalOptions_0.1.2
[109] cachem_1.0.8 parallel_4.3.0 AnnotationDbi_1.62.1
[112] pillar_1.9.0 vctrs_0.6.2 promises_1.2.0.1
[115] car_3.1-2 dbplyr_2.3.2 xtable_1.8-4
[118] cluster_2.1.4 Rgraphviz_2.44.0 evaluate_0.21
[121] cli_3.6.1 locfit_1.5-9.7 compiler_4.3.0
[124] rlang_1.1.1 crayon_1.5.2 rngtools_1.5.2
[127] ggsignif_0.6.4 heatmaply_1.4.2 labeling_0.4.2
[130] plyr_1.8.8 stringi_1.7.12 viridisLite_0.4.2
[133] gridBase_0.4-7 BiocParallel_1.34.1 assertthat_0.2.1
[136] munsell_0.5.0 Biostrings_2.68.0 lazyeval_0.2.2
[139] pacman_0.5.1 Matrix_1.5-4 hms_1.1.3
[142] bit64_4.0.5 KEGGREST_1.40.0 shiny_1.7.4
[145] highr_0.10 broom_1.0.4 igraph_1.4.2
[148] memoise_2.0.1 bit_4.0.5