Phyloseq Diversity Metrics: eDNA metabarcoding base script
This script analyzes your relative abundance matrix to assess alpha and beta diversity. Figures produced are potentially part of the main figures of your manuscript/report.
Load libraries
library(ggplot2) ## for plotting
library(tidyverse) ## for data manipulation
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(phyloseq)
library(knitr)
library(readxl)
library(writexl)
library(cowplot)
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
## for stats
library(pairwiseAdonis)
## Loading required package: vegan
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-8
## Loading required package: cluster
library(lme4) ## for stats
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(car) ## for stats
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(stats) ## for stats
library(vegan)
library("microbiome") ## for alpha diversity functions
##
## microbiome R package (microbiome.github.com)
##
##
##
## Copyright (C) 2011-2022 Leo Lahti,
## Sudarshan Shetty et al. <microbiome.github.io>
##
##
## Attaching package: 'microbiome'
##
## The following object is masked from 'package:vegan':
##
## diversity
##
## The following object is masked from 'package:ggplot2':
##
## alpha
##
## The following object is masked from 'package:base':
##
## transform
## set seed
set.seed(1234)
Load data
Before continuing with analyses, decide on what data you’re going to use an input. Relative abundance, raw reads, rarefied counts (can be done in phyloseq), variance stabilizing transformation (vst) from DESeq2? The input will impact your interpretation and thus is important to decide before conducting any stats to avoid bias towards a particular result.
On the Fisheries team, we have traditionally used relative abundance and the following code uses that dataset.
Relative Abundance data
# Option 1: Change global options (affects all subsequent operations)
## Convert scientific notation to regular numbers
options(scipen = 999)
## Relative abundance matrix
df <- read_xlsx("example_output/Results_2_relative_abundance_matrix.xlsx") %>%
## removing common_name and category for now
dplyr::select(-Common_name, -Category) %>%
## Remove columns with NA values
dplyr::select(where(~!any(is.na(.)))) %>%
## making species_name rownames instead of column
column_to_rownames(var = "Species_name") %>%
# remove columns that sum to 0
select(where(~sum(., na.rm = TRUE) != 0))
Metadata
meta <- read_xlsx("example_input/metadata.xlsx") %>%
## rownames are also needed in phyloseq meta table
mutate(sampleID2=sampleID) %>% column_to_rownames(var = "sampleID2")
Create phyloseq object
## Create ASV (OTU) table and meta table
otu <- otu_table(df, taxa_are_rows = T)
meta_phyloseq <- sample_data(meta)
## Merge metadata and OTU table into one phyloseq "object"
phylo_obj <- merge_phyloseq(otu, meta_phyloseq)
## view phyloseq obj
## expected output = otu_table() with taxa and sample numbers and sample_data() with the sample and column numbers
print(phylo_obj)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 107 taxa and 328 samples ]
## sample_data() Sample Data: [ 328 samples by 10 sample variables ]
# Ensure that your OTU table doesn't contain any NA or negative values (output should be FALSE)
any(is.na(otu_table(phylo_obj)))
## [1] FALSE
any(otu_table(phylo_obj) < 0)
## [1] FALSE
Subsetting phyloseq object
Optional if you’d like to break your data up by a certain variable (surface water, bottom water). If desired, uncomment the code chunk below.
# surface <- subset_samples(phylo_obj, SampleType == "Surface Water")
# bottom <- subset_samples(phylo_obj, SampleType == "Bottom Water")
Alpha Diversity
Comparing the species diversity (shannon index or species richness) at each site.
Calculating Shannon Index and Species Richness
The alpha function in the microbiome package calculates several alpha diversity indices. The most relevant are likely observed (species richness) and diversity_shannon (shannon index).
https://microbiome.github.io/tutorials/Alphadiversity.html
Or use plot_richness function from:
https://rstudio-pubs-static.s3.amazonaws.com/1071936_6115f873acbc4dc4a30b1380cc3885fb.html
fill_col = c("cyan4", "deeppink4", "#ed6a5a")
## Calculate
alpha_div <- estimate_richness(phylo_obj, measures = c("Shannon", "Simpson")) %>%
rownames_to_column(var = "sampleID") %>% left_join(., meta, by = "sampleID")
alpha_div %>%
gather("measurement", "value", Shannon:Simpson) %>%
ggplot(., aes(x=SampleType, y=value)) +
geom_boxplot(aes(color = SampleType), fill=NA, outlier.shape=NA, show.legend = FALSE) +
geom_jitter(width=0.2, shape=21, aes(fill = SampleType), color = 'black', alpha=0.5) +
facet_wrap(~measurement, scales = "free_y", strip.position = "left") +
## labels
labs(x="Sample Type", y="", fill = "Sample Type") +
scale_fill_manual(values = fill_col) +
scale_color_manual(values = fill_col) +
## theme options
theme_bw() +
theme(panel.background=element_blank(),
strip.background=element_blank(),
strip.text = element_text(size = 10, face="bold"),
legend.position = "none",
strip.clip = 'off',
strip.placement = "outside",
axis.text.y = element_text(size=8, color="grey30"),
axis.text.x = element_text(size=8, color="grey30"),
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0), size=10, face="bold"),
axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0), size=10, face="bold"))
ggsave("example_output/Figures/alpha_diversity.png", width = 6, height = 4)
Richness vs. Shannon per sample used in contract report
## Species Richness
biodiv_df <- df %>%
rownames_to_column(var = "Species_name") %>%
gather("sampleID", "relab", 2:last_col()) %>%
group_by(sampleID) %>%
summarize(richness = sum(relab > 0)) %>%
## Species Evenness ~ Shannon
left_join(., alpha_div %>% dplyr::select(sampleID, Shannon), by = "sampleID")
## Exporting data
biodiv_df %>% dplyr::rename(Richness = richness) %>% left_join(., meta, by = "sampleID") %>%
write_xlsx("example_output/Biodiversity.xlsx")
## plotting
biodiv_df %>%
ggplot(., aes(x=richness, y=Shannon)) +
annotate("rect", xmin=quantile(biodiv_df$richness, 0.75), xmax=Inf,
ymin=quantile(biodiv_df$Shannon, 0.75, na.rm = TRUE), ymax=Inf,
alpha=0.2, fill="#058C42") +
geom_point(fill = "#97C1DE", color='black', shape=21, alpha=0.5, size=2) +
labs(
x = "Sample Richness",
y = "Sample Evenness (Shannon Index)"
) +
theme_bw() +
theme(
axis.text.y = element_text(size=8, color="grey30"),
axis.text.x = element_text(size=8, color="grey30"),
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0), size=10, face="bold"),
axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0), size=10, face="bold")
)
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggsave("example_output/Figures/biodiversity.png", width = 5.5, height = 5)
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
Statistics
Test: T-test or ANOVA - Type I, II, and III:
- * indicates an interaction (SampleType*Month). Usually we are
interested in the interaction of our factors.
- + indicates an additive effect (SampleType + Month)
T-test to be used when only two groups to compare and ANOVA to be used with 3+ groups. Al
non-parametric Kolmogorov-Smirnov test for two-group comparisons when there are no relevant covariates
## Create model
aov <- aov(Shannon ~ SampleType, data = alpha_div)
## ANOVA test on above model
Anova(aov, type = "III")
## Anova Table (Type III tests)
##
## Response: Shannon
## Sum Sq Df F value Pr(>F)
## (Intercept) 10.880 1 28.9137 0.0000001449 ***
## SampleType 2.302 2 3.0595 0.04827 *
## Residuals 122.294 325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Test: If 3+ groups, Tukey Post Hoc Comparisons
TukeyHSD(aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Shannon ~ SampleType, data = alpha_div)
##
## $SampleType
## diff lwr upr p adj
## Control-Blank 0.46237370 0.01372728 0.91102012 0.0416634
## Inside WEA-Blank 0.36259559 -0.06463695 0.78982812 0.1142372
## Inside WEA-Control -0.09977811 -0.28988218 0.09032596 0.4329459
Beta Diversity
Comparing the community assemblages between sites/groups.
Resources (Read before continuing):
- https://ourcodingclub.github.io/tutorials/ordination/ -
https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/comparison-of-ordination-techniques/
Common options:
- Principal Components Analysis (PCA): Euclidean distance measure
- Principal Coordinates Analysis (PCoA): Dissimilarity distance based
- Non-metric MultiDimensional Scaling (NMDS): Dissimilarity distance
based
PCoA and NMDS handle zero’s in community matrices much better than PCA. For metabarcoding data, usually PCoA and NMDS are more appropriate. The differences between PCoA and NMDS are minor compared to difference between PCA. NMDS is iterative and used a different ordering method (see resource links above for more info).
NMDS requires evaluation of the output ‘stress value’: This value tells you how well the model fit your data. This is helpful to include on your NMDS plot in reports/manuscripts/presentations.
Stress (0-1 scale) Interpretation < 0.05
= Excellent representation
with no prospect of misinterpretation < 0.10
= Good ordination with no
real disk of drawing false inferences < 0.20
= Can be useful but has
potential to mislead. In particular, shouldn’t place too much reliance
on the details > 0.20
= Could be dangerous to interpret > 0.35
=
Samples placed essentially at random; little relation to original ranked
distances
If your stress value is >0.2, do not include in analyses and try PCoA instead.
Calculating dissimilarity matrix
## Bray Curtis Dissimilarity Matrix (used in statistics)
bray_df <- phyloseq::distance(phylo_obj, method = "bray")
## Sample information
sample_df <- data.frame(sample_data(phylo_obj))
Plotting
PCoA
https://www.rdocumentation.org/packages/phyloseq/versions/1.16.2/topics/ordinate
## Conduct PCoA
pcoa <- ordinate(physeq = phylo_obj, method = "PCoA", distance = "bray")
## Plotting
plot_ordination(phylo_obj, pcoa,
## USER EDITS shape, color, alpha, fill, etc. as desired based on project metadata
color = "Depth") +
## Point and point aesthetics
geom_point(aes(color = Depth), alpha = .5, size = 5) +
scale_color_manual(values = c("green4", "gold3")) +
## Labels: USER EDITS as desired
labs(color = "Sample Type") +
ggtitle("PCoA example") +
## Theme: USER EDITS as desired
theme_bw() +
theme(
legend.title = element_text(face = "bold", size=12),
legend.position = "right",
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0), size=10, face="bold"),
axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0), size=10, face="bold"),
)
## USER EDITS WIDTH AND HEIGHT TO DESIRED
ggsave("example_output/Figures/PCoA_phyloseq.png", width = 8, height = 6)
NMDS
## Calculating NMDS
## filtering for the sake of example visual
phylo_obj_filtered <- prune_samples(sample_data(phylo_obj)$SampleType != "Blank", phylo_obj)
NMDS <- ordinate(physeq = phylo_obj_filtered, method = "NMDS", distance = "bray")
## Run 0 stress 0.2324997
## Run 1 stress 0.2399498
## Run 2 stress 0.2363517
## Run 3 stress 0.2410485
## Run 4 stress 0.2466303
## Run 5 stress 0.2422236
## Run 6 stress 0.2435588
## Run 7 stress 0.2445893
## Run 8 stress 0.2359513
## Run 9 stress 0.2366502
## Run 10 stress 0.2433675
## Run 11 stress 0.2533385
## Run 12 stress 0.2439828
## Run 13 stress 0.2438916
## Run 14 stress 0.2402618
## Run 15 stress 0.246462
## Run 16 stress 0.2315938
## ... New best solution
## ... Procrustes: rmse 0.02792967 max resid 0.1792405
## Run 17 stress 0.2359704
## Run 18 stress 0.2448772
## Run 19 stress 0.2472408
## Run 20 stress 0.2375073
## *** Best solution was not repeated -- monoMDS stopping criteria:
## 1: no. of iterations >= maxit
## 19: stress ratio > sratmax
## Plotting
plot_ordination(phylo_obj_filtered, NMDS,
## USER EDITS shape, color, alpha, fill, etc. as desired based on project metadata
color = "SampleType") +
## Point and point aesthetics
geom_point(aes(color = SampleType), alpha = .5, size = 5) +
scale_color_manual(values = fill_col) +
## Labels: USER EDITS as desired
labs(color = "Sample Type") +
ggtitle("NMDS example") +
## adding stress value to plot (user edits x and y to desired location)
annotate(geom = "label", x = 1.2, y = 1.8,
label = sprintf("Stress: %.6f", NMDS$stress), hjust = 0, vjust = 1,
label.size = NA, fontface = "italic", color = "grey30", size = 2.75, fill="white") +
## Theme: USER EDITS as desired
theme_bw() +
theme(
legend.title = element_text(face = "bold", size=12),
legend.position = "right",
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0), size=10, face="bold"),
axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0), size=10, face="bold"),
)
## USER EDITS WIDTH AND HEIGHT TO DESIRED
ggsave("example_output/Figures/NMDS_phyloseq.png", width = 8, height = 6)
Statistics
Test: PERMANOVA - * indicates an interaction (SampleType*Month).
Usually we are interested in the interaction of our factors.
- + indicates an additive effect (SampleType + Month)
The output will tell you which factors significantly impact the community assemblage (matrix).
adonis2(bray_df ~ SampleType, data = sample_df, permutations = 99)
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 99
##
## adonis2(formula = bray_df ~ SampleType, data = sample_df, permutations = 99)
## Df SumOfSqs R2 F Pr(>F)
## Model 2 1.973 0.01778 2.941 0.01 **
## Residual 325 109.017 0.98222
## Total 327 110.990 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Additional test: Pairwise PERMANOVA
The output will tell you which specific variable within each factor is driving the significant effects.
pairwise.adonis2(bray_df ~ SampleType, data = sample_df)
## $parent_call
## [1] "bray_df ~ SampleType , strata = Null , permutations 999"
##
## $`Inside WEA_vs_Blank`
## Df SumOfSqs R2 F Pr(>F)
## Model 1 1.408 0.01639 4.1647 0.001 ***
## Residual 250 84.541 0.98361
## Total 251 85.949 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $`Inside WEA_vs_Control`
## Df SumOfSqs R2 F Pr(>F)
## Model 1 0.606 0.00568 1.7942 0.069 .
## Residual 314 106.037 0.99432
## Total 315 106.643 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $Blank_vs_Control
## Df SumOfSqs R2 F Pr(>F)
## Model 1 1.1454 0.04005 3.5878 0.002 **
## Residual 86 27.4558 0.95995
## Total 87 28.6012 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## attr(,"class")
## [1] "pwadstrata" "list"