Skip to content

Phyloseq Diversity Metrics: eDNA metabarcoding base script

This script analyzes your relative abundance matrix to assess alpha and beta diversity. Figures produced are potentially part of the main figures of your manuscript/report.

Load libraries

library(ggplot2) ## for plotting
library(tidyverse) ## for data manipulation 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.2.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(phyloseq)
library(knitr)
library(readxl)
library(writexl)
library(cowplot)
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
## for stats
library(pairwiseAdonis)
## Loading required package: vegan
## Loading required package: permute
## Loading required package: cluster
library(lme4) ## for stats
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(car) ## for stats
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(stats) ## for stats
library(vegan)
library("microbiome") ## for alpha diversity functions
## 
## microbiome R package (microbiome.github.com)
##     
## 
## 
##  Copyright (C) 2011-2022 Leo Lahti, 
##     Sudarshan Shetty et al. <microbiome.github.io>
## 
## 
## Attaching package: 'microbiome'
## 
## The following object is masked from 'package:vegan':
## 
##     diversity
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## 
## The following object is masked from 'package:base':
## 
##     transform
## set seed
set.seed(1234)

Load data

Before continuing with analyses, decide on what data you’re going to use an input. Relative abundance, raw reads, rarefied counts (can be done in phyloseq), variance stabilizing transformation (vst) from DESeq2? The input will impact your interpretation and thus is important to decide before conducting any stats to avoid bias towards a particular result.

On the Fisheries team, we have traditionally used relative abundance and the following code uses that dataset.

Relative Abundance data

# Option 1: Change global options (affects all subsequent operations)
## Convert scientific notation to regular numbers
options(scipen = 999)

## Relative abundance matrix 
df <- read_xlsx("docs/eDNA 12S metab/example_output/Results_matrix_relative.xlsx") %>%
  ## removing common_name and category for now 
  dplyr::select(-Common_name, -Category) %>%

  ## Remove columns with NA values
  dplyr::select(where(~!any(is.na(.)))) %>%

  ## making species_name rownames instead of column 
  column_to_rownames(var = "Species_name") %>%

  # remove columns that sum to 0
  select(where(~sum(., na.rm = TRUE) != 0))

Metadata

meta <- read_xlsx("docs/eDNA 12S metab/example_input/metadata_tidal_cycle.xlsx") %>%

    ## match sample IDs to matrix if needed 
  mutate(`GMGI Sample ID` = gsub("-", "_", `GMGI Sample ID`),
         `GMGI Sample ID` = paste0(`GMGI Sample ID`, "_12S")) %>%

  ## rownames are also needed in phyloseq meta table
  mutate(sampleID2=`GMGI Sample ID`) %>% column_to_rownames(var = "sampleID2") 

Create phyloseq object

## Create ASV (OTU) table and meta table 
otu <- otu_table(df, taxa_are_rows = T)
meta_phyloseq <- sample_data(meta)

## Merge metadata and OTU table into one phyloseq "object"
phylo_obj <- merge_phyloseq(otu, meta_phyloseq)

## view phyloseq obj 
## expected output = otu_table() with taxa and sample numbers and sample_data() with the sample and column numbers
print(phylo_obj)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 107 taxa and 47 samples ]
## sample_data() Sample Data:       [ 47 samples by 8 sample variables ]
# Ensure that your OTU table doesn't contain any NA or negative values (output should be FALSE)
any(is.na(otu_table(phylo_obj)))
## [1] FALSE
any(otu_table(phylo_obj) < 0)
## [1] FALSE

Subsetting phyloseq object

Optional if you’d like to break your data up by a certain variable (surface water, bottom water). If desired, uncomment the code chunk below.

# surface <- subset_samples(phylo_obj, SampleType == "Surface Water")
# bottom <- subset_samples(phylo_obj, SampleType == "Bottom Water") 

Alpha Diversity

Comparing the species diversity (shannon index or species richness) at each site.

Calculating Shannon Index and Species Richness

The alpha function in the microbiome package calculates several alpha diversity indices. The most relevant are likely observed (species richness) and diversity_shannon (shannon index).

https://microbiome.github.io/tutorials/Alphadiversity.html

Or use plot_richness function from:
https://rstudio-pubs-static.s3.amazonaws.com/1071936_6115f873acbc4dc4a30b1380cc3885fb.html

### USER: CHANGE COLORS AS NEEDED
fill_col = c("cyan4", "deeppink4", "#ed6a5a", "yellow2")

## Calculate
alpha_div <- estimate_richness(phylo_obj, measures = c("Shannon", "Simpson")) %>%
  rownames_to_column(var = "GMGI Sample ID") %>% left_join(., meta, by = "GMGI Sample ID") 
## Warning in estimate_richness(phylo_obj, measures = c("Shannon", "Simpson")): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.
### USER: EDIT THE X AXIS AND THE FILL=, SHAPE= AESTHETICS 
alpha_div %>%
  gather("measurement", "value", Shannon:Simpson) %>%
  ggplot(., aes(x=Area, y=value)) +
  geom_boxplot(aes(color = Area), fill=NA, outlier.shape=NA, show.legend = FALSE) +
  geom_jitter(width=0.2, shape=21, aes(fill = Area), color = 'black', alpha=0.5) +
  facet_wrap(~measurement, scales = "free_y", strip.position = "left") +
  ## labels
  labs(x="Area", y="", fill = "Area") +

  scale_fill_manual(values = fill_col) +
  scale_color_manual(values = fill_col) +

  ## theme options
  theme_bw() + 
  theme(panel.background=element_blank(),
        strip.background=element_blank(),
        strip.text = element_text(size = 10, face="bold"),
        legend.position = "none",
        strip.clip = 'off',
        strip.placement = "outside",
        axis.text.y = element_text(size=8, color="grey30"),
        axis.text.x = element_text(size=8, color="grey30"),
        axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0), size=10, face="bold"),
        axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0), size=10, face="bold"))

ggsave("docs/eDNA 12S metab/example_output/figures/alpha_diversity.png", width = 6, height = 4)

Statistics

Test: T-test or ANOVA - Type I, II, and III:
- * indicates an interaction (SampleType*Month). Usually we are interested in the interaction of our factors.
- + indicates an additive effect (SampleType + Month)

T-test to be used when only two groups to compare and ANOVA to be used with 3+ groups. Al

non-parametric Kolmogorov-Smirnov test for two-group comparisons when there are no relevant covariates

## Create model 
aov <- aov(Shannon ~ Area, data = alpha_div)

## ANOVA test on above model
Anova(aov, type = "III")
## Anova Table (Type III tests)
## 
## Response: Shannon
##              Sum Sq Df F value        Pr(>F)    
## (Intercept) 10.7970  1 45.9769 0.00000002676 ***
## Area         6.1493  3  8.7284     0.0001229 ***
## Residuals   10.0980 43                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test: If 3+ groups, Tukey Post Hoc Comparisons

TukeyHSD(aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Shannon ~ Area, data = alpha_div)
## 
## $Area
##                                    diff          lwr       upr     p adj
## Essex Bay-Danvers River       0.4544881 -0.086097904 0.9950741 0.1270188
## Wareham-Danvers River         0.6553259  0.126622344 1.1840294 0.0097952
## Westport Slocum-Danvers River 0.9889786  0.460275104 1.5176822 0.0000585
## Wareham-Essex Bay             0.2008378 -0.339748200 0.7414238 0.7542458
## Westport Slocum-Essex Bay     0.5344905 -0.006095441 1.0750765 0.0536475
## Westport Slocum-Wareham       0.3336528 -0.195050766 0.8623563 0.3431292

Beta Diversity

Comparing the community assemblages between sites/groups.

Resources (Read before continuing):
- https://ourcodingclub.github.io/tutorials/ordination/ - https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/comparison-of-ordination-techniques/

Common options:
- Principal Components Analysis (PCA): Euclidean distance measure
- Principal Coordinates Analysis (PCoA): Dissimilarity distance based
- Non-metric MultiDimensional Scaling (NMDS): Dissimilarity distance based

PCoA and NMDS handle zero’s in community matrices much better than PCA. For metabarcoding data, usually PCoA and NMDS are more appropriate. The differences between PCoA and NMDS are minor compared to difference between PCA. NMDS is iterative and used a different ordering method (see resource links above for more info).

NMDS requires evaluation of the output ‘stress value’: This value tells you how well the model fit your data. This is helpful to include on your NMDS plot in reports/manuscripts/presentations.

Stress (0-1 scale) Interpretation < 0.05 = Excellent representation with no prospect of misinterpretation < 0.10 = Good ordination with no real disk of drawing false inferences < 0.20 = Can be useful but has potential to mislead. In particular, shouldn’t place too much reliance on the details > 0.20 = Could be dangerous to interpret > 0.35 = Samples placed essentially at random; little relation to original ranked distances

If your stress value is >0.2, do not include in analyses and try PCoA instead.

Calculating dissimilarity matrix

## Bray Curtis Dissimilarity Matrix (used in statistics)
bray_df <- phyloseq::distance(phylo_obj, method = "bray")

## Sample information
sample_df <- data.frame(sample_data(phylo_obj))

Plotting

PCoA

https://www.rdocumentation.org/packages/phyloseq/versions/1.16.2/topics/ordinate

## Conduct PCoA 
pcoa <- ordinate(physeq = phylo_obj, method = "PCoA", distance = "bray")

## Plotting
plot_ordination(phylo_obj, pcoa, 

                ## USER EDITS shape, color, alpha, fill, etc. as desired based on project metadata
                color = "Area") +

  ## Point and point aesthetics
  geom_point(aes(color = Area), alpha = .5, size = 5) +
  scale_color_manual(values = c("green4", "gold3", "pink2", "purple")) +

  ## Labels: USER EDITS as desired
  labs(color = "Area") +
  ggtitle("PCoA example") +

  ## Theme: USER EDITS as desired
  theme_bw() +
  theme(
    legend.title = element_text(face = "bold", size=12),
    legend.position = "right",
    axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0), size=10, face="bold"),
    axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0), size=10, face="bold"),
  ) 
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the phyloseq package.
##   Please report the issue at <https://github.com/joey711/phyloseq/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## USER EDITS WIDTH AND HEIGHT TO DESIRED   
ggsave("docs/eDNA 12S metab/example_output/figures/PCoA_phyloseq.png", width = 8, height = 6)

NMDS

## Calculating NMDS 
## filtering example if needed
# phylo_obj_filtered <- prune_samples(sample_data(phylo_obj)$SampleType != "Blank", phylo_obj)

NMDS <- ordinate(physeq = phylo_obj, method = "NMDS", distance = "bray")
## Run 0 stress 0.1376309 
## Run 1 stress 0.1370744 
## ... New best solution
## ... Procrustes: rmse 0.02221491  max resid 0.143476 
## Run 2 stress 0.1470867 
## Run 3 stress 0.1923516 
## Run 4 stress 0.1376309 
## Run 5 stress 0.1482419 
## Run 6 stress 0.220068 
## Run 7 stress 0.1370774 
## ... Procrustes: rmse 0.001530697  max resid 0.007618404 
## ... Similar to previous best
## Run 8 stress 0.1476366 
## Run 9 stress 0.1370775 
## ... Procrustes: rmse 0.001634314  max resid 0.007749731 
## ... Similar to previous best
## Run 10 stress 0.1878113 
## Run 11 stress 0.1376309 
## Run 12 stress 0.1376309 
## Run 13 stress 0.1823485 
## Run 14 stress 0.1370773 
## ... Procrustes: rmse 0.001557422  max resid 0.00766044 
## ... Similar to previous best
## Run 15 stress 0.1482753 
## Run 16 stress 0.1799031 
## Run 17 stress 0.1841584 
## Run 18 stress 0.1983297 
## Run 19 stress 0.184445 
## Run 20 stress 0.1376224 
## *** Best solution repeated 3 times
## Plotting
plot_ordination(phylo_obj, NMDS, 

                ## USER EDITS shape, color, alpha, fill, etc. as desired based on project metadata
                color = "Area") +

  ## Point and point aesthetics
  geom_point(aes(color = Area), alpha = .5, size = 5) +
  scale_color_manual(values = fill_col) +

  ## Labels: USER EDITS as desired
  labs(color = "Sample Type") +
  ggtitle("NMDS example") +

  ## adding stress value to plot (user edits x and y to desired location)
  annotate(geom = "label", x = 0.7, y = 1.8, 
           label = sprintf("Stress: %.6f", NMDS$stress), hjust = 0, vjust = 1, 
           label.size = NA, fontface = "italic", color = "grey30", size = 2.75, fill="white") +

  ## Theme: USER EDITS as desired
  theme_bw() +
  theme(
    legend.title = element_text(face = "bold", size=12),
    legend.position = "right",
    axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0), size=10, face="bold"),
    axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0), size=10, face="bold"),
  ) 
## Warning in annotate(geom = "label", x = 0.7, y = 1.8, label = sprintf("Stress:
## %.6f", : Ignoring unknown parameters: `label.size`

## USER EDITS WIDTH AND HEIGHT TO DESIRED   
ggsave("docs/eDNA 12S metab/example_output/figures/NMDS_phyloseq.png", width = 8, height = 6)

Statistics

Test: PERMANOVA - * indicates an interaction (SampleType*Month). Usually we are interested in the interaction of our factors.
- + indicates an additive effect (SampleType + Month)

The output will tell you which factors significantly impact the community assemblage (matrix).

adonis2(bray_df ~ Area, data = sample_df, permutations = 99)
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 99
## 
## adonis2(formula = bray_df ~ Area, data = sample_df, permutations = 99)
##          Df SumOfSqs      R2      F Pr(>F)   
## Model     3   2.0084 0.24389 4.6233   0.01 **
## Residual 43   6.2264 0.75611                 
## Total    46   8.2348 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Additional test: Pairwise PERMANOVA

The output will tell you which specific variable within each factor is driving the significant effects.

pairwise.adonis2(bray_df ~ Area, data = sample_df)
## $parent_call
## [1] "bray_df ~ Area , strata = Null , permutations 999"
## 
## $`Danvers River_vs_Essex Bay`
##          Df SumOfSqs      R2      F Pr(>F)  
## Model     1   0.4460 0.13329 3.2297  0.027 *
## Residual 21   2.8999 0.86671                
## Total    22   3.3459 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $`Danvers River_vs_Wareham`
##          Df SumOfSqs      R2    F Pr(>F)    
## Model     1   1.2939 0.31678 10.2  0.001 ***
## Residual 22   2.7907 0.68322                
## Total    23   4.0847 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $`Danvers River_vs_Westport Slocum`
##          Df SumOfSqs      R2      F Pr(>F)   
## Model     1   0.5645 0.13984 3.5766  0.007 **
## Residual 22   3.4723 0.86016                 
## Total    23   4.0368 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $`Essex Bay_vs_Wareham`
##          Df SumOfSqs      R2      F Pr(>F)  
## Model     1   0.7194 0.20711 5.4853  0.011 *
## Residual 21   2.7541 0.79289                
## Total    22   3.4735 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $`Essex Bay_vs_Westport Slocum`
##          Df SumOfSqs      R2      F Pr(>F)  
## Model     1   0.4763 0.12176 2.9116  0.022 *
## Residual 21   3.4357 0.87824                
## Total    22   3.9120 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $`Wareham_vs_Westport Slocum`
##          Df SumOfSqs      R2      F Pr(>F)  
## Model     1   0.5013 0.13096 3.3153  0.028 *
## Residual 22   3.3265 0.86904                
## Total    23   3.8278 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## attr(,"class")
## [1] "pwadstrata" "list"