Modeling Alpha diversity of Macroinvertebrates Assamplages

Principal component analysis

Author

Bongani Ncube

Published

27 January 2024

Biodiversity and environmental characteristics

Aquatic biodiversity suffers from major threats that can be grouped into the following categories :

Climate change
Water pollution
Overexploitation
Invasion of exotic species
Habitat degradation
Flow modification

Climate change is known as the alterations in atmospheric ,biogeochemical and hydrological cycles. Aquatic macroinvertebrates have the ability to live under different levels of physico-chemical water conditions and also occupy different substrata. Structure and function of aquatic organisms reflect physical/chemical conditions and change with increased human influence . Main physico-chemical features that affect aquatic environment are temperature discharge , pH ,nutrients and specific conductivity .Temperature is the greatest significant parameter known to affect aquatic life or influence the structure of aquatic life (Giller and Malnisquist 1998). Many studies have seen that natural differences in pH also tend to influence the structure of macroinvertebrates .Highly acidic water results in hypernorishment of fauna. Water acidification can alter structure by chronically damaging tissues of invertebrates particularly species that easily lose sodium ions when pH is reduced .Acidity also destroys some important algal that macroinvertebrates highly depend on in the food chain and for shelter , by so doing this disturbs the diversity of these species

Elements of biodiversity

diversity of species in a landscape is a function of relative contributors of alpha (local), beta(species turnover) to gamma(regional/landscape) diversity. The focus of this research is on macroinvertebrates (aerial active dispersers consisting predominantly on the class insecta) Biodiversity encompasses variability at three different levels: «between ecosystems»; between the taxonomic units (hereon in species)[1] inhabiting the ecosystems («interspecific»); and among each

Alpha diversity

the diversity at the sampling point is called alpha diversity (α),this is usually used to refer the diversity within a particular area or ecosystem .it is usually expressed as the number of species or species richness in that ecosystem. Therefore in order to compare or monitor the effect of some species diversity on a particular farming practise we would want to compare species diversity within different ecosystems. Alpha diversity metrics summarise the structure of an ecological community with respect to its richness . Analyzing alpha diversity is crucial as it assesses the differences between ecosystems

Beta diversity

A comparison of diversity between ecosystems ,usually measured as the amount of species change between the ecosystems. Biodiversity can be measured by considering different scales. Beta diversity 1 (β1) is the dissimilarity between sampling points within the same community and beta diversity 2 (β2) is the dissimilarity between the communities in an ecosystem.

Gamma diversity

gamma diversity is the diversity of species in the ecosystem (γ) or community $(γ A, γ B, γ C) (γ_{A}, γ_{B}, γ_{C})$ . It is a measure of the overall diversity within a large region or Geographic-scale species diversity according to Hunter.

The dynamic nature of biodiversity patterns may be well captured by an integration of both alpha and beta components of biodiversity ,other than the simplistic measurement of alpha diversity alone. More information about gamma, alpha and beta diversity at different scales would be important since understanding regional biodiversity dynamics is among the first step to effectively conserve and restore ecosystems .

Key objectives

Provide information on measures which can be implemented to reduce impacts of biodiversity
Determine any changes in species county in response to changes in water quality in catchment area like Sanyati.
Increase awareness on the factors that affect biodiversity in catchment areas
Provide the information to the link between water quality and species diversity
Understanding biodiversity environment relationships that link theory to management and develop actions for catchments in the face of environmental changes .

Principal Component Analysis

According to the “spectral decomposition theorem”, if $Σ p \times p Σ_{p \times p}$ i s a positive semi-definite, symmetric, real matrix, then there exists an orthogonal matrix $A A$ such that $A' Σ A = v$ where $v$ is a diagonal matrix containing the eigenvalues $Σ$

$v = (v 1 0 \dots 0 0 v 2 \dots 0 ⋮ ⋮ ⋱ ⋮ 00 \dots v p)$

$A = (a 1 a 2 \dots a p)$

the i-th column of $A$ , $a i$ , is the i-th $p \times 1$ eigenvector of $Σ$ that corresponds to the eigenvalue, $v i$ , where $v 1 \geq v 2 \geq \dots \geq v p$ . Alternatively, express in matrix decomposition:

$Σ = A v A'$

$Σ = A (v 1 0 \dots 0 0 v 2 \dots 0 ⋮ ⋮ ⋱ ⋮ 00 \dots v p) A' = p \sum i = 1 v i a i a' i$

where the outer product $a i a' i$ is a $p \times p$ matrix of rank 1.

For example,

$x \sim N 2 (μ, Σ)$

$μ = (512); Σ = (4112)$

Here,

$A = (0.9239 - 0.3827 0.3827 0.9239)$

Columns of $A$ are the eigenvectors for the decomposition

Under matrix multiplication ( $A' Σ A$ or $A' A$ ), the off-diagonal elements equal to 0

Multiplying data by this matrix (i.e., projecting the data onto the orthogonal axes); the distriubiton of the resulting data (i.e., “scores”) is

$N 2 (A' μ, A' Σ A) = N 2 (A' μ, v)$

Notes:

The i-th eigenvalue is the variance of a linear combination of the elements of $x$ ; $v a r (y i) = v a r (a' i x) = v i$
The values on the transformed set of axes (i.e., the $y i$ ’s) are called the scores. These are the orthogonal projections of the data onto the “new principal component axes
Variances of $y 1$ are greater than those for any other possible projection

Covariance matrix decomposition and projection onto orthogonal axes = PCA

Population Principal Components

$p \times 1$ vectors $x 1, \dots, x n$ which are iid with $v a r (x i) = Σ$

The first PC is the linear combination $y 1 = a' 1 x = a 11 x 1 + \dots + a 1 p x p$ with $a' 1 a 1 = 1$ such that $v a r (y 1)$ is the maximum of all linear combinations of $x$ which have unit length
The second PC is the linear combination $y 1 = a' 2 x = a 21 x 1 + \dots + a 2 p x p$ with $a' 2 a 2 = 1$ such that $v a r (y 1)$ is the maximum of all linear combinations of $x$ which have unit length and uncorrelated with $y 1$ (i.e., $c o v (a' 1 x, a' 2 x) = 0$
continues for all $y i$ to $y p$

$a i$ ’s are those that make up the matrix $A$ in the symmetric decomposition $A' Σ A = v$ , where $v a r (y 1) = v 1, \dots, v a r (y p) = v p$ And the total variance of $x$ is

$v a r (x 1) + \dots + v a r (x p) = t r (Σ) = v 1 + \dots + v p = v a r (y 1) + \dots + v a r (y p)$

Data Reduction

To reduce the dimension of data from p (original) to k dimensions without much “loss of information”, we can use properties of the population principal components

Suppose $Σ \approx \sum k i = 1 v i a i a' i$ . Even thought the true variance-covariance matrix has rank $p$ , it can be be well approximate by a matrix of rank k (k <p)
New “traits” are linear combinations of the measured traits. We can attempt to make meaningful interpretation fo the combinations (with orthogonality constraints).
The proportion of the total variance accounted for by the j-th principal component is

$v a r (y j) \sum p i = 1 v a r (y i) = v j \sum p i = 1 v i$

The proportion of the total variation accounted for by the first k principal components is $\sum k i = 1 v i \sum p i = 1 v i$
Above example , we have $4.4144 / (4 + 2) = .735$ of the total variability can be explained by the first principal component

Sample Principal Components

Since $Σ$ is unknown, we use

$S = 1 n - 1 n \sum i = 1 (x i - ˉ x) (x i - ˉ x)'$

Let $ˆ v 1 \geq ˆ v 2 \geq \dots \geq ˆ v p \geq 0$ be the eigenvalues of $S$ and $ˆ a 1, ˆ a 2, \dots, ˆ a p$ denote the eigenvectors of $S$

Then, the i-th sample principal component score (or principal component or score) is

$ˆ y i j = p \sum k = 1 ˆ a i k x k j = ˆ a' i x j$

Properties of Sample Principal Components

The estimated variance of $y i = ˆ a' i x j$ is $ˆ v i$
The sample covariance between $ˆ y i$ and $ˆ y i'$ is 0 when $i \neq i'$
The proportion of the total sample variance accounted for by the i-th sample principal component is $ˆ v i \sum p k = 1 ˆ v k$
The estimated correlation between the i-th principal component score and the l-th attribute of $x$ is

$r x l, ˆ y i = ˆ a i l \sqrt v i \sqrt s l l$

The correlation coefficient is typically used to interpret the components (i.e., if this correlation is high then it suggests that the l-th original trait is important in the i-th principle component). According to [@Johnson_1988, pp.433-434], $r x l, ˆ y i$ only measures the univariate contribution of an individual X to a component Y without taking into account the presence of the other X’s. Hence, some prefer $ˆ a i l$ coefficient to interpret the principal component.
$r x l, ˆ y i; ˆ a i l$ are referred to as “loadings”

To use k principal components, we must calculate the scores for each data vector in the sample

$y j = (y 1 j y 2 j ⋮ y k j) = (ˆ a' 1 x j ˆ a' 2 x j ⋮ ˆ a' k x j) = (ˆ a' 1 ˆ a' 2 ⋮ ˆ a' k) x j$

Large sample theory exists for eigenvalues and eigenvectors of sample covariance matrices if inference is necessary. But we do not do inference with PCA, we only use it as exploratory or descriptive analysis.
PC is not invariant to changes in scale (Exception: if all trait are resecaled by multiplying by the same constant, such as feet to inches).
PCA based on the correlation matrix $R$ is different than that based on the covariance matrix $Σ$
PCA for the correlation matrix is just rescaling each trait to have unit variance
Transform $x$ to $z$ where $z i j = (x i j - ˉ x i) / \sqrt s i i$ where the denominator affects the PCA
After transformation, $c o v (z) = R$
PCA on $R$ is calculated in the same way as that on $S$ (where $ˆ v 1 + \dots + ˆ v p = p$ )
The use of $R, S$ depends on the purpose of PCA.
If the scale of the observations if different, covariance matrix is more preferable. but if they are dramatically different, analysis can still be dominated by the large variance traits.

How many PCs to use can be guided by

Scree Graphs: plot the eigenvalues against their indices. Look for the “elbow” where the steep decline in the graph suddenly flattens out; or big gaps.
minimum Percent of total variation (e.g., choose enough components to have 50% or 90%). can be used for interpretations.
Kaiser’s rule: use only those PC with eigenvalues larger than 1 (applied to PCA on the correlation matrix) - ad hoc
Compare to the eigenvalue scree plot of data to the scree plot when the data are randomized.

Reading in data

data1<-read.csv("waterQuality.csv",header=T) 
data2<-read.csv("Macroinvertebrates.csv",header=T)

RIVER<-data1$River

calculation of alpha diversity

### first we exclude the sengwa river and the corresponding site SE1 then we drop the Alpha column
data2<-cbind(RIVER,data2) |> 
  as.data.frame()

data1<-data1|>
  filter(River %!in% c("Sengwa"))|>
  select(-Alpha)

data2<-data2|> 
  filter(RIVER %!in% c("Sengwa")) 

my_al<-data2|> 
  select(-Alpha,-genus,-Total,-RIVER)

## we recalculate the Alpha diversity using the vegan package
## vegan diversity is used in calculating biodiversity indices
Alpha <- specnumber(my_al)

## column binding Alpha column to dataset1
data1 <- cbind(Alpha,data1) |> 
  as.data.frame()

Check class of variables

sapply(data1[1,],class)
#>        Alpha         site        River           DO       DO.sat         Cond 
#>    "integer"  "character"  "character"    "numeric"    "numeric"    "numeric" 
#>     salinity          TDS      Clarity         Temp  macrophytes      Ammonia 
#>    "numeric"    "numeric"    "numeric"    "numeric"    "integer"    "numeric" 
#>   Phosphates    Elevation Stream.order        Total    substrate            S 
#>    "numeric"    "integer"    "integer"    "integer"  "character"    "numeric" 
#>            E 
#>    "numeric"

Data cleaning and exploration

## data cleaning and class checking
data1<-data1|> 
  mutate_if(is.character,as.factor) |> 
  mutate(substrate=factor(substrate,levels=c("mixed","rocky","clay","sandy"))) |>
  mutate(River = factor(River,levels=c("Kwekwe","Munyati","Muzvezve","Svisvi","Sebakwe"))) |> 
  mutate(Stream.order=as.factor(Stream.order))

head(data1) |> 
  gt::gt()

Alpha	site	River	DO	DO.sat	Cond	salinity	TDS	Clarity	Temp	macrophytes	Ammonia	Phosphates	Elevation	Stream.order	Total	substrate	S	E
16	KR1	Kwekwe	7.0	73.7	367.0	0.16	184.0	64.00	23.7	0	3.1	1.184	1455	2	36	mixed	19.36817	30.00882
14	KR2	Kwekwe	8.7	101.2	274.0	0.11	136.0	100.00	24.9	90	1.0	0.344	1344	2	42	rocky	19.32541	29.99269
8	KR3	Kwekwe	5.2	70.9	73.2	0.04	36.7	16.50	32.7	10	2.0	0.744	1270	3	102	sandy	19.26582	29.92818
6	KR4	Kwekwe	6.4	87.0	91.8	0.05	46.2	21.25	31.2	0	3.4	1.304	1201	3	17	mixed	19.06011	29.81300
3	KR5	Kwekwe	7.8	103.8	370.0	0.16	185.0	100.00	30.1	50	2.0	0.744	1169	3	5	rocky	19.01677	29.74631
18	MB1	Kwekwe	6.2	73.1	170.4	0.09	85.0	26.25	23.9	80	0.0	0.000	1176	2	140	rocky	18.93007	29.92322

Data description

Variable	Description
DO	Dissolved oxygen
DO.sat	Dissolved Oxygen saturation
phosphate	phosphate concentration $(m g L - 1)$
salinity	dissolved salt `g/L`
clarity	How far down light can penetrate
Elevation	Elevation
substrate	materials in water (rocks/substrate)
temp	Temperature
Alpha	Alpha diversity
stream.order	Order of streams
Ammonia	Ammonia concentration( $(m g L - 1)$ )
Cond	Electrical conductivity
site	sampled site
River	name of river
TDS	Total dissolved salts
Macrophytes	macrophytes

Number of sites per River

vad<-data1|> 
  group_by(River)|> 
  dplyr::summarize(counts=n())|> 
  mutate(val=counts/sum(counts),
         perc=scales::percent(val))

labs<-paste0(vad$River,"(",vad$counts," sites)")

ggdonutchart(vad,"val",label=labs,fill="River",color="white",
             palette=c("#00AFBB","#E7B800","#FC4E07","blue","green"))

data visualisations of water parameters


(temps+amo) +
  plot_annotation(title = "Average temperature and Ammonia",
                  subtitle = "averaged per River",
                  theme = theme(text = element_text(family = "Asap Condensed"),
                                plot.title = element_text(face = "bold",
                                                          size = rel(1.5))))

Note

on average , Kwekwe has the highest level of ammonia as compared to other rivers
mean temperature is not significantly different between the rivers

(cond+tds) +
  plot_annotation(title = "Average conductivity and total dissolved salts",
                  subtitle = "averaged per River",
                  theme = theme(text = element_text(family = "Asap Condensed"),
                                plot.title = element_text(face = "bold",
                                                          size = rel(1.5))))

Note

On average ,Munyati has the greatest amount of dissolved salts
on the other hand Svisvi has greatest amount of conductivity

(phos+cla) +
  plot_annotation(title = "Average phosphates and clarity",
                  subtitle = "averaged per River",
                  theme = theme(text = element_text(family = "Asap Condensed"),
                                plot.title = element_text(face = "bold",
                                                          size = rel(1.5))))

substrates in river

## changing data frame from wide to long
dt2_long<-gather(data2,macroSpecies,number_of_species,Naucoris:Trichoptera)|> 
  select(genus,macroSpecies,number_of_species)

dtsum<-dt2_long|> 
  group_by(genus)|> 
  summarise(total_species=sum(number_of_species))|> 
  arrange(desc(total_species))

dtsum<- data1%>% 
  mutate(substrate=as.factor(substrate),Stream.order=as.factor(Stream.order))|> 
  group_by(River,substrate)|> 
  summarise(n=n())|>
  arrange(n)|>  
  dplyr::mutate(perc = paste0(sprintf("%4.1f", n/ sum(n) * 100), "%"))
    
## river and substrate type ###  
dt<- data1%>% 
  mutate(substrate=as.factor(substrate),
         Stream.order=as.factor(Stream.order))|> 
 group_by(River,substrate)|> 
    summarise(n=n())|>
  mutate(val=n/sum(n),
         perc=scales::percent(val))
 
dt |>  ggplot(
  aes(x=River,y=val,fill=substrate))+
  geom_bar(stat="identity",
          position="fill")+
  geom_text(aes(label=perc),
            size=3,
            position=position_stack(vjust=0.5))+
  scale_fill_brewer(palette="set1")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        legend.position = "top")+
  labs(y="percent substrate",
       fill="substrate type",
       x="River",title="Substrate type by River")

Stream order and macroinvertebrates


dtsum<-dt2_long|> 
  group_by(genus,macroSpecies)|> 
  summarise(total_species=sum(number_of_species))|> 
  arrange(desc(total_species))

dtsum|> 
  ggplot(aes(x=reorder(genus,total_species),y=total_species))+
  geom_col() + 
  coord_flip()+
  labs(x="Site",y="Total number of macroinvertebrates",
       title="Distribution of macroinvertebrates \n per site")

  
## relative frequencies and abundance
dtsum<- data1%>% 
  mutate(substrate=as.factor(substrate),
         Stream.order=as.factor(Stream.order))|> 
 group_by(River,Stream.order)|> 
    summarise(n=n())|> arrange(n)|> 
  dplyr::mutate(perc = paste0(sprintf("%4.1f", n/ sum(n) * 100), "%"))

ggplot(dtsum, aes(x =n, y =River,fill=Stream.order)) +
  geom_col() +
  theme_classic() +
  coord_flip()+
  theme(
    axis.text.y = element_text(size = 14, hjust = 1),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    plot.margin = margin(rep(15, 4))) +
  labs(title="Distribution of River Samples and their Stream order")

Descriptive statistical Analysis by River

Characteristic	Kwekwe, N = 6	Munyati, N = 8	Muzvezve, N = 4	Svisvi, N = 6	Sebakwe, N = 6	Test Statistic	p-value¹
Alpha, Mean (SD)	11 (6)	14 (5)	18 (7)	8 (2)	13 (6)	2.8	0.046
DO, Mean (SD)	6.88 (1.24)	6.06 (1.48)	5.70 (1.45)	7.78 (0.69)	5.13 (2.47)	2.5	0.065
DO.sat, Mean (SD)	85 (15)	76 (18)	70 (19)	98 (5)	68 (23)	2.9	0.042
Clarity, Mean (SD)	55 (39)	34 (31)	43 (42)	3 (1)	57 (24)	3.3	0.028
salinity, Mean (SD)	0.10 (0.05)	0.06 (0.02)	0.08 (0.04)	0.10 (0.08)	0.03 (0.02)	2.7	0.051
Temperature, Mean (SD)	27.75 (4.03)	27.36 (1.01)	26.43 (0.82)	29.28 (1.50)	25.63 (2.43)	2.1	0.11
Ammonia, Mean (SD)	1.92 (1.28)	0.11 (0.14)	0.10 (0.07)	0.09 (0.05)	0.13 (0.03)	12	<0.001
Phosphates, Mean (SD)	0.72 (0.49)	0.46 (0.46)	1.44 (1.71)	0.89 (0.72)	0.09 (0.05)	2.3	0.091
Cond, Mean (SD)	224 (132)	106 (41)	180 (100)	259 (148)	97 (37)	3.3	0.026
TDS, Mean (SD)	112 (66)	131 (217)	67 (67)	128 (74)	49 (19)	0.53	0.71
Elevation, Mean (SD)	1,269 (113)	1,105 (288)	1,145 (183)	807 (107)	1,296 (118)	6.6	<0.001
substrate, n / N (%)							0.063
mixed	2 / 6 (33%)	0 / 8 (0%)	1 / 4 (25%)	0 / 6 (0%)	2 / 6 (33%)
rocky	3 / 6 (50%)	4 / 8 (50%)	1 / 4 (25%)	0 / 6 (0%)	2 / 6 (33%)
clay	0 / 6 (0%)	2 / 8 (25%)	1 / 4 (25%)	0 / 6 (0%)	1 / 6 (17%)
sandy	1 / 6 (17%)	2 / 8 (25%)	1 / 4 (25%)	6 / 6 (100%)	1 / 6 (17%)
¹ One-way ANOVA; Fisher’s exact test

Note

there is significant differences in mean alpha diversity between the rivers (p<0.05)
there is also significant difference in mean ammonia,conductivity, clarity and elevation
though the p-value is small , we would assume there is significant association between substrate and River

But which rivers actually have different Alpha diversity?

we perform a POST HOC analysis

#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = Alpha ~ River, data = data1)
#> 
#> $River
#>                        diff        lwr        upr     p adj
#> Munyati-Kwekwe     3.541667  -4.881312 11.9646451 0.7315913
#> Muzvezve-Kwekwe    7.166667  -2.900718 17.2340515 0.2552279
#> Svisvi-Kwekwe     -3.333333 -12.337876  5.6712094 0.8113581
#> Sebakwe-Kwekwe     2.333333  -6.671209 11.3378760 0.9393348
#> Muzvezve-Munyati   3.625000  -5.925760 13.1757598 0.7973212
#> Svisvi-Munyati    -6.875000 -15.297978  1.5479784 0.1491607
#> Sebakwe-Munyati   -1.208333  -9.631312  7.2146451 0.9929891
#> Svisvi-Muzvezve  -10.500000 -20.567385 -0.4326152 0.0379058
#> Sebakwe-Muzvezve  -4.833333 -14.900718  5.2340515 0.6271945
#> Sebakwe-Svisvi     5.666667  -3.337876 14.6712094 0.3699209

Note

only Svisvi and Muzvezve have significant mean differences in Alpha diversity

Descriptive statistical analysis by Substrate type

Characteristic	mixed, N = 5¹	rocky, N = 10¹	clay, N = 4¹	sandy, N = 11¹	Test Statistic	p-value²
Alpha	13 (4)	11 (5)	16 (7)	12 (7)	0.69	0.57
DO	6.20 (2.76)	6.60 (1.26)	5.45 (1.80)	6.48 (1.70)	0.43	0.73
DO.sat	83 (20)	81 (16)	68 (24)	82 (21)	0.63	0.60
Clarity	44 (19)	54 (33)	26 (50)	24 (32)	1.6	0.21
salinity	0.07 (0.05)	0.08 (0.05)	0.05 (0.03)	0.08 (0.06)	0.36	0.78
Temperature	27.50 (3.23)	26.09 (1.79)	26.45 (2.30)	28.76 (2.20)	2.7	0.069
Ammonia	1.39 (1.70)	0.40 (0.64)	0.05 (0.04)	0.27 (0.58)	2.5	0.078
Phosphates	0.59 (0.61)	0.42 (0.45)	0.41 (0.51)	0.99 (1.14)	1.0	0.39
Cond	145 (125)	181 (107)	105 (58)	191 (132)	0.65	0.59
TDS	72 (63)	93 (51)	29 (20)	149 (183)	1.2	0.34
Elevation	1,252 (119)	1,190 (180)	1,044 (393)	1,029 (269)	1.4	0.26
River						0.063
Kwekwe	2 / 5 (40%)	3 / 10 (30%)	0 / 4 (0%)	1 / 11 (9.1%)
Munyati	0 / 5 (0%)	4 / 10 (40%)	2 / 4 (50%)	2 / 11 (18%)
Muzvezve	1 / 5 (20%)	1 / 10 (10%)	1 / 4 (25%)	1 / 11 (9.1%)
Svisvi	0 / 5 (0%)	0 / 10 (0%)	0 / 4 (0%)	6 / 11 (55%)
Sebakwe	2 / 5 (40%)	2 / 10 (20%)	1 / 4 (25%)	1 / 11 (9.1%)
Stream.order						0.24
1	0 / 5 (0%)	0 / 10 (0%)	1 / 4 (25%)	2 / 11 (18%)
2	2 / 5 (40%)	4 / 10 (40%)	0 / 4 (0%)	3 / 11 (27%)
3	3 / 5 (60%)	5 / 10 (50%)	1 / 4 (25%)	6 / 11 (55%)
4	0 / 5 (0%)	1 / 10 (10%)	2 / 4 (50%)	0 / 11 (0%)
¹ Mean (SD); n / N (%)
² One-way ANOVA; Fisher’s exact test

Note

none of the test outputs is significant enough at 5% level of significance

Correlations between alpha diversity and different water parameters

data1 |> 
  select_if(is.numeric) |> 
  gather("features","values",-Alpha) |> 
  group_by(features) |> 
  mutate(coef_dat=round(cor(values,Alpha),2)) |> 
  mutate(features=paste(features,' vs Alpha , r= ' ,coef_dat ,sep="")) |> 
  ggplot(aes(x=values,y=Alpha,color=features))+
  geom_point(show.legend = F)+
  geom_smooth()+
  facet_wrap(~features,scales="free")+
  ggthemes::scale_color_stata()

Note

there happen to be a high degree of negative correlation between alpha diversity and c(temperature,Dissolved oxygen saturation) whilst there is some significant correlation between Alpha diversity and Elevation

Principal component analysis

prepare the data for PCA

#subsetting only the numerical variables
pc_data <-data1|> 
  select(Alpha,macrophytes,DO.sat,Clarity,salinity,Temp,Ammonia,Phosphates,Cond,TDS,Elevation,DO,site) 

pc_data<-column_to_rownames(pc_data,var="site")
#making the PCA dataframe
# NB : can also use prcomp(p, center = TRUE, scale = TRUE)
## Principal component analysis

#same as scale(pc_data) |> prcomp
cor_pca <- prcomp(pc_data, scale = T)

manually calculate eigen values,propotion and cumulative propotion

# eigen values
cor_results <- data.frame(eigen_values = cor_pca$sdev ^ 2)
cor_results |>
    mutate(proportion = eigen_values / sum(eigen_values),
           cumulative = cumsum(proportion))
#>    eigen_values  proportion cumulative
#> 1    3.61943351 0.301619459  0.3016195
#> 2    2.34486631 0.195405525  0.4970250
#> 3    1.70074657 0.141728881  0.6387539
#> 4    1.25065342 0.104221118  0.7429750
#> 5    1.11059667 0.092549722  0.8355247
#> 6    0.75108740 0.062590616  0.8981153
#> 7    0.48820937 0.040684114  0.9387994
#> 8    0.32120028 0.026766690  0.9655661
#> 9    0.25538431 0.021282026  0.9868482
#> 10   0.09492865 0.007910720  0.9947589
#> 11   0.04668138 0.003890115  0.9986490
#> 12   0.01621214 0.001351012  1.0000000

summary(cor_pca)
#> Importance of components:
#>                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
#> Standard deviation     1.9025 1.5313 1.3041 1.1183 1.05385 0.86665 0.69872
#> Proportion of Variance 0.3016 0.1954 0.1417 0.1042 0.09255 0.06259 0.04068
#> Cumulative Proportion  0.3016 0.4970 0.6388 0.7430 0.83552 0.89812 0.93880
#>                            PC8     PC9    PC10    PC11    PC12
#> Standard deviation     0.56675 0.50536 0.30810 0.21606 0.12733
#> Proportion of Variance 0.02677 0.02128 0.00791 0.00389 0.00135
#> Cumulative Proportion  0.96557 0.98685 0.99476 0.99865 1.00000

using factoextra

library(factoextra)

# estimate pca scores by covariate

components <- prcomp(pc_data, scale = TRUE)

# apply rotation

components$rotation <- components$rotation

# show components

components$rotation|>
  as.data.frame()
#>                     PC1          PC2          PC3         PC4         PC5
#> Alpha       -0.35449121 -0.099861865  0.034611711  0.18790555 -0.46343472
#> macrophytes -0.23230742 -0.092419235  0.548808883  0.05226851  0.12074187
#> DO.sat       0.44589849 -0.009195734  0.224887088  0.30018957  0.12908776
#> Clarity     -0.28029176 -0.381657502  0.010384165  0.31668735  0.24755452
#> salinity     0.24173971 -0.514308694 -0.008787703 -0.12921516 -0.17433411
#> Temp         0.29431009  0.264654513 -0.380144332 -0.17037232 -0.03403939
#> Ammonia      0.10928400 -0.218758665 -0.523617522  0.21447741  0.41251633
#> Phosphates   0.01194742 -0.102994483 -0.294317484  0.47150593 -0.60047136
#> Cond         0.28872248 -0.489789925  0.039847584 -0.14116307 -0.10158596
#> TDS          0.01386244 -0.359919212  0.016981856 -0.54622162 -0.12690411
#> Elevation   -0.37989159 -0.253279618 -0.250867727  0.03512616  0.30801738
#> DO           0.39745139 -0.095572209  0.277007128  0.37147576  0.09548333
#>                     PC6         PC7         PC8         PC9         PC10
#> Alpha        0.28166620  0.42228029 -0.39808272 -0.37170706 -0.175865531
#> macrophytes -0.48446951 -0.14559438  0.03116949 -0.46254752  0.317589568
#> DO.sat      -0.04479548  0.21536370 -0.28856902  0.04201020  0.193054265
#> Clarity     -0.30526222 -0.21353368 -0.35832964  0.30406258 -0.487119843
#> salinity     0.18413223 -0.26871710 -0.19197187 -0.27528921  0.221645383
#> Temp        -0.37035133 -0.07870618 -0.62879524 -0.09712516  0.101071591
#> Ammonia     -0.09713152  0.21785145  0.28089682 -0.53855533 -0.142456583
#> Phosphates  -0.37482391 -0.10109928  0.27969256  0.17060108  0.244652854
#> Cond         0.19578007 -0.29303947  0.05169217  0.04862992 -0.116988890
#> TDS         -0.42867332  0.56184694  0.08984046  0.18520043 -0.055715660
#> Elevation    0.20587477  0.13821484 -0.16681026  0.28456362  0.661267540
#> DO           0.07276285  0.38957946  0.03081807  0.17793314  0.009800876
#>                    PC11         PC12
#> Alpha        0.16632006 -0.014207618
#> macrophytes  0.16749334 -0.138718540
#> DO.sat       0.11295599  0.678442518
#> Clarity     -0.12866827  0.024799710
#> salinity    -0.60166250 -0.002640343
#> Temp         0.14943825 -0.296522618
#> Ammonia      0.05009729  0.034839451
#> Phosphates   0.02297821  0.036238312
#> Cond         0.70318702 -0.086679360
#> TDS         -0.04938556  0.090379332
#> Elevation    0.13897252 -0.074904239
#> DO          -0.10666523 -0.638673640

fviz_screeplot(components, addlabels = TRUE, ylim = c(0, 36))+
  labs(title='Figure 11. Proportion of Variance Explained by Individual PCA Components (Dimensions)')

Note

from the screeplot , the first 4 principal components contain about 84.3% of the variability in the data
on the other hand the first five components contribute about 83.6% of the variability in the data and small for the subsequent PCs
thus the first four components are an effective summary of all the variables thus instead of working with all the variables ,we could use the 4 or 5 components for future modeling.

assess variable importance to each if the first 4 components (dim)

v1<-fviz_contrib(components, choice = "var", axes = 1, top = 10, title='Contribution of Variables \n to PCA1')
v2<-fviz_contrib(components, choice = "var", axes = 2, top = 10, title='Contribution of Variables \n to PCA2')
v3<-fviz_contrib(components, choice = "var", axes = 3, top = 10, title='Contribution of Variables \n to PCA3')
v4<-fviz_contrib(components, choice = "var", axes = 4, top = 10, title='Contribution of Variables \n to PCA4')

v1|v2

Note

the red dashed line on the graphs above indicate the expected average contribution . for a certain component , a variable with a contribution exceeding this benchmark is considered as important in contributing to the component .
from the output Dissolved oxygen saturation, Dissolved oxygen ,Elevetation and temperature contribute the most to first component.
On the other hand salinity,conductivity ,clarity and Total dissolved salts contribute more to second component.

v3|v4

Note

macrophytes , Ammonia and Temperature contribute significantly on PCA3

Visualizing PCA

The PCA can be visualized by using a variable factor map and an individual factor map or a combination of both maps, called the biplot. The variable plot shows the variables as vectors (arrows). The vectors start at the origin [0,0] and extend to coordinates given by the loading vector. This plot illustrates the contribution of each variable to components and the correlation of each variable.

fviz_pca_var(components, alpha.var = "contrib", col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title = 'Influence of Variables on PCA1 and PCA2', repel = TRUE )

Insight from variables factor map :

The more parallel the arrows (variables) to PC, the more those variables contribute to that PC. thus we note that Dissolved oxygen saturation, Dissolved oxygen and temperature contribute positively to the first PC , Alpha diversity and macrophytes contribute negatively to first component
The longer the arrow, the more variability of this variable is represented by two PC (red arrows).
Correlation between each variable can be defined by the angle between arrows and positively correlated variables are grouped together. A high positive correlation is represented by small-angle differences thus we note that Alpha diversity ,macrophytes ,Elevation and Clarity are in the same direction hence they are correlated in some way ,

it really make sense since macroinvertebrates feed on macrophytes and they need clear water to survive.


#Biplot of individuals and variables
fviz_pca_biplot(components, repel = TRUE,
                title='Distribution of numeric variables: from PCA1 and PCA2',
                col.var = "#2E9FDF",col.ind = "#696969",)

Note

the plot shows which which sites actually have the highest of the variables pointed to them
site MP2 ,MN1 and MB1 are high in alpha diversity

Principal component ellipses

Statistical modeling

we begin off with a linear regression model

mod_lm<-lm(Alpha~River+DO+Cond+TDS+Clarity+Temp+macrophytes+Ammonia+Phosphates+Elevation+Stream.order+substrate,data=data1)
summary(mod_lm)
#> 
#> Call:
#> lm(formula = Alpha ~ River + DO + Cond + TDS + Clarity + Temp + 
#>     macrophytes + Ammonia + Phosphates + Elevation + Stream.order + 
#>     substrate, data = data1)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.8528 -1.2873 -0.0888  1.3829  5.1517 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)     72.160087  29.454947   2.450  0.03427 * 
#> RiverMunyati    -7.984290   4.318036  -1.849  0.09419 . 
#> RiverMuzvezve   -7.949404   4.959435  -1.603  0.14004   
#> RiverSvisvi    -22.088267   8.649163  -2.554  0.02868 * 
#> RiverSebakwe    -8.303602   3.913732  -2.122  0.05986 . 
#> DO               0.476949   0.808486   0.590  0.56833   
#> Cond            -0.005017   0.010937  -0.459  0.65625   
#> TDS             -0.002222   0.010932  -0.203  0.84302   
#> Clarity          0.040850   0.045816   0.892  0.39355   
#> Temp            -0.983857   0.589829  -1.668  0.12627   
#> macrophytes     -0.114893   0.056560  -2.031  0.06965 . 
#> Ammonia         -7.878375   2.419128  -3.257  0.00862 **
#> Phosphates       3.367858   1.730696   1.946  0.08028 . 
#> Elevation       -0.005399   0.011221  -0.481  0.64076   
#> Stream.order2   -9.346333   5.250047  -1.780  0.10539   
#> Stream.order3   -9.358622   6.309718  -1.483  0.16884   
#> Stream.order4   -7.370862   9.946471  -0.741  0.47570   
#> substraterocky  -8.612929   2.763446  -3.117  0.01093 * 
#> substrateclay   -6.753207   3.861545  -1.749  0.11089   
#> substratesandy  -3.449021   3.380512  -1.020  0.33165   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.596 on 10 degrees of freedom
#> Multiple R-squared:  0.8736, Adjusted R-squared:  0.6335 
#> F-statistic: 3.639 on 19 and 10 DF,  p-value: 0.02046

Note

most variables have negative effects on Alpha diversity based on the coefficient of the variables
substraterocky and ammonia have coefficients that are statistically significant
the overal model is better than a null model (p=0.02046)
the amount of variability explained is $R 2 = 0.8736 = 87 %$ implying a linear model captures the data well

performance::check_model(mod_lm)

however from the smoothened scatter plots plotted earlier , it can be noted that some variables are not linear to Alpha diversity hence the need for generalised Additive Models

Generalised Additive models

Briefly, GAMs are effectively a nonparametric form of regression where the $β x i$ of a linear regression is replaced by a smooth function of the explanatory variables, $f (x i)$ , and the model becomes:

$y i = f (x i) + ϵ i$ where $y i$ is the response variable, $x i$ is the predictor variable, and $f$ is the smooth function.

Importantly, given that the smooth function $f (x i)$ is non-linear and local, the magnitude of the effect of the explanatory variable can vary over its range, depending on the relationship between the variable and the response.

That is, as opposed to one fixed coefficient $β$ , the function $f$ can continually change over the range of $x i$ . The degree of smoothness (or wiggliness) of $f$ is controlled using penalized regression determined automatically in mgcv using a generalized cross-validation (GCV) routine

We can try to build a more appropriate model by fitting the data with a smoothed (non-linear) term. In mgcv::gam(), smooth terms are specified by expressions of the form s(x), where $x$ is the non-linear predictor variable we want to smooth.

mod_gam1<-mgcv::gam(Alpha~River+s(DO)+s(Cond)+s(TDS)+s(Clarity)+s(Temp)+s(macrophytes)+s(Ammonia)+s(Phosphates)+s(Elevation)+Stream.order+substrate,data=data1,method="REML")
summary(mod_gam1)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> Alpha ~ River + s(DO) + s(Cond) + s(TDS) + s(Clarity) + s(Temp) + 
#>     s(macrophytes) + s(Ammonia) + s(Phosphates) + s(Elevation) + 
#>     Stream.order + substrate
#> 
#> Parametric coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)      37.661      7.344   5.128 0.000634 ***
#> RiverMunyati    -10.095      3.766  -2.681 0.025310 *  
#> RiverMuzvezve    -8.814      4.176  -2.111 0.064183 .  
#> RiverSvisvi     -24.878      7.329  -3.395 0.008014 ** 
#> RiverSebakwe     -9.678      3.351  -2.888 0.018059 *  
#> Stream.order2   -10.335      4.381  -2.359 0.042854 *  
#> Stream.order3   -10.342      5.326  -1.942 0.084304 .  
#> Stream.order4    -5.466      8.726  -0.626 0.546690    
#> substraterocky   -9.764      2.358  -4.141 0.002554 ** 
#> substrateclay    -9.465      3.452  -2.742 0.022901 *  
#> substratesandy   -2.768      2.864  -0.966 0.359326    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                  edf Ref.df      F p-value   
#> s(DO)          1.000   1.00  0.003 0.96083   
#> s(Cond)        1.000   1.00  0.081 0.78288   
#> s(TDS)         1.000   1.00  0.434 0.52632   
#> s(Clarity)     1.000   1.00  2.381 0.15724   
#> s(Temp)        1.000   1.00  5.585 0.04237 * 
#> s(macrophytes) 1.000   1.00  6.419 0.03204 * 
#> s(Ammonia)     1.000   1.00 19.388 0.00171 **
#> s(Phosphates)  2.057   2.46  1.718 0.18961   
#> s(Elevation)   1.000   1.00  0.614 0.45344   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.749   Deviance explained = 92.3%
#> -REML =   42.6  Scale est. = 8.8445    n = 30

Caution

We’ll start with the effective degrees of freedom[^lme4], or edf. In typical OLS regression the model degrees of freedom is equivalent to the number of features/terms in the model. This is not so straightforward with a GAM due to the smoothing process and the penalized regression estimation procedure, something that will be discussed more later[^edfinit]. In this situation, we are still trying to minimize the residual sums of squares, but we also have a built-in penalty for ‘wiggliness’ of the fit, where in general we try to strike a balance between an undersmoothed fit and an oversmoothed fit. The default p-value for the test is based on the effective degrees of freedom and the rank $r$ of the covariance matrix for the coefficients for a particular smooth, so here, conceptually, it is the p-value associated with the $F (r, n - e d f)$ .

Note

looking at the output ,variables that have effective degrees of freedom(edf=1) generally have been reduced to linear effect
we can remove the smooth functions s() for variables that have edf=1

mod_gam1<-mgcv::gam(Alpha~River+DO+Cond+TDS+Clarity+Temp+macrophytes+Ammonia+s(Phosphates)+Elevation+Stream.order+substrate,data=data1,method="REML")
summary(mod_gam1)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> Alpha ~ River + DO + Cond + TDS + Clarity + Temp + macrophytes + 
#>     Ammonia + s(Phosphates) + Elevation + Stream.order + substrate
#> 
#> Parametric coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)     87.446880  25.231425   3.466  0.00716 **
#> RiverMunyati   -10.095091   3.765690  -2.681  0.02531 * 
#> RiverMuzvezve   -8.814419   4.175921  -2.111  0.06418 . 
#> RiverSvisvi    -24.878492   7.328794  -3.395  0.00801 **
#> RiverSebakwe    -9.677819   3.351070  -2.888  0.01806 * 
#> DO               0.035880   0.710315   0.051  0.96082   
#> Cond            -0.002635   0.009282  -0.284  0.78291   
#> TDS             -0.006156   0.009340  -0.659  0.52643   
#> Clarity          0.065323   0.042337   1.543  0.15745   
#> Temp            -1.185668   0.501690  -2.363  0.04254 * 
#> macrophytes     -0.119475   0.047155  -2.534  0.03219 * 
#> Ammonia         -9.246293   2.099816  -4.403  0.00174 **
#> Elevation       -0.007359   0.009393  -0.784  0.45356   
#> Stream.order2  -10.334564   4.381227  -2.359  0.04285 * 
#> Stream.order3  -10.341613   5.326404  -1.942  0.08430 . 
#> Stream.order4   -5.466459   8.726297  -0.626  0.54668   
#> substraterocky  -9.764217   2.358169  -4.141  0.00255 **
#> substrateclay   -9.464739   3.451659  -2.742  0.02290 * 
#> substratesandy  -2.767558   2.864216  -0.966  0.35932   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                 edf Ref.df     F p-value  
#> s(Phosphates) 2.057   2.46 3.512  0.0531 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.749   Deviance explained = 92.3%
#> -REML =  66.01  Scale est. = 8.8445    n = 30

First ,from a statistical standpoint we come with almost the same results as the linear model
now ammonia , macrophytes and Temperature have statistically significant effect on Alpha diversity
moreover , our deviance explained value has increased to 92.3\% showing a better improvement and convincing us to take nonlinear model more seriously
most variables generally have negative impact on Alpha diversity , for instance decreasing ammonia increases alpha diversity and the result is statistically significant. Ammonia is bad for species diversity.
we would also expect an increase in water clarity to increase species diversity as well

Test for a better model

anova(mod_lm,mod_gam1,test="Chisq")
#> Analysis of Variance Table
#> 
#> Model 1: Alpha ~ River + DO + Cond + TDS + Clarity + Temp + macrophytes + 
#>     Ammonia + Phosphates + Elevation + Stream.order + substrate
#> Model 2: Alpha ~ River + DO + Cond + TDS + Clarity + Temp + macrophytes + 
#>     Ammonia + s(Phosphates) + Elevation + Stream.order + substrate
#>    Res.Df     RSS     Df Sum of Sq Pr(>Chi)  
#> 1 10.0000 129.337                            
#> 2  8.9431  79.097 1.0569     50.24  0.01879 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note

interesting , not that we couldn`t have assumed as such already but now we have additional statistical evidence to suggest that incorporating nonlinear relationships of the features improves the model.
the Pr(>Chi) value is less than 0.05 indicating that incorporating nonlinear aspect of the variables is important