Some useful formatting and visualization packages for this R Markdown:
library(ggplot2) #plotting
library(tidyverse) #data tidying power
library(gt) #tables
library(here)
Know that different kinds of ecological data require different GLM distributions (with a nifty table as a resource)
Know how to fit a GLM in R, which includes three steps:
fit a full model based on an ecological question
choose the best-fitting model between all possible models using AIC
run model diagnostics to determine that your model meets the assumptions of the distribution you’ve chosen
There are lots of different data distributions for LMs and GLMs for ecological data. The Zuur book has a great introduction to many of these and the Kyle Edwards notes (Lectures 6-13) have great succinct explanations and worked examples. Many of these I don’t actually use but there are a few that are really common in ecology which I use all the time.
You can find the model families for a given GLM-calling function in R by assessing the family
object for that function. I usually get there by typing ?glm
or ?glmmTMB
and then selecting either the family
or the family_glmmTMB
links from those help pages when they pop up.
I (with Tatum) compiled a list of common GLM families, what kinds of data you might fit with them, packages that have that model family, and where you might find more information on each of them in Kyle Edwards’ notes, or on the interwebs. You can find complementary information on page 205 of the Zuur book of in Kyle Edwards’ Lecture 11.
GLM Distributions | |||
---|---|---|---|
Find your data type for model families, R packages, resources, and math! | |||
Distribution | Common Ecological Data | Packages | Resources |
Data type: Binary or proportional | |||
Binomial | Binary or proportional (e.g. presence-absence, percents) | glm, glmmTMB | Kyle Edwards lectures 8, 9, 11 |
Quasi-binomial | Binary or proportional | glmmTMB | Kyle Edwards lectures 8, 9, 11 |
Beta or Beta-family | Proportions or percents that are not ones or zeros | betareg, glmmTMB | Kyle Edwards lecture 11 |
Data type: Count or abundance | |||
Poisson | Count or abundance data | stats, glmmTMB | Kyle Edwards lecture 6, 8, 10, 11 |
Negative Binomial | Count or abundance data with overdispersion | MASS, glmmTMB | Kyle Edwards lecture 6, 8, 10, 11 |
Quasi-Poisson | Count or abundance data with overdispersion | glmmTMB | Kyle Edwards lecture 6, 8, 10, 11 |
Zero-Inflated Poisson | Count or abundance data with too many zeros | glmmTMB | Kyle Edwards lecture 6, 8, 10, 11 |
Zero-Inflated Negative Binomial | Count or abundance data with too many zeros and overdispersion | glmmTMB | Kyle Edwards lecture 6, 8, 10, 11 |
Truncated Poisson | Count or abundance data that won't have zeros (e.g. only counted when they were present) | glmmTMB | Kyle Edwards lecture 6, 8, 10, 11 |
Truncated Negative Binomial | Count or abundance data that won't have zeros and that is overdispersed | glmmTMB | Kyle Edwards lecture 6, 8, 10, 11 |
Poisson or negative binomial families with an offset term | Density data, Count or abundance data with unequal sample sizes | stats, MASS, glmmTMB | Kyle Edwards lecture 9, 11 |
Data type: Continuous | |||
Gaussian | Normal continuous data (e.g. body measurements) | stats, glmmTMB | Kyle Edwards lecture 1, 11 |
Gamma | Continuous positive data | stats, glmmTMB | Kyle Edwards lecture 11 |
Tweedie | Continuous data derived from count data (e.g. tree biomasses) that will have too many zeros | glmmTMB | Kyle Edwards lecture 11, https://doi.org/10.1186/s13021-016-0051-z |
Weibull | survival/failure analysis, wind speed distributions | dweibull | https://www.r-bloggers.com/survival-analysis-fitting-weibull-models-for-improving-device-reliability-in-r/ |
You can find all the distributions in a package by exploring the family object in any ?glm* call. |
From the survey results, we have a mix of datasets that could be fit with a basic GLM (or extensions into mixed models), including
presence of many species across an environmental gradient
species across an environmental gradient - classic ecology, haha
CO2 emissions through the year from soil that received different treatments
abundance of fish exhibiting different social behaviors
Herbarium specimen records (collection dates, locations) ~ local climate
physiological response of individuals across size and thermal gradient
abundance/percent cover
species abundance and diversity across various gradients
Pathogen presence/absence, species composition and diversity
Phenological data from herbarium specimens
Condition of a species across environmental gradient and time
I’m going to go through a really common GLM in ecology (Poisson). I also have one extension (categorical predictor variables). Keep in mind that any type of data can follow a variation of the same process, including data that might be best fit for a basic linear model, or a more complicated mixed model! The other tutorial (that I co-led with Tatum Katz this summer) has a few more GLM types (binomial, negative binomial, zero-inflation) that I don’t go through here for simplicity. This tutorial is really meant to walk through the basic steps in fitting a GLM in R.
A lot of ecological data do not fit the basic assumptions of a linear model (specifically, normal data distributions).
Like a basic LM, GLMs do still have assumptions (heteroskedasticity, dispersion, and no zero-inflation).
We’ll be using a dataset on salamander abundances in streams that are either below mountaintop removal mining sites or in control (non-mined sites) to explore some GLM approaches. Let’s see why these data might be better suited for a GLM than a regular LM by looking at the frequency distribution (histogram) of the response variable (count
) values. For an LM, the assumption is that these are normally distributed (e.g. have a nice bell curve shape).
library(glmmTMB)
data(Salamanders)
hist(Salamanders$count)
The above histogram demonstrates a distribution I see in just about any of my ecological datasets that involve counts. Specifically, the data are not normally distributed (and there are a lot of zeros).
Conversely, the Sepal.Width
measurements in the iris
dataset from last week demonstrate a beautiful normally-distributed dataset that could be perfect for a basic linear model.
data(iris)
hist(iris$Sepal.Width)
library(MASS) #glm.nb for negative binomial models
library(glmmTMB) #Salamanders dataset and lots of families
library(lme4) #pseudo-R2 and and mixed model functionality
library(MuMIn) #dredge function for comparing models, AIC, marginal R^2 for mixed models
library(sjmisc) #pseudo R^2 - but I think gets loaded as a dependency
library(DHARMa) #model diagnostics
library(effects) #what do my marginal effects look like?
library(performance) #binomial model diagnostics
library(emmeans) #post hoc for categorical predictors
LMs, GLMs, and their extensions (e.g. mixed models) are used a TON in ecology and other fields so there are lots of R packages for regression, including some base packages and some more fancy ones.
Model-making packages include:
stats
(pre-loaded in R), MASS
for the glm.nb
function for negative binomial data, and glmmTMB
for most GLM models and mixed models
Model-comparing packages include:
MuMIn
is the primary package I use here, but I’m sure there are others!
Model-diagnostic packages include: DHARMa
for all the “do my data fit my model” questions, performance
for “do my data fit my model” questions for binomial models, effects
for looking at what the marginal effects are of the model (e.g. what is the relationship in my data?), and emmeans
for models with categorical predictors (covered later)
The process of fitting a GLM includes an iterative process of steps:
Determine a model structure based on experimental design (what is the effect of x (and potentially z, a, and b) on y (and potentially c and d)?)
What is the effect of water temperature and elevation on the abundance of caddisfly larvae?
What is the effect of habitat patch size and productivity on the presence or absence of spiders?
Do some model selection process to figure out what is a best model, which just means something like:
Do my data meet model assumptions, and if not, how do I fix this? (most of my time in statistics is spent on this step)
Words of encouragement: I follow an approach I learned from the Zuur book when developing and fitting models to my data. Start a priori with a full model I want to test, some knowledge of the type of family it should fit (for counts, likely Poisson, negative binomial, or some derivative), and try to remember the ecological reasons for why I’m doing my statistics. Not only does this avoid finding significance just because I’ve thrown in everything and the kitchen sink, but it also helps me explain my statistics to myself and to others who might not be statisticians first. Once I dove in to model selecting, I realized that everyone on Stack Overflow has their own strong and differing opinions about the “right” way, which made me think there isn’t a “right” way and it’s more important to be able to justify what I did by being consistent and transparent with my process.
Now the part we’ve all been waiting for - let’s fit one of these GLMs!
We’ll be using a dataset of salamander abundances from a paper on the effects of mountaintop removal on salamanders in streams (Price et al. 2015). The dataset is in the glmmTMB package.
data(Salamanders) #from glmmTMB
This dataset has 644 observations of 9 variables:
str(Salamanders)
## 'data.frame': 644 obs. of 9 variables:
## $ site : Ord.factor w/ 23 levels "R-1"<"R-2"<"R-3"<..: 13 14 15 1 2 3 4 5 6 7 ...
## $ mined : Factor w/ 2 levels "yes","no": 1 1 1 2 2 2 2 2 2 2 ...
## $ cover : num -1.442 0.298 0.398 -0.448 0.597 ...
## $ sample: int 1 1 1 1 1 1 1 1 1 1 ...
## $ DOP : num -0.596 -0.596 -1.191 0 0.596 ...
## $ Wtemp : num -1.2294 0.0848 1.0142 -3.0234 -0.1443 ...
## $ DOY : num -1.497 -1.497 -1.294 -2.712 -0.687 ...
## $ spp : Factor w/ 7 levels "GP","PR","DM",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ count : int 0 0 0 2 2 1 1 2 4 1 ...
site: name of a location where repeated samples were taken
mined: factor indicating whether the site was affected by mountain top removal coal mining
cover: amount of cover objects in the stream (scaled)
sample: repeated sample
DOP: Days since precipitation (scaled)
Wtemp: water temperature (scaled)
DOY: day of year (scaled)
spp: abbreviated species name, possibly also life stage
count: number of salamanders observed
What is our question (e.g. what is the effect of x on y?)
What is the best model? (e.g. which variables explain patterns without going too far?)
Do my data meet model assumptions, and how do I fix this if not?
Let’s say that this study was part of a project looking at how stream variables influenced salamander presence or abundance (we address categorical predictors in the Extensions and will revisit the mined vs. not mined question there and how to approach that with a GLM). Our question with a continuous predictor might be:
Salamanders %>%
gather(-site, -mined, -sample, -DOP, -DOY, -spp, -count, key = "var", value = "value") %>%
ggplot(aes(x = value, y = count)) +
geom_point() +
geom_smooth(method = "lm", se =F) +
facet_wrap(~ var, scales = "free") +
theme_bw()
Based on this graph, maybe there is some positive influence of cover on salamander abundance and some negative influence of water temperature.
The same way we fit an LM in the last session, we can call a GLM, but now using a GLM formula. Again, from above, we have abundance (count) data with a left-skewed distribution (not normal), so we will be starting with a Poisson distribution and tinkering with fixes on that model family (e.g. negative binomial).
model <- glm(count ~ cover + Wtemp,
data = Salamanders,
family = "poisson")
We can look at the model summary, which looks really similar to the basic regression summary.
summary(model)
##
## Call:
## glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1270 -1.5988 -1.3516 0.3424 13.7366
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.24068 0.03558 6.764 1.34e-11 ***
## cover 0.27085 0.03491 7.758 8.64e-15 ***
## Wtemp -0.05746 0.03692 -1.557 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2120.7 on 643 degrees of freedom
## Residual deviance: 2053.8 on 641 degrees of freedom
## AIC: 2783.1
##
## Number of Fisher Scoring iterations: 6
Based on this model, maybe there is an effect of cover on salamander abundance.
Kyle Edwards Lectures 15 & 16 are on model selection. I’m really not going to focus on the theory all that much here, and rather I’ll focus on the practical application of model selection for linear models.
The goal of model selection is this: there is a perfect model that explains all the pattern and variation in your data, however, you will never know this perfect model and so, instead, you are aiming to find a model that best approximates this magical unicorn model. (Kyle Edwards goes into more detail).
There are many ways to select models, but one that is really common in ecology is using AIC values (or corrected AICc for small sample sizes - which applies to most/all ecological datasets!) The AIC, or Akaike Information Criterion, AIC is a fancy equation
\[AIC = -2 * log(L(\hat\theta|Y)) + 2K\]
But is really quite simple to use in practice once you know what some of the parts of that equation means. The \(log(L(\hat\theta|Y))\) section is the log likelihood of your parameters (\(\hat\theta\)) of your model given your data (\(Y\)). This value will decrease with a better-fitting model (based on log likelihoods which is the basis of linear models and which I forget the theory of ALWAYS). The \(2K\) part is a penalty for adding more parameters, where \(K\) is the number of parameters in your model and will increase as you add more predictors. All this fancy math just to say:
I’m trying to find the model with the lowest AIC value of all models, which means I’ve found that happy medium between fitting my data well (reward: \(log(L(\hat\theta|Y))\)) and not using too many parameters (penalty: \(K\)).
You can use AIC values with non-nested models, meaning I could compare a model of count ~ Wtemp
with a model of count ~ cover
. However, one very important thing to keep in mind about AIC-based model comparisons is that you can only compare models fit to the same dataset. (In other words, I could not use AIC to compare a model of count ~ Wtemp
from one stream with a model of count ~ Wtemp
from another stream).
So, let’s fit an AIC to our model and then extend that to compare it to other models!
The basic summary()
function for models prints an AIC value for that model, however, this value is not necessarily the value we should be using here (as I mentioned before for small datasets). (Furthermore, and not to go into too much detail here, but for mixed models, you’ll want to do all the model selection steps with a different maximum likelihood estimator, set by putting a REML=FALSE
argument into your model. info here).
summary(model)
##
## Call:
## glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1270 -1.5988 -1.3516 0.3424 13.7366
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.24068 0.03558 6.764 1.34e-11 ***
## cover 0.27085 0.03491 7.758 8.64e-15 ***
## Wtemp -0.05746 0.03692 -1.557 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2120.7 on 643 degrees of freedom
## Residual deviance: 2053.8 on 641 degrees of freedom
## AIC: 2783.1
##
## Number of Fisher Scoring iterations: 6
AICc(model)
## [1] 2783.142
For this model, the AIC value from the summary()
function and the AIC value from the AICc()
function look fairly similar - however, AICc is a better approach and will often give different values for small datasets.
Now lets compare this model to all nested models, which is really saying: While I expect that x and z could predict y, it could be that one explains more of the variation in the data and that the other adds so little to that variation that the parameter number “penalty” part of the AIC equation (\(2K\)) is outweighing the parameter number “reward” (\(log(L(\hat\theta|Y))\)).
The AICc()
function comes from the MuMIn
package, as does the next function: dredge()
, which we will be using. However, to use dredge()
without possibly getting an error for the model producing different NA values for each model (dredge()
won’t work to compare different datasets), we have to first add an NA action to the model, and then run dredge()
.
model <- glm(count ~ cover + Wtemp,
data = Salamanders,
family = "poisson",
na.action = "na.fail")
dredge(model)
## Global model call: glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders,
## na.action = "na.fail")
## ---
## Model selection table
## (Intrc) cover Wtemp df logLik AICc delta weight
## 4 0.2407 0.2708 -0.05746 3 -1388.552 2783.1 0.00 0.549
## 2 0.2424 0.2780 2 -1389.758 2783.5 0.39 0.451
## 3 0.2758 -0.09193 2 -1418.500 2841.0 57.88 0.000
## 1 0.2799 1 -1421.967 2845.9 62.80 0.000
## Models ranked by AICc(x)
Now we have a bunch of AICc values to compare, along with a “delta” value for each, which is just the difference between that model and the “best” model of the list. I can’t remember off the top of my head, but a delta = 2 is a good rule of thumb for choosing between two models. In this case, our top two models are within 2 AICc values of each other - which happens quite often. In this case, I usually go for the more simple model of the two, which in this case is the model with just cover
as a predictor of count
. (I got this approach from Zuur, but could be good to read up on it on your own. You can also report both, and use emmeans()
or other functions to determine significance/p-values down the line). In general, choosing the top model from all of these models is also a totally appropriate approach, even if the next model is “close” in AIC values (see what I mean about the “lots of approaches to stats”?)
Now that we have our model that we want to report in our paper, we need some way to say whether our data actually meet the assumptions of this model. You can approach these in a similar way to basic LM model tests (looking at heteroskedasticity of residuals using residual or Q-Q plots), but there are several extensions here (dispersion and zero-inflation) AND some reasons as you get to more complicated models (e.g. mixed models) as to why you might want something that gives you some confidence interval (e.g. p-value) to whether your visual assessments of your data warrant revisiting your model.
I use the DHARMa
package for this in all my linear models (mixed and non-mixed). DHARMa
simulates a bunch of sets of residuals from each observations based on the fitted model, including the probability of those values given the data. The outputs are AWESOME diagnostics that look really similar to the qqplots and residual plots from basic model diagnostics, except now with some confidence that the data fit model assumptions tacked on (and the ability to determine overdispersion and zero-inflation for count data).
This is a very simplified explanation of the DHARMa
package, but they have a great tutorial with more in-depth explanations and lots more functions from the package.
First, we’ll simulate our batches of residuals based on our best-fitting model, which you’ll recall had just cover
as a predictor:
best_model <- glm(count ~ cover,
data = Salamanders,
family = "poisson")
simulationOutput <- simulateResiduals(fittedModel = best_model, plot = T)
YIKES. We’ve got some issues with model assumptions. This shape for a qqplot is definitely evidence of some potential overdispersion, but we can test that explicitly in the DHARMa
package using the testDispersion()
function.
testDispersion(simulationOutput)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 2.2729, p-value < 2.2e-16
## alternative hypothesis: two.sided
The testDispersion()
function gives us a histogram of simulated expected values, and then a red line indicating our observed dispersion in our actual data. Additionally, this function performs a statistical test of our dispersion with the null hypothesis being “We do not have overdispersion”. Our very low p-value indicates that our dispersion value is not very likely to come from the distribution of expected dispersion values for our model. Simply: we have overdispersion.
Overdispersion is a really common problem for count data and the good news is that there are some pretty easy fixes. (A place where opinions on the order in which you apply fixes deviates, but I’ll show you mine. Kyle Edwards and the Zuur book both have great workflows for this, which I based mine off as well. You can also find my flowchart for model selection in the Extensions tab).
The fixes used can depend on your data and what your model looks like in the first place. There are some simple math-related reasons based on model distributions for why overdispersion could be occurring and there are also some more practical reasons based on the variables we put into (or didn’t put into) the model.
Did I include all the relevant variables?
Are there other fixed effects that may be needed in my model?
(Not covered today, and maybe a problem with this dataset: is there a need for grouping variables (random effects and mixed models)?)
Do my data actually fit a Poisson distribution? (lots of opinions and approaches to address this point)
Let’s say that best_model
is actually the best model for these data and meets assumptions (which it doesn’t in this case…). I will want to report some statistics on this, right? For basic GLMs you can report the \(\beta\) for the model as well as the \(p-value\) for the predictors, which basically gives you
\(\beta\): the slope and direction (+ or -) of the relationship between the x and y components
\(p-value\): relative importance of the predictor
You can get both of these from the summary()
function, where \(\beta\) comes from the value under Estimate
for each predictor and the \(p-value\) comes from the Pr(>|z|)
for each predictor.
summary(best_model)
##
## Call:
## glm(formula = count ~ cover, family = "poisson", data = Salamanders)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0761 -1.5854 -1.3064 0.3306 13.7635
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.24241 0.03552 6.825 8.77e-12 ***
## cover 0.27800 0.03458 8.040 9.01e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2120.7 on 643 degrees of freedom
## Residual deviance: 2056.2 on 642 degrees of freedom
## AIC: 2783.5
##
## Number of Fisher Scoring iterations: 6
For a linear relationship like this one, you might also want to know the \(R^2\) value for your relationship, which is a measure of how close your points would be to a linear line (ranges 0 - 1, with a better relationship closer to 1). For GLM’s this is called a pseudo-\(R^2\) and can be run in the sjmisc
package using the r2()
function:
r2(best_model)
## $R2_Nagelkerke
## Nagelkerke's R2
## 0.0988587
In your final paper, your model selection methods could sound like:
Methods: We wanted to know which stream abiotic conditions influenced salamander abundance. To do this, we fit generalized linear models with a poisson distribution (for count data) with two important stream attributes, water temperature and woody debris cover as predictors and salamander count as a dependent variable. We performed model selection using AICc and ensured that models fit assumptions using model diagnostics. We ran all statistical analysis in R (Cite) using the [package for glm]. We selected models using AICc with the MuMIn package and performed all model diagnostics in DHARMa (CITE packages).
Results: The best fitting model for salamander abundance included woody debris cover as a predictor variable (\(\beta\) = 0.278, p-value < 0.001, \(R^2\) = 0.1). There were on average x%/Xx/[some human-friendly value] more salamanders in a stream with 50% woody debris cover as one with no debris cover.
GLMs can sometimes look suspiciously like an ANOVA, which isn’t surprising and makes GLMs really versatile!
Let’s revisit the example of the salamanders in streams that either were or weren’t below mountaintop removal sites. In this example, if we want to know whether mining has an effect on salamander abundance, we are dealing with a categorical fixed effect (mined vs. unmined). We can approach this really similarly to any other GLM and can also get diagnostics that look suspiciously similar to ANOVA diagnostics. Let’s do it:
Again, we have count data, so we’re going to be playing with Poisson or negative binomial models with this one. Our question is now What is the effect of presence of mountaintop mining on the abundance of salamanders in streams?"
In this situation, it might be more intuitive to look at our data as a boxplot:
ggplot(Salamanders, aes(x = mined, y = count)) +
geom_boxplot() +
theme_bw()
But we could also coerce our yes/no predictor into a continuous-type variable and visualize it as a line graph, setting the “no” level to “1”, and the “yes” level to “2”.
Salamanders %>%
mutate(mine_line = ifelse(mined == "no", 1, 2)) %>%
ggplot(aes(x = mine_line, y = count)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_bw()
In both of these graphs, it looks like mining may have a positive effect on salamander abundance.
In our model, the fixed effect will be mined
(a binary yes/no) and our response will be count
from the Salamanders
dataset:
cat_poisson <- glm(count ~ mined,
data = Salamanders,
family = "poisson")
We can look at our model summary again to see that the effect of mining does seem to be important for salamander abundance.
summary(cat_poisson)
##
## Call:
## glm(formula = count ~ mined, family = "poisson", data = Salamanders)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1283 -0.9459 -0.7687 -0.1796 11.4753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2192 0.1048 -11.63 <2e-16 ***
## minedno 2.0368 0.1109 18.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2120.7 on 643 degrees of freedom
## Residual deviance: 1575.2 on 642 degrees of freedom
## AIC: 2302.5
##
## Number of Fisher Scoring iterations: 6
Skipping (but maybe try it on your own?)
This probably seems repetitive by now - but we can also use DHARMa
for this kind of model.
simulationOutput <- simulateResiduals(fittedModel = cat_poisson, plot = T)
This model also has so many of the beautiful model assumption failures.
testDispersion(simulationOutput)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 2.1261, p-value < 2.2e-16
## alternative hypothesis: two.sided
A new one (meaning: we have too many zeros to fit this distribution with our data):
testZeroInflation(simulationOutput)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.4637, p-value < 2.2e-16
## alternative hypothesis: two.sided
So we could approach it like we’ve been approaching any other model when it comes to fixing this!
On your own: How would you attempt to fix some of these model assumption failures?
In this example, our predictor had two categories, and so we can report model results (including the p-value) for the between-group differences (ala ANOVA) from the summary()
for the model. However, if you had a categorical variables with more than two levels, there are some cool ways to do pair-wise comparisons between those levels (for example, if I had “full mountaintop removal”, “partial mountaintop removal”, and “no mountaintop removal” mining categories for the Salamander dataset).
Again, at this point I always like to do a quick visual of my data, which I can still do with the allEffects
call:
plot(allEffects(cat_poisson))
If I had more than two levels for my predictor, I could use the emmeans
package and function to determine the pair-wise differences between groups.
em <- emmeans(cat_poisson, "mined")
The emmeans()
function calculates the marginal means (mean of a group given all the other groupings in the model) based on a specified predictor (here, “mined”):
em
## mined emmean SE df asymp.LCL asymp.UCL
## yes -1.219 0.1048 Inf -1.425 -1.014
## no 0.818 0.0362 Inf 0.746 0.889
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
and then you can compare these means to each other:
pairs(em)
## contrast estimate SE df z.ratio p.value
## yes - no -2.04 0.111 Inf -18.363 <.0001
##
## Results are given on the log (not the response) scale.
I’ve also played around a bit with the ggeffects
package, which produces similar values in a ggplot-friendly format. I don’t cover it here, but you can always use that to graph marginal means from models. ggeffects
is also useful for plotting marginal means of different levels of a random effect for a mixed model!
Here’s a flow chart of my complete model selection process and troubleshooting for count data (for GLMM, but can easily be used for GLM as well)