Some useful formatting and visualization packages for this R Markdown:

library(ggplot2) #plotting
library(tidyverse) #data tidying power
library(gt) #tables
library(here)

Learning goals

  1. Know that different kinds of ecological data require different GLM distributions (with a nifty table as a resource)

  2. Know how to fit a GLM in R, which includes three steps:

  • fit a full model based on an ecological question

  • choose the best-fitting model between all possible models using AIC

  • run model diagnostics to determine that your model meets the assumptions of the distribution you’ve chosen

  1. Feel less intimidated by statistics! Specifically, remember: there are multiple ways to approach statistics and all of them are a little wrong, statistics is a tool to bolster ecological questions, and as long as you are consistent and transparent about your process (accessible data and code), you’ll be fine

A. Common Distributions

There are lots of different data distributions for LMs and GLMs for ecological data. The Zuur book has a great introduction to many of these and the Kyle Edwards notes (Lectures 6-13) have great succinct explanations and worked examples. Many of these I don’t actually use but there are a few that are really common in ecology which I use all the time.

You can find the model families for a given GLM-calling function in R by assessing the family object for that function. I usually get there by typing ?glm or ?glmmTMB and then selecting either the family or the family_glmmTMB links from those help pages when they pop up.

I (with Tatum) compiled a list of common GLM families, what kinds of data you might fit with them, packages that have that model family, and where you might find more information on each of them in Kyle Edwards’ notes, or on the interwebs. You can find complementary information on page 205 of the Zuur book of in Kyle Edwards’ Lecture 11.

GLM Distributions
Find your data type for model families, R packages, resources, and math!
Distribution Common Ecological Data Packages Resources
Data type: Binary or proportional
Binomial Binary or proportional (e.g. presence-absence, percents) glm, glmmTMB Kyle Edwards lectures 8, 9, 11
Quasi-binomial Binary or proportional glmmTMB Kyle Edwards lectures 8, 9, 11
Beta or Beta-family Proportions or percents that are not ones or zeros betareg, glmmTMB Kyle Edwards lecture 11
Data type: Count or abundance
Poisson Count or abundance data stats, glmmTMB Kyle Edwards lecture 6, 8, 10, 11
Negative Binomial Count or abundance data with overdispersion MASS, glmmTMB Kyle Edwards lecture 6, 8, 10, 11
Quasi-Poisson Count or abundance data with overdispersion glmmTMB Kyle Edwards lecture 6, 8, 10, 11
Zero-Inflated Poisson Count or abundance data with too many zeros glmmTMB Kyle Edwards lecture 6, 8, 10, 11
Zero-Inflated Negative Binomial Count or abundance data with too many zeros and overdispersion glmmTMB Kyle Edwards lecture 6, 8, 10, 11
Truncated Poisson Count or abundance data that won't have zeros (e.g. only counted when they were present) glmmTMB Kyle Edwards lecture 6, 8, 10, 11
Truncated Negative Binomial Count or abundance data that won't have zeros and that is overdispersed glmmTMB Kyle Edwards lecture 6, 8, 10, 11
Poisson or negative binomial families with an offset term Density data, Count or abundance data with unequal sample sizes stats, MASS, glmmTMB Kyle Edwards lecture 9, 11
Data type: Continuous
Gaussian Normal continuous data (e.g. body measurements) stats, glmmTMB Kyle Edwards lecture 1, 11
Gamma Continuous positive data stats, glmmTMB Kyle Edwards lecture 11
Tweedie Continuous data derived from count data (e.g. tree biomasses) that will have too many zeros glmmTMB Kyle Edwards lecture 11, https://doi.org/10.1186/s13021-016-0051-z
Weibull survival/failure analysis, wind speed distributions dweibull https://www.r-bloggers.com/survival-analysis-fitting-weibull-models-for-improving-device-reliability-in-r/
You can find all the distributions in a package by exploring the family object in any ?glm* call.

From the survey results, we have a mix of datasets that could be fit with a basic GLM (or extensions into mixed models), including

  • presence of many species across an environmental gradient

  • species across an environmental gradient - classic ecology, haha

  • CO2 emissions through the year from soil that received different treatments

  • abundance of fish exhibiting different social behaviors

  • Herbarium specimen records (collection dates, locations) ~ local climate

  • physiological response of individuals across size and thermal gradient

  • abundance/percent cover

  • species abundance and diversity across various gradients

  • Pathogen presence/absence, species composition and diversity

  • Phenological data from herbarium specimens

  • Condition of a species across environmental gradient and time

B. GLMs in R

I’m going to go through a really common GLM in ecology (Poisson). I also have one extension (categorical predictor variables). Keep in mind that any type of data can follow a variation of the same process, including data that might be best fit for a basic linear model, or a more complicated mixed model! The other tutorial (that I co-led with Tatum Katz this summer) has a few more GLM types (binomial, negative binomial, zero-inflation) that I don’t go through here for simplicity. This tutorial is really meant to walk through the basic steps in fitting a GLM in R.

Why GLM and not a basic LM?

A lot of ecological data do not fit the basic assumptions of a linear model (specifically, normal data distributions).

Like a basic LM, GLMs do still have assumptions (heteroskedasticity, dispersion, and no zero-inflation).

We’ll be using a dataset on salamander abundances in streams that are either below mountaintop removal mining sites or in control (non-mined sites) to explore some GLM approaches. Let’s see why these data might be better suited for a GLM than a regular LM by looking at the frequency distribution (histogram) of the response variable (count) values. For an LM, the assumption is that these are normally distributed (e.g. have a nice bell curve shape).

library(glmmTMB)
data(Salamanders)
hist(Salamanders$count)

The above histogram demonstrates a distribution I see in just about any of my ecological datasets that involve counts. Specifically, the data are not normally distributed (and there are a lot of zeros).

Conversely, the Sepal.Width measurements in the iris dataset from last week demonstrate a beautiful normally-distributed dataset that could be perfect for a basic linear model.

data(iris)
hist(iris$Sepal.Width)

R Packages for GLMs

library(MASS) #glm.nb for negative binomial models
library(glmmTMB) #Salamanders dataset and lots of families
library(lme4) #pseudo-R2 and and mixed model functionality
library(MuMIn) #dredge function for comparing models, AIC, marginal R^2 for mixed models
library(sjmisc) #pseudo R^2 - but I think gets loaded as a dependency
library(DHARMa) #model diagnostics
library(effects) #what do my marginal effects look like?
library(performance) #binomial model diagnostics
library(emmeans) #post hoc for categorical predictors

LMs, GLMs, and their extensions (e.g. mixed models) are used a TON in ecology and other fields so there are lots of R packages for regression, including some base packages and some more fancy ones.

Model-making packages include:

stats (pre-loaded in R), MASS for the glm.nb function for negative binomial data, and glmmTMB for most GLM models and mixed models

Model-comparing packages include:

MuMIn is the primary package I use here, but I’m sure there are others!

Model-diagnostic packages include: DHARMa for all the “do my data fit my model” questions, performance for “do my data fit my model” questions for binomial models, effects for looking at what the marginal effects are of the model (e.g. what is the relationship in my data?), and emmeans for models with categorical predictors (covered later)

Model-fitting packages

stats

MASS

glmmTMB

boot

Model-comparing packages

MuMIn

Model diagnostics packages

DHARMa

performance

effects

emmeans

boot

The GLM process

The process of fitting a GLM includes an iterative process of steps:

  1. Determine a model structure based on experimental design (what is the effect of x (and potentially z, a, and b) on y (and potentially c and d)?)

    • What is the effect of water temperature and elevation on the abundance of caddisfly larvae?

    • What is the effect of habitat patch size and productivity on the presence or absence of spiders?

  2. Do some model selection process to figure out what is a best model, which just means something like:

    • I have 1+ predictors (the x, z, a, and b above) and I want to know what combination of those predictors actually explain the patterns in my data without overfitting
  3. Do my data meet model assumptions, and if not, how do I fix this? (most of my time in statistics is spent on this step)

Words of encouragement: I follow an approach I learned from the Zuur book when developing and fitting models to my data. Start a priori with a full model I want to test, some knowledge of the type of family it should fit (for counts, likely Poisson, negative binomial, or some derivative), and try to remember the ecological reasons for why I’m doing my statistics. Not only does this avoid finding significance just because I’ve thrown in everything and the kitchen sink, but it also helps me explain my statistics to myself and to others who might not be statisticians first. Once I dove in to model selecting, I realized that everyone on Stack Overflow has their own strong and differing opinions about the “right” way, which made me think there isn’t a “right” way and it’s more important to be able to justify what I did by being consistent and transparent with my process.

Poisson GLM: Count data

Now the part we’ve all been waiting for - let’s fit one of these GLMs!

We’ll be using a dataset of salamander abundances from a paper on the effects of mountaintop removal on salamanders in streams (Price et al. 2015). The dataset is in the glmmTMB package.

data(Salamanders) #from glmmTMB

This dataset has 644 observations of 9 variables:

str(Salamanders)
## 'data.frame':    644 obs. of  9 variables:
##  $ site  : Ord.factor w/ 23 levels "R-1"<"R-2"<"R-3"<..: 13 14 15 1 2 3 4 5 6 7 ...
##  $ mined : Factor w/ 2 levels "yes","no": 1 1 1 2 2 2 2 2 2 2 ...
##  $ cover : num  -1.442 0.298 0.398 -0.448 0.597 ...
##  $ sample: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DOP   : num  -0.596 -0.596 -1.191 0 0.596 ...
##  $ Wtemp : num  -1.2294 0.0848 1.0142 -3.0234 -0.1443 ...
##  $ DOY   : num  -1.497 -1.497 -1.294 -2.712 -0.687 ...
##  $ spp   : Factor w/ 7 levels "GP","PR","DM",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ count : int  0 0 0 2 2 1 1 2 4 1 ...

site: name of a location where repeated samples were taken

mined: factor indicating whether the site was affected by mountain top removal coal mining

cover: amount of cover objects in the stream (scaled)

sample: repeated sample

DOP: Days since precipitation (scaled)

Wtemp: water temperature (scaled)

DOY: day of year (scaled)

spp: abbreviated species name, possibly also life stage

count: number of salamanders observed

Going back to the three steps we need to fit a GLM to our data:

  1. What is our question (e.g. what is the effect of x on y?)

  2. What is the best model? (e.g. which variables explain patterns without going too far?)

  3. Do my data meet model assumptions, and how do I fix this if not?

1. Poisson: What is the question?

Let’s say that this study was part of a project looking at how stream variables influenced salamander presence or abundance (we address categorical predictors in the Extensions and will revisit the mined vs. not mined question there and how to approach that with a GLM). Our question with a continuous predictor might be:

  • What are the effects of cover and water temperature on salamander abundance?
Salamanders %>%
  gather(-site, -mined, -sample, -DOP, -DOY, -spp, -count, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = count)) +
  geom_point() +
  geom_smooth(method = "lm", se =F) +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Based on this graph, maybe there is some positive influence of cover on salamander abundance and some negative influence of water temperature.

The same way we fit an LM in the last session, we can call a GLM, but now using a GLM formula. Again, from above, we have abundance (count) data with a left-skewed distribution (not normal), so we will be starting with a Poisson distribution and tinkering with fixes on that model family (e.g. negative binomial).

model <- glm(count ~ cover + Wtemp,
    data = Salamanders,
    family = "poisson")

We can look at the model summary, which looks really similar to the basic regression summary.

summary(model)
## 
## Call:
## glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1270  -1.5988  -1.3516   0.3424  13.7366  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24068    0.03558   6.764 1.34e-11 ***
## cover        0.27085    0.03491   7.758 8.64e-15 ***
## Wtemp       -0.05746    0.03692  -1.557     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 2053.8  on 641  degrees of freedom
## AIC: 2783.1
## 
## Number of Fisher Scoring iterations: 6

Based on this model, maybe there is an effect of cover on salamander abundance.

2. Poisson: What is the “best” model

Kyle Edwards Lectures 15 & 16 are on model selection. I’m really not going to focus on the theory all that much here, and rather I’ll focus on the practical application of model selection for linear models.

The goal of model selection is this: there is a perfect model that explains all the pattern and variation in your data, however, you will never know this perfect model and so, instead, you are aiming to find a model that best approximates this magical unicorn model. (Kyle Edwards goes into more detail).

There are many ways to select models, but one that is really common in ecology is using AIC values (or corrected AICc for small sample sizes - which applies to most/all ecological datasets!) The AIC, or Akaike Information Criterion, AIC is a fancy equation

\[AIC = -2 * log(L(\hat\theta|Y)) + 2K\]

But is really quite simple to use in practice once you know what some of the parts of that equation means. The \(log(L(\hat\theta|Y))\) section is the log likelihood of your parameters (\(\hat\theta\)) of your model given your data (\(Y\)). This value will decrease with a better-fitting model (based on log likelihoods which is the basis of linear models and which I forget the theory of ALWAYS). The \(2K\) part is a penalty for adding more parameters, where \(K\) is the number of parameters in your model and will increase as you add more predictors. All this fancy math just to say:

I’m trying to find the model with the lowest AIC value of all models, which means I’ve found that happy medium between fitting my data well (reward: \(log(L(\hat\theta|Y))\)) and not using too many parameters (penalty: \(K\)).

You can use AIC values with non-nested models, meaning I could compare a model of count ~ Wtemp with a model of count ~ cover. However, one very important thing to keep in mind about AIC-based model comparisons is that you can only compare models fit to the same dataset. (In other words, I could not use AIC to compare a model of count ~ Wtemp from one stream with a model of count ~ Wtemp from another stream).

So, let’s fit an AIC to our model and then extend that to compare it to other models!

The basic summary() function for models prints an AIC value for that model, however, this value is not necessarily the value we should be using here (as I mentioned before for small datasets). (Furthermore, and not to go into too much detail here, but for mixed models, you’ll want to do all the model selection steps with a different maximum likelihood estimator, set by putting a REML=FALSE argument into your model. info here).

summary(model)
## 
## Call:
## glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1270  -1.5988  -1.3516   0.3424  13.7366  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24068    0.03558   6.764 1.34e-11 ***
## cover        0.27085    0.03491   7.758 8.64e-15 ***
## Wtemp       -0.05746    0.03692  -1.557     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 2053.8  on 641  degrees of freedom
## AIC: 2783.1
## 
## Number of Fisher Scoring iterations: 6
AICc(model)
## [1] 2783.142

For this model, the AIC value from the summary() function and the AIC value from the AICc() function look fairly similar - however, AICc is a better approach and will often give different values for small datasets.

Now lets compare this model to all nested models, which is really saying: While I expect that x and z could predict y, it could be that one explains more of the variation in the data and that the other adds so little to that variation that the parameter number “penalty” part of the AIC equation (\(2K\)) is outweighing the parameter number “reward” (\(log(L(\hat\theta|Y))\)).

The AICc() function comes from the MuMIn package, as does the next function: dredge(), which we will be using. However, to use dredge() without possibly getting an error for the model producing different NA values for each model (dredge() won’t work to compare different datasets), we have to first add an NA action to the model, and then run dredge().

model <- glm(count ~ cover + Wtemp,
    data = Salamanders,
    family = "poisson",
    na.action = "na.fail")
dredge(model)
## Global model call: glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders, 
##     na.action = "na.fail")
## ---
## Model selection table 
##   (Intrc)  cover    Wtemp df    logLik   AICc delta weight
## 4  0.2407 0.2708 -0.05746  3 -1388.552 2783.1  0.00  0.549
## 2  0.2424 0.2780           2 -1389.758 2783.5  0.39  0.451
## 3  0.2758        -0.09193  2 -1418.500 2841.0 57.88  0.000
## 1  0.2799                  1 -1421.967 2845.9 62.80  0.000
## Models ranked by AICc(x)

Now we have a bunch of AICc values to compare, along with a “delta” value for each, which is just the difference between that model and the “best” model of the list. I can’t remember off the top of my head, but a delta = 2 is a good rule of thumb for choosing between two models. In this case, our top two models are within 2 AICc values of each other - which happens quite often. In this case, I usually go for the more simple model of the two, which in this case is the model with just cover as a predictor of count. (I got this approach from Zuur, but could be good to read up on it on your own. You can also report both, and use emmeans() or other functions to determine significance/p-values down the line). In general, choosing the top model from all of these models is also a totally appropriate approach, even if the next model is “close” in AIC values (see what I mean about the “lots of approaches to stats”?)

3. Poisson: Do my data fit the model assumptions?

Now that we have our model that we want to report in our paper, we need some way to say whether our data actually meet the assumptions of this model. You can approach these in a similar way to basic LM model tests (looking at heteroskedasticity of residuals using residual or Q-Q plots), but there are several extensions here (dispersion and zero-inflation) AND some reasons as you get to more complicated models (e.g. mixed models) as to why you might want something that gives you some confidence interval (e.g. p-value) to whether your visual assessments of your data warrant revisiting your model.

I use the DHARMa package for this in all my linear models (mixed and non-mixed). DHARMa simulates a bunch of sets of residuals from each observations based on the fitted model, including the probability of those values given the data. The outputs are AWESOME diagnostics that look really similar to the qqplots and residual plots from basic model diagnostics, except now with some confidence that the data fit model assumptions tacked on (and the ability to determine overdispersion and zero-inflation for count data).

This is a very simplified explanation of the DHARMa package, but they have a great tutorial with more in-depth explanations and lots more functions from the package.

First, we’ll simulate our batches of residuals based on our best-fitting model, which you’ll recall had just cover as a predictor:

best_model <- glm(count ~ cover,
    data = Salamanders,
    family = "poisson")

simulationOutput <- simulateResiduals(fittedModel = best_model, plot = T)

YIKES. We’ve got some issues with model assumptions. This shape for a qqplot is definitely evidence of some potential overdispersion, but we can test that explicitly in the DHARMa package using the testDispersion() function.

testDispersion(simulationOutput)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.2729, p-value < 2.2e-16
## alternative hypothesis: two.sided

The testDispersion() function gives us a histogram of simulated expected values, and then a red line indicating our observed dispersion in our actual data. Additionally, this function performs a statistical test of our dispersion with the null hypothesis being “We do not have overdispersion”. Our very low p-value indicates that our dispersion value is not very likely to come from the distribution of expected dispersion values for our model. Simply: we have overdispersion.

Poisson: Common fixes for models that don’t meet assumptions

Overdispersion is a really common problem for count data and the good news is that there are some pretty easy fixes. (A place where opinions on the order in which you apply fixes deviates, but I’ll show you mine. Kyle Edwards and the Zuur book both have great workflows for this, which I based mine off as well. You can also find my flowchart for model selection in the Extensions tab).

The fixes used can depend on your data and what your model looks like in the first place. There are some simple math-related reasons based on model distributions for why overdispersion could be occurring and there are also some more practical reasons based on the variables we put into (or didn’t put into) the model.

  1. Did I include all the relevant variables?

    • Are there other fixed effects that may be needed in my model?

    • (Not covered today, and maybe a problem with this dataset: is there a need for grouping variables (random effects and mixed models)?)

  2. Do my data actually fit a Poisson distribution? (lots of opinions and approaches to address this point)

    • My approach: 1) refit with a negative binomial model, 2) add a zero-inflation term, 3) think about data transformations. (some other useful overdispersion tips, specifically for GLMMs, but useful for GLMs too, on Ben Bolker’s GLMM FAQ page)

What to report in a paper?

Let’s say that best_model is actually the best model for these data and meets assumptions (which it doesn’t in this case…). I will want to report some statistics on this, right? For basic GLMs you can report the \(\beta\) for the model as well as the \(p-value\) for the predictors, which basically gives you

\(\beta\): the slope and direction (+ or -) of the relationship between the x and y components

\(p-value\): relative importance of the predictor

You can get both of these from the summary() function, where \(\beta\) comes from the value under Estimate for each predictor and the \(p-value\) comes from the Pr(>|z|) for each predictor.

summary(best_model)
## 
## Call:
## glm(formula = count ~ cover, family = "poisson", data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0761  -1.5854  -1.3064   0.3306  13.7635  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24241    0.03552   6.825 8.77e-12 ***
## cover        0.27800    0.03458   8.040 9.01e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 2056.2  on 642  degrees of freedom
## AIC: 2783.5
## 
## Number of Fisher Scoring iterations: 6

For a linear relationship like this one, you might also want to know the \(R^2\) value for your relationship, which is a measure of how close your points would be to a linear line (ranges 0 - 1, with a better relationship closer to 1). For GLM’s this is called a pseudo-\(R^2\) and can be run in the sjmisc package using the r2() function:

r2(best_model)
## $R2_Nagelkerke
## Nagelkerke's R2 
##       0.0988587

In your final paper, your model selection methods could sound like:

Methods: We wanted to know which stream abiotic conditions influenced salamander abundance. To do this, we fit generalized linear models with a poisson distribution (for count data) with two important stream attributes, water temperature and woody debris cover as predictors and salamander count as a dependent variable. We performed model selection using AICc and ensured that models fit assumptions using model diagnostics. We ran all statistical analysis in R (Cite) using the [package for glm]. We selected models using AICc with the MuMIn package and performed all model diagnostics in DHARMa (CITE packages).

Results: The best fitting model for salamander abundance included woody debris cover as a predictor variable (\(\beta\) = 0.278, p-value < 0.001, \(R^2\) = 0.1). There were on average x%/Xx/[some human-friendly value] more salamanders in a stream with 50% woody debris cover as one with no debris cover.

Extension: Categorical fixed effects

GLMs can sometimes look suspiciously like an ANOVA, which isn’t surprising and makes GLMs really versatile!

Let’s revisit the example of the salamanders in streams that either were or weren’t below mountaintop removal sites. In this example, if we want to know whether mining has an effect on salamander abundance, we are dealing with a categorical fixed effect (mined vs. unmined). We can approach this really similarly to any other GLM and can also get diagnostics that look suspiciously similar to ANOVA diagnostics. Let’s do it:

1. Categorical fixed effect, count response: what is my question?

Again, we have count data, so we’re going to be playing with Poisson or negative binomial models with this one. Our question is now What is the effect of presence of mountaintop mining on the abundance of salamanders in streams?"

In this situation, it might be more intuitive to look at our data as a boxplot:

ggplot(Salamanders, aes(x = mined, y = count)) +
  geom_boxplot() +
  theme_bw()

But we could also coerce our yes/no predictor into a continuous-type variable and visualize it as a line graph, setting the “no” level to “1”, and the “yes” level to “2”.

Salamanders %>%
  mutate(mine_line = ifelse(mined == "no", 1, 2)) %>%
           ggplot(aes(x = mine_line, y = count)) +
           geom_point() +
           geom_smooth(method = "lm", se = FALSE) +
           theme_bw()

In both of these graphs, it looks like mining may have a positive effect on salamander abundance.

In our model, the fixed effect will be mined (a binary yes/no) and our response will be count from the Salamanders dataset:

cat_poisson <- glm(count ~ mined, 
                   data = Salamanders,
                   family = "poisson")

We can look at our model summary again to see that the effect of mining does seem to be important for salamander abundance.

summary(cat_poisson)
## 
## Call:
## glm(formula = count ~ mined, family = "poisson", data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1283  -0.9459  -0.7687  -0.1796  11.4753  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2192     0.1048  -11.63   <2e-16 ***
## minedno       2.0368     0.1109   18.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 1575.2  on 642  degrees of freedom
## AIC: 2302.5
## 
## Number of Fisher Scoring iterations: 6

2. Categorical fixed effect, count response: what is the “best” model?

Skipping (but maybe try it on your own?)

3. Categorical fixed effect, count response: do data meet assumptions?

This probably seems repetitive by now - but we can also use DHARMa for this kind of model.

simulationOutput <- simulateResiduals(fittedModel = cat_poisson, plot = T)

This model also has so many of the beautiful model assumption failures.

testDispersion(simulationOutput)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.1261, p-value < 2.2e-16
## alternative hypothesis: two.sided

A new one (meaning: we have too many zeros to fit this distribution with our data):

testZeroInflation(simulationOutput)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.4637, p-value < 2.2e-16
## alternative hypothesis: two.sided

So we could approach it like we’ve been approaching any other model when it comes to fixing this!

On your own: How would you attempt to fix some of these model assumption failures?

Catagorical fixed effect wrap-up: post-hoc and visualizations

In this example, our predictor had two categories, and so we can report model results (including the p-value) for the between-group differences (ala ANOVA) from the summary() for the model. However, if you had a categorical variables with more than two levels, there are some cool ways to do pair-wise comparisons between those levels (for example, if I had “full mountaintop removal”, “partial mountaintop removal”, and “no mountaintop removal” mining categories for the Salamander dataset).

Again, at this point I always like to do a quick visual of my data, which I can still do with the allEffects call:

plot(allEffects(cat_poisson))

If I had more than two levels for my predictor, I could use the emmeans package and function to determine the pair-wise differences between groups.

em <- emmeans(cat_poisson, "mined")

The emmeans() function calculates the marginal means (mean of a group given all the other groupings in the model) based on a specified predictor (here, “mined”):

em
##  mined emmean     SE  df asymp.LCL asymp.UCL
##  yes   -1.219 0.1048 Inf    -1.425    -1.014
##  no     0.818 0.0362 Inf     0.746     0.889
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95

and then you can compare these means to each other:

pairs(em)
##  contrast estimate    SE  df z.ratio p.value
##  yes - no    -2.04 0.111 Inf -18.363 <.0001 
## 
## Results are given on the log (not the response) scale.

I’ve also played around a bit with the ggeffects package, which produces similar values in a ggplot-friendly format. I don’t cover it here, but you can always use that to graph marginal means from models. ggeffects is also useful for plotting marginal means of different levels of a random effect for a mixed model!

ggeffects

Flowchart for GLM and GLMM model selection for count data

Here’s a flow chart of my complete model selection process and troubleshooting for count data (for GLMM, but can easily be used for GLM as well)