Packages

Some useful formatting and visualization packages for this R Markdown:

library(ggplot2) #plotting
library(tidyverse) #data tidying power
library(gt) #tables
library(here)

The statistics packages you’ll need for this tutorial:

library(MASS) #glm.nb for negative binomial models
library(glmmTMB) #Salamanders dataset and lots of families

## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.2.18
## Current Matrix version is 1.3.2
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package

library(lme4) #pseudo-R2 and and mixed model functionality
library(MuMIn) #dredge function for comparing models, AIC, marginal R^2 for mixed models
library(sjmisc) #pseudo R^2 - but I think gets loaded as a dependency
library(DHARMa) #model diagnostics
library(effects) #what do my marginal effects look like?
library(performance) #binomial model diagnostics
library(emmeans) #post hoc for categorical predictors

You can install and load packages reproducibly across computers:

package.list <- c("here", "tidyverse", 
                  "ggplot2", "gt",
                  "MASS", "glmmTMB",
                  "lme4", "MuMIn", 
                  "sjmisc", "DHARMa", 
                  "effects", "performance",
                  "emmeans", "ggeffects")

## Installing them if they aren't already on the computer
new.packages <- package.list[!(package.list %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

## And loading them
for(i in package.list){library(i, character.only = T)}

Learning goals

Know that different kinds of ecological data require different GLM distributions (with a nifty table as a resource)
Know how to fit a GLM in R, which includes three steps:

fit a full model based on an ecological question
choose the best-fitting model between all possible models using AIC
run model diagnostics to determine that your model meets the assumptions of the distribution you’ve chosen

Feel less intimidated by statistics in R! Specifically, remember: there are multiple ways to approach statistics and all of them are a little wrong, statistics is a tool to bolster ecological questions, and as long as you are consistent and transparent about your process (accessible data and code), you’ll be fine.

A. Common Distributions

There are lots of different data distributions for LMs and GLMs for ecological data. The Zuur book has a great introduction to many of these and the Kyle Edwards notes (Lectures 6-13) have great succinct explanations and worked examples. Many of these I don’t actually use but there are a few that are really common in ecology which I use all the time.

You can find the model families for a given GLM-calling function in R by assessing the family object for that function. I usually get there by typing ?glm or ?glmmTMB and then selecting either the family or the family_glmmTMB links from those help pages when they pop up.

Myself and a collaborator (Tatum Katz, PhD student at UCSB) compiled a list of common GLM families, what kinds of data you might fit with them, packages that have that model family, and where you might find more information on each of them in Kyle Edwards’ notes, or on the interwebs. You can find complementary information on page 205 of the Zuur book or in Kyle Edwards’ Lecture 11.

GLM Distributions
Find your data type for model families, R packages, and resources!
Distribution	Common Ecological Data	Packages	Resources
Data type: Binary or proportional
Binomial	Binary or proportional (e.g. presence-absence, percents)	glm, glmmTMB	Kyle Edwards lectures 8, 9, 11
Quasi-binomial	Binary or proportional	glmmTMB	Kyle Edwards lectures 8, 9, 11
Beta or Beta-family	Proportions or percents that are not ones or zeros	betareg, glmmTMB	Kyle Edwards lecture 11
Data type: Count or abundance
Poisson	Count or abundance data	stats, glmmTMB	Kyle Edwards lecture 6, 8, 10, 11
Negative Binomial	Count or abundance data with overdispersion	MASS, glmmTMB	Kyle Edwards lecture 6, 8, 10, 11
Quasi-Poisson	Count or abundance data with overdispersion	glmmTMB	Kyle Edwards lecture 6, 8, 10, 11
Zero-Inflated Poisson	Count or abundance data with too many zeros	glmmTMB	Kyle Edwards lecture 6, 8, 10, 11
Zero-Inflated Negative Binomial	Count or abundance data with too many zeros and overdispersion	glmmTMB	Kyle Edwards lecture 6, 8, 10, 11
Truncated Poisson	Count or abundance data that won't have zeros (e.g. only counted when they were present)	glmmTMB	Kyle Edwards lecture 6, 8, 10, 11
Truncated Negative Binomial	Count or abundance data that won't have zeros and that is overdispersed	glmmTMB	Kyle Edwards lecture 6, 8, 10, 11
Poisson or negative binomial families with an offset term	Density data, Count or abundance data with unequal sample sizes	stats, MASS, glmmTMB	Kyle Edwards lecture 9, 11
Data type: Continuous
Gaussian	Normal continuous data (e.g. body measurements)	stats, glmmTMB	Kyle Edwards lecture 1, 11
Gamma	Continuous positive data	stats, glmmTMB	Kyle Edwards lecture 11
Tweedie	Continuous data derived from count data (e.g. tree biomasses) that will have too many zeros	glmmTMB	Kyle Edwards lecture 11, https://doi.org/10.1186/s13021-016-0051-z

Weibull	survival/failure analysis, wind speed distributions	dweibull	https://www.r-bloggers.com/survival-analysis-fitting-weibull-models-for-improving-device-reliability-in-r/
You can find all the distributions in a package by exploring the `family` object in any `?glm*` call.

From the survey results, we have a mix of datasets that could be fit with a basic GLM (or extensions into mixed models), including

species occurrence and density across environmental gradients
comparing targeted chemical concentration data to cell response
survey data for economic impacts to water systems and a lot of additional metadata
Plastic count concentration on the surface of the ocean
Various water quality data
microplastics toxicity data - what microplastics characteristics contribute to toxicity and at what doses?
I have data looking at responses to microbes in relation to environmental parameters, both from environmental surveys and in experimental environments
abundance of species in multiple traps across seasonal and environmental gradients

B. GLMs in R

I’m going to go through a really common GLM in ecology (Poisson for count data) using continuous predictor variables. Keep in mind that any type of data can follow a variation of the same process, including data that might be best fit for a basic linear model, or a more complicated mixed model! This tutorial is really meant to walk through the basic steps in fitting a GLM in R. I provide a tabbed extension at the end for different data types (categorical predictor variable (think: fancy ANOVA) and binomial distributions for presence-absence data) and solutions to common problems (negative binomial distributions for weird count data, and zero-inflated models for even weirder count data).

Why GLM and not a basic LM?

Allison Horst is an R Goddess

A lot of ecological data do not fit the basic assumptions of a linear model (specifically, normal data distributions).

Like a basic LM, GLMs do still have assumptions (e.g. no weird patterns in the residuals: “heteroskedasticity”, and residuals and data that generally don’t have a weird skew: “dispersion” and “zero-inflation”).

We’ll be using a dataset on salamander abundances in streams that are either below mountaintop removal mining sites or in control (non-mined sites) to explore some GLM approaches. Let’s see why these data might be better suited for a GLM than a regular LM by looking at the frequency distribution (histogram) of the response variable (count) values. For an LM, the assumption is that these are normally distributed (e.g. have a nice bell curve shape).

library(glmmTMB)
data(Salamanders)
hist(Salamanders$count)

The above histogram demonstrates a distribution I see in just about any of my ecological datasets that involve counts. Specifically, the data are not normally distributed (and there are a lot of zeros).

Conversely, the Sepal.Width measurements in the iris dataset demonstrate a beautiful normally-distributed dataset that could be perfect for a basic linear model.

data(iris)
hist(iris$Sepal.Width)

R Packages for GLMs

library(MASS) #glm.nb for negative binomial models
library(glmmTMB) #Salamanders dataset and lots of families
library(lme4) #pseudo-R2 and and mixed model functionality
library(MuMIn) #dredge function for comparing models, AIC, marginal R^2 for mixed models
library(sjmisc) #pseudo R^2 - but I think gets loaded as a dependency
library(DHARMa) #model diagnostics
library(effects) #what do my marginal effects look like?
library(performance) #binomial model diagnostics
library(emmeans) #post hoc for categorical predictors

LMs, GLMs, and their extensions (e.g. mixed models) are used a TON in ecology and other fields so there are lots of R packages for regression, including some base packages and some more fancy ones.

Model-making packages include:

stats (pre-loaded in R), MASS for the glm.nb function for negative binomial data, and glmmTMB of lme4 for most GLM models and mixed models

Model-comparing packages include:

MuMIn is the primary package I use here, but I’m sure there are others!

Model-diagnostic packages include: DHARMa for all the “do my data fit my model” questions, performance for “do my data fit my model” questions for binomial models, effects for looking at what the marginal effects are of the model (e.g. what is the relationship in my data?), and emmeans for models with categorical predictors (covered later)

The GLM process

The process of fitting a GLM includes an iterative process of steps:

Determine a model structure based on experimental design (what is the effect of x (and potentially z, a, and b) on y (and potentially c and d)?)
- What is the effect of water temperature and elevation on the abundance of caddisfly larvae?
- What is the effect of habitat patch size and productivity on the presence or absence of spiders?
Do some model selection process to figure out what is a best model, which just means something like:
- I have 1+ predictors (the x, z, a, and b above) and I want to know what combination of those predictors actually explain the patterns in my data without overfitting
Do my data meet model assumptions, and if not, how do I fix this? (most of my time in statistics is spent on this step)

Words of encouragement: I follow an approach I learned from the Zuur book when developing and fitting models to my data. Start a priori with a full model I want to test, some knowledge of the type of family it should fit (for counts, likely Poisson, negative binomial, or some derivative), and try to remember the ecological reasons for why I’m doing my statistics. Not only does this avoid finding significance just because I’ve thrown in everything and the kitchen sink, but it also helps me explain my statistics to myself and to others who might not be statisticians first. Once I dove in to model selecting, I realized that everyone on Stack Overflow has their own strong and differing opinions about the “right” way, which made me think there isn’t a “right” way and it’s more important to be able to justify what I did by being consistent and transparent with my process.

Poisson GLM: Count data

Now the part we’ve all been waiting for - let’s fit one of these GLMs!

We’ll be using a dataset of salamander abundances from a paper on the effects of mountaintop removal on salamanders in streams (Price et al. 2015). The dataset is in the glmmTMB package.

data(Salamanders) #from glmmTMB

This dataset has 644 observations of 9 variables:

str(Salamanders)

## 'data.frame':    644 obs. of  9 variables:
##  $ site  : Ord.factor w/ 23 levels "R-1"<"R-2"<"R-3"<..: 13 14 15 1 2 3 4 5 6 7 ...
##  $ mined : Factor w/ 2 levels "yes","no": 1 1 1 2 2 2 2 2 2 2 ...
##  $ cover : num  -1.442 0.298 0.398 -0.448 0.597 ...
##  $ sample: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DOP   : num  -0.596 -0.596 -1.191 0 0.596 ...
##  $ Wtemp : num  -1.2294 0.0848 1.0142 -3.0234 -0.1443 ...
##  $ DOY   : num  -1.497 -1.497 -1.294 -2.712 -0.687 ...
##  $ spp   : Factor w/ 7 levels "GP","PR","DM",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ count : int  0 0 0 2 2 1 1 2 4 1 ...

site: name of a location where repeated samples were taken

mined: factor indicating whether the site was affected by mountain top removal coal mining

cover: amount of cover objects in the stream (scaled)

sample: repeated sample

DOP: Days since precipitation (scaled)

Wtemp: water temperature (scaled)

DOY: day of year (scaled)

spp: abbreviated species name, possibly also life stage

count: number of salamanders observed

Going back to the three steps we need to fit a GLM to our data:

What is our question (e.g. what is the effect of x on y?)
What is the best model? (e.g. which variables explain patterns without going too far?)
Do my data meet model assumptions, and how do I fix this if not?

1. Poisson: What is the question?

Let’s say that this study was part of a project looking at how stream variables influenced salamander presence or abundance (we address categorical predictors at the end and will revisit the mined vs. not mined question there and how to approach that with a GLM). Our question with a continuous predictor might be:

What are the effects of cover and water temperature on salamander abundance?

Salamanders %>%
  gather(-site, -mined, -sample, -DOP, -DOY, -spp, -count, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = count)) +
  geom_point() +
  geom_smooth(method = "lm", se =F) +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Based on this graph, maybe there is some positive influence of cover on salamander abundance and some negative influence of water temperature.

The same way we fit an LM in the last session, we can call a GLM, but now using a GLM formula. Again, from above, we have abundance (count) data with a left-skewed distribution (not normal), so we will be starting with a Poisson distribution and tinkering with fixes on that model family (e.g. negative binomial).

model <- glm(count ~ cover + Wtemp,
    data = Salamanders,
    family = "poisson")

We can look at the model summary, which looks really similar to the basic regression summary.

summary(model)

## 
## Call:
## glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1270  -1.5988  -1.3516   0.3424  13.7366  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24068    0.03558   6.764 1.34e-11 ***
## cover        0.27085    0.03491   7.758 8.64e-15 ***
## Wtemp       -0.05746    0.03692  -1.557     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 2053.8  on 641  degrees of freedom
## AIC: 2783.1
## 
## Number of Fisher Scoring iterations: 6

In this model summary, the factors I usually look at include Estimate for each predictor, which gives a directionality or intercept for that predictor, and the Pr(>|z|) value, which gives some p-value based significance for this predictor in the model. The Estimate value for cover, for example, says that there is a positive relationship of salamander abundance with the cover variable with a positive slope of 0.27085. Because this variable has a significant Pr(>|z|), we can also gather that this is an important variable in this model. Ecologically, based on this model, maybe there is an effect of cover on salamander abundance.

2. Poisson: What is the “best” model

Note: what is the aim of your statistical approach?

There’s a great recent (2021) paper on being very specific about what the aim of your statistical approach is, and thus, how this model selection process should look (e.g. am I exploring all possible predictors, am I testing specific alternative hypotheses, am I trying to predict the magnitude of effect of a particular predictor on my response variable?). This is a lot to break down here but you can find the paper HERE. What approach you decide is best for your data will determine the motivation for your approach and how you perform the above processes.

This following process is in the “exploratory” category in that paper:

Kyle Edwards Lectures 15 & 16 are on model selection. I’m really not going to focus on the theory all that much here, and rather I’ll focus on the practical application of model selection for linear models.

The goal of model selection is this: there is a perfect model that explains all the pattern and variation in your data, however, you will never know this perfect model and so, instead, you are aiming to find a model that best approximates this magical unicorn model. (Kyle Edwards goes into more detail).

There are many ways to select models, but one that is really common in ecology is using AIC values (or corrected AICc for small sample sizes - which applies to most/all ecological datasets!) The AIC, or Akaike Information Criterion, AIC is a fancy equation

\[AIC = -2 * log(L(\hat\theta|Y)) + 2K\]

But is really quite simple to use in practice once you know what some of the parts of that equation means. The \(log(L(\hat\theta|Y))\) section is the log likelihood of your parameters (\(\hat\theta\)) of your model given your data (\(Y\)). This value will increase with a better-fitting model (based on log likelihoods which is the basis of linear models and which I forget the theory of ALWAYS). The \(2K\) part is a penalty for adding more parameters, where \(K\) is the number of parameters in your model and will increase as you add more predictors. (for AICc for small sample sizes, there is an additional penalty for too many predictors). All this fancy math just to say:

I’m trying to find the model with the lowest AIC value of all models, which means I’ve found that happy medium between fitting my data well (reward: \(log(L(\hat\theta|Y))\)) and not using too many parameters (penalty: \(K\)).

You can use AIC values with non-nested models, meaning I could compare a model of count ~ Wtemp with a model of count ~ cover. However, one very important thing to keep in mind about AIC-based model comparisons is that you can only compare models fit to the same dataset. (In other words, I could not use AIC to compare a model of count ~ Wtemp from one stream with a model of count ~ Wtemp from another stream).

So, let’s fit an AIC to our model and then extend that to compare it to other models!

The summary() function for models prints an AIC value for that model, however, this value is not necessarily the value we should be using here (we should use AICc most of the time, as I mentioned before for small datasets). (Furthermore, and not to go into too much detail here, but for mixed models, you’ll want to do all the model selection steps with a different maximum likelihood estimator, set by putting a REML=FALSE argument into your model. info here).

summary(model)

## 
## Call:
## glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1270  -1.5988  -1.3516   0.3424  13.7366  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24068    0.03558   6.764 1.34e-11 ***
## cover        0.27085    0.03491   7.758 8.64e-15 ***
## Wtemp       -0.05746    0.03692  -1.557     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 2053.8  on 641  degrees of freedom
## AIC: 2783.1
## 
## Number of Fisher Scoring iterations: 6

AICc(model)

## [1] 2783.142

For this model, the AIC value from the summary() function and the AIC value from the AICc() function look fairly similar - however, AICc is a better approach and will often give different values for small datasets, especially as you increase the number of predictors (AICc adds an additional penalty based on the number of predictor variables).

Now lets compare this model to all models with any combination of the predictors in this model, which is really saying: While I expect that x and z could predict y, it could be that one explains more of the variation in the data and that the other adds so little to that variation that the parameter number “penalty” part of the AIC equation (\(2K\)) is outweighing the parameter number “reward” (\(log(L(\hat\theta|Y))\)).

Note: again, this is in the “exploratory” category of statistical approaches. If I wanted to have a more inference-based approach, I might a priori say “Is water temp, cover, or water temp and cover important for salamander count” and the statistical method here would be to just test those models against each other.

The AICc() function comes from the MuMIn package, as does the next function: dredge(), which we will be using. However, to use dredge() without possibly getting an error for the model producing different NA values for each model (dredge() won’t work to compare different datasets), we have to first add an NA action to the model, and then run dredge().

model <- glm(count ~ cover + Wtemp,
    data = Salamanders,
    family = "poisson",
    na.action = "na.fail")

dredge(model)

## Global model call: glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders, 
##     na.action = "na.fail")
## ---
## Model selection table 
##   (Intrc)  cover    Wtemp df    logLik   AICc delta weight
## 4  0.2407 0.2708 -0.05746  3 -1388.552 2783.1  0.00  0.549
## 2  0.2424 0.2780           2 -1389.758 2783.5  0.39  0.451
## 3  0.2758        -0.09193  2 -1418.500 2841.0 57.88  0.000
## 1  0.2799                  1 -1421.967 2845.9 62.80  0.000
## Models ranked by AICc(x)

Warning: I would not recommend using the dredge() function in a paper most of the time. Especially if you have lots of predictors, this function will test all possible combinations of predictors and many of those models may not be ecologically motivated. Instead, you might want to take the inference approach I mentioned above and test a specific subset of models that represent alternative hypotheses for how environmental variables are shaping salamander abundance.

Now we have a bunch of AICc values to compare, along with a “delta” value for each, which is just the difference between that model and the “best” model of the list. A rule of thumb: delta = 2 is a good cutoff for choosing between two models (AIC is on the log scale, so anything with an AIC value of 2 difference is 2 orders of magnitude, or 100 times different). In this case, our top two models are within 2 AICc values of each other - which happens quite often. In this case, I usually go for the more simple model of the two, which in this case is the model with just cover as a predictor of count. (I got this approach from Zuur, but could be good to read up on it on your own. You can also report both, and use emmeans() or other functions to determine marginal significance/p-values down the line). In general, choosing the top model from all of these models is also a totally appropriate approach, even if the next model is “close” in AIC values (see what I mean about the “lots of approaches to stats”?)

3. Poisson: Do my data fit the model assumptions?

Now that we have our model that we want to report in our paper, we need some way to say whether our data actually meet the assumptions of this model. You can approach these in a similar way to basic LM model tests (looking at heteroskedasticity of residuals using residual or Q-Q plots), but there are several extensions here (dispersion and zero-inflation) AND some reasons as you get to more complicated models (e.g. mixed models) as to why you might want something that gives you some confidence interval (e.g. p-value) to whether your visual assessments of your data warrant revisiting your model.

I use the DHARMa package for this in all my linear models (mixed and non-mixed). DHARMa simulates a bunch of sets of residuals from each observations based on the fitted model, including the probability of those values given the data. The outputs are AWESOME diagnostics that look really similar to the qqplots and residual plots from basic model diagnostics, except now with some confidence that the data fit model assumptions tacked on (and the ability to determine overdispersion and zero-inflation for count data).

This is a very simplified explanation of the DHARMa package, but they have a great tutorial with more in-depth explanations and lots more functions from the package.

First, we’ll simulate our batches of residuals based on our best-fitting model, which you’ll recall had just cover as a predictor:

best_model <- glm(count ~ cover,
    data = Salamanders,
    family = "poisson")

simulationOutput <- simulateResiduals(fittedModel = best_model, plot = T)

YIKES. We’ve got some issues with model assumptions. This shape for a qqplot is definitely evidence of some potential overdispersion, but we can test that explicitly in the DHARMa package using the testDispersion() function.

testDispersion(simulationOutput)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.2729, p-value < 2.2e-16
## alternative hypothesis: two.sided

The testDispersion() function gives us a histogram of simulated expected values, and then a red line indicating our observed dispersion in our actual data. Additionally, this function performs a statistical test of our dispersion with the null hypothesis being “We do not have overdispersion”. Our very low p-value indicates that our dispersion value is not very likely to come from the distribution of expected dispersion values for our model. Simply: we have overdispersion.

Poisson: Common fixes for models that don’t meet assumptions

Overdispersion is a really common problem for count data and the good news is that there are some pretty easy fixes. (A place where opinions on the order in which you apply fixes deviates, but I’ll show you mine. Kyle Edwards and the Zuur book both have great workflows for this, which I based mine off as well. You can also find my flowchart for model selection in the Extensions tab).

The fixes used can depend on your data and what your model looks like in the first place. There are some simple math-related reasons based on model distributions for why overdispersion could be occurring and there are also some more practical reasons based on the variables we put into (or didn’t put into) the model.

Did I include all the relevant variables?
- Are there other fixed effects that may be needed in my model?
- (Not covered today, and maybe a problem with this dataset: is there a need for grouping variables to account for non-independent samples (random effects and mixed models)?)
Do my data actually fit a Poisson distribution? (lots of opinions and approaches to address this point)
- My approach: 1) refit with a negative binomial model, 2) add a zero-inflation term, 3) think about data transformations. (some other useful overdispersion tips, specifically for GLMMs, but useful for GLMs too, on Ben Bolker’s GLMM FAQ page)

What to report in a paper?

Let’s say that best_model is actually the best model for these data and meets assumptions (which it doesn’t in this case…). I will want to report some statistics on this, right? For basic GLMs you can report the \(\beta\) (Estimate) for the model as well as the \(p-value\) (Pr(>|z|)) for the predictors, which basically gives you

\(\beta\): the slope and direction (+ or -) of the relationship between the x and y components (or, for categorical variables, the by-category intercept)

\(p-value\): relative importance of the predictor

You can get both of these from the summary() function, where \(\beta\) comes from the value under Estimate for each predictor and the \(p-value\) comes from the Pr(>|z|) for each predictor.

summary(best_model)

## 
## Call:
## glm(formula = count ~ cover, family = "poisson", data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0761  -1.5854  -1.3064   0.3306  13.7635  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24241    0.03552   6.825 8.77e-12 ***
## cover        0.27800    0.03458   8.040 9.01e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 2056.2  on 642  degrees of freedom
## AIC: 2783.5
## 
## Number of Fisher Scoring iterations: 6

For a linear relationship like this one, you might also want to know the \(R^2\) value for your relationship, which is a measure of how close your points would be to a linear line (ranges 0 - 1, with a better relationship closer to 1). For GLM’s this is called a pseudo-\(R^2\) and can be run in the sjmisc package using the r2() function:

r2(best_model)

## $R2_Nagelkerke
## Nagelkerke's R2 
##       0.0988587

An \(R^2\) value gives you a value of the amount of variation in your data that is explained by the model.

You could also determine the confidence intervals for this model as an additional validation of how well it explains the variation in the data. I’ll use the confint function in the MASS package.

confint(best_model)

##                 2.5 %    97.5 %
## (Intercept) 0.1719303 0.3111744
## cover       0.2102292 0.3457966

In your final paper, your model selection methods could sound like:

Methods: We wanted to know which stream abiotic conditions influenced salamander abundance. To do this, we fit generalized linear models with a poisson distribution (for count data) with two important stream attributes, water temperature and woody debris cover as predictors and salamander count as a dependent variable. We performed model selection using AICc by comparing [exploratory (again, advisable with a few variables): “all possible models”; or inference: “a subset of models a, b, and c based on alternative hypotheses”] and ensured that models fit assumptions using model diagnostics. We ran all statistical analysis in R (Cite with version) using the [package for glm]. We selected models using AICc with the MuMIn package and performed all model diagnostics in DHARMa (CITE packages).

Results: The best fitting model for salamander abundance included woody debris cover as a predictor variable (\(\beta\) = 0.278, \(R^2\) = 0.1, CI = __). There were on average x%/Xx/[some human-friendly value] more salamanders in a stream with 50% woody debris cover as one with no debris cover.

Extensions

I don’t have time to go over these, but these are some common data types and issues I have and from survey responses, maybe you will have these types of data and challenges too. Specifically:

Categorical predictor variables (an extension of ANOVA)
Presence-absence data (binomial distributions)
Overdispersed count data (negative binomial distribution)
Count data with too many zeros (Zero-inflated model)

Categorical predictor

Allison Horst is a Stats Wizard

GLMs can sometimes look suspiciously like an ANOVA, which isn’t surprising and makes GLMs really versatile!

Let’s revisit the example of the salamanders in streams that either were or weren’t below mountaintop removal sites. In this example, if we want to know whether mining has an effect on salamander abundance, we are dealing with a categorical fixed effect (mined vs. unmined). We can approach this really similarly to any other GLM and can also get diagnostics that look suspiciously similar to ANOVA diagnostics. Let’s do it:

Categorical fixed effect, count response: what is the question?

Again, we have count data, so we’re going to be playing with Poisson or negative binomial models with this one. Our question is now What is the effect of presence of mountaintop mining on the abundance of salamanders in streams?"

In this situation, it might be more intuitive to look at our data as a boxplot:

ggplot(Salamanders, aes(x = mined, y = count)) +
  geom_boxplot() +
  theme_bw()

But we could also coerce our yes/no predictor into a continuous-type variable and visualize it as a line graph, setting the “no” level to “1”, and the “yes” level to “2”.

Salamanders %>%
  mutate(mine_line = ifelse(mined == "no", 1, 2)) %>%
           ggplot(aes(x = mine_line, y = count)) +
           geom_point() +
           geom_smooth(method = "lm", se = FALSE) +
           theme_bw()

In both of these graphs, it looks like mining may have a positive effect on salamander abundance.

In our model, the fixed effect will be mined (a binary yes/no) and our response will be count from the Salamanders dataset:

cat_poisson <- glm(count ~ mined, 
                   data = Salamanders,
                   family = "poisson")

We can look at our model summary again to see that the effect of mining does seem to be important for salamander abundance.

summary(cat_poisson)

## 
## Call:
## glm(formula = count ~ mined, family = "poisson", data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1283  -0.9459  -0.7687  -0.1796  11.4753  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2192     0.1048  -11.63   <2e-16 ***
## minedno       2.0368     0.1109   18.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 1575.2  on 642  degrees of freedom
## AIC: 2302.5
## 
## Number of Fisher Scoring iterations: 6

Categorical fixed effect, count response: what is the “best” model?

Skipping (but maybe try it on your own?)

Categorical fixed effect, count response: do data meet assumptions?

This probably seems repetitive by now - but we can also use DHARMa for this kind of model.

simulationOutput <- simulateResiduals(fittedModel = cat_poisson, plot = T)

This model also has so many of the beautiful model assumption failures.

testDispersion(simulationOutput)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.1261, p-value < 2.2e-16
## alternative hypothesis: two.sided

A new one (meaning: we have too many zeros to fit this distribution with our data):

testZeroInflation(simulationOutput)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.4637, p-value < 2.2e-16
## alternative hypothesis: two.sided

So we could approach it like we’ve been approaching any other model when it comes to fixing this!

On your own: How would you attempt to fix some of these model assumption failures?

Catagorical fixed effect wrap-up: post-hoc and visualizations

In this example, our predictor had two categories, and so we can report model results (including the p-value) for the between-group differences (ala ANOVA) from the summary() for the model. However, if you had a categorical variables with more than two levels, there are some cool ways to do pair-wise comparisons between those levels (for example, if I had “full mountaintop removal”, “partial mountaintop removal”, and “no mountaintop removal” mining categories for the Salamander dataset).

Again, at this point I always like to do a quick visual of my data, which I can still do with the allEffects call:

plot(allEffects(cat_poisson))

If I had more than two levels for my predictor, I could use the emmeans package and function to determine the pair-wise differences between groups.

em <- emmeans(cat_poisson, "mined")

The emmeans() function calculates the marginal means (mean of a group given all the other groupings in the model) based on a specified predictor (here, “mined”):

em

##  mined emmean     SE  df asymp.LCL asymp.UCL
##  yes   -1.219 0.1048 Inf    -1.425    -1.014
##  no     0.818 0.0362 Inf     0.746     0.889
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95

and then you can compare these means to each other:

pairs(em)

##  contrast estimate    SE  df z.ratio p.value
##  yes - no    -2.04 0.111 Inf -18.363 <.0001 
## 
## Results are given on the log (not the response) scale.

I’ve also played around a bit with the ggeffects package, which produces similar values in a ggplot-friendly format. I don’t cover it here, but you can always use that to graph marginal means from models. ggeffects is also useful for plotting marginal means of different levels of a random effect for a mixed model!

ggeffects

Back to Extensions

Presence-absence data (binomial)

Allison Horst

Let’s say we were actually interested in just whether we find salamanders or not, not their actual abundances in each stream. This means we have a binary response of presence-absence and want to know whether presence is predicted by cover or water temperature. We can convert the count data to a binary variable of presence:

Salamanders <- Salamanders %>%
  mutate(presence = ifelse(count > 0, 1, 0))

And then look at whether our two predictors (stream cover and water temperature) predict presence.

Salamanders %>%
  gather(-site, -mined, -sample, -DOP, -DOY, -spp, -count, -presence, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = presence)) +
  geom_point() +
  geom_smooth(method = "lm", se =F) +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

This is kind of a hard graph to visualize because of the binary response variable, but visually it does look like there might be more 1’s as cover increases and maybe more zeros as water temperature increases.

1. Binomial: What is the question

Our question here is really similar to the question above, except now we want to know whether stream cover and water temperature predict the presence of salamanders, rather than their abundance. Now that we have a binary presence-absence response variable, we will be using the binomial distribution to fit our data. You can run a binomial model by just changing the family = argument in the glm() function.

bi.model <- glm(presence ~ cover + Wtemp,
                data = Salamanders,
                family = binomial)

Our model summary is really similar to before and suggests some evidence that stream cover is, indeed, predictive of whether we see salamanders in streams or not.

summary(bi.model)

## 
## Call:
## glm(formula = presence ~ cover + Wtemp, family = binomial, data = Salamanders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4782  -0.9944  -0.7274   1.2272   1.8225  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.43787    0.08380  -5.225 1.74e-07 ***
## cover        0.56261    0.08914   6.311 2.76e-10 ***
## Wtemp       -0.05290    0.08758  -0.604    0.546    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 866.35  on 643  degrees of freedom
## Residual deviance: 821.17  on 641  degrees of freedom
## AIC: 827.17
## 
## Number of Fisher Scoring iterations: 4

2. Binomial: What is the “best” model (Next week!)

Again skipping this step for now!

3. Binomial: Does my data fit the model assumptions

With this model, we can again use DHARMa to test model assumptions and we can also use the binned_residuals() function in the performance package, which is specifically designed for binomial distributions.

simulationOutput <- simulateResiduals(fittedModel = bi.model, plot = T)

According to this, this model actually performs pretty well.

The residuals for a binomial model always look pretty funky:

plot(residuals(bi.model))

So we can use the binned_residuals function to tell how well our model is doing, which is… ok, but not great.

From the vignette for the performance package: “Binned residual plots are achieved by “dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin.” (Gelman,Hill 2007: 97). If the model were true, one would expect about 95% of the residuals to fall inside the error bounds."

binned_residuals(bi.model)

## Warning: About 80% of the residuals are inside the error bounds (~95% or higher would be good).

Based on these residuals, our data is ok, but not great at fitting model assumptions (95% of residuals). Again, this could be because of the practical or math reasons outlined above from the Poisson and negative binomial models:

Did I include the right variables in my model?
- Are there other fixed effects that may be needed in my model?
- (Not covered today, and maybe a problem with this dataset: is there a need for grouping variables (random effects and mixed models)?)
Do my data actually fit a binomial distribution?
- variable transformations or quasi-binomial distribution

Presence data wrap-up: a quick visualization

We can do another quick diagnostic visualization to ground ourselves here as well!

plot(allEffects(bi.model))

Back to Extensions

Negative binomial: count data with overdispersion

Our model of salamander abundance with cover and water temperature did not meet model assumptions, possibly because this dataset is really common to many count datasets in that the data are overdispersed and might need a data distribution that can account for that. Negative binomial models are a type of model that can be fit to count data that account for this data distribution. Let’s start by refitting our data to a negative binomial distribution, which we can do with the glm.nb function from the MASS package.

This means we’re returning to Step 1 via Step 4A in our GLM process flowchart:

1. NB: What is the question?

In this case, the ecological question stays the same, we’re just using a different model family to better account for model assumptions.

nb.model <- glm.nb(count ~ cover + Wtemp,
                   data = Salamanders)

Looking at the summary of this model, it looks like cover is still important, but the model coefficients have changed.

summary(nb.model)

## 
## Call:
## glm.nb(formula = count ~ cover + Wtemp, data = Salamanders, init.theta = 0.3343960957, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2138  -1.0258  -0.9076   0.1246   4.4182  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22911    0.07721   2.967    0.003 ** 
## cover        0.34692    0.07955   4.361 1.29e-05 ***
## Wtemp       -0.09539    0.08050  -1.185    0.236    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.3344) family taken to be 1)
## 
##     Null deviance: 546.2  on 643  degrees of freedom
## Residual deviance: 528.8  on 641  degrees of freedom
## AIC: 1904
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.3344 
##           Std. Err.:  0.0328 
## 
##  2 x log-likelihood:  -1896.0100

(Just as a reminder, the summary for the Poisson model):

summary(model)

## 
## Call:
## glm(formula = count ~ cover + Wtemp, family = "poisson", data = Salamanders, 
##     na.action = "na.fail")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1270  -1.5988  -1.3516   0.3424  13.7366  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24068    0.03558   6.764 1.34e-11 ***
## cover        0.27085    0.03491   7.758 8.64e-15 ***
## Wtemp       -0.05746    0.03692  -1.557     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2120.7  on 643  degrees of freedom
## Residual deviance: 2053.8  on 641  degrees of freedom
## AIC: 2783.1
## 
## Number of Fisher Scoring iterations: 6

2. NB: What is the “best” model (Next week!)

Again, let’s say we went through the step 2 model selection process and this came out as the best model.

3. NB: Do my data fit the model assumptions?

Let’s see how the model and data do in terms of model assumptions:

simulationOutput <- simulateResiduals(fittedModel = nb.model, plot = T)

It looks like we’ve fixed the overdispersion problem, but may still have some issues with this dataset based on the residuals plot. Let’s look at the dispersion problem specifically:

testDispersion(simulationOutput)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 0.95545, p-value = 0.768
## alternative hypothesis: two.sided

We have fixed the overdispersion! However, based on the output from this model, it still looks like we have some patterns in our residuals. I’m not surprised about this, given that it’s an example dataset in a mixed model package and so likely we are having more of a practical problem with missing effects in this model. (Maybe once we learn mixed models we can revisit?)

Count data wrap-up: a quick visualization

I’m a visual person, so at this point I always want to know what my model is actually telling me ecologically (e.g. what is the effect of cover??). I do quick diagnostics using the effects package and the allEffects function. This is just a line graph showing the marginal change in count for this model given a change in each of the predictors, while accounting for everything else in the model. You can get similar and more complex marginal effects plots with the ggeffects package, but I haven’t played with it much. I wouldn’t put this into my final paper, but it helps me to stay grounded in the ecology when I get lost in the R and statistics world.

plot(allEffects(nb.model))

Back to Extensions

Zero-inflated models for count data

As you may know from your own datasets, ecological data of counts often have a lot of zeros. Sometimes, these datasets have so many zeros that neither Poisson or negative binomial models will work. There are a set of models called “Zero-inflated” models (can be either Poisson or negative binomial), which are basically that - models that can fit data that have an “inflated” number of zeros. There are tons of resources on this, and I won’t cover it all here, just providing a practical worked example. (there are also two types of zero-inflated models that I won’t go into, but have to do with where the zeros are coming from, not going to try to have an exhaustive example here, but look into it if this is a problem for you!)

Zuur chapter 11 is a go-to resource to get started!

Another set of worked examples

Books

More books

GOOGLE

Let’s play with a different dataset from the lme4 package on the number of ticks observed in grouse chicks in nests in different years, elevations, and locations.

data(grouseticks)
str(grouseticks)

## 'data.frame':    403 obs. of  7 variables:
##  $ INDEX   : Factor w/ 403 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ TICKS   : num  0 0 0 0 0 3 2 0 0 2 ...
##  $ BROOD   : Factor w/ 118 levels "501","502","503",..: 1 1 2 3 3 3 3 4 4 4 ...
##  $ HEIGHT  : num  465 465 472 475 475 475 475 488 488 488 ...
##  $ YEAR    : Factor w/ 3 levels "95","96","97": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LOCATION: Factor w/ 63 levels "1","2","3","4",..: 32 32 36 37 37 37 37 44 44 44 ...
##  $ cHEIGHT : num  2.76 2.76 9.76 12.76 12.76 ...

INDEX: (factor) chick number (observation level)

TICKS: number of ticks sampled

BROOD: (factor) brood number

HEIGHT: height above sea level (meters)

YEAR: year (-1900)

LOCATION: (factor) geographic location code

cHEIGHT: centered height, derived from HEIGHT

1. Zero-inflated count data: What is the question?

Let’s say our question was whether elevation has an effect on tick abundance in grouse chicks, maybe because lower elevation sites have a larger temperature window of tick activity.

ggplot(grouseticks, aes(x = HEIGHT, y = TICKS)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_bw()

It does look like there is some effect of elevation on tick abundance, and it does seem to be negative (e.g. higher elevation = lower tick abundance).

Like the previous dataset on salamanders, there are a lot of zeros in this dataset

hist(grouseticks$TICKS)

grouseticks %>%
  filter(TICKS == 0) %>%
  summarise(count = n(), total = 403, proportion = count/403)

##   count total proportion
## 1   126   403  0.3126551

Out of 403 observations, 126 observations are zeros (31%). That’s a lot!

Again, we will start with fitting a normal Poisson model:

ticks_poisson <- glm(TICKS ~ HEIGHT,
                   data = grouseticks,
                   family = "poisson")

2. Zero-inflated count data: What is the “best” model? (Skipping again)

If this was our best model (skipping step 2 again), we can assess the residuals:

3. Zero-inflated count data: Does my data fit the model assumptions?

simulationOutput <- simulateResiduals(fittedModel = ticks_poisson, plot = T)

And test for overdispersion again:

testDispersion(ticks_poisson)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 4.8843, p-value < 2.2e-16
## alternative hypothesis: two.sided

Again, we have overdispersion. If it was me, I would start by trying to fit a negative binomial model (as before). Even with this number of zeros, often re-fitting with a negative binomial model will fix a lot of issues with zeros. However, for the sake of learning, let’s say that we can’t fit a negative binomial model to these data, or that it doesn’t fit our zero inflation problems.

How do we determine we have zero inflation issues, you ask? The DHARMa package has another useful function called testZeroInflation() that I always throw into my model diagnostics for count data.

testZeroInflation(ticks_poisson)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 4.7057, p-value < 2.2e-16
## alternative hypothesis: two.sided

This output operates a lot like the testDispersion() function output. Again, DHARMa has created some simulated expected values for our model, and we see that our distribution has way more zeros than would be predicted by a simulated data distribution based on our model.

1.a. Zero-inflated count data: What is the question?

Again, we have the same question, we just need to fit a different model family that incorporates zero inflation. There are several packages that will let you fit a zero-inflated model (pscl, glmmTMB, ZIM). You can find a tutorial based on pscl at this website. I use glmmTMB because it has a lot of flexibility and lets me extend to mixed models really easily. It is an actively-developing package that many of the pre-eminent mixed model people in ecology swear by.

In glmmTMB, the function to call a glm is glmmTMB() and we can add a zero-inflation term with an added argument (ziformula = 1). There are more complicated ways to call a zero-inflated formula, but this is the basic one.(more here)

Referring again to our flowchart for GLMs, we’re doing Step 1 via Step 4A again, except this time we’re going to fit a zero-inflated model rather than a negative binomial model to see if we can fix our model assumption problems:

ticks_zi <- glmmTMB(TICKS ~ HEIGHT,
                   data = grouseticks,
                   ziformula = ~1,
                   family = "poisson")

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

2.a. Zero-inflated count data: What is the “best” model?

Skip!

3.a. Zero-inflated count data: Do my data meet model assumptions?

Repeating our diagnostics:

simulationOutput <- simulateResiduals(fittedModel = ticks_zi, plot = T)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

This is NOT a good model fit! But, let’s see if we fixed the zero-inflation by including a zero inflation term in our model:

testZeroInflation(ticks_zi)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.0178, p-value = 0.8
## alternative hypothesis: two.sided

Yes we have! However, we still need to consider all the math and practical reasons why this model isn’t good (do we need to add variables, do we need a different distribution, do we need to transform?).

Flowchart for GLM and GLMM model selection for count data

Here’s a flow chart of my complete model selection process and troubleshooting for count data (for GLMM, but can easily be used for GLM as well)

Back to Extensions

GLM in R Tutorial

Ana Miller-ter Kuile

4/29/21

Packages

Learning goals

A. Common Distributions

B. GLMs in R

Why GLM and not a basic LM?

R Packages for GLMs

Model-fitting packages

Model-comparing packages

Model diagnostics packages

The GLM process

Poisson GLM: Count data

Going back to the three steps we need to fit a GLM to our data:

1. Poisson: What is the question?

2. Poisson: What is the “best” model

Note: what is the aim of your statistical approach?

3. Poisson: Do my data fit the model assumptions?

Poisson: Common fixes for models that don’t meet assumptions

What to report in a paper?

Extensions

Categorical predictor

Categorical fixed effect, count response: what is the question?

Categorical fixed effect, count response: what is the “best” model?

Categorical fixed effect, count response: do data meet assumptions?

Catagorical fixed effect wrap-up: post-hoc and visualizations

Presence-absence data (binomial)

1. Binomial: What is the question

2. Binomial: What is the “best” model (Next week!)

3. Binomial: Does my data fit the model assumptions

Presence data wrap-up: a quick visualization

Negative binomial: count data with overdispersion

1. NB: What is the question?

2. NB: What is the “best” model (Next week!)

3. NB: Do my data fit the model assumptions?

Count data wrap-up: a quick visualization

Zero-inflated models for count data

1. Zero-inflated count data: What is the question?

2. Zero-inflated count data: What is the “best” model? (Skipping again)

3. Zero-inflated count data: Does my data fit the model assumptions?

1.a. Zero-inflated count data: What is the question?

2.a. Zero-inflated count data: What is the “best” model?

3.a. Zero-inflated count data: Do my data meet model assumptions?

Flowchart for GLM and GLMM model selection for count data