Libraries
library(tidyverse)
library(data.table)
library(ggcorrplot)
library(GGally)
library(knitr)
library(kableExtra)
library(ROCR)
library(jtools)
library(ggstance)
library(broom)
library(interactions)
Generalized Linear Model
Generalized Linear Model (GLiM, or GLM) is an advanced statistical modelling technique formulated by John Nelder and Robert Wedderburn in 1972. It is an umbrella term that encompasses many other models, which allows the response variable y to have an error distribution other than a normal distribution. The models include Linear Regression, Logistic Regression, and Poisson Regression.
In a Linear Regression Model, the response variable ‘y’ is expressed as a linear combination of all the predictors ‘X’. The underlying relationship between the response and the predictors is linear. i.e. we can simply visualize the relationship in the form of a straight line. Also, the error distribution of the response variable should be normally distributed. Therefore we are building a linear model.
GLM models allow us to build a linear relationship between the response and predictors, even though their underlying relationship is not linear. This is made possible by using a link function, which links the response variable to a linear model. Unlike Linear Regression models, the error distribution of the response variable need not be normally distributed. The errors in the response variable are assumed to follow an exponential family of distribution (i.e. normal, binomial, Poisson, or gamma distributions). Since we are trying to generalize a linear regression model that can also be applied in these cases, the name Generalized Linear Models.
when to use GLM?
Linear Regression model is not suitable if,
- The relationship between X and y is not linear. There exists some non-linear relationship between them. For example, y increases exponentially as X increases.
- Variance of errors in y (commonly called as Homoscedasticity in Linear Regression), is not constant, and varies with X.
- Response variable is not continuous, but categorical. Linear Regression assumes normal distribution of the response variable, which can only be applied on a continuous data. If we try to build a linear regression model on a binary y variable, then the linear regression model predicts negative values for the corresponding response variable, which doesn’t make any sense!
In these scenario, GLM comes to play. Commonly used models in the GLM family include:
- Linear Regression. for continuous outcomes with normal distribution
- Poisson Regression for count based outcomes with poisson distribution
- Binary Logistic Regression for dichotomous or binary outcomes with binomial distribution
Generalized linear models are just as easy to fit in R as ordinary linear model. In fact, they require only an additional parameter to specify the variance and link functions. The basic tool for fitting generalized linear models is the glm() function, which has the folllowing general structure:
glm(formula, family, data, weights, subset, ...)
where … stands for more esoteric options. The only parameter that we have not encountered before is family, which is a simple way of specifying a choice of variance and link functions. There are six choices of family:
Family | Variance | Link |
---|---|---|
gaussian | gaussian | identity |
binomial | binomial | logit |
poisson | poisson | log |
gamma | gamma | inverse |
inverse.gaussian | inverse.gaussian | \(1/\mu^2\) |
quasi | user-defined | user-defined |
What is Logistic regression?
Logistic regression is used to predict a class, i.e., a probability. Logistic regression can predict a binary outcome accurately.
Imagine you want to predict whether a loan is denied/accepted based on many attributes. The logistic regression is of the form 0/1. y = 0 if a loan is rejected, y = 1 if accepted.
A logistic regression model differs from linear regression model in two ways.
- First of all, the logistic regression accepts only dichotomous (binary) input as a dependent variable (i.e., a vector of 0 and 1).
- Secondly, the outcome is measured by the following probabilistic link function called sigmoid due to its S-shaped.
The sigmoid function returns values from 0 to 1. For the classification task, we need a discrete output of 0 or 1.
Data
Let’s use the Recipes data set to illustrate Logistic regression. this dataset has about 20,000 recipes from several different categories. The dataset provides information on nutritional value of each recipe, ingredients (in the form of binary variables), in some cases, country, the recipe originated from. All in all there are over 650 features. It’s a playground for exploring recipes, predicting what makes a recipe to belong to a certain category (Dessert, or breakfast or brunch etc).
Recipe information lifted from: http://www.epicurious.com/recipes-menus
setwd("~/Nawrin_tania")
<- read.csv("epi_r.csv") recipes_data
I Love desserts! Let’s Predict is a recipe desert or not?
I’ll use Logistic regression to Predict this! Here, ‘dessert-or-not’ is going to be my response variable, while the nutritional factors like protein, fat, calorie content etc would be set as the predictors. So, a binary variable with quantitative explanatory variables will fit Logistic Models smoothly.
let’s take a glimpse of our data by glimpse(recipes_data)
How many recipes are of a dessert? select 1’s of the dessert column. Only 18% of 20,000 recipes are dessert recipes.
prop.table(table(recipes_data$dessert))
##
## 0 1
## 0.8218133 0.1781867
I’m including 1:7 columns on nutrition to find out the probability of a recipe being that of a dessert.
<- recipes_data %>% select(1:7,"dessert")
dessert_recipes colnames(dessert_recipes)[7] <- "cakeweek"
kable(data.table(head(dessert_recipes)))
title | rating | calories | protein | fat | sodium | cakeweek | dessert |
---|---|---|---|---|---|---|---|
Lentil, Apple, and Turkey Wrap | 2.500 | 426 | 30 | 7 | 559 | 0 | 0 |
Boudin Blanc Terrine with Red Onion Confit | 4.375 | 403 | 18 | 23 | 1439 | 0 | 0 |
Potato and Fennel Soup Hodge | 3.750 | 165 | 6 | 7 | 165 | 0 | 0 |
Mahi-Mahi in Tomato Olive Sauce | 5.000 | NA | NA | NA | NA | 0 | 0 |
Spinach Noodle Casserole | 3.125 | 547 | 20 | 32 | 452 | 0 | 0 |
The Best Blts | 4.375 | 948 | 19 | 79 | 1042 | 0 | 0 |
let’s look at Descriptive statistics to check for outliers, NAs and Convert cakeweek and dessert column to factors given they have binary values.
# change as factor
<- dessert_recipes %>% mutate(
dessert_recipes cakeweek = as.factor(cakeweek),
dessert = as.factor(dessert)
)
There’s definitely an outlier situation going on here given the maximum calorie, protein, fat, and sodium content is off the chart or in a different unit. For this analysis, I’ll filter these values out. Plus, there are NA values, that could either be removed or imputed based on the average of the relevant column values.
Removing Outliers
<- dessert_recipes %>% filter(calories <= 20000 & sodium <=20000) dessert_recipes
On filtering, we lost over 4,000 rows. NA’s have reduced. To fill out the NA values, we some options:
- Use mean value
- apply a ML approach to fill out the values
Both the methods have its pros and cons. For this project, I’ll use mean values of the column.
<- dessert_recipes %>% mutate(
dessert_recipes fat = replace(fat, is.na(fat), mean(fat, na.rm = T)),
protein = replace(protein, is.na(protein), mean(protein, na.rm = T))
)
let’s look at summary
.
summary(dessert_recipes)
## title rating calories protein
## Length:15885 Min. :0.000 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.:3.750 1st Qu.: 197.0 1st Qu.: 3.00
## Mode :character Median :4.375 Median : 330.0 Median : 8.00
## Mean :3.752 Mean : 491.1 Mean : 21.07
## 3rd Qu.:4.375 3rd Qu.: 585.0 3rd Qu.: 27.00
## Max. :5.000 Max. :19576.0 Max. :1164.00
## fat sodium cakeweek dessert
## Min. : 0.00 Min. : 0.0 0:15879 0:12931
## 1st Qu.: 8.00 1st Qu.: 79.0 1: 6 1: 2954
## Median : 18.00 Median : 293.0
## Mean : 28.28 Mean : 593.5
## 3rd Qu.: 33.00 3rd Qu.: 703.0
## Max. :1818.00 Max. :19986.0
All NA has been removed and outliers has been filtered. So, general notion is dessert is not only tasty but would have more fat, high calories, and less protein, sodium content. Let’s see,
%>% ggplot(aes(
dessert_recipes x = calories,
y = fat,
color = dessert
+
)) geom_point() +
theme_bw() +
labs(
color = "Dessert"
)
It is surprising to see that dessert recipes have lower fat and calorie content. But it’s not very clear given there are still fewer values which are much bigger than the average values of fat and calories in the data set.
%>% ggplot(aes(
dessert_recipes x = calories,
y = protein,
color = dessert
+
)) geom_point() +
theme_bw() +
labs(
color = "Dessert"
)
This makes more sense, given nobody is expecting to draw proteins from desserts! Let’s check with boxplots.
<- dessert_recipes %>% filter(fat <100)
dessert_box
%>% ggplot(
dessert_box aes(
x = dessert, y = fat, color = dessert
)+
) geom_boxplot(varwidth = T, size = 1, width = 0.5, show.legend = F, outlier.color = "red") +
theme_minimal() +
labs(
title = "Dessert recipes (1) - other recipes (0)"
)
The median fat content is quite similar in both kinds of recipes. For the bulk of the dessert recipes, fat content is considerably lower than rest of the recipes. However, there are several outlier values, which is pushing the standard deviation up, meaning the ‘fat’ values are quite spread out in the data.
and Lower protein in dessert recipes VS other recipes.
<- dessert_recipes %>% filter(protein <100)
dessert_box
%>% ggplot(
dessert_box aes(
x = dessert, y = protein, color = dessert
)+
) geom_boxplot(varwidth = T, size = 1, width = 0.5, show.legend = F, outlier.color = "red") +
theme_minimal() +
labs(
title = "Dessert recipes (1) - other recipes (0)"
)
Is that a Dessert? Looks delicious!
Before we dive into the model, one requirement of logistic regression is non-multicollinearity. If the explanatory variables are highly correlated with each other, then it makes sense to keep just one of the predictor in the model. I’ll visualize the correlation between the variables by plotting a heat map containing the coefficient of correlation computed with the Spearman method.
ggcorr(
dessert_recipes,method = c("pairwise", "spearman"),
nbreaks = 6,
hjust = 0.8,
label = TRUE,
label_size = 3,
color = "grey50",
name = "Scale",
palette = "Set3"
)
calories and fat are highly correlated.
Train/Test Dataset
We will divide the dataset into two subsets: train and test. To perform the train-test split.
- Train subset – we will use this subset to fit/train the model
- Test subset – we will use this subset to evaluate our model
I’ll write a little funtion for this task.
set.seed(6474)
<- function(data, size = 0.8, train = TRUE) {
create_train_test = nrow(data)
n_row = size * n_row
total_row <- 1: total_row
train_sample if (train == TRUE) {
return (data[train_sample, ])
else {
} return (data[-train_sample, ])
} }
Splitting the Data.
<- create_train_test(data = dessert_recipes, size = 0.8, train = T)
train <- create_train_test(data = dessert_recipes, size = 0.8, train = F) test
Building the Model
We are ready to estimate the logistic model to find out dessert recipes. Logistic Regression is probably the simplest and highly efficient algorithm to work out a classification problem, in this case, ‘dessert or not’.
<- glm (dessert ~ rating + calories + protein + fat + cakeweek, data = train, family = binomial(link = "logit")) dessert_model
Summary
summary(dessert_model)
##
## Call:
## glm(formula = dessert ~ rating + calories + protein + fat + cakeweek,
## family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.5211 -0.6575 -0.2722 -0.0057 3.5396
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.991535 0.083991 -23.711 < 2e-16 ***
## rating 0.157596 0.019818 7.952 1.83e-15 ***
## calories 0.006922 0.000229 30.226 < 2e-16 ***
## protein -0.212298 0.006740 -31.498 < 2e-16 ***
## fat -0.041422 0.002431 -17.038 < 2e-16 ***
## cakeweek1 14.225325 172.570303 0.082 0.934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12224.8 on 12707 degrees of freedom
## Residual deviance: 8949.8 on 12702 degrees of freedom
## AIC: 8961.8
##
## Number of Fisher Scoring iterations: 12
The summary of our model reveals interesting information. The performance of a logistic regression is evaluated with specific key metrics.
- AIC (Akaike Information Criteria): This is the equivalent of R2 in logistic regression. It measures the fit when a penalty is applied to the number of parameters. Smaller AIC values indicate the model is closer to the truth.
- Null deviance: Fits the model only with the intercept. The degree of freedom is n-1. We can interpret it as a Chi-square value (fitted value different from the actual value hypothesis testing).
- Residual Deviance: Model with all the variables. It is also interpreted as a Chi-square hypothesis testing.
- Number of Fisher Scoring iterations: Number of iterations before converging.
To understand how likely it is a dessert recipe, coefficients of explanatory variables will be exponentiated as logit model following code calculates the log likelihood for the response variable. This is called odd ratio. It measures, at a certain level of a predictor, how likely is the possibility of the response variable to be 1 or 0 (yes or no)
exp(coef(dessert_model))
## (Intercept) rating calories protein fat cakeweek1
## 1.364858e-01 1.170693e+00 1.006946e+00 8.087239e-01 9.594246e-01 1.506538e+06
ANOVA Test
ANOVA shows how the full model(with all predictors) is performing against the null model (with intercept only). The big difference in the null and residual deviance shows that residual model is doing better and p-values of all the predictors are statistically significant.
anova(dessert_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: dessert
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 12707 12224.8
## rating 1 6.20 12706 12218.6 0.012748 *
## calories 1 2.06 12705 12216.5 0.151545
## protein 1 2958.34 12704 9258.2 < 2.2e-16 ***
## fat 1 298.51 12703 8959.6 < 2.2e-16 ***
## cakeweek 1 9.84 12702 8949.8 0.001707 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model Interpretation
All the features, except cakeweek, in the model to predict the likelihood of a dessert recipe, are shown as highly significant at p<0.000.
From the model, keeping other predictors constant, it could be interpreted that,
- A unit increase in ratings, increases the odds of a dessert recipe by 16%.
- A unit increase in calories, increases these odds by 0.7%
- A unit increase in protein, decreases these odds by 21% (given the coefficient of protein is negative)
- A unit increase in fat, decreases these odds by 4.14% (given the coefficient of fat is also negative)
Performance of the model
One way to measure the accuracy of the model predictions is Confussion Matrix Another, measuring Area Under Curve(AUC) of ROC is the most accepted method to check for quality of predictions.
Confussion Matrix
The confusion matrix is a better choice to evaluate the classification performance compared with the different metrics you saw before. The general idea is to count the number of times True instances are classified are False.
at first, predict the model using test data.
<- predict(dessert_model, test, type = 'response') predict
then calculate the matrix,
<- table(test$dessert, predict > 0.5)
conf_matrix conf_matrix
##
## FALSE TRUE
## 0 2506 86
## 1 449 136
You can calculate the model accuracy by summing the true positive + true negative over the total observation. \[accuracy=\frac{TP+TN}{TP+TN+FP+FN}\]
<- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy_Test accuracy_Test
## [1] 0.8316021
The model appears to suffer from one problem, it overestimates the number of false negatives. This is called the accuracy test paradox. We stated that the accuracy is the ratio of correct predictions to the total number of cases. We can have relatively high accuracy but a useless model. It happens when there is a dominant class. If you look back at the confusion matrix, you can see most of the cases are classified as true negative. Imagine now, the model classified all the classes as negative.
You would have an accuracy of 75%. Your model performs better but struggles to distinguish the true positive with the true negative.
In such situation, it is preferable to have a more concise metric. We can look at:
- \(Precision=\frac{TP}{TP+FP}\)
- \(Recall=\frac{TP}{TP+FN}\)
Precision and Recall
Precision looks at the accuracy of the positive prediction. Recall is the ratio of positive instances that are correctly detected by the classifier.
let me write a function to calculate Precision and Recall
<- function(matrix) {
precision # True positive
<- matrix[2, 2]
tp # false positive
<- matrix[1, 2]
fp return (tp / (tp + fp))
}
<- function(matrix) {
recall # true positive
<- matrix[2, 2]# false positive
tp <- matrix[2, 1]
fn return (tp / (tp + fn))
}
Calculate,
<- precision(conf_matrix)
prec <- recall(conf_matrix)
rec prec
## [1] 0.6126126
rec
## [1] 0.2324786
When the model says it is a dessert recipe, it is correct in only 23 percent of the case, and can claim a recipe as dessert in 61 percent of the case.
You can create the \(F_1\) score based on the precision and recall. The \(F_1\) is a harmonic mean of these two metrics, meaning it gives more weight to the lower values. \[F_1=2\frac{precision*recall}{precision+recall}\]
<- 2 * ((prec * rec) / (prec + rec))
f1 f1
## [1] 0.3370508
It is impossible to have both a high precision and high recall.
If we increase the precision, the correct individual will be better predicted, but we would miss lots of them (lower recall). In some situation, we prefer higher precision than recall. There is a concave relationship between precision and recall.
- Imagine, you need to predict if a patient has a disease. You want to be as precise as possible.
- If you need to detect potential fraudulent people in the street through facial recognition, it would be better to catch many people labeled as fraudulent even though the precision is low. The police will be able to release the non-fraudulent individual.
Area Under Curve
A good model should have a better predictiction rate for true positives(sensitivity) and smaller for false positives(1-specificity). AUCROC graph is a plot of the values of sensitivity against false positives, “as the value of the cut-point is increased from 0 through to 1”. “AUC is the probability that if you were to take a random pair of observations, one with Y=1 and one with Y=0, the observation with Y=1 has a higher predicted probability than the other. The AUC thus gives the probability that the model correctly ranks such pairs of observations”.
<- prediction(predict, test$dessert)
ROCRpred <- performance(ROCRpred, 'tpr', 'fpr')
ROCRperf plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2, 1.7))
AUC, Area under the ROC curve, indicates the goodness of prediction when it’s closer to 1.
<- performance(ROCRpred, measure = "auc")
auc <- auc@y.values[[1]]
auc auc
## [1] 0.8813819
Conclusion
Thogh it can be treated as a good model, we can Improve it by,
- adding non-linearity to the model with the interaction
- Reducing Class Imbalance
What is Imbalanced Classification ?
Imbalanced classification is a supervised learning problem where one class outnumbers other class by a large proportion. This problem is faced more frequently in binary classification problems than multi-level classification problems.
The term imbalanced refer to the disparity encountered in the dependent (response) variable. Therefore, an imbalanced classification problem is one in which the dependent variable has imbalanced proportion of classes. In other words, a data set that exhibits an unequal distribution between its classes is considered to be imbalanced.
Example:
prop.table(table(test$dessert))
##
## 0 1
## 0.815864 0.184136
82% of recipes are not dessert recipes, only 18% is dessert recipe. This is an imbalanced scenerio.
Poisson Regression Model
Poisson Regression models are best used for modeling events where the outcomes are counts. Or, more specifically, count data: discrete data with non-negative integer values that count something, like the number of times an event occurs during a given time frame or the number of people in line at the grocery store.
Count data can also be expressed as rate data, since the number of times an event occurs within a time frame can be expressed as a raw count (i.e. “In a day, we eat three meals”) or as a rate (“We eat at a rate of 0.125 meals per hour”).
Poisson Regression helps us analyze both count data and rate data by allowing us to determine which explanatory variables (X values) have an effect on a given response variable (Y value, the count or a rate). For example, Poisson regression could be applied by a grocery store to better understand and predict the number of people in a line.
What is Poisson Distribution?
Poisson distribution is a statistical theory named after French mathematician Siméon Denis Poisson. It models the probability of event or events y occurring within a specific timeframe, assuming that y occurrences are not affected by the timing of previous occurrences of y. This can be expressed mathematically using the following formula: \[P(x)=\frac{e^{-\lambda} \lambda^x}{x!};\ \ where\ x=0,1,2...\]
Here, \(\lambda\) is the average number of times an event may occur per unit of exposure. It is also called the parameter of Poisson distribution. The exposure may be time, space, population size, distance, or area, but it is often time, denoted with \(t\). If exposure value is not given it is assumed to be equal to 1.
Data
we are going to build a model to predict number of meadowfoam flowers in given light conditions. The data are collected from an experiment to study how to maximize Mermaid meadowfoam production. (Meadowfoam is a small plant from which a vegetable oil can be extracted). This data is found in a package called GLMsData
library(GLMsData)
data("flowers")
Whats in the Data?
glimpse(flowers)
## Rows: 24
## Columns: 3
## $ Flowers <dbl> 62.4, 77.1, 77.7, 75.4, 55.7, 54.2, 68.9, 78.2, 49.5, 62.0, 57~
## $ Light <int> 150, 150, 150, 150, 300, 300, 300, 300, 450, 450, 450, 450, 60~
## $ Timing <fct> PFI, PFI, Before, Before, PFI, PFI, Before, Before, PFI, PFI, ~
A data frame with 24 observations on the following 3 variables. * Flowers: the mean number of flowers per meadowfoam plant, averaged over ten seedlings; a numeric vector * Light: the light intensity in \(\mu\) mol per square metre per second; a numeric vector * Timing: when the light treatment was applied; a factor with levels PFI (photo-periodic floral induction) or Before (24 days before PFI)
Let’s round the Flowers variable to make it as perfect count data, as count can’t be fraction!
<- flowers %>% mutate(Flowers = round(Flowers)) flowers
visualizing Histogram,
%>% ggplot(
flowers aes(Flowers)
+
) geom_histogram(position = "dodge", bins = 6, fill = "Tomato") +
theme_bw()
the data is not in the form of a bell curve like in a normal distribution. Lets check Mean and Variance:
c(Mean = mean(flowers$Flowers), Variance = var(flowers$Flowers))
## Mean Variance
## 56.04167 188.30254
The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.
Let’s fit the Poisson model using the glm()
command.
<- glm(Flowers~Light+Timing, family = poisson(link = "log"), data = flowers) flower_model
Interpreting the model
summary(flower_model)
##
## Call:
## glm(formula = Flowers ~ Light + Timing, family = poisson(link = "log"),
## data = flowers)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7385 -0.5343 -0.2473 0.7184 1.5002
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.4935465 0.0631135 71.198 < 2e-16 ***
## Light -0.0007284 0.0001076 -6.769 1.30e-11 ***
## TimingPFI -0.2164546 0.0548539 -3.946 7.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 78.057 on 23 degrees of freedom
## Residual deviance: 16.074 on 21 degrees of freedom
## AIC: 162.17
##
## Number of Fisher Scoring iterations: 4
Anova
anova(flower_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Flowers
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 23 78.057
## Light 1 46.321 22 31.736 1.004e-11 ***
## Timing 1 15.662 21 16.074 7.571e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output begins with echoing the function call. The information on deviance residuals is displayed next. Deviance residuals are approximately normally distributed if the model is specified correctly.In our example, it shows a little bit of skeweness since median is not quite zero.
Next come the Poisson regression coefficients for each of the variables along with the standard errors, z-scores, p-values and 95% confidence intervals for the coefficients. The coefficient for Light is -0.0007284. This means that the expected log count for a one-unit increase in Light is -0.0007284. The indicator variable Timing is the expected difference in log count (approx -0.2164546) between PFI and the reference group ‘Before’.
The information on deviance is also provided. We can use the residual deviance to perform a goodness of fit test for the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. Therefore, if the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data. We conclude that the model fits reasonably well because the goodness-of-fit chi-squared test is not statistically significant. If the test had been statistically significant, it would indicate that the data do not fit the model well.
If the Residual Deviance is greater than the degrees of freedom, then over-dispersion exists. This means that the estimates are correct, but the standard errors (standard deviation) are wrong and unaccounted for by the model.
The Null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean) whereas residual with the inclusion of independent variables. Above, we can see that the addition of 2 independent variables decreased the deviance to 16.074 from 78.057 Greater difference in values means a bad fit.
So, to have a more correct standard error we can use a quasi-poisson model:
<- glm(Flowers~Light+Timing, family = quasipoisson(link = "log"), data = flowers)
flower_model_2 summary(flower_model_2)
##
## Call:
## glm(formula = Flowers ~ Light + Timing, family = quasipoisson(link = "log"),
## data = flowers)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7385 -0.5343 -0.2473 0.7184 1.5002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.494e+00 5.518e-02 81.435 < 2e-16 ***
## Light -7.284e-04 9.409e-05 -7.742 1.39e-07 ***
## TimingPFI -2.165e-01 4.796e-02 -4.513 0.00019 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.7643845)
##
## Null deviance: 78.057 on 23 degrees of freedom
## Residual deviance: 16.074 on 21 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
Comparing The Models:
Now that we’ve got two different models, let’s compare them to see which is better. First, we’ll install the arm
library because we’ll use that se.coef()
function from this package to extract the coefficients from each model, and then use cbind()
combine those extracted values into a single data frame so we can compare them.
library(arm)
<- coef(flower_model)
coef1 <- coef(flower_model_2)
coef2 <- se.coef(flower_model)
se_coef1 <- se.coef(flower_model_2)
se_coef2 <- exp(coef1)
exponent <- cbind(coef1, se_coef1, coef2, se_coef2, exponent)
both both
## coef1 se_coef1 coef2 se_coef2 exponent
## (Intercept) 4.4935465419 0.0631135097 4.4935465419 5.517957e-02 89.4380800
## Light -0.0007284095 0.0001076146 -0.0007284095 9.408648e-05 0.9992719
## TimingPFI -0.2164545632 0.0548538836 -0.2164545632 4.795825e-02 0.8053691
In above output, we can see the coefficients are the same, but the standard errors are different.
Predicting From The Model
Once the model is made, we can use predict(model, data, type) to predict outcomes using new dataframes containing data other than the training data. Let’s look at an example.
<- data.frame(Light = c(550, 800), Timing = c("PFI", "Before"))
newdata predict(flower_model, newdata, type = "response")
## 1 2
## 48.25342 49.93983
Our model is predicting there will be roughly 48 and 49 flower with this new data.
Regression Co-efficients
plot regression coefficients for flower_model
plot_summs(flower_model, scale = T, exp = T)
plot regression coefficients for flower_model and flower_model_2
plot_summs(flower_model, flower_model_2, scale = T, exp = T)