Generalised Linear Models

Libraries

library(tidyverse)
library(data.table)
library(ggcorrplot)
library(GGally)
library(knitr)
library(kableExtra)
library(ROCR)

library(jtools)
library(ggstance)
library(broom)
library(interactions)

Generalized Linear Model

Generalized Linear Model (GLiM, or GLM) is an advanced statistical modelling technique formulated by John Nelder and Robert Wedderburn in 1972. It is an umbrella term that encompasses many other models, which allows the response variable y to have an error distribution other than a normal distribution. The models include Linear Regression, Logistic Regression, and Poisson Regression.

In a Linear Regression Model, the response variable ‘y’ is expressed as a linear combination of all the predictors ‘X’. The underlying relationship between the response and the predictors is linear. i.e. we can simply visualize the relationship in the form of a straight line. Also, the error distribution of the response variable should be normally distributed. Therefore we are building a linear model.

GLM models allow us to build a linear relationship between the response and predictors, even though their underlying relationship is not linear. This is made possible by using a link function, which links the response variable to a linear model. Unlike Linear Regression models, the error distribution of the response variable need not be normally distributed. The errors in the response variable are assumed to follow an exponential family of distribution (i.e. normal, binomial, Poisson, or gamma distributions). Since we are trying to generalize a linear regression model that can also be applied in these cases, the name Generalized Linear Models.

when to use GLM?

Linear Regression model is not suitable if,

The relationship between X and y is not linear. There exists some non-linear relationship between them. For example, y increases exponentially as X increases.
Variance of errors in y (commonly called as Homoscedasticity in Linear Regression), is not constant, and varies with X.
Response variable is not continuous, but categorical. Linear Regression assumes normal distribution of the response variable, which can only be applied on a continuous data. If we try to build a linear regression model on a binary y variable, then the linear regression model predicts negative values for the corresponding response variable, which doesn’t make any sense!

In these scenario, GLM comes to play. Commonly used models in the GLM family include:

Linear Regression. for continuous outcomes with normal distribution
Poisson Regression for count based outcomes with poisson distribution
Binary Logistic Regression for dichotomous or binary outcomes with binomial distribution

Generalized linear models are just as easy to fit in R as ordinary linear model. In fact, they require only an additional parameter to specify the variance and link functions. The basic tool for fitting generalized linear models is the glm() function, which has the folllowing general structure:

glm(formula, family, data, weights, subset, ...)

where … stands for more esoteric options. The only parameter that we have not encountered before is family, which is a simple way of specifying a choice of variance and link functions. There are six choices of family:

Family	Variance	Link
gaussian	gaussian	identity
binomial	binomial	logit
poisson	poisson	log
gamma	gamma	inverse
inverse.gaussian	inverse.gaussian	\(1/\mu^2\)
quasi	user-defined	user-defined

What is Logistic regression?

Logistic regression is used to predict a class, i.e., a probability. Logistic regression can predict a binary outcome accurately.

Imagine you want to predict whether a loan is denied/accepted based on many attributes. The logistic regression is of the form 0/1. y = 0 if a loan is rejected, y = 1 if accepted.

A logistic regression model differs from linear regression model in two ways.

First of all, the logistic regression accepts only dichotomous (binary) input as a dependent variable (i.e., a vector of 0 and 1).
Secondly, the outcome is measured by the following probabilistic link function called sigmoid due to its S-shaped.

The sigmoid function returns values from 0 to 1. For the classification task, we need a discrete output of 0 or 1.

Data

Let’s use the Recipes data set to illustrate Logistic regression. this dataset has about 20,000 recipes from several different categories. The dataset provides information on nutritional value of each recipe, ingredients (in the form of binary variables), in some cases, country, the recipe originated from. All in all there are over 650 features. It’s a playground for exploring recipes, predicting what makes a recipe to belong to a certain category (Dessert, or breakfast or brunch etc).

Recipe information lifted from: http://www.epicurious.com/recipes-menus

setwd("~/Nawrin_tania")
recipes_data <- read.csv("epi_r.csv")

I Love desserts! Let’s Predict is a recipe desert or not?

I’ll use Logistic regression to Predict this! Here, ‘dessert-or-not’ is going to be my response variable, while the nutritional factors like protein, fat, calorie content etc would be set as the predictors. So, a binary variable with quantitative explanatory variables will fit Logistic Models smoothly.

let’s take a glimpse of our data by glimpse(recipes_data)

How many recipes are of a dessert? select 1’s of the dessert column. Only 18% of 20,000 recipes are dessert recipes.

prop.table(table(recipes_data$dessert))

## 
##         0         1 
## 0.8218133 0.1781867

I’m including 1:7 columns on nutrition to find out the probability of a recipe being that of a dessert.

dessert_recipes <- recipes_data %>% select(1:7,"dessert")
colnames(dessert_recipes)[7] <- "cakeweek"
kable(data.table(head(dessert_recipes)))

title	rating	calories	protein	fat	sodium
Lentil, Apple, and Turkey Wrap	2.500	426	30	7	559
Boudin Blanc Terrine with Red Onion Confit	4.375	403	18	23	1439
Potato and Fennel Soup Hodge	3.750	165	6	7	165
Mahi-Mahi in Tomato Olive Sauce	5.000	NA	NA	NA	NA
Spinach Noodle Casserole	3.125	547	20	32	452
The Best Blts	4.375	948	19	79	1042

let’s look at Descriptive statistics to check for outliers, NAs and Convert cakeweek and dessert column to factors given they have binary values.

# change as factor
dessert_recipes <- dessert_recipes %>% mutate(
    cakeweek = as.factor(cakeweek),
    dessert = as.factor(dessert)
)

There’s definitely an outlier situation going on here given the maximum calorie, protein, fat, and sodium content is off the chart or in a different unit. For this analysis, I’ll filter these values out. Plus, there are NA values, that could either be removed or imputed based on the average of the relevant column values.

Removing Outliers

dessert_recipes <- dessert_recipes  %>% filter(calories <= 20000 & sodium <=20000)

On filtering, we lost over 4,000 rows. NA’s have reduced. To fill out the NA values, we some options:

Use mean value
apply a ML approach to fill out the values

Both the methods have its pros and cons. For this project, I’ll use mean values of the column.

dessert_recipes <- dessert_recipes %>% mutate(
    fat = replace(fat, is.na(fat), mean(fat, na.rm = T)),
    protein = replace(protein, is.na(protein), mean(protein, na.rm = T))
)

let’s look at summary.

summary(dessert_recipes)

##     title               rating         calories          protein       
##  Length:15885       Min.   :0.000   Min.   :    0.0   Min.   :   0.00  
##  Class :character   1st Qu.:3.750   1st Qu.:  197.0   1st Qu.:   3.00  
##  Mode  :character   Median :4.375   Median :  330.0   Median :   8.00  
##                     Mean   :3.752   Mean   :  491.1   Mean   :  21.07  
##                     3rd Qu.:4.375   3rd Qu.:  585.0   3rd Qu.:  27.00  
##                     Max.   :5.000   Max.   :19576.0   Max.   :1164.00  
##       fat              sodium        cakeweek  dessert  
##  Min.   :   0.00   Min.   :    0.0   0:15879   0:12931  
##  1st Qu.:   8.00   1st Qu.:   79.0   1:    6   1: 2954  
##  Median :  18.00   Median :  293.0                      
##  Mean   :  28.28   Mean   :  593.5                      
##  3rd Qu.:  33.00   3rd Qu.:  703.0                      
##  Max.   :1818.00   Max.   :19986.0

All NA has been removed and outliers has been filtered. So, general notion is dessert is not only tasty but would have more fat, high calories, and less protein, sodium content. Let’s see,

dessert_recipes %>% ggplot(aes(
  x = calories,
  y = fat,
  color = dessert
)) +
  geom_point() +
  theme_bw() +
  labs(
    color = "Dessert"
  )

It is surprising to see that dessert recipes have lower fat and calorie content. But it’s not very clear given there are still fewer values which are much bigger than the average values of fat and calories in the data set.

dessert_recipes %>% ggplot(aes(
  x = calories,
  y = protein,
  color = dessert
)) +
  geom_point() +
  theme_bw() +
  labs(
    color = "Dessert"
  )

This makes more sense, given nobody is expecting to draw proteins from desserts! Let’s check with boxplots.

dessert_box <- dessert_recipes %>% filter(fat <100)

dessert_box %>% ggplot(
    aes(
        x = dessert, y = fat, color = dessert
    )
) +
    geom_boxplot(varwidth = T, size = 1, width = 0.5, show.legend = F, outlier.color = "red") +
    theme_minimal() +
    labs(
        title = "Dessert recipes (1) - other recipes (0)"
    )

The median fat content is quite similar in both kinds of recipes. For the bulk of the dessert recipes, fat content is considerably lower than rest of the recipes. However, there are several outlier values, which is pushing the standard deviation up, meaning the ‘fat’ values are quite spread out in the data.

and Lower protein in dessert recipes VS other recipes.

dessert_box <- dessert_recipes %>% filter(protein <100)

dessert_box %>% ggplot(
    aes(
        x = dessert, y = protein, color = dessert
    )
) +
    geom_boxplot(varwidth = T, size = 1, width = 0.5, show.legend = F, outlier.color = "red") +
    theme_minimal() +
    labs(
        title = "Dessert recipes (1) - other recipes (0)"
    )

Is that a Dessert? Looks delicious!

Before we dive into the model, one requirement of logistic regression is non-multicollinearity. If the explanatory variables are highly correlated with each other, then it makes sense to keep just one of the predictor in the model. I’ll visualize the correlation between the variables by plotting a heat map containing the coefficient of correlation computed with the Spearman method.

ggcorr(
  dessert_recipes,
  method = c("pairwise", "spearman"),
  nbreaks = 6,
  hjust = 0.8,
  label = TRUE,
  label_size = 3,
  color = "grey50",
  name = "Scale",
  palette = "Set3"
)

calories and fat are highly correlated.

Train/Test Dataset

We will divide the dataset into two subsets: train and test. To perform the train-test split.

Train subset – we will use this subset to fit/train the model
Test subset – we will use this subset to evaluate our model

I’ll write a little funtion for this task.

set.seed(6474)
create_train_test <- function(data, size = 0.8, train = TRUE) {
    n_row = nrow(data)
    total_row = size * n_row
    train_sample <- 1: total_row
    if (train == TRUE) {
        return (data[train_sample, ])
    } else {
        return (data[-train_sample, ])
    }
}

Splitting the Data.

train <- create_train_test(data = dessert_recipes, size = 0.8, train = T)
test <- create_train_test(data = dessert_recipes, size = 0.8, train = F)

Building the Model

We are ready to estimate the logistic model to find out dessert recipes. Logistic Regression is probably the simplest and highly efficient algorithm to work out a classification problem, in this case, ‘dessert or not’.

dessert_model <- glm (dessert ~ rating + calories + protein + fat + cakeweek, data = train, family = binomial(link = "logit"))

Summary

summary(dessert_model)

## 
## Call:
## glm(formula = dessert ~ rating + calories + protein + fat + cakeweek, 
##     family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.5211  -0.6575  -0.2722  -0.0057   3.5396  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.991535   0.083991 -23.711  < 2e-16 ***
## rating        0.157596   0.019818   7.952 1.83e-15 ***
## calories      0.006922   0.000229  30.226  < 2e-16 ***
## protein      -0.212298   0.006740 -31.498  < 2e-16 ***
## fat          -0.041422   0.002431 -17.038  < 2e-16 ***
## cakeweek1    14.225325 172.570303   0.082    0.934    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12224.8  on 12707  degrees of freedom
## Residual deviance:  8949.8  on 12702  degrees of freedom
## AIC: 8961.8
## 
## Number of Fisher Scoring iterations: 12

The summary of our model reveals interesting information. The performance of a logistic regression is evaluated with specific key metrics.

AIC (Akaike Information Criteria): This is the equivalent of R2 in logistic regression. It measures the fit when a penalty is applied to the number of parameters. Smaller AIC values indicate the model is closer to the truth.
Null deviance: Fits the model only with the intercept. The degree of freedom is n-1. We can interpret it as a Chi-square value (fitted value different from the actual value hypothesis testing).
Residual Deviance: Model with all the variables. It is also interpreted as a Chi-square hypothesis testing.
Number of Fisher Scoring iterations: Number of iterations before converging.

To understand how likely it is a dessert recipe, coefficients of explanatory variables will be exponentiated as logit model following code calculates the log likelihood for the response variable. This is called odd ratio. It measures, at a certain level of a predictor, how likely is the possibility of the response variable to be 1 or 0 (yes or no)

exp(coef(dessert_model))

##  (Intercept)       rating     calories      protein          fat    cakeweek1 
## 1.364858e-01 1.170693e+00 1.006946e+00 8.087239e-01 9.594246e-01 1.506538e+06

ANOVA Test

ANOVA shows how the full model(with all predictors) is performing against the null model (with intercept only). The big difference in the null and residual deviance shows that residual model is doing better and p-values of all the predictors are statistically significant.

anova(dessert_model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: dessert
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     12707    12224.8              
## rating    1     6.20     12706    12218.6  0.012748 *  
## calories  1     2.06     12705    12216.5  0.151545    
## protein   1  2958.34     12704     9258.2 < 2.2e-16 ***
## fat       1   298.51     12703     8959.6 < 2.2e-16 ***
## cakeweek  1     9.84     12702     8949.8  0.001707 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Interpretation

All the features, except cakeweek, in the model to predict the likelihood of a dessert recipe, are shown as highly significant at p<0.000.

From the model, keeping other predictors constant, it could be interpreted that,

A unit increase in ratings, increases the odds of a dessert recipe by 16%.
A unit increase in calories, increases these odds by 0.7%
A unit increase in protein, decreases these odds by 21% (given the coefficient of protein is negative)
A unit increase in fat, decreases these odds by 4.14% (given the coefficient of fat is also negative)

Performance of the model

One way to measure the accuracy of the model predictions is Confussion Matrix Another, measuring Area Under Curve(AUC) of ROC is the most accepted method to check for quality of predictions.

Confussion Matrix

The confusion matrix is a better choice to evaluate the classification performance compared with the different metrics you saw before. The general idea is to count the number of times True instances are classified are False.

at first, predict the model using test data.

predict <- predict(dessert_model, test, type = 'response')

then calculate the matrix,

conf_matrix <- table(test$dessert, predict > 0.5)
conf_matrix

##    
##     FALSE TRUE
##   0  2506   86
##   1   449  136

You can calculate the model accuracy by summing the true positive + true negative over the total observation. \[accuracy=\frac{TP+TN}{TP+TN+FP+FN}\]

accuracy_Test <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy_Test

## [1] 0.8316021

The model appears to suffer from one problem, it overestimates the number of false negatives. This is called the accuracy test paradox. We stated that the accuracy is the ratio of correct predictions to the total number of cases. We can have relatively high accuracy but a useless model. It happens when there is a dominant class. If you look back at the confusion matrix, you can see most of the cases are classified as true negative. Imagine now, the model classified all the classes as negative.

You would have an accuracy of 75%. Your model performs better but struggles to distinguish the true positive with the true negative.

In such situation, it is preferable to have a more concise metric. We can look at:

\(Precision=\frac{TP}{TP+FP}\)
\(Recall=\frac{TP}{TP+FN}\)

Precision and Recall

Precision looks at the accuracy of the positive prediction. Recall is the ratio of positive instances that are correctly detected by the classifier.

let me write a function to calculate Precision and Recall

precision <- function(matrix) {
    # True positive
    tp <- matrix[2, 2]
    # false positive
    fp <- matrix[1, 2]
    return (tp / (tp + fp))
}

recall <- function(matrix) {
# true positive
    tp <- matrix[2, 2]# false positive
    fn <- matrix[2, 1]
    return (tp / (tp + fn))
}

Calculate,

prec <- precision(conf_matrix)
rec <- recall(conf_matrix)
prec

## [1] 0.6126126

rec

## [1] 0.2324786

When the model says it is a dessert recipe, it is correct in only 23 percent of the case, and can claim a recipe as dessert in 61 percent of the case.

You can create the \(F_1\) score based on the precision and recall. The \(F_1\) is a harmonic mean of these two metrics, meaning it gives more weight to the lower values. \[F_1=2\frac{precision*recall}{precision+recall}\]

f1 <- 2 * ((prec * rec) / (prec + rec))
f1

## [1] 0.3370508

It is impossible to have both a high precision and high recall.

If we increase the precision, the correct individual will be better predicted, but we would miss lots of them (lower recall). In some situation, we prefer higher precision than recall. There is a concave relationship between precision and recall.

Imagine, you need to predict if a patient has a disease. You want to be as precise as possible.
If you need to detect potential fraudulent people in the street through facial recognition, it would be better to catch many people labeled as fraudulent even though the precision is low. The police will be able to release the non-fraudulent individual.

Area Under Curve

A good model should have a better predictiction rate for true positives(sensitivity) and smaller for false positives(1-specificity). AUCROC graph is a plot of the values of sensitivity against false positives, “as the value of the cut-point is increased from 0 through to 1”. “AUC is the probability that if you were to take a random pair of observations, one with Y=1 and one with Y=0, the observation with Y=1 has a higher predicted probability than the other. The AUC thus gives the probability that the model correctly ranks such pairs of observations”.

ROCRpred <- prediction(predict, test$dessert)
ROCRperf <- performance(ROCRpred, 'tpr', 'fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2, 1.7))

AUC, Area under the ROC curve, indicates the goodness of prediction when it’s closer to 1.

auc <- performance(ROCRpred, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.8813819

Conclusion

Thogh it can be treated as a good model, we can Improve it by,

adding non-linearity to the model with the interaction
Reducing Class Imbalance

What is Imbalanced Classification ?

Imbalanced classification is a supervised learning problem where one class outnumbers other class by a large proportion. This problem is faced more frequently in binary classification problems than multi-level classification problems.

The term imbalanced refer to the disparity encountered in the dependent (response) variable. Therefore, an imbalanced classification problem is one in which the dependent variable has imbalanced proportion of classes. In other words, a data set that exhibits an unequal distribution between its classes is considered to be imbalanced.

Example:

prop.table(table(test$dessert))

## 
##        0        1 
## 0.815864 0.184136

82% of recipes are not dessert recipes, only 18% is dessert recipe. This is an imbalanced scenerio.

Poisson Regression Model

Poisson Regression models are best used for modeling events where the outcomes are counts. Or, more specifically, count data: discrete data with non-negative integer values that count something, like the number of times an event occurs during a given time frame or the number of people in line at the grocery store.

Count data can also be expressed as rate data, since the number of times an event occurs within a time frame can be expressed as a raw count (i.e. “In a day, we eat three meals”) or as a rate (“We eat at a rate of 0.125 meals per hour”).

Poisson Regression helps us analyze both count data and rate data by allowing us to determine which explanatory variables (X values) have an effect on a given response variable (Y value, the count or a rate). For example, Poisson regression could be applied by a grocery store to better understand and predict the number of people in a line.

What is Poisson Distribution?

Poisson distribution is a statistical theory named after French mathematician Siméon Denis Poisson. It models the probability of event or events y occurring within a specific timeframe, assuming that y occurrences are not affected by the timing of previous occurrences of y. This can be expressed mathematically using the following formula: \[P(x)=\frac{e^{-\lambda} \lambda^x}{x!};\ \ where\ x=0,1,2...\]

Here, \(\lambda\) is the average number of times an event may occur per unit of exposure. It is also called the parameter of Poisson distribution. The exposure may be time, space, population size, distance, or area, but it is often time, denoted with \(t\). If exposure value is not given it is assumed to be equal to 1.

Data

we are going to build a model to predict number of meadowfoam flowers in given light conditions. The data are collected from an experiment to study how to maximize Mermaid meadowfoam production. (Meadowfoam is a small plant from which a vegetable oil can be extracted). This data is found in a package called GLMsData

library(GLMsData)
data("flowers")

Whats in the Data?

glimpse(flowers)

## Rows: 24
## Columns: 3
## $ Flowers <dbl> 62.4, 77.1, 77.7, 75.4, 55.7, 54.2, 68.9, 78.2, 49.5, 62.0, 57~
## $ Light   <int> 150, 150, 150, 150, 300, 300, 300, 300, 450, 450, 450, 450, 60~
## $ Timing  <fct> PFI, PFI, Before, Before, PFI, PFI, Before, Before, PFI, PFI, ~

A data frame with 24 observations on the following 3 variables. * Flowers: the mean number of flowers per meadowfoam plant, averaged over ten seedlings; a numeric vector * Light: the light intensity in \(\mu\) mol per square metre per second; a numeric vector * Timing: when the light treatment was applied; a factor with levels PFI (photo-periodic floral induction) or Before (24 days before PFI)

Let’s round the Flowers variable to make it as perfect count data, as count can’t be fraction!

flowers <- flowers %>% mutate(Flowers = round(Flowers))

visualizing Histogram,

flowers %>% ggplot(
    aes(Flowers)
) +
    geom_histogram(position = "dodge", bins = 6, fill = "Tomato") +
    theme_bw()

the data is not in the form of a bell curve like in a normal distribution. Lets check Mean and Variance:

c(Mean = mean(flowers$Flowers), Variance = var(flowers$Flowers))

##      Mean  Variance 
##  56.04167 188.30254

The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.

Let’s fit the Poisson model using the glm() command.

flower_model <- glm(Flowers~Light+Timing, family = poisson(link = "log"), data = flowers)

Interpreting the model

summary(flower_model)

## 
## Call:
## glm(formula = Flowers ~ Light + Timing, family = poisson(link = "log"), 
##     data = flowers)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7385  -0.5343  -0.2473   0.7184   1.5002  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.4935465  0.0631135  71.198  < 2e-16 ***
## Light       -0.0007284  0.0001076  -6.769 1.30e-11 ***
## TimingPFI   -0.2164546  0.0548539  -3.946 7.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 78.057  on 23  degrees of freedom
## Residual deviance: 16.074  on 21  degrees of freedom
## AIC: 162.17
## 
## Number of Fisher Scoring iterations: 4

Anova

anova(flower_model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Flowers
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                      23     78.057              
## Light   1   46.321        22     31.736 1.004e-11 ***
## Timing  1   15.662        21     16.074 7.571e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output begins with echoing the function call. The information on deviance residuals is displayed next. Deviance residuals are approximately normally distributed if the model is specified correctly.In our example, it shows a little bit of skeweness since median is not quite zero.
Next come the Poisson regression coefficients for each of the variables along with the standard errors, z-scores, p-values and 95% confidence intervals for the coefficients. The coefficient for Light is -0.0007284. This means that the expected log count for a one-unit increase in Light is -0.0007284. The indicator variable Timing is the expected difference in log count (approx -0.2164546) between PFI and the reference group ‘Before’.
The information on deviance is also provided. We can use the residual deviance to perform a goodness of fit test for the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. Therefore, if the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data. We conclude that the model fits reasonably well because the goodness-of-fit chi-squared test is not statistically significant. If the test had been statistically significant, it would indicate that the data do not fit the model well.

If the Residual Deviance is greater than the degrees of freedom, then over-dispersion exists. This means that the estimates are correct, but the standard errors (standard deviation) are wrong and unaccounted for by the model.

The Null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean) whereas residual with the inclusion of independent variables. Above, we can see that the addition of 2 independent variables decreased the deviance to 16.074 from 78.057 Greater difference in values means a bad fit.

So, to have a more correct standard error we can use a quasi-poisson model:

flower_model_2 <- glm(Flowers~Light+Timing, family = quasipoisson(link = "log"), data = flowers)
summary(flower_model_2)

## 
## Call:
## glm(formula = Flowers ~ Light + Timing, family = quasipoisson(link = "log"), 
##     data = flowers)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7385  -0.5343  -0.2473   0.7184   1.5002  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.494e+00  5.518e-02  81.435  < 2e-16 ***
## Light       -7.284e-04  9.409e-05  -7.742 1.39e-07 ***
## TimingPFI   -2.165e-01  4.796e-02  -4.513  0.00019 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.7643845)
## 
##     Null deviance: 78.057  on 23  degrees of freedom
## Residual deviance: 16.074  on 21  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Comparing The Models:

Now that we’ve got two different models, let’s compare them to see which is better. First, we’ll install the arm library because we’ll use that se.coef() function from this package to extract the coefficients from each model, and then use cbind() combine those extracted values into a single data frame so we can compare them.

library(arm)
coef1 <- coef(flower_model)
coef2 <- coef(flower_model_2)
se_coef1 <- se.coef(flower_model)
se_coef2 <- se.coef(flower_model_2)
exponent <- exp(coef1)
both <- cbind(coef1, se_coef1, coef2, se_coef2, exponent)
both

##                     coef1     se_coef1         coef2     se_coef2   exponent
## (Intercept)  4.4935465419 0.0631135097  4.4935465419 5.517957e-02 89.4380800
## Light       -0.0007284095 0.0001076146 -0.0007284095 9.408648e-05  0.9992719
## TimingPFI   -0.2164545632 0.0548538836 -0.2164545632 4.795825e-02  0.8053691

In above output, we can see the coefficients are the same, but the standard errors are different.

Predicting From The Model

Once the model is made, we can use predict(model, data, type) to predict outcomes using new dataframes containing data other than the training data. Let’s look at an example.

newdata <- data.frame(Light = c(550, 800), Timing = c("PFI", "Before"))
predict(flower_model, newdata, type = "response")

##        1        2 
## 48.25342 49.93983

Our model is predicting there will be roughly 48 and 49 flower with this new data.

Regression Co-efficients

plot regression coefficients for flower_model

plot_summs(flower_model, scale = T, exp = T)

plot regression coefficients for flower_model and flower_model_2

plot_summs(flower_model, flower_model_2, scale = T, exp = T)

Generalised Linear Models

Nawrin Tania

2021-12-05

Libraries

Generalized Linear Model

when to use GLM?

What is Logistic regression?

Data

I Love desserts! Let’s Predict is a recipe desert or not?

Removing Outliers

Is that a Dessert? Looks delicious!

Train/Test Dataset

Building the Model

Summary

ANOVA Test

Model Interpretation

Performance of the model

Confussion Matrix

Precision and Recall

Area Under Curve

Conclusion

What is Imbalanced Classification ?

Poisson Regression Model

What is Poisson Distribution?

Data

Whats in the Data?

Interpreting the model

Anova

Comparing The Models:

Predicting From The Model

Regression Co-efficients