For the salmonella dataset, fit a linear model with colonies as the response and log(dose+1) as the predictor. Check for lack of fit.

Loading the salmonella datase

## Rows: 18
## Columns: 2
## $ colonies <int> 15, 21, 29, 16, 18, 21, 16, 26, 33, 27, 41, 60, 33, 38, 41...
## $ dose     <int> 0, 0, 0, 10, 10, 10, 33, 33, 33, 100, 100, 100, 333, 333, ...

We have 02 variables, and 18 records. this is a small dataset. The predictors data type looks more of a categorical than an integer.

Visualizing colonies = f(dose), colonies Vs. dose Our previous assumption was right. The independent variable, dose shows about 04 trending set of recurrent values.

Applying log on the predictor

Log definitely improves the explanation dose has over colonies. Let’s build a model using glm. Recalling, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

## 
## Call:
## glm(formula = colonies ~ dose, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6482  -1.8225  -0.2993   1.2917   5.1861  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.3219950  0.0540292  61.485   <2e-16 ***
## dose        0.0001901  0.0001172   1.622    0.105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 78.358  on 17  degrees of freedom
## Residual deviance: 75.806  on 16  degrees of freedom
## AIC: 172.34
## 
## Number of Fisher Scoring iterations: 4

The summary of glm show a high pvalue = 0.105 for dose. This is a red flag and an indication of lack of fit. Le’t apply the log on dose and run the glm again.

## 
## Call:
## glm(formula = colonies ~ log(dose + 1), family = poisson, data = salmonella)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0764  -1.4488  -0.2306   0.9259   4.7212  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.01989    0.09712  31.095  < 2e-16 ***
## log(dose + 1)  0.08585    0.02018   4.255 2.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 78.358  on 17  degrees of freedom
## Residual deviance: 59.629  on 16  degrees of freedom
## AIC: 156.17
## 
## Number of Fisher Scoring iterations: 4

Even we see a better pvalue with the transformed predictor, the residual does not show an equally distributed of residuals across the line y= 0. There is an unequal/nonconstant error variances. We also found out when we applied the lm function to the salmonella dataset, the Rsquare is weak, meaning there is no good fit to the dataset. This is still lack of fit…further errors need to be explored.