Assignment 4: Elements of Modeling & Generalised Linear Models

# Set the working directory
setwd("C:/Users/mws16/OneDrive/Desktop/OPIM-5603-Statistics in Business Analytics/Homework 4")

Part I

Q1: Categorical Predictor Variables in Regression

Use the BirthSmokers.csv

The BirthSmokers dataset is a random sample of 32 births based on a study conducted on pregnant women. The dataset has 3 variables wgt, gest and smoke giving information on birth weight of baby in gms, length of gestation in weeks and smoking status of mother(yes/no)

Use smoke and gest variables to predict the birth weight of the baby (Run a linear regression)

# read in the file
birthsmokers <- read.csv("BirthSmokers.csv")
# Build a linear regression
fit1a <- lm(Wgt ~ .,
            data = birthsmokers)
summary(fit1a)

## 
## Call:
## lm(formula = Wgt ~ ., data = birthsmokers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.693  -92.063   -9.365   79.663  197.507 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2389.573    349.206  -6.843 1.63e-07 ***
## Gest          143.100      9.128  15.677 1.07e-15 ***
## Smokeyes     -244.544     41.982  -5.825 2.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared:  0.8964, Adjusted R-squared:  0.8892 
## F-statistic: 125.4 on 2 and 29 DF,  p-value: 5.289e-15

# Save predicted values to data table
birthsmokers$Preds <- fit1a$fitted.values
birthsmokers

Plot the estimated regression function (Residuals vs Fitted)

# Plot the regression function
plot(y = fit1a$residuals,
     x = fit1a$fitted.values,
     main = "Residuals vs Fitted values",
     xlab = "Fitted values",
     ylab = "Residuals")
abline(0,1)  # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(fit1a$residuals ~ fit1a$fitted.values),
       col = "red")

Determine the regression equation for Smoking mothers and Non smoking mothers

summary(fit1a)

## 
## Call:
## lm(formula = Wgt ~ ., data = birthsmokers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.693  -92.063   -9.365   79.663  197.507 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2389.573    349.206  -6.843 1.63e-07 ***
## Gest          143.100      9.128  15.677 1.07e-15 ***
## Smokeyes     -244.544     41.982  -5.825 2.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared:  0.8964, Adjusted R-squared:  0.8892 
## F-statistic: 125.4 on 2 and 29 DF,  p-value: 5.289e-15

As Smoke is a categorical predictor with two values, the regression function gives coefficient estimate for only one value of the variable. In this case, the estimate for Smokeyes = -244.544.

Regression Equation for Smoking Mothers: (Smoke = Yes. Numerically, Smokeyes = 1) Wgt = (-2389.573) + 143.100 * Gest - 244.544

Regression Equation for Non smoking mothers: (Smoke = No. Numerically, Smokeno = 0) Wgt = (-2389.573) + 143.100 * Gest

Q2: Interaction Terms

Use the depression.csv

The response variable is “effectiveness” - a higher number means the treatment was more effective, a lower number means less effective treatment. You also have info on the age of the patient and the treatment that they received.

We want to study the effectiveness of three treatments A, B, and C for severe depression. The researchers collected depression data on a random sample of 36 severely depressed individuals

Use “age” (numeric) and “treatment” (factor) variables to predict “effectiveness” (numeric) for depression using a linear regression.

# read in the data
depression <- read.csv("depression.csv")
# Build a linear regression
fit2a <- lm(effect ~ .,
            data = depression)
summary(fit2a)

## 
## Call:
## lm(formula = effect ~ ., data = depression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5732  -3.3922   0.9829   3.9613   9.5062 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.54335    3.58105   9.088 2.23e-10 ***
## age           0.66446    0.06978   9.522 7.42e-11 ***
## TRTB         -9.80758    2.46471  -3.979 0.000371 ***
## TRTC        -10.25276    2.46542  -4.159 0.000224 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.035 on 32 degrees of freedom
## Multiple R-squared:  0.784,  Adjusted R-squared:  0.7637 
## F-statistic: 38.71 on 3 and 32 DF,  p-value: 9.287e-11

# Save the predictions to data table
depression$preds <- fit2a$fitted.values
depression

How do you interpret the coefficients in the equation? Hint: How does R handle factors with a linear regression?

summary(fit2a)

## 
## Call:
## lm(formula = effect ~ ., data = depression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5732  -3.3922   0.9829   3.9613   9.5062 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.54335    3.58105   9.088 2.23e-10 ***
## age           0.66446    0.06978   9.522 7.42e-11 ***
## TRTB         -9.80758    2.46471  -3.979 0.000371 ***
## TRTC        -10.25276    2.46542  -4.159 0.000224 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.035 on 32 degrees of freedom
## Multiple R-squared:  0.784,  Adjusted R-squared:  0.7637 
## F-statistic: 38.71 on 3 and 32 DF,  p-value: 9.287e-11

Regression Equation: effect = 32.54335 + 0.66446 * age + (-9.80758) * TRTB + (-10.25276) * TRTC

When a predictor variable is factor (categorical) with n possible values, R generates estimates for (n-1) values. The intercept and estimates are calcuates taking into account the weightage of the ommited value. Thus, when the value of the categorical variable equals to the one omitted value, the remaining coefficients and the intercept in the equation balance out the effect of this categorical variable.

Use a single interaction term TRT * age to predict the effectiveness. What does this variable mean?

Hint: use a “*” between the age and “TRT” columns
Link: https://www.theanalysisfactor.com/interactions-categorical-and-continuous-variables/
Link:https://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture10/lecture10-948 42.html

# Build regression using interaction term
fit2c <- lm(effect ~ age * TRT,
            data = depression)
summary(fit2c)

## 
## Call:
## lm(formula = effect ~ age * TRT, data = depression)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4366 -2.7637  0.1887  2.9075  6.5634 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.51559    3.82523  12.422 2.34e-13 ***
## age           0.33051    0.08149   4.056 0.000328 ***
## TRTB        -18.59739    5.41573  -3.434 0.001759 ** 
## TRTC        -41.30421    5.08453  -8.124 4.56e-09 ***
## age:TRTB      0.19318    0.11660   1.657 0.108001    
## age:TRTC      0.70288    0.10896   6.451 3.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.9001 
## F-statistic: 64.04 on 5 and 30 DF,  p-value: 4.264e-15

The interaction term age * TRT adds the deviation from the baseline slope, in addition to the deviation from the baseline given by the simple predictor terms.

In simple words, in an interaction term between a categorical and a continuous variable, the slope depends on how the categorical variable changes. For example, in the model above, for treatment B, a unit change in the age will result in the effect changing by 0.19318, whereas, for treatment C, a unit change in the age will result in the effect changing by 0.70288.

Compare the results of the model with and without the interaction term. Does the interaction term do a better job of explaining data?

# evaluate the AIC for both models
AIC(fit2a)

## [1] 237.3518

AIC(fit2c)

## [1] 208.0487

# evaluate regression metrics for both models
mymetrics2a <- hydroGOF::gof(sim = fit2a$fitted.values,
                           obs = depression$effect)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

# evaluate regression metrics for both models
mymetrics2c <- hydroGOF::gof(sim = fit2c$fitted.values,
                           obs = depression$effect)

mymetrics2a

##          [,1]
## ME       0.00
## MAE      4.72
## MSE     32.38
## RMSE     5.69
## NRMSE % 45.80
## PBIAS %  0.00
## RSR      0.46
## rSD      0.89
## NSE      0.78
## mNSE     0.53
## rNSE     0.65
## d        0.94
## md       0.75
## rd       0.90
## cp       0.89
## r        0.89
## R2       0.78
## bR2      0.78
## KGE      0.84
## VE       0.91

mymetrics2c

##          [,1]
## ME       0.00
## MAE      2.98
## MSE     12.84
## RMSE     3.58
## NRMSE % 28.90
## PBIAS %  0.00
## RSR      0.29
## rSD      0.96
## NSE      0.91
## mNSE     0.70
## rNSE     0.90
## d        0.98
## md       0.85
## rd       0.97
## cp       0.95
## r        0.96
## R2       0.91
## bR2      0.91
## KGE      0.94
## VE       0.95

The model with interaction term (fit2c) has a higher R-squared value (0.9143), a lower AIC (208.0487), a lower RMSE, MAE, MSE. Hence definitely the model with interaction term has a better performance.

Thus it can be said that the interaction term does a better job of explaining the data, maybe because it takes into consideration the interaction of the categorical and continuous predictor along with the interaction of both the predictors with the target variable.

Q3: Detecting Multicollinearity

Use the Cement.csv

The heat evolved in calories during the hardening of cement on a per gram basis is given in the cement dataset. The four predictors are the percentages of four ingredients: tricalcium aluminate, tricalcium silicate, tetracalcium alumino ferrite, and dicalcium silicate

Run the correlations on all the variables in the dataset. Describe what you see.

cement <- read.csv("Cement.csv")
# scatterplot matrix
library(psych)
# pearson correlation
pairs.panels(cement, 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

In the Person correlation matrix, it can be seen that all the predictor variables have a significant correlation with the target variable Heat. However, the variables Trical_sil and Dical_Sil have a strong negative correlation with each other, which means, having both of these variables will introduce multicollinearity while building the regression model. It is nothing but a repeatition of information. Hence, preferable, one of these variables should be dropped while buildin the regression.

Similarly, variables Trical_Alum and Tetracal_AlumFer seem to have a correlation, although not as strong as the other two variable. It needs to be checked whether dropping one of these variables would improve the performance of the regression.

Run the variance inflation factor analysis.

# Build a linear regression
fit3b <- lm(Heat ~ .,
            data = cement)
summary(fit3b)

## 
## Call:
## lm(formula = Heat ~ ., data = cement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       62.4054    70.0710   0.891   0.3991  
## Trical_Alum        1.5511     0.7448   2.083   0.0708 .
## Trical_sil         0.5102     0.7238   0.705   0.5009  
## Tetracal_AlumFer   0.1019     0.7547   0.135   0.8959  
## Dical_Sil         -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

# import library car
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

# Find the Variance Inflation Factor 
vif(fit3b)

##      Trical_Alum       Trical_sil Tetracal_AlumFer        Dical_Sil 
##         38.49621        254.42317         46.86839        282.51286

Build regression model for heat evolved. Which variables will you consider?

# Get the VIF values
vif(fit3b)

##      Trical_Alum       Trical_sil Tetracal_AlumFer        Dical_Sil 
##         38.49621        254.42317         46.86839        282.51286

All the variables have a high VIF value. For the variable to be retained for prediction, the VIF should be as low as possible (Ideally, Sqrt(VIF)<=3, maximum 5). Trying building a model without Dical_Sil, the variable with highest VIF.

# Fit another regression dropping the variable Dical_Sil
fit3c <- lm(Heat ~ . - Dical_Sil,
            data = cement)
summary(fit3c)

## 
## Call:
## lm(formula = Heat ~ . - Dical_Sil, data = cement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2543 -1.4726  0.1755  1.5409  3.9711 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      48.19363    3.91330  12.315 6.17e-07 ***
## Trical_Alum       1.69589    0.20458   8.290 1.66e-05 ***
## Trical_sil        0.65691    0.04423  14.851 1.23e-07 ***
## Tetracal_AlumFer  0.25002    0.18471   1.354    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.312 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.3 on 3 and 9 DF,  p-value: 3.367e-08

# Check the VIF values
vif(fit3c)

##      Trical_Alum       Trical_sil Tetracal_AlumFer 
##         3.251068         1.063575         3.142125

Now the VIF values are low and the variables also seem to have significance in the regression according to the p-values. The variable Tetracal_AlumFer doesnt show any significance, but has a considerably low VIF. Trying to build a model without this variable and then check the performance of both the models.

# Fit another regression dropping both Tetracal_AlumFer and Dical_Sil
fit3c_new <- lm(Heat ~ . - Dical_Sil - Tetracal_AlumFer,
                data = cement)
summary(fit3c_new)

## 
## Call:
## lm(formula = Heat ~ . - Dical_Sil - Tetracal_AlumFer, data = cement)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.893 -1.574 -1.302  1.363  4.048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
## Trical_Alum  1.46831    0.12130   12.11 2.69e-07 ***
## Trical_sil   0.66225    0.04585   14.44 5.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.9744 
## F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09

# Find the AIC for models with and without the Tetracal_AlumFer
AIC(fit3c)

## [1] 63.9036

AIC(fit3c_new)

## [1] 64.31239

After removing the variable Tetracal_AlumFer, the R-square reduced and the residual standard error increased slightly. The second model also has a higher AIC. Hence, only the variable Dical_Sil should be removed for building a regression model for Heat.

Describe your intuition behind the choice of variables based on VIF analysis

After running a VIF analysis on the first model, all the variables had a high value of VIF indicating multicollinearity. Among them, the variable Dical_Sil had the highest VIF, proving its multicollinearity with Trical_sil as seen in the scatterplot matrix. Hence the first choice for eliminating a variable was Dical_Sil.
The second model, built without Dical_Sil, showed all the other variables to be significant, apart from Tetracal_AlumFer. However, the VIF for all three variables turned out to be quite small, as expected.
Furthur removing Tetracal_AlumFer deteriorated the regression performance. Hence, going with its VIF value, it should be retained in the regression equation.
As a result, the final choice of variables Trical_Alum, Trical_sil, Tetracal_AlumFer is used to build a regression model.

Part II

Q4: Logistic Regression

Use Smarket: S&P Stock Market Data. https://rdrr.io/cran/ISLR/man/Smarket.html

These are the daily percentage returns for the S&P 500 stock index between 2001 and 2005. You can download this from within R in the ISLR package.

Build a logistic regression model using the glm (generalized linear model) function. We are trying to predict if the market will go up or down the next day.

# install.packages("ISLR")
# import library ISLR
library(ISLR)
# read in the Smarket data
smarket <- Smarket
smarket

str(smarket$Direction)

##  Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...

# Recode the target variable
smarket$Direction <- as.character(smarket$Direction)
smarket$Direction[smarket$Direction == "Up"] <- 1
smarket$Direction[smarket$Direction == "Down"] <- 0
str(smarket$Direction)

##  chr [1:1250] "1" "1" "0" "1" "1" "1" "0" "1" "1" "1" "0" "0" "1" "1" ...

smarket$Direction <- as.factor(smarket$Direction)

fit4a <- glm(Direction ~ . - Year - Today,
              data = smarket,
              family = binomial(link = "logit"))
summary(fit4a)

## 
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial(link = "logit"), 
##     data = smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

# predict the probability of having an affair for each observation
mypreds4a <- predict(fit4a,  # my model
                    newdata = smarket, # dataset
                    type = "response") # to get probabilities
summary(mypreds4a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4084  0.5020  0.5180  0.5184  0.5338  0.6486

smarket$preds <- round(mypreds4a)

Describe the interpretation of every variable in the model with the Direction.
Link: :https://stats.idre.ucla.edu/r/dae/logit-regression/

# get the coefficients
coef(fit4a)

##  (Intercept)         Lag1         Lag2         Lag3         Lag4 
## -0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938 
##         Lag5       Volume 
##  0.010313068  0.135440659

The logit function gives a change in the log odds of the variables according to the coefficients. Here:

For every unit increase in Lag 1, the log odds of the stock going up versus down decrease by 0.073074
For every unit increase in Lag 2, the log odds of the stock going up versus down decrease by 0.042301
For every unit increase in Lag 3, the log odds of the stock going up versus down increase by 0.011085
For every unit increase in Lag 4, the log odds of the stock going up versus down increase by 0.009359
For every unit increase in Lag 5, the log odds of the stock going up versus down increase by 0.010313
Lastly, for every unit increase in Volume, the log odds of the stock going up versus down increase by 0.135441

To better understand the coefficients, lets exponentiate the coefficients.

# exponentiate the coefficients to correctly interpret them as there is log in the output (because we are using logit)
exp(coef(fit4a))

## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
##   0.8816146   0.9295323   0.9585809   1.0111468   1.0094029   1.0103664 
##      Volume 
##   1.1450412

The positive sign of the coefficient idicates a direct relationship between the variables and the target. Interprtation of these coefficient values:

For every unit increase in Lag 1, value of stock increases by 0.9295323
For every unit increase in Lag 2, value of stock increases by 0.9585809
For every unit increase in Lag 3, value of stock increases by 1.0111468
For every unit increase in Lag 4, value of stock increases by 1.0094029
For every unit increase in Lag 5, value of stock increases by 1.0103664
For every unit increase in Volume, value of stock increases by 1.1450412

Explain why we use odds ratio in Logistic Regression? Is it the same thing as probability?

Odds ratio and probability are different concepts. Probability is the ratio of instances of success to the number of trials, whereas odds ratio is the ratio of number f instances of success to the number of instances of failure.
We use odds ratio in logistic regression because it represents a constant effect of a predictor on the likelihood of the occurrence of a particular outcome.
Also, the odds ratio can have a value ranging from 0 to Inf, which facilitates the log odds to have a value ranging from -Inf to Inf, thus letting the coefficient estimates in a regression take any value they need. However, probabilities range from 0 to 1, thus restricting the values that estimates can take.
Hence, we use odds ratio and not probability in a logistic regression.

Q5: Poisson Regression

Use the AmphibianRoadKills.csv

You are given with amphibian road kill data which gives the count of the number of animals killed on a 25000 mile road at different points.

Build a Poisson regression model to predict the number of animals killed on the roadway

options(scipen=999)    # to avoid scientific notation

# read in the data
arkill <- read.csv("AmphibianRoadKills.csv")
arkill

# build the regression model with family = poisson
fit5a <- glm(death ~ distance,
             data = arkill,
             family=poisson())
# check the model built
summary(fit5a)

## 
## Call:
## glm(formula = death ~ distance, family = poisson(), data = arkill)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.1100  -1.6950  -0.4708   1.4206   7.3337  
## 
## Coefficients:
##                 Estimate   Std. Error z value            Pr(>|z|)    
## (Intercept)  4.316484980  0.043219983   99.87 <0.0000000000000002 ***
## distance    -0.000105851  0.000004387  -24.13 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.4  on 51  degrees of freedom
## Residual deviance:  390.9  on 50  degrees of freedom
## AIC: 634.29
## 
## Number of Fisher Scoring iterations: 4

Should I use the quasipoisson model? Why or why not? If so, fit that model and compare the model fit/coefficients etc.

The model that is fit in question 5a, has a dispersion parameter of 1. However, the deviance can be seen to be very big. The problem here is overdispersion.
To check the overdisperion, we can divide the deviance by the degrees of freedom.
390.9/50 = 7.818 > 1
To fix overdispersion, quasipoisson should be used for family, instead of simple poisson

#install.packages("qcc")
# import library qcc to be able to run a quasipoisson model
library(qcc)

## Package 'qcc' version 2.7

## Type 'citation("qcc")' for citing this R package in publications.

# build regression using family = quasipoisson
fit5b <- glm(death ~ distance,
             data = arkill,
             family = quasipoisson())
# check the model built
summary(fit5b)

## 
## Call:
## glm(formula = death ~ distance, family = quasipoisson(), data = arkill)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.1100  -1.6950  -0.4708   1.4206   7.3337  
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)  4.31648498  0.11938536  36.156 < 0.0000000000000002 ***
## distance    -0.00010585  0.00001212  -8.735      0.0000000000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 7.630148)
## 
##     Null deviance: 1071.4  on 51  degrees of freedom
## Residual deviance:  390.9  on 50  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Now, the Dispersion parameter can be seen to be increased, to balance the deviance in the model.

metrics5a <- hydroGOF::gof(sim = fit5a$fitted.values,
                           obs = fit5a$y)
metrics5b <- hydroGOF::gof(sim = fit5b$fitted.values,
                           obs = fit5b$y)
metrics5a

##           [,1]
## ME        0.00
## MAE      10.40
## MSE     282.67
## RMSE     16.81
## NRMSE %  69.30
## PBIAS %   0.00
## RSR       0.69
## rSD       0.79
## NSE       0.51
## mNSE      0.46
## rNSE     -1.50
## d         0.83
## md        0.70
## rd        0.13
## cp        0.16
## r         0.72
## R2        0.52
## bR2       0.41
## KGE       0.65
## VE        0.60

metrics5b

##           [,1]
## ME        0.00
## MAE      10.40
## MSE     282.67
## RMSE     16.81
## NRMSE %  69.30
## PBIAS %   0.00
## RSR       0.69
## rSD       0.79
## NSE       0.51
## mNSE      0.46
## rNSE     -1.50
## d         0.83
## md        0.70
## rd        0.13
## cp        0.16
## r         0.72
## R2        0.52
## bR2       0.41
## KGE       0.65
## VE        0.60

Comparing both the models:
- The coefficient estimates for both the models are same, however the std error has increased in the model using the quasipoisson family. Hence, poisson looks like the better model in this perspective.
- All the other metrics for both the model are identical.

Residual diagnostics: Make a three plots and describe what you see

1. Residuals vs fitted (scatter)

# plot the residuals vs fitted values
plot(y = fit5a$residuals,
     x = fit5a$fitted.values,
     main = "Residuals vs Fitted values",
     xlab = "Fitted values",
     ylab = "Residuals")
abline(0,0) #horizontal axis at x=0, y=0
# add the regression line
abline(lm(fit5a$residuals ~ fit5a$fitted.values),
       col = "red")

Looking at the Residuals vs Fitted Values scatter plot, it can be seen that:

The residuals are not evenly distributed around the regression line. The plot indicates that the model is over-predicting more frequently than under-predicting.
However, the range of underprediction is much higher than that of the overpredictions. This indicates there might be some outliers in the data.
1. Residuals vs distance (scatter)

# plot residuals vs the actual distances
plot(y = fit5a$residuals,
     x = arkill$distance,
     main = "Residuals vs Distance",
     xlab = "Distance",
     ylab = "Residuals")
abline(0,0) #horizontal axis at x=0, y=0

In the Residuals vs. Distance scatterplot above, it can be seen that:

Apart from a few exceptions, the residuals are greater than 0, that means the model is under-predicting for distances upto around 13000, above which the model is mostly overpredicting.
On the whole, the model is overpredicting more frequently that underpredicting.
1. Residuals Frequency (histogram)

hist(fit5a$residuals,
     main = "Residuals frequency",
     xlab = "Residual values",
     ylab = "Frequency")

The histogram above solidifies the observation that the model is over predicting more than its underpredicting. It can be seen that:

The residuals are not normally distributed around 0, the distribution of residuals is right-skewed. This indicates that there maybe outliers in the higher end values of the actual observations.
The highest number of residuals lie between 0 and -1, indicating that the majority of the predictions were higher than the observed values.

Assignment 4: Elements of Modeling & Generalised Linear Models

Mrunmayi Shiveshwarkar