predictive-anal-2-.knit

Predictive Analytics Assignment 2

Question 1

a) Fit a Poisson GLM, test for the significance of the covariates, and interpret them.

library(COMPoissonReg)

## Loading required package: Rcpp

## Loading required package: numDeriv

data(couple)

str(couple)

## 'data.frame':    387 obs. of  3 variables:
##  $ UPB      : int  15 0 0 3 12 0 7 0 0 0 ...
##  $ EDUCATION: int  0 0 1 1 0 1 0 1 1 0 ...
##  $ ANXIETY  : num  1.005 -0.703 -0.703 0.611 0.217 ...

summary(couple)

##       UPB           EDUCATION         ANXIETY          
##  Min.   : 0.000   Min.   :0.0000   Min.   :-1.3606000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:-0.8348000  
##  Median : 0.000   Median :0.0000   Median :-0.0462000  
##  Mean   : 2.284   Mean   :0.3928   Mean   :-0.0000016  
##  3rd Qu.: 2.000   3rd Qu.:1.0000   3rd Qu.: 0.7425000  
##  Max.   :34.000   Max.   :1.0000   Max.   : 2.5826000

So i fitted a Poisson generalised linear model with log link to model the number of unwanted pursuit behavior perpetrations as a function of education and anxiety:

\(log(μ_i) = β_0 + β_1EDUCATION_i + β_2ANXIETY_i\)

fit.pois <- glm(UPB ~ EDUCATION + ANXIETY, 
                family = poisson(link = "log"), 
                data = couple)
summary(fit.pois)

## 
## Call:
## glm(formula = UPB ~ EDUCATION + ANXIETY, family = poisson(link = "log"), 
##     data = couple)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.81695    0.04386  18.628   <2e-16 ***
## EDUCATION   -0.21579    0.07047  -3.062   0.0022 ** 
## ANXIETY      0.42169    0.03333  12.651   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2478.3  on 386  degrees of freedom
## Residual deviance: 2310.8  on 384  degrees of freedom
## AIC: 2782.4
## 
## Number of Fisher Scoring iterations: 6

The significance of the covariates was determined and from the output we can see that the estimate co-efficient for EDUCATION was (-0.2158) which would mean that people with a degree have \(e^{-(0.2158)}\) approximately 0.81 times the expected number of UPB incidents compared to those without.This works out to about a 19% decrease in expected UPB for people with a degree.

The coefficient for ANXIETY was (0.4217),this would mean that a One-unit increase in Anxiety increases the number of UPB incidents by \(e^{(0.4217)}\) approximately 1.52. This would mean there is an approx 52% increase in expected UPB for each unit of anxiety.

The p-value of EDUCATION is (0.0022) and for ANXIETY is (,2e-16). Both covariates are statistically significant at the lvel of 5% with Anxiety being highly significant.Since the Poisson model uses a log link function, the effects of the covariates are multiplicative on the expected count rather than additive.

finally, the residual deviance (2310.8) is substantially lower than the null deviance at (2478.3), indicating that the covariates actually improve the model fit.

(b) Test for the presence of overdispersion. What is the conclusion? Based on that, would you recommend the Poisson model for analysing this dataset? Justify your answer. [10 marks]

The Poisson model assumes equidispersion, so that the varience of response variable is equal to its mean, overdispersion is when the varience exceeds the mean.

library(AER)

## Warning: package 'AER' was built under R version 4.4.3

## Loading required package: car

## Loading required package: carData

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Loading required package: survival

dispersiontest(fit.pois, trafo = function(x) x^2)

## 
##  Overdispersion test
## 
## data:  fit.pois
## z = 3.0033, p-value = 0.001335
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 3.186317

The test results are z = 3.0033 p-value = 0.001335 α = 3.1863

Since the p-value is less than 0.05, we reject the null hypothesis and conclude that overdispersion is there.

In addition, if we look at the summary statistics from part (a) we see that there is a mean of approx (2.28) and a max value of (34) which would indicate a large spread in the data.With a large spread in data that would suggest the varience is much larger than the mean which also helps us conclude that there is overdispersion.

In conclusion we can say that the data is overdispersed.

As it is overdispersed, the Poisson model is not recommended for this dataset, because its assumption that \(Var(Y)=E(Y)\) is violated. With the presence of overdispersion, the Poisson based inference is more likely to be unreliable and with the underestimated standard errors that will result in inflated significnace of covarience, So i would not recommend the Poisson model for analysising this data set.

(c) Propose another regression model (rather than Poisson) for analysing this data. Report the fitted model, and construct approximate 95% confidence intervals for the parameters. Also, check for the significance of the covariates included in the model. Any inferential change com- pared to the Poisson modelling? [10 marks]

Since overdispersion was is present, i will try a Negative Binomial regression model instead of the Poisson model. This model will allow for the variance to exceed the mean and it should be appropriate for overdispersed data.

library(MASS)

fit.nb <- glm.nb(UPB ~ EDUCATION + ANXIETY, 
                 data = couple, 
                 link = log)

summary(fit.nb)

## 
## Call:
## glm.nb(formula = UPB ~ EDUCATION + ANXIETY, data = couple, link = log, 
##     init.theta = 0.1937364127)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.8553     0.1549   5.521 3.36e-08 ***
## EDUCATION    -0.3532     0.2498  -1.414    0.157    
## ANXIETY       0.4856     0.1217   3.991 6.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1937) family taken to be 1)
## 
##     Null deviance: 303.28  on 386  degrees of freedom
## Residual deviance: 288.22  on 384  degrees of freedom
## AIC: 1285.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.1937 
##           Std. Err.:  0.0220 
## 
##  2 x log-likelihood:  -1277.9190

confint(fit.nb)

## Waiting for profiling to be done...

##                  2.5 %    97.5 %
## (Intercept)  0.5612145 1.1760641
## EDUCATION   -0.8510763 0.1547152
## ANXIETY      0.2347677 0.7428310

The fitted model is \(log(μ_i) = β_0 + β_1EDUCATIONi + β_2ANXIETYi\)

The estimated co-efficents are EDUCATION = (-0.3532) and ANXIETY = (0.4856) The p-value of EDUCATION = (0.157) which is not significant and the p-va;ue of ANXIETY = (6.58e-05) which is significant.

From the 95% confidence interval we can see that EDUCATION has 2.5% at (-0.852076) and 97.5% at (0.154715) which on the interval (-0.852,0.155) it includes 0. We can also see that ANXIETY has a 2.5% at (0.2347677) and a 97.5% at (0.7428310) so on the interval (0.235,0.743) excludes 0.

ANXIETY remains statistically significant of UPB. So a one -unit increase in Anxiety increases the expected number of UPB incidents multiplicatively, however, Education is no longer significant under this testing of negative binomial as both its p-value exceeds 0.05 and confidence interval includes zero.

Compared to the Poisson model, there is a clear inferential change. Under the Poisson, Education was always statistically significant but under the Negative Binomial model education is not significant.This change obviously happened as the Poisson model was underestimating the standard errors due to overdispersion and since the Negative Binomial model accounts for overdispersion it provides a more reliable standard error and infernece.

(d) Randomly split the data set into training and test sets. Fit a Poisson model and your model proposed in (c) to the training set. Now, compute the mean square prediction error using the test set for both models. Repeat the above procedure 100 times. After that, you will end up with 100 values for the test error for each method. Provide boxplots of these mean square prediction errors. What method is performing better in terms of prediction? [10 marks]

set.seed(123)

library(MASS)

n <- nrow(couple)

mse_pois <- numeric(100)
mse_nb <- numeric(100)

for(i in 1:100){
  train_index <- sample(1:n, size = 0.7*n)
  
  train <- couple[train_index, ]
  test <- couple[-train_index, ]
  
  fit_pois <- glm(UPB ~ EDUCATION + ANXIETY,
                  family = poisson(link = "log"),
                  data = train)
  summary(fit_pois)

  fit_nb <- glm.nb(UPB ~ EDUCATION + ANXIETY,
                   data = train)
  summary(fit_nb)


  pred_pois <- predict(fit_pois, newdata = test, type = "response")
  pred_nb <- predict(fit_nb, newdata = test, type = "response")
  

  mse_pois[i] <- mean((test$UPB - pred_pois)^2)
  mse_nb[i] <- mean((test$UPB - pred_nb)^2)
}

boxplot(mse_pois, mse_nb,
        names = c("Poisson", "NegBin"),
        col = c("seagreen", "pink"),
        main = "MSPE Comparison",
        ylab = "Mean Squared Prediction Error")
points(1:2, c(mean(mse_pois), mean(mse_nb)),
       col = "black", pch = 19)
med_pois <- median(mse_pois)
med_nb <- median(mse_nb)

text(1, med_pois + 1, round(med_pois, 2))
text(2, med_nb + 1, round(med_nb, 2))

The boxplots show very similar distributions of mean squared prediction error for both the Poisson and Negative Binomial models.

Both the mean and median MSPE values are slightly lower for the Poisson model compared to the Negative Binomial model, indicating marginally better predictive performance. This is supported by the median MSPE values 22.82 for the Poisson model and 23.01 for the Negative Binomial model, which is only a small difference between the two methods.

Although the Negative Binomial model is more appropriate for inference due to the presence of overdispersion, the Poisson model appears to perform slightly better in terms of prediction for this dataset.

##Question 2. Consider the mathach dataset explored and discussed in our lectures about the math achievement of students from 160 high schools in the USA. Randomly select a sample of 5 schools to be used to answer (a), (b), (c), and (d) in what follows.

###(a) Fit a normal regression model for the math achievement with the five sampled schools as a fixed factor. Report the results and check for the significance of the covariates.

data <- read.csv("~/Library/CloudStorage/OneDrive-UniversityCollegeDublin/HSAB.csv")
names(data)

## [1] "school"       "minority"     "sex"          "ses"          "math.achieve"
## [6] "sector"       "size"         "percent"

head(data)

##   school minority    sex    ses math.achieve sector size percent
## 1   1224       No Female -1.528        5.876 Public  842      35
## 2   1224       No Female -0.588       19.708 Public  842      35
## 3   1224       No   Male -0.528       20.349 Public  842      35
## 4   1224       No   Male -0.668        8.781 Public  842      35
## 5   1224       No   Male -0.158       17.898 Public  842      35
## 6   1224       No   Male  0.022        4.583 Public  842      35

set.seed(123)

schools <- sample(unique(data$school), 5)
data_sub <- subset(data, school %in% schools)

fit_a <- lm(math.achieve ~ ses + factor(school), data = data_sub)

summary(fit_a)

## 
## Call:
## lm(formula = math.achieve ~ ses + factor(school), data = data_sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7262  -4.7474  -0.5169   4.7279  14.8858 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          8.5267     1.3097   6.510 6.85e-10 ***
## ses                  2.1301     0.6712   3.173  0.00176 ** 
## factor(school)3088   1.6259     1.5962   1.019  0.30973    
## factor(school)3499   3.7915     1.7754   2.136  0.03402 *  
## factor(school)7342   3.5936     1.4883   2.415  0.01673 *  
## factor(school)9550   2.4495     1.7716   1.383  0.16844    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.345 on 185 degrees of freedom
## Multiple R-squared:  0.1318, Adjusted R-squared:  0.1083 
## F-statistic: 5.618 on 5 and 185 DF,  p-value: 7.604e-05

Above it shows that i fitted a normal linear regression model to the data and i used a randomly selected subset of 5 schools, school is a fixed effect.

The estimated co-efficient for SES is 2.1301 with a p-value of 0.00176 this would mean that it is a statistically significant predictor of achievement.

School 3499 and 7342 are both significant as their p-values are 0.03402 (3499) and 0.01673(7342) this would show that their mean achievement differs significantly from the reference.

School 3088 and 9550 are not significant, which would suggest no evidence of difference from the schools.

From the above we can see that SES is significant as it affects achievements and that some schools are also significant which means that achievement varies across schools.

###(b) Fit a mixed-effects model with the covariate school as a random effect. Report the results and check for the significance of the covariate. Is there any inferential change compared to the model considered in (a)?

library(lme4)

## Loading required package: Matrix

fit_b <- lmer(math.achieve ~ ses + (1 | school), data = data_sub)
summary(fit_b)

## Linear mixed model fit by REML ['lmerMod']
## Formula: math.achieve ~ ses + (1 | school)
##    Data: data_sub
## 
## REML criterion at convergence: 1247.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0324 -0.7319 -0.1230  0.7486  2.1691 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  school   (Intercept)  0.9641  0.9819  
##  Residual             40.2974  6.3480  
## Number of obs: 191, groups:  school, 5
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  11.0279     0.6565  16.797
## ses           2.4627     0.6166   3.994
## 
## Correlation of Fixed Effects:
##     (Intr)
## ses 0.212

nullfit <- lm(math.achieve ~ ses, data = data_sub)
fit_ml  <- lmer(math.achieve ~ ses + (1 | school), data = data_sub, REML = FALSE)

LR <- as.numeric(2 * (logLik(fit_ml) - logLik(nullfit)))
LR

## [1] 0.2931146

I fitted a mixed-effect model to the data with SES being a fixed effect and school as a random effect. It has allowed each school to have its own intercept whilst assuming that the schoolks are drawn at random.

The estimated co-effecient for SES is 2.4627 with a t value of 3.994 which is very large indicating that SES is aa significant predictor. This would mean that witha one unit increase in ses there is an approximately an increase of 2.46 in maths achievement.

The estimated varience between schools is 0.9641 with a standard deviation of 0.9819, the residual varience within teh schools is 40.2974. All of this would suggest that there is some varience in achievement between schools however most of the variation comes from within the schools not in between schools.

When we compare this with part (a) we can see that there is a clear inferential change. In part (a), school was treated as a fixed effect and seperate coefficients were estimated, however in (b) we are treating school as a random effect, and the variability across schools is shown by one varinece rather than individual ones.

The results show that SES remains a significant predictor of math achievement. There is also evidence of variation between schools. The mixed-effects model provides a more appropriate framework when schools are viewed as a sample from a larger population, as it captures school level variability without estimating a separate parameter for each school.

###(c) Compute the intraclass correlation coefficient ρ and interpret it. Additionally, provide 95% confidence intervals for the model parameters.

var_school <- as.numeric(VarCorr(fit_b)$school)
var_resid  <- attr(VarCorr(fit_b), "sc")^2

icc <- var_school / (var_school + var_resid)
icc

## [1] 0.02336602

The ICC is approximately 0.0234, meaning that about 2.34% of the total variation in math achievement is from the differences between schools.

The majority of variation in math achievement happens within schools between the students.

Only a small proportion is explained by differences between schools, While there is some clustering by school, the overall effect is relatively weak.

confint(fit_b, method = "boot")

## Computing bootstrap confidence intervals ...

## 
## 139 message(s): boundary (singular) fit: see help('isSingular')

##                2.5 %    97.5 %
## .sig01      0.000000  2.165562
## .sigma      5.652866  7.006612
## (Intercept) 9.809492 12.291635
## ses         1.212448  3.701831

I used the bootstrap method to get the 95% confidence intervals. The confidence interval for SES is 1.212, 3.702 , which does not include zero, which shows the SES is a statistically significant predictor of math achievement.

The confidence interval for the intercept is (9.809, 12.292), representing the range of the baseline level of achievement. The confidence interval for the school level variance includes zero 0.0000,2.16556 which indicates an uncertainty in the between-school variability, and the effect of it is actually quiet small.

The confidence interval for the residual standard deviation shows that most variability in the data occurs at the actual student level rather than between schools.

###(d) Predict the random effects and present the results. Also, provide a plot with 95% confidence intervals for these predictions

ranef(fit_b, condVar = TRUE)

## $school
##      (Intercept)
## 1637 -0.88955325
## 3088 -0.34665696
## 3499  0.54319013
## 7342  0.72144405
## 9550 -0.02842397
## 
## with conditional variances for "school"

Schools 3499 and 7342 have positive random effects, indicating that their average math achievement is above the overall mean.

Schools 1637 and 3088 have negative random effects, indicating that their average achievement is below the overall mean.

School 9550 has a value close to zero, suggesting its mean achievement is very similar to the overall average.

library(lattice)
dotplot(ranef(fit_b, condVar = TRUE))

## $school

The plot shows the estimated random effects for each school along with their 95% confidence intervals.

The points show the estimated deviation of each school’s mean from the overall mean. The horizontal lines represent the their 95% confidence intervals.

From the plot, we can see that all the confidence intervals include zero, showing that there is no strong evidence that any school differs significantly from the overall mean. These results are consistent with the low ICC value of 0.023, which indicates that only a small part of the total variation in math achievemnt comes from the differences between schools.

Therefore, most of the variation occurs between students level than between schools. This further supports the conclusion that the school-level random effect is not statistically significant.

Q3. Consider the Wage dataset from the textbook “An Introduction to Statistical Learning” containing wage and other data for a group of 3000 male workers in the Mid-Atlantic region; see https://cran.rstudio.com/web/packages/ISLR2/ISLR2.pdf. The aim is to predict the wage (response variable) using the variables year, maritl, race, education, region, jobclass, health, and health ins as covariates. See the above link for details on the covariates

####(a) Randomly split the data set into a training set and a test set. Fit a linear model using least squares, ridge regression, and LASSO on the training set and report the test error obtained according to these three methods. For ridge and LASSO, use cross-validation to choose λ. Discuss the results obtained including relevant figures.

set.seed(123)

n <- nrow(data)
train_index <- sample(1:n, size = 0.7 * n)

train <- data[train_index, ]
test  <- data[-train_index, ]

x_train <- model.matrix(math.achieve ~ ., train)[,-1]
y_train <- train$math.achieve

x_test  <- model.matrix(math.achieve ~ ., test)[,-1]
y_test  <- test$math.achieve

fit_lm <- lm(math.achieve ~ ., data = train)

pred_lm <- predict(fit_lm, newdata = test)

mse_lm <- mean((y_test - pred_lm)^2)
mse_lm

## [1] 39.16752

library(glmnet)

## Loaded glmnet 4.1-10

cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)

lambda_ridge <- cv_ridge$lambda.min
lambda_ridge

## [1] 0.2535681

pred_ridge <- predict(cv_ridge, s = lambda_ridge, newx = x_test)

mse_ridge <- mean((y_test - pred_ridge)^2)
mse_ridge

## [1] 39.1539

cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)

lambda_lasso <- cv_lasso$lambda.min
lambda_lasso

## [1] 0.01520101

pred_lasso <- predict(cv_lasso, s = lambda_lasso, newx = x_test)

mse_lasso <- mean((y_test - pred_lasso)^2)
mse_lasso

## [1] 39.15849

mse_lm

## [1] 39.16752

mse_ridge

## [1] 39.1539

mse_lasso

## [1] 39.15849

The dataset was randomly split into a training set of 70% and a test set of 30%.

Three regression models were fitted using the training data: Ordinary Least Squares (OLS) Ridge regression LASSO regression

For Ridge and LASSO, the tuning parameter is λ ansd was selected using cross-validation, as this minimises the estimated prediction error. The test mean squared errors (MSE) obtained for each model were: OLS: 39.1675 Ridge: 39.1539 LASSO: 39.1585

The Ridge regression model was the lowest test MSE eventhough the differences between the three methods are very small. This would mean that regularisation gives a slight improvement over OLS by reducing the variance. The small improvement indicates that the OLS model is already performing reasonably well and is not severely overfitting.

The fact that the Ridge regression performs best would suggest that the predictors have many small effects. Shrinking coefficients like the ridge is more appropriate than setting some to zero like we did in LASSO.

Although LASSO also works well, its tiny bit higher error suggests that variable selection is not very beneficial in this dataset.

plot(log(cv_ridge$lambda), cv_ridge$cvm, type = "l",
     col = "pink", lwd = 2,
     xlab = "log(Lambda)", ylab = "CV Error",
     main = "Ridge CV Error")

lines(log(cv_ridge$lambda),
      cv_ridge$cvm + cv_ridge$cvsd,
      col = "lightblue", lty = 2)

lines(log(cv_ridge$lambda),
      cv_ridge$cvm - cv_ridge$cvsd,
      col = "lightblue", lty = 2)

abline(v = log(cv_ridge$lambda.min), col = "darkblue", lwd = 2)

plot(log(cv_lasso$lambda), cv_lasso$cvm, type = "l",
     col = "pink", lwd = 2,
     xlab = "log(Lambda)", ylab = "CV Error",
     main = "LASSO CV Error")

lines(log(cv_lasso$lambda),
      cv_lasso$cvm + cv_lasso$cvsd,
      col = "lightgreen", lty = 2)

lines(log(cv_lasso$lambda),
      cv_lasso$cvm - cv_lasso$cvsd,
      col = "lightgreen", lty = 2)

abline(v = log(cv_lasso$lambda.min), col = "darkgreen", lwd = 2)

Cross-validation was used to select the optimal value of λ for both Ridge and LASSO, to ensure that the level of regularisation balances bias and variance effectively, leading to improved predictive performance.

The cross-validation curves for Ridge and LASSO are shown in plots above.

The solid curve represents the mean cross-validation error The dashed lines represent +/- 1 of the standard error The vertical line indicates the optimal value of λ The CV error is fairly flat near the minimum, indicating that a range of λ values give similar performance. This shows that the model is stable and not highly sensitive to the exact choice of λ.

The Ridge model performs slightly better than both OLS and LASSO. This would suggets that regularisation provides a small reduction in variance, improving predictive performance The fact that Ridge outperforms LASSO suggests that the underlying relationship is not sparse and that many predictors likely contribute small effects, rather than a few dominant variables. LASSO performs similarly but a little bit worse, indicating that variable selection is not as beneficial in this setting.