library(COMPoissonReg)
## Loading required package: Rcpp
## Loading required package: numDeriv
data(couple)
str(couple)
## 'data.frame': 387 obs. of 3 variables:
## $ UPB : int 15 0 0 3 12 0 7 0 0 0 ...
## $ EDUCATION: int 0 0 1 1 0 1 0 1 1 0 ...
## $ ANXIETY : num 1.005 -0.703 -0.703 0.611 0.217 ...
summary(couple)
## UPB EDUCATION ANXIETY
## Min. : 0.000 Min. :0.0000 Min. :-1.3606000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:-0.8348000
## Median : 0.000 Median :0.0000 Median :-0.0462000
## Mean : 2.284 Mean :0.3928 Mean :-0.0000016
## 3rd Qu.: 2.000 3rd Qu.:1.0000 3rd Qu.: 0.7425000
## Max. :34.000 Max. :1.0000 Max. : 2.5826000
So i fitted a Poisson generalised linear model with log link to model the number of unwanted pursuit behavior perpetrations as a function of education and anxiety:
\(log(μ_i) = β_0 + β_1EDUCATION_i + β_2ANXIETY_i\)
fit.pois <- glm(UPB ~ EDUCATION + ANXIETY,
family = poisson(link = "log"),
data = couple)
summary(fit.pois)
##
## Call:
## glm(formula = UPB ~ EDUCATION + ANXIETY, family = poisson(link = "log"),
## data = couple)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.81695 0.04386 18.628 <2e-16 ***
## EDUCATION -0.21579 0.07047 -3.062 0.0022 **
## ANXIETY 0.42169 0.03333 12.651 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2478.3 on 386 degrees of freedom
## Residual deviance: 2310.8 on 384 degrees of freedom
## AIC: 2782.4
##
## Number of Fisher Scoring iterations: 6
The significance of the covariates was determined and from the output we can see that the estimate co-efficient for EDUCATION was (-0.2158) which would mean that people with a degree have \(e^{-(0.2158)}\) approximately 0.81 times the expected number of UPB incidents compared to those without.This works out to about a 19% decrease in expected UPB for people with a degree.
The coefficient for ANXIETY was (0.4217),this would mean that a One-unit increase in Anxiety increases the number of UPB incidents by \(e^{(0.4217)}\) approximately 1.52. This would mean there is an approx 52% increase in expected UPB for each unit of anxiety.
The p-value of EDUCATION is (0.0022) and for ANXIETY is (,2e-16). Both covariates are statistically significant at the lvel of 5% with Anxiety being highly significant.Since the Poisson model uses a log link function, the effects of the covariates are multiplicative on the expected count rather than additive.
finally, the residual deviance (2310.8) is substantially lower than the null deviance at (2478.3), indicating that the covariates actually improve the model fit.
The Poisson model assumes equidispersion, so that the varience of response variable is equal to its mean, overdispersion is when the varience exceeds the mean.
library(AER)
## Warning: package 'AER' was built under R version 4.4.3
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
dispersiontest(fit.pois, trafo = function(x) x^2)
##
## Overdispersion test
##
## data: fit.pois
## z = 3.0033, p-value = 0.001335
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 3.186317
The test results are z = 3.0033 p-value = 0.001335 α = 3.1863
Since the p-value is less than 0.05, we reject the null hypothesis and conclude that overdispersion is there.
In addition, if we look at the summary statistics from part (a) we see that there is a mean of approx (2.28) and a max value of (34) which would indicate a large spread in the data.With a large spread in data that would suggest the varience is much larger than the mean which also helps us conclude that there is overdispersion.
In conclusion we can say that the data is overdispersed.
As it is overdispersed, the Poisson model is not recommended for this dataset, because its assumption that \(Var(Y)=E(Y)\) is violated. With the presence of overdispersion, the Poisson based inference is more likely to be unreliable and with the underestimated standard errors that will result in inflated significnace of covarience, So i would not recommend the Poisson model for analysising this data set.
Since overdispersion was is present, i will try a Negative Binomial regression model instead of the Poisson model. This model will allow for the variance to exceed the mean and it should be appropriate for overdispersed data.
library(MASS)
fit.nb <- glm.nb(UPB ~ EDUCATION + ANXIETY,
data = couple,
link = log)
summary(fit.nb)
##
## Call:
## glm.nb(formula = UPB ~ EDUCATION + ANXIETY, data = couple, link = log,
## init.theta = 0.1937364127)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8553 0.1549 5.521 3.36e-08 ***
## EDUCATION -0.3532 0.2498 -1.414 0.157
## ANXIETY 0.4856 0.1217 3.991 6.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.1937) family taken to be 1)
##
## Null deviance: 303.28 on 386 degrees of freedom
## Residual deviance: 288.22 on 384 degrees of freedom
## AIC: 1285.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.1937
## Std. Err.: 0.0220
##
## 2 x log-likelihood: -1277.9190
confint(fit.nb)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.5612145 1.1760641
## EDUCATION -0.8510763 0.1547152
## ANXIETY 0.2347677 0.7428310
The fitted model is \(log(μ_i) = β_0 + β_1EDUCATIONi + β_2ANXIETYi\)
The estimated co-efficents are EDUCATION = (-0.3532) and ANXIETY = (0.4856) The p-value of EDUCATION = (0.157) which is not significant and the p-va;ue of ANXIETY = (6.58e-05) which is significant.
From the 95% confidence interval we can see that EDUCATION has 2.5% at (-0.852076) and 97.5% at (0.154715) which on the interval (-0.852,0.155) it includes 0. We can also see that ANXIETY has a 2.5% at (0.2347677) and a 97.5% at (0.7428310) so on the interval (0.235,0.743) excludes 0.
ANXIETY remains statistically significant of UPB. So a one -unit increase in Anxiety increases the expected number of UPB incidents multiplicatively, however, Education is no longer significant under this testing of negative binomial as both its p-value exceeds 0.05 and confidence interval includes zero.
Compared to the Poisson model, there is a clear inferential change. Under the Poisson, Education was always statistically significant but under the Negative Binomial model education is not significant.This change obviously happened as the Poisson model was underestimating the standard errors due to overdispersion and since the Negative Binomial model accounts for overdispersion it provides a more reliable standard error and infernece.
set.seed(123)
library(MASS)
n <- nrow(couple)
mse_pois <- numeric(100)
mse_nb <- numeric(100)
for(i in 1:100){
train_index <- sample(1:n, size = 0.7*n)
train <- couple[train_index, ]
test <- couple[-train_index, ]
fit_pois <- glm(UPB ~ EDUCATION + ANXIETY,
family = poisson(link = "log"),
data = train)
summary(fit_pois)
fit_nb <- glm.nb(UPB ~ EDUCATION + ANXIETY,
data = train)
summary(fit_nb)
pred_pois <- predict(fit_pois, newdata = test, type = "response")
pred_nb <- predict(fit_nb, newdata = test, type = "response")
mse_pois[i] <- mean((test$UPB - pred_pois)^2)
mse_nb[i] <- mean((test$UPB - pred_nb)^2)
}
boxplot(mse_pois, mse_nb,
names = c("Poisson", "NegBin"),
col = c("seagreen", "pink"),
main = "MSPE Comparison",
ylab = "Mean Squared Prediction Error")
points(1:2, c(mean(mse_pois), mean(mse_nb)),
col = "black", pch = 19)
med_pois <- median(mse_pois)
med_nb <- median(mse_nb)
text(1, med_pois + 1, round(med_pois, 2))
text(2, med_nb + 1, round(med_nb, 2))
The boxplots show very similar distributions of mean squared prediction error for both the Poisson and Negative Binomial models.
Both the mean and median MSPE values are slightly lower for the Poisson model compared to the Negative Binomial model, indicating marginally better predictive performance. This is supported by the median MSPE values 22.82 for the Poisson model and 23.01 for the Negative Binomial model, which is only a small difference between the two methods.
Although the Negative Binomial model is more appropriate for inference due to the presence of overdispersion, the Poisson model appears to perform slightly better in terms of prediction for this dataset.
##Question 2. Consider the mathach dataset explored and discussed in our lectures about the math achievement of students from 160 high schools in the USA. Randomly select a sample of 5 schools to be used to answer (a), (b), (c), and (d) in what follows.
###(a) Fit a normal regression model for the math achievement with the five sampled schools as a fixed factor. Report the results and check for the significance of the covariates.
data <- read.csv("~/Library/CloudStorage/OneDrive-UniversityCollegeDublin/HSAB.csv")
names(data)
## [1] "school" "minority" "sex" "ses" "math.achieve"
## [6] "sector" "size" "percent"
head(data)
## school minority sex ses math.achieve sector size percent
## 1 1224 No Female -1.528 5.876 Public 842 35
## 2 1224 No Female -0.588 19.708 Public 842 35
## 3 1224 No Male -0.528 20.349 Public 842 35
## 4 1224 No Male -0.668 8.781 Public 842 35
## 5 1224 No Male -0.158 17.898 Public 842 35
## 6 1224 No Male 0.022 4.583 Public 842 35
set.seed(123)
schools <- sample(unique(data$school), 5)
data_sub <- subset(data, school %in% schools)
fit_a <- lm(math.achieve ~ ses + factor(school), data = data_sub)
summary(fit_a)
##
## Call:
## lm(formula = math.achieve ~ ses + factor(school), data = data_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7262 -4.7474 -0.5169 4.7279 14.8858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5267 1.3097 6.510 6.85e-10 ***
## ses 2.1301 0.6712 3.173 0.00176 **
## factor(school)3088 1.6259 1.5962 1.019 0.30973
## factor(school)3499 3.7915 1.7754 2.136 0.03402 *
## factor(school)7342 3.5936 1.4883 2.415 0.01673 *
## factor(school)9550 2.4495 1.7716 1.383 0.16844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.345 on 185 degrees of freedom
## Multiple R-squared: 0.1318, Adjusted R-squared: 0.1083
## F-statistic: 5.618 on 5 and 185 DF, p-value: 7.604e-05
Above it shows that i fitted a normal linear regression model to the data and i used a randomly selected subset of 5 schools, school is a fixed effect.
The estimated co-efficient for SES is 2.1301 with a p-value of 0.00176 this would mean that it is a statistically significant predictor of achievement.
School 3499 and 7342 are both significant as their p-values are 0.03402 (3499) and 0.01673(7342) this would show that their mean achievement differs significantly from the reference.
School 3088 and 9550 are not significant, which would suggest no evidence of difference from the schools.
From the above we can see that SES is significant as it affects achievements and that some schools are also significant which means that achievement varies across schools.
###(b) Fit a mixed-effects model with the covariate school as a random effect. Report the results and check for the significance of the covariate. Is there any inferential change compared to the model considered in (a)?
library(lme4)
## Loading required package: Matrix
fit_b <- lmer(math.achieve ~ ses + (1 | school), data = data_sub)
summary(fit_b)
## Linear mixed model fit by REML ['lmerMod']
## Formula: math.achieve ~ ses + (1 | school)
## Data: data_sub
##
## REML criterion at convergence: 1247.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0324 -0.7319 -0.1230 0.7486 2.1691
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 0.9641 0.9819
## Residual 40.2974 6.3480
## Number of obs: 191, groups: school, 5
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 11.0279 0.6565 16.797
## ses 2.4627 0.6166 3.994
##
## Correlation of Fixed Effects:
## (Intr)
## ses 0.212
nullfit <- lm(math.achieve ~ ses, data = data_sub)
fit_ml <- lmer(math.achieve ~ ses + (1 | school), data = data_sub, REML = FALSE)
LR <- as.numeric(2 * (logLik(fit_ml) - logLik(nullfit)))
LR
## [1] 0.2931146
I fitted a mixed-effect model to the data with SES being a fixed effect and school as a random effect. It has allowed each school to have its own intercept whilst assuming that the schoolks are drawn at random.
The estimated co-effecient for SES is 2.4627 with a t value of 3.994 which is very large indicating that SES is aa significant predictor. This would mean that witha one unit increase in ses there is an approximately an increase of 2.46 in maths achievement.
The estimated varience between schools is 0.9641 with a standard deviation of 0.9819, the residual varience within teh schools is 40.2974. All of this would suggest that there is some varience in achievement between schools however most of the variation comes from within the schools not in between schools.
When we compare this with part (a) we can see that there is a clear inferential change. In part (a), school was treated as a fixed effect and seperate coefficients were estimated, however in (b) we are treating school as a random effect, and the variability across schools is shown by one varinece rather than individual ones.
The results show that SES remains a significant predictor of math achievement. There is also evidence of variation between schools. The mixed-effects model provides a more appropriate framework when schools are viewed as a sample from a larger population, as it captures school level variability without estimating a separate parameter for each school.
###(c) Compute the intraclass correlation coefficient ρ and interpret it. Additionally, provide 95% confidence intervals for the model parameters.
var_school <- as.numeric(VarCorr(fit_b)$school)
var_resid <- attr(VarCorr(fit_b), "sc")^2
icc <- var_school / (var_school + var_resid)
icc
## [1] 0.02336602
The ICC is approximately 0.0234, meaning that about 2.34% of the total variation in math achievement is from the differences between schools.
The majority of variation in math achievement happens within schools between the students.
Only a small proportion is explained by differences between schools, While there is some clustering by school, the overall effect is relatively weak.
confint(fit_b, method = "boot")
## Computing bootstrap confidence intervals ...
##
## 139 message(s): boundary (singular) fit: see help('isSingular')
## 2.5 % 97.5 %
## .sig01 0.000000 2.165562
## .sigma 5.652866 7.006612
## (Intercept) 9.809492 12.291635
## ses 1.212448 3.701831
I used the bootstrap method to get the 95% confidence intervals. The confidence interval for SES is 1.212, 3.702 , which does not include zero, which shows the SES is a statistically significant predictor of math achievement.
The confidence interval for the intercept is (9.809, 12.292), representing the range of the baseline level of achievement. The confidence interval for the school level variance includes zero 0.0000,2.16556 which indicates an uncertainty in the between-school variability, and the effect of it is actually quiet small.
The confidence interval for the residual standard deviation shows that most variability in the data occurs at the actual student level rather than between schools.
###(d) Predict the random effects and present the results. Also, provide a plot with 95% confidence intervals for these predictions
ranef(fit_b, condVar = TRUE)
## $school
## (Intercept)
## 1637 -0.88955325
## 3088 -0.34665696
## 3499 0.54319013
## 7342 0.72144405
## 9550 -0.02842397
##
## with conditional variances for "school"
Schools 3499 and 7342 have positive random effects, indicating that their average math achievement is above the overall mean.
Schools 1637 and 3088 have negative random effects, indicating that their average achievement is below the overall mean.
School 9550 has a value close to zero, suggesting its mean achievement is very similar to the overall average.
library(lattice)
dotplot(ranef(fit_b, condVar = TRUE))
## $school
The plot shows the estimated random effects for each school along with
their 95% confidence intervals.
The points show the estimated deviation of each school’s mean from the overall mean. The horizontal lines represent the their 95% confidence intervals.
From the plot, we can see that all the confidence intervals include zero, showing that there is no strong evidence that any school differs significantly from the overall mean. These results are consistent with the low ICC value of 0.023, which indicates that only a small part of the total variation in math achievemnt comes from the differences between schools.
Therefore, most of the variation occurs between students level than between schools. This further supports the conclusion that the school-level random effect is not statistically significant.
####(a) Randomly split the data set into a training set and a test set. Fit a linear model using least squares, ridge regression, and LASSO on the training set and report the test error obtained according to these three methods. For ridge and LASSO, use cross-validation to choose λ. Discuss the results obtained including relevant figures.
set.seed(123)
n <- nrow(data)
train_index <- sample(1:n, size = 0.7 * n)
train <- data[train_index, ]
test <- data[-train_index, ]
x_train <- model.matrix(math.achieve ~ ., train)[,-1]
y_train <- train$math.achieve
x_test <- model.matrix(math.achieve ~ ., test)[,-1]
y_test <- test$math.achieve
fit_lm <- lm(math.achieve ~ ., data = train)
pred_lm <- predict(fit_lm, newdata = test)
mse_lm <- mean((y_test - pred_lm)^2)
mse_lm
## [1] 39.16752
library(glmnet)
## Loaded glmnet 4.1-10
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
lambda_ridge <- cv_ridge$lambda.min
lambda_ridge
## [1] 0.2535681
pred_ridge <- predict(cv_ridge, s = lambda_ridge, newx = x_test)
mse_ridge <- mean((y_test - pred_ridge)^2)
mse_ridge
## [1] 39.1539
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lambda_lasso <- cv_lasso$lambda.min
lambda_lasso
## [1] 0.01520101
pred_lasso <- predict(cv_lasso, s = lambda_lasso, newx = x_test)
mse_lasso <- mean((y_test - pred_lasso)^2)
mse_lasso
## [1] 39.15849
mse_lm
## [1] 39.16752
mse_ridge
## [1] 39.1539
mse_lasso
## [1] 39.15849
The dataset was randomly split into a training set of 70% and a test set of 30%.
Three regression models were fitted using the training data: Ordinary Least Squares (OLS) Ridge regression LASSO regression
For Ridge and LASSO, the tuning parameter is λ ansd was selected using cross-validation, as this minimises the estimated prediction error. The test mean squared errors (MSE) obtained for each model were: OLS: 39.1675 Ridge: 39.1539 LASSO: 39.1585
The Ridge regression model was the lowest test MSE eventhough the differences between the three methods are very small. This would mean that regularisation gives a slight improvement over OLS by reducing the variance. The small improvement indicates that the OLS model is already performing reasonably well and is not severely overfitting.
The fact that the Ridge regression performs best would suggest that the predictors have many small effects. Shrinking coefficients like the ridge is more appropriate than setting some to zero like we did in LASSO.
Although LASSO also works well, its tiny bit higher error suggests that variable selection is not very beneficial in this dataset.
plot(log(cv_ridge$lambda), cv_ridge$cvm, type = "l",
col = "pink", lwd = 2,
xlab = "log(Lambda)", ylab = "CV Error",
main = "Ridge CV Error")
lines(log(cv_ridge$lambda),
cv_ridge$cvm + cv_ridge$cvsd,
col = "lightblue", lty = 2)
lines(log(cv_ridge$lambda),
cv_ridge$cvm - cv_ridge$cvsd,
col = "lightblue", lty = 2)
abline(v = log(cv_ridge$lambda.min), col = "darkblue", lwd = 2)
plot(log(cv_lasso$lambda), cv_lasso$cvm, type = "l",
col = "pink", lwd = 2,
xlab = "log(Lambda)", ylab = "CV Error",
main = "LASSO CV Error")
lines(log(cv_lasso$lambda),
cv_lasso$cvm + cv_lasso$cvsd,
col = "lightgreen", lty = 2)
lines(log(cv_lasso$lambda),
cv_lasso$cvm - cv_lasso$cvsd,
col = "lightgreen", lty = 2)
abline(v = log(cv_lasso$lambda.min), col = "darkgreen", lwd = 2)
Cross-validation was used to select the optimal value of λ for both Ridge and LASSO, to ensure that the level of regularisation balances bias and variance effectively, leading to improved predictive performance.
The cross-validation curves for Ridge and LASSO are shown in plots above.
The solid curve represents the mean cross-validation error The dashed lines represent +/- 1 of the standard error The vertical line indicates the optimal value of λ The CV error is fairly flat near the minimum, indicating that a range of λ values give similar performance. This shows that the model is stable and not highly sensitive to the exact choice of λ.
The Ridge model performs slightly better than both OLS and LASSO. This would suggets that regularisation provides a small reduction in variance, improving predictive performance The fact that Ridge outperforms LASSO suggests that the underlying relationship is not sparse and that many predictors likely contribute small effects, rather than a few dominant variables. LASSO performs similarly but a little bit worse, indicating that variable selection is not as beneficial in this setting.