# Question 1 – Introduction to Statistical Modelling
# Load the iris dataset
data(iris)
# b. Fit the model and interpret slope
model1 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
summary(model1)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5561 -0.6333 -0.1120 0.5579 2.2226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5262 0.4789 13.63 <2e-16 ***
## Sepal.Width -0.2234 0.1551 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
Interpretation: The slope estimate of -0.2234 suggests that for each 1 cm increase in Sepal Width, Sepal Length decreases by approximately 0.2234 cm.
# Question 2 – Simple Linear Regression
# Load cars dataset
data(cars)
# a. Fit simple linear regression
model2 <- lm(dist ~ speed, data = cars)
summary(model2)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# b. Residual plot
plot(model2, which = 1)
Interpretation: The slope of 3.932 indicates that for each 1 mph increase in speed, stopping distance increases by approximately 3.93 feet. The residual plot shows a slight funnel shape, suggesting some heteroscedasticity, but the linear trend appears reasonable overall.
# Question 3 – Multiple Linear Regression
# Using mtcars as "own" dataset
data(mtcars)
# a. Fit multiple linear regression with 3 predictors
model3 <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model3)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
# b. Diagnostic plots
par(mfrow = c(1, 2))
plot(model3, which = 1) # Residuals vs Fitted
plot(model3, which = 2) # Normal Q-Q plot
# c. VIF for multicollinearity
library(car)
## Loading required package: carData
vif(model3)
## wt hp cyl
## 2.580486 3.258481 4.757456
Interpretation:
Significant predictors: Weight (wt) and horsepower (hp) are statistically significant at 5% level
Diagnostic plots: Residuals vs Fitted shows relatively random scatter; Q-Q plot shows approximate normality with slight deviations at tails
Multicollinearity: All VIF values are below 5, indicating no severe multicollinearity concerns
# Question 4 – Logistic Regression
# Using mtcars dataset
data(mtcars)
# a. Fit logistic regression
model4 <- glm(vs ~ wt + disp, data = mtcars, family = binomial)
summary(model4)
##
## Call:
## glm(formula = vs ~ wt + disp, family = binomial, data = mtcars)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.60859 2.43903 0.660 0.510
## wt 1.62635 1.49068 1.091 0.275
## disp -0.03443 0.01536 -2.241 0.025 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.86 on 31 degrees of freedom
## Residual deviance: 21.40 on 29 degrees of freedom
## AIC: 27.4
##
## Number of Fisher Scoring iterations: 6
# Odds ratios
exp(coef(model4))
## (Intercept) wt disp
## 4.9957752 5.0852960 0.9661524
# c. Predict probability for specific values
new_data <- data.frame(wt = 3.0, disp = 200)
predict(model4, new_data, type = "response")
## 1
## 0.4015303
Interpretation:
Odds ratio for disp: 0.966, meaning for each unit increase in displacement, the odds of having a straight engine (vs=1) decrease by about 3.4%, holding weight constant.
Predicted probability: A car with weight=3.0 and displacement=200 has approximately 40.2% probability of having a straight engine.
#Question 5 – Assumptions of Linear Regression
Question 5 – Assumptions of Linear Regression
Three Key Assumptions:
Linearity: The relationship between predictors and response variable should be linear. This is important because violations can lead to biased estimates and poor predictions.
Independence of errors: Residuals should be independent of each other. Violations (autocorrelation) can inflate Type I errors and reduce efficiency of estimates.
Homoscedasticity: Constant variance of residuals across all predicted values. Heteroscedasticity can lead to inefficient estimates and incorrect standard errors.
# Question 6 – Models for Count Data
# Load quakes dataset
data(quakes)
# a. Poisson regression
model6_pois <- glm(stations ~ mag, data = quakes, family = poisson)
summary(model6_pois)
##
## Call:
## glm(formula = stations ~ mag, family = poisson, data = quakes)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.96624 0.05584 -35.22 <2e-16 ***
## mag 1.15849 0.01147 101.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12198 on 999 degrees of freedom
## Residual deviance: 3018 on 998 degrees of freedom
## AIC: 8198.1
##
## Number of Fisher Scoring iterations: 4
# b. Overdispersion test
library(AER)
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
dispersiontest(model6_pois)
##
## Overdispersion test
##
## data: model6_pois
## z = 12.194, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 3.004911
# c. Negative Binomial if overdispersed
library(MASS)
model6_nb <- glm.nb(stations ~ mag, data = quakes)
summary(model6_nb)
##
## Call:
## glm.nb(formula = stations ~ mag, data = quakes, init.theta = 16.46797085,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.17447 0.11057 -19.67 <2e-16 ***
## mag 1.20230 0.02351 51.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(16.468) family taken to be 1)
##
## Null deviance: 3844.4 on 999 degrees of freedom
## Residual deviance: 994.9 on 998 degrees of freedom
## AIC: 7213.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 16.47
## Std. Err.: 1.14
##
## 2 x log-likelihood: -7207.871
# Compare AIC
AIC(model6_pois, model6_nb)
## df AIC
## model6_pois 2 8198.106
## model6_nb 3 7213.871
Interpretation:
Coefficient sign: Positive (1.202), indicating that higher magnitude earthquakes are reported by more stations
Overdispersion: The Negative Binomial model has a theta parameter of 16.47, confirming overdispersion is handled
Model comparison: Negative Binomial has lower AIC (7213.9 vs 8198.1), confirming it’s a better fit
# Question 7 – Non-Linear & Inverse Gaussian Models
# Load CO2 dataset and filter for Plant Qn1
data(CO2)
plant_Qn1 <- CO2[CO2$Plant == "Qn1", ]
# i) Fit quadratic model
plant_Qn1$conc_sq <- plant_Qn1$conc^2
model7 <- lm(uptake ~ conc + conc_sq, data = plant_Qn1)
summary(model7)
##
## Call:
## lm(formula = uptake ~ conc + conc_sq, data = plant_Qn1)
##
## Residuals:
## 1 2 3 4 5 6 7
## -6.051 3.751 4.396 2.626 -3.735 -2.321 1.334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.580e+01 5.723e+00 2.760 0.0509 .
## conc 7.039e-02 2.645e-02 2.661 0.0563 .
## conc_sq -4.782e-05 2.365e-05 -2.022 0.1133
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.95 on 4 degrees of freedom
## Multiple R-squared: 0.7579, Adjusted R-squared: 0.6369
## F-statistic: 6.261 on 2 and 4 DF, p-value: 0.05861
# Plot the relationship
plot(plant_Qn1$conc, plant_Qn1$uptake, main = "Quadratic Model Fit",
xlab = "CO2 Concentration", ylab = "Uptake")
curve(coef(model7)[1] + coef(model7)[2]*x + coef(model7)[3]*x^2,
add = TRUE, col = "red")