STA-301_mid-sem

# Question 1 – Introduction to Statistical Modelling
# Load the iris dataset
data(iris)

# b. Fit the model and interpret slope
model1 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
summary(model1)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5561 -0.6333 -0.1120  0.5579  2.2226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.5262     0.4789   13.63   <2e-16 ***
## Sepal.Width  -0.2234     0.1551   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

Interpretation: The slope estimate of -0.2234 suggests that for each 1 cm increase in Sepal Width, Sepal Length decreases by approximately 0.2234 cm.

# Question 2 – Simple Linear Regression
# Load cars dataset
data(cars)

# a. Fit simple linear regression
model2 <- lm(dist ~ speed, data = cars)
summary(model2)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

# b. Residual plot
plot(model2, which = 1)

Interpretation: The slope of 3.932 indicates that for each 1 mph increase in speed, stopping distance increases by approximately 3.93 feet. The residual plot shows a slight funnel shape, suggesting some heteroscedasticity, but the linear trend appears reasonable overall.

# Question 3 – Multiple Linear Regression
# Using mtcars as "own" dataset
data(mtcars)

# a. Fit multiple linear regression with 3 predictors
model3 <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model3)

## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

# b. Diagnostic plots
par(mfrow = c(1, 2))
plot(model3, which = 1)  # Residuals vs Fitted
plot(model3, which = 2)  # Normal Q-Q plot

# c. VIF for multicollinearity
library(car)

## Loading required package: carData

vif(model3)

##       wt       hp      cyl 
## 2.580486 3.258481 4.757456

Interpretation:

Significant predictors: Weight (wt) and horsepower (hp) are statistically significant at 5% level
Diagnostic plots: Residuals vs Fitted shows relatively random scatter; Q-Q plot shows approximate normality with slight deviations at tails
Multicollinearity: All VIF values are below 5, indicating no severe multicollinearity concerns

# Question 4 – Logistic Regression
# Using mtcars dataset
data(mtcars)

# a. Fit logistic regression
model4 <- glm(vs ~ wt + disp, data = mtcars, family = binomial)
summary(model4)

## 
## Call:
## glm(formula = vs ~ wt + disp, family = binomial, data = mtcars)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.60859    2.43903   0.660    0.510  
## wt           1.62635    1.49068   1.091    0.275  
## disp        -0.03443    0.01536  -2.241    0.025 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.86  on 31  degrees of freedom
## Residual deviance: 21.40  on 29  degrees of freedom
## AIC: 27.4
## 
## Number of Fisher Scoring iterations: 6

# Odds ratios
exp(coef(model4))

## (Intercept)          wt        disp 
##   4.9957752   5.0852960   0.9661524

# c. Predict probability for specific values
new_data <- data.frame(wt = 3.0, disp = 200)
predict(model4, new_data, type = "response")

##         1 
## 0.4015303

Interpretation:

Odds ratio for disp: 0.966, meaning for each unit increase in displacement, the odds of having a straight engine (vs=1) decrease by about 3.4%, holding weight constant.
Predicted probability: A car with weight=3.0 and displacement=200 has approximately 40.2% probability of having a straight engine.

#Question 5 – Assumptions of Linear Regression

Question 5 – Assumptions of Linear Regression

Three Key Assumptions:

Linearity: The relationship between predictors and response variable should be linear. This is important because violations can lead to biased estimates and poor predictions.
Independence of errors: Residuals should be independent of each other. Violations (autocorrelation) can inflate Type I errors and reduce efficiency of estimates.
Homoscedasticity: Constant variance of residuals across all predicted values. Heteroscedasticity can lead to inefficient estimates and incorrect standard errors.

# Question 6 – Models for Count Data
# Load quakes dataset
data(quakes)

# a. Poisson regression
model6_pois <- glm(stations ~ mag, data = quakes, family = poisson)
summary(model6_pois)

## 
## Call:
## glm(formula = stations ~ mag, family = poisson, data = quakes)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.96624    0.05584  -35.22   <2e-16 ***
## mag          1.15849    0.01147  101.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12198  on 999  degrees of freedom
## Residual deviance:  3018  on 998  degrees of freedom
## AIC: 8198.1
## 
## Number of Fisher Scoring iterations: 4

# b. Overdispersion test
library(AER)

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Loading required package: survival

dispersiontest(model6_pois)

## 
##  Overdispersion test
## 
## data:  model6_pois
## z = 12.194, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   3.004911

# c. Negative Binomial if overdispersed
library(MASS)
model6_nb <- glm.nb(stations ~ mag, data = quakes)
summary(model6_nb)

## 
## Call:
## glm.nb(formula = stations ~ mag, data = quakes, init.theta = 16.46797085, 
##     link = log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.17447    0.11057  -19.67   <2e-16 ***
## mag          1.20230    0.02351   51.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(16.468) family taken to be 1)
## 
##     Null deviance: 3844.4  on 999  degrees of freedom
## Residual deviance:  994.9  on 998  degrees of freedom
## AIC: 7213.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  16.47 
##           Std. Err.:  1.14 
## 
##  2 x log-likelihood:  -7207.871

# Compare AIC
AIC(model6_pois, model6_nb)

##             df      AIC
## model6_pois  2 8198.106
## model6_nb    3 7213.871

Interpretation:
- Coefficient sign: Positive (1.202), indicating that higher magnitude earthquakes are reported by more stations
- Overdispersion: The Negative Binomial model has a theta parameter of 16.47, confirming overdispersion is handled
- Model comparison: Negative Binomial has lower AIC (7213.9 vs 8198.1), confirming it’s a better fit

# Question 7 – Non-Linear & Inverse Gaussian Models
# Load CO2 dataset and filter for Plant Qn1
data(CO2)
plant_Qn1 <- CO2[CO2$Plant == "Qn1", ]

# i) Fit quadratic model
plant_Qn1$conc_sq <- plant_Qn1$conc^2
model7 <- lm(uptake ~ conc + conc_sq, data = plant_Qn1)
summary(model7)

## 
## Call:
## lm(formula = uptake ~ conc + conc_sq, data = plant_Qn1)
## 
## Residuals:
##      1      2      3      4      5      6      7 
## -6.051  3.751  4.396  2.626 -3.735 -2.321  1.334 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.580e+01  5.723e+00   2.760   0.0509 .
## conc         7.039e-02  2.645e-02   2.661   0.0563 .
## conc_sq     -4.782e-05  2.365e-05  -2.022   0.1133  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.95 on 4 degrees of freedom
## Multiple R-squared:  0.7579, Adjusted R-squared:  0.6369 
## F-statistic: 6.261 on 2 and 4 DF,  p-value: 0.05861

# Plot the relationship
plot(plant_Qn1$conc, plant_Qn1$uptake, main = "Quadratic Model Fit",
     xlab = "CO2 Concentration", ylab = "Uptake")
curve(coef(model7)[1] + coef(model7)[2]*x + coef(model7)[3]*x^2, 
      add = TRUE, col = "red")

7ii. Quadratic term interpretation: The quadratic term is negative (-4.782e-05), suggesting a concave relationship where uptake increases at a decreasing rate with concentration. This partially captures the biological saturation effect visible in your plot.

STA-301_mid-sem

2025-10-21