All the code provided inside of the file (see show/hide toggle in the top right corner). Answers are below the code and are in bold.
The identifying assumptions of unbiasedness of OLS is (1) linearity, (2) full-rank, while key assumption is (3) exogeneity (\(\mathbb{E}[\varepsilon \mid X] = 0\)), meaning that the error term is uncorrelated with the \(X\) variables, which insures that coefficients are unbiased and efficient.
\(\hat\beta_1\) is the expected change in \(y\) for a one-unit change in \(x_1\), holding all other variables constant.
1.c.i Heteroskedasticity affects OLS efficiency and standard errors, but not unbiasedness, because: (1) OLS assumes constant variance of the error term in order to provide BLUE coefficients and accurate standard errors. When this assumption is violated, OLS remains unbiased (under the other assumptions) but is no longer efficient, and the conventional standard errors are incorrect, leading to unreliable inference. When heterodeskedasticity is present, or, simply put, some observations are noiser than others, OLS still treats them as equally informative, which leads to inefficient estimates and incorrect standard errors; (2) OLS remains unbiased because the expected value of the error term is still zero (assumption required for unbiased coefficients as discussed above), regardless of the variance.
1.c.ii If nothing is done to address heteroskedasticity, the OLS estimates will still be unbiased, but the standard errors and confidence intervals of the OLS estimates will be incorrect, which can lead to unreliable hypothesis testing and confidence intervals. So in practice we might find that some coefficients are statistically significant, while they are actually not, due to smaller standard errors, or vice versa.
data(mtcars)
print(mtcars$mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
hist(mtcars$mpg, main="Histogram of Miles Per Gallon", xlab="Miles Per Gallon")
m1 <- lm(mpg ~ am + carb + cyl + disp + drat + gear + hp + qsec + vs + wt, data=mtcars)
summary(m1)
##
## Call:
## lm(formula = mpg ~ am + carb + cyl + disp + drat + gear + hp +
## qsec + vs + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## am 2.52023 2.05665 1.225 0.2340
## carb -0.19942 0.82875 -0.241 0.8122
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## drat 0.78711 1.63537 0.481 0.6353
## gear 0.65541 1.49326 0.439 0.6652
## hp -0.02148 0.02177 -0.987 0.3350
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## wt -3.71530 1.89441 -1.961 0.0633 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
m2 <- lm(mpg ~ wt + hp + cyl, data=mtcars)
summary(m2)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Answer: No, adding other variables does not improve the fit of the mode: adjusted \(R^2\), which penalized model for including additional regressors, is higher in the second model, and the added variables are not statistically significant.
library(quantreg)
taus <- c(0.25, 0.50, 0.75)
q_reg <- rq(mpg ~ wt + hp + cyl, data = mtcars, tau = taus)
summary(q_reg)
##
## Call: rq(formula = mpg ~ wt + hp + cyl, tau = taus, data = mtcars)
##
## tau: [1] 0.25
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 34.05625 32.57528 40.14781
## wt -2.58973 -5.20642 -2.26947
## hp -0.01127 -0.02644 -0.00195
## cyl -0.96877 -1.61836 0.00169
##
## Call: rq(formula = mpg ~ wt + hp + cyl, tau = taus, data = mtcars)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 38.33674 33.01808 41.45609
## wt -3.01449 -4.71262 -1.76240
## hp -0.01475 -0.05426 -0.00249
## cyl -1.13592 -1.94275 0.48042
##
## Call: rq(formula = mpg ~ wt + hp + cyl, tau = taus, data = mtcars)
##
## tau: [1] 0.75
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 40.18791 36.72762 45.02152
## wt -3.77914 -4.19886 -0.95680
## hp -0.02005 -0.04158 0.08590
## cyl -0.62227 -3.40813 -0.27690
t_ols <- tbl_regression(m2)
t_qreg <- tbl_regression(q_reg)
tbl_merge(
list(t_ols, t_qreg),
tab_spanner = c("**OLS**", "**Quantile Regressions**"))
| Characteristic |
OLS
|
Quantile Regressions
|
||||
|---|---|---|---|---|---|---|
| Beta | 95% CI | p-value | Group | Beta | 95% CI | |
| wt | -3.2 | -4.7, -1.6 | <0.001 | 0.25 | -2.6 | -8.7, -2.1 |
| wt | -3.2 | -4.7, -1.6 | <0.001 | 0.5 | -3.0 | -6.8, -1.4 |
| wt | -3.2 | -4.7, -1.6 | <0.001 | 0.75 | -3.8 | -4.4, -0.82 |
| hp | -0.02 | -0.04, 0.01 | 0.14 | 0.25 | -0.01 | -0.08, 0.00 |
| hp | -0.02 | -0.04, 0.01 | 0.14 | 0.5 | -0.01 | -0.06, 0.00 |
| hp | -0.02 | -0.04, 0.01 | 0.14 | 0.75 | -0.02 | -0.05, 0.11 |
| cyl | -0.94 | -2.1, 0.19 | 0.10 | 0.25 | -0.97 | -1.9, 0.55 |
| cyl | -0.94 | -2.1, 0.19 | 0.10 | 0.5 | -1.1 | -2.1, 0.82 |
| cyl | -0.94 | -2.1, 0.19 | 0.10 | 0.75 | -0.62 | -3.5, 0.13 |
| Abbreviation: CI = Confidence Interval | ||||||
Answer: Some of the coefficients are different. For \(wt\) variable, the coefficient is slightly different for the all 25th, 50th and 75th quantile regressions, which suggests that the effect of weight on miles per gallon is different for different classes of cars based on miler per gallon. For \(hp\) variable, the coefficients are almost identical. For \(cyl\) variable, the coefficient is slightly more negative for the 25th quantile regression, which suggests that the effect of number of cylinders on miles per gallon is stronger for cars with lower miles per gallon, and is even more negative for the 20th quantile regression, while for the 75th quantile regression the coefficient is less negative, which suggests that the effect of number of cylinders on miles per gallon is weaker for cars with higher miles per gallon.
mtcars$hp2wt <- mtcars$hp / mtcars$wt
q_reg2 <- rq(mpg ~ hp2wt, data = mtcars, tau=c(0.1,0.5,0.9))
summary(q_reg2)
##
## Call: rq(formula = mpg ~ hp2wt, tau = c(0.1, 0.5, 0.9), data = mtcars)
##
## tau: [1] 0.1
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 17.60032 6.47805 29.85810
## hp2wt -0.06740 -0.75642 0.07483
##
## Call: rq(formula = mpg ~ hp2wt, tau = c(0.1, 0.5, 0.9), data = mtcars)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 25.30497 21.98778 29.40190
## hp2wt -0.11413 -0.22116 -0.08743
##
## Call: rq(formula = mpg ~ hp2wt, tau = c(0.1, 0.5, 0.9), data = mtcars)
##
## tau: [1] 0.9
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 30.40000 24.88711 41.59796
## hp2wt 0.00000 -0.26425 0.26736
plot(summary(q_reg2))
Answer: As questions asks about whether effects is different for more or less fuel-efficient cars, I ran quantile regression, whuch shows that effect of the \(hp2wt\) variable is negative for the 10th quantile regression, and even more negative for the 50th quantile regression, while for the 90th quantile regression the coefficient is zero, which suggests that the effect of \(hp2wt\) variable is different for more or less fuel-efficient cars.
url <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/MASS/Insurance.csv"
file_name <- "insurance.csv"
download.file(url, paste(file_name, sep = ""), mode = "wb")
insurance <- read.csv(file_name)
url <- "https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip"
file_name <- "temp_data.zip"
download.file(url, paste(file_name, sep = ""), mode = "wb")
unzip("temp_data.zip", exdir = "extracted_data")
bikeshare <- read.csv("extracted_data/hour.csv")
bikeshare$season <- relevel(as.factor(bikeshare$season), ref = "4")
bikeshare$hr <- as.factor(bikeshare$hr)
bikeshare$mnth <- as.factor(bikeshare$mnth)
bikeshare$weekday <- as.factor(bikeshare$weekday)
bikeshare$weathersit <- as.factor(bikeshare$weathersit)
p1_bike <- glm(cnt ~ atemp + holiday + hr + hum + mnth + season + temp + weathersit + weekday + windspeed + yr,
family=poisson(link="log"),
data=bikeshare)
summary(p1_bike)
##
## Call:
## glm(formula = cnt ~ atemp + holiday + hr + hum + mnth + season +
## temp + weathersit + weekday + windspeed + yr, family = poisson(link = "log"),
## data = bikeshare)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.375328 0.007992 422.327 < 2e-16 ***
## atemp 0.946850 0.020326 46.584 < 2e-16 ***
## holiday -0.160979 0.003797 -42.398 < 2e-16 ***
## hr1 -0.466697 0.008182 -57.038 < 2e-16 ***
## hr2 -0.839682 0.009313 -90.160 < 2e-16 ***
## hr3 -1.507858 0.012163 -123.967 < 2e-16 ***
## hr4 -2.110448 0.015858 -133.084 < 2e-16 ***
## hr5 -0.956562 0.009787 -97.738 < 2e-16 ***
## hr6 0.400501 0.006619 60.509 < 2e-16 ***
## hr7 1.422874 0.005666 251.117 < 2e-16 ***
## hr8 1.916567 0.005423 353.411 < 2e-16 ***
## hr9 1.391883 0.005648 246.429 < 2e-16 ***
## hr10 1.123194 0.005806 193.438 < 2e-16 ***
## hr11 1.269596 0.005717 222.071 < 2e-16 ***
## hr12 1.447484 0.005642 256.544 < 2e-16 ***
## hr13 1.427091 0.005663 251.991 < 2e-16 ***
## hr14 1.364773 0.005707 239.121 < 2e-16 ***
## hr15 1.405023 0.005693 246.794 < 2e-16 ***
## hr16 1.628120 0.005593 291.123 < 2e-16 ***
## hr17 2.036233 0.005445 373.972 < 2e-16 ***
## hr18 1.970299 0.005443 362.018 < 2e-16 ***
## hr19 1.674864 0.005518 303.540 < 2e-16 ***
## hr20 1.377556 0.005648 243.882 < 2e-16 ***
## hr21 1.121995 0.005800 193.434 < 2e-16 ***
## hr22 0.864956 0.006005 144.039 < 2e-16 ***
## hr23 0.483910 0.006419 75.382 < 2e-16 ***
## hum -0.205716 0.004129 -49.824 < 2e-16 ***
## mnth2 0.113516 0.003776 30.060 < 2e-16 ***
## mnth3 0.223665 0.003937 56.817 < 2e-16 ***
## mnth4 0.181289 0.005235 34.629 < 2e-16 ***
## mnth5 0.244655 0.005477 44.668 < 2e-16 ***
## mnth6 0.196362 0.005585 35.158 < 2e-16 ***
## mnth7 0.098808 0.006064 16.294 < 2e-16 ***
## mnth8 0.195102 0.005898 33.078 < 2e-16 ***
## mnth9 0.270869 0.005427 49.914 < 2e-16 ***
## mnth10 0.187711 0.005395 34.794 < 2e-16 ***
## mnth11 0.061117 0.005303 11.524 < 2e-16 ***
## mnth12 0.045360 0.004676 9.700 < 2e-16 ***
## season1 -0.457991 0.004081 -112.212 < 2e-16 ***
## season2 -0.183915 0.004085 -45.020 < 2e-16 ***
## season3 -0.190763 0.003348 -56.982 < 2e-16 ***
## temp 0.164396 0.019469 8.444 < 2e-16 ***
## weathersit2 -0.064256 0.001422 -45.175 < 2e-16 ***
## weathersit3 -0.492963 0.002864 -172.124 < 2e-16 ***
## weathersit4 -0.469206 0.067067 -6.996 2.63e-12 ***
## weekday1 0.051206 0.002167 23.626 < 2e-16 ***
## weekday2 0.060927 0.002103 28.971 < 2e-16 ***
## weekday3 0.066408 0.002103 31.582 < 2e-16 ***
## weekday4 0.067338 0.002089 32.228 < 2e-16 ***
## weekday5 0.093581 0.002089 44.798 < 2e-16 ***
## weekday6 0.079608 0.002091 38.063 < 2e-16 ***
## windspeed -0.109958 0.004869 -22.581 < 2e-16 ***
## yr 0.468566 0.001151 407.080 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2891591 on 17378 degrees of freedom
## Residual deviance: 572011 on 17326 degrees of freedom
## AIC: 683018
##
## Number of Fisher Scoring iterations: 5
insurance$District <- as.factor(insurance$District)
insurance$log_holder <- log(insurance$Holders)
p1_insur <- glm(Claims ~ Age + District + Group,
family=poisson(link="log"),
data=insurance,
offset = log_holder)
summary(p1_insur)
##
## Call:
## glm(formula = Claims ~ Age + District + Group, family = poisson(link = "log"),
## data = insurance, offset = log_holder)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.82174 0.07679 -23.724 < 2e-16 ***
## Age>35 -0.53667 0.06996 -7.672 1.70e-14 ***
## Age25-29 -0.19101 0.08286 -2.305 0.021149 *
## Age30-35 -0.34495 0.08137 -4.239 2.24e-05 ***
## District2 0.02587 0.04302 0.601 0.547597
## District3 0.03852 0.05051 0.763 0.445657
## District4 0.23421 0.06167 3.798 0.000146 ***
## Group>2l 0.56341 0.07232 7.791 6.65e-15 ***
## Group1-1.5l 0.16134 0.05053 3.193 0.001409 **
## Group1.5-2l 0.39281 0.05500 7.142 9.18e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 236.26 on 63 degrees of freedom
## Residual deviance: 51.42 on 54 degrees of freedom
## AIC: 388.74
##
## Number of Fisher Scoring iterations: 4
Answer: The model B (insurance claims) will require offset term defined as \(log(Holders)\), because we are modelling rate of claims, so we need to account for the number of policyholders, which is the exposure for the claims. The model A does not need offset as it is already presented as a number of bikes per hours, so it is already a rate.
# FULL MODEL
cf <- exp(coef(p1_bike))
season_type <- cf[grep("season", names(cf))]
print(season_type) # Baseline season 4 (winter)
## season1 season2 season3
## 0.6325530 0.8320064 0.8263283
# ONLY SEASON AND YEAR MODEL
p2_bike <- glm(cnt ~ season + yr, family=poisson(link="log"),, data = bikeshare)
summary(p2_bike)
##
## Call:
## glm(formula = cnt ~ season + yr, family = poisson(link = "log"),
## data = bikeshare)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.017141 0.001296 3870.23 <2e-16 ***
## season1 -0.590186 0.001819 -324.39 <2e-16 ***
## season2 0.044316 0.001509 29.37 <2e-16 ***
## season3 0.168329 0.001460 115.32 <2e-16 ***
## yr 0.494594 0.001137 435.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2891591 on 17378 degrees of freedom
## Residual deviance: 2477112 on 17374 degrees of freedom
## AIC: 2588022
##
## Number of Fisher Scoring iterations: 5
cf <- exp(coef(p2_bike))
season_type <- cf[grep("season", names(cf))]
print(season_type) # Baseline season 4 (winter)
## season1 season2 season3
## 0.5542242 1.0453131 1.1833257
Answer: In the full model, the rate ratio for spring (season1), summer (season2), and fall(season3) are lower by 37%, 17% and 17% respectively compared to the baseline season, which is winter, which contradicts conventional judgment, a we would expect bike rentals to be higher in warmer seasons. It is important to note that it does not mean that they are lower in absolute terms, but as an count rate per hour.
However, if we run the poisson regression with only season and year, the results are different: rate ratio for spring is lower by 45% compared to winter, while ratio for summer and fall is higher by 4$ and 18% respectively. This is explained by the fact that in the full model we control for other variables, such as weather conditions and air temperature, which are likely to be correlated with season, and which have a strong effect on bike rentals. So when we control for weather related variables, they absorb most of the effect of season, which leads to the counter intuitive results.
cf <- exp(coef(p1_insur))
print(unique(insurance$Age))
## [1] "<25" "25-29" "30-35" ">35"
group_type <- cf[grep("Age", names(cf))]
print(group_type) # Baseline is <25
## Age>35 Age25-29 Age30-35
## 0.5846916 0.8261242 0.7082553
Answer: Based on the model results, the rate ratio for claims is higher for the age group <25 (baseline) is highest, meaning that this group is the riskiest.
Answer: Offset coefficient is fixed to 1 as we already now the exposure (policyholders), so the model needs to adjust count of claims per policyholders There is not need to estimate the coefficient for the exposure as the exposure is given.
residuals.from.Poisson <- residuals(p1_bike)
predicted.from.Poisson <- predict(p1_bike)
plot(predicted.from.Poisson,residuals.from.Poisson,col="red",pch=20)
performance::check_overdispersion(p1_bike)
## # Overdispersion test
##
## dispersion ratio = 32.256
## Pearson's Chi-Squared = 558864.469
## p-value = < 0.001
residuals.from.Poisson <- residuals(p1_insur)
predicted.from.Poisson <- predict(p1_insur)
plot(predicted.from.Poisson,residuals.from.Poisson,col="blue",pch=20)
performance::check_overdispersion(p1_insur)
## # Overdispersion test
##
## dispersion ratio = 0.901
## Pearson's Chi-Squared = 48.629
## p-value = 0.681
Answer: For the bike share model (A), the Poisson assumption is not valid, which is tested through overdispersion test. Thus, Negative Binomial model would be more appropriate. For the insurance model (B), the Poisson assumption is valid based on the results of overdispersion test, therefore there is no need to use Negative Binomial model.
Overdispersion can lead to misleading inference under Poison MLE as it underestimates standard errors, which can lead to finding significant effects when they are actually not significant.
Answer: The main difference between ZIP model and Hurdle model in terms of data generation is that for the ZIP model, data is assumed to come from a mixture of two processes: processes that generates structural zeros, when the even cannot happen, and another that generates sampling zeroes (event can happen but did not) and count observations according to a certain distribution. Thus zeroes in data comes from two different processes and are included at both steps of the model. In contrast, the Hurdle model assumed that both zeroes and count data is generated through one processes, even though model has two steps. Frst step of the model only treats zero vs positive observation, and once hurdle is passed, truncated (only positive) count distribution is used to treat only positive count data. Thus, in Hurdle model, zeroes are only generated at the first step.
load("DebTrivedi2.rda")
zip_model <- zeroinfl(ofp ~ hosp + health + numchron + gender + school + privins, data = DebTrivedi, dist = "poisson")
summary(zip_model)
##
## Call:
## zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school +
## privins, data = DebTrivedi, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -5.4092 -1.1579 -0.4769 0.5435 25.0380
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.405812 0.024175 58.152 < 2e-16 ***
## hosp 0.159011 0.006060 26.239 < 2e-16 ***
## healthexcellent -0.304134 0.031151 -9.763 < 2e-16 ***
## healthpoor 0.253454 0.017705 14.315 < 2e-16 ***
## numchron 0.101836 0.004721 21.571 < 2e-16 ***
## gendermale -0.062332 0.013054 -4.775 1.80e-06 ***
## school 0.019144 0.001873 10.221 < 2e-16 ***
## privinsyes 0.080557 0.017145 4.699 2.62e-06 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.08102 0.14233 -0.569 0.569219
## hosp -0.30330 0.09158 -3.312 0.000927 ***
## healthexcellent 0.23786 0.14990 1.587 0.112550
## healthpoor 0.02166 0.16170 0.134 0.893431
## numchron -0.53117 0.04601 -11.545 < 2e-16 ***
## gendermale 0.41527 0.08919 4.656 3.22e-06 ***
## school -0.05677 0.01223 -4.640 3.49e-06 ***
## privinsyes -0.75294 0.10257 -7.341 2.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 24
## Log-likelihood: -1.613e+04 on 16 Df
hurdle_model <- hurdle(ofp ~ hosp + health + numchron + gender + school + privins, data = DebTrivedi)
summary(hurdle_model)
##
## Call:
## hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins,
## data = DebTrivedi)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -5.4144 -1.1565 -0.4770 0.5432 25.0228
##
## Count model coefficients (truncated poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.406459 0.024180 58.167 < 2e-16 ***
## hosp 0.158967 0.006061 26.228 < 2e-16 ***
## healthexcellent -0.303677 0.031150 -9.749 < 2e-16 ***
## healthpoor 0.253521 0.017708 14.317 < 2e-16 ***
## numchron 0.101720 0.004719 21.557 < 2e-16 ***
## gendermale -0.062247 0.013055 -4.768 1.86e-06 ***
## school 0.019078 0.001872 10.194 < 2e-16 ***
## privinsyes 0.080879 0.017139 4.719 2.37e-06 ***
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.043147 0.139852 0.309 0.757688
## hosp 0.312449 0.091437 3.417 0.000633 ***
## healthexcellent -0.289570 0.142682 -2.029 0.042409 *
## healthpoor -0.008716 0.161024 -0.054 0.956833
## numchron 0.535213 0.045378 11.794 < 2e-16 ***
## gendermale -0.415658 0.087608 -4.745 2.09e-06 ***
## school 0.058541 0.011989 4.883 1.05e-06 ***
## privinsyes 0.747120 0.100880 7.406 1.30e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 14
## Log-likelihood: -1.613e+04 on 16 Df
AIC(zip_model, hurdle_model)
## df AIC
## zip_model 16 32300.06
## hurdle_model 16 32300.90
BIC(zip_model, hurdle_model)
## df BIC
## zip_model 16 32402.31
## hurdle_model 16 32403.15
vuong(zip_model, hurdle_model)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw 2.190241 model1 > model2 0.014253
## AIC-corrected 2.190241 model1 > model2 0.014253
## BIC-corrected 2.190241 model1 > model2 0.014253
Answer: There is very slight difference between coefficients and their significance for count models. Similarly, the results of zero models could be also interpreted as similar. They oftentimes have different direction but similar strength of coefficients. Similarly, same variables have significant effect on dependent variable: \(hosp, numchron, gendermale, school, privinsyes\). The only major difference is that zero hurdle model estimates variable \(healthexcellent\) has statistically significant effect of reducing odds of visiting physician office by \(exp(-0.2514146) = 25\%\), while ZIP zero model does not. However, it is important to note that for ZIP model, these coefficients should be interpreted as probability of being zero (never visiting the hospital), while for Hurdle model, they should be interpreted as probability of being positive (visit the hospital).
Based on the goodness of fit measures, such as AIC and BIC, ZIP model has a marginal preference. Vuong test also suggests that ZIP model is preferred over Hurdle model, as the test statistic is positive and significant. Overall, based on the results of the coefficient interpretation, goodness of fit measures and Vuong test, ZIP model is preferred over Hurdle model for this data. However, this preference is marginal, as both models fit data really well. Therefore, the choice between the models should be driven by conceptual considerations.
temp <- read.csv("/Users/artyom/Documents/study/rutgers/S26-AdvancedQuantMethods/HW_1/energytemp.csv")
summary(temp)
## day temperature energy_kwh month
## Min. : 1.00 Min. :20.00 Min. :11.05 Length:60
## 1st Qu.:15.75 1st Qu.:38.75 1st Qu.:18.96 Class :character
## Median :30.50 Median :57.50 Median :30.72 Mode :character
## Mean :30.50 Mean :57.50 Mean :35.49
## 3rd Qu.:45.25 3rd Qu.:76.25 3rd Qu.:52.37
## Max. :60.00 Max. :95.00 Max. :65.37
## is_extreme temp_squared temp_cubed
## Length:60 Min. : 400 Min. : 8000
## Class :character 1st Qu.:1502 1st Qu.: 58221
## Mode :character Median :3307 Median :190179
## Mean :3791 Mean :273710
## 3rd Qu.:5814 3rd Qu.:443392
## Max. :9025 Max. :857375
ggplot(data=temp, aes(x=temperature, y=energy_kwh)) +
geom_line(color="red")+
geom_point()
Answer: Based on the plot above, there is a clear relationship between energy consumption and temperature, which is non-linear. Initially, as temperature increases, energy consumption also increases. Then, after reaching a certain point (40-60 kWh), temperature suddenly decreases. It continues to decrease as energy consumption rises with minor fluctuations. Based on the visual inception, the energy consumption is on it’s minimum when temperature is around 50 degrees F. There are a few zones of relationship between temperature and energy: (1) 20 - 50F when the energy consumption slowly decreases with the major breakpoint at 50; (2) 50-100F, when energy consumption is slowly increasing with minor fluctuations.
temp_1 <- lm(energy_kwh ~ temperature, data = temp)
summary(temp_1)
##
## Call:
## lm(formula = energy_kwh ~ temperature, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.419 -12.265 4.660 7.855 21.346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.45611 4.51371 14.502 < 2e-16 ***
## temperature -0.52118 0.07331 -7.109 1.92e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.5 on 58 degrees of freedom
## Multiple R-squared: 0.4656, Adjusted R-squared: 0.4564
## F-statistic: 50.54 on 1 and 58 DF, p-value: 1.921e-09
plot(temp_1)
5b.i: The linear model is not appropriate for this data, as the relationship between energy consumption and temperature is non-linear, which was evident based on the plot. However, model is statistically significant, as is evident from F-statistic. As temperature is the only variable, it is also significant. Lastly, \(R^2\) is 0.46, which means that 46% of variation in energy can be explained by the model. Diaognostic plots show that there is a clear pattern in residuals, which confirms that linear model is not appropriate for this data.
5b.ii: The model would probably underpredict energy consumption in the moderate temperatures, as is evident from the Residuals vs Fitted values plot.
temp_2 <- lm(energy_kwh ~ temperature + temp_squared, data = temp)
summary(temp_2)
##
## Call:
## lm(formula = energy_kwh ~ temperature + temp_squared, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2996 -4.6314 -0.0937 5.4153 14.1765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.640136 7.082539 18.163 < 2e-16 ***
## temperature -3.096364 0.269091 -11.507 < 2e-16 ***
## temp_squared 0.022393 0.002306 9.709 1.1e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.741 on 57 degrees of freedom
## Multiple R-squared: 0.7986, Adjusted R-squared: 0.7916
## F-statistic: 113 on 2 and 57 DF, p-value: < 2.2e-16
plot(temp_2)
5c.i: We see improvement of the model fit, as \(R^2\) increased to 0.77, F-statistic is significant, and both temperature and its square are significant. Diagnostics plot show that there is still non-linearity that is not captured in the model.
5c.ii: The estimated coefficient for \(temperature^2\) suggests that there is a certain point at which the relationship between energy consumption and temperature changes direction from negative to positive, implying convex curve pattern.
5c.iii: This value can be calculates as:
\[ -b / (2a) = -(-3.096364) / (2 * 0.022393) = 69.13687313 \] So the point at which energy is minimized is 69 degrees F.
temp$bp1 <- pmax(temp$temperature - 49, 0)
temp$bp2 <- pmax(temp$temperature - 52, 0)
temp_pw <- lm(energy_kwh ~ temperature + bp1 + bp2, data = temp)
summary(temp_pw)
##
## Call:
## lm(formula = energy_kwh ~ temperature + bp1 + bp2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.7865 -2.0305 0.1335 2.3909 5.8748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.83128 2.76477 28.513 < 2e-16 ***
## temperature -0.71710 0.07731 -9.275 6.49e-13 ***
## bp1 -10.15021 0.63410 -16.007 < 2e-16 ***
## bp2 11.42505 0.60622 18.846 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.342 on 56 degrees of freedom
## Multiple R-squared: 0.9631, Adjusted R-squared: 0.9611
## F-statistic: 487.5 on 3 and 56 DF, p-value: < 2.2e-16
segmented.mod <- segmented(temp_1, seg.Z = ~temperature, npsi = 2)
summary(segmented.mod)
##
## ***Regression Model with Segmented Relationship(s)***
##
## Call:
## segmented.lm(obj = temp_1, seg.Z = ~temperature, npsi = 2)
##
## Estimated Break-Point(s):
## Est. St.Err
## psi1.temperature 49.205 0.133
## psi2.temperature 50.600 0.128
##
## Coefficients of the linear terms:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.56579 2.38663 32.500 < 2e-16 ***
## temperature -0.67384 0.06816 -9.886 1.03e-13 ***
## U1.temperature -23.02192 3.06733 -7.506 NA
## U2.temperature 24.21985 3.06679 7.897 NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.756 on 54 degrees of freedom
## Multiple R-Squared: 0.9758, Adjusted R-squared: 0.9736
##
## Boot restarting based on 6 samples. Last fit:
## Convergence attained in 10 iterations (rel. change 1.9184e-10)
Answer: breakpoints are relatively close.
Breakpoints represent temperatures at which energy consumption pattern on our data (regression line) radically changes: for breakpoints 1 in segmented relationship (49.205F), the relationship between energy and temperature changes from slightly negative to drastically different, so coefficients changes the slope of the line by -23.02192. In other words, breakpoint essentially means that after this point, one unit increase in temperature will lead to \(-0.67384 + -23.02192 = -23.69576\) change in energy consumption. Similarly, second breakpoint (50.600F) then shift slope by 24.21985, which means that one unit increase in temperature will lead to \(-0.67384 + -23.02192 + 24.21985 = 0.52401\) change in energy consumption.
Heating effect (when people heat homes, so lower temperatures) is slightly stronger that cooling effect (when people cool homes, so lower temperatures), as slope for it is -0.67384, while slope for cooling is 0.52401, which means that in absolute terms heating effect is bigger.
loess_1 <- loess(energy_kwh ~ temperature, data = temp, span = 0.15)
loess_2 <- loess(energy_kwh ~ temperature, data = temp, span = 0.4)
loess_3 <- loess(energy_kwh ~ temperature, data = temp, span = 0.75)
summary(loess_1)
## Call:
## loess(formula = energy_kwh ~ temperature, data = temp, span = 0.15)
##
## Number of Observations: 60
## Equivalent Number of Parameters: 21.77
## Residual Standard Error: 3.274
## Trace of smoother matrix: 24.05 (exact)
##
## Control settings:
## span : 0.15
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
summary(loess_2)
## Call:
## loess(formula = energy_kwh ~ temperature, data = temp, span = 0.4)
##
## Number of Observations: 60
## Equivalent Number of Parameters: 7.6
## Residual Standard Error: 4.19
## Trace of smoother matrix: 8.39 (exact)
##
## Control settings:
## span : 0.4
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
summary(loess_3)
## Call:
## loess(formula = energy_kwh ~ temperature, data = temp, span = 0.75)
##
## Number of Observations: 60
## Equivalent Number of Parameters: 4.37
## Residual Standard Error: 5.288
## Trace of smoother matrix: 4.77 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
ggplot(temp, aes(x = temperature, y = energy_kwh)) +
geom_point(alpha = 0.4) +
geom_smooth(aes(color = "0.15"), method = "loess", span = 0.15, se = FALSE) +
geom_smooth(aes(color = "0.4"), method = "loess", span = 0.4, se = FALSE) +
geom_smooth(aes(color = "0.75"), method = "loess", span = 0.75, se = FALSE) +
scale_color_manual(name = "Span", values = c("red", "blue", "green"))
Answer: The LOESS model with span 0.15 essentially overfits the data, capturing any minor fluctuation in the data and therefore producing a very wiggly curve. The LOESS model with span 0.4 fits data way better, providing balance between capturing the general pattern and not overfitting for fluctuations. The model with span 0.75 is too smooth, and it underfits the data, as it generalizes the pattern between variables too much.
LOESS fitted the curves by essentially performing weighted regression on the localized subset of data (the size of it is defined by span, which means it controls how smooth the line is going to be), fitting many regression lines (linear or polynomial) to produce a smooth curve.
loess_4 <- loess(energy_kwh ~ temperature, data = temp, span = 0.5)
new_data <- data.frame(temperature = 55)
predicted_energy_l <- predict(loess_4, newdata = new_data)
cat("LOESS prediction:", predicted_energy_l, "\n")
## LOESS prediction: 19.04937
new_data$bp1 <- pmax(new_data$temperature - 49, 0)
new_data$bp2 <- pmax(new_data$temperature - 52, 0)
predicted_energy_p <- predict(temp_pw, newdata = new_data)
cat("Piecewise prediction:", predicted_energy_p, "\n")
## Piecewise prediction: 12.76444
predicted_energy_s <- predict(segmented.mod, newdata = new_data)
cat("Segmented prediction:", predicted_energy_s, "\n")
## Segmented prediction: 13.64222
Answer: We see that the predictions from all three model are relatively close, piecewise and segmented being very close to each other, while LOESS prediction is slightly higher than the other two.
Piecewise or segmented would be much safer approaches for extrapolation. 10 degrees F is outside of the range of our data, and as LOESS is not designed for extrapolation but for fitting curve between existing data points, it is likely to produce unreliable predictions as there are no points close to this number to base the prediction on. As we see above, at 55 degrees F, LOESS already overpredicts the energy consumption due to smoother line that is achieved with span = 0.5. Contrary to that, piecewise or segmented would extrapolate based on the line after established breakpoints and therefore would be more reliable.
x0 <- 65
xi <- 60
x <- temp$temperature
span <- 0.3
loess_5 <- loess(energy_kwh ~ temperature, data = temp, span = span)
new_data <- data.frame(temperature = 65)
predicted_energy_l <- predict(loess_4, newdata = new_data)
cat("LOESS prediction:", predicted_energy_l, "\n")
## LOESS prediction: 17.13865
# how many neighbors are used
n_neighbors <- ceiling(span * length(x))
distances <- abs(x - x0)
h <- sort(distances)[n_neighbors]
# calculate distance between points x_0 and x_1
d <- abs(xi - x0)
# scale distance
u <- d / h
# find weight
if (u < 1) {weight <- (1 - u^3)^3} else {weight <- 0}
cat("h:", h, "\n")
## h: 10.9322
cat("Scaled distance:", u, "\n")
## Scaled distance: 0.4573643
cat("Weight:", weight, "\n")
## Weight: 0.7395666
Answer: Using given dataset, the weight that observation at temperature = 60F receives is 0.7395666. Calculations are provided in the code above.