HW_1

All the code provided inside of the file (see show/hide toggle in the top right corner). Answers are below the code and are in bold.

Assigment 1

1.a

The identifying assumptions of unbiasedness of OLS is (1) linearity, (2) full-rank, while key assumption is (3) exogeneity ($\mathbb{E}[\varepsilon \mid X] = 0$), meaning that the error term is uncorrelated with the $X$ variables, which insures that coefficients are unbiased and efficient.

1.b

$\hat\beta_1$ is the expected change in $y$ for a one-unit change in $x_1$, holding all other variables constant.

1.c

1.c.i Heteroskedasticity affects OLS efficiency and standard errors, but not unbiasedness, because: (1) OLS assumes constant variance of the error term in order to provide BLUE coefficients and accurate standard errors. When this assumption is violated, OLS remains unbiased (under the other assumptions) but is no longer efficient, and the conventional standard errors are incorrect, leading to unreliable inference. When heterodeskedasticity is present, or, simply put, some observations are noiser than others, OLS still treats them as equally informative, which leads to inefficient estimates and incorrect standard errors; (2) OLS remains unbiased because the expected value of the error term is still zero (assumption required for unbiased coefficients as discussed above), regardless of the variance.

1.c.ii If nothing is done to address heteroskedasticity, the OLS estimates will still be unbiased, but the standard errors and confidence intervals of the OLS estimates will be incorrect, which can lead to unreliable hypothesis testing and confidence intervals. So in practice we might find that some coefficients are statistically significant, while they are actually not, due to smaller standard errors, or vice versa.

Assigment 2

2a

data(mtcars)
print(mtcars$mpg)

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

hist(mtcars$mpg, main="Histogram of Miles Per Gallon", xlab="Miles Per Gallon")

2b

m1 <- lm(mpg ~ am + carb + cyl + disp + drat + gear + hp + qsec + vs + wt, data=mtcars)
summary(m1)

## 
## Call:
## lm(formula = mpg ~ am + carb + cyl + disp + drat + gear + hp + 
##     qsec + vs + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## am           2.52023    2.05665   1.225   0.2340  
## carb        -0.19942    0.82875  -0.241   0.8122  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## drat         0.78711    1.63537   0.481   0.6353  
## gear         0.65541    1.49326   0.439   0.6652  
## hp          -0.02148    0.02177  -0.987   0.3350  
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

m2 <- lm(mpg ~ wt + hp + cyl, data=mtcars)
summary(m2)

## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

Answer: No, adding other variables does not improve the fit of the mode: adjusted $R^2$, which penalized model for including additional regressors, is higher in the second model, and the added variables are not statistically significant.

2c

library(quantreg)

taus <- c(0.25, 0.50, 0.75)
q_reg <- rq(mpg ~ wt + hp + cyl, data = mtcars, tau = taus)

summary(q_reg)

## 
## Call: rq(formula = mpg ~ wt + hp + cyl, tau = taus, data = mtcars)
## 
## tau: [1] 0.25
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 34.05625     32.57528 40.14781
## wt          -2.58973     -5.20642 -2.26947
## hp          -0.01127     -0.02644 -0.00195
## cyl         -0.96877     -1.61836  0.00169
## 
## Call: rq(formula = mpg ~ wt + hp + cyl, tau = taus, data = mtcars)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 38.33674     33.01808 41.45609
## wt          -3.01449     -4.71262 -1.76240
## hp          -0.01475     -0.05426 -0.00249
## cyl         -1.13592     -1.94275  0.48042
## 
## Call: rq(formula = mpg ~ wt + hp + cyl, tau = taus, data = mtcars)
## 
## tau: [1] 0.75
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 40.18791     36.72762 45.02152
## wt          -3.77914     -4.19886 -0.95680
## hp          -0.02005     -0.04158  0.08590
## cyl         -0.62227     -3.40813 -0.27690

t_ols <- tbl_regression(m2)
t_qreg <- tbl_regression(q_reg)

tbl_merge(
  list(t_ols, t_qreg),
  tab_spanner = c("**OLS**", "**Quantile Regressions**"))

Characteristic	OLS			Quantile Regressions
Characteristic	Beta	95% CI	p-value	Group	Beta	95% CI
wt	-3.2	-4.7, -1.6	<0.001	0.25	-2.6	-8.7, -2.1
wt	-3.2	-4.7, -1.6	<0.001	0.5	-3.0	-6.8, -1.4
wt	-3.2	-4.7, -1.6	<0.001	0.75	-3.8	-4.4, -0.82
hp	-0.02	-0.04, 0.01	0.14	0.25	-0.01	-0.08, 0.00
hp	-0.02	-0.04, 0.01	0.14	0.5	-0.01	-0.06, 0.00
hp	-0.02	-0.04, 0.01	0.14	0.75	-0.02	-0.05, 0.11
cyl	-0.94	-2.1, 0.19	0.10	0.25	-0.97	-1.9, 0.55
cyl	-0.94	-2.1, 0.19	0.10	0.5	-1.1	-2.1, 0.82
cyl	-0.94	-2.1, 0.19	0.10	0.75	-0.62	-3.5, 0.13
Abbreviation: CI = Confidence Interval

Answer: Some of the coefficients are different. For $wt$ variable, the coefficient is slightly different for the all 25th, 50th and 75th quantile regressions, which suggests that the effect of weight on miles per gallon is different for different classes of cars based on miler per gallon. For $hp$ variable, the coefficients are almost identical. For $cyl$ variable, the coefficient is slightly more negative for the 25th quantile regression, which suggests that the effect of number of cylinders on miles per gallon is stronger for cars with lower miles per gallon, and is even more negative for the 20th quantile regression, while for the 75th quantile regression the coefficient is less negative, which suggests that the effect of number of cylinders on miles per gallon is weaker for cars with higher miles per gallon.

2d

mtcars$hp2wt <- mtcars$hp / mtcars$wt
q_reg2 <- rq(mpg ~ hp2wt, data = mtcars, tau=c(0.1,0.5,0.9))

summary(q_reg2)

## 
## Call: rq(formula = mpg ~ hp2wt, tau = c(0.1, 0.5, 0.9), data = mtcars)
## 
## tau: [1] 0.1
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 17.60032      6.47805 29.85810
## hp2wt       -0.06740     -0.75642  0.07483
## 
## Call: rq(formula = mpg ~ hp2wt, tau = c(0.1, 0.5, 0.9), data = mtcars)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 25.30497     21.98778 29.40190
## hp2wt       -0.11413     -0.22116 -0.08743
## 
## Call: rq(formula = mpg ~ hp2wt, tau = c(0.1, 0.5, 0.9), data = mtcars)
## 
## tau: [1] 0.9
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 30.40000     24.88711 41.59796
## hp2wt        0.00000     -0.26425  0.26736

plot(summary(q_reg2))

Answer: As questions asks about whether effects is different for more or less fuel-efficient cars, I ran quantile regression, whuch shows that effect of the $hp2wt$ variable is negative for the 10th quantile regression, and even more negative for the 50th quantile regression, while for the 90th quantile regression the coefficient is zero, which suggests that the effect of $hp2wt$ variable is different for more or less fuel-efficient cars.

Assigment 3

url <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/MASS/Insurance.csv"
file_name <- "insurance.csv"

download.file(url, paste(file_name, sep = ""), mode = "wb")
insurance <- read.csv(file_name)

url <- "https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip"
file_name <- "temp_data.zip"
download.file(url, paste(file_name, sep = ""), mode = "wb")
unzip("temp_data.zip", exdir = "extracted_data")

bikeshare <- read.csv("extracted_data/hour.csv")

3a

bikeshare$season <- relevel(as.factor(bikeshare$season), ref = "4")
bikeshare$hr <- as.factor(bikeshare$hr)
bikeshare$mnth <- as.factor(bikeshare$mnth)
bikeshare$weekday <- as.factor(bikeshare$weekday)
bikeshare$weathersit <- as.factor(bikeshare$weathersit)

p1_bike <- glm(cnt ~ atemp + holiday + hr + hum + mnth + season + temp + weathersit + weekday + windspeed + yr, 
               family=poisson(link="log"),
               data=bikeshare)

summary(p1_bike)

## 
## Call:
## glm(formula = cnt ~ atemp + holiday + hr + hum + mnth + season + 
##     temp + weathersit + weekday + windspeed + yr, family = poisson(link = "log"), 
##     data = bikeshare)
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  3.375328   0.007992  422.327  < 2e-16 ***
## atemp        0.946850   0.020326   46.584  < 2e-16 ***
## holiday     -0.160979   0.003797  -42.398  < 2e-16 ***
## hr1         -0.466697   0.008182  -57.038  < 2e-16 ***
## hr2         -0.839682   0.009313  -90.160  < 2e-16 ***
## hr3         -1.507858   0.012163 -123.967  < 2e-16 ***
## hr4         -2.110448   0.015858 -133.084  < 2e-16 ***
## hr5         -0.956562   0.009787  -97.738  < 2e-16 ***
## hr6          0.400501   0.006619   60.509  < 2e-16 ***
## hr7          1.422874   0.005666  251.117  < 2e-16 ***
## hr8          1.916567   0.005423  353.411  < 2e-16 ***
## hr9          1.391883   0.005648  246.429  < 2e-16 ***
## hr10         1.123194   0.005806  193.438  < 2e-16 ***
## hr11         1.269596   0.005717  222.071  < 2e-16 ***
## hr12         1.447484   0.005642  256.544  < 2e-16 ***
## hr13         1.427091   0.005663  251.991  < 2e-16 ***
## hr14         1.364773   0.005707  239.121  < 2e-16 ***
## hr15         1.405023   0.005693  246.794  < 2e-16 ***
## hr16         1.628120   0.005593  291.123  < 2e-16 ***
## hr17         2.036233   0.005445  373.972  < 2e-16 ***
## hr18         1.970299   0.005443  362.018  < 2e-16 ***
## hr19         1.674864   0.005518  303.540  < 2e-16 ***
## hr20         1.377556   0.005648  243.882  < 2e-16 ***
## hr21         1.121995   0.005800  193.434  < 2e-16 ***
## hr22         0.864956   0.006005  144.039  < 2e-16 ***
## hr23         0.483910   0.006419   75.382  < 2e-16 ***
## hum         -0.205716   0.004129  -49.824  < 2e-16 ***
## mnth2        0.113516   0.003776   30.060  < 2e-16 ***
## mnth3        0.223665   0.003937   56.817  < 2e-16 ***
## mnth4        0.181289   0.005235   34.629  < 2e-16 ***
## mnth5        0.244655   0.005477   44.668  < 2e-16 ***
## mnth6        0.196362   0.005585   35.158  < 2e-16 ***
## mnth7        0.098808   0.006064   16.294  < 2e-16 ***
## mnth8        0.195102   0.005898   33.078  < 2e-16 ***
## mnth9        0.270869   0.005427   49.914  < 2e-16 ***
## mnth10       0.187711   0.005395   34.794  < 2e-16 ***
## mnth11       0.061117   0.005303   11.524  < 2e-16 ***
## mnth12       0.045360   0.004676    9.700  < 2e-16 ***
## season1     -0.457991   0.004081 -112.212  < 2e-16 ***
## season2     -0.183915   0.004085  -45.020  < 2e-16 ***
## season3     -0.190763   0.003348  -56.982  < 2e-16 ***
## temp         0.164396   0.019469    8.444  < 2e-16 ***
## weathersit2 -0.064256   0.001422  -45.175  < 2e-16 ***
## weathersit3 -0.492963   0.002864 -172.124  < 2e-16 ***
## weathersit4 -0.469206   0.067067   -6.996 2.63e-12 ***
## weekday1     0.051206   0.002167   23.626  < 2e-16 ***
## weekday2     0.060927   0.002103   28.971  < 2e-16 ***
## weekday3     0.066408   0.002103   31.582  < 2e-16 ***
## weekday4     0.067338   0.002089   32.228  < 2e-16 ***
## weekday5     0.093581   0.002089   44.798  < 2e-16 ***
## weekday6     0.079608   0.002091   38.063  < 2e-16 ***
## windspeed   -0.109958   0.004869  -22.581  < 2e-16 ***
## yr           0.468566   0.001151  407.080  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2891591  on 17378  degrees of freedom
## Residual deviance:  572011  on 17326  degrees of freedom
## AIC: 683018
## 
## Number of Fisher Scoring iterations: 5

insurance$District <- as.factor(insurance$District)
insurance$log_holder <- log(insurance$Holders)

p1_insur <- glm(Claims ~ Age + District + Group, 
                family=poisson(link="log"),
                data=insurance,
                offset = log_holder)
summary(p1_insur)

## 
## Call:
## glm(formula = Claims ~ Age + District + Group, family = poisson(link = "log"), 
##     data = insurance, offset = log_holder)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.82174    0.07679 -23.724  < 2e-16 ***
## Age>35      -0.53667    0.06996  -7.672 1.70e-14 ***
## Age25-29    -0.19101    0.08286  -2.305 0.021149 *  
## Age30-35    -0.34495    0.08137  -4.239 2.24e-05 ***
## District2    0.02587    0.04302   0.601 0.547597    
## District3    0.03852    0.05051   0.763 0.445657    
## District4    0.23421    0.06167   3.798 0.000146 ***
## Group>2l     0.56341    0.07232   7.791 6.65e-15 ***
## Group1-1.5l  0.16134    0.05053   3.193 0.001409 ** 
## Group1.5-2l  0.39281    0.05500   7.142 9.18e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 236.26  on 63  degrees of freedom
## Residual deviance:  51.42  on 54  degrees of freedom
## AIC: 388.74
## 
## Number of Fisher Scoring iterations: 4

3a.i

Answer: The model B (insurance claims) will require offset term defined as $log(Holders)$, because we are modelling rate of claims, so we need to account for the number of policyholders, which is the exposure for the claims. The model A does not need offset as it is already presented as a number of bikes per hours, so it is already a rate.

3a.ii

# FULL MODEL
cf <- exp(coef(p1_bike))

season_type <- cf[grep("season", names(cf))]
print(season_type)  # Baseline season 4 (winter)

##   season1   season2   season3 
## 0.6325530 0.8320064 0.8263283

# ONLY SEASON AND YEAR MODEL
p2_bike <- glm(cnt ~ season + yr, family=poisson(link="log"),, data = bikeshare)
summary(p2_bike)

## 
## Call:
## glm(formula = cnt ~ season + yr, family = poisson(link = "log"), 
##     data = bikeshare)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.017141   0.001296 3870.23   <2e-16 ***
## season1     -0.590186   0.001819 -324.39   <2e-16 ***
## season2      0.044316   0.001509   29.37   <2e-16 ***
## season3      0.168329   0.001460  115.32   <2e-16 ***
## yr           0.494594   0.001137  435.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2891591  on 17378  degrees of freedom
## Residual deviance: 2477112  on 17374  degrees of freedom
## AIC: 2588022
## 
## Number of Fisher Scoring iterations: 5

cf <- exp(coef(p2_bike))
season_type <- cf[grep("season", names(cf))]
print(season_type)  # Baseline season 4 (winter)

##   season1   season2   season3 
## 0.5542242 1.0453131 1.1833257

Answer: In the full model, the rate ratio for spring (season1), summer (season2), and fall(season3) are lower by 37%, 17% and 17% respectively compared to the baseline season, which is winter, which contradicts conventional judgment, a we would expect bike rentals to be higher in warmer seasons. It is important to note that it does not mean that they are lower in absolute terms, but as an count rate per hour.

However, if we run the poisson regression with only season and year, the results are different: rate ratio for spring is lower by 45% compared to winter, while ratio for summer and fall is higher by 4$ and 18% respectively. This is explained by the fact that in the full model we control for other variables, such as weather conditions and air temperature, which are likely to be correlated with season, and which have a strong effect on bike rentals. So when we control for weather related variables, they absorb most of the effect of season, which leads to the counter intuitive results.

3a.iii

cf <- exp(coef(p1_insur))
print(unique(insurance$Age))

## [1] "<25"   "25-29" "30-35" ">35"

group_type <- cf[grep("Age", names(cf))]
print(group_type)  # Baseline is <25

##    Age>35  Age25-29  Age30-35 
## 0.5846916 0.8261242 0.7082553

Answer: Based on the model results, the rate ratio for claims is higher for the age group <25 (baseline) is highest, meaning that this group is the riskiest.

3a.iv

Answer: Offset coefficient is fixed to 1 as we already now the exposure (policyholders), so the model needs to adjust count of claims per policyholders There is not need to estimate the coefficient for the exposure as the exposure is given.

3a.v

residuals.from.Poisson <- residuals(p1_bike)
predicted.from.Poisson <- predict(p1_bike)
plot(predicted.from.Poisson,residuals.from.Poisson,col="red",pch=20)

performance::check_overdispersion(p1_bike)

## # Overdispersion test
## 
##        dispersion ratio =     32.256
##   Pearson's Chi-Squared = 558864.469
##                 p-value =    < 0.001

residuals.from.Poisson <- residuals(p1_insur)
predicted.from.Poisson <- predict(p1_insur)
plot(predicted.from.Poisson,residuals.from.Poisson,col="blue",pch=20)

performance::check_overdispersion(p1_insur)

## # Overdispersion test
## 
##        dispersion ratio =  0.901
##   Pearson's Chi-Squared = 48.629
##                 p-value =  0.681

Answer: For the bike share model (A), the Poisson assumption is not valid, which is tested through overdispersion test. Thus, Negative Binomial model would be more appropriate. For the insurance model (B), the Poisson assumption is valid based on the results of overdispersion test, therefore there is no need to use Negative Binomial model.

Overdispersion can lead to misleading inference under Poison MLE as it underestimates standard errors, which can lead to finding significant effects when they are actually not significant.

Assigment 4

4a

Answer: The main difference between ZIP model and Hurdle model in terms of data generation is that for the ZIP model, data is assumed to come from a mixture of two processes: processes that generates structural zeros, when the even cannot happen, and another that generates sampling zeroes (event can happen but did not) and count observations according to a certain distribution. Thus zeroes in data comes from two different processes and are included at both steps of the model. In contrast, the Hurdle model assumed that both zeroes and count data is generated through one processes, even though model has two steps. Frst step of the model only treats zero vs positive observation, and once hurdle is passed, truncated (only positive) count distribution is used to treat only positive count data. Thus, in Hurdle model, zeroes are only generated at the first step.

4b

load("DebTrivedi2.rda")

zip_model <- zeroinfl(ofp ~ hosp + health + numchron + gender + school + privins, data = DebTrivedi, dist = "poisson")
summary(zip_model)

## 
## Call:
## zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + 
##     privins, data = DebTrivedi, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4092 -1.1579 -0.4769  0.5435 25.0380 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.405812   0.024175  58.152  < 2e-16 ***
## hosp             0.159011   0.006060  26.239  < 2e-16 ***
## healthexcellent -0.304134   0.031151  -9.763  < 2e-16 ***
## healthpoor       0.253454   0.017705  14.315  < 2e-16 ***
## numchron         0.101836   0.004721  21.571  < 2e-16 ***
## gendermale      -0.062332   0.013054  -4.775 1.80e-06 ***
## school           0.019144   0.001873  10.221  < 2e-16 ***
## privinsyes       0.080557   0.017145   4.699 2.62e-06 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.08102    0.14233  -0.569 0.569219    
## hosp            -0.30330    0.09158  -3.312 0.000927 ***
## healthexcellent  0.23786    0.14990   1.587 0.112550    
## healthpoor       0.02166    0.16170   0.134 0.893431    
## numchron        -0.53117    0.04601 -11.545  < 2e-16 ***
## gendermale       0.41527    0.08919   4.656 3.22e-06 ***
## school          -0.05677    0.01223  -4.640 3.49e-06 ***
## privinsyes      -0.75294    0.10257  -7.341 2.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 24 
## Log-likelihood: -1.613e+04 on 16 Df

hurdle_model <- hurdle(ofp ~ hosp + health + numchron + gender + school + privins, data = DebTrivedi)
summary(hurdle_model)

## 
## Call:
## hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins, 
##     data = DebTrivedi)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4144 -1.1565 -0.4770  0.5432 25.0228 
## 
## Count model coefficients (truncated poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.406459   0.024180  58.167  < 2e-16 ***
## hosp             0.158967   0.006061  26.228  < 2e-16 ***
## healthexcellent -0.303677   0.031150  -9.749  < 2e-16 ***
## healthpoor       0.253521   0.017708  14.317  < 2e-16 ***
## numchron         0.101720   0.004719  21.557  < 2e-16 ***
## gendermale      -0.062247   0.013055  -4.768 1.86e-06 ***
## school           0.019078   0.001872  10.194  < 2e-16 ***
## privinsyes       0.080879   0.017139   4.719 2.37e-06 ***
## Zero hurdle model coefficients (binomial with logit link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.043147   0.139852   0.309 0.757688    
## hosp             0.312449   0.091437   3.417 0.000633 ***
## healthexcellent -0.289570   0.142682  -2.029 0.042409 *  
## healthpoor      -0.008716   0.161024  -0.054 0.956833    
## numchron         0.535213   0.045378  11.794  < 2e-16 ***
## gendermale      -0.415658   0.087608  -4.745 2.09e-06 ***
## school           0.058541   0.011989   4.883 1.05e-06 ***
## privinsyes       0.747120   0.100880   7.406 1.30e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 14 
## Log-likelihood: -1.613e+04 on 16 Df

AIC(zip_model, hurdle_model)

##              df      AIC
## zip_model    16 32300.06
## hurdle_model 16 32300.90

BIC(zip_model, hurdle_model)

##              df      BIC
## zip_model    16 32402.31
## hurdle_model 16 32403.15

vuong(zip_model, hurdle_model)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A  p-value
## Raw                    2.190241 model1 > model2 0.014253
## AIC-corrected          2.190241 model1 > model2 0.014253
## BIC-corrected          2.190241 model1 > model2 0.014253

Answer: There is very slight difference between coefficients and their significance for count models. Similarly, the results of zero models could be also interpreted as similar. They oftentimes have different direction but similar strength of coefficients. Similarly, same variables have significant effect on dependent variable: $hosp, numchron, gendermale, school, privinsyes$. The only major difference is that zero hurdle model estimates variable $healthexcellent$ has statistically significant effect of reducing odds of visiting physician office by $exp(-0.2514146) = 25\%$, while ZIP zero model does not. However, it is important to note that for ZIP model, these coefficients should be interpreted as probability of being zero (never visiting the hospital), while for Hurdle model, they should be interpreted as probability of being positive (visit the hospital).

Based on the goodness of fit measures, such as AIC and BIC, ZIP model has a marginal preference. Vuong test also suggests that ZIP model is preferred over Hurdle model, as the test statistic is positive and significant. Overall, based on the results of the coefficient interpretation, goodness of fit measures and Vuong test, ZIP model is preferred over Hurdle model for this data. However, this preference is marginal, as both models fit data really well. Therefore, the choice between the models should be driven by conceptual considerations.

Assigment 5

5a

temp <- read.csv("/Users/artyom/Documents/study/rutgers/S26-AdvancedQuantMethods/HW_1/energytemp.csv")
summary(temp)

##       day         temperature      energy_kwh       month          
##  Min.   : 1.00   Min.   :20.00   Min.   :11.05   Length:60         
##  1st Qu.:15.75   1st Qu.:38.75   1st Qu.:18.96   Class :character  
##  Median :30.50   Median :57.50   Median :30.72   Mode  :character  
##  Mean   :30.50   Mean   :57.50   Mean   :35.49                     
##  3rd Qu.:45.25   3rd Qu.:76.25   3rd Qu.:52.37                     
##  Max.   :60.00   Max.   :95.00   Max.   :65.37                     
##   is_extreme         temp_squared    temp_cubed    
##  Length:60          Min.   : 400   Min.   :  8000  
##  Class :character   1st Qu.:1502   1st Qu.: 58221  
##  Mode  :character   Median :3307   Median :190179  
##                     Mean   :3791   Mean   :273710  
##                     3rd Qu.:5814   3rd Qu.:443392  
##                     Max.   :9025   Max.   :857375

ggplot(data=temp, aes(x=temperature, y=energy_kwh)) +
  geom_line(color="red")+
  geom_point()

Answer: Based on the plot above, there is a clear relationship between energy consumption and temperature, which is non-linear. Initially, as temperature increases, energy consumption also increases. Then, after reaching a certain point (40-60 kWh), temperature suddenly decreases. It continues to decrease as energy consumption rises with minor fluctuations. Based on the visual inception, the energy consumption is on it’s minimum when temperature is around 50 degrees F. There are a few zones of relationship between temperature and energy: (1) 20 - 50F when the energy consumption slowly decreases with the major breakpoint at 50; (2) 50-100F, when energy consumption is slowly increasing with minor fluctuations.

5b

temp_1 <- lm(energy_kwh ~ temperature, data = temp)
summary(temp_1)

## 
## Call:
## lm(formula = energy_kwh ~ temperature, data = temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.419 -12.265   4.660   7.855  21.346 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 65.45611    4.51371  14.502  < 2e-16 ***
## temperature -0.52118    0.07331  -7.109 1.92e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.5 on 58 degrees of freedom
## Multiple R-squared:  0.4656, Adjusted R-squared:  0.4564 
## F-statistic: 50.54 on 1 and 58 DF,  p-value: 1.921e-09

plot(temp_1)

5b.i: The linear model is not appropriate for this data, as the relationship between energy consumption and temperature is non-linear, which was evident based on the plot. However, model is statistically significant, as is evident from F-statistic. As temperature is the only variable, it is also significant. Lastly, $R^2$ is 0.46, which means that 46% of variation in energy can be explained by the model. Diaognostic plots show that there is a clear pattern in residuals, which confirms that linear model is not appropriate for this data.

5b.ii: The model would probably underpredict energy consumption in the moderate temperatures, as is evident from the Residuals vs Fitted values plot.

5c

temp_2 <- lm(energy_kwh ~ temperature + temp_squared, data = temp)
summary(temp_2)

## 
## Call:
## lm(formula = energy_kwh ~ temperature + temp_squared, data = temp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2996  -4.6314  -0.0937   5.4153  14.1765 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  128.640136   7.082539  18.163  < 2e-16 ***
## temperature   -3.096364   0.269091 -11.507  < 2e-16 ***
## temp_squared   0.022393   0.002306   9.709  1.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.741 on 57 degrees of freedom
## Multiple R-squared:  0.7986, Adjusted R-squared:  0.7916 
## F-statistic:   113 on 2 and 57 DF,  p-value: < 2.2e-16

plot(temp_2)

5c.i: We see improvement of the model fit, as $R^2$ increased to 0.77, F-statistic is significant, and both temperature and its square are significant. Diagnostics plot show that there is still non-linearity that is not captured in the model.

5c.ii: The estimated coefficient for $temperature^2$ suggests that there is a certain point at which the relationship between energy consumption and temperature changes direction from negative to positive, implying convex curve pattern.

5c.iii: This value can be calculates as:

\[ -b / (2a) = -(-3.096364) / (2 * 0.022393) = 69.13687313 \] So the point at which energy is minimized is 69 degrees F.

5.d

temp$bp1 <- pmax(temp$temperature - 49, 0)
temp$bp2 <- pmax(temp$temperature - 52, 0)

temp_pw <- lm(energy_kwh ~ temperature + bp1 + bp2, data = temp)
summary(temp_pw)

## 
## Call:
## lm(formula = energy_kwh ~ temperature + bp1 + bp2, data = temp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7865  -2.0305   0.1335   2.3909   5.8748 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  78.83128    2.76477  28.513  < 2e-16 ***
## temperature  -0.71710    0.07731  -9.275 6.49e-13 ***
## bp1         -10.15021    0.63410 -16.007  < 2e-16 ***
## bp2          11.42505    0.60622  18.846  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.342 on 56 degrees of freedom
## Multiple R-squared:  0.9631, Adjusted R-squared:  0.9611 
## F-statistic: 487.5 on 3 and 56 DF,  p-value: < 2.2e-16

segmented.mod <- segmented(temp_1, seg.Z = ~temperature, npsi = 2)
summary(segmented.mod)

## 
##  ***Regression Model with Segmented Relationship(s)***
## 
## Call: 
## segmented.lm(obj = temp_1, seg.Z = ~temperature, npsi = 2)
## 
## Estimated Break-Point(s):
##                     Est. St.Err
## psi1.temperature 49.205  0.133
## psi2.temperature 50.600  0.128
## 
## Coefficients of the linear terms:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     77.56579    2.38663  32.500  < 2e-16 ***
## temperature     -0.67384    0.06816  -9.886 1.03e-13 ***
## U1.temperature -23.02192    3.06733  -7.506       NA    
## U2.temperature  24.21985    3.06679   7.897       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.756 on 54 degrees of freedom
## Multiple R-Squared: 0.9758,  Adjusted R-squared: 0.9736 
## 
## Boot restarting based on 6 samples. Last fit:
## Convergence attained in 10 iterations (rel. change 1.9184e-10)

Answer: breakpoints are relatively close.

5.d.i

Breakpoints represent temperatures at which energy consumption pattern on our data (regression line) radically changes: for breakpoints 1 in segmented relationship (49.205F), the relationship between energy and temperature changes from slightly negative to drastically different, so coefficients changes the slope of the line by -23.02192. In other words, breakpoint essentially means that after this point, one unit increase in temperature will lead to $-0.67384 + -23.02192 = -23.69576$ change in energy consumption. Similarly, second breakpoint (50.600F) then shift slope by 24.21985, which means that one unit increase in temperature will lead to $-0.67384 + -23.02192 + 24.21985 = 0.52401$ change in energy consumption.

5.d.ii

Heating effect (when people heat homes, so lower temperatures) is slightly stronger that cooling effect (when people cool homes, so lower temperatures), as slope for it is -0.67384, while slope for cooling is 0.52401, which means that in absolute terms heating effect is bigger.

5.e.

5.d

loess_1 <- loess(energy_kwh ~ temperature, data = temp, span = 0.15)
loess_2 <- loess(energy_kwh ~ temperature, data = temp, span = 0.4)
loess_3 <- loess(energy_kwh ~ temperature, data = temp, span = 0.75)

summary(loess_1)

## Call:
## loess(formula = energy_kwh ~ temperature, data = temp, span = 0.15)
## 
## Number of Observations: 60 
## Equivalent Number of Parameters: 21.77 
## Residual Standard Error: 3.274 
## Trace of smoother matrix: 24.05  (exact)
## 
## Control settings:
##   span     :  0.15 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

summary(loess_2)

## Call:
## loess(formula = energy_kwh ~ temperature, data = temp, span = 0.4)
## 
## Number of Observations: 60 
## Equivalent Number of Parameters: 7.6 
## Residual Standard Error: 4.19 
## Trace of smoother matrix: 8.39  (exact)
## 
## Control settings:
##   span     :  0.4 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

summary(loess_3)

## Call:
## loess(formula = energy_kwh ~ temperature, data = temp, span = 0.75)
## 
## Number of Observations: 60 
## Equivalent Number of Parameters: 4.37 
## Residual Standard Error: 5.288 
## Trace of smoother matrix: 4.77  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

ggplot(temp, aes(x = temperature, y = energy_kwh)) +
  geom_point(alpha = 0.4) +
  geom_smooth(aes(color = "0.15"), method = "loess", span = 0.15, se = FALSE) +
  geom_smooth(aes(color = "0.4"), method = "loess", span = 0.4, se = FALSE) +
  geom_smooth(aes(color = "0.75"), method = "loess", span = 0.75, se = FALSE) +
  scale_color_manual(name = "Span", values = c("red", "blue", "green"))

Answer: The LOESS model with span 0.15 essentially overfits the data, capturing any minor fluctuation in the data and therefore producing a very wiggly curve. The LOESS model with span 0.4 fits data way better, providing balance between capturing the general pattern and not overfitting for fluctuations. The model with span 0.75 is too smooth, and it underfits the data, as it generalizes the pattern between variables too much.

LOESS fitted the curves by essentially performing weighted regression on the localized subset of data (the size of it is defined by span, which means it controls how smooth the line is going to be), fitting many regression lines (linear or polynomial) to produce a smooth curve.

5.f

loess_4 <- loess(energy_kwh ~ temperature, data = temp, span = 0.5)

new_data <- data.frame(temperature = 55)
predicted_energy_l <- predict(loess_4, newdata = new_data)
cat("LOESS prediction:", predicted_energy_l, "\n")

## LOESS prediction: 19.04937

new_data$bp1 <- pmax(new_data$temperature - 49, 0)
new_data$bp2 <- pmax(new_data$temperature - 52, 0)

predicted_energy_p <- predict(temp_pw, newdata = new_data)
cat("Piecewise prediction:", predicted_energy_p, "\n")

## Piecewise prediction: 12.76444

predicted_energy_s <- predict(segmented.mod, newdata = new_data)
cat("Segmented prediction:", predicted_energy_s, "\n")

## Segmented prediction: 13.64222

Answer: We see that the predictions from all three model are relatively close, piecewise and segmented being very close to each other, while LOESS prediction is slightly higher than the other two.

Piecewise or segmented would be much safer approaches for extrapolation. 10 degrees F is outside of the range of our data, and as LOESS is not designed for extrapolation but for fitting curve between existing data points, it is likely to produce unreliable predictions as there are no points close to this number to base the prediction on. As we see above, at 55 degrees F, LOESS already overpredicts the energy consumption due to smoother line that is achieved with span = 0.5. Contrary to that, piecewise or segmented would extrapolate based on the line after established breakpoints and therefore would be more reliable.

5.g

x0 <- 65
xi <- 60
x <- temp$temperature
span <- 0.3

loess_5 <- loess(energy_kwh ~ temperature, data = temp, span = span)

new_data <- data.frame(temperature = 65)
predicted_energy_l <- predict(loess_4, newdata = new_data)
cat("LOESS prediction:", predicted_energy_l, "\n")

## LOESS prediction: 17.13865

# how many neighbors are used 
n_neighbors <- ceiling(span * length(x))
distances <- abs(x - x0) 
h <- sort(distances)[n_neighbors]

# calculate distance between points x_0 and x_1
d <- abs(xi - x0)

# scale distance 
u <- d / h

# find weight 
if (u < 1) {weight <- (1 - u^3)^3} else {weight <- 0}

cat("h:", h, "\n")

## h: 10.9322

cat("Scaled distance:", u, "\n")

## Scaled distance: 0.4573643

cat("Weight:", weight, "\n")

## Weight: 0.7395666

Answer: Using given dataset, the weight that observation at temperature = 60F receives is 0.7395666. Calculations are provided in the code above.