Introduction
Following packages and functions are used in this project.
## basic packages
library(knitr)
library(kableExtra)
library(tidyverse)
library(conflicted)
library(magrittr)
library(broom)
# library(car)
## paticular packages for this project
library(lmtest)
library(corrr)
library(tseries)
library(corrplot)
source("./funcs.R")
The data set is defined as follows based on file recs.csv:
set.seed(6)
dat <-
read_csv("../../data/recs.csv") %>%
dplyr::slice(sample(nrow(.), 300)) %>%
mutate(y = log(KWH / NHSLDMEM)) %>%
mutate(x8 = TOTROOMS + NCOMBATH + NHAFBATH) %>%
dplyr::select(y, x2 = NHSLDMEM, x3 = EDUCATION, x4 = MONEYPY, x5 = HHSEX,
x6 = HHAGE, x7 = ATHOME, x8) %>%
mutate_at(seq(2, 8), as.integer) %>% # make continuous variables discrete
mutate(x5 = - x5 + 2)
Different variables are summarized in the following table. y, the logarithm of averaged electricity consumption, is the variable that we are interested in. Specifically, The electricity consumption refers to the electricity usage of the house/studio where the respondent lives in 2015, measured in kilowatthours. The quantity is average by the number of household members in the house/studio. That way, it roughly represent the level of electricity consumption of the respondent. Other variables are discussion in the following table.
| z |
KWH |
electricity consumption |
| y |
LKWH.pers |
logarithm of KWH/NHSLDMEM |
| x2 |
NHSLDMEM |
number of household members |
| x3 |
EDUCATION |
highest education completed |
| x4 |
MONEYPY |
annual gross household income last year |
| x5 |
HHSEX |
gender |
| x6 |
HHAGE |
age |
| x7 |
ATHOME |
number of weekdays someone is at home |
| x8 |
TOTROOMS + |
number of rooms (including bathrooms) |
Note that x8 is a variable indicating the number of rooms of the house/studio of the respondent. It equals the summation of TOTROOMS, NCOMBATH and NHAFBATH in the original data set. x8 is not included in the initial analysis (sections 1 to 7).
x3, the education level of the respondent, is considered as a continuous variable in this project for simplicity. The detailed definition of different levels is shown in the following table.
| 1 |
Less than high school diploma or GED |
| 2 |
High school diploma or GED |
| 3 |
Some college or Associate’s degree |
| 4 |
Bachelor’s degree (for example: BA, BS) |
| 5 |
Master’s, Professional, or Doctorate degree |
The first 5 rows of the data set used can be visualized:
| y |
x2 |
x3 |
x4 |
x5 |
x6 |
x7 |
x8 |
| 7.540 |
5 |
3 |
8 |
1 |
39 |
5 |
15 |
| 8.193 |
1 |
2 |
2 |
0 |
85 |
5 |
14 |
| 8.678 |
3 |
1 |
1 |
0 |
71 |
5 |
8 |
| 7.846 |
4 |
3 |
5 |
1 |
39 |
5 |
8 |
| 9.755 |
1 |
3 |
3 |
0 |
57 |
0 |
10 |
In this project, we want to develop a model associating the continuous variable, average electricity consumption, with other variables. We will start by visualizing correlations between variables. In particular, the two variables x2 and x6, which highly correlate with y will be explored. The model regressing y on x2 will be discussed in section 2. Four assumptions will be made, three of which will be tested in section 3-5 by Jarque-Bera test (normality), White’s test (homoskedasticity) and RESET test (functional form) respectively. According the test results and discussion in section 6, data point 36 is excluded. Then, in section 7, more regressors are introduced and three new models are established. The detail regarding how regressors interact with each other, namely causality, is discussed in section 8. Based on the understanding of the underlying mechanism, a nonlinear term and x8 are included as new regressors. Two new models are estimated, based on the only model passing all tests in section 7. Finally, the model called mods[[7]] is chosen as the final for presentation and it is interpreted in section 10.
1. Data Visualization
It can be seen that y is highly correlated to x2 and x6 according to the following table.
| x |
y |
r |
| x2 |
y |
-0.51098 |
| x3 |
y |
0.03410 |
| x4 |
y |
0.04098 |
| x5 |
y |
-0.04260 |
| x6 |
y |
0.35346 |
| x7 |
y |
0.03966 |
| x8 |
y |
0.21329 |
It can be seen from the following covariance matrix that y is highly correlated to x2, x6 and x8. Besides, x3-x4, x2-x6, x4-x8 are high correlated, which will be discussed in section 9.

For each level of x2 a box indicating three quantiles (25%, 50%, 75%) of y is given. It shows that there is a tendency for y to decrease with x2 by looking at the median. The sizes of different boxes seem to vary with different values of x2. Besides, there are many observations when x2 is small. But it is assumed for now that the conditional variance is constant, which will be tested section 4. Three data points with extreme values 36, 241 and 163 is discussed in sections 3 and 5.

The box plot of y by x6 is given. It can be seen that the tendency is not strictly linear and the condition variance is not stable. So we will regress y on x2 first and use x6 as the second regressor in section 6.

2. Regress y on x2, Assumptions and Orthogonalization
mods[[1]] is obtained by regressing y on x2.
lm(formula = y ~ x2, data = dat)
| term |
estimate |
std.error |
statistic |
p.value |
| (Intercept) |
9.036 |
0.07526 |
120.06 |
2.214e-254 |
| x2 |
-0.270 |
0.02631 |
-10.26 |
2.352e-21 |
| r.squared |
adj.r.squared |
sigma |
statistic |
p.value |
df |
logLik |
AIC |
BIC |
deviance |
df.residual |
| 0.2611 |
0.2586 |
0.6309 |
105.3 |
2.352e-21 |
2 |
-286.5 |
579 |
590.1 |
118.6 |
298 |
By orthogonalizing x2 with respect to constant 1. the following reparameterized model can be obtained.
mods[[2]] <-
dat %>%
mutate(x1 = 1, x21 = x2 - mean(.$x2)) %>%
dplyr::select(y, x1, x21) %>%
{lm(y ~ x1 + x21, data = .)}
lm(formula = y ~ x1 + x21, data = .)
| term |
estimate |
std.error |
statistic |
p.value |
| (Intercept) |
8.36 |
0.03642 |
229.52 |
0.000e+00 |
| x21 |
-0.27 |
0.02631 |
-10.26 |
2.352e-21 |
| r.squared |
adj.r.squared |
sigma |
statistic |
p.value |
df |
logLik |
AIC |
BIC |
deviance |
df.residual |
| 0.2611 |
0.2586 |
0.6309 |
105.3 |
2.352e-21 |
2 |
-286.5 |
579 |
590.1 |
118.6 |
298 |
The estimated regression coefficient for x2 in mods[[1]] equals that for x21 in mods[[2]]. That is, slopes in these two models are the same. The standard error of the intercept is reduced by 51.60 %.
3. Normality and Jarque-Bera Test of mods[[1]]
The following four plots can be used to check the plausibility of normality assumptions:
- The upper left plot shows residuals against fitted values of
mods[[1]]. It is hard to trust indication the flat trending line because there are few data points with low fitted values. The variance seems to be stable when fitted values are high. The assumption of homoskedasticity is tested formally in section 4.
- Data points
36, 241 and 163 are mentioned in all but the lower right plots. They are examined in section 6.
- The assumption of conditional normality looks reasonable according to the upper right Q-Q plot. A formal Jarque-Bera test is performed later this section to examine this assumption in a quantitative manner.

The assumption of conditional normality is justified by JB test.
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| Jarque-Bera |
4.326 |
2 |
298 |
0.115 |
0.05 |
FALSE |
4. Homoskedasticity and White’s Test of mods[[1]]
mods[[1]] cannot pass the White’s test, which means the variances of residuals do vary with different values of y.
mods[[1]] %>% test_white(dat, resi2 ~ x2 + I(x2^2), 2) %>% tab_ti(F)
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| White |
6.028 |
2 |
298 |
0.04909 |
0.05 |
TRUE |
6. Regress y on x2 with 36 Data Point Excluded
36, 241 and 163 data points are mentioned in three plots regrading the analysis of residuals of mods[[1]] in section 3. According to the scatter plot in section 1, their values of y are too small compared to those with same values of x2. They seems to be well defined.
| index |
y |
x2 |
| 36 |
5.312 |
6 |
| 163 |
7.054 |
1 |
| 241 |
6.680 |
3 |
However, with just 8 other points when x2 equals 6, data point 36 will have a huge impact on the model, so it is excluded in the following model mods[[2]]. That is, a new model is estimated with the same formula as mods[[1]] but the data set excluding data point 36.
lm(formula = y ~ x2, data = .)
| term |
estimate |
std.error |
statistic |
p.value |
| (Intercept) |
9.0101 |
0.07431 |
121.254 |
5.324e-255 |
| x2 |
-0.2569 |
0.02612 |
-9.832 |
6.208e-20 |
| r.squared |
adj.r.squared |
sigma |
statistic |
p.value |
df |
logLik |
AIC |
BIC |
deviance |
df.residual |
| 0.2456 |
0.243 |
0.6197 |
96.67 |
6.208e-20 |
2 |
-280.2 |
566.4 |
577.5 |
114.1 |
297 |
Compared with mods[[1]], mods[[2]] has more accurate estimation. Besides, mods[[2]] passes all of the hypothesis tests. So data point 36 is exluded in the following models, and the corresponding new data set dat_2 is used.
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| Jarque-Bera |
4.383 |
2 |
297 |
0.1118 |
0.05 |
FALSE |
| White |
1.686 |
2 |
297 |
0.4304 |
0.05 |
FALSE |
| RESET |
1.986 |
1 |
298 |
0.1588 |
0.05 |
FALSE |
7. Models with More Regressors
7-1. Benchmark Model
mods[[3]] with x2 and x6 being regressors is a good model already and pass every test. It is chosen as the benchmark model after the discussion in subsection 7-4.
lm(formula = y ~ x2 + x6, data = dat_2)
| term |
estimate |
std.error |
statistic |
p.value |
| (Intercept) |
8.522218 |
0.167008 |
51.029 |
1.015e-148 |
| x2 |
-0.220207 |
0.028079 |
-7.842 |
8.080e-14 |
| x6 |
0.007576 |
0.002331 |
3.249 |
1.290e-03 |
| r.squared |
adj.r.squared |
sigma |
statistic |
p.value |
df |
logLik |
AIC |
BIC |
deviance |
df.residual |
| 0.2715 |
0.2666 |
0.61 |
55.17 |
4.32e-21 |
3 |
-275 |
557.9 |
572.7 |
110.1 |
296 |
results_test <- mods[[3]] %>% test_jb(dat_2)
results_test %<>%
bind_rows(test_white(mods[[3]], dat_2, resi2 ~ x2 + x6 + I(x2^2) + I(x6^2),
3)) %>%
bind_rows(test_white(mods[[3]], dat_2, resi2 ~ x2 + x6 + I(x2^2) + I(x6^2) +
I(x2 * x6), 6)) %>%
bind_rows(test_reset(mods[[3]], dat_2))
results_test %>% tab_ti(F)
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| Jarque-Bera |
3.46586 |
2 |
297 |
0.1768 |
0.05 |
FALSE |
| White |
1.32474 |
3 |
296 |
0.7233 |
0.05 |
FALSE |
| White |
3.24389 |
6 |
293 |
0.7777 |
0.05 |
FALSE |
| RESET |
0.04314 |
1 |
298 |
0.8355 |
0.05 |
FALSE |
7-2. Model with All Regressors
To construct mods[[4]], all of the variables (excluding x8) are included. mods[[4]] can pass Jarque-Bera test and two White’s tests, but cannot pass RESET test.
results_test <- mods[[4]] %>% test_jb(dat_2)
results_test %<>%
bind_rows(test_white(mods[[4]], dat_2, resi2 ~ x2 + x3 + x4 + x5 + x6 + x7 +
I(x2^2) + I(x3^2) + I(x4^2) + I(x5^2) + I(x6^2) + I(x7^2), 7)) %>%
bind_rows(test_white(mods[[4]], dat_2, resi2 ~ x2 + x3 + x4 + x5 + x6 + x7 +
I(x2^2) + I(x3^2) + I(x4^2) + I(x5^2) + I(x6^2) + I(x7^2) + I(x2 * x3) +
I(x2 * x4) + I(x2 * x5) + I(x2 * x6) + I(x2 * x7) + I(x4 * x3) +
I(x5 * x3) + I(x6 * x3) + I(x3 * x7) + I(x4 * x5) + I(x4 * x6) +
I(x4 * x7) + I(x5 * x6) + I(x5 * x7) + I(x6 * x7), 28)) %>%
bind_rows(test_reset(mods[[4]], dat_2))
results_test %>% tab_ti(F)
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| Jarque-Bera |
1.801 |
2 |
297 |
0.40632 |
0.05 |
FALSE |
| White |
12.334 |
7 |
292 |
0.09011 |
0.05 |
FALSE |
| White |
31.025 |
28 |
271 |
0.31598 |
0.05 |
FALSE |
| RESET |
4.103 |
1 |
298 |
0.04280 |
0.05 |
TRUE |
7-3. Likelihood Ratio Test
Likelihood ratio tests for restricting one parameter can be performed by using partial correlations. For example, to test the hypothesis that coefficient for x5 is 0 in mods[[4]], following calculation can be conducted. With p_value being 0.7581638, we cannot reject the hypothesis.
lr_x5 <- - 299 * log(1 - 0.0003170)
(1 - pchisq(lr_x5, 1))
[1] 0.7581638
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| logLik |
0.09479 |
1 |
299 |
0.7582 |
0.05 |
FALSE |
Likelihood tests for restricting more than one parameter can be only performed by using values of log likelihood in the original and restricted models. For example, to test the hypothesis that coefficients for x5 and x7 are both 0 in mods[[4]], following calculation can be conducted. We cannot reject the hypothesis according the function output.
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| logLik |
0.9188 |
2 |
298 |
0.3378 |
0.05 |
FALSE |
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| logLik |
0.824 |
1 |
299 |
0.364 |
0.05 |
FALSE |
The above three test statistics are related in an additive manner, so models with multiple regressors can be reduced in a step-wise procedure. During every step, partial correlations for regressors can be used as the indication of the next term to be reduced.
[1] TRUE
7-4. Automated Model Selection
Thus, mods[[5]] is determined by automated model selection using mods[[4]] with stats::step function. mods[[5]] can pass Jarque-Bera test and two White’s tests, but cannot pass RESET test.
lm(formula = y ~ x2 + x3 + x4 + x6, data = dat_2)
| term |
estimate |
std.error |
statistic |
p.value |
| (Intercept) |
8.602282 |
0.191431 |
44.937 |
1.003e-133 |
| x2 |
-0.244215 |
0.028598 |
-8.540 |
7.356e-16 |
| x3 |
-0.079886 |
0.036673 |
-2.178 |
3.018e-02 |
| x4 |
0.060276 |
0.018291 |
3.295 |
1.103e-03 |
| x6 |
0.007589 |
0.002297 |
3.304 |
1.070e-03 |
| r.squared |
adj.r.squared |
sigma |
statistic |
p.value |
df |
logLik |
AIC |
BIC |
deviance |
df.residual |
| 0.298 |
0.2885 |
0.6008 |
31.2 |
1.155e-21 |
5 |
-269.4 |
550.8 |
573 |
106.1 |
294 |
results_test <- mods[[5]] %>% test_jb(dat_2)
results_test %<>%
bind_rows(test_white(mods[[5]], dat_2, resi2 ~ x2 + x3 + x4 + x6 + I(x2^2) +
I(x3^2) + I(x4^2) + I(x6^2), 5)) %>%
bind_rows(test_white(mods[[5]], dat_2, resi2 ~ x2 + x3 + x4 + x6 +
I(x2^2) + I(x3^2) + I(x4^2) + I(x6^2) + I(x2 * x3) + I(x2 * x4) +
I(x2 * x6) + I(x4 * x3) + I(x6 * x3) +I(x4 * x6), 15)) %>%
bind_rows(test_reset(mods[[5]], dat_2))
results_test %>% tab_ti(F)
| whi |
stat |
df1 |
df2 |
p_value |
prob |
if_reject |
| Jarque-Bera |
1.985 |
2 |
297 |
0.37063 |
0.05 |
FALSE |
| White |
10.051 |
5 |
294 |
0.07380 |
0.05 |
FALSE |
| White |
19.371 |
15 |
284 |
0.19742 |
0.05 |
FALSE |
| RESET |
3.898 |
1 |
298 |
0.04836 |
0.05 |
TRUE |
According to results of five models, though mods[[4]] and mods[[5]] have lower AIC, mods[[3]] is the one with all tests passed and lowst AIC. It is discussed intensively in subsection 8-1, and acts as the benchmark model in section 9. Besides, mods[[3]] has the lowest BIC.
| index |
r.squared |
adj.r.squared |
sigma |
statistic |
p.value |
df |
logLik |
AIC |
BIC |
deviance |
df.residual |
| 1 |
0.2611 |
0.2586 |
0.6309 |
105.30 |
2.352e-21 |
2 |
-286.5 |
579.0 |
590.1 |
118.6 |
298 |
| 2 |
0.2611 |
0.2586 |
0.6309 |
105.30 |
2.352e-21 |
2 |
-286.5 |
579.0 |
590.1 |
118.6 |
298 |
| 3 |
0.2715 |
0.2666 |
0.6100 |
55.17 |
4.320e-21 |
3 |
-275.0 |
557.9 |
572.7 |
110.1 |
296 |
| 4 |
0.3002 |
0.2858 |
0.6020 |
20.87 |
2.372e-20 |
7 |
-269.0 |
553.9 |
583.5 |
105.8 |
292 |
| 5 |
0.2980 |
0.2885 |
0.6008 |
31.20 |
1.155e-21 |
5 |
-269.4 |
550.8 |
573.0 |
106.1 |
294 |
Additionally, mods[[4]] has the highest r.squared for including more regressors, but its adj.r.squared is penalized for that. The AIC and BIC are not low as well compared to those of mods[[3]] and mods[[5]].
10. Interpretation of mods[[7]]
10-1. Model Structure
The overall structure can be illustrated in the following figure. The income is correlated to the number of household memebers, which is probably caused by the correlation between numbers of people and rooms. Because those relations don’t contribute to the interpretation of consumptions directly, they are omitted.

10-2. Regressors
Electricity consumption is highly correlated to the number of household members of the respondent. When there are not too many household members, with more members, the quantity of consumption decreases, probably due to the fact that people tend to share the kitchen. However, the effect diminishes gradually when there are many household members (around 5-8).

Because y is the log of average electricity consumptions, the final relationship between electricity consumptions and numbers of household members and numbers of rooms are:
As for the effect of number of rooms on electricity consumptions,

10-3. Orthogonalization of Multiple Regressors
mods[[8]] <-
dat_2 %>%
mutate(z1 = lm(x2 ~ 1)$residuals) %>%
mutate(z2 = lm(x8 ~ x2 + 1)$residuals) %>%
mutate(x2.2 = x2^2) %>%
mutate(z3 = lm(x2.2 ~ x2 + x8 + 1)$residuals) %>%
{lm(y ~ z1 + z2 + z3, data = .)}
lm(formula = y ~ z1 + z2 + z3, data = .)
| term |
estimate |
std.error |
statistic |
p.value |
| (Intercept) |
8.37009 |
0.03349 |
249.911 |
0.000e+00 |
| z1 |
-0.25686 |
0.02441 |
-10.521 |
3.420e-22 |
| z2 |
0.06625 |
0.01093 |
6.060 |
4.145e-09 |
| z3 |
0.03563 |
0.01232 |
2.893 |
4.104e-03 |
| r.squared |
adj.r.squared |
sigma |
statistic |
p.value |
df |
logLik |
AIC |
BIC |
deviance |
df.residual |
| 0.3456 |
0.3389 |
0.5791 |
51.93 |
5.594e-27 |
4 |
-258.9 |
527.9 |
546.4 |
98.94 |
295 |
The intercept in mods[[8]] (8.37009) can be interpreted as the exptected value for an individual with average values of x2, x8 and I(x2^2), which can be verified by the prediction using mods[[7]].
( 8.77946 - 0.52071 * mean(dat_2$x2) + 0.07329 * mean(dat_2$x8) +
0.03563 * mean(dat_2$x2^2) ) - 8.37009 <= 1e-4
[1] TRUE
Since the standard errors in mods[[8]] are smaller than those in mods[[7]], mods[[8]] is used to conduct inference. Particularly, se for Intercept is reduced by 75.48%, and ses for first two regressors are reduced by 71.75% and 2.41%. The estimations for the last term are exactly the same, which is expected.
2.5 % 97.5 %
(Intercept) 8.30417224 8.43600027
z1 -0.30490653 -0.20881311
z2 0.04473184 0.08775875
z3 0.01138835 0.05986556
To reduce average electricity consumptions, people are encouraged to live together in houses with fewer rooms.
---
title: "Linear Regression Analysis of Electricity Consumption"
author: "Edward J. Xu (<edxu96@outlook.com>)"
date: "`r Sys.Date()`"
documentclass: article
classoption: a4paper
output:
  html_notebook:
    code_folding: hide
    df_print: paged
    fig_caption: no
    fig_height: 7.5
    fig_width: 15
    smooth_scroll: yes
    theme: default
    toc: yes
    toc_float: no
  #   toc: yes
  #   keep_tex: yes
  # pdf_document:
editor_options:
  chunk_output_type: inline
---

## Introduction

```{r}
rm(list = ls())
```

Following packages and functions are used in this project.

```{r, message = FALSE, echo=T}
## basic packages
library(knitr)
library(kableExtra)
library(tidyverse)
library(conflicted)
library(magrittr)
library(broom)
# library(car)
## paticular packages for this project
library(lmtest)
library(corrr)
library(tseries)
library(corrplot)
source("./funcs.R")
```

```{r, include=FALSE}
setwd("~/GitHub/TidySimStat/examples/regress")
options(width = 80, pillar.sigfig = 5)
knitr::opts_chunk$set(
  comment = "#>",
  echo = FALSE,
  fig.align="center"
)
```

The data set is defined as follows based on file `recs.csv`:

```{r, message = FALSE, echo=T}
set.seed(6)
dat <-
  read_csv("../../data/recs.csv") %>%
  dplyr::slice(sample(nrow(.), 300)) %>%
  mutate(y = log(KWH / NHSLDMEM)) %>%
  mutate(x8 = TOTROOMS + NCOMBATH + NHAFBATH) %>%
  dplyr::select(y, x2 = NHSLDMEM, x3 = EDUCATION, x4 = MONEYPY, x5 = HHSEX,
    x6 = HHAGE, x7 = ATHOME, x8) %>%
  mutate_at(seq(2, 8), as.integer) %>%  # make continuous variables discrete
  mutate(x5 = - x5 + 2)
```

Different variables are summarized in the following table. `y`, the logarithm of averaged electricity consumption, is the variable that we are interested in. Specifically, The electricity consumption refers to the electricity usage of the house/studio where the respondent lives in 2015, measured in kilowatthours. The quantity is average by the number of household members in the house/studio. That way, it roughly represent the level of electricity consumption of the respondent. Other variables are discussion in the following table.

| Sym | Abbr       | Definition                              |
| --- | ---------- | --------------------------------------- |
| z   | KWH        | electricity consumption                 |
| y   | LKWH.pers  | logarithm of KWH/NHSLDMEM               |
| x2  | NHSLDMEM   | number of household members             |
| x3  | EDUCATION  | highest education completed             |
| x4  | MONEYPY    | annual gross household income last year |
| x5  | HHSEX      | gender                                  |
| x6  | HHAGE      | age                                     |
| x7  | ATHOME     | number of weekdays someone is at home   |
| x8  | TOTROOMS + | number of rooms (including bathrooms)   |

Note that `x8` is a variable indicating the number of rooms of the house/studio of the respondent. It equals the summation of `TOTROOMS`, `NCOMBATH` and `NHAFBATH` in the original data set. `x8` is not included in the initial analysis (sections 1 to 7). 

`x3`, the education level of the respondent, is considered as a continuous variable in this project for simplicity. The detailed definition of different levels is shown in the following table.

| Level | Definition                                  |
| ----- | ------------------------------------------- |
| 1     | Less than high school diploma or GED        |
| 2     | High school diploma or GED                  |
| 3     | Some college or Associate’s degree          |
| 4     | Bachelor’s degree (for example: BA, BS)     |
| 5     | Master’s, Professional, or Doctorate degree |

The first 5 rows of the data set used can be visualized:

```{r}
head(dat, 5) %>% tab_ti(F)
```

In this project, we want to develop a model associating the continuous variable, average electricity consumption, with other variables. We will start by visualizing correlations between variables. In particular, the two variables `x2` and `x6`, which highly correlate with `y` will be explored. The model regressing `y` on `x2` will be discussed in section 2. Four assumptions will be made, three of which will be tested in section 3-5 by Jarque-Bera test (normality), White’s test (homoskedasticity) and RESET test (functional form) respectively. According the test results and discussion in section 6, data point `36` is excluded. Then, in section 7, more regressors are introduced and three new models are established. The detail regarding how regressors interact with each other, namely causality, is discussed in section 8. Based on the understanding of the underlying mechanism, a nonlinear term and `x8` are included as new regressors. Two new models are estimated, based on the only model passing all tests in section 7. Finally, the model called `mods[[7]]` is chosen as the final for presentation and it is interpreted in section 10.

## 1. Data Visualization

It can be seen that `y` is highly correlated to `x2` and `x6` according to the following table.

```{r}
dat %>%
  cor() %>%
  as_cordf() %>%
  stretch() %>%
  dplyr::filter(y == "y" & x != "y") %>%
  tab_ti(F)
```

It can be seen from the following covariance matrix that `y` is highly correlated to `x2`, `x6` and `x8`. Besides, `x3`-`x4`, `x2`-`x6`, `x4`-`x8` are high correlated, which will be discussed in section 9.

```{r}
dat %>%
  cor() %>%
  corrplot(type = "upper", tl.col = "black", tl.srt = 45)
```

For each level of `x2` a box indicating three quantiles (25%, 50%, 75%) of `y` is given. It shows that there is a tendency for `y` to decrease with `x2` by looking at the median. The sizes of different boxes seem to vary with different values of `x2`. Besides, there are many observations when `x2` is small. But it is assumed for now that the conditional variance is constant, which will be tested section 4. Three data points with extreme values `36`, `241` and `163` is discussed in sections 3 and 5. 

```{r}
dat %>%
  mutate(index = row_number()) %>%
  ggplot() +
    geom_boxplot(aes(x2, y, group = cut_width(x2, 1)), outlier.alpha = 0) +
    geom_point(aes(x2, y), shape = 1) +
    geom_text(aes(x2, y, label = ifelse(index == 36 | index == 163 |
      index == 241, as.character(index), "")), hjust=1.5, vjust=0)
```

The box plot of `y` by `x6` is given. It can be seen that the tendency is not strictly linear and the condition variance is not stable. So we will regress `y` on `x2` first and use `x6` as the second regressor in section 6.

```{r}
dat %>%
  ggplot() +
  geom_boxplot(aes(x6, y, group = cut_width(x6, 10)), outlier.alpha = 0)+
    geom_point(aes(x6, y), shape = 1)
```

## 2. Regress `y` on `x2`, Assumptions and Orthogonalization

`mods[[1]]` is obtained by regressing `y` on `x2`.

```{r}
mods <- list()
mods[[1]] <- lm(y ~ x2, data = dat)

mods[[1]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)

results <- new_results(mods[[1]], 1)
```

By orthogonalizing `x2` with respect to constant 1. the following reparameterized model can be obtained.

```{r, echo=T}
mods[[2]] <- 
  dat %>%
  mutate(x1 = 1, x21 = x2 - mean(.$x2)) %>%
  dplyr::select(y, x1, x21) %>%
  {lm(y ~ x1 + x21, data = .)}
```

```{r}
results %<>%
  collect_glance(mods[[2]], 2)

mods[[2]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

The estimated regression coefficient for `x2` in `mods[[1]]` equals that for `x21` in `mods[[2]]`. That is, slopes in these two models are the same. The standard error of the intercept is reduced by 51.60 %.

## 3. Normality and Jarque-Bera Test of `mods[[1]]`

The following four plots can be used to check the plausibility of normality assumptions:

- The upper left plot shows residuals against fitted values of `mods[[1]]`. It is hard to trust indication the flat trending line because there are few data points with low fitted values. The variance seems to be stable when fitted values are high. The assumption of homoskedasticity is tested formally in section 4.
- Data points `36`, `241` and `163` are mentioned in all but the lower right plots. They are examined in section 6.
- The assumption of conditional normality looks reasonable according to the upper right Q-Q plot. A formal Jarque-Bera test is performed later this section to examine this assumption in a quantitative manner.

```{r, warning=FALSE}
par_orginal <- par()
par(mfrow = c(2, 2), mai = c(0.3, 0.3, 0.3, 0.3))
plot(mods[[1]])
par(par_orginal)
```

The assumption of conditional normality is justified by JB test.

```{r}
mods[[1]] %>% test_jb(dat) %>% tab_ti(F)
```

## 4. Homoskedasticity and White's Test of `mods[[1]]`

`mods[[1]]` cannot pass the White's test, which means the variances of residuals do vary with different values of `y`.

```{r, echo=T}
mods[[1]] %>% test_white(dat, resi2 ~ x2 + I(x2^2), 2) %>% tab_ti(F)
```

## 5. Functional Form and RESET Test of `mods[[1]]`

`mods[[1]]` can pass RESET test.

```{r}
mods[[1]] %>% test_reset(dat) %>% tab_ti(F)
```

## 6. Regress `y` on `x2` with `36` Data Point Excluded

`36`, `241` and `163` data points are mentioned in three plots regrading the analysis of residuals of `mods[[1]]` in section 3. According to the scatter plot in section 1, their values of `y` are too small compared to those with same values of `x2`. They seems to be well defined. 

```{r}
dat %>%
  mutate(index = row_number()) %>%
  dplyr::filter(row_number() == 241 | row_number() == 163 | 
    row_number() == 36) %>%
  dplyr::select(index, y, x2) %>%
  tab_ti(F)
```

However, with just 8 other points when `x2` equals 6, data point `36` will have a huge impact on the model, so it is excluded in the following model `mods[[2]]`. That is, a new model is estimated with the same formula as `mods[[1]]` but the data set excluding data point `36`.

```{r}
mods[[2]] <-
  dat %>%
  dplyr::filter(row_number() != 36) %>%
  {lm(y ~ x2, data = .)}

mods[[2]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

Compared with `mods[[1]]`, `mods[[2]]` has more accurate estimation. Besides, `mods[[2]]` passes all of the hypothesis tests. So data point `36` is exluded in the following models, and the corresponding new data set `dat_2` is used.

```{r}
dat_2 <-
  dat %>%
  dplyr::filter(row_number() != 36)

results_test <-
  mods[[2]] %>%
  test_jb(dat_2)

results_test %<>%
  bind_rows(test_white(mods[[2]], dat_2, resi2 ~ x2 + I(x2^2), 2)) %>%
  bind_rows(test_reset(mods[[2]], dat_2))

results_test %>% tab_ti(F)
```

## 7. Models with More Regressors

### 7-1. Benchmark Model

`mods[[3]]` with `x2` and `x6` being regressors is a good model already and pass every test. It is chosen as the benchmark model after the discussion in subsection 7-4.

```{r}
mods[[3]] <- lm(y ~ x2 + x6, data = dat_2)
results %<>% collect_glance(mods[[3]], 3)

## Print model summary
model <- mods[[3]]
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

```{r, echo=T}
results_test <- mods[[3]] %>% test_jb(dat_2)

results_test %<>%
  bind_rows(test_white(mods[[3]], dat_2, resi2 ~ x2 + x6 + I(x2^2) + I(x6^2),
    3)) %>%
  bind_rows(test_white(mods[[3]], dat_2, resi2 ~ x2 + x6 + I(x2^2) + I(x6^2) +
    I(x2 * x6), 6)) %>%
  bind_rows(test_reset(mods[[3]], dat_2))

results_test %>% tab_ti(F)
```

### 7-2. Model with All Regressors

To construct `mods[[4]]`, all of the variables (excluding `x8`) are included. `mods[[4]]` can pass Jarque-Bera test and two White's tests, but cannot pass  RESET test.

```{r}
mods[[4]] <- lm(y ~ x2 + x3 + x4 + x5 + x6 + x7, data = dat_2)
results %<>% collect_glance(mods[[4]], 4)
mods[[4]] %>% tab_tidy(T)
```

```{r, echo=T}
results_test <- mods[[4]] %>% test_jb(dat_2)

results_test %<>%
  bind_rows(test_white(mods[[4]], dat_2, resi2 ~ x2 + x3 + x4 + x5 + x6 + x7 + 
    I(x2^2) + I(x3^2) + I(x4^2) + I(x5^2) + I(x6^2) + I(x7^2), 7)) %>%
  bind_rows(test_white(mods[[4]], dat_2, resi2 ~ x2 + x3 + x4 + x5 + x6 + x7 + 
    I(x2^2) + I(x3^2) + I(x4^2) + I(x5^2) + I(x6^2) + I(x7^2) + I(x2 * x3) +
    I(x2 * x4) + I(x2 * x5) + I(x2 * x6) + I(x2 * x7) + I(x4 * x3) + 
    I(x5 * x3) + I(x6 * x3) + I(x3 * x7) + I(x4 * x5) + I(x4 * x6) + 
    I(x4 * x7) + I(x5 * x6) + I(x5 * x7) + I(x6 * x7), 28)) %>%
  bind_rows(test_reset(mods[[4]], dat_2))

results_test %>% tab_ti(F)
```

### 7-3. Likelihood Ratio Test

Likelihood ratio tests for restricting one parameter can be performed by using partial correlations. For example, to test the hypothesis that coefficient for `x5` is 0 in `mods[[4]]`, following calculation can be conducted. With `p_value` being 0.7581638, we cannot reject the hypothesis.

```{r, echo=T}
lr_x5 <- - 299 * log(1 - 0.0003170) 
(1 - pchisq(lr_x5, 1))
```

```{r}
mods_lik <- list()
mods_lik[[1]] <- lm(y ~ x2 + x3 + x4 + x6 + x7, data = dat_2)
test_lik(mods[[4]], mods_lik[[1]]) %>% tab_ti(F)
```

Likelihood tests for restricting more than one parameter can be only performed by using values of log likelihood in the original and restricted models. For example, to test the hypothesis that coefficients for `x5` and `x7` are both 0 in `mods[[4]]`, following calculation can be conducted. We cannot reject the hypothesis according the function output.

```{r}
mods_lik[[2]] <- lm(y ~ x2 + x3 + x4 + x6, data = dat_2)
test_lik(mods[[4]], mods_lik[[2]]) %>% tab_ti(F)
```

```{r}
test_lik(mods_lik[[1]], mods_lik[[2]]) %>% tab_ti(F)
```

The above three test statistics are related in an additive manner, so models with multiple regressors can be reduced in a step-wise procedure. During every step, partial correlations for regressors can be used as the indication of the next term to be reduced.

```{r}
test_lik(mods[[4]], mods_lik[[1]])$stat + 
  test_lik(mods_lik[[1]], mods_lik[[2]])$stat -
  test_lik(mods[[4]], mods_lik[[2]])$stat <= 1e-5
```

### 7-4. Automated Model Selection

Thus, `mods[[5]]` is determined by automated model selection using `mods[[4]]` with `stats::step` function. `mods[[5]]` can pass Jarque-Bera test and two White's tests, but cannot pass RESET test.

```{r}
mods[[5]] <- quiet(step(mods[[4]]))
results %<>% collect_glance(mods[[5]], 5)

mods[[5]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

```{r, echo=T}
results_test <- mods[[5]] %>% test_jb(dat_2)

results_test %<>%
  bind_rows(test_white(mods[[5]], dat_2, resi2 ~ x2 + x3 + x4 + x6 + I(x2^2) +
    I(x3^2) + I(x4^2) + I(x6^2), 5)) %>%
  bind_rows(test_white(mods[[5]], dat_2, resi2 ~ x2 + x3 + x4 + x6 + 
    I(x2^2) + I(x3^2) + I(x4^2) + I(x6^2) + I(x2 * x3) + I(x2 * x4) + 
    I(x2 * x6) + I(x4 * x3) + I(x6 * x3) +I(x4 * x6), 15)) %>%
  bind_rows(test_reset(mods[[5]], dat_2))

results_test %>% tab_ti(F)
```

According to results of five models, though `mods[[4]]` and `mods[[5]]` have lower AIC, `mods[[3]]` is the one with all tests passed and lowst AIC. It is discussed intensively in subsection 8-1, and acts as the benchmark model in section 9. Besides, `mods[[3]]` has the lowest BIC.

```{r}
results %>% tab_ti(F)
```

Additionally, `mods[[4]]` has the highest `r.squared` for including more regressors, but its `adj.r.squared` is penalized for that. The AIC and BIC are not low as well compared to those of `mods[[3]]` and `mods[[5]]`. 

## 8. Causality and Mediation

### 8-1. Causality of `x2`-`x6`

It can be seen from the following two models that `y` is highly correlated to `x2` and `x6` separately. Besides, according to `mods[[3]]`, `y` is highly correlated to `x2` and `x6` at the same time. 

```{r}
mods_62 <- list()
mods_62[[1]] <- lm(y ~ x2, data = dat_2)

mods_62[[1]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

```{r}
mods_62[[2]] <- lm(y ~ x6, data = dat_2)

mods_62[[2]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

It is reasonable to assume that people tend to have more accompanies as age increases after 18. Also, the following model proves the relationship between `x2` and `x6`. We can say that the effect of `x6` on `y` is mediated by `x2`. With `x6` affecting `y` as well, `x2` does not mediate `x6` completely. So `x2` and `x6` are both supposed to appear in the model for `y`. The mediation factor `x2` may affect `y` though other ways, which will be explored in section 10. Besides, we find that `x6` does not affect `y` in other ways in section 10.

```{r}
mods_62[[3]] <- lm(x2 ~ x6, data = dat_2)

mods_62[[3]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

### 8-2. Causality of `x3`-`x4`

When taking `x3` and `x4` into consideration, the models assocating `y` with `x3` or `x4` both show no significance, though `x3` and `x4` are highly correlated. So neither `x3` nor `x4` should be included in the model.

```{r}
mods_34 <- list()
mods_34[[1]] <- lm(x4 ~ x3, data = dat_2)

mods_34[[1]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

```{r}
mods_34[[2]] <- lm(formula = y ~ x3 + x4, data = dat_2)

mods_34[[2]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```


### 8-3. Causality of `x4`-`x8`

It can be seen that `y` is highly correlated to `x8`.

```{r}
mods_48 <- list()
mods_48[[1]] <-  lm(y ~ x8, data = dat_2)

mods_48[[1]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

Also, `x8` is highly related to `x4`, which makes sense, because people with more income tend to buy houses with more rooms.

```{r}
mods_48[[2]] <-  lm(x8 ~ x4, data = dat_2)

mods_48[[2]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

However, from subsection 9.2 we already know that `y` is not highly correlated with `x4`, which is illustrated again by the following model. So we say that the effect of `x4` on `y` is completely mediated by `x8`. Though it makes sense that people with more income tend to consume more energy on average, the direct effect is completely mediated through `x8` and possibly other factors.

```{r}
mods_48[[3]] <- lm(y ~ x8 + x4, data = dat_2)

mods_48[[3]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

## 9. `mods[[3]]` with extra terms

In `mods[[6]]`, `x2`, `x6` and `x8` are kept at the same time.

```{r}
mods[[6]] <- lm(y ~ x2 + x6 + x8, data = dat_2)

mods[[6]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

```{r}
results_test <- mods[[6]] %>% test_jb(dat_2)

results_test %<>%
  bind_rows(test_white(mods[[6]], dat_2, resi2 ~ x2 + x6 + x8 + I(x2^2) +
    I(x6^2) + I(x8^2), 4)) %>%
  bind_rows(test_white(mods[[6]], dat_2, resi2 ~ x2 + x6 + x8 + I(x2^2) +
    I(x6^2) + I(x8^2) + I(x2 * x6) + I(x2 * x8) + I(x6 * x8), 10)) %>%
  bind_rows(test_reset(mods[[6]], dat_2))

results_test %>% tab_ti(F)
```

With the idea of diminishing effects of the number of household members in mind, it is natural to include the square of the household numbers. In addition, `x8` is included according to the discussion in subsection 8-3. Thus, `mods[[7]]` is estimated. Being insignificant, `x6` is excluded, beccause the effect of `x6` on `y` seems to be mediated compeletely by `x2` and `x2^2`.

```{r}
mods[[7]] <- lm(y ~ x2 + x8 + I(x2^2), data = dat_2)
mods[[7]] %>% tab_tidy(T)
```

With number of observations being 299, the model is sensitive to exlusion of some term only if the partial correlation of that term is smaller than 0.01276551. That is, if `p.r.squared` of some term displayed above is smaller than 0.01276551, the hypothesis regaring the coefficient for that term in the likelihood ratio test would not be rejected. The value `0.01276551` is obtained according to the discussion in subsection 7-3. Besides, small `p.value`s indicate that terms cannot be excluded as well. 

```{r, include=F}
lr_x5 <- - 299 * log(1 - 0.01276551)
(1 - pchisq(lr_x5, 1))
```

```{r}
results_test <- mods[[7]] %>% test_jb(dat_2)

results_test %<>%
  bind_rows(test_white(mods[[7]], dat_2, resi2 ~ x2 + x8 + I(x2^2) + I(x8^2),
    3)) %>%
  bind_rows(test_white(mods[[7]], dat_2, resi2 ~ x2 + x8 + I(x2^2) +
    I(x8^2) + I(x2 * x8), 6)) %>%
  bind_rows(test_reset(mods[[7]], dat_2))

results_test %>% tab_ti(F)
```

`mods[[7]]` has a lower AIC than `mods[[6]]` and `mods[[3]]`, so it is choosen as the final model for presentation and conclusion.

```{r}
AIC(mods[[7]], mods[[6]], mods[[3]]) %>% tab_ti(F)
```

## 10. Interpretation of `mods[[7]]`

### 10-1. Model Structure

The overall structure can be illustrated in the following figure. The income is correlated to the number of household memebers, which is probably caused by the correlation between numbers of people and rooms. Because those relations don't contribute to the interpretation of consumptions directly, they are omitted.

![Illustration of the model structure.](./images/1.png)

### 10-2. Regressors

Electricity consumption is highly correlated to the number of household members of the respondent. When there are not too many household members, with more members, the quantity of consumption decreases, probably due to the fact that people tend to share the kitchen. However, the effect diminishes gradually when there are many household members (around 5-8).

```{r}
ti <- tibble(x = seq(1:8), y = 8.77946 + 0.03563 * seq(1:8)^2 - 
  0.52071 * seq(1:8) + 0.07329 * mean(dat$x8))

dat %>%
  mutate(index = row_number()) %>%
  ggplot() +
    geom_boxplot(aes(x2, y, group = cut_width(x2, 1)), outlier.alpha = 0) +
    geom_point(aes(x2, y), shape = 1) +
    geom_text(aes(x2, y, label = ifelse(index == 36 | index == 163 |
      index == 241, as.character(index), "")), hjust=1.5, vjust=0) +
    geom_line(data = ti, mapping = aes(x, y), color = "#F8766D", size = 2)
```

Because `y` is the log of average electricity consumptions, the final relationship between electricity consumptions and numbers of household members and numbers of rooms are:

As for the effect of number of rooms on electricity consumptions, 

```{r}
ti <- tibble(x = seq(2, 22), y = 8.77946 + 0.07329 * seq(2, 22) - 
  0.52071 * mean(dat$x2) + 0.03563 * mean(dat$x2^2))

dat %>%
  mutate(index = row_number()) %>%
  ggplot() +
    geom_boxplot(aes(x8, y, group = cut_width(x8, 1)), outlier.alpha = 0) +
    geom_point(aes(x8, y), shape = 1) +
    geom_line(data = ti, mapping = aes(x, y), color = "#F8766D", size = 2)
```

### 10-3. Orthogonalization of Multiple Regressors

```{r, echo=T}
mods[[8]] <-
  dat_2 %>%
  mutate(z1 = lm(x2 ~ 1)$residuals) %>%
  mutate(z2 = lm(x8 ~ x2 + 1)$residuals) %>%
  mutate(x2.2 = x2^2) %>%
  mutate(z3 = lm(x2.2 ~ x2 + x8 + 1)$residuals) %>%
  {lm(y ~ z1 + z2 + z3, data = .)}
```

```{r}
mods[[8]] -> model  # Print model summary
model$call
model %>% tidy() %>% tab_ti(F)
model %>% glance() %>% tab_ti(F)
rm(model)
```

The intercept in `mods[[8]]` (8.37009) can be interpreted as the exptected value for an individual with average values of `x2`, `x8` and `I(x2^2)`, which can be verified by the prediction using `mods[[7]]`.

```{r, echo=T}
( 8.77946 - 0.52071 * mean(dat_2$x2) + 0.07329 * mean(dat_2$x8) + 
  0.03563 * mean(dat_2$x2^2) ) - 8.37009 <= 1e-4
```

Since the standard errors in `mods[[8]]` are smaller than those in `mods[[7]]`, `mods[[8]]` is used to conduct inference. Particularly, `se` for `Intercept` is reduced by 75.48%, and `se`s for first two regressors are reduced by 71.75% and 2.41%. The estimations for the last term are exactly the same, which is expected.

```{r}
mods[[8]] %>% confint()
```

To reduce average electricity consumptions, people are encouraged to live together in houses with fewer rooms.

<!-- ## 11. One-Sided t-Test of `mods[[1]]` -->

<!-- ```{r, echo=T} -->
<!-- mods[[1]] %>% -->
<!--   summary() %>% -->
<!--   {pt(coef(.)[2, 3], mods[[1]]$df, lower = FALSE)} %>% -->
<!--   {. <= qchisq(0.95, 1, lower.tail = TRUE, log.p = FALSE)} -->
<!-- ``` -->
