House_electricity_consumption

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(googlesheets4)
library(lubridate)
library(car)
library(jtools)
library(sjPlot)
library(equatiomatic)
library(readxl)
library(lmtest)
library(huxtable)
library(ciTools)

url <- "https://docs.google.com/spreadsheets/d/1xE7waEQnEXAEd-oyOIULnov9-D8liaSF_SecP2s3mDM/edit?usp=sharing"

gs4_deauth()    

Grandma_el_cons <- read_sheet(url, sheet = "Sheet1")
Grandma_el_cons <- Grandma_el_cons %>%
  mutate(date = as.Date(date, format = "%d.%m.%Y"))

The Puzzle

The owner of a house in central Czechia decided to install an electric heat pump (EHP) in 2025, replacing lignite and wood heating. The heat pump was installed on November 20, 2025. The owner began worrying about electricity consumption almost immediately.

I want to examine how well we can predict the electricity consumption of the EHP using data from only the first ~30 days after installation, with very limited background knowledge about the house and the system. Put differently: how far can we get with limited data and almost no technical information about the building, insulation, occupancy, or heat-pump settings?

Fortunately, the owner meticulously records electricity consumption (kWh) every day at around 8:00 AM, so we have handwritten daily records. A reasonable assumption is that the key driver of consumption is the outdoor temperature on a given day. However, it is not clear to me whether daily minimum or daily maximum temperature has a stronger effect. Therefore, we use either minimum or maximum temperature as the main independent variable, and electricity consumption as the dependent variable.

Unfortunately, we do not have weather measurements directly at the house. However, there is a nearby official weather station with accessible daily minimum and maximum temperatures. Since it is about 10 km away and has similar altitude and geography, we can use its data as a reasonable proxy (it would be even better if we also had cloud cover).

We might also expect higher electricity consumption on weekends—specifically Friday, Saturday, and Sunday—because family members visit the owner then. The problem is that these visits are irregular, and sometimes they last only a few hours, while other times they extend over a full weekend.

Data

We created a dataset that combines daily electricity consumption and daily minimum and maximum temperatures for each day (see the table below).

Key variables
Variable	Abbreviation	Units
Electricity consumption	kWh	kWh
Minimum temperature	min_temp_c	Celsius
Maximum temperature	max_temp_c	Celsius
Presence of Electric Heat Pump	EHP	(0, 1) dummy
Friday, Saturday, Sunday	Week_ends	(0, 1) dummy

A small caveat: the owner does not measure consumption at midnight but at 8:00 AM the following day. In other words, each “daily” observation covers the period from 8:00 AM on day t to 8:00 AM on day t+1. For that reason, we may want to align consumption with the minimum temperature of the following day, since night/morning temperatures may be especially important for heating demand.

The dataset spans November 1 to December 24. The chart below illustrates temperature variation and the sharp increase in electricity consumption after the EHP was installed.

Grandma_el_cons<-Grandma_el_cons%>%mutate(week_days= weekdays(date))

Grandma_el_cons<-Grandma_el_cons%>%mutate(week_ends= if_else(week_days=="Friday" | week_days=="Saturday" | week_days== "Sunday", 1, 0))

Grandma_el_cons<-Grandma_el_cons%>%mutate(min_temp_c_lead = dplyr::lead(min_temp_c, n=1))

Figure

Grandma_el_cons%>%
  ggplot(aes(date, min_temp_c_lead))+
  geom_line(aes(col="min"))+
  geom_line(aes(date, max_temp_c, col="max"))+
  geom_hline(yintercept = 0, linetype=3)+
  geom_line(aes(date, (kWh), col="kWh"))+
  geom_rect(xmin = as.Date("2025-11-01"), xmax=as.Date("2025-11-19"), ymin=-10, ymax=40, fill="red", alpha=0.002)+
  geom_vline(xintercept = as.Date("2025-11-20"), col="darkgreen", size=1, linetype=2)+
  geom_text(aes(label="Without EHP", x= as.Date("2025-11-10"), y=35), col="black")+
  geom_text(aes(label="With EHP", x= as.Date("2025-12-10"), y=35), col="darkgreen")+
  scale_color_manual(values=c("min"="blue", "max"="red", "kWh"="black"))+
  theme_bw(base_size = 13)+
  labs(color="Data lines")

Table

show(Grandma_el_cons)

## # A tibble: 55 × 8
##    date       min_temp_c max_temp_c   EHP   kWh week_days week_ends
##    <date>          <dbl>      <dbl> <dbl> <dbl> <chr>         <dbl>
##  1 2025-11-01          5         13     0     7 Saturday          1
##  2 2025-11-02          8         12     0     8 Sunday            1
##  3 2025-11-03          6         10     0     5 Monday            0
##  4 2025-11-04          5         13     0     5 Tuesday           0
##  5 2025-11-05          4         14     0     4 Wednesday         0
##  6 2025-11-06          4         13     0     7 Thursday          0
##  7 2025-11-07          5         12     0     7 Friday            1
##  8 2025-11-08          4         13     0     8 Saturday          1
##  9 2025-11-09          5          7     0    10 Sunday            1
## 10 2025-11-10          4         10     0     7 Monday            0
## # ℹ 45 more rows
## # ℹ 1 more variable: min_temp_c_lead <dbl>

Are there any patterns?

We need to determine first whether there is any relationship between electricity consumption (kWh) and temperature, and whether this relationship applies only to the period after the EHP was installed. The chart below shows the basic relationship between electricity consumption and the daily minimum temperature.

Grandma_el_cons%>%ggplot(aes(min_temp_c_lead, kWh, group=as.factor(EHP)))+
  geom_point(aes(col=as.factor(EHP)))+
  geom_smooth(method="lm", aes(col=as.factor(EHP)))+
  theme_bw(base_size = 13)+
   labs(color='EHP presence')

It seems there is a clear interaction effect. Before the EHP was installed, there is no relationship between minimum temperature and electricity consumption. After the EHP installation, the relationship is clearly negative: the lower the temperature, the higher the consumption. We double-check this pattern using an OLS regression.

Grm_m0<-lm(data=Grandma_el_cons, kWh~min_temp_c_lead+EHP+EHP*min_temp_c_lead)

equatiomatic::extract_eq(Grm_m0)

\[ \operatorname{kWh} = \alpha + \beta_{1}(\operatorname{min\_temp\_c\_lead}) + \beta_{2}(\operatorname{EHP}) + \beta_{3}(\operatorname{min\_temp\_c\_lead} \times \operatorname{EHP}) + \epsilon \]

export_summs(Grm_m0, robust="HC3")

	Model 1
(Intercept)	6.79 ***
	(0.85)
min_temp_c_lead	0.03
	(0.17)
EHP	19.21 ***
	(1.11)
min_temp_c_lead:EHP	-1.78 ***
	(0.28)
N	54
R2	0.90
Standard errors are heteroskedasticity robust. * p < 0.001; p < 0.01; * p < 0.05.

The results clearly support the visual finding from the chart above, so we will not use the interaction term in the remainder of the analysis. Since we now know that electricity consumption before the EHP installation was not affected by temperature, we can use the average daily consumption from the pre-installation period as a baseline (i.e., non-heating) level of electricity use. For the rest of the analysis, we work with heat_kw, defined as total electricity consumption minus this baseline consumption.

Before we proceed, one further adjustment is needed. A linear model was a reasonable starting point, but it is clear that EHP-related consumption (heat_kw) cannot be negative. We would also expect the relationship to be nonlinear: the function should be convex, never drop below zero, and show more pronounced increases as temperatures fall below 0 °C. A simple approach is to model the dependent variable in log form, which ensures predictions remain positive (though exact zero is never reached). Alternative option is to use gamma regression. Given the small size of our dataset it is hard to tell which alternative is better.

Grandma_el_cons<-Grandma_el_cons%>%mutate(Heat_kWh=kWh-mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T))

In the next step, we model the relationship between the log-transformed heating electricity consumption (log(heat_kw)) and temperature, and we also test whether weekends matter. Simoultaneously we fit analoguous gamma regression model. We estimate four models: (1) daily minimum temperature, (2)daily maximum temperature, (3) both minimum and maximum temperatures, and (4) minimum and maximum temperatures plus the week_ends variable.

Table lm log model

Grm_m1<-lm(data=Grandma_el_cons%>%filter(EHP==1), log(Heat_kWh)~min_temp_c_lead)

Grm_m2<-lm(data=Grandma_el_cons%>%filter(EHP==1), log(Heat_kWh)~max_temp_c)

Grm_m3<-lm(data=Grandma_el_cons%>%filter(EHP==1), log(Heat_kWh)~min_temp_c_lead+max_temp_c)

Grm_m4<-lm(data=Grandma_el_cons%>%filter(EHP==1), log(Heat_kWh)~min_temp_c_lead+max_temp_c+week_ends)

export_summs(Grm_m1, Grm_m2, Grm_m3, Grm_m4, robust="HC3")

	Model 1	Model 2	Model 3	Model 4
(Intercept)	2.87 ***	3.23 ***	3.04 ***	2.98 ***
	(0.05)	(0.06)	(0.09)	(0.11)
min_temp_c_lead	-0.10 ***		-0.07 **	-0.07 **
	(0.01)		(0.02)	(0.02)
max_temp_c		-0.08 ***	-0.04 **	-0.04 *
		(0.01)	(0.01)	(0.01)
week_ends				0.12
				(0.10)
N	35	35	35	35
R2	0.52	0.48	0.57	0.59
Standard errors are heteroskedasticity robust. * p < 0.001; p < 0.01; * p < 0.05.

equatiomatic::extract_eq(Grm_m3)

\[ \operatorname{log(Heat\_kWh)} = \alpha + \beta_{1}(\operatorname{min\_temp\_c\_lead}) + \beta_{2}(\operatorname{max\_temp\_c}) + \epsilon \]

equatiomatic::extract_eq(Grm_m3, use_coefs = TRUE)

\[ \operatorname{\widehat{log(Heat\_kWh)}} = 3.04 - 0.07(\operatorname{min\_temp\_c\_lead}) - 0.04(\operatorname{max\_temp\_c}) \]

Table glm gamma

Grama_m1<-glm(Heat_kWh ~ min_temp_c_lead,
             data = Grandma_el_cons%>%filter(EHP==1),
             family = Gamma(link="log"))

Grama_m2<-glm(Heat_kWh ~ max_temp_c,
             data = Grandma_el_cons%>%filter(EHP==1),
             family = Gamma(link="log"))

Grama_m3<-glm(Heat_kWh ~ min_temp_c_lead + max_temp_c,
             data = Grandma_el_cons%>%filter(EHP==1),
             family = Gamma(link="log"))

Grama_m4<-glm(Heat_kWh ~ min_temp_c_lead + max_temp_c + week_ends,
             data = Grandma_el_cons%>%filter(EHP==1),
             family = Gamma(link="log"))

export_summs(Grama_m1, Grama_m2, Grama_m3, Grama_m4, robust="HC3", digits=3)

	Model 1	Model 2	Model 3	Model 4
(Intercept)	2.907 ***	3.271 ***	3.085 ***	3.033 ***
	(0.041)	(0.053)	(0.076)	(0.086)
min_temp_c_lead	-0.099 ***		-0.062 **	-0.064 **
	(0.011)		(0.020)	(0.021)
max_temp_c		-0.079 ***	-0.040 **	-0.038 **
		(0.007)	(0.014)	(0.014)
week_ends				0.091
				(0.075)
N	35	35	35	35
AIC	215.371	218.534	212.354	213.219
BIC	220.037	223.200	218.575	220.996
Pseudo R2	0.549	0.506	0.609	0.622
Standard errors are heteroskedasticity robust. * p < 0.001; p < 0.01; * p < 0.05.

equatiomatic::extract_eq(Grama_m3)

\[ \log ({ E( \operatorname{Heat_kWh} ) }) = \alpha + \beta_{1}(\operatorname{min\_temp\_c\_lead}) + \beta_{2}(\operatorname{max\_temp\_c}) \]

equatiomatic::extract_eq(Grama_m3, use_coefs = TRUE)

\[ \log ({ \widehat{E( \operatorname{Heat_kWh} )} }) = 3.08 - 0.06(\operatorname{min\_temp\_c\_lead}) - 0.04(\operatorname{max\_temp\_c}) \]

Prediction lm log

plot_model(Grm_m3, type="pred", terms = c("min_temp_c_lead", "max_temp_c [10,0]"), colors = c("blue", "red"))

Prediction glm gamma

plot_model(Grama_m3, type="pred", terms = c("min_temp_c_lead", "max_temp_c [10,0]"), colors = c("blue", "red"))

The results suggest several points. First, daily minimum temperature appears to be slightly more important than daily maximum temperature (though the two are, of course, correlated). Second, weekends do not have a statistically significant effect—although this may simply reflect the short time span of the data. Finally, the model explains a relatively large share of the variation in consumption (R² above 50%).

Next, we double-check the model that uses minimum temperature. Because we are working with time-series data, a key concern is serial autocorrelation. A visual inspection of the residuals (see the first chart below) does not indicate a problem. The Durbin–Watson test leads to the same conclusion.

Overall, the model performs reasonably well. One caveat is that during the first days after installation, the contractors and owners adjusted the target indoor temperature, which likely introduced some instability. There is also one clear outlier (possibly a recording error).

Most importantly, the predicted relationship seems realistic: it never drops below 0, and it increases sharply—almost exponentially—once the outside temperature falls below 0 °C.

Chart below shows the key finding in the graphical form. As the dependend variable is logarithmised, the interpretation is pretty straigforward: one Celsius degree drop translates into approximately 7% increase in the energy consumption of the EHP.

OLS diagnostic

acf(residuals(Grm_m3))
dwtest(Grm_m3)

## 
##  Durbin-Watson test
## 
## data:  Grm_m3
## DW = 1.8221, p-value = 0.2245
## alternative hypothesis: true autocorrelation is greater than 0

plot(Grm_m3)

Gama diagnostic

acf(residuals(Grama_m3))
dwtest(Grama_m3)

## 
##  Durbin-Watson test
## 
## data:  Grama_m3
## DW = 1.6726, p-value = 0.1138
## alternative hypothesis: true autocorrelation is greater than 0

plot(Grama_m3)

Given the small number of data points, we cannot yet reliably test our predictions on out-of-sample data. However, we plan to do so around January 10th, when we should have roughly 15 additional observations. Hopefully, these will allow us to evaluate the predictive accuracy of the model.

Predicting electricity consumption

At the beginning, we noted that we ultimately want to estimate annual electricity consumption. For a rough approximation, we can take the daily minimum and maximum temperatures from the past year and use Model 3 to predict expected EHP-related consumption. After adding the baseline (non-heating) consumption, we can estimate total expected electricity use.

Again, we can rely on the nearby meteorological station and use the temperature data recorded over the last 12 months. This gives us a new dataset with the date and daily minimum and maximum temperatures. We then compute the corresponding heat_kw using Model 3. Because the model is estimated in logs, we transform predictions back to the original scale, add the baseline consumption, and finally sum the daily values to obtain the predicted annual total (analogical steps apply to the gamma regression).

Figure

El_cons_pred <- read_sheet(url, sheet = "Sheet2")
El_cons_pred <- El_cons_pred%>%mutate(date = as.Date(date, format = "%d.%m.%Y"), min_temp_c=as.numeric(min_temp_c), max_temp_c=as.numeric(max_temp_c))


El_cons_pred<-El_cons_pred%>%mutate(min_temp_c_lead = min_temp_c)

El_cons_pred2 <- read_sheet(url, sheet = "Sheet4")
El_cons_pred2 <- El_cons_pred2 %>%
  mutate(date = as.Date(date, format = "%m-%d"))

El_cons_pred2<-El_cons_pred2%>%mutate(min_temp_c_lead = min_temp_c)

El_cons_pred%>%
  ggplot(aes(date, min_temp_c_lead))+
  geom_line(aes(col="min"))+
  geom_line(aes(date, max_temp_c, col="max"))+
  geom_hline(yintercept = 0, linetype=3)+
  scale_color_manual(values=c("min"="blue", "max"="red"))+
  theme_bw(base_size = 13)+
  labs(color='temperature')

El_cons_pred2%>%
  ggplot(aes(date, min_temp_c_lead))+
  geom_line(aes(col="min"))+
  geom_line(aes(date, max_temp_c, col="max"))+
  geom_hline(yintercept = 0, linetype=3)+
  scale_color_manual(values=c("min"="blue", "max"="red"))+
  theme_bw(base_size = 13)+
  labs(color='temperature')

Table

show(El_cons_pred)

## # A tibble: 364 × 5
##    date       min_temp_c EHP   max_temp_c min_temp_c_lead
##    <date>          <dbl> <chr>      <dbl>           <dbl>
##  1 2024-12-24        0.3 1.00         2.8             0.3
##  2 2024-12-25        0.8 1.00         5               0.8
##  3 2024-12-26       -1.5 1.00         3.6            -1.5
##  4 2024-12-27       -3.5 1.00         4.3            -3.5
##  5 2024-12-28       -3.3 1.00         5              -3.3
##  6 2024-12-29       -2.6 1.00         5.9            -2.6
##  7 2024-12-30       -4.3 1.00         0.1            -4.3
##  8 2024-12-31       -4.1 1.00         4.2            -4.1
##  9 2025-01-01       -0.4 1.00         8.1            -0.4
## 10 2025-01-02       -0.2 1.00         8.9            -0.2
## # ℹ 354 more rows

show(El_cons_pred2)

## # A tibble: 365 × 4
##    date       max_temp_c min_temp_c min_temp_c_lead
##    <date>          <dbl>      <dbl>           <dbl>
##  1 2013-01-01        2.5       -1.3            -1.3
##  2 2013-01-02        5.1        1.2             1.2
##  3 2013-01-03        6.7        0.3             0.3
##  4 2013-01-04        9.7        5.3             5.3
##  5 2013-01-05        9.2        3.6             3.6
##  6 2013-01-06        6.2        2.8             2.8
##  7 2013-01-07        5.3        1.3             1.3
##  8 2013-01-08        5.7        2.1             2.1
##  9 2013-01-09        6          2.4             2.4
## 10 2013-01-10        4.7        0.3             0.3
## # ℹ 355 more rows

The results suggest that total expected electricity consumption is likely to be around 6,400 kWh. This is not a small amount: depending on the tariff, it could amount to roughly 30,000 CZK. At the same time, the owner will no longer need to buy lignite, which normally costs around 15,000 CZK for the entire heating season.

smear <- mean(exp(residuals(Grm_m3)))


El_cons_pred[c("fit","lwr","upr")]<-predict(Grm_m3, newdata=El_cons_pred, interval="confidence", level = 0.90)


El_cons_pred <- El_cons_pred %>%
  mutate(
    Pred_kW_median   = exp(fit),
    Pred_kW_median_L = exp(lwr),
    Pred_kW_median_U = exp(upr),

    Pred_kW_mean     = smear * exp(fit),
    Pred_kW_mean_L   = smear * exp(lwr),
    Pred_kW_mean_U   = smear * exp(upr)
  )


El_pr<-sum(El_cons_pred$Pred_kW_median)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prL<-sum(El_cons_pred$Pred_kW_median_L)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prU<-sum(El_cons_pred$Pred_kW_median_U)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prMean<-sum(El_cons_pred$Pred_kW_mean)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prMeanL<-sum(El_cons_pred$Pred_kW_mean_L)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prMeanU<-sum(El_cons_pred$Pred_kW_mean_U)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

#gamma reg

El_cons_pred$fitG<-predict(Grama_m3, newdata=El_cons_pred, type = "response")

El_cons_pred<- transform(El_cons_pred, Heat_kWh = NA_real_)

El_cons_pred<-add_ci(El_cons_pred,
  Grama_m3,
  alpha = 0.1,
  names = c("lwrG", "uprG"),
  type="boot",
  response = TRUE,
  nSims = 5000
)

El_prG<-sum(El_cons_pred$fitG)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prLG<-sum(El_cons_pred$lwrG)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prUG<-sum(El_cons_pred$uprG)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

#Cold year
El_cons_pred2[c("fit","lwr","upr")]<-predict(Grm_m3, newdata=El_cons_pred2, interval="confidence", level = 0.90)


El_cons_pred2 <- El_cons_pred2 %>%
  mutate(
    Pred_kW_median   = exp(fit),
    Pred_kW_median_L = exp(lwr),
    Pred_kW_median_U = exp(upr),

    Pred_kW_mean     = smear * exp(fit),
    Pred_kW_mean_L   = smear * exp(lwr),
    Pred_kW_mean_U   = smear * exp(upr)
  )


El_prC<-sum(El_cons_pred2$Pred_kW_median)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prLC<-sum(El_cons_pred2$Pred_kW_median_L)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prUC<-sum(El_cons_pred2$Pred_kW_median_U)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prMeanC<-sum(El_cons_pred2$Pred_kW_mean)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prMeanLC<-sum(El_cons_pred2$Pred_kW_mean_L)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prMeanUC<-sum(El_cons_pred2$Pred_kW_mean_U)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

#Gamma cold

El_cons_pred2$fitG<-predict(Grama_m3, newdata=El_cons_pred2, type = "response")

El_cons_pred2<- transform(El_cons_pred2, Heat_kWh = NA_real_)

El_cons_pred2<-add_ci(El_cons_pred2,
  Grama_m3,
  alpha = 0.1,
  names = c("lwrG", "uprG"),
  type="boot",
  response = TRUE,
  nSims = 5000
)

El_prGC<-sum(El_cons_pred2$fitG)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prLGC<-sum(El_cons_pred2$lwrG)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

El_prUGC<-sum(El_cons_pred2$uprG)+mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0], na.rm=T)*365

Expected yearly consumption
	Normal-to-Warm Y	Normal-to-Warm Y	Cold Y	Cold Y
	OLS_log smearing adj.	glm_gamma	OLS_log smearing adj.	glm_gamma
Prediction_upper	7371.5	6911	8456.6	7858
Prediction	6387.2	6378.6	7259.9	7233.7
Prediction_lower	5645.1	5894.9	6355.9	6634.5

We also calculated lower and upper bounds for the predicted annual electricity consumption (approximately 5,600 and 7,300 kWh for OLS log model, and 5,800 and 6,900 kWh for gamma model). Here we did model the 95% confidence interval for the regression line (as we are interested in overall average consumption). We could refine this further by running the same prediction for the past 10–20 years to see how much year-to-year temperatures—and therefore expected consumption—vary. For our current purposes, however, we will use just the data for one particularly cold year (2013) to see how much the electricity consumption might increase due to unexpectedly cold winter (2013 is considered to be the coldest year since 2010). It is noticeable that the consumption is in this cold winter scenario higher about 1000 kWh (difference is more pronounced for the upper estimate).

Because the OLS log model is fit on log(kW_heat), exponentiating predictions yields the median kW_heat on the original scale. To estimate the mean kW_heat, we apply Duan’s smearing correction. In this sample, alpha = 1.032, so mean predictions are about 3.2% higher than the naive back-transform. A reader will likely notice that gamma model offers a bit narrower intervals for the electricity consumption.

Consumption OLS log regression

Consumption, GAMA regression

El_cons_pred%>%ggplot(aes(date, fitG))+
  geom_line(aes(col="kWh"))+
  geom_line(aes(date, lwrG), col="gray")+
  geom_line(aes(date, uprG), col= "gray")+
  geom_line(aes(date, min_temp_c, col="min"), alpha=0.6)+
  geom_line(aes(date, max_temp_c, col="max"), alpha=0.6)+
  geom_hline(yintercept = 0, linetype=3)+
  scale_color_manual(values=c("min"="blue", "max"="red", "Kw"="black"))+
  theme_bw(base_size = 14)+
  ggtitle("Expected Daily Electricity Consumption of EHP")+
  labs(color='Data lines')+
  ylab("°C & El. consumption total kWh")

El_cons_pred2%>%ggplot(aes(date, fitG))+
  geom_line(aes(col="kWh"))+
  geom_line(aes(date, lwrG), col="gray")+
  geom_line(aes(date, uprG), col= "gray")+
  geom_line(aes(date, min_temp_c, col="min"), alpha=0.6)+
  geom_line(aes(date, max_temp_c, col="max"), alpha=0.6)+
  geom_hline(yintercept = 0, linetype=3)+
  scale_color_manual(values=c("min"="blue", "max"="red", "Kw"="black"))+
  theme_bw(base_size = 14)+
  ggtitle("Expected Daily Electricity Consumption of EHP - Cold Year")+
  labs(color='Data lines')+
  ylab("°C & El. consumption total kWh")

El_cons_pred%>%mutate(Month=month(date))%>%group_by(Month)%>%mutate(mean_temp_min=mean(min_temp_c))%>%
  ggplot(aes(as.factor(Month), fitG, group = Month))+
  geom_col(aes(fill=mean_temp_min))+
  scale_fill_continuous(low="blue", high="red3")+
  theme_bw(base_size=14)+
  ggtitle("Expected Monthly Electricity consumption of EHP")+
  labs(color='kWh consumption')+ 
  ylab("EHP consumption (kWh)")


El_cons_pred2%>%mutate(Month=month(date))%>%group_by(Month)%>%mutate(mean_temp_min=mean(min_temp_c))%>%
  ggplot(aes(as.factor(Month), fitG, group = Month))+
  geom_col(aes(fill=mean_temp_min))+
  scale_fill_continuous(low="blue", high="red3")+
  theme_bw(base_size=14)+
  ggtitle("Expected Monthly Electricity consumption of EHP - Cold Year")+
  labs(color='kWh consumption')+ 
  ylab("EHP consumption (kWh)")

Finally, we should note a few caveats. The prediction is based on a small sample—only 35 days with the EHP in operation. Our training data also tell us little about consumption during summer months, since we have no observations with temperatures above 12 °C. This means that predictions for warmer conditions—such as daily minimums of 15–20 °C and daily maximums of 25–30 °C—are driven less by data and more by the assumed shape of the regression curve.

In addition it is possible to switch the EHP off during several truly hot summer weeks, which could reduce overall costs.

Evaluating the model - out of sample data

When trying to build predictive models, the ideal process would involve validating the model on out-of sample data. Our original training dataset was relatively small (just 35 observations) making it impossible to leave out some data for ex-post validation of the model. With progressing time however new data (22 observations) were gathered on electricity consumption and the corresponding outside temperatures. We will measure the Mean Absolute Error (MAE) for the new (testing) dataset. Here we need to say that the testing sample has limited temperature variability (even though on some days temperatures dropped significantly below 0).

El_cons_test <- read_sheet(url, sheet = "Test")
El_cons_test <- El_cons_test%>%mutate(date = as.Date(date, format = "%d.%m.%Y"), min_temp_c=as.numeric(min_temp_c), max_temp_c=as.numeric(max_temp_c))

El_cons_test<-El_cons_test%>%mutate(min_temp_c_lead = dplyr::lead(min_temp_c, n=1))


El_cons_test[c("fit","lwr","upr")]<-predict(Grm_m3, newdata=El_cons_test, interval="prediction", level = 0.90)

El_cons_test$fitG<-predict(Grama_m3, newdata=El_cons_test, type = "response")


El_cons_test <- El_cons_test %>%
  mutate(
    Pred_kW_mean     = smear * exp(fit) + mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0]),
    Pred_kW_mean_L   = smear * exp(lwr) + mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0]),
    Pred_kW_mean_U   = smear * exp(upr) + mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0]),
    Pred_kW_meanG = fitG+
mean(Grandma_el_cons$kWh[Grandma_el_cons$EHP==0])
     )

show(El_cons_test)

## # A tibble: 23 × 14
##    date       min_temp_c max_temp_c   EHP   kWh min_temp_c_lead   fit   lwr
##    <date>          <dbl>      <dbl> <dbl> <dbl>           <dbl> <dbl> <dbl>
##  1 2025-12-25       -5          0       1    37            -3.9  3.30  2.78
##  2 2025-12-26       -3.9        2.7     1    35            -4.2  3.22  2.69
##  3 2025-12-27       -4.2        4.7     1    31            -3.9  3.12  2.58
##  4 2025-12-28       -3.9        1.4     1    33            -2.9  3.18  2.67
##  5 2025-12-29       -2.9        1.5     1    31            -4.9  3.31  2.78
##  6 2025-12-30       -4.9        1.3     1    34            -4.3  3.28  2.76
##  7 2025-12-31       -4.3        0.4     1    32             0.1  3.02  2.49
##  8 2026-01-01        0.1        2       1    30             0.4  2.94  2.42
##  9 2026-01-02        0.4        2.1     1    29            -2.2  3.11  2.60
## 10 2026-01-03       -2.2        1.5     1    33            -3.3  3.20  2.69
## # ℹ 13 more rows
## # ℹ 6 more variables: upr <dbl>, fitG <dbl>, Pred_kW_mean <dbl>,
## #   Pred_kW_mean_L <dbl>, Pred_kW_mean_U <dbl>, Pred_kW_meanG <dbl>

MAE_OLS<-mean(abs(El_cons_test$kWh-El_cons_test$Pred_kW_mean), na.rm=T)
MAE_GAMMA<-mean(abs(El_cons_test$kWh-El_cons_test$Pred_kW_meanG), na.rm=T)

El_cons_test%>%
  ggplot(aes(Pred_kW_mean, kWh))+
  geom_point()+
  geom_smooth(method="lm")+
  geom_abline(intercept=0, slope=1, linetype=2, col="red")+
  theme_bw()+
  ggtitle("OLS log model predictions vs. reality")

El_cons_test%>%
  ggplot(aes(Pred_kW_meanG, kWh))+
  geom_point()+
  geom_smooth(method="lm")+
  geom_abline(intercept=0, slope=1, linetype=2, col="red")+
  theme_bw()+
  ggtitle("OLS log model predictions vs. reality")

The MAE’s for the OLS log and Gamma models are 3.58 and 3.42 respectively (the error is ~ 3.5 kWh on average). It seems Gamma model is slightly better, even though the difference is rather small. Based on this evidence one could argue that the annual consumptions might be eventually closer to the estimates stemming from the Gamma model (those are a bit lower than the OLS log estimates).

Beyond the MAE, we provide (see above) also two charts comparing fit between the predicted and observed consumption for the OLS log and the Gamma regression. Clearly, both regressions are similar - at least when it comes to the predictions and corresponding errors. It is easy to see that both models exaggerate consumption and the higher the prediction, bigger the error. In plain english, the prediction is systematically much higher than the realized consumption, for very cold conditions. One can ask, how this is possible?

First, the owner decided to use small wood stove in the kitchen when the temperatures sunk below ~ -6°C. This obviously reduced electricity consumption a bit. Second, for most of those (very cold) days there was a snow cover over the rooftop - acting as an isolation preventing heat losses. One can imagine both models could perform even better if we had data capturing days with snow-cover.

Overall, however, both models seem to do quite a good job given the initial data limitations.

House_electricity_consumption

JK

2026-01-16

The Puzzle

Data

Figure

Table

Are there any patterns?

Table lm log model

Table glm gamma

Prediction lm log

Prediction glm gamma

OLS diagnostic

Gama diagnostic

Predicting electricity consumption

Figure

Table

Consumption OLS log regression

Consumption, GAMA regression

Evaluating the model - out of sample data