Note: This report is a work in progress

We will be analyzing a dataset obtained from the data.gov website. The information contained within provides a very comprehensive picture of energy consumption in the US.

The data is accompanied by a codebook where the variables and their encodings are explained.

library(readr)
library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(dplyr)
library(ggvis)
library(ggthemr)

First we load the dataset and the codebook.

df <- read_csv('data/recs2009_public.csv')
df_codebook <- read_excel('data/recs2009_public_codebook.xlsx')

# sanity checks
dim(df)
## [1] 12083   931

Before doing any further analysis, let’s try to find the answer to a test hypothesis: is the electricity bill higher for homes with more bedrooms? This might seem farily straightforward at first, but the results will be surprising.

ggthemr('fresh')
grouped_df <- df %>% group_by(BEDROOMS) %>% summarise_each(funs(mean))

ggplot(grouped_df, aes(x = BEDROOMS, y = DOLLAREL)) +
    geom_line(colour = "red") +
    geom_point() +
    ggtitle("Electricity Bill vs. Number of Bedrooms") +
    xlab("Number of Bedrooms") +
    ylab("Electricity Bill")

You can see that in general there is an increase in electricity consumption for households with more bedrooms. Interestingly, there is a decrease in cost for appartments with more than 10 bedrooms. One possible explanation is that they are just not used that often, hence there would be no increase in electricity bills.

Our next question that we will aim to answer is whether there is a corelaltion between the age of a house and its electricity bill.

First we will create a dataframe with the age in one column and the bill in another. For this we will need to do some data wrangling, and a custom function. After that we calculate the mean bill for a given year.

age_date <- df$YEARMADE
bill <- df$DOLLAREL

# convert age from date into variable
getYears <- function(year) {
    current_year <- as.integer(format(Sys.time(), "%Y"))
    return(current_year - year)
}

age <- unlist(lapply(age_date, getYears))

years_price_df <- data.frame(age, bill)
years_price_df$age <- factor(years_price_df$age)
years_price_df_means <- aggregate(years_price_df$bill ~ years_price_df$age, FUN = mean) # get mean bill per year

colnames(years_price_df_means) <- c("age", "mean_price")

And finally let’s plot our results, showing a linear regression fit.

# doing this with ggplot
ggplot(years_price_df_means, aes(x = as.numeric(age), y = mean_price, group = 1)) +
    geom_point() +
    geom_smooth(method = lm) +
    ggtitle("Housing Age vs. Electricity Bill") + 
    xlab("Age [years]") +
    theme(text = element_text(size = 10)) +
    scale_x_continuous(breaks = seq(5, 150, 10)) +
    ylab("Bill [$USD]")

We can also run a correlation test.

cor.test(as.numeric(years_price_df_means$age), years_price_df_means$mean_price)
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(years_price_df_means$age) and years_price_df_means$mean_price
## t = -7.5412, df = 88, p-value = 3.967e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7378941 -0.4819952
## sample estimates:
##        cor 
## -0.6265464

There actually seems to be a negative correlation between housing age and electricity bill!

Our next question is whether people who live in warmer states have to pay more for electricity. First we will get the information we need from the codebook and store it in a dataframe.

states_df <- as.data.frame(df_codebook[6,])
states_df <- states_df[3:4] 
names(states_df) <- c("codes", "names")

state_codes_vector <- as.numeric(unlist(strsplit(states_df$codes, "\r")))
state_names_vector <- unlist(strsplit(states_df$names, "\r"))
state_names_vector <- gsub(x = state_names_vector, "\n", "") # some cleaning

states_df_clean <- data.frame(state_codes_vector,
                           state_names_vector[2:length(state_names_vector)])
names(states_df_clean) <- c("Code", "State")

Then we will calculate the average bill per state.

state_price_df <- data.frame(df$REPORTABLE_DOMAIN, df$DOLLAREL)
names(state_price_df) <- c("state", "bill")
state_price_df_means <- aggregate(state_price_df$bill ~
                                  state_price_df$state,
                                  FUN = mean)

names(state_price_df_means) <- c("state", "avg_bill")
state_price_df_means$state <- as.factor(state_price_df_means$state)

And finally plot our results.

ggplot(state_price_df_means, aes(x = state, y = avg_bill)) +
    geom_point(size = 4, colour = "purple") +
    geom_segment(aes(xend=state), yend=0) + 
    ggtitle("Electricity expenditure among states") +
    xlab("State") + 
    ylab("Average bill [$USD]")

We would like to compare a warm state (for example 26, California), to a generally colder one, such as NY (1). An interesting observation here is that Florida homes have the highest bills. Maybe all that air conditioning can be the culprit?

And our final question would be if we can correlate appartment area with electricity bill. Our hypothesis would be that the larger the appartment, the higher the bill would be. In our dataset there are two variables that can be of interest in this case, besides the bill. These variables are total area and heated area. We will try to correlate both of them.

We will be needing to make two plots side by side, so we we will be using a handy function called multiplot.

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
    library(grid)
    
    # Make a list from the ... arguments and plotlist
    plots <- c(list(...), plotlist)
    
    numPlots = length(plots)
    
    # If layout is NULL, then use 'cols' to determine layout
    if (is.null(layout)) {
        # Make the panel
        # ncol: Number of columns of plots
        # nrow: Number of rows needed, calculated from # of cols
        layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                         ncol = cols, nrow = ceiling(numPlots/cols))
    }
    
    if (numPlots==1) {
        print(plots[[1]])
        
    } else {
        # Set up the page
        grid.newpage()
        pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
        
        # Make each plot, in the correct location
        for (i in 1:numPlots) {
            # Get the i,j matrix positions of the regions that contain this subplot
            matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
            
            print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                            layout.pos.col = matchidx$col))
        }
    }
}

Then we will proceed to subset our variables into a new dataframe.

square_area <- df$TOTSQFT
bill <- df$DOLLAREL
heated_area <- df$TOTHSQFT

# some cleaning
area_bill_df <- data.frame(square_area, heated_area, bill)
row_sub = apply(area_bill_df, 1, function(row) all(row !=0 ))
area_bill_df <- area_bill_df[row_sub,]
    
names(area_bill_df) <- c("area", "heated", "bill")

And finally let’s create our plots.

plot1 <- ggplot(area_bill_df, aes(area, bill)) +
    geom_point(alpha = 0.3) +
    ggtitle("Total Area vs Electricity Bill") + 
    xlab("Area [feet ^2]") +
    ylab("Electricity bill [$USD]")

plot2 <- ggplot(area_bill_df, aes(heated, bill)) +
    geom_point(alpha = 0.3) +
    ggtitle("Heated Area vs Electricity Bill") +
    xlab("Area [feet ^2]") +
    ylab("Electricity bill [$USD]")

multiplot(plot1, plot2, cols = 2)

It turns out there might be a weak positive correlation between area and bill. There doesn’t seem to be a large difference between total area and heated area, so we will disregard this for now.

We answered our questions for now, but of course there is still so much more to explore in this dataset. Some more advanced machine learning techniques like clustering can be implemented to see if we can find some hidden structures in the data and make predictions.