This project will explore the following question: Are retail sales affected by the outside temperature and fuel prices?


Load libraries

library(dplyr)
library(RMySQL)
library(dbplyr)
library(ggplot2)
library(hexbin)
library(broom)
library(knitr)


Acquire the data

The original data was procured from https://www.kaggle.com/manjeetsingh/retaildataset#stores%20data-set.csv and hosted on Github. But because the files are so large and time-consuming to pull directly into R, a remote MySQL database was created using the following code (commented out because it does not need to be re-run):

# drv <- dbDriver("MySQL")
# con <- dbConnect(drv, host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
# query <- "DROP DATABASE IF EXISTS retail_sales;"
# dbExecute(con, query)
# query <- "CREATE DATABASE retail_sales DEFAULT CHARACTER SET utf8;"
# dbExecute(con, query)
# 
# con <- dbConnect(drv, dbname="retail_sales", host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
# sales_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/sales.csv', stringsAsFactors = F)
# dbWriteTable(con, "sales", sales_df, row.names=F)
# features_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/features.csv', stringsAsFactors = F)
# dbWriteTable(con, "features", features_df, row.names=F)
# stores_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/stores.csv', stringsAsFactors = F)
# dbWriteTable(con, "stores", stores_df, row.names=F)
# dbDisconnect(con)

Now that the database and tables are created, we can load them into R:

retail_sales_db <- src_mysql(dbname="retail_sales", host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
sales_df <- tbl(retail_sales_db, "sales") %>% collect(n=Inf)
features_df <- tbl(retail_sales_db, "features") %>% collect(n=Inf)
stores_df <- tbl(retail_sales_db, "stores") %>% collect(n=Inf)


Exploring the data

We start by examining the tables.

sales_df %>% head() %>% kable()
Store Dept Date Weekly_Sales IsHoliday
1 1 05/02/2010 24924.50 0
1 1 12/02/2010 46039.49 0
1 1 19/02/2010 41595.55 0
1 1 26/02/2010 19403.54 0
1 1 05/03/2010 21827.90 0
1 1 12/03/2010 21043.39 0
features_df %>% head() %>% kable()
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday
1 05/02/2010 42.31 2.572 NA NA NA NA NA 211.0964 8.106 0
1 12/02/2010 38.51 2.548 NA NA NA NA NA 211.2422 8.106 0
1 19/02/2010 39.93 2.514 NA NA NA NA NA 211.2891 8.106 0
1 26/02/2010 46.63 2.561 NA NA NA NA NA 211.3196 8.106 0
1 05/03/2010 46.50 2.625 NA NA NA NA NA 211.3501 8.106 0
1 12/03/2010 57.79 2.667 NA NA NA NA NA 211.3806 8.106 0
stores_df %>% head() %>% kable()
Store Type Size
1 A 151315
2 A 202307
3 B 37392
4 A 205863
5 B 34875
6 A 202505

We would like to have our indepedent and dependent variables in the same data frame, so we’ll join the appropriate columns from features_df to sales_df.

df <- sales_df %>% 
  left_join(features_df %>% select(Store, Date, Temperature, Fuel_Price), by=c("Store", "Date"))
df %>% head() %>% kable()
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price
1 1 05/02/2010 24924.50 0 42.31 2.572
1 1 12/02/2010 46039.49 0 38.51 2.548
1 1 19/02/2010 41595.55 0 39.93 2.514
1 1 26/02/2010 19403.54 0 46.63 2.561
1 1 05/03/2010 21827.90 0 46.50 2.625
1 1 12/03/2010 21043.39 0 57.79 2.667

Next we would like to turn Fuel Price into a simple qualitative variable based on whether it is “high” or “low”. We do that by creating a dummy variable, split by the median.

df %>% select(Fuel_Price) %>% summary()
##    Fuel_Price   
##  Min.   :2.472  
##  1st Qu.:2.933  
##  Median :3.452  
##  Mean   :3.361  
##  3rd Qu.:3.738  
##  Max.   :4.468
df <- df %>% mutate(fp_high = ifelse(Fuel_Price>=median(Fuel_Price), 1, 0))
df %>% head() %>% kable()
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price fp_high
1 1 05/02/2010 24924.50 0 42.31 2.572 0
1 1 12/02/2010 46039.49 0 38.51 2.548 0
1 1 19/02/2010 41595.55 0 39.93 2.514 0
1 1 26/02/2010 19403.54 0 46.63 2.561 0
1 1 05/03/2010 21827.90 0 46.50 2.625 0
1 1 12/03/2010 21043.39 0 57.79 2.667 0
df %>% nrow()
## [1] 421570

Each case (row) is a weekly data point indicating the date, as well as the dependent (Temperature and fp_high) and independent (Weekly_Sales) variables. There are over 400 thousand cases!


Visualizing the data

Now that our data are clean and tidy, we explore a few plots of the variables.

A histogram of Weekly Sales indicates an extremely non-normal distribution (very close to a Gamma distribution with shape=0.5 and rate=1/32000). We transform the variable by taking it to the power of 1/6. The new histogram looks more Gaussian, though not perfect.

df %>% ggplot(aes(Weekly_Sales)) + geom_histogram(bins=50)

df <- df %>% filter(Weekly_Sales>0) %>% mutate(Weekly_Sales = Weekly_Sales^(1/6))
df %>% ggplot(aes(Weekly_Sales)) + geom_histogram(bins=50)

We also examine a histogram of Temperature, as well as a scatterplot of Temperature and Weekly Sales. (The latter is an expensive and somewhat fruitless computation with ~400 thousand rows, so we instead use a a small random sample and then a hexagonal bin plot on the whole dataset.)

df %>% ggplot(aes(Temperature)) + geom_histogram(bins=20)

df %>% sample_frac(0.01) %>% 
  ggplot(aes(x=Temperature, y=Weekly_Sales)) + geom_point() + geom_jitter() + geom_smooth(method = "loess")

df %>% ggplot(aes(x=Temperature, y=Weekly_Sales)) + stat_binhex()

There does not appear to be much of a relationship between Temperature and Weekly Sales.


Fitting a model

Now we fit a multiple linear regression model to formally examine the relationship between Weekly Sales (transformed) and Temperature + Fuel Prices. We also examine the residual plots.

lm_fit <- lm(Weekly_Sales ~ Temperature + fp_high, data=df)
summary(lm_fit)
## 
## Call:
## lm(formula = Weekly_Sales ~ Temperature + fp_high, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9517 -0.7746  0.0872  0.8689  5.0458 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.4738063  0.0065693 681.012   <2e-16 ***
## Temperature -0.0020299  0.0001074 -18.909   <2e-16 ***
## fp_high      0.0028467  0.0039609   0.719    0.472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.247 on 420209 degrees of freedom
## Multiple R-squared:  0.0008863,  Adjusted R-squared:  0.0008815 
## F-statistic: 186.4 on 2 and 420209 DF,  p-value: < 2.2e-16
lm_df <- augment(lm_fit)

### residual plot
lm_df %>% 
  ggplot(aes(x = .fitted, y = .resid)) + 
  stat_binhex() + 
  geom_hline(yintercept = 0)

### residual histogram
lm_df %>% 
  ggplot(aes(.resid)) + 
  geom_histogram(bins = 50)

### residual qqplot
lm_df %>% sample_frac(0.01) %>% 
  ggplot(aes(sample = .resid)) + 
  stat_qq() + 
  stat_qq_line()

The residuals look okay (remember, we transformed Weekly Sales for a reason) but the overall fit is terrible. Looking at the summary of the linear model, we see the p-value for the qualitative Fuel Price variable is high. And the adjusted R-squared is very low, indicating that almost none of the variance in Weekly Sales (to the power of 1/6) is described by the variance in Temperature plus Fuel Price category.


Summary

We acquired a large set of retail sales data, created a remote MySQL database, and loaded the data into R. We started with a theory that retail sales might be correlated with fuel prices and outside temperature. After visualizing the data, transforming the response variable, and then building a multiple linear regression model, it turns out we appear to be wrong! This is not necessarily a bad thing.

Theories try and fail all the time. It is important to report poor models just as it is to report good models. Otherwise the statistical community would (and very much does) suffer from “publishing bias” where people report findings only when they are successful, thereby incorrectly skewing our priors.

Future studies of retail sales might examine predictors such as time of year, holiday schedule, and economic indicators.