CUNY 606

This project will explore the following question: Are retail sales affected by the outside temperature and fuel prices?

Load libraries

library(dplyr)
library(RMySQL)
library(dbplyr)
library(ggplot2)
library(hexbin)
library(broom)
library(knitr)

Acquire the data

The original data was procured from https://www.kaggle.com/manjeetsingh/retaildataset#stores%20data-set.csv and hosted on Github. But because the files are so large and time-consuming to pull directly into R, a remote MySQL database was created using the following code (commented out because it does not need to be re-run):

# drv <- dbDriver("MySQL")
# con <- dbConnect(drv, host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
# query <- "DROP DATABASE IF EXISTS retail_sales;"
# dbExecute(con, query)
# query <- "CREATE DATABASE retail_sales DEFAULT CHARACTER SET utf8;"
# dbExecute(con, query)
# 
# con <- dbConnect(drv, dbname="retail_sales", host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
# sales_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/sales.csv', stringsAsFactors = F)
# dbWriteTable(con, "sales", sales_df, row.names=F)
# features_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/features.csv', stringsAsFactors = F)
# dbWriteTable(con, "features", features_df, row.names=F)
# stores_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/stores.csv', stringsAsFactors = F)
# dbWriteTable(con, "stores", stores_df, row.names=F)
# dbDisconnect(con)

Now that the database and tables are created, we can load them into R:

retail_sales_db <- src_mysql(dbname="retail_sales", host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
sales_df <- tbl(retail_sales_db, "sales") %>% collect(n=Inf)
features_df <- tbl(retail_sales_db, "features") %>% collect(n=Inf)
stores_df <- tbl(retail_sales_db, "stores") %>% collect(n=Inf)

Exploring the data

We start by examining the tables.

sales_df %>% head() %>% kable()

Store	Dept	Date	Weekly_Sales
1	1	05/02/2010	24924.50
1	1	12/02/2010	46039.49
1	1	19/02/2010	41595.55
1	1	26/02/2010	19403.54
1	1	05/03/2010	21827.90
1	1	12/03/2010	21043.39

features_df %>% head() %>% kable()

Store	Date	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment
1	05/02/2010	42.31	2.572	NA	NA	NA	NA	NA	211.0964	8.106
1	12/02/2010	38.51	2.548	NA	NA	NA	NA	NA	211.2422	8.106
1	19/02/2010	39.93	2.514	NA	NA	NA	NA	NA	211.2891	8.106
1	26/02/2010	46.63	2.561	NA	NA	NA	NA	NA	211.3196	8.106
1	05/03/2010	46.50	2.625	NA	NA	NA	NA	NA	211.3501	8.106
1	12/03/2010	57.79	2.667	NA	NA	NA	NA	NA	211.3806	8.106

stores_df %>% head() %>% kable()

Store	Type	Size
1	A	151315
2	A	202307
3	B	37392
4	A	205863
5	B	34875
6	A	202505

We would like to have our indepedent and dependent variables in the same data frame, so we’ll join the appropriate columns from features_df to sales_df.

df <- sales_df %>% 
  left_join(features_df %>% select(Store, Date, Temperature, Fuel_Price), by=c("Store", "Date"))
df %>% head() %>% kable()

Store	Dept	Date	Weekly_Sales	Temperature	Fuel_Price
1	1	05/02/2010	24924.50	42.31	2.572
1	1	12/02/2010	46039.49	38.51	2.548
1	1	19/02/2010	41595.55	39.93	2.514
1	1	26/02/2010	19403.54	46.63	2.561
1	1	05/03/2010	21827.90	46.50	2.625
1	1	12/03/2010	21043.39	57.79	2.667

Next we would like to turn Fuel Price into a simple qualitative variable based on whether it is “high” or “low”. We do that by creating a dummy variable, split by the median.

df %>% select(Fuel_Price) %>% summary()

##    Fuel_Price   
##  Min.   :2.472  
##  1st Qu.:2.933  
##  Median :3.452  
##  Mean   :3.361  
##  3rd Qu.:3.738  
##  Max.   :4.468

df <- df %>% mutate(fp_high = ifelse(Fuel_Price>=median(Fuel_Price), 1, 0))
df %>% head() %>% kable()

Store	Dept	Date	Weekly_Sales	Temperature	Fuel_Price
1	1	05/02/2010	24924.50	42.31	2.572
1	1	12/02/2010	46039.49	38.51	2.548
1	1	19/02/2010	41595.55	39.93	2.514
1	1	26/02/2010	19403.54	46.63	2.561
1	1	05/03/2010	21827.90	46.50	2.625
1	1	12/03/2010	21043.39	57.79	2.667

df %>% nrow()

## [1] 421570

Each case (row) is a weekly data point indicating the date, as well as the dependent (Temperature and fp_high) and independent (Weekly_Sales) variables. There are over 400 thousand cases!

Visualizing the data

Now that our data are clean and tidy, we explore a few plots of the variables.

A histogram of Weekly Sales indicates an extremely non-normal distribution (very close to a Gamma distribution with shape=0.5 and rate=1/32000). We transform the variable by taking it to the power of 1/6. The new histogram looks more Gaussian, though not perfect.

df %>% ggplot(aes(Weekly_Sales)) + geom_histogram(bins=50)

df <- df %>% filter(Weekly_Sales>0) %>% mutate(Weekly_Sales = Weekly_Sales^(1/6))
df %>% ggplot(aes(Weekly_Sales)) + geom_histogram(bins=50)

We also examine a histogram of Temperature, as well as a scatterplot of Temperature and Weekly Sales. (The latter is an expensive and somewhat fruitless computation with ~400 thousand rows, so we instead use a a small random sample and then a hexagonal bin plot on the whole dataset.)

df %>% ggplot(aes(Temperature)) + geom_histogram(bins=20)

df %>% sample_frac(0.01) %>% 
  ggplot(aes(x=Temperature, y=Weekly_Sales)) + geom_point() + geom_jitter() + geom_smooth(method = "loess")

df %>% ggplot(aes(x=Temperature, y=Weekly_Sales)) + stat_binhex()

There does not appear to be much of a relationship between Temperature and Weekly Sales.

Fitting a model

Now we fit a multiple linear regression model to formally examine the relationship between Weekly Sales (transformed) and Temperature + Fuel Prices. We also examine the residual plots.

lm_fit <- lm(Weekly_Sales ~ Temperature + fp_high, data=df)
summary(lm_fit)

## 
## Call:
## lm(formula = Weekly_Sales ~ Temperature + fp_high, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9517 -0.7746  0.0872  0.8689  5.0458 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.4738063  0.0065693 681.012   <2e-16 ***
## Temperature -0.0020299  0.0001074 -18.909   <2e-16 ***
## fp_high      0.0028467  0.0039609   0.719    0.472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.247 on 420209 degrees of freedom
## Multiple R-squared:  0.0008863,  Adjusted R-squared:  0.0008815 
## F-statistic: 186.4 on 2 and 420209 DF,  p-value: < 2.2e-16

lm_df <- augment(lm_fit)

### residual plot
lm_df %>% 
  ggplot(aes(x = .fitted, y = .resid)) + 
  stat_binhex() + 
  geom_hline(yintercept = 0)

### residual histogram
lm_df %>% 
  ggplot(aes(.resid)) + 
  geom_histogram(bins = 50)

### residual qqplot
lm_df %>% sample_frac(0.01) %>% 
  ggplot(aes(sample = .resid)) + 
  stat_qq() + 
  stat_qq_line()

The residuals look okay (remember, we transformed Weekly Sales for a reason) but the overall fit is terrible. Looking at the summary of the linear model, we see the p-value for the qualitative Fuel Price variable is high. And the adjusted R-squared is very low, indicating that almost none of the variance in Weekly Sales (to the power of 1/6) is described by the variance in Temperature plus Fuel Price category.

Summary

We acquired a large set of retail sales data, created a remote MySQL database, and loaded the data into R. We started with a theory that retail sales might be correlated with fuel prices and outside temperature. After visualizing the data, transforming the response variable, and then building a multiple linear regression model, it turns out we appear to be wrong! This is not necessarily a bad thing.

Theories try and fail all the time. It is important to report poor models just as it is to report good models. Otherwise the statistical community would (and very much does) suffer from “publishing bias” where people report findings only when they are successful, thereby incorrectly skewing our priors.

Future studies of retail sales might examine predictors such as time of year, holiday schedule, and economic indicators.