This project will explore the following question: Are retail sales affected by the outside temperature and fuel prices?
library(dplyr)
library(RMySQL)
library(dbplyr)
library(ggplot2)
library(hexbin)
library(broom)
library(knitr)
The original data was procured from https://www.kaggle.com/manjeetsingh/retaildataset#stores%20data-set.csv and hosted on Github. But because the files are so large and time-consuming to pull directly into R, a remote MySQL database was created using the following code (commented out because it does not need to be re-run):
# drv <- dbDriver("MySQL")
# con <- dbConnect(drv, host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
# query <- "DROP DATABASE IF EXISTS retail_sales;"
# dbExecute(con, query)
# query <- "CREATE DATABASE retail_sales DEFAULT CHARACTER SET utf8;"
# dbExecute(con, query)
#
# con <- dbConnect(drv, dbname="retail_sales", host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
# sales_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/sales.csv', stringsAsFactors = F)
# dbWriteTable(con, "sales", sales_df, row.names=F)
# features_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/features.csv', stringsAsFactors = F)
# dbWriteTable(con, "features", features_df, row.names=F)
# stores_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_606/master/final_project/stores.csv', stringsAsFactors = F)
# dbWriteTable(con, "stores", stores_df, row.names=F)
# dbDisconnect(con)
Now that the database and tables are created, we can load them into R:
retail_sales_db <- src_mysql(dbname="retail_sales", host = "mytestdb.c4udqcubz5hi.us-east-2.rds.amazonaws.com", port = 3306, user = "user", password = "password")
sales_df <- tbl(retail_sales_db, "sales") %>% collect(n=Inf)
features_df <- tbl(retail_sales_db, "features") %>% collect(n=Inf)
stores_df <- tbl(retail_sales_db, "stores") %>% collect(n=Inf)
We start by examining the tables.
sales_df %>% head() %>% kable()
| Store | Dept | Date | Weekly_Sales | IsHoliday |
|---|---|---|---|---|
| 1 | 1 | 05/02/2010 | 24924.50 | 0 |
| 1 | 1 | 12/02/2010 | 46039.49 | 0 |
| 1 | 1 | 19/02/2010 | 41595.55 | 0 |
| 1 | 1 | 26/02/2010 | 19403.54 | 0 |
| 1 | 1 | 05/03/2010 | 21827.90 | 0 |
| 1 | 1 | 12/03/2010 | 21043.39 | 0 |
features_df %>% head() %>% kable()
| Store | Date | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 05/02/2010 | 42.31 | 2.572 | NA | NA | NA | NA | NA | 211.0964 | 8.106 | 0 |
| 1 | 12/02/2010 | 38.51 | 2.548 | NA | NA | NA | NA | NA | 211.2422 | 8.106 | 0 |
| 1 | 19/02/2010 | 39.93 | 2.514 | NA | NA | NA | NA | NA | 211.2891 | 8.106 | 0 |
| 1 | 26/02/2010 | 46.63 | 2.561 | NA | NA | NA | NA | NA | 211.3196 | 8.106 | 0 |
| 1 | 05/03/2010 | 46.50 | 2.625 | NA | NA | NA | NA | NA | 211.3501 | 8.106 | 0 |
| 1 | 12/03/2010 | 57.79 | 2.667 | NA | NA | NA | NA | NA | 211.3806 | 8.106 | 0 |
stores_df %>% head() %>% kable()
| Store | Type | Size |
|---|---|---|
| 1 | A | 151315 |
| 2 | A | 202307 |
| 3 | B | 37392 |
| 4 | A | 205863 |
| 5 | B | 34875 |
| 6 | A | 202505 |
We would like to have our indepedent and dependent variables in the same data frame, so we’ll join the appropriate columns from features_df to sales_df.
df <- sales_df %>%
left_join(features_df %>% select(Store, Date, Temperature, Fuel_Price), by=c("Store", "Date"))
df %>% head() %>% kable()
| Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price |
|---|---|---|---|---|---|---|
| 1 | 1 | 05/02/2010 | 24924.50 | 0 | 42.31 | 2.572 |
| 1 | 1 | 12/02/2010 | 46039.49 | 0 | 38.51 | 2.548 |
| 1 | 1 | 19/02/2010 | 41595.55 | 0 | 39.93 | 2.514 |
| 1 | 1 | 26/02/2010 | 19403.54 | 0 | 46.63 | 2.561 |
| 1 | 1 | 05/03/2010 | 21827.90 | 0 | 46.50 | 2.625 |
| 1 | 1 | 12/03/2010 | 21043.39 | 0 | 57.79 | 2.667 |
Next we would like to turn Fuel Price into a simple qualitative variable based on whether it is “high” or “low”. We do that by creating a dummy variable, split by the median.
df %>% select(Fuel_Price) %>% summary()
## Fuel_Price
## Min. :2.472
## 1st Qu.:2.933
## Median :3.452
## Mean :3.361
## 3rd Qu.:3.738
## Max. :4.468
df <- df %>% mutate(fp_high = ifelse(Fuel_Price>=median(Fuel_Price), 1, 0))
df %>% head() %>% kable()
| Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | fp_high |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 05/02/2010 | 24924.50 | 0 | 42.31 | 2.572 | 0 |
| 1 | 1 | 12/02/2010 | 46039.49 | 0 | 38.51 | 2.548 | 0 |
| 1 | 1 | 19/02/2010 | 41595.55 | 0 | 39.93 | 2.514 | 0 |
| 1 | 1 | 26/02/2010 | 19403.54 | 0 | 46.63 | 2.561 | 0 |
| 1 | 1 | 05/03/2010 | 21827.90 | 0 | 46.50 | 2.625 | 0 |
| 1 | 1 | 12/03/2010 | 21043.39 | 0 | 57.79 | 2.667 | 0 |
df %>% nrow()
## [1] 421570
Each case (row) is a weekly data point indicating the date, as well as the dependent (Temperature and fp_high) and independent (Weekly_Sales) variables. There are over 400 thousand cases!
Now that our data are clean and tidy, we explore a few plots of the variables.
A histogram of Weekly Sales indicates an extremely non-normal distribution (very close to a Gamma distribution with shape=0.5 and rate=1/32000). We transform the variable by taking it to the power of 1/6. The new histogram looks more Gaussian, though not perfect.
df %>% ggplot(aes(Weekly_Sales)) + geom_histogram(bins=50)
df <- df %>% filter(Weekly_Sales>0) %>% mutate(Weekly_Sales = Weekly_Sales^(1/6))
df %>% ggplot(aes(Weekly_Sales)) + geom_histogram(bins=50)
We also examine a histogram of Temperature, as well as a scatterplot of Temperature and Weekly Sales. (The latter is an expensive and somewhat fruitless computation with ~400 thousand rows, so we instead use a a small random sample and then a hexagonal bin plot on the whole dataset.)
df %>% ggplot(aes(Temperature)) + geom_histogram(bins=20)
df %>% sample_frac(0.01) %>%
ggplot(aes(x=Temperature, y=Weekly_Sales)) + geom_point() + geom_jitter() + geom_smooth(method = "loess")
df %>% ggplot(aes(x=Temperature, y=Weekly_Sales)) + stat_binhex()
There does not appear to be much of a relationship between Temperature and Weekly Sales.
Now we fit a multiple linear regression model to formally examine the relationship between Weekly Sales (transformed) and Temperature + Fuel Prices. We also examine the residual plots.
lm_fit <- lm(Weekly_Sales ~ Temperature + fp_high, data=df)
summary(lm_fit)
##
## Call:
## lm(formula = Weekly_Sales ~ Temperature + fp_high, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9517 -0.7746 0.0872 0.8689 5.0458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4738063 0.0065693 681.012 <2e-16 ***
## Temperature -0.0020299 0.0001074 -18.909 <2e-16 ***
## fp_high 0.0028467 0.0039609 0.719 0.472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.247 on 420209 degrees of freedom
## Multiple R-squared: 0.0008863, Adjusted R-squared: 0.0008815
## F-statistic: 186.4 on 2 and 420209 DF, p-value: < 2.2e-16
lm_df <- augment(lm_fit)
### residual plot
lm_df %>%
ggplot(aes(x = .fitted, y = .resid)) +
stat_binhex() +
geom_hline(yintercept = 0)
### residual histogram
lm_df %>%
ggplot(aes(.resid)) +
geom_histogram(bins = 50)
### residual qqplot
lm_df %>% sample_frac(0.01) %>%
ggplot(aes(sample = .resid)) +
stat_qq() +
stat_qq_line()
The residuals look okay (remember, we transformed Weekly Sales for a reason) but the overall fit is terrible. Looking at the summary of the linear model, we see the p-value for the qualitative Fuel Price variable is high. And the adjusted R-squared is very low, indicating that almost none of the variance in Weekly Sales (to the power of 1/6) is described by the variance in Temperature plus Fuel Price category.
We acquired a large set of retail sales data, created a remote MySQL database, and loaded the data into R. We started with a theory that retail sales might be correlated with fuel prices and outside temperature. After visualizing the data, transforming the response variable, and then building a multiple linear regression model, it turns out we appear to be wrong! This is not necessarily a bad thing.
Theories try and fail all the time. It is important to report poor models just as it is to report good models. Otherwise the statistical community would (and very much does) suffer from “publishing bias” where people report findings only when they are successful, thereby incorrectly skewing our priors.
Future studies of retail sales might examine predictors such as time of year, holiday schedule, and economic indicators.