library("dplyr")
## Warning: package 'dplyr' was built under R version 3.6.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.6.2
library("tidyr")
## Warning: package 'tidyr' was built under R version 3.6.2
#Reading Data and Merging
features <- read.csv('features.csv')
head(features)
## Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3
## 1 1 2010-02-05 42.31 2.572 NA NA NA
## 2 1 2010-02-12 38.51 2.548 NA NA NA
## 3 1 2010-02-19 39.93 2.514 NA NA NA
## 4 1 2010-02-26 46.63 2.561 NA NA NA
## 5 1 2010-03-05 46.50 2.625 NA NA NA
## 6 1 2010-03-12 57.79 2.667 NA NA NA
## MarkDown4 MarkDown5 CPI Unemployment IsHoliday
## 1 NA NA 211.0964 8.106 FALSE
## 2 NA NA 211.2422 8.106 TRUE
## 3 NA NA 211.2891 8.106 FALSE
## 4 NA NA 211.3196 8.106 FALSE
## 5 NA NA 211.3501 8.106 FALSE
## 6 NA NA 211.3806 8.106 FALSE
features$Date <- as.Date(features$Date, format = "%Y-%m-%d")
features <- separate(features, "Date", c("Year", "Month", "Day"), sep = "-")
features_by_YMS <- features %>%
filter(Year == 2010 | Year == 2011 | Year == 2012) %>%
group_by(Year, Month, Store) %>%
arrange(Store, Year, Month) %>%
summarize(MeanFuel = round(mean(Fuel_Price),2),
MeanTemp = round(mean(Temperature),),
CPIM = round(mean(CPI),2),
UnemploymentM = round(mean(Unemployment),2))
train <- read.csv("train.csv")
train$Date <- as.Date(train$Date, format = "%Y-%m-%d")
train <- separate(train, "Date", c("Year", "Month", "Day"), sep = '-')
train_by_YMS <- train %>%
group_by(Year, Month, Store) %>%
arrange(Year, Month, Store) %>%
summarize(AvgSales = mean(Weekly_Sales))
trainM <- merge(train_by_YMS, features_by_YMS)
stores <- read.csv("stores.csv")
trainM <- merge(trainM, stores, by = "Store")
par(mfrow=c(3,2))
hist(trainM$AvgSales)
hist(trainM$MeanFuel)
hist(trainM$MeanTemp)
hist(trainM$CPIM)
hist(trainM$UnemploymentM)
ggplot(trainM, aes(x = Month,y = AvgSales)) +
geom_col() +
facet_wrap(~Year)
1, Training Data is missing for February in 2010, November and December in 2012 2, Clearly Sales have peaks in the months of November and December. This must be due to the holiday weeks in these two months. 3, Also seems like there is dip in September - October leading to the holiday weeks.
We need to further investigate these patterns
ggplot(trainM, aes(x = Size, y = AvgSales))+
geom_point() +
ggtitle("Store Size vs Sales")
Plotting Store Size against Sales, we can observe Linear relation although not very clear. There could be strata / counfounding here. Need to be investigated further
ggplot(trainM, aes(x = Type, y = AvgSales, size = Size))+
geom_point() +
ggtitle("Store Type vs Sales")
ggplot(trainM, aes(x = Type, y = AvgSales)) +
geom_boxplot()
In this plot, we have Sales vs Store Type, with the size of the dot indicating the Sales Value. WE can see that Type is classification of Stores based on Size / Sales. But there seems to be some overlaps which is probably due to incorrect dataset. Needs further investigation.
ggplot(trainM, aes(x = MeanTemp, y = AvgSales)) +
geom_point() +
facet_wrap(~Month)
ggplot(trainM, aes(x = MeanFuel, y = AvgSales)) +
geom_point() +
facet_wrap(~Year)
With these plots, we can see Fuel Price and Temprature having some influence over Sales. But as we saw earlier these ups and downs might be caused by Holidays or some other factor. We need to investigate for Multicollinearity here.
ggplot(trainM, aes(x = CPIM, y = AvgSales, color = Type)) +
geom_point()
With increased CPI, we see a pattern of Sales dropping for Type B stores but this is not true for other Types. This further needs to be investigated.
ggplot(trainM, aes(x = UnemploymentM, y = AvgSales)) +
geom_point() +
facet_wrap(~Year)
Over here we can see Sales dipping when Unemployment is going above 12.