library("dplyr")
## Warning: package 'dplyr' was built under R version 3.6.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.6.2
library("tidyr")
## Warning: package 'tidyr' was built under R version 3.6.2
#Reading Data and Merging
features <- read.csv('features.csv')
head(features)
##   Store       Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3
## 1     1 2010-02-05       42.31      2.572        NA        NA        NA
## 2     1 2010-02-12       38.51      2.548        NA        NA        NA
## 3     1 2010-02-19       39.93      2.514        NA        NA        NA
## 4     1 2010-02-26       46.63      2.561        NA        NA        NA
## 5     1 2010-03-05       46.50      2.625        NA        NA        NA
## 6     1 2010-03-12       57.79      2.667        NA        NA        NA
##   MarkDown4 MarkDown5      CPI Unemployment IsHoliday
## 1        NA        NA 211.0964        8.106     FALSE
## 2        NA        NA 211.2422        8.106      TRUE
## 3        NA        NA 211.2891        8.106     FALSE
## 4        NA        NA 211.3196        8.106     FALSE
## 5        NA        NA 211.3501        8.106     FALSE
## 6        NA        NA 211.3806        8.106     FALSE
features$Date <- as.Date(features$Date, format = "%Y-%m-%d")
features <- separate(features, "Date", c("Year", "Month", "Day"), sep = "-")

features_by_YMS <- features %>% 
  filter(Year == 2010 | Year == 2011 | Year == 2012) %>% 
  group_by(Year, Month, Store) %>% 
  arrange(Store, Year, Month) %>% 
  summarize(MeanFuel = round(mean(Fuel_Price),2), 
            MeanTemp = round(mean(Temperature),),
            CPIM = round(mean(CPI),2),
            UnemploymentM = round(mean(Unemployment),2))

train <- read.csv("train.csv")
train$Date <- as.Date(train$Date, format = "%Y-%m-%d")
train <- separate(train, "Date", c("Year", "Month", "Day"), sep = '-')

train_by_YMS <- train %>% 
  group_by(Year, Month, Store) %>% 
  arrange(Year, Month, Store) %>% 
  summarize(AvgSales = mean(Weekly_Sales)) 

trainM <- merge(train_by_YMS, features_by_YMS)

stores <- read.csv("stores.csv")
trainM <- merge(trainM, stores, by = "Store")

par(mfrow=c(3,2))
hist(trainM$AvgSales)
hist(trainM$MeanFuel)
hist(trainM$MeanTemp)
hist(trainM$CPIM)
hist(trainM$UnemploymentM)

ggplot(trainM, aes(x = Month,y = AvgSales)) + 
  geom_col() +
  facet_wrap(~Year)

1, Training Data is missing for February in 2010, November and December in 2012 2, Clearly Sales have peaks in the months of November and December. This must be due to the holiday weeks in these two months. 3, Also seems like there is dip in September - October leading to the holiday weeks.

We need to further investigate these patterns

ggplot(trainM, aes(x = Size, y = AvgSales))+ 
  geom_point() + 
  ggtitle("Store Size vs Sales")

Plotting Store Size against Sales, we can observe Linear relation although not very clear. There could be strata / counfounding here. Need to be investigated further

ggplot(trainM, aes(x = Type, y = AvgSales, size = Size))+ 
  geom_point() + 
  ggtitle("Store Type vs Sales")

ggplot(trainM, aes(x = Type, y = AvgSales)) + 
  geom_boxplot()

In this plot, we have Sales vs Store Type, with the size of the dot indicating the Sales Value. WE can see that Type is classification of Stores based on Size / Sales. But there seems to be some overlaps which is probably due to incorrect dataset. Needs further investigation.

ggplot(trainM, aes(x = MeanTemp, y = AvgSales)) +
  geom_point() + 
  facet_wrap(~Month)

ggplot(trainM, aes(x = MeanFuel, y = AvgSales)) +
  geom_point() + 
  facet_wrap(~Year)

With these plots, we can see Fuel Price and Temprature having some influence over Sales. But as we saw earlier these ups and downs might be caused by Holidays or some other factor. We need to investigate for Multicollinearity here.

ggplot(trainM, aes(x = CPIM, y = AvgSales, color = Type)) +
  geom_point() 

With increased CPI, we see a pattern of Sales dropping for Type B stores but this is not true for other Types. This further needs to be investigated.

ggplot(trainM, aes(x = UnemploymentM, y = AvgSales)) +
  geom_point() +
  facet_wrap(~Year)

Over here we can see Sales dipping when Unemployment is going above 12.