Data Analysis using R Programming for Big-Mart Supermart

In comparison to other tools, R has shown itself to be particularly suited to analyzing big data because of its built-in statistical formulas and third-party algorithms that are useful for analyzing data, predicting data, and visualizing data.

Install Packages

install.packages('rmarkdown')
install.packages('tinytex')
install.packages("flexdashboard")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("naniar")
install.packages("tidyverse")
install.packages("readr")
install.packages("plyr")
library("plyr")
library("rmarkdown")
library("tinytex")
library("flexdashboard")
library("dplyr")
library("ggplot2")
library("naniar")
library("tidyverse")
library("readr")

Research questions below.

To what extent does pricing affect sales of each product at a particular store?

How does Item visibility affect sales of each product at a particular store?

Relationship between price and visibility for each category?

What is Most common outlet type for each tier of city?

Fast selling and slow selling products?

Data exploration

Summary Statistics and Data structure

summary(df)

 Item_Identifier     Item_Weight     Item_Fat_Content   Item_Visibility  
 Length:8523        Min.   : 4.555   Length:8523        Min.   :0.00000  
 Class :character   1st Qu.: 8.774   Class :character   1st Qu.:0.02699  
 Mode  :character   Median :12.600   Mode  :character   Median :0.05393  
                    Mean   :12.858                      Mean   :0.06613  
                    3rd Qu.:16.850                      3rd Qu.:0.09459  
                    Max.   :21.350                      Max.   :0.32839  
                    NA's   :1463                                         
  Item_Type            Item_MRP      Outlet_Identifier 
 Length:8523        Min.   : 31.29   Length:8523       
 Class :character   1st Qu.: 93.83   Class :character  
 Mode  :character   Median :143.01   Mode  :character  
                    Mean   :140.99                     
                    3rd Qu.:185.64                     
                    Max.   :266.89                     
                                                       
 Outlet_Establishment_Year Outlet_Size        Outlet_Location_Type
 Min.   :1985              Length:8523        Length:8523         
 1st Qu.:1987              Class :character   Class :character    
 Median :1999              Mode  :character   Mode  :character    
 Mean   :1998                                                     
 3rd Qu.:2004                                                     
 Max.   :2009                                                     
                                                                  
 Outlet_Type        Item_Outlet_Sales 
 Length:8523        Min.   :   33.29  
 Class :character   1st Qu.:  834.25  
 Mode  :character   Median : 1794.33  
                    Mean   : 2181.29  
                    3rd Qu.: 3101.30  
                    Max.   :13086.97

str(df)

'data.frame':   8523 obs. of  12 variables:
 $ Item_Identifier          : chr  "FDA15" "DRC01" "FDN15" "FDX07" ...
 $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
 $ Item_Fat_Content         : chr  "Low Fat" "Regular" "Low Fat" "Regular" ...
 $ Item_Visibility          : num  0.016 0.0193 0.0168 0 0 ...
 $ Item_Type                : chr  "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
 $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
 $ Outlet_Identifier        : chr  "OUT049" "OUT018" "OUT049" "OUT010" ...
 $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
 $ Outlet_Size              : chr  "Medium" "Medium" "Medium" "" ...
 $ Outlet_Location_Type     : chr  "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
 $ Outlet_Type              : chr  "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store" ...
 $ Item_Outlet_Sales        : num  3735 443 2097 732 995 ...

Check for missing values

gg_miss_var(df)

There are missing values for Item_Weight

Check number of missing values in Item_Weight column

table(is.na(df$Item_Weight))


FALSE  TRUE 
 7060  1463

Item_Weight has 1463 empty cells|

Group boxplot of item identifier vs Item weight

qplot(Outlet_Identifier,Item_Weight, data=df) + geom_boxplot()

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

## Warning: Removed 1463 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 1463 rows containing missing values (`geom_point()`).

From the plots it can be seen that OUT 19 and OUT27 have missing item weight values

Check distribution of figures

summary(df$Item_Weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  4.555   8.774  12.600  12.858  16.850  21.350    1463

Replace null values for Item_Weight with Mean values

df$Item_Weight[is.na(df$Item_Weight)]<- mean(df$Item_Weight[!is.na(df$Item_Weight)])

Check distribution again

summary(df$Item_Weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.555   9.310  12.858  12.858  16.000  21.350

View each column variable for discrepancies

table(df$Item_Fat_Content)


     LF low fat Low Fat     reg Regular 
    316     112    5089     117    2889

table(df$Outlet_Type)


    Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3 
             1083              5577               928               935

table(df$Outlet_Location_Type)


Tier 1 Tier 2 Tier 3 
  2388   2785   3350

table(df$Outlet_Size)


         High Medium  Small 
  2410    932   2793   2388

Reorder values for Item_Fat_Content

df$Item_Fat_Content <- revalue(df$Item_Fat_Content,c("LF"="Low Fat" ,"low fat"="Low Fat","reg"="Regular"))

table(df$Item_Fat_Content)


Low Fat Regular 
   5517    3006

Visualizing relationship between variables

Plot of item visibility vs price

ggplot(data = df) +
  geom_point(mapping = aes(x = Item_Visibility, y = Item_MRP, )) + 
  labs(
    title = "Plot of Item Visibility vs Item MRP",
    x = "Item_ Visibility",
    y = "Item MRP")

Plot of item visibility vs item sales

ggplot(data = df) +
  geom_point(mapping = aes(x = Item_Visibility, y = Item_Outlet_Sales, color = 'Green'))+ 
  labs(
    title = "Plot of Item Item Visibility vs Item Outlet Sales",
    x = "Item Visibility",
    y = "Item Outlet Sales")

Plot of item Mrp vs item sales

ggplot(data = df) +
  geom_point(mapping = aes(x = Item_MRP, y = Item_Outlet_Sales,color ='dark green')) +
  labs(
    title = "Plot of Item MRP vs Item Outlet Sales",
    x = "Item MRP",
    y = "Item Outlet Sales")

Plot of outlet identifier vs item sales

ggplot(data = df) +
  geom_point(mapping = aes(x = Outlet_Identifier, y = Item_Outlet_Sales)) +
  labs(
    title = "Plot of Outlet_Identifier vs Item Outlet Sales",
    x = "Outlet Identifier",
    y = "Item Outlet Sales")

Plot of outlet type vs item sales

ggplot(data = df) +
  geom_point(mapping = aes(x = Outlet_Type, y = Item_Outlet_Sales)) +
  labs(
    title = "Plot of Outlet Type vs Item Outlet Sales",
    x = "Outlet  Type",
    y = "Item Outlet Sales")

Plot of Item visibility vs Item MRP

ggplot(df, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type))

BAR CHARTS

Item fat content

ggplot(data = df) +
  geom_bar(mapping = aes(x = Item_Fat_Content, fill =Item_Fat_Content ))

Item type

ggplot(data = df) +
  geom_bar(mapping = aes(x = Item_Type, fill = Item_Type, ))+
   theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

Outlet type

ggplot(data = df) +
  geom_bar(mapping = aes(x = Outlet_Type, fill = Outlet_Type))

####Establishment Year

ggplot(df, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
  scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) + 
  scale_y_continuous("Count", breaks = seq(0,1500,100)) +
  coord_flip()+ labs(title = "Establishment Year")

STACKED BAR CHARTS

ggplot(df, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+
  labs(title = "Outlet type vs Outlet Location type", x = "Outlet Location Type", y = "Count of Outlets")

OUTLET TYPE VS SALES

ggplot(df, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
  scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=1000))+
  labs(title = "OUTLET TYPE VS SALES", x = "Outlet Identifier")

Interpretation of Data Visualizationsand Insights

From the plots above, Fruits and Vegetables and Snack foods, Household items, frozen foods and baking goods are the most sold items, while seafood, breakfast, bread, hard drinks and starchy foods the least sold.

For the Outlet type, Supermarket Type 1 also generating the most revenue and it is more prevalent in Tier 1 and Tier 2 cities.

The best performing store is the store established in 1985, which is the oldest store and further analysis showed Outlet 19 and Outlet 27 as the only two outlets established in 1985.

The box plot also helps us see the distribution of values and check for outliers.

In statistics, correlation means how closely two variables move together.If two variables move in the same direction, they have a positive correlation, On the other hand, a negative correlation occurs when the variables move in opposite directions.

From the scatter plots above, it describes a relationship where the Maximum retail price is positively correlated to the Item outlet sales. Also, an increase in visibility would result in reduction in sales as sales values are clustered around visibility less than 0.2.

HEATMAP SHOWING COST OF EACH ITEM ON EVVRY OUTLET/item MRP, Outlet Identifier & Item Type

ggplot(df, aes(Outlet_Identifier, Item_Type))+
  geom_raster(aes(fill = Item_MRP))+
  labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
  scale_fill_continuous(name = "Item MRP")

Correlation test

df_num <- df[c('Item_Weight','Item_MRP','Item_Visibility','Item_Outlet_Sales','Outlet_Establishment_Year')]

cor(df_num)

                           Item_Weight     Item_MRP Item_Visibility
Item_Weight                1.000000000  0.024756101    -0.012048528
Item_MRP                   0.024756101  1.000000000    -0.001314848
Item_Visibility           -0.012048528 -0.001314848     1.000000000
Item_Outlet_Sales          0.011550001  0.567574447    -0.128624612
Outlet_Establishment_Year -0.008300836  0.005019916    -0.074833504
                          Item_Outlet_Sales Outlet_Establishment_Year
Item_Weight                      0.01155000              -0.008300836
Item_MRP                         0.56757445               0.005019916
Item_Visibility                 -0.12862461              -0.074833504
Item_Outlet_Sales                1.00000000              -0.049134970
Outlet_Establishment_Year       -0.04913497               1.000000000

Correlation plot

install.packages('corrplot')

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(corrplot)

## corrplot 0.92 loaded

corrplot(cor(df_num),method ="number", tl.cex =0.6 ,main = 'correlogram of numeric variables')

In statistics, a correlation coefficient is a number that indicates how linearly related two variables are.

This is usually expressed as the Pearson correlation coefficient and It ranges between -1 and 1.

Low correlation close to 0 means no linear relationship exists.

A correlation matrix summarizes how different variables are related. The figures above show a correlation of 0.57 between Item Maximum Retail price and Item Outlet Sales

Linear regression model for Item mrp vs outlet sales

cor.test(df$Item_Outlet_Sales,df$Item_MRP)


    Pearson's product-moment correlation

data:  df$Item_Outlet_Sales and df$Item_MRP
t = 63.635, df = 8521, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5530075 0.5817945
sample estimates:
      cor 
0.5675744

price_model <- lm(df$Item_Outlet_Sales~df$Item_MRP)
summary(price_model)


Call:
lm(formula = df$Item_Outlet_Sales ~ df$Item_MRP)

Residuals:
    Min      1Q  Median      3Q     Max 
-3871.2  -770.1   -64.0   696.4  9443.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.5751    37.6712  -0.307    0.759    
df$Item_MRP  15.5530     0.2444  63.635   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1405 on 8521 degrees of freedom
Multiple R-squared:  0.3221,    Adjusted R-squared:  0.3221 
F-statistic:  4049 on 1 and 8521 DF,  p-value: < 2.2e-16

This linear regression technique can be used to test an assumption or predict an output so we would test the hypothesis inferred from this correlation and determine if we can use the model for prediction. For this report the null hypothesis H₀ = The Item Maximum Retail price has a relationship with Item Outlet Sales.Alternative Hypothesis H₁ = The Item Maximum Retail price does not have a relationship with Item Outlet Sales.

From the earlier equation y = Slope(x) + Intercept, y = 15.55x - 11.58. where x =Item MRP, slope =15.55, intercept = -11.58, y = sales. The null hypothesis should be rejected if the p-value is less than.05.A p-value larger than.05 indicates that the deviation from the null hypothesis is not statistically significant, and so the null hypothesis is not rejected.

Large f statistic means the result is did not happen by chance.We should therefore reject the null hypothesis and conclude that an association does exist between item maximum retail price and item outlet sales. Based on our model, we can see that our F-statistic is large, and our p-value is basically zero. R squared value We use R squared to measure how well our model explains the dependent variable’s values.The closer the value is to 1, the more efficient the model is at explaining data. R squared of 0.32 means the model explains 32% of the variation within the Outlet sales

Residual vs fitted values plot (check homoscedasticity)

install.packages("olsrr")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
(as 'lib' is unspecified)

library(olsrr)


Attaching package: 'olsrr'

The following object is masked from 'package:datasets':

    rivers

ols_plot_resid_fit(price_model)

ols_plot_resid_hist(price_model)

df_num$pred <- predict(price_model,df_num)

cor(df_num$pred, df_num$Item_Outlet_Sales)

[1] 0.5675744

Residuals mean the difference between the actual and predicted values. From the residual values we can tell our distribution is symmetrical, which tells us the model is predicting well for both the high and low ranges.

CONCLUSION

One of the great benefits of Machine Learning is the ability to gain insights from extensive data analysis.This can help companies understand how different aspects of the business are performing, allowing them to make accurate forecasts.

Having analysed the data, we were able to identify which product categories performed well, as well as the relationships between the variables. According to this report, data quality is a critical factor in machine learning.For example, visualizations enabled us to spot outrageous values and drill down to the outlet type.

By knowing this information, recommendations can be made that will improve business results. As illustrated in the box plot showing the outliers, outlet 27 has very high values, which should be investigated as the mean value for item outlet sales is 2191.

In addition, it would help to identify if there is a glitch in the computer system that captures these records as it would not give a true representation of the sales figures. Also, the company can concentrate more product campaigns on Tier 1 and Tier 2 cities since Supermarket Type 1 with the highest sales figures in are concentrated in these cities. Slow-moving categories should also be given special attention as there may be a case of expiration since they are not fast-moving merchandise.