In comparison to other tools, R has shown itself to be particularly suited to analyzing big data because of its built-in statistical formulas and third-party algorithms that are useful for analyzing data, predicting data, and visualizing data.
Install Packages
install.packages('rmarkdown')
install.packages('tinytex')
install.packages("flexdashboard")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("naniar")
install.packages("tidyverse")
install.packages("readr")
install.packages("plyr")
library("plyr")
library("rmarkdown")
library("tinytex")
library("flexdashboard")
library("dplyr")
library("ggplot2")
library("naniar")
library("tidyverse")
library("readr")
Research questions below.
To what extent does pricing affect sales of each product at a particular store?
How does Item visibility affect sales of each product at a particular store?
Relationship between price and visibility for each category?
What is Most common outlet type for each tier of city?
Fast selling and slow selling products?
summary(df)
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
Length:8523 Min. : 4.555 Length:8523 Min. :0.00000
Class :character 1st Qu.: 8.774 Class :character 1st Qu.:0.02699
Mode :character Median :12.600 Mode :character Median :0.05393
Mean :12.858 Mean :0.06613
3rd Qu.:16.850 3rd Qu.:0.09459
Max. :21.350 Max. :0.32839
NA's :1463
Item_Type Item_MRP Outlet_Identifier
Length:8523 Min. : 31.29 Length:8523
Class :character 1st Qu.: 93.83 Class :character
Mode :character Median :143.01 Mode :character
Mean :140.99
3rd Qu.:185.64
Max. :266.89
Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
Min. :1985 Length:8523 Length:8523
1st Qu.:1987 Class :character Class :character
Median :1999 Mode :character Mode :character
Mean :1998
3rd Qu.:2004
Max. :2009
Outlet_Type Item_Outlet_Sales
Length:8523 Min. : 33.29
Class :character 1st Qu.: 834.25
Mode :character Median : 1794.33
Mean : 2181.29
3rd Qu.: 3101.30
Max. :13086.97
str(df)
'data.frame': 8523 obs. of 12 variables:
$ Item_Identifier : chr "FDA15" "DRC01" "FDN15" "FDX07" ...
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : chr "Low Fat" "Regular" "Low Fat" "Regular" ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
$ Item_Type : chr "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Identifier : chr "OUT049" "OUT018" "OUT049" "OUT010" ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
$ Outlet_Size : chr "Medium" "Medium" "Medium" "" ...
$ Outlet_Location_Type : chr "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
$ Outlet_Type : chr "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store" ...
$ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
gg_miss_var(df)
There are missing values for Item_Weight
table(is.na(df$Item_Weight))
FALSE TRUE
7060 1463
Item_Weight has 1463 empty cells|
qplot(Outlet_Identifier,Item_Weight, data=df) + geom_boxplot()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## Warning: Removed 1463 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 1463 rows containing missing values (`geom_point()`).
From the plots it can be seen that OUT 19 and OUT27 have missing item weight values
summary(df$Item_Weight)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4.555 8.774 12.600 12.858 16.850 21.350 1463
df$Item_Weight[is.na(df$Item_Weight)]<- mean(df$Item_Weight[!is.na(df$Item_Weight)])
summary(df$Item_Weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.555 9.310 12.858 12.858 16.000 21.350
table(df$Item_Fat_Content)
LF low fat Low Fat reg Regular
316 112 5089 117 2889
table(df$Outlet_Type)
Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3
1083 5577 928 935
table(df$Outlet_Location_Type)
Tier 1 Tier 2 Tier 3
2388 2785 3350
table(df$Outlet_Size)
High Medium Small
2410 932 2793 2388
df$Item_Fat_Content <- revalue(df$Item_Fat_Content,c("LF"="Low Fat" ,"low fat"="Low Fat","reg"="Regular"))
table(df$Item_Fat_Content)
Low Fat Regular
5517 3006
ggplot(data = df) +
geom_point(mapping = aes(x = Item_Visibility, y = Item_MRP, )) +
labs(
title = "Plot of Item Visibility vs Item MRP",
x = "Item_ Visibility",
y = "Item MRP")
ggplot(data = df) +
geom_point(mapping = aes(x = Item_Visibility, y = Item_Outlet_Sales, color = 'Green'))+
labs(
title = "Plot of Item Item Visibility vs Item Outlet Sales",
x = "Item Visibility",
y = "Item Outlet Sales")
ggplot(data = df) +
geom_point(mapping = aes(x = Item_MRP, y = Item_Outlet_Sales,color ='dark green')) +
labs(
title = "Plot of Item MRP vs Item Outlet Sales",
x = "Item MRP",
y = "Item Outlet Sales")
ggplot(data = df) +
geom_point(mapping = aes(x = Outlet_Identifier, y = Item_Outlet_Sales)) +
labs(
title = "Plot of Outlet_Identifier vs Item Outlet Sales",
x = "Outlet Identifier",
y = "Item Outlet Sales")
ggplot(data = df) +
geom_point(mapping = aes(x = Outlet_Type, y = Item_Outlet_Sales)) +
labs(
title = "Plot of Outlet Type vs Item Outlet Sales",
x = "Outlet Type",
y = "Item Outlet Sales")
ggplot(df, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type))
ggplot(data = df) +
geom_bar(mapping = aes(x = Item_Fat_Content, fill =Item_Fat_Content ))
ggplot(data = df) +
geom_bar(mapping = aes(x = Item_Type, fill = Item_Type, ))+
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
ggplot(data = df) +
geom_bar(mapping = aes(x = Outlet_Type, fill = Outlet_Type))
####Establishment Year
ggplot(df, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,100)) +
coord_flip()+ labs(title = "Establishment Year")
ggplot(df, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+
labs(title = "Outlet type vs Outlet Location type", x = "Outlet Location Type", y = "Count of Outlets")
ggplot(df, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=1000))+
labs(title = "OUTLET TYPE VS SALES", x = "Outlet Identifier")
From the plots above, Fruits and Vegetables and Snack foods, Household items, frozen foods and baking goods are the most sold items, while seafood, breakfast, bread, hard drinks and starchy foods the least sold.
For the Outlet type, Supermarket Type 1 also generating the most revenue and it is more prevalent in Tier 1 and Tier 2 cities.
The best performing store is the store established in 1985, which is the oldest store and further analysis showed Outlet 19 and Outlet 27 as the only two outlets established in 1985.
The box plot also helps us see the distribution of values and check for outliers.
In statistics, correlation means how closely two variables move together.If two variables move in the same direction, they have a positive correlation, On the other hand, a negative correlation occurs when the variables move in opposite directions.
From the scatter plots above, it describes a relationship where the Maximum retail price is positively correlated to the Item outlet sales. Also, an increase in visibility would result in reduction in sales as sales values are clustered around visibility less than 0.2.
ggplot(df, aes(Outlet_Identifier, Item_Type))+
geom_raster(aes(fill = Item_MRP))+
labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
scale_fill_continuous(name = "Item MRP")
df_num <- df[c('Item_Weight','Item_MRP','Item_Visibility','Item_Outlet_Sales','Outlet_Establishment_Year')]
cor(df_num)
Item_Weight Item_MRP Item_Visibility
Item_Weight 1.000000000 0.024756101 -0.012048528
Item_MRP 0.024756101 1.000000000 -0.001314848
Item_Visibility -0.012048528 -0.001314848 1.000000000
Item_Outlet_Sales 0.011550001 0.567574447 -0.128624612
Outlet_Establishment_Year -0.008300836 0.005019916 -0.074833504
Item_Outlet_Sales Outlet_Establishment_Year
Item_Weight 0.01155000 -0.008300836
Item_MRP 0.56757445 0.005019916
Item_Visibility -0.12862461 -0.074833504
Item_Outlet_Sales 1.00000000 -0.049134970
Outlet_Establishment_Year -0.04913497 1.000000000
install.packages('corrplot')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(df_num),method ="number", tl.cex =0.6 ,main = 'correlogram of numeric variables')
In statistics, a correlation coefficient is a number that indicates how linearly related two variables are.
This is usually expressed as the Pearson correlation coefficient and It ranges between -1 and 1.
Low correlation close to 0 means no linear relationship exists.
A correlation matrix summarizes how different variables are related. The figures above show a correlation of 0.57 between Item Maximum Retail price and Item Outlet Sales
cor.test(df$Item_Outlet_Sales,df$Item_MRP)
Pearson's product-moment correlation
data: df$Item_Outlet_Sales and df$Item_MRP
t = 63.635, df = 8521, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5530075 0.5817945
sample estimates:
cor
0.5675744
price_model <- lm(df$Item_Outlet_Sales~df$Item_MRP)
summary(price_model)
Call:
lm(formula = df$Item_Outlet_Sales ~ df$Item_MRP)
Residuals:
Min 1Q Median 3Q Max
-3871.2 -770.1 -64.0 696.4 9443.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.5751 37.6712 -0.307 0.759
df$Item_MRP 15.5530 0.2444 63.635 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1405 on 8521 degrees of freedom
Multiple R-squared: 0.3221, Adjusted R-squared: 0.3221
F-statistic: 4049 on 1 and 8521 DF, p-value: < 2.2e-16
This linear regression technique can be used to test an assumption or predict an output so we would test the hypothesis inferred from this correlation and determine if we can use the model for prediction. For this report the null hypothesis H0 = The Item Maximum Retail price has a relationship with Item Outlet Sales.Alternative Hypothesis H1 = The Item Maximum Retail price does not have a relationship with Item Outlet Sales.
From the earlier equation y = Slope(x) + Intercept, y = 15.55x - 11.58. where x =Item MRP, slope =15.55, intercept = -11.58, y = sales. The null hypothesis should be rejected if the p-value is less than.05.A p-value larger than.05 indicates that the deviation from the null hypothesis is not statistically significant, and so the null hypothesis is not rejected.
Large f statistic means the result is did not happen by chance.We should therefore reject the null hypothesis and conclude that an association does exist between item maximum retail price and item outlet sales. Based on our model, we can see that our F-statistic is large, and our p-value is basically zero. R squared value We use R squared to measure how well our model explains the dependent variable’s values.The closer the value is to 1, the more efficient the model is at explaining data. R squared of 0.32 means the model explains 32% of the variation within the Outlet sales
install.packages("olsrr")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
(as 'lib' is unspecified)
library(olsrr)
Attaching package: 'olsrr'
The following object is masked from 'package:datasets':
rivers
ols_plot_resid_fit(price_model)
ols_plot_resid_hist(price_model)
df_num$pred <- predict(price_model,df_num)
cor(df_num$pred, df_num$Item_Outlet_Sales)
[1] 0.5675744
Residuals mean the difference between the actual and predicted values. From the residual values we can tell our distribution is symmetrical, which tells us the model is predicting well for both the high and low ranges.
One of the great benefits of Machine Learning is the ability to gain insights from extensive data analysis.This can help companies understand how different aspects of the business are performing, allowing them to make accurate forecasts.
Having analysed the data, we were able to identify which product categories performed well, as well as the relationships between the variables. According to this report, data quality is a critical factor in machine learning.For example, visualizations enabled us to spot outrageous values and drill down to the outlet type.
By knowing this information, recommendations can be made that will improve business results. As illustrated in the box plot showing the outliers, outlet 27 has very high values, which should be investigated as the mean value for item outlet sales is 2191.
In addition, it would help to identify if there is a glitch in the computer system that captures these records as it would not give a true representation of the sales figures. Also, the company can concentrate more product campaigns on Tier 1 and Tier 2 cities since Supermarket Type 1 with the highest sales figures in are concentrated in these cities. Slow-moving categories should also be given special attention as there may be a case of expiration since they are not fast-moving merchandise.