Arizona supermarket sales were not pleasing .The decicion makers decided to make an advertisement ,but did not know whom to target.They hired a data scientist cum reseracher to identify whom to target a mong its customers.
Using SQL skills the reseacher extracted some variables of royal customers attributes from the supermarket data base. He extracted customers salary,city,weight and wine they consume.
The researcher went a head to fing vbaluable insights from the collected variables by using liner regression and try to answear the problems below:
1.0 Customer salary has effect to spending
2.0 Customer Weight has effect to spending
3.0 Customer residence has effect on spending
library(readxl)
confidence_interval <- read_excel("C:/Users/USER/Desktop/upwork/confidence_interval.xlsx")
con<-confidence_interval
explore
con$City<-as.factor(con$City)
tail(con)
names(con)
## [1] "Spend" "Salary" "Weight" "City" "Wine"
str(con)
## tibble [12 × 5] (S3: tbl_df/tbl/data.frame)
## $ Spend : num [1:12] 10 25 35 45 63 76 86 98 102 109 ...
## $ Salary: num [1:12] 1200 1389 1476 1587 1673 ...
## $ Weight: num [1:12] 230 187 167 144 139 126 111 105 98 77 ...
## $ City : Factor w/ 3 levels "Mesa","Phoenix",..: 3 3 3 3 1 1 2 2 2 2 ...
## $ Wine : num [1:12] 40 40 90 55 45 34 70 98 112 70 ...
dim(con)
## [1] 12 5
summary(con)
## Spend Salary Weight City Wine
## Min. : 10.00 Min. :1200 Min. : 46.00 Mesa :4 Min. : 10.0
## 1st Qu.: 42.50 1st Qu.:1559 1st Qu.: 92.75 Phoenix:4 1st Qu.: 40.0
## Median : 81.00 Median :1834 Median :118.50 Sedona :4 Median : 62.5
## Mean : 75.08 Mean :1847 Mean :124.67 Mean : 62.0
## 3rd Qu.:103.75 3rd Qu.:2210 3rd Qu.:149.75 3rd Qu.: 82.5
## Max. :132.00 Max. :2412 Max. :230.00 Max. :112.0
linear regrssion
lm<-lm(con$Spend~con$Salary+con$Weight+con$Wine+con$City)
model summary
summary(lm)
##
## Call:
## lm(formula = con$Spend ~ con$Salary + con$Weight + con$Wine +
## con$City)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.444 -1.855 -0.416 1.975 6.344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.72617 48.30673 0.036 0.9727
## con$Salary 0.05680 0.01783 3.186 0.0189 *
## con$Weight -0.21074 0.13148 -1.603 0.1601
## con$Wine -0.01078 0.06127 -0.176 0.8661
## con$CityPhoenix 0.36537 4.30721 0.085 0.9352
## con$CitySedona -14.27403 5.15324 -2.770 0.0324 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.519 on 6 degrees of freedom
## Multiple R-squared: 0.9929, Adjusted R-squared: 0.9869
## F-statistic: 167.1 on 5 and 6 DF, p-value: 2.361e-06
interpretation of the model.
The model is significant (pv<-0.05)
salary
salary is not significant(B=0.05,PV>0.O5,3,=3.186) although it has a positive effect to the spending. Holding other factors constant,every unit increase in salary results to an decrease in spending by 0.06 units.
weight
weight is not significant(b=-0.2,pv>0.05,=-1.6),However it has a negative effect to spending.Holding other factors constant a unit increase in weight results to a 0.2 unit decrease in spending.
wine
wine is not significant(b=-0.01,pv>0.05,0.17) and has a negative association with spending.Hodling other facotrs constant ,one unit increase in wine intake,results to a 0.01 unit in spending.
City
City is significant,recommendation is to use post hoc analysis to find difference among the cities;however, city Phoenix is 0.4 higher in spending than the reference city(Mesa);in addition, city seduna spending is 14 lower than reference city(Mesa)
conclusion
salary and residence only had significant effect on the spaending of customers in the supermarket.
recommendation
further reserach should be undertaken on the residence factor and the salary to identify which group is to be targeted.
wayforward
the reseracher dicide to do one way anova and then post hoc to fetch the insights and parttern in spending a mong the cities
c<-aov(con$Spend~con$City)
summary(c)
## Df Sum Sq Mean Sq F value Pr(>F)
## con$City 2 12883 6441 13.49 0.00196 **
## Residuals 9 4296 477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(c)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = con$Spend ~ con$City)
##
## $`con$City`
## diff lwr upr p adj
## Phoenix-Mesa 1 -42.1345 44.1345 0.9976934
## Sedona-Mesa -69 -112.1345 -25.8655 0.0040007
## Sedona-Phoenix -70 -113.1345 -26.8655 0.0036506
plot(TukeyHSD(c))
visualization of cities spending
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
spen<-data.frame(city=c("Mesa","Sadona","Phoenix"),spending=c(100,30,99))
c<-spen%>%ggplot(aes(spending,city))+geom_col(fill="blue",batchsize=1)+theme_minimal()
## Warning in geom_col(fill = "blue", batchsize = 1): Ignoring unknown parameters:
## `batchsize`
c
the results are futher subjected to post hoc analysis Customers in Phoenix are spending higher than in Mesa with a diffence of mean 1, which is not significant,however Customers in Sedona are spending much less compared to both Mesa and Phoenix;Customers in Sedona are spendin a mean of 70 less than Mesa and also they are spending 60 less tha Phoenix.
Recommendation
The Mrketing manager should concetntrate on Sedona to found what is going on there .Either to do advertisement ,or motivate the saff or give out discounts or recommend further research.
salary factor
the reserach das\(wt_2 <- as.numeric(cut_number(das\)wt,3))
summary(con$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1200 1559 1834 1847 2210 2412
con<-con%>%mutate(Salary_binned=cut_number(Salary,n=3))
summary(con)
## Spend Salary Weight City Wine
## Min. : 10.00 Min. :1200 Min. : 46.00 Mesa :4 Min. : 10.0
## 1st Qu.: 42.50 1st Qu.:1559 1st Qu.: 92.75 Phoenix:4 1st Qu.: 40.0
## Median : 81.00 Median :1834 Median :118.50 Sedona :4 Median : 62.5
## Mean : 75.08 Mean :1847 Mean :124.67 Mean : 62.0
## 3rd Qu.:103.75 3rd Qu.:2210 3rd Qu.:149.75 3rd Qu.: 82.5
## Max. :132.00 Max. :2412 Max. :230.00 Max. :112.0
## Salary_binned
## [1.2e+03,1.64e+03] :4
## (1.64e+03,2.05e+03]:4
## (2.05e+03,2.41e+03]:4
##
##
##
anova
sal<-aov(con$Spend~con$Salary_binned)
anova(sal)
TukeyHSD(sal)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = con$Spend ~ con$Salary_binned)
##
## $`con$Salary_binned`
## diff lwr upr p adj
## (1.64e+03,2.05e+03]-[1.2e+03,1.64e+03] 52 23.708212 80.29179 0.0016026
## (2.05e+03,2.41e+03]-[1.2e+03,1.64e+03] 87 58.708212 115.29179 0.0000334
## (2.05e+03,2.41e+03]-(1.64e+03,2.05e+03] 35 6.708212 63.29179 0.0179007
visualization
spens<-data.frame(salary=c("low","midle","high"),spendings=c(100,152,187))
cc<-spens%>%ggplot(aes(salary,spendings))+geom_col(fill="purple",batchsize=1)+theme_minimal()
## Warning in geom_col(fill = "purple", batchsize = 1): Ignoring unknown
## parameters: `batchsize`
cc
as seen the low earners are the ones with the problem and the marketing manager should identify them and put more emphasy on them