Arizona supermarket sales were not pleasing .The decicion makers decided to make an advertisement ,but did not know whom to target.They hired a data scientist cum reseracher to identify whom to target a mong its customers.

Using SQL skills the reseacher extracted some variables of royal customers attributes from the supermarket data base. He extracted customers salary,city,weight and wine they consume.

The researcher went a head to fing vbaluable insights from the collected variables by using liner regression and try to answear the problems below:

1.0 Customer salary has effect to spending

2.0 Customer Weight has effect to spending

3.0 Customer residence has effect on spending

library(readxl)
confidence_interval <- read_excel("C:/Users/USER/Desktop/upwork/confidence_interval.xlsx")
con<-confidence_interval

explore

con$City<-as.factor(con$City)
tail(con)
names(con)
## [1] "Spend"  "Salary" "Weight" "City"   "Wine"
str(con)
## tibble [12 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Spend : num [1:12] 10 25 35 45 63 76 86 98 102 109 ...
##  $ Salary: num [1:12] 1200 1389 1476 1587 1673 ...
##  $ Weight: num [1:12] 230 187 167 144 139 126 111 105 98 77 ...
##  $ City  : Factor w/ 3 levels "Mesa","Phoenix",..: 3 3 3 3 1 1 2 2 2 2 ...
##  $ Wine  : num [1:12] 40 40 90 55 45 34 70 98 112 70 ...
dim(con)
## [1] 12  5
summary(con)
##      Spend            Salary         Weight            City        Wine      
##  Min.   : 10.00   Min.   :1200   Min.   : 46.00   Mesa   :4   Min.   : 10.0  
##  1st Qu.: 42.50   1st Qu.:1559   1st Qu.: 92.75   Phoenix:4   1st Qu.: 40.0  
##  Median : 81.00   Median :1834   Median :118.50   Sedona :4   Median : 62.5  
##  Mean   : 75.08   Mean   :1847   Mean   :124.67               Mean   : 62.0  
##  3rd Qu.:103.75   3rd Qu.:2210   3rd Qu.:149.75               3rd Qu.: 82.5  
##  Max.   :132.00   Max.   :2412   Max.   :230.00               Max.   :112.0

linear regrssion

lm<-lm(con$Spend~con$Salary+con$Weight+con$Wine+con$City)

model summary

summary(lm)
## 
## Call:
## lm(formula = con$Spend ~ con$Salary + con$Weight + con$Wine + 
##     con$City)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.444 -1.855 -0.416  1.975  6.344 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       1.72617   48.30673   0.036   0.9727  
## con$Salary        0.05680    0.01783   3.186   0.0189 *
## con$Weight       -0.21074    0.13148  -1.603   0.1601  
## con$Wine         -0.01078    0.06127  -0.176   0.8661  
## con$CityPhoenix   0.36537    4.30721   0.085   0.9352  
## con$CitySedona  -14.27403    5.15324  -2.770   0.0324 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.519 on 6 degrees of freedom
## Multiple R-squared:  0.9929, Adjusted R-squared:  0.9869 
## F-statistic: 167.1 on 5 and 6 DF,  p-value: 2.361e-06

interpretation of the model.

The model is significant (pv<-0.05)

salary

salary is not significant(B=0.05,PV>0.O5,3,=3.186) although it has a positive effect to the spending. Holding other factors constant,every unit increase in salary results to an decrease in spending by 0.06 units.

weight

weight is not significant(b=-0.2,pv>0.05,=-1.6),However it has a negative effect to spending.Holding other factors constant a unit increase in weight results to a 0.2 unit decrease in spending.

wine

wine is not significant(b=-0.01,pv>0.05,0.17) and has a negative association with spending.Hodling other facotrs constant ,one unit increase in wine intake,results to a 0.01 unit in spending.

City

City is significant,recommendation is to use post hoc analysis to find difference among the cities;however, city Phoenix is 0.4 higher in spending than the reference city(Mesa);in addition, city seduna spending is 14 lower than reference city(Mesa)

conclusion

salary and residence only had significant effect on the spaending of customers in the supermarket.

recommendation

further reserach should be undertaken on the residence factor and the salary to identify which group is to be targeted.

wayforward

the reseracher dicide to do one way anova and then post hoc to fetch the insights and parttern in spending a mong the cities

c<-aov(con$Spend~con$City)
summary(c)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## con$City     2  12883    6441   13.49 0.00196 **
## Residuals    9   4296     477                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(c)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = con$Spend ~ con$City)
## 
## $`con$City`
##                diff       lwr      upr     p adj
## Phoenix-Mesa      1  -42.1345  44.1345 0.9976934
## Sedona-Mesa     -69 -112.1345 -25.8655 0.0040007
## Sedona-Phoenix  -70 -113.1345 -26.8655 0.0036506
plot(TukeyHSD(c))

visualization of cities spending

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
spen<-data.frame(city=c("Mesa","Sadona","Phoenix"),spending=c(100,30,99))
c<-spen%>%ggplot(aes(spending,city))+geom_col(fill="blue",batchsize=1)+theme_minimal()
## Warning in geom_col(fill = "blue", batchsize = 1): Ignoring unknown parameters:
## `batchsize`
c

the results are futher subjected to post hoc analysis Customers in Phoenix are spending higher than in Mesa with a diffence of mean 1, which is not significant,however Customers in Sedona are spending much less compared to both Mesa and Phoenix;Customers in Sedona are spendin a mean of 70 less than Mesa and also they are spending 60 less tha Phoenix.

Recommendation

The Mrketing manager should concetntrate on Sedona to found what is going on there .Either to do advertisement ,or motivate the saff or give out discounts or recommend further research.

salary factor

the reserach das\(wt_2 <- as.numeric(cut_number(das\)wt,3))

summary(con$Salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1200    1559    1834    1847    2210    2412
con<-con%>%mutate(Salary_binned=cut_number(Salary,n=3))
summary(con)
##      Spend            Salary         Weight            City        Wine      
##  Min.   : 10.00   Min.   :1200   Min.   : 46.00   Mesa   :4   Min.   : 10.0  
##  1st Qu.: 42.50   1st Qu.:1559   1st Qu.: 92.75   Phoenix:4   1st Qu.: 40.0  
##  Median : 81.00   Median :1834   Median :118.50   Sedona :4   Median : 62.5  
##  Mean   : 75.08   Mean   :1847   Mean   :124.67               Mean   : 62.0  
##  3rd Qu.:103.75   3rd Qu.:2210   3rd Qu.:149.75               3rd Qu.: 82.5  
##  Max.   :132.00   Max.   :2412   Max.   :230.00               Max.   :112.0  
##              Salary_binned
##  [1.2e+03,1.64e+03] :4    
##  (1.64e+03,2.05e+03]:4    
##  (2.05e+03,2.41e+03]:4    
##                           
##                           
## 

anova

sal<-aov(con$Spend~con$Salary_binned)

anova(sal)
TukeyHSD(sal)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = con$Spend ~ con$Salary_binned)
## 
## $`con$Salary_binned`
##                                         diff       lwr       upr     p adj
## (1.64e+03,2.05e+03]-[1.2e+03,1.64e+03]    52 23.708212  80.29179 0.0016026
## (2.05e+03,2.41e+03]-[1.2e+03,1.64e+03]    87 58.708212 115.29179 0.0000334
## (2.05e+03,2.41e+03]-(1.64e+03,2.05e+03]   35  6.708212  63.29179 0.0179007

visualization

spens<-data.frame(salary=c("low","midle","high"),spendings=c(100,152,187))
 cc<-spens%>%ggplot(aes(salary,spendings))+geom_col(fill="purple",batchsize=1)+theme_minimal()
## Warning in geom_col(fill = "purple", batchsize = 1): Ignoring unknown
## parameters: `batchsize`
 cc

as seen the low earners are the ones with the problem and the marketing manager should identify them and put more emphasy on them