Lab Test 2

Erin Dane

2022-03-21

Normality and Correlation

Determing Distribution of Amount Spent

From this QQ plot we can see a bell shaped distribution. This indicated that the Amount spend is not normally distributed. Given that the shape of the QQ plot opens up means it reight sknewed. From the histogram we can see a large right sknewedness because the majority of the frequency is on the left. Overall it can be concluded that Amount spent is not normally distributed. Lastly from the normally test we can see that the p-value is insignificant and the distribution is not normal.

> with(Marketing, qqPlot(AmountSpent, dist="norm", id=list(method="y", n=2, labels=rownames(Marketing))))

[1] 988 497
> with(Marketing, Hist(AmountSpent, scale="frequency", breaks="Sturges", col="darkgray"))

> normalityTest(~AmountSpent, test="shapiro.test", data=Marketing)

    Shapiro-Wilk normality test

data:  AmountSpent
W = 0.8784, p-value < 2.2e-16

Determing Distribution of Salary

From the QQ plot we can see that there is still a bell shaped curve, but much more data lies in the 95% confidence interval range. The graph also has a logn tail. From the histogram we can see a slight right skewness. Lastly from the normally test we can see that the p-value is insignificant and the distribution is not normal.

> with(Marketing, qqPlot(Salary, dist="norm", id=list(method="y", n=2, labels=rownames(Marketing))))

[1] 929 535
> with(Marketing, Hist(Salary, scale="frequency", breaks="Sturges", col="darkgray"))

> normalityTest(~Salary, test="shapiro.test", data=Marketing)

    Shapiro-Wilk normality test

data:  Salary
W = 0.96338, p-value = 3.763e-15

Correlation Matrix

The conclusions we can draw from this correlation matrix are:

  • that there is a strong postive correlation between Amount Spent and salary. This is understandable as people with larger incomes will have larger expendable income.
  • That there is a postive correlation between Amount Spent and Catalogs. This could be because people who spend more shop more often and would be more interested in having a catalog.
  • There is a weak negative relationship between Amount Spent and children. This could be because the more children you have the more money you need to save for thier education, sports etc.
  • All other correlationsa are below 0.2 meaning that they are not strong enough to be interpreted
> cor(Marketing[,c("AmountSpent","Catalogs","Children","Salary")], use="complete")
            AmountSpent   Catalogs    Children     Salary
AmountSpent   1.0000000  0.4726499 -0.22230817 0.69959571
Catalogs      0.4726499  1.0000000 -0.11345543 0.18355086
Children     -0.2223082 -0.1134554  1.00000000 0.04966316
Salary        0.6995957  0.1835509  0.04966316 1.00000000
> scatterplotMatrix(~AmountSpent+Catalogs+Children+Salary, regLine=FALSE, smooth=FALSE, diagonal=list(method="density"), data=Marketing)

Regression Models

Linear Regression Salary and Amount Spent

From the output we can determine

  • Given that the residual max is larger than the residual min we can see there is slight right skewedness. Meaning the model is not predicting well on the higher range of salaries.
  • From the coefficients we can detemine the formula for amount spend is \[ y = 0.02196x -15.31783 \] \[ where x = salary \] \[ where y = amount spent\] \[x = 80,0000 \] \[y = 0.02196(80,000) -15.31783 \] \[y = 1741.48217. \]
> RegModel.1 <- lm(AmountSpent~Salary, data=Marketing)
> summary(RegModel.1)

Call:
lm(formula = AmountSpent ~ Salary, data = Marketing)

Residuals:
    Min      1Q  Median      3Q     Max 
-2179.7  -315.2   -53.5   279.7  3752.9 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15.31783   45.37416  -0.338    0.736    
Salary        0.02196    0.00071  30.930   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 687.1 on 998 degrees of freedom
Multiple R-squared:  0.4894,    Adjusted R-squared:  0.4889 
F-statistic: 956.7 on 1 and 998 DF,  p-value: < 2.2e-16

Linear Regression Salary and Amount Spent

From the output we can determine

  • Given that the residual min is larger than the residual max we can see there is slight left skewedness. Meaning the model is not predicting well on the lower range of catalogues.
  • From the coefficients we can detemine the formula for amount spend is \[ y = 0.0032571x +10.7188406 \] \[ where x = amount spent\] \[ where y = catalogs\] \[ x = 900 \] \[ y = 0.0032571(900) +10.7188406\] \[ y = 13.6502306 \]
> RegModel.5 <- lm(Catalogs~AmountSpent, data=Marketing)
> summary(RegModel.5)

Call:
lm(formula = Catalogs ~ AmountSpent, data = Marketing)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.6592  -5.3263  -0.3718   4.5121  12.6460 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.072e+01  2.980e-01   35.97   <2e-16 ***
AmountSpent 3.257e-03  1.922e-04   16.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.839 on 998 degrees of freedom
Multiple R-squared:  0.2234,    Adjusted R-squared:  0.2226 
F-statistic: 287.1 on 1 and 998 DF,  p-value: < 2.2e-16

Recommendation

I would recommend using the linear regression model surrounding salary and amount spent. This model can help the marker to identify how much each person would spend based on thier salary. This can help them to target the people that will spend the most. It also allows the marketer to avdertise the correct products to each demographic. Someone with a higher salary is more likely to be able to purchase a $500 gaming system in comarison to someone with a lower salary.