Direct Marketing Lab Test 2

Filip Dragicevic

Student ID: 104607907

2022-03-21

## Loading required package: splines
## Loading required package: RcmdrMisc
## Loading required package: car
## Loading required package: carData
## Loading required package: sandwich
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
## Loading required package: effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
## The Commander GUI is launched only in interactive sessions
## 
## Attaching package: 'Rcmdr'
## The following object is masked from 'package:base':
## 
##     errorCondition
> setwd("C:/Users/filip/OneDrive/Desktop/MSCI 3230")
> direct <- 
+   read.table("C:/Users/filip/OneDrive/Desktop/MSCI 3230/DirectMarketingg.csv",
+    header=TRUE, stringsAsFactors=TRUE, sep=",", na.strings="NA", dec=".", 
+   strip.white=TRUE)

Visualization & Exploratory Analysis

> with(direct, Hist(AmountSpent, scale="frequency", breaks="Sturges", 
+   col="darkgray"))

- To start, we can conduct exploratory analysis for Amount Spent, the main variable of interest for the marketing agent.

- From the histogram, we can see that it is very right skewed, as the tail follows the main set of data and the peak is left of the center.

> Boxplot( ~ AmountSpent, data=direct, id=list(method="y"))

 [1] "181" "256" "386" "7"   "133" "278" "240" "291" "202" "140"

- This is further confirmed from the boxplot, as the top whisker is much longer than the bottom.

- There are, however, a lot more outliers in the Amount Spent variable.

> normalityTest(~AmountSpent, test="shapiro.test", data=direct)

    Shapiro-Wilk normality test

data:  AmountSpent
W = 0.8784, p-value < 2.2e-16

- Since the p-value is less the 0.05, we reject the null the null hypothesis that the distribution is normal, confirming it is not normally distributed.

- Because Amount Spent is skewed, we cannot use it for any regression models.

> with(direct, Hist(Salary, scale="frequency", breaks="Sturges", 
+   col="darkgray"))

- We can conduct similar analysis for salary.

- From the salary histogram, we can see that salary is slightly right/positive skewed since the tail is longer following the main portion of data.

> Boxplot( ~ Salary, data=direct, id=list(method="y"))

[1] "177"

- From the salary boxplot, we can see that the upper whisker is much longer than the bottom, further confirming it is right skewed.

- There is also only one major outlier, indicated by the 177.

> normalityTest(~Salary, test="shapiro.test", data=direct)

    Shapiro-Wilk normality test

data:  Salary
W = 0.96338, p-value = 3.763e-15

- We can confirm the test of normality by using a shapiro-wilk analysis.

- Since the p-value is less the 0.05, we reject the null the null hypothesis that the distribution is normal, confirming it is not normally distributed.

- Because salary is skewed, we cannot use it for any regression models.

> scatterplot(Salary~AmountSpent, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+    data=direct)

- By creating a scatterplot using salary and amount spent, we can see that the higher the salary, the higher the amount spent.

- This is indicated by the positive slope, which would be shown using the line of best fit.

> cor(direct[,c("AmountSpent","Catalogs","Salary")], use="complete")
            AmountSpent  Catalogs    Salary
AmountSpent   1.0000000 0.4726499 0.6995957
Catalogs      0.4726499 1.0000000 0.1835509
Salary        0.6995957 0.1835509 1.0000000

- A correlation matrix between Amount Spent, Catalogs, and Salary shows us which variables are positively correlated to Amount Spent.##

- Although both have positive values, Salary is significantly higher (about 0.7 compared to 0.5) meaning it is a better indicator of the potential amount an individual is willing to spend.

> cor(direct[,c("Age","AmountSpent","Children","Gender","Married","OwnHome")],
+    use="complete")
                     Age AmountSpent     Children       Gender      Married
Age          1.000000000   0.3482505 -0.271118420 -0.001458581  0.255993305
AmountSpent  0.348250488   1.0000000 -0.222308170 -0.201690213  0.475879979
Children    -0.271118420  -0.2223082  1.000000000  0.105469083  0.009770249
Gender      -0.001458581  -0.2016902  0.105469083  1.000000000 -0.116057285
Married      0.255993305   0.4758800  0.009770249 -0.116057285  1.000000000
OwnHome      0.428896769   0.3508080 -0.032274083 -0.084433317  0.264009318
                OwnHome
Age          0.42889677
AmountSpent  0.35080800
Children    -0.03227408
Gender      -0.08443332
Married      0.26400932
OwnHome      1.00000000

- By doing another matrix between Amount Spent and the rest of the variables, we can see how things like age, gender, number of children, etc. impact amount spent.

- The highest output when comparing to Amount Spent is the married variable, followed by age and own home.

- The postive correlation between these tells us that if they are married, they are more likely to spend more.

- Number of children and gender have negative correlation, meaning as these increase, amount spent decreases.

> RegModel.1 <- lm(AmountSpent~Salary, data=direct)
> summary(RegModel.1)

Call:
lm(formula = AmountSpent ~ Salary, data = direct)

Residuals:
    Min      1Q  Median      3Q     Max 
-2179.7  -315.2   -53.5   279.7  3752.9 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15.31783   45.37416  -0.338    0.736    
Salary        0.02196    0.00071  30.930   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 687.1 on 998 degrees of freedom
Multiple R-squared:  0.4894,    Adjusted R-squared:  0.4889 
F-statistic: 956.7 on 1 and 998 DF,  p-value: < 2.2e-16

- We can also conduct several regression models to analyze the linear relationship between input and output variables.

- For the marketing agent, Amount Spent is always going to the desired output/response variable.

- From the Amount Spent and salary linear regression model, we can see that r-squared is positive, however, slightly lower than desired.

- This tells us that there is positive relationship between the two.

> RegModel.2 <- lm(AmountSpent~Catalogs, data=direct)
> summary(RegModel.2)

Call:
lm(formula = AmountSpent ~ Catalogs, data = direct)

Residuals:
    Min      1Q  Median      3Q     Max 
-1660.9  -536.2  -135.1   399.8  4361.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  209.766     65.194   3.218  0.00133 ** 
Catalogs      68.588      4.048  16.944  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 847.4 on 998 degrees of freedom
Multiple R-squared:  0.2234,    Adjusted R-squared:  0.2226 
F-statistic: 287.1 on 1 and 998 DF,  p-value: < 2.2e-16

- The same model can be run with catalogs.

- This has a lower r-squared of only 0.223, meaning the relationship is not as strong.

> RegModel.3 <- lm(AmountSpent~Married, data=direct)
> summary(RegModel.3)

Call:
lm(formula = AmountSpent ~ Married, data = direct)

Residuals:
    Min      1Q  Median      3Q     Max 
-1579.1  -533.1  -171.9   420.9  4544.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   757.81      37.90   20.00   <2e-16 ***
Married       914.26      53.49   17.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 845.7 on 998 degrees of freedom
Multiple R-squared:  0.2265,    Adjusted R-squared:  0.2257 
F-statistic: 292.2 on 1 and 998 DF,  p-value: < 2.2e-16

- The same model can be run with married.

- This has a lower r-squared of only 0.226, meaning the relationship is not as strong.

> RegModel.4 <- lm(AmountSpent~OwnHome, data=direct)
> summary(RegModel.4)

Call:
lm(formula = AmountSpent ~ OwnHome, data = direct)

Residuals:
    Min      1Q  Median      3Q     Max 
-1478.1  -623.8  -226.3   425.7  4961.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   868.83      40.93   21.23   <2e-16 ***
OwnHome       674.31      56.98   11.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 900.4 on 998 degrees of freedom
Multiple R-squared:  0.1231,    Adjusted R-squared:  0.1222 
F-statistic: 140.1 on 1 and 998 DF,  p-value: < 2.2e-16
> RegModel.5 <- lm(AmountSpent~Age, data=direct)
> summary(RegModel.5)

Call:
lm(formula = AmountSpent ~ Age, data = direct)

Residuals:
    Min      1Q  Median      3Q     Max 
-1671.4  -573.9  -219.8   437.6  4621.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   295.72      83.49   3.542 0.000416 ***
Age           480.21      40.92  11.736  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 901.4 on 998 degrees of freedom
Multiple R-squared:  0.1213,    Adjusted R-squared:  0.1204 
F-statistic: 137.7 on 1 and 998 DF,  p-value: < 2.2e-16

- The same models can be run with own home and age.

- However, from our correlation matrix, we know these variables are not going to be as strong, as shown by the low r-squared values.

- Therefore, the model to recommend to the marketing agent is amount spent and salary, or amount spent and catalogs.

- These have the highest r-squared values and the strongest relationship with amount an individual spends.