Direct Marketing Lab Test 2
Filip Dragicevic
Student ID: 104607907
2022-03-21
## Loading required package: splines
## Loading required package: RcmdrMisc
## Loading required package: car
## Loading required package: carData
## Loading required package: sandwich
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
## Loading required package: effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
## The Commander GUI is launched only in interactive sessions
##
## Attaching package: 'Rcmdr'
## The following object is masked from 'package:base':
##
## errorCondition
> setwd("C:/Users/filip/OneDrive/Desktop/MSCI 3230")
> direct <-
+ read.table("C:/Users/filip/OneDrive/Desktop/MSCI 3230/DirectMarketingg.csv",
+ header=TRUE, stringsAsFactors=TRUE, sep=",", na.strings="NA", dec=".",
+ strip.white=TRUE)
Visualization & Exploratory Analysis
> with(direct, Hist(AmountSpent, scale="frequency", breaks="Sturges",
+ col="darkgray"))

- To start, we can conduct exploratory analysis for Amount Spent, the main variable of interest for the marketing agent.
- From the histogram, we can see that it is very right skewed, as the tail follows the main set of data and the peak is left of the center.
> Boxplot( ~ AmountSpent, data=direct, id=list(method="y"))

[1] "181" "256" "386" "7" "133" "278" "240" "291" "202" "140"
- This is further confirmed from the boxplot, as the top whisker is much longer than the bottom.
- There are, however, a lot more outliers in the Amount Spent variable.
> normalityTest(~AmountSpent, test="shapiro.test", data=direct)
Shapiro-Wilk normality test
data: AmountSpent
W = 0.8784, p-value < 2.2e-16
- Since the p-value is less the 0.05, we reject the null the null hypothesis that the distribution is normal, confirming it is not normally distributed.
- Because Amount Spent is skewed, we cannot use it for any regression models.
> with(direct, Hist(Salary, scale="frequency", breaks="Sturges",
+ col="darkgray"))

- We can conduct similar analysis for salary.
- From the salary histogram, we can see that salary is slightly right/positive skewed since the tail is longer following the main portion of data.
> Boxplot( ~ Salary, data=direct, id=list(method="y"))

[1] "177"
- From the salary boxplot, we can see that the upper whisker is much longer than the bottom, further confirming it is right skewed.
- There is also only one major outlier, indicated by the 177.
> normalityTest(~Salary, test="shapiro.test", data=direct)
Shapiro-Wilk normality test
data: Salary
W = 0.96338, p-value = 3.763e-15
- We can confirm the test of normality by using a shapiro-wilk analysis.
- Since the p-value is less the 0.05, we reject the null the null hypothesis that the distribution is normal, confirming it is not normally distributed.
- Because salary is skewed, we cannot use it for any regression models.
> scatterplot(Salary~AmountSpent, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=direct)

- By creating a scatterplot using salary and amount spent, we can see that the higher the salary, the higher the amount spent.
- This is indicated by the positive slope, which would be shown using the line of best fit.
> cor(direct[,c("AmountSpent","Catalogs","Salary")], use="complete")
AmountSpent Catalogs Salary
AmountSpent 1.0000000 0.4726499 0.6995957
Catalogs 0.4726499 1.0000000 0.1835509
Salary 0.6995957 0.1835509 1.0000000
- A correlation matrix between Amount Spent, Catalogs, and Salary shows us which variables are positively correlated to Amount Spent.##
- Although both have positive values, Salary is significantly higher (about 0.7 compared to 0.5) meaning it is a better indicator of the potential amount an individual is willing to spend.
> cor(direct[,c("Age","AmountSpent","Children","Gender","Married","OwnHome")],
+ use="complete")
Age AmountSpent Children Gender Married
Age 1.000000000 0.3482505 -0.271118420 -0.001458581 0.255993305
AmountSpent 0.348250488 1.0000000 -0.222308170 -0.201690213 0.475879979
Children -0.271118420 -0.2223082 1.000000000 0.105469083 0.009770249
Gender -0.001458581 -0.2016902 0.105469083 1.000000000 -0.116057285
Married 0.255993305 0.4758800 0.009770249 -0.116057285 1.000000000
OwnHome 0.428896769 0.3508080 -0.032274083 -0.084433317 0.264009318
OwnHome
Age 0.42889677
AmountSpent 0.35080800
Children -0.03227408
Gender -0.08443332
Married 0.26400932
OwnHome 1.00000000
- By doing another matrix between Amount Spent and the rest of the variables, we can see how things like age, gender, number of children, etc. impact amount spent.
- The highest output when comparing to Amount Spent is the married variable, followed by age and own home.
- The postive correlation between these tells us that if they are married, they are more likely to spend more.
- Number of children and gender have negative correlation, meaning as these increase, amount spent decreases.
> RegModel.1 <- lm(AmountSpent~Salary, data=direct)
> summary(RegModel.1)
Call:
lm(formula = AmountSpent ~ Salary, data = direct)
Residuals:
Min 1Q Median 3Q Max
-2179.7 -315.2 -53.5 279.7 3752.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.31783 45.37416 -0.338 0.736
Salary 0.02196 0.00071 30.930 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 687.1 on 998 degrees of freedom
Multiple R-squared: 0.4894, Adjusted R-squared: 0.4889
F-statistic: 956.7 on 1 and 998 DF, p-value: < 2.2e-16
- We can also conduct several regression models to analyze the linear relationship between input and output variables.
- For the marketing agent, Amount Spent is always going to the desired output/response variable.
- From the Amount Spent and salary linear regression model, we can see that r-squared is positive, however, slightly lower than desired.
- This tells us that there is positive relationship between the two.
> RegModel.2 <- lm(AmountSpent~Catalogs, data=direct)
> summary(RegModel.2)
Call:
lm(formula = AmountSpent ~ Catalogs, data = direct)
Residuals:
Min 1Q Median 3Q Max
-1660.9 -536.2 -135.1 399.8 4361.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 209.766 65.194 3.218 0.00133 **
Catalogs 68.588 4.048 16.944 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 847.4 on 998 degrees of freedom
Multiple R-squared: 0.2234, Adjusted R-squared: 0.2226
F-statistic: 287.1 on 1 and 998 DF, p-value: < 2.2e-16
- The same model can be run with catalogs.
- This has a lower r-squared of only 0.223, meaning the relationship is not as strong.
> RegModel.3 <- lm(AmountSpent~Married, data=direct)
> summary(RegModel.3)
Call:
lm(formula = AmountSpent ~ Married, data = direct)
Residuals:
Min 1Q Median 3Q Max
-1579.1 -533.1 -171.9 420.9 4544.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 757.81 37.90 20.00 <2e-16 ***
Married 914.26 53.49 17.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 845.7 on 998 degrees of freedom
Multiple R-squared: 0.2265, Adjusted R-squared: 0.2257
F-statistic: 292.2 on 1 and 998 DF, p-value: < 2.2e-16
- The same model can be run with married.
- This has a lower r-squared of only 0.226, meaning the relationship is not as strong.
> RegModel.4 <- lm(AmountSpent~OwnHome, data=direct)
> summary(RegModel.4)
Call:
lm(formula = AmountSpent ~ OwnHome, data = direct)
Residuals:
Min 1Q Median 3Q Max
-1478.1 -623.8 -226.3 425.7 4961.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 868.83 40.93 21.23 <2e-16 ***
OwnHome 674.31 56.98 11.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 900.4 on 998 degrees of freedom
Multiple R-squared: 0.1231, Adjusted R-squared: 0.1222
F-statistic: 140.1 on 1 and 998 DF, p-value: < 2.2e-16
> RegModel.5 <- lm(AmountSpent~Age, data=direct)
> summary(RegModel.5)
Call:
lm(formula = AmountSpent ~ Age, data = direct)
Residuals:
Min 1Q Median 3Q Max
-1671.4 -573.9 -219.8 437.6 4621.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 295.72 83.49 3.542 0.000416 ***
Age 480.21 40.92 11.736 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 901.4 on 998 degrees of freedom
Multiple R-squared: 0.1213, Adjusted R-squared: 0.1204
F-statistic: 137.7 on 1 and 998 DF, p-value: < 2.2e-16
- The same models can be run with own home and age.
- However, from our correlation matrix, we know these variables are not going to be as strong, as shown by the low r-squared values.
- Therefore, the model to recommend to the marketing agent is amount spent and salary, or amount spent and catalogs.
- These have the highest r-squared values and the strongest relationship with amount an individual spends.