DirectMarketing Dummy Variables & Linear Regression
Filip Dragicevic
2022-03-30
> setwd("C:/Users/filip/OneDrive/Desktop/MSCI 3230")
> dm <-
+ read.table("C:/Users/filip/OneDrive/Desktop/MSCI 3230/DirectMarketing.csv",
+ header=TRUE, stringsAsFactors=TRUE, sep=",", na.strings="NA", dec=".",
+ strip.white=TRUE)
> dm$IsFemale <- with(dm, ifelse(Gender=="Female",1,0))
> dm$IsOld <- with(dm, ifelse(Age=="Old",1,0))
> dm$IsMiddle <- with(dm, ifelse(Age=="Middle",1,0))
> dm$IsYoung <- with(dm, ifelse(Age=="Young",1,0))
> RegModel.1 <-
+ lm(AmountSpent~Catalogs+Children+IsFemale+IsMiddle+IsOld+Salary, data=dm)
> summary(RegModel.1)
Call:
lm(formula = AmountSpent ~ Catalogs + Children + IsFemale + IsMiddle +
IsOld + Salary, data = dm)
Residuals:
Min 1Q Median 3Q Max
-1872.5 -355.9 -36.6 256.9 3155.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.720e+02 6.130e+01 -7.700 3.27e-14 ***
Catalogs 4.804e+01 2.759e+00 17.408 < 2e-16 ***
Children -1.947e+02 1.884e+01 -10.331 < 2e-16 ***
IsFemale 3.642e+01 3.786e+01 0.962 0.336
IsMiddle -8.139e+01 5.304e+01 -1.535 0.125
IsOld -1.798e+01 5.904e+01 -0.305 0.761
Salary 2.125e-02 7.636e-04 27.821 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 562.2 on 993 degrees of freedom
Multiple R-squared: 0.6598, Adjusted R-squared: 0.6578
F-statistic: 321 on 6 and 993 DF, p-value: < 2.2e-16
When using dummy variables in linear regression, always choose 1 less than ‘n’ in your model and use the other as a benchmark
- Female is likely to spend $36 more than male
- Middle is likely to spend $81 less than the young person
- Old is likely to spend $17 less than the young person
Adjusted R-Squared value is 0.658, meaning the model is somewhat significant and there is a good fit
> confint(RegModel.1)
2.5 % 97.5 %
(Intercept) -592.324215 -351.74133135
Catalogs 42.621965 53.45203511
Children -231.654305 -157.70016410
IsFemale -37.876730 110.71612780
IsMiddle -185.478985 22.69029729
IsOld -133.833738 97.87583627
Salary 0.019747 0.02274408
All dummy variables include 0 in their confidence interval, therefore we cannot reject the null
> dm$IsOwn <- with(dm, ifelse(OwnHome=="Own",1,0))
> dm$IsRent <- with(dm, ifelse(OwnHome=="Rent",1,0))
> dm$IsSingle <- with(dm, ifelse(Married=="Single",1,0))
> dm$IsMarried <- with(dm, ifelse(Married=="Married",1,0))
> dm$IsFar <- with(dm, ifelse(Location=="Far",1,0))
> dm$IsClose <- with(dm, ifelse(Location=="Close",1,0))
> RegModel.2 <-
+ lm(AmountSpent~Catalogs+Children+IsClose+IsFemale+IsMiddle+IsOld+IsOwn+IsSingle+Salary,
+ data=dm)
> summary(RegModel.2)
Call:
lm(formula = AmountSpent ~ Catalogs + Children + IsClose + IsFemale +
IsMiddle + IsOld + IsOwn + IsSingle + Salary, data = dm)
Residuals:
Min 1Q Median 3Q Max
-1858.23 -328.75 -38.61 227.90 2803.61
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.353e+02 8.350e+01 -1.620 0.106
Catalogs 4.314e+01 2.548e+00 16.929 <2e-16 ***
Children -2.014e+02 1.724e+01 -11.678 <2e-16 ***
IsClose -5.079e+02 3.621e+01 -14.025 <2e-16 ***
IsFemale 4.154e+01 3.466e+01 1.198 0.231
IsMiddle -9.328e+01 5.142e+01 -1.814 0.070 .
IsOld -4.115e+01 5.657e+01 -0.727 0.467
IsOwn 4.750e+01 3.863e+01 1.230 0.219
IsSingle 6.390e+01 4.686e+01 1.364 0.173
Salary 2.220e-02 9.824e-04 22.603 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 513.7 on 990 degrees of freedom
Multiple R-squared: 0.7168, Adjusted R-squared: 0.7143
F-statistic: 278.5 on 9 and 990 DF, p-value: < 2.2e-16
Create more dummy variables (for OwnHome, Married, and Location) and add them to the model
- Female is likely to spend $36 more than male
- Middle is likely to spend $81 less than the young person
- Old is likely to spend $17 less than the young person
- Ownhome is likely to spend $47 more than renthome
- Single is likely to spend $64 more than married
Adjusted R-Squared value goes from 0.658 in previous model to 0.714, meaning it is a stronger model
From the subsets regression summary, we can conclude that IsMiddle would be the best of the dummy variables, as it is first after Salary, Catalogs, and Children
This is also shown from the individual p-values, as it has the lowest of the dummy variables at 0.07
> confint(RegModel.2)
2.5 % 97.5 %
(Intercept) -299.15272467 28.57220267
Catalogs 38.14094535 48.14280435
Children -235.20639656 -167.52966426
IsClose -578.92772364 -436.80312207
IsFemale -26.47914738 109.56472991
IsMiddle -194.17850128 7.62485357
IsOld -152.15608915 69.85627745
IsOwn -28.30717790 123.31106053
IsSingle -28.04907925 155.85408398
Salary 0.02027706 0.02413264
Similar to the previous model, all dummy variables include 0 in their confidence interval, therefore we cannot reject the null