DirectMarketing Dummy Variables & Linear Regression

Filip Dragicevic

2022-03-30

> setwd("C:/Users/filip/OneDrive/Desktop/MSCI 3230")

> dm <- 
+   read.table("C:/Users/filip/OneDrive/Desktop/MSCI 3230/DirectMarketing.csv", 
+   header=TRUE, stringsAsFactors=TRUE, sep=",", na.strings="NA", dec=".", 
+   strip.white=TRUE)

> dm$IsFemale <- with(dm, ifelse(Gender=="Female",1,0))

> dm$IsOld <- with(dm, ifelse(Age=="Old",1,0))

> dm$IsMiddle <- with(dm, ifelse(Age=="Middle",1,0))

> dm$IsYoung <- with(dm, ifelse(Age=="Young",1,0))

> RegModel.1 <- 
+   lm(AmountSpent~Catalogs+Children+IsFemale+IsMiddle+IsOld+Salary, data=dm)
> summary(RegModel.1)


Call:
lm(formula = AmountSpent ~ Catalogs + Children + IsFemale + IsMiddle + 
    IsOld + Salary, data = dm)

Residuals:
    Min      1Q  Median      3Q     Max 
-1872.5  -355.9   -36.6   256.9  3155.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.720e+02  6.130e+01  -7.700 3.27e-14 ***
Catalogs     4.804e+01  2.759e+00  17.408  < 2e-16 ***
Children    -1.947e+02  1.884e+01 -10.331  < 2e-16 ***
IsFemale     3.642e+01  3.786e+01   0.962    0.336    
IsMiddle    -8.139e+01  5.304e+01  -1.535    0.125    
IsOld       -1.798e+01  5.904e+01  -0.305    0.761    
Salary       2.125e-02  7.636e-04  27.821  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 562.2 on 993 degrees of freedom
Multiple R-squared:  0.6598,    Adjusted R-squared:  0.6578 
F-statistic:   321 on 6 and 993 DF,  p-value: < 2.2e-16

When using dummy variables in linear regression, always choose 1 less than ‘n’ in your model and use the other as a benchmark

- Female is likely to spend $36 more than male

- Middle is likely to spend $81 less than the young person

- Old is likely to spend $17 less than the young person

Adjusted R-Squared value is 0.658, meaning the model is somewhat significant and there is a good fit

> confint(RegModel.1)

                  2.5 %        97.5 %
(Intercept) -592.324215 -351.74133135
Catalogs      42.621965   53.45203511
Children    -231.654305 -157.70016410
IsFemale     -37.876730  110.71612780
IsMiddle    -185.478985   22.69029729
IsOld       -133.833738   97.87583627
Salary         0.019747    0.02274408

All dummy variables include 0 in their confidence interval, therefore we cannot reject the null

> dm$IsOwn <- with(dm, ifelse(OwnHome=="Own",1,0))

> dm$IsRent <- with(dm, ifelse(OwnHome=="Rent",1,0))

> dm$IsSingle <- with(dm, ifelse(Married=="Single",1,0))

> dm$IsMarried <- with(dm, ifelse(Married=="Married",1,0))

> dm$IsFar <- with(dm, ifelse(Location=="Far",1,0))

> dm$IsClose <- with(dm, ifelse(Location=="Close",1,0))

> RegModel.2 <- 
+   lm(AmountSpent~Catalogs+Children+IsClose+IsFemale+IsMiddle+IsOld+IsOwn+IsSingle+Salary,
+    data=dm)
> summary(RegModel.2)


Call:
lm(formula = AmountSpent ~ Catalogs + Children + IsClose + IsFemale + 
    IsMiddle + IsOld + IsOwn + IsSingle + Salary, data = dm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1858.23  -328.75   -38.61   227.90  2803.61 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.353e+02  8.350e+01  -1.620    0.106    
Catalogs     4.314e+01  2.548e+00  16.929   <2e-16 ***
Children    -2.014e+02  1.724e+01 -11.678   <2e-16 ***
IsClose     -5.079e+02  3.621e+01 -14.025   <2e-16 ***
IsFemale     4.154e+01  3.466e+01   1.198    0.231    
IsMiddle    -9.328e+01  5.142e+01  -1.814    0.070 .  
IsOld       -4.115e+01  5.657e+01  -0.727    0.467    
IsOwn        4.750e+01  3.863e+01   1.230    0.219    
IsSingle     6.390e+01  4.686e+01   1.364    0.173    
Salary       2.220e-02  9.824e-04  22.603   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 513.7 on 990 degrees of freedom
Multiple R-squared:  0.7168,    Adjusted R-squared:  0.7143 
F-statistic: 278.5 on 9 and 990 DF,  p-value: < 2.2e-16

Create more dummy variables (for OwnHome, Married, and Location) and add them to the model

- Female is likely to spend $36 more than male

- Middle is likely to spend $81 less than the young person

- Old is likely to spend $17 less than the young person

- Ownhome is likely to spend $47 more than renthome

- Single is likely to spend $64 more than married

Adjusted R-Squared value goes from 0.658 in previous model to 0.714, meaning it is a stronger model

From the subsets regression summary, we can conclude that IsMiddle would be the best of the dummy variables, as it is first after Salary, Catalogs, and Children

This is also shown from the individual p-values, as it has the lowest of the dummy variables at 0.07

> confint(RegModel.2)

                    2.5 %        97.5 %
(Intercept) -299.15272467   28.57220267
Catalogs      38.14094535   48.14280435
Children    -235.20639656 -167.52966426
IsClose     -578.92772364 -436.80312207
IsFemale     -26.47914738  109.56472991
IsMiddle    -194.17850128    7.62485357
IsOld       -152.15608915   69.85627745
IsOwn        -28.30717790  123.31106053
IsSingle     -28.04907925  155.85408398
Salary         0.02027706    0.02413264

DirectMarketing Dummy Variables & Linear Regression

Filip Dragicevic

2022-03-30

When using dummy variables in linear regression, always choose 1 less than ‘n’ in your model and use the other as a benchmark

- Female is likely to spend $36 more than male

- Middle is likely to spend $81 less than the young person

- Old is likely to spend $17 less than the young person

Adjusted R-Squared value is 0.658, meaning the model is somewhat significant and there is a good fit

All dummy variables include 0 in their confidence interval, therefore we cannot reject the null

Create more dummy variables (for OwnHome, Married, and Location) and add them to the model

- Female is likely to spend $36 more than male

- Middle is likely to spend $81 less than the young person

- Old is likely to spend $17 less than the young person

- Ownhome is likely to spend $47 more than renthome

- Single is likely to spend $64 more than married

Adjusted R-Squared value goes from 0.658 in previous model to 0.714, meaning it is a stronger model

From the subsets regression summary, we can conclude that IsMiddle would be the best of the dummy variables, as it is first after Salary, Catalogs, and Children

This is also shown from the individual p-values, as it has the lowest of the dummy variables at 0.07

Similar to the previous model, all dummy variables include 0 in their confidence interval, therefore we cannot reject the null