Introduction

This analysis looks at the topic of parsimony and false discovery rates for the DirectMarketing.csv. In addition to this, this analysis uses topics that were talked about in chapter 5.

library(tidyverse)
## -- Attaching packages ------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts --------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(pander)
DirectMarketing <- read_csv("DirectMarketing.csv")
## Parsed with column specification:
## cols(
##   Age = col_character(),
##   Gender = col_character(),
##   OwnHome = col_character(),
##   Married = col_character(),
##   Location = col_character(),
##   Salary = col_double(),
##   Children = col_double(),
##   History = col_character(),
##   Catalogs = col_double(),
##   AmountSpent = col_double()
## )
head(DirectMarketing)
## # A tibble: 6 x 10
##   Age   Gender OwnHome Married Location Salary Children History Catalogs
##   <chr> <chr>  <chr>   <chr>   <chr>     <dbl>    <dbl> <chr>      <dbl>
## 1 Old   Female Own     Single  Far       47500        0 High           6
## 2 Midd~ Male   Rent    Single  Close     63600        0 High           6
## 3 Young Female Rent    Single  Close     13500        0 Low           18
## 4 Midd~ Male   Own     Married Close     85600        1 High          18
## 5 Midd~ Female Own     Single  Close     68400        0 High          12
## 6 Young Male   Own     Married Close     30400        0 Low            6
## # ... with 1 more variable: AmountSpent <dbl>
summary(DirectMarketing)
##      Age               Gender            OwnHome            Married         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Location             Salary          Children       History         
##  Length:1000        Min.   : 10100   Min.   :0.000   Length:1000       
##  Class :character   1st Qu.: 29975   1st Qu.:0.000   Class :character  
##  Mode  :character   Median : 53700   Median :1.000   Mode  :character  
##                     Mean   : 56104   Mean   :0.934                     
##                     3rd Qu.: 77025   3rd Qu.:2.000                     
##                     Max.   :168800   Max.   :3.000                     
##     Catalogs      AmountSpent    
##  Min.   : 6.00   Min.   :  38.0  
##  1st Qu.: 6.00   1st Qu.: 488.2  
##  Median :12.00   Median : 962.0  
##  Mean   :14.68   Mean   :1216.8  
##  3rd Qu.:18.00   3rd Qu.:1688.5  
##  Max.   :24.00   Max.   :6217.0
m1<-lm(formula = AmountSpent ~ Catalogs+ Salary, data = DirectMarketing)
summary(m1)
## 
## Call:
## lm(formula = AmountSpent ~ Catalogs + Salary, data = DirectMarketing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1761.3  -327.9    14.6   270.6  3387.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.591e+02  5.368e+01  -12.28   <2e-16 ***
## Catalogs     5.170e+01  2.912e+00   17.75   <2e-16 ***
## Salary       1.991e-02  6.299e-04   31.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 599.2 on 997 degrees of freedom
## Multiple R-squared:  0.6121, Adjusted R-squared:  0.6113 
## F-statistic: 786.5 on 2 and 997 DF,  p-value: < 2.2e-16
summary(m1)$coefficients
##                  Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept) -659.14813353 5.367602e+01 -12.28012  2.219934e-32
## Catalogs      51.69516193 2.911923e+00  17.75293  1.749434e-61
## Salary         0.01990824 6.299047e-04  31.60516 1.924160e-152
m1.pvals<-summary(m1)$coef1ficients[,4]  
p1.examples <-
  data_frame(p1 = summary(m1)$coefficients[,4])%>%
  mutate(p1.fdr = p.adjust(p1, method="fdr"),
         p1.sig = ifelse(p1 < .05, "*", ""),
         p1.fdr.sig = ifelse(p1.fdr < .05, "*", ""))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
p1.examples %>%
  arrange(p1) %>%
  pander(caption="Generated 'p values', with and without FDR correction applied.")
Generated ‘p values’, with and without FDR correction applied.
p1 p1.fdr p1.sig p1.fdr.sig
1.924e-152 5.772e-152 * *
1.749e-61 2.624e-61 * *
2.22e-32 2.22e-32 * *

Conclusion

Salary and Catalogs were chosen since they numeric values and weren’t categorical data. Looking at the corrected p-values, it was noticeable that Salary, Catalogs were significant in comparison to their initial p-values. The initial p-value for Catalogs was 1.749434e-61 <0.001 and Salary was 1.924160e-152 <0.001. These two variables would’ve caused a type I error to occur due to their significance. After performing FDR on them, it was noticeable that 1.749e-61 < 2.624e-61, so the FDR for catalogs was still significant. While the FDR for Salary didn’t change.ir was still significant before and after FDR. The correction of these variables will help to make the model’s parsimony better as the type I errors have been accounted for.