This analysis looks at the topic of parsimony and false discovery rates for the DirectMarketing.csv. In addition to this, this analysis uses topics that were talked about in chapter 5.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(pander)
DirectMarketing <- read_csv("DirectMarketing.csv")
## Parsed with column specification:
## cols(
## Age = col_character(),
## Gender = col_character(),
## OwnHome = col_character(),
## Married = col_character(),
## Location = col_character(),
## Salary = col_double(),
## Children = col_double(),
## History = col_character(),
## Catalogs = col_double(),
## AmountSpent = col_double()
## )
head(DirectMarketing)
## # A tibble: 6 x 10
## Age Gender OwnHome Married Location Salary Children History Catalogs
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 Old Female Own Single Far 47500 0 High 6
## 2 Midd~ Male Rent Single Close 63600 0 High 6
## 3 Young Female Rent Single Close 13500 0 Low 18
## 4 Midd~ Male Own Married Close 85600 1 High 18
## 5 Midd~ Female Own Single Close 68400 0 High 12
## 6 Young Male Own Married Close 30400 0 Low 6
## # ... with 1 more variable: AmountSpent <dbl>
summary(DirectMarketing)
## Age Gender OwnHome Married
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Location Salary Children History
## Length:1000 Min. : 10100 Min. :0.000 Length:1000
## Class :character 1st Qu.: 29975 1st Qu.:0.000 Class :character
## Mode :character Median : 53700 Median :1.000 Mode :character
## Mean : 56104 Mean :0.934
## 3rd Qu.: 77025 3rd Qu.:2.000
## Max. :168800 Max. :3.000
## Catalogs AmountSpent
## Min. : 6.00 Min. : 38.0
## 1st Qu.: 6.00 1st Qu.: 488.2
## Median :12.00 Median : 962.0
## Mean :14.68 Mean :1216.8
## 3rd Qu.:18.00 3rd Qu.:1688.5
## Max. :24.00 Max. :6217.0
m1<-lm(formula = AmountSpent ~ Catalogs+ Salary, data = DirectMarketing)
summary(m1)
##
## Call:
## lm(formula = AmountSpent ~ Catalogs + Salary, data = DirectMarketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1761.3 -327.9 14.6 270.6 3387.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.591e+02 5.368e+01 -12.28 <2e-16 ***
## Catalogs 5.170e+01 2.912e+00 17.75 <2e-16 ***
## Salary 1.991e-02 6.299e-04 31.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 599.2 on 997 degrees of freedom
## Multiple R-squared: 0.6121, Adjusted R-squared: 0.6113
## F-statistic: 786.5 on 2 and 997 DF, p-value: < 2.2e-16
summary(m1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -659.14813353 5.367602e+01 -12.28012 2.219934e-32
## Catalogs 51.69516193 2.911923e+00 17.75293 1.749434e-61
## Salary 0.01990824 6.299047e-04 31.60516 1.924160e-152
m1.pvals<-summary(m1)$coef1ficients[,4]
p1.examples <-
data_frame(p1 = summary(m1)$coefficients[,4])%>%
mutate(p1.fdr = p.adjust(p1, method="fdr"),
p1.sig = ifelse(p1 < .05, "*", ""),
p1.fdr.sig = ifelse(p1.fdr < .05, "*", ""))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
p1.examples %>%
arrange(p1) %>%
pander(caption="Generated 'p values', with and without FDR correction applied.")
| p1 | p1.fdr | p1.sig | p1.fdr.sig |
|---|---|---|---|
| 1.924e-152 | 5.772e-152 | * | * |
| 1.749e-61 | 2.624e-61 | * | * |
| 2.22e-32 | 2.22e-32 | * | * |
Salary and Catalogs were chosen since they numeric values and weren’t categorical data. Looking at the corrected p-values, it was noticeable that Salary, Catalogs were significant in comparison to their initial p-values. The initial p-value for Catalogs was 1.749434e-61 <0.001 and Salary was 1.924160e-152 <0.001. These two variables would’ve caused a type I error to occur due to their significance. After performing FDR on them, it was noticeable that 1.749e-61 < 2.624e-61, so the FDR for catalogs was still significant. While the FDR for Salary didn’t change.ir was still significant before and after FDR. The correction of these variables will help to make the model’s parsimony better as the type I errors have been accounted for.