Data analysis should be reproducible, meaning: every step taken to manipulate, clean, transform, summarize, visualize or model data should be documented exactly so that results can be replicated. RMarkdown is a tool—or, specifically, a document type—for doing reproducible data science by keeping the code for a project together with the written analysis and interpretation.
This is an RMarkdown template that you can use for calculating answers to the project quiz questions for this module. You will also knit this document to HTML (or Word) and submit it for the File Upload assignment.
RMarkdown uses a very simple markup language. For example, rather than interacting with a menu to format the text, as in MS Word, you use simple code outside of the code chunks:
In the code chunk above (entitled “setup”) echo is set to TRUE. This means that the code in your chunks will be displayed, along with the results, in your compiled document.
Below is code to clean and prepare the dataset for modeling. Before running that code, follow these preparatory steps:
Once the files are in the right location on your computer then run this code to clean and format the data:
# You must run this code to format the dataset properly!
advise_invest <- read_csv("adviseinvest.csv") %>% # Download data and save it (via assignment operator)
select(-product) %>% # Remove the product column
filter(income > 0, # Filter out mistaken data
num_accts < 5) %>%
mutate(answered = ifelse(answered==0, "no","yes"), # Turn answered into yes/no
answered = factor(answered, # Turn answered into factor
levels = c("no", "yes")), # Specify factor levels
female = factor(female), # Make other binary and categorical # variables into factors
job = factor(job),
rent = factor(rent),
own_res = factor(own_res),
new_car = factor(new_car),
mobile = factor(mobile),
chk_acct = factor(chk_acct),
sav_acct = factor(sav_acct))
## Rows: 29502 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): answered, income, female, age, job, num_dependents, rent, own_res,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the code chunks below to write code that will enable you to answer the questions in the project quiz.
Some of the questions do not require writing code and have been omitted from this template.
mean(advise_invest$answered == "yes")%>%
round(3)
## [1] 0.547
ggplot(data = advise_invest,
mapping = aes(x = answered, y = income)) +
geom_boxplot()+
labs(title = "answered ~ income")
#fit tree
income_model <- rpart(formula = answered ~ income,
data = advise_invest)
income_model
## n= 29499
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 29499 13375 yes (0.4534052 0.5465948)
## 2) income>=39135 8063 2944 no (0.6348754 0.3651246) *
## 3) income< 39135 21436 8256 yes (0.3851465 0.6148535)
## 6) income< 35625 20412 8128 yes (0.3981971 0.6018029)
## 12) income>=4295 20156 8128 yes (0.4032546 0.5967454)
## 24) income< 9595 2944 1408 no (0.5217391 0.4782609)
## 48) income>=7890 1152 256 no (0.7777778 0.2222222) *
## 49) income< 7890 1792 640 yes (0.3571429 0.6428571) *
## 25) income>=9595 17212 6592 yes (0.3829886 0.6170114) *
## 13) income< 4295 256 0 yes (0.0000000 1.0000000) *
## 7) income>=35625 1024 128 yes (0.1250000 0.8750000) *
#plot tree object
rpart.plot(x = income_model)
predict(object = income_model,
type = "class")%>%
head()
## 1 2 3 4 5 6
## yes yes yes yes yes yes
## Levels: no yes
(predict(object = income_model, type = "class") == advise_invest$answered)%>%
mean()%>%
round(3)
## [1] 0.642
#fit tree
tree_model <- rpart(formula = answered ~ .,
data = advise_invest)
tree_model
## n= 29499
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 29499 13375 yes (0.4534052 0.5465948)
## 2) chk_acct=0,1,2 19199 8000 no (0.5833116 0.4166884)
## 4) income>=79840 1728 192 no (0.8888889 0.1111111) *
## 5) income< 79840 17471 7808 no (0.5530880 0.4469120)
## 10) mobile=0 16383 6848 no (0.5820057 0.4179943)
## 20) sav_acct=0,1 13311 4928 no (0.6297799 0.3702201)
## 40) income>=5900 13055 4672 no (0.6421295 0.3578705)
## 80) age< 25.5 3071 640 no (0.7915988 0.2084012)
## 160) chk_acct=0,1 2879 448 no (0.8443904 0.1556096) *
## 161) chk_acct=2 192 0 yes (0.0000000 1.0000000) *
## 81) age>=25.5 9984 4032 no (0.5961538 0.4038462)
## 162) income< 10310 1280 256 no (0.8000000 0.2000000) *
## 163) income>=10310 8704 3776 no (0.5661765 0.4338235)
## 326) income>=11000 8448 3520 no (0.5833333 0.4166667)
## 652) age>=27.5 7424 2880 no (0.6120690 0.3879310)
## 1304) income< 13875 1152 192 no (0.8333333 0.1666667) *
## 1305) income>=13875 6272 2688 no (0.5714286 0.4285714)
## 2610) income>=17960 5568 2176 no (0.6091954 0.3908046)
## 5220) female=1 768 128 no (0.8333333 0.1666667) *
## 5221) female=0 4800 2048 no (0.5733333 0.4266667)
## 10442) job=2 2944 1024 no (0.6521739 0.3478261)
## 20884) income>=26300 1984 448 no (0.7741935 0.2258065) *
## 20885) income< 26300 960 384 yes (0.4000000 0.6000000)
## 41770) new_car=1 384 64 no (0.8333333 0.1666667) *
## 41771) new_car=0 576 64 yes (0.1111111 0.8888889) *
## 10443) job=0,1,3 1856 832 yes (0.4482759 0.5517241)
## 20886) income< 21525 384 64 no (0.8333333 0.1666667) *
## 20887) income>=21525 1472 512 yes (0.3478261 0.6521739) *
## 2611) income< 17960 704 192 yes (0.2727273 0.7272727) *
## 653) age< 27.5 1024 384 yes (0.3750000 0.6250000) *
## 327) income< 11000 256 0 yes (0.0000000 1.0000000) *
## 41) income< 5900 256 0 yes (0.0000000 1.0000000) *
## 21) sav_acct=2,3,4 3072 1152 yes (0.3750000 0.6250000)
## 42) age< 34 1856 896 yes (0.4827586 0.5172414)
## 84) num_accts>=1.5 1216 448 no (0.6315789 0.3684211) *
## 85) num_accts< 1.5 640 128 yes (0.2000000 0.8000000) *
## 43) age>=34 1216 256 yes (0.2105263 0.7894737) *
## 11) mobile=1 1088 128 yes (0.1176471 0.8823529) *
## 3) chk_acct=3 10300 2176 yes (0.2112621 0.7887379)
## 6) income>=38910 2240 1088 no (0.5142857 0.4857143)
## 12) num_accts< 1.5 448 0 no (1.0000000 0.0000000) *
## 13) num_accts>=1.5 1792 704 yes (0.3928571 0.6071429)
## 26) new_car=1 448 64 no (0.8571429 0.1428571) *
## 27) new_car=0 1344 320 yes (0.2380952 0.7619048) *
## 7) income< 38910 8060 1024 yes (0.1270471 0.8729529) *
#visualize
rpart.plot(x = tree_model)
rpart.plot(tree_model, tweak = 2, roundint = T)
# Evaluate accuracy
(predict(tree_model, type = "class") == advise_invest$answered) %>%
mean
## [1] 0.8199261