###Question 1: To describe a categorical variable’s distribution we use contingency tables, bar plots, and box plots.
###Question 2: The three major items we describe about a quantitative variable’s distribution are shape, center, and spread of the data.
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("dplyr")
library("readr")
library("ggplot2")
bank<-read_csv2("~/Desktop/bankData.txt",col_names=TRUE)
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 41188 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (12): job, marital, education, default, housing, loan, contact, month, d...
## dbl (5): age, duration, campaign, pdays, previous
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Bank<-bank%>%
rename("term.deposit"="y")%>%
filter((age<75) & (marital!="unknown"))
table(Bank$marital,Bank$loan,Bank$term.deposit)
## , , = no
##
##
## no unknown yes
## divorced 3382 112 596
## married 18405 526 3385
## single 8152 241 1552
##
## , , = yes
##
##
## no unknown yes
## divorced 354 7 63
## married 2026 59 362
## single 1342 39 236
Bank %>%
group_by(marital,loan) %>%
summarise(minimum = min(duration, na.rm = TRUE),
Q1 = quantile(duration, 0.25, na.rm=TRUE),
mean = mean(duration, na.rm=TRUE),
median = median(duration, na.rm=TRUE),
Q3 = quantile(duration, 0.75, na.rm=TRUE),
maximum = max(duration, na.rm=TRUE))
## `summarise()` has grouped output by 'marital'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 8
## # Groups: marital [3]
## marital loan minimum Q1 mean median Q3 maximum
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 divorced no 0 102 254. 180 315 3253
## 2 divorced unknown 8 94.5 210. 164 266. 1120
## 3 divorced yes 7 98.5 258. 176 320 2139
## 4 married no 0 102 257. 179 318 4199
## 5 married unknown 8 102 247. 167 308 2926
## 6 married yes 3 99.5 258. 177 310 3322
## 7 single no 1 104 262. 184 328. 4918
## 8 single unknown 7 104 274. 178 317 1580
## 9 single yes 3 98 259. 174. 311. 3076
g<-ggplot(data =Bank, aes(x=marital,fill=loan))
g+geom_bar(position = "dodge")+
facet_wrap(~term.deposit)
gg<-ggplot(Bank, aes(x=age,y=duration,color=term.deposit))
gg+geom_jitter(shape=3)+
geom_smooth(method = lm)+
facet_wrap(~marital)
## `geom_smooth()` using formula 'y ~ x'
Comments
###First I started off my calling the packages I needed from library. ###Next I read the file in and named the data “bank” originally. ###In my next step I renamed the data to “Bank” while applying a new column name and filtering. ###I then created a three way contingency table with marital, loan, and my renamed column: term.deposit. ###Following that I summarized the data of marital, loan, and duration. ###Then I created my side-by-side bar plots using the ggplot2 package and named this “g”. ###Then I created scatter plots using the ggplot2 package again and named those “gg”.