Conceptual Questions

###Question 1: To describe a categorical variable’s distribution we use contingency tables, bar plots, and box plots.

###Question 2: The three major items we describe about a quantitative variable’s distribution are shape, center, and spread of the data.

Programming Questions

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("dplyr")
library("readr")
library("ggplot2")
bank<-read_csv2("~/Desktop/bankData.txt",col_names=TRUE)
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 41188 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (12): job, marital, education, default, housing, loan, contact, month, d...
## dbl  (5): age, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Bank<-bank%>%
  rename("term.deposit"="y")%>%
  filter((age<75) & (marital!="unknown"))
table(Bank$marital,Bank$loan,Bank$term.deposit)
## , ,  = no
## 
##           
##               no unknown   yes
##   divorced  3382     112   596
##   married  18405     526  3385
##   single    8152     241  1552
## 
## , ,  = yes
## 
##           
##               no unknown   yes
##   divorced   354       7    63
##   married   2026      59   362
##   single    1342      39   236
Bank %>%
  group_by(marital,loan) %>%
  summarise(minimum = min(duration, na.rm = TRUE),
            Q1 = quantile(duration, 0.25, na.rm=TRUE),
            mean = mean(duration, na.rm=TRUE),
            median = median(duration, na.rm=TRUE),
            Q3 = quantile(duration, 0.75, na.rm=TRUE),
            maximum = max(duration, na.rm=TRUE))
## `summarise()` has grouped output by 'marital'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 8
## # Groups:   marital [3]
##   marital  loan    minimum    Q1  mean median    Q3 maximum
##   <chr>    <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl>
## 1 divorced no            0 102    254.   180   315     3253
## 2 divorced unknown       8  94.5  210.   164   266.    1120
## 3 divorced yes           7  98.5  258.   176   320     2139
## 4 married  no            0 102    257.   179   318     4199
## 5 married  unknown       8 102    247.   167   308     2926
## 6 married  yes           3  99.5  258.   177   310     3322
## 7 single   no            1 104    262.   184   328.    4918
## 8 single   unknown       7 104    274.   178   317     1580
## 9 single   yes           3  98    259.   174.  311.    3076
g<-ggplot(data =Bank, aes(x=marital,fill=loan))
g+geom_bar(position = "dodge")+
  facet_wrap(~term.deposit)

gg<-ggplot(Bank, aes(x=age,y=duration,color=term.deposit))
gg+geom_jitter(shape=3)+
  geom_smooth(method = lm)+
  facet_wrap(~marital)
## `geom_smooth()` using formula 'y ~ x'

Comments

###First I started off my calling the packages I needed from library. ###Next I read the file in and named the data “bank” originally. ###In my next step I renamed the data to “Bank” while applying a new column name and filtering. ###I then created a three way contingency table with marital, loan, and my renamed column: term.deposit. ###Following that I summarized the data of marital, loan, and duration. ###Then I created my side-by-side bar plots using the ggplot2 package and named this “g”. ###Then I created scatter plots using the ggplot2 package again and named those “gg”.