Task 1: Conceptual questions (5 pts)

Answer the following questions:

  1. What plots and numeric summaries do we use to describe a categorical variable’s distribution? (2 pts)

    Contingency Tables and bar plots

  2. What three major items do we try to describe about a quantitative variable’s distribution? (3 pts)

    shape, center and spread

Task 2: Programming questions (17 pts)

Write a brief narrative and code to answer the questions below.

  1. Create code to import the bankData.txt file using functions from the tidyverse and save it as an R object. Note: The delimiter is a semicolon and the column names are included in the raw data file. (2 pts)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
bank <- read.csv2("D:\\NCSU\\Spring2022_sophomore\\ST308\\HW3\\bankData.txt")
  1. Create a new R object that takes the dataset from above and does the following:
  1. Renames the y variable to be more meaningful (1 pt)
  2. Subsets the observations to only include rows where age is less than 75 and marital is not “unknown.” (1pt)
newbank <- bank %>%
  rename("subscribed" = y) %>%
  filter((age<75) & (marital != "unknown"))

From this point forward, use your newly created dataset. 3. Create a three-way contingency table as seen below. (2 pts)

table(newbank$marital, newbank$loan, newbank$subscribed)
## , ,  = no
## 
##           
##               no unknown   yes
##   divorced  3382     112   596
##   married  18405     526  3385
##   single    8152     241  1552
## 
## , ,  = yes
## 
##           
##               no unknown   yes
##   divorced   354       7    63
##   married   2026      59   362
##   single    1342      39   236
#(marital, tried every variable with yes no known term and found it's loan, subscribed (y variable))
  1. Use functions from the tidyverse to replicate the finding of the minimum, 1st quartile, sample mean, median, 3rd quartile, and maximum for the duration variable. This was done for every combination of the marital status and loan variables (i.e. for all subgroups created by these two variables). (4 pts)
#st308notes P256
#newbank %>%
#  group_by(marital, loan) %>%
#  summarise(across(.fns = summary, .cols= c(duration))) #apply summary function multiple columns NOT HELPFUL HERE B/C JUST APPLY TO DURATION COLUMN

newbank %>%
  group_by(marital, loan) %>%
  summarize(minimum= min(duration),
            first_quartile = quantile(duration,0.25),
            sample_mean = mean(duration),
            median = median(duration),
            third_quantile = quantile(duration, 0.75),
            maxium = max(duration))
## `summarise()` regrouping output by 'marital' (override with `.groups` argument)
## # A tibble: 9 x 8
## # Groups:   marital [3]
##   marital  loan  minimum first_quartile sample_mean median third_quantile maxium
##   <chr>    <chr>   <int>          <dbl>       <dbl>  <dbl>          <dbl>  <int>
## 1 divorced no          0          102          254.   180            315    3253
## 2 divorced unkn~       8           94.5        210.   164            266.   1120
## 3 divorced yes         7           98.5        258.   176            320    2139
## 4 married  no          0          102          257.   179            318    4199
## 5 married  unkn~       8          102          247.   167            308    2926
## 6 married  yes         3           99.5        258.   177            310    3322
## 7 single   no          1          104          262.   184            328.   4918
## 8 single   unkn~       7          104          274.   178            317    1580
## 9 single   yes         3           98          259.   174.           311.   3076
  #not summary to create 6 # summary, need to do individually

5.Create multiple side-by-side bar plots of the marital status variable that you see below. (3 pts)

g <- ggplot(newbank, aes(x=marital, fill=loan))
g + geom_bar(position = "dodge") +
  facet_wrap(~subscribed)

  1. Create scatter plots of the age (x) and duration (y) with regression lines overlaid. (4 pts) Notes:
    • The color of the points and lines shown should differ based on the (new) ‘y’ variable.
    • The plots should be created for each subgroup of the Marital status variable.
    • The points are changed via the shape aesthetic (value of 3).
    • The points are ‘jittered’ using the appropriate geom.
e <-ggplot(newbank, aes(x=age, y=duration, color=subscribed)) 
e + 
  geom_point(shape=3, position = "jitter") +
  geom_smooth(method="lm") +
  facet_wrap(~marital)
## `geom_smooth()` using formula 'y ~ x'