Answer the following questions:
What plots and numeric summaries do we use to describe a categorical variable’s distribution? (2 pts)
Contingency Tables and bar plots
What three major items do we try to describe about a quantitative variable’s distribution? (3 pts)
shape, center and spread
Write a brief narrative and code to answer the questions below.
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
bank <- read.csv2("D:\\NCSU\\Spring2022_sophomore\\ST308\\HW3\\bankData.txt")
newbank <- bank %>%
rename("subscribed" = y) %>%
filter((age<75) & (marital != "unknown"))
From this point forward, use your newly created dataset. 3. Create a three-way contingency table as seen below. (2 pts)
table(newbank$marital, newbank$loan, newbank$subscribed)
## , , = no
##
##
## no unknown yes
## divorced 3382 112 596
## married 18405 526 3385
## single 8152 241 1552
##
## , , = yes
##
##
## no unknown yes
## divorced 354 7 63
## married 2026 59 362
## single 1342 39 236
#(marital, tried every variable with yes no known term and found it's loan, subscribed (y variable))
#st308notes P256
#newbank %>%
# group_by(marital, loan) %>%
# summarise(across(.fns = summary, .cols= c(duration))) #apply summary function multiple columns NOT HELPFUL HERE B/C JUST APPLY TO DURATION COLUMN
newbank %>%
group_by(marital, loan) %>%
summarize(minimum= min(duration),
first_quartile = quantile(duration,0.25),
sample_mean = mean(duration),
median = median(duration),
third_quantile = quantile(duration, 0.75),
maxium = max(duration))
## `summarise()` regrouping output by 'marital' (override with `.groups` argument)
## # A tibble: 9 x 8
## # Groups: marital [3]
## marital loan minimum first_quartile sample_mean median third_quantile maxium
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 divorced no 0 102 254. 180 315 3253
## 2 divorced unkn~ 8 94.5 210. 164 266. 1120
## 3 divorced yes 7 98.5 258. 176 320 2139
## 4 married no 0 102 257. 179 318 4199
## 5 married unkn~ 8 102 247. 167 308 2926
## 6 married yes 3 99.5 258. 177 310 3322
## 7 single no 1 104 262. 184 328. 4918
## 8 single unkn~ 7 104 274. 178 317 1580
## 9 single yes 3 98 259. 174. 311. 3076
#not summary to create 6 # summary, need to do individually
5.Create multiple side-by-side bar plots of the marital status variable that you see below. (3 pts)
g <- ggplot(newbank, aes(x=marital, fill=loan))
g + geom_bar(position = "dodge") +
facet_wrap(~subscribed)
e <-ggplot(newbank, aes(x=age, y=duration, color=subscribed))
e +
geom_point(shape=3, position = "jitter") +
geom_smooth(method="lm") +
facet_wrap(~marital)
## `geom_smooth()` using formula 'y ~ x'