We’re using R Markdown to gather together into a single document the code we build, text commenting on and reacting to that code, and the output of the analyses we build.
library(janitor)
library(knitr)
library(magrittr)
library(tidyverse)
theme_set(theme_bw())
Loading packages in R is like opening up apps on your phone. We need to tell R that, in addition to the base functions available in the software, we also have other functions we want to use.
tidyverse
, which we’ll use in every R Markdown file we create this semester. The tidyverse
includes several packages, all developed (in part) by Hadley Wickham, Chief Scientist at RStudio.
dplyr
is the primary package we’ll use for data wrangling, cleaning and transformationggplot2
is the primary package we’ll use for visualizationtidyverse
include packages for importing data, working with factors, and many other common activities.janitor
package has some tools for examining, cleaning and tabulating data (including tabyl()
and clean_names()
) that we’ll use regularly.knitr
package has a function called kable()
that we’ll sometimes use to neaten up results.magrittr
package includes a pipe function denoted %$%
which enables us to use “pipes” to pull specific variables out of a data frame (tibble) even when the function we want to use isn’t part of the tidyverse
.tidyverse
package last.quicksur_raw
The data we’ll use is from the Quick Survey we did in Class 02 (pdf is available from the Class 02 README) and which I’ve also done in 2014-2020, in various forms.
A. What is the distribution of pulse rates among students in 431 since 2014?
B. Does the distribution of student heights change materially over time?
C. Do taller people appear to have paid less for their most recent haircut?
D. Does the relationship between height and haircut price look the same within each sex?
E. For the 2021 students specifically, do students have a more substantial tobacco history if they prefer to speak English or a language other than English?
.csv
filequicksur_raw <- read_csv("data/quick_survey_2021.csv",
show_col_types = FALSE) %>%
clean_names()
quicksur_raw
glimpse(quicksur_raw)
Rows: 440
Columns: 23
$ student <dbl> 202158, 202157, 202156, 202155, 202154, 202153, 202152, 20~
$ glasses <chr> "n", "n", "y", "y", "y", "y", "y", "y", "y", "n", "y", "n"~
$ pulse <dbl> 82, 72, 76, 66, 72, 75, 96, 92, 64, 84, 60, 44, 65, 84, 80~
$ english <chr> "n", "y", "y", "y", "n", "y", "y", "y", "y", "y", "y", "y"~
$ smoke <dbl> 1, 1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1~
$ statsofar <dbl> 5, 4, 5, 5, 5, 6, 3, 5, 7, 3, 5, 5, 5, 5, 4, 5, 4, 5, 5, 4~
$ h_left <dbl> 0, 0, 0, 3, 1, 3, 4, 0, 0, 0, 1, 7, 1, 13, 0, 2, 0, 16, 16~
$ h_right <dbl> 10, 10, 20, 7, 15, 17, 16, 17, 19, 10, 19, 5, 12, 0, 15, 1~
$ handedness <chr> "1", "1", "1", "0.4", "0.88", "0.7", "0.6", "1", "1", "1",~
$ love_htcm <dbl> 183, 183, 178, 178, 175, 188, 191, 188, 188, 180, 180, 180~
$ statfuture <dbl> 6, NA, 7, 7, 7, 7, 5, 5, 7, 6, 7, 7, 5, 5, 6, 6, 6, 5, 7, ~
$ haircut <dbl> 35, 22, 60, 30, 40, 2, 50, 0, 40, 18, 0, 0, 70, 35, 20, 25~
$ lecture <dbl> 1, 1, 4, 5, 5, 3, 3, 3, 4, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3~
$ alone <dbl> 1, 3, 4, 4, 4, 4, 2, 3, 3, 4, 4, 4, 4, 2, 3, 4, 2, 4, 4, 2~
$ height_in <dbl> 67.0, 71.0, 68.0, 72.0, 70.0, 60.0, 67.0, 71.0, 65.0, 62.0~
$ hand_span <dbl> 20.00, 20.00, 17.50, 23.50, 22.00, 16.50, 18.00, 22.50, 21~
$ favcolor <chr> "blue", "green", "blue", "blue", "light blue", "blue", "gr~
$ lastsleep <dbl> 4.0, 6.5, 8.0, 7.0, 6.0, 9.0, 8.0, 7.0, 6.0, 5.0, 7.5, 8.0~
$ sex <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ ageguess <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ year <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021~
$ lovetrueage <dbl> 54.5, 54.5, 54.5, 54.5, 54.5, 54.5, 54.5, 54.5, 54.5, 54.5~
$ lovetrueht <dbl> 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190~
quicksur_raw %>% count(glasses)
quicksur_raw %>%
filter(year == "2021") %>%
tabyl(favcolor)
summary(quicksur_raw$pulse)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
30.00 64.00 72.00 73.31 80.00 110.00 73
qsdat
Based on Questions A-E, we need to include the following variables in our analytic data frame.
student
: student identification (numerical code)year
: indicates year when survey was taken (August)english
: y = prefers to speak English, else nsmoke
: 1 = never smoker, 2 = quit, 3 = currentpulse
: pulse rate (beats per minute)height_in
: student’s height (in inches)haircut
: price of student’s last haircut (in $)sex
: f (for female) or m (for male)quicksur_raw
qsdat <- quicksur_raw %>%
select(student, year, english, smoke,
pulse, height_in, haircut, sex)
qsdat
summary(qsdat)
student year english smoke
Min. :201401 Min. :2014 Length:440 Min. :1.000
1st Qu.:201620 1st Qu.:2016 Class :character 1st Qu.:1.000
Median :201818 Median :2018 Mode :character Median :1.000
Mean :201801 Mean :2018 Mean :1.082
3rd Qu.:202015 3rd Qu.:2020 3rd Qu.:1.000
Max. :202158 Max. :2021 Max. :3.000
NA's :2
pulse height_in haircut sex
Min. : 30.00 Min. :57.0 Min. : 0.00 Length:440
1st Qu.: 64.00 1st Qu.:64.0 1st Qu.: 13.50 Class :character
Median : 72.00 Median :67.0 Median : 20.00 Mode :character
Mean : 73.31 Mean :67.2 Mean : 29.41
3rd Qu.: 80.00 3rd Qu.:70.0 3rd Qu.: 39.00
Max. :110.00 Max. :77.5 Max. :250.00
NA's :73 NA's :7 NA's :9
factor
, rather than as a character
or numeric
variable.We want the year
and smoke
information treated as categorical, rather than as quantitative.
qsdat <- qsdat %>%
mutate(year = as_factor(year),
smoke = as_factor(smoke))
We want to be able to summarize english
and sex
. We can use a little coding trick to change all of the character variables to factors.
qsdat <- qsdat %>%
mutate(across(where(is_character), as_factor))
across(where())
syntax tells R to change everything that gives a TRUE response to “is this a character variable?” into a factor variable.summary(qsdat)
student year english smoke pulse
Min. :201401 2020 :67 n : 85 1 :409 Min. : 30.00
1st Qu.:201620 2016 :64 y :352 2 : 22 1st Qu.: 64.00
Median :201818 2019 :61 NA's: 3 3 : 7 Median : 72.00
Mean :201801 2021 :58 NA's: 2 Mean : 73.31
3rd Qu.:202015 2018 :51 3rd Qu.: 80.00
Max. :202158 2015 :49 Max. :110.00
(Other):90 NA's :73
height_in haircut sex
Min. :57.0 Min. : 0.00 f :125
1st Qu.:64.0 1st Qu.: 13.50 m :129
Median :67.0 Median : 20.00 NA's:186
Mean :67.2 Mean : 29.41
3rd Qu.:70.0 3rd Qu.: 39.00
Max. :77.5 Max. :250.00
NA's :7 NA's :9
NA's
?qsdat %>% count(year)
pulse
rates?ggplot(data = qsdat, aes(x = pulse)) +
geom_histogram(fill = "green", col = "blue")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 73 rows containing non-finite values (stat_bin).
How might we describe this distribution?
summary(qsdat$pulse)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
30.00 64.00 72.00 73.31 80.00 110.00 73
qsdat %$% mosaic::favstats(pulse)
qsdat %>%
filter(complete.cases(pulse)) %>%
ggplot(data = ., aes(x = pulse)) +
geom_histogram(fill = "seagreen", col = "white",
bins = 20) +
labs(title = "Pulse Rates of Dr. Love's students",
subtitle = "2014 - 2021",
y = "Number of Students",
x = "Pulse Rate (beats per minute)")
fill
and col
in the histogram?qsdat %>%
filter(complete.cases(height_in)) %>%
group_by(year) %>%
summarize(n = n(),
min = min(height_in),
q25 = quantile(height_in, 0.25),
median = median(height_in),
q75 = quantile(height_in, 0.75),
max = max(height_in))
qsdat %>%
filter(complete.cases(height_in)) %>%
ggplot(data = ., aes(x = year, y = height_in)) +
geom_boxplot() +
labs(title = "Heights of Dr. Love's students, by year",
subtitle = "2014 - 2021",
x = "Year",
y = "Height (in inches)")
year
to a factor instead of a numeric variable earlier?qsdat %>%
filter(complete.cases(height_in)) %>%
ggplot(data = ., aes(x = year, y = height_in)) +
geom_violin() +
geom_boxplot(aes(fill = year), width = 0.3) +
guides(fill = "none") +
scale_fill_viridis_d() +
labs(title = "Heights of Dr. Love's students, by year",
subtitle = "2014 - 2021",
x = "Year",
y = "Height (in inches)")
guides(fill = "none")
do?scale_fill_viridis_d()
do?qsdat %>%
filter(complete.cases(height_in)) %>%
group_by(year) %>%
summarize(n = n(),
mean = mean(height_in),
sd = sd(height_in))
In 2021, we had 55 students whose height_in
was available.
qsdat %>%
filter(complete.cases(height_in)) %>%
filter(year == "2021") %>%
ggplot(data = ., aes(x = height_in)) +
geom_histogram(fill = "salmon", col = "white",
binwidth = 1) +
labs(title = "Heights of Dr. Love's students",
subtitle = "2021 (n = 55 students with height data)",
y = "Number of Students",
x = "Height (inches)")
qsdat %>% filter(complete.cases(height_in)) %>%
filter(year == "2021") %>%
count(height_in >= 63.62 & height_in <= 71.88)
34/(34+21)
[1] 0.6181818
height_in
values gathered in 2021 were between 67.75 - (2)4.13 = 59.49 and 67.75 + 2(4.13) = 76.01?qsdat %>% filter(complete.cases(height_in)) %>%
filter(year == "2021") %>%
count(height_in >= 59.49 & height_in <= 76.01)
54/(54+1)
[1] 0.9818182
qsdat <- qsdat %>%
mutate(height_cm = height_in * 2.54)
qsdat %>%
filter(complete.cases(haircut, height_cm)) %>%
summarize(n = n(), median(haircut),
median(height_cm), median(height_in))
qsdat %>% filter(complete.cases(height_cm, haircut)) %>%
ggplot(., aes(x = height_cm, y = haircut)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = TRUE) +
labs(x = "Height (in cm)",
y = "Price of last haircut (in $)",
title = "Do taller people pay less for haircuts?")
qsdat %>%
select(height_in, height_cm, haircut) %>%
filter(complete.cases(.)) %>%
cor(.) %>%
kable(digits = 2)
height_in | height_cm | haircut | |
---|---|---|---|
height_in | 1.00 | 1.00 | -0.19 |
height_cm | 1.00 | 1.00 | -0.19 |
haircut | -0.19 | -0.19 | 1.00 |
m1 <- qsdat %>%
filter(complete.cases(haircut, height_cm)) %$%
lm(haircut ~ height_cm)
m1
Call:
lm(formula = haircut ~ height_cm)
Coefficients:
(Intercept) height_cm
130.4566 -0.5916
summary(m1)
Call:
lm(formula = haircut ~ height_cm)
Residuals:
Min 1Q Median 3Q Max
-38.796 -17.780 -5.272 7.966 218.718
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 130.4566 25.3486 5.147 4.06e-07 ***
height_cm -0.5916 0.1482 -3.991 7.73e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.38 on 427 degrees of freedom
Multiple R-squared: 0.03597, Adjusted R-squared: 0.03371
F-statistic: 15.93 on 1 and 427 DF, p-value: 7.734e-05
lm
fit to a loess
smooth curve?qsdat %>% filter(complete.cases(height_cm, haircut)) %>%
ggplot(., aes(x = height_cm, y = haircut)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = FALSE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
labs(x = "Height (in cm)",
y = "Price of last haircut (in $)",
title = "Do taller people pay less for haircuts?")
qsdat %>% tabyl(sex)
sex
data?qsdat %>% tabyl(sex, year)
qsdat %>%
filter(complete.cases(sex, haircut, height_cm)) %>%
group_by(sex) %>%
summarize(n = n(), median(haircut), median(height_cm))
qsdat %>% filter(complete.cases(height_cm, haircut, sex)) %>%
ggplot(., aes(x = height_cm, y = haircut,
group = sex, col = sex)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
labs(x = "Height (in cm)",
y = "Price of last haircut (in $)",
title = "Do taller people pay less for haircuts?",
subtitle = "accounting for sex")
sex
, with facet_wrap
qsdat %>% filter(complete.cases(height_cm, haircut, sex)) %>%
ggplot(., aes(x = height_cm, y = haircut, col = sex)) +
geom_point() +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = FALSE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
facet_wrap(~ sex) +
labs(x = "Height (in cm)",
y = "Price of last haircut (in $)",
title = "Do taller people pay less for haircuts?",
subtitle = "accounting for sex")
qsdat %>% filter(complete.cases(height_cm, haircut, sex)) %>%
ggplot(., aes(x = height_cm, y = haircut, col = sex)) +
geom_point() +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = FALSE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
facet_grid(sex ~ .) +
guides(col = "none") +
labs(x = "Height (in cm)",
y = "Price of last haircut (in $)",
title = "Do taller people pay less for haircuts?",
subtitle = "accounting for sex")
qsdat2021 <- qsdat %>%
filter(year == "2021") %>%
select(student, year, english, smoke)
summary(qsdat2021)
student year english smoke
Min. :202101 2021 :58 n:12 1:51
1st Qu.:202115 2014 : 0 y:46 2: 4
Median :202130 2015 : 0 3: 3
Mean :202130 2016 : 0
3rd Qu.:202144 2017 : 0
Max. :202158 2018 : 0
(Other): 0
No missing data.
qsdat2021 %>% tabyl(english)
qsdat2021 %>% tabyl(smoke) %>% adorn_pct_formatting()
adorn_pct_formatting()
do?qsdat2021 %>% tabyl(english, smoke)
smoke
levels to more meaningful names in tobacco
qsdat2021 <- qsdat2021 %>%
mutate(tobacco = fct_recode(smoke,
"Never" = "1", "Quit" = "2", "Current" = "3"))
Check our work?
qsdat2021 %>% count(smoke, tobacco)
smoke
= 1 has tobacco
as Never, etc.Now we’ll use this new variable, and this time, add row and column totals.
qsdat2021 %>% tabyl(english, tobacco) %>%
adorn_totals(where = c("row", "col"))
Don’t forget to close your file with the session information.
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggridges_0.5.3 mosaicData_0.20.2 ggformula_0.10.1 ggstance_0.3.5
[5] Matrix_1.3-4 lattice_0.20-44 forcats_0.5.1 stringr_1.4.0
[9] dplyr_1.0.7 purrr_0.3.4 readr_2.0.1 tidyr_1.1.3
[13] tibble_3.1.3 ggplot2_3.3.5 tidyverse_1.3.1 magrittr_2.0.1
[17] knitr_1.33 janitor_2.1.0
loaded via a namespace (and not attached):
[1] nlme_3.1-152 fs_1.5.0 lubridate_1.7.10 bit64_4.0.5
[5] httr_1.4.2 tools_4.1.1 backports_1.2.1 bslib_0.2.5.1
[9] utf8_1.2.2 R6_2.5.1 mgcv_1.8-36 DBI_1.1.1
[13] colorspace_2.0-2 withr_2.4.2 tidyselect_1.1.1 gridExtra_2.3
[17] leaflet_2.0.4.1 bit_4.0.4 compiler_4.1.1 cli_3.0.1
[21] rvest_1.0.1 xml2_1.3.2 ggdendro_0.1.22 labeling_0.4.2
[25] sass_0.4.0 mosaicCore_0.9.0 scales_1.1.1 digest_0.6.27
[29] rmarkdown_2.10 pkgconfig_2.0.3 htmltools_0.5.1.1 labelled_2.8.0
[33] dbplyr_2.1.1 highr_0.9 htmlwidgets_1.5.3 rlang_0.4.11
[37] readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4 farver_2.1.0
[41] generics_0.1.0 jsonlite_1.7.2 crosstalk_1.1.1 vroom_1.5.4
[45] Rcpp_1.0.7 munsell_0.5.0 fansi_0.5.0 lifecycle_1.0.0
[49] stringi_1.7.3 yaml_2.2.1 snakecase_0.11.0 MASS_7.3-54
[53] plyr_1.8.6 grid_4.1.1 parallel_4.1.1 ggrepel_0.9.1
[57] crayon_1.4.1 haven_2.4.3 splines_4.1.1 hms_1.1.0
[61] pillar_1.6.2 reprex_2.0.1 glue_1.4.2 evaluate_0.14
[65] modelr_0.1.8 vctrs_0.3.8 tzdb_0.1.2 tweenr_1.0.2
[69] cellranger_1.1.0 gtable_0.3.0 polyclip_1.10-0 assertthat_0.2.1
[73] xfun_0.25 ggforce_0.3.3 broom_0.7.9 viridisLite_0.4.0
[77] mosaic_1.8.3 ellipsis_0.3.2