What is statistics? - The field of statistics: the practice and study of collecting and analyzing data. - A summary statistic: a fact about or summary of some data
Types of Statistics - Descriptive statistics: describe and summarize data - Inferential statistics: Use a sample of data to make inferences about a larger population
Types of Data - Numeric (quantitative) - Categorical (qualitative) - There are more types of data. But for now, we’ll focus on these two types
Categorical data can be represented using numbers like the gender variable in our data set.
Being able to identify data types is important since the type of data you’re working with will dictate what kinds of summary statistics and visualizations make sense for your data, so this is an important skill to master. For numerical data, we can use summary statistics like mean, and plots like scatterplots, but these don’t make a ton of sense for categorical data.
Let’s begin by looking at descriptive statistics for our sample
But first, let’s set up our environment and set our working directory
# Load the psych, ggplot2, and dplyr libraries
library(psych)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Read in your data set again
# Reading in data
COVID1 <- read.csv("COVID1.csv", header = T, sep = ",")
#View(COVID1)
Descriptive statistics We’re going to use the summary() function to get a breakdown of our data. What are the variables you would normally report?
# Summary stats for age
summary(COVID1$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 22.00 29.50 36.68 51.00 83.00 101
Let’s talk through the code. What does each part represent? Also, what is something that this function doesn’t give us? We need the standard deviation! Luckily for us, we can simply use the sd() function to get it
Descriptive stats for categorical variables
# Turn gender, ethnicity, and income into factors (we did this in the previous
# session) and then use the summary() function to get the descriptives
# Turning gender into factor
Gender <- factor(COVID1$sx, levels = 1:3, labels = c("Male", "Female", "Prefer not to answer"))
summary(Gender)
## Male Female Prefer not to answer
## 125 440 5
## NA's
## 101
# Turning ethnicity into factor
Ethnicity <- factor(COVID1$eth, levels = 1:7, labels = c("Black", "South Asian",
"East/Southeast Asian",
"Hispanic or Latino",
"White",
"Pacific Islander",
"Prefer not to respond"
))
summary(Ethnicity)
## Black South Asian East/Southeast Asian
## 10 20 70
## Hispanic or Latino White Pacific Islander
## 158 274 11
## Prefer not to respond NA's
## 27 101
# Turning income into factor
Income <- factor(COVID1$fam_income, levels = 1:9, labels = c("Less than 20k",
"$20-29k",
"$30-39k",
"$40-49k",
"$50-74k",
"$75-99k",
"$100-149k",
"$150-249k",
"More than $250k"))
summary(Income)
## Less than 20k $20-29k $30-39k $40-49k $50-74k
## 44 45 25 46 77
## $75-99k $100-149k $150-249k More than $250k NA's
## 92 125 84 32 101
Let’s now talk about NHST, effect sizes, and confidence intervals What do we mean when we talk about NHST? What is an effect size? What’s an example of an effect size? What are confidence intervals and what do they tell us?
Null Hypothesis Significance Testing - NHST: process that produces probabilities that are accurate when the null hypothesis is true - In other words: If it is true that exercising has no effect on academic success, then there is a xx% chance we would get an average academic success score like the one we observed. - High probability = near the mean = common (higher probability that your effect was due to chance)
To further understand this concept, let’s work with a relatively common effect size: correlations.
Correlation - Quantifies linear relationship between two variables - Number between -1 and 1 - Magnitude corresponds to the strength of the relationship - Sign (+/-) corresponds to direction of relationship - We can use scatterplots to help us visualize any given correlation
Let’s look at the correlation between days spent in quarantine (quar_days) and a measure of worry (worry1). What’s the first thing we want to do? DATAVIZ. We want to visualize our data. Create a scatterplot for these two variables
ggplot(COVID1, aes(quar_days, worry1)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 101 rows containing non-finite values (stat_smooth).
## Warning: Removed 101 rows containing missing values (geom_point).
The last line of code tells R to give us a linear trend line. That blue
line we got. Try removing that line of code and see what happens. What
do we see? Weird plot? Yep, there’s a reason we couldn’t use these data,
lol
Calculating correlation between quar_days and age We’re going to use the cor() function
cor(COVID1$quar_days, COVID1$worry1)
## [1] NA
What happened? Turns out we have a ton of NAs. We can easily fix it by adding an argument in our command. We can tell R which data points to use, the argument is literally use =
Calculating correlation between quar_days and age, using “pairwise.complete.obs”
cor(COVID1$quar_days, COVID1$worry1, use = "pairwise.complete.obs")
## [1] 0.09149006
We fixed it, yay! But, what are we missing? (hint: think about NHST)
The cor() function does not give us a p-value. So, we’ll need to use a different function, corr.test()
Correlation test between quar_days and age
corr.test(COVID1$quar_days, COVID1$worry1, use = "pairwise.complete.obs",
alpha = .05)
## Call:corr.test(x = COVID1$quar_days, y = COVID1$worry1, use = "pairwise.complete.obs",
## alpha = 0.05)
## Correlation matrix
## [1] 0.09
## Sample Size
## [1] 570
## These are the unadjusted probability values.
## The probability values adjusted for multiple tests are in the p.adj object.
## [1] 0.03
##
## To see confidence intervals of the correlations, print with the short=FALSE option
What do we notice? What conclusions can we draw?
We still need to add confidence intervals. What are CIs?
Getting CI around effect size. You’ll need to install the “confintr” package and load the library. Since we only need this package during this one part, this is an example of adding a package within a chunk as opposed to including it in our environment set up at the beginning
library(confintr)
ci_cor(x = COVID1$quar_days, y = COVID1$worry1, method = "pearson")
##
## Two-sided 95% normal confidence interval for the true Pearson
## correlation coefficient
##
## Sample estimate: 0.09149006
## Confidence interval:
## 2.5% 97.5%
## 0.009435594 0.172320673
What can we say about all this?
The last thing I want to cover before we end this session is citing R. Remember that R is a free resource, package developers need recognition for all their hard work. When we work with a lot of packages, it can be easy to forget all the ones we use. Luckily, there’s a function to help us remember.
Retrieving session information
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] confintr_0.1.2 dplyr_1.0.10 ggplot2_3.3.6 psych_2.2.5
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.2 xfun_0.33 bslib_0.4.0 purrr_0.3.4
## [5] splines_4.2.1 lattice_0.20-45 colorspace_2.0-3 vctrs_0.4.1
## [9] generics_0.1.3 htmltools_0.5.3 yaml_2.3.5 mgcv_1.8-40
## [13] utf8_1.2.2 rlang_1.0.5 jquerylib_0.1.4 pillar_1.8.1
## [17] glue_1.6.2 withr_2.5.0 DBI_1.1.3 lifecycle_1.0.2
## [21] stringr_1.4.1 munsell_0.5.0 gtable_0.3.1 evaluate_0.16
## [25] labeling_0.4.2 knitr_1.40 fastmap_1.1.0 parallel_4.2.1
## [29] fansi_1.0.3 highr_0.9 scales_1.2.1 cachem_1.0.6
## [33] jsonlite_1.8.0 farver_2.1.1 mnormt_2.1.0 digest_0.6.29
## [37] stringi_1.7.8 grid_4.2.1 cli_3.4.0 tools_4.2.1
## [41] magrittr_2.0.3 sass_0.4.2 tibble_3.1.8 pkgconfig_2.0.3
## [45] Matrix_1.4-1 assertthat_0.2.1 rmarkdown_2.16 rstudioapi_0.14
## [49] boot_1.3-28 R6_2.5.1 nlme_3.1-157 compiler_4.2.1
Once you know that, you can easily retrieve the citation for your packages. You want to cite R itself, as well as every package you attacked.
Retrieving citations for packages
citation("tidyverse")
##
## To cite package 'tidyverse' in publications use:
##
## Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
## E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
## Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
## the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
## doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Welcome to the {tidyverse}},
## author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
## year = {2019},
## journal = {Journal of Open Source Software},
## volume = {4},
## number = {43},
## pages = {1686},
## doi = {10.21105/joss.01686},
## }
citation("ggplot2")
##
## To cite ggplot2 in publications, please use:
##
## H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
## Springer-Verlag New York, 2016.
##
## A BibTeX entry for LaTeX users is
##
## @Book{,
## author = {Hadley Wickham},
## title = {ggplot2: Elegant Graphics for Data Analysis},
## publisher = {Springer-Verlag New York},
## year = {2016},
## isbn = {978-3-319-24277-4},
## url = {https://ggplot2.tidyverse.org},
## }
Finally, this is how you would cite them:
…analyses were conducted using R (v4.1.2; R Core Team, 2021), the tidyverse package (v1.3.1; Wickham et al., 2019), and the ggplot2 package (v3.3.5, Wickham, 2016).
We’re done! The last thing we’ll do is an open Q&A panel with other grad students