Session3_Stats_in

What is statistics? - The field of statistics: the practice and study of collecting and analyzing data. - A summary statistic: a fact about or summary of some data

Types of Statistics - Descriptive statistics: describe and summarize data - Inferential statistics: Use a sample of data to make inferences about a larger population

Types of Data - Numeric (quantitative) - Categorical (qualitative) - There are more types of data. But for now, we’ll focus on these two types

Categorical data can be represented using numbers like the gender variable in our data set.
Being able to identify data types is important since the type of data you’re working with will dictate what kinds of summary statistics and visualizations make sense for your data, so this is an important skill to master. For numerical data, we can use summary statistics like mean, and plots like scatterplots, but these don’t make a ton of sense for categorical data.

Let’s begin by looking at descriptive statistics for our sample

But first, let’s set up our environment and set our working directory

# Load the psych, ggplot2, and dplyr libraries

library(psych)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read in your data set again

# Reading in data

COVID1 <- read.csv("COVID1.csv", header = T, sep = ",")
#View(COVID1)

Descriptive statistics We’re going to use the summary() function to get a breakdown of our data. What are the variables you would normally report?

# Summary stats for age
summary(COVID1$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   22.00   29.50   36.68   51.00   83.00     101

Let’s talk through the code. What does each part represent? Also, what is something that this function doesn’t give us? We need the standard deviation! Luckily for us, we can simply use the sd() function to get it

Descriptive stats for categorical variables

# Turn gender, ethnicity, and income into factors (we did this in the previous
# session) and then use the summary() function to get the descriptives

# Turning gender into factor
Gender <- factor(COVID1$sx, levels = 1:3, labels = c("Male", "Female", "Prefer not to answer"))
summary(Gender)

##                 Male               Female Prefer not to answer 
##                  125                  440                    5 
##                 NA's 
##                  101

# Turning ethnicity into factor
Ethnicity <- factor(COVID1$eth, levels = 1:7, labels = c("Black", "South Asian", 
                                                         "East/Southeast Asian",
                                                         "Hispanic or Latino", 
                                                         "White", 
                                                         "Pacific Islander",
                                                         "Prefer not to respond"
                                                         ))
summary(Ethnicity)

##                 Black           South Asian  East/Southeast Asian 
##                    10                    20                    70 
##    Hispanic or Latino                 White      Pacific Islander 
##                   158                   274                    11 
## Prefer not to respond                  NA's 
##                    27                   101

# Turning income into factor
Income <- factor(COVID1$fam_income, levels = 1:9, labels = c("Less than 20k", 
                                                             "$20-29k", 
                                                             "$30-39k",
                                                             "$40-49k", 
                                                             "$50-74k", 
                                                             "$75-99k", 
                                                             "$100-149k",
                                                             "$150-249k", 
                                                             "More than $250k"))

summary(Income)

##   Less than 20k         $20-29k         $30-39k         $40-49k         $50-74k 
##              44              45              25              46              77 
##         $75-99k       $100-149k       $150-249k More than $250k            NA's 
##              92             125              84              32             101

Let’s now talk about NHST, effect sizes, and confidence intervals What do we mean when we talk about NHST? What is an effect size? What’s an example of an effect size? What are confidence intervals and what do they tell us?

Null Hypothesis Significance Testing - NHST: process that produces probabilities that are accurate when the null hypothesis is true - In other words: If it is true that exercising has no effect on academic success, then there is a xx% chance we would get an average academic success score like the one we observed. - High probability = near the mean = common (higher probability that your effect was due to chance)

To further understand this concept, let’s work with a relatively common effect size: correlations.

Correlation - Quantifies linear relationship between two variables - Number between -1 and 1 - Magnitude corresponds to the strength of the relationship - Sign (+/-) corresponds to direction of relationship - We can use scatterplots to help us visualize any given correlation

Let’s look at the correlation between days spent in quarantine (quar_days) and a measure of worry (worry1). What’s the first thing we want to do? DATAVIZ. We want to visualize our data. Create a scatterplot for these two variables

ggplot(COVID1, aes(quar_days, worry1)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 101 rows containing non-finite values (stat_smooth).

## Warning: Removed 101 rows containing missing values (geom_point).

The last line of code tells R to give us a linear trend line. That blue line we got. Try removing that line of code and see what happens. What do we see? Weird plot? Yep, there’s a reason we couldn’t use these data, lol

Calculating correlation between quar_days and age We’re going to use the cor() function

cor(COVID1$quar_days, COVID1$worry1)

## [1] NA

What happened? Turns out we have a ton of NAs. We can easily fix it by adding an argument in our command. We can tell R which data points to use, the argument is literally use =

Calculating correlation between quar_days and age, using “pairwise.complete.obs”

cor(COVID1$quar_days, COVID1$worry1, use = "pairwise.complete.obs")

## [1] 0.09149006

We fixed it, yay! But, what are we missing? (hint: think about NHST)

The cor() function does not give us a p-value. So, we’ll need to use a different function, corr.test()

Correlation test between quar_days and age

corr.test(COVID1$quar_days, COVID1$worry1, use = "pairwise.complete.obs", 
          alpha = .05)

## Call:corr.test(x = COVID1$quar_days, y = COVID1$worry1, use = "pairwise.complete.obs", 
##     alpha = 0.05)
## Correlation matrix 
## [1] 0.09
## Sample Size 
## [1] 570
## These are the unadjusted probability values.
##   The probability values  adjusted for multiple tests are in the p.adj object. 
## [1] 0.03
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

What do we notice? What conclusions can we draw?

We still need to add confidence intervals. What are CIs?

Getting CI around effect size. You’ll need to install the “confintr” package and load the library. Since we only need this package during this one part, this is an example of adding a package within a chunk as opposed to including it in our environment set up at the beginning

library(confintr)

ci_cor(x = COVID1$quar_days, y = COVID1$worry1, method = "pearson")

## 
##  Two-sided 95% normal confidence interval for the true Pearson
##  correlation coefficient
## 
## Sample estimate: 0.09149006 
## Confidence interval:
##        2.5%       97.5% 
## 0.009435594 0.172320673

What can we say about all this?

The last thing I want to cover before we end this session is citing R. Remember that R is a free resource, package developers need recognition for all their hard work. When we work with a lot of packages, it can be easy to forget all the ones we use. Luckily, there’s a function to help us remember.

Retrieving session information

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] confintr_0.1.2 dplyr_1.0.10   ggplot2_3.3.6  psych_2.2.5   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2 xfun_0.33        bslib_0.4.0      purrr_0.3.4     
##  [5] splines_4.2.1    lattice_0.20-45  colorspace_2.0-3 vctrs_0.4.1     
##  [9] generics_0.1.3   htmltools_0.5.3  yaml_2.3.5       mgcv_1.8-40     
## [13] utf8_1.2.2       rlang_1.0.5      jquerylib_0.1.4  pillar_1.8.1    
## [17] glue_1.6.2       withr_2.5.0      DBI_1.1.3        lifecycle_1.0.2 
## [21] stringr_1.4.1    munsell_0.5.0    gtable_0.3.1     evaluate_0.16   
## [25] labeling_0.4.2   knitr_1.40       fastmap_1.1.0    parallel_4.2.1  
## [29] fansi_1.0.3      highr_0.9        scales_1.2.1     cachem_1.0.6    
## [33] jsonlite_1.8.0   farver_2.1.1     mnormt_2.1.0     digest_0.6.29   
## [37] stringi_1.7.8    grid_4.2.1       cli_3.4.0        tools_4.2.1     
## [41] magrittr_2.0.3   sass_0.4.2       tibble_3.1.8     pkgconfig_2.0.3 
## [45] Matrix_1.4-1     assertthat_0.2.1 rmarkdown_2.16   rstudioapi_0.14 
## [49] boot_1.3-28      R6_2.5.1         nlme_3.1-157     compiler_4.2.1

Once you know that, you can easily retrieve the citation for your packages. You want to cite R itself, as well as every package you attacked.

Retrieving citations for packages

citation("tidyverse")

## 
## To cite package 'tidyverse' in publications use:
## 
##   Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
##   Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
##   E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
##   Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
##   the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
##   doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }

citation("ggplot2")

## 
## To cite ggplot2 in publications, please use:
## 
##   H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
##   Springer-Verlag New York, 2016.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     author = {Hadley Wickham},
##     title = {ggplot2: Elegant Graphics for Data Analysis},
##     publisher = {Springer-Verlag New York},
##     year = {2016},
##     isbn = {978-3-319-24277-4},
##     url = {https://ggplot2.tidyverse.org},
##   }

Finally, this is how you would cite them:

…analyses were conducted using R (v4.1.2; R Core Team, 2021), the tidyverse package (v1.3.1; Wickham et al., 2019), and the ggplot2 package (v3.3.5, Wickham, 2016).

We’re done! The last thing we’ll do is an open Q&A panel with other grad students

Session3_Stats_in_R

Hailey_Rousey

2022-09-18