class: center, middle # R for Economics and Social Science Research #### Norberto E. Milla, Jr. --- class: center, middle # Day 1: Introduction to R --- # Overview of R and RStudio <img src="pic3.png" width="15%" style="display: block; margin: auto auto auto 0;" /> * is open source and freely available * is a cross-platform language * has an extensive and coherent set of tools for statistical analysis * has an extensive and highly flexible graphical facility for producing publication-ready graphics * has an expanding set of freely available ‘packages’ to extend R’s capabilities * has an extensive support network with numerous online and freely available documents --- # Overview of R and RStudio * RStudio is an add-on user-friendly interface to R * It incorporates the R Console, a script editor and other useful functionality <img src="pic1.png" width="100%" /> --- # Installing and loading packages * R packages are collections of functions, data sets, and documentation that enhance R's functionality (e. g. <tt>tidyverse</tt>, <tt>ggplot2</tt>, <tt>readxl</tt>) * They help users to efficiently perform specific tasks more easily * Many R packages can be downloaded from the Comprehensive R Archiving Network (CRAN): *>22,000 packages* * Type in the Console <tt>install.packages("*packagename*")</tt> * Alternative: Click **Tools** in the Menu bar and select *Install Packages...* <img src="pic2.png" width="45%" style="display: block; margin: auto;" /> --- # Data types and basic operations * R **objects** are fundamental data containers: <tt>vectors</tt>, <tt>matrices</tt>, <tt>data frames</tt>, <tt>lists</tt>, and <tt>functions</tt> * Objects are created using the assignment operator: <tt>*<-*</tt> ``` r a <- 10 b <- 5 c <- sqrt(a^2 + b^2) print(c) ``` ``` ## [1] 11.18034 ``` * A few rules in naming objects in R: - R is case-sensitive: <tt>Weight</tt> is different from <tt>weight</tt> - Object names should be explicit and not too long - Do not start a name with a number such as <tt>2cm</tt> --- # Data types and basic operations: vector * one-dimensional arrays that hold elements of the same type, such as numbers, characters, or logical values ``` r numeric_vector <- c(1, 2, 3, 4) # Numeric vector character_vector <- c("apple", "banana") # Character vector logical_vector <- c(TRUE, FALSE, TRUE) # Logical vector ``` ``` r numeric_vector ``` ``` ## [1] 1 2 3 4 ``` ``` r character_vector ``` ``` ## [1] "apple" "banana" ``` ``` r logical_vector ``` ``` ## [1] TRUE FALSE TRUE ``` --- # Data types and basic operations: vector ``` r x <- c(1, 2, 3) y <- c(4, 5, 6) x + y # Adds the elements of x and y ``` ``` ## [1] 5 7 9 ``` ``` r y^x # y is raised to power x ``` ``` ## [1] 4 25 216 ``` ``` r x * y # To get the product of the elements of x and y ``` ``` ## [1] 4 10 18 ``` --- # Data types and basic operations: vector * Elements of a vector can be accessed as follows: ``` r set.seed(1234) # Allows to generate the same set of random numbers z <- rnorm(n = 6, mean = 0, sd = 1) z # Prints all elements of z ``` ``` ## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559 ``` ``` r z[4] # Extracts the 4th element of z ``` ``` ## [1] -2.345698 ``` ``` r z[4:6] # Extracts the 4th through the 7th elements of z ``` ``` ## [1] -2.3456977 0.4291247 0.5060559 ``` ``` r z[c(1,3,5)] # Extracts the 1st, ,3rd, and 5th elements of z ``` ``` ## [1] -1.2070657 1.0844412 0.4291247 ``` --- # Data types and basic operations: vector * Basic functions in working with vectors: <tt>length()</tt>, <tt>sum()</tt>, <tt>mean()</tt>, <tt>sd()</tt> ``` r length(z) # Determines the number of elements of z ``` ``` ## [1] 6 ``` ``` r sum(z) # Determines the sum of the elements of z ``` ``` ## [1] -1.255712 ``` ``` r mean(z) # Calculates the mean/average of the elements of z ``` ``` ## [1] -0.2092854 ``` ``` r sd(z) # Calculates the standard deviation of the elements of z ``` ``` ## [1] 1.295355 ``` --- # Data types and basic operations: matrix * two-dimensional arrangement of data (of same type) in rows and columns * are collections of vectors organized into rows and columns ``` r A <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE) # Create a 3x3 numeric matrix print(A) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 ``` ``` r B <- matrix(1:9, nrow = 3, ncol = 3, byrow = FALSE) # Create a 3x3 numeric matrix print(B) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 ``` --- # Data types and basic operations: matrix * Addition, subtraction and multiplication of matrices are shown below: ``` r A + B # Addition ``` ``` ## [,1] [,2] [,3] ## [1,] 2 6 10 ## [2,] 6 10 14 ## [3,] 10 14 18 ``` ``` r A - B # Subutraction ``` ``` ## [,1] [,2] [,3] ## [1,] 0 -2 -4 ## [2,] 2 0 -2 ## [3,] 4 2 0 ``` ``` r A %*% B # Multiplication ``` ``` ## [,1] [,2] [,3] ## [1,] 14 32 50 ## [2,] 32 77 122 ## [3,] 50 122 194 ``` --- # Data types and basic operations: list * Used to store mixtures of data types ``` r mylist <- list(char_vector = c("black", "yellow", "orange"), logic_vector = c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE), num_mat= matrix(1:6, nrow = 3)) print(mylist) ``` ``` ## $char_vector ## [1] "black" "yellow" "orange" ## ## $logic_vector ## [1] TRUE TRUE FALSE TRUE FALSE FALSE ## ## $num_mat ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 ``` --- # Data types and basic operations: list * We can apply indexing to extract one or more elements of a list just like in a vector ``` r print(mylist[2]) # Prints the logical vector ``` ``` ## $logic_vector ## [1] TRUE TRUE FALSE TRUE FALSE FALSE ``` ``` r print(mylist[3]) # Prints the numeric matrix ``` ``` ## $num_mat ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 ``` --- # Data types and basic operations: data frame .pull-left[ * a two-dimensional, tabular data structure where each column can contain elements of different data types (numeric, character, logical, etc.) * every row corresponds to an observation or case (e. g. student, firm, university); a column corresponds to a variable (e. g. age, sex, marital status) * most commonly used data structure for statistical analysis and data manipulation ] .pull-right[ <img src="pic4.png" width="100%" style="display: block; margin: auto;" /> ] --- # Data types and basic operations: data frame ``` r stud_height <- c(180, 155, 160, 167, 181, 165) stud_weight <- c(65, 50, 52, 58, 70, 60) stud_names <- c("Theo", "Anthony", "Vincent", "Angelo", "Lee", "Antonette") stud_record <- data.frame(Names = stud_names, Height = stud_height, Weight = stud_weight) print(stud_record) ``` ``` ## Names Height Weight ## 1 Theo 180 65 ## 2 Anthony 155 50 ## 3 Vincent 160 52 ## 4 Angelo 167 58 ## 5 Lee 181 70 ## 6 Antonette 165 60 ``` --- # R Markdown - a simple and easy to use plain text language where one can type R codes and see the results (e.g. plots and tables) after running these codes in the one document - useful to generate a single nicely formatted and reproducible document (like a report, publication, thesis chapter or a web page, slides) - the document can be rendered in HTML, pdf, or Word format <img src="pic8.png" width="65%" style="display: block; margin: auto;" /> --- # Data management: importing data * We can import data sets using various functions in R - <tt>read.table()</tt> for text (*.txt*) files - <tt>read.csv()</tt> for Comma-delimited Excel (*.csv*) files - <tt>read_dta()</tt> or <tt>read_stata()</tt> for Stata (*.dta*) files [**haven** package] - <tt>read_excel()</tt> for Excel (*.xls* or *.xlsx*) files [**readxl** package] --- # Data management: importing data **Setting up the working directory** - Type in the Console <tt>setwd("path")</tt> - Click **Session** in the menu bar, select **Set Working Directory** then **Choose Directory..** - Click **Files** in the <tt>Files, Plots, Packages, Help</tt> pane in RStudio. Then click the icon (encircled) shown below. Browse to the desired folder. <img src="pic5.png" width="35%" style="display: block; margin: auto;" /> - Then click the gear icon (encircled) shown below and select *Set As Working Directory* <img src="pic6.png" width="35%" style="display: block; margin: auto;" /> --- # Data management: importing data ``` r ndhs <- read_stata("PHBR82FL.DTA") head(ndhs) ``` ``` ## # A tibble: 6 × 1,252 ## caseid bidx v000 v001 v002 v003 v004 v005 v006 v007 v008 v008a ## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 " 1 … 1 PH8 1 4 2 1 116381 5 2022 1469 44691 ## 2 " 1 … 2 PH8 1 4 2 1 116381 5 2022 1469 44691 ## 3 " 1 … 1 PH8 1 6 2 1 116381 5 2022 1469 44686 ## 4 " 1 … 2 PH8 1 6 2 1 116381 5 2022 1469 44686 ## 5 " 1 … 1 PH8 1 7 6 1 116381 5 2022 1469 44693 ## 6 " 1 … 2 PH8 1 7 6 1 116381 5 2022 1469 44693 ## # ℹ 1,240 more variables: v009 <dbl>, v010 <dbl>, v011 <dbl>, v012 <dbl>, ## # v013 <dbl+lbl>, v014 <dbl+lbl>, v015 <dbl+lbl>, v016 <dbl>, v017 <dbl>, ## # v018 <dbl+lbl>, v019 <dbl+lbl>, v019a <dbl+lbl>, v020 <dbl+lbl>, ## # v021 <dbl>, v022 <dbl>, v023 <dbl>, v024 <dbl+lbl>, v025 <dbl+lbl>, ## # v026 <dbl+lbl>, v027 <dbl>, v028 <dbl>, v029 <dbl>, v030 <dbl>, v031 <dbl>, ## # v032 <dbl>, v034 <dbl+lbl>, v040 <dbl>, v042 <dbl+lbl>, v044 <dbl+lbl>, ## # v045a <dbl+lbl>, v045b <dbl+lbl>, v045c <dbl+lbl>, v046 <dbl+lbl>, … ``` --- # Data management: importing data ``` r library(haven) library(readxl) ets1 <- read_excel("Profile of on-going students (Region 1).xlsx") head(ets1) ``` ``` ## # A tibble: 6 × 19 ## ...1 Region SUC Target Student_Cat_Final Age Sex Civil_stats HHsize ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1 2 1 2 18 1 1 2 ## 2 2 1 1 0 2 24 1 1 2 ## 3 3 1 2 0 2 19 1 1 2 ## 4 4 1 1 1 2 21 0 1 1 ## 5 5 1 4 1 2 20 0 1 2 ## 6 6 1 5 0 2 19 0 1 1 ## # ℹ 10 more variables: Religion <dbl>, `Monthly Income` <chr>, ## # Birth_Order <chr>, Living_arrange <dbl>, Listahanan <dbl>, ## # `4P's beneficiary` <dbl>, `Pre-college Ed` <dbl>, HS_type <dbl>, ## # Strand <dbl>, GWA <dbl> ``` --- # Data management: data wrangling Common verbs in the **dplyr** package: - <tt>select()</tt>: picks variables based on their names - <tt>filter()</tt>: picks cases based on their values - <tt>mutate()</tt>: adds new variables that are functions of existing variables - <tt>summarize()</tt>: generates summary statistics such as mean, median, and SD - <tt>group_by()</tt>: generate summaries by group - <tt>arrange()</tt>: changes the ordering of the rows - <tt>rename()</tt>: replace the name of a variable --- # Data management: data wrangling The **tidyverse** package: collection of R packages designed for data science <img src="pic7.png" width="70%" style="display: block; margin: auto;" /> --- # Data management: data wrangling ``` r ets1 %>% # pipe operator (CNTRL + SHIFT + M) select(Target, Age, Sex, HHsize, GWA, HS_type, Strand) %>% head() ``` ``` ## # A tibble: 6 × 7 ## Target Age Sex HHsize GWA HS_type Strand ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 18 1 2 85 0 0 ## 2 0 24 1 2 85 1 1 ## 3 0 19 1 2 87 0 0 ## 4 1 21 0 1 90 0 0 ## 5 1 20 0 2 NA 0 0 ## 6 0 19 0 1 92.0 0 0 ``` --- # Data management: data wrangling ``` r ets1 %>% select(Target, Age, Sex, HHsize, GWA, HS_type, Strand) %>% mutate(Sex_recode = recode(Sex, "0" = "Male", "1" = "Female")) %>% head(n=10) ``` ``` ## # A tibble: 10 × 8 ## Target Age Sex HHsize GWA HS_type Strand Sex_recode ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 1 18 1 2 85 0 0 Female ## 2 0 24 1 2 85 1 1 Female ## 3 0 19 1 2 87 0 0 Female ## 4 1 21 0 1 90 0 0 Male ## 5 1 20 0 2 NA 0 0 Male ## 6 0 19 0 1 92.0 0 0 Male ## 7 1 20 1 3 96 0 1 Female ## 8 1 23 0 3 NA 1 1 Male ## 9 1 21 1 2 93 1 0 Female ## 10 1 19 1 3 80 0 0 Female ``` --- # Data management: data wrangling ``` r ets1 %>% select(Target, Age, Sex, HHsize, GWA, HS_type, Strand) %>% mutate(Sex_recode = recode(Sex, "0" = "Male", "1" = "Female")) %>% filter(GWA>90 & Sex_recode == "Male") ``` ``` ## # A tibble: 218 × 8 ## Target Age Sex HHsize GWA HS_type Strand Sex_recode ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 0 19 0 1 92.0 0 0 Male ## 2 0 19 0 2 94 0 0 Male ## 3 0 21 0 2 91 0 1 Male ## 4 0 20 0 2 94.4 0 0 Male ## 5 0 20 0 3 92 0 0 Male ## 6 1 21 0 2 95 0 0 Male ## 7 1 25 0 2 92 0 0 Male ## 8 0 19 0 2 91 1 1 Male ## 9 0 21 0 2 91.5 1 1 Male ## 10 0 21 0 1 91 0 0 Male ## # ℹ 208 more rows ``` --- # Data management: data wrangling ``` r ets1 %>% select(Target, Age, Sex, HHsize, GWA, HS_type, Strand) %>% mutate(Sex_recode = recode(Sex, "0" = "Male", "1" = "Female")) %>% drop_na(Age) %>% mutate(Age_Cat = if_else(Age<=20, "Less than 20", if_else(Age<=30,"21-30", "31 & up"))) ``` ``` ## # A tibble: 1,749 × 9 ## Target Age Sex HHsize GWA HS_type Strand Sex_recode Age_Cat ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 1 18 1 2 85 0 0 Female Less than 20 ## 2 0 24 1 2 85 1 1 Female 21-30 ## 3 0 19 1 2 87 0 0 Female Less than 20 ## 4 1 21 0 1 90 0 0 Male 21-30 ## 5 1 20 0 2 NA 0 0 Male Less than 20 ## 6 0 19 0 1 92.0 0 0 Male Less than 20 ## 7 1 20 1 3 96 0 1 Female Less than 20 ## 8 1 23 0 3 NA 1 1 Male 21-30 ## 9 1 21 1 2 93 1 0 Female 21-30 ## 10 1 19 1 3 80 0 0 Female Less than 20 ## # ℹ 1,739 more rows ``` --- # Data management: data wrangling ``` r ets1 %>% select(Target, Age, Sex, HHsize, GWA, HS_type, Strand) %>% mutate(Sex_recode = recode(Sex, "0" = "Male", "1" = "Female")) %>% drop_na(Age) %>% arrange(Age) %>% mutate(Age_Cat = cut(Age, breaks = 3, labels = c("AgeGrp1", "AgeGrp2", "AgeGrp3"))) %>% group_by(Age_Cat) %>% count() ``` ``` ## # A tibble: 3 × 2 ## # Groups: Age_Cat [3] ## Age_Cat n ## <fct> <int> ## 1 AgeGrp1 1705 ## 2 AgeGrp2 41 ## 3 AgeGrp3 3 ``` --- class: center, middle ## LUNCH BREAK --- # Descriptive statistics: frequency tables ``` r etsdata <- ets1 %>% select(Target, Age, Sex, HHsize, GWA, HS_type, Strand, ) %>% mutate(Target = recode(Target, "0" = "General Students", "1" = "Equity Target Students"), Sex = recode(Sex, "0" = "Male", "1" = "Female"), HHsize = recode(HHsize, "1" = "Small", "2" = "Medium", "3" = "Large"), HS_type = recode(HS_type, "0" = "Public", "1" = "Private"), Strand = recode(Strand, "0" = "Non-STEM", "1" = "STEM")) ``` --- # Descriptive statistics: frequency tables ``` r etsdata %>% select(Sex, Strand) %>% tbl_summary() ```
Characteristic
N = 1,750
1
Sex
    Female
1,105 (63%)
    Male
645 (37%)
Strand
    Non-STEM
1,385 (79%)
    STEM
365 (21%)
1
n (%)
--- # Descriptive statistics: frequency tables ``` r etsdata %>% select(Sex, Strand) %>% tbl_summary(by = Strand) ```
Characteristic
Non-STEM
N = 1,385
1
STEM
N = 365
1
Sex
    Female
882 (64%)
223 (61%)
    Male
503 (36%)
142 (39%)
1
n (%)
--- # Descriptive statistics: frequency tables ``` r etsdata %>% select(Sex, Strand) %>% tbl_cross(percent = "row") %>% bold_labels() %>% add_p(source_note=TRUE) ```
Strand
Total
Non-STEM
STEM
Sex
    Female
882 (80%)
223 (20%)
1,105 (100%)
    Male
503 (78%)
142 (22%)
645 (100%)
Total
1,385 (79%)
365 (21%)
1,750 (100%)
Pearson’s Chi-squared test, p=0.4
--- # Descriptive statistics: numerical summaries ``` r etsdata %>% select(Age, Sex) %>% group_by(Sex) %>% drop_na(Age) %>% summarize(N=length(Age), Mean=mean(Age), Median=median(Age), SD=sd(Age), Min=min(Age), Max=max(Age)) ``` ``` ## # A tibble: 2 × 7 ## Sex N Mean Median SD Min Max ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Female 1104 20.6 20 1.75 18 40 ## 2 Male 645 20.8 21 1.81 17 31 ``` --- # Descriptive statistics: numerical summaries ``` r etsdata %>% select(HHsize, Age, GWA) %>% drop_na(Age,GWA) %>% tbl_summary(by = HHsize, include = c(Age,GWA), statistic = list(all_continuous() ~ "{mean} ({sd})")) ```
Characteristic
Large
N = 335
1
Medium
N = 1,062
1
Small
N = 174
1
Age
20.83 (1.71)
20.52 (1.77)
20.62 (1.55)
GWA
89.6 (4.2)
90.2 (4.1)
90.0 (4.2)
1
Mean (SD)
--- # Quick review of hypothesis testing - **Testing hypotheses**: a procedure used to decide which of two competing hypotheses are consistent with data observed in a random sample - **Null hypothesis** (`\(H_0\)`): hypothesis indicating no "effect" (no change, no improvement, no correlation) - **Alternative hypothesis** (`\(H_1\)`): researcher's hypothesis that indicates an "effect" - **Test statistic**: a summary of the observed data in the random sample that is used as evidence *for* or *against* the null hypothesis - **Level of significance**: the probability of wrongly rejecting a true null hypothesis (`\(\alpha = 0.01, \mathbf{0.05}\)`) - **p-value**: the chance that the observed results (or more extreme results) would occur **IF** the null hypothesis were true - Smaller p-values indicate *disagreement* between the observed data and the null hypothesis: **Reject `\(H_0\)` if p-value `\(\leq \alpha\)`** --- # Statistical tests on means - Tests on means of two independent groups - <tt>Student's t test</tt>: *normal distributions with equal variances* - <tt>Welch's t test</tt>: *normal distributions with unequal variances* - <tt>Mann-Whitney U test</tt>: *non-normal distribution* - Tests on means of two matched/paired groups - <tt>Paired t test</tt>: *normally distributed pairwise differences* - <tt>Signed rank test</tt>: *non-normal pairwise differences* --- # Statistical tests on means ``` r etsdata %>% select(Strand, GWA) %>% drop_na(GWA) %>% tbl_summary(by = Strand, include = GWA, statistic = list(all_continuous() ~ "{mean} ({sd})")) ```
Characteristic
Non-STEM
N = 1,229
1
STEM
N = 342
1
GWA
89.5 (4.1)
92.1 (3.2)
1
Mean (SD)
- `\(H_0: \mu_S = \mu_N\)`, where: S = STEM, N=Non-STEM - `\(H_1: \mu_S > \mu_N\)` --- # Statistical tests on means ``` r # Test of Normality etsdata %>% select(Strand, GWA) %>% group_by(Strand) %>% shapiro_test(GWA) ``` ``` ## # A tibble: 2 × 4 ## Strand variable statistic p ## <chr> <chr> <dbl> <dbl> ## 1 Non-STEM GWA 0.965 1.06e-16 ## 2 STEM GWA 0.962 8.25e- 8 ``` ``` r # Test of Equal Variance etsdata %>% select(Strand, GWA) %>% levene_test(GWA ~ Strand) ``` ``` ## # A tibble: 1 × 4 ## df1 df2 statistic p ## <int> <int> <dbl> <dbl> ## 1 1 1569 12.8 0.000352 ``` --- # Statistical tests on means ``` r etsdata %>% select(Strand, GWA) %>% drop_na(GWA) %>% wilcox.test(GWA ~ Strand, data = ., alternative = "less") ``` ``` ## ## Wilcoxon rank sum test with continuity correction ## ## data: GWA by Strand ## W = 129227, p-value < 2.2e-16 ## alternative hypothesis: true location shift is less than 0 ``` --- # Statistical tests on means ``` r etsdata %>% select(Strand, Age, GWA) %>% drop_na(Age, GWA) %>% tbl_summary(by = Strand, include = c(Age, GWA), statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% add_p(test = list(all_continuous() ~ "wilcox.test"), test.args = all_tests("wilcox.test") ~ list(var.equal = TRUE)) ```
Characteristic
Non-STEM
N = 1,229
1
STEM
N = 342
1
p-value
2
Age
20.65 (1.85)
20.41 (1.27)
0.11
GWA
89.5 (4.1)
92.1 (3.2)
<0.001
1
Mean (SD)
2
Wilcoxon rank sum test
--- # Statistical tests on means ``` r etsdata %>% select(Strand, GWA) %>% ggbetweenstats(x = Strand, y = GWA, violin.args = list(width = 0), type = "nonparametric", var.equal = TRUE, bayes.args = list(width=0)) ``` --- # Statistical tests on means <img src="Day-1--Slide-presentation_files/figure-html/unnamed-chunk-39-1.png" width="65%" style="display: block; margin: auto;" /> --- # Statistical tests on means .pull-left[ ``` r etsdata %>% select(HHsize, GWA) %>% ggbetweenstats(x = HHsize, y = GWA, violin.args = list(width = 0), type = "nonparametric", var.equal = TRUE, bayes.args = list(width=0)) ``` ] .pull-right[ <!-- --> ] --- # Correlation analysis: basic ideas - Correlation analysis is concerned with the analysis of linear relationship between two or more variables - It is used to determine the strength and direction, as well as statistical significance, of the correlation between variables - The correlation between two variables could be positive or negative - Positive correlation: `\(X\uparrow\)` and `\(Y\uparrow\)` or `\(X\downarrow\)` and `\(Y\downarrow\)` - Negative correlation: `\(X\uparrow\)` and `\(Y\downarrow\)` or `\(X\downarrow\)` and `\(Y\uparrow\)` --- # Correlation analysis: scatter plot .pull-left[ - It is a chart of the x-values (X-axis) and y-values (Y-axis) - It is a visual representation of the relationship of X and Y - Also known as *scatter diagram* ] .pull-right[ <img src="Day-1--Slide-presentation_files/figure-html/unnamed-chunk-42-1.png" width="100%" /> ] --- # Correlation analysis: correlation coefficient - measures the strength or magnitude of the correlation between the variables - **Pearson r**: both variables are measured in at least interval scale; bivariate normal distribution - **Spearman rho**: both variables are measured in at ordinal scale; non-normal data - **Point-biserial**: one variable is binary and the other is interval or ratio - **Rank-biserial**: one variable is binary and the other is ordinal - the value of a correlation coefficient ranges from -1 to +1 --- # Correlation analysis: correlation coefficient - a zero correlation coefficient indicates that the variables are NOT LINEARLY independent <img src="pic9.png" width="55%" style="display: block; margin: auto;" /> --- # Correlation analysis: test of significance - `\(H_0\)`: Correlation coefficient is equal to zero. (There is no linear relationship between the variables.) `\(\Longrightarrow H_0: \rho = 0\)` - `\(H_1\)`: Correlation coefficient is not equal to zero. (There is linear relationship between the variables.) `\(\Longrightarrow H_1: \rho \neq 0\)` - Test statistic: $$ t = \frac{r \sqrt{n-2}}{\sqrt{1-r^2}} $$ - Reject `\(H_0\)` if p-value associated with `\(t\)` is less than the significance level (`\(\alpha\)`) --- # Correlation analysis ``` r shapiro.test(College$Top10perc) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: College$Top10perc ## W = 0.88742, p-value < 2.2e-16 ``` ``` r shapiro.test(College$Grad.Rate) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: College$Grad.Rate ## W = 0.9948, p-value = 0.009424 ``` --- # Correlation analysis ``` r cor.test(x = College$Top10perc, y = College$Grad.Rate, method = "spearman") ``` ``` ## ## Spearman's rank correlation rho ## ## data: College$Top10perc and College$Grad.Rate ## S = 40500256, p-value < 2.2e-16 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## 0.4819798 ``` --- # Correlation analysis ``` r College %>% select(Top10perc, Grad.Rate) %>% ggscatterstats(x = Top10perc, y = Grad.Rate, type = "nonparametric", bf.message = FALSE) ``` <img src="Day-1--Slide-presentation_files/figure-html/unnamed-chunk-46-1.png" width="55%" style="display: block; margin: auto;" /> --- # Correlation analysis: visualization - <tt>corrplot()</tt> from the **corrplot** package - <tt>ggpairs()</tt> from the **GGally** package - <tt>ggcorr()</tt> also from the **GGally** package - <tt>pairs.panel()</tt> from the **psych** package --- # Correlation analysis ``` r College %>% select(Top10perc, PhD, S.F.Ratio, Expend, Grad.Rate) %>% cor() %>% corrplot(type = "lower", tl.cex = .75, tl.col = "black") ``` <img src="Day-1--Slide-presentation_files/figure-html/unnamed-chunk-47-1.png" width="55%" style="display: block; margin: auto;" /> --- # Correlation analysis ``` r College %>% select(Private, Top10perc, PhD, S.F.Ratio, Expend, Grad.Rate) %>% mutate(Private = recode(Private, "No" = "Public", "Yes" = "Private")) %>% ggpairs(columns = 2:6, aes(colour = Private)) ``` <img src="Day-1--Slide-presentation_files/figure-html/unnamed-chunk-48-1.png" width="55%" style="display: block; margin: auto;" />