suppressPackageStartupMessages(library('BBmisc'))

pkgs <- c('knitr', 'kableExtra', 'devtools', 'lubridate', 'data.table', 'tidyquant', 'stringr', 'magrittr', 'tidyverse', 'plyr', 'dplyr', 'broom', 'highcharter', 'formattable', 'DT', 'httr', 'openxlsx')
suppressAll(lib(pkgs))

#funs <- c('')
#l_ply(funs, function(x) source(paste0('./function/', x)))

options(warn=-1)
rm(pkgs)

1 Introduction

1.1 About this Course

This course provides a rigorous introduction to the R programming language, with a particular focus on using R for software development in a data science setting. Whether you are part of a data science team or working individually within a community of developers, this course will give you the knowledge of R needed to make useful contributions in those settings. As the first course in the Specialization, the course provides the essential foundation of R needed for the following courses. We cover basic R concepts and language fundamentals, key concepts like tidy data and related “tidyverse” tools, processing and manipulation of complex and large datasets, handling textual data, and basic data science tasks. Upon completing this course, learners will have fluency at the R console and will be able to create tidy datasets from a wide range of possible data sources.

Kindly refer to manual Mastering Software Development in R (web-base). (or Mastering Software Development in R.pdf)

1.2 Syllabus

  • Week 1 : Basic R Language
  • Welcome
  • Crash Course on R Syntax
  • The Importance of Tidy Data
  • Reading Tabular Data with the readr Package
  • Reading Web-Based Data
  • Assignment :
      1. Swirl: R Basics - Automatic Submission
      1. Swirl: R Basics - Manual Submission
  • Week 2 : Data Manipulation
  • Basic Data Manipulation
  • Working with Dates, Times, Time Zones
  • Assignment :
      1. Swirl: Data Manipulation - Automatic Submission
      1. Swirl: Data Manipulation - Manual Submission
  • Week 3 : Text Processing, Regular Expression, & Physical Memory
  • Text Processing and Regular Expressions
  • The Role of Physical Memory
  • Assignment
      1. Swirl: Text and Regular Expressions - Automatic Submission
      1. Swirl: Text and Regular Expressions - Manual Submission
  • Week 4 : Large Datasets
  • Working with Large Datasets
  • Diagnosing Problems
  • Data Manipulation and Summary

Dynamic Documents for R using R Markdown introduce some useful functions and also packages for R users.

Kindly refer to Mastering Software Development in R Specialization to know the whole courses for the R specialization.

2 The Swirl Course

2.1 The Swirl Course Network

swirl::install_course("The R Programming Environment")

# 1: Setting Up Swirl <- 'uYaZJxzcqIQ1y6cKCB4e'
# 2: Basic Building Blocks <- 'htv8enScWY5qMnuWwNXV'
# 3: Sequences of Numbers <- 'H8yEkfT49PAQ1qroHCRn'
# 4: Vectors <- 'nwXVEoJ1smaLlEHl0DIm'
# 5: Missing Values <- 'nMVLxE9f6vDEIjixH7ZQ'
# 6: Subsetting Vectors <- 'fBPFjy5E34DWbK51f0qh'
# 7: Matrices and Data Frames <- 'qGozNYaCJ5f3TdYEKVbn'
# 8: Logic <- 'r4PQMCG9k2a3RPsIJfHy'
# 9: Workspace and Files <- 'F1BHKCEnAeirZMc0YCSq'
#10: Reading Tabular Data <- 'B8Y4eKmHny5uivnpYJ05'
#11: Looking at Data <- 'katPyvJxaPMD1q8Ug6EF'
#12: Data Manipulation <- 'NopXxlF40CchYURJEbVg'
#13: Text Manipulation Functions <- 'uv2hFE4U3wjdMRfHVGY0'
#14: Regular Expressions <- 'Djrf113sbiWxZTfg7Vek'
#15: The stringr Package <- 'mixvNacXgbIdBnzauoNt'

2.2 Lesson

Some of the datasets described in the readings are not easily accessible and so we have made them available for download in this reading. Attached to his reading is a zip file ("datasets.zip") that contains some of the data referred to in the readings of this course.

# ---------------- eval=FALSE ------------------------
lnk <- 'https://d3c33hcgiwev3.cloudfront.net/_c51e36d1897263b6cfadc7de150f05a1_datasets.zip?Expires=1538697600&Signature=R63JeOjOgKMZuQZh6aXKWTLIHXPNJoE--Zxem6jL6XoNjWGGhXgdX9HDPzRUUNbnlrDTxnrzHehRZZEzv2B5hP2ZlwJljDocWHIlu7mCr09VtF~2Yz4Lyvl7mr9DaFyVdKTBmyxN6KXYtAj9MGZTyNPvixx41XgWc-s2Y-CLEcU_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A'

download.file(lnk, destfile = 'data/datasets.zip')
unzip(zipfile = 'data/datasets.zip', exdir = './data')

lnk2 <- 'https://d3c33hcgiwev3.cloudfront.net/_f018d9fe5547b1a722ce260af0fa71af_quiz_data.zip?Expires=1538784000&Signature=LPxLR-yO3YS-caxtkS7lE5nOM4QQiRWMTiVVvaJZ8jhIY6ydzqHe7UmUcgCj-GiYcxCMEVBHdjRHX4eaa-96fbJSLB0AFYPUZxbRJyBS7SN7c3CHLnEvGyvmvj5qayqb8cib~yvxUhYlj1qdp5Xo874U-u6J0VdqZFfC69xSayo_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A'

ext_tracks_file <- paste0('http://rammb.cira.colostate.edu/research/', 
                          'tropical_cyclones/tc_extended_best_track_dataset/', 
                          'data/ebtrk_atlc_1988_2015.txt')

zika_file <- paste0('https://raw.githubusercontent.com/cdcepi/zika/master/', 
                    'Brazil/COES_Microcephaly/data/COES_Microcephaly-2016-06-25.csv')

library(httr)
meso_url <- 'https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py/'
denver <- GET(url = meso_url, 
              query = list(station = 'DEN', 
                           data = 'sped', 
                           year1 = '2016', 
                           month1 = '6', 
                           day1 = '1', 
                           year2 = '2016', 
                           month2 = '6', 
                           day2 = '30', 
                           tz = 'America/Denver', 
                           format = 'comma')) %>% 
  content() %>% 
  read_csv(skip = 5, na = 'M')

denver %>% slice(1:3)
# A tibble: 3 × 3
#      station               valid  sped
#      <chr>                <dttm> <dbl>
#1     DEN     2016-06-01 00:00:00   9.2
#2     DEN     2016-06-01 00:05:00   9.2
#3     DEN     2016-06-01 00:10:00   6.9
readr function Use
read_csv Reads comma-separated file
read_csv2 Reads semicolon-separated file
read_tsv Reads tab-separated file
read_delim General function for reading delimited files
read_fwf Reads fixed width files
read_log Reads log files
Operator Meaning Example
== Equals storm_name == KATRINA
!= Does not equal min_pressure != 0
> Greater than latitude > 25
>= Greater than or equal to max_wind >= 160
< Less than min_pressure < 900
<= Less than or equal to distance_to_land <= 0
%in% Included in storm_name %in% c(“KATRINA”, “ANDREW”)
is.na() Is a missing value is.na(radius_34_ne)
#The mutate function in dplyr can be used to add new columns to a data frame or change existing columns in the data frame. As an example, I’ll use the worldcup dataset from the package faraway, which statistics from the 2010 World Cup. To load this example data frame, you can run:
library(faraway)
data(worldcup)

library(tidyr)
library(ggplot2)
worldcup %>%
  select(Position, Time, Shots, Tackles, Saves) %>% 
  gather(Type, Number, -Position, -Time) %>%
  ggplot(aes(x = Time, y = Number)) + 
  geom_point() + 
  facet_grid(Type ~ Position)

library(knitr)

# Summarize the data to create the summary statistics you want
wc_table <- worldcup %>% 
  filter(Team %in% c("Spain", "Netherlands", "Uruguay", "Germany")) %>%
  select(Team, Position, Passes) %>%
  group_by(Team, Position) %>%
  summarize(ave_passes = mean(Passes),
            min_passes = min(Passes),
            max_passes = max(Passes),
            pass_summary = paste0(round(ave_passes), " (", 
                                  min_passes, ", ",
                                  max_passes, ")")) %>%
  select(Team, Position, pass_summary)
# What the data looks like before using `spread`
wc_table

Source: local data frame [16 x 3]
Groups: Team [4]

          Team   Position   pass_summary
        <fctr>     <fctr>          <chr>
1      Germany   Defender  190 (44, 360)
2      Germany    Forward    90 (5, 217)
3      Germany Goalkeeper    99 (99, 99)
4      Germany Midfielder   177 (6, 423)
5  Netherlands   Defender  182 (30, 271)
6  Netherlands    Forward   97 (12, 248)
7  Netherlands Goalkeeper 149 (149, 149)
8  Netherlands Midfielder  170 (22, 307)
9        Spain   Defender   213 (1, 402)
10       Spain    Forward   77 (12, 169)
11       Spain Goalkeeper    67 (67, 67)
12       Spain Midfielder  212 (16, 563)
13     Uruguay   Defender   83 (22, 141)
14     Uruguay    Forward   100 (5, 202)
15     Uruguay Goalkeeper    75 (75, 75)
16     Uruguay Midfielder   100 (1, 252)

# Use spread to create a prettier format for a table
wc_table %>%
  spread(Position, pass_summary) %>%
  kable()
Team Defender Forward Goalkeeper Midfielder
Germany 190 (44, 360) 90 (5, 217) 99 (99, 99) 177 (6, 423)
Netherlands 182 (30, 271) 97 (12, 248) 149 (149, 149) 170 (22, 307)
Spain 213 (1, 402) 77 (12, 169) 67 (67, 67) 212 (16, 563)
Uruguay 83 (22, 141) 100 (5, 202) 75 (75, 75) 100 (1, 252)
#teams <- read_csv('data/team_standings.csv')
team_standings <- read_csv('data/team_standings.csv')
left_join(world_cup, team_standings, by = 'Team')

worldcup %>% 
  mutate(Name = rownames(worldcup),
         Team = as.character(Team)) %>%
  select(Name, Position, Shots, Team) %>%
  arrange(desc(Shots)) %>%
  slice(1:5) %>%
  left_join(team_standings, by = "Team") %>% # Merge in team standings
  rename("Team Standing" = Standing) %>%
  kable()
Name Position Shots Team Team Standing
Gyan Forward 27 Ghana 7
Villa Forward 22 Spain 1
Messi Forward 21 Argentina 5
Suarez Forward 19 Uruguay 4
Forlan Forward 18 Uruguay 4

2.3 Quiz

## https://jeevanyue.github.io/post/2018-01-08-read_data_in_r/
if(!file.exists('data/daily_SPEC_2014.csv')) unzip(zipfile = 'data/daily_SPEC_2014.zip', exdir = './data')

pth <- ifelse(dir.exists('The R Programming Environment'), 
              file.path('The R Programming Environment', 'data'), 
              file.path('data'))
lf <- list.files(pth, pattern = '.xlsx$|.csv$')

Qz <- llply(lf, function(x) {
    if(grepl('.csv$', x)) {
        fread(paste0(pth, '/', x)) %>% tbl_df
    } else {
        read.xlsx(paste0(pth, '/', x)) %>% tbl_df
    }
  })
names(Qz) <- lf %>% str_replace_all('\\.[a-z]{3,4}$', '')

Qzdf <- Qz$`daily_SPEC_2014` %>% 
  ddply(.(`State Name`, `Parameter Name`), summarize, 
        AM = mean(`Arithmetic Mean`, na.rm = TRUE)) %>% 
  tbl_df

file.remove('data/daily_SPEC_2014.csv')
## [1] TRUE

2.3.1 Q1

## Question 1
Q1 <- Qzdf %>% 
  dplyr::filter(`State Name` == 'Wisconsin' & 
                `Parameter Name` == 'Bromine PM2.5 LC')
Q1
## # A tibble: 1 x 3
##   `State Name` `Parameter Name`      AM
##   <chr>        <chr>              <dbl>
## 1 Wisconsin    Bromine PM2.5 LC 0.00396

2.3.2 Q2

## Question 2
Q2 <- Qzdf %>% 
  dplyr::filter(`Parameter Name` == 'EC2 PM2.5 LC'|
                `Parameter Name` == 'Sodium PM2.5 LC'|
                `Parameter Name` == 'Sulfur PM2.5 LC'|
                `Parameter Name` == 'OC CSN Unadjusted PM2.5 LC TOT') %>% 
  tbl_df %>% 
  arrange(desc(AM))
Q2
## # A tibble: 157 x 3
##    `State Name`         `Parameter Name`                    AM
##    <chr>                <chr>                            <dbl>
##  1 District Of Columbia OC CSN Unadjusted PM2.5 LC TOT 411.   
##  2 Michigan             OC CSN Unadjusted PM2.5 LC TOT   3.44 
##  3 Illinois             OC CSN Unadjusted PM2.5 LC TOT   2.46 
##  4 Missouri             OC CSN Unadjusted PM2.5 LC TOT   2.41 
##  5 Maine                OC CSN Unadjusted PM2.5 LC TOT   2.27 
##  6 Texas                OC CSN Unadjusted PM2.5 LC TOT   2.18 
##  7 Nevada               OC CSN Unadjusted PM2.5 LC TOT   1.31 
##  8 Ohio                 Sulfur PM2.5 LC                  0.811
##  9 Kentucky             Sulfur PM2.5 LC                  0.799
## 10 Indiana              Sulfur PM2.5 LC                  0.790
## # ... with 147 more rows

2.3.3 Q3

## Question 3
Q3 <- Qz$`daily_SPEC_2014` %>% 
    ddply(.(`State Code`, `County Code`, `Site Num`, `Parameter Name`), 
          summarize, AM = mean(`Arithmetic Mean`, na.rm = TRUE)) %>% 
    tbl_df
Q3 %<>% dplyr::filter(`Parameter Name` == 'Sulfate PM2.5 LC') %>% 
  tbl_df %>% 
  arrange(desc(AM))
Q3
## # A tibble: 358 x 5
##    `State Code` `County Code` `Site Num` `Parameter Name`    AM
##           <int>         <int>      <int> <chr>            <dbl>
##  1           39            81         17 Sulfate PM2.5 LC  3.18
##  2           42             3         64 Sulfate PM2.5 LC  3.06
##  3           54            39       1005 Sulfate PM2.5 LC  2.94
##  4           18            19          6 Sulfate PM2.5 LC  2.74
##  5           39           153         23 Sulfate PM2.5 LC  2.71
##  6           39            35         60 Sulfate PM2.5 LC  2.64
##  7           39            87         12 Sulfate PM2.5 LC  2.64
##  8           54            51       1002 Sulfate PM2.5 LC  2.62
##  9           21           111         67 Sulfate PM2.5 LC  2.55
## 10           18            37       2001 Sulfate PM2.5 LC  2.52
## # ... with 348 more rows

2.3.4 Q4

## Question 4
Q4 <- Qz$`daily_SPEC_2014` %>% 
    ddply(.(`State Name`, `Parameter Name`), 
          summarize, AM = mean(`Arithmetic Mean`, na.rm = TRUE)) %>% 
    tbl_df
Q4 %<>% dplyr::filter(
  (`State Name` == 'California' | `State Name` == 'Arizona') & 
  `Parameter Name` == 'EC PM2.5 LC TOR') %>% 
  tbl_df %>% 
  arrange(desc(AM))
Q4
## # A tibble: 2 x 3
##   `State Name` `Parameter Name`    AM
##   <chr>        <chr>            <dbl>
## 1 California   EC PM2.5 LC TOR  0.198
## 2 Arizona      EC PM2.5 LC TOR  0.179
## highest - secondary
Q4$AM[1] - Q4$AM[2]
## [1] 0.01856696

2.3.5 Q5

## Question 5
Q5 <- Qz$`daily_SPEC_2014` %>% dplyr::filter(
    Longitude < -100 & `Parameter Name` == 'OC PM2.5 LC TOR') %>% 
    tbl_df

median(Q5$`Arithmetic Mean`, na.rm = TRUE)
## [1] 0.43

2.3.6 Q6

## Question 6
Q6 <- Qz$`aqs_sites`
Q6 %<>% dplyr::count(Land.Use, Location.Setting) %>% 
  dplyr::filter(Land.Use == 'RESIDENTIAL' & Location.Setting == 'SUBURBAN')
Q6
## # A tibble: 1 x 3
##   Land.Use    Location.Setting     n
##   <chr>       <chr>            <int>
## 1 RESIDENTIAL SUBURBAN          3527

2.3.7 Q7

## Question 7
Qz7 <- left_join(Qz$`aqs_sites`, Qz$`daily_SPEC_2014`)
## Joining, by = c("Latitude", "Longitude", "Datum", "Address")
Q7 <- Qz7 %>% 
    dplyr::filter(Longitude >= -100 & `Parameter Name` == 'EC PM2.5 LC TOR' & 
                  Land.Use == 'RESIDENTIAL' & Location.Setting == 'SUBURBAN') %>% 
    .$`Arithmetic Mean` %>% median(na.rm = TRUE)
Q7
## [1] 0.61

2.3.8 Q8

## Question 8
Q8 <- Qz7 %>% 
    dplyr::filter(`Parameter Name` == 'Sulfate PM2.5 LC' & 
                  Land.Use == 'COMMERCIAL')
Q8 %<>% mutate(month = lubridate::month(ymd(`Date Local`))) %>% 
    select(month, `Arithmetic Mean`) %>% 
    ddply(.(month), summarize, `Arithmetic Mean` = mean(`Arithmetic Mean`, na.rm=TRUE)) %>% 
    arrange(desc(`Arithmetic Mean`))
Q8
##    month Arithmetic Mean
## 1      2        2.021325
## 2      3        1.805260
## 3      7        1.777605
## 4      8        1.761226
## 5      6        1.750571
## 6      9        1.645010
## 7      4        1.567614
## 8      5        1.558096
## 9     12        1.537649
## 10     1        1.316738
## 11    10        1.313770
## 12    11        1.295837

2.3.9 Q9

## Question 9
Q9 <- Qz7 %>% 
    dplyr::filter((`Parameter Name` == 'Sulfate PM2.5 LC' | 
                   `Parameter Name` == 'Total Nitrate PM2.5 LC') & 
                   State.Code == 6 & County.Code == 65 & 
                   Site.Number == 8001)

Q9 %<>% mutate(`Date Local` = ymd(`Date Local`)) %>% 
    select(`Date Local`, `Arithmetic Mean`) %>% 
    ddply(.(`Date Local`), summarize, 
          `Arithmetic Mean` = sum(`Arithmetic Mean`, na.rm=TRUE)) %>% 
    dplyr::filter(`Arithmetic Mean` > 10) %>% tbl_df
Q9 #nrow(Q9) = 37
## # A tibble: 37 x 2
##    `Date Local` `Arithmetic Mean`
##    <date>                   <dbl>
##  1 2014-01-11                40.8
##  2 2014-01-29                51.5
##  3 2014-02-10                23.6
##  4 2014-02-16                19.8
##  5 2014-02-19                17.7
##  6 2014-02-22                12.4
##  7 2014-02-25                16.9
##  8 2014-03-06                34.1
##  9 2014-03-24                30.2
## 10 2014-04-05                11.6
## # ... with 27 more rows

2.3.10 Q10

## Question 10
Q10 <- Qz7 %>% 
    dplyr::filter(`Parameter Name` == 'Sulfate PM2.5 LC' | 
                   `Parameter Name` == 'Total Nitrate PM2.5 LC')

Q10 %<>% ddply(.(State.Code, County.Code, Site.Number, `Parameter Name`), 
               summarize, 
               `Arithmetic Mean` = mean(`Arithmetic Mean`, na.rm=TRUE)) %>% 
    tbl_df
Q10 %<>% spread(`Parameter Name`, `Arithmetic Mean`)

## In order to know all possible correlation methods and dataset used, here I simulate all possible options.
use <- c('everything', 'all.obs', 'complete.obs', 'na.or.complete', 'pairwise.complete.obs')
method <- c('pearson', 'kendall', 'spearman')

corr <- llply(use, function(x) {
  llply(method, function(y) {
    Q10 %>% mutate(
      use = x, method = y, 
      corr = tryCatch(cor(`Sulfate PM2.5 LC`, `Total Nitrate PM2.5 LC`, 
                          use = x, method = y), error=function(e) NA))
    }) %>% bind_rows
  }) %>% bind_rows
corr
## # A tibble: 5,355 x 8
##    State.Code County.Code Site.Number `Sulfate PM2.5 ~ `Total Nitrate ~
##    <chr>            <dbl>       <dbl>            <dbl>            <dbl>
##  1 1                   73          23             2.06            0.644
##  2 1                   73        2003             2.36            0.607
##  3 1                   79           2             1.87            0.557
##  4 1                   89          14             1.98            0.911
##  5 1                  101        1002             1.94            0.484
##  6 1                  113           1             1.87            0.543
##  7 10                   1           3             2.03            2.78 
##  8 10                   3        2004             1.96            1.56 
##  9 11                   1          42             1.85            1.05 
## 10 11                   1          43             1.94            1.34 
## # ... with 5,345 more rows, and 3 more variables: use <chr>, method <chr>,
## #   corr <dbl>
## Filter all correlation possibilities.
sumr <- corr %>% na.omit %>% 
    dplyr::select(use, method, corr) %>% unique
sumr
## # A tibble: 9 x 3
##   use                   method    corr
##   <chr>                 <chr>    <dbl>
## 1 complete.obs          pearson  0.526
## 2 complete.obs          kendall  0.506
## 3 complete.obs          spearman 0.685
## 4 na.or.complete        pearson  0.526
## 5 na.or.complete        kendall  0.506
## 6 na.or.complete        spearman 0.685
## 7 pairwise.complete.obs pearson  0.526
## 8 pairwise.complete.obs kendall  0.506
## 9 pairwise.complete.obs spearman 0.685
## Here I compare the correlation methods.
sapply(method, function(x) {
    xx <- na.omit(corr)
    suppressWarnings(cor.test(
        xx$`Sulfate PM2.5 LC`, 
        xx$`Total Nitrate PM2.5 LC`, 
        method = x))
})
## $pearson
## 
##  Pearson's product-moment correlation
## 
## data:  xx$`Sulfate PM2.5 LC` and xx$`Total Nitrate PM2.5 LC`
## t = 34.717, df = 3148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5004545 0.5509808
## sample estimates:
##       cor 
## 0.5261819 
## 
## 
## $kendall
## 
##  Kendall's rank correlation tau
## 
## data:  xx$`Sulfate PM2.5 LC` and xx$`Total Nitrate PM2.5 LC`
## z = 42.48, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.5061482 
## 
## 
## $spearman
## 
##  Spearman's rank correlation rho
## 
## data:  xx$`Sulfate PM2.5 LC` and xx$`Total Nitrate PM2.5 LC`
## S = 1642600000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6846733
## tidy data
llply(method, function(x) {
    xx <- na.omit(corr)
    suppressWarnings(cor.test(
        xx$`Sulfate PM2.5 LC`, 
        xx$`Total Nitrate PM2.5 LC`, 
        method = x)) %>% broom::tidy()
}) %>% bind_rows
## # A tibble: 3 x 8
##   estimate statistic   p.value parameter conf.low conf.high method
##      <dbl>     <dbl>     <dbl>     <int>    <dbl>     <dbl> <chr> 
## 1    0.526    3.47e1 6.93e-224      3148    0.500     0.551 Pears~
## 2    0.506    4.25e1 0.               NA   NA        NA     Kenda~
## 3    0.685    1.64e9 0.               NA   NA        NA     Spear~
## # ... with 1 more variable: alternative <chr>
Q10B <- corr %>% na.omit %>% 
  dplyr::select(-`Sulfate PM2.5 LC`, -`Total Nitrate PM2.5 LC`) %>% 
  tbl_df %>% unique %>% 
  dplyr::filter(State.Code %in% c(2,5,16,42) & 
                County.Code %in% c(37,45,90,113) & 
                Site.Number %in% c(2,3,35)) %>% 
  arrange(desc(corr))
Q10B %>% ddply(.(method), head, corr = corr, 1)
##   State.Code County.Code Site.Number                   use   method
## 1         16          37           2          complete.obs  kendall
## 2         16          37           2 pairwise.complete.obs  pearson
## 3         16          37           2          complete.obs spearman
##        corr
## 1 0.5061482
## 2 0.5261819
## 3 0.6846733

Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For example, in the stock market, if we want to measure how two stocks are related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The point-biserial correlation is conducted with the Pearson correlation formula except that one of the variables is dichotomous. The following formula is used to calculate the Pearson r correlation:

pearson r correlation

r = Pearson r correlation coefficient N = number of observations ∑xy = sum of the products of paired scores ∑x = sum of x scores ∑y = sum of y scores ∑x2= sum of squared x scores ∑y2= sum of squared y scores

Types of research questions a Pearson correlation can examine:

  • Is there a statistically significant relationship between age, as measured in years, and height, measured in inches?
  • Is there a relationship between temperature, measured in degrees Fahrenheit, and ice cream sales, measured by income?
  • Is there a relationship between job satisfaction, as measured by the JSS, and income, measured in dollars?

Kendall rank correlation: Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables. If we consider two samples, a and b, where each sample size is n, we know that the total number of pairings with a b is n(n-1)/2. The following formula is used to calculate the value of Kendall rank correlation:

kendall rank correlation

Nc= number of concordant Nd= Number of discordant

Spearman rank correlation: Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.

The following formula is used to calculate the Spearman rank correlation:

spearman rank correlation

ρ= Spearman rank correlation di= the difference between the ranks of corresponding variables n= number of observations

Types of research questions a Spearman Correlation can examine:

  • Is there a statistically significant relationship between participants’ level of education (high school, bachelor’s, or graduate degree) and their starting salary?
  • Is there a statistically significant relationship between horse’s finishing position a race and horse’s age?

  • Which correlation coefficient is better to use: Spearman or Pearson? : The Pearson correlation coefficient is the most widely used. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it may be more appropriate to use the Spearman rank correlation method.
  • A comparison of the Pearson and Spearman correlation methods

Pearson product moment correlation The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable.

For example, you might use a Pearson correlation to evaluate whether increases in temperature at your production facility are associated with decreasing thickness of your chocolate coating.

Spearman rank-order correlation The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.

Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed.

It is always a good idea to examine the relationship between variables with a scatterplot. Correlation coefficients only measure linear (Pearson) or monotonic (Spearman) relationships. Other relationships are possible.

Dynamic Documents for R using R Markdown introduce some useful functions and also packages for R users.

Kindly refer to Mastering Software Development in R Specialization to know the whole courses for the R specialization.

2.4 The Swirl Course Network

swirl::install_course("The R Programming Environment")

3 Conclusion

Final Scores : 95/100

4 Appendix

4.1 Documenting File Creation

It’s useful to record some information about how your file was created.

  • File creation date: 2018-10-03
  • File latest updated date: 2018-10-19
  • R version 3.5.1 (2018-07-02)
  • R version (short form): 3.5.1
  • rmarkdown package version: 1.10
  • File version: 1.0.1
  • Author Profile: ®γσ, Eng Lian Hu
  • GitHub: Source Code
  • Additional session information:
Additional session information:
Category session_info Category Sys.info
version R version 3.5.1 (2018-07-02) sysname Windows
system x86_64, mingw32 release 10 x64
ui RTerm version build 17134
language en nodename RSTUDIO-SCIBROK
collate Japanese_Japan.932 machine x86-64
tz Asia/Tokyo login scibr
date 2018-10-19 user scibr
Current time 2018-10-19 20:28:45 JST effective_user scibr