October 8, 2020
install.packages('devtools')
install.packages('tidyverse')
install.packages(c('ggthemes', 'officer'))
library(tidyverse)
foo <- c(1,2,4) foo %>% min() foo %>% mean() foo %>% max() foo %>% sd()
foo <- c(1,2,4) foo %>% min() ## [1] 1 foo %>% mean() ## [1] 2.333333 foo %>% max() ## [1] 4 foo %>% sd() ## [1] 1.527525
glm and min functionshelp(glm)
help(min)
cat_function <- function(love = TRUE){
if(love == TRUE){
print('I love cats!')
}
else {
print('I am not a cool person.')
}
}
glm and min functionsfoo <- c(1,2,NA, 4) foo %>% min() foo %>% mean() foo %>% max() foo %>% sd()
glm and min functionsfoo <- c(1,2,NA, 4) foo %>% min() ## [1] NA foo %>% mean() ## [1] NA foo %>% max() ## [1] NA foo %>% sd() ## [1] NA
glm and min functionsfoo <- c(1,2,NA, 4) foo %>% min(na.rm= TRUE) ## [1] 1 foo %>% mean(na.rm= TRUE) ## [1] 2.333333 foo %>% max(na.rm= TRUE) ## [1] 4 foo %>% sd(na.rm= TRUE) ## [1] 1.527525
c('Washington', 'Oregon', 'Idaho') %>% class()
c('Washington', 'Oregon', 'Idaho') %>% is.character()
c('Washington', 'Oregon', 'Idaho') %>% is.factor()
c('Washington', 'Oregon', 'Idaho') %>% as.factor()
c('Washington', 'Oregon', 'Idaho') %>% as.factor() %>% class()
c('Washington', 'Oregon', 'Idaho') %>% class()
## [1] "character"
c('Washington', 'Oregon', 'Idaho') %>% is.character()
## [1] TRUE
c('Washington', 'Oregon', 'Idaho') %>% is.factor()
## [1] FALSE
c('Washington', 'Oregon', 'Idaho') %>% as.factor()
c('Washington', 'Oregon', 'Idaho') %>% as.factor() %>% class()
c('Washington', 'Oregon', 'Idaho') %>% class()
## [1] "character"
c('Washington', 'Oregon', 'Idaho') %>% is.character()
## [1] TRUE
c('Washington', 'Oregon', 'Idaho') %>% is.factor()
## [1] FALSE
c('Washington', 'Oregon', 'Idaho') %>% as.factor()
## [1] Washington Oregon Idaho
## Levels: Idaho Oregon Washington
c('Washington', 'Oregon', 'Idaho') %>% as.factor() %>% class()
## [1] "factor"
c('Washington', 'Oregon', 'Idaho')
## [1] "Washington" "Oregon" "Idaho"
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
rep(1:2, times = 2)
## [1] 1 2 1 2
seq(from = 0, to = 100, by = 10)
## [1] 0 10 20 30 40 50 60 70 80 90 100
c('Washington', 'Oregon', 'Idaho')[2]
seq(from = 0, to = 100, by = 10)[6]
c('Washington', 'Oregon', 'Idaho')[2]
## [1] "Oregon"
seq(from = 0, to = 100, by = 10)[6]
## [1] 50
tibble(x = c(1:3), y = c(4:6), z = c('Washington', 'Oregon', 'Idaho'))
## # A tibble: 3 x 3
## x y z
## <int> <int> <chr>
## 1 1 4 Washington
## 2 2 5 Oregon
## 3 3 6 Idaho
$ between the tibble name and the variable namepull() functionizes $cars$speed %>% head(7) ## [1] 4 4 7 7 8 9 10
cars %>% pull(speed) %>% head(7) ## [1] 4 4 7 7 8 9 10
cars[,1] %>% head(7) ## [1] 4 4 7 7 8 9 10
data.frame(x = c(1:3), y = c(4:6), z = c('Washington', 'Oregon', 'Idaho'))
## x y z
## 1 1 4 Washington
## 2 2 5 Oregon
## 3 3 6 Idaho
matrix(data = 1:6, nrow = 3, ncol = 2) ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
You are an analyst at a large global health organization. Organizational leadership needs to report pubically on the Covid-19 data that you’ve collected. Over the next month, you will be asked to prepare findings that you observe in the data.
Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?
Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?
To answer the questions, import the Covid-19 dataset.
covid <- read_csv('https://rb.gy/lzlylj')
Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?
covid %>% class() ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?
covid %>% class() ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" covid %>% pull(est_infections_p100k) %>% min(na.rm = TRUE) ## [1] 0.00000003176912 covid %>% pull(est_infections_p100k) %>% mean(na.rm = TRUE) ## [1] 51.07284
Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?
covid %>% class() ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" covid %>% pull(est_infections_p100k) %>% min(na.rm = TRUE) ## [1] 0.00000003176912 covid %>% pull(est_infections_p100k) %>% mean(na.rm = TRUE) ## [1] 51.07284 covid %>% pull(date) %>% min(na.rm = TRUE) ## [1] "2020-02-04" covid %>% pull(date) %>% max(na.rm = TRUE) ## [1] "2021-01-01"
head() shows you the top subset of a data in a tibble
head() argument defaults to 6tail() shows the bottom subset of data in a tibblesummary() shows summary statistics on all variabes in a dataset
ls() shows all variables in a tibble
ls() without calling an object between the parentheses to see all objects in your workspacestr() tells the variable type and selected variable values in a tibble for all variablesdim() tells you the dimensions of your dataset
nrow() reports the number of rows onlyncol() reports the number of columns onlyExercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?
Hint: There are multiple ways to answers these questions with the functions you know
Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?
covid %>% dim() ## [1] 165210 13
Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?
covid %>% dim() ## [1] 165210 13 covid %>% pull(mobility_composite) %>% median(na.rm = TRUE) ## [1] -19.54311
Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?
covid %>% dim() ## [1] 165210 13 covid %>% pull(mobility_composite) %>% median(na.rm = TRUE) ## [1] -19.54311 covid %>% ncol() ## [1] 13
Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?
covid %>% dim() ## [1] 165210 13 covid %>% pull(mobility_composite) %>% median(na.rm = TRUE) ## [1] -19.54311 covid %>% ncol() ## [1] 13
Other methods to answer questions
covid %>% nrow() covid %>% summary() covid %>% ls()
table() shows you the distribution of values in a vectorlength() tells you in the number of elements in a vectorunique() shows you the unique values in a vectorsort() orders values in a vectorsummary() shows you descriptive statistics for a vector
summary() can run on a vector or tibbleYou can chain together multiple functions
covid %>% pull(location) %>% unique() %>% length()
Exercise 3 - 5 minutes
- How many ‘projected’ values are there in mobility_data_type?
- Which is the third to last value for location when values are listed alphabetically?
- Is ‘projected’ a value from the total_tests_data_type variable?
Exercise 3 - 5 minutes
- How many ‘projected’ values are there in mobility_data_type?
- Which is the third to last value for location when values are listed alphabetically?
- Is ‘projected’ a value from the total_tests_data_type variable?
covid %>% pull(mobility_data_type) %>% table() ## . ## observed projected ## 123887 39992 covid %>% pull(location) %>% table() %>% tail() ## . ## Wyoming Yemen Yucatán Zacatecas Zambia Zimbabwe ## 540 538 538 538 538 538 covid %>% pull(total_tests_data_type) %>% unique() ## [1] "observed" NA
Make sure…
min() on a character string<-
%>%
Exercise 4 - 5 minutes
This code contains 6 mistakes. The code is supposed to create three new data object called covid_inf, mean_inf, and median_inf and report mean and median confirmed infections.
covid_inf <- covid %.% pull(confirmed_infection) mean_inf < mean(covid_inf na.rm = TRUE) median_inf <- covid_inf %>% median(na.rm = TRUE mean_inf medean_inf
Exercise 4 - 5 minutes
This code contains 6 mistakes. The code is supposed to create three new data object called covid_inf, mean_inf, and median_inf and report mean and median confirmed infections.
covid_inf <- covid %>% pull(confirmed_infections) mean_inf <- mean(covid_inf, na.rm = TRUE) median_inf <- covid_inf %>% median(na.rm = TRUE) mean_inf ## [1] 1335.746 median_inf ## [1] 54
You wear many hats and are also a crime analyst at a think tank. For a report being prepared by the think tank, you need to analyze crime data.
Exercise 5 - 7 minutes
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- What are the earliest and latest reported_year values?
Begin the exercise by importing the crime dataset in R Studio
crime <- read_csv('https://rb.gy/5zuayh')
Exercise 5 - 7 minutes
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- What are the earliest and latest reported_year values?
crime %>% pull(occurred_date) %>% max(na.rm = TRUE) ## [1] "2020-09-25"
Exercise 5 - 7 minutes
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- What are the earliest and latest reported_year values?
crime %>% pull(occurred_date) %>% max(na.rm = TRUE) ## [1] "2020-09-25" crime %>% pull(neighborhood) %>% table() %>% sort() %>% tail() ## . ## UNIVERSITY SLU/CASCADE QUEEN ANNE NORTHGATE ## 5458 5970 6792 7614 ## CAPITOL HILL DOWNTOWN COMMERCIAL ## 8144 11692
Exercise 5 - 7 minutes
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- What are the earliest and latest reported_year values?
crime %>% pull(occurred_date) %>% max(na.rm = TRUE) ## [1] "2020-09-25" crime %>% pull(neighborhood) %>% table() %>% sort() %>% tail() ## . ## UNIVERSITY SLU/CASCADE QUEEN ANNE NORTHGATE ## 5458 5970 6792 7614 ## CAPITOL HILL DOWNTOWN COMMERCIAL ## 8144 11692 crime %>% pull(reported_year) %>% min(na.rm = TRUE) ## [1] 2008 crime %>% pull(reported_year) %>% max(na.rm = TRUE) ## [1] 2020