October 6, 2018
install.packages('devtools')
install.packages('tidyverse')
install.packages(c('ggthemes', 'rmarkdown'))
library(tidyverse)
foo <- c(1,2,4) foo %>% min() foo %>% mean() foo %>% max() foo %>% sd()
foo <- c(1,2,4) foo %>% min() ## [1] 1 foo %>% mean() ## [1] 2.333333 foo %>% max() ## [1] 4 foo %>% sd() ## [1] 1.527525
lm and min functionshelp(lm)
help(min)
cat_function <- function(love=TRUE){
if(love==TRUE){
print('I love cats!')
}
else {
print('I am not a cool person.')
}
}
lm and min functionsfoo <- c(1,2,NA, 4) foo %>% min() foo %>% mean() foo %>% max() foo %>% sd()
lm and min functionsfoo <- c(1,2,NA, 4) foo %>% min() ## [1] NA foo %>% mean() ## [1] NA foo %>% max() ## [1] NA foo %>% sd() ## [1] NA
lm and min functionsfoo <- c(1,2,NA, 4) min(foo, na.rm = TRUE) ## [1] 1 mean(foo, na.rm = TRUE) ## [1] 2.333333 max(foo, na.rm = TRUE) ## [1] 4 sd(foo, na.rm = TRUE) ## [1] 1.527525
c('foo', 'moo', 'boo') %>% class()
c('foo', 'moo', 'boo') %>% is.character()
c('foo', 'moo', 'boo') %>% is.factor()
c('foo', 'moo', 'boo') %>% as.factor()
c('foo', 'moo', 'boo') %>% as.factor() %>% class()
c('foo', 'moo', 'boo') %>% class()
## [1] "character"
c('foo', 'moo', 'boo') %>% is.character()
## [1] TRUE
c('foo', 'moo', 'boo') %>% is.factor()
## [1] FALSE
c('foo', 'moo', 'boo') %>% as.factor()
c('foo', 'moo', 'boo') %>% as.factor() %>% class()
c('foo', 'moo', 'boo') %>% class()
## [1] "character"
c('foo', 'moo', 'boo') %>% is.character()
## [1] TRUE
c('foo', 'moo', 'boo') %>% is.factor()
## [1] FALSE
c('foo', 'moo', 'boo') %>% as.factor()
## [1] foo moo boo
## Levels: boo foo moo
c('foo', 'moo', 'boo') %>% as.factor() %>% class()
## [1] "factor"
data_frame(x = c(1:3), y = c(4:6), z = c('foo', 'boo', 'moo'))
## # A tibble: 3 x 3
## x y z
## <int> <int> <chr>
## 1 1 4 foo
## 2 2 5 boo
## 3 3 6 moo
$ between the data frame name and the variable namecars$speed %>% head(7) ## [1] 4 4 7 7 8 9 10
cars[,1] %>% head(7) ## [1] 4 4 7 7 8 9 10
matrix(data = 1:6, nrow = 3, ncol = 2) ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
c('foo', 'moo', 'boo')
## [1] "foo" "moo" "boo"
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
rep(1:2, times = 2)
## [1] 1 2 1 2
rep(c(1,2), times = 2)
## [1] 1 2 1 2
seq(from = 0, to = 100, by = 10)
## [1] 0 10 20 30 40 50 60 70 80 90 100
seq(0, 100, 10)
## [1] 0 10 20 30 40 50 60 70 80 90 100
cars$speed
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25
c('foo', 'moo', 'boo')[2]
seq(from = 0, to = 100, by = 10)[6]
c('foo', 'moo', 'boo')[2]
## [1] "moo"
seq(from = 0, to = 100, by = 10)[6]
## [1] 50
Exercise - 5 minutes
- What type of data object is candi?
- What are the min() and mean() for amount?
- What is the max() for election_year?
To answer the questions, import the political contributions dataset
candi <- read_csv('https://goo.gl/GTRqZs') %>% as.data.frame()
Exercise - 5 minutes
- What type of data object is candi?
- What are the min() and mean() for amount?
- What is the max() for election_year?
candi %>% class() ## [1] "data.frame"
Exercise - 5 minutes
- What type of data object is candi?
- What are the min() and mean() for amount?
- What is the max() for election_year?
candi %>% class() ## [1] "data.frame" candi$amount %>% min(na.rm = TRUE) ## [1] -225000 candi$amount %>% mean(na.rm = TRUE) ## [1] 358.1753
Exercise - 5 minutes
- What type of data object is candi?
- What are the min() and mean() for amount?
- What is the max() for election_year?
candi %>% class() ## [1] "data.frame" candi$amount %>% min(na.rm = TRUE) ## [1] -225000 candi$amount %>% mean(na.rm = TRUE) ## [1] 358.1753 candi$election_year %>% max(na.rm = TRUE) ## [1] 2023
head() shows you the top subset of a data in a data frame
head() argument and defaults at 5tail() shows the bottom subset of data in a data framesummary() shows summary statistics on all variabes in a dataset
ls() shows all variables in a data frame
ls() without calling an object between the parentheses to see all objects in your workspacestr() tells the variable type and selected variable values in a data frame for all variablesdim() tells you the dimensions of your dataset
nrow() reports the number of rows onlyncol() reports the number of columns onlyExercise - 5 minutes
- How many observations (or rows) are in candi?
- What is the median value for amount?
- How many variables are in candi?
Hint: There are multiple ways to answers these questions with the functions you know
Exercise - 5 minutes
- How many observations (or rows) are in candi?
- What is the median value for amount?
- How many variables are in candi?
candi %>% dim() ## [1] 97756 22
Exercise - 5 minutes
- How many observations (or rows) are in candi?
- What is the median value for amount?
- How many variables are in candi?
candi %>% dim() ## [1] 97756 22 candi$amount %>% median(na.rm = TRUE) ## [1] 35
Exercise - 5 minutes
- How many observations (or rows) are in candi?
- What is the median value for amount?
- How many variables are in candi?
candi %>% dim() ## [1] 97756 22 candi$amount %>% median(na.rm = TRUE) ## [1] 35 candi %>% ncol() ## [1] 22
Exercise - 5 minutes
- How many observations (or rows) are in candi?
- What is the median value for amount?
- How many variables are in candi?
candi %>% dim() ## [1] 97756 22 candi$amount %>% median(na.rm = TRUE) ## [1] 35 candi %>% ncol() ## [1] 22
Other methods to answer questions
candi %>% summary() candi %>% ls()
table() shows you the distribution of values in a vectorlength() tells you in the number of elements in a vectorunique() shows you the unique values in a vectorsummary() shows you descriptive statistics for a vector
summary() can run on a vector or data frameExercise - 7 minutes
- How many ‘DEMOCRAT’ values are there in party?
- Which value in contributor_state is most frequent?
- Is ‘Mayoral Race’ a value you from the type variable?
- How many distinct contributor_zip values are there?
Hint: There are multiple ways to answers these questions with the functions you know
Exercise - 7 minutes
- How many ‘DEMOCRAT’ values are there in party?
- Which value in contributor_state is most frequent?
- Is ‘Mayoral Race’ a value you from the type variable?
- How many distinct contributor_zip values are there?
table(candi$party)[1] ## DEMOCRAT ## 13961 table(candi$contributor_state) %>% tail() ## ## VT WA WI WV WY ZZ ## 18 84691 43 54 9 3 candi$type %>% unique() ## [1] "Candidate" "Political Committee" NA candi$contributor_zip %>% unique() %>% length() ## [1] 3450
Make sure…
min() on a character string<-
%>%
Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?
Begin the exercise by importing the crime dataset in R Studio
crime <- read_csv('https://goo.gl/FHW2Ni') %>% as.data.frame()
Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?
crime %>% dim() ## [1] 23459 13
Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?
crime %>% dim() ## [1] 23459 13 crime$occurred_date %>% max(na.rm = TRUE) ## [1] "2018-09-28"
Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?
table(crime$neighborhood)[13:15] ## ## COMMERCIAL DUWAMISH COMMERCIAL HARBOR ISLAND DOWNTOWN COMMERCIAL ## 15 9 1938
Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?
crime$crime_subcategory %>% table() %>% tail(5) ## . ## THEFT-BICYCLE THEFT-BUILDING THEFT-SHOPLIFT TRESPASS WEAPON ## 470 907 2037 680 190
Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?
crime$crime_subcategory %>% table() %>% tail(5) ## . ## THEFT-BICYCLE THEFT-BUILDING THEFT-SHOPLIFT TRESPASS WEAPON ## 470 907 2037 680 190 crime$reported_time %>% min(na.rm = TRUE) ## [1] 0 crime$reported_time %>% max(na.rm = TRUE) ## [1] 2359