This is the sixth course in the HarvardX Professional Certificate in Data Science, a series of courses that prepare you to do data analysis in R, from simple computations to machine learning.
The textbook for the Data Science course series is freely available online.
Section 1: Data Import
You will learn how to import data from different sources.
Section 2: Tidy Data
You will learn the first pieces of converting data into a tidy format.
Section 3: String Processing
You will learn how to process strings using regular expressions (regex).
Section 4: Dates, Times, and Text Mining
You will learn how to work with dates and times as file formats and how to mine text.
In the Data Import section, you will learn how import data into R.
After completing this section, you will be able to:
The textbook for this section is available here
A. Importing data into R
B. Formatting dates/times
C. Checking correlations between your variables
D. Tidying data
A. data.txt
B. data.csv
C. data.xlsx
D. data.tsv
initials,state,age,time
vib,MA,61,6:01
adc,TX,45,5:45
kme,CT,50,4:19
What type of file is this?
A. A comma-delimited file without a header
B. A tab-delimited file with a header
C. A white space-delimited file without a header
D. A comma-delimited file with a header
Which of the following lines of code CANNOT set the working directory to the desired “projects” directory?
A. setwd("~/Documents/projects/")
B. setwd("/Users/student/Documents/projects/")
C. setwd(/Users/student/Documents/projects/)
D. dir <- "/Users/student/Documents/projects" setwd(dir)
> getwd()
[1] "C:/Users/UNIVERSITY/Documents/Analyses/HarvardX-Wrangling"
> filename <- "murders.csv"
> path <- system.file("extdata", package = "dslabs")
Which of the following commands would NOT successfully copy “murders.csv” into the folder “data”?
A. file.copy(file.path(path, "murders.csv"), getwd())
B. setwd("data") file.copy(file.path(path, filename), getwd())
C. file.copy(file.path(path, "murders.csv"), file.path(getwd(), "data"))
D. file.location <- file.path(system.file("extdata", package = "dslabs"), "murders.csv")
file.destination <- file.path(getwd(),
"data") file.copy(file.location, file.destination)
A. Open the file in a basic text editor.
B. In the RStudio “Files” pane, click on your file, then select “View File”.
C. Use the command read_lines (remembering to specify the number of rows with the n_max argument).
read_excel and read_xlsx?A. Read_excel also reads meta-data from the excel file, such as sheet names, while read_xlsx only reads the first sheet in a file.
B. Read_excel reads both .xls and .xlsx files by detecting the file format from its extension, while read_xlsx only reads .xlsx files.
C. Read_excel is part of the readr package, while read_xlsx is part of the readxl package and has more options.
D. Read_xlsx has been replaced by read_excel in a recent readxl package update.
initials,state,age,time
vib,MA,61,6:01
adc,TX,45,5:45
kme,CT,50,4:19
Which line of code will NOT produce a tibble with column names “initials”, “state”, “age”, and “time”?
A. race_times <- read_csv("times.txt")
B. race_times <- read.csv("times.txt")
C. race_times <- read_csv("times.txt", col_names = TRUE)
D. race_times <- read_delim("times.txt", delim = “,”)
Which line of code will NOT import the data contained in the “2016” tab of this Excel sheet?
A. times_2016 <- read_excel("times.xlsx", sheet = 2)
B. times_2016 <- read_xlsx("times.xlsx", sheet = “2”)
C. times_2016 <- read_excel("times.xlsx", sheet = "2016")
D. times_2016 <- read_xlsx("times.xlsx", sheet = 2)
You read in the file using the following code.
race_times <- read.csv(“times.csv”)
What is the data type of the initials in the object race_times?
A. integers
B. characters
C. factors
D. logical
Note: If you don’t supply the argument stringsAsFactors = F, the read.csv file will automatically convert characters to factors.
A. The import functions in the readr package all start as read_, while the import functions for base R all start with read.
B. Base R import functions automatically convert character columns to factors.
C. The base R import functions can read .csv files, but cannot files with other delimiters, such as .tsv files, or fixed-width files.
D. Base R functions import data as a data frame, while readr functions import data as a tibble.
race_times <- read.csv(“times.csv”, stringsAsFactors = F)
What is the class of the object race_times?
A. data frame B. tibble
C. matrix
D. vector
url <- "https://raw.githubusercontent.com/MyUserName/MyProject/master/MyData.csv "
dat <- read_csv(url)
download.file(url, "MyData.csv")
A. Create a tibble in R called dat that contains the information contained in the csv file stored on Github and save that tibble to the working directory.
B. Create a matrix in R called dat that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and name the downloaded file “MyData.csv”.
C. Create a tibble in R called dat that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and randomly assign it a temporary name that is very likely to be unique.
D. Create a tibble in R called dat that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and name the downloaded file “MyData.csv”.
In the *Tidy Data section, you will learn how to convert data from a raw to a tidy format.
This section is divided into three parts: Reshaping Data, Combining Tables, and Web Scraping.
After completing the Tidy Data section, you will be able to:
The textbook for this section is available here and here
age_group,2015,2016,2017
20,3:46,3:22,3:50
30,3:50,3:43,4:43
40,4:39,3:49,4:51
50,4:48,4:59,5:01
Are these data considered “tidy” in R? Why or why not?
A. Yes. These data are considered “tidy” because each row contains unique observations.
B. Yes. These data are considered “tidy” because there are no missing data in the data frame.
C. No. These data are not considered “tidy” because the variable “year” is stored in the header.
D. No. These data are not considered “tidy” because there are not an equal number of columns and rows.
A.
state abb region population total
Alabama AL South 4779736 135
Alaska AK West 710231 19
Arizona AZ West 6392017 232
Arkansas AR South 2915918 93
California CA West 37253956 1257
Colorado CO West 5029196 65 [X]
state abb region var people
Alabama AL South population 4779736
Alabama AL South total 135
Alaska AK West population 710231
Alaska AK West total 19
Arizona AZ West population 6392017
Arizona AZ West total 232
state abb Northeast South North Central West
Alabama AL NA 4779736 NA NA
Alaska AK NA NA NA 710231
Arizona AZ NA NA NA 6392017
Arkansas AR NA 2915918 NA NA
California CA NA NA NA 37253956
Colorado CO NA NA NA 5029196
state abb region rate
Alabama AL South 2.82e-05
Alaska AK West 2.68e-05
Arizona AZ West 3.63e-05
Arkansas AR South 3.19e-05
California CA West 3.37e-05
Colorado CO West 1.29e-05
1.Your file called “times.csv” has age groups and average race finish times for three years of marathons.
age_group,2015,2016,2017
20,3:46,3:22,3:50
30,3:50,3:43,4:43
40,4:39,3:49,4:51
50,4:48,4:59,5:01
You read in the data file using the following command.
d <- read_csv("times.csv")
Which commands will help you “tidy” the data?
A.
tidy_data <- d %>%
gather(year, time, `2015`:`2017`)
tidy_data <- d %>%
spread(year, time, `2015`:`2017`)
tidy_data <- d %>%
gather(age_group, year, time, `2015`:`2017`)
tidy_data <- d %>%
gather(time, `2015`:`2017`)
> head(dat_wide)
state year population Hepatitis A Mumps Polio Rubella
Alabama 1990 4040587 86 19 76 1
Alabama 1991 4066003 39 14 65 0
Alabama 1992 4097169 35 12 24 0
Alabama 1993 4133242 40 22 67 0
Alabama 1994 4173361 72 12 39 0
Alabama 1995 4216645 75 2 38 0
Which of the following would transform this into a tidy dataset, with each row representing an observation of the incidence of each specific disease (as shown below)?
> head(dat_tidy)
state year population disease count
Alabama 1990 4040587 Hepatitis A 86
Alabama 1991 4066003 Hepatitis A 39
Alabama 1992 4097169 Hepatitis A 35
Alabama 1993 4133242 Hepatitis A 40
Alabama 1994 4173361 Hepatitis A 72
Alabama 1995 4216645 Hepatitis A 75
dat_tidy <- dat_wide %>%
gather (key = count, value = disease, `Hepatitis A`, `Rubella`)
dat_tidy <- dat_wide %>%
gather(key - count, value = disease, -state, -year, -population)
dat_tidy <- dat_wide %>%
gather(key = disease, value = count, -state)
D.
dat_tidy <- dat_wide %>%
gather(key = disease, value = count, “Hepatitis A”: “Rubella”)
age_group,year,time
20,2015,03:46
30,2015,03:50
40,2015,04:39
50,2015,04:48
20,2016,03:22
Select the code that converts these data back to the wide format, where each year has a separate column.
A. tidy_data %>% spread(time, year)
B. tidy_data %>% spread(year, time)
C. tidy_data %>% spread(year, age_group)
D. tidy_data %>% spread(time, year, `2015`:`2017`)
> head(dat)
state abb region var people
Alabama AL South population 4779736
Alabama AL South total 135
Alaska AK West population 710231
Alaska AK West total 19
Arizona AZ West population 6392017
Arizona AZ West total 232
You would like to transform it into a dataset where population and total are each their own column (shown below). Which code would best accomplish this?
state abb region population total
Alabama AL South 4779736 135
Alaska AK West 710231 19
Arizona AZ West 6392017 232
Arkansas AR South 2915918 93
California CA West 37253956 1257
Colorado CO West 5029196 65
A. dat_tidy <- dat %>% spread(key = var, value = people) B. dat_tidy <- dat %>% spread(key = state:region, value = people) C. dat_tidy <- dat %>% spread(key = people, value = var) D. dat_tidy <- dat %>% spread(key = region, value = people)
age_group,2015_time,2015_participants,2016_time,2016_participants
20,3:46,54,3:22,62
30,3:50,60,3:43,58
40,4:39,29,3:49,33
50,4:48,10,4:59,14
You read in the data file
d <- read_csv("times.csv")
Which of the answers below best tidys the data?
tidy_data <- d %>%
gather(key = “key”, value = “value”, -age_group) %>%
separate(col = key, into = c(“year”, “variable_name”), sep = “.”) %>%
spread(key = variable_name, value = value)
B.
tidy_data <- d %>%
gather(key = “key”, value = “value”, -age_group) %>%
separate(col = key, into = c(“year”, “variable_name”), sep = “_”) %>%
spread(key = variable_name, value = value)
tidy_data <- d %>%
gather(key = “key”, value = “value”) %>%
separate(col = key, into = c(“year”, “variable_name”), sep = “_”) %>%
spread(key = variable_name, value = value)
tidy_data <- d %>%
gather(key = “key”, value = “value”, -age_group) %>%
separate(col = key, into = “year”, sep = “_”) %>%
spread(key = year, value = value)
> head(stats)
key value
allen_height 75
allen_hand_length 8.25
allen_wingspan 79.25
bamba_height 83.25
bamba_hand_length 9.75
bamba_wingspan 94
Select all of the correct commands below that would turn this data into a “tidy” format.
A.
tidy_data <- stats %>%
separate(col = key, into = c("player", "variable_name"), sep = "_", extra = "merge") %>%
spread(key = variable_name, value = value)
tidy_data <- stats %>%
separate(col = key, into = c("player", "variable_name1", "variable_name2"), sep = "_", fill = "right") %>%
unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>%
spread(key = variable_name, value = value)
tidy_data <- stats %>%
separate(col = key, into = c("player", "variable_name"), sep = "_") %>%
spread(key = variable_name, value = value)
> tab1
state population
Alabama 4779736
Alaska 710231
Arizona 6392017
Delaware 897934
District of Columbia 601723
> tab2
state electoral_votes
Alabama 9
Alaska 3
Arizona 11
California 55
Colorado 9
Connecticut 7
> dim(tab1)
[1] 5 2
> dim(tab2)
[1] 6 2
What are the dimensions of the table dat, created by the following command?
dat <- left_join(tab1, tab2, by = “state”)
A. 3 rows by 3 columns
B. 5 rows by 2 columns
C. 5 rows by 3 columns
D. 6 rows by 3 columns
A. dat <- right_join(tab1, tab2, by = “state”)
B. dat <- full_join(tab1, tab2, by = “state”)
C. dat <- inner_join(tab1, tab2, by = “state”)
D. dat <- semi_join(tab1, tab2, by = “state”)
A. Binding functions combine by position, while join functions match by variables.
B. Joining functions can join datasets of different dimensions, but the bind functions must match on the appropriate dimension (either same row or column numbers).
C. Bind functions can combine both vectors and dataframes, while join functions work for only for dataframes.
> df1
x y
a a
b a
> df2
x y
a a
a b
Which command would result in the following table?
> final
x y
b a
A. final <- union(df1, df2)
B. final <- setdiff(df1, df2)
C. final <- setdiff(df2, df1)
D. final <- intersect(df1, df2)
A. Html is easily converted to to xml, which can then be used for extracting tables.
B. All elements in an html page are specified as “nodes”; we can use the node “tables” to identify and extract the specific table we are interested in before we do additional data cleaning.
C. All tables in html documents are stored in separate files that you can download via the html code.
D. Tables in html are formatted as csv tables, which we can easily copy and process in R.
tab <- h %>% html_nodes(“table”)
tab <- tab[[2]] %>%
html_table
Why did we use the html_nodes() command instead of the html_node command?
A. The html_node command only selects the first node of a specified type. In this example the first “table” node is a legend table and not the actual data we are interested in.
B. The html_nodes command allows us to specify what type of node we want to extract, while the html_node command does not.
C. It does not matter; the two commands are interchangeable.
D. We used html_nodes so that we could specify the second “table” element using the tab[[2]] command.
In the String Processing section, we use case studies that help demonstrate how string processing is a powerful tool useful for overcoming many data wrangling challenges. You will see how the original raw data was processed to create the data frames we have used in courses throughout this series.
This section is divided into three parts.
After completing the String Processing section, you will be able to:
The textbook for this section is available here
A. Removing unwanted characters from text.
B. Extracting numeric values from text.
C. Formatting numbers and characters so they can easily be displayed in deliverables like papers and presentations.
D. Splitting strings into multiple values.
A. cat(" LeBron James is 6’8\" ")
B. cat(' LeBron James is 6'8" ')
C. cat(` LeBron James is 6'8" `)
D. cat(" LeBron James is 6\’8" ")
A. Base R functions are rarely used for string processing by data scientists so it’s not worth learning them.
B. Functions in stringr all start with “str_”, which makes them easy to look up using autocomplete.
C. Stringr functions work better with pipes.
D. The order of arguments is more consistent in stringr functions than in base R.
> head(dat)
# A tibble: 5 x 3
Month Sales Profit
<chr> <chr> <chr>
January $128,568 $16,234
February $109,523 $12,876
March $115,468 $17,920
April $122,274 $15,825
May $117,921 $15,437
Which of the following commands could convert the sales and profits columns to numeric? Select all that apply.
A. dat %>% mutate_at(2:3, parse_number)
B. dat %>% mutate_at(2:3, as.numeric)
C. dat %>% mutate_all(parse_number)
D. dat %>% mutate_at(2:3, funs(str_replace_all(., c("\\$|,"), ""))) %>%mutate_at(2:3, as.numeric)
not_inches <- function(x, smallest = 50, tallest = 84) {
inches <- suppressWarnings(as.numeric(x))
ind <- is.na(inches) | inches < smallest | inches > tallest
ind
}
In this function, what TWO types of values are identified as not being correctly formatted in inches?
A. Values that specifically contain apostrophes (‘), periods (.) or quotations (“).
B. Values that result in NA’s when converted to numeric
C. Values less than 50 inches or greater than 84 inches
D. Values that are stored as a character class, because most are already classed as numeric.
not_inches, would return the vector c(FALSE)?A. c(175)
B. c(“5’8\””)
C. c(70)
D. c(85) (the height of Shaquille O'Neal in inches)
not_inches returns the object ind. Which answer correctly describes ind?A. ind is a logical vector of TRUE and FALSE, equal in length to the vector x (in the arguments list). TRUE indicates that a height entry is incorrectly formatted.
B. indis a logical vector of TRUE and FALSE, equal in length to the vector x(in the arguments list). TRUE indicates that a height entry is correctly formatted.
C. ind is a data frame like our reported_heights table but with an extra column of TRUE or FALSE. TRUE indicates that a height entry is incorrectly formatted.
D. ind is a numeric vector equal to reported_heights$heights but with incorrectly formatted heights replaced with NAs.
> s
[1] "70" "5 ft" "4'11" "" "." "Six feet"
What pattern vector yields the following result?
str_view_all(s, pattern)
70
5 ft
4’11
.
Six feet
A. pattern <- "\\d|ft"
B. pattern <- "\d|ft"
C. pattern <- "\\d\\d|ft"
D. pattern <- "\\d|feet"
> animals <- c("cat", "puppy", "Moose", "MONKEY")
> pattern <- "[a-z]"
> str_detect(animals, pattern)
A. TRUE
B. TRUE TRUE TRUE TRUE
C. TRUE TRUE TRUE FALSE
D. TRUE TRUE FALSE FALSE
> animals <- c("cat", "puppy", "Moose", "MONKEY")
> pattern <- "[A-Z]$"
> str_detect(animals, pattern)
A. FALSE FALSE FALSE FALSE
B. FALSE FALSE TRUE TRUE
C. FALSE FALSE FALSE TRUE
D. TRUE TRUE TRUE FALSE
> animals <- c("cat", "puppy", "Moose", "MONKEY")
> pattern <- "[a-z]{4,5}"
> str_detect(animals, pattern)
A. FALSE TRUE TRUE FALSE
B. TRUE TRUE FALSE FALSE
C. FALSE FALSE FALSE TRUE
D. TRUE TRUE TRUE FALSE
animals <- c(“moose”, “monkey”, “meerkat”, “mountain lion”) Which TWO “pattern” vectors would yield the following result?
str_detect(animals, pattern) [1] TRUE TRUE TRUE TRUE
A. pattern <- “mo*”
B. pattern <- “mo?”
C. pattern <- “mo+”
D. pattern <- “moo*”
> schools
[1] "U. Kentucky" "Univ New Hampshire" "Univ. of Massachusetts" "University Georgia"
[5] "U California" "California State University"
You want to clean this data to match the full names of each university
> final
[1] "University of Kentucky" "University of New Hampshire" "University of Massachusetts" "University of Georgia"
[5] "University of California" "California State University"
What of the following commands could accomplish this?
schools %>%
str_replace("Univ\\.?|U\\.?", "University ") %>%
str_replace("^University of |^University ", "University of ")
B.
schools %>%
str_replace("^Univ\\.?\\s|^U\\.?\\s", "University ") %>%
str_replace("^University of |^University ", "University of ")
schools %>%
str_replace("^Univ\\.\\s|^U\\.\\s", "University") %>%
str_replace("^University of |^University ", "University of ")
schools %>%
str_replace("^Univ\\.?\\s|^U\\.?\\s", "University") %>%
str_replace("University ", "University of ")
problems <- c("5.3", "5,5", "6 1", "5 .11", "5, 12")
pattern_with_groups <- "^([4-7])[,\\.](\\d*)$"
str_replace(problems, pattern_with_groups, "\\1'\\2")
What is your result?
A. [1] "5'3" "5'5" "6 1" "5 .11" "5, 12"
B. [1] “5.3” “5,5” “6 1” “5 .11” “5, 12”
C. [1] “5’3” “5’5” “6’1” “5 .11” “5, 12”
D. [1] “5’3” “5’5” “6’1” “5’11” “5’12”
problems <- c("5.3", "5,5", "6 1", "5 .11", "5, 12")
pattern_with_groups <- "^([4-7])[,\\.\\s](\\d*)$"
str_replace(problems, pattern_with_groups, "\\1'\\2")
What is your result?
A. [1] “5’3” “5’5” “6 1” “5 .11” “5, 12”
B. [1] “5.3” “5,5” “6 1” “5 .11” “5, 12”
C. [1] "5'3" "5'5" "6'1" "5 .11" "5, 12"
D. [1] “5’3” “5’5” “6’1” “5’11” “5’12”
converted <- problems %>%
str_replace("feet|foot|ft", "'") %>%
str_replace("inches|in|''|\"", "") %>%
str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")
pattern <- "^[4-7]\\s*'\\s*\\d{1,2}$"
index <- str_detect(converted, pattern)
converted[!index]
Which answer best describes the differences between the regex string we use as an argument in str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")
And the regex string in pattern <- "^[4-7]\\s*'\\s*\\d{1,2}$"?
A. The regex used in str_replace looks for either a comma, period or space between the feet and inches digits, while the pattern regex just looks for an apostrophe; the regex in str_replace allows for one or more digits to be entered as inches, while the pattern regex only allows for one or two digits.
B. The regex used in str_replace allows for additional spaces between the feet and inches digits, but the pattern regex does not.
C. The regex used in str_replace looks for either a comma, period or space between the feet and inches digits, while the pattern regex just looks for an apostrophe; the regex in str_replace allows none or more digits to be entered as inches, while the pattern regex only allows for the number 1 or 2 to be used.
D. The regex used in str_replace looks for either a comma, period or space between the feet and inches digits, while the pattern regex just looks for an apostrophe; the regex in str_replace allows for none or more digits to be entered as inches, while the pattern regex only allows for one or two digits.
yes <- c("5 feet 7inches", “5 7”)
no <- c("5ft 9 inches", "5 ft 9 inches")
s <- c(yes, no)
converted <- s %>%
str_replace("feet|foot|ft", "'") %>%
str_replace("inches|in|''|\"", "") %>%
str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")
pattern <- "^[4-7]\\s*'\\s*\\d{1,2}$"
str_detect(converted, pattern)
[1] TRUE FALSE FALSE
It seems like the problem may be due to spaces around the words feet|foot|ft and inches|in. What is another way you could fix this problem?
A.
converted <- s %>%
str_replace("\\s*(feet|foot|ft)\\s*", "'") %>%
str_replace("\\s*(inches|in|''|\")\\s*", "") %>%
str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")
converted <- s %>%
str_replace("\\s+feet|foot|ft\\s+”, "'") %>%
str_replace("\\s+inches|in|''|\"\\s+", "") %>%
str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")
converted <- s %>%
str_replace("\\s*|feet|foot|ft", "'") %>%
str_replace("\\s*|inches|in|''|\"", "") %>%
str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")
converted <- s %>%
str_replace_all(“\\s”, “”) %>%
str_replace("\\s|feet|foot|ft", "'") %>%
str_replace("\\s|inches|in|''|\"", "") %>%
str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2")
s <- c("5'10", "6'1\"", "5'8inches", "5'7.5")
tab <- data.frame(x = s)
If you use the extract code from our video, the decimal point is dropped. What modification of the code would allow you to put the decimals in a third column called “decimal”?
extract(data = tab, col = x, into = c(“feet”, “inches”, “decimal”), regex = "(\\d)'(\\d{1,2})(\\.)?"
extract(data = tab, col = x, into = c("feet", "inches", "decimal"), regex = "(\\d)'(\\d{1,2})(\\.\\d+)"
extract(data = tab, col = x, into = c("feet", "inches", "decimal"), regex = "(\\d)'(\\d{1,2})\\.\\d+?"
D.
extract(data = tab, col = x, into = c("feet", "inches", "decimal"), regex = "(\\d)'(\\d{1,2})(\\.\\d+)?")
>schedule
day staff
Monday Mandy, Chris and Laura
Tuesday Steve, Ruth and Frank
You want to turn this into a more useful data frame.
Which two commands would properly split the text in the “staff” column into each individual name? Select ALL that apply.
A. str_split(schedule$staff, ",|and")
B. str_split(schedule$staff, ", | and ")
C. str_split(schedule$staff, ",\\s|\\sand\\s")
D. str_split(schedule$staff, "\\s?(,|and)\\s?")
> schedule
day staff
Monday Mandy, Chris and Laura
Tuesday Steve, Ruth and Frank
What code would successfully turn your “Schedule” table into the following tidy table
< tidy
day staff
<chr> <chr>
Monday Mandy
Monday Chris
Monday Laura
Tuesday Steve
Tuesday Ruth
Tuesday Frank
A.
tidy <- schedule %>%
mutate(staff = str_split(staff, ", | and ")) %>%
unnest()
tidy <- separate(schedule, staff, into = c("s1","s2","s3"), sep = “,”) %>%
gather(key = s, value = staff, s1:s3)
tidy <- schedule %>%
mutate(staff = str_split(staff, ", | and ", simplify = TRUE)) %>% unnest()
dat <- gapminder %>% filter(region == "Middle Africa") %>%
mutate(recode(country,
"Central African Republic" = "CAR",
"Congo, Dem. Rep." = "DRC",
"Equatorial Guinea" = "Eq. Guinea"))
dat <- gapminder %>% filter(region == "Middle Africa") %>%
mutate(country_short = recode(country,
c("Central African Republic", "Congo, Dem. Rep.", "Equatorial Guinea"),
c("CAR", "DRC", "Eq. Guinea")))
dat <- gapminder %>% filter(region == "Middle Africa") %>%
mutate(country = recode(country,
"Central African Republic" = "CAR",
"Congo, Dem. Rep." = "DRC",
"Equatorial Guinea" = "Eq. Guinea"))
D.
dat <- gapminder %>% filter(region == "Middle Africa") %>%
mutate(country_short = recode(country,
"Central African Republic" = "CAR",
"Congo, Dem. Rep." = "DRC",
"Equatorial Guinea" = "Eq. Guinea"))
In the Dates, Times, and Text Mining section, you will learn how to deal with dates and times in R and also how to generate numerical summaries from text data.
After completing this section, you will be able to:
The textbook for this section is available here
A. MM-DD-YY
B. YYYY-MM-DD
C. YYYYMMDD
D. YY-MM-DD
dates <- c("09-01-02", "01-12-07", "02-03-04")
A. ymd(dates)
B. mdy(dates)
C. dmy(dates)
D. It is impossible to know which format is correct without additional information.