We’ll be working with San Francisco bike share ride data called bike_share_rides. It contains information on start and end stations of each trip, the trip duration, and some user information.
Before beginning to analyze any dataset, it’s important to take a look at the different types of columns you’ll be working with, which you can do using glimpse().
library(dplyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
library(assertive)
package 㤼㸱assertive㤼㸲 was built under R version 3.6.3
library(stringr)
library(ggplot2)
package 㤼㸱ggplot2㤼㸲 was built under R version 3.6.3
bike_share_rides = readRDS("bike_share_rides_ch1_1.rds")
# Glimpse at bike_share_rides
glimpse(bike_share_rides)
Observations: 35,229
Variables: 10
$ ride_id [3m[38;5;246m<int>[39m[23m 52797, 54540, 87695, 45619, 70832, 96135, 29928, 83331, 72424, 25910, 89090, 2443...
$ date [3m[38;5;246m<chr>[39m[23m "2017-04-15", "2017-04-19", "2017-04-14", "2017-04-03", "2017-04-10", "2017-04-18...
$ duration [3m[38;5;246m<chr>[39m[23m "1316.15 minutes", "8.13 minutes", "24.85 minutes", "6.35 minutes", "9.8 minutes"...
$ station_A_id [3m[38;5;246m<dbl>[39m[23m 67, 21, 16, 58, 16, 6, 5, 16, 5, 81, 30, 16, 16, 67, 21, 16, 5, 21, 67, 5, 21, 15...
$ station_A_name [3m[38;5;246m<chr>[39m[23m "San Francisco Caltrain Station 2 (Townsend St at 4th St)", "Montgomery St BART ...
$ station_B_id [3m[38;5;246m<dbl>[39m[23m 89, 64, 355, 368, 81, 66, 350, 91, 62, 81, 109, 10, 80, 90, 27, 50, 323, 3, 321, ...
$ station_B_name [3m[38;5;246m<chr>[39m[23m "Division St at Potrero Ave", "5th St at Brannan St", "23rd St at Tennessee St", ...
$ bike_id [3m[38;5;246m<dbl>[39m[23m 1974, 860, 2263, 1417, 507, 75, 388, 239, 1449, 3289, 2084, 1526, 3473, 192, 1425...
$ user_gender [3m[38;5;246m<chr>[39m[23m "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "...
$ user_birth_year [3m[38;5;246m<dbl>[39m[23m 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996, 1993, 1996, 1974, 1995, 1993, 199...
# Summary of user_birth_year
summary(bike_share_rides$user_birth_year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1979 1986 1984 1991 2001
# Convert user_birth_year to factor: user_birth_year_fct
bike_share_rides <- bike_share_rides %>%
mutate(user_birth_year_fct = as.factor(user_birth_year))
# Assert user_birth_year_fct is a factor
assert_is_factor(bike_share_rides$user_birth_year_fct)
# Summary of user_birth_year_fct
summary(bike_share_rides$user_birth_year_fct)
1900 1902 1923 1931 1938 1939 1941 1942 1943 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957
1 7 2 23 2 1 3 10 4 16 5 24 9 30 37 25 70 49 65 66 112 62
1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
156 99 196 161 256 237 245 349 225 363 365 331 370 548 529 527 563 601 481 541 775 876
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
825 1016 1056 1262 1157 1318 1606 1672 2135 1872 2062 1582 1703 1498 1476 1185 813 358 365 348 473 30
Looking at the new summary statistics, more riders were born in 1988 than any other year.
Another common dirty data problem is having extra bits like percent signs or periods in numbers, causing them to be read in as characters. In order to be able to crunch these numbers, the extra bits need to be removed and the numbers need to be converted from character to numeric.
bike_share_rides <- bike_share_rides %>%
# Remove 'minutes' from duration: duration_trimmed
mutate(duration_trimmed = str_remove(duration, "minutes"),
# Convert duration_trimmed to numeric: duration_mins
duration_mins = as.numeric(duration_trimmed))
# Glimpse at bike_share_rides
glimpse(bike_share_rides)
Observations: 35,229
Variables: 13
$ ride_id [3m[38;5;246m<int>[39m[23m 52797, 54540, 87695, 45619, 70832, 96135, 29928, 83331, 72424, 25910, 89090, ...
$ date [3m[38;5;246m<chr>[39m[23m "2017-04-15", "2017-04-19", "2017-04-14", "2017-04-03", "2017-04-10", "2017-0...
$ duration [3m[38;5;246m<chr>[39m[23m "1316.15 minutes", "8.13 minutes", "24.85 minutes", "6.35 minutes", "9.8 minu...
$ station_A_id [3m[38;5;246m<dbl>[39m[23m 67, 21, 16, 58, 16, 6, 5, 16, 5, 81, 30, 16, 16, 67, 21, 16, 5, 21, 67, 5, 21...
$ station_A_name [3m[38;5;246m<chr>[39m[23m "San Francisco Caltrain Station 2 (Townsend St at 4th St)", "Montgomery St B...
$ station_B_id [3m[38;5;246m<dbl>[39m[23m 89, 64, 355, 368, 81, 66, 350, 91, 62, 81, 109, 10, 80, 90, 27, 50, 323, 3, 3...
$ station_B_name [3m[38;5;246m<chr>[39m[23m "Division St at Potrero Ave", "5th St at Brannan St", "23rd St at Tennessee S...
$ bike_id [3m[38;5;246m<dbl>[39m[23m 1974, 860, 2263, 1417, 507, 75, 388, 239, 1449, 3289, 2084, 1526, 3473, 192, ...
$ user_gender [3m[38;5;246m<chr>[39m[23m "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male...
$ user_birth_year [3m[38;5;246m<dbl>[39m[23m 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996, 1993, 1996, 1974, 1995, 1993,...
$ user_birth_year_fct [3m[38;5;246m<fct>[39m[23m 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996, 1993, 1996, 1974, 1995, 1993,...
$ duration_trimmed [3m[38;5;246m<chr>[39m[23m "1316.15 ", "8.13 ", "24.85 ", "6.35 ", "9.8 ", "17.47 ", "16.52 ", "14.72 ",...
$ duration_mins [3m[38;5;246m<dbl>[39m[23m 1316.15, 8.13, 24.85, 6.35, 9.80, 17.47, 16.52, 14.72, 4.12, 25.77, 17.73, 15...
# Assert duration_mins is numeric
assert_is_numeric(bike_share_rides$duration_mins)
# Calculate mean duration
mean(bike_share_rides$duration_mins)
[1] 13.06214
By removing characters and converting to a numeric type, you were able to figure out that the average ride duration is about 13 minutes - not bad for a city like San Francisco!
Values that are out of range can throw off an analysis, so it’s important to catch them early on.
Bikes are not allowed to be kept out for more than 24 hours, or 1440 minutes at a time, but issues with some of the bikes caused inaccurate recording of the time they were returned.
# Create breaks
breaks <- c(min(bike_share_rides$duration_mins), 0, 1440, max(bike_share_rides$duration_mins))
# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_mins)) +
geom_histogram(breaks = breaks)
# duration_min_const: replace vals of duration_min > 1440 with 1440
bike_share_rides <- bike_share_rides %>%
mutate(duration_min_const = replace(duration_mins, duration_mins > 1440, 1440))
# Make sure all values of duration_min_const are between 0 and 1440
assert_all_are_in_closed_range(bike_share_rides$duration_min_const, lower = 0, upper = 1440)
The method of replacing erroneous data with the range limit works well, but you could just as easily replace these values with NAs or something else instead.
Something has gone wrong and it looks like you have data with dates from the future, which is way outside of the date range you expected to be working with. To fix this, we’ll need to remove any rides from the dataset that have a date in the future. Before you can do this, the date column needs to be converted from a character to a Date. Having these as Date objects will make it much easier to figure out which rides are from the future, since R makes it easy to check if one Date object is before (<) or after (>) another.
library(lubridate)
Attaching package: 㤼㸱lubridate㤼㸲
The following object is masked from 㤼㸱package:base㤼㸲:
date
# Convert date to Date type
bike_share_rides <- bike_share_rides %>%
mutate(date = as.Date(date))
# Make sure all dates are in the past
assert_all_are_in_past(bike_share_rides$date)
# Filter for rides that occurred before or on today's date
bike_share_rides_past <- bike_share_rides %>%
filter(date <= today())
# Make sure all dates from bike_share_rides_past are in the past
assert_all_are_in_past(bike_share_rides_past$date)
Handling data from the future like this is much easier than trying to verify the data’s correctness by time traveling.
When multiple rows of a data frame share the same values for all columns, they’re full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median.
Each ride, including its ride_id should be unique.
# Count the number of full duplicates
sum(duplicated(bike_share_rides))
[1] 0
# Remove duplicates
bike_share_rides_unique <- distinct(bike_share_rides)
# Count the full duplicates in bike_share_rides_unique
sum(duplicated(bike_share_rides_unique))
[1] 0
Removing full duplicates will ensure that summary statistics aren’t altered by repeated data points.
Partial duplicates are a bit tricker to deal with than full duplicates. We’ll first identify any partial duplicates and then practice the most common technique to deal with them, which involves dropping all partial duplicates, keeping only the first.
# Find duplicated ride_ids
bike_share_rides %>%
# Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n>1)
# Remove full and partial duplicates
bike_share_rides_unique <- bike_share_rides %>%
# Only based on ride_id instead of all cols
distinct(ride_id, .keep_all = TRUE)
# Find duplicated ride_ids in bike_share_rides_unique
bike_share_rides_unique %>%
# Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n > 1)
It’s important to consider the data you’re working with before removing partial duplicates, since sometimes it’s expected that there will be partial duplicates in a dataset, such as if the same customer makes multiple purchases.
Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you’re not sure how your data was collected and want an average, or if based on domain knowledge, you’d rather have too high of an estimate than too low of an estimate (or vice versa).
bike_share_rides %>%
# Group by ride_id and date
group_by(ride_id, date) %>%
# Add duration_min_avg column
mutate(duration_min_avg = mean(duration_mins)) %>%
# Remove duplicates based on ride_id and date, keep all cols
distinct(ride_id, date, .keep_all = TRUE) %>%
# Remove duration_min column
select(-duration_mins)
Aggregation of partial duplicates allows you to keep some information about all data points instead of keeping information about just one data point.
We’ll be working with a dataset called sfo_survey, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.
There were a few issues during data collection that resulted in some inconsistencies in the dataset. We’ll be working with the dest_size column, which categorizes the size of the destination airport that the passengers were flying to.
sfo_survey = readRDS("sfo_survey_ch2_1.rds")
dest_sizes = data.frame(dest_size = c("Small", "Medium", "Large", "Hub"),
passengers_per_day = c("0-20K", "20K-70K", "70K-100K", "100K+"))
head(sfo_survey)
head(dest_sizes)
# Count the number of occurrences of dest_size
sfo_survey %>%
count(dest_size)
# Find bad dest_size rows
sfo_survey %>%
# Join with dest_sizes data frame to get bad dest_size rows
anti_join(dest_sizes, by = "dest_size") %>%
# Select id, airline, destination, and dest_size cols
select(id, airline, destination, dest_size)
Column `dest_size` joining character vector and factor, coercing into character vector
# Remove bad dest_size rows
sfo_survey %>%
# Join with dest_sizes
semi_join(dest_sizes, by = "dest_size") %>%
# Count the number of each dest_size
count(dest_size)
Column `dest_size` joining character vector and factor, coercing into character vector
Anti-joins can help you identify the rows that are causing issues, and semi-joins can remove the issue-causing rows. In the next lesson, you’ll learn about other ways to deal with bad values so that you don’t have to lose rows of data.
We’ll continue working with the sfo_survey dataset. We’ll examine the dest_size column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face.
# Count dest_size
sfo_survey %>%
count(dest_size)
# Count cleanliness
sfo_survey %>%
count(cleanliness)
Now that we’ve identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, we’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.
# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
# dest_size_trimmed: dest_size without whitespace
mutate(dest_size_trimmed = str_trim(dest_size),
# cleanliness_lower: cleanliness converted to lowercase
cleanliness_lower = str_to_lower(cleanliness))
# Count values of dest_size_trimmed
sfo_survey %>%
count(dest_size_trimmed)
# Count values of cleanliness_lower
sfo_survey %>%
count(cleanliness_lower)
We were able to convert seven-category data into four-category data, which will help your analysis go more smoothly.
One of the tablets that participants filled out the sfo_survey on was not properly configured, allowing the response for dest_region to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region variable that we’ll need to correct.
library(forcats)
# Count categories of dest_region
sfo_survey %>%
count(dest_region)
# Count categories of dest_region
sfo_survey %>%
count(dest_region)
# Categories to map to Europe
europe_categories <- c("Europ", "EU", "eur")
# Add a new col dest_region_collapsed
sfo_survey %>%
# Map all categories in europe_categories to Europe
mutate(dest_region_collapsed = fct_collapse(dest_region,
Europe = europe_categories)) %>%
# Count categories of dest_region_collapsed
count(dest_region_collapsed)
Unknown levels in `f`: Europ, EU, eur
You’ve reduced the number of categories from 12 to 9, and you can now be confident that 401 of the survey participants were heading to Europe.
You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-) and some are written with parentheses ((,)). We’ll figure out which phone numbers have these issues so that you know which ones need fixing.
customer_accounts <- readRDS("fodors.rds")
head(customer_accounts)
# Filter for rows with "-" in the phone column
customer_accounts %>%
filter(str_detect(phone, "-"))
# Filter for rows with "(" or ")" in the phone column
customer_accounts %>%
filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")")))
Now that you’ve identified the inconsistencies in the phone column, it’s time to remove unnecessary characters to make the follow-up survey go as smoothly as possible.
The customer support team has requested that all phone numbers be in the format “123 456 7890”. In this exercise, you’ll use your new stringr skills to fulfill this request.
# Remove parentheses from phone column
phone_no_parens <- customer_accounts$phone %>%
# Remove "("s
str_remove_all(fixed("(")) %>%
# Remove ")"s
str_remove_all(fixed(")"))
# Add phone_no_parens as column
customer_accounts %>%
mutate(phone_no_parens = phone_no_parens)
# Add phone_no_parens as column
customer_accounts %>%
mutate(phone_no_parens = phone_no_parens,
# Replace all hyphens in phone_no_parens with spaces
phone_clean = str_replace_all(phone_no_parens, "-", " "))
We’ll remove any rows with invalid phone numbers.
# Check out the invalid numbers
customer_accounts %>%
filter(str_length(phone) != 12)
# Remove rows with invalid numbers
customer_accounts %>%
filter (str_length(phone) == 12)
you work at an asset management company and you’ll be working with the accounts dataset, which contains information about each customer, the amount in their account, and the date their account was opened. Your boss has asked you to calculate some summary statistics about the average value of each account and whether the age of the account is associated with a higher or lower account value. Before you can do this, you need to make sure that the accounts dataset you’ve been given doesn’t contain any uniformity problems.
We’ll investigate the date_opened column and clean it up so that all the dates are in the same format.
accounts <- readRDS("ch3_1_accounts.rds")
head(accounts)
as.Date(accounts$date_opened)
[1] "2003-10-19" NA "2008-07-29" "2005-06-09" "2012-03-31" "2007-06-20" NA "2019-06-03"
[9] "2011-05-07" "2018-04-07" "2018-11-16" "2001-04-16" "2005-04-21" "2006-06-13" "2009-01-07" "2012-07-07"
[17] NA NA "2004-05-21" "2001-09-06" "2005-04-09" "2009-10-20" "2003-05-16" "2015-10-25"
[25] NA NA NA "2008-12-27" "2015-11-11" "2009-02-26" "2008-12-26" NA
[33] NA "2005-12-13" NA "2004-12-03" "2016-10-19" NA "2009-10-05" "2013-07-11"
[41] "2002-03-24" "2015-10-17" NA NA "2019-11-12" NA NA "2019-10-01"
[49] "2000-08-17" "2001-04-11" NA "2016-06-30" NA NA "2013-05-23" "2017-02-24"
[57] NA "2004-11-02" "2019-03-06" "2018-09-01" NA "2002-12-31" "2013-07-27" "2014-01-10"
[65] "2011-12-14" NA "2008-03-01" "2018-05-07" "2017-11-23" NA "2008-09-27" NA
[73] "2008-01-07" NA "2005-05-11" "2003-08-12" NA NA NA "2014-11-25"
[81] NA NA NA "2008-04-01" NA "2002-10-01" "2011-03-25" "2000-07-11"
[89] "2014-10-19" NA "2013-06-20" "2008-01-16" "2016-06-24" NA NA "2007-04-29"
[97] NA NA
Notice we have a lot of NAs. By default, as.Date() can’t convert “Month DD, YYYY” formats.
We can use parse_date_time() instead:
# Define the date formats
formats <- c("%Y-%m-%d", "%B %d, %Y")
# Convert dates to the same format
accounts %>%
mutate(date_opened_clean = parse_date_time(date_opened, formats))
We’ll need to correct any unit differences. When we first plot the data, we’ll notice that there’s a group of very high values, and a group of relatively lower values. The bank has two different offices - one in New York, and one in Tokyo, so you suspect that the accounts managed by the Tokyo office are in Japanese yen instead of U.S. dollars. Luckily, we have a data frame called account_offices that indicates which office manages each customer’s account, so you can use this information to figure out which totals need to be converted from yen to dollars.
# Scatter plot of opening date vs total amount
accounts %>%
ggplot(aes(x = date_opened, y = total)) +
geom_point()
The formula to convert yen to dollars is USD = JPY / 104.
account_offices <- read.csv("account_offices.csv",stringsAsFactors=FALSE)
account_offices <- account_offices %>%
mutate(office = str_trim(office))
head(account_offices)
# Left join accounts and account_offices by id
accounts %>%
left_join(account_offices, by = "id")
Column `id` joining factor and character vector, coercing into character vector
# Left join accounts to account_offices by id
accounts %>%
left_join(account_offices, by = "id") %>%
# Convert totals from the Tokyo office to JPY
mutate(total_usd = ifelse(office == "Tokyo", total/104, total))
Column `id` joining factor and character vector, coercing into character vector
# Left join accounts to account_offices by id
accounts %>%
left_join(account_offices, by = "id") %>%
# Convert totals from the Tokyo office to JPY
mutate(total_usd = ifelse(office == "Tokyo", total / 104, total)) %>%
# Scatter plot of opening date vs total_usd
ggplot(aes(x = date_opened, y = total_usd)) +
geom_point()
Column `id` joining factor and character vector, coercing into character vector
The points in your last scatter plot all fall within a much smaller range now and you’ll be able to accurately assess the differences between accounts from different countries.
If we have 3 fund columns, they should sum the same as the total column:
# Find invalid totals
accounts %>%
# theoretical_total: sum of the three funds
mutate(theoretical_total = fund_A + fund_B + fund_C) %>%
# Find accounts where total doesn't match theoretical_total
filter(total != theoretical_total)
By using cross field validation, you’ve been able to detect values that don’t make sense. How you choose to handle these values will depend on the dataset.
We’ll need to validate the age of each account and see if rows with inconsistent acct_ages are the same ones that had inconsistent totals
# Find invalid acct_age
accounts %>%
# theoretical_age: age of acct based on date_opened
mutate(theoretical_age = floor(as.numeric(date_opened %--% today(),
"years"))) %>%
# Filter for rows where acct_age is different from theoretical_age
filter(theoretical_age != acct_age)
There are three accounts that all have ages off by one year, but none of them are the same as the accounts that had total inconsistencies, so it looks like these two bookkeeping errors may not be related.
Three flavors of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).
Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.
library(visdat)
package 㤼㸱visdat㤼㸲 was built under R version 3.6.3
vis_miss(accounts)
accounts %>%
# missing_inv: Is inv_amount missing?
mutate(missing_inv = is.na(inv_amount)) %>%
# Group by missing_inv
group_by(missing_inv) %>%
# Calculate mean age for each missing_inv group
summarize(avg_age = mean(age))
missing
# Sort by age and visualize missing vals
accounts %>%
arrange(age) %>%
vis_miss()
Investigating summary statistics based on missingness is a great way to determine if data is missing completely at random or missing at random.
Removing missing values:
# Create accounts_clean
accounts_clean <- accounts %>%
# Filter to remove rows with missing cust_id
filter(!is.na(cust_id))
Replacing missing values:
# Create accounts_clean
accounts_clean <- accounts %>%
# Filter to remove rows with missing cust_id
filter(!is.na(cust_id)) %>%
# Add new col acct_amount_filled with replaced NAs
mutate(acct_amount_filled = ifelse(is.na(acct_amount), 5 * inv_amount, acct_amount))
Assert that there are no missing values:
# Assert that cust_id has no missing vals
assert_all_are_not_na(accounts_clean$cust_id)
# Assert that acct_amount_filled has no missing vals
assert_all_are_not_na(accounts_clean$acct_amount_filled)
Damerau-Levenshtein distance is the minimum number of steps needed to get from String A to String B, using these operations:
There are multiple ways to calculate how similar or different two strings are. Now we’ll practice using the stringdist package to compute string distances using various methods. It’s important to be familiar with different methods, as some methods work better on certain datasets, while others work better on other datasets.
library(stringdist)
package 㤼㸱stringdist㤼㸲 was built under R version 3.6.3
Damerau-Levenshtein distance:
# Calculate Damerau-Levenshtein distance
stringdist("las angelos", "los angeles", method = "dl")
[1] 2
Longest Common Substring (LCS):
# Calculate LCS distance
stringdist("las angelos", "los angeles", method = "lcs")
[1] 4
LCS distance only uses insertion and deletion, so it takes more operations to change a string to another.
Jaccard distance:
# Calculate Jaccard distance
stringdist("las angelos", "los angeles", method = "jaccard")
[1] 0
One of the datasets you’ll be working with, zagat, is a set of restaurants in New York, Los Angeles, Atlanta, San Francisco, and Las Vegas. The data is from Zagat, a company that collects restaurant reviews, and includes the restaurant names, addresses, phone numbers, as well as other restaurant information.
The city column contains the name of the city that the restaurant is located in. However, there are a number of typos throughout the column. Your task is to map each city to one of the five correctly-spelled cities contained in the cities data frame.
zagat <- readRDS("zagat.rds")
cities <- data.frame(city_actual = c("new york", "los angeles", "atlanta", "san francisco", "las vegas"))
# Count the number of each city variation
zagat %>%
count(city)
library(fuzzyjoin)
package 㤼㸱fuzzyjoin㤼㸲 was built under R version 3.6.3
# Join zagat and cities and look at results
zagat %>%
# Left join based on stringdist using city and city_actual cols
stringdist_left_join(cities, by = c("city" = "city_actual")) %>%
# Select the name, city, and city_actual cols
select(name, city, city_actual)
Now that you’ve created consistent spelling for each city, it will be much easier to compute summary statistics by city.
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.
Zagat and Fodor’s are both companies that gather restaurant reviews. The zagat and fodors datasets both contain information about various restaurants, including addresses, phone numbers, and cuisine types. Some restaurants appear in both datasets, but don’t necessarily have the same exact name or phone number written down. We’ll work towards figuring out which restaurants appear in both datasets.
The first step towards this goal is to generate pairs of records so that you can compare them. We’ll first generate all possible pairs, and then use your newly-cleaned city column as a blocking variable.
fodors <- readRDS("fodors.rds")
# Load reclin
library(reclin)
package 㤼㸱reclin㤼㸲 was built under R version 3.6.3Loading required package: lvec
package 㤼㸱lvec㤼㸲 was built under R version 3.6.3
Attaching package: 㤼㸱lvec㤼㸲
The following object is masked from 㤼㸱package:base㤼㸲:
order
Loading required package: ldat
package 㤼㸱ldat㤼㸲 was built under R version 3.6.3Loading required package: Rcpp
Attaching package: 㤼㸱ldat㤼㸲
The following objects are masked from 㤼㸱package:base㤼㸲:
append, match, table, which
Attaching package: 㤼㸱reclin㤼㸲
The following object is masked from 㤼㸱package:base㤼㸲:
identical
# Generate all possible pairs
pair_blocking(zagat, fodors)
Simple blocking
No blocking used.
First data set: 310 records
Second data set: 533 records
Total number of pairs: 165 230 pairs
ldat with 165 230 rows and 2 columns
# Generate all possible pairs
pair_blocking(zagat, fodors, blocking_var = "city")
Column `city` joining factors with different levels, coercing to character vector
Simple blocking
Blocking variable(s): city
First data set: 310 records
Second data set: 533 records
Total number of pairs: 40 532 pairs
ldat with 40 532 rows and 2 columns
By using city as a blocking variable, you were able to reduce the number of pairs you’ll need to compare from 165,230 pairs to 40,532.
Now that we’ve generated the pairs of restaurants, it’s time to compare them. We can easily customize how we perform our comparisons using the by and default_comparator arguments. There’s no right answer as to what each should be set to, so we’ll try a couple options out.
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs by name using lcs()
compare_pairs(by = "name",
default_comparator = lcs())
Column `city` joining factors with different levels, coercing to character vector
Compare
By: name
Simple blocking
Blocking variable(s): city
First data set: 310 records
Second data set: 533 records
Total number of pairs: 40 532 pairs
ldat with 40 532 rows and 3 columns
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs by name, phone, addr
compare_pairs(by = c("name", "phone", "addr"),
default_comparator = jaro_winkler())
Column `city` joining factors with different levels, coercing to character vector
Compare
By: name, phone, addr
Simple blocking
Blocking variable(s): city
First data set: 310 records
Second data set: 533 records
Total number of pairs: 40 532 pairs
ldat with 40 532 rows and 5 columns
Choosing a comparator and the columns to compare is highly dataset-dependent, so it’s best to try out different combinations to see which works best on the dataset you’re working with.
Record linkage requires a number of steps that can be difficult to keep straight.
We’ve cleaned up the city column of zagat using string similarity, as well as generated and compared pairs of restaurants from zagat and fodors. The end is near - all that’s left to do is score and select pairs and link the data together, and we’ll be able to begin your analysis in no time!
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
score_problink()
Column `city` joining factors with different levels, coercing to character vector
Compare
By: name
Simple blocking
Blocking variable(s): city
First data set: 310 records
Second data set: 533 records
Total number of pairs: 40 532 pairs
ldat with 40 532 rows and 4 columns
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
score_problink() %>%
# Select pairs
select_n_to_m()
Column `city` joining factors with different levels, coercing to character vector
Compare
By: name
Simple blocking
Blocking variable(s): city
First data set: 310 records
Second data set: 533 records
Total number of pairs: 40 532 pairs
ldat with 40 532 rows and 5 columns
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
score_problink() %>%
# Select pairs
select_n_to_m() %>%
# Link data
link()
Column `city` joining factors with different levels, coercing to character vector
Now that your two datasets are merged, you can use the data to figure out if there are certain characteristics that make a restaurant more likely to be reviewed by Zagat or Fodor’s.