For this problem, I created several functions to calculate whether there was a match for each of the defined terms. Unverified information (coded as NA) are treated as ‘nomatch’ as the identifying information has not been confirmed and is therefore not a match.
results <- read_excel('Data_Analyst_Interview_Exercise_Workbook_1.xlsx', sheet=2)
transactions <- read_excel('Data_Analyst_Interview_Exercise_Workbook_1.xlsx', sheet=1)
This generic function is to ensure that the following functions are all DRY.
content.match <- function(row, content){
row %>%
select(content) %>%
match("match", nomatch = FALSE) %>%
return()
}
The following three functions check against the corresponding match rule and return TRUE or FALSE.
match.fullname <- function(row){
row %>%
content.match(c('FirstName', 'LastName')) %>%
all() %>%
return()
}
match.address <- function(row){
first <- row %>%
content.match(c('Address1')) %>%
all()
second <- row %>%
content.match(c('HouseNumber', 'StreetName')) %>%
all()
third <- row %>%
content.match(c('PostalCode', 'City')) %>%
any()
return((first || second) && third)
}
match.DOB <- function(row){
row %>%
content.match(c('DayOfBirth', 'MonthOfBirth', 'YearOfBirth')) %>%
all() %>%
return()
}
Finally, the above functions are utilized to calculate the match/nomatch signals. They are all listed below.
terms <- results %>%
merge(transactions) %>%
by_row(..f = match.fullname, .to = 'FullName', .collate='rows') %>%
by_row(..f = match.address, .to = 'Address', .collate='rows') %>%
by_row(..f = match.DOB, .to = 'DateOfBirth', .collate='rows') %>%
mutate(`Name And Address` = FullName & Address,
`Name And DateOfBirth` = FullName & DateOfBirth,
`Name And (Address Or DateOfBirth)` = FullName & (Address | DateOfBirth),
`Name And Address And DateOfBirth` = FullName & Address & DateOfBirth
) %>%
select(c(15, 2, 18:24))
terms %>%
datatable()
A decision needed to be made whether or not to remove NA from considering the match rate. There is, after all, a difference between “the information could not be verified” and “the information is wrong”. Ultimately, I decided to leave the NA terms in as a company is hoping to match information and any response other than a “match” could be viewed as a failure. This does lead to some results that may be considered misleading. For example, the “Consumer” data source has NA for all birth date information. This results in a 0% match rate even though none of their responses were ‘nomatch’.
temp <- terms %>%
select(-TransactionID) %>%
group_by(DatasourceName) %>%
summarise_all(mean) %>%
mutate_if(is.numeric, ~round(., 2))
temp %>%
datatable()
temp %>%
gather(key='key', value='value', -DatasourceName) %>%
ggplot() +
geom_bar(aes(DatasourceName, value, fill=DatasourceName), stat= 'identity') +
facet_wrap(~key, ncol=7, labeller = label_wrap_gen(width=20)) +
labs(x = '', y='') +
scale_y_continuous(expand=c(0, 0, 0.1, 0.1)) +
scale_fill_brewer(palette='Set1') +
theme_bw() +
theme(legend.position = 'none',
axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
axis.text.y = element_text(size='13'),
strip.text = element_text(size='13'),
plot.margin = unit(c(0,1.5,0,0), "cm"),
panel.grid.major.x = element_blank()
)
I will reuse calculated values from the previous question. I can group by the transaction ID and then see if any of the columns are TRUE. If any one is, then a match as been found as per the definition given.
terms %>%
select(-DatasourceName) %>%
group_by(TransactionID) %>%
summarize_all(any) %>%
datatable()
terms %>%
select(-DatasourceName) %>%
group_by(TransactionID) %>%
summarise_all(mean) %>%
mutate_if(is.numeric, ~round(., 2)) %>%
datatable()
For this problem I begin by counting the raw total of matches for the given match rule and the total cost of making all of the requests. This allows me to calculate the amount paid per match found.
terms %>%
select(DatasourceName, `Name And (Address Or DateOfBirth)`) %>%
group_by(DatasourceName) %>%
summarize(matches = sum(`Name And (Address Or DateOfBirth)`)) %>%
mutate(cost.per.request = c(0.64, 0.88, 0.52, 0.52),
requests = rep(nrow(transactions), 4),
total.cost = cost.per.request * requests,
cost.per.match = total.cost / matches) %>%
mutate_if(is.numeric, ~round(., 2)) %>%
arrange(cost.per.match) %>%
datatable()
I believe there may be benefit to the addition of a 3rd category of results beyond ‘match’ and ‘nomatch’ called ‘unknown’. I assumed that all of the NA in the provided data set meant that the information was unavailable (as opposed to not requested). If that is the case, there are two reasonable options to address the NA values – dropping them from consideration or including them as ‘nomatch’. I decided to include the NA as ‘nomatch’ but an argument could just as reasonably be made to drop them from consideration. I think the addition of a third category of ‘unknown’ would help provide additional context.
I was also uncertain for the last question if data sources charge for NA responses. I coded it as if they were paid no matter what (they are still doing the look up after all) but this is an easy change to make since it is factually true or false. Whether or not a company is charged for each request can also play a part on which data source is the best value.
Write a query to display the FirstName and LastName as 1 column using an alias name “Full Name”.
SELECT CONCAT(FirstName, " ", LastName) AS 'Full Name' FROM Customer;
Write a query to get all the unique values for City.
SELECT DISTINCT City FROM Customer;
Write a query to show all Customers with an email address that contains their FirstName.
SELECT *
FROM Customer;
WHERE EmailAddress LIKE CONCAT('%', FirstName, '%');
Write a query to count the number of Customers with the LastName “Smith”.
SELECT COUNT(LastName) AS 'Count' FROM Customer; WHERE LastName = 'Smith';
Write a query to count the number of Customers with an age between 18 and 25 years old as of 2018-07-01.
SELECT COUNT(DateOfBirth) AS 'AgeRangeCount'
FROM Customer
WHERE TIMESTAMPDIFF(YEAR, DateOfBirth, '2018-07-01')
BETWEEN 18
AND 25;
Write a query to show the average age of Customers with the LastName “Smith” between the age of 18 and 25 years old as of 2018-07-01.
SELECT AVG(TIMESTAMPDIFF(YEAR, DateOfBirth, CURDATE())) AS AverageAge
FROM Customer
WHERE TIMESTAMPDIFF(YEAR, DateOfBirth, '2018-07-01')
BETWEEN 18
AND 25
AND LastName = 'Smith';
Record <- read_csv('sqlassignment/RecordTable.csv')
Field <- read_csv('sqlassignment/FieldTable.csv')
Duplicate <- read_csv('sqlassignment/DuplicateTable.csv')
Calculate the overall match rate on all transactions for each country using the following match rule: At least one Datasource match on Name and (Address or DateOfBirth) where:
Rate <- sqldf("
SELECT
Country,
SUM(Valid) Correct,
COUNT(Valid) Total,
CAST(SUM(Valid) AS DOUBLE)/ Count(Valid) Rate
FROM (
SELECT
Country,
MAX(CASE WHEN FullName = 1 AND (Address = 1 OR DateOfBirth = 1) THEN 1 ELSE 0 END) Valid
FROM(
SELECT
Country, RecordID, Datasource,
CASE WHEN FirstName = 2 AND LastName = 2 THEN 1 ELSE 0 END FullName,
CASE WHEN DayOfBirth = 2 AND MonthOfBirth = 2 AND YearOfBirth = 2 THEN 1 ELSE 0 END DateOfBirth,
CASE WHEN StreetName = 2 AND HouseNumber = 2 AND (PostalCode = 2 OR City = 2) AND UnitNumber <> 3 THEN 1 ELSE 0 END Address
FROM (
SELECT Country, RecordID, Datasource,
MAX(CASE WHEN Field = 'FirstName' THEN MatchStatusID ELSE 0 END) FirstName,
MAX(CASE WHEN Field = 'LastName' THEN MatchStatusID ELSE 0 END) LastName,
MAX(CASE WHEN Field = 'DayOfBirth' THEN MatchStatusID ELSE 0 END) DayOfBirth,
MAX(CASE WHEN Field = 'MonthOfBirth' THEN MatchStatusID ELSE 0 END) MonthOfBirth,
MAX(CASE WHEN Field = 'YearOfBirth' THEN MatchStatusID ELSE 0 END) YearOfBirth,
MAX(CASE WHEN Field = 'StreetName' THEN MatchStatusID ELSE 0 END) StreetName,
MAX(CASE WHEN Field = 'HouseNumber' THEN MatchStatusID ELSE 0 END) HouseNumber,
MAX(CASE WHEN Field = 'Postal Code' THEN MatchStatusID ELSE 0 END) PostalCode,
MAX(CASE WHEN Field = 'City' THEN MatchStatusID ELSE 0 END) City,
MAX(CASE WHEN Field = 'UnitNumber' THEN MatchStatusID ELSE 0 END) UnitNumber
FROM Field F
GROUP BY Country, RecordID, Datasource
) AS T1
) AS T2
GROUP BY Country, RecordID
) AS T3
GROUP BY Country
ORDER BY Rate DESC;
")
Rate %>%
datatable()
Example display deliverable. Simple csv with data would also be provided.
Rate %>%
mutate(label = paste0(Correct, '/', Total)) %>%
ggplot() +
geom_bar(aes(Country, Rate), fill='darkred', stat='identity') +
geom_label(aes(Country, Rate, label=label), nudge_y = 0.05) +
scale_y_continuous(expand=c(0, 0, 0.17, 0.17)) +
scale_fill_brewer(palette='Set1') +
labs(x='Match Rate') +
theme_bw() +
theme(legend.position = 'none',
axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
axis.text.y = element_text(size='13'),
panel.grid.major.x = element_blank(),
axis.title = element_text(size='13')
)
Calculate the overall match rate on unique transactions (i.e. is not a duplicate) for each country using the same match rule defined in question 1. Include the change in match rate compared to question 1.
If this were pure SQL, I would use temporary tables instead of the previously calculated Rate.
Unique <- sqldf("
SELECT
T3.Country,
Correct,
Total,
Rate,
SUM(CASE WHEN D.isDuplicate = 0 THEN Valid ELSE 0 END) UniqueCorrect,
Count(CASE WHEN D.isDuplicate = 0 THEN 1 ELSE NULL END) UniqueTotal,
CAST(SUM(CASE WHEN D.isDuplicate = 0 THEN Valid ELSE 0 END) AS DOUBLE) / COUNT(CASE WHEN D.isDuplicate = 0 THEN 1 ELSE NULL END) UniqueRate,
CAST(SUM(CASE WHEN D.isDuplicate = 0 THEN Valid ELSE 0 END) AS DOUBLE) / COUNT(CASE WHEN D.isDuplicate = 0 THEN 1 ELSE NULL END) - Rate Change
FROM (
SELECT
Country,
RecordID,
MAX(CASE WHEN FullName = 1 AND (Address = 1 OR DateOfBirth = 1) THEN 1 ELSE 0 END) Valid
FROM(
SELECT
Country, RecordID, Datasource,
CASE WHEN FirstName = 2 AND LastName = 2 THEN 1 ELSE 0 END FullName,
CASE WHEN DayOfBirth = 2 AND MonthOfBirth = 2 AND YearOfBirth =2 THEN 1 ELSE 0 END DateOfBirth,
CASE WHEN StreetName = 2 AND HouseNumber = 2 AND (PostalCode = 2 OR City = 2) AND UnitNumber <> 3 THEN 1 ELSE 0 END Address
FROM (
SELECT Country, RecordID, Datasource,
MAX(CASE WHEN Field = 'FirstName' THEN MatchStatusID ELSE 0 END) FirstName,
MAX(CASE WHEN Field = 'LastName' THEN MatchStatusID ELSE 0 END) LastName,
MAX(CASE WHEN Field = 'DayOfBirth' THEN MatchStatusID ELSE 0 END) DayOfBirth,
MAX(CASE WHEN Field = 'MonthOfBirth' THEN MatchStatusID ELSE 0 END) MonthOfBirth,
MAX(CASE WHEN Field = 'YearOfBirth' THEN MatchStatusID ELSE 0 END) YearOfBirth,
MAX(CASE WHEN Field = 'StreetName' THEN MatchStatusID ELSE 0 END) StreetName,
MAX(CASE WHEN Field = 'HouseNumber' THEN MatchStatusID ELSE 0 END) HouseNumber,
MAX(CASE WHEN Field = 'Postal Code' THEN MatchStatusID ELSE 0 END) PostalCode,
MAX(CASE WHEN Field = 'City' THEN MatchStatusID ELSE 0 END) City,
MAX(CASE WHEN Field = 'UnitNumber' THEN MatchStatusID ELSE 0 END) UnitNumber
FROM Field F
GROUP BY Country, RecordID, Datasource
) AS T1
) AS T2
GROUP BY Country, RecordID
) AS T3
JOIN Duplicate D
ON T3.RecordID = D.RecordID
JOIN Rate
ON Rate.Country = T3.Country
GROUP BY T3.Country
ORDER BY UniqueRate DESC;
")
Unique %>%
datatable()
Example display deliverable. Simple csv with data would also be provided.
Unique %>%
mutate(label = paste0(Correct, '/', Total),
label2 = paste0(UniqueCorrect, '/', UniqueTotal)) %>%
select(Country, Rate, label, UniqueRate, label2) %>%
gather(key='key', value='Rate', -Country, -label, -label2) %>%
mutate(label3 = ifelse(key == 'Rate', label, label2)) %>%
ggplot() +
geom_bar(aes(Country, Rate, fill=key), stat='identity', position='dodge') +
geom_label(aes(Country, Rate + ifelse(key == 'Rate', -0.05, 0.05), label=label3, color=key), position = position_dodge(width = 1)) +
scale_y_continuous(expand=c(0, 0, 0.1, 0.1)) +
scale_fill_brewer(palette='Set1') +
labs(x='Match Rate') +
theme_bw() +
theme(legend.position = 'none',
axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
axis.text.y = element_text(size='13'),
panel.grid.major.x = element_blank(),
axis.title = element_text(size='13')
)
Calculate the new match rate for each country by changing the definition of the ‘Name’ term to: Match on (FirstInitial and LastName). Include the change in match rate compared to question 1.
FirstInitial <- sqldf("
SELECT
T3.Country,
Correct,
Total,
Rate,
SUM(Valid) FirstInitialCorrect,
COUNT(Valid) FirstInitialTotal,
CAST(SUM(Valid) AS DOUBLE) / Count(Valid) FirstInitialRate,
CAST(SUM(Valid) AS DOUBLE) / Count(Valid) - Rate Change
FROM (
SELECT
Country,
MAX(CASE WHEN FullName = 1 AND (Address = 1 OR DateOfBirth = 1) THEN 1 ELSE 0 END) Valid
FROM(
SELECT
Country, RecordID, Datasource,
CASE WHEN FirstInitial = 2 AND LastName = 2 THEN 1 ELSE 0 END FullName,
CASE WHEN DayOfBirth = 2 AND MonthOfBirth = 2 AND YearOfBirth =2 THEN 1 ELSE 0 END DateOfBirth,
CASE WHEN StreetName = 2 AND HouseNumber = 2 AND (PostalCode = 2 OR City = 2) AND UnitNumber <> 3 THEN 1 ELSE 0 END Address
FROM (
SELECT Country, RecordID, Datasource,
MAX(CASE WHEN Field = 'FirstInitial' THEN MatchStatusID ELSE 0 END) FirstInitial,
MAX(CASE WHEN Field = 'LastName' THEN MatchStatusID ELSE 0 END) LastName,
MAX(CASE WHEN Field = 'DayOfBirth' THEN MatchStatusID ELSE 0 END) DayOfBirth,
MAX(CASE WHEN Field = 'MonthOfBirth' THEN MatchStatusID ELSE 0 END) MonthOfBirth,
MAX(CASE WHEN Field = 'YearOfBirth' THEN MatchStatusID ELSE 0 END) YearOfBirth,
MAX(CASE WHEN Field = 'StreetName' THEN MatchStatusID ELSE 0 END) StreetName,
MAX(CASE WHEN Field = 'HouseNumber' THEN MatchStatusID ELSE 0 END) HouseNumber,
MAX(CASE WHEN Field = 'Postal Code' THEN MatchStatusID ELSE 0 END) PostalCode,
MAX(CASE WHEN Field = 'City' THEN MatchStatusID ELSE 0 END) City,
MAX(CASE WHEN Field = 'UnitNumber' THEN MatchStatusID ELSE 0 END) UnitNumber
FROM Field F
GROUP BY Country, RecordID, Datasource
) AS T1
) AS T2
GROUP BY Country, RecordID
) AS T3
JOIN Rate
ON Rate.Country = T3.Country
GROUP BY T3.Country
ORDER BY FirstInitialRate DESC;
")
FirstInitial %>%
datatable()
Example display deliverable. Simple csv with data would also be provided.
There is also an odd result that warrents further investigation. As a sanity check, all the ‘First Initial Rate’ values should be higher than the ‘First Name Rate’ values. France’s is not.
FirstInitial %>%
mutate(label = paste0(Correct, '/', Total),
label2 = paste0(FirstInitialCorrect, '/', FirstInitialTotal)) %>%
select(Country, Rate, label, FirstInitialRate, label2) %>%
gather(key='key', value='Rate', -Country, -label, -label2) %>%
mutate(label3 = ifelse(key == 'Rate', label, label2),
key = factor(key, levels=c('Rate', 'FirstInitialRate'))) %>%
ggplot() +
geom_bar(aes(Country, Rate, fill=key), stat='identity', position='dodge') +
geom_label(aes(Country, Rate + ifelse(key == 'Rate', -0.1, 0.1), label=label3, color=key), position = position_dodge(width = 1)) +
scale_y_continuous(expand=c(0, 0, 0.12, 0.12)) +
scale_fill_brewer(palette='Set1') +
labs(x='Match Rate') +
theme_bw() +
theme(legend.position = 'none',
axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
axis.text.y = element_text(size='13'),
panel.grid.major.x = element_blank(),
axis.title = element_text(size='13')
)
Calculate the new match rate for each country by remove the NotNoMatch on UnitNumber condition from the ‘Address’ term. Include the change in match rate compared to question 1.
UnitNumber <- sqldf("
SELECT
T3.Country,
Correct,
Total,
Rate,
SUM(Valid) EasyAddressCorrect,
COUNT(Valid) EasyAddressTotal,
CAST(SUM(Valid) AS DOUBLE) / Count(Valid) EasyAddressRate,
CAST(SUM(Valid) AS DOUBLE) / Count(Valid) - Rate Change
FROM (
SELECT
Country,
MAX(CASE WHEN FullName = 1 AND (Address = 1 OR DateOfBirth = 1) THEN 1 ELSE 0 END) Valid
FROM(
SELECT
Country, RecordID, Datasource,
CASE WHEN FirstName = 2 AND LastName = 2 THEN 1 ELSE 0 END FullName,
CASE WHEN DayOfBirth = 2 AND MonthOfBirth = 2 AND YearOfBirth =2 THEN 1 ELSE 0 END DateOfBirth,
CASE WHEN StreetName = 2 AND HouseNumber = 2 AND (PostalCode = 2 OR City = 2) THEN 1 ELSE 0 END Address
FROM (
SELECT Country, RecordID, Datasource,
MAX(CASE WHEN Field = 'FirstName' THEN MatchStatusID ELSE 0 END) FirstName,
MAX(CASE WHEN Field = 'LastName' THEN MatchStatusID ELSE 0 END) LastName,
MAX(CASE WHEN Field = 'DayOfBirth' THEN MatchStatusID ELSE 0 END) DayOfBirth,
MAX(CASE WHEN Field = 'MonthOfBirth' THEN MatchStatusID ELSE 0 END) MonthOfBirth,
MAX(CASE WHEN Field = 'YearOfBirth' THEN MatchStatusID ELSE 0 END) YearOfBirth,
MAX(CASE WHEN Field = 'StreetName' THEN MatchStatusID ELSE 0 END) StreetName,
MAX(CASE WHEN Field = 'HouseNumber' THEN MatchStatusID ELSE 0 END) HouseNumber,
MAX(CASE WHEN Field = 'Postal Code' THEN MatchStatusID ELSE 0 END) PostalCode,
MAX(CASE WHEN Field = 'City' THEN MatchStatusID ELSE 0 END) City
FROM Field F
GROUP BY Country, RecordID, Datasource
) AS T1
) AS T2
GROUP BY Country, RecordID
) AS T3
JOIN Rate
ON Rate.Country = T3.Country
GROUP BY T3.Country
ORDER BY EasyAddressRate DESC;
")
UnitNumber %>%
datatable()
Example display deliverable. Simple csv with data would also be provided.
UnitNumber %>%
mutate(label = paste0(Correct, '/', Total),
label2 = paste0(EasyAddressCorrect, '/', EasyAddressTotal)) %>%
select(Country, Rate, label, EasyAddressRate, label2) %>%
gather(key='key', value='Rate', -Country, -label, -label2) %>%
mutate(key = factor(key, levels=c('Rate', 'EasyAddressRate'))) %>%
ggplot() +
geom_bar(aes(Country, Rate, fill=key), stat='identity', position='dodge') +
scale_y_continuous(expand=c(0, 0, 0.18, 0.18)) +
scale_fill_brewer(palette='Set1') +
labs(x='Match Rate') +
theme_bw() +
theme(legend.position = 'none',
axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
axis.text.y = element_text(size='13'),
panel.grid.major.x = element_blank(),
axis.title = element_text(size='13')
)