Exercise 1

  1. Calculate Match/NoMatch Signals for each data source, for each Term defined in ‘Match Rules’ tab.

For this problem, I created several functions to calculate whether there was a match for each of the defined terms. Unverified information (coded as NA) are treated as ‘nomatch’ as the identifying information has not been confirmed and is therefore not a match.

results <- read_excel('Data_Analyst_Interview_Exercise_Workbook_1.xlsx', sheet=2)
transactions <- read_excel('Data_Analyst_Interview_Exercise_Workbook_1.xlsx', sheet=1)

This generic function is to ensure that the following functions are all DRY.

content.match <- function(row, content){
  row %>%
    select(content) %>%
    match("match", nomatch = FALSE) %>%
    return()
}

The following three functions check against the corresponding match rule and return TRUE or FALSE.

match.fullname <- function(row){
  row %>%
    content.match(c('FirstName', 'LastName')) %>%
    all() %>%
    return() 
}
match.address <- function(row){
  first <- row %>%
    content.match(c('Address1')) %>%
    all()
  
  second <- row %>%
    content.match(c('HouseNumber', 'StreetName')) %>%
    all()
  
  third <- row %>%
    content.match(c('PostalCode', 'City')) %>%
    any()
  
  return((first || second) && third)
}
match.DOB <- function(row){
  row %>%
    content.match(c('DayOfBirth', 'MonthOfBirth', 'YearOfBirth')) %>%
    all() %>%
    return()
}

Finally, the above functions are utilized to calculate the match/nomatch signals. They are all listed below.

terms <- results %>%
  merge(transactions) %>%
  by_row(..f = match.fullname, .to = 'FullName', .collate='rows') %>%
  by_row(..f = match.address, .to = 'Address', .collate='rows') %>%
  by_row(..f = match.DOB, .to = 'DateOfBirth', .collate='rows') %>%
  mutate(`Name And Address` = FullName & Address,
         `Name And DateOfBirth` = FullName & DateOfBirth,
         `Name And (Address Or DateOfBirth)` = FullName & (Address | DateOfBirth),
         `Name And Address And DateOfBirth` = FullName & Address & DateOfBirth
        ) %>%
  select(c(15, 2, 18:24))
terms %>%
  datatable()
  1. Calculate the Match Rate for each data source, for each Term.

A decision needed to be made whether or not to remove NA from considering the match rate. There is, after all, a difference between “the information could not be verified” and “the information is wrong”. Ultimately, I decided to leave the NA terms in as a company is hoping to match information and any response other than a “match” could be viewed as a failure. This does lead to some results that may be considered misleading. For example, the “Consumer” data source has NA for all birth date information. This results in a 0% match rate even though none of their responses were ‘nomatch’.

temp <- terms %>%
  select(-TransactionID) %>%
  group_by(DatasourceName) %>%
  summarise_all(mean) %>%
  mutate_if(is.numeric, ~round(., 2))
temp %>%
  datatable()
temp %>%
  gather(key='key', value='value', -DatasourceName) %>%
  ggplot() +
  geom_bar(aes(DatasourceName, value, fill=DatasourceName), stat= 'identity') +
  facet_wrap(~key, ncol=7, labeller = label_wrap_gen(width=20)) + 
  labs(x = '', y='') +
  scale_y_continuous(expand=c(0, 0, 0.1, 0.1)) +
  scale_fill_brewer(palette='Set1') + 
  theme_bw() +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
        axis.text.y = element_text(size='13'),
        strip.text = element_text(size='13'),
        plot.margin = unit(c(0,1.5,0,0), "cm"),
        panel.grid.major.x = element_blank()
        )

  1. Calculate Match/NoMatch Signals for each term for each transaction. A transaction is a “Match” for a given Term if 1 or more data sources returned a positive Match on that Term.

I will reuse calculated values from the previous question. I can group by the transaction ID and then see if any of the columns are TRUE. If any one is, then a match as been found as per the definition given.

terms %>%
  select(-DatasourceName) %>%
  group_by(TransactionID) %>%
  summarize_all(any) %>%
  datatable()
  1. Calculate the Match Rate for each transaction, for each Term.
terms %>%
  select(-DatasourceName) %>%
  group_by(TransactionID) %>%
  summarise_all(mean) %>%
  mutate_if(is.numeric, ~round(., 2)) %>%
  datatable()
  1. Rank the 4 data sources for optimization based on Maximizing Verification Rate and Minimizing Cost.

For this problem I begin by counting the raw total of matches for the given match rule and the total cost of making all of the requests. This allows me to calculate the amount paid per match found.

terms %>%
  select(DatasourceName, `Name And (Address Or DateOfBirth)`) %>%
  group_by(DatasourceName) %>%
  summarize(matches = sum(`Name And (Address Or DateOfBirth)`)) %>%
  mutate(cost.per.request = c(0.64, 0.88, 0.52, 0.52),
         requests = rep(nrow(transactions), 4),
         total.cost = cost.per.request * requests,
         cost.per.match = total.cost / matches) %>%
  mutate_if(is.numeric, ~round(., 2)) %>%
  arrange(cost.per.match) %>%
  datatable()
  1. Additional insights, findings and/or recommendations.

I believe there may be benefit to the addition of a 3rd category of results beyond ‘match’ and ‘nomatch’ called ‘unknown’. I assumed that all of the NA in the provided data set meant that the information was unavailable (as opposed to not requested). If that is the case, there are two reasonable options to address the NA values – dropping them from consideration or including them as ‘nomatch’. I decided to include the NA as ‘nomatch’ but an argument could just as reasonably be made to drop them from consideration. I think the addition of a third category of ‘unknown’ would help provide additional context.

I was also uncertain for the last question if data sources charge for NA responses. I coded it as if they were paid no matter what (they are still doing the look up after all) but this is an easy change to make since it is factually true or false. Whether or not a company is charged for each request can also play a part on which data source is the best value.

Exercise 2

Write a query to display the FirstName and LastName as 1 column using an alias name “Full Name”.

SELECT CONCAT(FirstName, " ", LastName) AS 'Full Name'
FROM Customer;

Write a query to get all the unique values for City.

SELECT DISTINCT City
FROM Customer;

Write a query to show all Customers with an email address that contains their FirstName.

SELECT *
FROM Customer;
WHERE EmailAddress LIKE CONCAT('%', FirstName, '%');

Write a query to count the number of Customers with the LastName “Smith”.

SELECT COUNT(LastName) AS 'Count'
FROM Customer;
WHERE LastName = 'Smith';

Write a query to count the number of Customers with an age between 18 and 25 years old as of 2018-07-01.

SELECT COUNT(DateOfBirth) AS 'AgeRangeCount'
FROM Customer
WHERE TIMESTAMPDIFF(YEAR, DateOfBirth, '2018-07-01')
    BETWEEN 18 
    AND 25;

Write a query to show the average age of Customers with the LastName “Smith” between the age of 18 and 25 years old as of 2018-07-01.

SELECT AVG(TIMESTAMPDIFF(YEAR, DateOfBirth, CURDATE())) AS AverageAge
FROM Customer
WHERE TIMESTAMPDIFF(YEAR, DateOfBirth, '2018-07-01')
    BETWEEN 18 
    AND 25
AND LastName = 'Smith';

SQL Exercise

Record <- read_csv('sqlassignment/RecordTable.csv')
Field <- read_csv('sqlassignment/FieldTable.csv')
Duplicate <- read_csv('sqlassignment/DuplicateTable.csv')

Calculate the overall match rate on all transactions for each country using the following match rule: At least one Datasource match on Name and (Address or DateOfBirth) where:

Rate <- sqldf("
SELECT 
  Country,
  SUM(Valid) Correct,
  COUNT(Valid) Total,
  CAST(SUM(Valid) AS DOUBLE)/ Count(Valid) Rate
FROM (
  SELECT 
    Country,
    MAX(CASE WHEN FullName = 1 AND (Address = 1 OR DateOfBirth = 1) THEN 1 ELSE 0 END) Valid
    FROM(
      SELECT 
        Country, RecordID, Datasource,
        CASE WHEN FirstName = 2 AND LastName = 2 THEN 1 ELSE 0 END FullName,
        CASE WHEN DayOfBirth = 2 AND MonthOfBirth = 2 AND YearOfBirth = 2 THEN 1 ELSE 0 END DateOfBirth,
        CASE WHEN StreetName = 2 AND HouseNumber = 2 AND (PostalCode = 2 OR City = 2) AND UnitNumber <> 3 THEN 1 ELSE 0 END Address
      FROM (
        SELECT Country, RecordID, Datasource,
          MAX(CASE WHEN Field = 'FirstName' THEN MatchStatusID ELSE 0 END) FirstName,
          MAX(CASE WHEN Field = 'LastName' THEN MatchStatusID ELSE 0 END) LastName,
          MAX(CASE WHEN Field = 'DayOfBirth' THEN MatchStatusID ELSE 0 END) DayOfBirth,
          MAX(CASE WHEN Field = 'MonthOfBirth' THEN MatchStatusID ELSE 0 END) MonthOfBirth,
          MAX(CASE WHEN Field = 'YearOfBirth' THEN MatchStatusID ELSE 0 END) YearOfBirth,
          MAX(CASE WHEN Field = 'StreetName' THEN MatchStatusID ELSE 0 END) StreetName,
          MAX(CASE WHEN Field = 'HouseNumber' THEN MatchStatusID ELSE 0 END) HouseNumber,
          MAX(CASE WHEN Field = 'Postal Code' THEN MatchStatusID ELSE 0 END) PostalCode,
          MAX(CASE WHEN Field = 'City' THEN MatchStatusID ELSE 0 END) City,
          MAX(CASE WHEN Field = 'UnitNumber' THEN MatchStatusID ELSE 0 END) UnitNumber
        FROM Field F
        GROUP BY Country, RecordID, Datasource
      ) AS T1
  ) AS T2
  GROUP BY Country, RecordID
) AS T3
GROUP BY Country
ORDER BY Rate DESC;
")
Rate %>%
  datatable()

Example display deliverable. Simple csv with data would also be provided.

Rate %>%
  mutate(label = paste0(Correct, '/', Total)) %>%
  ggplot() +
  geom_bar(aes(Country, Rate), fill='darkred', stat='identity') +
  geom_label(aes(Country, Rate, label=label), nudge_y = 0.05) +
  scale_y_continuous(expand=c(0, 0, 0.17, 0.17)) +
  scale_fill_brewer(palette='Set1') + 
  labs(x='Match Rate') +
  theme_bw() +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
        axis.text.y = element_text(size='13'),
        panel.grid.major.x = element_blank(),
        axis.title = element_text(size='13')
        )

Calculate the overall match rate on unique transactions (i.e. is not a duplicate) for each country using the same match rule defined in question 1. Include the change in match rate compared to question 1.

If this were pure SQL, I would use temporary tables instead of the previously calculated Rate.

Unique <- sqldf("
SELECT 
  T3.Country,
  Correct,
  Total,
  Rate,
  SUM(CASE WHEN D.isDuplicate = 0 THEN Valid ELSE 0 END) UniqueCorrect,
  Count(CASE WHEN D.isDuplicate = 0 THEN 1 ELSE NULL END) UniqueTotal,
  CAST(SUM(CASE WHEN D.isDuplicate = 0 THEN Valid ELSE 0 END) AS DOUBLE) / COUNT(CASE WHEN D.isDuplicate = 0 THEN 1 ELSE NULL END) UniqueRate,
  CAST(SUM(CASE WHEN D.isDuplicate = 0 THEN Valid ELSE 0 END) AS DOUBLE) / COUNT(CASE WHEN D.isDuplicate = 0 THEN 1 ELSE NULL END) - Rate Change
FROM (
  SELECT 
    Country,
    RecordID,
    MAX(CASE WHEN FullName = 1 AND (Address = 1 OR DateOfBirth = 1) THEN 1 ELSE 0 END) Valid
    FROM(
      SELECT 
        Country, RecordID, Datasource,
        CASE WHEN FirstName = 2 AND LastName = 2 THEN 1 ELSE 0 END FullName,
        CASE WHEN DayOfBirth = 2 AND MonthOfBirth = 2 AND YearOfBirth =2 THEN 1 ELSE 0 END DateOfBirth,
        CASE WHEN StreetName = 2 AND HouseNumber = 2 AND (PostalCode = 2 OR City = 2) AND UnitNumber <> 3 THEN 1 ELSE 0 END Address
      FROM (
        SELECT Country, RecordID, Datasource,
          MAX(CASE WHEN Field = 'FirstName' THEN MatchStatusID ELSE 0 END) FirstName,
          MAX(CASE WHEN Field = 'LastName' THEN MatchStatusID ELSE 0 END) LastName,
          MAX(CASE WHEN Field = 'DayOfBirth' THEN MatchStatusID ELSE 0 END) DayOfBirth,
          MAX(CASE WHEN Field = 'MonthOfBirth' THEN MatchStatusID ELSE 0 END) MonthOfBirth,
          MAX(CASE WHEN Field = 'YearOfBirth' THEN MatchStatusID ELSE 0 END) YearOfBirth,
          MAX(CASE WHEN Field = 'StreetName' THEN MatchStatusID ELSE 0 END) StreetName,
          MAX(CASE WHEN Field = 'HouseNumber' THEN MatchStatusID ELSE 0 END) HouseNumber,
          MAX(CASE WHEN Field = 'Postal Code' THEN MatchStatusID ELSE 0 END) PostalCode,
          MAX(CASE WHEN Field = 'City' THEN MatchStatusID ELSE 0 END) City,
          MAX(CASE WHEN Field = 'UnitNumber' THEN MatchStatusID ELSE 0 END) UnitNumber
        FROM Field F
        GROUP BY Country, RecordID, Datasource
      ) AS T1
  ) AS T2
  GROUP BY Country, RecordID
) AS T3
JOIN Duplicate D
ON T3.RecordID = D.RecordID
JOIN Rate
ON Rate.Country = T3.Country
GROUP BY T3.Country
ORDER BY UniqueRate DESC;
")
Unique %>%
  datatable()

Example display deliverable. Simple csv with data would also be provided.

Unique %>%
  mutate(label = paste0(Correct, '/', Total),
         label2 = paste0(UniqueCorrect, '/', UniqueTotal)) %>%
  select(Country, Rate, label, UniqueRate, label2) %>%
  gather(key='key', value='Rate', -Country, -label, -label2) %>%
  mutate(label3 = ifelse(key == 'Rate', label, label2)) %>%
  ggplot() +
  geom_bar(aes(Country, Rate, fill=key), stat='identity', position='dodge') +
  geom_label(aes(Country, Rate + ifelse(key == 'Rate', -0.05, 0.05), label=label3, color=key), position = position_dodge(width = 1)) +
  scale_y_continuous(expand=c(0, 0, 0.1, 0.1)) +
  scale_fill_brewer(palette='Set1') + 
  labs(x='Match Rate') +
  theme_bw() +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
        axis.text.y = element_text(size='13'),
        panel.grid.major.x = element_blank(),
        axis.title = element_text(size='13')
        )

Calculate the new match rate for each country by changing the definition of the ‘Name’ term to: Match on (FirstInitial and LastName). Include the change in match rate compared to question 1.

FirstInitial <- sqldf("
SELECT 
  T3.Country,
  Correct,
  Total,
  Rate,
  SUM(Valid) FirstInitialCorrect,
  COUNT(Valid) FirstInitialTotal,
  CAST(SUM(Valid) AS DOUBLE) / Count(Valid) FirstInitialRate,
  CAST(SUM(Valid) AS DOUBLE) / Count(Valid) - Rate Change
FROM (
  SELECT 
    Country,
    MAX(CASE WHEN FullName = 1 AND (Address = 1 OR DateOfBirth = 1) THEN 1 ELSE 0 END) Valid
    FROM(
      SELECT 
        Country, RecordID, Datasource,
        CASE WHEN FirstInitial = 2 AND LastName = 2 THEN 1 ELSE 0 END FullName,
        CASE WHEN DayOfBirth = 2 AND MonthOfBirth = 2 AND YearOfBirth =2 THEN 1 ELSE 0 END DateOfBirth,
        CASE WHEN StreetName = 2 AND HouseNumber = 2 AND (PostalCode = 2 OR City = 2) AND UnitNumber <> 3 THEN 1 ELSE 0 END Address
      FROM (
        SELECT Country, RecordID, Datasource,
          MAX(CASE WHEN Field = 'FirstInitial' THEN MatchStatusID ELSE 0 END) FirstInitial,
          MAX(CASE WHEN Field = 'LastName' THEN MatchStatusID ELSE 0 END) LastName,
          MAX(CASE WHEN Field = 'DayOfBirth' THEN MatchStatusID ELSE 0 END) DayOfBirth,
          MAX(CASE WHEN Field = 'MonthOfBirth' THEN MatchStatusID ELSE 0 END) MonthOfBirth,
          MAX(CASE WHEN Field = 'YearOfBirth' THEN MatchStatusID ELSE 0 END) YearOfBirth,
          MAX(CASE WHEN Field = 'StreetName' THEN MatchStatusID ELSE 0 END) StreetName,
          MAX(CASE WHEN Field = 'HouseNumber' THEN MatchStatusID ELSE 0 END) HouseNumber,
          MAX(CASE WHEN Field = 'Postal Code' THEN MatchStatusID ELSE 0 END) PostalCode,
          MAX(CASE WHEN Field = 'City' THEN MatchStatusID ELSE 0 END) City,
          MAX(CASE WHEN Field = 'UnitNumber' THEN MatchStatusID ELSE 0 END) UnitNumber
        FROM Field F
        GROUP BY Country, RecordID, Datasource
      ) AS T1
  ) AS T2
  GROUP BY Country, RecordID
) AS T3
JOIN Rate
ON Rate.Country = T3.Country
GROUP BY T3.Country
ORDER BY FirstInitialRate DESC;
")
FirstInitial %>%
  datatable()

Example display deliverable. Simple csv with data would also be provided.

There is also an odd result that warrents further investigation. As a sanity check, all the ‘First Initial Rate’ values should be higher than the ‘First Name Rate’ values. France’s is not.

FirstInitial %>%
  mutate(label = paste0(Correct, '/', Total),
         label2 = paste0(FirstInitialCorrect, '/', FirstInitialTotal)) %>%
  select(Country, Rate, label, FirstInitialRate, label2) %>%
  gather(key='key', value='Rate', -Country, -label, -label2) %>%
  mutate(label3 = ifelse(key == 'Rate', label, label2),
         key = factor(key, levels=c('Rate', 'FirstInitialRate'))) %>%
  ggplot() +
  geom_bar(aes(Country, Rate, fill=key), stat='identity', position='dodge') +
  geom_label(aes(Country, Rate + ifelse(key == 'Rate', -0.1, 0.1), label=label3, color=key), position = position_dodge(width = 1)) +
  scale_y_continuous(expand=c(0, 0, 0.12, 0.12)) +
  scale_fill_brewer(palette='Set1') + 
  labs(x='Match Rate') +
  theme_bw() +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
        axis.text.y = element_text(size='13'),
        panel.grid.major.x = element_blank(),
        axis.title = element_text(size='13')
        )

Calculate the new match rate for each country by remove the NotNoMatch on UnitNumber condition from the ‘Address’ term. Include the change in match rate compared to question 1.

UnitNumber <- sqldf("
SELECT 
  T3.Country,
  Correct,
  Total,
  Rate,
  SUM(Valid) EasyAddressCorrect,
  COUNT(Valid) EasyAddressTotal,
  CAST(SUM(Valid) AS DOUBLE) / Count(Valid) EasyAddressRate,
  CAST(SUM(Valid) AS DOUBLE) / Count(Valid) - Rate Change
FROM (
  SELECT 
    Country,
    MAX(CASE WHEN FullName = 1 AND (Address = 1 OR DateOfBirth = 1) THEN 1 ELSE 0 END) Valid
    FROM(
      SELECT 
        Country, RecordID, Datasource,
        CASE WHEN FirstName = 2 AND LastName = 2 THEN 1 ELSE 0 END FullName,
        CASE WHEN DayOfBirth = 2 AND MonthOfBirth = 2 AND YearOfBirth =2 THEN 1 ELSE 0 END DateOfBirth,
        CASE WHEN StreetName = 2 AND HouseNumber = 2 AND (PostalCode = 2 OR City = 2) THEN 1 ELSE 0 END Address
      FROM (
        SELECT Country, RecordID, Datasource,
          MAX(CASE WHEN Field = 'FirstName' THEN MatchStatusID ELSE 0 END) FirstName,
          MAX(CASE WHEN Field = 'LastName' THEN MatchStatusID ELSE 0 END) LastName,
          MAX(CASE WHEN Field = 'DayOfBirth' THEN MatchStatusID ELSE 0 END) DayOfBirth,
          MAX(CASE WHEN Field = 'MonthOfBirth' THEN MatchStatusID ELSE 0 END) MonthOfBirth,
          MAX(CASE WHEN Field = 'YearOfBirth' THEN MatchStatusID ELSE 0 END) YearOfBirth,
          MAX(CASE WHEN Field = 'StreetName' THEN MatchStatusID ELSE 0 END) StreetName,
          MAX(CASE WHEN Field = 'HouseNumber' THEN MatchStatusID ELSE 0 END) HouseNumber,
          MAX(CASE WHEN Field = 'Postal Code' THEN MatchStatusID ELSE 0 END) PostalCode,
          MAX(CASE WHEN Field = 'City' THEN MatchStatusID ELSE 0 END) City
        FROM Field F
        GROUP BY Country, RecordID, Datasource
      ) AS T1
  ) AS T2
  GROUP BY Country, RecordID
) AS T3
JOIN Rate
ON Rate.Country = T3.Country
GROUP BY T3.Country
ORDER BY EasyAddressRate DESC;
      ")
UnitNumber %>%
  datatable()

Example display deliverable. Simple csv with data would also be provided.

UnitNumber %>%
  mutate(label = paste0(Correct, '/', Total),
         label2 = paste0(EasyAddressCorrect, '/', EasyAddressTotal)) %>%
  select(Country, Rate, label, EasyAddressRate, label2) %>%
  gather(key='key', value='Rate', -Country, -label, -label2) %>%
  mutate(key = factor(key, levels=c('Rate', 'EasyAddressRate'))) %>%
  ggplot() +
  geom_bar(aes(Country, Rate, fill=key), stat='identity', position='dodge') +
  scale_y_continuous(expand=c(0, 0, 0.18, 0.18)) +
  scale_fill_brewer(palette='Set1') + 
  labs(x='Match Rate') +
  theme_bw() +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = -60, hjust = 0, size='13'),
        axis.text.y = element_text(size='13'),
        panel.grid.major.x = element_blank(),
        axis.title = element_text(size='13')
        )