Introduction
A demonstration to use text analytics for structured data.
Organizations are increasing their data needs for analytics. Often times it calls for data sets collected outside of the organization. The policies used to maintain data integrity within the organization fall apart when it starts to use outside data sources.
This demo takes related data sets from two different goverment agencies and uses fuzzy matching to link records.
library(tidyverse)
package ‘tidyverse’ was built under R version 3.2.5Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
package ‘tibble’ was built under R version 3.2.5package ‘tidyr’ was built under R version 3.2.5package ‘purrr’ was built under R version 3.2.5package ‘dplyr’ was built under R version 3.2.5Conflicts with tidy packages --------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
library(stringdist)
package ‘stringdist’ was built under R version 3.2.5
library(tm)
Loading required package: NLP
Attaching package: ‘NLP’
The following object is masked from ‘package:ggplot2’:
annotate
library(stringr)
library(DT)
Read in external data files. One data set is from the Texas Education Agency and the other is from an education think tank research project.
setwd("D:\\\\challenges\\rusersgroup")
The working directory was changed to D:/challenges/rusersgroup inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
tea <- read_csv("SAT_ACT_District_Data_Class_2012.csv")
1 parsing failure.
row col expected actual
12997 -- 11 columns 1 columns
acgr <- read_csv("acgr.csv")
Let’s get down to the dirty work. Two data sets from data sources. Each organization has its own data standards.
For our analysis, we are going to use the data at the district level and match the data on the most common variable – District Name.
Let do some basic cleanup to standardize the common variable such as case conversion and remove punctuation.
#--------------- Case conversion to upper case --------------------#
acgr$districtname <- toupper(acgr$leanm10)
tea$districtname <- toupper(tea$distname)
#---------------- Remove punctuation from district name --------------#
acgr$districtname <- removePunctuation(acgr$districtname)
tea$districtname <- removePunctuation(tea$districtname)
Let’s go to the tidyverse and clean up our two data sets.
The acgr data set contains national graduation rates by school district. Since we are only interested in Texas students, we need to filter by state.
The tea data set is the Texas Education Agency’s data for percentages of students in the district who took SAT/ACT exams. It also has rows by ethnicity. For this analysis, we just want all students.
acgr_analysis <- acgr %>%
filter(stnam == 'TEXAS') %>%
rename(grad_rate = ALL_RATE_1011) %>%
select(districtname,grad_rate)
tea_analysis <- tea %>%
filter(Group == 'All Students') %>%
select(districtname, Part_Rate)
We have two separate data sets, each with different number of rows. This is typical of data sets from different sources. We are going to find our matches by district name and build the matched data frame. Then we will build a non-match data frame
matched_df <- inner_join(acgr_analysis,tea_analysis,by='districtname')
non_matched_acgr <- anti_join(acgr_analysis,tea_analysis,by='districtname')
non_matched_tea <- anti_join(tea_analysis,acgr_analysis,by='districtname')
#--------------------------------------------------------------------------#
# Display tables with data tables format #
#--------------------------------------------------------------------------#
datatable(matched_df, rownames=FALSE, options = list(
pageLength = 25, autoWidth = TRUE, searching=FALSE))
Let’s review the non-matches in each of the data sets.
datatable(non_matched_acgr, rownames=FALSE, options = list(
pageLength = 25, autoWidth = TRUE, searching=FALSE))
datatable(non_matched_tea, rownames=FALSE, options = list(
pageLength = 25, autoWidth = TRUE, searching=FALSE))
Now it is time to do fuzzy matching on the district names for the two data sets. We perform a one to many match scores on district name.
We will compute the similary score using the Levenshtein distance algorithm from the stringdist package.
#------------ Data frame to hold score in loop ----------------------#
score_df <- data_frame(x=numeric(0),y=numeric(0),score=numeric(0))
for (x in 1:nrow(non_matched_acgr)) {
for (y in 1: nrow(non_matched_tea)) {
#-----------------------------------------------------------------#
# Similarity score using Levenshtein distance #
#-----------------------------------------------------------------#
score <- stringsim(non_matched_acgr$districtname[x], non_matched_tea$districtname[y], method = 'lv')
#----------- Current data values ---------------------------#
term <- data_frame(x=x,y=y,score=score)
#------------- Build output data frame ------------------#
score_df <- bind_rows(score_df, term)
}
}
Look at the non matches data frame and join the data by row number.
non_matched_acgr$x <- row(non_matched_acgr)
non_matched_tea$y <- row(non_matched_tea)
lv_matches <- left_join(non_matched_acgr,score_df,by='x')
lv_tea <- left_join(non_matched_tea,lv_matches, by='y')
Select the matches by filtering on similarity score. Take all score above .5. Review manually to determine the cutoff filter
tea_matches <- filter(lv_tea, score > .5)
#-------------- Order Columns -----------------------------------#
tea_matches <- tea_matches %>%
select(districtname.x,districtname.y,score,Part_Rate,grad_rate)
datatable(tea_matches, rownames=FALSE, options = list(
pageLength = 25, autoWidth = TRUE, searching=FALSE))
Reviewed manually and decided a decent cutoff filter would be .75
tea_cutoff <- filter(tea_matches, score > .75)
datatable(tea_cutoff, rownames=FALSE, options = list(
pageLength = 25, autoWidth = TRUE, searching=FALSE))
Manually select matches
fuzzy_matches <- tea_cutoff %>%
slice(c(3:6,8,10:11))
datatable(fuzzy_matches, rownames=FALSE, options = list(
pageLength = 25, autoWidth = TRUE, searching=FALSE))
LS0tDQp0aXRsZTogIlRleHQgQW5hbHl0aWNzIGZvciBTdHJ1Y3R1cmVkIERhdGEiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCmF1dGhvcjogUm9iIENhbXBhbmVsbA0KDQotLS0NCg0KKipJbnRyb2R1Y3Rpb24qKg0KDQpBIGRlbW9uc3RyYXRpb24gdG8gdXNlIHRleHQgYW5hbHl0aWNzIGZvciBzdHJ1Y3R1cmVkIGRhdGEuICAgDQoNCk9yZ2FuaXphdGlvbnMgYXJlIGluY3JlYXNpbmcgdGhlaXIgZGF0YSBuZWVkcyBmb3IgYW5hbHl0aWNzLiAgIE9mdGVuIHRpbWVzIGl0IGNhbGxzIGZvciBkYXRhIHNldHMgY29sbGVjdGVkIG91dHNpZGUgb2YgdGhlIG9yZ2FuaXphdGlvbi4gICBUaGUgcG9saWNpZXMgdXNlZCB0byBtYWludGFpbiBkYXRhIGludGVncml0eSB3aXRoaW4gdGhlIG9yZ2FuaXphdGlvbiBmYWxsIGFwYXJ0IHdoZW4gaXQgc3RhcnRzIHRvIHVzZSBvdXRzaWRlIGRhdGEgc291cmNlcy4NCg0KVGhpcyBkZW1vIHRha2VzIHJlbGF0ZWQgZGF0YSBzZXRzIGZyb20gdHdvIGRpZmZlcmVudCBnb3Zlcm1lbnQgYWdlbmNpZXMgYW5kIHVzZXMgZnV6enkgbWF0Y2hpbmcgdG8gbGluayByZWNvcmRzLg0KDQoNCmBgYHtyfQ0KDQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCmxpYnJhcnkoc3RyaW5nZGlzdCkNCmxpYnJhcnkodG0pDQpsaWJyYXJ5KHN0cmluZ3IpDQpsaWJyYXJ5KERUKQ0KDQpgYGANCg0KUmVhZCBpbiBleHRlcm5hbCBkYXRhIGZpbGVzLiAgIE9uZSBkYXRhIHNldCBpcyBmcm9tIHRoZSBUZXhhcyBFZHVjYXRpb24gQWdlbmN5IGFuZCB0aGUgb3RoZXIgaXMgZnJvbSBhbiBlZHVjYXRpb24gdGhpbmsgdGFuayByZXNlYXJjaCBwcm9qZWN0Lg0KDQpgYGB7cn0NCg0Kc2V0d2QoIkQ6XFxcXGNoYWxsZW5nZXNcXHJ1c2Vyc2dyb3VwIikNCg0KdGVhIDwtIHJlYWRfY3N2KCJTQVRfQUNUX0Rpc3RyaWN0X0RhdGFfQ2xhc3NfMjAxMi5jc3YiKQ0KYWNnciA8LSByZWFkX2NzdigiYWNnci5jc3YiKQ0KDQoNCmBgYA0KTGV0J3MgZ2V0IGRvd24gdG8gdGhlIGRpcnR5IHdvcmsuICAgVHdvIGRhdGEgc2V0cyBmcm9tIGRhdGEgc291cmNlcy4gICBFYWNoIG9yZ2FuaXphdGlvbiBoYXMgaXRzIG93biBkYXRhIHN0YW5kYXJkcy4gICANCg0KRm9yIG91ciBhbmFseXNpcywgd2UgYXJlIGdvaW5nIHRvIHVzZSB0aGUgZGF0YSBhdCB0aGUgZGlzdHJpY3QgbGV2ZWwgYW5kIG1hdGNoIHRoZSBkYXRhIG9uIHRoZSBtb3N0IGNvbW1vbiB2YXJpYWJsZSAtLSBEaXN0cmljdCBOYW1lLg0KDQpMZXQgZG8gc29tZSBiYXNpYyBjbGVhbnVwIHRvIHN0YW5kYXJkaXplIHRoZSBjb21tb24gdmFyaWFibGUgc3VjaCBhcyBjYXNlIGNvbnZlcnNpb24gYW5kIHJlbW92ZSBwdW5jdHVhdGlvbi4NCg0KYGBge3J9DQoNCiMtLS0tLS0tLS0tLS0tLS0gICBDYXNlIGNvbnZlcnNpb24gdG8gdXBwZXIgY2FzZSAgIC0tLS0tLS0tLS0tLS0tLS0tLS0tIw0KYWNnciRkaXN0cmljdG5hbWUgPC0gdG91cHBlcihhY2dyJGxlYW5tMTApDQp0ZWEkZGlzdHJpY3RuYW1lIDwtIHRvdXBwZXIodGVhJGRpc3RuYW1lKQ0KDQojLS0tLS0tLS0tLS0tLS0tLSAgIFJlbW92ZSBwdW5jdHVhdGlvbiBmcm9tIGRpc3RyaWN0IG5hbWUgICAtLS0tLS0tLS0tLS0tLSMNCmFjZ3IkZGlzdHJpY3RuYW1lIDwtIHJlbW92ZVB1bmN0dWF0aW9uKGFjZ3IkZGlzdHJpY3RuYW1lKQ0KdGVhJGRpc3RyaWN0bmFtZSA8LSByZW1vdmVQdW5jdHVhdGlvbih0ZWEkZGlzdHJpY3RuYW1lKQ0KDQpgYGANCg0KDQpMZXQncyBnbyB0byB0aGUgdGlkeXZlcnNlIGFuZCBjbGVhbiB1cCBvdXIgdHdvIGRhdGEgc2V0cy4gIA0KDQpUaGUgYWNnciBkYXRhIHNldCBjb250YWlucyBuYXRpb25hbCBncmFkdWF0aW9uIHJhdGVzIGJ5IHNjaG9vbCBkaXN0cmljdC4gICBTaW5jZSB3ZSBhcmUgb25seSBpbnRlcmVzdGVkIGluIFRleGFzIHN0dWRlbnRzLCB3ZSBuZWVkIHRvIGZpbHRlciBieSBzdGF0ZS4gICANCg0KVGhlIHRlYSBkYXRhIHNldCBpcyB0aGUgVGV4YXMgRWR1Y2F0aW9uIEFnZW5jeSdzIGRhdGEgZm9yIHBlcmNlbnRhZ2VzIG9mIHN0dWRlbnRzIGluIHRoZSBkaXN0cmljdCB3aG8gdG9vayBTQVQvQUNUIGV4YW1zLiAgIEl0IGFsc28gaGFzIHJvd3MgYnkgZXRobmljaXR5LiAgIEZvciB0aGlzIGFuYWx5c2lzLCB3ZSBqdXN0IHdhbnQgYWxsIHN0dWRlbnRzLg0KDQpgYGB7cn0NCg0KYWNncl9hbmFseXNpcyA8LSAgYWNnciAlPiUNCiAgICAgICAgICAgICAgICAgIGZpbHRlcihzdG5hbSA9PSAnVEVYQVMnKSAlPiUNCiAgICAgICAgICAgICAgICAgIHJlbmFtZShncmFkX3JhdGUgPSBBTExfUkFURV8xMDExKSAlPiUNCiAgICAgICAgICAgICAgICAgIHNlbGVjdChkaXN0cmljdG5hbWUsZ3JhZF9yYXRlKQ0KDQp0ZWFfYW5hbHlzaXMgPC0gdGVhICU+JQ0KICAgICAgICAgICAgICAgIGZpbHRlcihHcm91cCA9PSAnQWxsIFN0dWRlbnRzJykgJT4lDQogICAgICAgICAgICAgICAgc2VsZWN0KGRpc3RyaWN0bmFtZSwgUGFydF9SYXRlKQ0KICAgICAgICAgICAgICAgIA0KDQpgYGANCg0KV2UgaGF2ZSB0d28gc2VwYXJhdGUgZGF0YSBzZXRzLCBlYWNoIHdpdGggZGlmZmVyZW50IG51bWJlciBvZiByb3dzLiAgVGhpcyBpcyB0eXBpY2FsIG9mIGRhdGEgc2V0cyBmcm9tIGRpZmZlcmVudCBzb3VyY2VzLiAgICBXZSBhcmUgZ29pbmcgdG8gZmluZCBvdXIgbWF0Y2hlcyBieSBkaXN0cmljdCBuYW1lIGFuZCBidWlsZCB0aGUgbWF0Y2hlZCBkYXRhIGZyYW1lLiAgIFRoZW4gd2Ugd2lsbCBidWlsZCBhIG5vbi1tYXRjaCBkYXRhIGZyYW1lDQoNCmBgYHtyfQ0KDQptYXRjaGVkX2RmIDwtIGlubmVyX2pvaW4oYWNncl9hbmFseXNpcyx0ZWFfYW5hbHlzaXMsYnk9J2Rpc3RyaWN0bmFtZScpDQoNCm5vbl9tYXRjaGVkX2FjZ3IgPC0gYW50aV9qb2luKGFjZ3JfYW5hbHlzaXMsdGVhX2FuYWx5c2lzLGJ5PSdkaXN0cmljdG5hbWUnKQ0Kbm9uX21hdGNoZWRfdGVhIDwtIGFudGlfam9pbih0ZWFfYW5hbHlzaXMsYWNncl9hbmFseXNpcyxieT0nZGlzdHJpY3RuYW1lJykNCg0KIy0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tIw0KIyAgRGlzcGxheSB0YWJsZXMgd2l0aCBkYXRhIHRhYmxlcyBmb3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIw0KIy0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tIw0KZGF0YXRhYmxlKG1hdGNoZWRfZGYsIHJvd25hbWVzPUZBTFNFLCBvcHRpb25zID0gbGlzdCgNCiAgcGFnZUxlbmd0aCA9IDI1LCBhdXRvV2lkdGggPSBUUlVFLCBzZWFyY2hpbmc9RkFMU0UpKQ0KDQoNCmBgYA0KDQpMZXQncyByZXZpZXcgdGhlIG5vbi1tYXRjaGVzIGluIGVhY2ggb2YgdGhlIGRhdGEgc2V0cy4NCg0KYGBge3J9DQoNCmRhdGF0YWJsZShub25fbWF0Y2hlZF9hY2dyLCByb3duYW1lcz1GQUxTRSwgb3B0aW9ucyA9IGxpc3QoDQogIHBhZ2VMZW5ndGggPSAyNSwgYXV0b1dpZHRoID0gVFJVRSwgc2VhcmNoaW5nPUZBTFNFKSkNCg0KYGBgDQoNCg0KYGBge3J9DQoNCmRhdGF0YWJsZShub25fbWF0Y2hlZF90ZWEsIHJvd25hbWVzPUZBTFNFLCBvcHRpb25zID0gbGlzdCgNCiAgcGFnZUxlbmd0aCA9IDI1LCBhdXRvV2lkdGggPSBUUlVFLCBzZWFyY2hpbmc9RkFMU0UpKQ0KDQpgYGANCg0KDQoNCk5vdyBpdCBpcyB0aW1lIHRvIGRvIGZ1enp5IG1hdGNoaW5nIG9uIHRoZSBkaXN0cmljdCBuYW1lcyBmb3IgdGhlIHR3byBkYXRhIHNldHMuICAgV2UgcGVyZm9ybSBhIG9uZSB0byBtYW55IG1hdGNoIHNjb3JlcyBvbiBkaXN0cmljdCBuYW1lLiAgIA0KDQpXZSB3aWxsIGNvbXB1dGUgdGhlIHNpbWlsYXJ5IHNjb3JlIHVzaW5nIHRoZSBMZXZlbnNodGVpbiBkaXN0YW5jZSBhbGdvcml0aG0gZnJvbSB0aGUgc3RyaW5nZGlzdCBwYWNrYWdlLg0KDQpgYGB7cn0NCg0KIy0tLS0tLS0tLS0tLSAgIERhdGEgZnJhbWUgdG8gaG9sZCBzY29yZSBpbiBsb29wICAgLS0tLS0tLS0tLS0tLS0tLS0tLS0tLSMNCnNjb3JlX2RmIDwtIGRhdGFfZnJhbWUoeD1udW1lcmljKDApLHk9bnVtZXJpYygwKSxzY29yZT1udW1lcmljKDApKQ0KDQpmb3IgKHggaW4gMTpucm93KG5vbl9tYXRjaGVkX2FjZ3IpKSB7DQogIA0KICBmb3IgKHkgaW4gMTogbnJvdyhub25fbWF0Y2hlZF90ZWEpKSAgew0KICAgIA0KICAgICAjLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0jDQogICAgICMgIFNpbWlsYXJpdHkgc2NvcmUgdXNpbmcgTGV2ZW5zaHRlaW4gZGlzdGFuY2UgICAgICAgICAgICAgICAgICAgICMNCiAgICAgIy0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tIw0KICAgICBzY29yZSA8LSBzdHJpbmdzaW0obm9uX21hdGNoZWRfYWNnciRkaXN0cmljdG5hbWVbeF0sICAgbm9uX21hdGNoZWRfdGVhJGRpc3RyaWN0bmFtZVt5XSwgbWV0aG9kID0gJ2x2JykNCiAgICAgDQogICAgICMtLS0tLS0tLS0tLSAgIEN1cnJlbnQgZGF0YSB2YWx1ZXMgICAtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0jDQogICAgIHRlcm0gPC0gZGF0YV9mcmFtZSh4PXgseT15LHNjb3JlPXNjb3JlKQ0KICAgICANCiAgICAgIy0tLS0tLS0tLS0tLS0gIEJ1aWxkIG91dHB1dCBkYXRhIGZyYW1lICAgLS0tLS0tLS0tLS0tLS0tLS0tIw0KICAgICBzY29yZV9kZiA8LSBiaW5kX3Jvd3Moc2NvcmVfZGYsIHRlcm0pDQogICAgDQogIH0NCiAgDQp9DQoNCg0KYGBgDQoNCkxvb2sgYXQgdGhlIG5vbiBtYXRjaGVzIGRhdGEgZnJhbWUgYW5kIGpvaW4gdGhlIGRhdGEgYnkgcm93IG51bWJlci4gICANCg0KYGBge3J9DQoNCm5vbl9tYXRjaGVkX2FjZ3IkeCA8LSByb3cobm9uX21hdGNoZWRfYWNncikNCm5vbl9tYXRjaGVkX3RlYSR5IDwtIHJvdyhub25fbWF0Y2hlZF90ZWEpDQoNCmx2X21hdGNoZXMgPC0gbGVmdF9qb2luKG5vbl9tYXRjaGVkX2FjZ3Isc2NvcmVfZGYsYnk9J3gnKQ0KbHZfdGVhIDwtIGxlZnRfam9pbihub25fbWF0Y2hlZF90ZWEsbHZfbWF0Y2hlcywgYnk9J3knKQ0KDQpgYGANCg0KU2VsZWN0IHRoZSBtYXRjaGVzIGJ5IGZpbHRlcmluZyBvbiBzaW1pbGFyaXR5IHNjb3JlLiAgIFRha2UgYWxsIHNjb3JlIGFib3ZlIC41LiAgIFJldmlldyBtYW51YWxseSB0byBkZXRlcm1pbmUgdGhlIGN1dG9mZiBmaWx0ZXINCg0KYGBge3J9DQoNCnRlYV9tYXRjaGVzIDwtIGZpbHRlcihsdl90ZWEsIHNjb3JlID4gLjUpDQoNCiMtLS0tLS0tLS0tLS0tLSAgT3JkZXIgQ29sdW1ucyAgLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0jDQp0ZWFfbWF0Y2hlcyA8LSB0ZWFfbWF0Y2hlcyAlPiUNCiAgICAgICAgICAgICAgIHNlbGVjdChkaXN0cmljdG5hbWUueCxkaXN0cmljdG5hbWUueSxzY29yZSxQYXJ0X1JhdGUsZ3JhZF9yYXRlKSANCiANCmRhdGF0YWJsZSh0ZWFfbWF0Y2hlcywgcm93bmFtZXM9RkFMU0UsIG9wdGlvbnMgPSBsaXN0KA0KICBwYWdlTGVuZ3RoID0gMjUsIGF1dG9XaWR0aCA9IFRSVUUsIHNlYXJjaGluZz1GQUxTRSkpDQoNCmBgYA0KDQpSZXZpZXdlZCBtYW51YWxseSBhbmQgZGVjaWRlZCBhIGRlY2VudCBjdXRvZmYgZmlsdGVyIHdvdWxkIGJlIC43NQ0KDQpgYGB7cn0NCg0KdGVhX2N1dG9mZiA8LSBmaWx0ZXIodGVhX21hdGNoZXMsIHNjb3JlID4gLjc1KQ0KDQoNCmRhdGF0YWJsZSh0ZWFfY3V0b2ZmLCByb3duYW1lcz1GQUxTRSwgb3B0aW9ucyA9IGxpc3QoDQogIHBhZ2VMZW5ndGggPSAyNSwgYXV0b1dpZHRoID0gVFJVRSwgc2VhcmNoaW5nPUZBTFNFKSkNCg0KYGBgDQoNCk1hbnVhbGx5IHNlbGVjdCBtYXRjaGVzICANCg0KYGBge3J9DQoNCmZ1enp5X21hdGNoZXMgPC0gdGVhX2N1dG9mZiAlPiUNCiAgICAgICAgICAgICAgICAgc2xpY2UoYygzOjYsOCwxMDoxMSkpDQoNCmRhdGF0YWJsZShmdXp6eV9tYXRjaGVzLCByb3duYW1lcz1GQUxTRSwgb3B0aW9ucyA9IGxpc3QoDQogIHBhZ2VMZW5ndGggPSAyNSwgYXV0b1dpZHRoID0gVFJVRSwgc2VhcmNoaW5nPUZBTFNFKSkNCg0KDQoNCmBgYA0KDQo=