For the purposes of test delivery the data set here is relatively small (data from Denver’s open data catalog). You should approach this test as though the data were 1000x the actual size and is frequently updated by real time events.
There are multiple ways to approach this challenge, 1. Using a Data Science/Data Wrangling Approach 2. Using Unix/Linux Shell (combo of Sed/Awk/Shell/Perl for really large files) 3. Using a 3rd party ETL/ELT (Talend, Informatica, or DataStage) And there are mulitple platforms for this challenge, A. Sandbox (local machine non-production) B. Cloud (using AWS and S3 Storage, EC2 or EMR nodes for Processing)
There’s really no right way or wrong way, just always seems to be a “better way” and that just depends upon your perspective. If I’m a Data Scientist, I really don’t want to “wait around for IT”, and if I have the ability, I’ll go grab what I need to understand its value. But before I begin, I find that following the “CRISP-DM” or Cross-Industry Standard Process for Data Mining methodology. 1. Business Understanding: Determining the business objectives & success criteria, and compiling background on the Business concepts 2. Data Understanding: Basically taking an inventory of the Data. What’s it going to take to get the data, understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. 3. Data Prep: Finally we get into data wrangling and data discovery, the first element suggests that it will take some “cow wrestling” aka data wrangling to get our data. We’ll need to identify permissions, paths, methods for capture & extraction, staging requirements, change data requirements and of course data quality & governance requirements. Did I mention, quickly? Wrangling is all about moving quicking, efficiently, without the need for tons of ETL/EAI programming. (Yes there maybe a need for this later, but its all about speed right now). One of the more famous quotes from an Analyst was, they spend 70% of their time just “getting data”, speed that up and look at all the time you have left to truly analyze it. 4. Modeling: Nope not fashion or data modeling, but in the area of Data Science Models. There are various techniques, some look at clustering, others look for outlayers. Some models forecast or predict, while other models focus on regressions and paths. The key here is that most modeling works on very flat, and sometimes large datasets. Repeative queries or optimization for some models means that storing data into a ‘cache’ system improves search & calculations. But cache is system dependent on the amount of physical memory (RAM) for its speed. Yes, cache can be virtualized, thus expandable until all resources are consumed, but once cache exceeds the amount of RAM, there is a tremendous slow down in processing as I/O begins to thrash about reading & writing data to disk. The important issue to remember here, is that some modeling consumes large amounts of cache creating ‘datasets’ in memory. 5. Evaluation: Well here it is, is this a Eureka momemnt or back to the drawing board? At this stage in the project, we’ve created models (most likely using subsets of the data), but before deploying our models we need to evaluate them… not just from a performance perspective but from a “Truth or Consequences” approach. Were our messurements and analysis statistically sufficient (not just accurate)? 6. Deployment: So this is when we turn to IT and go “Here”, and they look at you and go “HA!” Data Wrangling is a far cry from Operational ETL, but if done correctly, methodically, repeatable then moving from Concept to Production is quite achievable if not down right easy. Data Mapping, Data Validation, Data Governance has all been documented (right) and moving from Wrangling to ETL can be as easy as adding on utilities which support operations (scheduling, maintenance, security, deployment).
The objectives of Ibotta are to test me to see if I have the skills (what it takes) to get the job done. Looking at the brief requirements, I’ve been asked to grab a couple of files (which can be large & change over time), load & transform the data into a resilient data store, then explore the data looking for patterns. What Ibotta is truly looking for is, my approach - as the data has very little value to the Ibotta busines objectives, which would be:
*** Ibotta is a Denver based mobile technology company that enables users with its smartphone app to earn cash back on in-store and mobile purchases with receipt and/or purchase verification.
The Business Objective, is to find a person who understands, big data, data integration, and perhaps some of the analysis used in big data. Cool!
*** Note: In this document I will show you THREE APPROACH methods (mentioned above), some will be thru executable code, images, or a combination there of.
With Big Data business like Ibotta are faced with big issues, big data issues; like dealing with Web data (311 and Traffic Accidents files from the web are sufficient to mimic “big data”). While big data issues surround: 1. Volume - data volumes exceeding Terabytes 2. Velocity - inbound data streams 3. Variety - unstructured (semi or other) and strucutred 4. Voracity (yes my own term) - the business appetite for more data & now! Think Ferocious but with a V.
The data for this test is from the web, providers identified as “www.denvergov.org/opendata”.
So how does 311 and Police data apply here? Lets discover this by looking at the data.
Download data about Denver 311 service requests and traffic accidents. This data set, which is not perfect, contains events in the City and County of Denver for the previous 12 months. Even as the data grows at a high rate, it should be reasonably queryable for the tasks below. ### First get the Denver 311 file, set the download, and unzip file name
download311File <- "data/311_service_requests.csv.zip"
download311URL <- "https://s3.amazonaws.com/ibotta-data-engineer-test/311_service_requests.csv.zip"
if(!file.exists("./data")) { dir.create("./data")}
if (!file.exists(download311File)) {
download.file(download311URL, download311File, method = "curl");
# unzip(downloadFile, overwrite = T, exdir = ".")
}
We first read in the Denver 311 Service Request data from the raw text file included in the zip archive. The data is a comma delimited file were fields are delimited with the , character and missing values are coded as “” fields. The first row of the file contains the header data.
df0 <- read.table("data/311_service_requests.csv", header = TRUE, sep = ",", na.strings = "")
After reading in the 311 data we check the first few rows (there are 477,502) rows in this dataset.
dim(df0)
## [1] 477502 24
head(df0[, 1:13])
## Case.Summary Case.Status
## 1 NIS - Sidewalk Closed - No Violation Found
## 2 NIS - Snow Removal Complaint/Sidewalk Closed - Snow Advisory
## 3 NIS - Snow Removal Complaint/Sidewalk Closed - Not Enough Info
## 4 NIS - Snow Removal Complaint/Sidewalk Closed - Snow Advisory
## 5 NIS - Parking on Lawn Routed to Agency
## 6 NIS - Sign Violation Routed to Agency
## Case.Source Case.Created.Date Case.Created.dttm Case.Closed.Date
## 1 PocketGov 12/31/2016 12/31/2016 01:51:04 PM <NA>
## 2 PocketGov 12/31/2016 12/31/2016 10:32:01 AM <NA>
## 3 PocketGov 12/31/2016 12/31/2016 03:25:16 PM 01/03/2017
## 4 PocketGov 12/31/2016 12/31/2016 03:45:07 PM <NA>
## 5 PocketGov 12/31/2016 12/31/2016 09:29:07 AM <NA>
## 6 PocketGov 12/31/2016 12/31/2016 09:06:18 PM <NA>
## Case.Closed.dttm First.Call.Resolution Customer.Zip.Code
## 1 <NA> N <NA>
## 2 <NA> N 80211
## 3 01/03/2017 12:16:35 PM Y 11111
## 4 <NA> N 80218
## 5 <NA> N 80220
## 6 <NA> N 80246
## Incident.Address.1 Incident.Address.2 Incident.Intersection.1
## 1 3001 S Newport St <NA> <NA>
## 2 4251 N Clay St <NA> <NA>
## 3 <NA> <NA> N Jackson St
## 4 1301 S Beach Ct <NA> <NA>
## 5 1140 N Roslyn St <NA> <NA>
## 6 1295 S Colorado Blvd <NA> <NA>
## Incident.Intersection.2
## 1 <NA>
## 2 <NA>
## 3 E 17th Ave
## 4 <NA>
## 5 <NA>
## 6 <NA>
Next we will load the Denver County Traffic Accident report. This dataset includes accidents in the City and County of Denver for the previous five calendar years plus the current year to date. The data is based on the National Incident Based Reporting System (NIBRS). The data is dynamic, which allows for additions, deletions and/or modifications at any time, resulting in more accurate information in the database. Due to continuous data entry, the number of records in subsequent extractions are subject to change. Accident data is updated Monday through Friday.
downloadAccidentFile <- "data/traffic_accidents.csv.zip"
downloadAccidentURL <- "https://s3.amazonaws.com/ibotta-data-engineer-test/traffic_accidents.csv.zip"
if(!file.exists("./data")) { dir.create("./data")}
if (!file.exists(downloadAccidentFile)) {
download.file(downloadAccidentURL, downloadAccidentFile, method = "curl");
# unzip(downloadAccidentFile, overwrite = T, exdir = ".")
}
df1 <- read.table("data/traffic_accidents1.csv", header = TRUE, sep = ",", na.strings = "")
After reading in the accident data we check the first few rows (there are 123,664) rows in this dataset.
dim(df1)
## [1] 123664 19
head(df1[, 1:13])
## INCIDENT_ID OFFENSE_ID OFFENSE_CODE OFFENSE_CODE_EXTENSION
## 1 2013331514 2.01333e+15 5441 0
## 2 2013227315 2.01323e+15 5441 0
## 3 2013227351 2.01323e+15 5441 0
## 4 2013227410 2.01323e+15 5401 0
## 5 2013227430 2.01323e+15 5441 0
## 6 2013227451 2.01323e+15 5441 0
## OFFENSE_TYPE_ID OFFENSE_CATEGORY_ID FIRST_OCCURRENCE_DATE
## 1 traffic-accident traffic-accident 7/17/13 16:48
## 2 traffic-accident traffic-accident 5/21/13 9:04
## 3 traffic-accident traffic-accident 5/21/13 15:40
## 4 traffic-accident-hit-and-run traffic-accident 5/21/13 16:09
## 5 traffic-accident traffic-accident 5/21/13 16:21
## 6 traffic-accident traffic-accident 5/21/13 16:41
## LAST_OCCURRENCE_DATE REPORTED_DATE INCIDENT_ADDRESS
## 1 <NA> 7/17/13 16:54 <NA>
## 2 <NA> 5/21/13 15:21 1900 E BUCHTEL BLVD
## 3 <NA> 5/21/13 15:40 <NA>
## 4 <NA> 6/7/13 16:40 S LOWELL BLVD / W GREENWOOD PL
## 5 <NA> 5/21/13 18:08 I25 HWYNB / 20TH ST
## 6 <NA> 5/21/13 16:48 <NA>
## GEO_X GEO_Y GEO_LON
## 1 3157251 1709511 -104.9405
## 2 3150708 1674532 -104.9645
## 3 3167776 1692387 -104.9035
## 4 3130885 1664949 -105.0351
## 5 3139350 1702436 -105.0043
## 6 3148211 1672456 -104.9734
From the Traffic Accident reports we are interested GEO columns, Offense Type and Date Reported. Here we extract that column and print a brief summary.
x1 <- data.frame(df1$OFFENSE_TYPE_ID, df1$REPORTED_DATE, df1$GEO_LAT,df1$GEO_LON)
summary(x1)
## df1.OFFENSE_TYPE_ID df1.REPORTED_DATE
## traf-vehicular-assault : 192 12/3/15 17:29: 6
## traf-vehicular-homicide : 23 12/22/13 1:34: 5
## traffic-accident :85921 1/11/12 12:08: 4
## traffic-accident-dui-duid : 4131 1/19/11 17:19: 4
## traffic-accident-hit-and-run:33397 10/10/11 8:32: 4
## 10/11/11 0:40: 4
## (Other) :123637
## df1.GEO_LAT df1.GEO_LON
## Min. : 0.00 Min. :-105.2
## 1st Qu.:39.70 1st Qu.:-105.0
## Median :39.74 Median :-105.0
## Mean :39.72 Mean :-104.9
## 3rd Qu.:39.76 3rd Qu.:-104.9
## Max. :40.00 Max. : 0.0
##
Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as NA).
mean(is.na(x1)) ## Are missing values important here?
## [1] 0
There were some obvious problems pulling from Ibotta’s AWS url over this weekend as the website reported the file as as unavailable using curl or simplying visiting the webpage. (errors 500, etc.) Similar issues were observed when pulling the zip file from www.data.denvergov.org as it reported they were undergoing maintenance, but luck have it, after 8 hours I was able to pull from their site. ### Size of Data Now then, the files are all that huge, 5+ MB for the Traffic file (zipped) and 15+ MB for the 311 file (zipped) where compression varied on the file, 25 MB and 115 MB uncompressed CSV respectively. Time to uncompress was less than 4 secs (clock), .4 secs (user sandbox cpu). I didn’t measure IO/Stats but I would of assumed it was minor.
print("Data Frame for 311 size is:")
## [1] "Data Frame for 311 size is:"
format(object.size(df0), units="auto")
## [1] "128.9 Mb"
print("Data Frame for Accident Reports size is:")
## [1] "Data Frame for Accident Reports size is:"
format(object.size(df1), units="auto")
## [1] "27.1 Mb"
For large file transfers optimization could be taken by 1. Improving CACHE (when reading from the internet & writing to a persistent storage), but a simple test is to either run ‘top’ or ‘vm_stat’ from a shell. Redirect the output to a file or pipe to tee while observing, either way persist the output.
vm_stat -c 4 1
## Mach Virtual Memory Statistics: (page size of 4096 bytes)
## free active specul inactive throttle wired prgable faults copy 0fill reactive purged file-backed anonymous cmprssed cmprssor dcomprs comprs pageins pageout swapins swapouts
## 5029768 1792158 255068 565151 0 617928 139063 1416183K 1427965 1391727K 9106425 340243 843059 1769318 641347 126044 1263647 74174817 2620185 11939 1191840 6266077
## 5029513 1792037 255068 565151 0 617928 139063 33 0 32 0 0 843059 1769197 641347 126044 0 0 0 0 0 0
## 5029700 1791982 255068 565152 0 617928 139063 22 0 21 0 0 843059 1769143 641347 126044 0 0 0 0 0 0
## 5047275 1774708 255068 565151 0 617766 139063 47 0 42 0 0 843059 1751868 641347 126044 0 0 0 0 0 0
traceroute -w 3 -m 9 amazonaws.com
## traceroute: Warning: amazonaws.com has multiple addresses; using 72.21.210.29
## traceroute to amazonaws.com (72.21.210.29), 9 hops max, 52 byte packets
## 1 10.0.0.1 (10.0.0.1) 3.601 ms 2.219 ms 2.273 ms
## 2 96.120.12.41 (96.120.12.41) 20.217 ms 158.565 ms 21.401 ms
## 3 ae-202-sur01.aurora.co.denver.comcast.net (68.85.221.193) 32.897 ms 11.276 ms 14.905 ms
## 4 ae-24-ar01.denver.co.denver.comcast.net (68.86.103.13) 15.110 ms 14.475 ms 13.211 ms
## 5 be-33652-cr01.1601milehigh.co.ibone.comcast.net (68.86.92.121) 16.864 ms 14.674 ms 15.040 ms
## 6 be-11719-cr02.denver.co.ibone.comcast.net (68.86.86.77) 15.213 ms 14.411 ms 19.957 ms
## 7 be-11724-cr02.dallas.tx.ibone.comcast.net (68.86.84.230) 30.038 ms 29.683 ms 30.577 ms
## 8 be-12441-pe01.1950stemmons.tx.ibone.comcast.net (68.86.89.206) 29.368 ms 29.813 ms 29.752 ms
## 9 50.242.148.102 (50.242.148.102) 30.026 ms 29.607 ms 39.852 ms
When it comes to improving your Network throughput at your data center, there are a few options: a. Dedicated communications line to your data center (this goes the same for going to/from AWS) b. “intelligent” routers - routers w/in your firewall can be optimized to go directly to a specific IP, however AWS uses multiple centers in its virtualization schema, and you may have to add daily (if not sub-hour) path optimization.
x0<-data.frame(df0$Case.Created.Date, df0$Type,df0$Latitude,df0$Longitude)
naX0<-x0[!complete.cases(x0),]
summary(naX0)
## df0.Case.Created.Date df0.Type df0.Latitude
## 01/03/2017: 2135 Inquiry :44367 Min. :19.31
## 12/19/2016: 1782 Request : 3210 1st Qu.:39.74
## 12/27/2016: 1736 Complaint : 633 Median :39.74
## 12/28/2016: 1695 Compliment : 1 Mean :39.73
## 12/29/2016: 1687 Legislation: 1 3rd Qu.:39.74
## 01/04/2017: 1651 (Other) : 0 Max. :43.48
## (Other) :71649 NA's :34123 NA's :53418
## df0.Longitude
## Min. :-112.24
## 1st Qu.:-104.98
## Median :-104.98
## Mean :-104.97
## 3rd Qu.:-104.98
## Max. : 23.83
## NA's :53418
You can see that the “Type” column in the 311 file was frequently empty (null). NA or null strings analysis on Traffic Accident file (again on fields I thought were of interest)
x1<-data.frame(df1$OFFENSE_CODE, df1$REPORTED_DATE, df1$GEO_LAT, df1$GEO_LON)
x1[!complete.cases(x1),]
## [1] df1.OFFENSE_CODE df1.REPORTED_DATE df1.GEO_LAT df1.GEO_LON
## <0 rows> (or 0-length row.names)
As you can see, all values were populated.
As far as distributions of data we can use the summary function to look at our data frame:
print("Summary of 311 Data Frame")
## [1] "Summary of 311 Data Frame"
summary(df0)
## Case.Summary Case.Status
## Graffiti : 5476 Closed - Answer Provided :301455
## dmv : 3652 Closed - Transfered to External Agency: 46926
## Pothole : 3585 Closed - Service Completed : 31003
## registration: 3401 Routed to Agency : 29172
## DMV : 3361 In-Progress : 14549
## (Other) :457988 Closed - Other : 10221
## NA's : 39 (Other) : 44176
## Case.Source Case.Created.Date
## Phone :432859 10/31/2016: 2545
## PocketGov : 34648 10/24/2016: 2491
## Email - 311 : 6218 11/07/2016: 2456
## Email - City Clerk: 2397 11/08/2016: 2406
## Walk-In : 634 10/26/2016: 2403
## (Other) : 741 10/17/2016: 2319
## NA's : 5 (Other) :462882
## Case.Created.dttm Case.Closed.Date
## 07/20/2016 07:48:59 PM: 9 10/31/2016: 2478
## 03/17/2016 09:50:29 PM: 4 06/13/2016: 2476
## 06/28/2016 05:20:14 PM: 4 11/07/2016: 2434
## 01/03/2017 04:57:34 PM: 3 10/24/2016: 2306
## 01/03/2017 11:21:37 AM: 3 10/26/2016: 2179
## 01/03/2017 11:33:41 AM: 3 (Other) :402015
## (Other) :477476 NA's : 63614
## Case.Closed.dttm First.Call.Resolution Customer.Zip.Code
## 11/16/2016 07:00:18 AM: 5 N :172596 1 : 65228
## 02/11/2016 07:43:34 PM: 4 Y :258914 80219 : 16085
## 06/13/2016 07:07:49 PM: 4 NA's: 45992 80205 : 11706
## 07/19/2016 07:49:56 PM: 4 80204 : 10262
## 08/27/2016 06:00:28 AM: 4 80211 : 10088
## (Other) :413867 (Other):198698
## NA's : 63614 NA's :165435
## Incident.Address.1 Incident.Address.2 Incident.Intersection.1
## n/a : 609 Apt A : 96 Alameda: 269
## N/A : 580 Apt 2 : 95 17th : 231
## 201 W Colfax Ave: 445 Apt 1 : 82 6th : 202
## na : 249 Apt B : 82 14th : 193
## 1111 : 170 Unit 101: 80 Colfax : 182
## (Other) :127019 (Other) : 4509 (Other): 22623
## NA's :348430 NA's :472558 NA's :453802
## Incident.Intersection.2 Incident.Zip.Code Longitude
## Broadway: 174 80219 : 11840 Min. :-147.6
## Monaco : 169 80211 : 7961 1st Qu.:-105.0
## Quebec : 164 80205 : 7828 Median :-105.0
## Sheridan: 164 80204 : 7552 Mean :-105.0
## Federal : 158 80210 : 7407 3rd Qu.:-105.0
## (Other) : 22871 (Other): 74446 Max. : 177.3
## NA's :453802 NA's :360468 NA's :53418
## Latitude Agency Division
## Min. :-45.03 Finance :114505 Mode:logical
## 1st Qu.: 39.74 Public Works :111347 NA's:477502
## Median : 39.74 External Agency : 69230
## Mean : 39.73 Environmental Health: 49119
## 3rd Qu.: 39.74 311 : 42608
## Max. : 64.89 (Other) : 90668
## NA's :53418 NA's : 25
## Major.Area Type Topic Council.District
## Mode:logical Inquiry :300798 Mode:logical Min. : 1
## NA's:477502 Request :112052 NA's:477502 1st Qu.: 3
## Complaint : 30372 Median : 7
## Compliment: 154 Mean : 6
## DIA Case : 1 3rd Qu.: 9
## (Other) : 2 Max. :11
## NA's : 34123 NA's :363076
## Police.District Neighborhood
## Min. :1 Mode:logical
## 1st Qu.:2 NA's:477502
## Median :3
## Mean :3
## 3rd Qu.:4
## Max. :7
## NA's :363079
print("Summary of Traffic Accidents Data Frame")
## [1] "Summary of Traffic Accidents Data Frame"
summary(df0)
## Case.Summary Case.Status
## Graffiti : 5476 Closed - Answer Provided :301455
## dmv : 3652 Closed - Transfered to External Agency: 46926
## Pothole : 3585 Closed - Service Completed : 31003
## registration: 3401 Routed to Agency : 29172
## DMV : 3361 In-Progress : 14549
## (Other) :457988 Closed - Other : 10221
## NA's : 39 (Other) : 44176
## Case.Source Case.Created.Date
## Phone :432859 10/31/2016: 2545
## PocketGov : 34648 10/24/2016: 2491
## Email - 311 : 6218 11/07/2016: 2456
## Email - City Clerk: 2397 11/08/2016: 2406
## Walk-In : 634 10/26/2016: 2403
## (Other) : 741 10/17/2016: 2319
## NA's : 5 (Other) :462882
## Case.Created.dttm Case.Closed.Date
## 07/20/2016 07:48:59 PM: 9 10/31/2016: 2478
## 03/17/2016 09:50:29 PM: 4 06/13/2016: 2476
## 06/28/2016 05:20:14 PM: 4 11/07/2016: 2434
## 01/03/2017 04:57:34 PM: 3 10/24/2016: 2306
## 01/03/2017 11:21:37 AM: 3 10/26/2016: 2179
## 01/03/2017 11:33:41 AM: 3 (Other) :402015
## (Other) :477476 NA's : 63614
## Case.Closed.dttm First.Call.Resolution Customer.Zip.Code
## 11/16/2016 07:00:18 AM: 5 N :172596 1 : 65228
## 02/11/2016 07:43:34 PM: 4 Y :258914 80219 : 16085
## 06/13/2016 07:07:49 PM: 4 NA's: 45992 80205 : 11706
## 07/19/2016 07:49:56 PM: 4 80204 : 10262
## 08/27/2016 06:00:28 AM: 4 80211 : 10088
## (Other) :413867 (Other):198698
## NA's : 63614 NA's :165435
## Incident.Address.1 Incident.Address.2 Incident.Intersection.1
## n/a : 609 Apt A : 96 Alameda: 269
## N/A : 580 Apt 2 : 95 17th : 231
## 201 W Colfax Ave: 445 Apt 1 : 82 6th : 202
## na : 249 Apt B : 82 14th : 193
## 1111 : 170 Unit 101: 80 Colfax : 182
## (Other) :127019 (Other) : 4509 (Other): 22623
## NA's :348430 NA's :472558 NA's :453802
## Incident.Intersection.2 Incident.Zip.Code Longitude
## Broadway: 174 80219 : 11840 Min. :-147.6
## Monaco : 169 80211 : 7961 1st Qu.:-105.0
## Quebec : 164 80205 : 7828 Median :-105.0
## Sheridan: 164 80204 : 7552 Mean :-105.0
## Federal : 158 80210 : 7407 3rd Qu.:-105.0
## (Other) : 22871 (Other): 74446 Max. : 177.3
## NA's :453802 NA's :360468 NA's :53418
## Latitude Agency Division
## Min. :-45.03 Finance :114505 Mode:logical
## 1st Qu.: 39.74 Public Works :111347 NA's:477502
## Median : 39.74 External Agency : 69230
## Mean : 39.73 Environmental Health: 49119
## 3rd Qu.: 39.74 311 : 42608
## Max. : 64.89 (Other) : 90668
## NA's :53418 NA's : 25
## Major.Area Type Topic Council.District
## Mode:logical Inquiry :300798 Mode:logical Min. : 1
## NA's:477502 Request :112052 NA's:477502 1st Qu.: 3
## Complaint : 30372 Median : 7
## Compliment: 154 Mean : 6
## DIA Case : 1 3rd Qu.: 9
## (Other) : 2 Max. :11
## NA's : 34123 NA's :363076
## Police.District Neighborhood
## Min. :1 Mode:logical
## 1st Qu.:2 NA's:477502
## Median :3
## Mean :3
## 3rd Qu.:4
## Max. :7
## NA's :363079
On a large scale, data issues need to be tracked by frequency and #of occurances, code can then be developed as an inline process prior to usage by production processes. For example, a feed could be streamed to a pipe that strips out known “bad characters” or purposely strips out columns that would never be used, this is pre-processing as a part of data staging.
In a Data Lake built on AWS S3, we need to insure a few consistent elements. 1. Each Amazon S3 object has data, a key, and metadata. Object key (or key name) uniquely identifies the object in a bucket. Object metadata is a set of name-value pairs. You can set object metadata at the time you upload it. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata. 2. Workloads that include a mix of request types of GET, PUT, DELETE, or GET Bucket (list objects), choosing appropriate key names for your objects will ensure better performance by providing low-latency access to the Amazon S3 index. It will also ensure scalability regardless of the number of requests you send per second. Whereas, workloads that are GET-intensive should be using the Amazon CloudFront content delivery service. 3. One way to introduce randomness to key names is to add a hash string as prefix to the key name. Reversing the key name string lays the groundwork for Amazon S3 to start with most frequently used partitions, one for each distinct first character in the key name.
***Note: I’ve cloned a github repository that demonstrates how to load to S3, see my java program in my github source.
Whilst the arguement will rage well into the century, ETL/ELT or whatever as a Application for Data Integration will continue to be one of the prominent solutions for developing work flows for data processing. The reason why you use a Data Integration program like Talend, Informatica or DataStage is that it provides the developer with a common IDE. You can tie the IDE to source control, to metadata exports, to a number of business use cases, all from a supported Bridge or Common API. Personally, I like Talend; its an open-source product, with tons of input from the user community, which help to keep it out front, plus its cheap! Informatica and DataStage work on a SPOKE & HUB basis, where their products require licensing where ever they run, thus running them on HUGE distributed environments has a HUGE cost. Its one of driving reasons why Map/Reduce grew prominent, it was too expensive to build out an ETL server that scaled with Cost.
Talend also generates JAVA code that can be redistributed w/out Talend attached. For example, we could use Talend IDE to develop the AWS S3 load process, add in whatever meta fields & parameters we need, test it within the Talend Framework then deploy it to our EC2 servers (no licensing required), we can even tell Talend to generate Map/Reduce code (Java wrapper to our program). Of course if you wanted, you could deploy and manage the same program from Talend which will collect & return metadata to its operational repository, but then you’re paying for the license.
Talend Flow to AWS looks like this:
Explore the data to find interesting patterns or trends and then write queries for the resulting reports. The types of questions you might ask include the following:
Looking at the data, the obvious that jumped out at me was lookign at corresponding GEO inforamtion. Forming the “assumption”: ### The most common element shared between the two datasets is their GPS Coordinates.
Here’s the kicker, what does this tell me that is useful to the business (Ibotta)?
Lets looks at the Analysis: We need to load some GIS maps, and project the values as a Plot.
##
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
##
## plot
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
Now that we’ve loaded the Denver County map, lets add some color and see how our Accident’s occur.
Statement FAILS denver_tracts_proj <- spTransform(denver_tracts, CRS(“+init=epsg:26954”)) NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files Error in CRS(“+init=epsg:26954”) : no system list, errno: 2 <– the error is specific to the EPSG stirng “+init=epsg:4326”
Well there’s really no summation data, either file, other than number of occurances at the same GEO locations. But perhaps, there are some correlations in clustering in Zip codes. To do this we would use a dynamic azimuthal equidistant projection (AEQ). An AEQ projection centred on each point will project equal distances in all directions, such as a buffered circle.
library("dbscan")
require(geosphere)
## Loading required package: geosphere
require(fpc)
## Loading required package: fpc
##
## Attaching package: 'fpc'
## The following object is masked from 'package:dbscan':
##
## dbscan
geo.dist = function(df) {
d <- function(i,z){ # z[1:2] contain long, lat
dist <- rep(0,nrow(z))
dist[i:nrow(z)] <- distHaversine(z[i:nrow(z),1:2],z[i,1:2])
return(dist)
}
dm <- do.call(cbind,lapply(1:nrow(df),d,df))
return(as.dist(dm))
}
df<- data.frame(long=df1$GEO_LON, lat=df1$GEO_LAT)
set.seed(123)
smp_size <- floor(0.13 * nrow(df))
train_ind <- sample(seq_len(nrow(df)), size = smp_size)
train <- df[train_ind, ]
test <- df[-train_ind, ]
d<- geo.dist(train) # distance matrix # removed because this is a huge memory hog, trial & error was better method
hc<- hclust(d) # hierarchical clustering
summary(hc)
## Length Class Mode
## merge 32150 -none- numeric
## height 16075 -none- numeric
## order 16076 -none- numeric
## labels 0 -none- NULL
## method 1 -none- character
## call 2 -none- call
## dist.method 0 -none- NULL
# plot(hc, main="Geo Clustering", xlab="Cluster Index", ylab="Distance Between in Miles", col="dark blue")
We overlayed the maps with the incident locations, and assigned a clustering distance on our training data set. Note, that the entire data frame consumes a lot of CPU and memory while looking for Clusters, using trial and error we determined that a 13% sample set gave us a reasonable time to compute (about 4-5 minutes). Something to think about when putting together a larger cluster set. There are several other methods for clustering, dbscan is one, but it too is a memory hog.
Well I hope I’ve demonstrated my abilities to handle large datasets, using sampling, distributions on hash keys, etc. will all improve performance and maintainablity of the data. Hopefully, you’re not having to debug R like I had to on my MAC, but just in case I’ve taken notes on how to rebuild libraries.
Conclusions from the Charts, bad things happen in bad neighboorhoods. ;-)
Thanks!
Al Shain