Ibotta Data Engineer Take Home Test

Test Synopsis

For the purposes of test delivery the data set here is relatively small (data from Denver’s open data catalog). You should approach this test as though the data were 1000x the actual size and is frequently updated by real time events.

Approach

There are multiple ways to approach this challenge, 1. Using a Data Science/Data Wrangling Approach 2. Using Unix/Linux Shell (combo of Sed/Awk/Shell/Perl for really large files) 3. Using a 3rd party ETL/ELT (Talend, Informatica, or DataStage) And there are mulitple platforms for this challenge, A. Sandbox (local machine non-production) B. Cloud (using AWS and S3 Storage, EC2 or EMR nodes for Processing)

Approach Discussion

There’s really no right way or wrong way, just always seems to be a “better way” and that just depends upon your perspective. If I’m a Data Scientist, I really don’t want to “wait around for IT”, and if I have the ability, I’ll go grab what I need to understand its value. But before I begin, I find that following the “CRISP-DM” or Cross-Industry Standard Process for Data Mining methodology. 1. Business Understanding: Determining the business objectives & success criteria, and compiling background on the Business concepts 2. Data Understanding: Basically taking an inventory of the Data. What’s it going to take to get the data, understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. 3. Data Prep: Finally we get into data wrangling and data discovery, the first element suggests that it will take some “cow wrestling” aka data wrangling to get our data. We’ll need to identify permissions, paths, methods for capture & extraction, staging requirements, change data requirements and of course data quality & governance requirements. Did I mention, quickly? Wrangling is all about moving quicking, efficiently, without the need for tons of ETL/EAI programming. (Yes there maybe a need for this later, but its all about speed right now). One of the more famous quotes from an Analyst was, they spend 70% of their time just “getting data”, speed that up and look at all the time you have left to truly analyze it. 4. Modeling: Nope not fashion or data modeling, but in the area of Data Science Models. There are various techniques, some look at clustering, others look for outlayers. Some models forecast or predict, while other models focus on regressions and paths. The key here is that most modeling works on very flat, and sometimes large datasets. Repeative queries or optimization for some models means that storing data into a ‘cache’ system improves search & calculations. But cache is system dependent on the amount of physical memory (RAM) for its speed. Yes, cache can be virtualized, thus expandable until all resources are consumed, but once cache exceeds the amount of RAM, there is a tremendous slow down in processing as I/O begins to thrash about reading & writing data to disk. The important issue to remember here, is that some modeling consumes large amounts of cache creating ‘datasets’ in memory. 5. Evaluation: Well here it is, is this a Eureka momemnt or back to the drawing board? At this stage in the project, we’ve created models (most likely using subsets of the data), but before deploying our models we need to evaluate them… not just from a performance perspective but from a “Truth or Consequences” approach. Were our messurements and analysis statistically sufficient (not just accurate)? 6. Deployment: So this is when we turn to IT and go “Here”, and they look at you and go “HA!” Data Wrangling is a far cry from Operational ETL, but if done correctly, methodically, repeatable then moving from Concept to Production is quite achievable if not down right easy. Data Mapping, Data Validation, Data Governance has all been documented (right) and moving from Wrangling to ETL can be as easy as adding on utilities which support operations (scheduling, maintenance, security, deployment).

My Approach

The objectives of Ibotta are to test me to see if I have the skills (what it takes) to get the job done. Looking at the brief requirements, I’ve been asked to grab a couple of files (which can be large & change over time), load & transform the data into a resilient data store, then explore the data looking for patterns. What Ibotta is truly looking for is, my approach - as the data has very little value to the Ibotta busines objectives, which would be:

*** Ibotta is a Denver based mobile technology company that enables users with its smartphone app to earn cash back on in-store and mobile purchases with receipt and/or purchase verification.

The Business Objective, is to find a person who understands, big data, data integration, and perhaps some of the analysis used in big data. Cool!

*** Note: In this document I will show you THREE APPROACH methods (mentioned above), some will be thru executable code, images, or a combination there of.

Requirement #1 - Big Data

With Big Data business like Ibotta are faced with big issues, big data issues; like dealing with Web data (311 and Traffic Accidents files from the web are sufficient to mimic “big data”). While big data issues surround: 1. Volume - data volumes exceeding Terabytes 2. Velocity - inbound data streams 3. Variety - unstructured (semi or other) and strucutred 4. Voracity (yes my own term) - the business appetite for more data & now! Think Ferocious but with a V.

The data for this test is from the web, providers identified as “www.denvergov.org/opendata”.

So how does 311 and Police data apply here? Lets discover this by looking at the data.

Identifying and Downloaing Data

Download data about Denver 311 service requests and traffic accidents. This data set, which is not perfect, contains events in the City and County of Denver for the previous 12 months. Even as the data grows at a high rate, it should be reasonably queryable for the tasks below. ### First get the Denver 311 file, set the download, and unzip file name

Approach 1. Data Wrangling using R

download311File <- "data/311_service_requests.csv.zip"
download311URL <- "https://s3.amazonaws.com/ibotta-data-engineer-test/311_service_requests.csv.zip"

test for data foloder and zip file, if NOT found create

if(!file.exists("./data")) { dir.create("./data")}
if (!file.exists(download311File)) {
  download.file(download311URL, download311File, method = "curl");
#  unzip(downloadFile, overwrite = T, exdir = ".")
}

Reading in the Denver 311 Service Request Data

We first read in the Denver 311 Service Request data from the raw text file included in the zip archive. The data is a comma delimited file were fields are delimited with the , character and missing values are coded as “” fields. The first row of the file contains the header data.

df0 <- read.table("data/311_service_requests.csv", header = TRUE, sep = ",", na.strings = "")

After reading in the 311 data we check the first few rows (there are 477,502) rows in this dataset.

dim(df0)
## [1] 477502     24
head(df0[, 1:13])
##                            Case.Summary                 Case.Status
## 1                        NIS - Sidewalk Closed - No Violation Found
## 2 NIS - Snow Removal Complaint/Sidewalk      Closed - Snow Advisory
## 3 NIS - Snow Removal Complaint/Sidewalk    Closed - Not Enough Info
## 4 NIS - Snow Removal Complaint/Sidewalk      Closed - Snow Advisory
## 5                 NIS - Parking on Lawn            Routed to Agency
## 6                  NIS - Sign Violation            Routed to Agency
##   Case.Source Case.Created.Date      Case.Created.dttm Case.Closed.Date
## 1   PocketGov        12/31/2016 12/31/2016 01:51:04 PM             <NA>
## 2   PocketGov        12/31/2016 12/31/2016 10:32:01 AM             <NA>
## 3   PocketGov        12/31/2016 12/31/2016 03:25:16 PM       01/03/2017
## 4   PocketGov        12/31/2016 12/31/2016 03:45:07 PM             <NA>
## 5   PocketGov        12/31/2016 12/31/2016 09:29:07 AM             <NA>
## 6   PocketGov        12/31/2016 12/31/2016 09:06:18 PM             <NA>
##         Case.Closed.dttm First.Call.Resolution Customer.Zip.Code
## 1                   <NA>                     N              <NA>
## 2                   <NA>                     N             80211
## 3 01/03/2017 12:16:35 PM                     Y             11111
## 4                   <NA>                     N             80218
## 5                   <NA>                     N             80220
## 6                   <NA>                     N             80246
##     Incident.Address.1 Incident.Address.2 Incident.Intersection.1
## 1    3001 S Newport St               <NA>                    <NA>
## 2       4251 N Clay St               <NA>                    <NA>
## 3                 <NA>               <NA>            N Jackson St
## 4      1301 S Beach Ct               <NA>                    <NA>
## 5     1140 N Roslyn St               <NA>                    <NA>
## 6 1295 S Colorado Blvd               <NA>                    <NA>
##   Incident.Intersection.2
## 1                    <NA>
## 2                    <NA>
## 3              E 17th Ave
## 4                    <NA>
## 5                    <NA>
## 6                    <NA>

Next we will load the Denver County Traffic Accident report. This dataset includes accidents in the City and County of Denver for the previous five calendar years plus the current year to date. The data is based on the National Incident Based Reporting System (NIBRS). The data is dynamic, which allows for additions, deletions and/or modifications at any time, resulting in more accurate information in the database. Due to continuous data entry, the number of records in subsequent extractions are subject to change. Accident data is updated Monday through Friday.

Now get the Denver County Traffice Accident file, set the download, and unzip file name

downloadAccidentFile <- "data/traffic_accidents.csv.zip"
downloadAccidentURL <- "https://s3.amazonaws.com/ibotta-data-engineer-test/traffic_accidents.csv.zip"

test for data foloder and zip file, if NOT found create

if(!file.exists("./data")) { dir.create("./data")}
if (!file.exists(downloadAccidentFile)) {
  download.file(downloadAccidentURL, downloadAccidentFile, method = "curl");
#  unzip(downloadAccidentFile, overwrite = T, exdir = ".")
}

NOTE: In test runs, we found multiple issues with the download site (grabbing the ZIP file proved impossible). Downloading the “csv” did however work.

df1 <- read.table("data/traffic_accidents1.csv", header = TRUE, sep = ",", na.strings = "")

After reading in the accident data we check the first few rows (there are 123,664) rows in this dataset.

dim(df1)
## [1] 123664     19
head(df1[, 1:13])
##   INCIDENT_ID  OFFENSE_ID OFFENSE_CODE OFFENSE_CODE_EXTENSION
## 1  2013331514 2.01333e+15         5441                      0
## 2  2013227315 2.01323e+15         5441                      0
## 3  2013227351 2.01323e+15         5441                      0
## 4  2013227410 2.01323e+15         5401                      0
## 5  2013227430 2.01323e+15         5441                      0
## 6  2013227451 2.01323e+15         5441                      0
##                OFFENSE_TYPE_ID OFFENSE_CATEGORY_ID FIRST_OCCURRENCE_DATE
## 1             traffic-accident    traffic-accident         7/17/13 16:48
## 2             traffic-accident    traffic-accident          5/21/13 9:04
## 3             traffic-accident    traffic-accident         5/21/13 15:40
## 4 traffic-accident-hit-and-run    traffic-accident         5/21/13 16:09
## 5             traffic-accident    traffic-accident         5/21/13 16:21
## 6             traffic-accident    traffic-accident         5/21/13 16:41
##   LAST_OCCURRENCE_DATE REPORTED_DATE               INCIDENT_ADDRESS
## 1                 <NA> 7/17/13 16:54                           <NA>
## 2                 <NA> 5/21/13 15:21            1900 E BUCHTEL BLVD
## 3                 <NA> 5/21/13 15:40                           <NA>
## 4                 <NA>  6/7/13 16:40 S LOWELL BLVD / W GREENWOOD PL
## 5                 <NA> 5/21/13 18:08            I25 HWYNB / 20TH ST
## 6                 <NA> 5/21/13 16:48                           <NA>
##     GEO_X   GEO_Y   GEO_LON
## 1 3157251 1709511 -104.9405
## 2 3150708 1674532 -104.9645
## 3 3167776 1692387 -104.9035
## 4 3130885 1664949 -105.0351
## 5 3139350 1702436 -105.0043
## 6 3148211 1672456 -104.9734

From the Traffic Accident reports we are interested GEO columns, Offense Type and Date Reported. Here we extract that column and print a brief summary.

x1 <- data.frame(df1$OFFENSE_TYPE_ID, df1$REPORTED_DATE, df1$GEO_LAT,df1$GEO_LON)
summary(x1)
##                    df1.OFFENSE_TYPE_ID     df1.REPORTED_DATE 
##  traf-vehicular-assault      :  192    12/3/15 17:29:     6  
##  traf-vehicular-homicide     :   23    12/22/13 1:34:     5  
##  traffic-accident            :85921    1/11/12 12:08:     4  
##  traffic-accident-dui-duid   : 4131    1/19/11 17:19:     4  
##  traffic-accident-hit-and-run:33397    10/10/11 8:32:     4  
##                                        10/11/11 0:40:     4  
##                                        (Other)      :123637  
##   df1.GEO_LAT     df1.GEO_LON    
##  Min.   : 0.00   Min.   :-105.2  
##  1st Qu.:39.70   1st Qu.:-105.0  
##  Median :39.74   Median :-105.0  
##  Mean   :39.72   Mean   :-104.9  
##  3rd Qu.:39.76   3rd Qu.:-104.9  
##  Max.   :40.00   Max.   :   0.0  
## 

Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as NA).

mean(is.na(x1))  ## Are missing values important here?
## [1] 0

Discussion of Findings

There were some obvious problems pulling from Ibotta’s AWS url over this weekend as the website reported the file as as unavailable using curl or simplying visiting the webpage. (errors 500, etc.) Similar issues were observed when pulling the zip file from www.data.denvergov.org as it reported they were undergoing maintenance, but luck have it, after 8 hours I was able to pull from their site. ### Size of Data Now then, the files are all that huge, 5+ MB for the Traffic file (zipped) and 15+ MB for the 311 file (zipped) where compression varied on the file, 25 MB and 115 MB uncompressed CSV respectively. Time to uncompress was less than 4 secs (clock), .4 secs (user sandbox cpu). I didn’t measure IO/Stats but I would of assumed it was minor.

print("Data Frame for 311 size is:")
## [1] "Data Frame for 311 size is:"
format(object.size(df0), units="auto")
## [1] "128.9 Mb"
print("Data Frame for Accident Reports size is:")
## [1] "Data Frame for Accident Reports size is:"
format(object.size(df1), units="auto")
## [1] "27.1 Mb"

For large file transfers optimization could be taken by 1. Improving CACHE (when reading from the internet & writing to a persistent storage), but a simple test is to either run ‘top’ or ‘vm_stat’ from a shell. Redirect the output to a file or pipe to tee while observing, either way persist the output.

vm_stat -c 4 1
## Mach Virtual Memory Statistics: (page size of 4096 bytes)
##     free   active   specul inactive throttle    wired  prgable   faults     copy    0fill reactive   purged file-backed anonymous cmprssed cmprssor  dcomprs   comprs  pageins  pageout  swapins swapouts
##  5029768  1792158   255068   565151        0   617928   139063 1416183K  1427965 1391727K  9106425   340243      843059   1769318   641347   126044  1263647 74174817  2620185    11939  1191840  6266077 
##  5029513  1792037   255068   565151        0   617928   139063       33        0       32        0        0      843059   1769197   641347   126044        0        0        0        0        0        0 
##  5029700  1791982   255068   565152        0   617928   139063       22        0       21        0        0      843059   1769143   641347   126044        0        0        0        0        0        0 
##  5047275  1774708   255068   565151        0   617766   139063       47        0       42        0        0      843059   1751868   641347   126044        0        0        0        0        0        0
  1. Improving persistent storage (my sandbox uses a SSD drive), but there are various options when deploying to either attached disk, SAN storage, or Virtual Storage Services (AWS S3). First crack would be to identify the type, size, and specs on your storage, then measure the IO of the devices under a benchmark test. Data distribution within your own storage is easy to identify and optimize (striping, redundancy, etc.) but SAN and VSS are more difficult. If I had storage that I could fully control, I would specify the striping, the redundancy (mirror or RAID level), and could even dedicate specific device types (SSD versus SATA and which gets used first). But there’s one drive, so lets look at AWS. AWS does provide for methods to optimize storage types, see chart below. The key is to optimize value over costs by selecting the right type of storage based on your business needs. In the case of replicated data from the web, I’d recommend either S3 Reduced Redundancy or Glacier, but that’s without knowing who’s using the data. (see figure below)
  2. Improving Network throughput and removing latency. From a bash shell (or thru an APP) I can look at the path from destination to my sandbox, using a command like “traceroute”. It’s a tad futile to optimize my route path from my house as I’m completely dependent upon my Internet Provider (Comcast) however you can pay for “faster speeds”. Visit the site www.speedtest.net Here’s the output:
traceroute -w 3 -m 9 amazonaws.com 
## traceroute: Warning: amazonaws.com has multiple addresses; using 72.21.210.29
## traceroute to amazonaws.com (72.21.210.29), 9 hops max, 52 byte packets
##  1  10.0.0.1 (10.0.0.1)  3.601 ms  2.219 ms  2.273 ms
##  2  96.120.12.41 (96.120.12.41)  20.217 ms  158.565 ms  21.401 ms
##  3  ae-202-sur01.aurora.co.denver.comcast.net (68.85.221.193)  32.897 ms  11.276 ms  14.905 ms
##  4  ae-24-ar01.denver.co.denver.comcast.net (68.86.103.13)  15.110 ms  14.475 ms  13.211 ms
##  5  be-33652-cr01.1601milehigh.co.ibone.comcast.net (68.86.92.121)  16.864 ms  14.674 ms  15.040 ms
##  6  be-11719-cr02.denver.co.ibone.comcast.net (68.86.86.77)  15.213 ms  14.411 ms  19.957 ms
##  7  be-11724-cr02.dallas.tx.ibone.comcast.net (68.86.84.230)  30.038 ms  29.683 ms  30.577 ms
##  8  be-12441-pe01.1950stemmons.tx.ibone.comcast.net (68.86.89.206)  29.368 ms  29.813 ms  29.752 ms
##  9  50.242.148.102 (50.242.148.102)  30.026 ms  29.607 ms  39.852 ms

When it comes to improving your Network throughput at your data center, there are a few options: a. Dedicated communications line to your data center (this goes the same for going to/from AWS) b. “intelligent” routers - routers w/in your firewall can be optimized to go directly to a specific IP, however AWS uses multiple centers in its virtualization schema, and you may have to add daily (if not sub-hour) path optimization.

Data Quality Findings:

  1. When first analyzing the files, using R, I quickly discovered that the Traffic Accident’s file contained characters that made my read abort. Two lines contained the hash-key “#” which initially stopped my read, but adding comment.char = “#” to the command allowed me to proceed.
  2. NA or null strings analysis on 311 (using fields that were interesting to me)
x0<-data.frame(df0$Case.Created.Date, df0$Type,df0$Latitude,df0$Longitude)
naX0<-x0[!complete.cases(x0),]
summary(naX0)
##  df0.Case.Created.Date        df0.Type      df0.Latitude  
##  01/03/2017: 2135      Inquiry    :44367   Min.   :19.31  
##  12/19/2016: 1782      Request    : 3210   1st Qu.:39.74  
##  12/27/2016: 1736      Complaint  :  633   Median :39.74  
##  12/28/2016: 1695      Compliment :    1   Mean   :39.73  
##  12/29/2016: 1687      Legislation:    1   3rd Qu.:39.74  
##  01/04/2017: 1651      (Other)    :    0   Max.   :43.48  
##  (Other)   :71649      NA's       :34123   NA's   :53418  
##  df0.Longitude    
##  Min.   :-112.24  
##  1st Qu.:-104.98  
##  Median :-104.98  
##  Mean   :-104.97  
##  3rd Qu.:-104.98  
##  Max.   :  23.83  
##  NA's   :53418

You can see that the “Type” column in the 311 file was frequently empty (null). NA or null strings analysis on Traffic Accident file (again on fields I thought were of interest)

x1<-data.frame(df1$OFFENSE_CODE, df1$REPORTED_DATE, df1$GEO_LAT, df1$GEO_LON)
x1[!complete.cases(x1),]
## [1] df1.OFFENSE_CODE  df1.REPORTED_DATE df1.GEO_LAT       df1.GEO_LON      
## <0 rows> (or 0-length row.names)

As you can see, all values were populated.

As far as distributions of data we can use the summary function to look at our data frame:

print("Summary of 311 Data Frame")
## [1] "Summary of 311 Data Frame"
summary(df0)
##        Case.Summary                                    Case.Status    
##  Graffiti    :  5476   Closed - Answer Provided              :301455  
##  dmv         :  3652   Closed - Transfered to External Agency: 46926  
##  Pothole     :  3585   Closed - Service Completed            : 31003  
##  registration:  3401   Routed to Agency                      : 29172  
##  DMV         :  3361   In-Progress                           : 14549  
##  (Other)     :457988   Closed - Other                        : 10221  
##  NA's        :    39   (Other)                               : 44176  
##              Case.Source      Case.Created.Date 
##  Phone             :432859   10/31/2016:  2545  
##  PocketGov         : 34648   10/24/2016:  2491  
##  Email - 311       :  6218   11/07/2016:  2456  
##  Email - City Clerk:  2397   11/08/2016:  2406  
##  Walk-In           :   634   10/26/2016:  2403  
##  (Other)           :   741   10/17/2016:  2319  
##  NA's              :     5   (Other)   :462882  
##               Case.Created.dttm    Case.Closed.Date 
##  07/20/2016 07:48:59 PM:     9   10/31/2016:  2478  
##  03/17/2016 09:50:29 PM:     4   06/13/2016:  2476  
##  06/28/2016 05:20:14 PM:     4   11/07/2016:  2434  
##  01/03/2017 04:57:34 PM:     3   10/24/2016:  2306  
##  01/03/2017 11:21:37 AM:     3   10/26/2016:  2179  
##  01/03/2017 11:33:41 AM:     3   (Other)   :402015  
##  (Other)               :477476   NA's      : 63614  
##                Case.Closed.dttm  First.Call.Resolution Customer.Zip.Code
##  11/16/2016 07:00:18 AM:     5   N   :172596           1      : 65228   
##  02/11/2016 07:43:34 PM:     4   Y   :258914           80219  : 16085   
##  06/13/2016 07:07:49 PM:     4   NA's: 45992           80205  : 11706   
##  07/19/2016 07:49:56 PM:     4                         80204  : 10262   
##  08/27/2016 06:00:28 AM:     4                         80211  : 10088   
##  (Other)               :413867                         (Other):198698   
##  NA's                  : 63614                         NA's   :165435   
##         Incident.Address.1 Incident.Address.2 Incident.Intersection.1
##  n/a             :   609   Apt A   :    96    Alameda:   269         
##  N/A             :   580   Apt 2   :    95    17th   :   231         
##  201 W Colfax Ave:   445   Apt 1   :    82    6th    :   202         
##  na              :   249   Apt B   :    82    14th   :   193         
##  1111            :   170   Unit 101:    80    Colfax :   182         
##  (Other)         :127019   (Other) :  4509    (Other): 22623         
##  NA's            :348430   NA's    :472558    NA's   :453802         
##  Incident.Intersection.2 Incident.Zip.Code   Longitude     
##  Broadway:   174         80219  : 11840    Min.   :-147.6  
##  Monaco  :   169         80211  :  7961    1st Qu.:-105.0  
##  Quebec  :   164         80205  :  7828    Median :-105.0  
##  Sheridan:   164         80204  :  7552    Mean   :-105.0  
##  Federal :   158         80210  :  7407    3rd Qu.:-105.0  
##  (Other) : 22871         (Other): 74446    Max.   : 177.3  
##  NA's    :453802         NA's   :360468    NA's   :53418   
##     Latitude                       Agency       Division      
##  Min.   :-45.03   Finance             :114505   Mode:logical  
##  1st Qu.: 39.74   Public Works        :111347   NA's:477502   
##  Median : 39.74   External Agency     : 69230                 
##  Mean   : 39.73   Environmental Health: 49119                 
##  3rd Qu.: 39.74   311                 : 42608                 
##  Max.   : 64.89   (Other)             : 90668                 
##  NA's   :53418    NA's                :    25                 
##  Major.Area             Type         Topic         Council.District
##  Mode:logical   Inquiry   :300798   Mode:logical   Min.   : 1      
##  NA's:477502    Request   :112052   NA's:477502    1st Qu.: 3      
##                 Complaint : 30372                  Median : 7      
##                 Compliment:   154                  Mean   : 6      
##                 DIA Case  :     1                  3rd Qu.: 9      
##                 (Other)   :     2                  Max.   :11      
##                 NA's      : 34123                  NA's   :363076  
##  Police.District  Neighborhood  
##  Min.   :1        Mode:logical  
##  1st Qu.:2        NA's:477502   
##  Median :3                      
##  Mean   :3                      
##  3rd Qu.:4                      
##  Max.   :7                      
##  NA's   :363079
print("Summary of Traffic Accidents Data Frame")
## [1] "Summary of Traffic Accidents Data Frame"
summary(df0)
##        Case.Summary                                    Case.Status    
##  Graffiti    :  5476   Closed - Answer Provided              :301455  
##  dmv         :  3652   Closed - Transfered to External Agency: 46926  
##  Pothole     :  3585   Closed - Service Completed            : 31003  
##  registration:  3401   Routed to Agency                      : 29172  
##  DMV         :  3361   In-Progress                           : 14549  
##  (Other)     :457988   Closed - Other                        : 10221  
##  NA's        :    39   (Other)                               : 44176  
##              Case.Source      Case.Created.Date 
##  Phone             :432859   10/31/2016:  2545  
##  PocketGov         : 34648   10/24/2016:  2491  
##  Email - 311       :  6218   11/07/2016:  2456  
##  Email - City Clerk:  2397   11/08/2016:  2406  
##  Walk-In           :   634   10/26/2016:  2403  
##  (Other)           :   741   10/17/2016:  2319  
##  NA's              :     5   (Other)   :462882  
##               Case.Created.dttm    Case.Closed.Date 
##  07/20/2016 07:48:59 PM:     9   10/31/2016:  2478  
##  03/17/2016 09:50:29 PM:     4   06/13/2016:  2476  
##  06/28/2016 05:20:14 PM:     4   11/07/2016:  2434  
##  01/03/2017 04:57:34 PM:     3   10/24/2016:  2306  
##  01/03/2017 11:21:37 AM:     3   10/26/2016:  2179  
##  01/03/2017 11:33:41 AM:     3   (Other)   :402015  
##  (Other)               :477476   NA's      : 63614  
##                Case.Closed.dttm  First.Call.Resolution Customer.Zip.Code
##  11/16/2016 07:00:18 AM:     5   N   :172596           1      : 65228   
##  02/11/2016 07:43:34 PM:     4   Y   :258914           80219  : 16085   
##  06/13/2016 07:07:49 PM:     4   NA's: 45992           80205  : 11706   
##  07/19/2016 07:49:56 PM:     4                         80204  : 10262   
##  08/27/2016 06:00:28 AM:     4                         80211  : 10088   
##  (Other)               :413867                         (Other):198698   
##  NA's                  : 63614                         NA's   :165435   
##         Incident.Address.1 Incident.Address.2 Incident.Intersection.1
##  n/a             :   609   Apt A   :    96    Alameda:   269         
##  N/A             :   580   Apt 2   :    95    17th   :   231         
##  201 W Colfax Ave:   445   Apt 1   :    82    6th    :   202         
##  na              :   249   Apt B   :    82    14th   :   193         
##  1111            :   170   Unit 101:    80    Colfax :   182         
##  (Other)         :127019   (Other) :  4509    (Other): 22623         
##  NA's            :348430   NA's    :472558    NA's   :453802         
##  Incident.Intersection.2 Incident.Zip.Code   Longitude     
##  Broadway:   174         80219  : 11840    Min.   :-147.6  
##  Monaco  :   169         80211  :  7961    1st Qu.:-105.0  
##  Quebec  :   164         80205  :  7828    Median :-105.0  
##  Sheridan:   164         80204  :  7552    Mean   :-105.0  
##  Federal :   158         80210  :  7407    3rd Qu.:-105.0  
##  (Other) : 22871         (Other): 74446    Max.   : 177.3  
##  NA's    :453802         NA's   :360468    NA's   :53418   
##     Latitude                       Agency       Division      
##  Min.   :-45.03   Finance             :114505   Mode:logical  
##  1st Qu.: 39.74   Public Works        :111347   NA's:477502   
##  Median : 39.74   External Agency     : 69230                 
##  Mean   : 39.73   Environmental Health: 49119                 
##  3rd Qu.: 39.74   311                 : 42608                 
##  Max.   : 64.89   (Other)             : 90668                 
##  NA's   :53418    NA's                :    25                 
##  Major.Area             Type         Topic         Council.District
##  Mode:logical   Inquiry   :300798   Mode:logical   Min.   : 1      
##  NA's:477502    Request   :112052   NA's:477502    1st Qu.: 3      
##                 Complaint : 30372                  Median : 7      
##                 Compliment:   154                  Mean   : 6      
##                 DIA Case  :     1                  3rd Qu.: 9      
##                 (Other)   :     2                  Max.   :11      
##                 NA's      : 34123                  NA's   :363076  
##  Police.District  Neighborhood  
##  Min.   :1        Mode:logical  
##  1st Qu.:2        NA's:477502   
##  Median :3                      
##  Mean   :3                      
##  3rd Qu.:4                      
##  Max.   :7                      
##  NA's   :363079

On a large scale, data issues need to be tracked by frequency and #of occurances, code can then be developed as an inline process prior to usage by production processes. For example, a feed could be streamed to a pipe that strips out known “bad characters” or purposely strips out columns that would never be used, this is pre-processing as a part of data staging.

In a Data Lake built on AWS S3, we need to insure a few consistent elements. 1. Each Amazon S3 object has data, a key, and metadata. Object key (or key name) uniquely identifies the object in a bucket. Object metadata is a set of name-value pairs. You can set object metadata at the time you upload it. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata. 2. Workloads that include a mix of request types of GET, PUT, DELETE, or GET Bucket (list objects), choosing appropriate key names for your objects will ensure better performance by providing low-latency access to the Amazon S3 index. It will also ensure scalability regardless of the number of requests you send per second. Whereas, workloads that are GET-intensive should be using the Amazon CloudFront content delivery service. 3. One way to introduce randomness to key names is to add a hash string as prefix to the key name. Reversing the key name string lays the groundwork for Amazon S3 to start with most frequently used partitions, one for each distinct first character in the key name.

General Processing of Data to AWS S3 Using Java

  1. Create an AWS credentials profile file as described in Set Up your AWS Credentials for Use with the AWS SDK for Java in the AWS SDK for Java Developer Guide.
  2. Create a new AWS Java project in Eclipse. The project is pre-configured with the AWS SDK for Java.
  3. Copy the code from the section you are reading to your project.
  4. Update the code by providing any required data. For example, if uploading a file, provide the file path and the bucket name.
  5. Run the code. Verify that the object is created by using the AWS Management Console. https://aws.amazon.com/console/

***Note: I’ve cloned a github repository that demonstrates how to load to S3, see my java program in my github source.

Using ETL/ELT for Data Wrangling

Whilst the arguement will rage well into the century, ETL/ELT or whatever as a Application for Data Integration will continue to be one of the prominent solutions for developing work flows for data processing. The reason why you use a Data Integration program like Talend, Informatica or DataStage is that it provides the developer with a common IDE. You can tie the IDE to source control, to metadata exports, to a number of business use cases, all from a supported Bridge or Common API. Personally, I like Talend; its an open-source product, with tons of input from the user community, which help to keep it out front, plus its cheap! Informatica and DataStage work on a SPOKE & HUB basis, where their products require licensing where ever they run, thus running them on HUGE distributed environments has a HUGE cost. Its one of driving reasons why Map/Reduce grew prominent, it was too expensive to build out an ETL server that scaled with Cost.

Talend also generates JAVA code that can be redistributed w/out Talend attached. For example, we could use Talend IDE to develop the AWS S3 load process, add in whatever meta fields & parameters we need, test it within the Talend Framework then deploy it to our EC2 servers (no licensing required), we can even tell Talend to generate Map/Reduce code (Java wrapper to our program). Of course if you wanted, you could deploy and manage the same program from Talend which will collect & return metadata to its operational repository, but then you’re paying for the license.

Talend Flow to AWS looks like this:

Task 3 Analysis

Explore the data to find interesting patterns or trends and then write queries for the resulting reports. The types of questions you might ask include the following:

  • Are there seasonal trends to the different types of events?
  • Which types of events are more common in which geographic areas (defined by coordinates or neighborhood)?
  • How long are typical response/resolution times? Do these differ by type of event, geography, or other factors?
  • What correlations are there between the two data sets?

Looking at the data, the obvious that jumped out at me was lookign at corresponding GEO inforamtion. Forming the “assumption”: ### The most common element shared between the two datasets is their GPS Coordinates.

Here’s the kicker, what does this tell me that is useful to the business (Ibotta)?

Lets looks at the Analysis: We need to load some GIS maps, and project the values as a Plot.

## 
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
## 
##     plot
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

Now that we’ve loaded the Denver County map, lets add some color and see how our Accident’s occur.

Hmm, we’ve hit a snag, seems there is a bug w/our RStudio on MAC which would require a rebuild of the PROJ framework.

Statement FAILS denver_tracts_proj <- spTransform(denver_tracts, CRS(“+init=epsg:26954”)) NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files Error in CRS(“+init=epsg:26954”) : no system list, errno: 2 <– the error is specific to the EPSG stirng “+init=epsg:4326”

Well there’s really no summation data, either file, other than number of occurances at the same GEO locations. But perhaps, there are some correlations in clustering in Zip codes. To do this we would use a dynamic azimuthal equidistant projection (AEQ). An AEQ projection centred on each point will project equal distances in all directions, such as a buffered circle.

library("dbscan")
require(geosphere)
## Loading required package: geosphere
require(fpc)
## Loading required package: fpc
## 
## Attaching package: 'fpc'
## The following object is masked from 'package:dbscan':
## 
##     dbscan
geo.dist = function(df) {
  d <- function(i,z){         # z[1:2] contain long, lat
    dist <- rep(0,nrow(z))
    dist[i:nrow(z)] <- distHaversine(z[i:nrow(z),1:2],z[i,1:2])
    return(dist)
  }
  dm <- do.call(cbind,lapply(1:nrow(df),d,df))
  return(as.dist(dm))
}

df<- data.frame(long=df1$GEO_LON, lat=df1$GEO_LAT)
set.seed(123)
smp_size <- floor(0.13 * nrow(df))
train_ind <- sample(seq_len(nrow(df)), size = smp_size)

train <- df[train_ind, ]
test <- df[-train_ind, ]
d<- geo.dist(train)   # distance matrix # removed because this is a huge memory hog, trial & error was better method
hc<- hclust(d)      # hierarchical clustering
summary(hc)
##             Length Class  Mode     
## merge       32150  -none- numeric  
## height      16075  -none- numeric  
## order       16076  -none- numeric  
## labels          0  -none- NULL     
## method          1  -none- character
## call            2  -none- call     
## dist.method     0  -none- NULL
# plot(hc, main="Geo Clustering", xlab="Cluster Index", ylab="Distance Between in Miles", col="dark blue")

Final plot

We overlayed the maps with the incident locations, and assigned a clustering distance on our training data set. Note, that the entire data frame consumes a lot of CPU and memory while looking for Clusters, using trial and error we determined that a 13% sample set gave us a reasonable time to compute (about 4-5 minutes). Something to think about when putting together a larger cluster set. There are several other methods for clustering, dbscan is one, but it too is a memory hog.

Conclusions

Well I hope I’ve demonstrated my abilities to handle large datasets, using sampling, distributions on hash keys, etc. will all improve performance and maintainablity of the data. Hopefully, you’re not having to debug R like I had to on my MAC, but just in case I’ve taken notes on how to rebuild libraries.

Conclusions from the Charts, bad things happen in bad neighboorhoods. ;-)

Thanks!

Al Shain