Buzz words are one of my least favorite things, but as buzz words go, I can appreciate the term “Data Lake.” It is one of the few buzz words that communicates a meaning very close to its intended definition. As you might imagine, with the advent of large scale data processing, there would be a need to name the location where lots of data resides, ergo, data lake. I personally prefer to call it a series of redundant commodity servers with Direct-Attached Storage, or hyperscale computing with large storage capacity, but I see the need for a shorter name. Also, the person that first coined the phrase could just have easily called it Data Ocean, but they were modest and settled for Data Lake. I like that.
Not to be outdone, and technology vendors being who they are, came up with their own “original” buzz word, “Data Reservoir”, implying a place where clean, processed data resides. Whatever you want to call it is fine with me as long as you understand that unless you follow a solid data management process, even your data reservoir could end up being a data landfill (also a popular buzz word). The point of this post is to demonstrate what a real data mangement process might look like by providing real examples, with real code, and, most importantly, to emphasize the importance of having a good process in place. Once implemented, you will have good data, from which solid analysis can be performed, and from which good decisions can be made.
The reason data quality suffers is that everyone is in a rush, it is costly, and can actually be very difficult and complex as you will see when you follow through the whole exercise below. It is much easier to take the easy way out. After all, executives don’t realize how bad their data is until they’ve been bitten, and even then it is difficult to trace it back to the root cause.
Have you ever wondered how many events were caused by poor data management practices that were subsequently blamed on the executive’s decision making process? Do you think the executives at AIG, or Lehman Brothers knew the full scope of their exposure to the subprime mortgage loans? Sure they were aware of problems, and they may have been given reports that were “directionally correct”, they were just “wrong.” Maybe the slope of that line should have been much steeper. Of course, greed can easily change the slope of a line. No way of telling, but when critical decisions are being made, being “directionally correct” is not good enough.
The flow of data into an organization employs many stages and processes along the way. Each phase is critical to achieving the ultimate goal, and this goal is to provide reliable information that enables accurate conclusions and decisions. This discussion will focus on knowing the data, and cleaning, or tidying of data. This does not mean that other phases are less critical. Even with tidy data that is well understood, improper data collection procedures, faulty analysis and misleading reports can be just as devestating as bad data. Every phase is important, and errors at any phase can cause a break in the information data flow process.
Whether it is a Big Data scenario, or not, data quality must be managed to the maximum extent possible. I am not stating that all Big Data, or any other data, must be managed as closely as what I demonstrate below. However, I am stating that any data that is to be used for analysis upon which decisions will be made could, and, in most cases, should be managed with rigorous data integrity rules enforced.
A quick side note here, these principles apply to more than just data flowing in and out of a database for analysis. The same rigor must be applied in managing data of all kinds. All successful Service Oriented Architecture (SOA) and Enterprise Service Bus implementations have a mature Governance process in place. Service contracts are managed rigorously just as you would a normalized database. Authoritative data sources are defined, security for sensitive data is managed closely, and well defined data structures and schemas (Canonical models and XML Schemas) are defined to enable seamless information exchange, and so on. Managing data, at all levels is a complicated matter and requires a great deal of hard work and diligence to ensure its security and trustworthiness.
With the terms Velocity, Volume, and Variety used to define Big Data, images are conjured up of data coming in fast and furious, with little, or no time to plan for this flood of streaming data. However, it does not work that way, or at least it shouldn’t. Yes, a company should have an infrastructure and processes in place that allow for rapid and accurate responses to a fast paced economy. However, this does not mean that data should be integrated without proper due diligence.
Being flexible, agile, and responsive should not translate to being reactionary, or careless. Before Big Data, or any other data, is incorporated into a company’s knowledge stores, planning must occur. This planning involves knowing the data, understanding its value, and ensuring its proper management, and security. This discussion will hopefully highlight the importance of managing data to ensure its integrity.
The use case for this exercise is that a company will routinely download data from a 3rd party vendor. For the purposes of this demonstration, the 3rd party data will come from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This data was selected beacuse it is relatively simple, yet complicated enough to demonstrate more advanced topics, and relatively small in size. It consists of three tables: 1) Details, 2) Fatalities, and 3) Locations. These three tables are broken up into 65 files each (one for each year starting in 1950 and ending in 2014). Of the three table types, there are a total of 73 columns and approximately 2.3M records across all three tables. There is also a fourth table called UGC that provides State and County codes. However, there are numerous problems with how this reference data has been used. It will be loaded and discussed briefly.
Obviously, the scope of this effort would be quite different at the enterprise level. Multiply what is done below by an appropriate factor and you will get an idea of the scope of work involved when you are dealing with several hundred distinct tables (not three) with several thousand columns. Getting to know and understand data is not a trivial task.
For this exercise the NOAA storm database was selected. Currently, this data set covers storm activity across the United States starting in January 1950 through December 2014. According to NOAA, “NCDC has performed data reformatting and standardization of event types but has not changed any data values for locations, fatalities, injuries, damage, narratives and any other event specific information. Please refer to the Database Details page for more information: http://www.ncdc.noaa.gov/stormevents/details.jsp”
Given this disclosure, the expectation was that there would be one or two types of errors discovered, bringing me to a rapid conlcusion: even with data that has undergone a thorough review, checks should always be performed. Unfortunately, in very little time significant problems were discovered, thus the length of this post. For the sake of brevity, some solutions are just mentioned, and the code does not cover a production level solution.
This is not intended to be critical of NOAA. It is simply pointing out that all data can be corrupt, and enforcement of data integrity is a constant battle if not managed properly. NOAA is now on version 3.0 as May 12, 2014. http://www.ncdc.noaa.gov/stormevents/versions.jsp and it appears there is still room for improvement.
Use the following links to download the data used in this post:
NOAA Storm Events Repository - http: http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
The data was downloaded, and loaded using the R Programming language. The data store for this data is an Oracle 12c XE RDBMS. However, this could just as easily be a Hadoop Data Lake, Teradata, DB2, MySQL, or PostgreSQL database (part of the reason the RJDBC package was chosen instead of ROracle was to accommodate various databases - hopefully, this will at least minimize the required modifications).
Note: With data, there are many terms used to define data elements. For instance, in logical modeling the terms ENTITY and ATTRIBUTE are used, whereas in physical modeling TABLE and COLUMNS are used. To avoid confusion, we will use the physical modeling terminology for the remainder of the discussion.
The NOAA data will serve as a good example of how detailed the data study must be. The data could just as easily be bank, lending, financial markets and investment, commodities, vendor inventory lists, marketing data, . . . When data is being incorporated into a company infrastructure, whether it is a data lake, or just a couple of millon records going into an RDBMS, the concepts proposed could, and should, be implemented in most scenarios.
The first task is to develop a data dictionary (or acquire one from the 3rd Party Source if it exists). NOAA provides a good start to the data dictionary, but it is not complete.
A data dictionary typically consists of a document in spreadsheet, or table format with the following minimum information:
- All table and coulumn names.
- Definitions for all tables and columns.
- Each column is further defined by:
- Data type (character, number, date, timestamp, CLOB, . . . )
- The data size (now many bytes of information can the column hold?)
- Domain value contraints (Limited list of acceptable values: 0/1, T/F, blue/brown/black, . . .)
- Whether the column can contain a null or not
- Whether the column is a primary, foriegn, or alternate key
Here is a sample of the dictionary created for this project. This particular format does not show the column size, but that information is displayed in the next figure, the Physical Data Model:
Depending on the project, it is usually very helpful to develop a data model that depicts the table relationships. I prefer to use Computer Associates’ Erwin Data Modeler. It is an industry standard for enterprise modeling, although there are many tools that would be suitable for smaller projects like this one, and XMLSPY is an excellent tool for depicting XML schemas. The model depicted below reflects how the data is modeled after having run through the exercise below. Previoulsy, all character columns were varchar2(255), and there were no primary, or foreign keys assigned,
To develop the data dictionary, I could have used the documents provided by NOAA exclusively, and entered each table and column name. However, I tend to be lazy and find it much easier to load the data, tidy it as much as possible, which includes assigning primary, foreign, and alternate keys as necessary, modifying the data types through the use of R (detailed in the code below), and then reverse engineering the database using Erwin. This provides most of what is needed for the data dictionary. If a logical model was required, the names would need to be changed to a more human readable format. Then all I have to do is copy over the definitions and I’m done. So far, so good, and this did not require a lot of work, and exposed a lot of problems with the data as we will discuss now. Also, please keep in mind, this is four tables total, not a hundred.
Note: The model depicted is not a fully normalized model. The lack of normalization causes a few issues that are discussed below, but this model will suffice for our purposes.
Before heading into any data project, it is best to have some idea of what you are trying to accomplish. What are the requirements? What questions are you trying to answer? Defining the use case is an important step and no data project should be undertaken without clear objectives in mind. Believe it or not, the lack of well defined requirements is a common problem in data management. Agile has come to mean no documentation, and I don’t believe that was the intent. So, for no other reason than to remind everyone of its importance (including myself), we will at least use the following as our high level requirements for this project.
The objective of this analysis is to:
- Download, and automate the loading and storage of all weather data as frequently as made available by NOAA.
- Compare weather related events that resulted in peronal injuries, or damanges.
- Check for changes in patterns by season, and location by in-depth data analysis.
library(knitr)
library(dplyr)
library(lubridate)
library(RJDBC)
library(stringr)
opts_chunk$set (options(knitr.table.format = 'html'))
opts_chunk$set(echo = TRUE, results = "asis")
opts_chunk$fig.height=6
opts_chunk$fig.width=8
drv <- JDBC("oracle.jdbc.OracleDriver",
classPath="~/Documents/Oracle/ojdbc6.jar", " ")
con <- dbConnect(drv, "jdbc:oracle:thin:@shaner-PC:1521/orcl", "shaner", "passwd")
rm(temp)
rm(details)
rm(fatalities)
rm(locations)
File <- list.files(pattern="details", full.names = FALSE)
for (i in 1:length(File)) {
if (exists("details")){
temp <- tbl_df (read.csv(File[i], na.strings = "NA",
stringsAsFactors = FALSE))
temp <- mutate(temp, SOURCE_FILE = File[i], NROWS = nrow(temp))
names(details) <- names(temp)
details <- rbind(details, temp)
} else {details <- tbl_df (read.csv(File[i], na.strings = "NA",
stringsAsFactors = FALSE))
details <- mutate(details, SOURCE_FILE = File[i], NROWS = nrow(details))
}
}
File <- list.files(pattern="fatalities", full.names = FALSE)
for (i in 1:length(File)) {
if (exists("fatalities")){
temp <- tbl_df (read.csv(File[i], na.strings = "NA",
stringsAsFactors = FALSE))
temp <- mutate(temp, SOURCE_FILE = File[i], NROWS = nrow(temp))
names(fatalities) <- names(temp)
fatalities <- rbind(fatalities, temp)
} else {fatalities <- tbl_df (read.csv(File[i], na.strings = "NA",
stringsAsFactors = FALSE))
fatalities <- mutate(fatalities, SOURCE_FILE = File[i], NROWS = nrow(fatalities))
}
}
File <- list.files(pattern="locations", full.names = FALSE)
for (i in 1:length(File)) {
if (exists("locations")){
temp <- tbl_df (read.csv(File[i], na.strings = "NA",
stringsAsFactors = FALSE))
temp <- mutate(temp, SOURCE_FILE = File[i], NROWS = nrow(temp))
names(locations) <- names(temp)
locations <- rbind(locations, temp)
} else {locations <- tbl_df (read.csv(File[i], na.strings = "NA",
stringsAsFactors = FALSE))
locations <- mutate(locations, SOURCE_FILE = File[i], NROWS = nrow(locations))
}
}
rm(temp)
# Convert character Dates to POSIXct DateTime fields using Lubridate
DD1 <- select(details, -BEGIN_YEARMONTH, -BEGIN_DAY, -BEGIN_TIME, -END_YEARMONTH,
-END_DAY, -END_TIME, -YEAR, -MONTH_NAME, -EPISODE_NARRATIVE,
-EVENT_NARRATIVE)
DD1$BEGIN_DATE_TIME <- parse_date_time(DD1$BEGIN_DATE_TIME,
c("%d%m%y %H%M%S", "%y%m%d %I%M%S %p", "%y%m%d %H%M", "%y%m%d %H%M%S"))
DD1$END_DATE_TIME <- parse_date_time(DD1$END_DATE_TIME,
c( "%d%m%y %H%M%S", "%y%m%d %I%M%S %p", "%y%m%d %H%M", "%y%m%d %H%M%S"))
kable(head(DD1[,c(2,10:12)], 10), align = "c")
| EVENT_ID | BEGIN_DATE_TIME | CZ_TIMEZONE | END_DATE_TIME |
|---|---|---|---|
| 10096222 | 2050-04-28 14:45:00 | CST | 2050-04-28 14:45:00 |
| 10120412 | 2050-04-29 15:30:00 | CST | 2050-04-29 15:30:00 |
| 10073785 | 2050-06-22 21:00:00 | CST | 2050-06-22 21:00:00 |
| 10099490 | 2050-09-15 17:45:00 | CST | 2050-09-15 17:45:00 |
| 10099491 | 2050-09-16 01:30:00 | CST | 2050-09-16 01:30:00 |
| 10099492 | 2050-10-01 21:00:00 | CST | 2050-10-01 21:00:00 |
| 10104927 | 2050-07-05 18:00:00 | CST | 2050-07-05 18:00:00 |
| 10104928 | 2050-07-05 18:30:00 | CST | 2050-07-05 18:30:00 |
| 10104929 | 2050-07-24 14:40:00 | CST | 2050-07-24 14:40:00 |
| 10104930 | 2050-08-29 16:00:00 | CST | 2050-08-29 16:00:00 |
kable(tail(DD1[,c(2,10:12)], 10), align = "c")
| EVENT_ID | BEGIN_DATE_TIME | CZ_TIMEZONE | END_DATE_TIME |
|---|---|---|---|
| 552886 | 2014-12-24 15:15:00 | EST-5 | 2014-12-24 15:15:00 |
| 552887 | 2014-12-24 15:20:00 | EST-5 | 2014-12-24 15:20:00 |
| 554953 | 2014-12-28 01:56:00 | AKST-9 | 2014-12-30 00:56:00 |
| 546339 | 2014-12-09 16:00:00 | EST-5 | 2014-12-11 02:00:00 |
| 546446 | 2014-12-09 07:00:00 | EST-5 | 2014-12-10 19:00:00 |
| 546504 | 2014-12-24 18:00:00 | EST-5 | 2014-12-25 06:00:00 |
| 546505 | 2014-12-24 18:00:00 | EST-5 | 2014-12-25 06:00:00 |
| 545932 | 2014-12-21 12:30:00 | PST-8 | 2014-12-26 14:00:00 |
| 547176 | 2014-10-14 04:33:00 | EST-5 | 2014-10-14 04:38:00 |
| 547186 | 2014-10-14 04:43:00 | EST-5 | 2014-10-14 04:48:00 |
# The Date columns are not in the same format between the Details and Fatalities
# tables. The Details table uses two digits for the year and Lubridate defaults
# to any number from 00-68 being the 21st Century, and 69-99 being the 20th
# century. You would think someone is still around from the Year 2000
# that remembers the debacle caused by using two digits for the year,
# but apparently not.
# Function to set any date with 2 digits larger than 40 to be 20th century.
CorrectCentury <- function(x, year=1940){
m <- year(x) %% 100
year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
x
}
DD1 <- mutate(DD1, BEGIN_DATE_TIME = CorrectCentury(BEGIN_DATE_TIME),
END_DATE_TIME = CorrectCentury(END_DATE_TIME))
kable(head(DD1[,c(2,10:12)], 10), align = "c")
| EVENT_ID | BEGIN_DATE_TIME | CZ_TIMEZONE | END_DATE_TIME |
|---|---|---|---|
| 10096222 | 1950-04-28 14:45:00 | CST | 1950-04-28 14:45:00 |
| 10120412 | 1950-04-29 15:30:00 | CST | 1950-04-29 15:30:00 |
| 10073785 | 1950-06-22 21:00:00 | CST | 1950-06-22 21:00:00 |
| 10099490 | 1950-09-15 17:45:00 | CST | 1950-09-15 17:45:00 |
| 10099491 | 1950-09-16 01:30:00 | CST | 1950-09-16 01:30:00 |
| 10099492 | 1950-10-01 21:00:00 | CST | 1950-10-01 21:00:00 |
| 10104927 | 1950-07-05 18:00:00 | CST | 1950-07-05 18:00:00 |
| 10104928 | 1950-07-05 18:30:00 | CST | 1950-07-05 18:30:00 |
| 10104929 | 1950-07-24 14:40:00 | CST | 1950-07-24 14:40:00 |
| 10104930 | 1950-08-29 16:00:00 | CST | 1950-08-29 16:00:00 |
kable(tail(DD1[,c(2,10:12)], 10), align = "c")
| EVENT_ID | BEGIN_DATE_TIME | CZ_TIMEZONE | END_DATE_TIME |
|---|---|---|---|
| 552886 | 2014-12-24 15:15:00 | EST-5 | 2014-12-24 15:15:00 |
| 552887 | 2014-12-24 15:20:00 | EST-5 | 2014-12-24 15:20:00 |
| 554953 | 2014-12-28 01:56:00 | AKST-9 | 2014-12-30 00:56:00 |
| 546339 | 2014-12-09 16:00:00 | EST-5 | 2014-12-11 02:00:00 |
| 546446 | 2014-12-09 07:00:00 | EST-5 | 2014-12-10 19:00:00 |
| 546504 | 2014-12-24 18:00:00 | EST-5 | 2014-12-25 06:00:00 |
| 546505 | 2014-12-24 18:00:00 | EST-5 | 2014-12-25 06:00:00 |
| 545932 | 2014-12-21 12:30:00 | PST-8 | 2014-12-26 14:00:00 |
| 547176 | 2014-10-14 04:33:00 | EST-5 | 2014-10-14 04:38:00 |
| 547186 | 2014-10-14 04:43:00 | EST-5 | 2014-10-14 04:48:00 |
# Unnecessary date fields removed from the data.
DF1 <- select(fatalities, -FAT_YEARMONTH, -FAT_DAY, -FAT_TIME, -EVENT_YEARMONTH)
DF1$FATALITY_DATE <- parse_date_time(DF1$FATALITY_DATE,
c("%y%m%d %I%M%S %p", "%y%m%d %H%M", "%y%m%d %H%M%S", "%d%m%y %H%M%S"))
kable(head(DF1[,1:4], 10), align = "c")
| FATALITY_ID | EVENT_ID | FATALITY_TYPE | FATALITY_DATE |
|---|---|---|---|
| 1005198 | 9981922 | D | 1951-01-01 05:25:00 |
| 1005199 | 10049525 | D | 1950-12-02 12:00:00 |
| 1005200 | 10120403 | D | 1950-11-02 13:50:00 |
| 1005201 | 10120406 | D | 1950-12-02 00:30:00 |
| 1005202 | 10120410 | D | 1950-12-02 12:00:00 |
| 1005203 | 10032625 | D | 1950-12-02 14:00:00 |
| 1005204 | 10032626 | D | 1950-12-02 14:00:00 |
| 1005205 | 10032628 | D | 1950-12-02 14:20:00 |
| 1005206 | 10126027 | D | 1951-01-02 02:00:00 |
| 1005207 | 10032627 | D | 1950-12-02 14:20:00 |
kable(tail(DF1[,1:4], 10), align = "c")
| FATALITY_ID | EVENT_ID | FATALITY_TYPE | FATALITY_DATE |
|---|---|---|---|
| 25236 | 552229 | D | 2073-03-10 03:49:55 |
| 26265 | 553167 | I | 2014-11-12 00:00:00 |
| 26262 | 553043 | D | 2014-11-12 00:00:00 |
| 27256 | 553293 | D | 2066-03-14 07:55:31 |
| 27257 | 553293 | I | 2066-03-14 07:55:31 |
| 27263 | 553363 | D | 2065-10-31 19:54:40 |
| 27264 | 553363 | D | 2065-10-31 19:54:40 |
| 27293 | 553951 | D | 2014-01-12 00:00:00 |
| 27314 | 553953 | D | 2014-05-12 00:00:00 |
| 27296 | 553954 | D | 2067-12-21 20:32:50 |
DF1 <- mutate(DF1, FATALITY_DATE = CorrectCentury(FATALITY_DATE))
kable(head(DF1[,1:4], 10), align = "c")
| FATALITY_ID | EVENT_ID | FATALITY_TYPE | FATALITY_DATE |
|---|---|---|---|
| 1005198 | 9981922 | D | 1951-01-01 05:25:00 |
| 1005199 | 10049525 | D | 1950-12-02 12:00:00 |
| 1005200 | 10120403 | D | 1950-11-02 13:50:00 |
| 1005201 | 10120406 | D | 1950-12-02 00:30:00 |
| 1005202 | 10120410 | D | 1950-12-02 12:00:00 |
| 1005203 | 10032625 | D | 1950-12-02 14:00:00 |
| 1005204 | 10032626 | D | 1950-12-02 14:00:00 |
| 1005205 | 10032628 | D | 1950-12-02 14:20:00 |
| 1005206 | 10126027 | D | 1951-01-02 02:00:00 |
| 1005207 | 10032627 | D | 1950-12-02 14:20:00 |
kable(tail(DF1[,1:4], 10), align = "c")
| FATALITY_ID | EVENT_ID | FATALITY_TYPE | FATALITY_DATE |
|---|---|---|---|
| 25236 | 552229 | D | 1973-03-10 03:49:55 |
| 26265 | 553167 | I | 2014-11-12 00:00:00 |
| 26262 | 553043 | D | 2014-11-12 00:00:00 |
| 27256 | 553293 | D | 1966-03-14 07:55:31 |
| 27257 | 553293 | I | 1966-03-14 07:55:31 |
| 27263 | 553363 | D | 1965-10-31 19:54:40 |
| 27264 | 553363 | D | 1965-10-31 19:54:40 |
| 27293 | 553951 | D | 2014-01-12 00:00:00 |
| 27314 | 553953 | D | 2014-05-12 00:00:00 |
| 27296 | 553954 | D | 1967-12-21 20:32:50 |
DL1 <- select(locations, -YEARMONTH)
Notice that two fields were added to the data to indicate which file the data were retrieved from and how many rows were in the file. The source of data can often shed light on issues as is demonstrated below, and this became necessary to identify where duplicates were coming from.
AlterSession <- "ALTER SESSION SET NLS_DATE_FORMAT=\'YYYY-MON-DD HH24:MI:SS\'"
dbSendUpdate(con, AlterSession)
# Due to scheduling contraints, I took a shortcut here and eliminated the
# NARRATIVE columns.
DD2<- DD1[complete.cases(DD1$EVENT_ID),]
# The EVENT_ID, as defined in the NOAA "Storm Data Export Format, Field names"
# is the primary key for the Details table. The document also states that
# the EVENT_ID connects the Fatalities and Locations tables - which makes
# sense -- a foreign key relationship to the Fatalities and Locations table.
# For each EVENT, there is a location, and each Fatality is associated with
# an EVENT.
# Enforcing this revealed several problems with the data and duplicates
# were discovered. All duplicates must be removed.
# The details table only has two bad records which are removed here.
DD2 <- DD2[!duplicated(DD2$EVENT_ID) ,]
# Changed EPISODE_ID NA values to a 0. JCDB
# was reading it as NUMERIC but trying to insert character values.
DD2<- mutate(DD2, EPISODE_ID = ifelse(is.na(EPISODE_ID), 0, EPISODE_ID))
# Load the Details data into Oracle, the count will be 1,301,013 records.
dbWriteTable(con, "NOAA_DETAILS", DD2)
[1] TRUE
############### LOAD Fatalities (DF1) Data into Oracle Database ###############
# There were considerable duplicates in the Fatalities tables (2013) plus
DF2 <- DF1[complete.cases(DF1$FATALITY_D),]
DF2 <- DF2[!duplicated(DF2$FATALITY_ID),] #Primary key identified: FATALITY_ID
DF2 <- semi_join( DF2, DD2, by = c("EVENT_ID" = "EVENT_ID"))
dbWriteTable(con, "NOAA_FATALITIES", DF2) # Load the data into Oracle
[1] TRUE
############### LOAD Locations (DL1) Data into Oracle Database ###############
# There are 134 records dropped from the Locations data because there is no
# matching EVENT_ID in the Details table - in other words there is a location
# for an EVENT that does not exist.
#Composite key: EPISODE_ID, EVENT_ID, LOCATION_INDEX
DL2 <- DL1[!duplicated(DL1[,1:3]),]
DL2 <- semi_join( DL2, DD2, by = c("EVENT_ID" = "EVENT_ID"))
# Load the Data for Locations
dbWriteTable(con, "NOAA_LOCATIONS", DL2)
[1] TRUE
################ LOAD UGC Data into Oracle Database ################
# UGC Table provides the State and County FIPS information
UGC <- tbl_df(read.csv("ugc_areas.csv", na.strings="NA", stringsAsFactors = FALSE))
UGC2 <- UGC[!duplicated(UGC[,1 & 3]),]
dbWriteTable(con, "UGC", UGC2)
[1] TRUE
dbSendUpdate(con, "ALTER TABLE UGC
ADD PRIMARY KEY (COMBINED_ID,WFO)")
With the tables created and the data purged of duplicates, we can now assign
primary keys and referential integrity contraints (foreign keys).
# Create Primary Key for NOAA_DETAILS (Details table)
dbSendUpdate(con, "ALTER TABLE NOAA_DETAILS
ADD PRIMARY KEY (EVENT_ID)")
#Create a Primary Key unique index on the NOAA_FATALITIES table
dbSendUpdate(con, "ALTER TABLE NOAA_FATALITIES
ADD PRIMARY KEY (FATALITY_ID)")
#Add foreign key
dbSendUpdate(con, "ALTER TABLE NOAA_FATALITIES
ADD (FOREIGN KEY (EVENT_ID) REFERENCES NOAA_DETAILS (EVENT_ID) ON DELETE SET NULL)")
#Create Composite Primary Key and Foreign Key for NOAA_LOCATIONS
dbSendUpdate(con, "ALTER TABLE NOAA_LOCATIONS
ADD PRIMARY KEY (EVENT_ID,EPISODE_ID,LOCATION_INDEX)")
#Add foreign key
dbSendUpdate(con, "ALTER TABLE NOAA_LOCATIONS
ADD (FOREIGN KEY (EVENT_ID) REFERENCES NOAA_DETAILS (EVENT_ID))")
Now that the primary and foreign key constraints are in place, we will go through and determine the appropriate column data type lengths. There is no reason that a column cannot accommodate data much larger than intended other than to ensure that data stays consistent. As is discussed below, there are several errors that would have been very difficult to find had I not looked at the maximum lengths of each column.
# Define Data Types - and modify as needed
TableName <- list(NOAA_DETAILS = DD2, NOAA_FATALITIES = DF2, NOAA_LOCATIONS = DL2)
for (i in 1:length(TableName)){
ColumnName <- names(TableName[[i]])
DataType <- sapply(TableName[[i]], typeof)
DataType <- data.frame(cbind(names(TableName[i]), ColumnName, DataType),
stringsAsFactors = FALSE, row.names = NULL)
names(DataType) <- c("TABLE_NAME", "COLUMN_NAME", "RDType")
for (x in 1:length(ColumnName)){
DataType$MAX_LENGTH[x] <- as.numeric(max(str_length(TableName[[i]][[x]]),
na.rm = TRUE))
}
if (exists("Rdt")){
Rdt <- rbind(Rdt, DataType)
} else {
Rdt <- DataType
}
}
Rdt <- tbl_df(Rdt)
# After tables are created, check DataTypes used by JCDB
# Compare field lengths to ensure that assigned datatypes will work
DD1DataTypes <- tbl_df (data.frame(dbGetFields(conn = con, name = "NOAA_DETAILS"),
stringsAsFactors = FALSE))
DF1DataTypes <- tbl_df (data.frame(dbGetFields(conn = con, name = "NOAA_FATALITIES"),
stringsAsFactors = FALSE))
DL1DataTypes <- tbl_df (data.frame(dbGetFields(conn = con, name = "NOAA_LOCATIONS"),
stringsAsFactors = FALSE))
DTypes <- rbind(DD1DataTypes, DF1DataTypes)
DTypes <- rbind(DTypes, DL1DataTypes)
Rdt <- inner_join(Rdt, DTypes)
## Joining by: c("TABLE_NAME", "COLUMN_NAME")
Rdt <- select(Rdt, TABLE_NAME, COLUMN_NAME, TYPE_NAME,
RDType, COLUMN_SIZE, MAX_LENGTH, IS_NULLABLE, COLUMN_DEF)
ALTER_STATEMENT <- Rdt[Rdt$MAX_LENGTH < 255 & Rdt$TYPE_NAME == "VARCHAR2", ]
if (nrow(ALTER_STATEMENT) != 0){
for (i in 1: nrow(ALTER_STATEMENT )){
dbSendUpdate(con, paste("alter table ", ALTER_STATEMENT$TABLE_NAME[i],
" modify ", ALTER_STATEMENT$COLUMN_NAME[i],
" varchar2(", ALTER_STATEMENT$MAX_LENGTH[i], ")", sep = ""))
}
}
dbDisconnect(con)
[1] TRUE
The above code would need work to accommodate the real problem of adjusting column data types. It is not a great deal of work, but more than I thought was necessary to demonstrate data management concepts. The code above simply checks the maximum length, and shortens the data type if it is less than 255. I already performed a manual check and know there are no columns with more than 255 characters. A bit lazy, but time was running short and this data was in much worse shape than I anticipated.
- First Problem: Transitive Dependency. BEGIN_DAY in table NOAA_DETAILS has a maximum length of 186 characters. In other words, the BEGIN_DAY is a non-key attribute and is dependent on other non-key attributes (month and year). For this to be allowed, there would have to be constraints in place to enforce several date rules (e.g. the number cannot be greater than 31 on months Jan, Mar, May, Jul, . . . , or greater than 30 on Apr, Jun, . . . , or greater than 28 in Feb except for leap years where it cannot be greater thant 29, . . . ) – basically what we call a DATE field.
A design like this is just asking for trouble and is really unnecessary when you have a date data type that enforces all of those rules for you. Why did Hadley Wickham develop lubridate if we were going to keep our date fields so neatly parsed? :) For this reason, the attributes pertaining to Dates (YEARMONTH, YEAR, MONTH, HOUR, . . . ) were removed from the data. This type of data is readily available from the DATE and TIME fields provided.
Here is a table of all columns, and data types that have been retained in the database:
TABLE_NAME COLUMN_NAME TYPE_NAME RDType COLUMN_SIZE MAX_LENGTH NOAA_DETAILS EPISODE_ID FLOAT double 126 7 NOAA_DETAILS EVENT_ID NUMBER integer 38 8 NOAA_DETAILS STATE VARCHAR2 character 20 20 NOAA_DETAILS STATE_FIPS NUMBER integer 38 2 NOAA_DETAILS EVENT_TYPE VARCHAR2 character 30 30 NOAA_DETAILS CZ_TYPE VARCHAR2 character 1 1 NOAA_DETAILS CZ_FIPS NUMBER integer 38 3 NOAA_DETAILS CZ_NAME VARCHAR2 character 200 200 NOAA_DETAILS WFO VARCHAR2 character 3 3 NOAA_DETAILS BEGIN_DATE_TIME VARCHAR2 double 19 19 NOAA_DETAILS CZ_TIMEZONE VARCHAR2 character 6 6 NOAA_DETAILS END_DATE_TIME VARCHAR2 double 19 19 NOAA_DETAILS INJURIES_DIRECT NUMBER integer 38 4 NOAA_DETAILS INJURIES_INDIRECT NUMBER integer 38 4 NOAA_DETAILS DEATHS_DIRECT NUMBER integer 38 3 NOAA_DETAILS DEATHS_INDIRECT NUMBER integer 38 2 NOAA_DETAILS DAMAGE_PROPERTY VARCHAR2 character 18 18 NOAA_DETAILS DAMAGE_CROPS VARCHAR2 character 19 19 NOAA_DETAILS SOURCE VARCHAR2 character 25 25 NOAA_DETAILS MAGNITUDE FLOAT double 126 5 NOAA_DETAILS MAGNITUDE_TYPE VARCHAR2 character 2 2 NOAA_DETAILS FLOOD_CAUSE VARCHAR2 character 28 28 NOAA_DETAILS CATEGORY NUMBER integer 38 1 NOAA_DETAILS TOR_F_SCALE VARCHAR2 character 3 3 NOAA_DETAILS TOR_LENGTH FLOAT double 126 5 NOAA_DETAILS TOR_WIDTH FLOAT double 126 6 NOAA_DETAILS TOR_OTHER_WFO VARCHAR2 character 3 3 NOAA_DETAILS TOR_OTHER_CZ_STATE VARCHAR2 character 2 2 NOAA_DETAILS TOR_OTHER_CZ_FIPS NUMBER integer 38 3 NOAA_DETAILS TOR_OTHER_CZ_NAME VARCHAR2 character 16 16 NOAA_DETAILS BEGIN_RANGE NUMBER integer 38 4 NOAA_DETAILS BEGIN_AZIMUTH VARCHAR2 character 5 5 NOAA_DETAILS BEGIN_LOCATION VARCHAR2 character 48 48 NOAA_DETAILS END_RANGE VARCHAR2 character 7 7 NOAA_DETAILS END_AZIMUTH VARCHAR2 character 5 5 NOAA_DETAILS END_LOCATION VARCHAR2 character 45 45 NOAA_DETAILS BEGIN_LAT VARCHAR2 character 8 8 NOAA_DETAILS BEGIN_LON VARCHAR2 character 9 9 NOAA_DETAILS END_LAT VARCHAR2 character 7 7 NOAA_DETAILS END_LON VARCHAR2 character 9 9 NOAA_DETAILS DATA_SOURCE VARCHAR2 character 50 50 NOAA_DETAILS SOURCE_FILE VARCHAR2 character 51 51 NOAA_DETAILS NROWS NUMBER integer 38 5 NOAA_FATALITIES FATALITY_ID NUMBER integer 38 7 NOAA_FATALITIES EVENT_ID NUMBER integer 38 8 NOAA_FATALITIES FATALITY_TYPE VARCHAR2 character 1 1 NOAA_FATALITIES FATALITY_DATE VARCHAR2 double 19 19 NOAA_FATALITIES FATALITY_AGE NUMBER integer 38 3 NOAA_FATALITIES FATALITY_SEX VARCHAR2 character 1 1 NOAA_FATALITIES FATALITY_LOCATION VARCHAR2 character 28 28 NOAA_FATALITIES SOURCE_FILE VARCHAR2 character 54 54 NOAA_FATALITIES NROWS NUMBER integer 38 4 NOAA_LOCATIONS EPISODE_ID NUMBER integer 38 7 NOAA_LOCATIONS EVENT_ID NUMBER integer 38 7 NOAA_LOCATIONS LOCATION_INDEX NUMBER integer 38 1 NOAA_LOCATIONS RANGE FLOAT double 126 6 NOAA_LOCATIONS AZIMUTH VARCHAR2 character 3 3 NOAA_LOCATIONS LOCATION VARCHAR2 character 48 48 NOAA_LOCATIONS LATITUDE FLOAT double 126 7 NOAA_LOCATIONS LONGITUDE FLOAT double 126 9 NOAA_LOCATIONS LAT2 NUMBER integer 38 7 NOAA_LOCATIONS LON2 NUMBER integer 38 8 NOAA_LOCATIONS SOURCE_FILE VARCHAR2 character 53 53 NOAA_LOCATIONS NROWS NUMBER integer 38 5
Notice the BEGIN_DATE size, but also, CZ_NAME. These are indicitive of poor design with lack of controls. The CZ_NAME is a reference value from the UGC table. Yet, the longest CZ_NAME in that reference table is 125 characters. There are numerous problems with not having the UGC referenced by its key.
- Second Problem: Lack of unique, and primary key constraints.
- There are two duplicates in the Details table which indicates that the key is not currently enforced. In the Oracle database we will set the primary key to be the EVENT_ID and remove the two duplicates. As we will see below, there are many more problems with duplications, and they become much more significant. The point here is that any duplications indicate a lack of contraints to enforce integrity and the error is repeated throughout the data.
- There are 2013 duplicates in the Fatalities table. These duplicates indicate that there is no primary key constraint being enforced. Per the definition in the “Storm Data Export, Field Names” file: “fatality_id Ex: 17582, 17590, 17597, 18222 (ID assigned by NWS to denote the individual fatality that occurred within a storm event)” Individual implies uniqueness, and a primary key is a unique key that identifies a specific row, or observation.
Note: Two duplicates do not in themselves seem to be troubling, out of 1301016 rows. However, keep in mind that all duplicates essentially cause a Cartesian join. Without contraints and checks, errors are magnified. In the Fatalities table there are 2013 duplicates out of 13925, which is significant, and as we will see there are many more problems introduced by the lack of integrity constraints.
- Third Problem: Lack of Referential Integrity Constraints. Constraints (e.g., data types, uniqueness, primary keys) and referential integrity contraints enforce rules. Without these rules being enforced the data loses its meaning and value. The implied rules from reviewing the definitions and studying the data so far state:
- There is one EPISODE, and this EPISODE can consist of multiple EVENTS
- Not every EVENT is associated with an EPISODE
- Each EVENT can appear multiple times within an EPISODE with a unique INDEX providing an azimuth, range, latitude, and longitude reading (the definition provides little information, but I assume the azimuth is to allow for more precise location reading through intersection and resection)
- Each EPISODE should have a minimum of one EVENT
- Each EVENT is associated with a LOCATION (some are missing which I assume is the result of missing historical data.
- Each FATALITY is associated with one EVENT. There should be one record in the Details table for every record in the Fatalities table.
I am sure this list is not exhaustive, but you get the idea. The Fatalities table, if you look at the data, only provides a few details about the fatality, like date, age, and sex. The rest of the information about the type of weather event is contained in the Details table. There are 28 records in the Fatalities table with no matching record in the Details table. In other words, a fatality without a storm event.
The following Venn diagram gives you a visual of the differences in the data before and after removing duplicates and orphans, and enforcing referential integrity. Notice that there is some intersection between Fatalities and Locations. However, this just means that the EVENT that occurred at a specific location, caused fatalities. The data does not provide the location of the fatality. Both direct and indirect deaths could have occurred miles away in a hospital a month later.
- Fourth Problem: Denormalized Data. This is a vague problem statement that Implies a multitude of problems. There are multiple reasons for enforcing what is called the 3rd Normal Form (3NF) in data and the subject is far too broad to cover here. However, a couple of examples of how the lack of normalization causes problems in the NOAA data.
- The fatalities table provides a column called FATALITY_TYPE. The acceptable values are “D” for DEATHS_DIRECT, and “I”, for DEATHS_INDIRECT. However, there is nothing enforcing these values. The value in the details table for DEATHS_DIRECT, and DEATHS_INDIRECT are cumulative – total number of deaths from the storm event. This implies that there will be the same number of individual records in fatalities table, as the value provided in the DEATHS_DIRECT column in the details table. There are 8942 events in the Details table where a fatality occurred either directly or indirectly. The total number of fatalities reported in the Details table for deaths caused directly from a weather event comes to 14939, and the total number from indirect causes is 1579, for a total of 16518. Yet, there are only 11884 observations listed in the Fatalities table after the duplicates are removed. Of these, 10584 are deaths caused directly from a storm, and 1300 are deaths caused indirectly form the storm.
So, how many of these match? You can do the checks, but hopefully you get the point. This is denormalized, and unenforceable. Not good.
- There are 464047 records in the NOAA_DETAILS table without a matching record in NOAA_LOCATIONS. This could be acceptable since the data probably was lost, or not available for some reason. However, there are 134 in the NOAA_LOCATIONS table with no matching EVENT_ID in the NOAA_DETAILS table (a location without an Event). This is not acceptable. Again, it is in perspective of storm data, but it could be Accounts and Balance, or Transcation data.
- There are numerous other examples, some more severe than others. The UGC data that I alluded to earlier is not referenced by its primary key, and I am not even addressing the fact that columns that have nothing to do with the EVENT are in each table. A tornado does not have a FLOOD_CAUSE, and a flood does not have a Fujita Scale. Tidy rule number 3, a separate table for entity type. There are other problems caused by this issue which I will not address, but the best rule of thumb is to enforce as many integrity constraints as you can.
- Fifth Problem: Sloppy Data Management Practices. Data collection, management, and analysis is difficult, and mistakes are easy to make. This is why processes, testing, Quality Assurance, Data Quality programs, and the like are established. An example of poor data management processes, is when errors are introduced during the processing. The duplicates in the NOAA_FATALITIES table would have never occured had there been referential integrity in place, but barring that, when you know there is a place where errors can be introduced, develop checks to validate the data. In the Fatalities csv files, there are only a handful of files that contain duplicate records and it appears that during the processing on August 24, September 15, September 30, 2014, and February 24, and March 26, 2015 there were some errors introduced. The following files contain duplicates:
SOURCE_FILE NROWS StormEvents_fatalities-ftp_v1.0_d1999_c20140915.csv.gz 907 StormEvents_fatalities-ftp_v1.0_d2001_c20140824.csv.gz 465 StormEvents_fatalities-ftp_v1.0_d2002_c20140824.csv.gz 466 StormEvents_fatalities-ftp_v1.0_d2003_c20140824.csv.gz 443 StormEvents_fatalities-ftp_v1.0_d2004_c20140824.csv.gz 369 StormEvents_fatalities-ftp_v1.0_d2005_c20140824.csv.gz 472 StormEvents_fatalities-ftp_v1.0_d2007_c20140824.csv.gz 647 StormEvents_fatalities-ftp_v1.0_d2012_c20140930.csv.gz 640 StormEvents_fatalities-ftp_v1.0_d2013_c20150224.csv.gz 374 StormEvents_fatalities-ftp_v1.0_d2014_c20150326.csv.gz 482
There would still be some work to do on this data, such as property and crop damages are not in a good format for analysis, the latitude and longitude in the locations table is inconsistent and needs standardization, and NOAA did not categorize the storms narrowly enough to be useful. However, I will stop here. If you have made it this far, you are probably a data geek like me, and you get my point(s). Hopefully, you have also developed a better understanding of what’s involved in maintaining data integrity. If you have something to add where I have overlooked, or misstated a point, or made an error, I would love to hear from you. Data geeks love data, and for the rare “non-data geek” that made it this far, its probably a lot like watching paint dry. I appreciate you hanging in there. Next time!