Project Description

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

The data shows that storm and severe weather events are costly not only to property and crops, but also in human lives. The most deadly events are tornadoes, while the most costly events are floods and droughts. The states on the Gulf Coast, and in the Midwest are the places where you are most likely to be injured or have a fatality. High wind events are many of the most damaging events throughout the U.S. It is important to note that this preliminary analysis leads to many opportunities for mitigation and data improvement. The NOAA data set needs lots of work to tidy the data and make sure future data collection is as clear as possible.

Data Processing

Loading and Processing the Raw Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data [47Mb There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

The data that are the most relevant to study for this analysis ar ethe following:

  • STATE - a character vector giving the FIPS code for location.
  • EVTYPE - a factor variable giving the event type (e.g. tornado, flood, etc.)
  • FATALITIES - a numerical variable of the number of fatalities
  • INJURIES - a numerical variable of the number of injuries
  • PROPDMG - a numerical variable giving the mantissa for the value of property damage in USD
  • PROPDMGEXP - a factor variable giving the exponent for the value of property damage in USD
  • CROPDMG - a numerical variable giving the mantissa for the value of crop damage in USD
  • CROPDMGEXP - a factor variable giving the exponent for the value of crop damage in USD
## Perform datafile download if necessary... 
if (!file.exists("./repdata_data_StormData.csv.bz2")){
    fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## Download zip and archive to destfile
        download.file(fileURL, destfile = "./repdata_data_StormData.csv.bz2", method = "curl", cache = TRUE)
        P1 <- Sys.time()
        # Extract File from bz2 format and record processing time
        NOAARawData <- read.csv("repdata_data_StormData.csv.bz2")
        P2<- Sys.time()
        Processing_Time = P2 - P1
        Processing_Time 
    remove(fileURL)
}
## Create local file
if (!file.exists("./NOAA.rdata")) {
    save(NOAARawData, file = "NOAA.rdata")
}
## Load Data
load(file = "NOAA.rdata")

Add FIPS Data to review and analyze

library(readxl)
## Load  FIPS Data for codes
if (!file.exists("./fips_codes_website.xls")){
        FileURL2 <- "http://www.census.gov/2010census/xls/fips_codes_website.xls"
        download.file(FileURL2, destfile = "./fips_codes_website.xls")
}
fips_codes_website <- read_excel("./fips_codes_website.xls", col_names = TRUE)
## Fix the spaces in the column names
colnames(fips_codes_website) <- gsub(" ", "_", colnames(fips_codes_website))
## Check out the data
head(fips_codes_website, 2)
## # A tibble: 2 x 7
##   State_Abbreviat… State_FIPS_Code County_FIPS_Code FIPS_Entity_Code ANSI_Code
##   <chr>            <chr>           <chr>            <chr>            <chr>    
## 1 AL               01              067              00124            02403054 
## 2 AL               01              073              00460            02403063 
## # … with 2 more variables: GU_Name <chr>, Entity_Description <chr>

Let’s look at the data first to see what kinds of of analysis we can do with the data.

colnames(NOAARawData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
head(NOAARawData, 2)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                        14   100 3   0          0       15    25.0
## 2         0                         2   150 2   0          0        0     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2

Create a smaller data frame to work with.

NOAAtidy  <- NOAARawData[ , c("STATE__" , "BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP",
"CROPDMG", "CROPDMGEXP")]
colnames(NOAAtidy) <- c("State", "Begin_Date", "Event_Type", "Fatalities", "Injuries", "Prop_Damage", "Prop_Damage_exp", "Crop_Damage", "Crop_Damage_exp") 

It looks like the date column needs to be cleaned up first. I first need to split the column because date and time are both there, then I need to delete the Time column and convert the Begin_time column to a Date class.

class(NOAAtidy$Begin_Date)
## [1] "character"
## Separate the beginning date column
NOAAtidy <- separate(NOAAtidy, col = Begin_Date, into = c("Begin_Date", "Time"), sep = " ")
NOAAtidy$Time <- NULL
## convert data column from Character to Data
NOAAtidy$Begin_Date = as.Date(NOAAtidy$Begin_Date, "%m/%d/%Y")
## Check to make sure the class is changed
class(NOAAtidy$Begin_Date)
## [1] "Date"

The next problem with the data is that the States are just numbers and not the actually state abbreviation. I wil take the FIPS codes data and substitute the information in the State Column.

## Make a uniqe data.table for the FIPS
Fips_codes <- as.data.frame(fips_codes_website[, 1:2])
## Get Unique Codes from FIPS
Fips_codes <- unique(as.data.frame(Fips_codes))
## Convert State Fips Code column to a number
Fips_codes$State_FIPS_Code <- as.numeric(Fips_codes$State_FIPS_Code)
## Substitue values in NOAA State column using FIPS codes
NOAAtidy <- merge(NOAAtidy, Fips_codes, by.x = "State", by.y ="State_FIPS_Code",  all.x=TRUE)
## Reorder the columns
NOAAtidy <- NOAAtidy[ , c(1,10, 2:9)]

We need to clean up the damage variables to include the exponent We will do this by making a function and applying it to our data frame.

## check to see what values are in exponents
unique(NOAAtidy$Prop_Damage_exp)
##  [1] ""  "K" "M" "B" "m" "+" "0" "5" "?" "6" "4" "3" "2" "h" "7" "H" "-" "1" "8"
unique(NOAAtidy$Crop_Damage_exp)
## [1] ""  "K" "M" "B" "m" "?" "0" "k" "2"
## Function to transform the characters into values
exp_transform <- function(e) {
    # h -> hundred, k -> thousand, m -> million, b -> billion
    if (e %in% c('h', 'H'))
        return(2)
    else if (e %in% c('k', 'K'))
        return(3)
    else if (e %in% c('m', 'M'))
        return(6)
    else if (e %in% c('b', 'B'))
        return(9)
    else if (!is.na(as.numeric(e))) # if a digit
        return(as.numeric(e))
    else if (e %in% c('', '-', '?', '+'))
        return(0)
    else {
        stop("Invalid exponent value.")
    }
}
## Apply function to calculate values for property and crop damage 
prop_exp <- sapply(NOAAtidy$Prop_Damage_exp, FUN=exp_transform)
NOAAtidy$Prop_Damage <- NOAAtidy$Prop_Damage * (10 ** prop_exp)
crop_exp <- sapply(NOAAtidy$Crop_Damage_exp, FUN=exp_transform)
NOAAtidy$Crop_Damage <- NOAAtidy$Crop_Damage * (10 ** crop_exp)

There are a really high number of unique events. To try and lower this number I will clean up the event data.

## number of unique event types
length(unique(NOAAtidy$Event_Type))
## [1] 985
## translate all letters to lowercase
NOAAtidy$Event_Type <- tolower(NOAAtidy$Event_Type)
## remove leading spaces and characters.
NOAAtidy$Event_Type <- trimws(NOAAtidy$Event_Type)
## replace all punct. characters with an underscore
NOAAtidy$Event_Type <- gsub("[[:blank:][:punct:]+]", "_", NOAAtidy$Event_Type)

When cleaning the data there were NA’s introduced during the processing. Let’s remove those NA’s by converting them to 0’s because we are not going for averages, only total sums and it will not effect our final analysis.

NOAAtidy[is.na(NOAAtidy)] <- 0

After reviewing the data and the codes I want to combine some of the events together that are identical to make the data easier to understand.

**It is important to analyze the names ahead of time so that you do not mislabel the events. i.e. if you do flood before landslide then the flood/landslide tag makes it only look like a flood event.

NOAAtidynames <- NOAAtidy
## how many uniquely named variables do we have at first?
length(unique(NOAAtidynames$Event_Type))
## [1] 866
## change tstm_wind --> thunderstorm_wind
NOAAtidynames$Event_Type[grepl("tstm", NOAAtidynames$Event_Type)] <- "thunderstorm_wind"
## change anything with thunderstorm_wind in it to just "thunderstorm_wind"
NOAAtidynames$Event_Type[grepl("thunderstorm", NOAAtidynames$Event_Type)] <- "thunderstorm_wind"
NOAAtidynames$Event_Type[grepl("thun", NOAAtidynames$Event_Type)] <- "thunderstorm_wind"
## clean up hail tags
NOAAtidynames$Event_Type[grepl("hail", NOAAtidynames$Event_Type)] <- "hail"
## clean up the tornado tags
NOAAtidynames$Event_Type[grepl("tornado", NOAAtidynames$Event_Type)] <- "tornado"
NOAAtidynames$Event_Type[grepl("torn", NOAAtidynames$Event_Type)] <- "tornado"
## clean up the landslide tags
NOAAtidynames$Event_Type[grepl("landslide", NOAAtidynames$Event_Type)] <- "landslide"
## clean up the flood tags
NOAAtidynames$Event_Type[grepl("flood", NOAAtidynames$Event_Type)] <- "flood"
## clean up the lightning tags
NOAAtidynames$Event_Type[grepl("lightning", NOAAtidynames$Event_Type)] <- "lightning"
## clean up the lightning tags
NOAAtidynames$Event_Type[grepl("snow", NOAAtidynames$Event_Type)] <- "snow"
## clean up the hurricane tags
NOAAtidynames$Event_Type[grepl("hurricane", NOAAtidynames$Event_Type)] <- "hurricane"
## clean up the cold weather tags
NOAAtidynames$Event_Type[grepl("cold", NOAAtidynames$Event_Type)] <- "cold_events"
NOAAtidynames$Event_Type[grepl("hypothermia", NOAAtidynames$Event_Type)] <- "cold_events"
NOAAtidynames$Event_Type[grepl("low_temperature", NOAAtidynames$Event_Type)] <- "cold_events"
NOAAtidynames$Event_Type[grepl("ice_storm", NOAAtidynames$Event_Type)] <- "Ice_storm"
NOAAtidynames$Event_Type[grepl("ice", NOAAtidynames$Event_Type)] <- "ice_events"
## clean up the avalanche tags
NOAAtidynames$Event_Type[grepl("avala", NOAAtidynames$Event_Type)] <- "avalanche"
## how many uniquely named variables do we now have?
length(unique(NOAAtidynames$Event_Type))
## [1] 446

There are also a lot of events with no damage, or no fatalities/injuries. since we are really interested in the events that had an impact on the community we want to clean this up.

## Remove data without damage and create new data frames to analyze
NOAAtidyDamage <- NOAAtidynames[(NOAAtidynames$Prop_Damage > 0 | NOAAtidynames$Crop_Damage > 0), ]
length(unique(NOAAtidyDamage$Event_Type))
## [1] 166
## Remove data without fatalities or injuries and create new data frames to analyze
NOAAtidyFatalities <- NOAAtidynames[(NOAAtidynames$Fatalities > 0 | NOAAtidynames$Injuries > 0), ]
length(unique(NOAAtidyFatalities$Event_Type))
## [1] 109

In order to look at property and crop damage more clearly, let’s make a smaller data frame with the important information for this analysis.

NOAADamageSum <- NOAAtidyDamage[ , c("State_Abbreviation", "Event_Type", "Prop_Damage", "Crop_Damage")]
## convert Event Type to factor
NOAADamageSum$Event_Type <- as.factor(NOAADamageSum$Event_Type)
## Aggregate damage by event type
DamageEventAgg <- aggregate(Prop_Damage ~ Event_Type, data = NOAADamageSum, FUN=sum)
Prop_Damage_Top <- head(DamageEventAgg[order(DamageEventAgg$Prop_Damage, decreasing = T), ], 20)

Let’s also process the crop damage data for later analysis

## Aggregate Crop Damage
CropEventAgg <- aggregate(Crop_Damage ~ Event_Type, data = NOAADamageSum, FUN=sum)
Crop_Damage_Top <- head(CropEventAgg[order(CropEventAgg$Crop_Damage, decreasing = T), ], 20)

Now we need to process the fatalities by the type of event.

NOAAfatalitiesSum  <- NOAAtidyFatalities[ , c("State_Abbreviation", "Event_Type", "Fatalities", "Injuries")]
##convert Event Type to factor
NOAAfatalitiesSum$Event_Type <- as.factor(NOAAfatalitiesSum$Event_Type)
## Aggregate fatalities 
FatalitiesAgg <- aggregate(Fatalities ~ Event_Type, data = NOAAfatalitiesSum, FUN=sum)
Fatalities_Top <- head(FatalitiesAgg[order(FatalitiesAgg$Fatalities, decreasing = T), ], 20)
## Aggregate injuries 
InjuriesAgg <- aggregate(Injuries ~ Event_Type, data = NOAAfatalitiesSum, FUN=sum)
Injuries_Top <- head(InjuriesAgg[order(InjuriesAgg$Injuries, decreasing = T), ], 20)

When looking at the data I was surprised the Ice_storm injuries were really high. I was curious to see why. I selected the rows only containing the Ice_storm results. It turns out that Ohio had one ice storm with 1568 injuries. I decided that this should be noted, but not excluded at this point in analysis without more information.

## Create a dataframe with only the Ice_storm data
IceStorm <- filter(NOAAfatalitiesSum, Event_Type == "Ice_storm")
IceStorm[65,]
##    State_Abbreviation Event_Type Fatalities Injuries
## 65                 OH  Ice_storm          1     1568

I was also interested in which states were the deadliest to live in.

## change the state to a factor
NOAAfatalitiesSum$State_Abbreviation <- as.factor(NOAAfatalitiesSum$State_Abbreviation)
## Aggregate fatalities by state
Fat_stateAgg <- aggregate(Fatalities ~ State_Abbreviation, data = NOAAfatalitiesSum, FUN=sum)
DeathlyStates_Top <- head(Fat_stateAgg[order(Fat_stateAgg$Fatalities, decreasing = T), ], 20)

Lastly, I wanted to find out in what states you were most likely to get injured.

## Aggregate fatalities by state
Inj_stateAgg <- aggregate(Injuries ~ State_Abbreviation, data = NOAAfatalitiesSum, FUN=sum)
InjuriesStates_Top <- head(Inj_stateAgg[order(Inj_stateAgg$Injuries, decreasing = T), ], 20)

Results

I completed three analysis of the data. * What events are the most harmful to a persons health? * What are the economic impacts on property and crops? * How does a persons risk change in different states from weather and storm event?

Health Impacts by Event

Let’s look at the top 10 events that cause fatalities.

head(Fatalities_Top[, c("Event_Type", "Fatalities")],10)
##           Event_Type Fatalities
## 86           tornado       5636
## 15    excessive_heat       1903
## 19             flood       1525
## 32              heat        937
## 59         lightning        817
## 85 thunderstorm_wind        756
## 6        cold_events        451
## 73       rip_current        368
## 48         high_wind        248
## 1          avalanche        225

Let’s look at the top 10 events that cause injuries

head(Injuries_Top[, c("Event_Type", "Injuries")],10)
##            Event_Type Injuries
## 86            tornado    91407
## 85  thunderstorm_wind     9545
## 19              flood     8604
## 15     excessive_heat     6525
## 59          lightning     5231
## 32               heat     2100
## 56          Ice_storm     1990
## 30               hail     1371
## 53          hurricane     1328
## 104      winter_storm     1321

Let’s graph both of these to show the information visually.

## I am going to create these graphs using the base plot system
par(mfrow = c(1, 2), las = 3, mar = c(10, 4, 2, 2), cex = .7)
barplot(Fatalities_Top$Fatalities, names.arg = Fatalities_Top$Event, col = "orange",
        main = 'Top 20 Events for Fatalities', ylab = 'Number of Fatalities')
barplot(Injuries_Top$Injuries, names.arg = Injuries_Top$Event, col = 'blue',
        main = 'Top 20 Events for Injuries', ylab = 'Number of Injuries')

According to the data, tornadoes cause the most injuries and deaths. It is also important to note that there are many wind related deaths and injuries through out the results. The next most dangerous activities were heat and floods. Continuing question would be how people’s health are getting affected to find out if there are any ways to mitigate these before a disaster.

Economic Impact by Event

Let’s look at the top 10 events that cause most property damage.

head(Prop_Damage_Top[, c("Event_Type", "Prop_Damage")],10)
##            Event_Type  Prop_Damage
## 35              flood 168211639835
## 89          hurricane  84756180010
## 130           tornado  57003318426
## 124       storm_surge  43323536000
## 55               hail  15977564513
## 129 thunderstorm_wind  12785421700
## 132    tropical_storm   7703890550
## 161      winter_storm   6688497251
## 78          high_wind   5270046295
## 154          wildfire   4765114000

Let’s look at the top 10 events that cause most crop damage.

head(Crop_Damage_Top[, c("Event_Type", "Crop_Damage")], 10)
##            Event_Type Crop_Damage
## 22            drought 13972566000
## 35              flood 12380079100
## 89          hurricane  5515292800
## 92          Ice_storm  5022113500
## 55               hail  3046887623
## 15        cold_events  1416765500
## 129 thunderstorm_wind  1274208988
## 44       frost_freeze  1094186000
## 62         heavy_rain   733399800
## 132    tropical_storm   678346000

Let’s graph the property and crop damage vs. the event.

library(ggplot2)
## I am going to plot these using th ggplot2 package
## Plot the Property Damage
ggplot(Prop_Damage_Top, aes(x=reorder(Event_Type, -Prop_Damage), Prop_Damage)) +
    geom_bar(stat="identity", fill =  "cadetblue4") +
    xlab("") +
    ylab("Property Damage") +
    labs(title = "Property Damage by Event Type from 1950 to 2010") +
    theme(axis.text.x = element_text(angle = 90), axis.text.y = element_text(angle = 90, size =10))

## Plot the Crop Damage
ggplot(Crop_Damage_Top, aes(x=reorder(Event_Type, -Crop_Damage), Crop_Damage)) +
    geom_bar(stat="identity", fill = "darkorchid3") +
    xlab("") +
    ylab("Crop Damage") +
    labs(title = "Crop Damage by Event Type from 1950 to 2010") +
    theme(axis.text.x = element_text(angle = 90), axis.text.y = element_text(angle = 90, size =10))

According to the data, floods are the costliest events for property, and drought has the highest economic impact on crops. Wind and water events like hurricanes and storms are also very costly. The events for both graphs were very similar, just with different rankings. So, if an event is costly to property, it is most likely also costly to crops.

Risk vs. State

Lastly, let’s look at how different parts of the United States have historically been affected by weather and storm events.

Let’s look at the top 10 states by fatalities.

## Most deadly states
head(DeathlyStates_Top, 10)
##    State_Abbreviation Fatalities
## 16                 IL       1421
## 46                 TX       1366
## 40                 PA        846
## 3                  AL        784
## 26                 MO        754
## 11                 FL        746
## 27                 MS        555
## 6                  CA        550
## 4                  AR        530
## 45                 TN        521

Let’s look at the top 10 states by injuries.

## Most deadly states
head(InjuriesStates_Top, 10)
##    State_Abbreviation Injuries
## 46                 TX    17667
## 26                 MO     8998
## 3                  AL     8742
## 37                 OH     7112
## 27                 MS     6675
## 11                 FL     5918
## 38                 OK     5710
## 16                 IL     5563
## 4                  AR     5550
## 45                 TN     5202

Let’s graph both of these to show the information visually.

## I am going to create these graphs using the base plot system
par(mfrow = c(1, 2), las = 3, mar = c(4, 5, 2, 2), cex = .7)
barplot(DeathlyStates_Top$Fatalities, names.arg = DeathlyStates_Top$State_Abbreviation, col = "brown",
        main ="Highest Fatalities by State", ylab = 'Number of Fatalities')
barplot(InjuriesStates_Top$Injuries, names.arg = InjuriesStates_Top$State_Abbreviation, col = "cyan",
        main = "Highest Injuries by State", ylab = 'Number of Injuries')

According to the data Illinois has the highest risk of a fatality in the United States. However, Texas is second, and also first in injuries, so a logical conclusion would be that Texas is the state with the highest risk of fatalities and/injuries from weather and storms in the United States. The location of the top 20 states is primarily on the Gulf Coast, and in the Midwest. These areas would appear to be effected more by extreme weather events than other areas of the country.