This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data shows that storm and severe weather events are costly not only to property and crops, but also in human lives. The most deadly events are tornadoes, while the most costly events are floods and droughts. The states on the Gulf Coast, and in the Midwest are the places where you are most likely to be injured or have a fatality. High wind events are many of the most damaging events throughout the U.S. It is important to note that this preliminary analysis leads to many opportunities for mitigation and data improvement. The NOAA data set needs lots of work to tidy the data and make sure future data collection is as clear as possible.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The data that are the most relevant to study for this analysis ar ethe following:
## Perform datafile download if necessary...
if (!file.exists("./repdata_data_StormData.csv.bz2")){
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## Download zip and archive to destfile
download.file(fileURL, destfile = "./repdata_data_StormData.csv.bz2", method = "curl", cache = TRUE)
P1 <- Sys.time()
# Extract File from bz2 format and record processing time
NOAARawData <- read.csv("repdata_data_StormData.csv.bz2")
P2<- Sys.time()
Processing_Time = P2 - P1
Processing_Time
remove(fileURL)
}
## Create local file
if (!file.exists("./NOAA.rdata")) {
save(NOAARawData, file = "NOAA.rdata")
}
## Load Data
load(file = "NOAA.rdata")
Add FIPS Data to review and analyze
library(readxl)
## Load FIPS Data for codes
if (!file.exists("./fips_codes_website.xls")){
FileURL2 <- "http://www.census.gov/2010census/xls/fips_codes_website.xls"
download.file(FileURL2, destfile = "./fips_codes_website.xls")
}
fips_codes_website <- read_excel("./fips_codes_website.xls", col_names = TRUE)
## Fix the spaces in the column names
colnames(fips_codes_website) <- gsub(" ", "_", colnames(fips_codes_website))
## Check out the data
head(fips_codes_website, 2)
## # A tibble: 2 x 7
## State_Abbreviat… State_FIPS_Code County_FIPS_Code FIPS_Entity_Code ANSI_Code
## <chr> <chr> <chr> <chr> <chr>
## 1 AL 01 067 00124 02403054
## 2 AL 01 073 00460 02403063
## # … with 2 more variables: GU_Name <chr>, Entity_Description <chr>
Let’s look at the data first to see what kinds of of analysis we can do with the data.
colnames(NOAARawData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
head(NOAARawData, 2)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14 100 3 0 0 15 25.0
## 2 0 2 150 2 0 0 0 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
Create a smaller data frame to work with.
NOAAtidy <- NOAARawData[ , c("STATE__" , "BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP",
"CROPDMG", "CROPDMGEXP")]
colnames(NOAAtidy) <- c("State", "Begin_Date", "Event_Type", "Fatalities", "Injuries", "Prop_Damage", "Prop_Damage_exp", "Crop_Damage", "Crop_Damage_exp")
It looks like the date column needs to be cleaned up first. I first need to split the column because date and time are both there, then I need to delete the Time column and convert the Begin_time column to a Date class.
class(NOAAtidy$Begin_Date)
## [1] "character"
## Separate the beginning date column
NOAAtidy <- separate(NOAAtidy, col = Begin_Date, into = c("Begin_Date", "Time"), sep = " ")
NOAAtidy$Time <- NULL
## convert data column from Character to Data
NOAAtidy$Begin_Date = as.Date(NOAAtidy$Begin_Date, "%m/%d/%Y")
## Check to make sure the class is changed
class(NOAAtidy$Begin_Date)
## [1] "Date"
The next problem with the data is that the States are just numbers and not the actually state abbreviation. I wil take the FIPS codes data and substitute the information in the State Column.
## Make a uniqe data.table for the FIPS
Fips_codes <- as.data.frame(fips_codes_website[, 1:2])
## Get Unique Codes from FIPS
Fips_codes <- unique(as.data.frame(Fips_codes))
## Convert State Fips Code column to a number
Fips_codes$State_FIPS_Code <- as.numeric(Fips_codes$State_FIPS_Code)
## Substitue values in NOAA State column using FIPS codes
NOAAtidy <- merge(NOAAtidy, Fips_codes, by.x = "State", by.y ="State_FIPS_Code", all.x=TRUE)
## Reorder the columns
NOAAtidy <- NOAAtidy[ , c(1,10, 2:9)]
We need to clean up the damage variables to include the exponent We will do this by making a function and applying it to our data frame.
## check to see what values are in exponents
unique(NOAAtidy$Prop_Damage_exp)
## [1] "" "K" "M" "B" "m" "+" "0" "5" "?" "6" "4" "3" "2" "h" "7" "H" "-" "1" "8"
unique(NOAAtidy$Crop_Damage_exp)
## [1] "" "K" "M" "B" "m" "?" "0" "k" "2"
## Function to transform the characters into values
exp_transform <- function(e) {
# h -> hundred, k -> thousand, m -> million, b -> billion
if (e %in% c('h', 'H'))
return(2)
else if (e %in% c('k', 'K'))
return(3)
else if (e %in% c('m', 'M'))
return(6)
else if (e %in% c('b', 'B'))
return(9)
else if (!is.na(as.numeric(e))) # if a digit
return(as.numeric(e))
else if (e %in% c('', '-', '?', '+'))
return(0)
else {
stop("Invalid exponent value.")
}
}
## Apply function to calculate values for property and crop damage
prop_exp <- sapply(NOAAtidy$Prop_Damage_exp, FUN=exp_transform)
NOAAtidy$Prop_Damage <- NOAAtidy$Prop_Damage * (10 ** prop_exp)
crop_exp <- sapply(NOAAtidy$Crop_Damage_exp, FUN=exp_transform)
NOAAtidy$Crop_Damage <- NOAAtidy$Crop_Damage * (10 ** crop_exp)
There are a really high number of unique events. To try and lower this number I will clean up the event data.
## number of unique event types
length(unique(NOAAtidy$Event_Type))
## [1] 985
## translate all letters to lowercase
NOAAtidy$Event_Type <- tolower(NOAAtidy$Event_Type)
## remove leading spaces and characters.
NOAAtidy$Event_Type <- trimws(NOAAtidy$Event_Type)
## replace all punct. characters with an underscore
NOAAtidy$Event_Type <- gsub("[[:blank:][:punct:]+]", "_", NOAAtidy$Event_Type)
When cleaning the data there were NA’s introduced during the processing. Let’s remove those NA’s by converting them to 0’s because we are not going for averages, only total sums and it will not effect our final analysis.
NOAAtidy[is.na(NOAAtidy)] <- 0
After reviewing the data and the codes I want to combine some of the events together that are identical to make the data easier to understand.
**It is important to analyze the names ahead of time so that you do not mislabel the events. i.e. if you do flood before landslide then the flood/landslide tag makes it only look like a flood event.
NOAAtidynames <- NOAAtidy
## how many uniquely named variables do we have at first?
length(unique(NOAAtidynames$Event_Type))
## [1] 866
## change tstm_wind --> thunderstorm_wind
NOAAtidynames$Event_Type[grepl("tstm", NOAAtidynames$Event_Type)] <- "thunderstorm_wind"
## change anything with thunderstorm_wind in it to just "thunderstorm_wind"
NOAAtidynames$Event_Type[grepl("thunderstorm", NOAAtidynames$Event_Type)] <- "thunderstorm_wind"
NOAAtidynames$Event_Type[grepl("thun", NOAAtidynames$Event_Type)] <- "thunderstorm_wind"
## clean up hail tags
NOAAtidynames$Event_Type[grepl("hail", NOAAtidynames$Event_Type)] <- "hail"
## clean up the tornado tags
NOAAtidynames$Event_Type[grepl("tornado", NOAAtidynames$Event_Type)] <- "tornado"
NOAAtidynames$Event_Type[grepl("torn", NOAAtidynames$Event_Type)] <- "tornado"
## clean up the landslide tags
NOAAtidynames$Event_Type[grepl("landslide", NOAAtidynames$Event_Type)] <- "landslide"
## clean up the flood tags
NOAAtidynames$Event_Type[grepl("flood", NOAAtidynames$Event_Type)] <- "flood"
## clean up the lightning tags
NOAAtidynames$Event_Type[grepl("lightning", NOAAtidynames$Event_Type)] <- "lightning"
## clean up the lightning tags
NOAAtidynames$Event_Type[grepl("snow", NOAAtidynames$Event_Type)] <- "snow"
## clean up the hurricane tags
NOAAtidynames$Event_Type[grepl("hurricane", NOAAtidynames$Event_Type)] <- "hurricane"
## clean up the cold weather tags
NOAAtidynames$Event_Type[grepl("cold", NOAAtidynames$Event_Type)] <- "cold_events"
NOAAtidynames$Event_Type[grepl("hypothermia", NOAAtidynames$Event_Type)] <- "cold_events"
NOAAtidynames$Event_Type[grepl("low_temperature", NOAAtidynames$Event_Type)] <- "cold_events"
NOAAtidynames$Event_Type[grepl("ice_storm", NOAAtidynames$Event_Type)] <- "Ice_storm"
NOAAtidynames$Event_Type[grepl("ice", NOAAtidynames$Event_Type)] <- "ice_events"
## clean up the avalanche tags
NOAAtidynames$Event_Type[grepl("avala", NOAAtidynames$Event_Type)] <- "avalanche"
## how many uniquely named variables do we now have?
length(unique(NOAAtidynames$Event_Type))
## [1] 446
There are also a lot of events with no damage, or no fatalities/injuries. since we are really interested in the events that had an impact on the community we want to clean this up.
## Remove data without damage and create new data frames to analyze
NOAAtidyDamage <- NOAAtidynames[(NOAAtidynames$Prop_Damage > 0 | NOAAtidynames$Crop_Damage > 0), ]
length(unique(NOAAtidyDamage$Event_Type))
## [1] 166
## Remove data without fatalities or injuries and create new data frames to analyze
NOAAtidyFatalities <- NOAAtidynames[(NOAAtidynames$Fatalities > 0 | NOAAtidynames$Injuries > 0), ]
length(unique(NOAAtidyFatalities$Event_Type))
## [1] 109
In order to look at property and crop damage more clearly, let’s make a smaller data frame with the important information for this analysis.
NOAADamageSum <- NOAAtidyDamage[ , c("State_Abbreviation", "Event_Type", "Prop_Damage", "Crop_Damage")]
## convert Event Type to factor
NOAADamageSum$Event_Type <- as.factor(NOAADamageSum$Event_Type)
## Aggregate damage by event type
DamageEventAgg <- aggregate(Prop_Damage ~ Event_Type, data = NOAADamageSum, FUN=sum)
Prop_Damage_Top <- head(DamageEventAgg[order(DamageEventAgg$Prop_Damage, decreasing = T), ], 20)
Let’s also process the crop damage data for later analysis
## Aggregate Crop Damage
CropEventAgg <- aggregate(Crop_Damage ~ Event_Type, data = NOAADamageSum, FUN=sum)
Crop_Damage_Top <- head(CropEventAgg[order(CropEventAgg$Crop_Damage, decreasing = T), ], 20)
Now we need to process the fatalities by the type of event.
NOAAfatalitiesSum <- NOAAtidyFatalities[ , c("State_Abbreviation", "Event_Type", "Fatalities", "Injuries")]
##convert Event Type to factor
NOAAfatalitiesSum$Event_Type <- as.factor(NOAAfatalitiesSum$Event_Type)
## Aggregate fatalities
FatalitiesAgg <- aggregate(Fatalities ~ Event_Type, data = NOAAfatalitiesSum, FUN=sum)
Fatalities_Top <- head(FatalitiesAgg[order(FatalitiesAgg$Fatalities, decreasing = T), ], 20)
## Aggregate injuries
InjuriesAgg <- aggregate(Injuries ~ Event_Type, data = NOAAfatalitiesSum, FUN=sum)
Injuries_Top <- head(InjuriesAgg[order(InjuriesAgg$Injuries, decreasing = T), ], 20)
When looking at the data I was surprised the Ice_storm injuries were really high. I was curious to see why. I selected the rows only containing the Ice_storm results. It turns out that Ohio had one ice storm with 1568 injuries. I decided that this should be noted, but not excluded at this point in analysis without more information.
## Create a dataframe with only the Ice_storm data
IceStorm <- filter(NOAAfatalitiesSum, Event_Type == "Ice_storm")
IceStorm[65,]
## State_Abbreviation Event_Type Fatalities Injuries
## 65 OH Ice_storm 1 1568
I was also interested in which states were the deadliest to live in.
## change the state to a factor
NOAAfatalitiesSum$State_Abbreviation <- as.factor(NOAAfatalitiesSum$State_Abbreviation)
## Aggregate fatalities by state
Fat_stateAgg <- aggregate(Fatalities ~ State_Abbreviation, data = NOAAfatalitiesSum, FUN=sum)
DeathlyStates_Top <- head(Fat_stateAgg[order(Fat_stateAgg$Fatalities, decreasing = T), ], 20)
Lastly, I wanted to find out in what states you were most likely to get injured.
## Aggregate fatalities by state
Inj_stateAgg <- aggregate(Injuries ~ State_Abbreviation, data = NOAAfatalitiesSum, FUN=sum)
InjuriesStates_Top <- head(Inj_stateAgg[order(Inj_stateAgg$Injuries, decreasing = T), ], 20)
I completed three analysis of the data. * What events are the most harmful to a persons health? * What are the economic impacts on property and crops? * How does a persons risk change in different states from weather and storm event?
Let’s look at the top 10 events that cause fatalities.
head(Fatalities_Top[, c("Event_Type", "Fatalities")],10)
## Event_Type Fatalities
## 86 tornado 5636
## 15 excessive_heat 1903
## 19 flood 1525
## 32 heat 937
## 59 lightning 817
## 85 thunderstorm_wind 756
## 6 cold_events 451
## 73 rip_current 368
## 48 high_wind 248
## 1 avalanche 225
Let’s look at the top 10 events that cause injuries
head(Injuries_Top[, c("Event_Type", "Injuries")],10)
## Event_Type Injuries
## 86 tornado 91407
## 85 thunderstorm_wind 9545
## 19 flood 8604
## 15 excessive_heat 6525
## 59 lightning 5231
## 32 heat 2100
## 56 Ice_storm 1990
## 30 hail 1371
## 53 hurricane 1328
## 104 winter_storm 1321
Let’s graph both of these to show the information visually.
## I am going to create these graphs using the base plot system
par(mfrow = c(1, 2), las = 3, mar = c(10, 4, 2, 2), cex = .7)
barplot(Fatalities_Top$Fatalities, names.arg = Fatalities_Top$Event, col = "orange",
main = 'Top 20 Events for Fatalities', ylab = 'Number of Fatalities')
barplot(Injuries_Top$Injuries, names.arg = Injuries_Top$Event, col = 'blue',
main = 'Top 20 Events for Injuries', ylab = 'Number of Injuries')
According to the data, tornadoes cause the most injuries and deaths. It is also important to note that there are many wind related deaths and injuries through out the results. The next most dangerous activities were heat and floods. Continuing question would be how people’s health are getting affected to find out if there are any ways to mitigate these before a disaster.
Let’s look at the top 10 events that cause most property damage.
head(Prop_Damage_Top[, c("Event_Type", "Prop_Damage")],10)
## Event_Type Prop_Damage
## 35 flood 168211639835
## 89 hurricane 84756180010
## 130 tornado 57003318426
## 124 storm_surge 43323536000
## 55 hail 15977564513
## 129 thunderstorm_wind 12785421700
## 132 tropical_storm 7703890550
## 161 winter_storm 6688497251
## 78 high_wind 5270046295
## 154 wildfire 4765114000
Let’s look at the top 10 events that cause most crop damage.
head(Crop_Damage_Top[, c("Event_Type", "Crop_Damage")], 10)
## Event_Type Crop_Damage
## 22 drought 13972566000
## 35 flood 12380079100
## 89 hurricane 5515292800
## 92 Ice_storm 5022113500
## 55 hail 3046887623
## 15 cold_events 1416765500
## 129 thunderstorm_wind 1274208988
## 44 frost_freeze 1094186000
## 62 heavy_rain 733399800
## 132 tropical_storm 678346000
Let’s graph the property and crop damage vs. the event.
library(ggplot2)
## I am going to plot these using th ggplot2 package
## Plot the Property Damage
ggplot(Prop_Damage_Top, aes(x=reorder(Event_Type, -Prop_Damage), Prop_Damage)) +
geom_bar(stat="identity", fill = "cadetblue4") +
xlab("") +
ylab("Property Damage") +
labs(title = "Property Damage by Event Type from 1950 to 2010") +
theme(axis.text.x = element_text(angle = 90), axis.text.y = element_text(angle = 90, size =10))
## Plot the Crop Damage
ggplot(Crop_Damage_Top, aes(x=reorder(Event_Type, -Crop_Damage), Crop_Damage)) +
geom_bar(stat="identity", fill = "darkorchid3") +
xlab("") +
ylab("Crop Damage") +
labs(title = "Crop Damage by Event Type from 1950 to 2010") +
theme(axis.text.x = element_text(angle = 90), axis.text.y = element_text(angle = 90, size =10))
According to the data, floods are the costliest events for property, and drought has the highest economic impact on crops. Wind and water events like hurricanes and storms are also very costly. The events for both graphs were very similar, just with different rankings. So, if an event is costly to property, it is most likely also costly to crops.
Lastly, let’s look at how different parts of the United States have historically been affected by weather and storm events.
Let’s look at the top 10 states by fatalities.
## Most deadly states
head(DeathlyStates_Top, 10)
## State_Abbreviation Fatalities
## 16 IL 1421
## 46 TX 1366
## 40 PA 846
## 3 AL 784
## 26 MO 754
## 11 FL 746
## 27 MS 555
## 6 CA 550
## 4 AR 530
## 45 TN 521
Let’s look at the top 10 states by injuries.
## Most deadly states
head(InjuriesStates_Top, 10)
## State_Abbreviation Injuries
## 46 TX 17667
## 26 MO 8998
## 3 AL 8742
## 37 OH 7112
## 27 MS 6675
## 11 FL 5918
## 38 OK 5710
## 16 IL 5563
## 4 AR 5550
## 45 TN 5202
Let’s graph both of these to show the information visually.
## I am going to create these graphs using the base plot system
par(mfrow = c(1, 2), las = 3, mar = c(4, 5, 2, 2), cex = .7)
barplot(DeathlyStates_Top$Fatalities, names.arg = DeathlyStates_Top$State_Abbreviation, col = "brown",
main ="Highest Fatalities by State", ylab = 'Number of Fatalities')
barplot(InjuriesStates_Top$Injuries, names.arg = InjuriesStates_Top$State_Abbreviation, col = "cyan",
main = "Highest Injuries by State", ylab = 'Number of Injuries')
According to the data Illinois has the highest risk of a fatality in the United States. However, Texas is second, and also first in injuries, so a logical conclusion would be that Texas is the state with the highest risk of fatalities and/injuries from weather and storms in the United States. The location of the top 20 states is primarily on the Gulf Coast, and in the Midwest. These areas would appear to be effected more by extreme weather events than other areas of the country.