This purpose of this report is to load and analyze the National Weather Service Storm Database, and answer some basic questions regarding trends across the United States. It will be a combination of text, graphics, and code (r) in order to enable the reader to reproduce any of the analysis steps (if desired). The main two questions that this report will answer are:
This section of the report shows all steps of the data processing necessary in the creation of the NWS Storm Database Analysis.
1. The first step of this analysis is to load the NWS database file “repdata-data-StormData.csv.bz2”. This data file can be downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
This file will be read into memory, and stored into a dataframe called “weather_data”.
#1. Load the data - "repdata-data-StormData.csv.bz2"
weather_data<-read.csv(bzfile("repdata-data-StormData.csv.bz2", "r"), header=TRUE)
2. The next step is to verify that the file has been read in correctly and check the file layout. I’ll do this using the head() function for the first 5 rows. I’ll also store the column names into a character vector called “names” so that I can use it for displaying or selecting columns of interest during subsequent processing steps.
options(width = 125) #sets correct width for viewing
head(weather_data,5)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO 0
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO 0
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO 0
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO 0
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO 0
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 0 NA 0 14.0 100 3 0 0 15 25.0 K
## 2 0 NA 0 2.0 150 2 0 0 0 2.5 K
## 3 0 NA 0 0.1 123 2 0 0 2 25.0 K
## 4 0 NA 0 0.0 100 2 0 0 2 2.5 K
## 5 0 NA 0 0.0 150 2 0 0 2 2.5 K
## CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 0 3040 8812 3051 8806 1
## 2 0 3042 8755 0 0 2
## 3 0 3340 8742 0 0 3
## 4 0 3458 8626 0 0 4
## 5 0 3412 8642 0 0 5
names<-names(weather_data)
names
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY" "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE"
## [10] "BGN_AZI" "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN" "END_RANGE" "END_AZI" "END_LOCATI"
## [19] "LENGTH" "WIDTH" "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG" "PROPDMGEXP" "CROPDMG"
## [28] "CROPDMGEXP" "WFO" "STATEOFFIC" "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_" "REMARKS"
## [37] "REFNUM"
3. I’ll be using the “sqldf” library in order to further process the data, so I load it at the start of this next data processing step.
Next, I create 3 new dataframes that are each unique “Top 10” aggregated queries by event type (EVTYPE).
The first is called “agg_fatalities”, and is a “Top 10” aggregated query of fatalities by event type (EVTYPE).
The second is called “agg_injuries”, and is a “Top 10” aggregated query of injuries by event type (EVTYPE).
The third is called “agg_injuries_or_fatalities”, and is a “Top 10” aggregated query of injuries and fatalities by event type (EVTYPE).
After creating these three dataframes I re-factor the event type (just using the list of the top 10), and also re-order the event type factors by the relevent sums. This is done so that when later plotted in the “Results” section the event type plots will be in decending order from largest to smallest.
library(sqldf)
agg_fatalities<-sqldf("select EVTYPE, sum(FATALITIES) as total_fatalities,
count(*) as count from weather_data group by EVTYPE order by total_fatalities desc limit 10")
agg_fatalities$EVTYPE<-factor(agg_fatalities$EVTYPE)
agg_fatalities$EVTYPE <- reorder(agg_fatalities$EVTYPE, -agg_fatalities$total_fatalities)
agg_fatalities
## EVTYPE total_fatalities count
## 1 TORNADO 5633 60652
## 2 EXCESSIVE HEAT 1903 1678
## 3 FLASH FLOOD 978 54277
## 4 HEAT 937 767
## 5 LIGHTNING 816 15754
## 6 TSTM WIND 504 219940
## 7 FLOOD 470 25326
## 8 RIP CURRENT 368 470
## 9 HIGH WIND 248 20212
## 10 AVALANCHE 224 386
agg_injuries<-sqldf("select EVTYPE, sum(INJURIES) AS total_injuries,
count(*) as count from weather_data group by EVTYPE order by total_injuries desc limit 10")
agg_injuries$EVTYPE<-factor(agg_injuries$EVTYPE)
agg_injuries$EVTYPE <- reorder(agg_injuries$EVTYPE, -agg_injuries$total_injuries)
agg_injuries
## EVTYPE total_injuries count
## 1 TORNADO 91346 60652
## 2 TSTM WIND 6957 219940
## 3 FLOOD 6789 25326
## 4 EXCESSIVE HEAT 6525 1678
## 5 LIGHTNING 5230 15754
## 6 HEAT 2100 767
## 7 ICE STORM 1975 2006
## 8 FLASH FLOOD 1777 54277
## 9 THUNDERSTORM WIND 1488 82563
## 10 HAIL 1361 288661
agg_injuries_or_fatalities<-sqldf("select EVTYPE, sum(FATALITIES) as total_fatalities, sum(INJURIES) AS total_injuries,
sum(FATALITIES) + sum(INJURIES) as total_injuries_or_fatalities, count(*) as count from weather_data
group by EVTYPE order by total_injuries_or_fatalities desc limit 10")
agg_injuries_or_fatalities$EVTYPE<-factor(agg_injuries_or_fatalities$EVTYPE)
agg_injuries_or_fatalities$EVTYPE<- reorder(agg_injuries_or_fatalities$EVTYPE, -agg_injuries_or_fatalities$total_injuries_or_fatalities)
agg_injuries_or_fatalities
## EVTYPE total_fatalities total_injuries total_injuries_or_fatalities count
## 1 TORNADO 5633 91346 96979 60652
## 2 EXCESSIVE HEAT 1903 6525 8428 1678
## 3 TSTM WIND 504 6957 7461 219940
## 4 FLOOD 470 6789 7259 25326
## 5 LIGHTNING 816 5230 6046 15754
## 6 HEAT 937 2100 3037 767
## 7 FLASH FLOOD 978 1777 2755 54277
## 8 ICE STORM 89 1975 2064 2006
## 9 THUNDERSTORM WIND 133 1488 1621 82563
## 10 WINTER STORM 206 1321 1527 11433
4. At this point, I’ve created the data in order to answer the first question for the analysis (and create a plot). However, I still don’t have a dataset that is required to answer the second question (or create a plot). That is going to be the focus of this last data preparation step.
I start by selecting 5 fields from “weather_data” that pertain to event type and damage values (EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP). Using these fields (and understanding the exponent postfix columns) I then create two additional fields (PROPDMG_VALUE, CROPDMG_VALUE). These are calculations for the total property & total crop damages in dollars. The result is stored into an intermediate dataframe that I’m calling temp_a.
The next step is to aggregate all of the property and crop damage from “temp_a” by event type, and then store it into a dataframe for future use. I do this with another sql query and store the result into a dataframe called “agg_dmg_values”, again only showing the “Top 10” event types.
After creating the “agg_dmg_values” dataframe, I then re-factor the event type (just using the list of the “Top 10”), and also re-order the event type factors by the relevent total damage values. This is done so that when later plotted in the “Results” section the event type plots will be in decending order from largest to smallest.
temp_a<-sqldf("select
EVTYPE,
PROPDMG,
PROPDMGEXP,
case PROPDMGEXP
when '' then PROPDMG
when 'k' then 1000*PROPDMG
when 'm' then 1000000*PROPDMG
when 'b' then 1000000000*PROPDMG
when 'K' then 1000*PROPDMG
when 'M' then 1000000*PROPDMG
when 'B' then 1000000000*PROPDMG
else null end as PROPDMG_VALUE,
CROPDMG,
CROPDMGEXP,
case CROPDMGEXP
when '' then CROPDMG
when 'k' then 1000*CROPDMG
when 'm' then 1000000*CROPDMG
when 'b' then 1000000000*CROPDMG
when 'K' then 1000*CROPDMG
when 'M' then 1000000*CROPDMG
when 'B' then 1000000000*CROPDMG
else null end as CROPDMG_VALUE
from weather_data")
agg_dmg_values<-sqldf("select EVTYPE, sum(PROPDMG_VALUE + CROPDMG_VALUE) as total_dmg_value
from temp_a
group by EVTYPE
order by total_dmg_value desc limit 10")
agg_dmg_values$EVTYPE<-factor(agg_dmg_values$EVTYPE)
agg_dmg_values$EVTYPE <- reorder(agg_dmg_values$EVTYPE, -agg_dmg_values$total_dmg_value)
This section of the report displays the results of the anlysis, and provides answers for the initial two questions. This is going to be fairly easy, now that all of the data is processed!!
library(ggplot2)
ggplot(agg_injuries_or_fatalities, aes(x=EVTYPE, y=total_injuries_or_fatalities/1000)) +
geom_bar(stat="identity", color="black", fill="orange") + theme(axis.text.x=element_text(angle=45, hjust=1)) +
labs(title = "Sum of Injuries and Fatalities [Thousands] by Event Type (EVTYPE)") +
labs(x="Event Type (EVTYPE)") + labs(y="Sum of Injuries and Fatalities [Thousands]") +
geom_text(aes(label=round(total_injuries_or_fatalities/1000,1),vjust=-0.5, size=12)) + scale_size(guide="none")
ggplot(agg_dmg_values, aes(x=EVTYPE, y=total_dmg_value/1000000000)) +
geom_bar(stat="identity", color="black", fill="green") + theme(axis.text.x=element_text(angle=45, hjust=1)) +
labs(title = "Total Damage Value (Property + Crop) [$ Billions] by Event Type (EVTYPE)") +
labs(x="Event Type (EVTYPE)") + labs(y="Total Damage Value (Property + Crop) [$ Billions]") +
geom_text(aes(label=round(total_dmg_value/1000000000,1),vjust=-0.5, size=12)) + scale_size(guide="none")