The aim of this report is to identify what the consequences are of storm and weather events in the United States. For this reasearch the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is used. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011.
The two main questions that will be answered in this report are:
We first read in the data from the cvs file in the zip archive. The csv is a comma separated file, so we specify this in the arguments.
data <- read.csv("repdata_data_StormData.csv.bz2")
To explore the dataset we print the column names to see which variables are included in the dataset.
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Next we look if there are NA values in the data
na_count <-sapply(data, function(y) sum(length(which(is.na(y)))))
na_count
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 0 0 0 0 0 0
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 0 0 0 0 0 0
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 0 0 902297 0 0 0
## LENGTH WIDTH F MAG FATALITIES INJURIES
## 0 0 843563 0 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 0 0 0 0 0 0
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## 0 47 0 40 0 0
## REFNUM
## 0
For this analysis we need the variables EVTYPE, FATALITIES, INJURIES, PROPDMG and CROPDMG. As can be seen above, these variables do not contain any missing values so we do not have to do anyting about the missing values.
To identify which type of events are most harmful with respect to population health we look at the number of injuries and fatalities. To get a sense the number of fatalities and injuries, we summarise those variables.
summary(data$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
summary(data$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
Hence, the maximum number of deaths per event is 583 and the maximum number of injuries per event is 1700.
First we sum the number of fatalities per type of event, and then arrange in descending order based on the total number of fatalities per event type.
fatalities <- aggregate(list(fatalities = data$FATALITIES), by=list(evtype = data$EVTYPE), FUN=sum)
fatalities <- arrange(fatalities, desc(fatalities))
fatalities[1:10,]
## evtype fatalities
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
Then we sum the number of injuries per type of event.
injuries <- aggregate(list(injuries = data$INJURIES), by=list(evtype = data$EVTYPE), FUN=sum)
injuries<- arrange(injuries, desc(injuries))
injuries[1:10,]
## evtype injuries
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
The next step is to combine the injuries en fatalities to identify which events are most harmful to population health.
total_harm <- merge(fatalities, injuries)
total_harm <- mutate(total_harm, total = fatalities + injuries)
total_harm <- arrange(total_harm, desc(total))
barplot(total_harm$total[1:10], names=total_harm$evtype[1:10], main='Total number of injuries and fatalities per event type')
So the most harmful event types for population health are tornado’s, ecessive heat, tstm wind, flood and lightning.
For the economic consewuences of the events we look at events with proprety damage and crop damage. When there is damage, there are associated monetary costs that go along with it.
damage <- select(data, EVTYPE, PROPDMG, CROPDMG)
damage <- aggregate(list(propdmg = data$PROPDMG, cropdmg = data$CROPDMG), by=list(evtype=data$EVTYPE), FUN=sum)
damage <- mutate(damage, total = propdmg+cropdmg)
damage <- arrange(damage, desc(total))
damage[1:10,]
## evtype propdmg cropdmg total
## 1 TORNADO 3212258.2 100018.52 3312276.7
## 2 FLASH FLOOD 1420124.6 179200.46 1599325.1
## 3 TSTM WIND 1335965.6 109202.60 1445168.2
## 4 HAIL 688693.4 579596.28 1268289.7
## 5 FLOOD 899938.5 168037.88 1067976.4
## 6 THUNDERSTORM WIND 876844.2 66791.45 943635.6
## 7 LIGHTNING 603351.8 3580.61 606932.4
## 8 THUNDERSTORM WINDS 446293.2 18684.93 464978.1
## 9 HIGH WIND 324731.6 17283.21 342014.8
## 10 WINTER STORM 132720.6 1978.99 134699.6
Hence, that events that have the greates economic consequences are tornado’s, flash flood, tstm wind, hail and flood.
barplot(damage$total[1:5], names=damage$evtype[1:5], main="Total damage per event type", ylab = 'Damage in dollars')