This project is part of the Reproducible Research Course by Coursera and explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage. The goal of this analysis is to find which events are most harmful with respect to population health and which events have the greatest economic consequences.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The events in the database start in the year 1950 and end in November 2011. The file can be downloaded at this link:
First we need to load the following packages. They are going to be used to handle the data and plot the results.
library(dplyr)
library(ggplot2)
library(gridExtra)
Then we need to download the file and read it as comma separated values.
url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(url,'storm.csv.bz2')
storm_data <- read.csv('storm.csv.bz2')
Most of the information contained in the original data are not necessary for our analysis. For this reason, we create a subset selecting the columns that we need.
storm <- select(storm_data,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
storm$PROPDMGEXP <- as.character(storm$PROPDMGEXP)
storm$CROPDMGEXP <- as.character(storm$CROPDMGEXP)
head(storm)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
The Storm Data Documentation states that damage values are rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number. For example, alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
These exponents, in the columns PROPDMGEXP and CROPDMGEXP, are converted into numbers. For example M is converted into 6. These numbers are used to find the damage value, which is stored in the new columns PROP_VALUE and CROP_VALUE.
unique(storm$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
The alphanumeric values are substituted with numeric values. Then the new variable property value is calculated and stored in the new column PROP_VALUE.
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('-','+','?','')] <- 0
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('h','H')] <- 2
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('m','M')] <- 6
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('k','K')] <- 3
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('b','B')] <- 9
storm$PROP_VALUE <- storm$PROPDMG * (10 ** as.numeric(storm$PROPDMGEXP))
The same operation is conducted for Crop Damage.
unique(storm$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
The alphanumeric values are substituted with numeric values. Then the new variable crops value is calculated and stored in the new column CROP_VALUE.
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('?','')] <- 0
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('m','M')] <- 6
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('k','K')] <- 3
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('b','B')] <- 9
storm$CROP_VALUE <- storm$CROPDMG * (10 ** as.numeric(storm$CROPDMGEXP))
Once the data is processed and clean, we can proceed and analyse the weather events.
First we analyse which events are most harmful with respect to population health. This can be done observing the number of fatalities and injuries that the event caused.
The data is grouped by type of event and summarised finding the total number of fatalities per type of event. The table obtained is then sorted showing the top type of events. From the output we can see that tornado is the type of event causing the most fatalities.
fatalities <- storm %>% group_by(EVTYPE) %>%
summarise(fat=sum(FATALITIES)) %>%
arrange(desc(fat))
head(fatalities)
## # A tibble: 6 × 2
## EVTYPE fat
## <fctr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
The same process is repeated for the injuries. The data is grouped by type of event and summarised finding the total number of injuries per type of event. The table obtained is then sorted showing the top type of events. From the output we can see that tornado is the type of event causing the most injuries.
injuries <- storm %>% group_by(EVTYPE) %>%
summarise(inj=sum(INJURIES)) %>%
arrange(desc(inj))
head(injuries)
## # A tibble: 6 × 2
## EVTYPE inj
## <fctr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
These results can be plotted to better understand the events. We plot the first five types of events with the most number of fatalities or injuries. Observing the plots we can also notice the difference between the different events.
g1 <- ggplot(data = fatalities[1:5,],aes(reorder(EVTYPE,fat),fat))+
geom_bar(stat = 'identity',fill='blue')+coord_flip()+
ggtitle('Events with high Fatalities')+ylab('Fatalities')+xlab('Events')+
geom_text(aes(label=fat),hjust=1.2,col='white')
g2 <- ggplot(data=injuries[1:5,],aes(reorder(EVTYPE,inj),inj))+
geom_bar(stat = 'identity',fill='red')+coord_flip()+
ggtitle('Events with high Injuries')+ylab('Injuries')+xlab('Events')+
geom_text(aes(label=inj),hjust=1,col='white')
grid.arrange(g1,g2,ncol=1)
Another analysis regards the economic consequences of the weather events. In this case we find which events have the greatest economic consequences. In order to do this we consider two variables from our data set, the value of properties damaged and the value of crops damaged.
Starting with properties, our data set storm is grouped by type of event and summarised finding the total value of damaged properties per type of event. The table obtained is sorted showing the top events causing the most damage to properties. The results show that flood has the greatest consequence on properties.
properties <- storm %>% group_by(EVTYPE) %>%
summarise(prop=sum(PROP_VALUE)) %>%
arrange(desc(prop))
head(properties)
## # A tibble: 6 × 2
## EVTYPE prop
## <fctr> <dbl>
## 1 FLOOD 144657709807
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56947380677
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16822673979
## 6 HAIL 15735267513
The same process is repeated for crops. The data is grouped by event and summarised finding the total value of damaged crops per type of event. The table obtained is sorted to show the top events causing the most damage to crops. The results show that drought has the greatest consequence on crops.
crops <- storm %>% group_by(EVTYPE) %>%
summarise(crop=sum(CROP_VALUE)) %>%
arrange(desc(crop))
head(crops)
## # A tibble: 6 × 2
## EVTYPE crop
## <fctr> <dbl>
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025954473
## 6 HURRICANE 2741910000
The results are plotted to better understand the events and to observe the difference between the different events.
p1 <- ggplot(data=properties[1:5,],aes(reorder(EVTYPE,prop),prop))+
geom_bar(stat='identity',fill='blue')+
coord_flip()+ggtitle('Events with high Properties Damages')+
xlab('Events')+ylab('Property Damages')
p2 <- ggplot(data=crops[1:5,],aes(reorder(EVTYPE,crop),crop))+
geom_bar(stat='identity',fill='red')+
coord_flip()+ggtitle('Events with high Crops Damages')+
xlab('Events')+ylab('Crops Damages')
grid.arrange(p1,p2,ncol=1)
The analysis conducted can be summarised as follows:
With regards to population health, Tornado is the event causing the highest number of fatalities and injuries. It is followed by Excessive Heat for fatalities and Thunderstorm Wind for injuries.
With regards to economic consequences, Flood is the event causing the highest property damage and Drought is the event causing the highest crops damage.