The aim of this report is to present exploratory data analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, to support the discussion of two research questions (described below). The original data was first processed to transform the variable BNG_DATE into a date format, to understand the evolution of the events over time. Data was then subsetted to the variables that represent the measurements of interest. Hence, population health is represented by the variables FATALITIES and INJURIES and economic consequences are represented by PROPDMG and CROPDMG. This latter variables were summarised into a single one Total DMG. The results present summaries of the total number of injuries, total number of fatalities and total economic consequences (properties and crops damage). As there are 985 different types of events, the top 15 events (higher values of Injuries, Fatalities and economic damage) were selected. The plots present the number of injuries, fatalities and economic damage, per type of event.
Data for USA severe weather events was loaded from here directly into working directory.
bz2file <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dir <- getwd()
download.file(bz2file, "dir")
StormData <- read.csv("StormData")
library(dplyr)
library(lubridate)
library(ggplot2)
library(gridExtra)
dim(StormData)
## [1] 902297 37
There are 902297 observations and 37 variables.
tbl_df(StormData)
## Source: local data frame [902,297 x 37]
##
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## <dbl> <fctr> <fctr> <fctr> <dbl> <fctr> <fctr>
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## .. ... ... ... ... ... ... ...
## Variables not shown: EVTYPE <fctr>, BGN_RANGE <dbl>, BGN_AZI <fctr>,
## BGN_LOCATI <fctr>, END_DATE <fctr>, END_TIME <fctr>, COUNTY_END <dbl>,
## COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <fctr>, END_LOCATI <fctr>,
## LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
## INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <fctr>, CROPDMG <dbl>,
## CROPDMGEXP <fctr>, WFO <fctr>, STATEOFFIC <fctr>, ZONENAMES <fctr>,
## LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## REMARKS <fctr>, REFNUM <dbl>.
Clean the data to prepare it for analysis
StormData$BGN_DATE <- mdy_hms(as.character(StormData$BGN_DATE))
StormData$BGN_DATE <- format(as.Date(StormData$BGN_DATE, format = "%m/%d/%Y"), "%Y")
StormData$BGN_DATE <- as.numeric(StormData$BGN_DATE)
Variable date was changed to determine total occurencies/events by year. Below is the subset of the dataset by date DataByYear
StormData$EVTYPE <- as.character(StormData$EVTYPE)
Subset data for total Injuries, total Fatalities and total economic damage, per type of event
EventsData <- StormData %>%
select(BGN_DATE, INJURIES, FATALITIES, EVTYPE, PROPDMG, CROPDMG) %>%
group_by(EVTYPE) %>%
summarise(
Total_INJURIES = sum(INJURIES),
Total_FATALITIES = sum(FATALITIES),
Total_DMG = sum(PROPDMG, CROPDMG)
)
Subset data per year and type of event, for injuries, fatalities and economic damage
DataByYear <- StormData %>%
select(BGN_DATE, INJURIES, FATALITIES, EVTYPE, PROPDMG, CROPDMG) %>%
group_by(BGN_DATE) %>%
summarise(
Total_INJURIES = sum(INJURIES),
Total_FATALITIES = sum(FATALITIES),
Total_DMG = sum(PROPDMG, CROPDMG)
) %>% arrange(desc(Total_INJURIES))
Plot total events by date to understand the evolution along the years
p1 <- ggplot(DataByYear, aes(BGN_DATE, Total_INJURIES)) + geom_line(color = "blue") + xlab("") + ylab("Total Injuries") + ggtitle("Population health indicators and economy over the years")
p2 <- ggplot(DataByYear, aes(BGN_DATE, Total_FATALITIES)) + geom_line(color = "darkblue")
p3 <- ggplot(DataByYear, aes(BGN_DATE, Total_DMG)) + geom_line(color = "darkorange")
grid.arrange(p1, p2, p3, nrow = 3)
The higher rates of events in the latter years might be due to the fact that there were more measurements made in the recent years.
Select the top 15 events with higher impact in population health (Injuries and Fatalities)
TopINJ <- tbl_df(EventsData) %>% arrange(desc(Total_INJURIES)) %>% print(n = 15)
## Source: local data frame [985 x 4]
##
## EVTYPE Total_INJURIES Total_FATALITIES Total_DMG
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 91346 5633 3312276.68
## 2 TSTM WIND 6957 504 1445168.21
## 3 FLOOD 6789 470 1067976.36
## 4 EXCESSIVE HEAT 6525 1903 1954.40
## 5 LIGHTNING 5230 816 606932.39
## 6 HEAT 2100 937 961.20
## 7 ICE STORM 1975 89 67689.62
## 8 FLASH FLOOD 1777 978 1599325.05
## 9 THUNDERSTORM WIND 1488 133 943635.62
## 10 HAIL 1361 15 1268289.66
## 11 WINTER STORM 1321 206 134699.58
## 12 HURRICANE/TYPHOON 1275 64 10637.85
## 13 HIGH WIND 1137 248 342014.77
## 14 HEAVY SNOW 1021 127 124417.71
## 15 WILDFIRE 911 75 88823.54
## .. ... ... ... ...
TopFATAL <- tbl_df(EventsData) %>% arrange(desc(Total_FATALITIES)) %>% print(n = 15)
## Source: local data frame [985 x 4]
##
## EVTYPE Total_INJURIES Total_FATALITIES Total_DMG
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 91346 5633 3312276.68
## 2 EXCESSIVE HEAT 6525 1903 1954.40
## 3 FLASH FLOOD 1777 978 1599325.05
## 4 HEAT 2100 937 961.20
## 5 LIGHTNING 5230 816 606932.39
## 6 TSTM WIND 6957 504 1445168.21
## 7 FLOOD 6789 470 1067976.36
## 8 RIP CURRENT 232 368 1.00
## 9 HIGH WIND 1137 248 342014.77
## 10 AVALANCHE 170 224 1623.90
## 11 WINTER STORM 1321 206 134699.58
## 12 RIP CURRENTS 297 204 162.00
## 13 HEAT WAVE 309 172 1524.55
## 14 EXTREME COLD 231 160 13778.68
## 15 THUNDERSTORM WIND 1488 133 943635.62
## .. ... ... ... ...
Plot the top weather events over population health
plot1 <- ggplot(head(TopFATAL, 15), aes(x = reorder(EVTYPE, Total_FATALITIES), y = Total_FATALITIES)) + geom_bar(fill = "darkblue", stat = "identity") + guides(fill = FALSE) + xlab("Type of Event") + ylab("Total number of Fatalities") + coord_flip() + ggtitle("Weather events and health effects in the US")
plot2 <- ggplot(head(TopINJ, 15), aes(x = reorder(EVTYPE, Total_INJURIES), y = Total_INJURIES)) + geom_bar(fill = "darkred", stat = "identity") + guides(fill = FALSE) + xlab("Type of Event") + ylab("Total number of Injuries") + coord_flip()
grid.arrange(plot1, plot2, nrow = 2)
The plots present the events that have higher impact in population health indicators. “Tornados” appear to have the highest impact in both injuries and fatalities that occur in the U.S. The other type of weather events differ between Fatalities and Injuries, which indicates that possibly the classification of these two types of occurencies might influence the data analysis.
Top 15 events that impact the economy (property and crop damage)
TopDAMAGE <- tbl_df(EventsData) %>% arrange(desc(Total_DMG)) %>% print(n = 15)
## Source: local data frame [985 x 4]
##
## EVTYPE Total_INJURIES Total_FATALITIES Total_DMG
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 91346 5633 3312276.68
## 2 FLASH FLOOD 1777 978 1599325.05
## 3 TSTM WIND 6957 504 1445168.21
## 4 HAIL 1361 15 1268289.66
## 5 FLOOD 6789 470 1067976.36
## 6 THUNDERSTORM WIND 1488 133 943635.62
## 7 LIGHTNING 5230 816 606932.39
## 8 THUNDERSTORM WINDS 908 64 464978.11
## 9 HIGH WIND 1137 248 342014.77
## 10 WINTER STORM 1321 206 134699.58
## 11 HEAVY SNOW 1021 127 124417.71
## 12 WILDFIRE 911 75 88823.54
## 13 ICE STORM 1975 89 67689.62
## 14 STRONG WIND 280 103 64610.71
## 15 HEAVY RAIN 251 98 61964.94
## .. ... ... ... ...
Plot the top weather events over economic damage
ggplot(head(TopDAMAGE, 15), aes(x = reorder(EVTYPE, Total_DMG), y = Total_DMG, fill = Total_DMG)) + geom_bar(stat = "identity") + guides(fill = FALSE) + xlab("Type of Event") + ylab("Total Property and Crop Damage") + ggtitle("Weather events and economic impact in the US") + theme(text = element_text(size = 11), axis.text.x = element_text(angle = 90, vjust = 1))
The plot shows the top 15 types of events with higher levels of economic damage. “Tornados” show the highest impact in the economic damage, which is coherent with the analysis for type of event on population health.