In this report we aim to describe the problems that can be caused by storms and other severe weather problems. Focus of the analysis is to identify which type of weather problem is most harmful with respect to population. Moreover we investigate which natural event has the greatest economic consequences.
The analysis is based on data provided by the NOAA Storm Database. Moreover the focus of the analysis is on weather problems occured in the US between 1990 an 2011.
In the next section we load and process the data. Following we present the results of the analysis.
The raw data can be downloaded from the course website as bzip2 file. We download the data and unzip the file in our local repository.
# read data
data <- read.csv(file = "repdata_data_StormData.csv", sep = ",")
# display structure, data types and dimension of data frame, variables
# str(data)
# eliminate inconsitencies in labes and match cases of EVTYPE
data$EVTYPE <- gsub("[ /\t.]+", " ", as.character(data$EVTYPE))
data$EVTYPE <- tolower(as.character(data$EVTYPE))
# convert date into date format
data$date <- as.Date(as.character(data$BGN_DATE), format = "%m/%d/%Y")
# subset data to date range 1990 - 2011
dataNew <- subset(data, date >= as.Date("1990-01-01"))
# aggreagte relevant data
dataAgg <- aggregate(cbind(FATALITIES, PROPDMG) ~ EVTYPE, dataNew, FUN = sum)
The raw data consists of 37 variables and 902297 datasets. We subset the data to time range 1990-2011. This reduces the data frame to 751740 datasets. The variable EVTYPE describes the type of weather problem which will be the explanatory variable in the following analysis. The variable FATALITIES describes the deaths caused by the weather problem. FATALITIES will be the indicator for the thread to population. The attribute PROPDNG describes the damage caused by the externalities and is the indicator for economic damage.
library(ggplot2)
summary(dataAgg)
## EVTYPE FATALITIES PROPDMG
## Length:868 Min. : 0.0 Min. : 0
## Class :character 1st Qu.: 0.0 1st Qu.: 0
## Mode :character Median : 0.0 Median : 0
## Mean : 12.8 Mean : 10669
## 3rd Qu.: 0.0 3rd Qu.: 50
## Max. :1903.0 Max. :1588733
# Top 10 types for fatalities
top10Fat <- head(dataAgg[order(dataAgg$FATALITIES, decreasing = TRUE), ], 10)
top10Fat
## EVTYPE FATALITIES PROPDMG
## 115 excessive heat 1903 1460.0
## 736 tornado 1752 1588733.0
## 137 flash flood 978 1420124.6
## 239 heat 937 298.5
## 408 lightning 816 603351.8
## 152 flood 470 899938.5
## 513 rip current 368 1.0
## 757 tstm wind 327 1335995.6
## 311 high wind 248 324731.6
## 19 avalanche 224 1623.9
# plot top 10 fatalities
ggplot(data = top10Fat, aes(EVTYPE, FATALITIES)) + geom_bar() + labs(x = "Weather") +
opts(axis.text.x = theme_text(angle = -90))
## 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1)
## theme_text is deprecated. Use 'element_text' instead. (Deprecated; last used in version 0.9.1)
## Mapping a variable to y and also using stat="bin".
## With stat="bin", it will attempt to set the y value to the count of cases in each group.
## This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
## If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity".
## See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
# Top 10 types for fatalities
top10Dmg <- head(dataAgg[order(dataAgg$PROPDMG, decreasing = TRUE), ], 10)
top10Dmg
## EVTYPE FATALITIES PROPDMG
## 736 tornado 1752 1588733
## 137 flash flood 978 1420125
## 757 tstm wind 327 1335996
## 152 flood 470 899938
## 668 thunderstorm wind 133 876844
## 208 hail 15 688693
## 408 lightning 816 603352
## 693 thunderstorm winds 64 446318
## 311 high wind 248 324732
## 860 winter storm 206 132721
# plot top 10 damage
ggplot(data = top10Dmg, aes(EVTYPE, PROPDMG)) + geom_bar() + labs(x = "Weather",
y = "Damage") + opts(axis.text.x = theme_text(angle = -90))
## 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1)
## theme_text is deprecated. Use 'element_text' instead. (Deprecated; last used in version 0.9.1)
## Mapping a variable to y and also using stat="bin".
## With stat="bin", it will attempt to set the y value to the count of cases in each group.
## This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
## If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity".
## See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
# combine to 10 of both categories
top10 <- rbind(top10Dmg, top10Fat)
top10 <- unique(top10)
ggplot(data = top10, aes(FATALITIES, PROPDMG), ) + geom_point(aes(group = EVTYPE,
color = EVTYPE)) + labs(x = "Fatalities", y = "Damage")
The biggest thread to population is caused by excessive heat, wehreas tornados produce the most economical damage.