Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
I have conducted the following research using the data [47MB] from the NOAA Storm Database. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
We are seeking to answer the following two questions with this information to help prevent, detect, and predict the impact on human lives as well as cost to our economy based on severe weather events.
Across the United States, which types of events (as indicated in the π΄π πππΏπ΄ variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Based on our Reproducible Research below - we have discovered the following:
Download the file
We will start by downloading the NOAA file from the link above, and store it to an R object βNOAA.β
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "stormData.csv.bz2", method = "curl")
NOAA <- read.csv(bzfile("stormData.csv.bz2"), sep=",", header=T)
unlink(temp)
## View Dimensions of the data
dim(NOAA)
## [1] 902297 37
## View header
head(NOAA)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
We see that there is data we want to isolate for this particular analysis of damage based on EVTYPE (event type), FATALITIES, INJURIES, PROPDMG (property damage), PROPDMGEXP (property damage expense), CROPDMG (crop damage), and CROPDMGEXP (crop damage expense). We will isolate this data and store it in r object called βNOAA_1.β
NOAA_1 <- NOAA[,c('EVTYPE','FATALITIES','INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')]
## View headers in revised data set
head(NOAA_1)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Data looks a lot better now! We will now convert the PROPDMGEXP & CROPDMGEXP fields to tangible numbers where H (hundreds = 10^2), K (thousands = 10^3), M (millions = 10^6), and B (billions = 10^9) based on Wikipedia power of 10 table.
## Convert Property Damage
NOAA_1$PROPDMGDOLLARS = 0
NOAA_1[NOAA_1$PROPDMGEXP == "H", ]$PROPDMGDOLLARS = NOAA_1[NOAA_1$PROPDMGEXP == "H", ]$PROPDMG * 10^2
NOAA_1[NOAA_1$PROPDMGEXP == "K", ]$PROPDMGDOLLARS = NOAA_1[NOAA_1$PROPDMGEXP == "K", ]$PROPDMG * 10^3
NOAA_1[NOAA_1$PROPDMGEXP == "M", ]$PROPDMGDOLLARS = NOAA_1[NOAA_1$PROPDMGEXP == "M", ]$PROPDMG * 10^6
NOAA_1[NOAA_1$PROPDMGEXP == "B", ]$PROPDMGDOLLARS = NOAA_1[NOAA_1$PROPDMGEXP == "B", ]$PROPDMG * 10^9
## Convert Crop Damage
NOAA_1$CROPDMGDOLLARS = 0
NOAA_1[NOAA_1$CROPDMGEXP == "H", ]$CROPDMGDOLLARS = NOAA_1[NOAA_1$CROPDMGEXP == "H", ]$CROPDMG * 10^2
NOAA_1[NOAA_1$CROPDMGEXP == "K", ]$CROPDMGDOLLARS = NOAA_1[NOAA_1$CROPDMGEXP == "K", ]$CROPDMG * 10^3
NOAA_1[NOAA_1$CROPDMGEXP == "M", ]$CROPDMGDOLLARS = NOAA_1[NOAA_1$CROPDMGEXP == "M", ]$CROPDMG * 10^6
NOAA_1[NOAA_1$CROPDMGEXP == "B", ]$CROPDMGDOLLARS = NOAA_1[NOAA_1$CROPDMGEXP == "B", ]$CROPDMG * 10^9
## View revised headers of NOAA_1
head(NOAA_1)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
## PROPDMGDOLLARS CROPDMGDOLLARS
## 1 25000 0
## 2 2500 0
## 3 25000 0
## 4 2500 0
## 5 2500 0
## 6 2500 0
Now we have all of our data in a neat table:
We will now use the data to answer the questions.
## Load the appropriate libraries in R
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Organize type of event ~ fatalities and store in object "fatalities" & same for "injuries"
fatalities <- aggregate(FATALITIES ~ EVTYPE, data=NOAA_1, sum)
injuries <- aggregate(INJURIES ~ EVTYPE, data = NOAA_1, sum)
## Sort fatalities
fatalities <- fatalities[order(-fatalities$FATALITIES), ][1:20, ]
fatalities$EVTYPE <- factor(fatalities$EVTYPE, levels = fatalities$EVTYPE)
head(fatalities)
## EVTYPE FATALITIES
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
## 856 TSTM WIND 504
## Sort Injuries
injuries <- injuries[order(-injuries$INJURIES), ][1:20, ]
injuries$EVTYPE <- factor(injuries$EVTYPE, levels = injuries$EVTYPE)
head(injuries)
## EVTYPE INJURIES
## 834 TORNADO 91346
## 856 TSTM WIND 6957
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
## 275 HEAT 2100
## Plot using ggplot2
p1 = ggplot(fatalities, aes(x = EVTYPE, y = FATALITIES, theme_set(theme_bw()))) +
geom_bar(stat = "identity", fill = "orange") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6)) +
xlab("Event Type") + ylab("Fatalities") + ggtitle("Fatalities by top 20 Weather Event Types") +
theme(plot.title = element_text(size = 10))
p2 = ggplot(injuries, aes(x = EVTYPE, y = INJURIES, theme_set(theme_bw()))) +
geom_bar(stat = "identity", fill = "pink") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6)) +
xlab("Event Type") + ylab("Injuries") + ggtitle("Injuries by top 20 Weather Event Types") +
theme(plot.title = element_text(size = 10))
## Plot both side by side using gridExtra package
grid.arrange(p1, p2, ncol = 2, top = "Most Harmful Events with Respect to Population Health")
By this chart we can conclude that Tornado as an event type has the highest level of Fatalities and Injuries.
## Organize Property & Crop to Event Type and store in object "damage"
damage <- aggregate(PROPDMGDOLLARS + CROPDMGDOLLARS ~ EVTYPE, data=NOAA_1, sum)
names(damage) = c("EVENT_TYPE", "TOTAL_DAMAGE")
## Sort
damage <- damage[order(-damage$TOTAL_DAMAGE), ][1:20, ]
damage$EVENT_TYPE <- factor(damage$EVENT_TYPE, levels = damage$EVENT_TYPE)
## Check headers
head(damage)
## EVENT_TYPE TOTAL_DAMAGE
## 170 FLOOD 150319678250
## 411 HURRICANE/TYPHOON 71913712800
## 834 TORNADO 57340613590
## 670 STORM SURGE 43323541000
## 244 HAIL 18752904670
## 153 FLASH FLOOD 17562128610
## Plot using ggplot2
ggplot(damage, aes(x = EVENT_TYPE, y = TOTAL_DAMAGE, theme_set(theme_bw()))) +
geom_bar(stat = "identity", fill = "green") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event Type") + ylab("Total Damage in $USD") + ggtitle("Total Property & Crop Damage by top 20 Weather Events")
Based on this chart we can concluded the the highest cost event type is Flood. This has the greatest (adverse) economic impact.
END
Assignment
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Requirements
For this assignment you will need some specific tools
Document Layout
Publishing Your Analysis
For this assignment you will need to publish your analysis on RPubs.com. If you do not already have an account, then you will have to create a new account. After you have completed writing your analysis in RStudio, you can publish it to RPubs by doing the following:
In RStudio, make sure your R Markdown document (.πππ) document is loaded in the editor
Click the πΊπππ π·ππΌπ» button in the doc toolbar to preview your document.
In the preview window, click the πΏππππππ button.
Once your document is published to RPubs, you should get a unique URL to that document. Make a note of this URL as you will need it to submit your assignment.
NOTE: If you are having trouble connecting with RPubs due to proxy-related or other issues, you can upload your final analysis document file as a PDF to Coursera instead.
Submitting Your Assignment
In order to submit this assignment, you must copy the RPubs URL for your completed data analysis document in to the peer assessment question.
If you choose to submit as a PDF, please insert an obvious placeholder URL (e.g.Β https://google.com) in order to allow submission.
Review criteria
Has either a (1) valid RPubs URL pointing to a data analysis document for this assignment been submitted; or (2) a complete PDF file presenting the data analysis been uploaded?
Is the document written in English?
Does the analysis include description and justification for any data transformations?
Does the document have a title that briefly summarizes the data analysis?
Does the document have a synopsis that describes and summarizes the data analysis in less than 10 sentences?
Is there a section titled βData Processingβ that describes how the data were loaded into R and processed for analysis?
Is there a section titled βResultsβ where the main results are presented?
Is there at least one figure in the document that contains a plot?
Are there at most 3 figures in this document?
Does the analysis start from the raw data file (i.e.Β the original .csv.bz2 file)?
Does the analysis address the question of which types of events are most harmful to population health?
Does the analysis address the question of which types of events have the greatest economic consequences?
Do all the results of the analysis (i.e.Β figures, tables, numerical summaries) appear to be reproducible?
Do the figure(s) have descriptive captions (i.e.Β there is a description near the figure of what is happening in the figure)?
As far as you can determine, does it appear that the work submitted for this project is the work of the student who submitted it?