Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
We first load the 2 function libraries (dplyr and ggplot2). Then unzip the compressed file and store the .csv file in data raw.
This study is to investigate the impacts of natural disaster to Public Safety and Economics. So we create 2 subsets of raw, named health and eco
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
raw = read.csv("repdata%2Fdata%2FStormData.csv.bz2")
health = subset(raw, select=c(EVTYPE,FATALITIES, INJURIES))
eco = subset(raw, select = c(EVTYPE, PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))
First, I group the data according to EVTYPE, then calculate the sum number of fatalities and sum of injuries for each type of natual disaster. After that, we add 2 more columes: FnI, denoting the sum of fatalities and injuries; Score, denoting the severity of the natural disaster (Here, we define score = number of fatalities + 0.5 * number of injuries)
health_byEvent = group_by(health, EVTYPE)
health_stats = summarise(health_byEvent, SumFata = sum(FATALITIES), SumInju = sum(INJURIES))
health_stats2 = mutate(health_stats, FnI = SumFata+SumInju, Score=SumFata+0.5*SumInju)
Then we rank the most catastrophic disasters from 4 perspectives: most fatalities, most injuries, most fatalities and injuries, highest severity score.
Lastly, we compile 4 rankings together and we find out that TSTM WIND, TORANDO, LIGHTNING, HEAT, EXCESSIVE HEAT,FlOOD and FLASH FLOOD appear in all 4 rankings. So they are most dangerous to public health.
Sum_Fata = health_stats2[order(health_stats2$SumFata,decreasing = TRUE),][1:10,]
Sum_Inju = health_stats2[order(health_stats2$SumInju,decreasing = TRUE),][1:10,]
Fn_I = health_stats2[order(health_stats2$FnI,decreasing = TRUE),][1:10,]
Scor_e = health_stats2[order(health_stats2$Score,decreasing = TRUE),][1:10,]
combined = rbind(Sum_Fata,Sum_Inju,Fn_I,Scor_e)
ggplot(combined, aes(x=EVTYPE)) + geom_histogram(stat = "count") + coord_flip()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Sum_Fata
## # A tibble: 10 x 5
## EVTYPE SumFata SumInju FnI Score
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979 51306
## 2 EXCESSIVE HEAT 1903 6525 8428 5166.
## 3 FLASH FLOOD 978 1777 2755 1866.
## 4 HEAT 937 2100 3037 1987
## 5 LIGHTNING 816 5230 6046 3431
## 6 TSTM WIND 504 6957 7461 3982.
## 7 FLOOD 470 6789 7259 3864.
## 8 RIP CURRENT 368 232 600 484
## 9 HIGH WIND 248 1137 1385 816.
## 10 AVALANCHE 224 170 394 309
Sum_Inju
## # A tibble: 10 x 5
## EVTYPE SumFata SumInju FnI Score
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979 51306
## 2 TSTM WIND 504 6957 7461 3982.
## 3 FLOOD 470 6789 7259 3864.
## 4 EXCESSIVE HEAT 1903 6525 8428 5166.
## 5 LIGHTNING 816 5230 6046 3431
## 6 HEAT 937 2100 3037 1987
## 7 ICE STORM 89 1975 2064 1076.
## 8 FLASH FLOOD 978 1777 2755 1866.
## 9 THUNDERSTORM WIND 133 1488 1621 877
## 10 HAIL 15 1361 1376 696.
Fn_I
## # A tibble: 10 x 5
## EVTYPE SumFata SumInju FnI Score
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979 51306
## 2 EXCESSIVE HEAT 1903 6525 8428 5166.
## 3 TSTM WIND 504 6957 7461 3982.
## 4 FLOOD 470 6789 7259 3864.
## 5 LIGHTNING 816 5230 6046 3431
## 6 HEAT 937 2100 3037 1987
## 7 FLASH FLOOD 978 1777 2755 1866.
## 8 ICE STORM 89 1975 2064 1076.
## 9 THUNDERSTORM WIND 133 1488 1621 877
## 10 WINTER STORM 206 1321 1527 866.
Scor_e
## # A tibble: 10 x 5
## EVTYPE SumFata SumInju FnI Score
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979 51306
## 2 EXCESSIVE HEAT 1903 6525 8428 5166.
## 3 TSTM WIND 504 6957 7461 3982.
## 4 FLOOD 470 6789 7259 3864.
## 5 LIGHTNING 816 5230 6046 3431
## 6 HEAT 937 2100 3037 1987
## 7 FLASH FLOOD 978 1777 2755 1866.
## 8 ICE STORM 89 1975 2064 1076.
## 9 THUNDERSTORM WIND 133 1488 1621 877
## 10 WINTER STORM 206 1321 1527 866.
We first summarise the contents of PROPDMGEXP and CROPDMGEXP and we find 2 problems. 1. Unknown symbols such as “-”,“?”,“+”,“0”,… Then we use subset to find the full row containing the unknown symbols. We find that most of them are meaningless, perhaps wrong input, which should be removed from the database. Since they only take up a small portion of the total 902297 observations. It does not matter that we remove them.
table(eco$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(eco$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Then we create a new dataframe eco_2 to translate the unit from natural language to number. We design a for loop to achieve that. [This is very time-consuming so I did not get a full result here]
eco_2 = mutate(eco, propDmgUnit = 0, cropDmgUnit =0) i=1 for (i in 1:902297) { if (eco_2\(PROPDMGEXP[i] %in% c("h", "H")) {eco_2[i,6] = 100} else if (eco_2\)PROPDMGEXP[i] %in% c(“k”, “K”)) {eco_2[i,6] = 1000} else if (eco_2\(PROPDMGEXP[i] %in% c("m", "M")) {eco_2[i,6] = 1000000} else if (eco_2\)PROPDMGEXP[i] %in% c(“B”)) {eco_2[i,6] = 1000000000} else {eco_2[i,6] = 0}
if (eco_2$CROPDMGEXP[i] %in% c("k", "K")) {eco_2[i,7] = 1000}
else if (eco_2$CROPDMGEXP[i] %in% c("m", "M")) {eco_2[i,7] = 1000000}
else if (eco_2$CROPDMGEXP[i] %in% c("B")) {eco_2[i,7] = 1000000000}
else {eco_2[i,7] = 0}
}
Lastly, we calculate the property pecuniary damage, crop pecuniary damage and total damage.
eco_3 = mutate(eco_2, proplost = PROPDMG * propDmgUnit, croplost = CROPDMG * cropDmgUnit)
eco_3 = group_by(eco_3, eco_3$EVTYPE) eco_4 = summarise(eco_3, SumPropLost = sum(proplost), SumCropLost = sum(croplost)) eco_4 = mutate(eco_4, totalLost = SumPropLost + SumCropLost)
Draw graph here…
Code Book This is to provide some idea of the purpose of each of the 37 columns, along with url links for further information.
https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf http://www.ncdc.noaa.gov/stormevents/details.jsp http://www.ncdc.noaa.gov/stormevents/pd01016005curr.pdf
“STATE__” State number (1 = Alabama) “BGN_DATE” Begin date “BGN_TIME” Begin time “TIME_ZONE” Time zone “COUNTY” County number “COUNTYNAME” County name “STATE” State, 2 letter abbreviation “EVTYPE” Type of event (storm, flood, etc) “BGN_RANGE” Beginning range “BGN_AZI” Beginning azimuth “BGN_LOCATI” Beginning location “END_DATE” End date “END_TIME” End time “COUNTY_END” County where event ended? (name or number?) “COUNTYENDN” County where event ended? (name or number?) “END_RANGE” Ending range “END_AZI” Ending azimuth “END_LOCATI” Ending location “LENGTH” Length of tornado path (in yards?) “WIDTH” Maximum width of tornado’s path in yards “F” Fujita tornado intensity scale “MAG” Hail in inches (implied hundreths) “FATALITIES” Number of fatalities “INJURIES” Number of injuries “PROPDMG” Property damage in dollars “PROPDMGEXP” K=thousands, M=millions, B=billions “CROPDMG” Crop damage in dollars “CROPDMGEXP” H=hundreds, K=thousands, M=millions, B=billions “WFO” Weather Forecast Office “STATEOFFIC” State office? “ZONENAMES” Zone names? “LATITUDE” Latitude “LONGITUDE” Longitude “LATITUDE_E” ? “LONGITUDE_” ? “REMARKS” Remarks “REFNUM” Reference number?