Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

We first load the 2 function libraries (dplyr and ggplot2). Then unzip the compressed file and store the .csv file in data raw.

This study is to investigate the impacts of natural disaster to Public Safety and Economics. So we create 2 subsets of raw, named health and eco

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
raw = read.csv("repdata%2Fdata%2FStormData.csv.bz2")
health = subset(raw, select=c(EVTYPE,FATALITIES, INJURIES))
eco = subset(raw, select = c(EVTYPE, PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))

Results

Impacts to Public Safety

First, I group the data according to EVTYPE, then calculate the sum number of fatalities and sum of injuries for each type of natual disaster. After that, we add 2 more columes: FnI, denoting the sum of fatalities and injuries; Score, denoting the severity of the natural disaster (Here, we define score = number of fatalities + 0.5 * number of injuries)

health_byEvent = group_by(health, EVTYPE)
health_stats = summarise(health_byEvent, SumFata = sum(FATALITIES), SumInju = sum(INJURIES))
health_stats2 = mutate(health_stats, FnI = SumFata+SumInju, Score=SumFata+0.5*SumInju)

Then we rank the most catastrophic disasters from 4 perspectives: most fatalities, most injuries, most fatalities and injuries, highest severity score.

Lastly, we compile 4 rankings together and we find out that TSTM WIND, TORANDO, LIGHTNING, HEAT, EXCESSIVE HEAT,FlOOD and FLASH FLOOD appear in all 4 rankings. So they are most dangerous to public health.

Sum_Fata = health_stats2[order(health_stats2$SumFata,decreasing = TRUE),][1:10,]
Sum_Inju = health_stats2[order(health_stats2$SumInju,decreasing = TRUE),][1:10,]
Fn_I = health_stats2[order(health_stats2$FnI,decreasing = TRUE),][1:10,]
Scor_e = health_stats2[order(health_stats2$Score,decreasing = TRUE),][1:10,]

combined = rbind(Sum_Fata,Sum_Inju,Fn_I,Scor_e)

ggplot(combined, aes(x=EVTYPE)) + geom_histogram(stat = "count") + coord_flip()
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Sum_Fata
## # A tibble: 10 x 5
##    EVTYPE         SumFata SumInju   FnI  Score
##    <fct>            <dbl>   <dbl> <dbl>  <dbl>
##  1 TORNADO           5633   91346 96979 51306 
##  2 EXCESSIVE HEAT    1903    6525  8428  5166.
##  3 FLASH FLOOD        978    1777  2755  1866.
##  4 HEAT               937    2100  3037  1987 
##  5 LIGHTNING          816    5230  6046  3431 
##  6 TSTM WIND          504    6957  7461  3982.
##  7 FLOOD              470    6789  7259  3864.
##  8 RIP CURRENT        368     232   600   484 
##  9 HIGH WIND          248    1137  1385   816.
## 10 AVALANCHE          224     170   394   309
Sum_Inju
## # A tibble: 10 x 5
##    EVTYPE            SumFata SumInju   FnI  Score
##    <fct>               <dbl>   <dbl> <dbl>  <dbl>
##  1 TORNADO              5633   91346 96979 51306 
##  2 TSTM WIND             504    6957  7461  3982.
##  3 FLOOD                 470    6789  7259  3864.
##  4 EXCESSIVE HEAT       1903    6525  8428  5166.
##  5 LIGHTNING             816    5230  6046  3431 
##  6 HEAT                  937    2100  3037  1987 
##  7 ICE STORM              89    1975  2064  1076.
##  8 FLASH FLOOD           978    1777  2755  1866.
##  9 THUNDERSTORM WIND     133    1488  1621   877 
## 10 HAIL                   15    1361  1376   696.
Fn_I
## # A tibble: 10 x 5
##    EVTYPE            SumFata SumInju   FnI  Score
##    <fct>               <dbl>   <dbl> <dbl>  <dbl>
##  1 TORNADO              5633   91346 96979 51306 
##  2 EXCESSIVE HEAT       1903    6525  8428  5166.
##  3 TSTM WIND             504    6957  7461  3982.
##  4 FLOOD                 470    6789  7259  3864.
##  5 LIGHTNING             816    5230  6046  3431 
##  6 HEAT                  937    2100  3037  1987 
##  7 FLASH FLOOD           978    1777  2755  1866.
##  8 ICE STORM              89    1975  2064  1076.
##  9 THUNDERSTORM WIND     133    1488  1621   877 
## 10 WINTER STORM          206    1321  1527   866.
Scor_e
## # A tibble: 10 x 5
##    EVTYPE            SumFata SumInju   FnI  Score
##    <fct>               <dbl>   <dbl> <dbl>  <dbl>
##  1 TORNADO              5633   91346 96979 51306 
##  2 EXCESSIVE HEAT       1903    6525  8428  5166.
##  3 TSTM WIND             504    6957  7461  3982.
##  4 FLOOD                 470    6789  7259  3864.
##  5 LIGHTNING             816    5230  6046  3431 
##  6 HEAT                  937    2100  3037  1987 
##  7 FLASH FLOOD           978    1777  2755  1866.
##  8 ICE STORM              89    1975  2064  1076.
##  9 THUNDERSTORM WIND     133    1488  1621   877 
## 10 WINTER STORM          206    1321  1527   866.

Impacts to Economics

We first summarise the contents of PROPDMGEXP and CROPDMGEXP and we find 2 problems. 1. Unknown symbols such as “-”,“?”,“+”,“0”,… Then we use subset to find the full row containing the unknown symbols. We find that most of them are meaningless, perhaps wrong input, which should be removed from the database. Since they only take up a small portion of the total 902297 observations. It does not matter that we remove them.

  1. Inconsistency “k” should be corrected to capital “K”; “h” should be corrected to capital “H”…
table(eco$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(eco$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Then we create a new dataframe eco_2 to translate the unit from natural language to number. We design a for loop to achieve that. [This is very time-consuming so I did not get a full result here]

eco_2 = mutate(eco, propDmgUnit = 0, cropDmgUnit =0) i=1 for (i in 1:902297) { if (eco_2\(PROPDMGEXP[i] %in% c("h", "H")) {eco_2[i,6] = 100} else if (eco_2\)PROPDMGEXP[i] %in% c(“k”, “K”)) {eco_2[i,6] = 1000} else if (eco_2\(PROPDMGEXP[i] %in% c("m", "M")) {eco_2[i,6] = 1000000} else if (eco_2\)PROPDMGEXP[i] %in% c(“B”)) {eco_2[i,6] = 1000000000} else {eco_2[i,6] = 0}

    if (eco_2$CROPDMGEXP[i] %in% c("k", "K")) {eco_2[i,7] = 1000}
    else if (eco_2$CROPDMGEXP[i] %in% c("m", "M")) {eco_2[i,7] = 1000000}
    else if (eco_2$CROPDMGEXP[i] %in% c("B")) {eco_2[i,7] = 1000000000}
    else {eco_2[i,7] = 0}

}

Lastly, we calculate the property pecuniary damage, crop pecuniary damage and total damage.

eco_3 = mutate(eco_2, proplost = PROPDMG * propDmgUnit, croplost = CROPDMG * cropDmgUnit)
eco_3 = group_by(eco_3, eco_3$EVTYPE) eco_4 = summarise(eco_3, SumPropLost = sum(proplost), SumCropLost = sum(croplost)) eco_4 = mutate(eco_4, totalLost = SumPropLost + SumCropLost)

Draw graph here…

Code Book This is to provide some idea of the purpose of each of the 37 columns, along with url links for further information.

https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf http://www.ncdc.noaa.gov/stormevents/details.jsp http://www.ncdc.noaa.gov/stormevents/pd01016005curr.pdf

“STATE__” State number (1 = Alabama) “BGN_DATE” Begin date “BGN_TIME” Begin time “TIME_ZONE” Time zone “COUNTY” County number “COUNTYNAME” County name “STATE” State, 2 letter abbreviation “EVTYPE” Type of event (storm, flood, etc) “BGN_RANGE” Beginning range “BGN_AZI” Beginning azimuth “BGN_LOCATI” Beginning location “END_DATE” End date “END_TIME” End time “COUNTY_END” County where event ended? (name or number?) “COUNTYENDN” County where event ended? (name or number?) “END_RANGE” Ending range “END_AZI” Ending azimuth “END_LOCATI” Ending location “LENGTH” Length of tornado path (in yards?) “WIDTH” Maximum width of tornado’s path in yards “F” Fujita tornado intensity scale “MAG” Hail in inches (implied hundreths) “FATALITIES” Number of fatalities “INJURIES” Number of injuries “PROPDMG” Property damage in dollars “PROPDMGEXP” K=thousands, M=millions, B=billions “CROPDMG” Crop damage in dollars “CROPDMGEXP” H=hundreds, K=thousands, M=millions, B=billions “WFO” Weather Forecast Office “STATEOFFIC” State office? “ZONENAMES” Zone names? “LATITUDE” Latitude “LONGITUDE” Longitude “LATITUDE_E” ? “LONGITUDE_” ? “REMARKS” Remarks “REFNUM” Reference number?