Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The purpose of the analysis is to identify severe weather events that are most harmful to population health (measured by fatalities and injuries) and those with the greatest economic consequences (measured by property and crop damage). The results of this analysis can be used by government officials to prepare for severe weather events and to prioritize resources for different types of events.
setwd("C:/Users/yxj4/OneDrive - CDC/+My_Documents/CDC/1 DDT/Data Modernization/Coursera/Hopkins Data Science Specialization/5 Reproducible Research/4 Course Project 2/final_project")
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
noaa <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, sep = ",", na.strings = "")
str(noaa)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
describe(noaa$EVTYPE)
## noaa$EVTYPE
## n missing distinct
## 902297 0 985
##
## lowest : HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD LIGHTNING TSTM WIND
## highest: WINTERY MIX Wintry mix Wintry Mix WINTRY MIX WND
describe(noaa$FATALITIES)
## noaa$FATALITIES
## n missing distinct Info Mean Gmd .05 .10
## 902297 0 52 0.023 0.01678 0.03344 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## lowest : 0 1 2 3 4, highest: 99 114 116 158 583
noaa2 <- noaa[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "BGN_DATE")]
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. According to this website, NOAA started recording all severe weather events beginning in 1996: https://www.ncdc.noaa.gov/stormevents/details.jsp. Therefore, in order to provide a fair comparison between all weather event types in terms of the consequences for public health and the economy, we should only evaluate events that happened in 1996 and after.
range(noaa2$BGN_DATE)
## [1] "1/1/1966 0:00:00" "9/9/2011 0:00:00"
noaa2$year <- format(as.Date(noaa2$BGN_DATE, format="%m/%d/%Y %H:%M:%S"),"%Y")
range(noaa2$year)
## [1] "1950" "2011"
table(noaa2$year)
##
## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962
## 223 269 272 492 609 1413 1703 2184 2213 1813 1945 2246 2389
## 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975
## 1968 2348 2855 2388 2688 3312 2926 3215 3471 2168 4463 5386 4975
## 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
## 3768 3728 3657 4279 6146 4517 7132 8322 7335 7979 8726 7367 7257
## 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## 10410 10946 12522 13534 12607 20631 27970 32270 28680 38128 31289 34471 34962
## 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 36293 39752 39363 39184 44034 43289 55663 45817 48161 62174
noaa3 <- subset(noaa2, year>=1996)
range(noaa3$year)
## [1] "1996" "2011"
table(noaa3$year)
##
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## 32270 28680 38128 31289 34471 34962 36293 39752 39363 39184 44034 43289 55663
## 2009 2010 2011
## 45817 48161 62174
There are four variables used to calculate property damage and crop
damage:
- PROPDMG: the magnitude of property damage in U.S. Dollars
- PROPDMGEXP: the exponent values for property damage (meaning
10^PROPDMGEXP)
- CROPDMG: the magnitude of crop damage in U.S. Dollars
- CROPDMGEXP: the exponent values for crop damage (meaning
10^CROPDMGEXP)
describe(noaa3$PROPDMG)
## noaa3$PROPDMG
## n missing distinct Info Mean Gmd .05 .10
## 653530 0 1381 0.641 11.69 21.92 0 0
## .25 .50 .75 .90 .95
## 0 0 1 15 50
##
## lowest : 0.00 0.01 0.02 0.03 0.04
## highest: 3200.00 3500.00 4410.00 4800.00 5000.00
unique(noaa3$PROPDMGEXP)
## [1] "K" NA "M" "B" "0"
sum(is.na(noaa3$PROPDMGEXP))
## [1] 276185
sum(!is.na(noaa3$PROPDMGEXP))
## [1] 377345
table(noaa3$PROPDMGEXP)
##
## 0 B K M
## 1 32 369938 7374
noaa3$PROPDMGEXP <- gsub("K", "3", noaa3$PROPDMGEXP) #Create an exponent value for thousand
noaa3$PROPDMGEXP <- gsub("M", "6", noaa3$PROPDMGEXP) #Create an exponent value for million
noaa3$PROPDMGEXP <- gsub("B", "9", noaa3$PROPDMGEXP) #Create an exponent value for billion
noaa3$PROPDMGEXP <- gsub("NA", "NA", noaa3$PROPDMGEXP)
noaa3$PROPDMGEXP <- gsub("0", "0", noaa3$PROPDMGEXP)
noaa3$PROPDMGEXP <- as.numeric(noaa3$PROPDMGEXP)
sum(is.na(noaa3$PROPDMGEXP))
## [1] 276185
sum(!is.na(noaa3$PROPDMGEXP))
## [1] 377345
unique(noaa3$PROPDMGEXP)
## [1] 3 NA 6 9 0
table(noaa3$PROPDMGEXP)
##
## 0 3 6 9
## 1 369938 7374 32
noaa3$propertydamage <- (noaa3$PROPDMG * (10 ^ noaa3$PROPDMGEXP))/1000000000
describe(noaa3$CROPDMG)
## noaa3$CROPDMG
## n missing distinct Info Mean Gmd .05 .10
## 653530 0 415 0.083 1.839 3.652 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## lowest : 0.00 0.01 0.10 0.15 0.20, highest: 950.00 975.00 978.00 985.00 990.00
unique(noaa3$CROPDMGEXP)
## [1] "K" NA "M" "B"
sum(is.na(noaa3$CROPDMGEXP))
## [1] 373069
sum(!is.na(noaa3$CROPDMGEXP))
## [1] 280461
table(noaa3$CROPDMGEXP)
##
## B K M
## 4 278686 1771
noaa3$CROPDMGEXP <- gsub("K", "3", noaa3$CROPDMGEXP) #Create an exponent value for thousand
noaa3$CROPDMGEXP <- gsub("M", "6", noaa3$CROPDMGEXP) #Create an exponent value for million
noaa3$CROPDMGEXP <- gsub("B", "9", noaa3$CROPDMGEXP) #Create an exponent value for billion
noaa3$CROPDMGEXP <- gsub("NA", "NA", noaa3$CROPDMGEXP)
noaa3$CROPDMGEXP <- as.numeric(noaa3$CROPDMGEXP)
sum(is.na(noaa3$CROPDMGEXP))
## [1] 373069
sum(!is.na(noaa3$CROPDMGEXP))
## [1] 280461
unique(noaa3$CROPDMGEXP)
## [1] 3 NA 6 9
table(noaa3$CROPDMGEXP)
##
## 3 6 9
## 278686 1771 4
noaa3$cropdamage <- (noaa3$CROPDMG * (10 ^ noaa3$CROPDMGEXP))/1000000000
totalfatalities <- aggregate(FATALITIES ~ EVTYPE, noaa3, sum)
totalfatalities[which.max(totalfatalities$FATALITIES),]
## EVTYPE FATALITIES
## 81 EXCESSIVE HEAT 1797
#Create a smaller dataset with the top 10 events by total number of fatalities
toptenFatalities <- totalfatalities[order(-totalfatalities$FATALITIES), ][1:10, ]
toptenFatalities
## EVTYPE FATALITIES
## 81 EXCESSIVE HEAT 1797
## 426 TORNADO 1511
## 98 FLASH FLOOD 887
## 224 LIGHTNING 651
## 102 FLOOD 414
## 300 RIP CURRENT 340
## 434 TSTM WIND 241
## 147 HEAT 237
## 177 HIGH WIND 235
## 16 AVALANCHE 223
#Plot the total number of fatalities by event type for the top 10 events
ggplot(data=toptenFatalities, aes(x=reorder(EVTYPE, FATALITIES), y=FATALITIES, fill=EVTYPE)) + geom_bar(stat="identity") + xlab("Weather Event") + ylab("Total number of fatalities") + ggtitle("Top 10 Events with Highest Fatalities") + coord_flip()
totalinjuries <- aggregate(INJURIES ~ EVTYPE, noaa3, sum)
totalinjuries[which.max(totalinjuries$INJURIES),]
## EVTYPE INJURIES
## 426 TORNADO 20667
#Create a smaller dataset with the top 10 events by total number of injuries
toptenInjuries <- totalinjuries[order(-totalinjuries$INJURIES), ][1:10, ]
toptenInjuries
## EVTYPE INJURIES
## 426 TORNADO 20667
## 102 FLOOD 6758
## 81 EXCESSIVE HEAT 6391
## 224 LIGHTNING 4141
## 434 TSTM WIND 3629
## 98 FLASH FLOOD 1674
## 421 THUNDERSTORM WIND 1400
## 507 WINTER STORM 1292
## 185 HURRICANE/TYPHOON 1275
## 147 HEAT 1222
#Plot the total number of injuries by event type for the top 10 events
ggplot(data=toptenInjuries, aes(x=reorder(EVTYPE, INJURIES), y=INJURIES, fill=EVTYPE)) + geom_bar(stat="identity") + xlab("Weather Event") + ylab("Total number of injuries") + ggtitle("Top 10 Events with Highest Injuries") + coord_flip()
totalpropertydamage <- aggregate(propertydamage ~ EVTYPE, noaa3, sum)
totalpropertydamage[which.max(totalpropertydamage$propertydamage),]
## EVTYPE propertydamage
## 42 FLOOD 143.9448
#Create a smaller dataset with the top 10 events by total number of fatalities
toptenpropdamag <- totalpropertydamage[order(-totalpropertydamage$propertydamage), ][1:10, ]
toptenpropdamag
## EVTYPE propertydamage
## 42 FLOOD 143.944834
## 84 HURRICANE/TYPHOON 69.305840
## 134 STORM SURGE 43.193536
## 143 TORNADO 24.616946
## 40 FLASH FLOOD 15.222204
## 67 HAIL 14.595143
## 83 HURRICANE 11.812819
## 145 TROPICAL STORM 7.642476
## 80 HIGH WIND 5.247860
## 168 WILDFIRE 4.758667
#Plot the total amount of property damage by event type for the top 10 events
ggplot(data=toptenpropdamag, aes(x=reorder(EVTYPE, propertydamage), y=propertydamage, fill=EVTYPE)) + geom_bar(stat="identity") + xlab("Weather Event") + ylab("Total Property Damange in Billions of Dollars") + ggtitle("Events with Highest Property Damange (Billions of Dollars)") + coord_flip()
totalcropdamage <- aggregate(cropdamage ~ EVTYPE, noaa3, sum)
totalcropdamage[which.max(totalcropdamage$cropdamage),]
## EVTYPE cropdamage
## 11 DROUGHT 13.36757
#Create a smaller dataset with the top 10 events by total number of fatalities
toptencropdamag <- totalcropdamage[order(-totalcropdamage$cropdamage), ][1:10, ]
toptencropdamag
## EVTYPE cropdamage
## 11 DROUGHT 13.3675660
## 22 FLOOD 4.9747784
## 41 HURRICANE 2.7414100
## 42 HURRICANE/TYPHOON 2.6078728
## 32 HAIL 2.4760294
## 21 FLASH FLOOD 1.3349017
## 18 EXTREME COLD 1.2889730
## 28 FROST/FREEZE 1.0940860
## 35 HEAVY RAIN 0.7281698
## 67 TROPICAL STORM 0.6777110