This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
For preventing such outcomes to the extent possible is a key concern. There are 2 questions which this report would like to address: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
The data for this analysis come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The source data file is downloaded from this link
# Load data file into R
StormData <- read.csv("C:/Users/Admin/Desktop/RStudio_et_Github/Reproducible Research/Final assigment/repdata_data_StormData.csv.bz2")
# Summarize the loaded data frame
str(StormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Explanation of column names exits in this link. However, for the scoop of this analysis, only public heath and economy related variables are subset, including:
BGN_DATE Beginning dateEVTYPE Weather event typesFATALITIES and INJURIES People die or injured by the event, which are impact on public healthPROPDMG and CROPDMG Property damage and crop damage, which are impact on public healthPROPDMGEXP and CROPDMGEXP Unit (by USD) for property and crop damageAccording to the NOAA (https://www.ncdc.noaa.gov/stormevents/details.jsp), only since 1996, they can record all type of events. For comparison between events, data older than the year 1996 should be eliminated
Any observations with NA values should be eliminated
# Subset needed data
data <- StormData %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# Reformat and select data since 1996
data$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y")
# Filter data
data <- data %>% filter(BGN_DATE >= "1996/01/01") # Data since 1996
data <- data %>% filter (!is.na(data)) # Drop NA value
data <- data %>% select(-BGN_DATE) # Drop time data
# For PROPDMGEXP
## Check for code using
table(data$PROPDMGEXP)
##
## 0 B K M
## 276185 1 32 369938 7374
## => Code appear in this column include 0, B, K, M
## Replace for suitable value for 0, B, K, M
data$PROPDMGEXP <- gsub("0","1",data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("B","9",data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("K","3",data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("M","6",data$PROPDMGEXP)
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
# For CROPDMGEXP
## Check for code using
table(data$CROPDMGEXP)
##
## B K M
## 373069 4 278686 1771
## => Code appear in this column include B, K, M
## Replace for suitable value for B, K, M
data$CROPDMGEXP <- gsub("B","9",data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("K","3",data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("M","6",data$CROPDMGEXP)
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
# Null value will return as NA, they are the value without unit, therefore we can ignore them
data[,"PROPERTY"] <- with(data,PROPDMG*10^PROPDMGEXP)
data[,"CROP"] <- with(data,CROPDMG*10^CROPDMGEXP)
Considering number of death and injury people have the same weight impact on public health. Note that there is many event types, however, we will only look at events with top highest impact.
# Calculate health impact by events
Health <- aggregate(Health$PH,by=list(Health$EVTYPE),sum,na.rm=TRUE)
Health <- subset(Health,x>quantile(x,prob=0.98))
Health <- Health[order(-Health$x),]
colnames(Health) <- c("Event","Impact")
Health
## Event Impact
## 426 TORNADO 22178
## 81 EXCESSIVE HEAT 8188
## 102 FLOOD 7172
## 224 LIGHTNING 4792
## 434 TSTM WIND 3870
## 98 FLASH FLOOD 2561
## 421 THUNDERSTORM WIND 1530
## 507 WINTER STORM 1483
## 147 HEAT 1459
## 185 HURRICANE/TYPHOON 1339
## 177 HIGH WIND 1318
Considering property damage and crop damage have the same weight impact on economy
#EC is a new variable represent economy impact
Economy <- data %>% mutate(data,EC = PROPERTY + CROP)
# Calculate health impact by events
Economy <- aggregate(Economy$EC,by=list(Economy$EVTYPE),sum,na.rm=TRUE)
Economy <- subset(Economy,x>quantile(x,prob=0.98))
Economy <- Economy[order(-Economy$x),]
colnames(Economy) <- c("Event","Impact")
Economy
## Event Impact
## 102 FLOOD 137278823900
## 185 HURRICANE/TYPHOON 29348167800
## 426 TORNADO 16308770350
## 183 HURRICANE 12404268000
## 142 HAIL 9331288590
## 98 FLASH FLOOD 8402099530
## 343 STORM SURGE/TIDE 4641493000
## 421 THUNDERSTORM WIND 3780985440
## 496 WILDFIRE 3684468370
## 177 HIGH WIND 3057106640
## 63 DROUGHT 1868412000
# Take 10 biggest impact for plotting
Health <- head(Health,10)
# Plot Event by impact
health.plot <- ggplot(Health, aes(x = Event, y = Impact, fill = Event)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Total number of health impact") +
ggtitle("Weather event types impact on public health") +
theme(plot.title = element_text(hjust = 0.5))
print(health.plot)
According to the graph, Tornado is the event with biggest impact on public health (both fatalities and injuries)
# Take 10 biggest impact for plotting
Economy <- head(Economy,10)
# Plot Event by impact
economy.plot <- ggplot(Economy, aes(x = Event, y = Impact, fill = Event)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Total number of economy impact by USD") +
ggtitle("Weather event types impact on economy") +
theme(plot.title = element_text(hjust = 0.5))
print(economy.plot)
According to the graph, Flood is the event with biggest impact on Economy (both property and crop)