Synopsis

This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to analyze the impact of storm and other severe weather events on population health and economy in the USA. We compared the total number of injuries and fatalities caused by each weather event, and found that Tonado had the biggest impact on population health in the USA. We also calcuated the total property and crop damages caused by each weather event, and found that Flood caused the biggest economical loss in the USA

Loading and Processing the Raw Data

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

Download bz2 file and read the csv file.

We first download the data in the bz2 format.

# Download zip file
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("repdata_data_StormData.csv.bz2")){
download.file(url, destfile = "repdata_data_StormData.csv.bz2", method="curl")
}

We then read in the storm data from the csv file included in the bz2 file.

storm <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE)

After reading the data, we check the dimension and first few rows in this dataset.

dim(storm)
## [1] 902297     37

Check the basic structure of the dataset.

str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Data processing

Create a dataset to analyze the effects of storm events on population health and economy

We select EVTYPE and the following variables for health: FATALITIES and INJURIES: approx. number of injuries; and the following variables for economy: PROPDMG, PROPDMGEXP (units for PROPDMG); CROPDMG and CROPDMGEXP (Units for CROPDMG).

We load the requried libraries and select the columns of variables from the original dataset.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
storm_damage<-select(storm, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

We then check whether there are missing values for FATALITIES, INJURIES, PROPDMG, and CROPDMG.

sum(is.na(storm_damage$FATALITIES))##check NA for fatalies
## [1] 0
sum(is.na(storm_damage$INJURIES))##check NA for injuries
## [1] 0
sum(is.na(storm_damage$PROPDMG))##check NA for property damage
## [1] 0
sum(is.na(storm_damage$CROPDMG))##check NA for crop damage
## [1] 0

There are no missing values in the data.

Calculate the total injuries and fatatities per event type

We group the dataset according to EVTYPE, and calculate the sum for FATALITIES, INJURIES, PROPDMG, and CROPDMG. We first calculate the sum of fatality and injury for each EVTYPE. The result is saved in a new list health_damage.

storm_damage<-group_by(storm_damage, EVTYPE)
health_damage<-storm_damage%>%summarise(fatality_sum=sum(FATALITIES), injury_sum=sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
head(health_damage)
## # A tibble: 6 x 3
##   EVTYPE                  fatality_sum injury_sum
##   <fct>                          <dbl>      <dbl>
## 1 "   HIGH SURF ADVISORY"            0          0
## 2 " COASTAL FLOOD"                   0          0
## 3 " FLASH FLOOD"                     0          0
## 4 " LIGHTNING"                       0          0
## 5 " TSTM WIND"                       0          0
## 6 " TSTM WIND (G45)"                 0          0

Calculate the total economal damages per event type

We first check the units included in the PROPDMGEXP and CROPDMGEXP

table(storm_damage$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(storm_damage$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

We convert the value and units for the economical damages to numbers according to the rule of h for 100, m for 10^6, k for 10^3, and b for 10^9. Any numbers will be converted according to rule of 10^number, and the “-”, “?”, and blanks will be converted to 10^0.

test<- storm_damage
test[grepl("B", test$PROPDMGEXP),]$PROPDMG<- test[grepl("B", test$PROPDMGEXP),]$PROPDMG*(10^9)
test[grepl("[Hh]", test$PROPDMGEXP),]$PROPDMG<- test[grepl("[Hh]", test$PROPDMGEXP),]$PROPDMG*(100)
test[grepl("[Mm]", test$PROPDMGEXP),]$PROPDMG<- test[grepl("[Mm]", test$PROPDMGEXP),]$PROPDMG*(10^6)
test[grepl("[Kk]", test$PROPDMGEXP),]$PROPDMG<- test[grepl("[Kk]", test$PROPDMGEXP),]$PROPDMG*(1000)
for (i in c("1", "2", "3","4", "5","6",  "7", "8")) {
        test[grepl(i, test$PROPDMGEXP),]$PROPDMG<- test[grepl(i, test$PROPDMGEXP),]$PROPDMG*(10^(as.numeric(i)))
}

test[grepl("B", test$CROPDMGEXP),]$CROPDMG<- test[grepl("B", test$CROPDMGEXP),]$CROPDMG*(10^9)
test[grepl("[Mm]", test$CROPDMGEXP),]$CROPDMG<- test[grepl("[Mm]", test$CROPDMGEXP),]$CROPDMG*(10^6)
test[grepl("[Kk]", test$CROPDMGEXP),]$CROPDMG<- test[grepl("[Kk]", test$CROPDMGEXP),]$CROPDMG*(1000)
test[grepl("2", test$CROPDMGEXP),]$CROPDMG<- test[grepl("2", test$CROPDMGEXP),]$CROPDMG*(100)

storm_damage<-test
rm(test)

We create a new variable totalDM to be the sum of property damage and crop damage, and save the sum of total damage for each event type in ecomony_damage.

storm_damage$totalDM<-storm_damage$PROPDMG+storm_damage$CROPDMG
ecomony_damage<-storm_damage%>%summarise(damage_sum=sum(totalDM))
## `summarise()` ungrouping output (override with `.groups` argument)
head(ecomony_damage)
## # A tibble: 6 x 2
##   EVTYPE                  damage_sum
##   <fct>                        <dbl>
## 1 "   HIGH SURF ADVISORY"     200000
## 2 " COASTAL FLOOD"                 0
## 3 " FLASH FLOOD"               50000
## 4 " LIGHTNING"                     0
## 5 " TSTM WIND"               8100000
## 6 " TSTM WIND (G45)"            8000

Identify the top 5 events for population health and economical damage

There are 985 levels for the factor variable EVTYPE. We reorder the dataset and find the top 5 events for fatality and injury. The top 5 events for fatalites are:

health_damage[order(health_damage$fatality_sum, decreasing = TRUE), ]$EVTYPE[1:5]
## [1] TORNADO        EXCESSIVE HEAT FLASH FLOOD    HEAT           LIGHTNING     
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD  FLASH FLOOD ... WND
top_f<- as.character(health_damage[order(health_damage$fatality_sum, decreasing = TRUE), ]$EVTYPE[1:5])

The top 5 events for injuries are:

health_damage[order(health_damage$injury_sum, decreasing = TRUE), ]$EVTYPE[1:5]
## [1] TORNADO        TSTM WIND      FLOOD          EXCESSIVE HEAT LIGHTNING     
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD  FLASH FLOOD ... WND
top_i<-as.character(health_damage[order(health_damage$injury_sum, decreasing = TRUE), ]$EVTYPE[1:5])

The top 5 events for economical damages are:

ecomony_damage[order(ecomony_damage$damage_sum, decreasing = TRUE), ]$EVTYPE[1:5]
## [1] FLOOD             HURRICANE/TYPHOON TORNADO           STORM SURGE      
## [5] HAIL             
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD  FLASH FLOOD ... WND
top_d<- as.character(ecomony_damage[order(ecomony_damage$damage_sum, decreasing = TRUE), ]$EVTYPE[1:5])

Results

plot top 5 events for fatalities and injuries

We make barplot for the top 5 events for fatalities and injuries, with the events ordered in a descending order.

top_fatality<-health_damage[order(health_damage$fatality_sum, decreasing = TRUE), ][1:5,]
fatality_p <- ggplot(top_fatality, aes(x = EVTYPE, y = fatality_sum))+geom_col(position="dodge")+theme(axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Top 5 events for fatalities") + ylab("No. of fatailities")+scale_x_discrete (limits=top_f )
fatality_p

top_injury<-health_damage[order(health_damage$injury_sum, decreasing = TRUE), ][1:5,]
injury_p <- ggplot(top_injury, aes(x = EVTYPE, y = injury_sum))+geom_col(position="dodge")+ theme(axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Top 5 events for injuries") + ylab("No. of injuries")+scale_x_discrete (limits=top_i )
injury_p

Tornado is thus the most important cause with respect to population health, both for causing fatalities and injuries.

plot top 5 events for total economical damages

We make barplot for the top 5 events for total economical damages, with the events ordered in a descending order.

top_damage<-ecomony_damage[order(ecomony_damage$damage_sum, decreasing = TRUE), ][1:5,]
damage_p <- ggplot(top_damage, aes(x = EVTYPE, y = damage_sum))+geom_col(position="dodge")+ theme(axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Top 5 events for economical damage") + ylab("$ damage")+scale_x_discrete (limits=top_d )
damage_p

We thus conclude that Flood has the greatest economic consequences in the USA.