This report presents an exploratory data analysis of United States’ NOAA Storm data to answer two questions about severe weather events: 1) Across the United States, which types of events are most harmful with respect to population health? and 2) Across the United States, which types of events have the greatest economic consequences? The report includes data importing and processing, summary statistics and exploratory plots to address the questions. The results presented can help decision makers in policy making, the general public gaining knowledge of severe weather impacts and scientists furthering research and data analysis.

1. Synopsis

The NOAA Store database contains records on weather events and health and ecomony losses. The types of weather events are stored as EVTYPE variable. Additionally, population health impact is measured by FATALITIES and INJURIES in the database and economic consequences by PROPDMG (property damage) and CROPDMG (crop damage) in 1,000 dollars in the database. We hypothesized that there are certain types of weather events are most harmful relating to public health and economy. The sections below described how the raw data were processed and explored to identify those top weather hazards. We found that Heat/Excessie Heat is the single hazardous weather event that caused most fatality and injury. Heat is followed by Tornados and Floods as high public health dangers. Tornado and Floods are also among top hazards to economy loss followed by winds and hail. It’s worth noting that certain less occurring but extreme event such as drought can be most harmful to crop damage although not so harmful to property damage.

2. Data Processing

This section describes how the data were loaded into R and processed for analysis.

2.1 Raw data import

dat.loc <- "D:\\Coursera\\ReproducibleResearch_RPeng\\Projects\\proj2"
dat <- read.csv(paste(dat.loc, "repdata_data_StormData.csv.bz2", sep = "\\"), sep = ",")
str(dat)
#> 'data.frame':    902297 obs. of  37 variables:
#>  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
#>  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
#>  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
#>  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
#>  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
#>  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
#>  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
#>  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ BGN_AZI   : chr  "" "" "" "" ...
#>  $ BGN_LOCATI: chr  "" "" "" "" ...
#>  $ END_DATE  : chr  "" "" "" "" ...
#>  $ END_TIME  : chr  "" "" "" "" ...
#>  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
#>  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ END_AZI   : chr  "" "" "" "" ...
#>  $ END_LOCATI: chr  "" "" "" "" ...
#>  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
#>  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
#>  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
#>  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
#>  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
#>  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
#>  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
#>  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ CROPDMGEXP: chr  "" "" "" "" ...
#>  $ WFO       : chr  "" "" "" "" ...
#>  $ STATEOFFIC: chr  "" "" "" "" ...
#>  $ ZONENAMES : chr  "" "" "" "" ...
#>  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
#>  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
#>  $ LATITUDE_E: num  3051 0 0 0 0 ...
#>  $ LONGITUDE_: num  8806 0 0 0 0 ...
#>  $ REMARKS   : chr  "" "" "" "" ...
#>  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(dat[, c(2,3,12,13)]); tail(dat[, c(2,3,12,13)])
BGN_DATE BGN_TIME END_DATE END_TIME
4/18/1950 0:00:00 0130
4/18/1950 0:00:00 0145
2/20/1951 0:00:00 1600
6/8/1951 0:00:00 0900
11/15/1951 0:00:00 1500
11/15/1951 0:00:00 2000
BGN_DATE BGN_TIME END_DATE END_TIME
902292 11/28/2011 0:00:00 03:00:00 PM 11/29/2011 0:00:00 12:00:00 PM
902293 11/30/2011 0:00:00 10:30:00 PM 11/30/2011 0:00:00 10:30:00 PM
902294 11/10/2011 0:00:00 02:48:00 PM 11/10/2011 0:00:00 02:48:00 PM
902295 11/8/2011 0:00:00 02:58:00 PM 11/9/2011 0:00:00 01:15:00 PM
902296 11/9/2011 0:00:00 10:21:00 AM 11/9/2011 0:00:00 05:00:00 PM
902297 11/28/2011 0:00:00 08:00:00 PM 11/29/2011 0:00:00 04:00:00 AM

The raw data contains 902297 records and 37 variables from 4/18/1950 to 11/29/2011. There are 985 weather events recorded. The structure of the variables and values reveal some irregular formats need to be transformed as described below.

2.2 Raw data transformations

Raw data recorded the dates (BGN_DATE and END_DATE) and time (BGN_TIME and END_TIME) as character type and they need to be converted to date and time format. Raw values also recorded times in different formats before and after 1996. Times were recorded by hour and minute in 24 hours clock time before 1996 and by hour, minute and second in 12 hours with AM/PM since 1996. A ‘year’ variable was created so that BGN_TIME and END_TIME could be converted with matching time formats. The converted dates and times were saved into four new variables namely BgnDate, EndDate, BgnTime and EndTime.

library(tidyverse, warn.conflicts = FALSE)
library(lubridate)
library(usdata)
dat$BgnDate <- strptime(dat$BGN_DATE, "%m/%d/%Y")
dat$EndDate <- strptime(dat$END_DATE, "%m/%d/%Y")
dat <- dat %>%  mutate(year = year(BgnDate)) %>% arrange(year)
dat$BgnTime <- ifelse(dat$year < 1996, format(strptime(dat$BGN_TIME,"%H%M"), "%H:%M:%S"),
                    format(strptime(dat$BGN_TIME, "%I:%M:%S %p"), "%H:%M:%S"))
dat$EndTime <- ifelse(dat$year < 1996, format(strptime(dat$END_TIME,"%H%M"), "%H:%M:%S"),
                    format(strptime(dat$END_TIME, "%I:%M:%S %p"), "%H:%M:%S"))
# names(dat)
head(dat[, c(2,3,12,13, 38:42)]); tail(dat[, c(2,3,12,13, 38:42)])
BGN_DATE BGN_TIME END_DATE END_TIME BgnDate EndDate year BgnTime EndTime
4/18/1950 0:00:00 0130 1950-04-18 NA 1950 01:30:00 NA
4/18/1950 0:00:00 0145 1950-04-18 NA 1950 01:45:00 NA
1/13/1950 0:00:00 0525 1950-01-13 NA 1950 05:25:00 NA
2/12/1950 0:00:00 1230 1950-02-12 NA 1950 12:30:00 NA
2/12/1950 0:00:00 1500 1950-02-12 NA 1950 15:00:00 NA
3/26/1950 0:00:00 1930 1950-03-26 NA 1950 19:30:00 NA
BGN_DATE BGN_TIME END_DATE END_TIME BgnDate EndDate year BgnTime EndTime
902292 11/28/2011 0:00:00 03:00:00 PM 11/29/2011 0:00:00 12:00:00 PM 2011-11-28 2011-11-29 2011 15:00:00 12:00:00
902293 11/30/2011 0:00:00 10:30:00 PM 11/30/2011 0:00:00 10:30:00 PM 2011-11-30 2011-11-30 2011 22:30:00 22:30:00
902294 11/10/2011 0:00:00 02:48:00 PM 11/10/2011 0:00:00 02:48:00 PM 2011-11-10 2011-11-10 2011 14:48:00 14:48:00
902295 11/8/2011 0:00:00 02:58:00 PM 11/9/2011 0:00:00 01:15:00 PM 2011-11-08 2011-11-09 2011 14:58:00 13:15:00
902296 11/9/2011 0:00:00 10:21:00 AM 11/9/2011 0:00:00 05:00:00 PM 2011-11-09 2011-11-09 2011 10:21:00 17:00:00
902297 11/28/2011 0:00:00 08:00:00 PM 11/29/2011 0:00:00 04:00:00 AM 2011-11-28 2011-11-29 2011 20:00:00 04:00:00
dat2 <- dat[,c(4, 7, 8, 38, 39, 41, 42, 23:25, 27)]
names(dat2)
#>  [1] "TIME_ZONE"  "STATE"      "EVTYPE"     "BgnDate"    "EndDate"   
#>  [6] "BgnTime"    "EndTime"    "FATALITIES" "INJURIES"   "PROPDMG"   
#> [11] "CROPDMG"

The processed data were check again and dates and times are now in corrected formats. A new data set dat2 was created which contains only variables of interest for further exploration.

2.3 Data exploration focused on the two questions

Quantiles of the four responses of interest (FATALITIES, INJURIES, PROPDMG and CROPDMG) were calculated as shown below. The numbers revealed that all four responses have heavily left skewed distributions with majority of low values and some extremely high values. In particularly, there are many zero values. The proportions of non-zero records are 0.0077, 0.02, 0.27 and 0.024 for fatality, injuries, property damage and crop damage respectively. Since the objectives are health and economy impact, zeros are meaningless and removed from further examination.

# names(dat2)
lapply(as.list(dat2[, c(8:11)]), quantile, probs = seq(0.90,1.0, 0.01))
#> $FATALITIES
#>  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
#>    0    0    0    0    0    0    0    0    0    0  583 
#> 
#> $INJURIES
#>  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
#>    0    0    0    0    0    0    0    0    0    2 1700 
#> 
#> $PROPDMG
#>  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
#>   15   20   25   25   32   50   50  100  200  250 5000 
#> 
#> $CROPDMG
#>  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
#>    0    0    0    0    0    0    0    0    3   20  990

Raw data contained 985 types of severe weather events. To identify the events that caused most health and economic concerns, i created a subset dataset for each response with nonzero values followed by events with reported responses greater than 50% (median) of the non-zero values (shown as responsename50).

cbind(dat2 %>% filter(FATALITIES >0) %>% summarize(fatal50 = median(FATALITIES)),
      dat2 %>% filter(INJURIES >0) %>% summarize(injury50 = median(INJURIES)),
      dat2 %>% filter(PROPDMG >0) %>% summarize(propdmg50 = median(PROPDMG)),
      dat2 %>% filter(CROPDMG >0) %>% summarize(cropdmg50 = median(CROPDMG))
)
fatal50 injury50 propdmg50 cropdmg50
1 2 10 10

event.severeH1 <- dat2 %>% filter(FATALITIES >0) %>% filter(FATALITIES > median(FATALITIES))
event.severeH2 <- dat2 %>% filter(INJURIES >0) %>% filter(INJURIES > median(INJURIES))
event.severeE1 <- dat2 %>% filter(PROPDMG >0) %>% filter(PROPDMG > median(PROPDMG))
event.severeE2 <- dat2 %>% filter(CROPDMG >0) %>% filter(CROPDMG > median(CROPDMG))

#dim(event.severeH1)[1]; dim(event.severeH2)[1]; dim(event.severeE1)[1]; dim(event.severeE2)[1]

Using the code below, i listed the events associated with greater than medians of the four responses, ranked by their occurrence frequency.

tmp1 <- as.data.frame(table(event.severeH1$EVTYPE)) %>% arrange(desc(Freq))
tmp2 <- as.data.frame(table(event.severeH2$EVTYPE)) %>% arrange(desc(Freq))
tmp3 <- as.data.frame(table(event.severeE1$EVTYPE)) %>% arrange(desc(Freq))
tmp4 <- as.data.frame(table(event.severeE2$EVTYPE)) %>% arrange(desc(Freq))

"evens caused >50% fatalities: "; head(tmp1, 10);
"evens caused >50%  injuries: ";  head(tmp2, 10)
"evens caused >50%  property damage: "; head(tmp3, 10)
"evens caused >50%  crop damage: "; head(tmp4, 10)
#> [1] "evens caused >50% fatalities: "
Var1 Freq
TORNADO 768
EXCESSIVE HEAT 223
FLASH FLOOD 169
FLOOD 91
TSTM WIND 59
HEAT 51
WINTER STORM 50
LIGHTNING 49
AVALANCHE 37
HIGH WIND 35
#> [1] "evens caused >50%  injuries: "
Var1 Freq
TORNADO 4022
TSTM WIND 680
LIGHTNING 420
EXCESSIVE HEAT 129
THUNDERSTORM WIND 120
FLASH FLOOD 112
HIGH WIND 103
HAIL 93
WINTER STORM 87
THUNDERSTORM WINDS 84
#> [1] "evens caused >50%  property damage: "
Var1 Freq
TORNADO 24036
TSTM WIND 18420
THUNDERSTORM WIND 12395
FLASH FLOOD 10919
HAIL 6960
LIGHTNING 5719
FLOOD 5227
THUNDERSTORM WINDS 3557
HIGH WIND 2893
WINTER STORM 933
#> [1] "evens caused >50%  crop damage: "
Var1 Freq
HAIL 4226
TSTM WIND 1267
FLASH FLOOD 1179
FLOOD 1090
TORNADO 688
THUNDERSTORM WIND 470
THUNDERSTORM WINDS 182
DROUGHT 181
HIGH WIND 122
HEAVY RAIN 69

Above exploration revealed several common events that highly impacted both US public heath and economy although their occurring ranks differ in the two areas. These events are tornado, flood (/flash flood) and big wind events. Also, heats caused remarkable death and lightening caused remarkable injury. For economic loss concern, hail caused considerable crop damage. A new factor variable EVTYPE2 was created to group the EVTYPE in raw data into categories of interest.

dat2$EVTYPE2 <- as.factor(
    ifelse(grepl("tornado", dat$EVTYPE, ignore.case = TRUE), "Tornados", ifelse(
        grepl("flood", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
        "Floods", ifelse(
            grepl("wind", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
            "Winds", ifelse(
                grepl("heat", dat$EVTYPE, ignore.case = TRUE) == TRUE,
                "Heats", ifelse(
                    grepl("hail", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
                    "Hails", ifelse(
                        grepl("lightening", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
                        "Lightening", "Others"
                    )
                )
            )
        )
    ))
)

levels(dat2$EVTYPE2)
#> [1] "Floods"   "Hails"    "Heats"    "Others"   "Tornados" "Winds"

The 985 weather event types in raw data is now grouped into 6 categories as shown above.

3. Results

This section presents main results to answer the two questions of interest.

3.1 Across the United States, which types of events are most harmful with respect to population health?

Figure 1 presents the boxplots of Fatality and Injury grouped by types of weather events. The thick black line in each box is median, the red diamonds are means.

dat2.f <- dat2 %>% filter(FATALITIES > 0)
dat2.i <- dat2 %>% filter(INJURIES > 0)

par(mfrow=c(1,2), oma = c(0, 0, 3, 0), las = 3)
boxplot(dat2.f$FATALITIES ~ dat2.f$EVTYPE2, log = "y", xlab = "", ylab = "Fatalities")
points(tapply(dat2.f$FATALITIES, dat2.f$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
boxplot(dat2.i$INJURIES ~ dat2.i$EVTYPE2, log = "y", xlab = "", ylab = "Injuries")
points(tapply(dat2.i$INJURIES, dat2.i$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
mtext("Figure 1. Harmful Events With Respect To Population Health", outer = T, font = 2, cex = 1.5, las = 1)

Figure 1 showed that median of the non-zero fatality is 1 for all types of event. Heats, however, is most harmful to cause death and injury. Average heat fatality is 4 and average heat injuries is 40. The most fatal event ocurred during the 5-day HEAT event from 1995-07-12 to 1995-07-16 in Illinois, with 583 reported death, the highest in this database.

Other highly dangerous events are tornados and floods, whose average fatality and injury are 4 and 12 for tornados and 2 and 15 for floods.

3.2 Across the United States, which types of events have the greatest economic consequences?

Simiarly, Figure 2 presents the boxplots of property damage and crop damage grouped by types of weather events. These two losses estimated the economic consequences by the weather conditions. The thick black lines within boxes are medians and red diamonds are means.

dat2.p <- dat2 %>% filter(PROPDMG > 0)
dat2.c <- dat2 %>% filter(CROPDMG > 0)

par(mfrow=c(1,2), oma = c(0, 0, 3, 0), las = 3)
boxplot(dat2.p$PROPDMG ~ dat2.p$EVTYPE2, log = "y", xlab = "", ylab = "Property Damages")
points(tapply(dat2.p$PROPDMG, dat2.p$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
boxplot(dat2.c$CROPDMG ~ dat2.c$EVTYPE2, log = "y", xlab = "", ylab = "Crop Damages")
points(tapply(dat2.c$CROPDMG, dat2.c$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
mtext("Figure 2. Harmful Events With Respect To Economic Loss (in K)", outer = T, font = 2, cex = 1.5, las = 1)

Figure 2 showed that economic loss occurred during all severe weather events. Tornados, Floods and Heats have the greatest property damages and Floods and Tornados are among top hazards to crop damage as well. The summary statistics of the economic consequences by event types are also listed in tables below.

as.data.frame(sapply(split(dat2.p$PROPDMG, dat2.p$EVTYPE2), summary, digits = 2))[c(3,4,6),] %>%
  rownames_to_column(var = "Property Damage (K)")
as.data.frame(sapply(split(dat2.c$CROPDMG, dat2.c$EVTYPE2), summary, digits = 2))[c(3,4,6),] %>%
  rownames_to_column(var = "Crop Damage (K)")
Property Damage (K) Floods Hails Heats Others Tornados Winds
Median 15 5 25 20 25 5
Mean 77 30 73 72 82 25
Max. 5000 970 600 5000 4400 5000
Crop Damage (K) Floods Hails Heats Others Tornados Winds
Median 20 10 1.8 16 10 7
Mean 88 62 53.0 85 67 39
Max. 980 900 490.0 990 900 900

It’s worth noting that events in “Others” category could cause remarkable crop damage as shown in the crop damage summary table above. I listed records with more than 900 K crop damage due to severe weather events.

dat2[dat2$CROPDMG > 900,]
TIME_ZONE STATE EVTYPE BgnDate EndDate BgnTime EndTime FATALITIES INJURIES PROPDMG CROPDMG EVTYPE2
279930 EST WV River Flooding 1996-09-06 1996-09-08 10:15:00 07:30:00 0 0 16.50 950 Floods
322172 CST KS FLOOD 1998-09-13 1998-09-15 16:00:00 16:00:00 0 0 1.00 978 Floods
387863 CST IA FLOOD 2000-06-24 2000-06-27 03:00:00 15:00:00 0 0 650.00 975 Floods
544253 MST MT DROUGHT 2004-05-01 2004-05-31 01:01:00 23:59:00 0 0 0.00 990 Others
548558 EST NC FLOOD 2004-09-16 2004-09-17 23:00:00 19:00:00 0 0 8.00 920 Floods
631126 EST NC TROPICAL STORM 2006-09-01 2006-09-01 05:00:00 07:00:00 0 0 0.00 985 Others
743347 CST WI FLASH FLOOD 2008-06-07 2008-06-08 NA 07:30:00 0 0 2.25 950 Floods

The table showed that certain less occurring but extremely agricultural hazardous event can result in excessive crop loss. In this database, for example, the most crop damage was caused by the DROUGHT in May, 2004 in Montana. The total crop loss is 990 K.