This report presents an exploratory data analysis of United States’ NOAA Storm data to answer two questions about severe weather events: 1) Across the United States, which types of events are most harmful with respect to population health? and 2) Across the United States, which types of events have the greatest economic consequences? The report includes data importing and processing, summary statistics and exploratory plots to address the questions. The results presented can help decision makers in policy making, the general public gaining knowledge of severe weather impacts and scientists furthering research and data analysis.
The NOAA Store database contains records on weather events and health and ecomony losses. The types of weather events are stored as EVTYPE variable. Additionally, population health impact is measured by FATALITIES and INJURIES in the database and economic consequences by PROPDMG (property damage) and CROPDMG (crop damage) in 1,000 dollars in the database. We hypothesized that there are certain types of weather events are most harmful relating to public health and economy. The sections below described how the raw data were processed and explored to identify those top weather hazards. We found that Heat/Excessie Heat is the single hazardous weather event that caused most fatality and injury. Heat is followed by Tornados and Floods as high public health dangers. Tornado and Floods are also among top hazards to economy loss followed by winds and hail. It’s worth noting that certain less occurring but extreme event such as drought can be most harmful to crop damage although not so harmful to property damage.
This section describes how the data were loaded into R and processed for analysis.
dat.loc <- "D:\\Coursera\\ReproducibleResearch_RPeng\\Projects\\proj2"
dat <- read.csv(paste(dat.loc, "repdata_data_StormData.csv.bz2", sep = "\\"), sep = ",")
str(dat)
#> 'data.frame': 902297 obs. of 37 variables:
#> $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
#> $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
#> $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
#> $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
#> $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
#> $ STATE : chr "AL" "AL" "AL" "AL" ...
#> $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
#> $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ BGN_AZI : chr "" "" "" "" ...
#> $ BGN_LOCATI: chr "" "" "" "" ...
#> $ END_DATE : chr "" "" "" "" ...
#> $ END_TIME : chr "" "" "" "" ...
#> $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ COUNTYENDN: logi NA NA NA NA NA NA ...
#> $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ END_AZI : chr "" "" "" "" ...
#> $ END_LOCATI: chr "" "" "" "" ...
#> $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
#> $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
#> $ F : int 3 2 2 2 2 2 2 1 3 3 ...
#> $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
#> $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
#> $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
#> $ PROPDMGEXP: chr "K" "K" "K" "K" ...
#> $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ CROPDMGEXP: chr "" "" "" "" ...
#> $ WFO : chr "" "" "" "" ...
#> $ STATEOFFIC: chr "" "" "" "" ...
#> $ ZONENAMES : chr "" "" "" "" ...
#> $ LATITUDE : num 3040 3042 3340 3458 3412 ...
#> $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
#> $ LATITUDE_E: num 3051 0 0 0 0 ...
#> $ LONGITUDE_: num 8806 0 0 0 0 ...
#> $ REMARKS : chr "" "" "" "" ...
#> $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(dat[, c(2,3,12,13)]); tail(dat[, c(2,3,12,13)])
| BGN_DATE | BGN_TIME | END_DATE | END_TIME |
|---|---|---|---|
| 4/18/1950 0:00:00 | 0130 | ||
| 4/18/1950 0:00:00 | 0145 | ||
| 2/20/1951 0:00:00 | 1600 | ||
| 6/8/1951 0:00:00 | 0900 | ||
| 11/15/1951 0:00:00 | 1500 | ||
| 11/15/1951 0:00:00 | 2000 |
| BGN_DATE | BGN_TIME | END_DATE | END_TIME | |
|---|---|---|---|---|
| 902292 | 11/28/2011 0:00:00 | 03:00:00 PM | 11/29/2011 0:00:00 | 12:00:00 PM |
| 902293 | 11/30/2011 0:00:00 | 10:30:00 PM | 11/30/2011 0:00:00 | 10:30:00 PM |
| 902294 | 11/10/2011 0:00:00 | 02:48:00 PM | 11/10/2011 0:00:00 | 02:48:00 PM |
| 902295 | 11/8/2011 0:00:00 | 02:58:00 PM | 11/9/2011 0:00:00 | 01:15:00 PM |
| 902296 | 11/9/2011 0:00:00 | 10:21:00 AM | 11/9/2011 0:00:00 | 05:00:00 PM |
| 902297 | 11/28/2011 0:00:00 | 08:00:00 PM | 11/29/2011 0:00:00 | 04:00:00 AM |
The raw data contains 902297 records and 37 variables from 4/18/1950 to 11/29/2011. There are 985 weather events recorded. The structure of the variables and values reveal some irregular formats need to be transformed as described below.
Raw data recorded the dates (BGN_DATE and END_DATE) and time (BGN_TIME and END_TIME) as character type and they need to be converted to date and time format. Raw values also recorded times in different formats before and after 1996. Times were recorded by hour and minute in 24 hours clock time before 1996 and by hour, minute and second in 12 hours with AM/PM since 1996. A ‘year’ variable was created so that BGN_TIME and END_TIME could be converted with matching time formats. The converted dates and times were saved into four new variables namely BgnDate, EndDate, BgnTime and EndTime.
library(tidyverse, warn.conflicts = FALSE)
library(lubridate)
library(usdata)
dat$BgnDate <- strptime(dat$BGN_DATE, "%m/%d/%Y")
dat$EndDate <- strptime(dat$END_DATE, "%m/%d/%Y")
dat <- dat %>% mutate(year = year(BgnDate)) %>% arrange(year)
dat$BgnTime <- ifelse(dat$year < 1996, format(strptime(dat$BGN_TIME,"%H%M"), "%H:%M:%S"),
format(strptime(dat$BGN_TIME, "%I:%M:%S %p"), "%H:%M:%S"))
dat$EndTime <- ifelse(dat$year < 1996, format(strptime(dat$END_TIME,"%H%M"), "%H:%M:%S"),
format(strptime(dat$END_TIME, "%I:%M:%S %p"), "%H:%M:%S"))
# names(dat)
head(dat[, c(2,3,12,13, 38:42)]); tail(dat[, c(2,3,12,13, 38:42)])
| BGN_DATE | BGN_TIME | END_DATE | END_TIME | BgnDate | EndDate | year | BgnTime | EndTime |
|---|---|---|---|---|---|---|---|---|
| 4/18/1950 0:00:00 | 0130 | 1950-04-18 | NA | 1950 | 01:30:00 | NA | ||
| 4/18/1950 0:00:00 | 0145 | 1950-04-18 | NA | 1950 | 01:45:00 | NA | ||
| 1/13/1950 0:00:00 | 0525 | 1950-01-13 | NA | 1950 | 05:25:00 | NA | ||
| 2/12/1950 0:00:00 | 1230 | 1950-02-12 | NA | 1950 | 12:30:00 | NA | ||
| 2/12/1950 0:00:00 | 1500 | 1950-02-12 | NA | 1950 | 15:00:00 | NA | ||
| 3/26/1950 0:00:00 | 1930 | 1950-03-26 | NA | 1950 | 19:30:00 | NA |
| BGN_DATE | BGN_TIME | END_DATE | END_TIME | BgnDate | EndDate | year | BgnTime | EndTime | |
|---|---|---|---|---|---|---|---|---|---|
| 902292 | 11/28/2011 0:00:00 | 03:00:00 PM | 11/29/2011 0:00:00 | 12:00:00 PM | 2011-11-28 | 2011-11-29 | 2011 | 15:00:00 | 12:00:00 |
| 902293 | 11/30/2011 0:00:00 | 10:30:00 PM | 11/30/2011 0:00:00 | 10:30:00 PM | 2011-11-30 | 2011-11-30 | 2011 | 22:30:00 | 22:30:00 |
| 902294 | 11/10/2011 0:00:00 | 02:48:00 PM | 11/10/2011 0:00:00 | 02:48:00 PM | 2011-11-10 | 2011-11-10 | 2011 | 14:48:00 | 14:48:00 |
| 902295 | 11/8/2011 0:00:00 | 02:58:00 PM | 11/9/2011 0:00:00 | 01:15:00 PM | 2011-11-08 | 2011-11-09 | 2011 | 14:58:00 | 13:15:00 |
| 902296 | 11/9/2011 0:00:00 | 10:21:00 AM | 11/9/2011 0:00:00 | 05:00:00 PM | 2011-11-09 | 2011-11-09 | 2011 | 10:21:00 | 17:00:00 |
| 902297 | 11/28/2011 0:00:00 | 08:00:00 PM | 11/29/2011 0:00:00 | 04:00:00 AM | 2011-11-28 | 2011-11-29 | 2011 | 20:00:00 | 04:00:00 |
dat2 <- dat[,c(4, 7, 8, 38, 39, 41, 42, 23:25, 27)]
names(dat2)
#> [1] "TIME_ZONE" "STATE" "EVTYPE" "BgnDate" "EndDate"
#> [6] "BgnTime" "EndTime" "FATALITIES" "INJURIES" "PROPDMG"
#> [11] "CROPDMG"
The processed data were check again and dates and times are now in corrected formats. A new data set dat2 was created which contains only variables of interest for further exploration.
Quantiles of the four responses of interest (FATALITIES, INJURIES, PROPDMG and CROPDMG) were calculated as shown below. The numbers revealed that all four responses have heavily left skewed distributions with majority of low values and some extremely high values. In particularly, there are many zero values. The proportions of non-zero records are 0.0077, 0.02, 0.27 and 0.024 for fatality, injuries, property damage and crop damage respectively. Since the objectives are health and economy impact, zeros are meaningless and removed from further examination.
# names(dat2)
lapply(as.list(dat2[, c(8:11)]), quantile, probs = seq(0.90,1.0, 0.01))
#> $FATALITIES
#> 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
#> 0 0 0 0 0 0 0 0 0 0 583
#>
#> $INJURIES
#> 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
#> 0 0 0 0 0 0 0 0 0 2 1700
#>
#> $PROPDMG
#> 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
#> 15 20 25 25 32 50 50 100 200 250 5000
#>
#> $CROPDMG
#> 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
#> 0 0 0 0 0 0 0 0 3 20 990
Raw data contained 985 types of severe weather events. To identify the events that caused most health and economic concerns, i created a subset dataset for each response with nonzero values followed by events with reported responses greater than 50% (median) of the non-zero values (shown as responsename50).
cbind(dat2 %>% filter(FATALITIES >0) %>% summarize(fatal50 = median(FATALITIES)),
dat2 %>% filter(INJURIES >0) %>% summarize(injury50 = median(INJURIES)),
dat2 %>% filter(PROPDMG >0) %>% summarize(propdmg50 = median(PROPDMG)),
dat2 %>% filter(CROPDMG >0) %>% summarize(cropdmg50 = median(CROPDMG))
)
| fatal50 | injury50 | propdmg50 | cropdmg50 |
|---|---|---|---|
| 1 | 2 | 10 | 10 |
event.severeH1 <- dat2 %>% filter(FATALITIES >0) %>% filter(FATALITIES > median(FATALITIES))
event.severeH2 <- dat2 %>% filter(INJURIES >0) %>% filter(INJURIES > median(INJURIES))
event.severeE1 <- dat2 %>% filter(PROPDMG >0) %>% filter(PROPDMG > median(PROPDMG))
event.severeE2 <- dat2 %>% filter(CROPDMG >0) %>% filter(CROPDMG > median(CROPDMG))
#dim(event.severeH1)[1]; dim(event.severeH2)[1]; dim(event.severeE1)[1]; dim(event.severeE2)[1]
Using the code below, i listed the events associated with greater than medians of the four responses, ranked by their occurrence frequency.
tmp1 <- as.data.frame(table(event.severeH1$EVTYPE)) %>% arrange(desc(Freq))
tmp2 <- as.data.frame(table(event.severeH2$EVTYPE)) %>% arrange(desc(Freq))
tmp3 <- as.data.frame(table(event.severeE1$EVTYPE)) %>% arrange(desc(Freq))
tmp4 <- as.data.frame(table(event.severeE2$EVTYPE)) %>% arrange(desc(Freq))
"evens caused >50% fatalities: "; head(tmp1, 10);
"evens caused >50% injuries: "; head(tmp2, 10)
"evens caused >50% property damage: "; head(tmp3, 10)
"evens caused >50% crop damage: "; head(tmp4, 10)
#> [1] "evens caused >50% fatalities: "
| Var1 | Freq |
|---|---|
| TORNADO | 768 |
| EXCESSIVE HEAT | 223 |
| FLASH FLOOD | 169 |
| FLOOD | 91 |
| TSTM WIND | 59 |
| HEAT | 51 |
| WINTER STORM | 50 |
| LIGHTNING | 49 |
| AVALANCHE | 37 |
| HIGH WIND | 35 |
#> [1] "evens caused >50% injuries: "
| Var1 | Freq |
|---|---|
| TORNADO | 4022 |
| TSTM WIND | 680 |
| LIGHTNING | 420 |
| EXCESSIVE HEAT | 129 |
| THUNDERSTORM WIND | 120 |
| FLASH FLOOD | 112 |
| HIGH WIND | 103 |
| HAIL | 93 |
| WINTER STORM | 87 |
| THUNDERSTORM WINDS | 84 |
#> [1] "evens caused >50% property damage: "
| Var1 | Freq |
|---|---|
| TORNADO | 24036 |
| TSTM WIND | 18420 |
| THUNDERSTORM WIND | 12395 |
| FLASH FLOOD | 10919 |
| HAIL | 6960 |
| LIGHTNING | 5719 |
| FLOOD | 5227 |
| THUNDERSTORM WINDS | 3557 |
| HIGH WIND | 2893 |
| WINTER STORM | 933 |
#> [1] "evens caused >50% crop damage: "
| Var1 | Freq |
|---|---|
| HAIL | 4226 |
| TSTM WIND | 1267 |
| FLASH FLOOD | 1179 |
| FLOOD | 1090 |
| TORNADO | 688 |
| THUNDERSTORM WIND | 470 |
| THUNDERSTORM WINDS | 182 |
| DROUGHT | 181 |
| HIGH WIND | 122 |
| HEAVY RAIN | 69 |
Above exploration revealed several common events that highly impacted both US public heath and economy although their occurring ranks differ in the two areas. These events are tornado, flood (/flash flood) and big wind events. Also, heats caused remarkable death and lightening caused remarkable injury. For economic loss concern, hail caused considerable crop damage. A new factor variable EVTYPE2 was created to group the EVTYPE in raw data into categories of interest.
dat2$EVTYPE2 <- as.factor(
ifelse(grepl("tornado", dat$EVTYPE, ignore.case = TRUE), "Tornados", ifelse(
grepl("flood", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
"Floods", ifelse(
grepl("wind", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
"Winds", ifelse(
grepl("heat", dat$EVTYPE, ignore.case = TRUE) == TRUE,
"Heats", ifelse(
grepl("hail", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
"Hails", ifelse(
grepl("lightening", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
"Lightening", "Others"
)
)
)
)
))
)
levels(dat2$EVTYPE2)
#> [1] "Floods" "Hails" "Heats" "Others" "Tornados" "Winds"
The 985 weather event types in raw data is now grouped into 6 categories as shown above.
This section presents main results to answer the two questions of interest.
Figure 1 presents the boxplots of Fatality and Injury grouped by types of weather events. The thick black line in each box is median, the red diamonds are means.
dat2.f <- dat2 %>% filter(FATALITIES > 0)
dat2.i <- dat2 %>% filter(INJURIES > 0)
par(mfrow=c(1,2), oma = c(0, 0, 3, 0), las = 3)
boxplot(dat2.f$FATALITIES ~ dat2.f$EVTYPE2, log = "y", xlab = "", ylab = "Fatalities")
points(tapply(dat2.f$FATALITIES, dat2.f$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
boxplot(dat2.i$INJURIES ~ dat2.i$EVTYPE2, log = "y", xlab = "", ylab = "Injuries")
points(tapply(dat2.i$INJURIES, dat2.i$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
mtext("Figure 1. Harmful Events With Respect To Population Health", outer = T, font = 2, cex = 1.5, las = 1)
Figure 1 showed that median of the non-zero fatality is 1 for all types of event. Heats, however, is most harmful to cause death and injury. Average heat fatality is 4 and average heat injuries is 40. The most fatal event ocurred during the 5-day HEAT event from 1995-07-12 to 1995-07-16 in Illinois, with 583 reported death, the highest in this database.
Other highly dangerous events are tornados and floods, whose average fatality and injury are 4 and 12 for tornados and 2 and 15 for floods.
Simiarly, Figure 2 presents the boxplots of property damage and crop damage grouped by types of weather events. These two losses estimated the economic consequences by the weather conditions. The thick black lines within boxes are medians and red diamonds are means.
dat2.p <- dat2 %>% filter(PROPDMG > 0)
dat2.c <- dat2 %>% filter(CROPDMG > 0)
par(mfrow=c(1,2), oma = c(0, 0, 3, 0), las = 3)
boxplot(dat2.p$PROPDMG ~ dat2.p$EVTYPE2, log = "y", xlab = "", ylab = "Property Damages")
points(tapply(dat2.p$PROPDMG, dat2.p$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
boxplot(dat2.c$CROPDMG ~ dat2.c$EVTYPE2, log = "y", xlab = "", ylab = "Crop Damages")
points(tapply(dat2.c$CROPDMG, dat2.c$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
mtext("Figure 2. Harmful Events With Respect To Economic Loss (in K)", outer = T, font = 2, cex = 1.5, las = 1)
Figure 2 showed that economic loss occurred during all severe weather events. Tornados, Floods and Heats have the greatest property damages and Floods and Tornados are among top hazards to crop damage as well. The summary statistics of the economic consequences by event types are also listed in tables below.
as.data.frame(sapply(split(dat2.p$PROPDMG, dat2.p$EVTYPE2), summary, digits = 2))[c(3,4,6),] %>%
rownames_to_column(var = "Property Damage (K)")
as.data.frame(sapply(split(dat2.c$CROPDMG, dat2.c$EVTYPE2), summary, digits = 2))[c(3,4,6),] %>%
rownames_to_column(var = "Crop Damage (K)")
| Property Damage (K) | Floods | Hails | Heats | Others | Tornados | Winds |
|---|---|---|---|---|---|---|
| Median | 15 | 5 | 25 | 20 | 25 | 5 |
| Mean | 77 | 30 | 73 | 72 | 82 | 25 |
| Max. | 5000 | 970 | 600 | 5000 | 4400 | 5000 |
| Crop Damage (K) | Floods | Hails | Heats | Others | Tornados | Winds |
|---|---|---|---|---|---|---|
| Median | 20 | 10 | 1.8 | 16 | 10 | 7 |
| Mean | 88 | 62 | 53.0 | 85 | 67 | 39 |
| Max. | 980 | 900 | 490.0 | 990 | 900 | 900 |
It’s worth noting that events in “Others” category could cause remarkable crop damage as shown in the crop damage summary table above. I listed records with more than 900 K crop damage due to severe weather events.
dat2[dat2$CROPDMG > 900,]
| TIME_ZONE | STATE | EVTYPE | BgnDate | EndDate | BgnTime | EndTime | FATALITIES | INJURIES | PROPDMG | CROPDMG | EVTYPE2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 279930 | EST | WV | River Flooding | 1996-09-06 | 1996-09-08 | 10:15:00 | 07:30:00 | 0 | 0 | 16.50 | 950 | Floods |
| 322172 | CST | KS | FLOOD | 1998-09-13 | 1998-09-15 | 16:00:00 | 16:00:00 | 0 | 0 | 1.00 | 978 | Floods |
| 387863 | CST | IA | FLOOD | 2000-06-24 | 2000-06-27 | 03:00:00 | 15:00:00 | 0 | 0 | 650.00 | 975 | Floods |
| 544253 | MST | MT | DROUGHT | 2004-05-01 | 2004-05-31 | 01:01:00 | 23:59:00 | 0 | 0 | 0.00 | 990 | Others |
| 548558 | EST | NC | FLOOD | 2004-09-16 | 2004-09-17 | 23:00:00 | 19:00:00 | 0 | 0 | 8.00 | 920 | Floods |
| 631126 | EST | NC | TROPICAL STORM | 2006-09-01 | 2006-09-01 | 05:00:00 | 07:00:00 | 0 | 0 | 0.00 | 985 | Others |
| 743347 | CST | WI | FLASH FLOOD | 2008-06-07 | 2008-06-08 | NA | 07:30:00 | 0 | 0 | 2.25 | 950 | Floods |
The table showed that certain less occurring but extremely agricultural hazardous event can result in excessive crop loss. In this database, for example, the most crop damage was caused by the DROUGHT in May, 2004 in Montana. The total crop loss is 990 K.