Proj2Rmarkdown.utf8

This report presents an exploratory data analysis of United States’ NOAA Storm data to answer two questions about severe weather events: 1) Across the United States, which types of events are most harmful with respect to population health? and 2) Across the United States, which types of events have the greatest economic consequences? The report includes data importing and processing, summary statistics and exploratory plots to address the questions. The results presented can help decision makers in policy making, the general public gaining knowledge of severe weather impacts and scientists furthering research and data analysis.

1. Synopsis

The NOAA Store database contains records on weather events and health and ecomony losses. The types of weather events are stored as EVTYPE variable. Additionally, population health impact is measured by FATALITIES and INJURIES in the database and economic consequences by PROPDMG (property damage) and CROPDMG (crop damage) in 1,000 dollars in the database. We hypothesized that there are certain types of weather events are most harmful relating to public health and economy. The sections below described how the raw data were processed and explored to identify those top weather hazards. We found that Heat/Excessie Heat is the single hazardous weather event that caused most fatality and injury. Heat is followed by Tornados and Floods as high public health dangers. Tornado and Floods are also among top hazards to economy loss followed by winds and hail. It’s worth noting that certain less occurring but extreme event such as drought can be most harmful to crop damage although not so harmful to property damage.

2. Data Processing

This section describes how the data were loaded into R and processed for analysis.

2.1 Raw data import

dat.loc <- "D:\\Coursera\\ReproducibleResearch_RPeng\\Projects\\proj2"
dat <- read.csv(paste(dat.loc, "repdata_data_StormData.csv.bz2", sep = "\\"), sep = ",")
str(dat)
#> 'data.frame':    902297 obs. of  37 variables:
#>  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
#>  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
#>  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
#>  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
#>  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
#>  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
#>  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
#>  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ BGN_AZI   : chr  "" "" "" "" ...
#>  $ BGN_LOCATI: chr  "" "" "" "" ...
#>  $ END_DATE  : chr  "" "" "" "" ...
#>  $ END_TIME  : chr  "" "" "" "" ...
#>  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
#>  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ END_AZI   : chr  "" "" "" "" ...
#>  $ END_LOCATI: chr  "" "" "" "" ...
#>  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
#>  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
#>  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
#>  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
#>  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
#>  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
#>  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
#>  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ CROPDMGEXP: chr  "" "" "" "" ...
#>  $ WFO       : chr  "" "" "" "" ...
#>  $ STATEOFFIC: chr  "" "" "" "" ...
#>  $ ZONENAMES : chr  "" "" "" "" ...
#>  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
#>  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
#>  $ LATITUDE_E: num  3051 0 0 0 0 ...
#>  $ LONGITUDE_: num  8806 0 0 0 0 ...
#>  $ REMARKS   : chr  "" "" "" "" ...
#>  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

head(dat[, c(2,3,12,13)]); tail(dat[, c(2,3,12,13)])

BGN_DATE	BGN_TIME	END_DATE	END_TIME
4/18/1950 0:00:00	0130
4/18/1950 0:00:00	0145
2/20/1951 0:00:00	1600
6/8/1951 0:00:00	0900
11/15/1951 0:00:00	1500
11/15/1951 0:00:00	2000

	BGN_DATE	BGN_TIME	END_DATE	END_TIME
902292	11/28/2011 0:00:00	03:00:00 PM	11/29/2011 0:00:00	12:00:00 PM
902293	11/30/2011 0:00:00	10:30:00 PM	11/30/2011 0:00:00	10:30:00 PM
902294	11/10/2011 0:00:00	02:48:00 PM	11/10/2011 0:00:00	02:48:00 PM
902295	11/8/2011 0:00:00	02:58:00 PM	11/9/2011 0:00:00	01:15:00 PM
902296	11/9/2011 0:00:00	10:21:00 AM	11/9/2011 0:00:00	05:00:00 PM
902297	11/28/2011 0:00:00	08:00:00 PM	11/29/2011 0:00:00	04:00:00 AM

The raw data contains 902297 records and 37 variables from 4/18/1950 to 11/29/2011. There are 985 weather events recorded. The structure of the variables and values reveal some irregular formats need to be transformed as described below.

2.2 Raw data transformations

Raw data recorded the dates (BGN_DATE and END_DATE) and time (BGN_TIME and END_TIME) as character type and they need to be converted to date and time format. Raw values also recorded times in different formats before and after 1996. Times were recorded by hour and minute in 24 hours clock time before 1996 and by hour, minute and second in 12 hours with AM/PM since 1996. A ‘year’ variable was created so that BGN_TIME and END_TIME could be converted with matching time formats. The converted dates and times were saved into four new variables namely BgnDate, EndDate, BgnTime and EndTime.

library(tidyverse, warn.conflicts = FALSE)
library(lubridate)
library(usdata)

dat$BgnDate <- strptime(dat$BGN_DATE, "%m/%d/%Y")
dat$EndDate <- strptime(dat$END_DATE, "%m/%d/%Y")
dat <- dat %>%  mutate(year = year(BgnDate)) %>% arrange(year)
dat$BgnTime <- ifelse(dat$year < 1996, format(strptime(dat$BGN_TIME,"%H%M"), "%H:%M:%S"),
                    format(strptime(dat$BGN_TIME, "%I:%M:%S %p"), "%H:%M:%S"))
dat$EndTime <- ifelse(dat$year < 1996, format(strptime(dat$END_TIME,"%H%M"), "%H:%M:%S"),
                    format(strptime(dat$END_TIME, "%I:%M:%S %p"), "%H:%M:%S"))
# names(dat)
head(dat[, c(2,3,12,13, 38:42)]); tail(dat[, c(2,3,12,13, 38:42)])

BGN_DATE	BGN_TIME	BgnDate	EndDate	year	BgnTime	EndTime
4/18/1950 0:00:00	0130	1950-04-18	NA	1950	01:30:00	NA
4/18/1950 0:00:00	0145	1950-04-18	NA	1950	01:45:00	NA
1/13/1950 0:00:00	0525	1950-01-13	NA	1950	05:25:00	NA
2/12/1950 0:00:00	1230	1950-02-12	NA	1950	12:30:00	NA
2/12/1950 0:00:00	1500	1950-02-12	NA	1950	15:00:00	NA
3/26/1950 0:00:00	1930	1950-03-26	NA	1950	19:30:00	NA

	BGN_DATE	BGN_TIME	END_DATE	END_TIME	BgnDate	EndDate	year	BgnTime	EndTime
902292	11/28/2011 0:00:00	03:00:00 PM	11/29/2011 0:00:00	12:00:00 PM	2011-11-28	2011-11-29	2011	15:00:00	12:00:00
902293	11/30/2011 0:00:00	10:30:00 PM	11/30/2011 0:00:00	10:30:00 PM	2011-11-30	2011-11-30	2011	22:30:00	22:30:00
902294	11/10/2011 0:00:00	02:48:00 PM	11/10/2011 0:00:00	02:48:00 PM	2011-11-10	2011-11-10	2011	14:48:00	14:48:00
902295	11/8/2011 0:00:00	02:58:00 PM	11/9/2011 0:00:00	01:15:00 PM	2011-11-08	2011-11-09	2011	14:58:00	13:15:00
902296	11/9/2011 0:00:00	10:21:00 AM	11/9/2011 0:00:00	05:00:00 PM	2011-11-09	2011-11-09	2011	10:21:00	17:00:00
902297	11/28/2011 0:00:00	08:00:00 PM	11/29/2011 0:00:00	04:00:00 AM	2011-11-28	2011-11-29	2011	20:00:00	04:00:00

dat2 <- dat[,c(4, 7, 8, 38, 39, 41, 42, 23:25, 27)]
names(dat2)

#>  [1] "TIME_ZONE"  "STATE"      "EVTYPE"     "BgnDate"    "EndDate"   
#>  [6] "BgnTime"    "EndTime"    "FATALITIES" "INJURIES"   "PROPDMG"   
#> [11] "CROPDMG"

The processed data were check again and dates and times are now in corrected formats. A new data set dat2 was created which contains only variables of interest for further exploration.

2.3 Data exploration focused on the two questions

Quantiles of the four responses of interest (FATALITIES, INJURIES, PROPDMG and CROPDMG) were calculated as shown below. The numbers revealed that all four responses have heavily left skewed distributions with majority of low values and some extremely high values. In particularly, there are many zero values. The proportions of non-zero records are 0.0077, 0.02, 0.27 and 0.024 for fatality, injuries, property damage and crop damage respectively. Since the objectives are health and economy impact, zeros are meaningless and removed from further examination.

# names(dat2)
lapply(as.list(dat2[, c(8:11)]), quantile, probs = seq(0.90,1.0, 0.01))

#> $FATALITIES
#>  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
#>    0    0    0    0    0    0    0    0    0    0  583 
#> 
#> $INJURIES
#>  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
#>    0    0    0    0    0    0    0    0    0    2 1700 
#> 
#> $PROPDMG
#>  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
#>   15   20   25   25   32   50   50  100  200  250 5000 
#> 
#> $CROPDMG
#>  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
#>    0    0    0    0    0    0    0    0    3   20  990

Raw data contained 985 types of severe weather events. To identify the events that caused most health and economic concerns, i created a subset dataset for each response with nonzero values followed by events with reported responses greater than 50% (median) of the non-zero values (shown as responsename50).

cbind(dat2 %>% filter(FATALITIES >0) %>% summarize(fatal50 = median(FATALITIES)),
      dat2 %>% filter(INJURIES >0) %>% summarize(injury50 = median(INJURIES)),
      dat2 %>% filter(PROPDMG >0) %>% summarize(propdmg50 = median(PROPDMG)),
      dat2 %>% filter(CROPDMG >0) %>% summarize(cropdmg50 = median(CROPDMG))
)

fatal50	injury50	propdmg50	cropdmg50
1	2	10	10


event.severeH1 <- dat2 %>% filter(FATALITIES >0) %>% filter(FATALITIES > median(FATALITIES))
event.severeH2 <- dat2 %>% filter(INJURIES >0) %>% filter(INJURIES > median(INJURIES))
event.severeE1 <- dat2 %>% filter(PROPDMG >0) %>% filter(PROPDMG > median(PROPDMG))
event.severeE2 <- dat2 %>% filter(CROPDMG >0) %>% filter(CROPDMG > median(CROPDMG))

#dim(event.severeH1)[1]; dim(event.severeH2)[1]; dim(event.severeE1)[1]; dim(event.severeE2)[1]

Using the code below, i listed the events associated with greater than medians of the four responses, ranked by their occurrence frequency.

tmp1 <- as.data.frame(table(event.severeH1$EVTYPE)) %>% arrange(desc(Freq))
tmp2 <- as.data.frame(table(event.severeH2$EVTYPE)) %>% arrange(desc(Freq))
tmp3 <- as.data.frame(table(event.severeE1$EVTYPE)) %>% arrange(desc(Freq))
tmp4 <- as.data.frame(table(event.severeE2$EVTYPE)) %>% arrange(desc(Freq))

"evens caused >50% fatalities: "; head(tmp1, 10);
"evens caused >50%  injuries: ";  head(tmp2, 10)
"evens caused >50%  property damage: "; head(tmp3, 10)
"evens caused >50%  crop damage: "; head(tmp4, 10)

#> [1] "evens caused >50% fatalities: "

Var1	Freq
TORNADO	768
EXCESSIVE HEAT	223
FLASH FLOOD	169
FLOOD	91
TSTM WIND	59
HEAT	51
WINTER STORM	50
LIGHTNING	49
AVALANCHE	37
HIGH WIND	35

#> [1] "evens caused >50%  injuries: "

Var1	Freq
TORNADO	4022
TSTM WIND	680
LIGHTNING	420
EXCESSIVE HEAT	129
THUNDERSTORM WIND	120
FLASH FLOOD	112
HIGH WIND	103
HAIL	93
WINTER STORM	87
THUNDERSTORM WINDS	84

#> [1] "evens caused >50%  property damage: "

Var1	Freq
TORNADO	24036
TSTM WIND	18420
THUNDERSTORM WIND	12395
FLASH FLOOD	10919
HAIL	6960
LIGHTNING	5719
FLOOD	5227
THUNDERSTORM WINDS	3557
HIGH WIND	2893
WINTER STORM	933

#> [1] "evens caused >50%  crop damage: "

Var1	Freq
HAIL	4226
TSTM WIND	1267
FLASH FLOOD	1179
FLOOD	1090
TORNADO	688
THUNDERSTORM WIND	470
THUNDERSTORM WINDS	182
DROUGHT	181
HIGH WIND	122
HEAVY RAIN	69

Above exploration revealed several common events that highly impacted both US public heath and economy although their occurring ranks differ in the two areas. These events are tornado, flood (/flash flood) and big wind events. Also, heats caused remarkable death and lightening caused remarkable injury. For economic loss concern, hail caused considerable crop damage. A new factor variable EVTYPE2 was created to group the EVTYPE in raw data into categories of interest.

dat2$EVTYPE2 <- as.factor(
    ifelse(grepl("tornado", dat$EVTYPE, ignore.case = TRUE), "Tornados", ifelse(
        grepl("flood", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
        "Floods", ifelse(
            grepl("wind", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
            "Winds", ifelse(
                grepl("heat", dat$EVTYPE, ignore.case = TRUE) == TRUE,
                "Heats", ifelse(
                    grepl("hail", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
                    "Hails", ifelse(
                        grepl("lightening", dat2$EVTYPE, ignore.case = TRUE) == TRUE,
                        "Lightening", "Others"
                    )
                )
            )
        )
    ))
)

levels(dat2$EVTYPE2)

#> [1] "Floods"   "Hails"    "Heats"    "Others"   "Tornados" "Winds"

The 985 weather event types in raw data is now grouped into 6 categories as shown above.

3. Results

This section presents main results to answer the two questions of interest.

3.1 Across the United States, which types of events are most harmful with respect to population health?

Figure 1 presents the boxplots of Fatality and Injury grouped by types of weather events. The thick black line in each box is median, the red diamonds are means.

dat2.f <- dat2 %>% filter(FATALITIES > 0)
dat2.i <- dat2 %>% filter(INJURIES > 0)

par(mfrow=c(1,2), oma = c(0, 0, 3, 0), las = 3)
boxplot(dat2.f$FATALITIES ~ dat2.f$EVTYPE2, log = "y", xlab = "", ylab = "Fatalities")
points(tapply(dat2.f$FATALITIES, dat2.f$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
boxplot(dat2.i$INJURIES ~ dat2.i$EVTYPE2, log = "y", xlab = "", ylab = "Injuries")
points(tapply(dat2.i$INJURIES, dat2.i$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
mtext("Figure 1. Harmful Events With Respect To Population Health", outer = T, font = 2, cex = 1.5, las = 1)

Figure 1 showed that median of the non-zero fatality is 1 for all types of event. Heats, however, is most harmful to cause death and injury. Average heat fatality is 4 and average heat injuries is 40. The most fatal event ocurred during the 5-day HEAT event from 1995-07-12 to 1995-07-16 in Illinois, with 583 reported death, the highest in this database.

Other highly dangerous events are tornados and floods, whose average fatality and injury are 4 and 12 for tornados and 2 and 15 for floods.

3.2 Across the United States, which types of events have the greatest economic consequences?

Simiarly, Figure 2 presents the boxplots of property damage and crop damage grouped by types of weather events. These two losses estimated the economic consequences by the weather conditions. The thick black lines within boxes are medians and red diamonds are means.

dat2.p <- dat2 %>% filter(PROPDMG > 0)
dat2.c <- dat2 %>% filter(CROPDMG > 0)

par(mfrow=c(1,2), oma = c(0, 0, 3, 0), las = 3)
boxplot(dat2.p$PROPDMG ~ dat2.p$EVTYPE2, log = "y", xlab = "", ylab = "Property Damages")
points(tapply(dat2.p$PROPDMG, dat2.p$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
boxplot(dat2.c$CROPDMG ~ dat2.c$EVTYPE2, log = "y", xlab = "", ylab = "Crop Damages")
points(tapply(dat2.c$CROPDMG, dat2.c$EVTYPE2, mean), col = "red", pch = 18, cex = 1.25)
mtext("Figure 2. Harmful Events With Respect To Economic Loss (in K)", outer = T, font = 2, cex = 1.5, las = 1)

Figure 2 showed that economic loss occurred during all severe weather events. Tornados, Floods and Heats have the greatest property damages and Floods and Tornados are among top hazards to crop damage as well. The summary statistics of the economic consequences by event types are also listed in tables below.

as.data.frame(sapply(split(dat2.p$PROPDMG, dat2.p$EVTYPE2), summary, digits = 2))[c(3,4,6),] %>%
  rownames_to_column(var = "Property Damage (K)")
as.data.frame(sapply(split(dat2.c$CROPDMG, dat2.c$EVTYPE2), summary, digits = 2))[c(3,4,6),] %>%
  rownames_to_column(var = "Crop Damage (K)")

Property Damage (K)	Floods	Hails	Heats	Others	Tornados	Winds
Median	15	5	25	20	25	5
Mean	77	30	73	72	82	25
Max.	5000	970	600	5000	4400	5000

Crop Damage (K)	Floods	Hails	Heats	Others	Tornados	Winds
Median	20	10	1.8	16	10	7
Mean	88	62	53.0	85	67	39
Max.	980	900	490.0	990	900	900

It’s worth noting that events in “Others” category could cause remarkable crop damage as shown in the crop damage summary table above. I listed records with more than 900 K crop damage due to severe weather events.

dat2[dat2$CROPDMG > 900,]

	TIME_ZONE	STATE	EVTYPE	BgnDate	EndDate	BgnTime	EndTime	PROPDMG	CROPDMG	EVTYPE2
279930	EST	WV	River Flooding	1996-09-06	1996-09-08	10:15:00	07:30:00	16.50	950	Floods
322172	CST	KS	FLOOD	1998-09-13	1998-09-15	16:00:00	16:00:00	1.00	978	Floods
387863	CST	IA	FLOOD	2000-06-24	2000-06-27	03:00:00	15:00:00	650.00	975	Floods
544253	MST	MT	DROUGHT	2004-05-01	2004-05-31	01:01:00	23:59:00	0.00	990	Others
548558	EST	NC	FLOOD	2004-09-16	2004-09-17	23:00:00	19:00:00	8.00	920	Floods
631126	EST	NC	TROPICAL STORM	2006-09-01	2006-09-01	05:00:00	07:00:00	0.00	985	Others
743347	CST	WI	FLASH FLOOD	2008-06-07	2008-06-08	NA	07:30:00	2.25	950	Floods

The table showed that certain less occurring but extremely agricultural hazardous event can result in excessive crop loss. In this database, for example, the most crop damage was caused by the DROUGHT in May, 2004 in Montana. The total crop loss is 990 K.

Heat, Tornado and Flood Are Top Weather Hazards On US Public Health and Economy

ReproducibleResearch Project 2

Xin Dai

2021-03-19