Exploring Weather Events in the United States

Synopsis

In this report we aim to present the events that cause the most human damages and health issues, in addition to the events that cause the greatest economic consequences. To investigate these issues, we obtained data from the U.S. National Oceanic and Atmospheric Administration's (NOAA) database. We specifically obtained data for the years 1950 through 2011. From these data, we found that the events tornados and heat waves cause the most injuries and fatalities, respectively. On the other hand, thunderstorms and drought cause the most property and crop daamges, respectively. Thus, we conducted the analysis using the following events: tornados, heat waves, thunderstorms and drought.

Loading and Processing the Raw Data

From the NOAA storm database we obtained data on weather events that are monitored across the U.S. We obtained the files for the years 1950 through 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Reading in the data

We used the utils package to unzip the data file, then read.csv function to read in the data.

# setwd('ReproducibleAnalysis/Projects/Project2') Unzip the data set. First
# load the R.utils library
library(R.utils)
# bunzip2('repdata-data-StormData.csv.bz2', 'repdata-data-StormData.csv',
# remove=F) Read the data
storms <- read.csv("repdata-data-StormData.csv")
# stms <- subset(storms, FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 |
# CROPDMG > 0)

After reading in the we check the first few rows (there are 902297) observations in this dataset. There are 37 variables in the data set.

dim(storms)
## [1] 902297     37
head(storms[, 1:8])
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE
## 1 TORNADO
## 2 TORNADO
## 3 TORNADO
## 4 TORNADO
## 5 TORNADO
## 6 TORNADO
str(storms)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels ""," ","  ","   ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The events that causes the maximum number of fatalities and injuries are heat waves and tornados.

levels(factor(storms$EVTYPE[which.max(storms$FATALITIES)]))
## [1] "HEAT"
levels(factor(storms$EVTYPE[which.max(storms$INJURIES)]))
## [1] "TORNADO"

The events that have the greatest economic consequences are thunderstorms and drouht. Thunderstorms cause the greatest property damages. Drought causes the greatest crop damages.

levels(factor(storms$EVTYPE[which.max(storms$PROPDMG)]))
## [1] "THUNDERSTORM WIND"
levels(factor(storms$EVTYPE[which.max(storms$CROPDMG)]))
## [1] "DROUGHT"

Results

Tornados

In the data set, the event type has 985 factors. The tornado factor could be any of the strings that include the string “TORN”. For this reason, we used the grep() function to extract these strings, however after transforming the characters into upper letters. We used the lubridate package to manipulate dates in the date set. We plotted the number of tornados over the years from 1950 to 2011. It is obvious that the number of tornados increases over the years. The number of tornados ranges from about 250 in 1950 to about 2200 in 2011. Although there is obvious increasing trend in the number of tornados, there is no clear trend in the number of injuries from tornados. Wichita County in Texas experienced the maximum number of injuries caused by tornados in April 10, 1979. An estimated 1700 people were injured.

tUpper <- toupper(storms$EVTYPE)
t <- grep("TORN+", tUpper, value = T, perl = T)
TornadoStms <- subset(storms, storms$EVTYPE %in% t)
date <- TornadoStms$BGN_DATE
library(lubridate)
date <- mdy_hms(date)
TornadoStms$Year <- year(date)
# Count the number of tornados each year
TornadosCnt <- table(TornadoStms$Year)

TornadoStms$STATE[which.max(TornadoStms$INJURIES)]
## [1] TX
## 72 Levels: AK AL AM AN AR AS AZ CA CO CT DC DE FL GA GM GU HI IA ID ... XX
TornadoStms$COUNTYNAME[which.max(TornadoStms$INJURIES)]
## [1] WICHITA
## 29601 Levels:  5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI ... ZIEBACH AND HAAKON
TornadoStms$BGN_DATE[which.max(TornadoStms$INJURIES)]
## [1] 4/10/1979 0:00:00
## 16335 Levels: 10/10/1954 0:00:00 10/10/1958 0:00:00 ... 9/9/2011 0:00:00
TornadoStms$INJURIES[which.max(TornadoStms$INJURIES)]
## [1] 1700

# Or
max(TornadoStms$INJURIES)
## [1] 1700
# Using ggplot, first convert table into a data frame df <-
# as.data.frame(TornadosCnt) rename first column names(df)[1] <- 'Year'
# ggplot library(ggplot2) p <- ggplot(df,
# aes(as.numeric(as.character(Year)), Freq)) p + geom_line(color='blue') +
# ylab('Number of Tornados') + xlab('') + ggtitle #('Number of Tornados over
# the Year')
par(mfrow = c(2, 1))
# Plot Number of tornados each year
plot(TornadosCnt, type = "l", col = "blue", ylab = "Number of Tornados", xlab = "", 
    main = "Number of Tornados over the Years", tck = 1)
# Number of Injuries over the years caused by tornados
injuries = tapply(TornadoStms$INJURIES, TornadoStms$Year, sum)
plot(as.numeric(levels(as.factor(TornadoStms$Year))), injuries, type = "l", 
    tck = 1, xlab = "", ylab = "Number of Injuries", main = "Number of Injuries caused by Tornados each year", 
    col = "blue")

plot of chunk unnamed-chunk-6

Heat waves

The heat wave factor could be any of the strings that include the string “HEAT”. For this reason, we used the grep() function to extract these strings, however after transforming the characters into upper letters. We plotted the number of heat waves over the years from 1993 to 2011. It is obvious that the number of heat waves increases over the years. The number of heat waves ranges from about 10 in 1993 to about 410 in 2011. On the hand, on average, the yearly number of fatalities decreases. The maximum number of fatalities was caused by a heat wave in Illinois in the second week of July, 1995. An estimated 583 people died during that week.

h <- grep("HEAT+", tUpper, value = T, perl = T)
heatSevere <- subset(storms, storms$EVTYPE %in% h)
HDate <- heatSevere$BGN_DATE
HDate <- mdy_hms(HDate)
heatSevere$Year <- year(HDate)
HeatWaveCnt <- table(heatSevere$Year)

heatSevere$STATE[which.max(heatSevere$FATALITIES)]
## [1] IL
## 72 Levels: AK AL AM AN AR AS AZ CA CO CT DC DE FL GA GM GU HI IA ID ... XX
heatSevere$COUNTYNAME[which.max(heatSevere$FATALITIES)]
## [1] ILZ003>006 - 008 - 010>014 - 019>023 - 032 - 033 - 039
## 29601 Levels:  5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI ... ZIEBACH AND HAAKON
heatSevere$BGN_DATE[which.max(heatSevere$FATALITIES)]
## [1] 7/12/1995 0:00:00
## 16335 Levels: 10/10/1954 0:00:00 10/10/1958 0:00:00 ... 9/9/2011 0:00:00
heatSevere$FATALITIES[which.max(heatSevere$FATALITIES)]
## [1] 583
# Or
max(heatSevere$FATALITIES)
## [1] 583
par(mfrow = c(2, 1))
plot(HeatWaveCnt, type = "l", col = "red", ylab = "Number of Heat Waves", xlab = "", 
    main = "Number of Heat Waves Over the Years", tck = 1)
# Number of fatalities caused by Heat
fatalities <- tapply(heatSevere$FATALITIES, heatSevere$Year, sum)
plot(as.numeric(levels(as.factor(heatSevere$Year))), fatalities, type = "l", 
    tck = 1, xlab = "", ylab = "Number of Fatalities", main = "Number of Fatalities caused by Heat each year", 
    col = "red")

plot of chunk unnamed-chunk-8

Property Damages Caused by Thuderstorms

The thunderstorm factor could be any of the strings that include the string “THUN”. For this reason, we used the grep() function to extract these strings, however after transforming the characters into upper letters. We plotted yearly cost of property damages caused by thunderstorms from 1993 to 2005. There is significant variability in the cost between years. Thunderstorms caused the greatest property damages in Franklin County, North Carolina in July 26, 2009. The property damages were estimated to be about $500 million.

tH <- grep("THUN+", tUpper, value = T, perl = T)
thunder <- subset(storms, storms$EVTYPE %in% tH)
tHDate <- thunder$BGN_DATE
tHDate <- mdy_hms(tHDate)
thunder$Year <- year(tHDate)
propDmgCost <- tapply(thunder$PROPDMG, thunder$Year, sum)

thunder$STATE[which.max(thunder$PROPDMG)]
## [1] NC
## 72 Levels: AK AL AM AN AR AS AZ CA CO CT DC DE FL GA GM GU HI IA ID ... XX
thunder$COUNTYNAME[which.max(thunder$PROPDMG)]
## [1] FRANKLIN
## 29601 Levels:  5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI ... ZIEBACH AND HAAKON
thunder$BGN_DATE[which.max(thunder$PROPDMG)]
## [1] 7/26/2009 0:00:00
## 16335 Levels: 10/10/1954 0:00:00 10/10/1958 0:00:00 ... 9/9/2011 0:00:00
thunder$PROPDMG[which.max(thunder$PROPDMG)]
## [1] 5000

Crop Damages Caused by Drought

The thunderstorm factor could be any of the strings that include the string “DROUGHT”. For this reason, we used the grep() function to extract these strings, however after transforming the characters into upper letters. We plotted yearly cost of crop damages caused by drought from 1993 to 2005. There is a somehow increasing trend in the cost of crop damages over the years. Drought caused the greatest crop damages in Montana in May 1, 2004. The crop damages were estimated to be about $77.5 million.

# Drought
d <- grep("DROUGHT+", tUpper, value = T, perl = T)
drought <- subset(storms, storms$EVTYPE %in% d)
dDate <- drought$BGN_DATE
dDate <- mdy_hms(dDate)
drought$Year <- year(dDate)
cropDmgCost <- tapply(drought$CROPDMG, drought$Year, sum)

drought$STATE[which.max(drought$CROPDMG)]
## [1] MT
## 72 Levels: AK AL AM AN AR AS AZ CA CO CT DC DE FL GA GM GU HI IA ID ... XX
drought$COUNTYNAME[which.max(drought$CROPDMG)]
## [1] MTZ024>025 - 062
## 29601 Levels:  5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI ... ZIEBACH AND HAAKON
drought$BGN_DATE[which.max(drought$CROPDMG)]
## [1] 5/1/2004 0:00:00
## 16335 Levels: 10/10/1954 0:00:00 10/10/1958 0:00:00 ... 9/9/2011 0:00:00
drought$PROPDMG[which.max(drought$PROPDMG)]
## [1] 775
par(mfrow = c(2, 1))
plot(as.numeric(levels(factor(thunder$Year))), propDmgCost, type = "l", col = "orange", 
    ylab = "Cost of Property Damage, $millions", xlab = "", main = "Cost of Property Damages Over the Years", 
    tck = 1)

plot(as.numeric(levels(factor(drought$Year))), cropDmgCost, type = "l", col = "brown", 
    ylab = "Cost of Crop Damages, $millions", xlab = "", main = "Cost of Crop Damages Over the Years", 
    tck = 1)

plot of chunk unnamed-chunk-11

Table of the Maximum Number of Fatalities and Injuries Caused by Heat and Tornados summarized by State

The two tables shown below summarize the maximum number of fatalities and injuries caused by heat and tornados, and the yearly average cost of property and crop damages caused by thunderstorms and drought in each state. The data set state.abb in R was used to match and keep only the states. Use the melt() function in the reshape2 package to change array data structure to a data frame.

# Maximum number of fatalities caused by heat in each State
fH <- tapply(heatSevere$FATALITIES, heatSevere$STATE, max)
fT <- tapply(TornadoStms$FATALITIES, TornadoStms$STATE, max)
iH <- tapply(heatSevere$INJURIES, heatSevere$STATE, max)
iT <- tapply(TornadoStms$INJURIES, TornadoStms$STATE, max)

# Reshape
library(reshape2)
fH1 <- melt(fH)
fT1 <- melt(fT)
iH1 <- melt(iH)
iT1 <- melt(iT)

# Data frame
df <- data.frame(fH1, fT1[2], iH1[2], iT1[2])
names(df) <- c("State", "Fatalities By Heat", "Fatalities by Tornados", "Injuries By Heat", 
    "Injuries By Tornados")
df1 <- subset(df, df$State %in% state.abb)
State Fatalities By Heat Fatalities by Tornados Injuries By Heat Injuries By Tornados
1 AK 1.00 0.00 0.00 0.00
2 AL 2.00 44.00 24.00 800.00
5 AR 3.00 50.00 4.00 350.00
7 AZ 30.00 2.00 0.00 41.00
8 CA 46.00 0.00 102.00 30.00
9 CO 0.00 2.00 0.00 78.00
10 CT 1.00 3.00 0.00 500.00
12 DE 4.00 2.00 10.00 30.00
13 FL 1.00 25.00 8.00 450.00
14 GA 2.00 18.00 3.00 300.00
17 HI 0.00 0.00 0.00 4.00
18 IA 3.00 13.00 12.00 450.00
19 ID 0.00 0.00 0.00 3.00
20 IL 583.00 33.00 122.00 500.00
21 IN 14.00 31.00 80.00 560.00
22 KS 5.00 75.00 14.00 450.00
23 KY 2.00 31.00 69.00 350.00
24 LA 9.00 22.00 2.00 266.00
31 MA 2.00 90.00 0.00 1228.00
32 MD 21.00 2.00 241.00 122.00
33 ME 1.00 3.00
35 MI 17.00 116.00 215.00 785.00
36 MN 5.00 12.00 0.00 175.00
37 MO 42.00 158.00 519.00 1150.00
38 MS 2.00 57.00 5.00 504.00
39 MT 0.00 2.00 0.00 5.00
40 NC 2.00 12.00 15.00 280.00
41 ND 0.00 10.00 0.00 103.00
42 NE 3.00 11.00 0.00 118.00
43 NH 0.00 1.00 0.00 7.00
44 NJ 17.00 1.00 160.00 12.00
45 NM 1.00 2.00 0.00 34.00
46 NV 6.00 0.00 0.00 1.00
47 NY 42.00 9.00 50.00 68.00
48 OH 13.00 36.00 52.00 1150.00
49 OK 10.00 20.00 100.00 293.00
50 OR 1.00 0.00 0.00 2.00
51 PA 74.00 12.00 135.00 120.00
57 RI 2.00 0.00 0.00 20.00
58 SC 9.00 7.00 15.00 115.00
59 SD 1.00 6.00 1.00 150.00
62 TN 7.00 23.00 1.00 200.00
63 TX 49.00 114.00 223.00 1700.00
64 UT 1.00 1.00 0.00 80.00
65 VA 2.00 11.00 100.00 246.00
67 VT 0.00 0.00 0.00 7.00
68 WA 1.00 6.00 0.00 300.00
69 WI 57.00 20.00 40.00 200.00
70 WV 1.00 1.00 3.00 15.00
71 WY 2.00 40.00

Yearly Average Cost of Property and Crop Damages caused by thunderstorms and droughts

# Maximum number of fatalities caused by heat in each State
cP <- tapply(thunder$PROPDMG, thunder$STATE, mean)
cC <- tapply(drought$CROPDMG, drought$STATE, mean)

# Reshape
library(reshape2)
cP1 <- melt(cP)
cC1 <- melt(cC)

names(cP1) = c("State", "Property Damages")
names(cC1) = c("State", "Crop Damages")
cP1 <- subset(cP1, cP1$State %in% state.abb)
cC1 <- subset(cC1, cC1$State %in% state.abb)

# Data frame
dataFrame <- data.frame(cP1, cC1[2])
names(dataFrame) = c("State", "Property Damage, $millions", "Crop Damages, $millions")
State Property Damage, $millions Crop Damages, $millions
1 AK 73.57 0.00
2 AL 10.97 0.71
5 AR 21.27 22.01
7 AZ 37.93 0.00
8 CA 28.44 0.00
9 CO 6.25 0.00
10 CT 12.84 0.00
12 DE 16.90 0.88
13 FL 11.39 1.41
14 GA 19.73 7.65
17 HI 3.75 0.93
18 IA 22.82 65.66
19 ID 21.71 0.00
20 IL 15.81 5.93
21 IN 11.65 3.65
22 KS 10.80 7.02
23 KY 14.00 6.11
24 LA 13.39 83.81
31 MA 9.33 0.00
32 MD 5.96 1.41
33 ME 0.36 0.00
35 MI 11.72 16.67
36 MN 8.75 0.00
37 MO 9.84 11.78
38 MS 17.97 100.00
39 MT 10.76 99.00
40 NC 6.18 5.78
41 ND 19.69 0.00
42 NE 10.40 48.00
43 NH 4.47 0.00
44 NJ 8.55 1.95
45 NM 9.18 0.19
46 NV 6.82
47 NY 11.05 9.68
48 OH 18.66 8.00
49 OK 10.21 87.07
50 OR 11.87 7.53
51 PA 10.96 8.42
57 RI 15.62
58 SC 6.15 0.31
59 SD 5.43 0.00
62 TN 11.31 0.00
63 TX 19.54 21.70
64 UT 10.71 0.00
65 VA 4.35 17.26
67 VT 18.78 0.00
68 WA 3.40 0.00
69 WI 11.72 0.95
70 WV 10.10 8.12
71 WY 5.88