This research involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Across the United States, the types of storm events are most harmful with respect to population health are:
Across the United States, the types of storm events that have the greatest economic consequences are:
The R statistical computing and graphics environment was used to analyze the data and publish this report. This assignment requires the following R libraries.
library("xtable");
library("plyr");
library("ggplot2");
The following R code was used to download data archive, unzip archive and load the storm data into R.
tmp <-"data/repdata-data-StormData.csv.bz2";
storm <- read.csv(
bzfile(tmp,"repdata-data-StormData.csv"),
stringsAsFactors=FALSE,header=TRUE
);
str(storm[,c('BGN_DATE','FATALITIES','INJURIES','PROPDMG','CROPDMG')]);
## 'data.frame': 902297 obs. of 5 variables:
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
summary(storm[,c('BGN_DATE','FATALITIES','INJURIES','PROPDMG','CROPDMG')]);
## Warning: closing unused connection 5 (data/repdata-data-StormData.csv.bz2)
## BGN_DATE FATALITIES INJURIES PROPDMG
## Length:902297 Min. : 0 Min. : 0.0 Min. : 0
## Class :character 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## Mode :character Median : 0 Median : 0.0 Median : 0
## Mean : 0 Mean : 0.2 Mean : 12
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.: 0
## Max. :583 Max. :1700.0 Max. :5000
## CROPDMG
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 0.0
## Mean : 1.5
## 3rd Qu.: 0.0
## Max. :990.0
The following R code was used to take subset of data. The subset contains the variables of interest, which are related to population health {fatalities and injuries} and economic consequences {property damage and crop damage}.
varNames <- c('BGN_DATE', 'EVTYPE', 'FATALITIES','INJURIES',
'PROPDMGEXP','PROPDMG','CROPDMGEXP','CROPDMG');
df <- storm[storm$FATALITIES>0 | storm$INJURIES>0,varNames];
df$BGN_DATE <- as.Date(df$BGN_DATE,"%m/%d/%Y");
df$EVTYPE <- toupper(df$EVTYPE);
df$PROPDMGEXP <- factor(toupper(df$PROPDMGEXP));
df$PROPDMG[df$PROPDMGEXP=='H'] <- df$PROPDMG[df$PROPDMGEXP=='H'] * 100;
df$PROPDMG[df$PROPDMGEXP=='K'] <- df$PROPDMG[df$PROPDMGEXP=='K'] * 1000;
df$PROPDMG[df$PROPDMGEXP=='M'] <- df$PROPDMG[df$PROPDMGEXP=='M'] * 1000000;
df$PROPDMG[df$PROPDMGEXP=='B'] <- df$PROPDMG[df$PROPDMGEXP=='B'] * 1000000000;
df$CROPDMGEXP <- factor(toupper(df$CROPDMGEXP));
df$CROPDMG[df$CROPDMGEXP=='H'] <- df$CROPDMG[df$CROPDMGEXP=='H'] * 100;
df$CROPDMG[df$CROPDMGEXP=='K'] <- df$CROPDMG[df$CROPDMGEXP=='K'] * 1000;
df$CROPDMG[df$CROPDMGEXP=='M'] <- df$CROPDMG[df$CROPDMGEXP=='M'] * 1000000;
df$CROPDMG[df$CROPDMGEXP=='B'] <- df$CROPDMG[df$CROPDMGEXP=='B'] * 1000000000;
df$ECONOMIC <- df$PROPDMG + df$CROPDMG;
df$HEALTH <- df$FATALITIES + df$INJURIES;
for( i in 1:dim(df)[1] ) {
if(df$EVTYPE[i]=='FLASH FLOOD') df$EVTYPE[i] = 'FLOOD';
if(df$EVTYPE[i]=='TSTM WIND') df$EVTYPE[i] = 'TROPICAL STORM';
if(df$EVTYPE[i]=='EXCESSIVE HEAT') df$EVTYPE[i] = 'HEAT';
if(df$EVTYPE[i]=='HURRICANE/TYPHOON') df$EVTYPE[i] = 'HURRICANE';
}
varNames <- c('BGN_DATE','EVTYPE', 'FATALITIES','INJURIES',
'HEALTH', 'PROPDMG','CROPDMG', 'ECONOMIC');
df <- df[,varNames]
str(df); summary(df);
## 'data.frame': 21929 obs. of 8 variables:
## $ BGN_DATE : Date, format: "1950-04-18" "1951-02-20" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 1 0 0 1 ...
## $ INJURIES : num 15 2 2 2 6 1 14 3 3 26 ...
## $ HEALTH : num 15 2 2 2 6 1 15 3 3 27 ...
## $ PROPDMG : num 25000 25000 2500 2500 2500 2500 25000 2500000 2500000 250000 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ECONOMIC : num 25000 25000 2500 2500 2500 2500 25000 2500000 2500000 250000 ...
## BGN_DATE EVTYPE FATALITIES INJURIES
## Min. :1950-01-03 Length:21929 Min. : 0.0 Min. : 0.0
## 1st Qu.:1987-07-06 Class :character 1st Qu.: 0.0 1st Qu.: 1.0
## Median :1997-12-08 Mode :character Median : 0.0 Median : 1.0
## Mean :1993-07-24 Mean : 0.7 Mean : 6.4
## 3rd Qu.:2004-05-02 3rd Qu.: 1.0 3rd Qu.: 3.0
## Max. :2011-11-30 Max. :583.0 Max. :1700.0
## HEALTH PROPDMG CROPDMG ECONOMIC
## Min. : 1.0 Min. :0.00e+00 Min. :0.00e+00 Min. :0.00e+00
## 1st Qu.: 1.0 1st Qu.:0.00e+00 1st Qu.:0.00e+00 1st Qu.:0.00e+00
## Median : 2.0 Median :1.00e+04 Median :0.00e+00 Median :1.00e+04
## Mean : 7.1 Mean :5.68e+06 Mean :2.76e+05 Mean :5.96e+06
## 3rd Qu.: 4.0 3rd Qu.:2.50e+05 3rd Qu.:0.00e+00 3rd Qu.:2.50e+05
## Max. :1742.0 Max. :1.00e+10 Max. :1.51e+09 Max. :1.00e+10
The following R code ranks storm events by the number of injuries or fatalities that the generated.
head(arrange(as.data.frame(xtabs(FATALITIES~EVTYPE,df)),desc(Freq)));
## EVTYPE Freq
## 1 TORNADO 5633
## 2 HEAT 2840
## 3 FLOOD 1448
## 4 LIGHTNING 816
## 5 TROPICAL STORM 562
## 6 RIP CURRENT 368
head(arrange(as.data.frame(xtabs(INJURIES~EVTYPE,df)),desc(Freq)));
## EVTYPE Freq
## 1 TORNADO 91346
## 2 HEAT 8625
## 3 FLOOD 8566
## 4 TROPICAL STORM 7297
## 5 LIGHTNING 5230
## 6 ICE STORM 1975
Now that the we know the top 6 storm events in terms of harm to the population, a histogram can be generated.
tmp <- head(arrange(as.data.frame(xtabs(HEALTH~EVTYPE,df)),desc(Freq)));
ck <- df$EVTYPE==tmp$EVTYPE[1] | df$EVTYPE==tmp$EVTYPE[2] |
df$EVTYPE==tmp$EVTYPE[3] | df$EVTYPE==tmp$EVTYPE[4] |
df$EVTYPE==tmp$EVTYPE[5] | df$EVTYPE==tmp$EVTYPE[6];
qplot(EVTYPE,data=df[ck,c('EVTYPE','HEALTH')],
main ='Top Harmful Storm Events in the U.S (1950-2011)',
xlab ='Storm Event Type',
ylab ='Number of Fatalities or Injuries');
The following R code extracts ranks storm events by the value of the property damage and crop damage that was caused by event.
head(arrange(as.data.frame(xtabs(CROPDMG~EVTYPE,df)),desc(Freq)));
## EVTYPE Freq
## 1 HURRICANE 3.680e+09
## 2 HEAT 4.937e+08
## 3 HIGH WIND 3.520e+08
## 4 TROPICAL STORM 2.225e+08
## 5 WILDFIRE 1.821e+08
## 6 FLOOD 1.756e+08
head(arrange(as.data.frame(xtabs(PROPDMG~EVTYPE,df)),desc(Freq)));
## EVTYPE Freq
## 1 TORNADO 4.189e+10
## 2 HURRICANE 3.556e+10
## 3 FLOOD 1.053e+10
## 4 TROPICAL STORM 7.954e+09
## 5 WINTER STORM 5.221e+09
## 6 STORM SURGE/TIDE 4.004e+09
Now that the we know the top 6 storm events in terms of economic impact, a histogram can be generated.
tmp <- head(arrange(as.data.frame(xtabs(ECONOMIC~EVTYPE,df)),desc(Freq)))
ck <- df$EVTYPE==tmp$EVTYPE[1] | df$EVTYPE==tmp$EVTYPE[2] |
df$EVTYPE==tmp$EVTYPE[3] | df$EVTYPE==tmp$EVTYPE[4] |
df$EVTYPE==tmp$EVTYPE[5] | df$EVTYPE==tmp$EVTYPE[6];
qplot(EVTYPE,data=df[ck,c('EVTYPE','HEALTH')],
main ='Top Economically Impactful Storm Events in the U.S (1950-2011)',
xlab ='Storm Event Type',
ylab ='Dollar Value')