Synopsis: The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. Aim of the report is to identify which types of events are most harmful to population health (in terms of fatalities and injuries numbers) and economics (in terms of damage to property and crop). Based on this information, better forcast and evacuation system can be made to reduce the damage in future.
First I downloaded the file form the link given in the assignment page, given that the file is bz2 format, I used the bzfile command to unzip it. Since reading file takes a long time, set the cache = TRUE to save time for further debug. Use str() and summary() function to get some idea of the data.
if(!file.exists("StormData.csv.bz2")){
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")}
data <- read.csv(bzfile("StormData.csv.bz2"),stringsAsFactors = FALSE)
# it takes a while to read
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
summary(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31 Class :character Class :character Class :character
## Median : 75 Mode :character Mode :character Mode :character
## Mean :101
## 3rd Qu.:131
## Max. :873
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0 Class :character Class :character Class :character
## Median : 0 Mode :character Mode :character Mode :character
## Mean : 1
## 3rd Qu.: 1
## Max. :3749
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0
## Mode :character Median :0 Median : 0
## Mean :0 Mean : 1
## 3rd Qu.:0 3rd Qu.: 0
## Max. :0 Max. :925
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0 Min. : 0
## Class :character Class :character 1st Qu.: 0.0 1st Qu.: 0
## Mode :character Mode :character Median : 0.0 Median : 0
## Mean : 0.2 Mean : 8
## 3rd Qu.: 0.0 3rd Qu.: 0
## Max. :2315.0 Max. :4400
##
## F MAG FATALITIES INJURIES
## Min. :0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.:0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0
## Median :1 Median : 50 Median : 0 Median : 0.0
## Mean :1 Mean : 47 Mean : 0 Mean : 0.2
## 3rd Qu.:1 3rd Qu.: 75 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :5 Max. :22000 Max. :583 Max. :1700.0
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0 Length:902297 Min. : 0.0 Length:902297
## 1st Qu.: 0 Class :character 1st Qu.: 0.0 Class :character
## Median : 0 Mode :character Median : 0.0 Mode :character
## Mean : 12 Mean : 1.5
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :5000 Max. :990.0
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
# To check the type of weather events
eventType <- unique(data$EVTYPE)
length(eventType)
## [1] 985
# A total of 985 types - WoW
The impact to population health can be inferred from FATALITIES and INJURIES columns which stand for total fatality number and injured number. FATALITIES: num 0 0 0 0 0 0 0 0 1 0 … INJURIES : num 15 0 2 2 2 6 1 0 14 0 …
# Use tapply to calculate the total faltalities and injuries number per weather type
populationDamage <- with(data,tapply(FATALITIES+INJURIES, EVTYPE,sum))
# Sort the data by descending order
populationDamage <- populationDamage[order(populationDamage,decreasing = TRUE)]
# Make the boxplot
barplot(head(populationDamage, n= 5),main="Population Damage caused by weather type", xlab="Type of weather", ylab="Sum of Fatalities and injuries number",col="gold",cex.names=0.7)
The damage to ecnomic can be infered from PROPDMG and CROPDMG - which stand for property damage and crop damage. The units of each damage is in the column PROPDMGEXP and CROPDMGEXP.
# First check the unqique type of PROPDMGEXP and CROPDMGEXP
unique(data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Data transformation is neccessary to unify all the unit.
B is for billion -> 109
M is for million -> 106
K is for thousand -> 103
H is for hundred -> 102
For easy read, we converted all damage costs to million USD and save to new variables PROPDMG2 and CROPDMG2.
# use recode() function in car package to convert the B-M-K-H to 9-6-3-2 and then change all as numeric numbers
library(car)
data$PROPDMGEXP2 <- as.numeric(with(data, recode(PROPDMGEXP,
"'B'=9;
'b'=9;
'M'=6;
'm'=6;
'K'=3;
'k'=3;
'H'=2;
'h'=2;
'+'=1;
'-'=1;
'?'=1;
''=1")))
data$PROPDMG2 <- data$PROPDMG*(10^data$PROPDMGEXP2)/(10^6)
data$CROPDMGEXP2 <- as.numeric(with(data, recode(CROPDMGEXP,
"'B'=9;
'b'=9;
'M'=6;
'm'=6;
'K'=3;
'k'=3;
'H'=2;
'h'=2;
'+'=1;
'-'=1;
'?'=1;
''=1")))
data$CROPDMG2 <- data$CROPDMG*(10^data$CROPDMGEXP2)/(10^6)
economicDamage <- with(data,tapply(CROPDMG2+PROPDMG2, EVTYPE,sum))
economicDamage <- economicDamage[order(economicDamage, decreasing=TRUE)]
head(economicDamage,n=5)
## FLOOD HURRICANE/TYPHOON TORNADO STORM SURGE
## 150320 71914 57362 43324
## HAIL
## 18761
barplot(head(economicDamage, n= 5),main="Total Economic Damage caused by weather type in million USD", xlab="Type of weather", ylab="Sum of Crop damage and Property damage in million USD",col="gold",cex.names=0.7)