Synopsis

In this Data Analysis, I have summarized the consequences on population health and economic losses of Storms and other severe weather events across United States. This data analysis project explores the ‘Storm Data Preparation’(raw data set) which tracks characteristics of major storms and weather events in the United States, including when and where they occur, also shows estimates of any fatalities, injuries, and property damage. The data set can be downloaded from “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2” in the csv format and some documentation on the data set can be availed from “https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf” in pdf. In the first part of the analysis, consequences on population health by disastrous events have been summarised. In the second part of the analysis, economic losses due to property and crops damages occured as a result of these disastrous events have been accounted.

Data Processing

First, the data must be downloaded from the internet using the download.file(). Then it is stored in a variable named ‘storm’ using read.csv().
Packages used:
-‘dplyr’: for manipulating data
-‘ggplot2’: for creating visual graphs.
-‘R.utils’: to download the file if it isn’t already present using the bunzip2().

library(dplyr)
library(ggplot2)
library(R.utils)
if (file.exists("repdata_data_StormData.csv.bz2")) {
    if(file.exists("repdata_data_StormData")) read.csv ('repdata_data_StormData')->storm
    
    if(!file.exists("repdata_data_StormData"))
    {
        bunzip2 ("repdata_data_StormData.csv.bz2", destname = "repdata_data_StormData")
        read.csv ('repdata_data_StormData')->storm
     }
    
}
if (!file.exists("repdata_data_StormData.csv.bz2"))
{
fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, dest = "repdata_data_StormData.csv.bz2")
bunzip2 ("repdata_data_StormData.csv.bz2", destname = "repdata_data_StormData")
read.csv ('repdata_data_StormData')->storm
}

Second, after loading the data, quickly summarise the data. The str() function is really good to know the variable types and number of observations. It is recommended to see the documentation provided to get to know in detail about the data set. As per the documentation, the column names required for our analysis can be segregated. We see the column names using the names() function.

str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
names(storm)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Processing Data for consequences on Population Health by Disastrous Events:

On reading the documentation, see that the event names are given by the ‘EVTYPE’ field. The fields related to population health are ‘FATALITIES’ and ‘INJURIES’. Extract only these columns and store it in a variable named pophealth.

pophealth<-storm[,c("EVTYPE","FATALITIES","INJURIES")]

Next, I’ve aggregate this data set by summing up the fatalities and injuries per event. this is done by using the group_by and summarise_each functions.

pophealth<-group_by(pophealth,EVTYPE)
summarise_each(pophealth,funs(sum))->pophealth

Then I’ve created two separate data sets popfatality and popinjury to store data related on to fatalaties and injuries respectively and arranged each of them in the descending order using the arrange function. The 20 most disastrous events are then stored back in these variables.

popfatality<-pophealth[,c(1,2)]
popinjury<-pophealth[,c(1,3)]
popfatality<-arrange(popfatality,desc(FATALITIES))
popinjury<-arrange(popinjury,desc(INJURIES))
popfatality<-head(popfatality,20)
popinjury<-head(popinjury,20)

Processing Data for Economic Losses occured as consequence of Disastrous Events:

In this part, consider only those fields required for obtaining greatest costs of economic losses as consequence of disastrous events in the United States.
By going through the documentation observe that the fields related to economic losses are ‘PROPDMG’ and ‘CROPDMG’, meaning Property and Crop Damage. Extract only these columns and store it in a variable named ecoconsi. Other neccessary fields CROPEXP and PROPEXP, which store alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions of losses occured.

ecoconsi<-storm[,c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]

On exploring this data observe that there are blank and unknown values too. Exclude them from the data set. First I’ve converted the columns to uppercase using toupper () and then keep only those observations with valid magnitudes using the filter().

ecoconsi[,3]<-toupper(ecoconsi[,3])
ecoconsi[,5]<-toupper(ecoconsi[,5])
ecoconsi<-filter(ecoconsi,PROPDMGEXP==c("K","B","M"),CROPDMGEXP==c("K","B","M"))
## Warning in c("K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M", :
## longer object length is not a multiple of shorter object length
## Warning in c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", :
## longer object length is not a multiple of shorter object length

I’ve then converted the magnitudes stored in the fields PROPDMGEXP and CROPDMGEXP to numeric using the following for loop.

for(i in 1:length(ecoconsi[,1]))
{
    if(ecoconsi[i,3]=="K")
    {ecoconsi[i,3]=1000}
    if(ecoconsi[i,5]=="K")
    {ecoconsi[i,5]=1000}
    if(ecoconsi[i,3]=="M")
    {ecoconsi[i,3]=1000000}
    if(ecoconsi[i,5]=="M")
    {ecoconsi[i,5]=1000000}
    if(ecoconsi[i,3]=="B")
    {ecoconsi[i,3]=1000000000}
    if(ecoconsi[i,5]=="B")
    {ecoconsi[i,5]=1000000000}
}

Later we multiply the columns of property damage with magnitude and crop damage with its magnitude to get the complete unrounded numeric value of the economic losses.
Then, I’ve added the economic losses due to property and crop damage to get the total economic losses, stored in column named ECODMG.

ecoconsi[,2]<-(ecoconsi[,2]*as.numeric(ecoconsi[,3]))
ecoconsi[,4]<-(ecoconsi[,4]*as.numeric(ecoconsi[,5]))
ecoconsi$ECODMG<-ecoconsi[,2]+ecoconsi[,4]

I’ve created a variable containing only the total economic losses and the event type. Then I’ve aggregated this data set by summing up the total economic losses per event. this is done by using the group_by and summarise_each functions. Then it is arranged in the descending order of total losses. The 20 most damaging events are stored back in the econsi variable.

ecoconsi<-ecoconsi[,c(1,6)]
ecoconsi<-group_by(ecoconsi,EVTYPE)
ecoconsi<-summarise_each(ecoconsi,funs(sum))
ecoconsi<-arrange(ecoconsi,desc(ECODMG))
ecoconsi<-head(ecoconsi,20)

RESULTS

Plot the results using the processed data.

I’ve used the ggplot2 package for constructing the graph.

First, plot the graphs for the 20 most disastrous events resulting to fatalities.

g1<-ggplot(popfatality)
g1+geom_point(aes(y=EVTYPE,x=FATALITIES,fill=FATALITIES),pch=21,alpha=0.5,size=5)+theme(plot.title = element_text(face="bold"),axis.title = element_text(size=14, face="bold"),axis.text.x=element_text(size=9,face="bold",angle=45),axis.text.y=element_text(face="bold"),legend.position="none")+labs(x="Number Of Fatalities",y="Event",title="Number of Falatilities per Most Threatening Events in United States")+scale_x_continuous(limits=c(0,5700),breaks=seq(0,6000,by=500))

Second, plot the graph for the 20 most disastrous events resulting to injuries.

g2<-ggplot(popinjury)
g2+geom_point(aes(y=EVTYPE,x=INJURIES,fill=INJURIES),pch=21,alpha=0.5,size=5)+theme(plot.title = element_text(face="bold"),axis.title = element_text(size=14, face="bold"),axis.text.x=element_text(size=9,face="bold",angle=45),axis.text.y=element_text(face="bold"),legend.position="none")+labs(x="Number Of Injuries",y="Event",title="Number of Injuries per Most Threatening Events in United States")+scale_x_continuous(limits=c(0,92000),breaks=seq(0,92000,by=7000))

We see that the Tornado causes the most population health problems.

filter(pophealth,FATALITIES==max(FATALITIES))
## Source: local data frame [1 x 3]
## 
##    EVTYPE FATALITIES INJURIES
## 1 TORNADO       5633    91346

Last, plot the graph for the 20 most disastrous events resulting to economic losses.

g3<-ggplot(ecoconsi)
g3+geom_point(aes(y=EVTYPE,x=ECODMG/1000000,fill=ECODMG),pch=21,alpha=0.5,size=4)+theme(plot.title = element_text(face="bold"),axis.title = element_text(size=14, face="bold"),axis.text.x=element_text(size=9,face="bold",angle=90),axis.text.y=element_text(face="bold"),legend.position="none")+labs(x="Economic Loss in Millions",y="Event",title="Economic Losses per Event in United States")+scale_x_continuous(breaks=c(0,min(ecoconsi$ECODMG),seq(625,10000,by=625)))

We see that the River Flood causes the most economic problems

filter(ecoconsi,ECODMG==max(ECODMG))
## Source: local data frame [1 x 2]
## 
##        EVTYPE      ECODMG
## 1 RIVER FLOOD 10005694000

Additional Details

Software Environment

Computer architecture: CPU AMD A-8
Operating system: Windows 8.1 PRO
Software toolchain: RStudio
Supporting software / infrastructure: R packages: dplyr, R.utils, ggplot2
Data Source: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Documentation Source: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf