In this Data Analysis, I have summarized the consequences on population health and economic losses of Storms and other severe weather events across United States. This data analysis project explores the ‘Storm Data Preparation’(raw data set) which tracks characteristics of major storms and weather events in the United States, including when and where they occur, also shows estimates of any fatalities, injuries, and property damage. The data set can be downloaded from “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2” in the csv format and some documentation on the data set can be availed from “https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf” in pdf. In the first part of the analysis, consequences on population health by disastrous events have been summarised. In the second part of the analysis, economic losses due to property and crops damages occured as a result of these disastrous events have been accounted.
First, the data must be downloaded from the internet using the download.file(). Then it is stored in a variable named ‘storm’ using read.csv().
Packages used:
-‘dplyr’: for manipulating data
-‘ggplot2’: for creating visual graphs.
-‘R.utils’: to download the file if it isn’t already present using the bunzip2().
library(dplyr)
library(ggplot2)
library(R.utils)
if (file.exists("repdata_data_StormData.csv.bz2")) {
if(file.exists("repdata_data_StormData")) read.csv ('repdata_data_StormData')->storm
if(!file.exists("repdata_data_StormData"))
{
bunzip2 ("repdata_data_StormData.csv.bz2", destname = "repdata_data_StormData")
read.csv ('repdata_data_StormData')->storm
}
}
if (!file.exists("repdata_data_StormData.csv.bz2"))
{
fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, dest = "repdata_data_StormData.csv.bz2")
bunzip2 ("repdata_data_StormData.csv.bz2", destname = "repdata_data_StormData")
read.csv ('repdata_data_StormData')->storm
}
Second, after loading the data, quickly summarise the data. The str() function is really good to know the variable types and number of observations. It is recommended to see the documentation provided to get to know in detail about the data set. As per the documentation, the column names required for our analysis can be segregated. We see the column names using the names() function.
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
names(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
On reading the documentation, see that the event names are given by the ‘EVTYPE’ field. The fields related to population health are ‘FATALITIES’ and ‘INJURIES’. Extract only these columns and store it in a variable named pophealth.
pophealth<-storm[,c("EVTYPE","FATALITIES","INJURIES")]
Next, I’ve aggregate this data set by summing up the fatalities and injuries per event. this is done by using the group_by and summarise_each functions.
pophealth<-group_by(pophealth,EVTYPE)
summarise_each(pophealth,funs(sum))->pophealth
Then I’ve created two separate data sets popfatality and popinjury to store data related on to fatalaties and injuries respectively and arranged each of them in the descending order using the arrange function. The 20 most disastrous events are then stored back in these variables.
popfatality<-pophealth[,c(1,2)]
popinjury<-pophealth[,c(1,3)]
popfatality<-arrange(popfatality,desc(FATALITIES))
popinjury<-arrange(popinjury,desc(INJURIES))
popfatality<-head(popfatality,20)
popinjury<-head(popinjury,20)
In this part, consider only those fields required for obtaining greatest costs of economic losses as consequence of disastrous events in the United States.
By going through the documentation observe that the fields related to economic losses are ‘PROPDMG’ and ‘CROPDMG’, meaning Property and Crop Damage. Extract only these columns and store it in a variable named ecoconsi. Other neccessary fields CROPEXP and PROPEXP, which store alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions of losses occured.
ecoconsi<-storm[,c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
On exploring this data observe that there are blank and unknown values too. Exclude them from the data set. First I’ve converted the columns to uppercase using toupper () and then keep only those observations with valid magnitudes using the filter().
ecoconsi[,3]<-toupper(ecoconsi[,3])
ecoconsi[,5]<-toupper(ecoconsi[,5])
ecoconsi<-filter(ecoconsi,PROPDMGEXP==c("K","B","M"),CROPDMGEXP==c("K","B","M"))
## Warning in c("K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M", :
## longer object length is not a multiple of shorter object length
## Warning in c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", :
## longer object length is not a multiple of shorter object length
I’ve then converted the magnitudes stored in the fields PROPDMGEXP and CROPDMGEXP to numeric using the following for loop.
for(i in 1:length(ecoconsi[,1]))
{
if(ecoconsi[i,3]=="K")
{ecoconsi[i,3]=1000}
if(ecoconsi[i,5]=="K")
{ecoconsi[i,5]=1000}
if(ecoconsi[i,3]=="M")
{ecoconsi[i,3]=1000000}
if(ecoconsi[i,5]=="M")
{ecoconsi[i,5]=1000000}
if(ecoconsi[i,3]=="B")
{ecoconsi[i,3]=1000000000}
if(ecoconsi[i,5]=="B")
{ecoconsi[i,5]=1000000000}
}
Later we multiply the columns of property damage with magnitude and crop damage with its magnitude to get the complete unrounded numeric value of the economic losses.
Then, I’ve added the economic losses due to property and crop damage to get the total economic losses, stored in column named ECODMG.
ecoconsi[,2]<-(ecoconsi[,2]*as.numeric(ecoconsi[,3]))
ecoconsi[,4]<-(ecoconsi[,4]*as.numeric(ecoconsi[,5]))
ecoconsi$ECODMG<-ecoconsi[,2]+ecoconsi[,4]
I’ve created a variable containing only the total economic losses and the event type. Then I’ve aggregated this data set by summing up the total economic losses per event. this is done by using the group_by and summarise_each functions. Then it is arranged in the descending order of total losses. The 20 most damaging events are stored back in the econsi variable.
ecoconsi<-ecoconsi[,c(1,6)]
ecoconsi<-group_by(ecoconsi,EVTYPE)
ecoconsi<-summarise_each(ecoconsi,funs(sum))
ecoconsi<-arrange(ecoconsi,desc(ECODMG))
ecoconsi<-head(ecoconsi,20)
g1<-ggplot(popfatality)
g1+geom_point(aes(y=EVTYPE,x=FATALITIES,fill=FATALITIES),pch=21,alpha=0.5,size=5)+theme(plot.title = element_text(face="bold"),axis.title = element_text(size=14, face="bold"),axis.text.x=element_text(size=9,face="bold",angle=45),axis.text.y=element_text(face="bold"),legend.position="none")+labs(x="Number Of Fatalities",y="Event",title="Number of Falatilities per Most Threatening Events in United States")+scale_x_continuous(limits=c(0,5700),breaks=seq(0,6000,by=500))
g2<-ggplot(popinjury)
g2+geom_point(aes(y=EVTYPE,x=INJURIES,fill=INJURIES),pch=21,alpha=0.5,size=5)+theme(plot.title = element_text(face="bold"),axis.title = element_text(size=14, face="bold"),axis.text.x=element_text(size=9,face="bold",angle=45),axis.text.y=element_text(face="bold"),legend.position="none")+labs(x="Number Of Injuries",y="Event",title="Number of Injuries per Most Threatening Events in United States")+scale_x_continuous(limits=c(0,92000),breaks=seq(0,92000,by=7000))
filter(pophealth,FATALITIES==max(FATALITIES))
## Source: local data frame [1 x 3]
##
## EVTYPE FATALITIES INJURIES
## 1 TORNADO 5633 91346
g3<-ggplot(ecoconsi)
g3+geom_point(aes(y=EVTYPE,x=ECODMG/1000000,fill=ECODMG),pch=21,alpha=0.5,size=4)+theme(plot.title = element_text(face="bold"),axis.title = element_text(size=14, face="bold"),axis.text.x=element_text(size=9,face="bold",angle=90),axis.text.y=element_text(face="bold"),legend.position="none")+labs(x="Economic Loss in Millions",y="Event",title="Economic Losses per Event in United States")+scale_x_continuous(breaks=c(0,min(ecoconsi$ECODMG),seq(625,10000,by=625)))
filter(ecoconsi,ECODMG==max(ECODMG))
## Source: local data frame [1 x 2]
##
## EVTYPE ECODMG
## 1 RIVER FLOOD 10005694000
Computer architecture: CPU AMD A-8
Operating system: Windows 8.1 PRO
Software toolchain: RStudio
Supporting software / infrastructure: R packages: dplyr, R.utils, ggplot2
Data Source: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Documentation Source: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf