The project aims to identify the most affecting natural events on health and economy of the US. The first aim was operationalized as the top 10 most injurious and fatal events existing in the dataset, while the second aim was operationalized as the top 10 most costly events that caused property and monetory crop damage.
I already have .csv file required for my analysis in my working directory. So loading the datafromt eh csv file into the R. I had to unzip the file manually and then load it. It took lots of unending time for the bz2 to load
data <- read.csv("repdata_data_StormData.csv")
Thus subseting the required content needed to solve both of our questions and make our processing faster
#make a new dataset with a column that only contains the un-frequented (i.e. unique) event types.
smallerdata <- data.frame(EVS=unique(data$EVTYPE))
#add a column that is the sum of all fatalities caused by the event
smallerdata$FATALITIES <- tapply(data$FATALITIES, data$EVTYPE , sum)[smallerdata$EVS]
#add a column that is the sum of all injuries caused by the event
smallerdata$INJ <- tapply(data$INJURIES, data$EVTYPE , sum)[smallerdata$EVS]
#add a column that is the sum of injuries and fatalities caused by the event
smallerdata$totalhealthissues <- smallerdata$INJ + smallerdata$FATALITIES
#add a column that is the sum of all property damage costs (in Thousand dollars) caused by the event
smallerdata$PROPDMG <- (tapply(data$PROPDMG, data$EVTYPE , sum)[smallerdata$EVS])/1000
#add a column that is the sum of all crop damage costs (in Thousand dollars) caused by the event
smallerdata$CROPDMG <- (tapply(data$CROPDMG, data$EVTYPE , sum)[smallerdata$EVS])/1000
#add a column that is the sum of poperty damage costs and crop damage costs caused by the event (in Thousand $)
smallerdata$totalcosts <- smallerdata$PROPDMG + smallerdata$CROPDMG
Now that we have extracted our datasets lets start analyzing it Drawing a boxplot for number of fatalities and injuries for each event shows us some very effective outliers, without which, the dataset would relatively have low dispersion.
boxplot (smallerdata$FATALITIES, smallerdata$INJ)
Since the outliers does not have much differnce, I ignore them and move forward with my analysis.
Lets take out the top 10 most fatal events along with the top 10 most injurious events
#loading the required libraries
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.1
library(reshape)
## Warning: package 'reshape' was built under R version 3.6.1
#returns the top 10 most fatal events
mostfatal <- head(smallerdata[order(smallerdata$FATALITIES, decreasing= TRUE), ], 10)
#returns the top 10 most injurous events
mostinj <- head(smallerdata[order(smallerdata$INJ, decreasing= TRUE), ], 10)
#returns the top 10 most impactful events on health (i.e. fatalities + injuries)
mosttotalhealthissue <- head(smallerdata[order(smallerdata$totalhealthissues, decreasing= TRUE), ], 10)
#ordering the events according to the total health impact (i.e. injuries + fatalities) which is more intuitive
mosttotalhealthissue$EVS <- factor(mosttotalhealthissue$EVS, levels = mosttotalhealthissue$EVS[order(mosttotalhealthissue$totalhealthissues, decreasing= TRUE)])
in order to tackle the above question we take the intersection of the most fatal and most injurious events in the USA
intrsctmostinjandfatal <- intersect(mostfatal[,1], mostinj[,1])
intrsctmostinjandfatal
## [1] "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD" "HEAT"
## [5] "LIGHTNING" "TSTM WIND" "FLOOD"
meltedmostfatalinj <- melt(mosttotalhealthissue[mosttotalhealthissue$EVS %in% intrsctmostinjandfatal,], id.vars =c("EVS", "PROPDMG", "CROPDMG", "totalcosts"))
ggplot(meltedmostfatalinj, aes(EVS, value)) + geom_bar(aes(fill = variable), position = "dodge", stat="identity")+
ggtitle(label = "Top Most Weather Events Impacting Human Health in the USA") +
xlab("Event") + ylab ("Affected # of People") +
scale_fill_discrete(name = "Variable", labels = c("Fatalities", "Injuries", "Total(Fatalities + Injuries)"))
In order to tackle the above question we take out the top 10 economically costly events i.e events with high crop damage and property damage costs
#returns top 10 most costly events
mosteconomicalcost <- head(smallerdata[order(smallerdata$totalcosts, decreasing= TRUE), ], 10)
mosteconomicalcost
## EVS FATALITIES INJ totalhealthissues PROPDMG
## 1 TORNADO 5633 91346 96979 3212.2582
## 20 FLASH FLOOD 978 1777 2755 1420.1246
## 2 TSTM WIND 504 6957 7461 1335.9656
## 3 HAIL 15 1361 1376 688.6934
## 36 FLOOD 470 6789 7259 899.9385
## 16 THUNDERSTORM WIND 133 1488 1621 876.8442
## 15 LIGHTNING 816 5230 6046 603.3518
## 10 THUNDERSTORM WINDS 64 908 972 446.2932
## 46 HIGH WIND 248 1137 1385 324.7316
## 8 WINTER STORM 206 1321 1527 132.7206
## CROPDMG totalcosts
## 1 100.01852 3312.2767
## 20 179.20046 1599.3251
## 2 109.20260 1445.1682
## 3 579.59628 1268.2897
## 36 168.03788 1067.9764
## 16 66.79145 943.6356
## 15 3.58061 606.9324
## 10 18.68493 464.9781
## 46 17.28321 342.0148
## 8 1.97899 134.6996
#make sure of proper ordering
mosteconomicalcost$EVS <- factor(mosteconomicalcost$EVS, levels =
mosteconomicalcost$EVS[order(mosteconomicalcost$totalcosts, decreasing= TRUE)])
#prepares the latter dataset for ggplotting
meltedmostecocost <- melt(mosteconomicalcost, id.vars =c("EVS", "FATALITIES", "INJ",
"totalhealthissues"))
#again, make sure of tidy ordering
meltedmostecocost$variable <- factor(meltedmostecocost$variable, levels = c("CROPDMG", "PROPDMG", "totalcosts"))
#gg barplotting
ggplot(meltedmostecocost, aes(EVS, value)) +
geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
ggtitle(label = "Top 10 Weather Events Impacting Economy in the USA") +
xlab("Event") + ylab ("Thousand US $ (Ignoring the EXPonential variables)") +
scale_fill_discrete(name = "Variable", labels = c("Crop Losses", "Property Losses", "Total Losses(Crop + Property)"))