The purpose of this report is to analyze storm data in order to quantify the impact of storm events on population health as well as the economy. The data is obtained from US National Oceanic and Atmospheric Administration’s (NOAA) database. The affect on population health is being measured in terms of the total number of fatalities and injuries due to various storm events. Similarly, the affect on economy is being measured in terms of the damage to property and crops.
In order to accurately quantify and depict the parameters mentioned above, we have chosen to define the following questions.
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The approach for both these questions are mentioned below. A common factor in both these questions and in general for any data analysis report is the manner of obtaining the data. In order to ensure reproducibility, this step too has been included in this report. The very first step of the analysis is to download, extract and load the data from the official URL mentioned in the Coursera Website. Subsequently, once the data is loaded, it can be cleaned up and various analysis can be performed.
The approach for question 1 has been detailed below:
Step 1: Calculate the sum of fatalities grouped by event type
Step 2: Calculate the sum of injuries grouped by event type
Step 3: Subset both the data sets mentioned above to get the TOP 15 ABOVE AVERAGE values
Step 4: Stack the data sets
Step 5: Plot the data and draw appropriate inferences
Similarly, the approach for question 2 has been outlined below:
Step 1: Calculate the sum of property damage grouped by event type
Step 2: Calculate the sum of crop damage grouped by event type
Step 3: Subset both the data sets mentioned above to get the TOP 15 ABOVE AVERAGE values
Step 4: Stack the data sets
Step 5: Plot the data and draw appropriate inferences
Let us load the data from the official URL mentioned on the Coursera site. The data is downloaded and stored in a variable named storm.data.
##### Clear the workspace
rm(list = ls())
##### URL of the bz2 file
storm.data.url <- c("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
storm.data.file <- tempfile()
setInternet2(TRUE)
download.file(storm.data.url, storm.data.file)
## Warning: downloaded length 34107392 != reported length 49177144
##### Read the file into a data frame
storm.data <- read.csv ( storm.data.file )
## Warning: EOF within quoted string
##### Load the libraries needed for the analysis
library(ggplot2)
library(plyr)
After the execution of this code chunk, the data will be present in the storm.data variable. This variable will not be modified and will be used as a data source for the analysis detailed below.
The results for each question and the corresponding analysis has been detailed below.
The approach followed for this question has been mentioned above. Let us track each and every step mentioned in the approach.
Step 1: Calculate the sum of fatalities grouped by event type
fatality.impact <- ddply(storm.data , c("EVTYPE"),
summarise , count = sum(FATALITIES),
type = "Fatality")
Step 2: Calculate the sum of injuries grouped by event type
injury.impact <- ddply(storm.data , c("EVTYPE"),
summarise , count = sum(INJURIES),
type = "Injury")
Step 3: Subset both the data sets mentioned above to get the TOP 15 ABOVE AVERAGE values
##### Calculate the mean values
fatality.mean <- mean(fatality.impact$count , na.rm = TRUE)
injury.mean <- mean(injury.impact$count , na.rm = TRUE)
##### Get the above average values
fatality.impact.subset <-
fatality.impact[fatality.impact$count > fatality.mean , ]
##### Get the top 15 values
fatality.impact.subset <-
head ( fatality.impact.subset[order(- fatality.impact.subset$count), ]
, 15 )
##### Get the above average values
injury.impact.subset <-
injury.impact[injury.impact$count > injury.mean , ]
##### Get the top 15 values
injury.impact.subset <-
head ( injury.impact.subset[order(- injury.impact.subset$count), ]
, 15 )
Step 4: Stack the data sets
pop.health.impact <- rbind (fatality.impact
, injury.impact)
pop.health.impact.subset <- rbind (fatality.impact.subset
, injury.impact.subset)
Step 5: Plot the data and draw appropriate inferences
ggplot(data = pop.health.impact.subset ,
aes(x = EVTYPE , y= count ) ) +
geom_bar(stat = "identity", width = 1
, fill = "red3" , color ="white") +
facet_grid(type ~ . , scales="free_y") +
labs(x = "Types of Events") +
labs(y = "People Affected") +
labs(title = "Impact of Events on Population Health") +
theme(axis.text.x = element_text(angle = -90 ,
vjust = 0.5) ) +
theme(axis.title.x = element_text(face = "bold" , size = 18)) +
theme(axis.title.y = element_text(face = "bold" , size = 18 ,
vjust = 1)) +
theme(plot.title = element_text(face = "bold" , size = 22 , vjust = 2)) +
theme(strip.text.y = element_text(face = "bold" , size = 10))
As oberved in the plot, we now have the top 15 events which have caused above average fatalities and injuries. The data in the plot leads to the following answer for question 1.
Answer: The event type which causes the maximum harm to population health is a tornado. It causes the maximum number of fatalities (approximately 5633) and injuries (approximately 91346) compared to any other event type.
Following the tornado, there is no event which causes both the second highest number of fatalities AND the second highest number of injuries.
Hence, as far as fatalities are concerned, the events which cause the maximum harm to population health after tornado are Excessive Heat, Flash Floods , Heat and Lightning respectively.
On the other hand, the events which lead to the maximum number of injuries after the tornado are Excessive Heat, Flood, Lightning and TSTM wind.
The approach followed for this question has been mentioned above. Let us track each and every step mentioned in the approach.
Step 1: Calculate the sum of property damage grouped by event type
property.damage <- ddply(storm.data , c("EVTYPE"),
summarise , count = sum(PROPDMG),
type = "Property")
Step 2: Calculate the sum of crop damage grouped by event type
crop.damage <- ddply(storm.data , c("EVTYPE"),
summarise , count = sum(CROPDMG),
type = "Crop")
Step 3: Subset both the data sets mentioned above to get the TOP 15 ABOVE AVERAGE values
property.damage.mean <- mean(property.damage$count , na.rm = TRUE)
crop.damage.mean <- mean(crop.damage$count , na.rm = TRUE)
property.damage.subset <-
property.damage[property.damage$count > property.damage.mean , ]
property.damage.subset <-
head ( property.damage.subset[order(- property.damage.subset$count), ]
, 15 )
crop.damage.subset <-
crop.damage[crop.damage$count > crop.damage.mean , ]
crop.damage.subset <-
head ( crop.damage.subset[order(- crop.damage.subset$count), ]
, 15 )
Step 4: Stack the data sets
economic.impact <- rbind(property.damage,
crop.damage)
economic.impact.subset <- rbind(property.damage.subset,
crop.damage.subset)
Step 5: Plot the data and draw appropriate inferences
ggplot(data = economic.impact.subset ,
aes(x = EVTYPE , y= count )) +
geom_bar(stat = "identity", width = 1
, fill = "red3" , color = "white") +
facet_grid(type ~ . , scales="free_y") +
labs(x = "Types of Events") +
labs(y = "Damage") +
labs(title = "Impact of Events on Economy") +
theme(axis.text.x = element_text(angle = -90 ,
vjust = 0.5) ) +
theme(axis.title.x = element_text(face = "bold" , size = 18)) +
theme(axis.title.y = element_text(face = "bold" , size = 18 ,
vjust = 1)) +
theme(plot.title = element_text(face = "bold" , size = 22 , vjust = 2)) +
theme(strip.text.y = element_text(face = "bold" , size = 10))
As oberved in the plot, we now have the top 15 events which have caused above average property and crop damage. The data in the plot leads to the following answer for question 2.
Answer:
There is no event which causes maximum crop damage AND maximum property damage.
The event which causes the maximum property damage is the Tornado (approximately 3212258 units). This is followed by Flash Floods, TSTSM Wind, Flood and Thunderstorm Wind.
The event which causes the maximum crop damage is Hail (approximately 579596 units). This is followed by Flash Flood, Flood and TSTM Wind.