Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
In data processing section I read the data and try to explain about it.
Further I approach the questions and subset the data and did some Exploratory analysis appropriate for the question.
Missing values present in the overall data but I didn’t bother to clean them as they were not present in subset of data which was necessary for question .
I tried to answer all question properly and explain most of the steps where needed
libraries used are ggplot2 and dplyr
I also included limitations In the end
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
With bellow code I read the data
# Read the data
setwd('D:/Rishikesh_Data_Science/R_proggrame_coursera/Reproducible Research/Assignment_2_week_4')
data <- read.csv('repdata_data_StormData.csv.bz2')
head(data,3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
NA_percent <- mean(is.na(data))*100
There are 5.2297366 % values that is missing which pretty less ; but I will check how much missing data is related question.
For this question I will need variables EVTYPE, FATALITIES, and INJURIES
Missing values in this columns are : 0
I am using dplyr package to clean data for answering the above question. Grouping the data with even-type and adding all fatalities and injuries of all county in data so I got who havoc more damage to human life or health(Injury) compare to other events
library(dplyr)
# subsetting the data for our use
total_health <- (sum(data$FATALITIES)+sum(data$INJURIES))
health <- data %>% select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarise(Tot_fatalities = sum(FATALITIES),
Tot_injuries = sum(INJURIES),
Total = Tot_fatalities + Tot_injuries ,
percentage = Total/total_health) %>%
arrange(desc(percentage))
head(health)
## # A tibble: 6 x 5
## EVTYPE Tot_fatalities Tot_injuries Total percentage
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979 0.623
## 2 EXCESSIVE HEAT 1903 6525 8428 0.0541
## 3 TSTM WIND 504 6957 7461 0.0479
## 4 FLOOD 470 6789 7259 0.0466
## 5 LIGHTNING 816 5230 6046 0.0388
## 6 HEAT 937 2100 3037 0.0195
# finding which quantile have more vlaued data
quantiles <- quantile(health$Total, c(0 ,0.75, 0.80, 0.85, 0.90, 0.95, 1))
quantiles
## 0% 75% 80% 85% 90% 95% 100%
## 0.0 0.0 1.0 2.0 7.6 64.8 96979.0
As from data I can deduce that maximum amount of damage to human health is caused by Tornado which is 62.29% of all events. There 985 different types events. Also the data skewed that means some of event cause more harm than other so I checked how they are distributed
Note 75% of overall data is negligibly distributed and last 5% have maximum valued data
I am filter the max data from skewed distribution to find some more insights.
# plot ready data
health <- health %>%
mutate(Strom = if_else(percentage < 0.009, "Other", EVTYPE)) %>%
group_by(Strom) %>%
summarize(percentage = sum(percentage)) %>%
arrange(percentage)
library(ggplot2)
# Pie chart
pie <- ggplot(health, aes(x= '', y = percentage, fill =
Strom,),stat = "identity", color = "white")
pie +
coord_polar('y', start = 0) + geom_col(color = 'black') +
geom_text(aes(label = paste0(round(percentage*100), "%")) , position = position_stack(vjust = 0.5)) +
theme(panel.background = element_blank(),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5, size =
15),) +
ggtitle(" No.of Human life Effected By Diffrent Stroms (1950-2011) ") +
guides(fill = guide_legend(reverse = TRUE))
For answer this question I am using columns EVTYPE, PROPDMG, CROPDMG which I think is cost of damage cause to property and crop in an event. Also there are two more columns PROPDMGEXP, and CROPDMGEXP; which are exponential or scale to damage cost for previous columns.(Online Research gave me this insight)
I will check is there any null values in this column
Missing values in PROPDMG, CROPDMG : 0 So there are no missing values
I will subset the data
# filtering the data
economics <- data %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG,
CROPDMGEXP)
# finding Exponential factor for cost
union(unique(economics$PROPDMGEXP),
unique(economics$CROPDMGEXP))
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
## [20] "k"
There are some 20 factors I identify from above analysis. (Note that capital and lower case are of same value ex- “h’ and”H" = 100)
Note: EXP = exponent These are possible values of CROPDMGEXP and PROPDMGEXP:
H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character
# firstly I will convert PROPDMG
economics <- economics %>%
mutate(PROPDMG =
if_else(
grepl("[Hh]", PROPDMGEXP),
PROPDMG * 100, # multiply by Hundred
ifelse(
grepl("[Kk]",
PROPDMGEXP),
PROPDMG * 1e3, # multiply by Thousand
if_else(
grepl("[Mm]",
PROPDMGEXP),
PROPDMG * 1e6, # multiply by Million
if_else(
grepl("[Bb]", PROPDMGEXP),
PROPDMG * 1e9, # multiply by Billion
if_else(grepl("[0-9]", PROPDMGEXP),
PROPDMG * 10,
PROPDMG)
)
)
)
))
# Now I will convert CROPDMG
economics <- economics %>%
mutate(CROPDMG = if_else(
grepl("[Hh]", CROPDMGEXP),
# multiply by Hundred
CROPDMG * 100, # multiply by Hundred
ifelse(
grepl("[Kk]",
CROPDMGEXP),
CROPDMG * 1e3, # multiply by Thousand
if_else(
grepl("[Mm]", CROPDMGEXP),
# multiply by million
CROPDMG * 1e6, # multiple by Million
if_else(
grepl("[Bb]", CROPDMGEXP),
CROPDMG * 1e9, # multiple by Billion
if_else(grepl("[0-9]", CROPDMGEXP),
CROPDMG * 10,
CROPDMG)
)
)
)
))
head(economics)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 25000 K 0
## 2 TORNADO 2500 K 0
## 3 TORNADO 25000 K 0
## 4 TORNADO 2500 K 0
## 5 TORNADO 2500 K 0
## 6 TORNADO 2500 K 0
I have clean the data in above steps,note that “+”, “?”, “-”, "" valued 1 or exponential 0 which mean multiple of 1; that’s why I keep the values as previous (In the last ifelse)
Now I transform data for plotting or answering the question
# plot ready
economics <- economics %>%
group_by(EVTYPE) %>%
summarise(Total_propdmg = sum(PROPDMG),
Total_cropdmg = sum(CROPDMG),
Total = Total_cropdmg + Total_propdmg) %>%
arrange(desc(Total))
#exploring data
summary(economics$Total_propdmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 4.338e+08 5.105e+04 1.447e+11
summary(economics$Total_cropdmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 4.985e+07 0.000e+00 1.397e+10
summary(economics$Total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 4.837e+08 8.500e+04 1.503e+11
From this Above summaries I can say most of the values are zero as median is zero and Most valuable data lies in last quantile
I will plot top 10 event which cost most
top <- economics[11,]$Total # filter
economics <- economics %>%
mutate(plot = if_else(economics$Total > top,
EVTYPE,'Other')) %>%
group_by(plot) %>%
summarise(Total = sum(Total)) %>%
arrange(desc(Total))
event <- economics$plot
total<- economics$Total
total <- total/1e6 # In millions
# bar plot
p <- ggplot() + geom_bar(aes(
x = reorder(event, total),
y = total,
fill = total
),
stat = "identity",
show.legend = FALSE)
p + ggtitle("Strom impact on USA economy (1950- 2011)") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(
plot.title = element_text(size = 15)
) +
xlab("") + ylab("Cost of Damage ( in million US $)") +
coord_flip() +
scale_fill_gradient(low = "blue", high = "red")
I answered the above questions and conclude: - Human health(Death and Injuries) is mostly effected by Tornado
Between 1950 - 2012 due Tornado 96979 life effected which account for 63% of all other storm
Other Top culprit were Excessive heat, Flood and TSTM(Thunderstorm) Wind
Flood was on Top for biggest Economy damage (property and crop )
Flood Damage cost US economy around 150.3120 Billion Dollar
For first analysis I ignore the multiple type/name of Event(storm) since 62% is bigger than half so it wouldn’t have change the result
For exponent mean I took help of some online article which I assume was correct
In second question I took only top 10 event to plot as there were 985 events which are hard to plot on Graph. Note that ‘other’ contain all the event except first ten.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.(Coursera Question note)