Huiting Ma
This homework will focus on the following parts:
This homework is to analyze the Natural Disasters around the world from the year 2000 to 2013. The natural disasters include earthquake, volcano and mass movement. My main focus is to identify which continents or countries have more natural disasters and whether the number of death, injuries, homeless and the population affected are correlated with the damaged amount. The dataset I am going to use is from EM-DAT, which is the International Disaster Database here.
The definition of all variables can be found in the above website, which are:
NumDisaster A unique disaster number for each event Country Country (ies) in which the disaster has occurredYear When the disaster occurred. NumKilled Persons confirmed as dead and persons missing and presumed dead (official figures when available)NumInjured People suffering from physical injuries, trauma or an illness requiring medical treatment as a direct result of a disasterNumHomeless People needing immediate assistance for shelterNumAffected People requiring immediate assistance during a period of emergency; it can also include displaced or evacuated peopleTotalDamUSD Several institutions have developed methodologies to quantify these losses in their specific domain. However, there is no standard procedure to determine a global figure for economic impact. Estimated damages are given (000') US$library(plyr)
library(xtable)
library(ggplot2)
library(reshape2)
Supercheck whether data has imported correctly
NaturalDisaster <- read.csv("Disaster.csv")
str(NaturalDisaster)
## 'data.frame': 1648 obs. of 9 variables:
## $ Country : Factor w/ 188 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
## $ NumDisaster: int 6 6 16 9 3 14 13 7 4 5 ...
## $ NumKilled : int 594 485 4083 137 18 582 382 296 1334 101 ...
## $ NumInjured : int 0 20 1391 4 40 44 185 20 182 86 ...
## $ NumAffected: int 2582228 204425 302279 500 2800 37901 2225515 26755 452602 62521 ...
## $ NumHomeless: int 0 250 10000 4250 2700 6775 8210 3480 180 3250 ...
## $ TotalDamUSD: int 50 10 0 0 0 5050 0 0 0 20000 ...
## $ Continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
Learn from Jennie, to define a function for converting and printing to HTML table.
htmlPrint <- function(x, ...,
digits = 0, include.rownames = FALSE) {
print(xtable(x, digits = digits, ...), type = 'html',
include.rownames = include.rownames, ...)
}
Try to count the number of natural disaster over time on different continents
numCountByYear <- daply(NaturalDisaster,~Year +
Continent, summarize,
TotalCount = sum(NumDisaster))
numCountByYear <- as.data.frame(numCountByYear)
htmlPrint(numCountByYear)
| Africa | Americas | Asia | Europe | Oceania |
|---|---|---|---|---|
| 122 | 101 | 195 | 95 | 11 |
| 114 | 97 | 164 | 53 | 19 |
| 111 | 116 | 187 | 90 | 20 |
| 83 | 89 | 146 | 51 | 17 |
| 85 | 89 | 169 | 39 | 19 |
| 100 | 95 | 186 | 99 | 14 |
| 118 | 73 | 183 | 51 | 18 |
| 108 | 103 | 159 | 66 | 9 |
| 104 | 96 | 147 | 33 | 12 |
| 93 | 84 | 142 | 42 | 19 |
| 100 | 109 | 142 | 70 | 15 |
| 76 | 97 | 150 | 18 | 11 |
| 68 | 85 | 147 | 64 | 11 |
| 19 | 16 | 48 | 11 | 5 |
Seems 2013 does not have a lot of observations, Let's drop 2013.
NDisaster <- droplevels(subset(NaturalDisaster,Year != "2013"))
table(NDisaster$Year) #Check whether 2013 has dropped
##
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 128 120 128 118 122 139 106 134 121 114 135 105 120
Try to count the number of countries in each continents
numCountries <- ddply(NDisaster, ~Continent, summarize, numCoutries = length(unique(Country)))
htmlPrint(numCountries)
| Continent | numCoutries |
|---|---|
| Africa | 50 |
| Americas | 37 |
| Asia | 45 |
| Europe | 40 |
| Oceania | 16 |
Now, let us try to plot a graph and visualize the results
ggplot(NaturalDisaster, aes(x = Year, y = NumDisaster, color = Year)) +
geom_jitter() + facet_wrap(~ Continent) +
ggtitle("How is Number Disasters Changing over Time on Diffferent Continents")
Based on the graph, it is found that “Oceania” does not have a lot of data, drop it!
NDisaster <- droplevels(subset(NaturalDisaster,Continent != "Oceania"))
table(NDisaster$Continent) #Check whether "Oceania" has dropped
##
## Africa Americas Asia Europe
## 501 339 399 316
Also, I found there are some missing data for Number of affected. Let's drop these data!
NDisaster <- droplevels(subset(NDisaster,NumAffected != "0"))
str(NDisaster)
## 'data.frame': 1280 obs. of 9 variables:
## $ Country : Factor w/ 165 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
## $ NumDisaster: int 6 6 16 9 3 14 13 7 4 5 ...
## $ NumKilled : int 594 485 4083 137 18 582 382 296 1334 101 ...
## $ NumInjured : int 0 20 1391 4 40 44 185 20 182 86 ...
## $ NumAffected: int 2582228 204425 302279 500 2800 37901 2225515 26755 452602 62521 ...
## $ NumHomeless: int 0 250 10000 4250 2700 6775 8210 3480 180 3250 ...
## $ TotalDamUSD: int 50 10 0 0 0 5050 0 0 0 20000 ...
## $ Continent : Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
Since there also a lot of missing data for Total damage, let's drop these values.
Disaster <- droplevels(subset(NDisaster,TotalDamUSD != "0"))
str(Disaster)
## 'data.frame': 561 obs. of 9 variables:
## $ Country : Factor w/ 121 levels "Afghanistan",..: 1 1 1 1 1 2 2 3 3 3 ...
## $ Year : int 2000 2001 2005 2009 2011 2002 2004 2001 2002 2005 ...
## $ NumDisaster: int 6 6 14 5 4 3 2 1 3 4 ...
## $ NumKilled : int 594 485 582 101 83 7 3 921 48 34 ...
## $ NumInjured : int 0 20 44 86 115 0 0 423 20 5 ...
## $ NumAffected: int 2582228 204425 37901 62521 1753000 192110 2500 45000 2285 70 ...
## $ NumHomeless: int 0 250 6775 3250 9700 0 0 0 0 1750 ...
## $ TotalDamUSD: int 50 10 5050 20000 142000 17500 173 300000 1500 7256 ...
## $ Continent : Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 4 4 1 1 1 ...
Since I want to fit linear regression model in the second script, it is necessary to make sure all countries should have multiple observations. I will delete the countries with only 2 or less observations.
NumObsCountry <- ddply(Disaster, ~Country, summarize, numobs = length(Year))
CountryDrop <- droplevels(subset(NumObsCountry,!(numobs > 2)))
Disaster <- droplevels(subset(Disaster,
!(Country %in% CountryDrop$Country)))
DropNumObsCountry <- ddply(Disaster, ~Country, summarize, numobs = length(Year))
htmlPrint(DropNumObsCountry)
| Country | numobs |
|---|---|
| Afghanistan | 5 |
| Algeria | 7 |
| Argentina | 7 |
| Bangladesh | 4 |
| Belize | 4 |
| Bolivia | 7 |
| Bosnia-Hercegovenia | 3 |
| Brazil | 11 |
| Bulgaria | 3 |
| Cambodia | 5 |
| Canada | 7 |
| Chile | 9 |
| China P Rep | 14 |
| Colombia | 5 |
| Costa Rica | 6 |
| Cuba | 6 |
| Czech Rep | 3 |
| Dominican Rep | 6 |
| Ecuador | 5 |
| El Salvador | 5 |
| France | 8 |
| Georgia | 3 |
| Germany | 3 |
| Greece | 4 |
| Guatemala | 6 |
| Haiti | 6 |
| Honduras | 8 |
| Hungary | 5 |
| India | 13 |
| Indonesia | 14 |
| Iran Islam Rep | 9 |
| Italy | 6 |
| Jamaica | 6 |
| Japan | 12 |
| Kazakhstan | 5 |
| Kenya | 5 |
| Korea Dem P Rep | 6 |
| Korea Rep | 9 |
| Madagascar | 8 |
| Malaysia | 4 |
| Mexico | 13 |
| Moldova Rep | 4 |
| Mongolia | 3 |
| Morocco | 4 |
| Mozambique | 6 |
| Myanmar | 5 |
| Nepal | 5 |
| Nicaragua | 3 |
| Nigeria | 7 |
| Pakistan | 8 |
| Panama | 3 |
| Paraguay | 3 |
| Peru | 3 |
| Philippines | 14 |
| Poland | 5 |
| Puerto Rico | 3 |
| Romania | 4 |
| Russia | 11 |
| South Africa | 7 |
| Spain | 9 |
| Sri Lanka | 8 |
| Sudan | 3 |
| Switzerland | 4 |
| Taiwan (China) | 9 |
| Tajikistan | 11 |
| Thailand | 11 |
| Turkey | 7 |
| Ukraine | 7 |
| United Kingdom | 8 |
| United States | 14 |
| Venezuela | 5 |
| Viet Nam | 13 |
| Zimbabwe | 4 |
Yes, countries with only few observations have been dropped. The last thing I want to do is to reorder continents based on the total nunmber of disasters happened.
Disaster <- within(Disaster, Continent <- reorder(Continent, NumDisaster, sum))
Disaster <- arrange(Disaster,Continent)
In this stage, dataset has been sucessfully cleaned to the format that I want. Let us get into Exploratory Data Analysis stage.
First, Learn for JB, for each country, write stripplots to file to compare the number of disasters vs. year
Based on the graphs above, I did not find there is an obvious relationship between the number of disasters and year in different countries.
Second, try to get the spread of death within the continents.
spreaddeath <- ddply(Disaster, ~ Continent, summarize,
sdNumKilled = sd(NumKilled),
iqrNumKilled = IQR(NumKilled))
htmlPrint(arrange(spreaddeath, sdNumKilled))
| Continent | sdNumKilled | iqrNumKilled |
|---|---|---|
| Africa | 264 | 147 |
| Europe | 6802 | 42 |
| Asia | 17676 | 634 |
| Americas | 18674 | 126 |
It can be seen that there are a lot of variation for the number of killed people in natural disasters.
newspread <- melt(spreaddeath, id="Continent")
ggplot(newspread, aes(x = Continent, y = value, colour = variable)) +
geom_point() + geom_line(aes(x=as.numeric(Continent)))+ ylab("spread") +
ggtitle("Measure of Spread")
Then,try to produce the maximum and minimum statistics for different variables in all continents for all continents
maxmean<- ddply(Disaster, ~ Continent, summarize,
meanDamage = mean(TotalDamUSD),
maxDamage= max(TotalDamUSD),
meanAffPop = mean(NumAffected),
maxAffPop = max(NumAffected),
meanInjured = mean(NumInjured),
maxInjured = max(NumInjured),
meanDeath = mean(NumKilled),
maxDeath = max(NumKilled),
meanHomeless = mean(NumHomeless),
maxHomeless = max(NumHomeless)
)
htmlPrint(arrange(maxmean,maxDamage))
| Continent | meanDamage | maxDamage | meanAffPop | maxAffPop | meanInjured | maxInjured | meanDeath | maxDeath | meanHomeless | maxHomeless |
|---|---|---|---|---|---|---|---|---|---|---|
| Africa | 118076 | 779000 | 824687 | 7015029 | 128 | 1200 | 203 | 921 | 17465 | 258450 |
| Europe | 1316447 | 17137601 | 87556 | 2600000 | 183 | 5850 | 1328 | 55844 | 1364 | 30000 |
| Americas | 4122132 | 159060330 | 537394 | 13390150 | 4103 | 577520 | 1682 | 229549 | 14865 | 800000 |
| Asia | 3576366 | 212520000 | 12551989 | 342023353 | 8179 | 368719 | 3563 | 166691 | 137750 | 5003500 |
We have summarized some detailed information about different variables in the dataset. Now, let's discouver more details. First,let's identify which varibales play an important role to predict the damaged amount from natural disasters
To start with, Let us to fit the full model.
FullModel <- lm(TotalDamUSD ~ NumDisaster + NumKilled + NumInjured +
NumAffected +NumHomeless, data = Disaster)
summary(FullModel)
##
## Call:
## lm(formula = TotalDamUSD ~ NumDisaster + NumKilled + NumInjured +
## NumAffected + NumHomeless, data = Disaster)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38666229 -2518125 -747136 166490 206946995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.14e+06 9.13e+05 -1.25 0.21
## NumDisaster 6.34e+05 1.14e+05 5.58 4.1e-08 ***
## NumKilled 1.62e+01 5.72e+01 0.28 0.78
## NumInjured 6.87e+01 2.67e+01 2.57 0.01 *
## NumAffected -1.12e-02 2.68e-02 -0.42 0.68
## NumHomeless -2.98e+00 1.94e+00 -1.54 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14100000 on 480 degrees of freedom
## Multiple R-squared: 0.111, Adjusted R-squared: 0.102
## F-statistic: 12 on 5 and 480 DF, p-value: 5.73e-11
Based on the summmary, it can be shown that number of disasters played a significant role to determine the damaged amount. Therefore, I will pay more attention on whether number of disasters also play an essential role in country level.
mFun <- function(x) {
model <- lm(TotalDamUSD ~ NumDisaster, x)
estCoefs <- c(coef(model))
# estSE <- c(se.coef(model))
## 2 means 2nd row (we do not want to test intercept),
## 4 means the 4th column, which corresponding to p-value
p_value <- summary(model)$coefficients[2, 4]
names(estCoefs) <- c("intercept", "NumDisaster")
#names(estSE) <- c("SE(intercept)", "SE(NumDisaster)")
names(p_value) <- "p-value (NumDisaster)"
return (c(estCoefs,p_value))
}
mCoefs <- ddply(Disaster, ~Country, mFun)
print(mCoefs)
## Country intercept NumDisaster p-value (NumDisaster)
## 1 Afghanistan 8.053e+04 -6.730e+03 0.4589563
## 2 Algeria 4.192e+05 -7.655e+04 0.1692596
## 3 Argentina 3.517e+05 9.989e+03 0.9027709
## 4 Bangladesh 2.029e+06 -5.878e+04 0.8119140
## 5 Belize 1.625e+05 -1.631e+04 0.9353574
## 6 Bolivia -3.106e+04 5.384e+04 0.4841940
## 7 Bosnia-Hercegovenia 1.750e+05 -1.750e+04 0.8234795
## 8 Brazil 1.189e+06 -8.193e+04 0.4775022
## 9 Bulgaria -5.774e+05 1.292e+05 0.4567365
## 10 Cambodia 6.630e+05 -3.225e+05 0.0972609
## 11 Canada 2.467e+06 -3.347e+05 0.1638766
## 12 Chile 2.343e+07 -5.622e+06 0.1073906
## 13 China P Rep -3.802e+06 8.040e+05 0.5965933
## 14 Colombia 2.341e+06 -2.951e+05 0.2403580
## 15 Costa Rica 2.170e+05 -7.750e+04 0.1518295
## 16 Cuba -7.023e+05 6.903e+05 0.1475787
## 17 Czech Rep 6.762e+05 9.878e+04 0.9499379
## 18 Dominican Rep -5.039e+04 4.236e+04 0.2893147
## 19 Ecuador 6.549e+05 -1.396e+05 0.2795469
## 20 El Salvador 1.035e+06 -5.659e+04 0.8160431
## 21 France 8.089e+05 2.077e+05 0.7096798
## 22 Georgia 4.560e+05 -1.810e+05 0.3963003
## 23 Germany -6.612e+06 2.648e+06 0.4851640
## 24 Greece 1.117e+06 -1.241e+05 0.8102967
## 25 Guatemala 2.901e+05 2.358e+04 0.9005996
## 26 Haiti -8.410e+05 4.622e+05 0.4348501
## 27 Honduras -2.172e+04 2.292e+04 0.0363515
## 28 Hungary 4.467e+05 -2.117e+05 0.2633967
## 29 India -7.494e+05 1.670e+05 0.0618948
## 30 Indonesia -3.486e+02 7.951e+04 0.5044540
## 31 Iran Islam Rep 5.403e+04 2.293e+04 0.4979471
## 32 Italy 5.122e+06 1.184e+05 0.9373062
## 33 Jamaica 5.036e+05 -1.196e+05 0.6064703
## 34 Japan -1.466e+07 5.848e+06 0.4642779
## 35 Kazakhstan 1.184e+05 -5.018e+04 0.3660053
## 36 Kenya 7.958e+04 -8.492e+03 0.5442917
## 37 Korea Dem P Rep 7.045e+06 -2.994e+06 0.0666164
## 38 Korea Rep 2.211e+06 -2.341e+05 0.5290848
## 39 Madagascar 2.744e+05 -4.604e+04 0.2454068
## 40 Malaysia 6.645e+05 -6.805e+04 0.8137809
## 41 Mexico 5.382e+04 2.409e+05 0.5019379
## 42 Moldova Rep 3.872e+05 -1.837e+05 0.4608119
## 43 Mongolia -3.228e+04 3.987e+04 0.1960214
## 44 Morocco 1.492e+05 -5.214e+04 0.4775352
## 45 Mozambique -1.892e+05 6.052e+04 0.1149816
## 46 Myanmar 7.859e+06 -3.859e+06 0.0007387
## 47 Nepal -3.327e+03 4.499e+03 0.4487414
## 48 Nicaragua 1.029e+03 -3.571e+00 0.8789623
## 49 Nigeria 1.686e+05 -1.621e+04 0.4834093
## 50 Pakistan 1.565e+06 1.748e+05 0.6953090
## 51 Panama 1.233e+04 -3.250e+03 0.5332411
## 52 Paraguay 3.908e+04 -1.005e+04 0.1058932
## 53 Peru 2.143e+05 2.142e+04 0.8789929
## 54 Philippines -2.169e+05 3.705e+04 0.0136360
## 55 Poland 7.441e+05 1.539e+04 0.9875127
## 56 Puerto Rico 2.947e+05 -2.300e+04 0.9329778
## 57 Romania -6.334e+05 2.083e+05 0.1153793
## 58 Russia 8.353e+05 -6.734e+03 0.9125249
## 59 South Africa 2.898e+05 -2.456e+04 0.2677210
## 60 Spain 4.345e+05 -6.019e+04 0.5198923
## 61 Sri Lanka -2.065e+04 9.909e+04 0.4732060
## 62 Sudan -5.600e+04 9.343e+04 0.2020026
## 63 Switzerland 1.290e+06 -2.100e+05 0.6977024
## 64 Taiwan (China) -1.030e+05 1.062e+05 0.1380400
## 65 Tajikistan 2.199e+05 -3.533e+04 0.5734894
## 66 Thailand 2.047e+04 6.859e+05 0.7949296
## 67 Turkey 1.282e+06 -1.839e+05 0.3572760
## 68 Ukraine -8.176e+04 2.405e+05 0.3854931
## 69 United Kingdom -1.747e+06 1.117e+06 0.2348985
## 70 United States 5.639e+07 -8.762e+05 0.6313770
## 71 Venezuela 4.700e+04 8.167e+03 0.7943047
## 72 Viet Nam 2.013e+05 3.811e+04 0.4116813
## 73 Zimbabwe 1.164e+05 -1.343e+04 0.8139364
Question: For this part, I did not use htmlPrint() since I do not know how to control
decimals in r markdown. R markdown automatically round numbers.
I tried my best, but I am not sure whether this is the correct way. Based on above table, we found that even though the number of disasters play an critical role in the world level, it does not significant in country level for most countries. It might because the individual country has limited sample size. Thus, the standard error is huge.
The next topic that I would like to focus one is how is the number of disasters changing over time on different continents.To begin with, let us plot the number of disasters in different continents.
To begin with, let us plot the number of disasters in different continents.
ggplot(Disaster,aes(x=NumDisaster, fill= Continent)) + facet_wrap(~Continent)+
geom_bar(binwidth = 2, color = "black") +
ggtitle("Number of Disasters in Diffferent Continents")
After having some basic ideas about the frequency of the number of disasters in different continents,I will look at the number of disasters changing over time on diffferent continents
p <- ggplot(Disaster, aes(x = Year, y = NumDisaster, color = Year))+
geom_jitter() + facet_wrap(~ Continent) +
geom_line(stat = "summary", fun.y = "mean", col = "red", lwd = 1) +
ggtitle("How is Number of Disasters Changing over Time on Diffferent Continents") +
scale_x_continuous(name = "Year", breaks = seq(min(Disaster$Year),
max(Disaster$Year), by = 2)) +
xlab("Year") + ylab("Number of Disasters")
print(p)
Based on the plot, there is no significan relationship beween the number of disasters and time. However, it can be found that there is some difference for the number of disasters across continents.
Try boxplot for the year 2000, 2005 and 2010
ggplot(subset(Disaster, Year %in% c(2000,2005,2010)), aes(x = factor(Year),
y = NumDisaster,
fill = Continent),
groups = Continent) + geom_boxplot(alpha = 0.2, outlier.colour= "red")
Obviously, Asia has more variation for the number of disasters.
Finally, let us try another plot about the number of disasters changing over time on diffferent continents
low_number= 10
ggplot(Disaster, aes(x = Year, y = NumDisaster,
colour = NumDisaster <= low_number)) +
geom_jitter(position = position_jitter(width = .2)) +
facet_wrap(~ Continent) +
ggtitle(paste("NumDisaster <= ", low_number)) +
theme(plot.title = element_text(face="bold")) +
scale_colour_discrete(name="",breaks=c("FALSE", "TRUE"),
labels=c("Number > 5", "Number <= 5"))+
ggtitle("Show Number of Disasters by Year across Continents")
Now,Let us look at a special plot: dots scatterplot of number of natural disaster over year for China. Why?
ggplot(subset(Disaster, Country == "China P Rep"), aes(x = Year, y = NumKilled)) +
geom_line() + xlab("Year") + ylab("Number killed") +
ggtitle("How is Number of People Killed over time in People's Repubic of China") +
scale_x_continuous(name = "Year", breaks = seq(min(NDisaster$Year),
max(NDisaster$Year), by = 2))
The graph shows that more than 80,000 were killed by natural disaster in the year of 2008 in China. That was Sichuan Earthquake. At that time, I was also in Sichuan and experienced this catastrophe. I feel sorry for those people who lost their lives during this disaster.Since I experienced this natural disaster, I would think this is an important plot for me!!
Next, I would like to evaluate whether there is a relationship between affected population and death across continents.
The question is that whether more people got affected by natural disasters will lead more people be killed? Since some natural disasters caused a lot of people lost lives, I used log transformation (Question Log transformation may not make sense?)
ggplot(Disaster, aes(x = NumAffected, y = NumKilled, color = Continent)) +
geom_point() + scale_x_log10() + scale_y_log10()+
ggtitle("How NumKilled related to NumAffected across Continents")
Based on the graph, it can be seen that with the increase of number of people got affected, there is a slightly increasing trend for the number of people got killed.
In conclusion, the number of disasters played a significant role to determine the total damaged amount in population level but not in country level. In addition, with the increase of number of people got affected, there is a slightly increasing trend for the number of people got killed.