I was always fascinated about learning about poverty in our society. There are many countries, especially those in the African and Middle Eastern regions, that go through many issues. Typically these countries have very high birth & death rates as well as a low GNP. Basically, the lower the GNP, the worst the country is in terms of functioning properly to survive. These proverty stricken countries have a hard time keeping up with demands in food and healthcare as the population blows up exponentially. So for this project, I will analyze trends in poverty according to region. The dataset was obtained from https://www2.stetson.edu/~jrasp/data.htm which includes data for 97 countries
So for this, I downloaded the dataset(link above) and imported it into R
library(readr)
df<-read_csv("C:/Documents/My Excel/Poverty.csv")
## Parsed with column specification:
## cols(
## BirthRt = col_double(),
## DeathRt = col_double(),
## InfMort = col_double(),
## LExpM = col_double(),
## LExpF = col_double(),
## GNP = col_character(),
## Region = col_integer(),
## Country = col_character()
## )
head(df)
## # A tibble: 6 x 8
## BirthRt DeathRt InfMort LExpM LExpF GNP Region Country
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <chr>
## 1 24.7 5.7 30.8 69.6 75.5 600 1 Albania
## 2 12.5 11.9 14.4 68.3 74.7 2250 1 Bulgaria
## 3 13.4 11.7 11.3 71.8 77.7 2980 1 Czechoslovakia
## 4 12 12.4 7.6 69.8 75.9 * 1 Former_E._Germany
## 5 11.6 13.4 14.8 65.4 73.8 2780 1 Hungary
## 6 14.3 10.2 16 67.2 75.7 1690 1 Poland
library(ggplot2)
Now notice that the 4th row in the head of the raw dataframe has an asterisk for GNP. This is to indicate that the value for GNP for that particular country is missing. Unfortunately, It is impossible for me to determine the GNP for the countries with the missing values because I don’t know what year this data was from. Also, replacing the missing value with the mean and median imputation would result in an ambiguous GNP value. Therefore, the best way for me to clean the data is to delete the rows with the missing value
poverty<-df[-c(4,8,50,56,61,70),]# Note the reason why you're deleting these rows
#poverty<-sapply(df1[1:91,1:7], as.numeric)
poverty$GNP<-as.numeric(poverty$GNP)
head(poverty)
## # A tibble: 6 x 8
## BirthRt DeathRt InfMort LExpM LExpF GNP Region Country
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr>
## 1 24.7 5.7 30.8 69.6 75.5 600 1 Albania
## 2 12.5 11.9 14.4 68.3 74.7 2250 1 Bulgaria
## 3 13.4 11.7 11.3 71.8 77.7 2980 1 Czechoslovakia
## 4 11.6 13.4 14.8 65.4 73.8 2780 1 Hungary
## 5 14.3 10.2 16 67.2 75.7 1690 1 Poland
## 6 13.6 10.7 26.9 66.5 72.4 1640 1 Romania
Now it looks like that the country with the missing value has beeen deleted
birthdeath=lm(poverty$BirthRt~poverty$DeathRt)
summary(birthdeath)
##
## Call:
## lm(formula = poverty$BirthRt ~ poverty$DeathRt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.803 -12.190 3.191 9.569 20.479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.5864 3.1298 4.341 3.74e-05 ***
## poverty$DeathRt 1.4788 0.2675 5.529 3.19e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.89 on 89 degrees of freedom
## Multiple R-squared: 0.2557, Adjusted R-squared: 0.2473
## F-statistic: 30.57 on 1 and 89 DF, p-value: 3.185e-07
deathbirth=lm(poverty$DeathRt~poverty$BirthRt)
summary(deathbirth)
##
## Call:
## lm(formula = poverty$DeathRt ~ poverty$BirthRt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0741 -3.0553 -0.6525 2.5061 12.5455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.64105 1.01489 5.558 2.81e-07 ***
## poverty$BirthRt 0.17288 0.03127 5.529 3.19e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.064 on 89 degrees of freedom
## Multiple R-squared: 0.2557, Adjusted R-squared: 0.2473
## F-statistic: 30.57 on 1 and 89 DF, p-value: 3.185e-07
summary(poverty$BirthRt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.70 14.70 29.00 29.46 42.55 52.20
Looking at the regression I calculated, the R^2 value is 0.2473. This suggest that the relationship between birth and death rates have close to an inverse relationship. Typically when Birth rate increases, Death rate tends to decrease as well.
plot(birthdeath)
plot(deathbirth)
plot(poverty$BirthRt,poverty$DeathRt)
Looking at the death vs birth rate graph above. It appears that this graph looks very similar to a graph with the model \(y=x^2\) graph. The graph starts off as having a median of 15 people dead per 10 people born. The death rate sligtly decreases as the birth rate increases, favoring the mean of birth rate, then ss birth rate increases more, we see more people dying. It appears that there is kindof a linear trend for birth and death rates; however, it starts off at the median for death rate.
summary(poverty$InfMort) #Min goes to Japan while max goes to Alganistan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.50 13.05 43.00 55.28 86.50 181.60
infMortabove90=subset(poverty,InfMort > 90)
unique(infMortabove90$Country)
## [1] "Bolivia" "Peru" "Afghanistan" "Iran"
## [5] "Bangladesh" "India" "Nepal" "Pakistan"
## [9] "Angola" "Ethiopia" "Gabon" "Gambia"
## [13] "Malawi" "Mozambique" "Namibia" "Nigeria"
## [17] "Sierra_Leone" "Somalia" "Sudan" "Swaziland"
## [21] "Uganda" "Tanzania"
Deathmort=lm(poverty$DeathRt~poverty$InfMort)
summary(Deathmort) #Remember that this only has correlation in certain countries. Not all
##
## Call:
## lm(formula = poverty$DeathRt ~ poverty$InfMort)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2278 -2.8306 -0.1377 2.0089 13.3079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.944072 0.567255 12.242 < 2e-16 ***
## poverty$InfMort 0.068558 0.007884 8.695 1.6e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.463 on 89 degrees of freedom
## Multiple R-squared: 0.4593, Adjusted R-squared: 0.4533
## F-statistic: 75.61 on 1 and 89 DF, p-value: 1.603e-13
Honestly, looking at the regression for all countries, not much; however, in certain countries, there is a stronger relationship between infant mortality and death rate.
GNPMF=lm(poverty$GNP~poverty$LExpM+poverty$LExpF)
summary(GNPMF)
##
## Call:
## lm(formula = poverty$GNP ~ poverty$LExpM + poverty$LExpF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9627 -4990 -522 3948 21662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25979.8 4284.5 -6.064 3.25e-08 ***
## poverty$LExpM 108.1 356.2 0.303 0.762
## poverty$LExpF 379.9 311.3 1.220 0.226
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6217 on 88 degrees of freedom
## Multiple R-squared: 0.4232, Adjusted R-squared: 0.4101
## F-statistic: 32.28 on 2 and 88 DF, p-value: 3.064e-11
Since the r^2 value is around 41%, there is not much correlation between life expectancy with regards to GNP. Usually for most countries, the life expectancy is high enough to offset the GNP. In my opinion, life expectancy plays no role in determining GNP since it is high in most countries. There are also certain countries, such as Egypt, where the life expectancy is in the mid 60s but the GNP is 600 which is very low
looking at the plot, it appears that there seems to be a trend for GNP and Death rate. Typically, a higher GNP impies that the country is more Economically stable and supportive, therefore having a lower death rate. When looking at this graph, it appears that the death rate is around 5-10 for countries with high GNP. There are some countries that fall into the median GNP but have very low death rates
a<-ggplot(poverty,aes(GNP,DeathRt))
a+geom_jitter()+geom_smooth(method="lm")
GNPpredictor<-lm(poverty$GNP~poverty$BirthRt+poverty$DeathRt+poverty$InfMort+poverty$LExpM+poverty$LExpF)
summary(GNPpredictor)
##
## Call:
## lm(formula = poverty$GNP ~ poverty$BirthRt + poverty$DeathRt +
## poverty$InfMort + poverty$LExpM + poverty$LExpF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11945.3 -3177.0 -131.2 2715.5 19040.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60755.00 21690.72 -2.801 0.00631 **
## poverty$BirthRt -8.16 112.56 -0.072 0.94238
## poverty$DeathRt 700.47 226.74 3.089 0.00271 **
## poverty$InfMort 37.44 45.02 0.832 0.40800
## poverty$LExpM 595.14 368.96 1.613 0.11044
## poverty$LExpF 312.25 381.94 0.818 0.41591
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5912 on 85 degrees of freedom
## Multiple R-squared: 0.4962, Adjusted R-squared: 0.4665
## F-statistic: 16.74 on 5 and 85 DF, p-value: 1.754e-11
Unfortunately, I wasn’t able to accurately predict the GNP for the countries with the missing GNP. I believe that the regression model is not accurate enough since the r^2 is below 45%. I tried predicting the GNP for former east germany and got a GNP prediction of13358. That value is too high and is not suitable. I think the reason for my model failing is because I included Life expenctancy in my regression equation and I think that it was unnecessary and adds redundancy to the model which results in inaccurate predictions.
It appears that region 6, which is the african region, has the highest birth rates and the highest death rates
a<-ggplot(poverty,aes(x=BirthRt,y=DeathRt,color=factor(Region)))
a+geom_jitter()
a<-ggplot(poverty,aes(x=Region,y=GNP,color=factor(Region)))
a+geom_jitter()