Analyzing poverty according to region and attempting to predict GNP

I was always fascinated about learning about poverty in our society. There are many countries, especially those in the African and Middle Eastern regions, that go through many issues. Typically these countries have very high birth & death rates as well as a low GNP. Basically, the lower the GNP, the worst the country is in terms of functioning properly to survive. These proverty stricken countries have a hard time keeping up with demands in food and healthcare as the population blows up exponentially. So for this project, I will analyze trends in poverty according to region. The dataset was obtained from https://www2.stetson.edu/~jrasp/data.htm which includes data for 97 countries

Defining some terms.

This dataset includes terms which includes:

Birthrate=The number of live births per thousand of population per year
Deathrate=The ratio of deaths to the population of a particular area during a particular period of time, usually calculated as the number of deaths per thousand of people per year
Infant Mortality Rate=Death of children who are less than 1 year old 4-5. Life expectancy rate= The average period that someone is expected to live for male and female
GNP= The total value of goods produced & serives provided by a country during 1 year. It’s equal to GDP \(+\) Net income from foreigh investiments

The regions in this dataset include

Russian region
South America
Europe, America, and Japan(Not communist)
Middle East
Asian region
African region

Working with the Data

So for this, I downloaded the dataset(link above) and imported it into R

library(readr)
df<-read_csv("C:/Documents/My Excel/Poverty.csv")

## Parsed with column specification:
## cols(
##   BirthRt = col_double(),
##   DeathRt = col_double(),
##   InfMort = col_double(),
##   LExpM = col_double(),
##   LExpF = col_double(),
##   GNP = col_character(),
##   Region = col_integer(),
##   Country = col_character()
## )

head(df)

## # A tibble: 6 x 8
##   BirthRt DeathRt InfMort LExpM LExpF GNP   Region Country          
##     <dbl>   <dbl>   <dbl> <dbl> <dbl> <chr>  <int> <chr>            
## 1    24.7     5.7    30.8  69.6  75.5 600        1 Albania          
## 2    12.5    11.9    14.4  68.3  74.7 2250       1 Bulgaria         
## 3    13.4    11.7    11.3  71.8  77.7 2980       1 Czechoslovakia   
## 4    12      12.4     7.6  69.8  75.9 *          1 Former_E._Germany
## 5    11.6    13.4    14.8  65.4  73.8 2780       1 Hungary          
## 6    14.3    10.2    16    67.2  75.7 1690       1 Poland

library(ggplot2)

Cleaning the data

Now notice that the 4th row in the head of the raw dataframe has an asterisk for GNP. This is to indicate that the value for GNP for that particular country is missing. Unfortunately, It is impossible for me to determine the GNP for the countries with the missing values because I don’t know what year this data was from. Also, replacing the missing value with the mean and median imputation would result in an ambiguous GNP value. Therefore, the best way for me to clean the data is to delete the rows with the missing value

poverty<-df[-c(4,8,50,56,61,70),]# Note the reason why you're deleting these rows
#poverty<-sapply(df1[1:91,1:7], as.numeric)
poverty$GNP<-as.numeric(poverty$GNP)
head(poverty)

## # A tibble: 6 x 8
##   BirthRt DeathRt InfMort LExpM LExpF   GNP Region Country       
##     <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>  <int> <chr>         
## 1    24.7     5.7    30.8  69.6  75.5   600      1 Albania       
## 2    12.5    11.9    14.4  68.3  74.7  2250      1 Bulgaria      
## 3    13.4    11.7    11.3  71.8  77.7  2980      1 Czechoslovakia
## 4    11.6    13.4    14.8  65.4  73.8  2780      1 Hungary       
## 5    14.3    10.2    16    67.2  75.7  1690      1 Poland        
## 6    13.6    10.7    26.9  66.5  72.4  1640      1 Romania

Now it looks like that the country with the missing value has beeen deleted

Identifying a relationship between birth and death rates using regression

birthdeath=lm(poverty$BirthRt~poverty$DeathRt)
summary(birthdeath)

## 
## Call:
## lm(formula = poverty$BirthRt ~ poverty$DeathRt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.803 -12.190   3.191   9.569  20.479 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      13.5864     3.1298   4.341 3.74e-05 ***
## poverty$DeathRt   1.4788     0.2675   5.529 3.19e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.89 on 89 degrees of freedom
## Multiple R-squared:  0.2557, Adjusted R-squared:  0.2473 
## F-statistic: 30.57 on 1 and 89 DF,  p-value: 3.185e-07

deathbirth=lm(poverty$DeathRt~poverty$BirthRt)
summary(deathbirth)

## 
## Call:
## lm(formula = poverty$DeathRt ~ poverty$BirthRt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0741 -3.0553 -0.6525  2.5061 12.5455 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.64105    1.01489   5.558 2.81e-07 ***
## poverty$BirthRt  0.17288    0.03127   5.529 3.19e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.064 on 89 degrees of freedom
## Multiple R-squared:  0.2557, Adjusted R-squared:  0.2473 
## F-statistic: 30.57 on 1 and 89 DF,  p-value: 3.185e-07

summary(poverty$BirthRt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.70   14.70   29.00   29.46   42.55   52.20

Looking at the regression I calculated, the R^2 value is 0.2473. This suggest that the relationship between birth and death rates have close to an inverse relationship. Typically when Birth rate increases, Death rate tends to decrease as well.

plot(birthdeath)

plot(deathbirth)

plot(poverty$BirthRt,poverty$DeathRt)

Looking at the death vs birth rate graph above. It appears that this graph looks very similar to a graph with the model \(y=x^2\) graph. The graph starts off as having a median of 15 people dead per 10 people born. The death rate sligtly decreases as the birth rate increases, favoring the mean of birth rate, then ss birth rate increases more, we see more people dying. It appears that there is kindof a linear trend for birth and death rates; however, it starts off at the median for death rate.

summary(poverty$InfMort) #Min goes to Japan while max goes to Alganistan

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.50   13.05   43.00   55.28   86.50  181.60

Countries where infant mortality is greater than 90.

This is a serious issue and these countries are in danger

infMortabove90=subset(poverty,InfMort > 90)
unique(infMortabove90$Country)

##  [1] "Bolivia"      "Peru"         "Afghanistan"  "Iran"        
##  [5] "Bangladesh"   "India"        "Nepal"        "Pakistan"    
##  [9] "Angola"       "Ethiopia"     "Gabon"        "Gambia"      
## [13] "Malawi"       "Mozambique"   "Namibia"      "Nigeria"     
## [17] "Sierra_Leone" "Somalia"      "Sudan"        "Swaziland"   
## [21] "Uganda"       "Tanzania"

Does death rate impact infant mortality as well as life expectancy

Deathmort=lm(poverty$DeathRt~poverty$InfMort)
summary(Deathmort) #Remember that this only has correlation in certain countries. Not all

## 
## Call:
## lm(formula = poverty$DeathRt ~ poverty$InfMort)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2278 -2.8306 -0.1377  2.0089 13.3079 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.944072   0.567255  12.242  < 2e-16 ***
## poverty$InfMort 0.068558   0.007884   8.695  1.6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.463 on 89 degrees of freedom
## Multiple R-squared:  0.4593, Adjusted R-squared:  0.4533 
## F-statistic: 75.61 on 1 and 89 DF,  p-value: 1.603e-13

Honestly, looking at the regression for all countries, not much; however, in certain countries, there is a stronger relationship between infant mortality and death rate.

Is there a relationship between GNP and life expectancy for males and females?

GNPMF=lm(poverty$GNP~poverty$LExpM+poverty$LExpF)
summary(GNPMF)

## 
## Call:
## lm(formula = poverty$GNP ~ poverty$LExpM + poverty$LExpF)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9627  -4990   -522   3948  21662 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -25979.8     4284.5  -6.064 3.25e-08 ***
## poverty$LExpM    108.1      356.2   0.303    0.762    
## poverty$LExpF    379.9      311.3   1.220    0.226    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6217 on 88 degrees of freedom
## Multiple R-squared:  0.4232, Adjusted R-squared:  0.4101 
## F-statistic: 32.28 on 2 and 88 DF,  p-value: 3.064e-11

Since the r^2 value is around 41%, there is not much correlation between life expectancy with regards to GNP. Usually for most countries, the life expectancy is high enough to offset the GNP. In my opinion, life expectancy plays no role in determining GNP since it is high in most countries. There are also certain countries, such as Egypt, where the life expectancy is in the mid 60s but the GNP is 600 which is very low

Plotting the regression of GNP to deathrate

looking at the plot, it appears that there seems to be a trend for GNP and Death rate. Typically, a higher GNP impies that the country is more Economically stable and supportive, therefore having a lower death rate. When looking at this graph, it appears that the death rate is around 5-10 for countries with high GNP. There are some countries that fall into the median GNP but have very low death rates

a<-ggplot(poverty,aes(GNP,DeathRt))
a+geom_jitter()+geom_smooth(method="lm")

My next goal is to predict the GNP for the rows I previously cleaned

GNPpredictor<-lm(poverty$GNP~poverty$BirthRt+poverty$DeathRt+poverty$InfMort+poverty$LExpM+poverty$LExpF)
summary(GNPpredictor)

## 
## Call:
## lm(formula = poverty$GNP ~ poverty$BirthRt + poverty$DeathRt + 
##     poverty$InfMort + poverty$LExpM + poverty$LExpF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11945.3  -3177.0   -131.2   2715.5  19040.3 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     -60755.00   21690.72  -2.801  0.00631 **
## poverty$BirthRt     -8.16     112.56  -0.072  0.94238   
## poverty$DeathRt    700.47     226.74   3.089  0.00271 **
## poverty$InfMort     37.44      45.02   0.832  0.40800   
## poverty$LExpM      595.14     368.96   1.613  0.11044   
## poverty$LExpF      312.25     381.94   0.818  0.41591   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5912 on 85 degrees of freedom
## Multiple R-squared:  0.4962, Adjusted R-squared:  0.4665 
## F-statistic: 16.74 on 5 and 85 DF,  p-value: 1.754e-11

Unfortunately, I wasn’t able to accurately predict the GNP for the countries with the missing GNP. I believe that the regression model is not accurate enough since the r^2 is below 45%. I tried predicting the GNP for former east germany and got a GNP prediction of13358. That value is too high and is not suitable. I think the reason for my model failing is because I included Life expenctancy in my regression equation and I think that it was unnecessary and adds redundancy to the model which results in inaccurate predictions.

Plotting Birth rate vs death rate by region

It appears that region 6, which is the african region, has the highest birth rates and the highest death rates

a<-ggplot(poverty,aes(x=BirthRt,y=DeathRt,color=factor(Region)))
a+geom_jitter()

Which region had the greatest GNP

a<-ggplot(poverty,aes(x=Region,y=GNP,color=factor(Region)))
a+geom_jitter()