The aim of this report is to investigate whether there is a relationship between self-rated poverty and GDP using the economic data obtained from World Bank and SWS. The two research questions are the following:

  1. How does GDP vary over the years?

  2. Does GDP have a direct effect on self-rated poverty? If not, then what other factors in the dataset have a relationship with the self-rated poverty?

Initial Data Analysis

The data was collected from World Bank and SWS. It has 60 observations from year 1961 to 2019. Variables are:

# Load package
library(tidyverse)
library(readxl)
library(psych)
# Import data
data <- read_excel("DATAx.xlsx")

# Quick look at top 5 rows of data
head(data)
## # A tibble: 6 x 10
##    YEAR `GDP GROWTH (%)` `GDP per capita… `Inflation (%)` `Unemployment (…
##   <dbl>            <dbl>            <dbl>           <dbl>            <dbl>
## 1  2019             6.04            3485.           2.48               5.1
## 2  2018             6.34            3252.           5.21               5.3
## 3  2017             6.93            3123.           2.85               5.7
## 4  2016             7.15            3074.           1.25               5.5
## 5  2015             6.35            3001.           0.674              6.3
## 6  2014             6.35            2960.           3.60               6.8
## # … with 5 more variables: `SELF RATED POVERTY (%) SWS` <dbl>, `POVERTY
## #   INCIDENCE AMONG POPULATION (%)` <dbl>, `POVERTY INCIDENCE AMONG FAMILIES
## #   (%)` <dbl>, `Total No. of Families (in thousands)` <dbl>, `No. of Families
## #   (in lowest 30%; in thousands)` <dbl>
# Size of data
dim(data)
## [1] 59 10
# Classification of variables
str(data)
## tibble [59 × 10] (S3: tbl_df/tbl/data.frame)
##  $ YEAR                                         : num [1:59] 2019 2018 2017 2016 2015 ...
##  $ GDP GROWTH (%)                               : num [1:59] 6.04 6.34 6.93 7.15 6.35 ...
##  $ GDP per capita ($)                           : num [1:59] 3485 3252 3123 3074 3001 ...
##  $ Inflation (%)                                : num [1:59] 2.48 5.212 2.853 1.254 0.674 ...
##  $ Unemployment (% of total Labor Force)        : num [1:59] 5.1 5.3 5.7 5.5 6.3 6.8 7.1 7 7 7.3 ...
##  $ SELF RATED POVERTY (%) SWS                   : num [1:59] 54 48 46 44 50 54 52 52 49 48 ...
##  $ POVERTY INCIDENCE AMONG POPULATION (%)       : num [1:59] NA 21.9 23.1 24.5 21.6 25.8 24.9 25.2 NA NA ...
##  $ POVERTY INCIDENCE AMONG FAMILIES (%)         : num [1:59] NA 12.3 NA NA 16.5 NA 19.1 19.7 NA NA ...
##  $ Total No. of Families (in thousands)         : num [1:59] NA NA 24354 23771 NA ...
##  $ No. of Families (in lowest 30%; in thousands): num [1:59] NA NA 7307 7132 NA ...
# Look for any notable trends between all pairs of variables
plot(data)

# Some Stats
summary(data)
##       YEAR      GDP GROWTH (%)   GDP per capita ($) Inflation (%)   
##  Min.   :1961   Min.   :-7.324   Min.   : 156.7     Min.   : 0.674  
##  1st Qu.:1976   1st Qu.: 3.588   1st Qu.: 381.8     1st Qu.: 3.694  
##  Median :1990   Median : 5.087   Median : 732.4     Median : 6.254  
##  Mean   :1990   Mean   : 4.415   Mean   :1083.4     Mean   : 8.659  
##  3rd Qu.:2004   3rd Qu.: 6.123   3rd Qu.:1202.3     3rd Qu.:10.126  
##  Max.   :2019   Max.   : 8.921   Max.   :3485.1     Max.   :50.339  
##                                                                     
##  Unemployment (% of total Labor Force) SELF RATED POVERTY (%) SWS
##  Min.   : 5.1                          Min.   :44.00             
##  1st Qu.: 7.2                          1st Qu.:50.75             
##  Median : 9.1                          Median :56.00             
##  Mean   : 8.8                          Mean   :57.14             
##  3rd Qu.:10.8                          3rd Qu.:63.00             
##  Max.   :11.8                          Max.   :74.00             
##  NA's   :24                            NA's   :23                
##  POVERTY INCIDENCE AMONG POPULATION (%) POVERTY INCIDENCE AMONG FAMILIES (%)
##  Min.   :21.60                          Min.   :12.30                       
##  1st Qu.:24.70                          1st Qu.:19.70                       
##  Median :26.30                          Median :24.70                       
##  Mean   :30.34                          Mean   :27.15                       
##  3rd Qu.:34.90                          3rd Qu.:35.50                       
##  Max.   :49.50                          Max.   :44.20                       
##  NA's   :44                             NA's   :46                          
##  Total No. of Families (in thousands)
##  Min.   :16873                       
##  1st Qu.:18067                       
##  Median :19128                       
##  Mean   :20345                       
##  3rd Qu.:22713                       
##  Max.   :24354                       
##  NA's   :50                          
##  No. of Families (in lowest 30%; in thousands)
##  Min.   :5062                                 
##  1st Qu.:5420                                 
##  Median :6569                                 
##  Mean   :6215                                 
##  3rd Qu.:6814                                 
##  Max.   :7307                                 
##  NA's   :50

Info:

SELF-RATED POVERTY Just a brief summary of this variable.

summary(data$`SELF RATED POVERTY (%) SWS`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   44.00   50.75   56.00   57.14   63.00   74.00      23

We have 23 missing values for this variable. We’ll ommit these and use the available data.

srp_data <- data %>% filter(YEAR >= 1984 & YEAR <= 2019) %>% 
        rename(srp = "SELF RATED POVERTY (%) SWS")

Getting to know more about SELF RATED POVERTY

Years with lowest and highest SRP:

# Highest
srp_data %>% filter(srp == max(srp)) %>% select(YEAR, srp)
## # A tibble: 1 x 2
##    YEAR   srp
##   <dbl> <dbl>
## 1  1985    74
# Lowest
srp_data %>% filter(srp == min(srp))  %>% select(YEAR, srp)
## # A tibble: 1 x 2
##    YEAR   srp
##   <dbl> <dbl>
## 1  2016    44

Take a pic:

ggplot(data = srp_data, mapping = aes(x = YEAR, y = srp)) +
        geom_point(color = "red") +
        geom_smooth() + 
        labs(title = "SRP (1984 - 2019)")

Comment here.

Sneak peek on GDP GROWTH (%)

data <- data %>% rename(GDP_GROWTH = "GDP GROWTH (%)")
ggplot(data = data, mapping = aes(x = YEAR, y = GDP_GROWTH)) +
        geom_point(color = "violet") +
        geom_smooth() +
        labs(title = "GDP GROWTH OVER THE YEAR")

GDP growth rate seems to decline in between 1980 and 2000. Also, GDP growth rate is hovering around 4 to 8 percent with mean value 4.415. Not bad enough.

Sneak peek on Inflation (%)

data <- data %>% rename(Inflation = "Inflation (%)")
ggplot(data = data, mapping = aes(x = YEAR, y = Inflation)) +
        geom_point(color = "brown") +
        geom_smooth() + 
        labs(title = "INFLATION RATE OVER THE YEAR")

Inflation rate is high around 1980. Let’s take a look of that unusual value.

# Unusual
data %>% filter(Inflation == max(Inflation)) %>% select(YEAR, Inflation)
## # A tibble: 1 x 2
##    YEAR Inflation
##   <dbl>     <dbl>
## 1  1984      50.3

Year 1984 has the highest inflation rate. What happened this year!

Sneak peek on Unemployment (% of total Labor Force)

data <- data %>% 
        rename(Unemployment = "Unemployment (% of total Labor Force)")

data %>% filter(YEAR >= 1985 & YEAR <= 2019) %>% 
        ggplot(mapping = aes(x = YEAR, y = Unemployment)) +
        geom_point(color = "green") +
        geom_smooth() + 
        labs(title = "UNEMPLOYMENT RATE (1985 - 2019)")

We have missing values for this variable. We’ll just ommit that for now. Observations for this variable starts from year 1985 upto 2019. Also, the trend is going downward with peak values around 2002 to 2005.

Research Question 1

How does GDP vary over the years?

Sneak peek on GDP

data <- data %>% rename(GDP = "GDP per capita ($)")
ggplot(data = data, mapping = aes(x = YEAR, y = GDP)) +
        geom_point(color = "red") +
        geom_smooth() +
        labs(title = "GDP OVER THE YEAR")

GDP is doing good with an increasing trend over the year. Let’s take a look of the years with highest and lowest GDP.

# Highest
data %>% filter(GDP == max(GDP)) %>% select(YEAR, GDP)
## # A tibble: 1 x 2
##    YEAR   GDP
##   <dbl> <dbl>
## 1  2019 3485.
# Lowest
data %>% filter(GDP == min(GDP))  %>% select(YEAR, GDP)
## # A tibble: 1 x 2
##    YEAR   GDP
##   <dbl> <dbl>
## 1  1962  157.

Obviously from the above plot, year 2019 has the highest GDP while year 1962 has the lowest. What do other variables say about this?

Research Question 2

Does GDP have a direct effect on self-rated poverty? If not, then what other factors in the dataset have a relationship with the self-rated poverty?

In this analysis we will use simple linear regression to check if there is a relationship between self-rated poverty and GDP?.

The model

\[Y ≈ \beta_0 + \beta_1X\] Here, our outcome variable is srp while our predictor is GDP .

Goal

  1. Estimate \(\beta_1\).

  2. See if \(\beta_1 = 0\) and from this we can say that there is no relationship between the two variables. This is done, by checking the p-value.

GDP vs SRP

First, let’s check if the assumptions are met using diagnostic plots.

# Data setup
srp_data <- srp_data %>% rename(GDP = "GDP per capita ($)")
plot(srp_data$GDP, srp_data$srp, xlab = "GDP", ylab = "SRP")

srp_vs_gdp = lm((srp_data$srp) ~ srp_data$GDP)
plot(srp_vs_gdp)

The residual plot shows that the data points are not scattered randomly above and below the residual line and so a non-linear model would be ideal in modelling the data (Observable trend of red line tells us that linearity assumption is not met).

Log-transformation:

model = lm(log(srp_data$srp) ~ log(srp_data$GDP))
plot(model)

The residual plot red line is fairly straight showing that the linearity assumption is not violated. Normal Q-Q plot also shows that our data is fairly normal. Also, the scale-location plot shows that the points are randomly spread satisfying the assumption of equal variance.

Results

summary(model)
## 
## Call:
## lm(formula = log(srp_data$srp) ~ log(srp_data$GDP))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32238 -0.04483  0.01998  0.06129  0.12794 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.23845    0.18822  27.831  < 2e-16 ***
## log(srp_data$GDP) -0.16753    0.02615  -6.406 2.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09156 on 34 degrees of freedom
## Multiple R-squared:  0.5469, Adjusted R-squared:  0.5335 
## F-statistic: 41.03 on 1 and 34 DF,  p-value: 2.574e-07

The small p-value (2.26e-07) for the variable GDP indicates that it has association with the variable srp, i.e, we declare a relationship to exist between srp and GDP. Also, the coefficient estimate \(\hat{\beta} = -9.603\) implies that an additional 1% increase in GDP is associated with 9.603 decrease in srp.

What other factors in the dataset have a relationship with the self-rated poverty?

Method 1: Pearson Correlation Coefficient

pairs.panels(srp_data[,c(1, 2, 4, 5, 6)], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

As we can see, coefficient values between srp and the variables inflation and Unemployment are positive which implies for positive relationship, i.e, as inflation and unemployment increases srp would also increase. There seems a negative relationship between the variables growth rate and srp. As growth rate increases srp decreases.

Method 2: Multiple Linear Regression

Outcome Variable: srp

Predictors: inflation, unemployment, growth rate

The model:

\[Y ≈ \beta_0 + \beta_1X_1 + \beta_2X_2+ \beta_3X_3+ \beta_4X_4\]

# Data setup
srp_data <- srp_data %>% rename(gdp_growth = "GDP GROWTH (%)",
                                inflation = "Inflation (%)",
                                unemployment = "Unemployment (% of total Labor Force)")
srp_vs_all = lm((srp_data$srp) ~ (srp_data$GDP + srp_data$gdp_growth +
                        srp_data$inflation + srp_data$unemployment))
plot(srp_vs_all)

Assumptions on linearity not met. Let’s transform the data.

model2 = lm(log(srp_data$srp) ~ (log(srp_data$GDP) + 
                                         log(srp_data$gdp_growth) +
                                         log(srp_data$inflation) +
                                         log(srp_data$unemployment)))
## Warning in log(srp_data$gdp_growth): NaNs produced
plot(model2)

The residual plot red line is fairly straight (points fairly random) showing that the linearity assumption is not violated. Normal Q-Q plot also shows that our data is fairly normal. Also, the scale-location plot shows that the points are randomly spread satisfying the assumption of equal variance.

Results

summary(model2)
## 
## Call:
## lm(formula = log(srp_data$srp) ~ (log(srp_data$GDP) + log(srp_data$gdp_growth) + 
##     log(srp_data$inflation) + log(srp_data$unemployment)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29538 -0.05896  0.02121  0.05736  0.11840 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 5.21543    0.70613   7.386 6.05e-08 ***
## log(srp_data$GDP)          -0.15263    0.06151  -2.481   0.0196 *  
## log(srp_data$gdp_growth)   -0.02890    0.02920  -0.990   0.3312    
## log(srp_data$inflation)     0.02549    0.02738   0.931   0.3600    
## log(srp_data$unemployment) -0.03836    0.12928  -0.297   0.7690    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08995 on 27 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.5765, Adjusted R-squared:  0.5138 
## F-statistic:  9.19 on 4 and 27 DF,  p-value: 8.05e-05
model2$coefficients
##                (Intercept)          log(srp_data$GDP) 
##                 5.21542572                -0.15263225 
##   log(srp_data$gdp_growth)    log(srp_data$inflation) 
##                -0.02889541                 0.02549092 
## log(srp_data$unemployment) 
##                -0.03835641

Only the variable GDP (p-val < 0.05) has a relationship with the outcome variable srp. Also, our model here is quite good having 0.08995 RSE and 0.5765 \(R^2\).

Summary:

  1. GDP is doing good with an increasing trend over the year.

  2. We declare a relationship to exist between srp and GDP. The coefficient estimate \(\hat{\beta} = -9.603\) implies that an additional 1% increase in GDP is associated with 9.603 decrease in srp. (Simple Linear Regression result)

  3. Coefficient values between srp and the variables inflation and Unemployment are positive which implies for positive relationship, i.e, as inflation and unemployment increases srp would also increase. There seems a negative relationship between the variables growth rate and srp. As growth rate increases srp decreases. (Pearson Correlation Coefficient result)

  4. Only the variable GDP (p-val < 0.05) has a relationship with the outcome variable srp. Also, our model here is quite good having 0.08995 RSE and 0.5765 \(R^2\). (Multiple Linear Regression result)Findings:

  5. GDP is doing good with an increasing trend over the year.

  6. We declare a relationship to exist between srp and GDP. The coefficient estimate \(\hat{\beta} = -9.603\) implies that an additional 1% increase in GDP is associated with 9.603 decrease in srp. (Simple Linear Regression result)

  7. Coefficient values between srp and the variables inflation and Unemployment are positive which implies for positive relationship, i.e, as inflation and unemployment increases srp would also increase. There seems a negative relationship between the variables growth rate and srp. As growth rate increases srp decreases. (Pearson Correlation Coefficient result)

  8. Only the variable GDP (p-val < 0.05) has a relationship with the outcome variable srp. Also, our model here is quite good having 0.08995 RSE and 0.5765 \(R^2\). (Multiple Linear Regression result)

Source Code: https://github.com/EarlMacalam/Data-Analysis-with-R/blob/master/SRP_GDP.Rmdhttps://github.com/EarlMacalam/Data-Analysis-with-R/blob/master/SRP_GDP.Rmd