Introduction

For this discussion post I decided to analyzed data from World Bank’s Data Bank. My downloaded data was upload to my github account Gitbub repository.

GDP <- read.csv('https://raw.githubusercontent.com/jnaval88/DATA605/main/Week11/Discussion11-GDP_Birth_Rate.csv')

GDP$X2019..YR2019. = as.numeric(GDP$X2019..YR2019)
## Warning: NAs introduced by coercion
GDP = GDP %>% 
  pivot_wider(names_from = "Series.Name" , values_from = "X2019..YR2019." )

Plot the Variables

For this section I will plot the birth rate

ggplot(data = GDP, aes(x =`GDP per capita (current US$)` , y = `Birth rate, crude (per 1,000 people)`)) +
  geom_point()
## Warning: Removed 28 rows containing missing values (`geom_point()`).

Since the data is very big, plotting a big data can’t give a visual, to be a better view of the data from the plot I will a take a log transformation which will make it clearer.

ggplot(data = GDP, aes(x =`GDP per capita (current US$)` , y = `Birth rate, crude (per 1,000 people)`)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()
## Warning: Removed 28 rows containing missing values (`geom_point()`).

Build a Linear Regression Model

Now I will perform some linear regression model of the GDP data.

GDP_LM = lm( log1p(`Birth rate, crude (per 1,000 people)`) ~ log1p(`GDP per capita (current US$)`), data = GDP)
summary(GDP_LM)
## 
## Call:
## lm(formula = log1p(`Birth rate, crude (per 1,000 people)`) ~ 
##     log1p(`GDP per capita (current US$)`), data = GDP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87441 -0.17155  0.02437  0.18867  0.67440 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            5.42074    0.10510   51.58   <2e-16 ***
## log1p(`GDP per capita (current US$)`) -0.28491    0.01185  -24.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2548 on 236 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.7102, Adjusted R-squared:  0.709 
## F-statistic: 578.4 on 1 and 236 DF,  p-value: < 2.2e-16

Linear Regression Model

Now I will plot the out come of the linear regression model.

plot(GDP_LM)

After plotting the linear regression model, I can conclude that The residuals vs fitted plot appears to have constant variability, and the QQ plot would indicate that the residuals are somewhat normally distributed.

gvlma(GDP_LM)
## 
## Call:
## lm(formula = log1p(`Birth rate, crude (per 1,000 people)`) ~ 
##     log1p(`GDP per capita (current US$)`), data = GDP)
## 
## Coefficients:
##                           (Intercept)  log1p(`GDP per capita (current US$)`)  
##                                5.4207                                -0.2849  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = GDP_LM) 
## 
##                      Value  p-value                   Decision
## Global Stat        15.1020 0.004494 Assumptions NOT satisfied!
## Skewness            3.5186 0.060682    Assumptions acceptable.
## Kurtosis            0.5489 0.458779    Assumptions acceptable.
## Link Function       7.0409 0.007967 Assumptions NOT satisfied!
## Heteroscedasticity  3.9936 0.045673 Assumptions NOT satisfied!