MATH1324 Assignment 2

Linear Relationship between a Contry’s GDP and Population

Yaqi Zhang s3872859, Ziman Zeng s3874080, Ting Zhang s3873126

Last updated: 25 October 2020

Introduction

Problem Statement

Data

World <- read_excel("2015 World Bank data by nation and region.xls")

Data Cont.

colnames(World)[4] <- "GDP_PPP"
colnames(World)[5] <- "Total_population"
Worldnew <- World[!is.na(World$GDP_PPP),]
Worldnew$GDP_PPP %>% hist(main = "GDP-PPP")

Data Cont.

plot(GDP_PPP ~ Total_population, data = Worldnew)

Data Transformation

Worldnew$GDP_PPP <- Worldnew$GDP_PPP %>% log()
Worldnew$Total_population <- Worldnew$Total_population %>% log()

Descriptive Statistics and Visualisation

plot(GDP_PPP ~ Total_population, data = Worldnew,xlab = "Total Population(Log scaled)",ylab = "GDP_PPP(Log scaled)")
abline(lm(GDP_PPP ~ Total_population, data = Worldnew), col = "RED")

Decsriptive Statistics Cont.

Worldnew %>% summarise(Min = min(GDP_PPP,na.rm = TRUE),
                                           Q1 = quantile(GDP_PPP,probs = .25,na.rm = TRUE),
                                           Median = median(GDP_PPP, na.rm = TRUE),
                                           Q3 = quantile(GDP_PPP,probs = .75,na.rm = TRUE),
                                           Max = max(GDP_PPP,na.rm = TRUE),
                                           Mean = mean(GDP_PPP, na.rm = TRUE),
                                           SD = sd(GDP_PPP, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(GDP_PPP))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
19.15763 24.12947 25.84698 28.09133 32.36418 26.09003 2.788551 214 0

Hypothesis Testing

\[H_0: data \ do \ not \ fit \ linear \ regression \ model \] \[H_A: data \ fit \ linear \ regression \ model \]

gdpmodel <- lm(GDP_PPP ~ Total_population, data = Worldnew)
gdpmodel %>%  summary()
## 
## Call:
## lm(formula = GDP_PPP ~ Total_population, data = Worldnew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9777 -0.7941  0.1847  0.8259  2.4725 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.02404    0.49925   20.08   <2e-16 ***
## Total_population  0.95756    0.02939   32.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.14 on 212 degrees of freedom
## Multiple R-squared:  0.8335, Adjusted R-squared:  0.8327 
## F-statistic:  1061 on 1 and 212 DF,  p-value: < 2.2e-16
gdpmodel %>% confint()
##                      2.5 %    97.5 %
## (Intercept)      9.0399088 11.008177
## Total_population 0.8996238  1.015497

Hypthesis Testing Cont.

  1. Independence
    • Assumed as each observation comes from one country
  2. Linearity
    • Scatter plot between GDP and population suggest linearity
    • Residual vs Fitted plot also shows the roughly flat relationship
  3. Normality of residuals
    • No significant deviation from normal distribution from Q-Q plot
  4. Homoscedasticity
    • From scope locale plot the red line appears to be roughly flat.
    • Residuals are randomly spread, with variance relatively consistent across fitted values.
  5. Influential cases
    • According to Residuals vs Leverage plot, no influential cases present in data set

Hypthesis Testing Cont.

plot(gdpmodel)

Discussion

Limitations

Conclusion:

References