MATH1324 Assignment 2

Linear Relationship between a Contry’s GDP and Population

Yaqi Zhang s3872859, Ziman Zeng s3874080, Ting Zhang s3873126

Last updated: 25 October 2020

RPubs link information

Rpubs link comes as below:

Introduction

In this project, we aim to test if there is a linear relationship of GDP and total population of a country.
The assumption is that GDP has linear relationship with total population of a country.

Problem Statement

It is reasonale to assume a positive relationship between GDP and Population
Using statitical methods to gain statistic evidence that this reloationship is linear
scatter plot and linear regression model in R are the predominant methods in this project

Data

Data from 264 countries are extracted from data.world (https://data.world) for this investigation.
GDP (Gross Domestic Product) - sum value of all goods and services produced in the country
GDP PPP (GDP at Purchasing power parity) - GDP valued at prices prevailing in United States
Total population - population data in 2015

World <- read_excel("2015 World Bank data by nation and region.xls")

Data Cont.

Data has been subset to only include complete records
Total population is highly right skewed

colnames(World)[4] <- "GDP_PPP"
colnames(World)[5] <- "Total_population"
Worldnew <- World[!is.na(World$GDP_PPP),]
Worldnew$GDP_PPP %>% hist(main = "GDP-PPP")

Data Cont.

Scatter Plot

plot(GDP_PPP ~ Total_population, data = Worldnew)

Data Transformation

The Log transformation is applied here to make this highly right skewed distributions less skewed

Worldnew$GDP_PPP <- Worldnew$GDP_PPP %>% log()
Worldnew$Total_population <- Worldnew$Total_population %>% log()

Descriptive Statistics and Visualisation

Total number of observations is 214 with population range from 9 - 23 and GDP from 19 - 33.
Log scale is applied.
The sccater plot indicates a possible linear relationship between two variables.

plot(GDP_PPP ~ Total_population, data = Worldnew,xlab = "Total Population(Log scaled)",ylab = "GDP_PPP(Log scaled)")
abline(lm(GDP_PPP ~ Total_population, data = Worldnew), col = "RED")

Decsriptive Statistics Cont.

See summary statitics below;

Worldnew %>% summarise(Min = min(GDP_PPP,na.rm = TRUE),
                                           Q1 = quantile(GDP_PPP,probs = .25,na.rm = TRUE),
                                           Median = median(GDP_PPP, na.rm = TRUE),
                                           Q3 = quantile(GDP_PPP,probs = .75,na.rm = TRUE),
                                           Max = max(GDP_PPP,na.rm = TRUE),
                                           Mean = mean(GDP_PPP, na.rm = TRUE),
                                           SD = sd(GDP_PPP, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(GDP_PPP))) -> table1
knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
19.15763	24.12947	25.84698	28.09133	32.36418	26.09003	2.788551	214	0

Hypothesis Testing

The linear regression statistical hypotheses is as following:

\[H_0: data \ do \ not \ fit \ linear \ regression \ model \] \[H_A: data \ fit \ linear \ regression \ model \]

gdpmodel <- lm(GDP_PPP ~ Total_population, data = Worldnew)
gdpmodel %>%  summary()

## 
## Call:
## lm(formula = GDP_PPP ~ Total_population, data = Worldnew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9777 -0.7941  0.1847  0.8259  2.4725 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.02404    0.49925   20.08   <2e-16 ***
## Total_population  0.95756    0.02939   32.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.14 on 212 degrees of freedom
## Multiple R-squared:  0.8335, Adjusted R-squared:  0.8327 
## F-statistic:  1061 on 1 and 212 DF,  p-value: < 2.2e-16

gdpmodel %>% confint()

##                      2.5 %    97.5 %
## (Intercept)      9.0399088 11.008177
## Total_population 0.8996238  1.015497

Hypthesis Testing Cont.

Assumptions(See next slides for visualization)

Independence
- Assumed as each observation comes from one country
Linearity
- Scatter plot between GDP and population suggest linearity
- Residual vs Fitted plot also shows the roughly flat relationship
Normality of residuals
- No significant deviation from normal distribution from Q-Q plot
Homoscedasticity
- From scope locale plot the red line appears to be roughly flat.
- Residuals are randomly spread, with variance relatively consistent across fitted values.
Influential cases
- According to Residuals vs Leverage plot, no influential cases present in data set

Hypthesis Testing Cont.

plot(gdpmodel)

Discussion

The linear model was statistically significant, F(1,212) = 1061, p < .001. H0 is rejected.
Population of the country explained 83.35 % of the GDP_PPP.
The estimated average GDP when population = 0 was 10.02, 95% CI (9.04, 11.01), the intercept of the regression as statistically significant
For every one unit increase in population, GDP increases on average by 0.96, 95% CI (0.90, 1.02)
The slope of regression for population was statistically significant, b= 0.96, p < 0.001
The estimated regression equation was: \[y = 10.02 + 0.9*x \]

Limitations

potentially biased by relatively large number of missing value
data only represents year 2015

Conclusion:

There is statistically significant linear postive relationship between GDP PPP and total population.
The findings agree to our initial assumption.
For future investigations, the above equation can be tested by inputting data from larger time span

References

Gary, H. (2016). 2015 World Bank data by nation and region. Retrieved from https://data.world/garyhoov/world-bank-data/workspace/file?filename=2015+World+Bank+data+by+nation+and+region.xls
CIA. (2018). The World Factbook. Retrieved from https://www.cia.gov/library/publications/the-world-factbook/fields/208rank.html#:~:text=A%20nation’s%20GDP%20at%20purchasing,prevailing%20in%20the%20United%20States.