Summary

This R markdown contains the linear models that used the investigation on factors that influencing GDP in 193 countries. Linear regression and data analysis are carried out to selected influential covariates.

Keywords:Exploratory data analysis(EDA),Analysis Of Variance(ANOVA),selection of variables.

1. Data Preprocessing

Load data and variables in suitable terms and let R know that some factors are numerical variables such as fertility, mortality, the life expectancy of females, percentage of population in urban areas and GDP per capita.

GDP <- read.csv("~/Desktop/GDP.csv", stringsAsFactors=TRUE)
View(GDP)
GDP$fertility<-as.numeric(GDP$fertility)
GDP$ppgdp<-as.numeric(GDP$ppgdp)
GDP$lifeExpF<-as.numeric(GDP$lifeExpF)
GDP$pctUrban<-as.numeric(GDP$pctUrban)
GDP$infantMortality<-as.numeric(GDP$infantMortality)

2. Data Analysis

To analyse the data, first, attach the data into R.

attach(GDP)
Figure 1: Scatter plots of each factors

Figure 1: Scatter plots of each factors

Clearly, there is a relationship between fertility, pcturban and log(ppgdp). The result of R is examined quantitatively by calculating the correlation coefficients:

cor(ppgdp,fertility)
## [1] -0.4423632
cor(log(ppgdp),fertility)
## [1] -0.7210742
cor(pctUrban,fertility)
## [1] -0.5391271
cor(pctUrban,ppgdp)
## [1] 0.5855575
par(mfrow=c(1,2),  pty="m",mai=c(0.9,0.8,0.3,0.35), cex.lab=1.2)
hist(ppgdp, nclass=20, col="oldlace", border="deepskyblue2", freq=F, main="", xlab="PPGDP")
hist(log(ppgdp), nclass=20, col="oldlace", border="deepskyblue2", freq=F, main="", xlab="logPPGDP")
Figure 2: Histogram of GDP content

Figure 2: Histogram of GDP content

Value of GDP is positively skew as it can be seen left and the first histogram in Figure3.To find out the best skewness, logarithm might give a much better result. The right panel of the Figure 3, which it is the logarithm of GDP,is look more symetric than the left panel.This means the typical value is more unambiguous than first one.To support this theory,Figure 4 shows a better linear model of GDP.

par(mfrow=c(1,2),mai=c(0.9,1.2,0.2,0.5), cex.lab=1.2)
plot(fertility,ppgdp, ylab="GDP", xlab = "Fertility")
plot(fertility,log(ppgdp), ylab="log(GDP)", xlab = "Fertility")
Figure 3: Scatter plots of Fertility against GDP (Left Panel) and log GDP (Right Panel)

Figure 3: Scatter plots of Fertility against GDP (Left Panel) and log GDP (Right Panel)

3.Exploratory Data Analysis

3.1 Linear models

Linear models lead to finding out the best model to use of into the data.

lm0<-lm(log(ppgdp)~1)
lm1<-lm(log(ppgdp)~fertility)
lm2<-lm(log(ppgdp)~pctUrban)
lm3<-lm(log(ppgdp)~lifeExpF)
lm4<-lm(log(ppgdp)~infantMortality)
lm5<-lm(log(ppgdp)~fertility+pctUrban)
lm6<-lm(log(ppgdp)~fertility+pctUrban+lifeExpF)
lm7<-lm(log(ppgdp)~pctUrban+lifeExpF)
lm8<-lm(log(ppgdp)~pctUrban+lifeExpF+infantMortality)
lm9<-lm(log(ppgdp)~fertility+lifeExpF)
lm10<-lm(log(ppgdp)~fertility+lifeExpF+infantMortality)
lm11<-lm(log(ppgdp)~fertility+pctUrban+infantMortality)
lm12<-lm(log(ppgdp)~fertility+infantMortality)
lm13<-lm(log(ppgdp)~pctUrban+infantMortality)
lm14<-lm(log(ppgdp)~lifeExpF+infantMortality)
lm15<-lm(log(ppgdp)~fertility+pctUrban+lifeExpF+infantMortality)

The most complex model is fitted to the data and obtained the standardised residuals.

lm15<-lm(log(ppgdp)~fertility+pctUrban+lifeExpF+infantMortality)
par(mfrow=c(2,2),  pty="m",mai=c(0.9,0.8,0.3,0.35), cex.lab=1.2, cex.axis=0.8)
plot(lm15)
Figure 4: Residuals plots

Figure 4: Residuals plots

3.2 ANOVA

ANOVA model can help to choose the best suitable model by comparing the different linear model in R.

anova(lm6,lm15)
## Analysis of Variance Table
## 
## Model 1: log(ppgdp) ~ fertility + pctUrban + lifeExpF
## Model 2: log(ppgdp) ~ fertility + pctUrban + lifeExpF + infantMortality
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    189 123.58                              
## 2    188 119.94  1    3.6407 5.7066 0.01789 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm5,lm6)
## Analysis of Variance Table
## 
## Model 1: log(ppgdp) ~ fertility + pctUrban
## Model 2: log(ppgdp) ~ fertility + pctUrban + lifeExpF
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    190 140.70                                  
## 2    189 123.58  1    17.117 26.177 7.617e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm1,lm5)
## Analysis of Variance Table
## 
## Model 1: log(ppgdp) ~ fertility
## Model 2: log(ppgdp) ~ fertility + pctUrban
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    191 220.27                                  
## 2    190 140.70  1    79.575 107.46 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm2,lm5)
## Analysis of Variance Table
## 
## Model 1: log(ppgdp) ~ pctUrban
## Model 2: log(ppgdp) ~ fertility + pctUrban
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    191 207.94                                  
## 2    190 140.70  1    67.236 90.796 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm5)
## 
## Call:
## lm(formula = log(ppgdp) ~ fertility + pctUrban)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6975 -0.6625 -0.0372  0.6741  3.0270 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.966908   0.299596  26.592   <2e-16 ***
## fertility   -0.519084   0.054476  -9.529   <2e-16 ***
## pctUrban     0.033085   0.003192  10.366   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8605 on 190 degrees of freedom
## Multiple R-squared:  0.6934, Adjusted R-squared:  0.6901 
## F-statistic: 214.8 on 2 and 190 DF,  p-value: < 2.2e-16
summary(lm6)
## 
## Call:
## lm(formula = log(ppgdp) ~ fertility + pctUrban + lifeExpF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4544 -0.5968 -0.0579  0.5475  3.3452 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.519681   0.913664   3.852  0.00016 ***
## fertility   -0.228293   0.076489  -2.985  0.00321 ** 
## pctUrban     0.027947   0.003163   8.836 6.80e-16 ***
## lifeExpF     0.054553   0.010662   5.116 7.62e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8086 on 189 degrees of freedom
## Multiple R-squared:  0.7307, Adjusted R-squared:  0.7264 
## F-statistic: 170.9 on 3 and 189 DF,  p-value: < 2.2e-16

According to ANOVA testing, lm5 is the best model that can be used in analysing.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.