library(readr)
## Warning: package 'readr' was built under R version 3.4.3
GDP_Dataset <- read_csv("C:/Users/Cong/Desktop/GDP Dataset.csv")
## Parsed with column specification:
## cols(
## Quarter = col_character(),
## GDP = col_double(),
## CPI = col_double(),
## UnemploymentRate = col_double(),
## WorkPopulation = col_double(),
## BondYields = col_double(),
## CarRegistration = col_double(),
## NetExport = col_double(),
## DispInc = col_double(),
## PrConEx = col_double(),
## PrDoInv = col_double(),
## DebtToGDP = col_double()
## )
View(GDP_Dataset)
summary(GDP_Dataset)
## Quarter GDP CPI UnemploymentRate
## Length:193 Min. : 1054 Min. : 17.47 Min. : 3.950
## Class :character 1st Qu.: 3284 1st Qu.: 43.38 1st Qu.: 5.270
## Mode :character Median : 7136 Median : 67.28 Median : 5.950
## Mean : 8354 Mean : 65.89 Mean : 6.377
## 3rd Qu.:13649 3rd Qu.: 91.48 3rd Qu.: 7.410
## Max. :19965 Max. :114.39 Max. :10.910
## WorkPopulation BondYields CarRegistration NetExport
## Min. :117082333 Min. : 1.560 Min. : 81.87 Min. :-805.59
## 1st Qu.:146310667 1st Qu.: 4.230 1st Qu.:135.17 1st Qu.:-501.59
## Median :165234000 Median : 6.340 Median :148.38 Median :-108.84
## Mean :166690396 Mean : 6.463 Mean :152.84 Mean :-246.72
## 3rd Qu.:192357667 3rd Qu.: 8.060 3rd Qu.:175.22 3rd Qu.: -23.19
## Max. :206284333 Max. :14.850 Max. :224.32 Max. : 21.58
## DispInc PrConEx PrDoInv DebtToGDP
## Min. : 3359 Min. : 2882 Min. : 166.8 Min. : 30.60
## 1st Qu.: 4765 1st Qu.: 4066 1st Qu.: 593.6 1st Qu.: 35.13
## Median : 6911 Median : 6260 Median :1202.1 Median : 57.56
## Mean : 7579 Mean : 6819 Mean :1433.1 Mean : 57.76
## 3rd Qu.:10534 3rd Qu.: 9729 3rd Qu.:2156.5 3rd Qu.: 64.08
## Max. :12931 Max. :12067 Max. :3377.2 Max. :105.67
cpi.c <- scale(GDP_Dataset$CPI, center=TRUE, scale=FALSE)
umploy.c <- scale(GDP_Dataset$UnemploymentRate, center=TRUE, scale=FALSE)
workp.c <- scale(GDP_Dataset$WorkPopulation, center=TRUE, scale=FALSE)
bondy.c <- scale(GDP_Dataset$BondYields, center=TRUE, scale=FALSE)
carr.c <- scale(GDP_Dataset$CarRegistration, center=TRUE, scale=FALSE)
export.c <- scale(GDP_Dataset$NetExport, center=TRUE, scale=FALSE)
new.c.vars <- cbind(cpi.c, umploy.c, workp.c,bondy.c,carr.c,export.c)
newGDPdata <- cbind(GDP_Dataset, new.c.vars)
names(newGDPdata)[13:18] <- c("cpi.c", "umploy.c", "workp.c" ,"bondy.c","carr.c","export.c")
lmGDPdata <- newGDPdata[,c(2,13:18)]
summary(lmGDPdata)
## GDP cpi.c umploy.c workp.c
## Min. : 1054 Min. :-48.421 Min. :-2.4269 Min. :-49608062
## 1st Qu.: 3284 1st Qu.:-22.511 1st Qu.:-1.1069 1st Qu.:-20379729
## Median : 7136 Median : 1.389 Median :-0.4269 Median : -1456396
## Mean : 8354 Mean : 0.000 Mean : 0.0000 Mean : 0
## 3rd Qu.:13649 3rd Qu.: 25.589 3rd Qu.: 1.0331 3rd Qu.: 25667271
## Max. :19965 Max. : 48.499 Max. : 4.5331 Max. : 39593938
## bondy.c carr.c export.c
## Min. :-4.9033 Min. :-70.972 Min. :-558.9
## 1st Qu.:-2.2333 1st Qu.:-17.663 1st Qu.:-254.9
## Median :-0.1233 Median : -4.461 Median : 137.9
## Mean : 0.0000 Mean : 0.000 Mean : 0.0
## 3rd Qu.: 1.5967 3rd Qu.: 22.386 3rd Qu.: 223.5
## Max. : 8.3867 Max. : 71.484 Max. : 268.3
plot(lmGDPdata, pch=16, col="blue", main="Matrix Scatterplot for Predictors")
The matrix plot above allows us to vizualise the relationship among all variables in one single image. For example, we can see how GDP and CPI.C are related. Another interesting example is the relationship between GDP and NetExport.Here we can see that as the NetExport increases, GDP declines.
Fitting a linear regression on this dataset and see how well it models the observed data. We’ll add all other predictors and give each of them a separate slope coefficient. For our multiple linear regression example, we want to solve the following equation:
GDP = B0 + B1 * cpi.c + B2 * UmploymentRate.C + B3 * WorkPopulation.C + B4 * BondYields.C + B5 * CarRegistration.C +B6 * NetExport.C
The model will estimate the value of the intercept (B0) and each predictor’s slope (B1) for education, (B2) for prestige and (B3) for women. The intercept is the average expected GDP value for the average value across all predictors. The value for each slope estimate will be the average increase in GDP associated with a one-unit increase in each predictor value, holding the others constant. We want our model to fit a line or plane across the observed relationship in a way that the line/plane created is as close as possible to all data points.
lm1<-lm(formula = GDP~ CPI + UnemploymentRate + WorkPopulation + BondYields +
CarRegistration + NetExport, data = GDP_Dataset)
summary(lm1)
##
## Call:
## lm(formula = GDP ~ CPI + UnemploymentRate + WorkPopulation +
## BondYields + CarRegistration + NetExport, data = GDP_Dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1217.2 -598.6 -137.9 429.4 2245.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.985e+03 3.308e+03 1.809 0.0720 .
## CPI 1.727e+02 2.634e+01 6.558 5.25e-10 ***
## UnemploymentRate -8.529e+00 4.182e+01 -0.204 0.8386
## WorkPopulation -4.266e-05 3.130e-05 -1.363 0.1745
## BondYields -2.623e+02 3.001e+01 -8.743 1.33e-15 ***
## CarRegistration -7.015e+00 3.316e+00 -2.116 0.0357 *
## NetExport -3.731e+00 5.821e-01 -6.410 1.17e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 769.4 on 186 degrees of freedom
## Multiple R-squared: 0.9819, Adjusted R-squared: 0.9813
## F-statistic: 1679 on 6 and 186 DF, p-value: < 2.2e-16
The result of the model is shown above.
From the model output and the scatterplot we can make some interesting observations:
CPI, Bond Yields adn NetExport are significantly associated with GDP.
However surprisingly, for any given level of CPI, work population, bond yields and car registration, improving one percentage point of NetExport will see average GDP decline by $-37.31.For any given level of CPI, Work Population, NetExport and car registration, seeing an improvement in bond yields by one point will lead to a $262.3 decline in average GDP.
The F-Statistic value can also help answer whether there is a relationship between the response and the predictors used. We can use the value of our F-Statistic to test whether all our coefficients are equal to zero (testing for the null hypothesis which means). The F-Statistic value from our model is 1679 on 6 and 186 degrees of freedom. So assuming that the number of data points is appropriate and given that the p-values returned are low, we have some evidence that at least one of the predictors is associated with GDP.
# Plot a correlation graph
GDPdatacor <- cor(GDP_Dataset[2:8])
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
corrplot(GDPdatacor, method = "number")
Notice that the correlation between CPI and work Population is very high at 1. This reveals CPI is strongly aligned to Work Population. Because they are strongly correlated, we face a problem of collinearity (the predictors are collinear).So in essence, when they are put together in the model, work Population is no longer significant after adjusting for CPI.
Given that we have indications that at least one of the predictors is associated with GDP, and based on the fact that UnEmploymentRate, Work Polulation and Car Registration have high p-values, we can consider removing these three variables from the model and see how the model fit changes.
lm2<-lm(formula = GDP~ CPI + BondYields + NetExport, data = GDP_Dataset)
summary(lm2)
##
## Call:
## lm(formula = GDP ~ CPI + BondYields + NetExport, data = GDP_Dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1248.0 -599.9 -117.2 413.8 2413.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.9864 346.7802 -0.089 0.929
## CPI 142.1929 3.9585 35.921 < 2e-16 ***
## BondYields -276.9001 29.0424 -9.534 < 2e-16 ***
## NetExport -3.2647 0.4559 -7.161 1.73e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 779.3 on 189 degrees of freedom
## Multiple R-squared: 0.9811, Adjusted R-squared: 0.9808
## F-statistic: 3271 on 3 and 189 DF, p-value: < 2.2e-16
The model excluding UnEmploymentRate, Work Polulation and Car Registration has in fact improved our F-Statistic from 1679 to 3271 but no substantial improvement was achieved in residual standard error and adjusted R-square value. This is possibly due to the presence of outlier points in the data.
plot(lm2, pch=16, which=1)
Note how the residuals plot of this last model shows some important points still lying far away from the middle area of the graph.
At this stage we could try a few different transformations on both the predictors and the response variable to see how this would improve the model fit. For now, let’s apply a logarithmic transformation with the log function on the GDP variable. Also, we could try to square predictors. Let’s apply these suggested transformations directly into the model function and see what happens with both the model fit and the model accuracy.
lm3 = lm(log(GDP) ~ cpi.c + I(cpi.c^2) + bondy.c + I(bondy.c^2) + export.c + I(export.c^2) , data=newGDPdata)
summary(lm3)
##
## Call:
## lm(formula = log(GDP) ~ cpi.c + I(cpi.c^2) + bondy.c + I(bondy.c^2) +
## export.c + I(export.c^2), data = newGDPdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.162833 -0.019783 -0.003236 0.021791 0.127302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.898e+00 8.161e-03 1090.268 < 2e-16 ***
## cpi.c 2.811e-02 4.550e-04 61.790 < 2e-16 ***
## I(cpi.c^2) -1.475e-04 9.385e-06 -15.722 < 2e-16 ***
## bondy.c 1.568e-02 4.575e-03 3.427 0.000752 ***
## I(bondy.c^2) -1.925e-03 5.813e-04 -3.312 0.001114 **
## export.c -2.311e-04 4.615e-05 -5.007 1.28e-06 ***
## I(export.c^2) -3.686e-07 1.020e-07 -3.614 0.000387 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0456 on 186 degrees of freedom
## Multiple R-squared: 0.9972, Adjusted R-squared: 0.9971
## F-statistic: 1.116e+04 on 6 and 186 DF, p-value: < 2.2e-16
# Plot model residuals.
plot(lm3, pch=16, which=1)
By transforming all predictors and the target variable, we achieve an improved model fit. Note how the adjusted R-square has jumped to 0.9971. And all predictors’ p-values are significant. And F-Statistic value is improved to 11160.
In summary, we’ve seen a few different multiple linear regression models applied to the GDP dataset. We tried an linear approach. We created a correlation matrix to understand how each variable was correlated. Subsequently, we transformed the variables to see the effect in the model.
CPI, Bond Yields adn NetExport are significantly associated with GDP.
Improving one percentage point of NetExport lead to a average GDP decline by $-37.31.
Improving one percentage point of bond yields lead to a $262.3 decline in average GDP.
Improving one percentage point of CPI lead to a $172.7 increase in average GDP.