Load Data In

library(readr)

## Warning: package 'readr' was built under R version 3.4.3

GDP_Dataset <- read_csv("C:/Users/Cong/Desktop/GDP Dataset.csv")

## Parsed with column specification:
## cols(
##   Quarter = col_character(),
##   GDP = col_double(),
##   CPI = col_double(),
##   UnemploymentRate = col_double(),
##   WorkPopulation = col_double(),
##   BondYields = col_double(),
##   CarRegistration = col_double(),
##   NetExport = col_double(),
##   DispInc = col_double(),
##   PrConEx = col_double(),
##   PrDoInv = col_double(),
##   DebtToGDP = col_double()
## )

View(GDP_Dataset)

summary(GDP_Dataset)

##    Quarter               GDP             CPI         UnemploymentRate
##  Length:193         Min.   : 1054   Min.   : 17.47   Min.   : 3.950  
##  Class :character   1st Qu.: 3284   1st Qu.: 43.38   1st Qu.: 5.270  
##  Mode  :character   Median : 7136   Median : 67.28   Median : 5.950  
##                     Mean   : 8354   Mean   : 65.89   Mean   : 6.377  
##                     3rd Qu.:13649   3rd Qu.: 91.48   3rd Qu.: 7.410  
##                     Max.   :19965   Max.   :114.39   Max.   :10.910  
##  WorkPopulation        BondYields     CarRegistration    NetExport      
##  Min.   :117082333   Min.   : 1.560   Min.   : 81.87   Min.   :-805.59  
##  1st Qu.:146310667   1st Qu.: 4.230   1st Qu.:135.17   1st Qu.:-501.59  
##  Median :165234000   Median : 6.340   Median :148.38   Median :-108.84  
##  Mean   :166690396   Mean   : 6.463   Mean   :152.84   Mean   :-246.72  
##  3rd Qu.:192357667   3rd Qu.: 8.060   3rd Qu.:175.22   3rd Qu.: -23.19  
##  Max.   :206284333   Max.   :14.850   Max.   :224.32   Max.   :  21.58  
##     DispInc         PrConEx         PrDoInv         DebtToGDP     
##  Min.   : 3359   Min.   : 2882   Min.   : 166.8   Min.   : 30.60  
##  1st Qu.: 4765   1st Qu.: 4066   1st Qu.: 593.6   1st Qu.: 35.13  
##  Median : 6911   Median : 6260   Median :1202.1   Median : 57.56  
##  Mean   : 7579   Mean   : 6819   Mean   :1433.1   Mean   : 57.76  
##  3rd Qu.:10534   3rd Qu.: 9729   3rd Qu.:2156.5   3rd Qu.: 64.08  
##  Max.   :12931   Max.   :12067   Max.   :3377.2   Max.   :105.67

Center Predictors

cpi.c <- scale(GDP_Dataset$CPI, center=TRUE, scale=FALSE)
umploy.c <- scale(GDP_Dataset$UnemploymentRate, center=TRUE, scale=FALSE)
workp.c <- scale(GDP_Dataset$WorkPopulation, center=TRUE, scale=FALSE)
bondy.c <- scale(GDP_Dataset$BondYields, center=TRUE, scale=FALSE)
carr.c <- scale(GDP_Dataset$CarRegistration, center=TRUE, scale=FALSE)
export.c <- scale(GDP_Dataset$NetExport, center=TRUE, scale=FALSE)

Bind these new variables into GDP_Dataset and display a summary.

new.c.vars <- cbind(cpi.c, umploy.c, workp.c,bondy.c,carr.c,export.c)
newGDPdata <- cbind(GDP_Dataset, new.c.vars)
names(newGDPdata)[13:18] <- c("cpi.c", "umploy.c", "workp.c" ,"bondy.c","carr.c","export.c")

Subset those centered variables

lmGDPdata <- newGDPdata[,c(2,13:18)]
summary(lmGDPdata)

##       GDP            cpi.c            umploy.c          workp.c         
##  Min.   : 1054   Min.   :-48.421   Min.   :-2.4269   Min.   :-49608062  
##  1st Qu.: 3284   1st Qu.:-22.511   1st Qu.:-1.1069   1st Qu.:-20379729  
##  Median : 7136   Median :  1.389   Median :-0.4269   Median : -1456396  
##  Mean   : 8354   Mean   :  0.000   Mean   : 0.0000   Mean   :        0  
##  3rd Qu.:13649   3rd Qu.: 25.589   3rd Qu.: 1.0331   3rd Qu.: 25667271  
##  Max.   :19965   Max.   : 48.499   Max.   : 4.5331   Max.   : 39593938  
##     bondy.c            carr.c           export.c     
##  Min.   :-4.9033   Min.   :-70.972   Min.   :-558.9  
##  1st Qu.:-2.2333   1st Qu.:-17.663   1st Qu.:-254.9  
##  Median :-0.1233   Median : -4.461   Median : 137.9  
##  Mean   : 0.0000   Mean   :  0.000   Mean   :   0.0  
##  3rd Qu.: 1.5967   3rd Qu.: 22.386   3rd Qu.: 223.5  
##  Max.   : 8.3867   Max.   : 71.484   Max.   : 268.3

Plot matrix of all variables.

plot(lmGDPdata, pch=16, col="blue", main="Matrix Scatterplot for Predictors")

The matrix plot above allows us to vizualise the relationship among all variables in one single image. For example, we can see how GDP and CPI.C are related. Another interesting example is the relationship between GDP and NetExport.Here we can see that as the NetExport increases, GDP declines.

Fitting a Multiple linear regression

Fitting a linear regression on this dataset and see how well it models the observed data. We’ll add all other predictors and give each of them a separate slope coefficient. For our multiple linear regression example, we want to solve the following equation:

GDP = B0 + B1 * cpi.c + B2 * UmploymentRate.C + B3 * WorkPopulation.C + B4 * BondYields.C + B5 * CarRegistration.C +B6 * NetExport.C

The model will estimate the value of the intercept (B0) and each predictor’s slope (B1) for education, (B2) for prestige and (B3) for women. The intercept is the average expected GDP value for the average value across all predictors. The value for each slope estimate will be the average increase in GDP associated with a one-unit increase in each predictor value, holding the others constant. We want our model to fit a line or plane across the observed relationship in a way that the line/plane created is as close as possible to all data points.

lm1<-lm(formula = GDP~ CPI + UnemploymentRate + WorkPopulation + BondYields +
CarRegistration + NetExport, data = GDP_Dataset)
summary(lm1)

## 
## Call:
## lm(formula = GDP ~ CPI + UnemploymentRate + WorkPopulation + 
##     BondYields + CarRegistration + NetExport, data = GDP_Dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1217.2  -598.6  -137.9   429.4  2245.6 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.985e+03  3.308e+03   1.809   0.0720 .  
## CPI               1.727e+02  2.634e+01   6.558 5.25e-10 ***
## UnemploymentRate -8.529e+00  4.182e+01  -0.204   0.8386    
## WorkPopulation   -4.266e-05  3.130e-05  -1.363   0.1745    
## BondYields       -2.623e+02  3.001e+01  -8.743 1.33e-15 ***
## CarRegistration  -7.015e+00  3.316e+00  -2.116   0.0357 *  
## NetExport        -3.731e+00  5.821e-01  -6.410 1.17e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 769.4 on 186 degrees of freedom
## Multiple R-squared:  0.9819, Adjusted R-squared:  0.9813 
## F-statistic:  1679 on 6 and 186 DF,  p-value: < 2.2e-16

The result of the model is shown above.

From the model output and the scatterplot we can make some interesting observations:

CPI, Bond Yields adn NetExport are significantly associated with GDP.

However surprisingly, for any given level of CPI, work population, bond yields and car registration, improving one percentage point of NetExport will see average GDP decline by $-37.31.For any given level of CPI, Work Population, NetExport and car registration, seeing an improvement in bond yields by one point will lead to a $262.3 decline in average GDP.

The F-Statistic value can also help answer whether there is a relationship between the response and the predictors used. We can use the value of our F-Statistic to test whether all our coefficients are equal to zero (testing for the null hypothesis which means). The F-Statistic value from our model is 1679 on 6 and 186 degrees of freedom. So assuming that the number of data points is appropriate and given that the p-values returned are low, we have some evidence that at least one of the predictors is associated with GDP.

Let’s validate this situation with a correlation plot:

# Plot a correlation graph
GDPdatacor <- cor(GDP_Dataset[2:8])
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.4.4

## corrplot 0.84 loaded

corrplot(GDPdatacor, method = "number")

Notice that the correlation between CPI and work Population is very high at 1. This reveals CPI is strongly aligned to Work Population. Because they are strongly correlated, we face a problem of collinearity (the predictors are collinear).So in essence, when they are put together in the model, work Population is no longer significant after adjusting for CPI.

Fit a linear model excluding the variables with high p-values

Given that we have indications that at least one of the predictors is associated with GDP, and based on the fact that UnEmploymentRate, Work Polulation and Car Registration have high p-values, we can consider removing these three variables from the model and see how the model fit changes.

lm2<-lm(formula = GDP~ CPI + BondYields + NetExport, data = GDP_Dataset)
summary(lm2)

## 
## Call:
## lm(formula = GDP ~ CPI + BondYields + NetExport, data = GDP_Dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1248.0  -599.9  -117.2   413.8  2413.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -30.9864   346.7802  -0.089    0.929    
## CPI          142.1929     3.9585  35.921  < 2e-16 ***
## BondYields  -276.9001    29.0424  -9.534  < 2e-16 ***
## NetExport     -3.2647     0.4559  -7.161 1.73e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 779.3 on 189 degrees of freedom
## Multiple R-squared:  0.9811, Adjusted R-squared:  0.9808 
## F-statistic:  3271 on 3 and 189 DF,  p-value: < 2.2e-16

The model excluding UnEmploymentRate, Work Polulation and Car Registration has in fact improved our F-Statistic from 1679 to 3271 but no substantial improvement was achieved in residual standard error and adjusted R-square value. This is possibly due to the presence of outlier points in the data.

Let’s plot this last model’s residuals:

plot(lm2, pch=16, which=1)

Note how the residuals plot of this last model shows some important points still lying far away from the middle area of the graph.

Improve Model Fit

At this stage we could try a few different transformations on both the predictors and the response variable to see how this would improve the model fit. For now, let’s apply a logarithmic transformation with the log function on the GDP variable. Also, we could try to square predictors. Let’s apply these suggested transformations directly into the model function and see what happens with both the model fit and the model accuracy.

lm3 = lm(log(GDP) ~ cpi.c + I(cpi.c^2) + bondy.c + I(bondy.c^2) + export.c + I(export.c^2) , data=newGDPdata)
summary(lm3)

## 
## Call:
## lm(formula = log(GDP) ~ cpi.c + I(cpi.c^2) + bondy.c + I(bondy.c^2) + 
##     export.c + I(export.c^2), data = newGDPdata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.162833 -0.019783 -0.003236  0.021791  0.127302 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    8.898e+00  8.161e-03 1090.268  < 2e-16 ***
## cpi.c          2.811e-02  4.550e-04   61.790  < 2e-16 ***
## I(cpi.c^2)    -1.475e-04  9.385e-06  -15.722  < 2e-16 ***
## bondy.c        1.568e-02  4.575e-03    3.427 0.000752 ***
## I(bondy.c^2)  -1.925e-03  5.813e-04   -3.312 0.001114 ** 
## export.c      -2.311e-04  4.615e-05   -5.007 1.28e-06 ***
## I(export.c^2) -3.686e-07  1.020e-07   -3.614 0.000387 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0456 on 186 degrees of freedom
## Multiple R-squared:  0.9972, Adjusted R-squared:  0.9971 
## F-statistic: 1.116e+04 on 6 and 186 DF,  p-value: < 2.2e-16

# Plot model residuals.
plot(lm3, pch=16, which=1)

By transforming all predictors and the target variable, we achieve an improved model fit. Note how the adjusted R-square has jumped to 0.9971. And all predictors’ p-values are significant. And F-Statistic value is improved to 11160.

Data Analysis Approach Summary

In summary, we’ve seen a few different multiple linear regression models applied to the GDP dataset. We tried an linear approach. We created a correlation matrix to understand how each variable was correlated. Subsequently, we transformed the variables to see the effect in the model.

Conclusion

CPI, Bond Yields adn NetExport are significantly associated with GDP.

Improving one percentage point of NetExport lead to a average GDP decline by $-37.31.

Improving one percentage point of bond yields lead to a $262.3 decline in average GDP.

Improving one percentage point of CPI lead to a $172.7 increase in average GDP.

ANLY 510-51 Final Project

Sihui Zhang

June 7, 2018