Introduction to R

Loading data

Step 1: Install required packages using the install.packages() function.

Step 2: Load required packages using library() function.

To install and load several packages at once, create a list of packages you will use (named pkgs in the code below) and then use the lapply() function.

Uncomment and run the code.

pkgs = c("readxl", "dplyr", "stats","forecast","zoo","corrplot","MASS","vars","ggplot2","ggfortify","car")

#lapply(pkgs, install.packages, character.only = T)

suppressPackageStartupMessages(lapply(pkgs,library, character.only = T))
## [[1]]
## [1] "readxl"    "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
## [1] "dplyr"     "readxl"    "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[3]]
## [1] "dplyr"     "readxl"    "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "forecast"  "dplyr"     "readxl"    "stats"     "graphics" 
##  [6] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "zoo"       "forecast"  "dplyr"     "readxl"    "stats"    
##  [6] "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [11] "base"     
## 
## [[6]]
##  [1] "corrplot"  "zoo"       "forecast"  "dplyr"     "readxl"   
##  [6] "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [11] "methods"   "base"     
## 
## [[7]]
##  [1] "MASS"      "corrplot"  "zoo"       "forecast"  "dplyr"    
##  [6] "readxl"    "stats"     "graphics"  "grDevices" "utils"    
## [11] "datasets"  "methods"   "base"     
## 
## [[8]]
##  [1] "vars"        "lmtest"      "urca"        "strucchange" "sandwich"   
##  [6] "MASS"        "corrplot"    "zoo"         "forecast"    "dplyr"      
## [11] "readxl"      "stats"       "graphics"    "grDevices"   "utils"      
## [16] "datasets"    "methods"     "base"       
## 
## [[9]]
##  [1] "ggplot2"     "vars"        "lmtest"      "urca"        "strucchange"
##  [6] "sandwich"    "MASS"        "corrplot"    "zoo"         "forecast"   
## [11] "dplyr"       "readxl"      "stats"       "graphics"    "grDevices"  
## [16] "utils"       "datasets"    "methods"     "base"       
## 
## [[10]]
##  [1] "ggfortify"   "ggplot2"     "vars"        "lmtest"      "urca"       
##  [6] "strucchange" "sandwich"    "MASS"        "corrplot"    "zoo"        
## [11] "forecast"    "dplyr"       "readxl"      "stats"       "graphics"   
## [16] "grDevices"   "utils"       "datasets"    "methods"     "base"       
## 
## [[11]]
##  [1] "car"         "carData"     "ggfortify"   "ggplot2"     "vars"       
##  [6] "lmtest"      "urca"        "strucchange" "sandwich"    "MASS"       
## [11] "corrplot"    "zoo"         "forecast"    "dplyr"       "readxl"     
## [16] "stats"       "graphics"    "grDevices"   "utils"       "datasets"   
## [21] "methods"     "base"

Step 3: Upload the file to your working directory and then load the file using the read_excel() function

We’ll use the tutorial_01.xlsx in the shared folder link for this tutorial. We name the dataframe gdpdata.

#gdpdata <- read_excel("tutorial_01.xlsx")
gdpdata_gr <- read_excel("tutorial_01.xlsx", sheet = "Sheet2")

Tidying data

Step 3.1: Use the head() and tail() functions to view the first and last six rows of the dataframe respectively.

head(gdpdata_gr)
## # A tibble: 6 x 11
##   Date                   Agri   Manufc Utilities Constrn   Trade
##   <dttm>                <dbl>    <dbl>     <dbl>   <dbl>   <dbl>
## 1 2007-06-01 00:00:00  0.0364  0.383     -0.0918  0.0776  0.303 
## 2 2007-09-01 00:00:00  0.258   0.0503    -0.186  -0.0871  0.0413
## 3 2007-12-01 00:00:00  0.309   0.0925     0.285   0.0570  0.136 
## 4 2008-03-01 00:00:00 -0.383  -0.371      0.130   0.109  -0.240 
## 5 2008-06-01 00:00:00  0.0179  0.441     -0.0969  0.0543  0.0484
## 6 2008-09-01 00:00:00  0.242  -0.00823   -0.162  -0.0911  0.0967
## # … with 5 more variables: WS_Ret_trade <dbl>, Hotel <dbl>, `Financial
## #   services` <dbl>, `Real estate` <dbl>, RGDP <dbl>
tail(gdpdata_gr)
## # A tibble: 6 x 11
##   Date                  Agri  Manufc Utilities Constrn   Trade WS_Ret_trade
##   <dttm>               <dbl>   <dbl>     <dbl>   <dbl>   <dbl>        <dbl>
## 1 2017-09-01 00:00:00  0.347  0.0487    -0.172  0.227  -0.0186     -0.0128 
## 2 2017-12-01 00:00:00  0.147  0.167      0.291 -0.0460  0.306       0.292  
## 3 2018-03-01 00:00:00 -0.520 -0.399      0.132 -0.0754 -0.279      -0.263  
## 4 2018-06-01 00:00:00  0.279  0.362     -0.141  0.0254  0.118       0.0989 
## 5 2018-09-01 00:00:00  0.447  0.0345    -0.192  0.181  -0.0161     -0.00993
## 6 2018-12-01 00:00:00  0.146  0.208      0.288 -0.0162  0.306       0.292  
## # … with 4 more variables: Hotel <dbl>, `Financial services` <dbl>, `Real
## #   estate` <dbl>, RGDP <dbl>

Step 3.2: Use the colnames() or names() functions to view the column names of the dataframe.

colnames(gdpdata_gr)
##  [1] "Date"               "Agri"               "Manufc"            
##  [4] "Utilities"          "Constrn"            "Trade"             
##  [7] "WS_Ret_trade"       "Hotel"              "Financial services"
## [10] "Real estate"        "RGDP"

Step 3.3: To rename a column use the colnames(data)[i] function where data is the dataframe and i is the column index whose name you want to change.

In this dataframe, we want to change the names of columns 9 and 10 from “Financial services”and “Real estate” to “Fin_serv” and “Real_est” respectively.

colnames(gdpdata_gr)[9]<- "Fin_serv"
colnames(gdpdata_gr)[10]<- "Real_est"
colnames(gdpdata_gr)
##  [1] "Date"         "Agri"         "Manufc"       "Utilities"   
##  [5] "Constrn"      "Trade"        "WS_Ret_trade" "Hotel"       
##  [9] "Fin_serv"     "Real_est"     "RGDP"

The linear regression model

Consider a multiple linear regression model with dependent variable \(Y_i\) and \(p\) explanatory variables \(X_{1,i}, X_{2,i},..., X_{p,i}\) of the form:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2X_2 + \beta_3 X_3 + ... + \beta_p X_p+\epsilon \] where \(\epsilon_i\) is identically, independently and normally distributed, i.e., \(\epsilon_i\) \(i.i.d N(0,\sigma^2)\)

The regression coefficients \(\beta_0,\beta_1,...,\beta_p\) are unknown and they are estimated using the least squares approach . We choose \(\beta_0,\beta_1,...,\beta_p\) to minimize the sum of squared residuals. \[RSS = {\sum\limits_{i=1}^{n}(y_i – \hat{y})^2} \]

Under the null hypothesis, we assume that there is no linear relationship between \(Y\) and \(X_i\) for \(i = 1,...,p\).

\[H_0 : \beta_i = 0 \] versus \[H_a : \beta_i \neq 0 \] To test the null hypothesis, we need to determine whether \(\hat{\beta}_i\), our estimate for \(\beta_i\), is sufficiently far from zero that we can be confident that \(\beta_i\) is non-zero. For this, we compute the t-statistic,given by

\[ t = \frac{\hat{\beta}_i - 0}{SE(\hat{\beta}_i)},\] which measures the number of standard deviations that \(\hat{\beta}_i\) is away from 0. If there really is no relationship between \(X_1,...,X_p\) and \(Y\) , then we expect that the t-statistic will have a t-distribution with n−p-1 degrees of freedom. For large \(n\) (\(n>30\)), the t-distribution is quite similar to the normal distribution. Consequently, it is a simple matter to compute the probability of observing any value equal to \(|t|\) or larger, assuming \(\beta_i\) = 0. We call this probability the p-value. A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small p-value (generally, <5%), we can infer that there is an association between the predictor and the response at 5% significance level.

Fitting the linear regression model

To fit the linear regression model, first, we format the data to a dataframe, and then we use the lm() function to fit the linear regression model. The syntax lm(y∼x1+x2+x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.

For example, we can to fit a linear regression model with the dependent variable: RGDP and ‘independent’ or more appropriately explanatory variables: Agri, Manfc, Utilities, Constrn, Trade, WS_Ret_Trade, Hotel, Fin_serv, and Real_est.

gdpdata_gr<- gdpdata_gr %>% as.data.frame() 


model00 <- lm(RGDP ~ Agri+ Manufc+ Utilities+ Constrn+ Trade+ WS_Ret_trade+ Hotel+Fin_serv+ Real_est, data =gdpdata_gr)

summary(model00)
## 
## Call:
## lm(formula = RGDP ~ Agri + Manufc + Utilities + Constrn + Trade + 
##     WS_Ret_trade + Hotel + Fin_serv + Real_est, data = gdpdata_gr)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.027122 -0.005391  0.001581  0.004913  0.023608 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.008155   0.001851  -4.406 8.67e-05 ***
## Agri          0.050215   0.027997   1.794 0.081057 .  
## Manufc        0.173408   0.038983   4.448 7.64e-05 ***
## Utilities     0.067501   0.044497   1.517 0.137769    
## Constrn       0.046504   0.019489   2.386 0.022254 *  
## Trade        -0.481205   0.482937  -0.996 0.325522    
## WS_Ret_trade  0.542363   0.469618   1.155 0.255535    
## Hotel         0.072920   0.020843   3.499 0.001236 ** 
## Fin_serv      0.227130   0.054899   4.137 0.000194 ***
## Real_est      0.034409   0.023993   1.434 0.159940    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01036 on 37 degrees of freedom
## Multiple R-squared:  0.9928, Adjusted R-squared:  0.9911 
## F-statistic: 570.1 on 9 and 37 DF,  p-value: < 2.2e-16

The gdpdata_gr data set contains 10 variables, and so it is cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand

gdpdata_gr0<- gdpdata_gr %>% dplyr::select(-Date)
# we exclude the date column 

model01 <- lm(RGDP ~. ,data =gdpdata_gr0)

summary(model01)
## 
## Call:
## lm(formula = RGDP ~ ., data = gdpdata_gr0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.027122 -0.005391  0.001581  0.004913  0.023608 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.008155   0.001851  -4.406 8.67e-05 ***
## Agri          0.050215   0.027997   1.794 0.081057 .  
## Manufc        0.173408   0.038983   4.448 7.64e-05 ***
## Utilities     0.067501   0.044497   1.517 0.137769    
## Constrn       0.046504   0.019489   2.386 0.022254 *  
## Trade        -0.481205   0.482937  -0.996 0.325522    
## WS_Ret_trade  0.542363   0.469618   1.155 0.255535    
## Hotel         0.072920   0.020843   3.499 0.001236 ** 
## Fin_serv      0.227130   0.054899   4.137 0.000194 ***
## Real_est      0.034409   0.023993   1.434 0.159940    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01036 on 37 degrees of freedom
## Multiple R-squared:  0.9928, Adjusted R-squared:  0.9911 
## F-statistic: 570.1 on 9 and 37 DF,  p-value: < 2.2e-16

In this case, explanatory variables: Manufc, Fin_serv, Hotel,Constrn and Agri are significant at the 5% significance level.

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, Trade has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except Trade.

model02 <- lm(RGDP ~. -Trade ,data =gdpdata_gr0)

summary(model02)
## 
## Call:
## lm(formula = RGDP ~ . - Trade, data = gdpdata_gr0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0260446 -0.0066672  0.0008535  0.0070744  0.0219046 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.007575   0.001757  -4.312 0.000111 ***
## Agri          0.039399   0.025805   1.527 0.135087    
## Manufc        0.143819   0.025253   5.695 1.49e-06 ***
## Utilities     0.037467   0.032729   1.145 0.259473    
## Constrn       0.050684   0.019030   2.663 0.011283 *  
## WS_Ret_trade  0.076151   0.040238   1.893 0.066057 .  
## Hotel         0.078088   0.020186   3.868 0.000416 ***
## Fin_serv      0.242415   0.052707   4.599 4.60e-05 ***
## Real_est      0.031522   0.023815   1.324 0.193539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01036 on 38 degrees of freedom
## Multiple R-squared:  0.9926, Adjusted R-squared:  0.9911 
## F-statistic: 641.4 on 8 and 38 DF,  p-value: < 2.2e-16

Alternatively, the update() function can be used.

model03 = update (model00, ~.-Trade)
summary(model03)
## 
## Call:
## lm(formula = RGDP ~ Agri + Manufc + Utilities + Constrn + WS_Ret_trade + 
##     Hotel + Fin_serv + Real_est, data = gdpdata_gr)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0260446 -0.0066672  0.0008535  0.0070744  0.0219046 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.007575   0.001757  -4.312 0.000111 ***
## Agri          0.039399   0.025805   1.527 0.135087    
## Manufc        0.143819   0.025253   5.695 1.49e-06 ***
## Utilities     0.037467   0.032729   1.145 0.259473    
## Constrn       0.050684   0.019030   2.663 0.011283 *  
## WS_Ret_trade  0.076151   0.040238   1.893 0.066057 .  
## Hotel         0.078088   0.020186   3.868 0.000416 ***
## Fin_serv      0.242415   0.052707   4.599 4.60e-05 ***
## Real_est      0.031522   0.023815   1.324 0.193539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01036 on 38 degrees of freedom
## Multiple R-squared:  0.9926, Adjusted R-squared:  0.9911 
## F-statistic: 641.4 on 8 and 38 DF,  p-value: < 2.2e-16

Assessing the Accuracy of the Model

The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the \(R^2\).

Residual Standard Error

The RSE is an estimate of the standard deviation of \(\epsilon\). Roughly speaking, it is the average amount that the response will deviate from the true regression line. It is calculated as follows: \[RSE =\sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}{\sum\limits_{i=1}^{n}(y_i – \hat{y})^2}}\] If the predictions obtained using the model are very close to the true outcome values—that is, if \(\hat{y_i} \approx y_i\) for \(i = 1,...,n\), the RSE will be small, and we can conclude that the model fits the data very well. On the other hand, if \(\hat{y_i}\) is very far from \(y_i\) for one or more observations, then the RSE may be quite large, indicating that the model doesn’t fit the data well.

\(R^2\) statistic

\(R^2\) measures the proportion of variability in \(Y\) that can be explained using \(X\). An \(R^2\) statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. A number near 0 indicates that the regression did not explain much of the variability in the response; this might occur because the linear model is wrong, or the inherent error \(\sigma^2\) is high, or both.

\[R^2 = \frac{TSS-RSS}{TSS} = 1- \frac{RSS}{TSS} = 1- \frac{\sum\limits_{i=1}^{n}(y_i – \hat{y})^2}{\sum\limits_{i=1}^{n}(y_i – \bar{y})^2}\] where $RSS = {{i=1}^{n}(y_i – )^2} $ and $ TSS = {{i=1}^{n}(y_i – {y})^2} $.

Since RSS always decreases as more variables are added to the model, the \(R^2\) always increases as more variables are added. For a least squares model with p variables, the adjusted \(R^2\) statistic is calculated as

\[Adjusted\, R^2 = 1- \frac{RSS/(n-p-1)}{TSS/(n-1)} \] Unlike the \(R^2\) statistic, the adjusted \(R^2\) statistic pays a price for the inclusion of unnecessary variables in the model.

Linear Regression assumptions

Linearity of the data: The relationship between the predictor (x) and the outcome (y) is assumed to be linear.

Normality of residuals: The residual errors are assumed to be normally distributed.

Homoskedasticity of residuals:The residuals are assumed to have a constant variance

Independence of residuals error terms.

Potential issues when fitting a linear regression model include:

  1. Non-linearity of the response-predictor relationships.
  2. Correlation of error terms.
  3. Heteroscedasticity: non-constant variance of error terms.
  4. Outliers: extreme values in the predictors (y) variable
  5. High-leverage points: extreme values in the predictors (x) variable
  6. Influential values
  7. Collinearity

All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors.

Regression diagnostics plots can be created using the R base function plot() or the autoplot() function [ggfortify package], which creates a ggplot2-based graphics.

par(mfrow=c(2,2))  # set 2 rows and 2 column plot layout

plot(model00)

Homoskedasticity: Constant Variance of Error Terms

Under the linear regression model, we assume that \(Var(\epsilon_i) = \sigma^2\) for all \(i\). From the first plot (top-left), as the fitted values along x increase, the residuals should be approximately flat if the disturbances are homoskedastic. The plot on the bottom left also checks this, and is more convenient as the disturbance term in Y axis is standardized. In this case, there is no definite pattern noticed. However, observations 6, 16 and 18 are potential outliers.

par(mfrow=c(1,2))  # set 2 rows and 2 column plot layout

plot(model00,1)
plot(model00,3)

Normality of residuals

The QQ plot of residuals (top-left) can be used to visually check the normality assumption. The standardised residuals should coincide with the theoretical quantiles (z values) should approximately follow a straight line. Observations whose standardized residuals are greater than 3 in absolute value are possible outliers.

In our example, the points fall approximately along this reference line, so we can assume normality. However, observations 6 and 18 can be identified as outliers.

No autocorrelation of residuals

Under the linear regression model, we assume that the residuals \(\epsilon_i\) are independently distributed. A correlogram (also called Auto Correlation Function ACF Plot or Autocorrelation plot) is a visual way to show serial correlation in data that changes over time (i.e. time series data). To plot the ACF, use the acf() function.

The horizontal dashed blue lines mark the two standard error limits of the sample ACFs. Since the spikes lie within the confidence intervals (dashed blue lines), we can say that the residuals are not serially correlated.

acf(model00$residuals,type = "correlation",main = "Residuals ACF plot" )

To formally test for serial correlation, we can use the Durbin Watson test. The Durbin-Watson test tests the null hypothesis that linear regression residuals are uncorrelated, against the alternative hypothesis that autocorrelation exists. We use the dwtest() function from the lmtest package to perform the test.

lmtest::dwtest(model00)
## 
##  Durbin-Watson test
## 
## data:  model00
## DW = 1.8562, p-value = 0.4815
## alternative hypothesis: true autocorrelation is greater than 0

Since the p-value is far greater than the 0.05 cut-off level, we accept the null hypothesis and conclude that at 5% sig. level, there is sufficient evidence that the residuals are uncorrelated.

High leverage points

Observations with high leverage have an unusual value for \(X_i\).

In order to quantify an observation’s leverage, we compute the leverage statistic. A large value of this statistic indicates an observation with high leverage leverage. For a simple linear regression, the leverage statistic is calculated as follows:

\[h_i = \frac{1}{n} + \frac{(x_i-\bar{x})}{\sum\limits_{i'=1}^{n}(x_i'– \bar{x})^2}\]

Leverage statistics can be computed for any number of predictors using the hatvalues() function.

hatvalues(model00)
##          1          2          3          4          5          6 
## 0.56277572 0.20059111 0.59725952 0.37052043 0.40544712 0.40955319 
##          7          8          9         10         11         12 
## 0.40834024 0.28901068 0.24189393 0.39831111 0.19007874 0.15581298 
##         13         14         15         16         17         18 
## 0.23487628 0.14984981 0.33089896 0.28936657 0.10031265 0.18911528 
##         19         20         21         22         23         24 
## 0.13449730 0.16972021 0.08808154 0.11863030 0.21845666 0.15904688 
##         25         26         27         28         29         30 
## 0.11570542 0.22044907 0.17032150 0.14617907 0.13167628 0.12809160 
##         31         32         33         34         35         36 
## 0.22511176 0.16915487 0.20484482 0.09826884 0.17136769 0.12703338 
##         37         38         39         40         41         42 
## 0.18676235 0.17555549 0.13708234 0.12876599 0.10596054 0.16343921 
##         43         44         45         46         47 
## 0.15736128 0.14333617 0.13389504 0.20415507 0.14303501

To plot the leverage values, use the plot() function.

par(mfrow=c(1,2))
plot(hatvalues(model00), ylab = "Leverage", xlab = "Observation_i") 
plot(model00,5)

The plot of the studentized residuals versus leverage shows that observation 1 has a high leverage whilst observation 6 has a high standardised residual.

Influential values

Outliers and high leverage points might distort our least-squares regression model. Cook’s distance, \(D_i\), is used to find influential outliers in a set of predictor variables.

Cook’s distance reflects how much the fitted values would change if a point was deleted. The measurement is a combination of each observation’s leverage and residual values; the higher the leverage and residuals, the higher the Cook’s distance. An observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier.

plot(model00,4)

From the plot above, we can see that the Cook’s distance for observation 1 and 16 are significantly different from those of the other observations. We might consider removing these influential values to improve the fit of our linear regression model.

Collinearity

Collinearity refers to the situation in which two or more explanatory variables are highly correlated with each other. The presence of collinearity can pose problems in the regression context, since it can be difficult to disentangle the individual effects of collinear variables on the response.

Collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for \(\hatβ_j\) to increase. Recall that the t-statistic for each predictor is calculated by dividing \(\hatβ_j\) by its standard error. Consequently, collinearity results in a decline in the t-statistic. As a result, in the presence of collinearity, we may fail to reject \(H_0 : β_j = 0\). This means that the power of the hypothesis test— the probability of correctly detecting a non-zero coefficient— is reduced by collinearity.

Detecting collinearity

A simple way to detect collinearity is to look at the correlation matrix of the predictors. An element of this matrix that is large in absolute value indicates a pair of highly correlated variables, and therefore a collinearity problem in the data.

To compute the correlation matrix in R we use the cor() function from the stats package.The syntax is as follows cor(x, y, method = c("pearson", "kendall", "spearman"))for a pair of numeric vectors x and y. If we have a dataframe consisting of several numerical variables, we use cor(data, method = c("pearson", "kendall", "spearman")). For more details, see the R documentation here. I use the round( ,2) function to round up the correlation coefficients to 2 decimal places.

#cor(gdpdata_gr[,-1], method = "pearson")
round(cor(gdpdata_gr[,-1], method = "pearson"),2)
##               Agri Manufc Utilities Constrn Trade WS_Ret_trade Hotel
## Agri          1.00   0.77     -0.29    0.64  0.75         0.76  0.25
## Manufc        0.77   1.00     -0.17    0.53  0.86         0.85 -0.06
## Utilities    -0.29  -0.17      1.00   -0.26  0.22         0.22  0.68
## Constrn       0.64   0.53     -0.26    1.00  0.44         0.44  0.07
## Trade         0.75   0.86      0.22    0.44  1.00         1.00  0.37
## WS_Ret_trade  0.76   0.85      0.22    0.44  1.00         1.00  0.38
## Hotel         0.25  -0.06      0.68    0.07  0.37         0.38  1.00
## Fin_serv      0.93   0.77     -0.07    0.59  0.87         0.88  0.39
## Real_est      0.34  -0.10     -0.25    0.29 -0.01         0.01  0.21
## RGDP          0.89   0.84      0.08    0.59  0.94         0.95  0.43
##              Fin_serv Real_est RGDP
## Agri             0.93     0.34 0.89
## Manufc           0.77    -0.10 0.84
## Utilities       -0.07    -0.25 0.08
## Constrn          0.59     0.29 0.59
## Trade            0.87    -0.01 0.94
## WS_Ret_trade     0.88     0.01 0.95
## Hotel            0.39     0.21 0.43
## Fin_serv         1.00     0.32 0.96
## Real_est         0.32     1.00 0.18
## RGDP             0.96     0.18 1.00

However, there might be cases where collinearity exists between three or more variables even if no pair of variables has a particularly high correlation; this is called multicollinearity.

Variance Inflation Factor (VIF)

A better way to assess multi-collinearity is to compute the variance inflation factor (VIF). The VIF is the ratio of the variance of \(\hatβ_j\) when fitting the full model divided by the variance of \(\hatβ_j\) if fit on its own. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. The VIF for each variable can be computed using the formula

\[ VIF = \frac{1}{1-R_{X_ j|X_-j}^2} \]

where \(R_{X_ j|X_-j}^2\) is the \(R^2\) from a regression of \(X_j\) onto all of the other predictors. If \(R_{X_ j|X_-j}^2\) is close to one, then collinearity is present, and so the VIF will be large.

The vif() function, part of the car package, can be used to compute variance inflation factors.

vif(model00)
##         Agri       Manufc    Utilities      Constrn        Trade 
##    32.418244    54.943399    24.055959     2.031277  4407.866303 
## WS_Ret_trade        Hotel     Fin_serv     Real_est 
##  3741.616855    14.560791    28.845620     2.166113

Since the predictor variables are highly correlated (see correlation matrix above), the vif are very high.

Dealing with multicollinearity
Method 1: Removing predictor variables with high vif

We can use an iterative process to remove predictor variable with the highest vif and then calculate the vif of the new model without th We calculate the vif of the full model, i.e., including all predictor variables. If the vif > 10 for any predictor, we remove the one with the highest vif.

vif(update (model00, ~.-Trade))
##         Agri       Manufc    Utilities      Constrn WS_Ret_trade 
##    27.545468    23.061041    13.017426     1.937132    27.474218 
##        Hotel     Fin_serv     Real_est 
##    13.659400    26.593267     2.134540
vif(update (model00, ~.-Trade-Agri))
##       Manufc    Utilities      Constrn WS_Ret_trade        Hotel 
##    15.254825     7.571058     1.926718    26.616626     7.368885 
##     Fin_serv     Real_est 
##    21.276543     2.093035
vif(update (model00, ~.-Trade-Agri-WS_Ret_trade))
##    Manufc Utilities   Constrn     Hotel  Fin_serv  Real_est 
##  8.719110  4.179298  1.756301  7.295053 12.036058  2.092964
vif(update (model00, ~.-Trade-Agri-WS_Ret_trade-Fin_serv))
##    Manufc Utilities   Constrn     Hotel  Real_est 
##  1.558692  3.078561  1.754747  2.871730  1.677775
Method 2: Principal Component Analysis (PCA)

PCA is dimension reduction technique used to deal with mutlicollinearity when there are many predictor variables. It is, however, beyond the scope of this tutorial.

Endogeneity

The endogeneity problem arises in the linear regression when there is a correlation between the explanatory variables (X) and the error term due to omitted variables or measurement errors.

A correlation plot might help to visualise the presence, in any, of correlation. To calculate the correlation coefficients we use the cor() function with the following syntax cor(x, y, method = c("pearson", "kendall", "spearman")). For more details, see the R documentation here.

library(corrplot)

xx = cbind(gdpdata_gr[,-1],Residuals = model00$residuals)

corrplot(cor(xx, method = "pearson"),tl.col = "black", type = "upper")

From the correlation plot, we can see that there is no to negligible correlation between the residuals and the explanatory variables.

To perform a correlation test using Pearson’s product-moment correlation, use the cor.test(x, y) function.

cor.test(gdpdata_gr$Hotel, model00$residuals)
## 
##  Pearson's product-moment correlation
## 
## data:  gdpdata_gr$Hotel and model00$residuals
## t = -3.7001e-16, df = 45, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2871668  0.2871668
## sample estimates:
##           cor 
## -5.515734e-17

References:

  1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

  2. Assumptions of Linear Regression by Selva Prabhakaran link