Regression Analysis

0.1 Introduction

Regression analysis is a method to create a model for predicting future outcome (e.g., future GDP growth, inflation rate, etc.), for examining the hypotheses or for evaluating impact of policy implementation. Regression analysis allows us to examine the relationship between random variables, for example, to what extent can the variable Y be explained by a variable X.

Regression analysis would be useful, in the case that you want to examine the relationship between the following variables.

How much will the demand for electricity increase when temperature changes by 1 C°?
Do people work harder when wages increase?
Increases in interest rate will attract people to have KHR Deposits?

Prediction and statistical inference

Purpose of regression analsysis can be categorized into two:

Prediction
Statistical inference (hypothesis testing)

In the purpose of prediction, the regression analysis is conducted to estimate the model which predict the future value of outcome variables in higher accuracy. In the purpose of statistical inference, the regression analysis is conducted to estimate the parameter of the function related to a policy variable.

Let’s say, there is a true function to determine GDP growth as GDP = f(X,\beta). X is dependent variable which determines value of GDP. \beta is a parameter which determine the relationship between X and GDP.

Roughly speaking, in prediction, the goal is more like to estimate the function f(X, \beta). On the other hand, the goal is more like to estimate parameter \beta accrately in statistical inference.

0.2 Concept of Regression Analysis

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

library(ggplot2)

# Create a sample dataset
set.seed(123)
x <- 1:50
y <- 2 + 0.5 * x + rnorm(50, sd = 10)
data <- data.frame(x, y)

# Perform linear regression
model <- lm(y ~ x, data = data)

# Predict values based on the model
data$y_pred <- predict(model)

# Calculate residuals
data$residuals <- data$y - data$y_pred

# Plotting
ggplot(data, aes(x = x, y = y)) +
  geom_point(color = "red", size = 3) +  # Data points
  geom_abline(slope = coef(model)[2], intercept = coef(model)[1], color = "blue", size = 1.2) +  # Regression line
  geom_segment(aes(xend = x, yend = y_pred), linetype = "dotted", size = 1) + 
  annotate("text", x = 2, y = data$y[2] + 0.5, label = "paste(y, x)", parse = TRUE,hjust = -0.2) +  annotate("text", x = 25, y = 4, label = "italic(u)", parse = TRUE, hjust = 1.2) +   labs(x = "x", y = "y") +
  theme_minimal() +
  ggtitle("Linear Regression with Residuals") +
  theme(plot.title = element_text(hjust = 0.5))

1 Regression Analysis

1.1 OLS Estimation

Making regression in R is very simple. We can do that by using lm() function.

We can type the codes as follow.

y_{i} = \alpha + \beta x_i + \epsilon_i

library(readxl)


file_path <- "Data/NBC_MFI.xlsx"
data <- read_excel(file_path, sheet = "Sheet1", range = "A3:BY842")

New names:
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...70`

reg <- lm( m_asset1 ~ staff, data= data)
summary(reg)


Call:
lm(formula = m_asset1 ~ staff, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2994730    28711   112352   146917  8182090 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -157225.58   21592.98  -7.281 7.69e-13 ***
staff          1222.26      19.46  62.808  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 583800 on 828 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.8265,    Adjusted R-squared:  0.8263 
F-statistic:  3945 on 1 and 828 DF,  p-value: < 2.2e-16

The results of estimated regression coefficients can be plotted by abline() function. After plotting the data of two variables in the graph, we can add the regression line using the results of regression as follow.

data_clean <- na.omit(data[, c("staff", "m_asset1")])  # There are missing values. Thus, we have to clean the data for plotting

data_clean$predicted <- predict(reg, newdata = data_clean)

ggplot(data_clean, aes(x = staff, y = m_asset1))  + 
  geom_point() + geom_line(aes(x = staff, y = predicted), color="red")

Coding the regression analysis in R is quite flexible. We can run the regression with log transformation of independent variables If we transform y and x variables in the log form, the regression model is expressed as below.

\log(y_i) = \alpha + \beta + \log(x_i) + \epsilon_i

The regression of the above mentioned model can be performed using the log function as follow.

log_reg <- lm( log(m_asset1 +1) ~ log(staff +1), data=data)  

summary(log_reg)


Call:
lm(formula = log(m_asset1 + 1) ~ log(staff + 1), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4350 -0.5725 -0.0261  0.5351  3.2044 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6.30249    0.09163   68.78   <2e-16 ***
log(staff + 1)  0.94567    0.01972   47.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9323 on 828 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.7352,    Adjusted R-squared:  0.7349 
F-statistic:  2299 on 1 and 828 DF,  p-value: < 2.2e-16

Since the “asset2” include zero values and negative values, log transformation make “NA” as missing values. If the transformed variable contains “NA” or “inf”, the regression will return errors. In that case, we should transform variables in advance and should code the variables not including NA and inf before the analysis.

log transformation can be also used in plot function.

data_clean <-  na.omit(data[, c("staff", "m_asset1")])  # There are missing values. Thus, we have to clean the data for plotting


data_clean$predicted <- predict(log_reg, newdata = data_clean)

ggplot(data_clean, aes(x = log(staff+1), y = log(m_asset1+1)))  + 
  geom_point() + geom_line(aes(x = log(staff+1), y = predicted), color="red")

data_clean$predicted_original <- exp(data_clean$predicted)

ggplot(data_clean, aes(x = staff, y = m_asset1))  + 
  geom_point() + 
  geom_line(aes(x = staff, y = predicted_original), color = "red") +
  labs(title = "Actual vs. Predicted Values",
       x = "Staff", 
       y = "M_Asset1") +
  theme_minimal()

For the comparison of performance of prediction, AIC is calculated as follow:

AIC(log_reg)

[1] 2243.002

1.2 Example: Relationship between temprature and electricity demand

Let’s consider the determinants of electricity demand for forecasting purpose. As electricity is not stored, the prediction of demand is crucial to determine the production capacity and pricing of electricity.

Suppose that the temprature is one of the crucial determinants of demand of electricity. Tokyo Electric Power Company regularly publishes the data of actual use of electricity on daily basis. Japan Meteorological Agency publishes data of temperature on daily basis.

library(readxl)

data <- read_excel("Data/Temprature and electricity.xls")

summary(data)

 temprature_tokyo      date                  kw        
 Min.   : 2.30    Min.   :2021-01-01   Min.   : 54744  
 1st Qu.:10.60    1st Qu.:2021-04-02   1st Qu.: 69678  
 Median :16.70    Median :2021-07-02   Median : 74432  
 Mean   :16.67    Mean   :2021-07-02   Mean   : 76738  
 3rd Qu.:22.70    3rd Qu.:2021-10-01   3rd Qu.: 83847  
 Max.   :31.00    Max.   :2021-12-31   Max.   :103834

library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(scales)

scaling_factor <- max(data$temprature_tokyo) / max(data$kw) 

 ggplot(data) +
  geom_line(aes(x = date, y = temprature_tokyo, color = "Temperature (°C)")) +
  geom_line(aes(x = date, y = kw * scaling_factor, color = "KW (scaled)")) +
  scale_y_continuous(
    name = "Temperature (°C)", 
    sec.axis = sec_axis(~ . / scaling_factor, name = "KW")
  ) +
  theme_minimal()

Firstly, let’s look at the data in the summer season . The figure shows a scatter plot of average temperature and electricity consumption in the summer season ((June, July, August, September))

data$date <- as.Date(data$date)
ds1 <- subset(data, date>=as.Date("2021-06-01") & date<as.Date("2021-10-01"))

ggplot(ds1) + geom_point(aes(x = temprature_tokyo, y =kw) ) +
  theme_minimal()

For this data, now we fit the following regression model

KW_{i} = \alpha + \beta Temprature_i + \epsilon_i

reg <- lm( kw ~ temprature_tokyo, data= ds1)
summary(reg)


Call:
lm(formula = kw ~ temprature_tokyo, data = ds1)

Residuals:
     Min       1Q   Median       3Q      Max 
-12992.2  -4770.5    756.1   3985.3   9994.8 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        9490.9     4078.9   2.327   0.0216 *  
temprature_tokyo   2805.8      164.4  17.071   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5639 on 120 degrees of freedom
Multiple R-squared:  0.7083,    Adjusted R-squared:  0.7059 
F-statistic: 291.4 on 1 and 120 DF,  p-value: < 2.2e-16

Let’s plot the estimated line in figure.

ds1$predicted <- predict(reg)

ggplot(ds1) + geom_point(aes(x = temprature_tokyo, y =kw) ) +  
  geom_line(aes(x = temprature_tokyo, y =predicted), color ="red" ) +
  theme_minimal()

ggplot(data) + geom_point(aes(x = temprature_tokyo, y =kw) ) +
  theme_minimal()

reg <- lm( kw ~ temprature_tokyo + I(temprature_tokyo^2)  , data= data)
summary(reg)


Call:
lm(formula = kw ~ temprature_tokyo + I(temprature_tokyo^2), data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-20919  -4387   1134   4084  11573 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           113770.778   1519.812   74.86   <2e-16 ***
temprature_tokyo       -5305.021    202.752  -26.16   <2e-16 ***
I(temprature_tokyo^2)    154.469      6.027   25.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6171 on 362 degrees of freedom
Multiple R-squared:  0.6542,    Adjusted R-squared:  0.6523 
F-statistic: 342.4 on 2 and 362 DF,  p-value: < 2.2e-16

# Or we can use  "poly(temprature_tokyo, 2, raw = TRUE)" , instead of "I(temprature_tokyo^2)"

And the regression line is plotted in the figure.

data$predicted <- predict(reg)

ggplot(data) + geom_point(aes(x = temprature_tokyo, y =kw) ) +
  geom_line(aes(x = temprature_tokyo, y =predicted), color="red" ) +
  theme_minimal()