Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Regression Analysis
0.1 Introduction
Regression analysis is a method to create a model for predicting future outcome (e.g., future GDP growth, inflation rate, etc.), for examining the hypotheses or for evaluating impact of policy implementation. Regression analysis allows us to examine the relationship between random variables, for example, to what extent can the variable Y be explained by a variable X.
Regression analysis would be useful, in the case that you want to examine the relationship between the following variables.
How much will the demand for electricity increase when temperature changes by 1 C°?
Do people work harder when wages increase?
Increases in interest rate will attract people to have KHR Deposits?
Purpose of regression analsysis can be categorized into two:
Prediction
Statistical inference (hypothesis testing)
In the purpose of prediction, the regression analysis is conducted to estimate the model which predict the future value of outcome variables in higher accuracy. In the purpose of statistical inference, the regression analysis is conducted to estimate the parameter of the function related to a policy variable.
Let’s say, there is a true function to determine GDP growth as GDP = f(X,\beta). X is dependent variable which determines value of GDP. \beta is a parameter which determine the relationship between X and GDP.
Roughly speaking, in prediction, the goal is more like to estimate the function f(X, \beta). On the other hand, the goal is more like to estimate parameter \beta accrately in statistical inference.
0.2 Concept of Regression Analysis
library(ggplot2)
# Create a sample dataset
set.seed(123)
x <- 1:50
y <- 2 + 0.5 * x + rnorm(50, sd = 10)
data <- data.frame(x, y)
# Perform linear regression
model <- lm(y ~ x, data = data)
# Predict values based on the model
data$y_pred <- predict(model)
# Calculate residuals
data$residuals <- data$y - data$y_pred
# Plotting
ggplot(data, aes(x = x, y = y)) +
geom_point(color = "red", size = 3) + # Data points
geom_abline(slope = coef(model)[2], intercept = coef(model)[1], color = "blue", size = 1.2) + # Regression line
geom_segment(aes(xend = x, yend = y_pred), linetype = "dotted", size = 1) +
annotate("text", x = 2, y = data$y[2] + 0.5, label = "paste(y, x)", parse = TRUE,hjust = -0.2) + annotate("text", x = 25, y = 4, label = "italic(u)", parse = TRUE, hjust = 1.2) + labs(x = "x", y = "y") +
theme_minimal() +
ggtitle("Linear Regression with Residuals") +
theme(plot.title = element_text(hjust = 0.5))
1 Regression Analysis
1.1 OLS Estimation
Making regression in R is very simple. We can do that by using lm() function.
We can type the codes as follow.
y_{i} = \alpha + \beta x_i + \epsilon_i
library(readxl)
file_path <- "Data/NBC_MFI.xlsx"
data <- read_excel(file_path, sheet = "Sheet1", range = "A3:BY842")New names:
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...70`
reg <- lm( m_asset1 ~ staff, data= data)
summary(reg)
Call:
lm(formula = m_asset1 ~ staff, data = data)
Residuals:
Min 1Q Median 3Q Max
-2994730 28711 112352 146917 8182090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -157225.58 21592.98 -7.281 7.69e-13 ***
staff 1222.26 19.46 62.808 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 583800 on 828 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.8265, Adjusted R-squared: 0.8263
F-statistic: 3945 on 1 and 828 DF, p-value: < 2.2e-16
The results of estimated regression coefficients can be plotted by abline() function. After plotting the data of two variables in the graph, we can add the regression line using the results of regression as follow.
data_clean <- na.omit(data[, c("staff", "m_asset1")]) # There are missing values. Thus, we have to clean the data for plotting
data_clean$predicted <- predict(reg, newdata = data_clean)
ggplot(data_clean, aes(x = staff, y = m_asset1)) +
geom_point() + geom_line(aes(x = staff, y = predicted), color="red") Coding the regression analysis in R is quite flexible. We can run the regression with log transformation of independent variables If we transform y and x variables in the log form, the regression model is expressed as below.
\log(y_i) = \alpha + \beta + \log(x_i) + \epsilon_i
The regression of the above mentioned model can be performed using the log function as follow.
log_reg <- lm( log(m_asset1 +1) ~ log(staff +1), data=data)
summary(log_reg)
Call:
lm(formula = log(m_asset1 + 1) ~ log(staff + 1), data = data)
Residuals:
Min 1Q Median 3Q Max
-4.4350 -0.5725 -0.0261 0.5351 3.2044
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.30249 0.09163 68.78 <2e-16 ***
log(staff + 1) 0.94567 0.01972 47.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9323 on 828 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.7352, Adjusted R-squared: 0.7349
F-statistic: 2299 on 1 and 828 DF, p-value: < 2.2e-16
Since the “asset2” include zero values and negative values, log transformation make “NA” as missing values. If the transformed variable contains “NA” or “inf”, the regression will return errors. In that case, we should transform variables in advance and should code the variables not including NA and inf before the analysis.
log transformation can be also used in plot function.
data_clean <- na.omit(data[, c("staff", "m_asset1")]) # There are missing values. Thus, we have to clean the data for plotting
data_clean$predicted <- predict(log_reg, newdata = data_clean)
ggplot(data_clean, aes(x = log(staff+1), y = log(m_asset1+1))) +
geom_point() + geom_line(aes(x = log(staff+1), y = predicted), color="red") data_clean$predicted_original <- exp(data_clean$predicted)
ggplot(data_clean, aes(x = staff, y = m_asset1)) +
geom_point() +
geom_line(aes(x = staff, y = predicted_original), color = "red") +
labs(title = "Actual vs. Predicted Values",
x = "Staff",
y = "M_Asset1") +
theme_minimal()For the comparison of performance of prediction, AIC is calculated as follow:
AIC(log_reg)[1] 2243.002
1.2 Example: Relationship between temprature and electricity demand
Let’s consider the determinants of electricity demand for forecasting purpose. As electricity is not stored, the prediction of demand is crucial to determine the production capacity and pricing of electricity.
Suppose that the temprature is one of the crucial determinants of demand of electricity. Tokyo Electric Power Company regularly publishes the data of actual use of electricity on daily basis. Japan Meteorological Agency publishes data of temperature on daily basis.
library(readxl)
data <- read_excel("Data/Temprature and electricity.xls")
summary(data) temprature_tokyo date kw
Min. : 2.30 Min. :2021-01-01 Min. : 54744
1st Qu.:10.60 1st Qu.:2021-04-02 1st Qu.: 69678
Median :16.70 Median :2021-07-02 Median : 74432
Mean :16.67 Mean :2021-07-02 Mean : 76738
3rd Qu.:22.70 3rd Qu.:2021-10-01 3rd Qu.: 83847
Max. :31.00 Max. :2021-12-31 Max. :103834
library(ggplot2)
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(scales)
scaling_factor <- max(data$temprature_tokyo) / max(data$kw)
ggplot(data) +
geom_line(aes(x = date, y = temprature_tokyo, color = "Temperature (°C)")) +
geom_line(aes(x = date, y = kw * scaling_factor, color = "KW (scaled)")) +
scale_y_continuous(
name = "Temperature (°C)",
sec.axis = sec_axis(~ . / scaling_factor, name = "KW")
) +
theme_minimal() Firstly, let’s look at the data in the summer season . The figure shows a scatter plot of average temperature and electricity consumption in the summer season ((June, July, August, September))
data$date <- as.Date(data$date)
ds1 <- subset(data, date>=as.Date("2021-06-01") & date<as.Date("2021-10-01"))
ggplot(ds1) + geom_point(aes(x = temprature_tokyo, y =kw) ) +
theme_minimal() For this data, now we fit the following regression model
KW_{i} = \alpha + \beta Temprature_i + \epsilon_i
reg <- lm( kw ~ temprature_tokyo, data= ds1)
summary(reg)
Call:
lm(formula = kw ~ temprature_tokyo, data = ds1)
Residuals:
Min 1Q Median 3Q Max
-12992.2 -4770.5 756.1 3985.3 9994.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9490.9 4078.9 2.327 0.0216 *
temprature_tokyo 2805.8 164.4 17.071 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5639 on 120 degrees of freedom
Multiple R-squared: 0.7083, Adjusted R-squared: 0.7059
F-statistic: 291.4 on 1 and 120 DF, p-value: < 2.2e-16
Let’s plot the estimated line in figure.
ds1$predicted <- predict(reg)
ggplot(ds1) + geom_point(aes(x = temprature_tokyo, y =kw) ) +
geom_line(aes(x = temprature_tokyo, y =predicted), color ="red" ) +
theme_minimal() ggplot(data) + geom_point(aes(x = temprature_tokyo, y =kw) ) +
theme_minimal() reg <- lm( kw ~ temprature_tokyo + I(temprature_tokyo^2) , data= data)
summary(reg)
Call:
lm(formula = kw ~ temprature_tokyo + I(temprature_tokyo^2), data = data)
Residuals:
Min 1Q Median 3Q Max
-20919 -4387 1134 4084 11573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 113770.778 1519.812 74.86 <2e-16 ***
temprature_tokyo -5305.021 202.752 -26.16 <2e-16 ***
I(temprature_tokyo^2) 154.469 6.027 25.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6171 on 362 degrees of freedom
Multiple R-squared: 0.6542, Adjusted R-squared: 0.6523
F-statistic: 342.4 on 2 and 362 DF, p-value: < 2.2e-16
# Or we can use "poly(temprature_tokyo, 2, raw = TRUE)" , instead of "I(temprature_tokyo^2)"And the regression line is plotted in the figure.
data$predicted <- predict(reg)
ggplot(data) + geom_point(aes(x = temprature_tokyo, y =kw) ) +
geom_line(aes(x = temprature_tokyo, y =predicted), color="red" ) +
theme_minimal()