Backgrounds

Today we will fit a simple linear regression on simulated data. Afterward, we will explore how changes in the random noise affect the parameter. Using the rnorm() function, create a vector, \(x\), containing 100 observations drawn from a \(N(0,1)\) distribution. This represents a feature, \(X\). Similarly we will create I(esp) of 100 observations drawn from a \(N(0,0.25)\) distribution,a normal distribution with mean zero and variance \(0.25\). Finally, using \(x\) and eps, we will generate a vector y according to the model \[ Y=-1+0.5 X+\epsilon \]

Then we will fit our models and do the exploring. This problem was initially introduced in “An Introduction to Statistical Learning- Gareth James.”

Library and others

set.seed(1)

library(ggplot2)

The data

x <- rnorm(100, mean = 0, sd = sqrt(1))  # Create X
esp <- rnorm(100, mean = 0, sd = sqrt(0.25))  # Create noise
Y_noerror <- -1 + 0.5 * x
Y <- Y_noerror + esp  # Generate response Y

length(Y)
[1] 100

Comments:

The length of Y is 100 and our true population parameters are \(\beta_{0}= -1\) and \(\beta_{1}= 0.5\)

datPop <- data.frame(Y, x, esp)
head(datPop)
           Y          x         esp
1 -1.6234102 -0.6264538 -0.31018334
2 -0.8871204  0.1836433  0.02105794
3 -1.8732751 -0.8356286 -0.45546082
4 -0.1233452  1.5952808  0.07901439
5 -1.1625384  0.3295078 -0.32729232
6 -0.5265906 -0.8204684  0.88364363
ggplot(data = datPop, aes(x = x, y = Y)) + geom_point() + ggtitle("Scatter plot of x vs Y")

Comment:

There is a positive correlation between x and Y, and the data looks moderately scattered.

Fitting model

modLS <- lm(Y ~ x, data = datPop)
summary(modLS)

Call:
lm(formula = Y ~ x, data = datPop)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.93842 -0.30688 -0.06975  0.26970  1.17309 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
x            0.49947    0.05386   9.273 4.58e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4814 on 98 degrees of freedom
Multiple R-squared:  0.4674,    Adjusted R-squared:  0.4619 
F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15
modLS$coefficients
(Intercept)           x 
 -1.0188463   0.4994698 

Comment:

The fitted model coefficients are \(\hat{\beta_{0}}= -1.018\) and \(\hat{\beta_{1}}= 0.499\). Its appears to be a good fit for the data. However, both \(\hat{\beta_{0}}\) and \(\hat{\beta_{1}}\) is slightly smaller than population \(\beta_{0}\) and \(\beta_{1}\).

Visuals of the population and the fitted line

plot(x, Y, cex = 0.9, col = "black")
abline(modLS, col = "red", lwd = 2)
abline(-1, 0.5, col = "green", lwd = 2)
legend("bottomright", legend = c("Fitted line", "Population line"), col = c("red",
    "green"), lty = 1, lwd = 2, cex = 0.8)

Comment:

The graph above shows the filled line (red) and the population line (green), which supports the findings in the “modLS” model.

Does including quadratic terms will do any good?

modPOLY <- lm(Y ~ x + I(x^2))
summary(modPOLY)

Call:
lm(formula = Y ~ x + I(x^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.98252 -0.31270 -0.06441  0.29014  1.13500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
x            0.50858    0.05399   9.420  2.4e-15 ***
I(x^2)      -0.05946    0.04238  -1.403    0.164    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.479 on 97 degrees of freedom
Multiple R-squared:  0.4779,    Adjusted R-squared:  0.4672 
F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14
summary(modLS)

Call:
lm(formula = Y ~ x, data = datPop)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.93842 -0.30688 -0.06975  0.26970  1.17309 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
x            0.49947    0.05386   9.273 4.58e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4814 on 98 degrees of freedom
Multiple R-squared:  0.4674,    Adjusted R-squared:  0.4619 
F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15
mean((Y - modPOLY$fitted.values)^2)
[1] 0.2225728
mean((Y - modLS$fitted.values)^2)
[1] 0.227089

Comment:

By looking at the adjusted \(R^2\) and MSE of these two models (modLS:0.462, 0.227 and modPOLY:0.467, 0.222), we can say that the quadratic term seems to improve the fit a very little.

Let’s reduce the noise

esp_red <- rnorm(100, mean = 0, sd = sqrt(0.1))  # we can reduce noise by reducing its variance

Y_red <- Y_noerror + esp_red

datRed <- data.frame(x, Y_red)


ggplot(data = datRed, aes(x = x, y = Y_red)) + geom_point() + ggtitle("Scatter plot of x vs Y_red")

Comment:

There is a positive correlation between x and Y. However, this time due to the reduced error variance, the data is more compact and linear.

Fitting the model on data with reduced noise

mod_red <- lm(Y_red ~ x, data = datRed)
summary(mod_red)

Call:
lm(formula = Y_red ~ x, data = datRed)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.92152 -0.15252 -0.01433  0.20531  0.83534 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.99135    0.03311  -29.94   <2e-16 ***
x            0.50669    0.03678   13.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3287 on 98 degrees of freedom
Multiple R-squared:  0.6595,    Adjusted R-squared:  0.656 
F-statistic: 189.8 on 1 and 98 DF,  p-value: < 2.2e-16

Comment:

Due to less variability, the fitted model shows improved fit. This model can explain almost \(66\%\) variability in Y.

Visuals of the population and the fitted line

plot(x, Y_red, cex = 0.9, col = "black")
abline(mod_red, col = "red", lwd = 2)
abline(-1, 0.5, col = "green", lwd = 2)
legend("bottomright", legend = c("Fitted line", "Population line"), col = c("red",
    "green"), lty = 1, lwd = 2, cex = 0.8)

Comment:

The findings are also reflected in the graph above. The fitted line and population line almost became indistinguishable.

Let’s increase the noise

esp_increased <- rnorm(100, mean = 0, sd = sqrt(0.9))

Y_increased <- Y_noerror + esp_increased

datIncreased <- data.frame(x, Y_increased)


ggplot(data = datIncreased, aes(x = x, y = Y_increased)) + geom_point() + ggtitle("Scatter plot of x vs Y_increased")

Comment:

There is a positive correlation between x and Y. However, due to the larger variance in error, data is more scattered.

Fitting the model on data with increased noise

mod_increased <- lm(Y_increased ~ x, data = datIncreased)
summary(mod_increased)

Call:
lm(formula = Y_increased ~ x, data = datIncreased)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.38713 -0.51727 -0.03582  0.63835  1.78246 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.94529    0.09514  -9.936  < 2e-16 ***
x            0.44717    0.10567   4.232 5.23e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9444 on 98 degrees of freedom
Multiple R-squared:  0.1545,    Adjusted R-squared:  0.1459 
F-statistic: 17.91 on 1 and 98 DF,  p-value: 5.225e-05

Comments:

The increase in error instantly reflects in the fit. Estimates are far away from the original and adjusted \(R^2\) dropped to 0.15. Also the standard error of the estimates has increased.

Visuals of the population and the fitted line

plot(x, Y_increased, cex = 0.9, col = "black")
abline(mod_increased, col = "red", lwd = 2)
abline(-1, 0.5, col = "green", lwd = 2)
legend("bottomright", legend = c("Fitted line", "Population line"), col = c("red",
    "green"), lty = 1, lwd = 2, cex = 0.8)

Comment:

The effect is clearly visible to the eyes via the above graph. As the \(\beta_{1}\) that is the slop was underestimated, the fitted line falls towards the horizontal axis.

Confidence interval of parameter estimates

CI_original <- confint(modLS, level = 0.95)
CI_red <- confint(mod_red, level = 0.95)
CI_increased <- confint(mod_increased, level = 0.95)
CI_Original \(\beta_{0}\) -1.12 -0.92
\(\beta_{1}\) 0.39 0.61
CI_red \(\beta_{0}\) -1.06 -0.93
\(\beta_{1}\) 0.43 0.58
CI_increased \(\beta_{0}\) -1.13 -0.76
\(\beta_{1}\) 0.24 0.66

\[~\]

Comment:

The width of the interval for each parameter increase with the amount of error variability. Because larger error variance leads to poor fit, which causes the higher standard error for the parameter, the CI becomes wider. \(CI_{red}< CI_{Original}< CI_{increased}\) is the general CI width order.