Introduction to Regression Assumptions

Overview of Regression Analysis

Regression analysis is a statistical method used to examine the relationship between a dependent variable and one or more independent variables. It helps identify the strength and nature of these relationships, and can be used for prediction and forecasting. By fitting a model to the data, regression analysis provides insights into how changes in the independent variables impact the dependent variable.

Importance of Assumptions in Regression

Assumptions in regression analysis are crucial because they ensure the validity and reliability of the model’s results. Key assumptions, such as linearity, independence, and homoscedasticity, help in making accurate inferences and predictions. Violating these assumptions can lead to misleading conclusions and unreliable estimates, undermining the effectiveness of the regression model.

List of Key Assumptions:

  • Linearity
  • Independence
  • Homoscedasticity
  • Normality

Violation 1: Non-Linearity

Explanation of Linearity Assumption

The linearity assumption in regression analysis posits that the relationship between the independent and dependent variables is linear, meaning that changes in the independent variables are associated with proportional changes in the dependent variable. This assumption allows for a straightforward interpretation of the model coefficients, which represent constant effects of the predictors. If the true relationship is non-linear, a linear model may produce inaccurate predictions and misleading insights, necessitating alternative modeling approaches.

Identifying Non-Linearity in Data (Scatterplot, Residual Plot)

To identify non-linearity in data, a scatterplot of the independent variable(s) against the dependent variable can reveal if the relationship deviates from a straight line. Additionally, a residual plot, which displays residuals versus fitted values, can highlight patterns such as curves or systematic structures that indicate non-linearity. Both visual tools help in assessing whether the linearity assumption is violated and guide the need for more complex modeling techniques.

R Example: Simulating and Plotting Non-Linear Data

x <- 1:100
y <- 3*x^2 + rnorm(100, mean=0, sd=500)  # Introducing a quadratic term to make the relationship non-linear
plot(x, y, main="Non-Linear Relationship", xlab="X", ylab="Y")

Solution 1: Polynomial Regression

poly_model <- lm(y ~ poly(x, 2))
plot(x, fitted(poly_model))

Solution 2: Spline Regression

spline_model <- lm(y ~ ns(x, df=5))
plot(x, fitted(spline_model))

Solution 3: Use of Transformations e.g log

Let’s consider a dataset where the relationship between the predictor x and the response variable y is exponential. A log transformation can help linearize this relationship.

# Create a sample dataset
set.seed(123)
x <- seq(1, 100, by=1)
y <- exp(0.05 * x) + rnorm(length(x), sd=10)  # Exponential relationship with noise

# Scatterplot of original data
plot(x, y, main="Scatterplot of Original Data", xlab="x", ylab="y")

# Apply log transformation
y_log <- log(y)
## Warning in log(y): NaNs produced
# Scatterplot of transformed data
plot(x, y_log, main="Scatterplot After Log Transformation", xlab="x", ylab="log(y)")

Solution 3: Use of Transformations e.g square root

# Create a sample dataset with increasing variance
set.seed(456)
x <- seq(1, 100, by=1)
y <- x^2 + rnorm(length(x), sd=50)  # Quadratic relationship with noise

# Scatterplot of original data
plot(x, y, main="Scatterplot of Original Data", xlab="x", ylab="y")

# Apply square root transformation
y_sqrt <- sqrt(y)
## Warning in sqrt(y): NaNs produced
# Scatterplot of transformed data
plot(x, y_sqrt, main="Scatterplot After Square Root Transformation", xlab="x", ylab="sqrt(y)")

Violation 2: Multi-collinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, leading to unreliable estimates of the coefficients and difficulties in assessing the effect of each predictor.

Introduction

Multicollinearity refers to a situation where predictor variables in a regression model are highly correlated with each other. This can lead to issues such as inflated standard errors, unstable coefficients, and difficulty in determining the individual effect of each predictor. Detecting and addressing multicollinearity is crucial for accurate and reliable regression analysis.

Example of Multicollinearity

Let’s create a dataset where multicollinearity is present and demonstrate how to detect it using Variance Inflation Factor (VIF) and correlation matrices.

# Load necessary libraries

# Create a sample dataset
set.seed(789)
x1 <- rnorm(100)
rho<-0.99
x2 <-  rho*x1 + sqrt(1 - rho^2) * rnorm(100)
x3 <- rnorm(100)
y <- 2 * x1 + 3 * x2 + 4 * x3 + rnorm(100)  # Response variable

# Combine into a data frame
data <- data.frame(x1, x2, x3, y)

# Calculate the correlation matrix
cor_matrix <- cor(data)
cor_matrix
##            x1         x2         x3         y
## x1 1.00000000 0.98851183 0.08831085 0.7959001
## x2 0.98851183 1.00000000 0.07940875 0.7913076
## x3 0.08831085 0.07940875 1.00000000 0.6494448
## y  0.79590010 0.79130759 0.64944483 1.0000000
corrplot(cor_matrix, method="circle", main=" ")

# Fit a linear regression model
model <- lm(y ~ x1 + x2 + x3, data=data)
# Calculate Variance Inflation Factor (VIF)
vif_values <- vif(model)
vif_values
##        x1        x2        x3 
## 43.961025 43.894972  1.010634
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.86054 -0.72486  0.08838  0.74269  2.40076 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1200     0.1063   1.129 0.261540    
## x1            2.3409     0.7193   3.254 0.001570 ** 
## x2            2.7318     0.7239   3.774 0.000279 ***
## x3            3.9433     0.1086  36.320  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 96 degrees of freedom
## Multiple R-squared:  0.9752, Adjusted R-squared:  0.9744 
## F-statistic:  1258 on 3 and 96 DF,  p-value: < 2.2e-16

Solution: Dropping Highly Correlated Predictors

lm_model_revised <- lm(y ~ x1+x3)

summary(lm_model_revised)
## 
## Call:
## lm(formula = y ~ x1 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6224 -0.7538  0.1639  0.8413  2.6085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.08962    0.11296   0.793     0.43    
## x1           5.02401    0.11611  43.269   <2e-16 ***
## x3           3.92182    0.11558  33.931   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.124 on 97 degrees of freedom
## Multiple R-squared:  0.9715, Adjusted R-squared:  0.9709 
## F-statistic:  1654 on 2 and 97 DF,  p-value: < 2.2e-16

Solution: Regularization (Ridge/Lasso Regression)

ridge_model <- glmnet(as.matrix(cbind(x1, x2, x3)), y, alpha=0)
summary(ridge_model)
##           Length Class     Mode   
## a0        100    -none-    numeric
## beta      300    dgCMatrix S4     
## df        100    -none-    numeric
## dim         2    -none-    numeric
## lambda    100    -none-    numeric
## dev.ratio 100    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        4    -none-    call   
## nobs        1    -none-    numeric
ridge_model
## 
## Call:  glmnet(x = as.matrix(cbind(x1, x2, x3)), y = y, alpha = 0) 
## 
##     Df  %Dev Lambda
## 1    3  0.00 5220.0
## 2    3  0.46 4757.0
## 3    3  0.51 4334.0
## 4    3  0.56 3949.0
## 5    3  0.61 3598.0
## 6    3  0.67 3279.0
## 7    3  0.73 2987.0
## 8    3  0.80 2722.0
## 9    3  0.88 2480.0
## 10   3  0.97 2260.0
## 11   3  1.06 2059.0
## 12   3  1.16 1876.0
## 13   3  1.28 1709.0
## 14   3  1.40 1558.0
## 15   3  1.53 1419.0
## 16   3  1.68 1293.0
## 17   3  1.84 1178.0
## 18   3  2.02 1074.0
## 19   3  2.21  978.2
## 20   3  2.43  891.3
## 21   3  2.66  812.1
## 22   3  2.91  740.0
## 23   3  3.19  674.2
## 24   3  3.49  614.3
## 25   3  3.82  559.8
## 26   3  4.18  510.0
## 27   3  4.57  464.7
## 28   3  4.99  423.4
## 29   3  5.46  385.8
## 30   3  5.97  351.6
## 31   3  6.52  320.3
## 32   3  7.11  291.9
## 33   3  7.76  265.9
## 34   3  8.47  242.3
## 35   3  9.23  220.8
## 36   3 10.05  201.2
## 37   3 10.94  183.3
## 38   3 11.90  167.0
## 39   3 12.94  152.2
## 40   3 14.05  138.7
## 41   3 15.25  126.3
## 42   3 16.53  115.1
## 43   3 17.90  104.9
## 44   3 19.36   95.6
## 45   3 20.91   87.1
## 46   3 22.56   79.3
## 47   3 24.31   72.3
## 48   3 26.15   65.9
## 49   3 28.09   60.0
## 50   3 30.12   54.7
## 51   3 32.25   49.8
## 52   3 34.46   45.4
## 53   3 36.75   41.4
## 54   3 39.12   37.7
## 55   3 41.55   34.4
## 56   3 44.04   31.3
## 57   3 46.58   28.5
## 58   3 49.16   26.0
## 59   3 51.75   23.7
## 60   3 54.36   21.6
## 61   3 56.96   19.6
## 62   3 59.54   17.9
## 63   3 62.09   16.3
## 64   3 64.59   14.9
## 65   3 67.03   13.6
## 66   3 69.40   12.3
## 67   3 71.69   11.2
## 68   3 73.88   10.2
## 69   3 75.97    9.3
## 70   3 77.96    8.5
## 71   3 79.83    7.8
## 72   3 81.58    7.1
## 73   3 83.22    6.4
## 74   3 84.73    5.9
## 75   3 86.13    5.3
## 76   3 87.41    4.9
## 77   3 88.58    4.4
## 78   3 89.65    4.0
## 79   3 90.61    3.7
## 80   3 91.47    3.4
## 81   3 92.24    3.1
## 82   3 92.93    2.8
## 83   3 93.54    2.5
## 84   3 94.08    2.3
## 85   3 94.55    2.1
## 86   3 94.97    1.9
## 87   3 95.33    1.8
## 88   3 95.64    1.6
## 89   3 95.92    1.5
## 90   3 96.15    1.3
## 91   3 96.36    1.2
## 92   3 96.53    1.1
## 93   3 96.68    1.0
## 94   3 96.81    0.9
## 95   3 96.92    0.8
## 96   3 97.01    0.8
## 97   3 97.09    0.7
## 98   3 97.16    0.6
## 99   3 97.22    0.6
## 100  3 97.27    0.5
plot(ridge_model)

coef(ridge_model, s = 0.1)
## 4 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) 0.08468395
## x1          2.43992133
## x2          2.45989918
## x3          3.66428705

Check the detail :https://glmnet.stanford.edu/articles/glmnet.html