Regression analysis is a statistical method used to examine the relationship between a dependent variable and one or more independent variables. It helps identify the strength and nature of these relationships, and can be used for prediction and forecasting. By fitting a model to the data, regression analysis provides insights into how changes in the independent variables impact the dependent variable.
Assumptions in regression analysis are crucial because they ensure the validity and reliability of the model’s results. Key assumptions, such as linearity, independence, and homoscedasticity, help in making accurate inferences and predictions. Violating these assumptions can lead to misleading conclusions and unreliable estimates, undermining the effectiveness of the regression model.
The linearity assumption in regression analysis posits that the relationship between the independent and dependent variables is linear, meaning that changes in the independent variables are associated with proportional changes in the dependent variable. This assumption allows for a straightforward interpretation of the model coefficients, which represent constant effects of the predictors. If the true relationship is non-linear, a linear model may produce inaccurate predictions and misleading insights, necessitating alternative modeling approaches.
To identify non-linearity in data, a scatterplot of the independent variable(s) against the dependent variable can reveal if the relationship deviates from a straight line. Additionally, a residual plot, which displays residuals versus fitted values, can highlight patterns such as curves or systematic structures that indicate non-linearity. Both visual tools help in assessing whether the linearity assumption is violated and guide the need for more complex modeling techniques.
x <- 1:100
y <- 3*x^2 + rnorm(100, mean=0, sd=500) # Introducing a quadratic term to make the relationship non-linear
plot(x, y, main="Non-Linear Relationship", xlab="X", ylab="Y")
poly_model <- lm(y ~ poly(x, 2))
plot(x, fitted(poly_model))
spline_model <- lm(y ~ ns(x, df=5))
plot(x, fitted(spline_model))
Let’s consider a dataset where the relationship between the predictor
x
and the response variable y
is exponential.
A log transformation can help linearize this relationship.
# Create a sample dataset
set.seed(123)
x <- seq(1, 100, by=1)
y <- exp(0.05 * x) + rnorm(length(x), sd=10) # Exponential relationship with noise
# Scatterplot of original data
plot(x, y, main="Scatterplot of Original Data", xlab="x", ylab="y")
# Apply log transformation
y_log <- log(y)
## Warning in log(y): NaNs produced
# Scatterplot of transformed data
plot(x, y_log, main="Scatterplot After Log Transformation", xlab="x", ylab="log(y)")
# Create a sample dataset with increasing variance
set.seed(456)
x <- seq(1, 100, by=1)
y <- x^2 + rnorm(length(x), sd=50) # Quadratic relationship with noise
# Scatterplot of original data
plot(x, y, main="Scatterplot of Original Data", xlab="x", ylab="y")
# Apply square root transformation
y_sqrt <- sqrt(y)
## Warning in sqrt(y): NaNs produced
# Scatterplot of transformed data
plot(x, y_sqrt, main="Scatterplot After Square Root Transformation", xlab="x", ylab="sqrt(y)")
Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, leading to unreliable estimates of the coefficients and difficulties in assessing the effect of each predictor.
Multicollinearity refers to a situation where predictor variables in a regression model are highly correlated with each other. This can lead to issues such as inflated standard errors, unstable coefficients, and difficulty in determining the individual effect of each predictor. Detecting and addressing multicollinearity is crucial for accurate and reliable regression analysis.
Let’s create a dataset where multicollinearity is present and demonstrate how to detect it using Variance Inflation Factor (VIF) and correlation matrices.
# Load necessary libraries
# Create a sample dataset
set.seed(789)
x1 <- rnorm(100)
rho<-0.99
x2 <- rho*x1 + sqrt(1 - rho^2) * rnorm(100)
x3 <- rnorm(100)
y <- 2 * x1 + 3 * x2 + 4 * x3 + rnorm(100) # Response variable
# Combine into a data frame
data <- data.frame(x1, x2, x3, y)
# Calculate the correlation matrix
cor_matrix <- cor(data)
cor_matrix
## x1 x2 x3 y
## x1 1.00000000 0.98851183 0.08831085 0.7959001
## x2 0.98851183 1.00000000 0.07940875 0.7913076
## x3 0.08831085 0.07940875 1.00000000 0.6494448
## y 0.79590010 0.79130759 0.64944483 1.0000000
corrplot(cor_matrix, method="circle", main=" ")
# Fit a linear regression model
model <- lm(y ~ x1 + x2 + x3, data=data)
# Calculate Variance Inflation Factor (VIF)
vif_values <- vif(model)
vif_values
## x1 x2 x3
## 43.961025 43.894972 1.010634
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.86054 -0.72486 0.08838 0.74269 2.40076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1200 0.1063 1.129 0.261540
## x1 2.3409 0.7193 3.254 0.001570 **
## x2 2.7318 0.7239 3.774 0.000279 ***
## x3 3.9433 0.1086 36.320 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 96 degrees of freedom
## Multiple R-squared: 0.9752, Adjusted R-squared: 0.9744
## F-statistic: 1258 on 3 and 96 DF, p-value: < 2.2e-16
ridge_model <- glmnet(as.matrix(cbind(x1, x2, x3)), y, alpha=0)
summary(ridge_model)
## Length Class Mode
## a0 100 -none- numeric
## beta 300 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 4 -none- call
## nobs 1 -none- numeric
ridge_model
##
## Call: glmnet(x = as.matrix(cbind(x1, x2, x3)), y = y, alpha = 0)
##
## Df %Dev Lambda
## 1 3 0.00 5220.0
## 2 3 0.46 4757.0
## 3 3 0.51 4334.0
## 4 3 0.56 3949.0
## 5 3 0.61 3598.0
## 6 3 0.67 3279.0
## 7 3 0.73 2987.0
## 8 3 0.80 2722.0
## 9 3 0.88 2480.0
## 10 3 0.97 2260.0
## 11 3 1.06 2059.0
## 12 3 1.16 1876.0
## 13 3 1.28 1709.0
## 14 3 1.40 1558.0
## 15 3 1.53 1419.0
## 16 3 1.68 1293.0
## 17 3 1.84 1178.0
## 18 3 2.02 1074.0
## 19 3 2.21 978.2
## 20 3 2.43 891.3
## 21 3 2.66 812.1
## 22 3 2.91 740.0
## 23 3 3.19 674.2
## 24 3 3.49 614.3
## 25 3 3.82 559.8
## 26 3 4.18 510.0
## 27 3 4.57 464.7
## 28 3 4.99 423.4
## 29 3 5.46 385.8
## 30 3 5.97 351.6
## 31 3 6.52 320.3
## 32 3 7.11 291.9
## 33 3 7.76 265.9
## 34 3 8.47 242.3
## 35 3 9.23 220.8
## 36 3 10.05 201.2
## 37 3 10.94 183.3
## 38 3 11.90 167.0
## 39 3 12.94 152.2
## 40 3 14.05 138.7
## 41 3 15.25 126.3
## 42 3 16.53 115.1
## 43 3 17.90 104.9
## 44 3 19.36 95.6
## 45 3 20.91 87.1
## 46 3 22.56 79.3
## 47 3 24.31 72.3
## 48 3 26.15 65.9
## 49 3 28.09 60.0
## 50 3 30.12 54.7
## 51 3 32.25 49.8
## 52 3 34.46 45.4
## 53 3 36.75 41.4
## 54 3 39.12 37.7
## 55 3 41.55 34.4
## 56 3 44.04 31.3
## 57 3 46.58 28.5
## 58 3 49.16 26.0
## 59 3 51.75 23.7
## 60 3 54.36 21.6
## 61 3 56.96 19.6
## 62 3 59.54 17.9
## 63 3 62.09 16.3
## 64 3 64.59 14.9
## 65 3 67.03 13.6
## 66 3 69.40 12.3
## 67 3 71.69 11.2
## 68 3 73.88 10.2
## 69 3 75.97 9.3
## 70 3 77.96 8.5
## 71 3 79.83 7.8
## 72 3 81.58 7.1
## 73 3 83.22 6.4
## 74 3 84.73 5.9
## 75 3 86.13 5.3
## 76 3 87.41 4.9
## 77 3 88.58 4.4
## 78 3 89.65 4.0
## 79 3 90.61 3.7
## 80 3 91.47 3.4
## 81 3 92.24 3.1
## 82 3 92.93 2.8
## 83 3 93.54 2.5
## 84 3 94.08 2.3
## 85 3 94.55 2.1
## 86 3 94.97 1.9
## 87 3 95.33 1.8
## 88 3 95.64 1.6
## 89 3 95.92 1.5
## 90 3 96.15 1.3
## 91 3 96.36 1.2
## 92 3 96.53 1.1
## 93 3 96.68 1.0
## 94 3 96.81 0.9
## 95 3 96.92 0.8
## 96 3 97.01 0.8
## 97 3 97.09 0.7
## 98 3 97.16 0.6
## 99 3 97.22 0.6
## 100 3 97.27 0.5
plot(ridge_model)
coef(ridge_model, s = 0.1)
## 4 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.08468395
## x1 2.43992133
## x2 2.45989918
## x3 3.66428705
Check the detail :https://glmnet.stanford.edu/articles/glmnet.html