What is Regression?
Regression is a statistical method used to understand relationships
between variables. For example, how income (dependent variable) is
affected by education level and age (independent variables).
Dependent Variable (Y):
The outcome or the variable you want to predict or explain.
Independent Variables (X):
The predictors or factors that influence the dependent
variable.
Multiple Regression:
Multiple regression is a statistical technique used to predict the value
of a dependent variable (\(Y\)) based
on two or more independent variables (\(X_1,
X_2, \dots, X_n\)). It is widely applied across various fields to
solve real-world problems, optimize resources, and make
predictions.
Introducing the equation for multiple regression: \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_kX_k + \epsilon \] - \(Y\): Dependent variable.
\(X_1, X_2, \ldots, X_k\): Independent variables.
\(\beta_0\): Intercept (value of \(Y\) when all \(X\)s are 0).
\(\beta_1, \beta_2, \ldots\): Coefficients (how much \(Y\) changes when \(X\) changes by one unit).
\(\epsilon\): Error term (unexplained variation).
Real-life examples:
The assumptions in simple terms:
Linearity: The relationship between \(X\) and \(Y\) is linear.
No Multicollinearity: Independent variables should not be highly correlated.
Homoscedasticity: Variance of errors is constant.
Normality: Residuals (errors) are normally distributed.
Hours Studied (\(X_1\)) | Tutor Hours (\(X_2\)) | Test Score (\(Y\)) |
---|---|---|
2 | 3 | 50 |
4 | 5 | 70 |
6 | 8 | 90 |
Example: Predicting compressive strength of concrete (\(Y\)) using:
Example: Estimating energy consumption (\(Y\)) based on:
Example: Predicting crop yield (\(Y\)) using:
Dependent Variable (\(Y\)):
The outcome or result you want to predict.
Independent Variables (\(X_1, X_2, \dots, X_n\)):
Factors that influence the dependent variable.
Regression Equation:
\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \epsilon
\]
Goodness of Fit (\(R^2\)):
Measures how well the independent variables explain the variability in
\(Y\).
Civil Engineering:
Use real data to predict concrete strength based on mix design.
Electrical Engineering:
Analyze energy consumption data for a specific building or
area.
Agriculture:
Model the impact of fertilizer and irrigation on crop yield.
Geology:
Assess soil erosion risk using environmental data.
Mechanical Engineering:
Predict thermal efficiency using engine parameters.
Mechatronic Engineering:
Analyze robotic system data to improve positioning accuracy.
Intercept (\(\beta_0\)):
Value of \(Y\) when all \(X\) are zero.
Coefficients (\(\beta_1, \beta_2, \dots\)):
How much \(Y\) changes for a unit
increase in \(X\), keeping others
constant.
Significance:
Focus on variables with the highest impact on \(Y\).
\(R^2\):
A higher \(R^2\) indicates a better
model fit.
Multiple regression is a powerful tool that allows professionals across disciplines to make data-driven decisions. By understanding the relationships between variables, you can predict outcomes, optimize resources, and design better systems.
SPSS provides built-in datasets that are perfect for training purposes. A commonly used dataset for regression analysis is the “Employee Data” file, which is included with SPSS. Here’s how you can access and use it:
Open SPSS.
Click on File > Open > Data.
In the Open Data dialog box:
Navigate to the SPSS installation directory on your computer
(usually something like
C:\Program Files\IBM\SPSS\Statistics\Samples
for
Windows).
Select Employee Data.sav and click Open.
The Employee Data file contains information on
employees, including: - Dependent Variable (\(Y\)):
- Current Salary: The salary of employees.
Independent Variables (\(X_1, X_2, \dots\)):
Education Level: Years of education completed.
Previous Experience: Years of prior work experience.
Gender: Male or Female.
Minority Classification: Minority group status.
We will predict Current Salary (\(Y\)) using the following independent variables:
Education Level (\(X_1\))
Previous Experience (\(X_2\))
Gender (\(X_3\))
Minority Classification (\(X_4\))
Open the Dataset: Follow the steps above to load the Employee Data.
Navigate to Regression Analysis:
Specify Variables:
Drag Current Salary into the Dependent box.
Drag Education Level, Previous Experience, Gender, and Minority Classification into the Independent(s) box.
Run the Analysis:
Regression Equation:
SPSS will provide coefficients that you can use to write the regression
equation: \[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 +
\epsilon
\]
Coefficients Table:
Check the Unstandardized Coefficients (\(\beta\)) to interpret the effect of each variable on the dependent variable.
Look at the Significance (Sig.) column to determine whether each predictor is statistically significant.
Model Summary Table:
ANOVA Table:
Real-Life Application: Relate variables like salary, education, and experience to scenarios in various disciplines, such as assessing the impact of training on employee performance.
Assumptions: Discuss the assumptions of multiple regression, such as linearity, independence, and normality.
Visualization: Use scatterplots or residual plots to demonstrate the model’s fit.
Visualizations play a crucial role in understanding and communicating the results of a multiple regression analysis. Here are some key examples you can create in SPSS using the built-in Employee Data dataset.
Visualization | Purpose | Key Insight |
---|---|---|
Scatterplot Matrix | Check relationships and linearity | Visualize pairwise relationships. |
Residual Plot | Assess homoscedasticity | Random scatter indicates equal variance. |
Histogram of Residuals | Check normality of residuals | Residuals should form a bell curve. |
Normal P-P Plot | Confirm residuals’ normal distribution | Points align with the diagonal line. |
Predicted vs. Actual Plot | Assess predictive accuracy | Alignment with 45° line shows accuracy. |
Interaction Plots | Visualize interaction effects | Shows changes due to interaction terms. |
Boxplot | Compare dependent variable across categories | Highlights median differences by category. |
Line Chart | Observe trends over continuous predictors | Shows increasing/decreasing trends. |
Sample ID | Cement Content (kg/m³) | Water-Cement Ratio | Aggregate Size (mm) | Curing Time (days) | Compressive Strength (MPa) | |||||||
1 | 350 | 0.45 | 20 | 7 | 24.5 | |||||||
2 | 300 | 0.50 | 25 | 14 | 26.7 | |||||||
3 | 400 | 0.40 | 10 | 28 | 42.1 | |||||||
4 | 320 | 0.55 | 20 | 28 | 29.3 | |||||||
5 | 380 | 0.48 | 15 | 14 | 35.2 | |||||||
6 | 410 | 0.38 | 25 | 7 | 30.8 | |||||||
7 | 330 | 0.52 | 20 | 14 | 27.5 | |||||||
8 | 370 | 0.46 | 10 | 28 | 38.4 | |||||||
9 | 340 | 0.50 | 25 | 7 | 23.7 | |||||||
10 | 360 | 0.42 | 20 | 28 | 39.5 | |||||||
Dependent Variable (YY):
Independent Variables (X1,X2,X3,X4):
Cement Content (kg/m³): The amount of cement used per cubic meter.
Water-Cement Ratio: The ratio of water to cement in the mix.
Aggregate Size (mm): The size of coarse aggregate in millimeters.
Curing Time (days): The duration for which the concrete is allowed to cure.
To predict the compressive strength of concrete based on cement content, water-cement ratio, aggregate size, and curing time.
How do cement content, water-cement ratio, aggregate size, and curing time affect the compressive strength of concrete?
Launch SPSS.
Go to File > New > Data to open a new data sheet.
In the Variable View tab:
Enter the variable names:
Sample_ID (Numeric)
Cement_Content (Numeric)
Water_Cement_Ratio (Numeric)
Aggregate_Size (Numeric)
Curing_Time (Numeric)
Compressive_Strength (Numeric)
Set the Measure for each variable:
Use Scale for all variables since they are continuous.
Switch to the Data View tab.
Go to Analyze > Regression > Linear.
In the Linear Regression Dialog Box:
Move Compressive_Strength to the Dependent box.
Move Cement_Content, Water_Cement_Ratio, Aggregate_Size, and Curing_Time to the Independent(s) box.
Check Estimates, Model Fit, Descriptives, and Collinearity Diagnostics.
Click Continue.
Click on Plots:
Add ZPRED to the X-axis and ZRESID to the Y-axis to create a residual plot.
Check Normal probability plot.
Click Continue.
Click OK to run the analysis.
The descriptive statistics for the dataset indicate that the average compressive strength is 31.77 MPa with a standard deviation of 6.59 MPa, reflecting moderate variability. The cement content averages 356.00 kg/m³, with a standard deviation of 35.02 kg/m³, suggesting consistent material proportions. The water-cement ratio has a mean of 0.466 and a low variability (standard deviation of 0.054). The aggregate size averages 19.00 mm with a standard deviation of 5.68 mm, indicating some variation in particle size. Lastly, the curing time has the highest variability with a mean of 17.5 days and a standard deviation of 9.48 days, showing significant differences in curing durations across samples.
Model Summary:
Look at R2R^2 to assess how well the model explains the variability in the dependent variable.
The model summary reveals a strong positive relationship between the predictors (cement content, water-cement ratio, aggregate size, and curing time) and the dependent variable (compressive strength), as indicated by the correlation coefficient \(R = 0.960\). The \(R^2 = 0.921\) implies that 92.1% of the variation in compressive strength is explained by the model, showcasing high predictive power. The adjusted \(R^2 = 0.858\) accounts for the number of predictors and indicates a slightly lower but still substantial explanatory power. The standard error of the estimate (2.48 MPa) reflects the average deviation of observed compressive strength from the predicted values. This model demonstrates strong fit and reliability for predicting compressive strength.
ANOVA Table:
Check the significance value (p-value) to determine if the regression model is statistically significant.
The ANOVA table evaluates the overall significance of the regression model. The regression sum of squares (360.266) and residual sum of squares (30.875) indicate that most of the variation in compressive strength is explained by the model. The mean square for regression (90.066) and residual (6.175) are used to calculate the F-statistic (14.585). The associated p-value (\(\text{Sig.} = 0.006\)) is below 0.05, confirming that the regression model is statistically significant and that the predictors (cement content, water-cement ratio, aggregate size, and curing time) collectively influence compressive strength.
Coefficients Table:
Examine the coefficients (β\beta) and their significance values to understand the effect of each predictor.
The coefficients table provides insights into the individual contributions of each predictor to the model. The intercept (\(\text{Constant} = 22.995\)) represents the estimated compressive strength when all predictors are zero. Among the predictors:
Collinearity diagnostics show tolerances above 0.1 and VIF values below 10, suggesting no severe multicollinearity among the predictors.
The Collinearity Diagnostics table helps identify multicollinearity among the predictors by examining eigenvalues, condition indices, and variance proportions:
The high condition index in Dimension 5 (>30) and overlapping variance proportions suggest potential multicollinearity issues, particularly involving Cement Content and Water-Cement Ratio. To address this: - Consider removing or combining predictors. - Use regularization techniques (e.g., ridge regression) if multicollinearity significantly affects model stability.
The Residuals Statistics table provides information about the model’s predicted values, residuals (differences between observed and predicted values), and standardized values, helping to evaluate model fit and detect outliers:
The residuals exhibit reasonable variability without extreme outliers, indicating a good model fit. However, further diagnostic plots (e.g., residual vs. predicted plots or normal Q-Q plots) may help confirm assumptions of linear regression, such as normality and homoscedasticity of residuals.
Residual Plot:
Ensure residuals are randomly distributed (no patterns).
This is a scatterplot of standardized residuals vs. standardized predicted values for the dependent variable, Compressive Strength (MPa). This plot is used to assess whether the assumptions of linear regression are met, particularly the assumptions of homoscedasticity (constant variance of residuals) and linearity.
The scatterplot supports the assumptions of linear regression, with residuals showing no systematic patterns and a roughly constant spread. This indicates that the model is a good fit for the data.
Normal P-P Plot:
Confirm residuals follow a normal distribution.
This is a Normal P-P Plot of Regression Standardized Residuals for the dependent variable, Compressive Strength (MPa). The plot evaluates whether the residuals (differences between observed and predicted values) follow a normal distribution, which is a key assumption in linear regression.
The plot suggests that the residuals are reasonably normally distributed, meaning the model satisfies the normality assumption for regression. This is a good indication of the model’s validity, but you might still consider additional tests (e.g., Shapiro-Wilk) or plots (e.g., histogram of residuals) for a more thorough analysis.
Create scatterplots or predicted vs. actual plots to visualize the model fit. Refer to the visualizations in the earlier steps for detailed guidance.
PRED
).This plot shows the relationship between the observed compressive strength (MPa) and the unstandardized predicted values from the regression model.
The plot supports the reliability of the regression model in predicting compressive strength (MPa). The closeness of the scatter points to the diagonal line indicates that the model’s predictions align well with the observed values.
For elementary, intermediate, and advanced special trainings on data science using R, Python, Stata, etc visit us on: https://softdataconsult.com/ or https://softdataconsult.github.io/ email: softdataconsult@gmail.com