Comprehensive Note on Multiple Regression for Multidisciplinary Learners


1. Basics of Regression

Key Concepts:

What is Regression?
Regression is a statistical method used to understand relationships between variables. For example, how income (dependent variable) is affected by education level and age (independent variables).

  • Dependent Variable (Y):
    The outcome or the variable you want to predict or explain.

  • Independent Variables (X):
    The predictors or factors that influence the dependent variable.

  • Multiple Regression:
    Multiple regression is a statistical technique used to predict the value of a dependent variable (\(Y\)) based on two or more independent variables (\(X_1, X_2, \dots, X_n\)). It is widely applied across various fields to solve real-world problems, optimize resources, and make predictions.

2. Breakdown of the Equation

Introducing the equation for multiple regression: \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_kX_k + \epsilon \] - \(Y\): Dependent variable.

Real-life examples:


3. Assumptions

The assumptions in simple terms:


4. Data Collection and Preparation

Example Dataset:

Hours Studied (\(X_1\)) Tutor Hours (\(X_2\)) Test Score (\(Y\))
2 3 50
4 5 70
6 8 90

Steps:

  • Gather data.
  • Ensure the data is clean and relevant.

Materials Needed:


Applications in Various Disciplines

Civil Engineering

  • Example: Predicting compressive strength of concrete (\(Y\)) using:

    • Water-to-cement ratio (\(X_1\))
    • Aggregate size (\(X_2\))
    • Curing time (\(X_3\))

Electrical Engineering

  • Example: Estimating energy consumption (\(Y\)) based on:

    • Number of appliances (\(X_1\))
    • Hours of operation (\(X_2\))
    • Temperature (\(X_3\))

Agriculture

  • Example: Predicting crop yield (\(Y\)) using:

    • Fertilizer application rate (\(X_1\))
    • Irrigation volume (\(X_2\))
    • Sunlight hours (\(X_3\))

Geology

  • Example: Modeling soil erosion rate (\(Y\)) based on:
    • Slope gradient (\(X_1\))
    • Soil type (\(X_2\))
    • Rainfall intensity (\(X_3\))

Mechanical Engineering

  • Example: Predicting thermal efficiency of an engine (\(Y\)) using:
    • Compression ratio (\(X_1\))
    • Engine speed (\(X_2\))
    • Fuel-air ratio (\(X_3\))

Mechatronic Engineering

  • Example: Predicting robotic arm positioning accuracy (\(Y\)) using:
    • Motor torque (\(X_1\))
    • Joint angles (\(X_2\))
    • Object weight (\(X_3\))

Key Concepts in Multiple Regression

  1. Dependent Variable (\(Y\)):
    The outcome or result you want to predict.

  2. Independent Variables (\(X_1, X_2, \dots, X_n\)):
    Factors that influence the dependent variable.

  3. Regression Equation:
    \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \epsilon \]

    • \(\beta_0\): Intercept (value of \(Y\) when all \(X\) are zero)
    • \(\beta_1, \beta_2, \dots, \beta_n\): Coefficients (how much \(Y\) changes for a unit change in \(X\))
    • \(\epsilon\): Error term (unexplained variation in \(Y\))
  4. Goodness of Fit (\(R^2\)):
    Measures how well the independent variables explain the variability in \(Y\).

Activities for Learners

  1. Civil Engineering:
    Use real data to predict concrete strength based on mix design.

  2. Electrical Engineering:
    Analyze energy consumption data for a specific building or area.

  3. Agriculture:
    Model the impact of fertilizer and irrigation on crop yield.

  4. Geology:
    Assess soil erosion risk using environmental data.

  5. Mechanical Engineering:
    Predict thermal efficiency using engine parameters.

  6. Mechatronic Engineering:
    Analyze robotic system data to improve positioning accuracy.


Practical Tips for Interpreting Results

  1. Intercept (\(\beta_0\)):
    Value of \(Y\) when all \(X\) are zero.

  2. Coefficients (\(\beta_1, \beta_2, \dots\)):
    How much \(Y\) changes for a unit increase in \(X\), keeping others constant.

  3. Significance:
    Focus on variables with the highest impact on \(Y\).

  4. \(R^2\):
    A higher \(R^2\) indicates a better model fit.


Conclusion

Multiple regression is a powerful tool that allows professionals across disciplines to make data-driven decisions. By understanding the relationships between variables, you can predict outcomes, optimize resources, and design better systems.

Steps to Perform Multiple Regression in SPSS


Step 1: Prepare Your Data

  • Ensure your dataset is complete, with all variables properly defined.
  • Label your variables clearly in the SPSS Data Editor.
    For example:
    • Dependent Variable (\(Y\))
    • Independent Variables (\(X_1, X_2, X_3\))

Step 2: Load Data into SPSS

  1. Open SPSS.
  2. Import or manually enter your data into the Data View tab.
  3. In the Variable View tab:
    • Assign meaningful names to the variables.
    • Define the type (e.g., numeric).
    • Add labels if needed for clarity.

Step 3: Navigate to Regression Analysis

  1. Go to the Analyze menu.
  2. Select Regression > Linear from the dropdown menu.

Step 4: Specify the Model

  1. In the Linear Regression dialog box:
    • Drag and drop the Dependent Variable (\(Y\)) into the Dependent box.
    • Drag and drop all Independent Variables (\(X_1, X_2, \dots\)) into the Independent(s) box.
  2. Optional: Click Statistics to include additional outputs, such as confidence intervals and collinearity diagnostics, then click Continue.

Step 5: Check Additional Options (Optional)

  • Plots: Create scatterplots or residual plots to check assumptions.
  • Save: Save predicted values or residuals for further analysis.
  • Method: Choose the default Enter method to include all variables simultaneously, or explore stepwise options.

Step 6: Run the Analysis

  1. Click OK to run the regression.
  2. SPSS will output the results in the Output Viewer.

Interpreting the Output

1. Model Summary Table

  • \(R^2\): Explains the proportion of variance in the dependent variable accounted for by the independent variables.
  • Adjusted \(R^2\): Adjusted for the number of predictors in the model.

2. ANOVA Table

  • Shows whether the regression model is statistically significant (look at the Sig. value).

3. Coefficients Table

  • Unstandardized Coefficients (\(\beta\)): Use these for the regression equation: \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon \]
    • \(\beta_0\): Intercept.
    • \(\beta_1, \beta_2, \beta_3\): Effects of each predictor.
  • Significance (Sig.): Indicates whether each predictor is statistically significant.

4. Residual Statistics (if requested)

  • Check residuals to assess model assumptions, like linearity and homoscedasticity.

Using Built-In SPSS Data for Multiple Regression

SPSS provides built-in datasets that are perfect for training purposes. A commonly used dataset for regression analysis is the “Employee Data” file, which is included with SPSS. Here’s how you can access and use it:

Accessing the Built-In Dataset

  1. Open SPSS.

  2. Click on File > Open > Data.

  3. In the Open Data dialog box:

    • Navigate to the SPSS installation directory on your computer (usually something like C:\Program Files\IBM\SPSS\Statistics\Samples for Windows).

    • Select Employee Data.sav and click Open.


Understanding the Employee Data

The Employee Data file contains information on employees, including: - Dependent Variable (\(Y\)):
- Current Salary: The salary of employees.


Setting Up the Regression Analysis

We will predict Current Salary (\(Y\)) using the following independent variables:

Steps in SPSS

  1. Open the Dataset: Follow the steps above to load the Employee Data.

  2. Navigate to Regression Analysis:

    • Go to Analyze > Regression > Linear.
  3. Specify Variables:

    • Drag Current Salary into the Dependent box.

    • Drag Education Level, Previous Experience, Gender, and Minority Classification into the Independent(s) box.

  4. Run the Analysis:

    • Click OK.

Interpreting the Results

  1. Regression Equation:
    SPSS will provide coefficients that you can use to write the regression equation: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \epsilon \]

  2. Coefficients Table:

    • Check the Unstandardized Coefficients (\(\beta\)) to interpret the effect of each variable on the dependent variable.

    • Look at the Significance (Sig.) column to determine whether each predictor is statistically significant.

  3. Model Summary Table:

    • \(R^2\) shows the percentage of variance in Current Salary explained by the predictors.
  4. ANOVA Table:

    • Tests whether the regression model is statistically significant overall.

Practical Notes for Learners

Essential Visualizations for Multiple Regression in SPSS

Visualizations play a crucial role in understanding and communicating the results of a multiple regression analysis. Here are some key examples you can create in SPSS using the built-in Employee Data dataset.


1. Scatterplot Matrix

Purpose:

  • To examine the relationship between the dependent variable and independent variables.
  • To check for linearity.

Steps in SPSS:

  1. Go to Graphs > Chart Builder.
  2. Choose Scatter/Dot from the gallery.
  3. Select Matrix Scatterplot.
  4. Drag the dependent variable (Current Salary) and independent variables (Education Level, Previous Experience) into the chart.

Interpretation:

  • Each scatterplot shows how one independent variable is related to the dependent variable.
  • Look for linear patterns, which indicate a good fit for regression.

2. Residual Plot

Purpose:

  • To check the assumption of homoscedasticity (equal variance of residuals).
  • To identify any patterns that indicate a poor model fit.

Steps in SPSS:

  1. While performing regression, click Plots in the Linear Regression dialog.
  2. Drag ZPRED (Predicted Values) to the X axis and ZRESID (Standardized Residuals) to the Y axis.
  3. Click Continue and run the analysis.

Interpretation:

  • Residuals should be randomly scattered around zero.
  • Patterns (e.g., a funnel shape) may indicate heteroscedasticity.

3. Histogram of Residuals

Purpose:

  • To check the normality assumption of residuals.

Steps in SPSS:

  1. In the Linear Regression dialog, click Save and select Standardized Residuals.
  2. Go to Graphs > Chart Builder.
  3. Choose Histogram and plot the saved standardized residuals.

Interpretation:

  • The histogram should approximate a normal distribution. Deviations may suggest non-normal residuals.

4. Normal P-P Plot of Residuals

Purpose:

  • To assess whether residuals follow a normal distribution.

Steps in SPSS:

  1. In the Linear Regression dialog, click Plots.
  2. Check Normal probability plot.
  3. Run the analysis.

Interpretation:

  • Points should fall along the diagonal line if the residuals are normally distributed.

5. Predicted vs. Actual Plot

Purpose:

  • To assess the model’s predictive accuracy.

Steps in SPSS:

  1. Save the predicted values during the regression analysis (Save > Predicted Values).
  2. Go to Graphs > Chart Builder.
  3. Choose Scatter/Dot and plot Predicted Values on the X-axis and Actual Values (Dependent Variable) on the Y-axis.

Interpretation:

  • A strong alignment along the 45° line indicates high predictive accuracy.

6. Interaction Plots (Optional)

Purpose:

  • To visualize the effect of interaction terms (e.g., Gender × Education Level) if they are included in the model.

Steps in SPSS:

  1. Create interaction terms by multiplying variables (e.g., Gender × Education Level).
  2. Include the interaction term in your regression model.
  3. Use the Graph > Chart Builder to plot the interaction effect.

Interpretation:

  • The graph shows how the dependent variable changes with one predictor at different levels of another predictor.

7. Boxplot of Categorical Predictors

Purpose:

  • To visualize the distribution of the dependent variable across levels of categorical predictors (e.g., Gender, Minority Classification).

Steps in SPSS:

  1. Go to Graphs > Chart Builder.
  2. Choose Boxplot and drag the dependent variable (Current Salary) to the Y-axis.
  3. Add the categorical variable (e.g., Gender) to the X-axis.

Interpretation:

  • Compare medians and variances across groups. Large differences may suggest the variable significantly influences the dependent variable.

Summary of Visualizations and Their Use

Visualization Purpose Key Insight
Scatterplot Matrix Check relationships and linearity Visualize pairwise relationships.
Residual Plot Assess homoscedasticity Random scatter indicates equal variance.
Histogram of Residuals Check normality of residuals Residuals should form a bell curve.
Normal P-P Plot Confirm residuals’ normal distribution Points align with the diagonal line.
Predicted vs. Actual Plot Assess predictive accuracy Alignment with 45° line shows accuracy.
Interaction Plots Visualize interaction effects Shows changes due to interaction terms.
Boxplot Compare dependent variable across categories Highlights median differences by category.
Line Chart Observe trends over continuous predictors Shows increasing/decreasing trends.

Dataset: Compressive Strength of Concrete

Sample ID Cement Content (kg/m³) Water-Cement Ratio Aggregate Size (mm) Curing Time (days) Compressive Strength (MPa)
1 350 0.45 20 7 24.5
2 300 0.50 25 14 26.7
3 400 0.40 10 28 42.1
4 320 0.55 20 28 29.3
5 380 0.48 15 14 35.2
6 410 0.38 25 7 30.8
7 330 0.52 20 14 27.5
8 370 0.46 10 28 38.4
9 340 0.50 25 7 23.7
10 360 0.42 20 28 39.5

Variable Descriptions

Using This Dataset

Goal:

To predict the compressive strength of concrete based on cement content, water-cement ratio, aggregate size, and curing time.

Example Research Question:

How do cement content, water-cement ratio, aggregate size, and curing time affect the compressive strength of concrete?

Steps to Enter the Dataset into SPSS and Perform Multiple Regression

Step 1: Open SPSS

Launch SPSS.

Go to File > New > Data to open a new data sheet.

Step 2: Create Variables

In the Variable View tab:

  • Enter the variable names:

  • Sample_ID (Numeric)

  • Cement_Content (Numeric)

  • Water_Cement_Ratio (Numeric)

  • Aggregate_Size (Numeric)

  • Curing_Time (Numeric)

  • Compressive_Strength (Numeric)

  • Set the Measure for each variable:

  • Use Scale for all variables since they are continuous.

Step 3: Input the Data

Switch to the Data View tab.

Step 4: Perform Multiple Regression

Go to Analyze > Regression > Linear.

In the Linear Regression Dialog Box:

  • Move Compressive_Strength to the Dependent box.

  • Move Cement_Content, Water_Cement_Ratio, Aggregate_Size, and Curing_Time to the Independent(s) box.

  1. Click on Statistics:
  • Check Estimates, Model Fit, Descriptives, and Collinearity Diagnostics.

  • Click Continue.

Click on Plots:

  • Add ZPRED to the X-axis and ZRESID to the Y-axis to create a residual plot.

  • Check Normal probability plot.

  • Click Continue.

Click OK to run the analysis.

Step 5: Interpret the Output

  • Descriptive Statistics:

The descriptive statistics for the dataset indicate that the average compressive strength is 31.77 MPa with a standard deviation of 6.59 MPa, reflecting moderate variability. The cement content averages 356.00 kg/m³, with a standard deviation of 35.02 kg/m³, suggesting consistent material proportions. The water-cement ratio has a mean of 0.466 and a low variability (standard deviation of 0.054). The aggregate size averages 19.00 mm with a standard deviation of 5.68 mm, indicating some variation in particle size. Lastly, the curing time has the highest variability with a mean of 17.5 days and a standard deviation of 9.48 days, showing significant differences in curing durations across samples.

  • Correlations:

  • Model Summary:

  • Look at R2R^2 to assess how well the model explains the variability in the dependent variable.

The model summary reveals a strong positive relationship between the predictors (cement content, water-cement ratio, aggregate size, and curing time) and the dependent variable (compressive strength), as indicated by the correlation coefficient \(R = 0.960\). The \(R^2 = 0.921\) implies that 92.1% of the variation in compressive strength is explained by the model, showcasing high predictive power. The adjusted \(R^2 = 0.858\) accounts for the number of predictors and indicates a slightly lower but still substantial explanatory power. The standard error of the estimate (2.48 MPa) reflects the average deviation of observed compressive strength from the predicted values. This model demonstrates strong fit and reliability for predicting compressive strength.

  • ANOVA Table:

  • Check the significance value (p-value) to determine if the regression model is statistically significant.

The ANOVA table evaluates the overall significance of the regression model. The regression sum of squares (360.266) and residual sum of squares (30.875) indicate that most of the variation in compressive strength is explained by the model. The mean square for regression (90.066) and residual (6.175) are used to calculate the F-statistic (14.585). The associated p-value (\(\text{Sig.} = 0.006\)) is below 0.05, confirming that the regression model is statistically significant and that the predictors (cement content, water-cement ratio, aggregate size, and curing time) collectively influence compressive strength.

  • Coefficients Table:

  • Examine the coefficients (β\beta) and their significance values to understand the effect of each predictor.

The coefficients table provides insights into the individual contributions of each predictor to the model. The intercept (\(\text{Constant} = 22.995\)) represents the estimated compressive strength when all predictors are zero. Among the predictors:

  • Cement Content (B = 0.055, p = 0.355): The effect is positive but not statistically significant, suggesting it has a limited independent impact on compressive strength.
  • Water-Cement Ratio (B = -30.866, p = 0.350): The negative coefficient implies an inverse relationship with compressive strength, though it is not statistically significant.
  • Aggregate Size (B = -0.205, p = 0.449): The negative coefficient suggests a slight decrease in compressive strength with larger aggregates, but the effect is not significant.
  • Curing Time (B = 0.437, p = 0.018): This predictor is significant (p < 0.05), indicating that increased curing time strongly enhances compressive strength.

Collinearity diagnostics show tolerances above 0.1 and VIF values below 10, suggesting no severe multicollinearity among the predictors.

  • Collinearity Diagnostics:

The Collinearity Diagnostics table helps identify multicollinearity among the predictors by examining eigenvalues, condition indices, and variance proportions:

  1. Eigenvalues and Condition Indices:
    • The first dimension has a high eigenvalue (4.716) and a low condition index (1.000), indicating it captures most of the variance in the predictors.
    • Dimensions 2, 3, and 4 have smaller eigenvalues and progressively larger condition indices (4.448, 12.380, and 18.173), reflecting increasing levels of dependency among predictors.
    • Dimension 5 has the smallest eigenvalue (0.000) and a very high condition index (116.834), which is a red flag for potential multicollinearity.
  2. Variance Proportions:
    • In Dimension 5, the variance proportions for Cement Content (0.96) and Water-Cement Ratio (0.85) are high, suggesting these variables may be highly collinear.
    • Significant overlap in variance proportions across dimensions indicates shared variance among predictors, particularly for water-cement ratio, cement content, and aggregate size.

Conclusion:

The high condition index in Dimension 5 (>30) and overlapping variance proportions suggest potential multicollinearity issues, particularly involving Cement Content and Water-Cement Ratio. To address this: - Consider removing or combining predictors. - Use regularization techniques (e.g., ridge regression) if multicollinearity significantly affects model stability.

The Residuals Statistics table provides information about the model’s predicted values, residuals (differences between observed and predicted values), and standardized values, helping to evaluate model fit and detect outliers:

  1. Predicted Values:
    • The predicted compressive strength ranges from 24.0340 MPa to 42.6341 MPa, with a mean of 31.7700 MPa, aligning well with the actual mean compressive strength.
    • The standard deviation of predicted values is 6.32689, indicating variability in the predictions.
  2. Residuals:
    • Residuals (actual - predicted) range from -2.64596 to 3.26372, with a mean of 0.00000 (as expected in a regression model).
    • The standard deviation of the residuals (1.85219) shows the average deviation of predictions from observed values.
  3. Standardized Predicted and Residual Values:
    • Standardized predicted values range from -1.223 to 1.717, with a mean of 0 and standard deviation of 1, indicating no extreme predictions.
    • Standardized residuals range from -1.065 to 1.313, with a mean of 0 and standard deviation of 0.745, showing no significant outliers in residuals.

Conclusion:

The residuals exhibit reasonable variability without extreme outliers, indicating a good model fit. However, further diagnostic plots (e.g., residual vs. predicted plots or normal Q-Q plots) may help confirm assumptions of linear regression, such as normality and homoscedasticity of residuals.

This is a scatterplot of standardized residuals vs. standardized predicted values for the dependent variable, Compressive Strength (MPa). This plot is used to assess whether the assumptions of linear regression are met, particularly the assumptions of homoscedasticity (constant variance of residuals) and linearity.

Interpretation:

  1. Pattern of Residuals:
    • The residuals appear to be randomly scattered around zero, without any clear pattern or systematic structure. This suggests that the assumption of linearity is satisfied.
  2. Homoscedasticity (Constant Variance):
    • The spread of residuals seems fairly consistent across all predicted values, indicating that the assumption of homoscedasticity is also likely met. If there were a funnel shape or clustering, it would indicate heteroscedasticity.
  3. Outliers:
    • There do not appear to be any extreme outliers in the standardized residuals, which is a good sign for model reliability.

Conclusion:

The scatterplot supports the assumptions of linear regression, with residuals showing no systematic patterns and a roughly constant spread. This indicates that the model is a good fit for the data.

This is a Normal P-P Plot of Regression Standardized Residuals for the dependent variable, Compressive Strength (MPa). The plot evaluates whether the residuals (differences between observed and predicted values) follow a normal distribution, which is a key assumption in linear regression.

Interpretation:

  1. Alignment with the Line:
    • Most data points lie close to the diagonal line, indicating that the residuals approximately follow a normal distribution. This supports the assumption of normality.
  2. Deviations:
    • Minor deviations from the diagonal line are acceptable and can occur due to sampling variability. However, large or systematic deviations might suggest non-normality.

Conclusion:

The plot suggests that the residuals are reasonably normally distributed, meaning the model satisfies the normality assumption for regression. This is a good indication of the model’s validity, but you might still consider additional tests (e.g., Shapiro-Wilk) or plots (e.g., histogram of residuals) for a more thorough analysis.

Step 6: Visualize the Results

Create scatterplots or predicted vs. actual plots to visualize the model fit. Refer to the visualizations in the earlier steps for detailed guidance.


Steps in SPSS to Create Scatterplots

Step 1: Run Multiple Regression

  1. Follow the steps to enter the data and run multiple regression as described earlier.
  2. Ensure that the option for Save > Predicted Values is checked in the regression dialog box. This will generate a new column in the dataset with predicted values.

Step 2: Create Scatterplot of Actual vs. Predicted

  1. After running the regression:
    • Switch to the Data View.
    • Note the newly created column for predicted values (e.g., PRED).
  2. Go to Graphs > Chart Builder.
  3. In the Chart Builder:
    • Select Scatter/Dot from the list of chart types.
    • Drag the Simple Scatter icon into the workspace.
    • Assign:
      • Predicted Values (PRED) to the X-axis.
      • Actual Values (Compressive_Strength) to the Y-axis.
  4. Click OK to generate the scatterplot.

Step 3: Add Line of Best Fit

  1. Double-click the scatterplot to open the Chart Editor.
  2. In the Chart Editor:
    • Right-click on the scatterplot and select Add Fit Line at Total.
    • Choose Linear as the type of fit line.
    • Apply the changes and close the Chart Editor.

Step 4: Interpret the Scatterplot

  • A well-fitting regression model will show points closely aligned with the diagonal line.
  • If the points are widely scattered, the model may need improvement.

This plot shows the relationship between the observed compressive strength (MPa) and the unstandardized predicted values from the regression model.

Interpretation:

  1. Trend Line:
    • The plot includes a diagonal line, which represents the line of perfect prediction (where observed values equal predicted values).
  2. Closeness to the Line:
    • The scatter points lie close to the line, indicating that the regression model provides a good fit to the data. The predicted values are generally accurate in estimating the observed compressive strength.
  3. Deviation from the Line:
    • Small deviations are observed, which represent prediction errors (residuals). These deviations appear relatively minor, confirming that the model’s predictions are reasonably accurate.

Conclusion:

The plot supports the reliability of the regression model in predicting compressive strength (MPa). The closeness of the scatter points to the diagonal line indicates that the model’s predictions align well with the observed values.

For elementary, intermediate, and advanced special trainings on data science using R, Python, Stata, etc visit us on: https://softdataconsult.com/ or https://softdataconsult.github.io/ email: