This is the chapter 3&4 Project for stat 321 using data from kaggle called https://www.kaggle.com/code/divan0/multiple-linear-regression/input.
Multiple Linear Regression is a statistical method used to model the relationship between a dependent variable and two or more independent variables. The goal is to understand how the independent variables collectively influence the dependent variable. The model assumes a linear relationship, meaning changes in the independent variables are associated with a constant change in the dependent variable. The equation for a multiple linear regression model is:
y=β0+β1⋅x1+β2⋅x2+⋯+βp⋅xp+ϵ
First, we shall prepare the data:
house_data <- read.csv(file = "house_data.csv",
header = TRUE,
sep = ",")
# Print variables
names(house_data)
## [1] "id" "date" "price" "bedrooms"
## [5] "bathrooms" "sqft_living" "sqft_lot" "floors"
## [9] "waterfront" "view" "condition" "grade"
## [13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
## [17] "zipcode" "lat" "long" "sqft_living15"
## [21] "sqft_lot15"
Check Dimensions
# Check dimension
dim(house_data)
## [1] 21613 21
Print First 10 Observations
# Print first 10 observations
head(house_data, n = 10)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000 221900 3 1.00 1180 5650
## 2 6414100192 20141209T000000 538000 3 2.25 2570 7242
## 3 5631500400 20150225T000000 180000 2 1.00 770 10000
## 4 2487200875 20141209T000000 604000 4 3.00 1960 5000
## 5 1954400510 20150218T000000 510000 3 2.00 1680 8080
## 6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## 7 1321400060 20140627T000000 257500 3 2.25 1715 6819
## 8 2008000270 20150115T000000 291850 3 1.50 1060 9711
## 9 2414600126 20150415T000000 229500 3 1.00 1780 7470
## 10 3793500160 20150312T000000 323000 3 2.50 1890 6560
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## 7 2 0 0 3 7 1715 0 1995
## 8 1 0 0 3 7 1060 0 1963
## 9 1 0 0 3 7 1050 730 1960
## 10 2 0 0 3 7 1890 0 2003
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
## 7 0 98003 47.3097 -122.327 2238 6819
## 8 0 98198 47.4095 -122.315 1650 9711
## 9 0 98146 47.5123 -122.337 1780 8113
## 10 0 98038 47.3684 -122.031 2390 7570
Visualizing the Data To gain insights into the relationship between calorie and each of the other variables (called explanatory variable or predictor), we’ll create a scatter plot with price as the response variable.
house_data_new = house_data[, c(3, 4, 5:10)] # I re-arranged the columns so that price is the first column
plot( house_data_new)
Fitting the multiple linear regression model Now, let’s
fit a multiple linear regression model to predict price based on other
numeric variables.
model <- lm(price ~ ., data = house_data_new)
We then print the summary of the results:
summary(model)
##
## Call:
## lm(formula = price ~ ., data = house_data_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1533199 -134322 -17631 101548 4229409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.764e+04 7.277e+03 9.295 <2e-16 ***
## bedrooms -4.645e+04 2.237e+03 -20.764 <2e-16 ***
## bathrooms 6.604e+03 3.600e+03 1.835 0.0666 .
## sqft_living 2.818e+02 3.047e+00 92.496 <2e-16 ***
## sqft_lot -3.893e-01 4.083e-02 -9.534 <2e-16 ***
## floors 8.905e+03 3.580e+03 2.488 0.0129 *
## waterfront 5.529e+05 2.091e+04 26.445 <2e-16 ***
## view 7.374e+04 2.464e+03 29.929 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 243200 on 21605 degrees of freedom
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.5611
## F-statistic: 3948 on 7 and 21605 DF, p-value: < 2.2e-16
Interpretation of Results:
Coefficients: Intercept (β₀): The estimated intercept is $67,640. This represents the estimated baseline price when all other predictor variables are zero. In the context of housing data, it might not have a direct practical interpretation.
Bedrooms (β₁): For each additional bedroom, the estimated price decreases by $46,450. This implies that houses with more bedrooms tend to have lower prices, holding other factors constant.
Bathrooms (β₂): The estimated coefficient for bathrooms is $6,604. However, with a p-value of 0.0666, it is marginally above the common significance level of 0.05. This suggests that the number of bathrooms might not be statistically significant in predicting house prices.
Sqft_living (β₃): For each additional square foot of living space, the estimated price increases by $281.80. This is a highly significant predictor with a very low p-value, indicating a strong positive relationship between living space and house price.
Sqft_lot (β₄): An increase of one unit in square footage of the lot is associated with a decrease in price by $0.3893. This suggests that larger lots might be associated with lower prices.
Floors (β₅): Houses with multiple floors have, on average, $8,905 higher prices compared to single-floor houses. The p-value of 0.0129 suggests that the number of floors is statistically significant in predicting house prices.
Waterfront (β₆): Houses with waterfronts have an estimated price that is $552,900 higher than those without. This variable is highly significant, as indicated by the very low p-value.
View (β₇): Each unit increase in the view rating is associated with an increase in price by $73,740. This is a highly significant predictor with a very low p-value.
P-values: The p-value for bathrooms is 0.0666, suggesting marginal significance. It might be worth further investigation to determine its importance.
The p-values for bedrooms, sqft_living, sqft_lot, floors, waterfront, and view are all very low (<2e-16), indicating that these variables are highly significant in predicting house prices.
Model Fit: Residual Standard Error: The standard deviation of the residuals is $243,200. This is a measure of how well the model fits the data, with lower values indicating better fit.
R-squared: The R-squared value is 0.5613, suggesting that the model explains approximately 56.13% of the variance in house prices.
Adjusted R-squared: The adjusted R-squared value is 0.5611, which takes into account the number of predictors in the model.
F-statistic: The F-statistic is 3948 with a very low p-value (< 2.2e-16), indicating that the overall model is statistically significant.
In summary, the model suggests that variables such as bedrooms, sqft_living, sqft_lot, floors, waterfront, and view are significant predictors of house prices, while bathrooms might be marginally significant. The overall model has a good fit based on the R-squared value and the F-statistic.
Model Checking To ensure the validity of our model, let’s perform diagnostic checks.
plot(model, which = 1)
plot(model, which = 2)
plot(model, which = 3)
plot(model, which = 4)
Cook’s Distance is a statistical measure used to identify influential data points in a regression analysis. It measures the impact of deleting a particular observation on the overall fit of the model. Large values of Cook’s Distance indicate that removing a specific data point significantly changes the regression coefficients.
The plot shows that observations 3915, 7253 and 15871.
**R-squared* It is a statistical measure that represents the proportion of the variability in the response variable that is explained by the model. It quantifies the goodness of fit of the model. R-squared can be defined similarly in multiple linear regression. R-squared values range from 0 to 1. A higher R-squared value indicates a better fit, meaning that a larger proportion of the variance in the dependent variable is accounted for by the independent variable(s) in the model. Remember, while R-squared provides information about the goodness of fit, it doesn’t necessarily indicate the causation between variables or the appropriateness of the model for making predictions. It should be used in conjunction with other model evaluation techniques.
The R2 here is 56.13% indicating that 56.13% of the total variation is explained by the model.
Prediction Interval for a New Observation A prediction interval for a new observation in the context of regression analysis provides a range within which a future response value falls with a certain desired probability. This interval takes into account both the uncertainty in estimating the population regression line and the variability of individual observations around that line.
Let’s make predictions based on our model in the previous section for new data points.
new_data <- data.frame(bedrooms = 0.75,
bathrooms = 2.0,
sqft_living= 162,
sqft_lot = 28,
floors = 0.6,
waterfront = 17.2,
view = 1.5
)
# Predict calories for new data
predictions <- predict(model, newdata = new_data, interval = "prediction", level = 0.95)
# Display predictions
predictions
## fit lwr upr
## 1 9717722 8869009 10566435
Explanation of code:
A new data frame new_data is created with specific values for each predictor variable. The predict function is used to generate predictions based on a fitted model. model: This is the regression model that has been previously fitted, in this case, a multiple regression model. newdata = new_data: This specifies the new data for which predictions are to be made. The new_data object should be a data frame containing values for the predictor variables used in the model. interval = “prediction”: This argument specifies that we want to calculate prediction intervals. Prediction intervals provide a range within which the true response values are likely to fall with certain probability (as specified by the level). level = 0.95: It is the desired probability that the interval will contain the true value of the response variable for a new observation. predictions <- …: The results of the predict function, which include the predicted values and prediction intervals, are stored in the predictions object. This object is typically a matrix containing columns for the point predictions, lower bounds of the prediction intervals (lwr), and upper bounds of the prediction intervals (upr). Finally, the predictions are displayed.
Confidence Interval for the Mean Response The mean of the response for all the price with a particular set of values on the seven predictors considered here can be of an interest. Since it is a parameter, we can calculate the confidence interval for it. The code is:
# Confidence interval for the mean response for all the price with
# the particular set of values given above on the seven predictors
CI <- predict(model, newdata = new_data, interval = "confidence", level = 0.95)
Including indicator variables in multiple regression models
DT::datatable(house_data)
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
augmented_house <- cbind(house_data,
I1 = as.numeric(house_data$price=="bedrooms"),
I2 = as.numeric(house_data$price=="bathrooms"),
I3 = as.numeric(house_data$price=="sqft_living")
)
The following is the R code that does the model fitting:
fit <- lm(price ~ bedrooms + bathrooms + sqft_living + I2 + I3,
data = augmented_house)
summary(fit)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + I2 +
## I3, data = augmented_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1642456 -144268 -22821 102461 4180678
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74847.141 6913.667 10.83 <2e-16 ***
## bedrooms -57860.894 2334.607 -24.78 <2e-16 ***
## bathrooms 7932.712 3510.556 2.26 0.0239 *
## sqft_living 309.392 3.087 100.23 <2e-16 ***
## I2 NA NA NA NA
## I3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257800 on 21609 degrees of freedom
## Multiple R-squared: 0.5069, Adjusted R-squared: 0.5069
## F-statistic: 7405 on 3 and 21609 DF, p-value: < 2.2e-16
Fitting a multiple linear regression model
fit <- lm(price ~ bedrooms + bathrooms + sqft_living,
data = house_data)
# Display the summary of the model
summary(fit)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1642456 -144268 -22821 102461 4180678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74847.141 6913.667 10.83 <2e-16 ***
## bedrooms -57860.894 2334.607 -24.78 <2e-16 ***
## bathrooms 7932.712 3510.556 2.26 0.0239 *
## sqft_living 309.392 3.087 100.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257800 on 21609 degrees of freedom
## Multiple R-squared: 0.5069, Adjusted R-squared: 0.5069
## F-statistic: 7405 on 3 and 21609 DF, p-value: < 2.2e-16
ggplot(house_data, aes(x = bedrooms, y = price, color = sqft_living)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Scatter Plot with Regression Lines for Each sqft_living",
x = "bedrooms",
y = "Price")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Additional Topics in Multiple Regression Added
Variable Plots The added variable plot (also known as partial
regression plot), introduced by Mosteller and Tukey (1977), provides a
visual representation of the regression coefficient of a new variable
being considered for inclusion in a model. Here’s a step-by-step guide
to construct an added variable plot:
Regress Y on all variables other than X: This step involves fitting a regression model where the response variable Y is regressed on all predictor variables except for X. The residuals obtained from this regression represent the part of Y not explained by the other variables (Y residuals).
Regress X on all other variables: Similarly, regress the variable X on all other variables included in the model. The residuals obtained from this regression represent the part of X not explained by the other variables (X residuals).
Construct a scatter plot: Create a scatter plot with Y residuals on the y-axis and X residuals on the x-axis.
The interpretation of Y and X residuals is as follows:
Y residuals: These represent the portion of Y that is not explained by all variables other than X. X residuals: These represent the portion of X that is not explained by other variables. The slope of the line fitted to the points in the added variable plot must be equal to the regression coefficient when Y is regressed on all variables, including X.
The interpretation of the plot:
A strong linear relationship in the added variable plot indicates the increased importance of the contribution of X to the model already containing the other predictors. Essentially, it suggests that X adds significant explanatory power to the model beyond what is already captured by the existing predictors. If the relationship in the added variable plot is non-linear, it suggests that the relationship between the response variable Y and the predictor variable X is not adequately captured by a linear model, even after accounting for the effects of other predictor variables. Transforming the predictor variable X or the response variable Y may help linearize the relationship.
# Install and load the 'car' package (if not already installed)
# install.packages("car")
library(car)
## Loading required package: carData
# Fit a multiple regression model
model <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width, data = iris)
# Create an Added Variable Plot for the variable 'Sepal.Length'
avPlot(model, "Sepal.Length")
Best subsets of Predictors Best Subsets of Predictors
is an approach in regression analysis where multiple linear regression
models are fitted using different combinations of predictor variables,
and the best model is selected based on a specified criterion. The goal
is to identify the subset of predictors that results in the most
parsimonious and effective model.
Mallow’s Cp is one of many statistical criteria used in the process of choosing the best subset of predictor variables for a regression model. The goal is to strike a balance between model simplicity and predictive accuracy.
Mallow’s Cp is defined as:
Cp=SSEmMSEk+2(m+1)−n
where:
SSEm is the sum of squared errors for the model with m predictors SSEk is the mean square error (or the estimated variance of the error term) for the full model with all k predictors. n is the number of observations. A model with lower Cp indicates a better balance between goodness of fit (as measured by SSE ) and model complexity, and thus is a preferred model.
# Install and load the leaps package
library(leaps)
# Example data
set.seed(123)
simulated_data <- data.frame(
Predictor1 = rnorm(100),
Predictor2 = rnorm(100),
Predictor3 = rnorm(100),
Predictor4 = rnorm(100),
Predictor5 = rnorm(100),
Predictor6 = rnorm(100)
)
# Generate response variable with a linear combination of predictors and random noise
# Response = 20 + 2 * Predictor2 - 4 * Predictor4 + 5 * Predictor5 + random noise
simulated_data$Response = 20 + 2*simulated_data$Predictor2 - 4*simulated_data$Predictor4 + 5*simulated_data$Predictor5 - 6*simulated_data$Predictor6 + rnorm(100)
# Use leaps to find the best subset of predictors
best_subsets <- regsubsets(Response ~ ., data = simulated_data, nvmax = 4) # Adjust nvmax as needed
# Get the Mallow's Cp values for each subset
res = summary(best_subsets)
cp_values <- res$cp
cat("The best subset is:\n ",
paste(" ", best_subsets$xnames[-1][tail(summary(best_subsets)$which,1)[-1]], collapse = ", "),
"\nwith Mallow's Cp =", min(cp_values), ".", sep = "")
## The best subset is:
## Predictor2, Predictor4, Predictor5, Predictor6
## with Mallow's Cp =3.112201.
Cross-Validation Many times, the model we build is for prediction in the future. A natural question would be: will our model perform well? That a model performs well does not guarranttii it also perform well on new data. How can we assess the performance of the model for future prediction?
Training and Holdout Samples Cross-validation is a statistical technique used in machine learning and statistical modeling to assess the performance and generalizability of a predictive model. The main goal of cross-validation is to estimate how well a model will perform on an independent dataset by partitioning the available data into two subsets: one for training the model and another for testing its performance.
Training Sample: The training sample is the portion of the dataset used to train the model. During model training, the algorithm learns the patterns and relationships within the data, adjusting its parameters to minimize the difference between predicted and actual outcomes. The larger and more representative the training sample is, the better the model may capture the underlying patterns in the data. The training sample is typically used for fitting the model, estimating coefficients, and adjusting internal parameters. Holdout Sample (Testing Sample): The holdout sample is a separate subset of the data that is not used during the training phase. After the model is trained on the training sample, it is tested on the holdout sample to assess its performance on new, unseen data. The holdout sample serves as an independent evaluation set, providing an estimate of how well the model is expected to generalize to new, unseen data. The performance metrics on the holdout sample can help identify potential issues such as overfitting or underfitting. It is generally expected that the performance of the model is worse for prediction based on the holdout data than for prediction based on the training data. If this happens, the model might not have found the patterns or relationships that occur or hold in the more general population.
Here is an example using the caret package, which is commonly used for machine learning tasks. We will use the iris data to fit a multiple linear regression model with Sepal length as the response and sepal Width, petal length, petal width as predictors.
# Set a seed for reproducibility
set.seed(123)
# Split the data into a training set (80%) and a testing set (20%)
size_of_data = nrow(iris)
size_of_training_sample = size_of_data * 0.8
train_indices <- sample(1:size_of_data, size_of_training_sample)
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]
# Fit a model using the training data
model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = train_data)
# Make predictions on the test data
predictions <- predict(model, newdata = test_data)
# Evaluate the model's performance on the test data (using mean squared error)
mse <- mean((test_data$Sepal.Length - predictions)^2)
cat("Mean Squared Error:", mse, "\n")
## Mean Squared Error: 0.07225017
In conclusion, this project explores the application of multiple linear regression in predicting house prices using a dataset obtained from Kaggle. The initial steps involve data preparation, exploration, and visualization to gain insights into the relationships between predictor variables and the response variable (price). The multiple linear regression model is then fitted, and the results are thoroughly interpreted.
The coefficients of the model provide valuable insights into the impact of each predictor variable on house prices. For instance, the number of bedrooms is associated with a decrease in price, while square footage of living space, waterfront presence, and view ratings significantly contribute to higher prices. The p-values help assess the significance of each predictor, and the overall model fit is evaluated through metrics like R-squared and the F-statistic.
Additionally, diagnostic checks are performed to ensure the validity of the model, and influential observations are identified using Cook’s Distance. The document also delves into the concept of R-squared, providing an understanding of the proportion of variance explained by the model.
Furthermore, the document demonstrates the use of prediction intervals and confidence intervals for new observations, allowing for a comprehensive assessment of the model’s predictive capabilities. Model checking is emphasized through diagnostic plots, ensuring the assumptions of linear regression are met.
The document concludes with the exploration of additional topics in multiple regression, including added variable plots, best subsets of predictors using Mallow’s Cp, and cross-validation techniques to assess the model’s performance on new, unseen data. Overall, this comprehensive analysis serves as a practical guide for utilizing multiple linear regression in real-world data analysis scenarios.