#Import the data
library(readr)
data <- read_csv("dataset_2_ILM312 2.csv")
Hypothesis 1:
Research Question: Do the lot area, number of bedrooms above ground, and garage area have a significant impact on the sale price of the houses?
Null Hypothesis (H0): There is no relationship between the lot area, number of bedrooms above ground, garage area, and the sale price of the houses. Alternative Hypothesis (H1): There is a relationship between the lot area, number of bedrooms above ground, garage area, and the sale price of the houses.
Hypothesis 2:
Research Question: Are any of the predictors (lot area, number of bedrooms above ground, garage area) individually associated with the sale price of the houses?
Null Hypothesis (H0): The coefficients of the lot area, number of bedrooms above ground, and garage area in the regression model are all zero (i.e., no effect on the sale price).
Alternative Hypothesis (H1): At least one of the coefficients of the lot area, number of bedrooms above ground, and garage area in the regression model is not zero (i.e., there is an effect on the sale price).
Assumption 1: Linearity: The relationship between the predictors and the response variable is linear.
library(ggplot2)
library(ggplot2)
# Scatterplot of response variable against index/observation number
data$Index <- 1:nrow(data) # Create an index column
ggplot(data, aes(x = Index, y = SalePrice*0.001)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
labs(x = "Index", y = "Sale Price")+
theme_bw()+
labs(title = "Scatterplot of response variable against index")
• The points as observed from the scatter plot above lie along the line
suggesting that there is a consistent and proportional relationship
between the response variable and the index or observation number. As
the index increases, the response variable also increases in a linear
fashion.Hence the assumption of linearity.
Assumption 2: Homoscedasticity: The variability of the residuals are constant across all levels of the predictors.
# regression model
model <- lm(SalePrice ~ LotArea + BedroomAbvGr + GarageArea, data = data)
# Breusch-Pagan test for heteroscedasticity
library(lmtest)
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 130.52, df = 3, p-value < 2.2e-16
• The test statistic, denoted as BP, has a value of 130.52 performed with 3 degrees of freedom. The Significance Level (The p-value)is reported as “< 2.2e-16,” which indicates that the p-value is very small (essentially 0) and below any conventional significance level This implies strong evidence against the assumption of homoscedasticity.
Assumption 3: Normality: The residuals should follow a normal distribution.
library(ggplot2)
# Histogram of residuals with normality line
ggplot(data, aes(x = model$residuals)) +
geom_histogram(binwidth = 1000, aes(y = ..density..), fill = "lightblue", color = "cornflowerblue") +
geom_density(color = "red", size = 1) +
labs(x = "Residuals", y = "Density") +
scale_y_continuous(labels = function(x) x * 10000)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Q-Q plot of residuals
qqnorm(model$residuals, yaxt = "n") # Disable automatic y-axis labels
qqline(model$residuals)
# Modify y-axis labels
y_vals <- axTicks(2) * 0.001 # Compute new y-axis values
axis(2, at = axTicks(2), labels = y_vals, las = 1) # Add custom y-axis labels
• From the above histogram and QQ-plots it is evident that the data
follows a normal distribution and hence the assumption of normality
Assumption 4: No multicollinearity: The predictors should not be highly correlated with each other.
# Correlation matrix
cor(data[, c("LotArea", "BedroomAbvGr", "GarageArea")])
## LotArea BedroomAbvGr GarageArea
## LotArea 1.0000000 0.11968991 0.18040276
## BedroomAbvGr 0.1196899 1.00000000 0.06525253
## GarageArea 0.1804028 0.06525253 1.00000000
# Variance Inflation Factors (VIF)
library(car)
vif(model)
## LotArea BedroomAbvGr GarageArea
## 1.046289 1.016566 1.035710
• No Strong Evidence of Multicollinearity: According to the values from the correlation matrix, there isn’t much proof that the predictor variables are multicollinear overall. The low correlation coefficients indicate that there is not a strong association between the variables.
The linear regression model is:
SalePrice ~ LotArea + BedroomAbvGr + GarageArea
This specifies that “SalePrice” is the response variable, and “LotArea”, “BedroomAbvGr,” and “GarageArea” are the predictor variables. The data argument is used to specify the dataset containing the variables.
# Fit linear regression model
model <- lm(SalePrice ~ LotArea + BedroomAbvGr + GarageArea, data = data)
# Summary of the regression model
summary(model)
##
## Call:
## lm(formula = SalePrice ~ LotArea + BedroomAbvGr + GarageArea,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290734 -31321 -4321 23270 471257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.377e+04 6.583e+03 5.129 3.30e-07 ***
## LotArea 1.145e+00 1.618e-01 7.079 2.25e-12 ***
## BedroomAbvGr 1.095e+04 1.952e+03 5.613 2.38e-08 ***
## GarageArea 2.193e+02 7.516e+00 29.174 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60310 on 1456 degrees of freedom
## Multiple R-squared: 0.4248, Adjusted R-squared: 0.4236
## F-statistic: 358.4 on 3 and 1456 DF, p-value: < 2.2e-16
• Impact of Predictors: The model summary’s coefficients show the estimated influence of each predictor variable on the SalePrice. For instance:
-LotArea: The SalePrice is predicted to rise by roughly 1145 units for every unit increase in LotArea (on a par with LotArea). -BedroomAbvGr: The SalePrice is predicted to rise by roughly 10950 units for every additional unit of bedrooms above ground. -GarageArea: The SalePrice is predicted to rise by about 219 units for every unit increase in the GarageArea.
• Statistical Significance: All of the predictor variables (LotArea, BedroomAbvGr, and GarageArea) have p-values below 0.001, indicating that they are statistically significant in connection to the SalePrice. This implies that these variables offer valuable information for forecasting the SalePrice.
• Model Fit: According to the R-squared value of 0.4248, the predictor variables in the model can account for about 42.48% of the variation in the sale price. This means that based on the provided predictors, the model captures a moderate amount of the variation in the SalePrice.
• Overall Model Significance: The F-statistic has a p-value of 2.2e-16 or less, which is exceptionally low and indicates that the overall model is statistically significant. This indicates a substantial correlation between the SalePrice and at least one of the predictor variables.
• Residuals: The residuals are a measure of how the anticipated SalePrice values from the model depart from the observed SalePrice values. The range of residuals (from -290734 to 471257) shows the distribution of the model’s prediction mistakes.
Hypothesis 1:
Research Question: Do the lot area, number of bedrooms above ground, and garage area have a significant impact on the sale price of the houses?
• The lot area, number of bedrooms above ground, and garage area significantly impact the sale price of houses.This is because they have a p-value less than 0.05. Increasing lot area, bedrooms, and garage area are associated with higher sale prices.Therfore we do not reject the null hypothesis
Hypothesis 2:
Research Question: Are any of the predictors (lot area, number of bedrooms above ground, garage area) individually associated with the sale price of the houses?
• We do not reject the null hypothesis. This is because all of the predictors (lot area, number of bedrooms above ground, and garage area) are individually associated with the sale price of the houses (p < 0.001). i.e Each predictor has a significant impact on the sale price.
# Predicting sale price
new_data <- data.frame(LotArea = 9000, BedroomAbvGr = 3, GarageArea = 700)
predicted_price <- predict(model, newdata = new_data)
# Printing the predicted sale price
cat("The predicted sale price of a house with a Lot Area of 9000 sq. meters, 3 bedrooms above ground, and a garage area of 700 sq. meters is:", predicted_price, "\n")
## The predicted sale price of a house with a Lot Area of 9000 sq. meters, 3 bedrooms above ground, and a garage area of 700 sq. meters is: 230425.6