The “homeprice.csv” file is used as a variable “hprice” and the data frame structure is made with this variable. Package ggplot2 is used for the plotting figures.
'data.frame': 29 obs. of 7 variables:
$ list : num 80 151 310 295 339 ...
$ sale : num 118 151 300 275 340 ...
$ full : int 1 1 2 2 2 1 3 1 1 1 ...
$ half : int 0 0 1 1 0 1 0 1 2 0 ...
$ bedrooms : int 3 4 4 4 3 4 3 3 3 1 ...
$ rooms : int 6 7 9 8 7 8 7 7 7 3 ...
$ neighborhood: int 1 1 3 3 4 3 2 2 3 2 ...
library(ggplot2)
Relationship Between Sale Price and Other Variables
Sale Price vs List Price
This is a scatter plot of relationship between sale price and list price.
qplot(sale, list, data = hprice, xlab ="Sale Price", ylab ="List Price",main ="Sale Price vs List Price")
Warning: `qplot()` was deprecated in ggplot2 3.4.0.
Sale Price vs Number of Full Bathrooms
This is a scatter plot of relationship between sale price and number of full bathrooms.
qplot(sale, full, data = hprice, xlab ="Sale Price", ylab ="Full Bathroom",main ="Sale Price vs Full Bathroom")
Sale Price vs Number of Half Bathrooms
This is a scatter plot of relationship between sale price and number of half bathrooms.
qplot(sale, half, data = hprice, xlab ="Sale Price", ylab ="Half Bathroom",main ="Sale Price vs Half Bathroom")
Sale Price vs Number of Bedrooms
This is a scatter plot of relationship between sale price and number of bedrooms.
qplot(sale, bedrooms, data = hprice, xlab ="Sale Price", ylab ="Bedrooms",main ="Sale Price vs Bedrooms")
Sale Price vs Number of Non-Bedrooms
This is a scatter plot of relationship between sale price and number of non-bedrooms.
qplot(sale, rooms, data = hprice, xlab ="Sale Price", ylab ="Non-Bedrooms",main ="Sale Price vs Non-Bedrooms")
Sale Price vs Neighborhood Rank
This is a scatter plot of relationship between sale price and neighborhood rank.
qplot(sale, neighborhood, data = hprice, xlab ="Sale Price", ylab ="Neighborhood Rank",main ="Sale Price vs Neighborhood Rank")
Result
After reviewing the relationship between the sale price and other variables through scatter plot, list price appears to have the strongest relationship with sale price.
Linear Regression Model
This section will make multiple linear regression models to find out the relationship of sale price and list price with other variables. The summary and ANOVA functions are used to interpret the result of the linear regression models. Histograms and plots are used for finding out the residuals of the models used in the analysis.
Sale Price vs Other Variables
Linear Regression of Sale Price and Other Variables
lin1 =lm(sale ~ list + full + half + bedrooms + rooms + neighborhood, data = hprice)summary(lin1)
Call:
lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood,
data = hprice)
Residuals:
Min 1Q Median 3Q Max
-28.807 -6.626 -0.270 5.580 32.933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.13359 17.15496 0.299 0.768
list 0.97131 0.07616 12.754 1.22e-11 ***
full -4.97759 5.48033 -0.908 0.374
half -1.00644 5.70418 -0.176 0.862
bedrooms 2.49224 6.43616 0.387 0.702
rooms -0.43411 3.70424 -0.117 0.908
neighborhood 2.03434 6.88609 0.295 0.770
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.87 on 22 degrees of freedom
Multiple R-squared: 0.989, Adjusted R-squared: 0.986
F-statistic: 330.5 on 6 and 22 DF, p-value: < 2.2e-16
For this linear regression model, the multiple R-squared value is 0.989 and adjusted R-squared value is 0.986. Both R-squared value is close to 1 which means that this model fits input data almost exactly and has a good goodness-of-fit.
ANOVA of Sale Price and Other Variables
anova(lin1)
Analysis of Variance Table
Response: sale
Df Sum Sq Mean Sq F value Pr(>F)
list 1 381050 381050 1981.6252 <2e-16 ***
full 1 156 156 0.8116 0.3774
half 1 21 21 0.1092 0.7441
bedrooms 1 25 25 0.1314 0.7204
rooms 1 3 3 0.0141 0.9065
neighborhood 1 17 17 0.0873 0.7704
Residuals 22 4230 192
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The linear regression and ANOVA results shows that list price of housing has the biggest effect on the sale price of housing. The F-value of list price is 1981.6252 which means that there is a strong relationship between sale price and list price. The F-value for other variables are all less than 1, which means that there is a weak relationship with sale price and there could be a null hypothesis.
The p-value of list price is 2e-16 which is smaller than 0.05. Connecting to the F-value, the p-value below 0.05 means that there is a strong relationship between sale price and list price. The p-value for other variables are all bigger than 0.05, which means that there is a weak relationship with sale price and there could be a null hypothesis.
Since list price is the only variable that has a F-value bigger than 1 and p-value less than 0.05, list price most effects sale price among all variables.
Residuals of Sale Price and Other Variables Model
The residuals of this model is not well distributed and the model might have a bias in the final results.
Histogram of Residuals
hist(residuals(lin1))
Residuals Against the Fitted Values
plot(lin1, which =1)
Cook’s Distance for Each Observations
plot(lin1, which =4)
List Price vs Other Variables
Linear Regression of List Price and Other Variables
lin2 =lm(list ~ sale + full + half + bedrooms + rooms + neighborhood, data = hprice)summary(lin2)
Call:
lm(formula = list ~ sale + full + half + bedrooms + rooms + neighborhood,
data = hprice)
Residuals:
Min 1Q Median 3Q Max
-27.8544 -6.7013 -0.7265 6.7894 31.3427
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.8752 15.9419 -1.372 0.184
sale 0.9069 0.0711 12.754 1.22e-11 ***
full 8.3411 5.0923 1.638 0.116
half 6.3398 5.3475 1.186 0.248
bedrooms -0.0627 6.2402 -0.010 0.992
rooms 1.2426 3.5706 0.348 0.731
neighborhood 7.3793 6.4787 1.139 0.267
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.4 on 22 degrees of freedom
Multiple R-squared: 0.9903, Adjusted R-squared: 0.9876
F-statistic: 373.3 on 6 and 22 DF, p-value: < 2.2e-16
For this linear regression model, the multiple R-squared value is 0.9903 and adjusted R-squared value is 0.9876. Both R-squared value is close to 1 which means that this model fits input data almost exactly and has a good goodness-of-fit.
ANOVA of List Price and Other Variables
anova(lin2)
Analysis of Variance Table
Response: list
Df Sum Sq Mean Sq F value Pr(>F)
sale 1 401374 401374 2235.5702 <2e-16 ***
full 1 346 346 1.9259 0.1791
half 1 134 134 0.7440 0.3977
bedrooms 1 4 4 0.0209 0.8864
rooms 1 24 24 0.1326 0.7192
neighborhood 1 233 233 1.2973 0.2670
Residuals 22 3950 180
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The linear regression and ANOVA results shows that sale price of housing has the biggest effect on the list price of housing. The F-value of list price is 2235.5702 which means that there is a strong relationship between list price and sale price. The F-value for number of full bathroom and neighborhood rank is 1.9259 and 1.2973 each. The F-values of these two variables are all above 1 which means that they have strong relationship with the list price. However, comparing to the F-value of the sale price, they have weaker relationship with list price than sale price. The F-value of other variables are all less than 1, which means that there is a weak relationship with sale price and there could be a null hypothesis.
The p-value of sale price is 2e-16 which is smaller than 0.05. Connecting to the F-value, the p-value below 0.05 means that there is a strong relationship between list price and sale price. The p-value for other variables are all bigger than 0.05, which means that there is a weak relationship with sale price and there could be a null hypothesis.
For number of full bathroom and neighborhood rank which had a F-value bigger than 1, has a p-value bigger than 0.05 unlike sale price which has both F-value bigger than 1 and p-value less than 0.05. This means that the sale price most effects list price among all variables. This information can be used for a real estate agent to recommend housing characteristic of sale price, number of full bathroom, and neighborhood rank.
Residuals of List Price and Other Variables Model
The residuals of this model is not well distributed and the model might have a bias in the final results.
Histogram of Residuals
hist(residuals(lin2))
Residuals Against the Fitted Values
plot(lin2, which =1)
Cook’s Distance for Each Observations
plot(lin2, which =4)
Effect of Neighborhood on the Difference between Sale Price and List Price
This section will make a multiple linear regression model to find out the relationship of neighborhood rank with sale price and list price. The summary and ANOVA functions are used to interpret the result of the linear regression model. Histogram and plots are used for finding out the residuals of the model used in the analysis.
Linear Regression Model and ANOVA
lin3 =lm(neighborhood ~ sale + list, data = hprice)summary(lin3)
Call:
lm(formula = neighborhood ~ sale + list, data = hprice)
Residuals:
Min 1Q Median 3Q Max
-0.97992 -0.31827 -0.01618 0.33585 0.84921
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8551268 0.2395232 3.570 0.00142 **
sale 0.0008349 0.0074462 0.112 0.91159
list 0.0065966 0.0072552 0.909 0.37158
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4968 on 26 degrees of freedom
Multiple R-squared: 0.7763, Adjusted R-squared: 0.7591
F-statistic: 45.11 on 2 and 26 DF, p-value: 3.516e-09
anova(lin3)
Analysis of Variance Table
Response: neighborhood
Df Sum Sq Mean Sq F value Pr(>F)
sale 1 22.0673 22.0673 89.3927 6.724e-10 ***
list 1 0.2041 0.2041 0.8267 0.3716
Residuals 26 6.4183 0.2469
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result shows that there are more effect on the sale price by neighborhoods. Sale price have a F-value of 89.3927 and p-value of 6.724e-10 and list price have a F-value of 0.8267 and p-value of 0.3716. Sale price have both F-value over 1 and p-value less than 0.05, but list price does not. This shows that the neighborhood rank effects the sale price rather than list price. This can connect to richer neighborhoods has more probability of to have a house go over the asking price.
For this linear regression model, the multiple R-squared value is 0.7763 and adjusted R-squared value is 0.7591. Both R-squared value is close to 1 which means that this model fits input data almost exactly and has a good goodness-of-fit.
Residuals of the Model
The residuals of this model is well distributed and the model does not have a bias in the final results.