This is a real dataset of house prices sold in Seattle, WA, USA between August and December 2022.
The dataset consists of a dataframe with 1665 observations with the following 6 variables:
1.beds = Number of bedrooms in property 2. region = based on zip code, north, central or south Seattle. 3. baths = Number of bathrooms in property. Note 0.5 corresponds to a half-bath which has a sink and toilet but no tub or shower 4. size = Total floor area of property in square feet 5. lot_size = Total area of the land where the property is located on in square feet 6. price = Price the property was sold for (US dollars) ##THIS IS THE RESPONSE VARIABLE
# reading the dataset in to RStudio
# Call the data table price_data
#https://drive.google.com/file/d/1LGxmER8s0PiGtrxoJA0fLCGU5IvCB4mh/view?usp=sharing
# Set up the actual file ID from the share link
file_id <- "1LGxmER8s0PiGtrxoJA0fLCGU5IvCB4mh"
data_url <- paste0("https://docs.google.com/uc?export=download&id=", file_id)
#Read the data into R
price_data <- read.csv(data_url, header = TRUE)
# Show the first few rows of data
head(price_data)
## beds Region baths size lot_size price
## 1 1 Central 1 704 500 645000
## 2 3 Central 3 1524 513 850000
## 3 3 Central 3 1524 513 875000
## 4 3 North 3 1450 525 750000
## 5 1 Central 1 480 560 275000
## 6 2 North 1 800 560 690000
# Compute univariate numeric summary statistics
summary(as.data.frame(price_data))
## beds Region baths size
## Min. : 1.000 Length:1665 Min. :0.500 Min. : 376
## 1st Qu.: 2.000 Class :character 1st Qu.:1.500 1st Qu.: 1260
## Median : 3.000 Mode :character Median :2.000 Median : 1720
## Mean : 3.126 Mean :2.298 Mean : 1896
## 3rd Qu.: 4.000 3rd Qu.:3.000 3rd Qu.: 2360
## Max. :15.000 Max. :9.000 Max. :11010
## lot_size price
## Min. : 500 Min. : 159488
## 1st Qu.: 2734 1st Qu.: 680000
## Median : 5000 Median : 865000
## Mean : 9673 Mean :1010483
## 3rd Qu.: 7350 3rd Qu.:1175000
## Max. :400752 Max. :6250000
# Univariate Charts--histograms for the quantitative variables
# use hist(price_data$VARNAME)
par(mfrow = c(3, 2))
hist(price_data$beds)
hist(price_data$baths)
hist(price_data$size)
hist(price_data$lot_size)
hist(price_data$price)
#Create the correlation matrix for the quantitative variables CHANGE THE DATA TABLE NAME
round(cor(price_data [,-2]),2)
## beds baths size lot_size price
## beds 1.00 0.59 0.73 -0.13 0.46
## baths 0.59 1.00 0.62 -0.08 0.54
## size 0.73 0.62 1.00 -0.07 0.74
## lot_size -0.13 -0.08 -0.07 1.00 -0.09
## price 0.46 0.54 0.74 -0.09 1.00
Size has a 0.74 correlation with price, a strong positive correlation Baths has a 0.54 correlation with price, a moderate positive correlation Beds has a 0.46 correlation with price, a moderate positive correlation
# Box plot CHANGE THE VARIABLES AND DATA TABLE
boxplot(price_data$price ~ price_data$Region)
Central has the highest price overall, followed by North and South. All three have similar bottom ranges but the high end is higher for Central, then North, with South having the lowest high prices.
# Box plots by vehicle type CHANGE THE VARIABLES AND DATA TABLE
par(mfrow = c(2, 2))
boxplot(price_data$beds ~ price_data$Region)
boxplot(price_data$baths ~ price_data$Region)
boxplot(price_data$size ~ price_data$Region)
boxplot(price_data$lot_size ~ price_data$Region)
Baths - South has less Baths on average than Central or North Beds - All regions have the same average bed count, but South has a lower Q3 Size - Central has a slightly higher average size than North or South, which have similar graphs Lot Size - There are many outliers for all regions, but North and South have the highest means, with South having the highest median
## Explore the relationships between all predictor variables and the response
## CHANGE THE VARIABLES AND DATA TABLE AND THE XLAB AND YLAB
par(mfrow = c(2, 3))
plot(price_data$beds,price_data$price,xlab = "beds", ylab = "Price Emissions")
plot(price_data$baths,price_data$price,xlab = "baths", ylab = "Price Emissions")
plot(price_data$size,price_data$price,xlab = "size", ylab = "Price Emissions")
plot(price_data$lot_size,price_data$price,xlab = "lot_size", ylab = "Price Emissions")
beds - generally trending up, but with wide variance Region - baths - trending very slightly up, wide variance size - trending up, moderately correlated lot_size - lots of prices at first, settling into a low trend after
# Fit a linear regression model CHANGE THE VARIABLES AND DATA TABLE
model1 <- lm(price ~ size + baths + beds, data = price_data )
summary(model1)
##
## Call:
## lm(formula = price ~ size + baths + beds, data = price_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1841254 -181876 -23339 143997 4399224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 172751.59 26670.93 6.477 1.23e-10 ***
## size 496.11 15.25 32.534 < 2e-16 ***
## baths 105351.90 11820.29 8.913 < 2e-16 ***
## beds -110288.68 11481.24 -9.606 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 367700 on 1661 degrees of freedom
## Multiple R-squared: 0.5872, Adjusted R-squared: 0.5864
## F-statistic: 787.5 on 3 and 1661 DF, p-value: < 2.2e-16
All coefficients are significant
Yes, the p-value is under 0.05
172751.59 - when all other variables are zero, this is the price
# CHANGE THE VARIABLES AND DATA TABLE
model2 <- lm (price ~ Region, price_data )
summary(model2)
##
## Call:
## lm(formula = price ~ Region, data = price_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -976039 -305368 -122368 149942 5023961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1226039 26408 46.426 < 2e-16 ***
## RegionNorth -193671 33439 -5.792 8.31e-09 ***
## RegionSouth -425981 35779 -11.906 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 548900 on 1662 degrees of freedom
## Multiple R-squared: 0.07961, Adjusted R-squared: 0.0785
## F-statistic: 71.87 on 2 and 1662 DF, p-value: < 2.2e-16
Central - 1226039 North - 1032368 South - 800058
# Extract the standardized residuals
#NO CHANGES NEEDED
resids2 = rstandard(model2)
fits2 = model2$fitted.values
# Constant Variance Assumption
plot(fits2, resids2,
xlab="Fitted Values",
ylab="Residuals",
main="")
abline(0, 0, lty=2, lwd=2)
Variance seems to increase slightly
# CHANGE THE VARIABLES AND DATA TABLE
model3 <- lm (price ~ Region + size + baths + beds, price_data)
summary(model3)
##
## Call:
## lm(formula = price ~ Region + size + baths + beds, data = price_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1739043 -171506 -26067 139679 4354258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 306457.22 29239.64 10.481 < 2e-16 ***
## RegionNorth -107417.43 21669.17 -4.957 7.88e-07 ***
## RegionSouth -286054.93 23461.50 -12.193 < 2e-16 ***
## size 470.85 14.79 31.842 < 2e-16 ***
## baths 92190.41 11368.33 8.109 9.79e-16 ***
## beds -84877.34 11186.60 -7.587 5.41e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 351700 on 1659 degrees of freedom
## Multiple R-squared: 0.6228, Adjusted R-squared: 0.6216
## F-statistic: 547.8 on 5 and 1659 DF, p-value: < 2.2e-16
all coefficients are significant
When the Region is North and all other variables are zero, the price is the intercept plus -107417.43
yes, the p-value is < 2.2e-16
# Extract the standardized residuals
#NO CHANGES NEEDED
resids3 = rstandard(model3)
fits3 = model3$fitted.values
# Constant Variance Assumption
plot(fits3, resids3,
xlab="Fitted Values",
ylab="Residuals",
main="")
abline(0, 0, lty=2, lwd=2)
Mostly, but it increases for greater values
# Return the R^2 values for model 1 and model 3
#NO CHANGES NEEDED
paste("Model 1 adjusted R^2:", round(summary(model1)$adj.r.squared,2))
## [1] "Model 1 adjusted R^2: 0.59"
paste("Model 3 adjusted R^2:",round(summary(model3)$adj.r.squared,2))
## [1] "Model 3 adjusted R^2: 0.62"
Model 3 has a slightly better performance, with 62% of the variation in price explained by the model, while model 1 explains 59% of the variation
Region=“Central”, beds=3, baths=4, size=2000
Region=“South”, beds=1, baths=0.5, size=250,
#new data
# CHANGE THE VARIABLES AND DATA TABLE AND VALUES
newvals1 <- data.frame(Region="Central", beds=3, baths=4, size=2000)
newvals2 <- data.frame(Region="South", beds=1, baths=0.5, size=250)
# Confidence Interval for the response variable
predict(model3,newvals1,interval='confidence',level=.95)
## fit lwr upr
## 1 1362288 1312779 1411798
predict(model3,newvals2,interval='confidence',level=.95)
## fit lwr upr
## 1 99332.86 53958.28 144707.4
1362288
99332.86
Comment on whether the plot shows constant variance.
Yes, the variance is constant