Background

This is a real dataset of house prices sold in Seattle, WA, USA between August and December 2022.

Data Description

The dataset consists of a dataframe with 1665 observations with the following 6 variables:

1.beds = Number of bedrooms in property 2. region = based on zip code, north, central or south Seattle. 3. baths = Number of bathrooms in property. Note 0.5 corresponds to a half-bath which has a sink and toilet but no tub or shower 4. size = Total floor area of property in square feet 5. lot_size = Total area of the land where the property is located on in square feet 6. price = Price the property was sold for (US dollars) ##THIS IS THE RESPONSE VARIABLE

Source: Kaggle

Load the data

No changes needed

# reading the dataset in to RStudio
# Call the data table price_data 

#https://drive.google.com/file/d/1LGxmER8s0PiGtrxoJA0fLCGU5IvCB4mh/view?usp=sharing

# Set up the actual file ID from the share link
file_id <- "1LGxmER8s0PiGtrxoJA0fLCGU5IvCB4mh"
data_url <- paste0("https://docs.google.com/uc?export=download&id=", file_id)

#Read the data into R
price_data <- read.csv(data_url, header = TRUE)

# Show the first few rows of data
head(price_data)
##   beds  Region baths size lot_size  price
## 1    1 Central     1  704      500 645000
## 2    3 Central     3 1524      513 850000
## 3    3 Central     3 1524      513 875000
## 4    3   North     3 1450      525 750000
## 5    1 Central     1  480      560 275000
## 6    2   North     1  800      560 690000

Exploratory Data Analysis

# Compute univariate numeric summary statistics
summary(as.data.frame(price_data))
##       beds           Region              baths            size      
##  Min.   : 1.000   Length:1665        Min.   :0.500   Min.   :  376  
##  1st Qu.: 2.000   Class :character   1st Qu.:1.500   1st Qu.: 1260  
##  Median : 3.000   Mode  :character   Median :2.000   Median : 1720  
##  Mean   : 3.126                      Mean   :2.298   Mean   : 1896  
##  3rd Qu.: 4.000                      3rd Qu.:3.000   3rd Qu.: 2360  
##  Max.   :15.000                      Max.   :9.000   Max.   :11010  
##     lot_size          price        
##  Min.   :   500   Min.   : 159488  
##  1st Qu.:  2734   1st Qu.: 680000  
##  Median :  5000   Median : 865000  
##  Mean   :  9673   Mean   :1010483  
##  3rd Qu.:  7350   3rd Qu.:1175000  
##  Max.   :400752   Max.   :6250000
# Univariate Charts--histograms for the quantitative variables
# use hist(price_data$VARNAME)
par(mfrow = c(3, 2))
hist(price_data$beds)
hist(price_data$baths)
hist(price_data$size)
hist(price_data$lot_size)
hist(price_data$price)

Compute the correlation coefficient for each quantitative predicting variables against the response

#Create the correlation matrix for the quantitative variables  CHANGE THE DATA TABLE NAME
round(cor(price_data [,-2]),2)
##           beds baths  size lot_size price
## beds      1.00  0.59  0.73    -0.13  0.46
## baths     0.59  1.00  0.62    -0.08  0.54
## size      0.73  0.62  1.00    -0.07  0.74
## lot_size -0.13 -0.08 -0.07     1.00 -0.09
## price     0.46  0.54  0.74    -0.09  1.00

Describe the strength and direction of the top 3 predicting variables that have the strongest linear relationships with the response.

Size has a 0.74 correlation with price, a strong positive correlation Baths has a 0.54 correlation with price, a moderate positive correlation Beds has a 0.46 correlation with price, a moderate positive correlation

Create a boxplot of the qualitative predicting variable versus the response variable.

# Box plot CHANGE THE VARIABLES AND DATA TABLE
boxplot(price_data$price ~ price_data$Region)

Explain the relationship between the qualitative predicting variable versus the response variable.

Central has the highest price overall, followed by North and South. All three have similar bottom ranges but the high end is higher for Central, then North, with South having the lowest high prices.

Create boxplots of the qualitative predicting variable versus the quantitative PREDICTOR variables

# Box plots by vehicle type  CHANGE THE VARIABLES AND DATA TABLE
par(mfrow = c(2, 2))
boxplot(price_data$beds ~ price_data$Region)
boxplot(price_data$baths ~ price_data$Region)
boxplot(price_data$size ~ price_data$Region)
boxplot(price_data$lot_size ~ price_data$Region)

Explain the relationship between the variables.

Baths - South has less Baths on average than Central or North Beds - All regions have the same average bed count, but South has a lower Q3 Size - Central has a slightly higher average size than North or South, which have similar graphs Lot Size - There are many outliers for all regions, but North and South have the highest means, with South having the highest median

Create scatterplots of the response variable against each quantitative predicting variable Describe the general trend of each plot.

## Explore the relationships between all predictor variables and the response
## CHANGE THE VARIABLES AND DATA TABLE AND THE XLAB AND YLAB
par(mfrow = c(2, 3))
plot(price_data$beds,price_data$price,xlab = "beds", ylab = "Price Emissions")
plot(price_data$baths,price_data$price,xlab = "baths", ylab = "Price Emissions")
plot(price_data$size,price_data$price,xlab = "size", ylab = "Price Emissions")
plot(price_data$lot_size,price_data$price,xlab = "lot_size", ylab = "Price Emissions")

Describe the general trend of each plot.

beds - generally trending up, but with wide variance Region - baths - trending very slightly up, wide variance size - trending up, moderately correlated lot_size - lots of prices at first, settling into a low trend after

Regression Model Fitting and Interpretation

Fit a multiple linear regression model called model1 using with the response and the top 2 predicting variables with the strongest relationship with the response variable

# Fit a linear regression model  CHANGE THE VARIABLES AND DATA TABLE

model1 <- lm(price ~ size + baths + beds, data = price_data )

summary(model1)
## 
## Call:
## lm(formula = price ~ size + baths + beds, data = price_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1841254  -181876   -23339   143997  4399224 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  172751.59   26670.93   6.477 1.23e-10 ***
## size            496.11      15.25  32.534  < 2e-16 ***
## baths        105351.90   11820.29   8.913  < 2e-16 ***
## beds        -110288.68   11481.24  -9.606  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 367700 on 1661 degrees of freedom
## Multiple R-squared:  0.5872, Adjusted R-squared:  0.5864 
## F-statistic: 787.5 on 3 and 1661 DF,  p-value: < 2.2e-16

Using alpha = 0.05, which of the estimated coefficients are statistically significant?

All coefficients are significant

Is the overall regression significant? And why or why not?

Yes, the p-value is under 0.05

What is the estimated coefficient for the intercept? Interpret this coefficient in the context of the dataset.

172751.59 - when all other variables are zero, this is the price

What is the estimated coefficient for the strongest correlated predictor variable? Interpret this coefficient in the context of the dataset.

the estimated coefficient of size is 496.11 - this means that for each increase of 1 in size, the price increases by 496.11

Goodness of Fit

# Extract the standardized residuals
#NO CHANGES NEEDED
resids = rstandard(model1)
fits = model1$fitted.values

# Constant Variance Assumption
plot(fits, resids,
     xlab="Fitted Values",
     ylab="Residuals",
     main="")
abline(0, 0, lty=2, lwd=2)

Comment on whether the plot shows constant variance.

Yes, the variance is constant

Fit a linear regression model called model2 with response variable and the qualitataive variable as the predicting variable

# CHANGE THE VARIABLES AND DATA TABLE
model2 <- lm (price ~  Region, price_data )
summary(model2)
## 
## Call:
## lm(formula = price ~ Region, data = price_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -976039 -305368 -122368  149942 5023961 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1226039      26408  46.426  < 2e-16 ***
## RegionNorth  -193671      33439  -5.792 8.31e-09 ***
## RegionSouth  -425981      35779 -11.906  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 548900 on 1662 degrees of freedom
## Multiple R-squared:  0.07961,    Adjusted R-squared:  0.0785 
## F-statistic: 71.87 on 2 and 1662 DF,  p-value: < 2.2e-16

What is the model for each of the values of the qualitataive variable?

Central - 1226039 North - 1032368 South - 800058

Goodness of Fit

# Extract the standardized residuals
#NO CHANGES NEEDED

resids2 = rstandard(model2)
fits2 = model2$fitted.values

# Constant Variance Assumption
plot(fits2, resids2,
     xlab="Fitted Values",
     ylab="Residuals",
     main="")
abline(0, 0, lty=2, lwd=2)

Does the fitted values vs. standardized residuals plot indicate constant variance?

Variance seems to increase slightly

Fit a multiple linear regression model called model3 using the response variable and your choice of predictor variables.

# CHANGE THE VARIABLES AND DATA TABLE
model3 <- lm (price ~  Region + size + baths + beds, price_data)

summary(model3)
## 
## Call:
## lm(formula = price ~ Region + size + baths + beds, data = price_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1739043  -171506   -26067   139679  4354258 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  306457.22   29239.64  10.481  < 2e-16 ***
## RegionNorth -107417.43   21669.17  -4.957 7.88e-07 ***
## RegionSouth -286054.93   23461.50 -12.193  < 2e-16 ***
## size            470.85      14.79  31.842  < 2e-16 ***
## baths         92190.41   11368.33   8.109 9.79e-16 ***
## beds         -84877.34   11186.60  -7.587 5.41e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 351700 on 1659 degrees of freedom
## Multiple R-squared:  0.6228, Adjusted R-squared:  0.6216 
## F-statistic: 547.8 on 5 and 1659 DF,  p-value: < 2.2e-16

Using alpha = 0.05, which of the estimated coefficients arestatistically significant in model2?

all coefficients are significant

Interpret the estimated coefficient for RegionNorth in the context of the dataset.

When the Region is North and all other variables are zero, the price is the intercept plus -107417.43

Is the overall regression (model3) significant at an alpha-level of 0.05? Explain how you determined the answer.

yes, the p-value is < 2.2e-16

Goodness of Fit

# Extract the standardized residuals
#NO CHANGES NEEDED

resids3 = rstandard(model3)
fits3 = model3$fitted.values

# Constant Variance Assumption
plot(fits3, resids3,
     xlab="Fitted Values",
     ylab="Residuals",
     main="")
abline(0, 0, lty=2, lwd=2)

Does the fitted values vs. standardized residuals plot indicate constant variance?

Mostly, but it increases for greater values

What are the Adjusted R^2 values for model 1 and model 3?

# Return the R^2 values for model 1 and model 3
#NO CHANGES NEEDED

paste("Model 1 adjusted R^2:", round(summary(model1)$adj.r.squared,2))
## [1] "Model 1 adjusted R^2: 0.59"
paste("Model 3 adjusted R^2:",round(summary(model3)$adj.r.squared,2))
## [1] "Model 3 adjusted R^2: 0.62"

How would you describe the performance of model 1 and model 3 based on the R^2 value.

Model 3 has a slightly better performance, with 62% of the variation in price explained by the model, while model 1 explains 59% of the variation

Prediction

Using model3, predict the mean response for 2 sets of characteristics:

CHANGE THE VARIABLES AND THE VALUES TO MATCH YOUR DATA SET

  1. Region=“Central”, beds=3, baths=4, size=2000

  2. Region=“South”, beds=1, baths=0.5, size=250,

#new data
# CHANGE THE VARIABLES AND DATA TABLE AND VALUES

newvals1 <- data.frame(Region="Central", beds=3, baths=4, size=2000)
newvals2 <- data.frame(Region="South", beds=1, baths=0.5, size=250)


# Confidence Interval for the response variable
predict(model3,newvals1,interval='confidence',level=.95)
##       fit     lwr     upr
## 1 1362288 1312779 1411798
predict(model3,newvals2,interval='confidence',level=.95)
##        fit      lwr      upr
## 1 99332.86 53958.28 144707.4

What is the predicted mean response for set 1?

1362288

What is the predicted mean emissions for set 2?

99332.86