Background

This is a real dataset of house prices sold in Seattle, WA, USA between August and December 2022.

Data Description

The dataset consists of a dataframe with 1665 observations with the following 6 variables:

1.beds = Number of bedrooms in property 2. region = based on zip code, north, central or south Seattle. 3. baths = Number of bathrooms in property. Note 0.5 corresponds to a half-bath which has a sink and toilet but no tub or shower 4. size = Total floor area of property in square feet 5. lot_size = Total area of the land where the property is located on in square feet 6. price = Price the property was sold for (US dollars) ##THIS IS THE RESPONSE VARIABLE

Source: Kaggle

Load the data

No changes needed

# reading the dataset in to RStudio
# Call the data table price_data 

#https://drive.google.com/file/d/1LGxmER8s0PiGtrxoJA0fLCGU5IvCB4mh/view?usp=sharing

# Set up the actual file ID from the share link
file_id <- "1LGxmER8s0PiGtrxoJA0fLCGU5IvCB4mh"
data_url <- paste0("https://docs.google.com/uc?export=download&id=", file_id)

#Read the data into R
price_data <- read.csv(data_url, header = TRUE)

# Show the first few rows of data
head(price_data)
##   beds  Region baths size lot_size  price
## 1    1 Central     1  704      500 645000
## 2    3 Central     3 1524      513 850000
## 3    3 Central     3 1524      513 875000
## 4    3   North     3 1450      525 750000
## 5    1 Central     1  480      560 275000
## 6    2   North     1  800      560 690000



# Exploratory Data Analysis


``` r
# Compute univariate numeric summary statistics
summary(price_data)
##       beds           Region              baths            size      
##  Min.   : 1.000   Length:1665        Min.   :0.500   Min.   :  376  
##  1st Qu.: 2.000   Class :character   1st Qu.:1.500   1st Qu.: 1260  
##  Median : 3.000   Mode  :character   Median :2.000   Median : 1720  
##  Mean   : 3.126                      Mean   :2.298   Mean   : 1896  
##  3rd Qu.: 4.000                      3rd Qu.:3.000   3rd Qu.: 2360  
##  Max.   :15.000                      Max.   :9.000   Max.   :11010  
##     lot_size          price        
##  Min.   :   500   Min.   : 159488  
##  1st Qu.:  2734   1st Qu.: 680000  
##  Median :  5000   Median : 865000  
##  Mean   :  9673   Mean   :1010483  
##  3rd Qu.:  7350   3rd Qu.:1175000  
##  Max.   :400752   Max.   :6250000
# Univariate Charts--histograms for the quantitative variables
# use hist(price_data$VARNAME)
par(mfrow = c(3, 2))
hist(price_data$beds)
hist(price_data$baths)
hist(price_data$size)
hist(price_data$lot_size)
hist(price_data$price)

Compute the correlation coefficient for each quantitative predicting variables against the response

#Create the correlation matrix for the quantitative variables  CHANGE THE DATA TABLE NAME
round(cor(price_data [,-2]),2)
##           beds baths  size lot_size price
## beds      1.00  0.59  0.73    -0.13  0.46
## baths     0.59  1.00  0.62    -0.08  0.54
## size      0.73  0.62  1.00    -0.07  0.74
## lot_size -0.13 -0.08 -0.07     1.00 -0.09
## price     0.46  0.54  0.74    -0.09  1.00

Describe the strength and direction of the top 3 predicting variables that have the strongest linear relationships with the response.

Answer: The three predictors that have the strongest linear relationships with price are size, baths, and beds.

Create a boxplot of the qualitative predicting variable versus the response variable.

# Box plot CHANGE THE VARIABLES AND DATA TABLE
boxplot(price_data$price ~ price_data$Region)

Explain the relationship between the qualitative predicting variable versus the response variable.

Answer: The boxplot of price by region shows that average house prices are not the same across the three regions.

Create boxplots of the qualitative predicting variable versus the quantitative PREDICTOR variables

# Box plots by vehicle type  CHANGE THE VARIABLES AND DATA TABLE
par(mfrow = c(2, 2))
boxplot(price_data$beds ~ price_data$Region)
boxplot(price_data$baths ~ price_data$Region)
boxplot(price_data$size ~ price_data$Region)
boxplot(price_data$lot_size ~ price_data$Region)

Explain the relationship between the variables.

Answer: The boxplots of beds, baths, size, and lot_size by region show that the characteristics of homes differ by region.

Create scatterplots of the response variable against each quantitative predicting variable Describe the general trend of each plot.

## Explore the relationships between all predictor variables and the response
## CHANGE THE VARIABLES AND DATA TABLE AND THE XLAB AND YLAB
par(mfrow = c(2, 3))
plot(price_data$beds,price_data$price,xlab = "beds", ylab = "price")
plot(price_data$baths,price_data$price,xlab = "baths", ylab = "price")
plot(price_data$size,price_data$price,xlab = "size", ylab = "price")
plot(price_data$lot_size,price_data$price,xlab = "lot_size", ylab = "price")
plot(price_data$price,price_data$price,xlab = "price", ylab = "price")

Describe the general trend of each plot.

Answer: For price vs beds, there is a upward trend. For price vs baths there is also an upward trend. For price vs size, there is a clear positive relationship. For price vs lot_size, there is a positive trend.

Regression Model Fitting and Interpretation

Fit a multiple linear regression model called model1 using with the response and the top 2 predicting variables with the strongest relationship with the response variable

# Fit a linear regression model  CHANGE THE VARIABLES AND DATA TABLE

model1 <- lm(price ~ size + baths, data = price_data )

summary(model1)
## 
## Call:
## lm(formula = price ~ size + baths, data = price_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1512769  -184967   -14583   139168  4682870 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54645.49   24309.14   2.248   0.0247 *  
## size          412.82      12.88  32.043  < 2e-16 ***
## baths       75416.75   11711.01   6.440 1.56e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 377700 on 1662 degrees of freedom
## Multiple R-squared:  0.5643, Adjusted R-squared:  0.5637 
## F-statistic:  1076 on 2 and 1662 DF,  p-value: < 2.2e-16

Using alpha = 0.05, which of the estimated coefficients are statistically significant?

Answer: In the model, both the size coefficient and the baths coefficient have p-values below 0.05, so they are statistically significant predictors of price.

Is the overall regression significant? And why or why not?

Answer: Yes, the overall regression is significant.

What is the estimated coefficient for the intercept? Interpret this coefficient in the context of the dataset.

Answer: The estimated intercept is the value of the price when both size = 0 and baths = 0. In context, this does not have a realistic meaning because a house cannot have 0 square feet and 0 baths.

What is the estimated coefficient for the strongest correlated predictor variable? Interpret this coefficient in the context of the dataset.

Answer:

Goodness of Fit

# Extract the standardized residuals
#NO CHANGES NEEDED
resids = rstandard(model1)
fits = model1$fitted

# Constant Variance Assumption
plot(fits, resids,
     xlab="Fitted Values",
     ylab="Residuals",
     main="")
abline(0, 0, lty=2, lwd=2)

Comment on whether the plot shows constant variance.

Answer: Looking at the plot of standardized residuals versus fitted values for Model 1, the residuals are roughly spread around zero without a strong pattern. There may be some slight widening of the spread at higher fitted values, but there is no extreme shape.

Fit a linear regression model called model2 with response variable and the qualitataive variable as the predicting variable

# CHANGE THE VARIABLES AND DATA TABLE
model2 <- lm (price ~  Region, price_data )
summary(model2)
## 
## Call:
## lm(formula = price ~ Region, data = price_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -976039 -305368 -122368  149942 5023961 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1226039      26408  46.426  < 2e-16 ***
## RegionNorth  -193671      33439  -5.792 8.31e-09 ***
## RegionSouth  -425981      35779 -11.906  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 548900 on 1662 degrees of freedom
## Multiple R-squared:  0.07961,    Adjusted R-squared:  0.0785 
## F-statistic: 71.87 on 2 and 1662 DF,  p-value: < 2.2e-16

What is the model for each of the values of the qualitataive variable?

Answer: Each region’s mean price is the intercept plus the corresponding region coefficient

Goodness of Fit

# Extract the standardized residuals
#NO CHANGES NEEDED

resids2 = rstandard(model2)
fits2 = model2$fitted

# Constant Variance Assumption
plot(fits2, resids2,
     xlab="Fitted Values",
     ylab="Residuals",
     main="")
abline(0, 0, lty=2, lwd=2)

Does the fitted values vs. standardized residuals plot indicate constant variance?

Answer: The residual vs fitted plot shows the residuals scattered around zero with no strong increasing or decreasing spread as fitted values change.

Fit a multiple linear regression model called model3 using the response variable and your choice of predictor variables.

# CHANGE THE VARIABLES AND DATA TABLE
model3 <- lm (price ~  Region + size + baths, data = price_data)

summary(model3)
## 
## Call:
## lm(formula = price ~ Region + size + baths, data = price_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1478315  -174273   -25602   134080  4555225 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  237727.73   28270.85   8.409  < 2e-16 ***
## RegionNorth -127937.34   21863.00  -5.852 5.85e-09 ***
## RegionSouth -319851.40   23423.98 -13.655  < 2e-16 ***
## size            405.70      12.24  33.140  < 2e-16 ***
## baths         68780.22   11126.49   6.182 7.97e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 357700 on 1660 degrees of freedom
## Multiple R-squared:  0.6097, Adjusted R-squared:  0.6087 
## F-statistic: 648.2 on 4 and 1660 DF,  p-value: < 2.2e-16

Using alpha = 0.05, which of the estimated coefficients arestatistically significant in model2?

Answer: The coefficients that are statistically significant are those with p-values < 0.05 which include size and baths.

Interpret the estimated coefficient for TypeSUV in the context of the dataset.

Answer: The coefficient for north region represents the difference in average price between houses in the central region.

Is the overall regression (model3) significant at an alpha-level of 0.05? Explain how you determined the answer.

Answer: Yes, Model 3 is significant at an alpha level of 0.05.

Goodness of Fit

# Extract the standardized residuals
#NO CHANGES NEEDED

resids3 = rstandard(model3)
fits3 = model3$fitted

# Constant Variance Assumption
plot(fits3, resids3,
     xlab="Fitted Values",
     ylab="Residuals",
     main="")
abline(0, 0, lty=2, lwd=2)

Does the fitted values vs. standardized residuals plot indicate constant variance?

Answer: The residual vs fitted plot for Model 3 shows the residuals scattered around zero with no strong pattern. There might be some variation in the spread, but there is no clear shape so the constant variance assumption appears to be satisfied for Model 3

What are the Adjusted R^2 values for model 1 and model 3?

# Return the R^2 values for model 1 and model 3
#NO CHANGES NEEDED

paste("Model 1 adjusted R^2:", round(summary(model1)$adj.r.squared,2))
## [1] "Model 1 adjusted R^2: 0.56"
paste("Model 3 adjusted R^2:",round(summary(model3)$adj.r.squared,2))
## [1] "Model 3 adjusted R^2: 0.61"

How would you describe the performance of model 1 and model 3 based on the R^2 value.

Prediction

Using model3, predict the mean response for 2 sets of characteristics:

CHANGE THE VARIABLES AND THE VALUES TO MATCH YOUR DATA SET

  1. Type=“Sedan”, MPG=25, Weight=3000, Horsepower=250

  2. Type=“SUV”, MPG=25, Weight=3000, Horsepower=250,

#new data
# CHANGE THE VARIABLES AND DATA TABLE AND VALUES

newvals1 <- data.frame(Region="North", size=704, baths=2)
newvals2 <- data.frame(Region="Central", size=1010, baths=3)


# Confidence Interval for the response variable
predict(model3,newvals1,interval='confidence',level=.95)
##        fit      lwr      upr
## 1 532963.6 497329.5 568597.8
predict(model3,newvals2,interval='confidence',level=.95)
##        fit    lwr      upr
## 1 853825.4 805391 902259.9

What is the predicted mean response for set 1?

What is the predicted mean emissions for set 2?