Download the Data

my_git_url <- getURL("https://raw.githubusercontent.com/AhmedBuckets/SPS605/main/home_data.csv")
price_data <- read.csv(text = my_git_url)

Plotting Price against Square Feet

Make Model

price_sqft_model <- lm(price ~ sqft_living, data = price_data)

Display Line of Best Fit

## integer(0)

Look at the Summary

## 
## Call:
## lm(formula = price ~ sqft_living, data = price_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1476062  -147486   -24043   106182  4362067 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43580.743   4402.690  -9.899   <2e-16 ***
## sqft_living    280.624      1.936 144.920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261500 on 21611 degrees of freedom
## Multiple R-squared:  0.4929, Adjusted R-squared:  0.4928 
## F-statistic: 2.1e+04 on 1 and 21611 DF,  p-value: < 2.2e-16

Residual Analysis

The residuals tell us about the differences between observed values and values predicted by the model. The minimum, or largest underestimation by the model, is -1476062. The largest overestimation was 4362067. The median residual value is -24043. The residual values tell us more when we can cmompare them with the model’s predictions:

We see that the values are fairly evenly distributed evenly around 0. Residuals do tend to increase as we move to the right, meaning the model will struggle with prediction at larger values.

We can use a Q-Q plot to visualize whether or not the residuals are normally distributed:

The residuals don’t deviate too much from the the line until it gets to the rightmost extreme.