I believe that from the Ames data the best indicator of sale price will be the amount of square footage above grade or the GrLivArea. I believe this due to the fact is basements are nice but overall the square footage of the actual house is more important
Data Cleanup: To clean up the data I just filtered for NA among data I was interested in using and then converted overall quality into a factor and also I was deselecting various variables I knew I didn’t want just to help myself.
Models: I made three models to determine which ones I thought would be the best. I eliminated the third model as I was not a fan of the r squared value as the other two were a lot higher and close to my target. For the second model I made a function to get the average price for each category.
ames_model1 <- lm(SalePrice ~ GrLivArea, data = ames_data)
summary(ames_model1)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = ames_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483467 -30219 -1966 22728 334323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13289.634 3269.703 4.064 4.94e-05 ***
## GrLivArea 111.694 2.066 54.061 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared: 0.4995, Adjusted R-squared: 0.4994
## F-statistic: 2923 on 1 and 2928 DF, p-value: < 2.2e-16
ames_model2 <- lm(SalePrice ~ OverallQual, data = ames_data)
summary(ames_model2)
##
## Call:
## lm(formula = SalePrice ~ OverallQual, data = ames_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290217 -23985 -2691 19339 304783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162130 1614 100.474 < 2e-16 ***
## OverallQualAverage -27378 2217 -12.350 < 2e-16 ***
## OverallQualBelow_Average -55645 3322 -16.749 < 2e-16 ***
## OverallQualExcellent 206206 4519 45.635 < 2e-16 ***
## OverallQualFair -78944 7089 -11.136 < 2e-16 ***
## OverallQualGood 42895 2402 17.857 < 2e-16 ***
## OverallQualPoor -109805 12216 -8.989 < 2e-16 ***
## OverallQualVery_Excellent 288087 8006 35.986 < 2e-16 ***
## OverallQualVery_Good 108783 2837 38.342 < 2e-16 ***
## OverallQualVery_Poor -113405 21889 -5.181 2.36e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43660 on 2920 degrees of freedom
## Multiple R-squared: 0.7023, Adjusted R-squared: 0.7013
## F-statistic: 765.2 on 9 and 2920 DF, p-value: < 2.2e-16
ames_model3 <- lm(SalePrice ~ GarageCars, data = ames_data)
summary(ames_model3)
##
## Call:
## lm(formula = SalePrice ~ GarageCars, data = ames_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -274084 -36686 -6686 25306 490348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60753 2843 21.37 <2e-16 ***
## GarageCars 67966 1478 45.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60890 on 2928 degrees of freedom
## Multiple R-squared: 0.4193, Adjusted R-squared: 0.4191
## F-statistic: 2115 on 1 and 2928 DF, p-value: < 2.2e-16
ggplot(clean_ames_data, aes(x = GrLivArea, y = SalePrice)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Sale Price vs. Above Grade Living Area",
x = "Above Grade Living Area (sq ft)",
y = "Sale Price ($)") +
theme_minimal()
ggplot(average_price, aes(x = factor(OverallQual), y = AveragePrice)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Average Sale Price by Overall Quality",
x = "Overall Quality (1-10)",
y = "Average Sale Price ($)") +
theme_minimal()
ggplot(clean_ames_data, aes(x = factor(GarageCars), y = SalePrice)) +
geom_boxplot() +
labs(title = "Sale Price by Number of Garage Cars",
x = "Number of Garage Cars",
y = "Sale Price ($)") +
theme_minimal()
=