Thesis

I believe that from the Ames data the best indicator of sale price will be the amount of square footage above grade or the GrLivArea. I believe this due to the fact is basements are nice but overall the square footage of the actual house is more important

Key Steps Explained

  1. Data Cleanup: To clean up the data I just filtered for NA among data I was interested in using and then converted overall quality into a factor and also I was deselecting various variables I knew I didn’t want just to help myself.

  2. Models: I made three models to determine which ones I thought would be the best. I eliminated the third model as I was not a fan of the r squared value as the other two were a lot higher and close to my target. For the second model I made a function to get the average price for each category.

ames_model1 <- lm(SalePrice ~  GrLivArea, data = ames_data)
summary(ames_model1)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = ames_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483467  -30219   -1966   22728  334323 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13289.634   3269.703   4.064 4.94e-05 ***
## GrLivArea     111.694      2.066  54.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared:  0.4995, Adjusted R-squared:  0.4994 
## F-statistic:  2923 on 1 and 2928 DF,  p-value: < 2.2e-16
ames_model2 <- lm(SalePrice ~  OverallQual, data = ames_data)
summary(ames_model2)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual, data = ames_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -290217  -23985   -2691   19339  304783 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 162130       1614 100.474  < 2e-16 ***
## OverallQualAverage          -27378       2217 -12.350  < 2e-16 ***
## OverallQualBelow_Average    -55645       3322 -16.749  < 2e-16 ***
## OverallQualExcellent        206206       4519  45.635  < 2e-16 ***
## OverallQualFair             -78944       7089 -11.136  < 2e-16 ***
## OverallQualGood              42895       2402  17.857  < 2e-16 ***
## OverallQualPoor            -109805      12216  -8.989  < 2e-16 ***
## OverallQualVery_Excellent   288087       8006  35.986  < 2e-16 ***
## OverallQualVery_Good        108783       2837  38.342  < 2e-16 ***
## OverallQualVery_Poor       -113405      21889  -5.181 2.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43660 on 2920 degrees of freedom
## Multiple R-squared:  0.7023, Adjusted R-squared:  0.7013 
## F-statistic: 765.2 on 9 and 2920 DF,  p-value: < 2.2e-16
ames_model3 <- lm(SalePrice ~  GarageCars, data = ames_data)
summary(ames_model3)
## 
## Call:
## lm(formula = SalePrice ~ GarageCars, data = ames_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -274084  -36686   -6686   25306  490348 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    60753       2843   21.37   <2e-16 ***
## GarageCars     67966       1478   45.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60890 on 2928 degrees of freedom
## Multiple R-squared:  0.4193, Adjusted R-squared:  0.4191 
## F-statistic:  2115 on 1 and 2928 DF,  p-value: < 2.2e-16
  1. Graphs: I made three different graphs based on my various models to see which one I like the most. I used a a geom smooth with a regression line going through model one to show the trend line. For the second one I did a box plot to show the ratings for average prices and I used a box plot for model three.
ggplot(clean_ames_data, aes(x = GrLivArea, y = SalePrice)) +
  geom_point(alpha = 0.5) +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Sale Price vs. Above Grade Living Area",
       x = "Above Grade Living Area (sq ft)",
       y = "Sale Price ($)") +
  theme_minimal()

ggplot(average_price, aes(x = factor(OverallQual), y = AveragePrice)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Average Sale Price by Overall Quality",
       x = "Overall Quality (1-10)",
       y = "Average Sale Price ($)") +
  theme_minimal()

ggplot(clean_ames_data, aes(x = factor(GarageCars), y = SalePrice)) +
  geom_boxplot() +
  labs(title = "Sale Price by Number of Garage Cars",
       x = "Number of Garage Cars",
       y = "Sale Price ($)") +
  theme_minimal()

  1. Conclusion: Overall, I came to find my thesis was incorrect as my r squared value for my first model that compared sale price and above grade living area was only a .4994 while model two comparing sale price and garage cars had a much stronger r squared of .7023 so I believe that garage cars is the best indicator of sale price.

=