R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

This exercise is under construction. Please report any errors at https://forms.gle/2W4tffs4YJA1jeBv9

Goal: Understand and experience outlier detection techniques Law in action.

Background: The data for this question has been adapted from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. Please review information at https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview before you get started.

Before starting: 1. You are not allowed to search for solutions to this assignment. 2. You are allowed to search information about packages and functions that can help you.

Individual assignment only: 70 total points (Rmd and html solution) Team assignment: 20 points (written analysis)

[1 point] Q1.

Start by entering your name and today’s date in Lines 3 and 4, respectively, to agree to the Fuqua Honor Code. Then, run the chunk of code below by clicking on the green arrow (that points to the right) on the top right of the chunk. Tip: I numbered code chunks corresponding to their numbers. Chunk 1 specified the knitting parameters.

[4 points] Q2.

Read and store the data from the file PricesBefore2009.csv into a variable called before2009. Tip: Then, inspect the data. Rubric: 1 each point for reading and storing; 1 points each for using 2 R commands for inspecting. Tip: I recommend using the read_csv() function from the tidyverse package to read and store data for this and all subsequent assignments.

## [4 points] Q2.
# Install and load the tidyverse package if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) {
  install.packages("tidyverse")
}

# Load the tidyverse package
library(tidyverse)
## Warning: 程辑包'tidyverse'是用R版本4.3.2 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read and store the data from PricesBefore2009.csv
before2009 <- read_csv("PricesBefore2009.csv")
## New names:
## Rows: 1933 Columns: 82
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf... dbl
## (39): ...1, Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCo...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Inspect the data using head() and str() commands
head(before2009)
## # A tibble: 6 × 82
##    ...1    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
##   <dbl> <dbl>      <dbl> <chr>          <dbl>   <dbl> <chr>  <chr> <chr>   
## 1     1     1         60 RL                65    8450 Pave   <NA>  Reg     
## 2     2     2         20 RL                80    9600 Pave   <NA>  Reg     
## 3     3     3         60 RL                68   11250 Pave   <NA>  IR1     
## 4     4     4         70 RL                60    9550 Pave   <NA>  IR1     
## 5     5     5         60 RL                84   14260 Pave   <NA>  IR1     
## 6     6     7         20 RL                75   10084 Pave   <NA>  Reg     
## # ℹ 73 more variables: LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## #   LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## #   BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## #   YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## #   Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>,
## #   ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <chr>,
## #   BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
str(before2009)
## spc_tbl_ [1,933 × 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1         : num [1:1933] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Id           : num [1:1933] 1 2 3 4 5 7 9 10 11 12 ...
##  $ MSSubClass   : num [1:1933] 60 20 60 70 60 20 50 190 20 60 ...
##  $ MSZoning     : chr [1:1933] "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : num [1:1933] 65 80 68 60 84 75 51 50 70 85 ...
##  $ LotArea      : num [1:1933] 8450 9600 11250 9550 14260 ...
##  $ Street       : chr [1:1933] "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr [1:1933] NA NA NA NA ...
##  $ LotShape     : chr [1:1933] "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr [1:1933] "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr [1:1933] "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr [1:1933] "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr [1:1933] "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr [1:1933] "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr [1:1933] "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr [1:1933] "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr [1:1933] "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr [1:1933] "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : num [1:1933] 7 6 7 7 8 8 7 5 5 9 ...
##  $ OverallCond  : num [1:1933] 5 8 5 5 5 5 5 6 5 5 ...
##  $ YearBuilt    : num [1:1933] 2003 1976 2001 1915 2000 ...
##  $ YearRemodAdd : num [1:1933] 2003 1976 2002 1970 2000 ...
##  $ RoofStyle    : chr [1:1933] "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr [1:1933] "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr [1:1933] "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr [1:1933] "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr [1:1933] "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : num [1:1933] 196 0 162 0 350 186 0 0 0 286 ...
##  $ ExterQual    : chr [1:1933] "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr [1:1933] "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr [1:1933] "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr [1:1933] "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr [1:1933] "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr [1:1933] "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr [1:1933] "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : num [1:1933] 706 978 486 216 655 ...
##  $ BsmtFinType2 : chr [1:1933] "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BsmtUnfSF    : num [1:1933] 150 284 434 540 490 317 952 140 134 177 ...
##  $ TotalBsmtSF  : num [1:1933] 856 1262 920 756 1145 ...
##  $ Heating      : chr [1:1933] "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr [1:1933] "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr [1:1933] "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr [1:1933] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : num [1:1933] 856 1262 920 961 1145 ...
##  $ X2ndFlrSF    : num [1:1933] 854 0 866 756 1053 ...
##  $ LowQualFinSF : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : num [1:1933] 1710 1262 1786 1717 2198 ...
##  $ BsmtFullBath : num [1:1933] 1 0 1 1 1 1 0 1 1 1 ...
##  $ BsmtHalfBath : num [1:1933] 0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : num [1:1933] 2 2 2 1 2 2 2 1 1 3 ...
##  $ HalfBath     : num [1:1933] 1 0 1 0 1 0 0 0 0 0 ...
##  $ BedroomAbvGr : num [1:1933] 3 3 3 3 4 3 2 2 3 4 ...
##  $ KitchenAbvGr : num [1:1933] 1 1 1 1 1 1 2 2 1 1 ...
##  $ KitchenQual  : chr [1:1933] "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : num [1:1933] 8 6 6 7 9 7 8 5 5 11 ...
##  $ Functional   : chr [1:1933] "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : num [1:1933] 0 1 1 1 1 1 2 2 0 2 ...
##  $ FireplaceQu  : chr [1:1933] NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr [1:1933] "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : num [1:1933] 2003 1976 2001 1998 2000 ...
##  $ GarageFinish : chr [1:1933] "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : num [1:1933] 2 2 2 3 3 2 2 1 1 3 ...
##  $ GarageArea   : num [1:1933] 548 460 608 642 836 636 468 205 384 736 ...
##  $ GarageQual   : chr [1:1933] "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr [1:1933] "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr [1:1933] "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : num [1:1933] 0 298 0 0 192 255 90 0 0 147 ...
##  $ OpenPorchSF  : num [1:1933] 61 0 42 35 84 57 0 4 0 21 ...
##  $ EnclosedPorch: num [1:1933] 0 0 0 272 0 0 205 0 0 0 ...
##  $ X3SsnPorch   : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ScreenPorch  : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr [1:1933] NA NA NA NA ...
##  $ Fence        : chr [1:1933] NA NA NA NA ...
##  $ MiscFeature  : chr [1:1933] NA NA NA NA ...
##  $ MiscVal      : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
##  $ MoSold       : num [1:1933] 2 5 9 2 12 8 4 1 2 7 ...
##  $ YrSold       : num [1:1933] 2008 2007 2008 2006 2008 ...
##  $ SaleType     : chr [1:1933] "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr [1:1933] "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : num [1:1933] 208500 181500 223500 140000 250000 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   Id = col_double(),
##   ..   MSSubClass = col_double(),
##   ..   MSZoning = col_character(),
##   ..   LotFrontage = col_double(),
##   ..   LotArea = col_double(),
##   ..   Street = col_character(),
##   ..   Alley = col_character(),
##   ..   LotShape = col_character(),
##   ..   LandContour = col_character(),
##   ..   Utilities = col_character(),
##   ..   LotConfig = col_character(),
##   ..   LandSlope = col_character(),
##   ..   Neighborhood = col_character(),
##   ..   Condition1 = col_character(),
##   ..   Condition2 = col_character(),
##   ..   BldgType = col_character(),
##   ..   HouseStyle = col_character(),
##   ..   OverallQual = col_double(),
##   ..   OverallCond = col_double(),
##   ..   YearBuilt = col_double(),
##   ..   YearRemodAdd = col_double(),
##   ..   RoofStyle = col_character(),
##   ..   RoofMatl = col_character(),
##   ..   Exterior1st = col_character(),
##   ..   Exterior2nd = col_character(),
##   ..   MasVnrType = col_character(),
##   ..   MasVnrArea = col_double(),
##   ..   ExterQual = col_character(),
##   ..   ExterCond = col_character(),
##   ..   Foundation = col_character(),
##   ..   BsmtQual = col_character(),
##   ..   BsmtCond = col_character(),
##   ..   BsmtExposure = col_character(),
##   ..   BsmtFinType1 = col_character(),
##   ..   BsmtFinSF1 = col_double(),
##   ..   BsmtFinType2 = col_character(),
##   ..   BsmtFinSF2 = col_double(),
##   ..   BsmtUnfSF = col_double(),
##   ..   TotalBsmtSF = col_double(),
##   ..   Heating = col_character(),
##   ..   HeatingQC = col_character(),
##   ..   CentralAir = col_character(),
##   ..   Electrical = col_character(),
##   ..   X1stFlrSF = col_double(),
##   ..   X2ndFlrSF = col_double(),
##   ..   LowQualFinSF = col_double(),
##   ..   GrLivArea = col_double(),
##   ..   BsmtFullBath = col_double(),
##   ..   BsmtHalfBath = col_double(),
##   ..   FullBath = col_double(),
##   ..   HalfBath = col_double(),
##   ..   BedroomAbvGr = col_double(),
##   ..   KitchenAbvGr = col_double(),
##   ..   KitchenQual = col_character(),
##   ..   TotRmsAbvGrd = col_double(),
##   ..   Functional = col_character(),
##   ..   Fireplaces = col_double(),
##   ..   FireplaceQu = col_character(),
##   ..   GarageType = col_character(),
##   ..   GarageYrBlt = col_double(),
##   ..   GarageFinish = col_character(),
##   ..   GarageCars = col_double(),
##   ..   GarageArea = col_double(),
##   ..   GarageQual = col_character(),
##   ..   GarageCond = col_character(),
##   ..   PavedDrive = col_character(),
##   ..   WoodDeckSF = col_double(),
##   ..   OpenPorchSF = col_double(),
##   ..   EnclosedPorch = col_double(),
##   ..   X3SsnPorch = col_double(),
##   ..   ScreenPorch = col_double(),
##   ..   PoolArea = col_double(),
##   ..   PoolQC = col_character(),
##   ..   Fence = col_character(),
##   ..   MiscFeature = col_character(),
##   ..   MiscVal = col_double(),
##   ..   MoSold = col_double(),
##   ..   YrSold = col_double(),
##   ..   SaleType = col_character(),
##   ..   SaleCondition = col_character(),
##   ..   SalePrice = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

[4 points] Q3.

Convert the following columns to character or factor type: MSSubClass, OverallQual, OverallCond. Then, inspect the result to verify that your code works. Tip: You can refer to column in before2009 as before2009$colName or before2009[, “colName”] or before[[“colName”]]. Tip: You can use as.character() or factor(). Tip: You can print multiple columns using summary(before2009[,c(“colName1”, “colName2”, “colName3”)]). Rubric: 3 points (1 point each) for conversion and 1 point for verification.

# Convert columns to character or factor type
before2009$MSSubClass <- as.character(before2009$MSSubClass)
before2009$OverallQual <- as.factor(before2009$OverallQual)
before2009$OverallCond <- as.factor(before2009$OverallCond)

# Verify the conversion
summary(before2009[, c("MSSubClass", "OverallQual", "OverallCond")])
##   MSSubClass         OverallQual   OverallCond  
##  Length:1933        5      :554   5      :1094  
##  Class :character   6      :485   6      : 354  
##  Mode  :character   7      :395   7      : 260  
##                     8      :233   8      :  94  
##                     4      :141   4      :  69  
##                     9      : 68   3      :  30  
##                     (Other): 57   (Other):  32

[7 points] Q4.

How many NAs does each column have? Display your answer as a dataframe (or tibble) called beforeNAs. The dataset beforeNAs should contain two columns, one containing the names of the columns of before2009, and the other containing the number of NAs in each column. Then, print only the first 10 (head) rows of this dataframe to verify that your code worked. Tip: See what as_tibble(map(before2009, ~sum(is.na(.)))) does for you. Rubric: 6 points for constructing beforeNAs and 1 point for verification.

temp = map(before2009, ~sum(is.na(.))) %>% as_tibble() %>% t()
beforeNAs = tibble('Columns' = rownames(temp), "NAs" = temp[,1])
beforeNAs %>% head(10)
## # A tibble: 10 × 2
##    Columns       NAs
##    <chr>       <int>
##  1 ...1            0
##  2 Id              0
##  3 MSSubClass      0
##  4 MSZoning        3
##  5 LotFrontage   317
##  6 LotArea         0
##  7 Street          0
##  8 Alley        1797
##  9 LotShape        0
## 10 LandContour     0

[9 points] Q5.

Drop (remove) all the columns (except SalePrice) that have 20 or more missing values. Also, drop (remove) the columns called X1, Id, and Utilities (all its values are the same). While some of the columns we drop here may contribute to the predictive accuracy of our model, the majority of the information will be contained in the remaining variables. Then, print only the first 10 (head) rows of this dataframe to verify that your code worked. Tip: You can put the names of all the columns to be dropped into a vector called dropCols (based on 20 <= NA and other conditions above). Then, you can call dplyr::select(before2009, -dropCols) to exclude all columns in dropCols. Rubric: 8 points for constructing beforeNAs and 1 point for verification.

# Define the columns to be dropped
str(beforeNAs)
## tibble [82 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Columns: chr [1:82] "...1" "Id" "MSSubClass" "MSZoning" ...
##  $ NAs    : Named int [1:82] 0 0 0 3 317 0 0 1797 0 0 ...
##   ..- attr(*, "names")= chr [1:82] "...1" "Id" "MSSubClass" "MSZoning" ...
# Create a vector of column names to drop
dropCols <- beforeNAs$Columns[beforeNAs$NAs >= 20]

# Drop specified columns
before2009 <- before2009 %>%
  select(-any_of(dropCols),SalePrice)
# Drop specified columns and the first column
before2009 <- before2009 %>%
  select(-Id, -Utilities, -1)

[5 points] Q6.

Conduct a multiple linear regression on all the variables. Set SalePrice as the response and store the results in regBefore2009. Then, print the summary of regBefore2009 to verify that your code works. Tip: The formula for regression is lm(SalePrice ~ ., data = before2009) Rubric: 4 points for setting regBefore2009 and 1 point for verification.

# Fit multiple linear regression
regBefore2009 <- lm(SalePrice ~ ., data = before2009)

# Print the summary
summary(regBefore2009)
## 
## Call:
## lm(formula = SalePrice ~ ., data = before2009)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -178924   -4505     -76    4002  157196 
## 
## Coefficients: (3 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.572e+06  9.826e+05  -2.618 0.008929 ** 
## MSSubClass150         1.061e+04  1.760e+04   0.603 0.546766    
## MSSubClass160         2.518e+03  5.629e+03   0.447 0.654737    
## MSSubClass180         2.272e+03  7.209e+03   0.315 0.752622    
## MSSubClass190        -5.529e+02  1.459e+04  -0.038 0.969767    
## MSSubClass20          8.063e+03  7.107e+03   1.135 0.256704    
## MSSubClass30          7.803e+03  7.565e+03   1.031 0.302466    
## MSSubClass40          7.856e+03  1.115e+04   0.705 0.481006    
## MSSubClass45          1.458e+04  1.382e+04   1.055 0.291381    
## MSSubClass50          1.195e+04  8.571e+03   1.395 0.163310    
## MSSubClass60          9.306e+03  8.412e+03   1.106 0.268783    
## MSSubClass70          1.163e+04  8.551e+03   1.360 0.173888    
## MSSubClass75          8.722e+03  1.040e+04   0.839 0.401607    
## MSSubClass80         -3.468e+03  1.022e+04  -0.339 0.734344    
## MSSubClass85          5.338e+02  9.204e+03   0.058 0.953758    
## MSSubClass90         -4.085e+03  8.549e+03  -0.478 0.632780    
## MSZoningFV            4.095e+04  6.833e+03   5.993 2.51e-09 ***
## MSZoningRH            2.490e+04  6.983e+03   3.567 0.000372 ***
## MSZoningRL            3.009e+04  5.710e+03   5.269 1.55e-07 ***
## MSZoningRM            3.019e+04  5.331e+03   5.663 1.74e-08 ***
## LotArea               5.643e-01  7.679e-02   7.348 3.13e-13 ***
## StreetPave            2.965e+04  6.601e+03   4.492 7.54e-06 ***
## LotShapeIR2           4.777e+03  2.431e+03   1.965 0.049604 *  
## LotShapeIR3           7.886e+03  5.038e+03   1.565 0.117678    
## LotShapeReg           5.291e+02  9.504e+02   0.557 0.577754    
## LandContourHLS        1.283e+04  2.912e+03   4.406 1.12e-05 ***
## LandContourLow        8.402e+02  4.041e+03   0.208 0.835341    
## LandContourLvl        1.041e+04  2.160e+03   4.818 1.58e-06 ***
## LotConfigCulDSac      6.820e+03  1.939e+03   3.517 0.000449 ***
## LotConfigFR2         -7.573e+03  2.562e+03  -2.956 0.003161 ** 
## LotConfigFR3         -1.235e+04  5.077e+03  -2.432 0.015109 *  
## LotConfigInside      -3.084e+03  1.062e+03  -2.904 0.003728 ** 
## LandSlopeMod          1.175e+04  2.376e+03   4.943 8.46e-07 ***
## LandSlopeSev         -2.130e+04  7.277e+03  -2.926 0.003476 ** 
## NeighborhoodBlueste  -1.081e+04  9.673e+03  -1.118 0.263830    
## NeighborhoodBrDale   -2.852e+03  6.729e+03  -0.424 0.671753    
## NeighborhoodBrkSide  -1.128e+04  5.441e+03  -2.073 0.038294 *  
## NeighborhoodClearCr  -2.081e+04  5.730e+03  -3.631 0.000290 ***
## NeighborhoodCollgCr  -1.806e+04  4.275e+03  -4.224 2.53e-05 ***
## NeighborhoodCrawfor   3.828e+03  4.936e+03   0.775 0.438221    
## NeighborhoodEdwards  -2.614e+04  4.686e+03  -5.578 2.83e-08 ***
## NeighborhoodGilbert  -1.959e+04  4.545e+03  -4.312 1.71e-05 ***
## NeighborhoodIDOTRR   -1.865e+04  5.904e+03  -3.159 0.001613 ** 
## NeighborhoodMeadowV  -2.288e+04  6.805e+03  -3.362 0.000792 ***
## NeighborhoodMitchel  -2.956e+04  4.753e+03  -6.220 6.26e-10 ***
## NeighborhoodNAmes    -2.254e+04  4.574e+03  -4.929 9.09e-07 ***
## NeighborhoodNoRidge   1.408e+04  5.029e+03   2.799 0.005181 ** 
## NeighborhoodNPkVill   5.527e+03  1.022e+04   0.541 0.588839    
## NeighborhoodNridgHt   1.329e+04  4.439e+03   2.994 0.002796 ** 
## NeighborhoodNWAmes   -2.666e+04  4.712e+03  -5.659 1.79e-08 ***
## NeighborhoodOldTown  -2.330e+04  5.415e+03  -4.303 1.78e-05 ***
## NeighborhoodSawyer   -1.786e+04  4.745e+03  -3.763 0.000174 ***
## NeighborhoodSawyerW  -1.319e+04  4.649e+03  -2.838 0.004596 ** 
## NeighborhoodSomerst  -1.647e+04  5.199e+03  -3.169 0.001559 ** 
## NeighborhoodStoneBr   2.869e+04  5.073e+03   5.656 1.81e-08 ***
## NeighborhoodSWISU    -1.355e+04  5.847e+03  -2.318 0.020557 *  
## NeighborhoodTimber   -1.265e+04  4.795e+03  -2.638 0.008427 ** 
## NeighborhoodVeenker  -4.591e+03  5.740e+03  -0.800 0.423928    
## Condition1Feedr       2.806e+03  2.954e+03   0.950 0.342241    
## Condition1Norm        1.340e+04  2.470e+03   5.425 6.62e-08 ***
## Condition1PosA        1.274e+04  5.569e+03   2.288 0.022259 *  
## Condition1PosN        7.585e+03  4.569e+03   1.660 0.097089 .  
## Condition1RRAe       -1.425e+04  4.652e+03  -3.063 0.002226 ** 
## Condition1RRAn        1.101e+04  3.895e+03   2.826 0.004765 ** 
## Condition1RRNe       -1.495e+03  8.783e+03  -0.170 0.864886    
## Condition1RRNn        6.815e+03  9.224e+03   0.739 0.460134    
## Condition2Feedr      -8.786e+03  1.040e+04  -0.845 0.398354    
## Condition2Norm       -3.235e+03  9.104e+03  -0.355 0.722366    
## Condition2PosA       -7.289e+03  1.437e+04  -0.507 0.612077    
## Condition2PosN       -2.439e+05  1.368e+04 -17.834  < 2e-16 ***
## Condition2RRAe       -1.074e+05  2.368e+04  -4.537 6.11e-06 ***
## Condition2RRAn       -7.330e+03  1.870e+04  -0.392 0.695047    
## Condition2RRNn       -3.959e+02  1.467e+04  -0.027 0.978476    
## BldgType2fmCon       -7.442e+02  1.277e+04  -0.058 0.953546    
## BldgTypeDuplex               NA         NA      NA       NA    
## BldgTypeTwnhs        -1.811e+04  7.659e+03  -2.365 0.018163 *  
## BldgTypeTwnhsE       -1.752e+04  7.082e+03  -2.474 0.013460 *  
## HouseStyle1.5Unf      7.953e+03  1.114e+04   0.714 0.475592    
## HouseStyle1Story      1.086e+04  4.960e+03   2.190 0.028633 *  
## HouseStyle2.5Fin     -1.193e+04  1.025e+04  -1.164 0.244636    
## HouseStyle2.5Unf     -8.651e+03  7.330e+03  -1.180 0.238061    
## HouseStyle2Story     -6.055e+03  4.809e+03  -1.259 0.208150    
## HouseStyleSFoyer      1.538e+04  6.497e+03   2.367 0.018057 *  
## HouseStyleSLvl        1.827e+04  7.821e+03   2.336 0.019626 *  
## OverallQual2          2.961e+04  1.988e+04   1.490 0.136484    
## OverallQual3          3.496e+04  1.859e+04   1.880 0.060248 .  
## OverallQual4          3.600e+04  1.847e+04   1.949 0.051465 .  
## OverallQual5          4.016e+04  1.853e+04   2.167 0.030388 *  
## OverallQual6          4.530e+04  1.858e+04   2.438 0.014872 *  
## OverallQual7          5.296e+04  1.860e+04   2.847 0.004468 ** 
## OverallQual8          6.663e+04  1.867e+04   3.569 0.000368 ***
## OverallQual9          8.783e+04  1.890e+04   4.647 3.63e-06 ***
## OverallQual10         1.369e+05  1.948e+04   7.028 3.03e-12 ***
## OverallCond2          1.315e+04  2.332e+04   0.564 0.572872    
## OverallCond3          2.004e+04  1.342e+04   1.494 0.135405    
## OverallCond4          2.603e+04  1.336e+04   1.948 0.051607 .  
## OverallCond5          3.390e+04  1.336e+04   2.536 0.011290 *  
## OverallCond6          4.023e+04  1.342e+04   2.998 0.002759 ** 
## OverallCond7          4.572e+04  1.345e+04   3.400 0.000690 ***
## OverallCond8          5.177e+04  1.350e+04   3.836 0.000130 ***
## OverallCond9          6.002e+04  1.400e+04   4.288 1.90e-05 ***
## YearBuilt             3.595e+02  4.731e+01   7.599 4.92e-14 ***
## YearRemodAdd          1.055e+02  3.212e+01   3.286 0.001038 ** 
## RoofStyleGable       -4.815e+03  8.656e+03  -0.556 0.578143    
## RoofStyleGambrel     -2.186e+03  9.745e+03  -0.224 0.822557    
## RoofStyleHip         -3.696e+03  8.704e+03  -0.425 0.671165    
## RoofStyleMansard      5.986e+03  1.100e+04   0.544 0.586339    
## RoofStyleShed         7.902e+04  1.515e+04   5.215 2.06e-07 ***
## RoofMatlCompShg       6.617e+05  2.022e+04  32.718  < 2e-16 ***
## RoofMatlMembran       7.396e+05  2.887e+04  25.620  < 2e-16 ***
## RoofMatlMetal         6.992e+05  2.872e+04  24.346  < 2e-16 ***
## RoofMatlRoll          6.538e+05  2.634e+04  24.818  < 2e-16 ***
## RoofMatlTar&Grv       6.672e+05  2.175e+04  30.676  < 2e-16 ***
## RoofMatlWdShake       6.437e+05  2.152e+04  29.911  < 2e-16 ***
## RoofMatlWdShngl       7.432e+05  2.132e+04  34.854  < 2e-16 ***
## Exterior1stAsphShn   -1.850e+04  2.276e+04  -0.813 0.416323    
## Exterior1stBrkComm   -7.338e+03  1.387e+04  -0.529 0.596760    
## Exterior1stBrkFace    7.740e+03  7.342e+03   1.054 0.291930    
## Exterior1stCemntBd   -1.151e+04  1.224e+04  -0.940 0.347353    
## Exterior1stHdBoard   -1.014e+04  7.086e+03  -1.431 0.152611    
## Exterior1stImStucc   -6.986e+04  1.799e+04  -3.884 0.000107 ***
## Exterior1stMetalSd    1.209e+03  7.963e+03   0.152 0.879352    
## Exterior1stPlywood   -1.547e+04  6.952e+03  -2.226 0.026170 *  
## Exterior1stStone     -2.641e+04  1.549e+04  -1.705 0.088463 .  
## Exterior1stStucco    -4.405e+03  8.089e+03  -0.544 0.586173    
## Exterior1stVinylSd   -1.611e+04  8.066e+03  -1.997 0.045966 *  
## Exterior1stWd Sdng   -8.627e+03  6.955e+03  -1.240 0.215021    
## Exterior1stWdShing   -3.086e+03  7.381e+03  -0.418 0.675971    
## Exterior2ndAsphShn    2.238e+03  1.431e+04   0.156 0.875675    
## Exterior2ndBrk Cmn    4.508e+03  1.355e+04   0.333 0.739463    
## Exterior2ndBrkFace   -3.269e+02  8.265e+03  -0.040 0.968454    
## Exterior2ndCmentBd    1.024e+04  1.261e+04   0.812 0.417099    
## Exterior2ndHdBoard    1.608e+03  7.571e+03   0.212 0.831880    
## Exterior2ndImStucc    3.464e+04  8.906e+03   3.890 0.000104 ***
## Exterior2ndMetalSd   -3.830e+03  8.357e+03  -0.458 0.646778    
## Exterior2ndOther     -1.006e+04  1.825e+04  -0.551 0.581676    
## Exterior2ndPlywood    3.119e+03  7.283e+03   0.428 0.668547    
## Exterior2ndStone      1.465e+04  1.425e+04   1.029 0.303846    
## Exterior2ndStucco    -2.973e+03  8.510e+03  -0.349 0.726834    
## Exterior2ndVinylSd    1.104e+04  8.440e+03   1.308 0.191016    
## Exterior2ndWd Sdng    4.350e+03  7.496e+03   0.580 0.561750    
## Exterior2ndWd Shng   -4.197e+03  7.836e+03  -0.536 0.592332    
## MasVnrTypeBrkFace     8.602e+03  3.955e+03   2.175 0.029777 *  
## MasVnrTypeNone        1.164e+04  3.963e+03   2.938 0.003349 ** 
## MasVnrTypeStone       1.311e+04  4.195e+03   3.125 0.001807 ** 
## MasVnrArea            1.884e+01  3.395e+00   5.550 3.32e-08 ***
## ExterQualFa           1.336e+04  6.288e+03   2.125 0.033726 *  
## ExterQualGd          -8.024e+03  3.085e+03  -2.601 0.009375 ** 
## ExterQualTA          -1.003e+04  3.394e+03  -2.956 0.003160 ** 
## ExterCondFa          -3.499e+03  7.066e+03  -0.495 0.620488    
## ExterCondGd          -1.037e+04  6.377e+03  -1.625 0.104247    
## ExterCondTA          -6.483e+03  6.373e+03  -1.017 0.309186    
## FoundationCBlock      1.819e+03  1.801e+03   1.010 0.312470    
## FoundationPConc       6.034e+03  1.964e+03   3.072 0.002159 ** 
## FoundationSlab        6.341e+03  4.468e+03   1.419 0.156047    
## FoundationStone       4.014e+03  7.200e+03   0.557 0.577301    
## FoundationWood       -2.481e+04  1.172e+04  -2.116 0.034456 *  
## BsmtFinSF1            3.360e+01  2.392e+00  14.048  < 2e-16 ***
## BsmtFinSF2            2.188e+01  3.201e+00   6.835 1.14e-11 ***
## BsmtUnfSF             1.270e+01  2.228e+00   5.698 1.43e-08 ***
## TotalBsmtSF                  NA         NA      NA       NA    
## HeatingGasA           2.343e+03  1.677e+04   0.140 0.888850    
## HeatingGasW          -6.304e+03  1.730e+04  -0.364 0.715552    
## HeatingGrav          -3.139e+03  1.894e+04  -0.166 0.868428    
## HeatingOthW          -2.839e+04  2.062e+04  -1.377 0.168675    
## HeatingWall           4.965e+03  2.129e+04   0.233 0.815610    
## HeatingQCFa          -1.928e+03  2.706e+03  -0.712 0.476280    
## HeatingQCGd          -3.737e+03  1.183e+03  -3.160 0.001606 ** 
## HeatingQCPo           1.191e+04  1.245e+04   0.956 0.339135    
## HeatingQCTA          -3.674e+03  1.197e+03  -3.069 0.002180 ** 
## CentralAirY          -3.873e+02  2.091e+03  -0.185 0.853039    
## ElectricalFuseF      -2.989e+03  3.552e+03  -0.842 0.400166    
## ElectricalFuseP      -9.314e+03  7.068e+03  -1.318 0.187772    
## ElectricalMix         9.603e+03  2.643e+04   0.363 0.716405    
## ElectricalSBrkr      -1.970e+03  1.715e+03  -1.148 0.251012    
## X1stFlrSF             5.350e+01  2.858e+00  18.718  < 2e-16 ***
## X2ndFlrSF             6.487e+01  3.035e+00  21.375  < 2e-16 ***
## LowQualFinSF          1.211e+01  1.099e+01   1.102 0.270735    
## GrLivArea                    NA         NA      NA       NA    
## BsmtFullBath          2.144e+03  1.114e+03   1.925 0.054395 .  
## BsmtHalfBath          8.477e+02  1.648e+03   0.514 0.607154    
## FullBath              3.703e+03  1.295e+03   2.860 0.004292 ** 
## HalfBath             -7.771e+01  1.241e+03  -0.063 0.950057    
## BedroomAbvGr         -3.233e+03  7.977e+02  -4.053 5.28e-05 ***
## KitchenAbvGr         -6.874e+03  4.161e+03  -1.652 0.098739 .  
## KitchenQualFa        -1.566e+04  3.644e+03  -4.296 1.84e-05 ***
## KitchenQualGd        -2.059e+04  2.131e+03  -9.663  < 2e-16 ***
## KitchenQualTA        -1.784e+04  2.372e+03  -7.523 8.70e-14 ***
## TotRmsAbvGrd          7.895e+02  5.523e+02   1.430 0.153012    
## FunctionalMaj2       -5.674e+03  9.442e+03  -0.601 0.547938    
## FunctionalMin1        5.206e+03  5.666e+03   0.919 0.358376    
## FunctionalMin2        6.318e+03  5.832e+03   1.083 0.278778    
## FunctionalMod        -5.581e+03  6.326e+03  -0.882 0.377781    
## FunctionalSev        -5.304e+04  1.823e+04  -2.910 0.003666 ** 
## FunctionalTyp         1.758e+04  5.075e+03   3.465 0.000544 ***
## Fireplaces            4.200e+03  7.970e+02   5.270 1.54e-07 ***
## GarageCars            2.574e+03  1.278e+03   2.015 0.044109 *  
## GarageArea            1.710e+01  4.373e+00   3.910 9.61e-05 ***
## PavedDriveP          -3.467e+03  2.994e+03  -1.158 0.247036    
## PavedDriveY          -2.394e+03  1.905e+03  -1.257 0.208887    
## WoodDeckSF            1.471e+01  3.400e+00   4.325 1.61e-05 ***
## OpenPorchSF           1.655e+01  6.399e+00   2.586 0.009807 ** 
## EnclosedPorch         6.099e+00  6.780e+00   0.900 0.368493    
## X3SsnPorch            5.566e+01  1.690e+01   3.293 0.001012 ** 
## ScreenPorch           2.423e+01  7.027e+00   3.449 0.000577 ***
## PoolArea              6.602e+01  9.534e+00   6.924 6.20e-12 ***
## MiscVal              -5.596e-01  6.572e-01  -0.852 0.394563    
## MoSold               -5.360e+02  1.417e+02  -3.784 0.000160 ***
## YrSold                4.488e+02  4.868e+02   0.922 0.356638    
## SaleTypeCon           3.738e+04  9.712e+03   3.849 0.000123 ***
## SaleTypeConLD         1.265e+04  5.268e+03   2.402 0.016432 *  
## SaleTypeConLI        -5.307e+03  9.946e+03  -0.534 0.593718    
## SaleTypeConLw        -5.056e+02  8.633e+03  -0.059 0.953308    
## SaleTypeCWD           2.103e+04  5.336e+03   3.942 8.40e-05 ***
## SaleTypeNew           1.496e+04  8.815e+03   1.697 0.089953 .  
## SaleTypeOth           1.122e+04  9.702e+03   1.156 0.247642    
## SaleTypeWD           -1.063e+03  2.538e+03  -0.419 0.675428    
## SaleConditionAdjLand  9.288e+03  5.828e+03   1.594 0.111226    
## SaleConditionAlloca   6.959e+03  6.002e+03   1.160 0.246397    
## SaleConditionFamily  -2.638e+02  3.233e+03  -0.082 0.934961    
## SaleConditionNormal   4.383e+03  1.704e+03   2.573 0.010165 *  
## SaleConditionPartial  5.334e+03  8.442e+03   0.632 0.527558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15630 on 1684 degrees of freedom
##   (因为不存在,30个观察量被删除了)
## Multiple R-squared:  0.966,  Adjusted R-squared:  0.9616 
## F-statistic: 219.2 on 218 and 1684 DF,  p-value: < 2.2e-16

[9 points] Q7.

Using the result of this and your general understanding of what variables should be important in determining SalePrice, choose a maximum of 15 variables and create another, smaller regression, and call it regBefore2009optimal. Then, print the summary of regBefore2009optimal to verify that your code works. Tip: Normally you would do a more detailed variable selection using a backward or step-wise selection approach but this is NOT required for this question. Tip: This is the formula for regression: lm(SalePrice ~ var1 + var2 + … + varN, data = before2009), where var1, etc. are the variables of your choice. Tip: Pick the variables with the lowest Pr(>|t|) Rubric: 8 points for setting regBefore2009optimal and 1 point for verification.

# Selecting the top 15 variables with the lowest Pr(>|t|) values
selected_vars <- names(coef(regBefore2009)[-1])[order(summary(regBefore2009)$coefficients[-1, 4])[1:40]]
print(selected_vars)
##  [1] "RoofMatlWdShake"     "RoofStyleShed"       "RoofMatlRoll"       
##  [4] "RoofMatlTar&Grv"     "RoofMatlCompShg"     "RoofMatlMetal"      
##  [7] "RoofMatlMembran"     "ElectricalSBrkr"     "ElectricalMix"      
## [10] "Condition2PosN"      "FoundationWood"      "BedroomAbvGr"       
## [13] "OverallCond9"        "KitchenAbvGr"        "LotArea"            
## [16] "OverallQual9"        "EnclosedPorch"       "BsmtFinSF1"         
## [19] "NeighborhoodMitchel" "MSZoningFV"          "BsmtFinSF2"         
## [22] "MSZoningRM"          "NeighborhoodNWAmes"  "NeighborhoodStoneBr"
## [25] "NeighborhoodEdwards" "MasVnrTypeStone"     "Condition1Norm"     
## [28] "FunctionalMod"       "MSZoningRL"          "RoofStyleMansard"   
## [31] "LandSlopeMod"        "NeighborhoodNAmes"   "LandContourLvl"     
## [34] "OverallQual8"        "Condition2RRAe"      "StreetPave"         
## [37] "LandContourHLS"      "GarageArea"          "NeighborhoodGilbert"
## [40] "NeighborhoodOldTown"
regBefore2009optimal <- lm(SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF + OverallQual+ Condition2 + MSZoning + Neighborhood + LotArea +OverallCond +Foundation + BedroomAbvGr + EnclosedPorch + BsmtFinSF1 +BsmtFinSF2 + MasVnrType, data = before2009)

# Print the summary
summary(regBefore2009optimal)
## 
## Call:
## lm(formula = SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF + OverallQual + 
##     Condition2 + MSZoning + Neighborhood + LotArea + OverallCond + 
##     Foundation + BedroomAbvGr + EnclosedPorch + BsmtFinSF1 + 
##     BsmtFinSF2 + MasVnrType, data = before2009)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126988  -15520   -1473   13956  187597 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6.723e+05  5.381e+04 -12.493  < 2e-16 ***
## RoofMatlCompShg      5.839e+05  3.168e+04  18.433  < 2e-16 ***
## RoofMatlMembran      6.160e+05  4.481e+04  13.747  < 2e-16 ***
## RoofMatlMetal        6.445e+05  4.510e+04  14.292  < 2e-16 ***
## RoofMatlRoll         5.786e+05  4.287e+04  13.495  < 2e-16 ***
## RoofMatlTar&Grv      5.862e+05  3.250e+04  18.040  < 2e-16 ***
## RoofMatlWdShake      5.870e+05  3.362e+04  17.460  < 2e-16 ***
## RoofMatlWdShngl      6.633e+05  3.346e+04  19.825  < 2e-16 ***
## LandSlopeMod         1.080e+04  3.589e+03   3.009 0.002654 ** 
## LandSlopeSev        -4.146e+04  1.213e+04  -3.418 0.000645 ***
## BsmtUnfSF            2.807e+01  2.443e+00  11.488  < 2e-16 ***
## OverallQual2         4.370e+04  3.308e+04   1.321 0.186691    
## OverallQual3         3.858e+04  3.060e+04   1.261 0.207594    
## OverallQual4         4.067e+04  3.048e+04   1.334 0.182334    
## OverallQual5         4.746e+04  3.054e+04   1.554 0.120390    
## OverallQual6         6.640e+04  3.062e+04   2.169 0.030228 *  
## OverallQual7         9.085e+04  3.065e+04   2.964 0.003074 ** 
## OverallQual8         1.245e+05  3.070e+04   4.054 5.24e-05 ***
## OverallQual9         1.729e+05  3.092e+04   5.594 2.56e-08 ***
## OverallQual10        2.870e+05  3.162e+04   9.079  < 2e-16 ***
## Condition2Feedr     -7.713e+03  1.734e+04  -0.445 0.656452    
## Condition2Norm       1.214e+03  1.464e+04   0.083 0.933906    
## Condition2PosA      -4.793e+04  2.348e+04  -2.041 0.041388 *  
## Condition2PosN      -2.343e+05  2.270e+04 -10.320  < 2e-16 ***
## Condition2RRAe       2.725e+04  3.238e+04   0.842 0.400075    
## Condition2RRAn      -2.672e+04  3.244e+04  -0.824 0.410288    
## Condition2RRNn       1.542e+03  2.498e+04   0.062 0.950802    
## MSZoningFV           4.900e+04  1.121e+04   4.373 1.30e-05 ***
## MSZoningRH           2.496e+04  1.181e+04   2.113 0.034729 *  
## MSZoningRL           4.049e+04  9.214e+03   4.394 1.18e-05 ***
## MSZoningRM           3.501e+04  8.642e+03   4.052 5.30e-05 ***
## NeighborhoodBlueste -2.377e+04  1.643e+04  -1.447 0.148186    
## NeighborhoodBrDale  -3.148e+04  9.979e+03  -3.155 0.001634 ** 
## NeighborhoodBrkSide -1.670e+04  8.244e+03  -2.025 0.042981 *  
## NeighborhoodClearCr -1.231e+04  9.202e+03  -1.338 0.181206    
## NeighborhoodCollgCr -9.268e+03  6.955e+03  -1.333 0.182795    
## NeighborhoodCrawfor  8.885e+03  7.769e+03   1.144 0.252898    
## NeighborhoodEdwards -3.248e+04  7.502e+03  -4.329 1.58e-05 ***
## NeighborhoodGilbert -3.688e+03  7.360e+03  -0.501 0.616348    
## NeighborhoodIDOTRR  -2.445e+04  8.843e+03  -2.765 0.005757 ** 
## NeighborhoodMeadowV -3.192e+04  9.947e+03  -3.209 0.001355 ** 
## NeighborhoodMitchel -3.232e+04  7.699e+03  -4.197 2.83e-05 ***
## NeighborhoodNAmes   -2.768e+04  7.232e+03  -3.827 0.000134 ***
## NeighborhoodNoRidge  4.475e+04  8.059e+03   5.553 3.22e-08 ***
## NeighborhoodNPkVill -2.614e+04  1.190e+04  -2.196 0.028248 *  
## NeighborhoodNridgHt  2.934e+04  7.500e+03   3.912 9.49e-05 ***
## NeighborhoodNWAmes  -2.052e+04  7.602e+03  -2.699 0.007009 ** 
## NeighborhoodOldTown -2.335e+04  8.078e+03  -2.891 0.003889 ** 
## NeighborhoodSawyer  -3.012e+04  7.633e+03  -3.946 8.23e-05 ***
## NeighborhoodSawyerW -8.314e+03  7.599e+03  -1.094 0.274112    
## NeighborhoodSomerst -8.833e+03  8.598e+03  -1.027 0.304385    
## NeighborhoodStoneBr  3.185e+04  8.446e+03   3.770 0.000168 ***
## NeighborhoodSWISU   -2.465e+04  9.145e+03  -2.696 0.007091 ** 
## NeighborhoodTimber  -6.031e+03  7.992e+03  -0.755 0.450570    
## NeighborhoodVeenker  1.021e+04  9.536e+03   1.071 0.284485    
## LotArea              1.321e+00  1.176e-01  11.231  < 2e-16 ***
## OverallCond2         1.281e+04  2.991e+04   0.428 0.668375    
## OverallCond3         2.431e+04  2.223e+04   1.094 0.274248    
## OverallCond4         3.282e+04  2.204e+04   1.489 0.136691    
## OverallCond5         3.912e+04  2.199e+04   1.779 0.075474 .  
## OverallCond6         4.538e+04  2.203e+04   2.060 0.039556 *  
## OverallCond7         5.190e+04  2.205e+04   2.354 0.018690 *  
## OverallCond8         5.651e+04  2.213e+04   2.553 0.010746 *  
## OverallCond9         7.254e+04  2.274e+04   3.190 0.001445 ** 
## FoundationCBlock     4.564e+03  2.831e+03   1.612 0.107069    
## FoundationPConc      1.883e+04  3.082e+03   6.109 1.22e-09 ***
## FoundationSlab       3.196e+04  6.716e+03   4.759 2.10e-06 ***
## FoundationStone     -8.538e+03  1.225e+04  -0.697 0.485947    
## FoundationWood       7.437e+03  2.099e+04   0.354 0.723098    
## BedroomAbvGr         1.307e+04  9.057e+02  14.432  < 2e-16 ***
## EnclosedPorch       -1.836e+00  1.095e+01  -0.168 0.866779    
## BsmtFinSF1           5.586e+01  2.483e+00  22.499  < 2e-16 ***
## BsmtFinSF2           5.029e+01  4.559e+00  11.030  < 2e-16 ***
## MasVnrTypeBrkFace    2.000e+04  6.728e+03   2.973 0.002989 ** 
## MasVnrTypeNone       1.593e+04  6.630e+03   2.402 0.016394 *  
## MasVnrTypeStone      2.614e+04  7.143e+03   3.660 0.000260 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28640 on 1828 degrees of freedom
##   (因为不存在,29个观察量被删除了)
## Multiple R-squared:  0.8759, Adjusted R-squared:  0.8708 
## F-statistic:   172 on 75 and 1828 DF,  p-value: < 2.2e-16

[5 points] Q8.

Display diagnostic plots of your regression. Tip: The diagnostic plots include QQ-Plot, Residual versus Fitted Values plot, a \(\sqrt{Standardized \; Residuals}\) vs Fitted Values plot, and a Standardized Residuals vs Leverage plot. Do not worry if your residuals have a slight curve to them. Tip: Google “Plotting Diagnostics for Linear Models - CRAN” and don’t use any arguments for the function autoplot at this time.

library(ggfortify)
## Warning: 程辑包'ggfortify'是用R版本4.3.2 来建造的
regBefore2009optimal %>%
  autoplot()
## Warning: Removed 1904 rows containing missing values (`geom_line()`).
## Warning: Removed 7 rows containing missing values (`geom_point()`).
## Warning: Removed 14 rows containing missing values (`geom_line()`).

[5 points] Q9.

Now read in the PricesAfter2009.csv data and assign it to a variable called after2009. The dataset contains data for house prices after 2009. Then, repeat your data manipulation operations from Q2 and Q3 on this new dataset. Drop (remove) unnecessary columns that you dropped in Q5. Rubric: 1 point for reading and 4 points for data manipulation.

after2009 <- read.csv("PricesAfter2009.csv")
# Inspect the data using head() and str() commands
head(after2009)
##   X Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1  6         50       RL          85   14115   Pave  <NA>      IR1
## 2 2  8         60       RL          NA   10382   Pave  <NA>      IR1
## 3 3 17         20       RL          NA   11241   Pave  <NA>      IR1
## 4 4 20         20       RL          70    7560   Pave  <NA>      Reg
## 5 5 25         20       RL          NA    8246   Pave  <NA>      IR1
## 6 6 26         20       RL         110   14230   Pave  <NA>      Reg
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1         Lvl    AllPub    Inside       Gtl      Mitchel       Norm       Norm
## 2         Lvl    AllPub    Corner       Gtl       NWAmes       PosN       Norm
## 3         Lvl    AllPub   CulDSac       Gtl        NAmes       Norm       Norm
## 4         Lvl    AllPub    Inside       Gtl        NAmes       Norm       Norm
## 5         Lvl    AllPub    Inside       Gtl       Sawyer       Norm       Norm
## 6         Lvl    AllPub    Corner       Gtl      NridgHt       Norm       Norm
##   BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1     1Fam     1.5Fin           5           5      1993         1995     Gable
## 2     1Fam     2Story           7           6      1973         1973     Gable
## 3     1Fam     1Story           6           7      1970         1970     Gable
## 4     1Fam     1Story           5           6      1958         1965       Hip
## 5     1Fam     1Story           5           8      1968         2001     Gable
## 6     1Fam     1Story           8           5      2007         2007     Gable
##   RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1  CompShg     VinylSd     VinylSd       None          0        TA        TA
## 2  CompShg     HdBoard     HdBoard      Stone        240        TA        TA
## 3  CompShg     Wd Sdng     Wd Sdng    BrkFace        180        TA        TA
## 4  CompShg     BrkFace     Plywood       None          0        TA        TA
## 5  CompShg     Plywood     Plywood       None          0        TA        Gd
## 6  CompShg     VinylSd     VinylSd      Stone        640        Gd        TA
##   Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1       Wood       Gd       TA           No          GLQ        732
## 2     CBlock       Gd       TA           Mn          ALQ        859
## 3     CBlock       TA       TA           No          ALQ        578
## 4     CBlock       TA       TA           No          LwQ        504
## 5     CBlock       TA       TA           Mn          Rec        188
## 6      PConc       Gd       TA           No          Unf          0
##   BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1          Unf          0        64         796    GasA        Ex          Y
## 2          BLQ         32       216        1107    GasA        Ex          Y
## 3          Unf          0       426        1004    GasA        Ex          Y
## 4          Unf          0       525        1029    GasA        TA          Y
## 5          ALQ        668       204        1060    GasA        Ex          Y
## 6          Unf          0      1566        1566    GasA        Ex          Y
##   Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1      SBrkr       796       566            0      1362            1
## 2      SBrkr      1107       983            0      2090            1
## 3      SBrkr      1004         0            0      1004            1
## 4      SBrkr      1339         0            0      1339            0
## 5      SBrkr      1060         0            0      1060            1
## 6      SBrkr      1600         0            0      1600            0
##   BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1            0        1        1            1            1          TA
## 2            0        2        1            3            1          TA
## 3            0        1        0            2            1          TA
## 4            0        1        0            3            1          TA
## 5            0        1        0            3            1          Gd
## 6            0        2        0            3            1          Gd
##   TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1            5        Typ          0        <NA>     Attchd        1993
## 2            7        Typ          2          TA     Attchd        1973
## 3            5        Typ          1          TA     Attchd        1970
## 4            6       Min1          0        <NA>     Attchd        1958
## 5            6        Typ          1          TA     Attchd        1968
## 6            7        Typ          1          Gd     Attchd        2007
##   GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1          Unf          2        480         TA         TA          Y
## 2          RFn          2        484         TA         TA          Y
## 3          Fin          2        480         TA         TA          Y
## 4          Unf          1        294         TA         TA          Y
## 5          Unf          1        270         TA         TA          Y
## 6          RFn          3        890         TA         TA          Y
##   WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1         40          30             0        320           0        0     NA
## 2        235         204           228          0           0        0     NA
## 3          0           0             0          0           0        0     NA
## 4          0           0             0          0           0        0     NA
## 5        406          90             0          0           0        0     NA
## 6          0          56             0          0           0        0     NA
##   Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 MnPrv        Shed     700     10   2009       WD        Normal    143000
## 2  <NA>        Shed     350     11   2009       WD        Normal    200000
## 3  <NA>        Shed     700      3   2010       WD        Normal    149000
## 4 MnPrv        <NA>       0      5   2009      COD       Abnorml    139000
## 5 MnPrv        <NA>       0      5   2010       WD        Normal    154000
## 6  <NA>        <NA>       0      7   2009       WD        Normal    256300
str(after2009)
## 'data.frame':    986 obs. of  82 variables:
##  $ X            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Id           : int  6 8 17 20 25 26 27 28 34 37 ...
##  $ MSSubClass   : int  50 60 20 20 20 20 20 20 20 20 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  85 NA NA 70 NA 110 60 98 70 112 ...
##  $ LotArea      : int  14115 10382 11241 7560 8246 14230 7200 11478 10552 10859 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "IR1" "IR1" "IR1" "Reg" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "Corner" "CulDSac" "Inside" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "Mitchel" "NWAmes" "NAmes" "NAmes" ...
##  $ Condition1   : chr  "Norm" "PosN" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "1.5Fin" "2Story" "1Story" "1Story" ...
##  $ OverallQual  : int  5 7 6 5 5 8 5 8 5 5 ...
##  $ OverallCond  : int  5 6 7 6 8 5 7 5 5 5 ...
##  $ YearBuilt    : int  1993 1973 1970 1958 1968 2007 1951 2007 1959 1994 ...
##  $ YearRemodAdd : int  1995 1973 1970 1965 2001 2007 2000 2008 1959 1995 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Hip" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "HdBoard" "Wd Sdng" "BrkFace" ...
##  $ Exterior2nd  : chr  "VinylSd" "HdBoard" "Wd Sdng" "Plywood" ...
##  $ MasVnrType   : chr  "None" "Stone" "BrkFace" "None" ...
##  $ MasVnrArea   : int  0 240 180 0 0 640 0 200 0 0 ...
##  $ ExterQual    : chr  "TA" "TA" "TA" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "Wood" "CBlock" "CBlock" "CBlock" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "TA" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "TA" ...
##  $ BsmtExposure : chr  "No" "Mn" "No" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "ALQ" "LwQ" ...
##  $ BsmtFinSF1   : int  732 859 578 504 188 0 234 1218 1018 0 ...
##  $ BsmtFinType2 : chr  "Unf" "BLQ" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 32 0 0 668 0 486 0 0 0 ...
##  $ BsmtUnfSF    : int  64 216 426 525 204 1566 180 486 380 1097 ...
##  $ TotalBsmtSF  : int  796 1107 1004 1029 1060 1566 900 1704 1398 1097 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "TA" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  796 1107 1004 1339 1060 1600 900 1704 1700 1097 ...
##  $ X2ndFlrSF    : int  566 983 0 0 0 0 0 0 0 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1362 2090 1004 1339 1060 1600 900 1704 1700 1097 ...
##  $ BsmtFullBath : int  1 1 1 0 1 0 0 1 0 0 ...
##  $ BsmtHalfBath : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ FullBath     : int  1 2 1 1 1 2 1 2 1 1 ...
##  $ HalfBath     : int  1 1 0 0 0 0 0 0 1 1 ...
##  $ BedroomAbvGr : int  1 3 2 3 3 3 3 3 4 3 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ KitchenQual  : chr  "TA" "TA" "TA" "TA" ...
##  $ TotRmsAbvGrd : int  5 7 5 6 6 7 5 7 6 6 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Min1" ...
##  $ Fireplaces   : int  0 2 1 0 1 1 0 1 1 0 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" NA ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ GarageYrBlt  : int  1993 1973 1970 1958 1968 2007 2005 2008 1959 1995 ...
##  $ GarageFinish : chr  "Unf" "RFn" "Fin" "Unf" ...
##  $ GarageCars   : int  2 2 2 1 1 3 2 3 2 2 ...
##  $ GarageArea   : int  480 484 480 294 270 890 576 772 447 672 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  40 235 0 0 406 0 222 0 0 392 ...
##  $ OpenPorchSF  : int  30 204 0 0 90 56 32 50 38 64 ...
##  $ EnclosedPorch: int  0 228 0 0 0 0 0 0 0 0 ...
##  $ X3SsnPorch   : int  320 0 0 0 0 0 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : logi  NA NA NA NA NA NA ...
##  $ Fence        : chr  "MnPrv" NA NA "MnPrv" ...
##  $ MiscFeature  : chr  "Shed" "Shed" "Shed" NA ...
##  $ MiscVal      : int  700 350 700 0 0 0 0 0 0 0 ...
##  $ MoSold       : int  10 11 3 5 5 7 5 5 4 6 ...
##  $ YrSold       : int  2009 2009 2010 2009 2010 2009 2010 2010 2010 2009 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "COD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : num  143000 200000 149000 139000 154000 ...
# Convert columns to character or factor type
after2009$MSSubClass <- as.character(after2009$MSSubClass)
after2009$OverallQual <- as.factor(after2009$OverallQual)
after2009$OverallCond <- as.factor(after2009$OverallCond)

# Verify the conversion
summary(after2009[, c("MSSubClass", "OverallQual", "OverallCond")])
##   MSSubClass         OverallQual   OverallCond 
##  Length:986         5      :271   5      :551  
##  Class :character   6      :246   6      :177  
##  Mode  :character   7      :205   7      :130  
##                     8      :109   8      : 50  
##                     4      : 85   4      : 32  
##                     9      : 39   3      : 20  
##                     (Other): 31   (Other): 26
temp = map(after2009, ~sum(is.na(.))) %>% as_tibble() %>% t()
afterNAs = tibble('Columns' = rownames(temp), "NAs" = temp[,1])
afterNAs %>% head(10)
## # A tibble: 10 × 2
##    Columns       NAs
##    <chr>       <int>
##  1 X               0
##  2 Id              0
##  3 MSSubClass      0
##  4 MSZoning        1
##  5 LotFrontage   169
##  6 LotArea         0
##  7 Street          0
##  8 Alley         924
##  9 LotShape        0
## 10 LandContour     0
# Define the columns to be dropped
str(afterNAs)
## tibble [82 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Columns: chr [1:82] "X" "Id" "MSSubClass" "MSZoning" ...
##  $ NAs    : Named int [1:82] 0 0 0 1 169 0 0 924 0 0 ...
##   ..- attr(*, "names")= chr [1:82] "X" "Id" "MSSubClass" "MSZoning" ...
# Create a vector of column names to drop
dropCols <- afterNAs$Columns[afterNAs$NAs >= 20]

# Drop specified columns
after2009 <- after2009 %>%
  select(-any_of(dropCols),SalePrice)
# Drop specified columns and the first column
after2009 <- after2009 %>%
  select(-Id, -Utilities, -1)

[8 points] Q10.

Local authorities found in 2011 that there was housing fraud taking place in several neighborhoods, including NAmes, Gilbert and NridgHt. Make a density plot of SalePrice (after 2009) for all the neighborhoods (with or without fraud) and arrange them all in a grid. Tip: Data scientists often use density plot to catch outliers or anomalous activity). Tip: I recommend using ggplot2 for these plots with facet_wrap(~ Neighborhood). Your call will look something like this: ggplot(data = …, aes(x = SalePrice)) + geom_density() + facet_wrap(~ …) + ggtitle(“…”) + xlab(‘…’)

# Assuming you have loaded the necessary libraries and the after2009 data

library(ggplot2)

# Filter data for the specified neighborhoods
fraud_neighborhoods <- c("NAmes", "Gilbert", "NridgHt")
after2009_fraud <- after2009[after2009$Neighborhood %in% fraud_neighborhoods, ]

# Create a density plot with ggplot2 and facet_wrap
ggplot(data = after2009, aes(x = SalePrice)) +
  geom_density() +
  facet_wrap(~ Neighborhood) +
  ggtitle("Density Plot of SalePrice by Neighborhood (After 2009)") +
  xlab('SalePrice') +
  theme_minimal()  # You can customize the theme if needed
## Warning: Removed 5 rows containing non-finite values (`stat_density()`).

[8 points] Q11.

As you can see, the density plot for NAmes between 2009 and 2010 does not look any different from other density plots. If there are fraudsters, they are making an effort to mask their activities. Now, make 2 density plots, one for SalePrice in NAmes before 2009 and the other for after 2009. Compare the two to see if there is visual evidence of anomalous activity. Then, do the same for Gilbert and see if anything anomalous is detectable between these plots. Tip: I recommend using the gridExtra library’s grid.arrange function for all four plots so you can see the plots for each neighborhood side by side.

# Assuming you have loaded the necessary libraries and the before2009 and after2009 data

library(ggplot2)
library(gridExtra)
## Warning: 程辑包'gridExtra'是用R版本4.3.2 来建造的
## 
## 载入程辑包:'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Function to create density plot for a neighborhood
create_density_plot <- function(data, neighborhood, title) {
  ggplot(data = data[data$Neighborhood == neighborhood, ], aes(x = SalePrice)) +
    geom_density() +
    ggtitle(title) +
    xlab('SalePrice') +
    theme_minimal()
}

# Create density plots for NAmes and Gilbert before and after 2009
plot_NAmes_before <- create_density_plot(before2009, "NAmes", "Density Plot - NAmes (Before 2009)")
plot_NAmes_after <- create_density_plot(after2009, "NAmes", "Density Plot - NAmes (After 2009)")

plot_Gilbert_before <- create_density_plot(before2009, "Gilbert", "Density Plot - Gilbert (Before 2009)")
plot_Gilbert_after <- create_density_plot(after2009, "Gilbert", "Density Plot - Gilbert (After 2009)")

# Arrange plots side by side
grid.arrange(plot_NAmes_before, plot_NAmes_after, plot_Gilbert_before, plot_Gilbert_after, ncol = 2)
## Warning: Removed 4 rows containing non-finite values (`stat_density()`).
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).

* * * * * * * * * * * * * * * * * * * * * * * * * * *

The questions above were in the Previous Assignment.

Team Assignemnt worth a total of 60 Points.

* * * * * * * * * * * * * * * * * * * * * * * * * * *

We pick up this story from new Question 12 below and continue the investigation after you have learned regression in more detail. Tip: I bookended this assignment with the regression module so you can reinforce your understanding and apply it. (I also wanted to have empathy for your learning-life blend.) This will also, hopefully, cement your understanding and build your confidence.

[5 points] Q12

Analyze the visualizations above for Gilbert and NAmes to detect possible fraud. Tip: Look for a fraud pattern.

### This section doesn't require code. Just answer the question as a comment.
# Normally density plots have peaks around the mean, but "Gilbert" has two peaks and another peak around 145000, which is a risk of fraud. In addition, when I look at the peaks of "NAmes" I feel that the values are too concentrated around the average, which could also be potentially fraudulent.

[5 points] Q13.

You may feel that the fraudsters were not very careful in masking their activity after identifying the fraud pattern. However, we don’t have sufficient evidence to claim that this is fraudulent activity (just based on the density plots). We will now use multiple linear regression to attempt to get more evidence. Run a regression on the data in after2009 using variables you already know to be good at predicting the SalePrice. Store the result in variable called regAfter2009optimal. Then print summary of regAfter2009optimal to verify that your code works. Tip: You can reuse your previous work on before2009. Rubric: 4 points for regression, 1 point for printing summary.

# Selecting the top 15 variables with the lowest Pr(>|t|) values

regAfter2009optimal <- lm(SalePrice ~ RoofMatl + KitchenQual + OverallQual + Condition2 + MSZoning + Neighborhood + LotArea +OverallCond  +Foundation + BedroomAbvGr + ExterQual + BsmtFinSF1 +BsmtFinSF2 + MSSubClass + BsmtUnfSF, data = after2009)

# Print the summary
summary(regAfter2009optimal)
## 
## Call:
## lm(formula = SalePrice ~ RoofMatl + KitchenQual + OverallQual + 
##     Condition2 + MSZoning + Neighborhood + LotArea + OverallCond + 
##     Foundation + BedroomAbvGr + ExterQual + BsmtFinSF1 + BsmtFinSF2 + 
##     MSSubClass + BsmtUnfSF, data = after2009)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -276926  -11883      76   13297  209716 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.398e+04  5.027e+04   1.273 0.203394    
## RoofMatlTar&Grv     -2.977e+03  1.227e+04  -0.243 0.808350    
## RoofMatlWdShake      5.169e+04  2.488e+04   2.077 0.038071 *  
## RoofMatlWdShngl     -7.834e+03  3.065e+04  -0.256 0.798336    
## KitchenQualFa       -3.421e+04  9.574e+03  -3.573 0.000372 ***
## KitchenQualGd       -2.741e+04  6.134e+03  -4.468 8.89e-06 ***
## KitchenQualTA       -3.147e+04  6.666e+03  -4.721 2.72e-06 ***
## OverallQual2        -6.917e+04  4.065e+04  -1.701 0.089222 .  
## OverallQual3        -5.758e+04  4.058e+04  -1.419 0.156331    
## OverallQual4        -5.078e+04  3.980e+04  -1.276 0.202312    
## OverallQual5        -4.972e+04  4.000e+04  -1.243 0.214174    
## OverallQual6        -3.574e+04  4.013e+04  -0.891 0.373375    
## OverallQual7        -2.085e+04  4.027e+04  -0.518 0.604729    
## OverallQual8        -3.968e+03  4.046e+04  -0.098 0.921893    
## OverallQual9         2.645e+04  4.120e+04   0.642 0.520965    
## OverallQual10        5.314e+04  4.296e+04   1.237 0.216457    
## Condition2Feedr      3.155e+04  3.640e+04   0.867 0.386233    
## Condition2Norm       1.827e+04  3.294e+04   0.555 0.579241    
## Condition2PosA       6.630e+04  4.582e+04   1.447 0.148213    
## Condition2PosN      -2.183e+05  4.615e+04  -4.730 2.61e-06 ***
## MSZoningFV           3.040e+04  1.808e+04   1.681 0.093068 .  
## MSZoningRH           2.227e+04  1.707e+04   1.305 0.192278    
## MSZoningRL           2.850e+04  1.377e+04   2.070 0.038718 *  
## MSZoningRM           2.944e+04  1.289e+04   2.284 0.022578 *  
## NeighborhoodBlueste -4.273e+03  1.864e+04  -0.229 0.818750    
## NeighborhoodBrDale  -6.282e+03  1.815e+04  -0.346 0.729400    
## NeighborhoodBrkSide -1.042e+04  1.509e+04  -0.691 0.489874    
## NeighborhoodClearCr -2.594e+03  1.518e+04  -0.171 0.864360    
## NeighborhoodCollgCr -1.154e+04  1.240e+04  -0.930 0.352424    
## NeighborhoodCrawfor  2.400e+04  1.374e+04   1.747 0.080900 .  
## NeighborhoodEdwards -2.634e+04  1.315e+04  -2.004 0.045410 *  
## NeighborhoodGilbert -1.910e+04  1.293e+04  -1.477 0.139978    
## NeighborhoodIDOTRR  -2.964e+04  1.692e+04  -1.752 0.080070 .  
## NeighborhoodMeadowV -2.545e+04  1.806e+04  -1.409 0.159108    
## NeighborhoodMitchel -1.454e+04  1.314e+04  -1.107 0.268675    
## NeighborhoodNAmes   -2.106e+04  1.281e+04  -1.643 0.100714    
## NeighborhoodNoRidge  2.979e+04  1.379e+04   2.160 0.031058 *  
## NeighborhoodNPkVill -2.485e+03  1.473e+04  -0.169 0.866033    
## NeighborhoodNridgHt  3.254e+03  1.229e+04   0.265 0.791237    
## NeighborhoodNWAmes  -1.549e+04  1.327e+04  -1.167 0.243553    
## NeighborhoodOldTown -2.707e+04  1.481e+04  -1.827 0.068007 .  
## NeighborhoodSawyer  -2.699e+04  1.344e+04  -2.009 0.044882 *  
## NeighborhoodSawyerW -9.161e+03  1.263e+04  -0.725 0.468477    
## NeighborhoodSomerst  1.505e+03  1.611e+04   0.093 0.925608    
## NeighborhoodStoneBr  4.206e+04  1.384e+04   3.040 0.002435 ** 
## NeighborhoodSWISU   -1.620e+04  1.477e+04  -1.097 0.272957    
## NeighborhoodTimber  -8.974e+03  1.377e+04  -0.651 0.514892    
## NeighborhoodVeenker  2.162e+04  1.973e+04   1.096 0.273572    
## LotArea              8.138e-01  1.377e-01   5.908 4.90e-09 ***
## OverallCond2         6.608e+04  2.589e+04   2.552 0.010862 *  
## OverallCond3         5.661e+04  2.373e+04   2.385 0.017285 *  
## OverallCond4         5.811e+04  2.349e+04   2.474 0.013538 *  
## OverallCond5         6.654e+04  2.298e+04   2.895 0.003881 ** 
## OverallCond6         6.772e+04  2.302e+04   2.942 0.003346 ** 
## OverallCond7         7.402e+04  2.304e+04   3.213 0.001360 ** 
## OverallCond8         6.858e+04  2.342e+04   2.928 0.003496 ** 
## OverallCond9         7.267e+04  2.419e+04   3.004 0.002739 ** 
## FoundationCBlock     2.981e+03  4.553e+03   0.655 0.512875    
## FoundationPConc      1.389e+04  4.971e+03   2.794 0.005310 ** 
## FoundationSlab       3.349e+04  9.557e+03   3.504 0.000481 ***
## FoundationStone      3.245e+04  1.473e+04   2.204 0.027808 *  
## FoundationWood      -1.024e+04  1.912e+04  -0.536 0.592293    
## BedroomAbvGr         7.871e+03  1.613e+03   4.880 1.26e-06 ***
## ExterQualFa         -5.173e+04  1.258e+04  -4.111 4.31e-05 ***
## ExterQualGd         -4.107e+04  8.145e+03  -5.042 5.58e-07 ***
## ExterQualTA         -4.873e+04  8.917e+03  -5.465 5.99e-08 ***
## BsmtFinSF1           6.805e+01  4.328e+00  15.721  < 2e-16 ***
## BsmtFinSF2           4.776e+01  7.003e+00   6.820 1.68e-11 ***
## MSSubClass160        1.273e+04  7.919e+03   1.608 0.108250    
## MSSubClass180        9.389e+03  1.753e+04   0.535 0.592450    
## MSSubClass190        1.566e+04  1.130e+04   1.385 0.166303    
## MSSubClass20         1.539e+04  6.024e+03   2.555 0.010776 *  
## MSSubClass30         1.031e+04  8.556e+03   1.205 0.228425    
## MSSubClass40         6.438e+03  2.535e+04   0.254 0.799547    
## MSSubClass45        -4.934e+02  1.933e+04  -0.026 0.979639    
## MSSubClass50         2.449e+04  7.491e+03   3.270 0.001118 ** 
## MSSubClass60         4.511e+04  6.600e+03   6.835 1.52e-11 ***
## MSSubClass70         2.301e+04  9.048e+03   2.543 0.011163 *  
## MSSubClass75         3.285e+04  1.598e+04   2.056 0.040099 *  
## MSSubClass80         3.190e+04  7.851e+03   4.063 5.27e-05 ***
## MSSubClass85         1.594e+04  9.162e+03   1.740 0.082234 .  
## MSSubClass90         2.064e+04  8.456e+03   2.440 0.014870 *  
## BsmtUnfSF            3.715e+01  4.366e+00   8.509  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29720 on 898 degrees of freedom
##   (因为不存在,5个观察量被删除了)
## Multiple R-squared:  0.8599, Adjusted R-squared:  0.8471 
## F-statistic: 67.21 on 82 and 898 DF,  p-value: < 2.2e-16

[2 points] Q14.

Now, display diagnostic plots of your regression (regAfter2009optimal). Tip: You have already know how to autoplot.

library(ggfortify)
regAfter2009optimal %>%
  autoplot()
## Warning: Removed 981 rows containing missing values (`geom_line()`).
## Warning: Removed 4 rows containing missing values (`geom_point()`).
## Warning: Removed 11 rows containing missing values (`geom_line()`).

[6 points] Q15.

Now, let’s focus on the Residual vs. Fitted graph by plotting it by itself using ggplot. Tip: Call ggplot with the data parameter in regAfter2009optimal. The aes parameters are (.fitted, .resid), respectively. You can use stat_smooth() for the trendline and appropriately title the plot and label both axes. Tip: Check out cheatsheets such as https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf.

library(ggplot2)

# Create Residual vs. Fitted graph
ggplot(data = regAfter2009optimal, aes(x = .fitted, y = .resid)) +
  geom_point() +                 # Scatter plot of residuals vs fitted values
  stat_smooth(method = "loess", se = FALSE, color = "red") +  # Add a trendline
  ggtitle("Residual vs. Fitted") +
  xlab("Fitted Values") +
  ylab("Residuals")
## `geom_smooth()` using formula = 'y ~ x'

[5 points] Q16.

Identify any outliers in the visualization from the last two chunks.

### This section doesn't require code. Just answer the question as a comment.
#first chunk# Residuals vs. Fitted Values Plot: outliers points far from the horizontal center line like 280
#QQ-Plot (Quantile-Quantile Plot): Points deviating from the slop line may indicate outliers like 280
#Scale-Location:points far from the horizontal center like 280 would be woutliers
#Residuals vs. Leverage Plot:points outside the dashed horizontal lines like 529 is outlier
#second chunk:Residuals vs. Fitted Values Plot: outliers points far from the read line like the point which residuals over 2e+05

[20 points] Q17.

Now, let’s think like a fraudster and do something smarter fraudsters may do. Instead of misrepresenting values by just reporting the mean value of the houses sold in NAmes before 2009, what is something more clever and nuanced that the fraudsters could report these values? Specifically, consider a method smarter fraudsters may use to set the rows in which the prices are misrepresented? Then, using this method generate and set values for the SalePrice in those rows. Then, try your fraud inspection techniques of comparing old and new density plots as well as using the diagnostic plots to show that now the fraud is much harder to catch. Tip: You must use exact commands/functions to set the values and tell us why you chose to generate values this way. You must share the resulting diagnostic plots with us. Tip: Consider using more information (instead of the mean values) to generate the fraudulent values using what you learned from your work above. You can do this in two steps: Step 1: Find the rows set by the stupid fraudsters (by searching for the SalePrice of 142769.7). Step 2: Use a smarter way to generate and replace these values. Tip: For plotting, you may use ggplot to plot NAmes and NAmes. My ggplot call looked like this: before2009 %>% filter(Neighborhood == “???”) %>% ggplot(aes(x = SalePrice)) + geom_density(fill = “???”, alpha = 0.5) + ggtitle(“???”) + xlab(“???”) Tip: Always refine your model as fraudsters adapt their methods after they find out that you can catch them. Rubric: 10 points each for the fraud method and the plots.

### This section requires you to first explain your idea. Just answer this as a comment.
## 
# Step 1: Find the rows set by the original fraudulent method (mean value)
fraud_rows <- after2009 %>% filter(SalePrice == 142769.7 )

# Step 2: Use a smarter way to generate and replace these values (e.g., smoothing)
set.seed(156)
fraud_rows$SalePrice <- rnorm(nrow(fraud_rows), mean(after2009$SalePrice, na.rm = TRUE), sd = sd(after2009$SalePrice,na.rm = TRUE))

# Now, create a density plot for NAmes before and after fraud
ggplot() +
  geom_density(data = fraud_rows %>% filter(Neighborhood == "NAmes"), aes(x = SalePrice), fill = "blue", alpha = 0.5) +
  geom_density(data = before2009 %>% filter(Neighborhood == "NAmes"), aes(x = SalePrice), fill = "green", alpha = 0.5) +
  geom_density(data = after2009 %>% filter(Neighborhood == "NAmes"), aes(x = SalePrice), fill = "red", alpha = 0.5) +
  ggtitle("Density Plot for NAmes Fraud") +
  xlab("SalePrice")

ggplot() +
  geom_density(data = fraud_rows %>% filter(Neighborhood == "Gilbert"), aes(x = SalePrice), fill = "blue", alpha = 0.5) +
  geom_density(data = before2009 %>% filter(Neighborhood == "Gilbert"), aes(x = SalePrice), fill = "green", alpha = 0.5) +
  geom_density(data = after2009 %>% filter(Neighborhood == "Gilbert"), aes(x = SalePrice), fill = "red", alpha = 0.5) +
  ggtitle("Density Plot for NAmes Fraud") +
  xlab("SalePrice")
## Warning: Removed 4 rows containing non-finite values (`stat_density()`).
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).

#Now we can see that both after2009 and before2009 densities are clustered around 140,000-150,000 this is clearly fraudulent, so we chose to randomly replace these averages using a normal distribution, and using our approach allows the density curve to be smoother i.e. more uniformly distributed this way it won't be as easy to detect that it's fraudulent.

[5 points] Q18.

Now, run a regression on the new data in after2009 using variables you know are good at predicting SalePrice. Store the result in variable called regAfter2009optimalFraud. Then print summary of regAfter2009optimalFraud to verify that your code works. Tip: You can reuse previous work you before2009. Rubric: 4 points for regression, 1 point for printing summary.

# Identify rows in after2009 where SalePrice is equal to 142769.7
fraud_rows <- after2009 %>% filter(SalePrice == 142769.7)

# Set seed for reproducibility
set.seed(156)

# Generate random values from a normal distribution
fraud_rows$SalePrice <- rnorm(nrow(fraud_rows), mean(after2009$SalePrice, na.rm = TRUE), sd = sd(after2009$SalePrice,na.rm = TRUE))

# Replace the corresponding rows in after2009 with the modified fraud_rows
after2009 <- after2009 %>% 
  mutate(SalePrice = ifelse(SalePrice == 142769.7, fraud_rows$SalePrice, SalePrice))

# Verify the changes
head(after2009)
##   MSSubClass MSZoning LotArea Street LotShape LandContour LotConfig LandSlope
## 1         50       RL   14115   Pave      IR1         Lvl    Inside       Gtl
## 2         60       RL   10382   Pave      IR1         Lvl    Corner       Gtl
## 3         20       RL   11241   Pave      IR1         Lvl   CulDSac       Gtl
## 4         20       RL    7560   Pave      Reg         Lvl    Inside       Gtl
## 5         20       RL    8246   Pave      IR1         Lvl    Inside       Gtl
## 6         20       RL   14230   Pave      Reg         Lvl    Corner       Gtl
##   Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual
## 1      Mitchel       Norm       Norm     1Fam     1.5Fin           5
## 2       NWAmes       PosN       Norm     1Fam     2Story           7
## 3        NAmes       Norm       Norm     1Fam     1Story           6
## 4        NAmes       Norm       Norm     1Fam     1Story           5
## 5       Sawyer       Norm       Norm     1Fam     1Story           5
## 6      NridgHt       Norm       Norm     1Fam     1Story           8
##   OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 1           5      1993         1995     Gable  CompShg     VinylSd     VinylSd
## 2           6      1973         1973     Gable  CompShg     HdBoard     HdBoard
## 3           7      1970         1970     Gable  CompShg     Wd Sdng     Wd Sdng
## 4           6      1958         1965       Hip  CompShg     BrkFace     Plywood
## 5           8      1968         2001     Gable  CompShg     Plywood     Plywood
## 6           5      2007         2007     Gable  CompShg     VinylSd     VinylSd
##   MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtFinSF1 BsmtFinSF2
## 1       None          0        TA        TA       Wood        732          0
## 2      Stone        240        TA        TA     CBlock        859         32
## 3    BrkFace        180        TA        TA     CBlock        578          0
## 4       None          0        TA        TA     CBlock        504          0
## 5       None          0        TA        Gd     CBlock        188        668
## 6      Stone        640        Gd        TA      PConc          0          0
##   BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF
## 1        64         796    GasA        Ex          Y      SBrkr       796
## 2       216        1107    GasA        Ex          Y      SBrkr      1107
## 3       426        1004    GasA        Ex          Y      SBrkr      1004
## 4       525        1029    GasA        TA          Y      SBrkr      1339
## 5       204        1060    GasA        Ex          Y      SBrkr      1060
## 6      1566        1566    GasA        Ex          Y      SBrkr      1600
##   X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 1       566            0      1362            1            0        1        1
## 2       983            0      2090            1            0        2        1
## 3         0            0      1004            1            0        1        0
## 4         0            0      1339            0            0        1        0
## 5         0            0      1060            1            0        1        0
## 6         0            0      1600            0            0        2        0
##   BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces
## 1            1            1          TA            5        Typ          0
## 2            3            1          TA            7        Typ          2
## 3            2            1          TA            5        Typ          1
## 4            3            1          TA            6       Min1          0
## 5            3            1          Gd            6        Typ          1
## 6            3            1          Gd            7        Typ          1
##   GarageCars GarageArea PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## 1          2        480          Y         40          30             0
## 2          2        484          Y        235         204           228
## 3          2        480          Y          0           0             0
## 4          1        294          Y          0           0             0
## 5          1        270          Y        406          90             0
## 6          3        890          Y          0          56             0
##   X3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
## 1        320           0        0     700     10   2009       WD        Normal
## 2          0           0        0     350     11   2009       WD        Normal
## 3          0           0        0     700      3   2010       WD        Normal
## 4          0           0        0       0      5   2009      COD       Abnorml
## 5          0           0        0       0      5   2010       WD        Normal
## 6          0           0        0       0      7   2009       WD        Normal
##   SalePrice
## 1    143000
## 2    200000
## 3    149000
## 4    139000
## 5    154000
## 6    256300
regAfter2009optimal <- lm(SalePrice ~ RoofMatl + KitchenQual + OverallQual + Condition2 + MSZoning + Neighborhood + LotArea +OverallCond  +Foundation + BedroomAbvGr + ExterQual + BsmtFinSF1 +BsmtFinSF2 + MSSubClass + BsmtUnfSF, data = after2009)

# Print the summary
summary(regAfter2009optimal)
## 
## Call:
## lm(formula = SalePrice ~ RoofMatl + KitchenQual + OverallQual + 
##     Condition2 + MSZoning + Neighborhood + LotArea + OverallCond + 
##     Foundation + BedroomAbvGr + ExterQual + BsmtFinSF1 + BsmtFinSF2 + 
##     MSSubClass + BsmtUnfSF, data = after2009)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -198807  -12665    -599   13262  198561 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.879e+04  4.804e+04   1.432 0.152466    
## RoofMatlTar&Grv     -2.618e+03  1.172e+04  -0.223 0.823385    
## RoofMatlWdShake      5.147e+04  2.378e+04   2.164 0.030706 *  
## RoofMatlWdShngl     -6.740e+03  2.929e+04  -0.230 0.818071    
## KitchenQualFa       -3.752e+04  9.150e+03  -4.101 4.49e-05 ***
## KitchenQualGd       -2.892e+04  5.862e+03  -4.934 9.58e-07 ***
## KitchenQualTA       -3.444e+04  6.370e+03  -5.407 8.23e-08 ***
## OverallQual2        -6.920e+04  3.885e+04  -1.781 0.075225 .  
## OverallQual3        -5.773e+04  3.878e+04  -1.489 0.136970    
## OverallQual4        -5.232e+04  3.803e+04  -1.376 0.169289    
## OverallQual5        -5.073e+04  3.823e+04  -1.327 0.184853    
## OverallQual6        -3.596e+04  3.836e+04  -0.938 0.348723    
## OverallQual7        -2.216e+04  3.848e+04  -0.576 0.564962    
## OverallQual8        -4.554e+03  3.866e+04  -0.118 0.906277    
## OverallQual9         3.126e+04  3.937e+04   0.794 0.427382    
## OverallQual10        4.893e+04  4.106e+04   1.192 0.233681    
## Condition2Feedr      3.078e+04  3.478e+04   0.885 0.376463    
## Condition2Norm       1.746e+04  3.148e+04   0.554 0.579414    
## Condition2PosA       6.545e+04  4.379e+04   1.495 0.135360    
## Condition2PosN      -2.109e+05  4.410e+04  -4.781 2.04e-06 ***
## MSZoningFV           4.121e+04  1.728e+04   2.385 0.017282 *  
## MSZoningRH           1.466e+04  1.631e+04   0.899 0.368890    
## MSZoningRL           2.671e+04  1.316e+04   2.030 0.042613 *  
## MSZoningRM           2.959e+04  1.231e+04   2.403 0.016458 *  
## NeighborhoodBlueste -1.408e+04  1.782e+04  -0.790 0.429482    
## NeighborhoodBrDale   7.083e+02  1.735e+04   0.041 0.967445    
## NeighborhoodBrkSide -2.314e+04  1.442e+04  -1.605 0.108907    
## NeighborhoodClearCr -1.322e+04  1.451e+04  -0.911 0.362668    
## NeighborhoodCollgCr -2.174e+04  1.185e+04  -1.834 0.067007 .  
## NeighborhoodCrawfor  1.342e+04  1.313e+04   1.022 0.306902    
## NeighborhoodEdwards -3.736e+04  1.257e+04  -2.973 0.003024 ** 
## NeighborhoodGilbert -2.013e+04  1.236e+04  -1.629 0.103613    
## NeighborhoodIDOTRR  -4.263e+04  1.617e+04  -2.637 0.008499 ** 
## NeighborhoodMeadowV -3.348e+04  1.726e+04  -1.940 0.052734 .  
## NeighborhoodMitchel -2.528e+04  1.255e+04  -2.014 0.044360 *  
## NeighborhoodNAmes   -3.185e+04  1.225e+04  -2.601 0.009460 ** 
## NeighborhoodNoRidge  2.634e+04  1.318e+04   1.999 0.045963 *  
## NeighborhoodNPkVill -3.457e+03  1.407e+04  -0.246 0.806005    
## NeighborhoodNridgHt -5.515e+01  1.175e+04  -0.005 0.996255    
## NeighborhoodNWAmes  -2.739e+04  1.268e+04  -2.160 0.031067 *  
## NeighborhoodOldTown -3.999e+04  1.416e+04  -2.825 0.004832 ** 
## NeighborhoodSawyer  -3.786e+04  1.284e+04  -2.948 0.003283 ** 
## NeighborhoodSawyerW -1.865e+04  1.207e+04  -1.545 0.122678    
## NeighborhoodSomerst -1.895e+04  1.540e+04  -1.230 0.218848    
## NeighborhoodStoneBr  3.673e+04  1.322e+04   2.778 0.005584 ** 
## NeighborhoodSWISU   -2.620e+04  1.412e+04  -1.856 0.063822 .  
## NeighborhoodTimber  -2.076e+04  1.316e+04  -1.577 0.115188    
## NeighborhoodVeenker  1.185e+04  1.886e+04   0.628 0.529901    
## LotArea              8.433e-01  1.316e-01   6.406 2.40e-10 ***
## OverallCond2         6.487e+04  2.474e+04   2.622 0.008890 ** 
## OverallCond3         5.737e+04  2.268e+04   2.529 0.011605 *  
## OverallCond4         5.945e+04  2.245e+04   2.648 0.008231 ** 
## OverallCond5         6.743e+04  2.197e+04   3.070 0.002207 ** 
## OverallCond6         6.991e+04  2.200e+04   3.177 0.001537 ** 
## OverallCond7         7.520e+04  2.202e+04   3.416 0.000664 ***
## OverallCond8         6.919e+04  2.238e+04   3.091 0.002055 ** 
## OverallCond9         7.476e+04  2.312e+04   3.234 0.001267 ** 
## FoundationCBlock     3.362e+03  4.352e+03   0.773 0.439951    
## FoundationPConc      1.357e+04  4.751e+03   2.857 0.004373 ** 
## FoundationSlab       3.482e+04  9.134e+03   3.812 0.000147 ***
## FoundationStone      3.331e+04  1.408e+04   2.367 0.018153 *  
## FoundationWood      -1.134e+04  1.827e+04  -0.620 0.535216    
## BedroomAbvGr         7.570e+03  1.542e+03   4.911 1.08e-06 ***
## ExterQualFa         -4.459e+04  1.203e+04  -3.707 0.000222 ***
## ExterQualGd         -3.717e+04  7.784e+03  -4.775 2.10e-06 ***
## ExterQualTA         -4.218e+04  8.522e+03  -4.949 8.89e-07 ***
## BsmtFinSF1           6.794e+01  4.137e+00  16.424  < 2e-16 ***
## BsmtFinSF2           4.668e+01  6.692e+00   6.975 5.91e-12 ***
## MSSubClass160        1.251e+04  7.568e+03   1.653 0.098707 .  
## MSSubClass180        1.042e+04  1.676e+04   0.622 0.534095    
## MSSubClass190        2.048e+04  1.080e+04   1.896 0.058277 .  
## MSSubClass20         1.997e+04  5.757e+03   3.469 0.000547 ***
## MSSubClass30         1.470e+04  8.177e+03   1.798 0.072515 .  
## MSSubClass40         1.234e+04  2.422e+04   0.509 0.610580    
## MSSubClass45         5.362e+03  1.847e+04   0.290 0.771649    
## MSSubClass50         2.862e+04  7.159e+03   3.998 6.90e-05 ***
## MSSubClass60         4.948e+04  6.308e+03   7.845 1.22e-14 ***
## MSSubClass70         2.759e+04  8.647e+03   3.191 0.001469 ** 
## MSSubClass75         3.696e+04  1.527e+04   2.420 0.015701 *  
## MSSubClass80         3.697e+04  7.503e+03   4.927 9.93e-07 ***
## MSSubClass85         1.998e+04  8.756e+03   2.282 0.022732 *  
## MSSubClass90         2.478e+04  8.081e+03   3.066 0.002233 ** 
## BsmtUnfSF            3.905e+01  4.173e+00   9.360  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28400 on 898 degrees of freedom
##   (因为不存在,5个观察量被删除了)
## Multiple R-squared:  0.8737, Adjusted R-squared:  0.8622 
## F-statistic: 75.79 on 82 and 898 DF,  p-value: < 2.2e-16

[2 points] Q19.

Now, display diagnostic plots of your regression (regAfter2009optimalFraud). Tip: You have already know how to autoplot.

library(ggfortify)
regAfter2009optimal %>%
  autoplot()
## Warning: Removed 981 rows containing missing values (`geom_line()`).
## Warning: Removed 4 rows containing missing values (`geom_point()`).
## Warning: Removed 11 rows containing missing values (`geom_line()`).

[5 points] Q20.

Now, look for outliers in diagnostic plots of your regression (regAfter2009optimal). Tip: You have already know how to autoplot.

### This section doesn't require code. Just answer the question as a comment.
##  Residuals vs. Fitted Values Plot: outliers points far from the horizontal center line like 280
#QQ-Plot (Quantile-Quantile Plot): Points deviating from the slop line may indicate outliers like 533
#Scale-Location:points far from the horizontal center like 280 would be outliers
#Residuals vs. Leverage Plot:points outside the dashed horizontal lines like 524 is outlier

[5 points] Q21.

Knit to html after eliminating all the errors. Submit both the Rmd and html files. Tip: Do not worry about minor formatting issues.

### This section doesn't require code. Just knit and submit the Rmd and html files.###