This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This exercise is under construction. Please report any errors at https://forms.gle/2W4tffs4YJA1jeBv9
Goal: Understand and experience outlier detection techniques Law in action.
Background: The data for this question has been adapted from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. Please review information at https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview before you get started.
Before starting: 1. You are not allowed to search for solutions to this assignment. 2. You are allowed to search information about packages and functions that can help you.
Individual assignment only: 70 total points (Rmd and html solution) Team assignment: 20 points (written analysis)
Start by entering your name and today’s date in Lines 3 and 4, respectively, to agree to the Fuqua Honor Code. Then, run the chunk of code below by clicking on the green arrow (that points to the right) on the top right of the chunk. Tip: I numbered code chunks corresponding to their numbers. Chunk 1 specified the knitting parameters.
Read and store the data from the file PricesBefore2009.csv into a variable called before2009. Tip: Then, inspect the data. Rubric: 1 each point for reading and storing; 1 points each for using 2 R commands for inspecting. Tip: I recommend using the read_csv() function from the tidyverse package to read and store data for this and all subsequent assignments.
## [4 points] Q2.
# Install and load the tidyverse package if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) {
install.packages("tidyverse")
}
# Load the tidyverse package
library(tidyverse)
## Warning: 程辑包'tidyverse'是用R版本4.3.2 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read and store the data from PricesBefore2009.csv
before2009 <- read_csv("PricesBefore2009.csv")
## New names:
## Rows: 1933 Columns: 82
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf... dbl
## (39): ...1, Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCo...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Inspect the data using head() and str() commands
head(before2009)
## # A tibble: 6 × 82
## ...1 Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 7 20 RL 75 10084 Pave <NA> Reg
## # ℹ 73 more variables: LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <chr>,
## # BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
str(before2009)
## spc_tbl_ [1,933 × 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:1933] 1 2 3 4 5 6 7 8 9 10 ...
## $ Id : num [1:1933] 1 2 3 4 5 7 9 10 11 12 ...
## $ MSSubClass : num [1:1933] 60 20 60 70 60 20 50 190 20 60 ...
## $ MSZoning : chr [1:1933] "RL" "RL" "RL" "RL" ...
## $ LotFrontage : num [1:1933] 65 80 68 60 84 75 51 50 70 85 ...
## $ LotArea : num [1:1933] 8450 9600 11250 9550 14260 ...
## $ Street : chr [1:1933] "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr [1:1933] NA NA NA NA ...
## $ LotShape : chr [1:1933] "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr [1:1933] "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr [1:1933] "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr [1:1933] "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr [1:1933] "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr [1:1933] "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr [1:1933] "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr [1:1933] "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr [1:1933] "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr [1:1933] "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : num [1:1933] 7 6 7 7 8 8 7 5 5 9 ...
## $ OverallCond : num [1:1933] 5 8 5 5 5 5 5 6 5 5 ...
## $ YearBuilt : num [1:1933] 2003 1976 2001 1915 2000 ...
## $ YearRemodAdd : num [1:1933] 2003 1976 2002 1970 2000 ...
## $ RoofStyle : chr [1:1933] "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr [1:1933] "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr [1:1933] "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr [1:1933] "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr [1:1933] "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : num [1:1933] 196 0 162 0 350 186 0 0 0 286 ...
## $ ExterQual : chr [1:1933] "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr [1:1933] "TA" "TA" "TA" "TA" ...
## $ Foundation : chr [1:1933] "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr [1:1933] "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr [1:1933] "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr [1:1933] "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr [1:1933] "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : num [1:1933] 706 978 486 216 655 ...
## $ BsmtFinType2 : chr [1:1933] "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
## $ BsmtUnfSF : num [1:1933] 150 284 434 540 490 317 952 140 134 177 ...
## $ TotalBsmtSF : num [1:1933] 856 1262 920 756 1145 ...
## $ Heating : chr [1:1933] "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr [1:1933] "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr [1:1933] "Y" "Y" "Y" "Y" ...
## $ Electrical : chr [1:1933] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : num [1:1933] 856 1262 920 961 1145 ...
## $ X2ndFlrSF : num [1:1933] 854 0 866 756 1053 ...
## $ LowQualFinSF : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : num [1:1933] 1710 1262 1786 1717 2198 ...
## $ BsmtFullBath : num [1:1933] 1 0 1 1 1 1 0 1 1 1 ...
## $ BsmtHalfBath : num [1:1933] 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : num [1:1933] 2 2 2 1 2 2 2 1 1 3 ...
## $ HalfBath : num [1:1933] 1 0 1 0 1 0 0 0 0 0 ...
## $ BedroomAbvGr : num [1:1933] 3 3 3 3 4 3 2 2 3 4 ...
## $ KitchenAbvGr : num [1:1933] 1 1 1 1 1 1 2 2 1 1 ...
## $ KitchenQual : chr [1:1933] "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : num [1:1933] 8 6 6 7 9 7 8 5 5 11 ...
## $ Functional : chr [1:1933] "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : num [1:1933] 0 1 1 1 1 1 2 2 0 2 ...
## $ FireplaceQu : chr [1:1933] NA "TA" "TA" "Gd" ...
## $ GarageType : chr [1:1933] "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : num [1:1933] 2003 1976 2001 1998 2000 ...
## $ GarageFinish : chr [1:1933] "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : num [1:1933] 2 2 2 3 3 2 2 1 1 3 ...
## $ GarageArea : num [1:1933] 548 460 608 642 836 636 468 205 384 736 ...
## $ GarageQual : chr [1:1933] "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr [1:1933] "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr [1:1933] "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : num [1:1933] 0 298 0 0 192 255 90 0 0 147 ...
## $ OpenPorchSF : num [1:1933] 61 0 42 35 84 57 0 4 0 21 ...
## $ EnclosedPorch: num [1:1933] 0 0 0 272 0 0 205 0 0 0 ...
## $ X3SsnPorch : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
## $ ScreenPorch : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr [1:1933] NA NA NA NA ...
## $ Fence : chr [1:1933] NA NA NA NA ...
## $ MiscFeature : chr [1:1933] NA NA NA NA ...
## $ MiscVal : num [1:1933] 0 0 0 0 0 0 0 0 0 0 ...
## $ MoSold : num [1:1933] 2 5 9 2 12 8 4 1 2 7 ...
## $ YrSold : num [1:1933] 2008 2007 2008 2006 2008 ...
## $ SaleType : chr [1:1933] "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr [1:1933] "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : num [1:1933] 208500 181500 223500 140000 250000 ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. Id = col_double(),
## .. MSSubClass = col_double(),
## .. MSZoning = col_character(),
## .. LotFrontage = col_double(),
## .. LotArea = col_double(),
## .. Street = col_character(),
## .. Alley = col_character(),
## .. LotShape = col_character(),
## .. LandContour = col_character(),
## .. Utilities = col_character(),
## .. LotConfig = col_character(),
## .. LandSlope = col_character(),
## .. Neighborhood = col_character(),
## .. Condition1 = col_character(),
## .. Condition2 = col_character(),
## .. BldgType = col_character(),
## .. HouseStyle = col_character(),
## .. OverallQual = col_double(),
## .. OverallCond = col_double(),
## .. YearBuilt = col_double(),
## .. YearRemodAdd = col_double(),
## .. RoofStyle = col_character(),
## .. RoofMatl = col_character(),
## .. Exterior1st = col_character(),
## .. Exterior2nd = col_character(),
## .. MasVnrType = col_character(),
## .. MasVnrArea = col_double(),
## .. ExterQual = col_character(),
## .. ExterCond = col_character(),
## .. Foundation = col_character(),
## .. BsmtQual = col_character(),
## .. BsmtCond = col_character(),
## .. BsmtExposure = col_character(),
## .. BsmtFinType1 = col_character(),
## .. BsmtFinSF1 = col_double(),
## .. BsmtFinType2 = col_character(),
## .. BsmtFinSF2 = col_double(),
## .. BsmtUnfSF = col_double(),
## .. TotalBsmtSF = col_double(),
## .. Heating = col_character(),
## .. HeatingQC = col_character(),
## .. CentralAir = col_character(),
## .. Electrical = col_character(),
## .. X1stFlrSF = col_double(),
## .. X2ndFlrSF = col_double(),
## .. LowQualFinSF = col_double(),
## .. GrLivArea = col_double(),
## .. BsmtFullBath = col_double(),
## .. BsmtHalfBath = col_double(),
## .. FullBath = col_double(),
## .. HalfBath = col_double(),
## .. BedroomAbvGr = col_double(),
## .. KitchenAbvGr = col_double(),
## .. KitchenQual = col_character(),
## .. TotRmsAbvGrd = col_double(),
## .. Functional = col_character(),
## .. Fireplaces = col_double(),
## .. FireplaceQu = col_character(),
## .. GarageType = col_character(),
## .. GarageYrBlt = col_double(),
## .. GarageFinish = col_character(),
## .. GarageCars = col_double(),
## .. GarageArea = col_double(),
## .. GarageQual = col_character(),
## .. GarageCond = col_character(),
## .. PavedDrive = col_character(),
## .. WoodDeckSF = col_double(),
## .. OpenPorchSF = col_double(),
## .. EnclosedPorch = col_double(),
## .. X3SsnPorch = col_double(),
## .. ScreenPorch = col_double(),
## .. PoolArea = col_double(),
## .. PoolQC = col_character(),
## .. Fence = col_character(),
## .. MiscFeature = col_character(),
## .. MiscVal = col_double(),
## .. MoSold = col_double(),
## .. YrSold = col_double(),
## .. SaleType = col_character(),
## .. SaleCondition = col_character(),
## .. SalePrice = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Convert the following columns to character or factor type: MSSubClass, OverallQual, OverallCond. Then, inspect the result to verify that your code works. Tip: You can refer to column in before2009 as before2009$colName or before2009[, “colName”] or before[[“colName”]]. Tip: You can use as.character() or factor(). Tip: You can print multiple columns using summary(before2009[,c(“colName1”, “colName2”, “colName3”)]). Rubric: 3 points (1 point each) for conversion and 1 point for verification.
# Convert columns to character or factor type
before2009$MSSubClass <- as.character(before2009$MSSubClass)
before2009$OverallQual <- as.factor(before2009$OverallQual)
before2009$OverallCond <- as.factor(before2009$OverallCond)
# Verify the conversion
summary(before2009[, c("MSSubClass", "OverallQual", "OverallCond")])
## MSSubClass OverallQual OverallCond
## Length:1933 5 :554 5 :1094
## Class :character 6 :485 6 : 354
## Mode :character 7 :395 7 : 260
## 8 :233 8 : 94
## 4 :141 4 : 69
## 9 : 68 3 : 30
## (Other): 57 (Other): 32
How many NAs does each column have? Display your answer as a dataframe (or tibble) called beforeNAs. The dataset beforeNAs should contain two columns, one containing the names of the columns of before2009, and the other containing the number of NAs in each column. Then, print only the first 10 (head) rows of this dataframe to verify that your code worked. Tip: See what as_tibble(map(before2009, ~sum(is.na(.)))) does for you. Rubric: 6 points for constructing beforeNAs and 1 point for verification.
temp = map(before2009, ~sum(is.na(.))) %>% as_tibble() %>% t()
beforeNAs = tibble('Columns' = rownames(temp), "NAs" = temp[,1])
beforeNAs %>% head(10)
## # A tibble: 10 × 2
## Columns NAs
## <chr> <int>
## 1 ...1 0
## 2 Id 0
## 3 MSSubClass 0
## 4 MSZoning 3
## 5 LotFrontage 317
## 6 LotArea 0
## 7 Street 0
## 8 Alley 1797
## 9 LotShape 0
## 10 LandContour 0
Drop (remove) all the columns (except SalePrice) that have 20 or more missing values. Also, drop (remove) the columns called X1, Id, and Utilities (all its values are the same). While some of the columns we drop here may contribute to the predictive accuracy of our model, the majority of the information will be contained in the remaining variables. Then, print only the first 10 (head) rows of this dataframe to verify that your code worked. Tip: You can put the names of all the columns to be dropped into a vector called dropCols (based on 20 <= NA and other conditions above). Then, you can call dplyr::select(before2009, -dropCols) to exclude all columns in dropCols. Rubric: 8 points for constructing beforeNAs and 1 point for verification.
# Define the columns to be dropped
str(beforeNAs)
## tibble [82 × 2] (S3: tbl_df/tbl/data.frame)
## $ Columns: chr [1:82] "...1" "Id" "MSSubClass" "MSZoning" ...
## $ NAs : Named int [1:82] 0 0 0 3 317 0 0 1797 0 0 ...
## ..- attr(*, "names")= chr [1:82] "...1" "Id" "MSSubClass" "MSZoning" ...
# Create a vector of column names to drop
dropCols <- beforeNAs$Columns[beforeNAs$NAs >= 20]
# Drop specified columns
before2009 <- before2009 %>%
select(-any_of(dropCols),SalePrice)
# Drop specified columns and the first column
before2009 <- before2009 %>%
select(-Id, -Utilities, -1)
Conduct a multiple linear regression on all the variables. Set SalePrice as the response and store the results in regBefore2009. Then, print the summary of regBefore2009 to verify that your code works. Tip: The formula for regression is lm(SalePrice ~ ., data = before2009) Rubric: 4 points for setting regBefore2009 and 1 point for verification.
# Fit multiple linear regression
regBefore2009 <- lm(SalePrice ~ ., data = before2009)
# Print the summary
summary(regBefore2009)
##
## Call:
## lm(formula = SalePrice ~ ., data = before2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -178924 -4505 -76 4002 157196
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.572e+06 9.826e+05 -2.618 0.008929 **
## MSSubClass150 1.061e+04 1.760e+04 0.603 0.546766
## MSSubClass160 2.518e+03 5.629e+03 0.447 0.654737
## MSSubClass180 2.272e+03 7.209e+03 0.315 0.752622
## MSSubClass190 -5.529e+02 1.459e+04 -0.038 0.969767
## MSSubClass20 8.063e+03 7.107e+03 1.135 0.256704
## MSSubClass30 7.803e+03 7.565e+03 1.031 0.302466
## MSSubClass40 7.856e+03 1.115e+04 0.705 0.481006
## MSSubClass45 1.458e+04 1.382e+04 1.055 0.291381
## MSSubClass50 1.195e+04 8.571e+03 1.395 0.163310
## MSSubClass60 9.306e+03 8.412e+03 1.106 0.268783
## MSSubClass70 1.163e+04 8.551e+03 1.360 0.173888
## MSSubClass75 8.722e+03 1.040e+04 0.839 0.401607
## MSSubClass80 -3.468e+03 1.022e+04 -0.339 0.734344
## MSSubClass85 5.338e+02 9.204e+03 0.058 0.953758
## MSSubClass90 -4.085e+03 8.549e+03 -0.478 0.632780
## MSZoningFV 4.095e+04 6.833e+03 5.993 2.51e-09 ***
## MSZoningRH 2.490e+04 6.983e+03 3.567 0.000372 ***
## MSZoningRL 3.009e+04 5.710e+03 5.269 1.55e-07 ***
## MSZoningRM 3.019e+04 5.331e+03 5.663 1.74e-08 ***
## LotArea 5.643e-01 7.679e-02 7.348 3.13e-13 ***
## StreetPave 2.965e+04 6.601e+03 4.492 7.54e-06 ***
## LotShapeIR2 4.777e+03 2.431e+03 1.965 0.049604 *
## LotShapeIR3 7.886e+03 5.038e+03 1.565 0.117678
## LotShapeReg 5.291e+02 9.504e+02 0.557 0.577754
## LandContourHLS 1.283e+04 2.912e+03 4.406 1.12e-05 ***
## LandContourLow 8.402e+02 4.041e+03 0.208 0.835341
## LandContourLvl 1.041e+04 2.160e+03 4.818 1.58e-06 ***
## LotConfigCulDSac 6.820e+03 1.939e+03 3.517 0.000449 ***
## LotConfigFR2 -7.573e+03 2.562e+03 -2.956 0.003161 **
## LotConfigFR3 -1.235e+04 5.077e+03 -2.432 0.015109 *
## LotConfigInside -3.084e+03 1.062e+03 -2.904 0.003728 **
## LandSlopeMod 1.175e+04 2.376e+03 4.943 8.46e-07 ***
## LandSlopeSev -2.130e+04 7.277e+03 -2.926 0.003476 **
## NeighborhoodBlueste -1.081e+04 9.673e+03 -1.118 0.263830
## NeighborhoodBrDale -2.852e+03 6.729e+03 -0.424 0.671753
## NeighborhoodBrkSide -1.128e+04 5.441e+03 -2.073 0.038294 *
## NeighborhoodClearCr -2.081e+04 5.730e+03 -3.631 0.000290 ***
## NeighborhoodCollgCr -1.806e+04 4.275e+03 -4.224 2.53e-05 ***
## NeighborhoodCrawfor 3.828e+03 4.936e+03 0.775 0.438221
## NeighborhoodEdwards -2.614e+04 4.686e+03 -5.578 2.83e-08 ***
## NeighborhoodGilbert -1.959e+04 4.545e+03 -4.312 1.71e-05 ***
## NeighborhoodIDOTRR -1.865e+04 5.904e+03 -3.159 0.001613 **
## NeighborhoodMeadowV -2.288e+04 6.805e+03 -3.362 0.000792 ***
## NeighborhoodMitchel -2.956e+04 4.753e+03 -6.220 6.26e-10 ***
## NeighborhoodNAmes -2.254e+04 4.574e+03 -4.929 9.09e-07 ***
## NeighborhoodNoRidge 1.408e+04 5.029e+03 2.799 0.005181 **
## NeighborhoodNPkVill 5.527e+03 1.022e+04 0.541 0.588839
## NeighborhoodNridgHt 1.329e+04 4.439e+03 2.994 0.002796 **
## NeighborhoodNWAmes -2.666e+04 4.712e+03 -5.659 1.79e-08 ***
## NeighborhoodOldTown -2.330e+04 5.415e+03 -4.303 1.78e-05 ***
## NeighborhoodSawyer -1.786e+04 4.745e+03 -3.763 0.000174 ***
## NeighborhoodSawyerW -1.319e+04 4.649e+03 -2.838 0.004596 **
## NeighborhoodSomerst -1.647e+04 5.199e+03 -3.169 0.001559 **
## NeighborhoodStoneBr 2.869e+04 5.073e+03 5.656 1.81e-08 ***
## NeighborhoodSWISU -1.355e+04 5.847e+03 -2.318 0.020557 *
## NeighborhoodTimber -1.265e+04 4.795e+03 -2.638 0.008427 **
## NeighborhoodVeenker -4.591e+03 5.740e+03 -0.800 0.423928
## Condition1Feedr 2.806e+03 2.954e+03 0.950 0.342241
## Condition1Norm 1.340e+04 2.470e+03 5.425 6.62e-08 ***
## Condition1PosA 1.274e+04 5.569e+03 2.288 0.022259 *
## Condition1PosN 7.585e+03 4.569e+03 1.660 0.097089 .
## Condition1RRAe -1.425e+04 4.652e+03 -3.063 0.002226 **
## Condition1RRAn 1.101e+04 3.895e+03 2.826 0.004765 **
## Condition1RRNe -1.495e+03 8.783e+03 -0.170 0.864886
## Condition1RRNn 6.815e+03 9.224e+03 0.739 0.460134
## Condition2Feedr -8.786e+03 1.040e+04 -0.845 0.398354
## Condition2Norm -3.235e+03 9.104e+03 -0.355 0.722366
## Condition2PosA -7.289e+03 1.437e+04 -0.507 0.612077
## Condition2PosN -2.439e+05 1.368e+04 -17.834 < 2e-16 ***
## Condition2RRAe -1.074e+05 2.368e+04 -4.537 6.11e-06 ***
## Condition2RRAn -7.330e+03 1.870e+04 -0.392 0.695047
## Condition2RRNn -3.959e+02 1.467e+04 -0.027 0.978476
## BldgType2fmCon -7.442e+02 1.277e+04 -0.058 0.953546
## BldgTypeDuplex NA NA NA NA
## BldgTypeTwnhs -1.811e+04 7.659e+03 -2.365 0.018163 *
## BldgTypeTwnhsE -1.752e+04 7.082e+03 -2.474 0.013460 *
## HouseStyle1.5Unf 7.953e+03 1.114e+04 0.714 0.475592
## HouseStyle1Story 1.086e+04 4.960e+03 2.190 0.028633 *
## HouseStyle2.5Fin -1.193e+04 1.025e+04 -1.164 0.244636
## HouseStyle2.5Unf -8.651e+03 7.330e+03 -1.180 0.238061
## HouseStyle2Story -6.055e+03 4.809e+03 -1.259 0.208150
## HouseStyleSFoyer 1.538e+04 6.497e+03 2.367 0.018057 *
## HouseStyleSLvl 1.827e+04 7.821e+03 2.336 0.019626 *
## OverallQual2 2.961e+04 1.988e+04 1.490 0.136484
## OverallQual3 3.496e+04 1.859e+04 1.880 0.060248 .
## OverallQual4 3.600e+04 1.847e+04 1.949 0.051465 .
## OverallQual5 4.016e+04 1.853e+04 2.167 0.030388 *
## OverallQual6 4.530e+04 1.858e+04 2.438 0.014872 *
## OverallQual7 5.296e+04 1.860e+04 2.847 0.004468 **
## OverallQual8 6.663e+04 1.867e+04 3.569 0.000368 ***
## OverallQual9 8.783e+04 1.890e+04 4.647 3.63e-06 ***
## OverallQual10 1.369e+05 1.948e+04 7.028 3.03e-12 ***
## OverallCond2 1.315e+04 2.332e+04 0.564 0.572872
## OverallCond3 2.004e+04 1.342e+04 1.494 0.135405
## OverallCond4 2.603e+04 1.336e+04 1.948 0.051607 .
## OverallCond5 3.390e+04 1.336e+04 2.536 0.011290 *
## OverallCond6 4.023e+04 1.342e+04 2.998 0.002759 **
## OverallCond7 4.572e+04 1.345e+04 3.400 0.000690 ***
## OverallCond8 5.177e+04 1.350e+04 3.836 0.000130 ***
## OverallCond9 6.002e+04 1.400e+04 4.288 1.90e-05 ***
## YearBuilt 3.595e+02 4.731e+01 7.599 4.92e-14 ***
## YearRemodAdd 1.055e+02 3.212e+01 3.286 0.001038 **
## RoofStyleGable -4.815e+03 8.656e+03 -0.556 0.578143
## RoofStyleGambrel -2.186e+03 9.745e+03 -0.224 0.822557
## RoofStyleHip -3.696e+03 8.704e+03 -0.425 0.671165
## RoofStyleMansard 5.986e+03 1.100e+04 0.544 0.586339
## RoofStyleShed 7.902e+04 1.515e+04 5.215 2.06e-07 ***
## RoofMatlCompShg 6.617e+05 2.022e+04 32.718 < 2e-16 ***
## RoofMatlMembran 7.396e+05 2.887e+04 25.620 < 2e-16 ***
## RoofMatlMetal 6.992e+05 2.872e+04 24.346 < 2e-16 ***
## RoofMatlRoll 6.538e+05 2.634e+04 24.818 < 2e-16 ***
## RoofMatlTar&Grv 6.672e+05 2.175e+04 30.676 < 2e-16 ***
## RoofMatlWdShake 6.437e+05 2.152e+04 29.911 < 2e-16 ***
## RoofMatlWdShngl 7.432e+05 2.132e+04 34.854 < 2e-16 ***
## Exterior1stAsphShn -1.850e+04 2.276e+04 -0.813 0.416323
## Exterior1stBrkComm -7.338e+03 1.387e+04 -0.529 0.596760
## Exterior1stBrkFace 7.740e+03 7.342e+03 1.054 0.291930
## Exterior1stCemntBd -1.151e+04 1.224e+04 -0.940 0.347353
## Exterior1stHdBoard -1.014e+04 7.086e+03 -1.431 0.152611
## Exterior1stImStucc -6.986e+04 1.799e+04 -3.884 0.000107 ***
## Exterior1stMetalSd 1.209e+03 7.963e+03 0.152 0.879352
## Exterior1stPlywood -1.547e+04 6.952e+03 -2.226 0.026170 *
## Exterior1stStone -2.641e+04 1.549e+04 -1.705 0.088463 .
## Exterior1stStucco -4.405e+03 8.089e+03 -0.544 0.586173
## Exterior1stVinylSd -1.611e+04 8.066e+03 -1.997 0.045966 *
## Exterior1stWd Sdng -8.627e+03 6.955e+03 -1.240 0.215021
## Exterior1stWdShing -3.086e+03 7.381e+03 -0.418 0.675971
## Exterior2ndAsphShn 2.238e+03 1.431e+04 0.156 0.875675
## Exterior2ndBrk Cmn 4.508e+03 1.355e+04 0.333 0.739463
## Exterior2ndBrkFace -3.269e+02 8.265e+03 -0.040 0.968454
## Exterior2ndCmentBd 1.024e+04 1.261e+04 0.812 0.417099
## Exterior2ndHdBoard 1.608e+03 7.571e+03 0.212 0.831880
## Exterior2ndImStucc 3.464e+04 8.906e+03 3.890 0.000104 ***
## Exterior2ndMetalSd -3.830e+03 8.357e+03 -0.458 0.646778
## Exterior2ndOther -1.006e+04 1.825e+04 -0.551 0.581676
## Exterior2ndPlywood 3.119e+03 7.283e+03 0.428 0.668547
## Exterior2ndStone 1.465e+04 1.425e+04 1.029 0.303846
## Exterior2ndStucco -2.973e+03 8.510e+03 -0.349 0.726834
## Exterior2ndVinylSd 1.104e+04 8.440e+03 1.308 0.191016
## Exterior2ndWd Sdng 4.350e+03 7.496e+03 0.580 0.561750
## Exterior2ndWd Shng -4.197e+03 7.836e+03 -0.536 0.592332
## MasVnrTypeBrkFace 8.602e+03 3.955e+03 2.175 0.029777 *
## MasVnrTypeNone 1.164e+04 3.963e+03 2.938 0.003349 **
## MasVnrTypeStone 1.311e+04 4.195e+03 3.125 0.001807 **
## MasVnrArea 1.884e+01 3.395e+00 5.550 3.32e-08 ***
## ExterQualFa 1.336e+04 6.288e+03 2.125 0.033726 *
## ExterQualGd -8.024e+03 3.085e+03 -2.601 0.009375 **
## ExterQualTA -1.003e+04 3.394e+03 -2.956 0.003160 **
## ExterCondFa -3.499e+03 7.066e+03 -0.495 0.620488
## ExterCondGd -1.037e+04 6.377e+03 -1.625 0.104247
## ExterCondTA -6.483e+03 6.373e+03 -1.017 0.309186
## FoundationCBlock 1.819e+03 1.801e+03 1.010 0.312470
## FoundationPConc 6.034e+03 1.964e+03 3.072 0.002159 **
## FoundationSlab 6.341e+03 4.468e+03 1.419 0.156047
## FoundationStone 4.014e+03 7.200e+03 0.557 0.577301
## FoundationWood -2.481e+04 1.172e+04 -2.116 0.034456 *
## BsmtFinSF1 3.360e+01 2.392e+00 14.048 < 2e-16 ***
## BsmtFinSF2 2.188e+01 3.201e+00 6.835 1.14e-11 ***
## BsmtUnfSF 1.270e+01 2.228e+00 5.698 1.43e-08 ***
## TotalBsmtSF NA NA NA NA
## HeatingGasA 2.343e+03 1.677e+04 0.140 0.888850
## HeatingGasW -6.304e+03 1.730e+04 -0.364 0.715552
## HeatingGrav -3.139e+03 1.894e+04 -0.166 0.868428
## HeatingOthW -2.839e+04 2.062e+04 -1.377 0.168675
## HeatingWall 4.965e+03 2.129e+04 0.233 0.815610
## HeatingQCFa -1.928e+03 2.706e+03 -0.712 0.476280
## HeatingQCGd -3.737e+03 1.183e+03 -3.160 0.001606 **
## HeatingQCPo 1.191e+04 1.245e+04 0.956 0.339135
## HeatingQCTA -3.674e+03 1.197e+03 -3.069 0.002180 **
## CentralAirY -3.873e+02 2.091e+03 -0.185 0.853039
## ElectricalFuseF -2.989e+03 3.552e+03 -0.842 0.400166
## ElectricalFuseP -9.314e+03 7.068e+03 -1.318 0.187772
## ElectricalMix 9.603e+03 2.643e+04 0.363 0.716405
## ElectricalSBrkr -1.970e+03 1.715e+03 -1.148 0.251012
## X1stFlrSF 5.350e+01 2.858e+00 18.718 < 2e-16 ***
## X2ndFlrSF 6.487e+01 3.035e+00 21.375 < 2e-16 ***
## LowQualFinSF 1.211e+01 1.099e+01 1.102 0.270735
## GrLivArea NA NA NA NA
## BsmtFullBath 2.144e+03 1.114e+03 1.925 0.054395 .
## BsmtHalfBath 8.477e+02 1.648e+03 0.514 0.607154
## FullBath 3.703e+03 1.295e+03 2.860 0.004292 **
## HalfBath -7.771e+01 1.241e+03 -0.063 0.950057
## BedroomAbvGr -3.233e+03 7.977e+02 -4.053 5.28e-05 ***
## KitchenAbvGr -6.874e+03 4.161e+03 -1.652 0.098739 .
## KitchenQualFa -1.566e+04 3.644e+03 -4.296 1.84e-05 ***
## KitchenQualGd -2.059e+04 2.131e+03 -9.663 < 2e-16 ***
## KitchenQualTA -1.784e+04 2.372e+03 -7.523 8.70e-14 ***
## TotRmsAbvGrd 7.895e+02 5.523e+02 1.430 0.153012
## FunctionalMaj2 -5.674e+03 9.442e+03 -0.601 0.547938
## FunctionalMin1 5.206e+03 5.666e+03 0.919 0.358376
## FunctionalMin2 6.318e+03 5.832e+03 1.083 0.278778
## FunctionalMod -5.581e+03 6.326e+03 -0.882 0.377781
## FunctionalSev -5.304e+04 1.823e+04 -2.910 0.003666 **
## FunctionalTyp 1.758e+04 5.075e+03 3.465 0.000544 ***
## Fireplaces 4.200e+03 7.970e+02 5.270 1.54e-07 ***
## GarageCars 2.574e+03 1.278e+03 2.015 0.044109 *
## GarageArea 1.710e+01 4.373e+00 3.910 9.61e-05 ***
## PavedDriveP -3.467e+03 2.994e+03 -1.158 0.247036
## PavedDriveY -2.394e+03 1.905e+03 -1.257 0.208887
## WoodDeckSF 1.471e+01 3.400e+00 4.325 1.61e-05 ***
## OpenPorchSF 1.655e+01 6.399e+00 2.586 0.009807 **
## EnclosedPorch 6.099e+00 6.780e+00 0.900 0.368493
## X3SsnPorch 5.566e+01 1.690e+01 3.293 0.001012 **
## ScreenPorch 2.423e+01 7.027e+00 3.449 0.000577 ***
## PoolArea 6.602e+01 9.534e+00 6.924 6.20e-12 ***
## MiscVal -5.596e-01 6.572e-01 -0.852 0.394563
## MoSold -5.360e+02 1.417e+02 -3.784 0.000160 ***
## YrSold 4.488e+02 4.868e+02 0.922 0.356638
## SaleTypeCon 3.738e+04 9.712e+03 3.849 0.000123 ***
## SaleTypeConLD 1.265e+04 5.268e+03 2.402 0.016432 *
## SaleTypeConLI -5.307e+03 9.946e+03 -0.534 0.593718
## SaleTypeConLw -5.056e+02 8.633e+03 -0.059 0.953308
## SaleTypeCWD 2.103e+04 5.336e+03 3.942 8.40e-05 ***
## SaleTypeNew 1.496e+04 8.815e+03 1.697 0.089953 .
## SaleTypeOth 1.122e+04 9.702e+03 1.156 0.247642
## SaleTypeWD -1.063e+03 2.538e+03 -0.419 0.675428
## SaleConditionAdjLand 9.288e+03 5.828e+03 1.594 0.111226
## SaleConditionAlloca 6.959e+03 6.002e+03 1.160 0.246397
## SaleConditionFamily -2.638e+02 3.233e+03 -0.082 0.934961
## SaleConditionNormal 4.383e+03 1.704e+03 2.573 0.010165 *
## SaleConditionPartial 5.334e+03 8.442e+03 0.632 0.527558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15630 on 1684 degrees of freedom
## (因为不存在,30个观察量被删除了)
## Multiple R-squared: 0.966, Adjusted R-squared: 0.9616
## F-statistic: 219.2 on 218 and 1684 DF, p-value: < 2.2e-16
Using the result of this and your general understanding of what variables should be important in determining SalePrice, choose a maximum of 15 variables and create another, smaller regression, and call it regBefore2009optimal. Then, print the summary of regBefore2009optimal to verify that your code works. Tip: Normally you would do a more detailed variable selection using a backward or step-wise selection approach but this is NOT required for this question. Tip: This is the formula for regression: lm(SalePrice ~ var1 + var2 + … + varN, data = before2009), where var1, etc. are the variables of your choice. Tip: Pick the variables with the lowest Pr(>|t|) Rubric: 8 points for setting regBefore2009optimal and 1 point for verification.
# Selecting the top 15 variables with the lowest Pr(>|t|) values
selected_vars <- names(coef(regBefore2009)[-1])[order(summary(regBefore2009)$coefficients[-1, 4])[1:40]]
print(selected_vars)
## [1] "RoofMatlWdShake" "RoofStyleShed" "RoofMatlRoll"
## [4] "RoofMatlTar&Grv" "RoofMatlCompShg" "RoofMatlMetal"
## [7] "RoofMatlMembran" "ElectricalSBrkr" "ElectricalMix"
## [10] "Condition2PosN" "FoundationWood" "BedroomAbvGr"
## [13] "OverallCond9" "KitchenAbvGr" "LotArea"
## [16] "OverallQual9" "EnclosedPorch" "BsmtFinSF1"
## [19] "NeighborhoodMitchel" "MSZoningFV" "BsmtFinSF2"
## [22] "MSZoningRM" "NeighborhoodNWAmes" "NeighborhoodStoneBr"
## [25] "NeighborhoodEdwards" "MasVnrTypeStone" "Condition1Norm"
## [28] "FunctionalMod" "MSZoningRL" "RoofStyleMansard"
## [31] "LandSlopeMod" "NeighborhoodNAmes" "LandContourLvl"
## [34] "OverallQual8" "Condition2RRAe" "StreetPave"
## [37] "LandContourHLS" "GarageArea" "NeighborhoodGilbert"
## [40] "NeighborhoodOldTown"
regBefore2009optimal <- lm(SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF + OverallQual+ Condition2 + MSZoning + Neighborhood + LotArea +OverallCond +Foundation + BedroomAbvGr + EnclosedPorch + BsmtFinSF1 +BsmtFinSF2 + MasVnrType, data = before2009)
# Print the summary
summary(regBefore2009optimal)
##
## Call:
## lm(formula = SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF + OverallQual +
## Condition2 + MSZoning + Neighborhood + LotArea + OverallCond +
## Foundation + BedroomAbvGr + EnclosedPorch + BsmtFinSF1 +
## BsmtFinSF2 + MasVnrType, data = before2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126988 -15520 -1473 13956 187597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.723e+05 5.381e+04 -12.493 < 2e-16 ***
## RoofMatlCompShg 5.839e+05 3.168e+04 18.433 < 2e-16 ***
## RoofMatlMembran 6.160e+05 4.481e+04 13.747 < 2e-16 ***
## RoofMatlMetal 6.445e+05 4.510e+04 14.292 < 2e-16 ***
## RoofMatlRoll 5.786e+05 4.287e+04 13.495 < 2e-16 ***
## RoofMatlTar&Grv 5.862e+05 3.250e+04 18.040 < 2e-16 ***
## RoofMatlWdShake 5.870e+05 3.362e+04 17.460 < 2e-16 ***
## RoofMatlWdShngl 6.633e+05 3.346e+04 19.825 < 2e-16 ***
## LandSlopeMod 1.080e+04 3.589e+03 3.009 0.002654 **
## LandSlopeSev -4.146e+04 1.213e+04 -3.418 0.000645 ***
## BsmtUnfSF 2.807e+01 2.443e+00 11.488 < 2e-16 ***
## OverallQual2 4.370e+04 3.308e+04 1.321 0.186691
## OverallQual3 3.858e+04 3.060e+04 1.261 0.207594
## OverallQual4 4.067e+04 3.048e+04 1.334 0.182334
## OverallQual5 4.746e+04 3.054e+04 1.554 0.120390
## OverallQual6 6.640e+04 3.062e+04 2.169 0.030228 *
## OverallQual7 9.085e+04 3.065e+04 2.964 0.003074 **
## OverallQual8 1.245e+05 3.070e+04 4.054 5.24e-05 ***
## OverallQual9 1.729e+05 3.092e+04 5.594 2.56e-08 ***
## OverallQual10 2.870e+05 3.162e+04 9.079 < 2e-16 ***
## Condition2Feedr -7.713e+03 1.734e+04 -0.445 0.656452
## Condition2Norm 1.214e+03 1.464e+04 0.083 0.933906
## Condition2PosA -4.793e+04 2.348e+04 -2.041 0.041388 *
## Condition2PosN -2.343e+05 2.270e+04 -10.320 < 2e-16 ***
## Condition2RRAe 2.725e+04 3.238e+04 0.842 0.400075
## Condition2RRAn -2.672e+04 3.244e+04 -0.824 0.410288
## Condition2RRNn 1.542e+03 2.498e+04 0.062 0.950802
## MSZoningFV 4.900e+04 1.121e+04 4.373 1.30e-05 ***
## MSZoningRH 2.496e+04 1.181e+04 2.113 0.034729 *
## MSZoningRL 4.049e+04 9.214e+03 4.394 1.18e-05 ***
## MSZoningRM 3.501e+04 8.642e+03 4.052 5.30e-05 ***
## NeighborhoodBlueste -2.377e+04 1.643e+04 -1.447 0.148186
## NeighborhoodBrDale -3.148e+04 9.979e+03 -3.155 0.001634 **
## NeighborhoodBrkSide -1.670e+04 8.244e+03 -2.025 0.042981 *
## NeighborhoodClearCr -1.231e+04 9.202e+03 -1.338 0.181206
## NeighborhoodCollgCr -9.268e+03 6.955e+03 -1.333 0.182795
## NeighborhoodCrawfor 8.885e+03 7.769e+03 1.144 0.252898
## NeighborhoodEdwards -3.248e+04 7.502e+03 -4.329 1.58e-05 ***
## NeighborhoodGilbert -3.688e+03 7.360e+03 -0.501 0.616348
## NeighborhoodIDOTRR -2.445e+04 8.843e+03 -2.765 0.005757 **
## NeighborhoodMeadowV -3.192e+04 9.947e+03 -3.209 0.001355 **
## NeighborhoodMitchel -3.232e+04 7.699e+03 -4.197 2.83e-05 ***
## NeighborhoodNAmes -2.768e+04 7.232e+03 -3.827 0.000134 ***
## NeighborhoodNoRidge 4.475e+04 8.059e+03 5.553 3.22e-08 ***
## NeighborhoodNPkVill -2.614e+04 1.190e+04 -2.196 0.028248 *
## NeighborhoodNridgHt 2.934e+04 7.500e+03 3.912 9.49e-05 ***
## NeighborhoodNWAmes -2.052e+04 7.602e+03 -2.699 0.007009 **
## NeighborhoodOldTown -2.335e+04 8.078e+03 -2.891 0.003889 **
## NeighborhoodSawyer -3.012e+04 7.633e+03 -3.946 8.23e-05 ***
## NeighborhoodSawyerW -8.314e+03 7.599e+03 -1.094 0.274112
## NeighborhoodSomerst -8.833e+03 8.598e+03 -1.027 0.304385
## NeighborhoodStoneBr 3.185e+04 8.446e+03 3.770 0.000168 ***
## NeighborhoodSWISU -2.465e+04 9.145e+03 -2.696 0.007091 **
## NeighborhoodTimber -6.031e+03 7.992e+03 -0.755 0.450570
## NeighborhoodVeenker 1.021e+04 9.536e+03 1.071 0.284485
## LotArea 1.321e+00 1.176e-01 11.231 < 2e-16 ***
## OverallCond2 1.281e+04 2.991e+04 0.428 0.668375
## OverallCond3 2.431e+04 2.223e+04 1.094 0.274248
## OverallCond4 3.282e+04 2.204e+04 1.489 0.136691
## OverallCond5 3.912e+04 2.199e+04 1.779 0.075474 .
## OverallCond6 4.538e+04 2.203e+04 2.060 0.039556 *
## OverallCond7 5.190e+04 2.205e+04 2.354 0.018690 *
## OverallCond8 5.651e+04 2.213e+04 2.553 0.010746 *
## OverallCond9 7.254e+04 2.274e+04 3.190 0.001445 **
## FoundationCBlock 4.564e+03 2.831e+03 1.612 0.107069
## FoundationPConc 1.883e+04 3.082e+03 6.109 1.22e-09 ***
## FoundationSlab 3.196e+04 6.716e+03 4.759 2.10e-06 ***
## FoundationStone -8.538e+03 1.225e+04 -0.697 0.485947
## FoundationWood 7.437e+03 2.099e+04 0.354 0.723098
## BedroomAbvGr 1.307e+04 9.057e+02 14.432 < 2e-16 ***
## EnclosedPorch -1.836e+00 1.095e+01 -0.168 0.866779
## BsmtFinSF1 5.586e+01 2.483e+00 22.499 < 2e-16 ***
## BsmtFinSF2 5.029e+01 4.559e+00 11.030 < 2e-16 ***
## MasVnrTypeBrkFace 2.000e+04 6.728e+03 2.973 0.002989 **
## MasVnrTypeNone 1.593e+04 6.630e+03 2.402 0.016394 *
## MasVnrTypeStone 2.614e+04 7.143e+03 3.660 0.000260 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28640 on 1828 degrees of freedom
## (因为不存在,29个观察量被删除了)
## Multiple R-squared: 0.8759, Adjusted R-squared: 0.8708
## F-statistic: 172 on 75 and 1828 DF, p-value: < 2.2e-16
Display diagnostic plots of your regression. Tip: The diagnostic plots include QQ-Plot, Residual versus Fitted Values plot, a \(\sqrt{Standardized \; Residuals}\) vs Fitted Values plot, and a Standardized Residuals vs Leverage plot. Do not worry if your residuals have a slight curve to them. Tip: Google “Plotting Diagnostics for Linear Models - CRAN” and don’t use any arguments for the function autoplot at this time.
library(ggfortify)
## Warning: 程辑包'ggfortify'是用R版本4.3.2 来建造的
regBefore2009optimal %>%
autoplot()
## Warning: Removed 1904 rows containing missing values (`geom_line()`).
## Warning: Removed 7 rows containing missing values (`geom_point()`).
## Warning: Removed 14 rows containing missing values (`geom_line()`).
Now read in the PricesAfter2009.csv data and assign it to a variable called after2009. The dataset contains data for house prices after 2009. Then, repeat your data manipulation operations from Q2 and Q3 on this new dataset. Drop (remove) unnecessary columns that you dropped in Q5. Rubric: 1 point for reading and 4 points for data manipulation.
after2009 <- read.csv("PricesAfter2009.csv")
# Inspect the data using head() and str() commands
head(after2009)
## X Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 6 50 RL 85 14115 Pave <NA> IR1
## 2 2 8 60 RL NA 10382 Pave <NA> IR1
## 3 3 17 20 RL NA 11241 Pave <NA> IR1
## 4 4 20 20 RL 70 7560 Pave <NA> Reg
## 5 5 25 20 RL NA 8246 Pave <NA> IR1
## 6 6 26 20 RL 110 14230 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1 Lvl AllPub Inside Gtl Mitchel Norm Norm
## 2 Lvl AllPub Corner Gtl NWAmes PosN Norm
## 3 Lvl AllPub CulDSac Gtl NAmes Norm Norm
## 4 Lvl AllPub Inside Gtl NAmes Norm Norm
## 5 Lvl AllPub Inside Gtl Sawyer Norm Norm
## 6 Lvl AllPub Corner Gtl NridgHt Norm Norm
## BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1 1Fam 1.5Fin 5 5 1993 1995 Gable
## 2 1Fam 2Story 7 6 1973 1973 Gable
## 3 1Fam 1Story 6 7 1970 1970 Gable
## 4 1Fam 1Story 5 6 1958 1965 Hip
## 5 1Fam 1Story 5 8 1968 2001 Gable
## 6 1Fam 1Story 8 5 2007 2007 Gable
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1 CompShg VinylSd VinylSd None 0 TA TA
## 2 CompShg HdBoard HdBoard Stone 240 TA TA
## 3 CompShg Wd Sdng Wd Sdng BrkFace 180 TA TA
## 4 CompShg BrkFace Plywood None 0 TA TA
## 5 CompShg Plywood Plywood None 0 TA Gd
## 6 CompShg VinylSd VinylSd Stone 640 Gd TA
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1 Wood Gd TA No GLQ 732
## 2 CBlock Gd TA Mn ALQ 859
## 3 CBlock TA TA No ALQ 578
## 4 CBlock TA TA No LwQ 504
## 5 CBlock TA TA Mn Rec 188
## 6 PConc Gd TA No Unf 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1 Unf 0 64 796 GasA Ex Y
## 2 BLQ 32 216 1107 GasA Ex Y
## 3 Unf 0 426 1004 GasA Ex Y
## 4 Unf 0 525 1029 GasA TA Y
## 5 ALQ 668 204 1060 GasA Ex Y
## 6 Unf 0 1566 1566 GasA Ex Y
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1 SBrkr 796 566 0 1362 1
## 2 SBrkr 1107 983 0 2090 1
## 3 SBrkr 1004 0 0 1004 1
## 4 SBrkr 1339 0 0 1339 0
## 5 SBrkr 1060 0 0 1060 1
## 6 SBrkr 1600 0 0 1600 0
## BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1 0 1 1 1 1 TA
## 2 0 2 1 3 1 TA
## 3 0 1 0 2 1 TA
## 4 0 1 0 3 1 TA
## 5 0 1 0 3 1 Gd
## 6 0 2 0 3 1 Gd
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1 5 Typ 0 <NA> Attchd 1993
## 2 7 Typ 2 TA Attchd 1973
## 3 5 Typ 1 TA Attchd 1970
## 4 6 Min1 0 <NA> Attchd 1958
## 5 6 Typ 1 TA Attchd 1968
## 6 7 Typ 1 Gd Attchd 2007
## GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1 Unf 2 480 TA TA Y
## 2 RFn 2 484 TA TA Y
## 3 Fin 2 480 TA TA Y
## 4 Unf 1 294 TA TA Y
## 5 Unf 1 270 TA TA Y
## 6 RFn 3 890 TA TA Y
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1 40 30 0 320 0 0 NA
## 2 235 204 228 0 0 0 NA
## 3 0 0 0 0 0 0 NA
## 4 0 0 0 0 0 0 NA
## 5 406 90 0 0 0 0 NA
## 6 0 56 0 0 0 0 NA
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 MnPrv Shed 700 10 2009 WD Normal 143000
## 2 <NA> Shed 350 11 2009 WD Normal 200000
## 3 <NA> Shed 700 3 2010 WD Normal 149000
## 4 MnPrv <NA> 0 5 2009 COD Abnorml 139000
## 5 MnPrv <NA> 0 5 2010 WD Normal 154000
## 6 <NA> <NA> 0 7 2009 WD Normal 256300
str(after2009)
## 'data.frame': 986 obs. of 82 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Id : int 6 8 17 20 25 26 27 28 34 37 ...
## $ MSSubClass : int 50 60 20 20 20 20 20 20 20 20 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 85 NA NA 70 NA 110 60 98 70 112 ...
## $ LotArea : int 14115 10382 11241 7560 8246 14230 7200 11478 10552 10859 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "IR1" "IR1" "IR1" "Reg" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "Corner" "CulDSac" "Inside" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "Mitchel" "NWAmes" "NAmes" "NAmes" ...
## $ Condition1 : chr "Norm" "PosN" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "1.5Fin" "2Story" "1Story" "1Story" ...
## $ OverallQual : int 5 7 6 5 5 8 5 8 5 5 ...
## $ OverallCond : int 5 6 7 6 8 5 7 5 5 5 ...
## $ YearBuilt : int 1993 1973 1970 1958 1968 2007 1951 2007 1959 1994 ...
## $ YearRemodAdd : int 1995 1973 1970 1965 2001 2007 2000 2008 1959 1995 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Hip" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "HdBoard" "Wd Sdng" "BrkFace" ...
## $ Exterior2nd : chr "VinylSd" "HdBoard" "Wd Sdng" "Plywood" ...
## $ MasVnrType : chr "None" "Stone" "BrkFace" "None" ...
## $ MasVnrArea : int 0 240 180 0 0 640 0 200 0 0 ...
## $ ExterQual : chr "TA" "TA" "TA" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "Wood" "CBlock" "CBlock" "CBlock" ...
## $ BsmtQual : chr "Gd" "Gd" "TA" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "TA" ...
## $ BsmtExposure : chr "No" "Mn" "No" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "ALQ" "LwQ" ...
## $ BsmtFinSF1 : int 732 859 578 504 188 0 234 1218 1018 0 ...
## $ BsmtFinType2 : chr "Unf" "BLQ" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 32 0 0 668 0 486 0 0 0 ...
## $ BsmtUnfSF : int 64 216 426 525 204 1566 180 486 380 1097 ...
## $ TotalBsmtSF : int 796 1107 1004 1029 1060 1566 900 1704 1398 1097 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "TA" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 796 1107 1004 1339 1060 1600 900 1704 1700 1097 ...
## $ X2ndFlrSF : int 566 983 0 0 0 0 0 0 0 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1362 2090 1004 1339 1060 1600 900 1704 1700 1097 ...
## $ BsmtFullBath : int 1 1 1 0 1 0 0 1 0 0 ...
## $ BsmtHalfBath : int 0 0 0 0 0 0 1 0 1 0 ...
## $ FullBath : int 1 2 1 1 1 2 1 2 1 1 ...
## $ HalfBath : int 1 1 0 0 0 0 0 0 1 1 ...
## $ BedroomAbvGr : int 1 3 2 3 3 3 3 3 4 3 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ KitchenQual : chr "TA" "TA" "TA" "TA" ...
## $ TotRmsAbvGrd : int 5 7 5 6 6 7 5 7 6 6 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Min1" ...
## $ Fireplaces : int 0 2 1 0 1 1 0 1 1 0 ...
## $ FireplaceQu : chr NA "TA" "TA" NA ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ GarageYrBlt : int 1993 1973 1970 1958 1968 2007 2005 2008 1959 1995 ...
## $ GarageFinish : chr "Unf" "RFn" "Fin" "Unf" ...
## $ GarageCars : int 2 2 2 1 1 3 2 3 2 2 ...
## $ GarageArea : int 480 484 480 294 270 890 576 772 447 672 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 40 235 0 0 406 0 222 0 0 392 ...
## $ OpenPorchSF : int 30 204 0 0 90 56 32 50 38 64 ...
## $ EnclosedPorch: int 0 228 0 0 0 0 0 0 0 0 ...
## $ X3SsnPorch : int 320 0 0 0 0 0 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : logi NA NA NA NA NA NA ...
## $ Fence : chr "MnPrv" NA NA "MnPrv" ...
## $ MiscFeature : chr "Shed" "Shed" "Shed" NA ...
## $ MiscVal : int 700 350 700 0 0 0 0 0 0 0 ...
## $ MoSold : int 10 11 3 5 5 7 5 5 4 6 ...
## $ YrSold : int 2009 2009 2010 2009 2010 2009 2010 2010 2010 2009 ...
## $ SaleType : chr "WD" "WD" "WD" "COD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : num 143000 200000 149000 139000 154000 ...
# Convert columns to character or factor type
after2009$MSSubClass <- as.character(after2009$MSSubClass)
after2009$OverallQual <- as.factor(after2009$OverallQual)
after2009$OverallCond <- as.factor(after2009$OverallCond)
# Verify the conversion
summary(after2009[, c("MSSubClass", "OverallQual", "OverallCond")])
## MSSubClass OverallQual OverallCond
## Length:986 5 :271 5 :551
## Class :character 6 :246 6 :177
## Mode :character 7 :205 7 :130
## 8 :109 8 : 50
## 4 : 85 4 : 32
## 9 : 39 3 : 20
## (Other): 31 (Other): 26
temp = map(after2009, ~sum(is.na(.))) %>% as_tibble() %>% t()
afterNAs = tibble('Columns' = rownames(temp), "NAs" = temp[,1])
afterNAs %>% head(10)
## # A tibble: 10 × 2
## Columns NAs
## <chr> <int>
## 1 X 0
## 2 Id 0
## 3 MSSubClass 0
## 4 MSZoning 1
## 5 LotFrontage 169
## 6 LotArea 0
## 7 Street 0
## 8 Alley 924
## 9 LotShape 0
## 10 LandContour 0
# Define the columns to be dropped
str(afterNAs)
## tibble [82 × 2] (S3: tbl_df/tbl/data.frame)
## $ Columns: chr [1:82] "X" "Id" "MSSubClass" "MSZoning" ...
## $ NAs : Named int [1:82] 0 0 0 1 169 0 0 924 0 0 ...
## ..- attr(*, "names")= chr [1:82] "X" "Id" "MSSubClass" "MSZoning" ...
# Create a vector of column names to drop
dropCols <- afterNAs$Columns[afterNAs$NAs >= 20]
# Drop specified columns
after2009 <- after2009 %>%
select(-any_of(dropCols),SalePrice)
# Drop specified columns and the first column
after2009 <- after2009 %>%
select(-Id, -Utilities, -1)
Local authorities found in 2011 that there was housing fraud taking place in several neighborhoods, including NAmes, Gilbert and NridgHt. Make a density plot of SalePrice (after 2009) for all the neighborhoods (with or without fraud) and arrange them all in a grid. Tip: Data scientists often use density plot to catch outliers or anomalous activity). Tip: I recommend using ggplot2 for these plots with facet_wrap(~ Neighborhood). Your call will look something like this: ggplot(data = …, aes(x = SalePrice)) + geom_density() + facet_wrap(~ …) + ggtitle(“…”) + xlab(‘…’)
# Assuming you have loaded the necessary libraries and the after2009 data
library(ggplot2)
# Filter data for the specified neighborhoods
fraud_neighborhoods <- c("NAmes", "Gilbert", "NridgHt")
after2009_fraud <- after2009[after2009$Neighborhood %in% fraud_neighborhoods, ]
# Create a density plot with ggplot2 and facet_wrap
ggplot(data = after2009, aes(x = SalePrice)) +
geom_density() +
facet_wrap(~ Neighborhood) +
ggtitle("Density Plot of SalePrice by Neighborhood (After 2009)") +
xlab('SalePrice') +
theme_minimal() # You can customize the theme if needed
## Warning: Removed 5 rows containing non-finite values (`stat_density()`).
As you can see, the density plot for NAmes between 2009 and 2010 does not look any different from other density plots. If there are fraudsters, they are making an effort to mask their activities. Now, make 2 density plots, one for SalePrice in NAmes before 2009 and the other for after 2009. Compare the two to see if there is visual evidence of anomalous activity. Then, do the same for Gilbert and see if anything anomalous is detectable between these plots. Tip: I recommend using the gridExtra library’s grid.arrange function for all four plots so you can see the plots for each neighborhood side by side.
# Assuming you have loaded the necessary libraries and the before2009 and after2009 data
library(ggplot2)
library(gridExtra)
## Warning: 程辑包'gridExtra'是用R版本4.3.2 来建造的
##
## 载入程辑包:'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Function to create density plot for a neighborhood
create_density_plot <- function(data, neighborhood, title) {
ggplot(data = data[data$Neighborhood == neighborhood, ], aes(x = SalePrice)) +
geom_density() +
ggtitle(title) +
xlab('SalePrice') +
theme_minimal()
}
# Create density plots for NAmes and Gilbert before and after 2009
plot_NAmes_before <- create_density_plot(before2009, "NAmes", "Density Plot - NAmes (Before 2009)")
plot_NAmes_after <- create_density_plot(after2009, "NAmes", "Density Plot - NAmes (After 2009)")
plot_Gilbert_before <- create_density_plot(before2009, "Gilbert", "Density Plot - Gilbert (Before 2009)")
plot_Gilbert_after <- create_density_plot(after2009, "Gilbert", "Density Plot - Gilbert (After 2009)")
# Arrange plots side by side
grid.arrange(plot_NAmes_before, plot_NAmes_after, plot_Gilbert_before, plot_Gilbert_after, ncol = 2)
## Warning: Removed 4 rows containing non-finite values (`stat_density()`).
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
We pick up this story from new Question 12 below and continue the investigation after you have learned regression in more detail. Tip: I bookended this assignment with the regression module so you can reinforce your understanding and apply it. (I also wanted to have empathy for your learning-life blend.) This will also, hopefully, cement your understanding and build your confidence.
Analyze the visualizations above for Gilbert and NAmes to detect possible fraud. Tip: Look for a fraud pattern.
### This section doesn't require code. Just answer the question as a comment.
# Normally density plots have peaks around the mean, but "Gilbert" has two peaks and another peak around 145000, which is a risk of fraud. In addition, when I look at the peaks of "NAmes" I feel that the values are too concentrated around the average, which could also be potentially fraudulent.
You may feel that the fraudsters were not very careful in masking their activity after identifying the fraud pattern. However, we don’t have sufficient evidence to claim that this is fraudulent activity (just based on the density plots). We will now use multiple linear regression to attempt to get more evidence. Run a regression on the data in after2009 using variables you already know to be good at predicting the SalePrice. Store the result in variable called regAfter2009optimal. Then print summary of regAfter2009optimal to verify that your code works. Tip: You can reuse your previous work on before2009. Rubric: 4 points for regression, 1 point for printing summary.
# Selecting the top 15 variables with the lowest Pr(>|t|) values
regAfter2009optimal <- lm(SalePrice ~ RoofMatl + KitchenQual + OverallQual + Condition2 + MSZoning + Neighborhood + LotArea +OverallCond +Foundation + BedroomAbvGr + ExterQual + BsmtFinSF1 +BsmtFinSF2 + MSSubClass + BsmtUnfSF, data = after2009)
# Print the summary
summary(regAfter2009optimal)
##
## Call:
## lm(formula = SalePrice ~ RoofMatl + KitchenQual + OverallQual +
## Condition2 + MSZoning + Neighborhood + LotArea + OverallCond +
## Foundation + BedroomAbvGr + ExterQual + BsmtFinSF1 + BsmtFinSF2 +
## MSSubClass + BsmtUnfSF, data = after2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -276926 -11883 76 13297 209716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.398e+04 5.027e+04 1.273 0.203394
## RoofMatlTar&Grv -2.977e+03 1.227e+04 -0.243 0.808350
## RoofMatlWdShake 5.169e+04 2.488e+04 2.077 0.038071 *
## RoofMatlWdShngl -7.834e+03 3.065e+04 -0.256 0.798336
## KitchenQualFa -3.421e+04 9.574e+03 -3.573 0.000372 ***
## KitchenQualGd -2.741e+04 6.134e+03 -4.468 8.89e-06 ***
## KitchenQualTA -3.147e+04 6.666e+03 -4.721 2.72e-06 ***
## OverallQual2 -6.917e+04 4.065e+04 -1.701 0.089222 .
## OverallQual3 -5.758e+04 4.058e+04 -1.419 0.156331
## OverallQual4 -5.078e+04 3.980e+04 -1.276 0.202312
## OverallQual5 -4.972e+04 4.000e+04 -1.243 0.214174
## OverallQual6 -3.574e+04 4.013e+04 -0.891 0.373375
## OverallQual7 -2.085e+04 4.027e+04 -0.518 0.604729
## OverallQual8 -3.968e+03 4.046e+04 -0.098 0.921893
## OverallQual9 2.645e+04 4.120e+04 0.642 0.520965
## OverallQual10 5.314e+04 4.296e+04 1.237 0.216457
## Condition2Feedr 3.155e+04 3.640e+04 0.867 0.386233
## Condition2Norm 1.827e+04 3.294e+04 0.555 0.579241
## Condition2PosA 6.630e+04 4.582e+04 1.447 0.148213
## Condition2PosN -2.183e+05 4.615e+04 -4.730 2.61e-06 ***
## MSZoningFV 3.040e+04 1.808e+04 1.681 0.093068 .
## MSZoningRH 2.227e+04 1.707e+04 1.305 0.192278
## MSZoningRL 2.850e+04 1.377e+04 2.070 0.038718 *
## MSZoningRM 2.944e+04 1.289e+04 2.284 0.022578 *
## NeighborhoodBlueste -4.273e+03 1.864e+04 -0.229 0.818750
## NeighborhoodBrDale -6.282e+03 1.815e+04 -0.346 0.729400
## NeighborhoodBrkSide -1.042e+04 1.509e+04 -0.691 0.489874
## NeighborhoodClearCr -2.594e+03 1.518e+04 -0.171 0.864360
## NeighborhoodCollgCr -1.154e+04 1.240e+04 -0.930 0.352424
## NeighborhoodCrawfor 2.400e+04 1.374e+04 1.747 0.080900 .
## NeighborhoodEdwards -2.634e+04 1.315e+04 -2.004 0.045410 *
## NeighborhoodGilbert -1.910e+04 1.293e+04 -1.477 0.139978
## NeighborhoodIDOTRR -2.964e+04 1.692e+04 -1.752 0.080070 .
## NeighborhoodMeadowV -2.545e+04 1.806e+04 -1.409 0.159108
## NeighborhoodMitchel -1.454e+04 1.314e+04 -1.107 0.268675
## NeighborhoodNAmes -2.106e+04 1.281e+04 -1.643 0.100714
## NeighborhoodNoRidge 2.979e+04 1.379e+04 2.160 0.031058 *
## NeighborhoodNPkVill -2.485e+03 1.473e+04 -0.169 0.866033
## NeighborhoodNridgHt 3.254e+03 1.229e+04 0.265 0.791237
## NeighborhoodNWAmes -1.549e+04 1.327e+04 -1.167 0.243553
## NeighborhoodOldTown -2.707e+04 1.481e+04 -1.827 0.068007 .
## NeighborhoodSawyer -2.699e+04 1.344e+04 -2.009 0.044882 *
## NeighborhoodSawyerW -9.161e+03 1.263e+04 -0.725 0.468477
## NeighborhoodSomerst 1.505e+03 1.611e+04 0.093 0.925608
## NeighborhoodStoneBr 4.206e+04 1.384e+04 3.040 0.002435 **
## NeighborhoodSWISU -1.620e+04 1.477e+04 -1.097 0.272957
## NeighborhoodTimber -8.974e+03 1.377e+04 -0.651 0.514892
## NeighborhoodVeenker 2.162e+04 1.973e+04 1.096 0.273572
## LotArea 8.138e-01 1.377e-01 5.908 4.90e-09 ***
## OverallCond2 6.608e+04 2.589e+04 2.552 0.010862 *
## OverallCond3 5.661e+04 2.373e+04 2.385 0.017285 *
## OverallCond4 5.811e+04 2.349e+04 2.474 0.013538 *
## OverallCond5 6.654e+04 2.298e+04 2.895 0.003881 **
## OverallCond6 6.772e+04 2.302e+04 2.942 0.003346 **
## OverallCond7 7.402e+04 2.304e+04 3.213 0.001360 **
## OverallCond8 6.858e+04 2.342e+04 2.928 0.003496 **
## OverallCond9 7.267e+04 2.419e+04 3.004 0.002739 **
## FoundationCBlock 2.981e+03 4.553e+03 0.655 0.512875
## FoundationPConc 1.389e+04 4.971e+03 2.794 0.005310 **
## FoundationSlab 3.349e+04 9.557e+03 3.504 0.000481 ***
## FoundationStone 3.245e+04 1.473e+04 2.204 0.027808 *
## FoundationWood -1.024e+04 1.912e+04 -0.536 0.592293
## BedroomAbvGr 7.871e+03 1.613e+03 4.880 1.26e-06 ***
## ExterQualFa -5.173e+04 1.258e+04 -4.111 4.31e-05 ***
## ExterQualGd -4.107e+04 8.145e+03 -5.042 5.58e-07 ***
## ExterQualTA -4.873e+04 8.917e+03 -5.465 5.99e-08 ***
## BsmtFinSF1 6.805e+01 4.328e+00 15.721 < 2e-16 ***
## BsmtFinSF2 4.776e+01 7.003e+00 6.820 1.68e-11 ***
## MSSubClass160 1.273e+04 7.919e+03 1.608 0.108250
## MSSubClass180 9.389e+03 1.753e+04 0.535 0.592450
## MSSubClass190 1.566e+04 1.130e+04 1.385 0.166303
## MSSubClass20 1.539e+04 6.024e+03 2.555 0.010776 *
## MSSubClass30 1.031e+04 8.556e+03 1.205 0.228425
## MSSubClass40 6.438e+03 2.535e+04 0.254 0.799547
## MSSubClass45 -4.934e+02 1.933e+04 -0.026 0.979639
## MSSubClass50 2.449e+04 7.491e+03 3.270 0.001118 **
## MSSubClass60 4.511e+04 6.600e+03 6.835 1.52e-11 ***
## MSSubClass70 2.301e+04 9.048e+03 2.543 0.011163 *
## MSSubClass75 3.285e+04 1.598e+04 2.056 0.040099 *
## MSSubClass80 3.190e+04 7.851e+03 4.063 5.27e-05 ***
## MSSubClass85 1.594e+04 9.162e+03 1.740 0.082234 .
## MSSubClass90 2.064e+04 8.456e+03 2.440 0.014870 *
## BsmtUnfSF 3.715e+01 4.366e+00 8.509 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29720 on 898 degrees of freedom
## (因为不存在,5个观察量被删除了)
## Multiple R-squared: 0.8599, Adjusted R-squared: 0.8471
## F-statistic: 67.21 on 82 and 898 DF, p-value: < 2.2e-16
Now, display diagnostic plots of your regression (regAfter2009optimal). Tip: You have already know how to autoplot.
library(ggfortify)
regAfter2009optimal %>%
autoplot()
## Warning: Removed 981 rows containing missing values (`geom_line()`).
## Warning: Removed 4 rows containing missing values (`geom_point()`).
## Warning: Removed 11 rows containing missing values (`geom_line()`).
Now, let’s focus on the Residual vs. Fitted graph by plotting it by itself using ggplot. Tip: Call ggplot with the data parameter in regAfter2009optimal. The aes parameters are (.fitted, .resid), respectively. You can use stat_smooth() for the trendline and appropriately title the plot and label both axes. Tip: Check out cheatsheets such as https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf.
library(ggplot2)
# Create Residual vs. Fitted graph
ggplot(data = regAfter2009optimal, aes(x = .fitted, y = .resid)) +
geom_point() + # Scatter plot of residuals vs fitted values
stat_smooth(method = "loess", se = FALSE, color = "red") + # Add a trendline
ggtitle("Residual vs. Fitted") +
xlab("Fitted Values") +
ylab("Residuals")
## `geom_smooth()` using formula = 'y ~ x'
Identify any outliers in the visualization from the last two chunks.
### This section doesn't require code. Just answer the question as a comment.
#first chunk# Residuals vs. Fitted Values Plot: outliers points far from the horizontal center line like 280
#QQ-Plot (Quantile-Quantile Plot): Points deviating from the slop line may indicate outliers like 280
#Scale-Location:points far from the horizontal center like 280 would be woutliers
#Residuals vs. Leverage Plot:points outside the dashed horizontal lines like 529 is outlier
#second chunk:Residuals vs. Fitted Values Plot: outliers points far from the read line like the point which residuals over 2e+05
Now, let’s think like a fraudster and do something smarter fraudsters may do. Instead of misrepresenting values by just reporting the mean value of the houses sold in NAmes before 2009, what is something more clever and nuanced that the fraudsters could report these values? Specifically, consider a method smarter fraudsters may use to set the rows in which the prices are misrepresented? Then, using this method generate and set values for the SalePrice in those rows. Then, try your fraud inspection techniques of comparing old and new density plots as well as using the diagnostic plots to show that now the fraud is much harder to catch. Tip: You must use exact commands/functions to set the values and tell us why you chose to generate values this way. You must share the resulting diagnostic plots with us. Tip: Consider using more information (instead of the mean values) to generate the fraudulent values using what you learned from your work above. You can do this in two steps: Step 1: Find the rows set by the stupid fraudsters (by searching for the SalePrice of 142769.7). Step 2: Use a smarter way to generate and replace these values. Tip: For plotting, you may use ggplot to plot NAmes and NAmes. My ggplot call looked like this: before2009 %>% filter(Neighborhood == “???”) %>% ggplot(aes(x = SalePrice)) + geom_density(fill = “???”, alpha = 0.5) + ggtitle(“???”) + xlab(“???”) Tip: Always refine your model as fraudsters adapt their methods after they find out that you can catch them. Rubric: 10 points each for the fraud method and the plots.
### This section requires you to first explain your idea. Just answer this as a comment.
##
# Step 1: Find the rows set by the original fraudulent method (mean value)
fraud_rows <- after2009 %>% filter(SalePrice == 142769.7 )
# Step 2: Use a smarter way to generate and replace these values (e.g., smoothing)
set.seed(156)
fraud_rows$SalePrice <- rnorm(nrow(fraud_rows), mean(after2009$SalePrice, na.rm = TRUE), sd = sd(after2009$SalePrice,na.rm = TRUE))
# Now, create a density plot for NAmes before and after fraud
ggplot() +
geom_density(data = fraud_rows %>% filter(Neighborhood == "NAmes"), aes(x = SalePrice), fill = "blue", alpha = 0.5) +
geom_density(data = before2009 %>% filter(Neighborhood == "NAmes"), aes(x = SalePrice), fill = "green", alpha = 0.5) +
geom_density(data = after2009 %>% filter(Neighborhood == "NAmes"), aes(x = SalePrice), fill = "red", alpha = 0.5) +
ggtitle("Density Plot for NAmes Fraud") +
xlab("SalePrice")
ggplot() +
geom_density(data = fraud_rows %>% filter(Neighborhood == "Gilbert"), aes(x = SalePrice), fill = "blue", alpha = 0.5) +
geom_density(data = before2009 %>% filter(Neighborhood == "Gilbert"), aes(x = SalePrice), fill = "green", alpha = 0.5) +
geom_density(data = after2009 %>% filter(Neighborhood == "Gilbert"), aes(x = SalePrice), fill = "red", alpha = 0.5) +
ggtitle("Density Plot for NAmes Fraud") +
xlab("SalePrice")
## Warning: Removed 4 rows containing non-finite values (`stat_density()`).
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
#Now we can see that both after2009 and before2009 densities are clustered around 140,000-150,000 this is clearly fraudulent, so we chose to randomly replace these averages using a normal distribution, and using our approach allows the density curve to be smoother i.e. more uniformly distributed this way it won't be as easy to detect that it's fraudulent.
Now, run a regression on the new data in after2009 using variables you know are good at predicting SalePrice. Store the result in variable called regAfter2009optimalFraud. Then print summary of regAfter2009optimalFraud to verify that your code works. Tip: You can reuse previous work you before2009. Rubric: 4 points for regression, 1 point for printing summary.
# Identify rows in after2009 where SalePrice is equal to 142769.7
fraud_rows <- after2009 %>% filter(SalePrice == 142769.7)
# Set seed for reproducibility
set.seed(156)
# Generate random values from a normal distribution
fraud_rows$SalePrice <- rnorm(nrow(fraud_rows), mean(after2009$SalePrice, na.rm = TRUE), sd = sd(after2009$SalePrice,na.rm = TRUE))
# Replace the corresponding rows in after2009 with the modified fraud_rows
after2009 <- after2009 %>%
mutate(SalePrice = ifelse(SalePrice == 142769.7, fraud_rows$SalePrice, SalePrice))
# Verify the changes
head(after2009)
## MSSubClass MSZoning LotArea Street LotShape LandContour LotConfig LandSlope
## 1 50 RL 14115 Pave IR1 Lvl Inside Gtl
## 2 60 RL 10382 Pave IR1 Lvl Corner Gtl
## 3 20 RL 11241 Pave IR1 Lvl CulDSac Gtl
## 4 20 RL 7560 Pave Reg Lvl Inside Gtl
## 5 20 RL 8246 Pave IR1 Lvl Inside Gtl
## 6 20 RL 14230 Pave Reg Lvl Corner Gtl
## Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual
## 1 Mitchel Norm Norm 1Fam 1.5Fin 5
## 2 NWAmes PosN Norm 1Fam 2Story 7
## 3 NAmes Norm Norm 1Fam 1Story 6
## 4 NAmes Norm Norm 1Fam 1Story 5
## 5 Sawyer Norm Norm 1Fam 1Story 5
## 6 NridgHt Norm Norm 1Fam 1Story 8
## OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 1 5 1993 1995 Gable CompShg VinylSd VinylSd
## 2 6 1973 1973 Gable CompShg HdBoard HdBoard
## 3 7 1970 1970 Gable CompShg Wd Sdng Wd Sdng
## 4 6 1958 1965 Hip CompShg BrkFace Plywood
## 5 8 1968 2001 Gable CompShg Plywood Plywood
## 6 5 2007 2007 Gable CompShg VinylSd VinylSd
## MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtFinSF1 BsmtFinSF2
## 1 None 0 TA TA Wood 732 0
## 2 Stone 240 TA TA CBlock 859 32
## 3 BrkFace 180 TA TA CBlock 578 0
## 4 None 0 TA TA CBlock 504 0
## 5 None 0 TA Gd CBlock 188 668
## 6 Stone 640 Gd TA PConc 0 0
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF
## 1 64 796 GasA Ex Y SBrkr 796
## 2 216 1107 GasA Ex Y SBrkr 1107
## 3 426 1004 GasA Ex Y SBrkr 1004
## 4 525 1029 GasA TA Y SBrkr 1339
## 5 204 1060 GasA Ex Y SBrkr 1060
## 6 1566 1566 GasA Ex Y SBrkr 1600
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 1 566 0 1362 1 0 1 1
## 2 983 0 2090 1 0 2 1
## 3 0 0 1004 1 0 1 0
## 4 0 0 1339 0 0 1 0
## 5 0 0 1060 1 0 1 0
## 6 0 0 1600 0 0 2 0
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces
## 1 1 1 TA 5 Typ 0
## 2 3 1 TA 7 Typ 2
## 3 2 1 TA 5 Typ 1
## 4 3 1 TA 6 Min1 0
## 5 3 1 Gd 6 Typ 1
## 6 3 1 Gd 7 Typ 1
## GarageCars GarageArea PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## 1 2 480 Y 40 30 0
## 2 2 484 Y 235 204 228
## 3 2 480 Y 0 0 0
## 4 1 294 Y 0 0 0
## 5 1 270 Y 406 90 0
## 6 3 890 Y 0 56 0
## X3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
## 1 320 0 0 700 10 2009 WD Normal
## 2 0 0 0 350 11 2009 WD Normal
## 3 0 0 0 700 3 2010 WD Normal
## 4 0 0 0 0 5 2009 COD Abnorml
## 5 0 0 0 0 5 2010 WD Normal
## 6 0 0 0 0 7 2009 WD Normal
## SalePrice
## 1 143000
## 2 200000
## 3 149000
## 4 139000
## 5 154000
## 6 256300
regAfter2009optimal <- lm(SalePrice ~ RoofMatl + KitchenQual + OverallQual + Condition2 + MSZoning + Neighborhood + LotArea +OverallCond +Foundation + BedroomAbvGr + ExterQual + BsmtFinSF1 +BsmtFinSF2 + MSSubClass + BsmtUnfSF, data = after2009)
# Print the summary
summary(regAfter2009optimal)
##
## Call:
## lm(formula = SalePrice ~ RoofMatl + KitchenQual + OverallQual +
## Condition2 + MSZoning + Neighborhood + LotArea + OverallCond +
## Foundation + BedroomAbvGr + ExterQual + BsmtFinSF1 + BsmtFinSF2 +
## MSSubClass + BsmtUnfSF, data = after2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -198807 -12665 -599 13262 198561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.879e+04 4.804e+04 1.432 0.152466
## RoofMatlTar&Grv -2.618e+03 1.172e+04 -0.223 0.823385
## RoofMatlWdShake 5.147e+04 2.378e+04 2.164 0.030706 *
## RoofMatlWdShngl -6.740e+03 2.929e+04 -0.230 0.818071
## KitchenQualFa -3.752e+04 9.150e+03 -4.101 4.49e-05 ***
## KitchenQualGd -2.892e+04 5.862e+03 -4.934 9.58e-07 ***
## KitchenQualTA -3.444e+04 6.370e+03 -5.407 8.23e-08 ***
## OverallQual2 -6.920e+04 3.885e+04 -1.781 0.075225 .
## OverallQual3 -5.773e+04 3.878e+04 -1.489 0.136970
## OverallQual4 -5.232e+04 3.803e+04 -1.376 0.169289
## OverallQual5 -5.073e+04 3.823e+04 -1.327 0.184853
## OverallQual6 -3.596e+04 3.836e+04 -0.938 0.348723
## OverallQual7 -2.216e+04 3.848e+04 -0.576 0.564962
## OverallQual8 -4.554e+03 3.866e+04 -0.118 0.906277
## OverallQual9 3.126e+04 3.937e+04 0.794 0.427382
## OverallQual10 4.893e+04 4.106e+04 1.192 0.233681
## Condition2Feedr 3.078e+04 3.478e+04 0.885 0.376463
## Condition2Norm 1.746e+04 3.148e+04 0.554 0.579414
## Condition2PosA 6.545e+04 4.379e+04 1.495 0.135360
## Condition2PosN -2.109e+05 4.410e+04 -4.781 2.04e-06 ***
## MSZoningFV 4.121e+04 1.728e+04 2.385 0.017282 *
## MSZoningRH 1.466e+04 1.631e+04 0.899 0.368890
## MSZoningRL 2.671e+04 1.316e+04 2.030 0.042613 *
## MSZoningRM 2.959e+04 1.231e+04 2.403 0.016458 *
## NeighborhoodBlueste -1.408e+04 1.782e+04 -0.790 0.429482
## NeighborhoodBrDale 7.083e+02 1.735e+04 0.041 0.967445
## NeighborhoodBrkSide -2.314e+04 1.442e+04 -1.605 0.108907
## NeighborhoodClearCr -1.322e+04 1.451e+04 -0.911 0.362668
## NeighborhoodCollgCr -2.174e+04 1.185e+04 -1.834 0.067007 .
## NeighborhoodCrawfor 1.342e+04 1.313e+04 1.022 0.306902
## NeighborhoodEdwards -3.736e+04 1.257e+04 -2.973 0.003024 **
## NeighborhoodGilbert -2.013e+04 1.236e+04 -1.629 0.103613
## NeighborhoodIDOTRR -4.263e+04 1.617e+04 -2.637 0.008499 **
## NeighborhoodMeadowV -3.348e+04 1.726e+04 -1.940 0.052734 .
## NeighborhoodMitchel -2.528e+04 1.255e+04 -2.014 0.044360 *
## NeighborhoodNAmes -3.185e+04 1.225e+04 -2.601 0.009460 **
## NeighborhoodNoRidge 2.634e+04 1.318e+04 1.999 0.045963 *
## NeighborhoodNPkVill -3.457e+03 1.407e+04 -0.246 0.806005
## NeighborhoodNridgHt -5.515e+01 1.175e+04 -0.005 0.996255
## NeighborhoodNWAmes -2.739e+04 1.268e+04 -2.160 0.031067 *
## NeighborhoodOldTown -3.999e+04 1.416e+04 -2.825 0.004832 **
## NeighborhoodSawyer -3.786e+04 1.284e+04 -2.948 0.003283 **
## NeighborhoodSawyerW -1.865e+04 1.207e+04 -1.545 0.122678
## NeighborhoodSomerst -1.895e+04 1.540e+04 -1.230 0.218848
## NeighborhoodStoneBr 3.673e+04 1.322e+04 2.778 0.005584 **
## NeighborhoodSWISU -2.620e+04 1.412e+04 -1.856 0.063822 .
## NeighborhoodTimber -2.076e+04 1.316e+04 -1.577 0.115188
## NeighborhoodVeenker 1.185e+04 1.886e+04 0.628 0.529901
## LotArea 8.433e-01 1.316e-01 6.406 2.40e-10 ***
## OverallCond2 6.487e+04 2.474e+04 2.622 0.008890 **
## OverallCond3 5.737e+04 2.268e+04 2.529 0.011605 *
## OverallCond4 5.945e+04 2.245e+04 2.648 0.008231 **
## OverallCond5 6.743e+04 2.197e+04 3.070 0.002207 **
## OverallCond6 6.991e+04 2.200e+04 3.177 0.001537 **
## OverallCond7 7.520e+04 2.202e+04 3.416 0.000664 ***
## OverallCond8 6.919e+04 2.238e+04 3.091 0.002055 **
## OverallCond9 7.476e+04 2.312e+04 3.234 0.001267 **
## FoundationCBlock 3.362e+03 4.352e+03 0.773 0.439951
## FoundationPConc 1.357e+04 4.751e+03 2.857 0.004373 **
## FoundationSlab 3.482e+04 9.134e+03 3.812 0.000147 ***
## FoundationStone 3.331e+04 1.408e+04 2.367 0.018153 *
## FoundationWood -1.134e+04 1.827e+04 -0.620 0.535216
## BedroomAbvGr 7.570e+03 1.542e+03 4.911 1.08e-06 ***
## ExterQualFa -4.459e+04 1.203e+04 -3.707 0.000222 ***
## ExterQualGd -3.717e+04 7.784e+03 -4.775 2.10e-06 ***
## ExterQualTA -4.218e+04 8.522e+03 -4.949 8.89e-07 ***
## BsmtFinSF1 6.794e+01 4.137e+00 16.424 < 2e-16 ***
## BsmtFinSF2 4.668e+01 6.692e+00 6.975 5.91e-12 ***
## MSSubClass160 1.251e+04 7.568e+03 1.653 0.098707 .
## MSSubClass180 1.042e+04 1.676e+04 0.622 0.534095
## MSSubClass190 2.048e+04 1.080e+04 1.896 0.058277 .
## MSSubClass20 1.997e+04 5.757e+03 3.469 0.000547 ***
## MSSubClass30 1.470e+04 8.177e+03 1.798 0.072515 .
## MSSubClass40 1.234e+04 2.422e+04 0.509 0.610580
## MSSubClass45 5.362e+03 1.847e+04 0.290 0.771649
## MSSubClass50 2.862e+04 7.159e+03 3.998 6.90e-05 ***
## MSSubClass60 4.948e+04 6.308e+03 7.845 1.22e-14 ***
## MSSubClass70 2.759e+04 8.647e+03 3.191 0.001469 **
## MSSubClass75 3.696e+04 1.527e+04 2.420 0.015701 *
## MSSubClass80 3.697e+04 7.503e+03 4.927 9.93e-07 ***
## MSSubClass85 1.998e+04 8.756e+03 2.282 0.022732 *
## MSSubClass90 2.478e+04 8.081e+03 3.066 0.002233 **
## BsmtUnfSF 3.905e+01 4.173e+00 9.360 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28400 on 898 degrees of freedom
## (因为不存在,5个观察量被删除了)
## Multiple R-squared: 0.8737, Adjusted R-squared: 0.8622
## F-statistic: 75.79 on 82 and 898 DF, p-value: < 2.2e-16
Now, display diagnostic plots of your regression (regAfter2009optimalFraud). Tip: You have already know how to autoplot.
library(ggfortify)
regAfter2009optimal %>%
autoplot()
## Warning: Removed 981 rows containing missing values (`geom_line()`).
## Warning: Removed 4 rows containing missing values (`geom_point()`).
## Warning: Removed 11 rows containing missing values (`geom_line()`).
Now, look for outliers in diagnostic plots of your regression (regAfter2009optimal). Tip: You have already know how to autoplot.
### This section doesn't require code. Just answer the question as a comment.
## Residuals vs. Fitted Values Plot: outliers points far from the horizontal center line like 280
#QQ-Plot (Quantile-Quantile Plot): Points deviating from the slop line may indicate outliers like 533
#Scale-Location:points far from the horizontal center like 280 would be outliers
#Residuals vs. Leverage Plot:points outside the dashed horizontal lines like 524 is outlier
Knit to html after eliminating all the errors. Submit both the Rmd and html files. Tip: Do not worry about minor formatting issues.
### This section doesn't require code. Just knit and submit the Rmd and html files.###