This assignment is based on the Kaggle kernel by Erik Bruin, which we have been examining in class.
Create a folder ames somewhere on your computer and make it an R project. If you are using Rstudio cloud, create a new project and name it ames.
Create three subfolders: Input, Output, and Code. In the cloud, use the terminal and the mkdir command. To document your completion of this task, take a screenshot showing the results of your efforts. Save this rmd file in the code folder.
Download the file all.rdata from Moodle. This file contains the updated train and test files which Bruin started with. You can identify the test records by the NA values in the price variable. All of the cleaning, imputation and recoding done in Bruin’s kernel have been completed. This file is suitable for constructing a model. Separate this file into train and test components. To demonstrate your success, do a glimpse of the test dataframe.
# Place your code here.
load(file="input/all.Rdata")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
trainames <- all %>%
filter(!is.na(SalePrice))
testames <- all %>%
filter(is.na(SalePrice))
glimpse(testames)
## Observations: 1,459
## Variables: 79
## $ MSSubClass <fct> 1 story 1946+, 1 story 1946+, 2 story 1946+, 2 sto…
## $ MSZoning <fct> RH, RL, RL, RL, RL, RL, RL, RL, RL, RL, RH, RM, RM…
## $ LotFrontage <int> 80, 81, 74, 78, 43, 75, 64, 63, 85, 70, 26, 21, 21…
## $ LotArea <int> 11622, 14267, 13830, 9978, 5005, 10000, 7980, 8402…
## $ Street <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Alley <fct> None, None, None, None, None, None, None, None, No…
## $ LotShape <int> 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 2,…
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "HLS", "Lvl", "Lvl", "…
## $ LotConfig <fct> Inside, Corner, Inside, Inside, Inside, Corner, In…
## $ LandSlope <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ Neighborhood <fct> NAmes, NAmes, Gilbert, Gilbert, StoneBr, Gilbert, …
## $ Condition1 <fct> Feedr, Norm, Norm, Norm, Norm, Norm, Norm, Norm, N…
## $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, TwnhsE, 1Fam, 1Fam, 1Fam, …
## $ HouseStyle <fct> 1Story, 1Story, 2Story, 2Story, 1Story, 2Story, 1S…
## $ OverallQual <int> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6, 5, 6, 7, 9, 8,…
## $ OverallCond <int> 6, 6, 5, 6, 5, 5, 7, 5, 5, 5, 5, 5, 5, 6, 6, 5, 5,…
## $ YearBuilt <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, 19…
## $ YearRemodAdd <int> 1961, 1958, 1998, 1998, 1992, 1994, 2007, 1998, 19…
## $ RoofStyle <fct> Gable, Hip, Gable, Gable, Gable, Gable, Gable, Gab…
## $ RoofMatl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompS…
## $ Exterior1st <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdBoa…
## $ Exterior2nd <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdBoa…
## $ MasVnrType <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 2,…
## $ MasVnrArea <dbl> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0, 504, 492, 0, 0…
## $ ExterQual <int> 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 5, 4,…
## $ ExterCond <int> 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Foundation <fct> CBlock, CBlock, PConc, PConc, PConc, PConc, PConc,…
## $ BsmtQual <int> 3, 3, 4, 3, 4, 4, 4, 4, 4, 3, 4, 3, 3, 3, 4, 5, 4,…
## $ BsmtCond <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ BsmtExposure <int> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ BsmtFinType1 <int> 3, 5, 6, 6, 5, 1, 5, 1, 6, 5, 6, 3, 3, 5, 1, 1, 1,…
## $ BsmtFinSF1 <dbl> 468, 923, 791, 602, 263, 0, 935, 0, 637, 804, 1051…
## $ BsmtFinType2 <int> 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 1, 1, 1, 1, 1, 1,…
## $ BsmtFinSF2 <dbl> 144, 0, 0, 0, 0, 0, 0, 0, 0, 78, 0, 0, 0, 0, 0, 0,…
## $ BsmtUnfSF <dbl> 270, 406, 137, 324, 1017, 763, 233, 789, 663, 0, 3…
## $ TotalBsmtSF <dbl> 882, 1329, 928, 926, 1280, 763, 1168, 789, 1300, 8…
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Ga…
## $ HeatingQC <int> 3, 3, 4, 5, 5, 4, 5, 4, 4, 3, 5, 3, 3, 3, 5, 5, 5,…
## $ CentralAir <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, S…
## $ X1stFlrSF <int> 896, 1329, 928, 926, 1280, 763, 1187, 789, 1341, 8…
## $ X2ndFlrSF <int> 0, 0, 701, 678, 0, 892, 0, 676, 0, 0, 0, 504, 567,…
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea <int> 896, 1329, 1629, 1604, 1280, 1655, 1187, 1465, 134…
## $ BsmtFullBath <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,…
## $ BsmtHalfBath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FullBath <int> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2,…
## $ HalfBath <int> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,…
## $ BedroomAbvGr <int> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, 3,…
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ KitchenQual <int> 3, 4, 3, 4, 4, 3, 3, 3, 4, 3, 4, 3, 3, 4, 3, 5, 4,…
## $ TotRmsAbvGrd <int> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5, 6, 6, 4, 10, 7…
## $ Functional <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…
## $ Fireplaces <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,…
## $ FireplaceQu <int> 0, 0, 3, 4, 0, 3, 0, 4, 1, 0, 2, 0, 0, 3, 0, 4, 0,…
## $ GarageType <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, At…
## $ GarageYrBlt <dbl> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, 19…
## $ GarageFinish <int> 1, 1, 3, 3, 2, 3, 3, 3, 1, 3, 3, 1, 1, 1, 1, 3, 2,…
## $ GarageCars <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 3, 3,…
## $ GarageArea <dbl> 730, 312, 482, 470, 506, 440, 420, 393, 506, 525, …
## $ GarageQual <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ GarageCond <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ PavedDrive <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ WoodDeckSF <int> 140, 393, 212, 360, 0, 157, 483, 0, 192, 240, 203,…
## $ OpenPorchSF <int> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0, 68, 0, 0, 0, …
## $ EnclosedPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ScreenPorch <int> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Fence <fct> MnPrv, None, MnPrv, None, None, None, GdPrv, None,…
## $ MiscFeature <fct> None, Gar2, None, None, None, None, Shed, None, No…
## $ MiscVal <int> 0, 12500, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MoSold <fct> 6, 6, 3, 6, 1, 4, 3, 5, 2, 4, 6, 2, 3, 6, 6, 1, 6,…
## $ YrSold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 20…
## $ SaleType <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, COD, W…
## $ SaleCondition <fct> Normal, Normal, Normal, Normal, Normal, Normal, No…
## $ SalePrice <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Build a regression model to predict sale price based on variables you choose after reading Bruin’s document. You need to include at least 4 variables. Two of these variables must have been altered in Bruin’s cleanup work. Describe the variables. In the explanation of the variables alterd by Bruin, describe what he did.
Display a summary of your regression model.
Fence was one of the variables Bruin changed. Fence was entered into the datasets with many NA values but the NA values have significance in this situation, so all the NA’s were changed to the factor, “None”. This change was made by selecting all the rows with an NA value using dollar and brackets, then entering “None” into those rows.
Bruin also changed the month sold variable, because it was coded as a numeric variable. Months should actually be a factor because they are not actually numbers that have any significance. This change was made by using, “as.factor” on the column.
# Place your code here.
modames <- lm(SalePrice ~ OverallQual + GrLivArea + MoSold + Fence, data=trainames)
summary(modames)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + MoSold + Fence,
## data = trainames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -389474 -23149 -335 19211 286497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.124e+05 9.345e+03 -12.033 < 2e-16 ***
## OverallQual 3.311e+04 1.020e+03 32.456 < 2e-16 ***
## GrLivArea 5.559e+01 2.641e+00 21.048 < 2e-16 ***
## MoSold2 -1.483e+04 8.126e+03 -1.825 0.06819 .
## MoSold3 -8.974e+03 6.936e+03 -1.294 0.19594
## MoSold4 -1.354e+04 6.636e+03 -2.040 0.04151 *
## MoSold5 -7.023e+03 6.331e+03 -1.109 0.26747
## MoSold6 -8.983e+03 6.186e+03 -1.452 0.14668
## MoSold7 -5.026e+03 6.233e+03 -0.806 0.42017
## MoSold8 -8.878e+03 6.773e+03 -1.311 0.19016
## MoSold9 -1.345e+04 7.742e+03 -1.738 0.08247 .
## MoSold10 -1.575e+04 7.180e+03 -2.194 0.02839 *
## MoSold11 -9.234e+03 7.357e+03 -1.255 0.20961
## MoSold12 -1.264e+04 7.877e+03 -1.605 0.10880
## FenceGdWo 1.803e+04 8.070e+03 2.234 0.02561 *
## FenceMnPrv 1.499e+04 6.561e+03 2.285 0.02247 *
## FenceMnWw 1.969e+04 1.401e+04 1.405 0.16028
## FenceNone 1.718e+04 5.683e+03 3.024 0.00254 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42430 on 1442 degrees of freedom
## Multiple R-squared: 0.7181, Adjusted R-squared: 0.7148
## F-statistic: 216.1 on 17 and 1442 DF, p-value: < 2.2e-16
Use predict() to make predicted sale prices for the test dataframe. Convert these predictions to a form suitable for submission to Kaggle. Then do the submission. Do a screenshot of your submission report from Kaggle and post it here.
# Place your code here.
predames <- predict(modames, newdata=testames)
testames$Id <- seq.int(nrow(testames))
testames <- testames %>%
mutate(Id = Id+1460)
submission <- cbind(testames, predames)
readysubmission <- submission %>%
select(Id, predames)
finalsubmission <- readysubmission %>%
rename(SalePrice = predames)
head(finalsubmission)
This will insert a jpg file named Path.jpg from your working directory. Run getwd() from a chunk to make sure where it is.