Final Problem 1.

Playing with PageRank

You’ll verify for yourself that PageRank works by performing calculations on a small universe of web pages.Let’s use the 6 page universe that we had in the previous discussion For this directed graph, perform the following calculations in R.

Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes. (5 Points)

m <- matrix(c(0, 1/2, 0, 0, 1/2, 0,
             1/2, 0, 1/2, 0, 1/2, 0,
             0, 1/2, 1/2, 0, 0, 0,
             0, 0, 1/2, 0, 1/2, 0,
             0, 0, 1/2, 0, 0, 0,
             1/2, 0, 0, 1/2, 0, 0), nrow=6)
decay <- 0.85

B <- 0.85*m + (0.15/6)

Start with a uniform rank vector r and perform power iterations on B till convergence. That is, compute the solution r = Bn × r. Attempt this for a sufficiently large n so that r actually converges. (5 Points)

r <- matrix(c(1/8,1/8,1/8,1/8,1/8,1/8),nrow=6)


iterations <- function(p, r, n) {
  Bn = diag(nrow(p)) 

  for (i in 1:n)
  {
    Bn = Bn %*% p
  }
  return (Bn %*% r)
}

# Convergence at 40
eig1<-iterations(B, r, 40) 

eig1
##           [,1]
## [1,] 0.1934180
## [2,] 0.3429422
## [3,] 0.5433138
## [4,] 0.0503023
## [5,] 0.2803157
## [6,] 0.0354912

Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1.(10 points)

# Decomposing B
decomp <- eigen(B)

# Maximum eigen value we get is indeed 1
max_value<-which.max(decomp$values)
## Warning in which.max(decomp$values): imaginary parts discarded in coercion
# check this eigenvector has all positive entries and it sums to 1
eig2 <- as.numeric((1/sum(decomp$vectors[,1]))*decomp$vectors[,1])
sum(eig2)
## [1] 1

Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally. Verify that you do get the same PageRank vector as the two approaches above. (10 points)

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
g <- graph.adjacency(t(m), weighted = TRUE, mode = "directed")
plot(g)

eig3 <- page.rank(g)$vector
eig3
## [1] 0.1067407 0.2509968 0.4250163 0.0356250 0.1566212 0.0250000
final <- cbind(eig1, eig2, eig3)
colnames(final) <- c('Eigen 1', 'Eigen 2', 'Eigen 3')
final
##        Eigen 1    Eigen 2   Eigen 3
## [1,] 0.1934180 0.13378079 0.1067407
## [2,] 0.3429422 0.23720167 0.2509968
## [3,] 0.5433138 0.37579205 0.4250163
## [4,] 0.0503023 0.03479242 0.0356250
## [5,] 0.2803157 0.19388498 0.1566212
## [6,] 0.0354912 0.02454808 0.0250000

Final Problem 2.

  1. Go to Kaggle.com and build an account if you do not already have one. It is free.

  2. Go to https://www.kaggle.com/c/digit-recognizer/overview, accept the rules of the competition, and download the data. You will not be required to submit work to Kaggle, but you do need the data.‘MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.”

  3. Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.) (5 points)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## x purrr::compose()       masks igraph::compose()
## x tidyr::crossing()      masks igraph::crossing()
## x dplyr::filter()        masks stats::filter()
## x dplyr::groups()        masks igraph::groups()
## x dplyr::lag()           masks stats::lag()
## x purrr::simplify()      masks igraph::simplify()
library(readr)
library(OpenImageR)
library(nnet)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(dplyr)
train<-read_csv('train.csv')
## Rows: 42000 Columns: 785
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (785): label, pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test<-read_csv('test.csv')
## Rows: 28000 Columns: 784
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (784): pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pixel7, p...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Train<-array(as.vector(unlist(train)),dim = c(nrow(train),28,28))
par(mfrow=c(3,4))

for (x in 1:10){ 
  image(flipImage(Train[x,,]))
}
Train<-Train/255

  1. What is the frequency distribution of the numbers in the dataset? (5 points)
hist(Train)

hist(Train[Train>0])

  1. For each number, provide the mean pixel intensity. What does this tell you? (5 points)
options(scipen = 999)
colMeans(Train)[1:50]
##  [1] 0.0174770308123 0.0000000000000 0.0000000000000 0.0000000000000
##  [5] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
##  [9] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [13] 0.0000000000000 0.0000117647059 0.0000438842204 0.0000201680672
## [17] 0.0000008403361 0.0000000000000 0.0000000000000 0.0000000000000
## [21] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [25] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [29] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [33] 0.0000000000000 0.0000014939309 0.0000051353875 0.0000413632120
## [37] 0.0001069094304 0.0001996265173 0.0002604108310 0.0005081232493
## [41] 0.0006828197946 0.0007502334267 0.0007474323063 0.0007688141923
## [45] 0.0006719887955 0.0006450046685 0.0005949579832 0.0004129785247
## [49] 0.0002383753501 0.0001767507003

By the value towards to the center, the mean values are increasing and it tells the image is getting white.

  1. Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why? (5 points)
tr<-train
trcov<-cov(tr/255)
pca<-prcomp(trcov)
(cumsum(pca$sdev^2)/sum(pca$sdev^2))[1:20]
##  [1] 0.2533214 0.4211306 0.5443493 0.6382875 0.7099679 0.7691895 0.8028369
##  [8] 0.8302285 0.8531289 0.8711703 0.8853394 0.8986566 0.9080934 0.9174708
## [15] 0.9256099 0.9328111 0.9384893 0.9438301 0.9484009 0.9527328
  1. Plot the first 10 images generated by PCA. They will appear to be noise. Why? (5 points)
par(mfrow=c(3,4))

for (i in 1:10) {
  image(array(pca$x[,i], dim=c(28,28)))
}

The images are blury because the variance across the digits.

  1. Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%). Plot the first 10 images. What do you see? (5 points)
tr8<-train%>%
  filter(label==8)
dim(tr8)
## [1] 4063  785
trcov8<-cov(tr8/255)
pca8<-prcomp(trcov8)

par(mfrow=c(3,4))

for (i in 1:10) {
  image(array(pca8$x[,i], dim=c(28,28)))
}

Successfully retrieved 8s, and the images present 8s with different shape of dark and bright of combinations.

  1. An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal. (10 points)
train$label <- as.factor(train$label)
X <- train[2:785]/255
X$label <- train$label
model <- multinom(label~., data=X, MaxNWts=1000000)
## # weights:  7860 (7065 variable)
## initial  value 96708.573906 
## iter  10 value 25322.714106
## iter  20 value 20402.086316
## iter  30 value 19312.872829
## iter  40 value 18703.256586
## iter  50 value 18197.815143
## iter  60 value 17732.985798
## iter  70 value 16739.962157
## iter  80 value 14961.658448
## iter  90 value 13446.085942
## iter 100 value 12442.636014
## final  value 12442.636014 
## stopped after 100 iterations
prediction <- predict(model, train[2:785])
prediction <- as.data.frame(prediction, row.names=c('predicted_value'))
## Warning in as.data.frame.factor(prediction, row.names = c("predicted_value")):
## 'row.names' is not a character vector of length 42000 -- omitting it. Will be an
## error!
prediction$actual <- train$label
prediction$equal <- ifelse(prediction$prediction==prediction$actual,1,0)
confusionMatrix(prediction$prediction, prediction$actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 3841    0   12    5    6   31   14   12    4    9
##          1    1 3734   11    3    2   11    4    7    5    3
##          2    9   13 3460   49   22   14   17   43    5    8
##          3   11   37   99 3833   18  235    4   52   32   36
##          4    6    2   22    2 3222   15    8   13    2   17
##          5    9    0    3   12    2 1926   11    3    1    4
##          6   27    5   31   13   22   53 3808    3    2    0
##          7    2    4   14    8    2    8    2 3420    1   19
##          8  217  867  492  391  473 1425  266  234 4005  303
##          9    9   22   33   35  303   77    3  614    6 3789
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8342               
##                  95% CI : (0.8306, 0.8378)     
##     No Information Rate : 0.1115               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8158               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.92957  0.79718  0.82835  0.88095  0.79126  0.50751
## Specificity           0.99754  0.99874  0.99524  0.98608  0.99771  0.99882
## Pos Pred Value        0.97636  0.98757  0.95055  0.87973  0.97371  0.97717
## Neg Pred Value        0.99236  0.97514  0.98131  0.98624  0.97803  0.95331
## Prevalence            0.09838  0.11152  0.09945  0.10360  0.09695  0.09036
## Detection Rate        0.09145  0.08890  0.08238  0.09126  0.07671  0.04586
## Detection Prevalence  0.09367  0.09002  0.08667  0.10374  0.07879  0.04693
## Balanced Accuracy     0.96356  0.89796  0.91179  0.93351  0.89448  0.75317
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92047  0.77710  0.98572  0.90473
## Specificity           0.99588  0.99840  0.87695  0.97086
## Pos Pred Value        0.96065  0.98276  0.46178  0.77469
## Neg Pred Value        0.99135  0.97453  0.99826  0.98925
## Prevalence            0.09850  0.10479  0.09674  0.09971
## Detection Rate        0.09067  0.08143  0.09536  0.09021
## Detection Prevalence  0.09438  0.08286  0.20650  0.11645
## Balanced Accuracy     0.95818  0.88775  0.93134  0.93779

Final Problem 3.

You are to compete in the House Prices: Advanced Regression Techniques competition https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not? 5 points

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix. 5 points

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/Rdevel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss. 10 points

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Data Description

MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.

   A    Agriculture
   C    Commercial
   FV   Floating Village Residential
   I    Industrial
   RH   Residential High Density
   RL   Residential Low Density
   RP   Residential Low Density Park 
   RM   Residential Medium Density

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access to property

   Grvl Gravel  
   Pave Paved
    

Alley: Type of alley access to property

   Grvl Gravel
   Pave Paved
   NA   No alley access
    

LotShape: General shape of property

   Reg  Regular 
   IR1  Slightly irregular
   IR2  Moderately Irregular
   IR3  Irregular
   

LandContour: Flatness of the property

   Lvl  Near Flat/Level 
   Bnk  Banked - Quick and significant rise from street grade to building
   HLS  Hillside - Significant slope from side to side
   Low  Depression
    

Utilities: Type of utilities available

   AllPub   All public Utilities (E,G,W,& S)    
   NoSewr   Electricity, Gas, and Water (Septic Tank)
   NoSeWa   Electricity and Gas Only
   ELO  Electricity only    

LotConfig: Lot configuration

   Inside   Inside lot
   Corner   Corner lot
   CulDSac  Cul-de-sac
   FR2  Frontage on 2 sides of property
   FR3  Frontage on 3 sides of property

LandSlope: Slope of property

   Gtl  Gentle slope
   Mod  Moderate Slope  
   Sev  Severe Slope

Neighborhood: Physical locations within Ames city limits

   Blmngtn  Bloomington Heights
   Blueste  Bluestem
   BrDale   Briardale
   BrkSide  Brookside
   ClearCr  Clear Creek
   CollgCr  College Creek
   Crawfor  Crawford
   Edwards  Edwards
   Gilbert  Gilbert
   IDOTRR   Iowa DOT and Rail Road
   MeadowV  Meadow Village
   Mitchel  Mitchell
   Names    North Ames
   NoRidge  Northridge
   NPkVill  Northpark Villa
   NridgHt  Northridge Heights
   NWAmes   Northwest Ames
   OldTown  Old Town
   SWISU    South & West of Iowa State University
   Sawyer   Sawyer
   SawyerW  Sawyer West
   Somerst  Somerset
   StoneBr  Stone Brook
   Timber   Timberland
   Veenker  Veenker
        

Condition1: Proximity to various conditions

   Artery   Adjacent to arterial street
   Feedr    Adjacent to feeder street   
   Norm Normal  
   RRNn Within 200' of North-South Railroad
   RRAn Adjacent to North-South Railroad
   PosN Near positive off-site feature--park, greenbelt, etc.
   PosA Adjacent to postive off-site feature
   RRNe Within 200' of East-West Railroad
   RRAe Adjacent to East-West Railroad

Condition2: Proximity to various conditions (if more than one is present)

   Artery   Adjacent to arterial street
   Feedr    Adjacent to feeder street   
   Norm Normal  
   RRNn Within 200' of North-South Railroad
   RRAn Adjacent to North-South Railroad
   PosN Near positive off-site feature--park, greenbelt, etc.
   PosA Adjacent to postive off-site feature
   RRNe Within 200' of East-West Railroad
   RRAe Adjacent to East-West Railroad

BldgType: Type of dwelling

   1Fam Single-family Detached  
   2FmCon   Two-family Conversion; originally built as one-family dwelling
   Duplx    Duplex
   TwnhsE   Townhouse End Unit
   TwnhsI   Townhouse Inside Unit

HouseStyle: Style of dwelling

   1Story   One story
   1.5Fin   One and one-half story: 2nd level finished
   1.5Unf   One and one-half story: 2nd level unfinished
   2Story   Two story
   2.5Fin   Two and one-half story: 2nd level finished
   2.5Unf   Two and one-half story: 2nd level unfinished
   SFoyer   Split Foyer
   SLvl Split Level

OverallQual: Rates the overall material and finish of the house

   10   Very Excellent
   9    Excellent
   8    Very Good
   7    Good
   6    Above Average
   5    Average
   4    Below Average
   3    Fair
   2    Poor
   1    Very Poor

OverallCond: Rates the overall condition of the house

   10   Very Excellent
   9    Excellent
   8    Very Good
   7    Good
   6    Above Average   
   5    Average
   4    Below Average   
   3    Fair
   2    Poor
   1    Very Poor
    

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

RoofStyle: Type of roof

   Flat Flat
   Gable    Gable
   Gambrel  Gabrel (Barn)
   Hip  Hip
   Mansard  Mansard
   Shed Shed
    

RoofMatl: Roof material

   ClyTile  Clay or Tile
   CompShg  Standard (Composite) Shingle
   Membran  Membrane
   Metal    Metal
   Roll Roll
   Tar&Grv  Gravel & Tar
   WdShake  Wood Shakes
   WdShngl  Wood Shingles
    

Exterior1st: Exterior covering on house

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast 
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles

Exterior2nd: Exterior covering on house (if more than one material)

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles

MasVnrType: Masonry veneer type

   BrkCmn   Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   None None
   Stone    Stone

MasVnrArea: Masonry veneer area in square feet

ExterQual: Evaluates the quality of the material on the exterior

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor
    

ExterCond: Evaluates the present condition of the material on the exterior

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor
    

Foundation: Type of foundation

   BrkTil   Brick & Tile
   CBlock   Cinder Block
   PConc    Poured Contrete 
   Slab Slab
   Stone    Stone
   Wood Wood
    

BsmtQual: Evaluates the height of the basement

   Ex   Excellent (100+ inches) 
   Gd   Good (90-99 inches)
   TA   Typical (80-89 inches)
   Fa   Fair (70-79 inches)
   Po   Poor (<70 inches
   NA   No Basement
    

BsmtCond: Evaluates the general condition of the basement

   Ex   Excellent
   Gd   Good
   TA   Typical - slight dampness allowed
   Fa   Fair - dampness or some cracking or settling
   Po   Poor - Severe cracking, settling, or wetness
   NA   No Basement

BsmtExposure: Refers to walkout or garden level walls

   Gd   Good Exposure
   Av   Average Exposure (split levels or foyers typically score average or above)  
   Mn   Mimimum Exposure
   No   No Exposure
   NA   No Basement

BsmtFinType1: Rating of basement finished area

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement
    

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Rating of basement finished area (if multiple types)

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

   Floor    Floor Furnace
   GasA Gas forced warm air furnace
   GasW Gas hot water or steam heat
   Grav Gravity furnace 
   OthW Hot water or steam heat other than gas
   Wall Wall furnace
    

HeatingQC: Heating quality and condition

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor
    

CentralAir: Central air conditioning

   N    No
   Y    Yes
    

Electrical: Electrical system

   SBrkr    Standard Circuit Breakers & Romex
   FuseA    Fuse Box over 60 AMP and all Romex wiring (Average) 
   FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
   FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
   Mix  Mixed
    

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

Kitchen: Kitchens above grade

KitchenQual: Kitchen quality

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
    

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Import Data

house_train<-read.csv('house_train.csv', sep =",", stringsAsFactors=T)
house_test<-read.csv('house_test.csv',  sep =",",stringsAsFactors = T)
str(house_train)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

With the str() function, It seems need to convert some of the class of variables.

head(house_train)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
tail(house_train)
##        Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1455 1455         20       FV          62    7500   Pave  Pave      Reg
## 1456 1456         60       RL          62    7917   Pave  <NA>      Reg
## 1457 1457         20       RL          85   13175   Pave  <NA>      Reg
## 1458 1458         70       RL          66    9042   Pave  <NA>      Reg
## 1459 1459         20       RL          68    9717   Pave  <NA>      Reg
## 1460 1460         20       RL          75    9937   Pave  <NA>      Reg
##      LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1455         Lvl    AllPub    Inside       Gtl      Somerst       Norm
## 1456         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 1457         Lvl    AllPub    Inside       Gtl       NWAmes       Norm
## 1458         Lvl    AllPub    Inside       Gtl      Crawfor       Norm
## 1459         Lvl    AllPub    Inside       Gtl        NAmes       Norm
## 1460         Lvl    AllPub    Inside       Gtl      Edwards       Norm
##      Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1455       Norm     1Fam     1Story           7           5      2004
## 1456       Norm     1Fam     2Story           6           5      1999
## 1457       Norm     1Fam     1Story           6           6      1978
## 1458       Norm     1Fam     2Story           7           9      1941
## 1459       Norm     1Fam     1Story           5           6      1950
## 1460       Norm     1Fam     1Story           5           6      1965
##      YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1455         2005     Gable  CompShg     VinylSd     VinylSd       None
## 1456         2000     Gable  CompShg     VinylSd     VinylSd       None
## 1457         1988     Gable  CompShg     Plywood     Plywood      Stone
## 1458         2006     Gable  CompShg     CemntBd     CmentBd       None
## 1459         1996       Hip  CompShg     MetalSd     MetalSd       None
## 1460         1965     Gable  CompShg     HdBoard     HdBoard       None
##      MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1455          0        Gd        TA      PConc       Gd       TA           No
## 1456          0        TA        TA      PConc       Gd       TA           No
## 1457        119        TA        TA     CBlock       Gd       TA           No
## 1458          0        Ex        Gd      Stone       TA       Gd           No
## 1459          0        TA        TA     CBlock       TA       TA           Mn
## 1460          0        Gd        TA     CBlock       TA       TA           No
##      BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1455          GLQ        410          Unf          0       811        1221
## 1456          Unf          0          Unf          0       953         953
## 1457          ALQ        790          Rec        163       589        1542
## 1458          GLQ        275          Unf          0       877        1152
## 1459          GLQ         49          Rec       1029         0        1078
## 1460          BLQ        830          LwQ        290       136        1256
##      Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1455    GasA        Ex          Y      SBrkr      1221         0            0
## 1456    GasA        Ex          Y      SBrkr       953       694            0
## 1457    GasA        TA          Y      SBrkr      2073         0            0
## 1458    GasA        Ex          Y      SBrkr      1188      1152            0
## 1459    GasA        Gd          Y      FuseA      1078         0            0
## 1460    GasA        Gd          Y      SBrkr      1256         0            0
##      GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1455      1221            1            0        2        0            2
## 1456      1647            0            0        2        1            3
## 1457      2073            1            0        2        0            3
## 1458      2340            0            0        2        0            4
## 1459      1078            1            0        1        0            2
## 1460      1256            1            0        1        1            3
##      KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1455            1          Gd            6        Typ          0        <NA>
## 1456            1          TA            7        Typ          1          TA
## 1457            1          TA            7       Min1          2          TA
## 1458            1          Gd            9        Typ          2          Gd
## 1459            1          Gd            5        Typ          0        <NA>
## 1460            1          TA            6        Typ          0        <NA>
##      GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1455     Attchd        2004          RFn          2        400         TA
## 1456     Attchd        1999          RFn          2        460         TA
## 1457     Attchd        1978          Unf          2        500         TA
## 1458     Attchd        1941          RFn          1        252         TA
## 1459     Attchd        1950          Unf          1        240         TA
## 1460     Attchd        1965          Fin          1        276         TA
##      GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1455         TA          Y          0         113             0          0
## 1456         TA          Y          0          40             0          0
## 1457         TA          Y        349           0             0          0
## 1458         TA          Y          0          60             0          0
## 1459         TA          Y        366           0           112          0
## 1460         TA          Y        736          68             0          0
##      ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1455           0        0   <NA>  <NA>        <NA>       0     10   2009
## 1456           0        0   <NA>  <NA>        <NA>       0      8   2007
## 1457           0        0   <NA> MnPrv        <NA>       0      2   2010
## 1458           0        0   <NA> GdPrv        Shed    2500      5   2010
## 1459           0        0   <NA>  <NA>        <NA>       0      4   2010
## 1460           0        0   <NA>  <NA>        <NA>       0      6   2008
##      SaleType SaleCondition SalePrice
## 1455       WD        Normal    185000
## 1456       WD        Normal    175000
## 1457       WD        Normal    210000
## 1458       WD        Normal    266500
## 1459       WD        Normal    142125
## 1460       WD        Normal    147500

Visualization

ggplot(house_train, aes(x=GrLivArea,y=SalePrice))+geom_point()+geom_smooth()+ggtitle('Above Grade Living Area and Sales Price')
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The sales price increase as the living area increase until the 4000 point. However it goes opposite above 4000 point.

ggplot(house_train, aes(x=OverallQual, y=SalePrice))+geom_boxplot()+ggtitle('The Overall Quality and Sales Price')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

The sales price increases as the overall quality increasing.

ggplot(house_train, aes(x=OverallCond, y=SalePrice))+geom_boxplot()+ggtitle('The Overall Condition and Sales Price')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Surprisingly, the median sales price increase as the overall condition increasing from 1 to 5, and drops from 6 to 8 and slight increase at 9.

ggplot(house_train,aes(x=GarageCars))+geom_bar()+ggtitle('Number of Garages')

ggplot(house_train, aes(x=as.factor(GarageCars),y=SalePrice))+geom_boxplot()+ggtitle('Car Capacity of Garage and Sale Price')

The garages that can contain 2 cars have the most in the data set, and the garages that have 3 car capacity costs the most.

ggplot(house_train, aes(x=CentralAir, y=SalePrice))+geom_boxplot()+ggtitle('With Central Air Conditioning an Without Central Air Conditioning')

The houses that have the central air conditioning system have higher price compare to the one without central air condition.

ggplot(house_train, aes(x=LandSlope, y=SalePrice))+geom_boxplot()

The Plot shows that the house with severe and moderate land slope have very similar sales price.

ggplot(house_train, aes(x=ExterQual, y=SalePrice))+geom_boxplot()+ggtitle('Exterior Quality and Sales Price')

Obviously the house with excellent exterior quality results the highest sales price, and followed by good and average.

ggplot(house_train, aes(x=FullBath, y= SalePrice, fill = SalePrice)) +
  geom_bar(stat="identity")

Most of the houses that have 3 bathrooms show higher sale price.

ggplot(house_train, aes(x=YearBuilt, y=SalePrice))+geom_point()+ggtitle('Year Built and Sales Price')

In most of the cases, the recent built houses trend to have higher sales price.

ggplot(house_train, aes(x=SaleType, y=SalePrice))+geom_boxplot()+ggtitle('Types and Sales Price')

Scatterplot Matrix and Correlation

The above plots show the relationships between the dependent variable and independent variables. The Scatterplot Matix allows to view the overall relationships.

pairs(~SalePrice + GrLivArea+TotalBsmtSF+GarageArea, data=house_train)

library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
library(tidyverse)

cor.test(~SalePrice+GarageArea, data=house_train, conf.level=0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  SalePrice and GarageArea
## t = 30.446, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314
cor.test(~SalePrice+GrLivArea, data=house_train, conf.level=0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  SalePrice and GrLivArea
## t = 38.348, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245
cor.test(~GarageArea+GrLivArea, data=house_train, conf.level=0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  GarageArea and GrLivArea
## t = 20.276, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

According to the correlation test, both Garage Area and Ground living Area are positively correlated to the sales price.

house_cor <- house_train %>%
  dplyr::select(SalePrice, GrLivArea, GarageArea) %>%
  cor(., method = "pearson")

corrplot(house_cor)

sol_house_cor<-solve(house_cor)
pre_house_cor<-house_cor %*% sol_house_cor
pre_house_cor
##                           SalePrice                 GrLivArea GarageArea
## SalePrice  1.0000000000000004440892 0.00000000000000004163336          0
## GrLivArea  0.0000000000000002220446 1.00000000000000000000000          0
## GarageArea 0.0000000000000003330669 0.00000000000000008326673          1
library(matrixcalc)
## 
## Attaching package: 'matrixcalc'
## The following object is masked from 'package:igraph':
## 
##     %s%
lu<-lu.decomposition(house_cor)
lu
## $L
##           [,1]       [,2] [,3]
## [1,] 1.0000000 0.00000000    0
## [2,] 0.7086245 1.00000000    0
## [3,] 0.6234314 0.05467234    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.7086245 0.6234314
## [2,]    0 0.4978513 0.0272187
## [3,]    0 0.0000000 0.6098451
house_cor==lu$L %*% lu$U
##            SalePrice GrLivArea GarageArea
## SalePrice       TRUE      TRUE       TRUE
## GrLivArea       TRUE      TRUE       TRUE
## GarageArea      TRUE      TRUE       TRUE
slu<-lu.decomposition(sol_house_cor)
slu
## $L
##            [,1]       [,2] [,3]
## [1,]  1.0000000  0.0000000    0
## [2,] -0.5336085  1.0000000    0
## [3,] -0.3731704 -0.4689975    1
## 
## $U
##          [,1]      [,2]       [,3]
## [1,] 2.569203 -1.370948 -0.9587504
## [2,] 0.000000  1.281983 -0.6012469
## [3,] 0.000000  0.000000  1.0000000
round(sol_house_cor)==round(slu$L %*% slu$U)
##            SalePrice GrLivArea GarageArea
## SalePrice       TRUE      TRUE       TRUE
## GrLivArea       TRUE      TRUE       TRUE
## GarageArea      TRUE      TRUE       TRUE
ggplot(house_train, aes(x=SalePrice))+geom_histogram()+ggtitle('Histogram of Sales Price')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
fitdist<-fitdistr(house_train$SalePrice, densfun = 'exponential')
lam<-fitdist$estimate
redist<-rexp(500, lam)
summary(redist)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     125.4   52360.6  127844.9  176968.1  247065.2 1006709.4
hist(redist)

The 5th and 95th percentile values.

fifth<-round(log(1-0.05)/-lam,2)
fifth
##    rate 
## 9280.04
nintyfiv<-round(log(1-0.95)/-lam,2)
nintyfiv
##     rate 
## 541991.5

Multiple Regression Model

library(tidyverse)
drop_train<-house_train %>% 
  drop_na(LotFrontage,MasVnrType,MasVnrArea, BsmtQual, BsmtCond, BsmtExposure,BsmtFinType1,BsmtFinType2,Electrical,GarageType,GarageYrBlt, GarageFinish,GarageQual,GarageCond)


mod<-lm(SalePrice ~    LotArea +Street + LandContour  + LotConfig + LandSlope + Neighborhood   + OverallQual + OverallCond + YearBuilt + BsmtQual   + BsmtExposure + CentralAir     + X1stFlrSF + X2ndFlrSF + BedroomAbvGr + KitchenQual + TotRmsAbvGrd+ Fireplaces  + GarageCars , data= drop_train)

summary(mod)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + Street + LandContour + LotConfig + 
##     LandSlope + Neighborhood + OverallQual + OverallCond + YearBuilt + 
##     BsmtQual + BsmtExposure + CentralAir + X1stFlrSF + X2ndFlrSF + 
##     BedroomAbvGr + KitchenQual + TotRmsAbvGrd + Fireplaces + 
##     GarageCars, data = drop_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -370155  -14157    -218   12977  231013 
## 
## Coefficients:
##                         Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)         -580584.9262  179212.1323  -3.240             0.001235 ** 
## LotArea                   0.7969       0.1575   5.058  0.00000050027327094 ***
## StreetPave            31972.7485   17713.6100   1.805             0.071367 .  
## LandContourHLS        22344.5124    7503.0848   2.978             0.002968 ** 
## LandContourLow         7610.9878   11294.5822   0.674             0.500549    
## LandContourLvl        23107.5893    5517.3436   4.188  0.00003050621763567 ***
## LotConfigCulDSac      17262.1354    5646.6593   3.057             0.002292 ** 
## LotConfigFR2         -12201.9085    6596.3032  -1.850             0.064625 .  
## LotConfigFR3         -25535.6900   16448.1083  -1.553             0.120847    
## LotConfigInside        1486.4159    2669.8692   0.557             0.577827    
## LandSlopeMod          11659.6407    5974.0124   1.952             0.051239 .  
## LandSlopeSev         -32718.8760   17213.4674  -1.901             0.057608 .  
## NeighborhoodBlueste   -9747.0752   24630.6113  -0.396             0.692385    
## NeighborhoodBrDale    -9916.6593   12892.6348  -0.769             0.441966    
## NeighborhoodBrkSide    6360.4216   11606.6356   0.548             0.583810    
## NeighborhoodClearCr   16926.1641   13973.5289   1.211             0.226055    
## NeighborhoodCollgCr   22343.5126    9428.2193   2.370             0.017977 *  
## NeighborhoodCrawfor   29655.4531   11403.0031   2.601             0.009436 ** 
## NeighborhoodEdwards  -10782.7530   10473.0223  -1.030             0.303450    
## NeighborhoodGilbert   12337.6131   10250.7284   1.204             0.229025    
## NeighborhoodIDOTRR     6624.7098   12374.3704   0.535             0.592517    
## NeighborhoodMeadowV   -6342.3616   14057.6037  -0.451             0.651962    
## NeighborhoodMitchel   -3723.4236   11162.1627  -0.334             0.738767    
## NeighborhoodNAmes      3700.9265   10058.2247   0.368             0.712985    
## NeighborhoodNoRidge   80496.4228   10930.3709   7.364  0.00000000000036149 ***
## NeighborhoodNPkVill     868.4867   15291.9117   0.057             0.954720    
## NeighborhoodNridgHt   49068.0076    9787.3172   5.013  0.00000062832220933 ***
## NeighborhoodNWAmes     3362.5445   10476.4661   0.321             0.748304    
## NeighborhoodOldTown   -4802.4838   11179.6515  -0.430             0.667595    
## NeighborhoodSawyer     5711.4393   10893.7139   0.524             0.600190    
## NeighborhoodSawyerW   18964.1016   10236.1776   1.853             0.064215 .  
## NeighborhoodSomerst   33686.7207    9666.2448   3.485             0.000513 ***
## NeighborhoodStoneBr   73860.7657   11573.6850   6.382  0.00000000026322087 ***
## NeighborhoodSWISU     13155.1504   13067.3157   1.007             0.314303    
## NeighborhoodTimber    16695.9502   11055.2474   1.510             0.131289    
## NeighborhoodVeenker   34130.1301   15475.6234   2.205             0.027644 *  
## OverallQual            9266.5480    1421.2839   6.520  0.00000000010962288 ***
## OverallCond            8035.9665    1138.6456   7.057  0.00000000000309474 ***
## YearBuilt               270.7543      88.2643   3.068             0.002214 ** 
## BsmtQualFa           -35504.6555    8432.0517  -4.211  0.00002766622757058 ***
## BsmtQualGd           -35717.9901    4413.6528  -8.093  0.00000000000000162 ***
## BsmtQualTA           -32000.5166    5735.0203  -5.580  0.00000003070331741 ***
## BsmtExposureGd        20355.1316    4470.4283   4.553  0.00000590607452817 ***
## BsmtExposureMn        -1990.1470    4370.7806  -0.455             0.648967    
## BsmtExposureNo        -7177.1106    2994.5198  -2.397             0.016717 *  
## CentralAirY            8318.8421    5062.5596   1.643             0.100642    
## X1stFlrSF                48.8335       4.9888   9.789 < 0.0000000000000002 ***
## X2ndFlrSF                37.9051       4.5111   8.403 < 0.0000000000000002 ***
## BedroomAbvGr          -2074.4525    1963.9138  -1.056             0.291084    
## KitchenQualFa        -31762.6040    9124.7759  -3.481             0.000520 ***
## KitchenQualGd        -34564.7144    4591.8416  -7.527  0.00000000000011198 ***
## KitchenQualTA        -36578.8929    5326.1370  -6.868  0.00000000001120575 ***
## TotRmsAbvGrd           2449.2646    1306.3633   1.875             0.061091 .  
## Fireplaces             5985.0837    1958.1753   3.056             0.002297 ** 
## GarageCars            13593.2101    2203.7300   6.168  0.00000000098741630 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31970 on 1039 degrees of freedom
## Multiple R-squared:  0.8595, Adjusted R-squared:  0.8522 
## F-statistic: 117.7 on 54 and 1039 DF,  p-value: < 0.00000000000000022

The multiple R-squared is 0.8595, it means that the model explains 85.95% of the data.

plot(fitted(mod), resid(mod))

ggplot(data = mod, aes(x = .resid)) + geom_histogram() + xlab('Residuals')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = mod) + stat_qq(aes(sample = .stdresid)) + geom_abline()

Model Prediction

pred_price <- predict(mod, newdata=house_test,type="response")
head(pred_price)
##        1        2        3        4        5        6 
## 112449.5 145059.0 168926.7 190641.8 235372.4 174203.3

Submission

pred_price[is.na(pred_price)] <- mean(house_train$SalePrice)
submission <- data.frame(list("Id"=house_test$Id, "SalePrice"=pred_price), stringsAsFactors = FALSE)
head(submission)
##     Id SalePrice
## 1 1461  112449.5
## 2 1462  145059.0
## 3 1463  168926.7
## 4 1464  190641.8
## 5 1465  235372.4
## 6 1466  174203.3
write.csv(submission, file="final_test.csv", row.names=FALSE, col.names=TRUE,sep='\t')
## Warning in write.csv(submission, file = "final_test.csv", row.names = FALSE, :
## attempt to set 'col.names' ignored
## Warning in write.csv(submission, file = "final_test.csv", row.names = FALSE, :
## attempt to set 'sep' ignored

Submitted by Chunjie Nan , Score: 0.38856