You’ll verify for yourself that PageRank works by performing calculations on a small universe of web pages.Let’s use the 6 page universe that we had in the previous discussion For this directed graph, perform the following calculations in R.
Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes. (5 Points)
m <- matrix(c(0, 1/2, 0, 0, 1/2, 0,
1/2, 0, 1/2, 0, 1/2, 0,
0, 1/2, 1/2, 0, 0, 0,
0, 0, 1/2, 0, 1/2, 0,
0, 0, 1/2, 0, 0, 0,
1/2, 0, 0, 1/2, 0, 0), nrow=6)
decay <- 0.85
B <- 0.85*m + (0.15/6)
Start with a uniform rank vector r and perform power iterations on B till convergence. That is, compute the solution r = Bn × r. Attempt this for a sufficiently large n so that r actually converges. (5 Points)
r <- matrix(c(1/8,1/8,1/8,1/8,1/8,1/8),nrow=6)
iterations <- function(p, r, n) {
Bn = diag(nrow(p))
for (i in 1:n)
{
Bn = Bn %*% p
}
return (Bn %*% r)
}
# Convergence at 40
eig1<-iterations(B, r, 40)
eig1
## [,1]
## [1,] 0.1934180
## [2,] 0.3429422
## [3,] 0.5433138
## [4,] 0.0503023
## [5,] 0.2803157
## [6,] 0.0354912
Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1.(10 points)
# Decomposing B
decomp <- eigen(B)
# Maximum eigen value we get is indeed 1
max_value<-which.max(decomp$values)
## Warning in which.max(decomp$values): imaginary parts discarded in coercion
# check this eigenvector has all positive entries and it sums to 1
eig2 <- as.numeric((1/sum(decomp$vectors[,1]))*decomp$vectors[,1])
sum(eig2)
## [1] 1
Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally. Verify that you do get the same PageRank vector as the two approaches above. (10 points)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
g <- graph.adjacency(t(m), weighted = TRUE, mode = "directed")
plot(g)
eig3 <- page.rank(g)$vector
eig3
## [1] 0.1067407 0.2509968 0.4250163 0.0356250 0.1566212 0.0250000
final <- cbind(eig1, eig2, eig3)
colnames(final) <- c('Eigen 1', 'Eigen 2', 'Eigen 3')
final
## Eigen 1 Eigen 2 Eigen 3
## [1,] 0.1934180 0.13378079 0.1067407
## [2,] 0.3429422 0.23720167 0.2509968
## [3,] 0.5433138 0.37579205 0.4250163
## [4,] 0.0503023 0.03479242 0.0356250
## [5,] 0.2803157 0.19388498 0.1566212
## [6,] 0.0354912 0.02454808 0.0250000
Go to Kaggle.com and build an account if you do not already have one. It is free.
Go to https://www.kaggle.com/c/digit-recognizer/overview, accept the rules of the competition, and download the data. You will not be required to submit work to Kaggle, but you do need the data.‘MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.”
Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.) (5 points)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## x purrr::compose() masks igraph::compose()
## x tidyr::crossing() masks igraph::crossing()
## x dplyr::filter() masks stats::filter()
## x dplyr::groups() masks igraph::groups()
## x dplyr::lag() masks stats::lag()
## x purrr::simplify() masks igraph::simplify()
library(readr)
library(OpenImageR)
library(nnet)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
train<-read_csv('train.csv')
## Rows: 42000 Columns: 785
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (785): label, pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test<-read_csv('test.csv')
## Rows: 28000 Columns: 784
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (784): pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pixel7, p...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Train<-array(as.vector(unlist(train)),dim = c(nrow(train),28,28))
par(mfrow=c(3,4))
for (x in 1:10){
image(flipImage(Train[x,,]))
}
Train<-Train/255
hist(Train)
hist(Train[Train>0])
options(scipen = 999)
colMeans(Train)[1:50]
## [1] 0.0174770308123 0.0000000000000 0.0000000000000 0.0000000000000
## [5] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [9] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [13] 0.0000000000000 0.0000117647059 0.0000438842204 0.0000201680672
## [17] 0.0000008403361 0.0000000000000 0.0000000000000 0.0000000000000
## [21] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [25] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [29] 0.0000000000000 0.0000000000000 0.0000000000000 0.0000000000000
## [33] 0.0000000000000 0.0000014939309 0.0000051353875 0.0000413632120
## [37] 0.0001069094304 0.0001996265173 0.0002604108310 0.0005081232493
## [41] 0.0006828197946 0.0007502334267 0.0007474323063 0.0007688141923
## [45] 0.0006719887955 0.0006450046685 0.0005949579832 0.0004129785247
## [49] 0.0002383753501 0.0001767507003
By the value towards to the center, the mean values are increasing and it tells the image is getting white.
tr<-train
trcov<-cov(tr/255)
pca<-prcomp(trcov)
(cumsum(pca$sdev^2)/sum(pca$sdev^2))[1:20]
## [1] 0.2533214 0.4211306 0.5443493 0.6382875 0.7099679 0.7691895 0.8028369
## [8] 0.8302285 0.8531289 0.8711703 0.8853394 0.8986566 0.9080934 0.9174708
## [15] 0.9256099 0.9328111 0.9384893 0.9438301 0.9484009 0.9527328
par(mfrow=c(3,4))
for (i in 1:10) {
image(array(pca$x[,i], dim=c(28,28)))
}
The images are blury because the variance across the digits.
tr8<-train%>%
filter(label==8)
dim(tr8)
## [1] 4063 785
trcov8<-cov(tr8/255)
pca8<-prcomp(trcov8)
par(mfrow=c(3,4))
for (i in 1:10) {
image(array(pca8$x[,i], dim=c(28,28)))
}
Successfully retrieved 8s, and the images present 8s with different shape of dark and bright of combinations.
train$label <- as.factor(train$label)
X <- train[2:785]/255
X$label <- train$label
model <- multinom(label~., data=X, MaxNWts=1000000)
## # weights: 7860 (7065 variable)
## initial value 96708.573906
## iter 10 value 25322.714106
## iter 20 value 20402.086316
## iter 30 value 19312.872829
## iter 40 value 18703.256586
## iter 50 value 18197.815143
## iter 60 value 17732.985798
## iter 70 value 16739.962157
## iter 80 value 14961.658448
## iter 90 value 13446.085942
## iter 100 value 12442.636014
## final value 12442.636014
## stopped after 100 iterations
prediction <- predict(model, train[2:785])
prediction <- as.data.frame(prediction, row.names=c('predicted_value'))
## Warning in as.data.frame.factor(prediction, row.names = c("predicted_value")):
## 'row.names' is not a character vector of length 42000 -- omitting it. Will be an
## error!
prediction$actual <- train$label
prediction$equal <- ifelse(prediction$prediction==prediction$actual,1,0)
confusionMatrix(prediction$prediction, prediction$actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 3841 0 12 5 6 31 14 12 4 9
## 1 1 3734 11 3 2 11 4 7 5 3
## 2 9 13 3460 49 22 14 17 43 5 8
## 3 11 37 99 3833 18 235 4 52 32 36
## 4 6 2 22 2 3222 15 8 13 2 17
## 5 9 0 3 12 2 1926 11 3 1 4
## 6 27 5 31 13 22 53 3808 3 2 0
## 7 2 4 14 8 2 8 2 3420 1 19
## 8 217 867 492 391 473 1425 266 234 4005 303
## 9 9 22 33 35 303 77 3 614 6 3789
##
## Overall Statistics
##
## Accuracy : 0.8342
## 95% CI : (0.8306, 0.8378)
## No Information Rate : 0.1115
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8158
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.92957 0.79718 0.82835 0.88095 0.79126 0.50751
## Specificity 0.99754 0.99874 0.99524 0.98608 0.99771 0.99882
## Pos Pred Value 0.97636 0.98757 0.95055 0.87973 0.97371 0.97717
## Neg Pred Value 0.99236 0.97514 0.98131 0.98624 0.97803 0.95331
## Prevalence 0.09838 0.11152 0.09945 0.10360 0.09695 0.09036
## Detection Rate 0.09145 0.08890 0.08238 0.09126 0.07671 0.04586
## Detection Prevalence 0.09367 0.09002 0.08667 0.10374 0.07879 0.04693
## Balanced Accuracy 0.96356 0.89796 0.91179 0.93351 0.89448 0.75317
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.92047 0.77710 0.98572 0.90473
## Specificity 0.99588 0.99840 0.87695 0.97086
## Pos Pred Value 0.96065 0.98276 0.46178 0.77469
## Neg Pred Value 0.99135 0.97453 0.99826 0.98925
## Prevalence 0.09850 0.10479 0.09674 0.09971
## Detection Rate 0.09067 0.08143 0.09536 0.09021
## Detection Prevalence 0.09438 0.08286 0.20650 0.11645
## Balanced Accuracy 0.95818 0.88775 0.93134 0.93779
You are to compete in the House Prices: Advanced Regression Techniques competition https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not? 5 points
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix. 5 points
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/Rdevel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss. 10 points
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access to property
Grvl Gravel
Pave Paved
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
LandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
Utilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Neighborhood: Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
Condition1: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Condition2: Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
BldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
HouseStyle: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
OverallQual: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
OverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
RoofMatl: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Exterior1st: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
MasVnrArea: Masonry veneer area in square feet
ExterQual: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
ExterCond: Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Foundation: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
BsmtCond: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
BsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFinType1: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
HeatingQC: Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
CentralAir: Central air conditioning
N No
Y Yes
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen: Kitchens above grade
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
house_train<-read.csv('house_train.csv', sep =",", stringsAsFactors=T)
house_test<-read.csv('house_test.csv', sep =",",stringsAsFactors = T)
str(house_train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
## $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
## $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
## $ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
## $ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
## $ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
With the str() function, It seems need to convert some of the class of variables.
head(house_train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
tail(house_train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1455 1455 20 FV 62 7500 Pave Pave Reg
## 1456 1456 60 RL 62 7917 Pave <NA> Reg
## 1457 1457 20 RL 85 13175 Pave <NA> Reg
## 1458 1458 70 RL 66 9042 Pave <NA> Reg
## 1459 1459 20 RL 68 9717 Pave <NA> Reg
## 1460 1460 20 RL 75 9937 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1455 Lvl AllPub Inside Gtl Somerst Norm
## 1456 Lvl AllPub Inside Gtl Gilbert Norm
## 1457 Lvl AllPub Inside Gtl NWAmes Norm
## 1458 Lvl AllPub Inside Gtl Crawfor Norm
## 1459 Lvl AllPub Inside Gtl NAmes Norm
## 1460 Lvl AllPub Inside Gtl Edwards Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1455 Norm 1Fam 1Story 7 5 2004
## 1456 Norm 1Fam 2Story 6 5 1999
## 1457 Norm 1Fam 1Story 6 6 1978
## 1458 Norm 1Fam 2Story 7 9 1941
## 1459 Norm 1Fam 1Story 5 6 1950
## 1460 Norm 1Fam 1Story 5 6 1965
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1455 2005 Gable CompShg VinylSd VinylSd None
## 1456 2000 Gable CompShg VinylSd VinylSd None
## 1457 1988 Gable CompShg Plywood Plywood Stone
## 1458 2006 Gable CompShg CemntBd CmentBd None
## 1459 1996 Hip CompShg MetalSd MetalSd None
## 1460 1965 Gable CompShg HdBoard HdBoard None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1455 0 Gd TA PConc Gd TA No
## 1456 0 TA TA PConc Gd TA No
## 1457 119 TA TA CBlock Gd TA No
## 1458 0 Ex Gd Stone TA Gd No
## 1459 0 TA TA CBlock TA TA Mn
## 1460 0 Gd TA CBlock TA TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1455 GLQ 410 Unf 0 811 1221
## 1456 Unf 0 Unf 0 953 953
## 1457 ALQ 790 Rec 163 589 1542
## 1458 GLQ 275 Unf 0 877 1152
## 1459 GLQ 49 Rec 1029 0 1078
## 1460 BLQ 830 LwQ 290 136 1256
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1455 GasA Ex Y SBrkr 1221 0 0
## 1456 GasA Ex Y SBrkr 953 694 0
## 1457 GasA TA Y SBrkr 2073 0 0
## 1458 GasA Ex Y SBrkr 1188 1152 0
## 1459 GasA Gd Y FuseA 1078 0 0
## 1460 GasA Gd Y SBrkr 1256 0 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1455 1221 1 0 2 0 2
## 1456 1647 0 0 2 1 3
## 1457 2073 1 0 2 0 3
## 1458 2340 0 0 2 0 4
## 1459 1078 1 0 1 0 2
## 1460 1256 1 0 1 1 3
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1455 1 Gd 6 Typ 0 <NA>
## 1456 1 TA 7 Typ 1 TA
## 1457 1 TA 7 Min1 2 TA
## 1458 1 Gd 9 Typ 2 Gd
## 1459 1 Gd 5 Typ 0 <NA>
## 1460 1 TA 6 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1455 Attchd 2004 RFn 2 400 TA
## 1456 Attchd 1999 RFn 2 460 TA
## 1457 Attchd 1978 Unf 2 500 TA
## 1458 Attchd 1941 RFn 1 252 TA
## 1459 Attchd 1950 Unf 1 240 TA
## 1460 Attchd 1965 Fin 1 276 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1455 TA Y 0 113 0 0
## 1456 TA Y 0 40 0 0
## 1457 TA Y 349 0 0 0
## 1458 TA Y 0 60 0 0
## 1459 TA Y 366 0 112 0
## 1460 TA Y 736 68 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1455 0 0 <NA> <NA> <NA> 0 10 2009
## 1456 0 0 <NA> <NA> <NA> 0 8 2007
## 1457 0 0 <NA> MnPrv <NA> 0 2 2010
## 1458 0 0 <NA> GdPrv Shed 2500 5 2010
## 1459 0 0 <NA> <NA> <NA> 0 4 2010
## 1460 0 0 <NA> <NA> <NA> 0 6 2008
## SaleType SaleCondition SalePrice
## 1455 WD Normal 185000
## 1456 WD Normal 175000
## 1457 WD Normal 210000
## 1458 WD Normal 266500
## 1459 WD Normal 142125
## 1460 WD Normal 147500
ggplot(house_train, aes(x=GrLivArea,y=SalePrice))+geom_point()+geom_smooth()+ggtitle('Above Grade Living Area and Sales Price')
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The sales price increase as the living area increase until the 4000 point. However it goes opposite above 4000 point.
ggplot(house_train, aes(x=OverallQual, y=SalePrice))+geom_boxplot()+ggtitle('The Overall Quality and Sales Price')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
The sales price increases as the overall quality increasing.
ggplot(house_train, aes(x=OverallCond, y=SalePrice))+geom_boxplot()+ggtitle('The Overall Condition and Sales Price')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Surprisingly, the median sales price increase as the overall condition increasing from 1 to 5, and drops from 6 to 8 and slight increase at 9.
ggplot(house_train,aes(x=GarageCars))+geom_bar()+ggtitle('Number of Garages')
ggplot(house_train, aes(x=as.factor(GarageCars),y=SalePrice))+geom_boxplot()+ggtitle('Car Capacity of Garage and Sale Price')
The garages that can contain 2 cars have the most in the data set, and the garages that have 3 car capacity costs the most.
ggplot(house_train, aes(x=CentralAir, y=SalePrice))+geom_boxplot()+ggtitle('With Central Air Conditioning an Without Central Air Conditioning')
The houses that have the central air conditioning system have higher price compare to the one without central air condition.
ggplot(house_train, aes(x=LandSlope, y=SalePrice))+geom_boxplot()
The Plot shows that the house with severe and moderate land slope have very similar sales price.
ggplot(house_train, aes(x=ExterQual, y=SalePrice))+geom_boxplot()+ggtitle('Exterior Quality and Sales Price')
Obviously the house with excellent exterior quality results the highest sales price, and followed by good and average.
ggplot(house_train, aes(x=FullBath, y= SalePrice, fill = SalePrice)) +
geom_bar(stat="identity")
Most of the houses that have 3 bathrooms show higher sale price.
ggplot(house_train, aes(x=YearBuilt, y=SalePrice))+geom_point()+ggtitle('Year Built and Sales Price')
In most of the cases, the recent built houses trend to have higher sales price.
ggplot(house_train, aes(x=SaleType, y=SalePrice))+geom_boxplot()+ggtitle('Types and Sales Price')
The above plots show the relationships between the dependent variable and independent variables. The Scatterplot Matix allows to view the overall relationships.
pairs(~SalePrice + GrLivArea+TotalBsmtSF+GarageArea, data=house_train)
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
library(tidyverse)
cor.test(~SalePrice+GarageArea, data=house_train, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: SalePrice and GarageArea
## t = 30.446, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
cor.test(~SalePrice+GrLivArea, data=house_train, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: SalePrice and GrLivArea
## t = 38.348, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
cor.test(~GarageArea+GrLivArea, data=house_train, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: GarageArea and GrLivArea
## t = 20.276, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
According to the correlation test, both Garage Area and Ground living Area are positively correlated to the sales price.
house_cor <- house_train %>%
dplyr::select(SalePrice, GrLivArea, GarageArea) %>%
cor(., method = "pearson")
corrplot(house_cor)
sol_house_cor<-solve(house_cor)
pre_house_cor<-house_cor %*% sol_house_cor
pre_house_cor
## SalePrice GrLivArea GarageArea
## SalePrice 1.0000000000000004440892 0.00000000000000004163336 0
## GrLivArea 0.0000000000000002220446 1.00000000000000000000000 0
## GarageArea 0.0000000000000003330669 0.00000000000000008326673 1
library(matrixcalc)
##
## Attaching package: 'matrixcalc'
## The following object is masked from 'package:igraph':
##
## %s%
lu<-lu.decomposition(house_cor)
lu
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.00000000 0
## [2,] 0.7086245 1.00000000 0
## [3,] 0.6234314 0.05467234 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.7086245 0.6234314
## [2,] 0 0.4978513 0.0272187
## [3,] 0 0.0000000 0.6098451
house_cor==lu$L %*% lu$U
## SalePrice GrLivArea GarageArea
## SalePrice TRUE TRUE TRUE
## GrLivArea TRUE TRUE TRUE
## GarageArea TRUE TRUE TRUE
slu<-lu.decomposition(sol_house_cor)
slu
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] -0.5336085 1.0000000 0
## [3,] -0.3731704 -0.4689975 1
##
## $U
## [,1] [,2] [,3]
## [1,] 2.569203 -1.370948 -0.9587504
## [2,] 0.000000 1.281983 -0.6012469
## [3,] 0.000000 0.000000 1.0000000
round(sol_house_cor)==round(slu$L %*% slu$U)
## SalePrice GrLivArea GarageArea
## SalePrice TRUE TRUE TRUE
## GrLivArea TRUE TRUE TRUE
## GarageArea TRUE TRUE TRUE
ggplot(house_train, aes(x=SalePrice))+geom_histogram()+ggtitle('Histogram of Sales Price')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
fitdist<-fitdistr(house_train$SalePrice, densfun = 'exponential')
lam<-fitdist$estimate
redist<-rexp(500, lam)
summary(redist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 125.4 52360.6 127844.9 176968.1 247065.2 1006709.4
hist(redist)
The 5th and 95th percentile values.
fifth<-round(log(1-0.05)/-lam,2)
fifth
## rate
## 9280.04
nintyfiv<-round(log(1-0.95)/-lam,2)
nintyfiv
## rate
## 541991.5
library(tidyverse)
drop_train<-house_train %>%
drop_na(LotFrontage,MasVnrType,MasVnrArea, BsmtQual, BsmtCond, BsmtExposure,BsmtFinType1,BsmtFinType2,Electrical,GarageType,GarageYrBlt, GarageFinish,GarageQual,GarageCond)
mod<-lm(SalePrice ~ LotArea +Street + LandContour + LotConfig + LandSlope + Neighborhood + OverallQual + OverallCond + YearBuilt + BsmtQual + BsmtExposure + CentralAir + X1stFlrSF + X2ndFlrSF + BedroomAbvGr + KitchenQual + TotRmsAbvGrd+ Fireplaces + GarageCars , data= drop_train)
summary(mod)
##
## Call:
## lm(formula = SalePrice ~ LotArea + Street + LandContour + LotConfig +
## LandSlope + Neighborhood + OverallQual + OverallCond + YearBuilt +
## BsmtQual + BsmtExposure + CentralAir + X1stFlrSF + X2ndFlrSF +
## BedroomAbvGr + KitchenQual + TotRmsAbvGrd + Fireplaces +
## GarageCars, data = drop_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -370155 -14157 -218 12977 231013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -580584.9262 179212.1323 -3.240 0.001235 **
## LotArea 0.7969 0.1575 5.058 0.00000050027327094 ***
## StreetPave 31972.7485 17713.6100 1.805 0.071367 .
## LandContourHLS 22344.5124 7503.0848 2.978 0.002968 **
## LandContourLow 7610.9878 11294.5822 0.674 0.500549
## LandContourLvl 23107.5893 5517.3436 4.188 0.00003050621763567 ***
## LotConfigCulDSac 17262.1354 5646.6593 3.057 0.002292 **
## LotConfigFR2 -12201.9085 6596.3032 -1.850 0.064625 .
## LotConfigFR3 -25535.6900 16448.1083 -1.553 0.120847
## LotConfigInside 1486.4159 2669.8692 0.557 0.577827
## LandSlopeMod 11659.6407 5974.0124 1.952 0.051239 .
## LandSlopeSev -32718.8760 17213.4674 -1.901 0.057608 .
## NeighborhoodBlueste -9747.0752 24630.6113 -0.396 0.692385
## NeighborhoodBrDale -9916.6593 12892.6348 -0.769 0.441966
## NeighborhoodBrkSide 6360.4216 11606.6356 0.548 0.583810
## NeighborhoodClearCr 16926.1641 13973.5289 1.211 0.226055
## NeighborhoodCollgCr 22343.5126 9428.2193 2.370 0.017977 *
## NeighborhoodCrawfor 29655.4531 11403.0031 2.601 0.009436 **
## NeighborhoodEdwards -10782.7530 10473.0223 -1.030 0.303450
## NeighborhoodGilbert 12337.6131 10250.7284 1.204 0.229025
## NeighborhoodIDOTRR 6624.7098 12374.3704 0.535 0.592517
## NeighborhoodMeadowV -6342.3616 14057.6037 -0.451 0.651962
## NeighborhoodMitchel -3723.4236 11162.1627 -0.334 0.738767
## NeighborhoodNAmes 3700.9265 10058.2247 0.368 0.712985
## NeighborhoodNoRidge 80496.4228 10930.3709 7.364 0.00000000000036149 ***
## NeighborhoodNPkVill 868.4867 15291.9117 0.057 0.954720
## NeighborhoodNridgHt 49068.0076 9787.3172 5.013 0.00000062832220933 ***
## NeighborhoodNWAmes 3362.5445 10476.4661 0.321 0.748304
## NeighborhoodOldTown -4802.4838 11179.6515 -0.430 0.667595
## NeighborhoodSawyer 5711.4393 10893.7139 0.524 0.600190
## NeighborhoodSawyerW 18964.1016 10236.1776 1.853 0.064215 .
## NeighborhoodSomerst 33686.7207 9666.2448 3.485 0.000513 ***
## NeighborhoodStoneBr 73860.7657 11573.6850 6.382 0.00000000026322087 ***
## NeighborhoodSWISU 13155.1504 13067.3157 1.007 0.314303
## NeighborhoodTimber 16695.9502 11055.2474 1.510 0.131289
## NeighborhoodVeenker 34130.1301 15475.6234 2.205 0.027644 *
## OverallQual 9266.5480 1421.2839 6.520 0.00000000010962288 ***
## OverallCond 8035.9665 1138.6456 7.057 0.00000000000309474 ***
## YearBuilt 270.7543 88.2643 3.068 0.002214 **
## BsmtQualFa -35504.6555 8432.0517 -4.211 0.00002766622757058 ***
## BsmtQualGd -35717.9901 4413.6528 -8.093 0.00000000000000162 ***
## BsmtQualTA -32000.5166 5735.0203 -5.580 0.00000003070331741 ***
## BsmtExposureGd 20355.1316 4470.4283 4.553 0.00000590607452817 ***
## BsmtExposureMn -1990.1470 4370.7806 -0.455 0.648967
## BsmtExposureNo -7177.1106 2994.5198 -2.397 0.016717 *
## CentralAirY 8318.8421 5062.5596 1.643 0.100642
## X1stFlrSF 48.8335 4.9888 9.789 < 0.0000000000000002 ***
## X2ndFlrSF 37.9051 4.5111 8.403 < 0.0000000000000002 ***
## BedroomAbvGr -2074.4525 1963.9138 -1.056 0.291084
## KitchenQualFa -31762.6040 9124.7759 -3.481 0.000520 ***
## KitchenQualGd -34564.7144 4591.8416 -7.527 0.00000000000011198 ***
## KitchenQualTA -36578.8929 5326.1370 -6.868 0.00000000001120575 ***
## TotRmsAbvGrd 2449.2646 1306.3633 1.875 0.061091 .
## Fireplaces 5985.0837 1958.1753 3.056 0.002297 **
## GarageCars 13593.2101 2203.7300 6.168 0.00000000098741630 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31970 on 1039 degrees of freedom
## Multiple R-squared: 0.8595, Adjusted R-squared: 0.8522
## F-statistic: 117.7 on 54 and 1039 DF, p-value: < 0.00000000000000022
The multiple R-squared is 0.8595, it means that the model explains 85.95% of the data.
plot(fitted(mod), resid(mod))
ggplot(data = mod, aes(x = .resid)) + geom_histogram() + xlab('Residuals')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = mod) + stat_qq(aes(sample = .stdresid)) + geom_abline()
pred_price <- predict(mod, newdata=house_test,type="response")
head(pred_price)
## 1 2 3 4 5 6
## 112449.5 145059.0 168926.7 190641.8 235372.4 174203.3
pred_price[is.na(pred_price)] <- mean(house_train$SalePrice)
submission <- data.frame(list("Id"=house_test$Id, "SalePrice"=pred_price), stringsAsFactors = FALSE)
head(submission)
## Id SalePrice
## 1 1461 112449.5
## 2 1462 145059.0
## 3 1463 168926.7
## 4 1464 190641.8
## 5 1465 235372.4
## 6 1466 174203.3
write.csv(submission, file="final_test.csv", row.names=FALSE, col.names=TRUE,sep='\t')
## Warning in write.csv(submission, file = "final_test.csv", row.names = FALSE, :
## attempt to set 'col.names' ignored
## Warning in write.csv(submission, file = "final_test.csv", row.names = FALSE, :
## attempt to set 'sep' ignored
Submitted by Chunjie Nan , Score: 0.38856