You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. ##https://www.kaggle.com/c/house-prices-advanced-regression-techniques## . I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
We can start by plotting a correlation matrix, describing the pearson correlation coefficients between predictors. This will inform us about our dataset and provide insight when model building:
df_train <- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-605/main/train.csv")
df.orig <- df_train
tib <- df_train
# Select only the numeric variables
df.num <- select_if(df_train, is.numeric)
mat.correlation <- cor(df.num, use = 'complete')
mat.correlation[upper.tri(mat.correlation)] <- NA
mlt.correlation <- melt(mat.correlation)
ggplot(data = mlt.correlation, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name = "Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, vjust = 1,
size = 5, hjust = 1),
axis.text.y = element_text(size = 7)) +
coord_fixed()
This will give us a summary of the count, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum for each variable in the dataset.
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
This will give us a summary of the count, missing values, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum for each variable in the dataset, as well as additional information such as the type of variable, number of unique values, and top and bottom values.
library(skimr)
skim(df_train)
Name | df_train |
Number of rows | 1460 |
Number of columns | 81 |
_______________________ | |
Column type frequency: | |
character | 43 |
numeric | 38 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
MSZoning | 0 | 1.00 | 2 | 7 | 0 | 5 | 0 |
Street | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
Alley | 1369 | 0.06 | 4 | 4 | 0 | 2 | 0 |
LotShape | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
LandContour | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
Utilities | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
LotConfig | 0 | 1.00 | 3 | 7 | 0 | 5 | 0 |
LandSlope | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
Neighborhood | 0 | 1.00 | 5 | 7 | 0 | 25 | 0 |
Condition1 | 0 | 1.00 | 4 | 6 | 0 | 9 | 0 |
Condition2 | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
BldgType | 0 | 1.00 | 4 | 6 | 0 | 5 | 0 |
HouseStyle | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
RoofStyle | 0 | 1.00 | 3 | 7 | 0 | 6 | 0 |
RoofMatl | 0 | 1.00 | 4 | 7 | 0 | 8 | 0 |
Exterior1st | 0 | 1.00 | 5 | 7 | 0 | 15 | 0 |
Exterior2nd | 0 | 1.00 | 5 | 7 | 0 | 16 | 0 |
MasVnrType | 8 | 0.99 | 4 | 7 | 0 | 4 | 0 |
ExterQual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
ExterCond | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
Foundation | 0 | 1.00 | 4 | 6 | 0 | 6 | 0 |
BsmtQual | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
BsmtCond | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
BsmtExposure | 38 | 0.97 | 2 | 2 | 0 | 4 | 0 |
BsmtFinType1 | 37 | 0.97 | 3 | 3 | 0 | 6 | 0 |
BsmtFinType2 | 38 | 0.97 | 3 | 3 | 0 | 6 | 0 |
Heating | 0 | 1.00 | 4 | 5 | 0 | 6 | 0 |
HeatingQC | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
CentralAir | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
Electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
KitchenQual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
Functional | 0 | 1.00 | 3 | 4 | 0 | 7 | 0 |
FireplaceQu | 690 | 0.53 | 2 | 2 | 0 | 5 | 0 |
GarageType | 81 | 0.94 | 6 | 7 | 0 | 6 | 0 |
GarageFinish | 81 | 0.94 | 3 | 3 | 0 | 3 | 0 |
GarageQual | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
GarageCond | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
PavedDrive | 0 | 1.00 | 1 | 1 | 0 | 3 | 0 |
PoolQC | 1453 | 0.00 | 2 | 2 | 0 | 3 | 0 |
Fence | 1179 | 0.19 | 4 | 5 | 0 | 4 | 0 |
MiscFeature | 1406 | 0.04 | 4 | 4 | 0 | 4 | 0 |
SaleType | 0 | 1.00 | 2 | 5 | 0 | 9 | 0 |
SaleCondition | 0 | 1.00 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Id | 0 | 1.00 | 730.50 | 421.61 | 1 | 365.75 | 730.5 | 1095.25 | 1460 | ▇▇▇▇▇ |
MSSubClass | 0 | 1.00 | 56.90 | 42.30 | 20 | 20.00 | 50.0 | 70.00 | 190 | ▇▅▂▁▁ |
LotFrontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 | ▇▃▁▁▁ |
LotArea | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 | ▇▁▁▁▁ |
OverallQual | 0 | 1.00 | 6.10 | 1.38 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▂▇▅▁ |
OverallCond | 0 | 1.00 | 5.58 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
YearBuilt | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 | ▁▂▃▆▇ |
YearRemodAdd | 0 | 1.00 | 1984.87 | 20.65 | 1950 | 1967.00 | 1994.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
MasVnrArea | 8 | 0.99 | 103.69 | 181.07 | 0 | 0.00 | 0.0 | 166.00 | 1600 | ▇▁▁▁▁ |
BsmtFinSF1 | 0 | 1.00 | 443.64 | 456.10 | 0 | 0.00 | 383.5 | 712.25 | 5644 | ▇▁▁▁▁ |
BsmtFinSF2 | 0 | 1.00 | 46.55 | 161.32 | 0 | 0.00 | 0.0 | 0.00 | 1474 | ▇▁▁▁▁ |
BsmtUnfSF | 0 | 1.00 | 567.24 | 441.87 | 0 | 223.00 | 477.5 | 808.00 | 2336 | ▇▅▂▁▁ |
TotalBsmtSF | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 | ▇▃▁▁▁ |
X1stFlrSF | 0 | 1.00 | 1162.63 | 386.59 | 334 | 882.00 | 1087.0 | 1391.25 | 4692 | ▇▅▁▁▁ |
X2ndFlrSF | 0 | 1.00 | 346.99 | 436.53 | 0 | 0.00 | 0.0 | 728.00 | 2065 | ▇▃▂▁▁ |
LowQualFinSF | 0 | 1.00 | 5.84 | 48.62 | 0 | 0.00 | 0.0 | 0.00 | 572 | ▇▁▁▁▁ |
GrLivArea | 0 | 1.00 | 1515.46 | 525.48 | 334 | 1129.50 | 1464.0 | 1776.75 | 5642 | ▇▇▁▁▁ |
BsmtFullBath | 0 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 | ▇▆▁▁▁ |
BsmtHalfBath | 0 | 1.00 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 | ▇▁▁▁▁ |
FullBath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▁▇▁▇▁ |
HalfBath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
BedroomAbvGr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
KitchenAbvGr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
TotRmsAbvGrd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 | ▂▇▇▁▁ |
Fireplaces | 0 | 1.00 | 0.61 | 0.64 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▁▁ |
GarageYrBlt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961.00 | 1980.0 | 2002.00 | 2010 | ▁▁▅▅▇ |
GarageCars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▃▇▂▁ |
GarageArea | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 | ▂▇▃▁▁ |
WoodDeckSF | 0 | 1.00 | 94.24 | 125.34 | 0 | 0.00 | 0.0 | 168.00 | 857 | ▇▂▁▁▁ |
OpenPorchSF | 0 | 1.00 | 46.66 | 66.26 | 0 | 0.00 | 25.0 | 68.00 | 547 | ▇▁▁▁▁ |
EnclosedPorch | 0 | 1.00 | 21.95 | 61.12 | 0 | 0.00 | 0.0 | 0.00 | 552 | ▇▁▁▁▁ |
X3SsnPorch | 0 | 1.00 | 3.41 | 29.32 | 0 | 0.00 | 0.0 | 0.00 | 508 | ▇▁▁▁▁ |
ScreenPorch | 0 | 1.00 | 15.06 | 55.76 | 0 | 0.00 | 0.0 | 0.00 | 480 | ▇▁▁▁▁ |
PoolArea | 0 | 1.00 | 2.76 | 40.18 | 0 | 0.00 | 0.0 | 0.00 | 738 | ▇▁▁▁▁ |
MiscVal | 0 | 1.00 | 43.49 | 496.12 | 0 | 0.00 | 0.0 | 0.00 | 15500 | ▇▁▁▁▁ |
MoSold | 0 | 1.00 | 6.32 | 2.70 | 1 | 5.00 | 6.0 | 8.00 | 12 | ▃▆▇▃▃ |
YrSold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▅ |
SalePrice | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 | ▇▅▁▁▁ |
Here’s a brief version of what you’ll find in the data description file.
Dependent Variable
Independent Variable
LotArea: Lot size in square feet
YearBuilt: Original construction date
subset.df_train <- subset(df_train, select = c("SalePrice","LotArea", "YearBuilt","PoolArea"))
head(subset.df_train)
## SalePrice LotArea YearBuilt PoolArea
## 1 208500 8450 2003 0
## 2 181500 9600 1976 0
## 3 223500 11250 2001 0
## 4 140000 9550 1915 0
## 5 250000 14260 2000 0
## 6 143000 14115 1993 0
p1 <-ggplot(df_train, aes(x=LotArea)) +
geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
geom_density(alpha=.2, fill="green")+
labs(title = "Lot Area", x = "", y = "")
p2 <- ggplot(df_train, aes(x=YearBuilt)) +
geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
geom_density(alpha=.2, fill="green")+
labs(title = "Original construction date", x = "", y = "")
p3 <-ggplot(df_train, aes(x=PoolArea)) +
geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
geom_density(alpha=.2, fill="green")+
labs(title = "Pool Area", x = "", y = "")
grid.arrange(p1, p2,p3, nrow=2)
s1 <- ggplot(df_train, aes(sample = LotArea))+
stat_qq()+
stat_qq_line()+
labs(title="Lot Area",x = "", y = "")
s2 <- ggplot(df_train, aes(sample = YearBuilt))+
stat_qq()+
stat_qq_line()+
labs(title="Original construction date", x = "", y = "")
s3 <- ggplot(df_train, aes(sample = PoolArea))+
stat_qq()+
stat_qq_line()+
labs(title="Pool Area", x = "", y = "")
grid.arrange(s1, s2, s3, nrow=2)
subset.df_train %>%
select_if(is.numeric) %>%
tidyr::gather()%>%
#gather() %>%
ggplot(aes(x = key, y = value)) +
geom_boxplot() +
xlab("Variable") +
ylab("Value")
subset.df_train %>%
select_if(is.numeric) %>%
tidyr::gather()%>%
#gather() %>%
ggplot(aes(x = value, fill = key)) +
geom_density(alpha = 0.5) +
xlab("Value") +
ylab("Density") +
facet_wrap(~key, ncol = 1, scales = "free")
# Create a scatterplot matrix
library(ggplot2)
ggplot(df_train, aes(x = LotArea, y = SalePrice)) +
geom_point() +
labs(title = "Scatterplot of GrLivArea and TotalBsmtSF")
ggplot(df_train, aes(x = YearBuilt, y = SalePrice)) +
geom_point() +
labs(title = "Scatterplot of GrLivArea and SalePrice")
ggplot(df_train, aes(x = PoolArea, y = SalePrice)) +
geom_point() +
labs(title = "Scatterplot of TotalBsmtSF and SalePrice")
# Select three quantitative variables
vars <- c("SalePrice", "LotArea", "YearBuilt")
df_sub <- df_train[, vars]
# Calculate correlation matrix
cor_matrix <- cor(df_sub)
# Print correlation matrix
print(cor_matrix)
## SalePrice LotArea YearBuilt
## SalePrice 1.0000000 0.26384335 0.52289733
## LotArea 0.2638434 1.00000000 0.01422765
## YearBuilt 0.5228973 0.01422765 1.00000000
df.orig= subset.df_train
tib<-df_train
mat.correlation = cor(subset.df_train, use = 'complete')
mat.correlation[upper.tri(mat.correlation)] = NA
mlt.correlation <- melt(mat.correlation)
ggplot(data = mlt.correlation, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 8, hjust = 1), axis.text.y = element_text(size = 8))+
coord_fixed()
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Hypotheses:
H0 = There is 0 correlation between each pairwise variables
HA = There is correlation between each pairwise variables
cor.test(subset.df_train$LotArea, subset.df_train$SalePrice, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: subset.df_train$LotArea and subset.df_train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
cor.test(subset.df_train$YearBuilt, subset.df_train$SalePrice, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: subset.df_train$YearBuilt and subset.df_train$SalePrice
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4980766 0.5468619
## sample estimates:
## cor
## 0.5228973
cor.test(subset.df_train$PoolArea, subset.df_train$SalePrice, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: subset.df_train$PoolArea and subset.df_train$SalePrice
## t = 3.5435, df = 1458, p-value = 0.0004073
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.05902496 0.12557575
## sample estimates:
## cor
## 0.09240355
The correlation coefficient between PoolArea and SalePrice is 0.0924, which indicates a weak positive correlation between the two variables.
The t value is 3.5435, and the p-value is 0.0004073, which means that the correlation coefficient is statistically significant at a significance level of 0.05. Therefore, we can reject the null hypothesis that there is no correlation between PoolArea and SalePrice.
The 80 percent confidence interval for the true correlation coefficient lies between 0.059 and 0.126, which means that 80 percent confident that the true correlation coefficient falls within this interval.
The corr_results dataframe contains the correlation coefficient, t-value, p-value, and 80% confidence interval for each pairwise correlation in the dataset.
The null hypothesis that the correlation coefficient is 0 is rejected if the p-value is less than the significance level of 0.05. In this case, we are using an 80% confidence interval, which corresponds to a significance level of 0.2. Therefore, we reject the null hypothesis if the p-value is less than 0.1.
In terms of familywise error, we would be worried about it if we were conducting multiple hypothesis tests simultaneously without adjusting the significance level. In this case, we are only conducting a small number of hypothesis tests (i.e., the number of pairwise correlations), so familywise error is not a major concern.
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
#To invert a matrix, you can use the solve() function in R
prec_mat <- solve(cor_matrix)
cor_prec <- cor_matrix %*% prec_mat
prec_cor <- prec_mat %*% cor_matrix
#lu_cor <- lu(cor_matrix)
lu_decomp <- lu.decomposition(cor_matrix)
L <- lu_decomp$L
L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.2638434 1.0000000 0
## [3,] 0.5228973 -0.1329934 1
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
numeric = which(sapply(df_train, is.numeric))
all_num <- df_train[, numeric]
df_train %>%
dplyr::select(names(all_num)) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
From above we can see BsmtUnfSF is skewed to the right.
ggplot(df_train, aes(BsmtUnfSF)) +
geom_histogram(bins = 30, fill = "lightgreen", color = "white") +
labs(title = "Histogram of Basement Unfinished Square Feet",
x = "Basement Unfinished Square Feet",
y = "Count")
summary(df_train$BsmtUnfSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 223.0 477.5 567.2 808.0 2336.0
fit_data <- df_train$BsmtUnfSF
fit_data <- fit_data[complete.cases(fit_data)]
hist(fit_data)
length(fit_data[fit_data == 0])
## [1] 118
fit_data <- fit_data + .01
library(MASS)
#fitdistr(df_train$BsmtUnfSF, "exponential")
BsmtUnfSF_exp_dist <- fitdistr(df_train$BsmtUnfSF,'exponential')
BsmtUnfSF_lamb <- BsmtUnfSF_exp_dist$estimate
BsmtUnfSF_lamb
## rate
## 0.001762921
set.seed(1000)
BsmtUnfSF_sample <- rexp(1000,BsmtUnfSF_lamb)
hist(BsmtUnfSF_sample)
par(mfrow=c(1,2))
hist(fit_data)
hist(BsmtUnfSF_sample)
The histogram of fit_data and exp_dist are both right skrewed; however, the second bin of BsmtUnfSF_sample has a frequency that is about double the frequency of fit_data
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
qexp(.05, rate=BsmtUnfSF_lamb)
## [1] 29.09563
qexp(.95, rate=BsmtUnfSF_lamb)
## [1] 1699.3
norm.interval = function(data, variance = var(data), conf.level = 0.95)
{
z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
xbar = mean(data)
sdx = sqrt(variance/length(data))
c(xbar - z * sdx, xbar + z * sdx)
}
norm.interval(fit_data, variance=var(fit_data), conf.level = 0.95)
## [1] 544.5850 589.9158
quantile(x=fit_data, probs=c(.05, .95))
## 5% 95%
## 0.01 1468.01
We are 95% confident that the mean of Unfinished square feet of basement area is between 544.8550 and 589.9158. The exponential distribution is a good fit since 95% is 1468.01 and only 5% is .01.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.
I am chosing following varibles for my multiple linear regression model
Dependent Variable:
Independent Variable:
We want to build a model for estimating SalePrice based on the LotArea, PoolArea, YearBuilt, and TotalBsmtSF of each house.
Model 1:
model_1 <- lm(SalePrice ~ LotArea + PoolArea + YearBuilt + TotalBsmtSF, data = df_train)
summary(model_1)
##
## Call:
## lm(formula = SalePrice ~ LotArea + PoolArea + YearBuilt + TotalBsmtSF,
## data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -537777 -34223 -10964 24191 431224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.730e+06 1.046e+05 -16.529 < 2e-16 ***
## LotArea 1.140e+00 1.552e-01 7.346 3.39e-13 ***
## PoolArea 4.859e+01 3.737e+01 1.300 0.194
## YearBuilt 9.207e+02 5.380e+01 17.114 < 2e-16 ***
## TotalBsmtSF 7.897e+01 3.859e+00 20.463 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56770 on 1455 degrees of freedom
## Multiple R-squared: 0.4907, Adjusted R-squared: 0.4893
## F-statistic: 350.4 on 4 and 1455 DF, p-value: < 2.2e-16
This model is a multiple linear regression model with SalePrice as the response variable and LotArea, PoolArea, YearBuilt and TotalBsmtSF as the predictor variables.
Multiple R-squared: The multiple R-squared value measures the proportion of the variation in the dependent variable that can be explained by the independent variables in a regression model. Here, the multiple R-squared is 0.4907, indicating that approximately 49.07% of the variability in the dependent variable is explained by the independent variables in the model.
The adjusted R-squared is similar to the multiple R-squared, suggesting that the inclusion of additional predictors does not substantially improve the model’s explanatory power.
The p-values for each predictor variable show whether they are statistically significant in predicting SalePrice or not. In this case, PoolArea is not significant (p-value = 0.194), while all the other predictor variables have very small p-values (less than 0.05), indicating strong evidence of a significant linear relationship between each predictor variable and SalePrice.
F-statistic: The F-statistic tests the overall significance of the regression model. It compares the variability explained by the model to the variability not explained. In this case, the F-statistic is 350.4 on 4 and 1455 degrees of freedom, with a p-value of < 2.2e-16. The small p-value indicates that the model is highly significant, suggesting that the relationship between the independent variables and the dependent variable is not due to chance.
summary(model_1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.729567e+06 1.046366e+05 -16.529280 2.235997e-56
## LotArea 1.139763e+00 1.551523e-01 7.346090 3.386472e-13
## PoolArea 4.859126e+01 3.736973e+01 1.300284 1.937095e-01
## YearBuilt 9.206557e+02 5.379628e+01 17.113745 5.787068e-60
## TotalBsmtSF 7.897346e+01 3.859405e+00 20.462598 5.436356e-82
Residual vs. Fitted Values
ggplot(model_1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
labs(title = "Residual vs. Fitted Values",
x = "Fitted Values", y= "Residuals")
# define residuals
res <- resid(model_1)
#create Q-Q plot for residuals
qqnorm(res)
#add a straight diagonal line to the plot
qqline(res, col = "red")
The pattern of the normal probability plot is straight, so this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution. Let do box cox transformation and see if we get better QQ plot.
Box Cox Transformation:
b = boxcox(SalePrice ~ LotArea + PoolArea + YearBuilt + TotalBsmtSF, data = df_train)
lamda = b$x
lik = b$y
bc = cbind(lamda,lik)
head(bc)
## lamda lik
## [1,] -2.000000 -4766.602
## [2,] -1.959596 -4715.197
## [3,] -1.919192 -4664.572
## [4,] -1.878788 -4614.743
## [5,] -1.838384 -4565.727
## [6,] -1.797980 -4517.542
#Order Likelihood in ascending
bc[order(-lik),]
## lamda lik
## [1,] -0.02020202 -3417.663
## [2,] 0.02020202 -3418.003
## [3,] -0.06060606 -3418.435
## [4,] 0.06060606 -3419.447
## [5,] -0.10101010 -3420.324
## [6,] 0.10101010 -3421.986
## [7,] -0.14141414 -3423.338
## [8,] 0.14141414 -3425.612
## [9,] -0.18181818 -3427.484
## [10,] 0.18181818 -3430.319
## [11,] -0.22222222 -3432.766
## [12,] 0.22222222 -3436.096
## [13,] -0.26262626 -3439.192
## [14,] 0.26262626 -3442.935
## [15,] -0.30303030 -3446.765
## [16,] 0.30303030 -3450.828
## [17,] -0.34343434 -3455.490
## [18,] 0.34343434 -3459.765
## [19,] -0.38383838 -3465.371
## [20,] 0.38383838 -3469.737
## [21,] -0.42424242 -3476.411
## [22,] 0.42424242 -3480.736
## [23,] -0.46464646 -3488.612
## [24,] 0.46464646 -3492.753
## [25,] -0.50505051 -3501.978
## [26,] 0.50505051 -3505.777
## [27,] -0.54545455 -3516.508
## [28,] 0.54545455 -3519.801
## [29,] -0.58585859 -3532.203
## [30,] 0.58585859 -3534.815
## [31,] -0.62626263 -3549.064
## [32,] 0.62626263 -3550.809
## [33,] -0.66666667 -3567.089
## [34,] 0.66666667 -3567.775
## [35,] 0.70707071 -3585.704
## [36,] -0.70707071 -3586.277
## [37,] 0.74747475 -3604.588
## [38,] -0.74747475 -3606.625
## [39,] 0.78787879 -3624.416
## [40,] -0.78787879 -3628.130
## [41,] 0.82828283 -3645.180
## [42,] -0.82828283 -3650.788
## [43,] 0.86868687 -3666.873
## [44,] -0.86868687 -3674.594
## [45,] 0.90909091 -3689.484
## [46,] -0.90909091 -3699.543
## [47,] 0.94949495 -3713.006
## [48,] -0.94949495 -3725.628
## [49,] 0.98989899 -3737.430
## [50,] -0.98989899 -3752.843
## [51,] 1.03030303 -3762.748
## [52,] -1.03030303 -3781.179
## [53,] 1.07070707 -3788.952
## [54,] -1.07070707 -3810.629
## [55,] 1.11111111 -3816.032
## [56,] -1.11111111 -3841.182
## [57,] 1.15151515 -3843.983
## [58,] 1.19191919 -3872.794
## [59,] -1.15151515 -3872.828
## [60,] 1.23232323 -3902.458
## [61,] -1.19191919 -3905.558
## [62,] 1.27272727 -3932.968
## [63,] -1.23232323 -3939.360
## [64,] 1.31313131 -3964.315
## [65,] -1.27272727 -3974.222
## [66,] 1.35353535 -3996.491
## [67,] -1.31313131 -4010.131
## [68,] 1.39393939 -4029.488
## [69,] -1.35353535 -4047.074
## [70,] 1.43434343 -4063.299
## [71,] -1.39393939 -4085.039
## [72,] 1.47474747 -4097.916
## [73,] -1.43434343 -4124.010
## [74,] 1.51515152 -4133.331
## [75,] -1.47474747 -4163.973
## [76,] 1.55555556 -4169.535
## [77,] -1.51515152 -4204.914
## [78,] 1.59595960 -4206.521
## [79,] 1.63636364 -4244.281
## [80,] -1.55555556 -4246.817
## [81,] 1.67676768 -4282.807
## [82,] -1.59595960 -4289.666
## [83,] 1.71717172 -4322.091
## [84,] -1.63636364 -4333.446
## [85,] 1.75757576 -4362.124
## [86,] -1.67676768 -4378.140
## [87,] 1.79797980 -4402.899
## [88,] -1.71717172 -4423.732
## [89,] 1.83838384 -4444.407
## [90,] -1.75757576 -4470.205
## [91,] 1.87878788 -4486.641
## [92,] -1.79797980 -4517.542
## [93,] 1.91919192 -4529.590
## [94,] -1.83838384 -4565.727
## [95,] 1.95959596 -4573.248
## [96,] -1.87878788 -4614.743
## [97,] 2.00000000 -4617.606
## [98,] -1.91919192 -4664.572
## [99,] -1.95959596 -4715.197
## [100,] -2.00000000 -4766.602
Since landa is .2 we can do square root of selling price from model 2
model2_transform = lm(SalePrice^(1/2)~ LotArea + PoolArea +YearBuilt + TotalBsmtSF, data = df_train)
summary(model2_transform)
##
## Call:
## lm(formula = SalePrice^(1/2) ~ LotArea + PoolArea + YearBuilt +
## TotalBsmtSF, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -559.74 -36.21 -9.33 30.97 341.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.896e+03 1.087e+02 -17.440 < 2e-16 ***
## LotArea 1.263e-03 1.612e-04 7.832 9.21e-15 ***
## PoolArea 2.707e-02 3.884e-02 0.697 0.486
## YearBuilt 1.122e+00 5.591e-02 20.067 < 2e-16 ***
## TotalBsmtSF 8.344e-02 4.011e-03 20.804 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59 on 1455 degrees of freedom
## Multiple R-squared: 0.5281, Adjusted R-squared: 0.5268
## F-statistic: 407 on 4 and 1455 DF, p-value: < 2.2e-16
Overall, the given information suggests that the regression model is statistically significant and provides a moderate level of explanatory power (as indicated by the multiple R-squared value). The residuals have a standard error of 59, indicating the average deviation between observed and predicted values. The adjusted R-squared is similar to the multiple R-squared, suggesting that the inclusion of additional predictors does not substantially improve the model’s explanatory power.
Box Cox Transformation:
# define residuals
res <- resid(model2_transform)
#create Q-Q plot for residuals
qqnorm(res)
#add a straight diagonal line to the plot
qqline(res, col = "red")
Now we can see that QQ plot is much better and straight line that passes through the first and third quartiles of the data.We see that residuals are normally distributed.So this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution.
## Id SalePrice
## 1 1461 391.7807
## 2 1462 429.0543
## 3 1463 438.7944
## 4 1464 434.8850
## 5 1465 451.4129
## 6 1466 415.7022
#write.csv(prediction, file="prediction.csv", row.names = FALSE)
write.csv(prediction, file="prediction.csv", row.names = FALSE)
https://towardsdatascience.com/predicting-housing-prices-with-r-c9ec0821328d
https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html https://www.youtube.com/watch?v=rkXc25Uvyl4
https://www.youtube.com/watch?v=vtm35gVP8JU
https://www.kaggle.com/code/fedesoriano/house-prices-what-s-a-good-score