Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Pick one of the quanititative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
For this final exam we were asked to go to Kaggle.com and download the train.csv data set for House Prices - Advanced Regression Techniques, we need to use the data to build a model for the housing prices then use final output of the data to enter to kaggle competition. You will get a score and I will submit my Kaggle username along with my score.
First I will begin by importing the train.csv data set that I have downloaded from Kaggle.com. I will use the read.csv command to import my csv file.
## [1] 1460 81
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
In order to clean the data I Will be using imputation method to replace the NA’s value with the mean of the each column with numerical value. This method was perform on the following method LotFrontage, MasVnrArea and GarageYrBlt.
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 60.00
## Median : 730.5 Median : 50.0 Mode :character Median : 70.05
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 79.00
## Max. :1460.0 Max. :190.0 Max. :313.00
##
## LotArea Street LotShape LandContour
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Condition1 Condition2 BldgType HouseStyle
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## RoofStyle RoofMatl Exterior1st Exterior2nd
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## MasVnrType MasVnrArea ExterQual ExterCond
## Length:1460 Min. : 0.0 Length:1460 Length:1460
## Class :character 1st Qu.: 0.0 Class :character Class :character
## Mode :character Median : 0.0 Mode :character Mode :character
## Mean : 103.7
## 3rd Qu.: 164.2
## Max. :1600.0
##
## Foundation BsmtQual BsmtCond BsmtExposure
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## Length:1460 Min. : 0.0 Length:1460 Min. : 0.00
## Class :character 1st Qu.: 0.0 Class :character 1st Qu.: 0.00
## Mode :character Median : 383.5 Mode :character Median : 0.00
## Mean : 443.6 Mean : 46.55
## 3rd Qu.: 712.2 3rd Qu.: 0.00
## Max. :5644.0 Max. :1474.00
##
## BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.0 Min. : 0.0 Length:1460 Length:1460
## 1st Qu.: 223.0 1st Qu.: 795.8 Class :character Class :character
## Median : 477.5 Median : 991.5 Mode :character Mode :character
## Mean : 567.2 Mean :1057.4
## 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :2336.0 Max. :6110.0
##
## CentralAir Electrical X1stFlrSF X2ndFlrSF
## Length:1460 Length:1460 Min. : 334 Min. : 0
## Class :character Class :character 1st Qu.: 882 1st Qu.: 0
## Mode :character Mode :character Median :1087 Median : 0
## Mean :1163 Mean : 347
## 3rd Qu.:1391 3rd Qu.: 728
## Max. :4692 Max. :2065
##
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.000 Median :1464 Median :0.0000 Median :0.00000
## Mean : 5.845 Mean :1515 Mean :0.4253 Mean :0.05753
## 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :572.000 Max. :5642 Max. :3.0000 Max. :2.00000
##
## FullBath HalfBath BedroomAbvGr KitchenAbvGr
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :0.0000 Median :3.000 Median :1.000
## Mean :1.565 Mean :0.3829 Mean :2.866 Mean :1.047
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :2.0000 Max. :8.000 Max. :3.000
##
## KitchenQual TotRmsAbvGrd Functional Fireplaces
## Length:1460 Min. : 2.000 Length:1460 Min. :0.000
## Class :character 1st Qu.: 5.000 Class :character 1st Qu.:0.000
## Mode :character Median : 6.000 Mode :character Median :1.000
## Mean : 6.518 Mean :0.613
## 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :14.000 Max. :3.000
##
## FireplaceQu GarageType GarageYrBlt GarageFinish
## Length:1460 Length:1460 Min. :1900 Length:1460
## Class :character Class :character 1st Qu.:1962 Class :character
## Mode :character Mode :character Median :1979 Mode :character
## Mean :1979
## 3rd Qu.:2001
## Max. :2010
##
## GarageCars GarageArea GarageQual GarageCond
## Min. :0.000 Min. : 0.0 Length:1460 Length:1460
## 1st Qu.:1.000 1st Qu.: 334.5 Class :character Class :character
## Median :2.000 Median : 480.0 Mode :character Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Length:1460 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Median : 0.00 Median : 25.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Length:1460
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Class :character
## Median : 0.00 Median : 0.00 Median : 0.000 Mode :character
## Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## Length:1460 Length:1460 Min. : 0.00 6 :253
## Class :character Class :character 1st Qu.: 0.00 7 :234
## Mode :character Mode :character Median : 0.00 5 :204
## Mean : 43.49 4 :141
## 3rd Qu.: 0.00 8 :122
## Max. :15500.00 3 :106
## (Other):400
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 Length:1460 Length:1460 Min. : 34900
## 1st Qu.:2007 Class :character Class :character 1st Qu.:129975
## Median :2008 Mode :character Mode :character Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
Pick one of the quanititative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
In order for me to pick the variable x that is skewed to the right I need to plot some the variables and see which one is skewed to the right. The variables that I will be plotting are LotFrontage, LotArea, TotalBsmtSF, GrLivArea, SalePrice
Base on the plots above we can agree that the right skewed variable is LotArea.Therefore, variable X is LotArea and variable Y is SalePrice.
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable.Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
\[A. P(X>x | Y>y) \]
## [1] 0.8200913
\[B. P(X>x, Y>y) \]
## [1] 0.6150685
\[C. P(X<x | Y>y) \]
## [1] 0.1799087
Does splitting the training data in this fashion make them independent?
I will say that splitting the training data doesn’t make them independent.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y.Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
Below is a boxplot of both the Variable X and variable Y so that I can see where the outliers lies. Since the data have over 1400 rows I used log on the box plot to better see the outliers.
univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y
I will begin by providing an Univariate descriptive statistics along with the scatterplot for the variable X and Y as require. Below is the result of the analysis, as for the scatterplot we can see that all is close to zero and the other points can be outlier and scatter all over.
## LotArea SalePrice
## Min. : 1300 Min. : 34900
## 1st Qu.: 7554 1st Qu.:129975
## Median : 9478 Median :163000
## Mean : 10517 Mean :180921
## 3rd Qu.: 11602 3rd Qu.:214000
## Max. :215245 Max. :755000
Since the data consist of numerical value and categorical value so I will filter only for all the numerical value from housing market value data.
Below I will filter the data only look at houses market value Sale Price with Correlation greater than .70
## OverallQual GrLivArea SalePrice
## OverallQual 1.0000000 0.5930074 0.7909816
## GrLivArea 0.5930074 1.0000000 0.7086245
## SalePrice 0.7909816 0.7086245 1.0000000
Derive a correlation matrix for two of the quantitative variables you selected
I will provide a correlation matrix for two quantitative variables I have selected. The two variables that I selected are OverallQual and SalePrice for the correlation testing. I performed a correlation test on both variables from 99% confidence interval of 0.7643382 0.8149288 and the correlation test between both variables is 0.7909816. The correlation test of 0.7909816 is closer to 0.8149288 which mean that theres is true correlation between OverallQual and SalePrice of the house.
Correlation Test of SalePrice and OverallQual.
##
## Pearson's product-moment correlation
##
## data: Housing_MarketValue_subset$OverallQual and Housing_MarketValue_subset$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.7643382 0.8149288
## sample estimates:
## cor
## 0.7909816
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.
Correlation matrix by the precision matrix
## [,1] [,2]
## [1,] 2.671310 -2.112957
## [2,] -2.112957 2.671310
multiply the precision matrix by the correlation matrix
Below is the product of the correlation matrix and precision matrix which is the identity matrix. The first matrix is correlation multiply by precision matrix and the output is print it, then the second matrix is precision matrix multiply by correlation matrix and the output is also printed.
Correlation Matrix by Precision Matrix
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
Precision Matrix by Correlation Matrix
## [,1] [,2]
## [1,] 11.60049 -11.28873
## [2,] -11.28873 11.60049
Conduct principle components analysis.
Principle Components analysis centering data around 0 by shifting the variables; rescaling the variance to 1 unit. The Eigenvalues measure the amount of variation held by each principal component (PC). They are evaluated to determine the number of principal components to be considered.
## Standard deviations (1, .., p=2):
## [1] 1.414214e+00 7.850462e-17
##
## Rotation (n x k) = (2 x 2):
## PC1 PC2
## [1,] -0.7071068 0.7071068
## [2,] 0.7071068 0.7071068
Eigenvalue of Principle Components Analysis.
Plot of Correlation matrix.
Below is a correlation plot of housing market value data set with correlation greater than 0.70.
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
looking at housing market value data to see if there is a difference between the mean and median and see all right skew, base on the results below we can MSSubClass is more right skew follow by LotArea.
rowname | V1 |
|---|---|
MSSubClass | 0.1630535974320969361262 |
LotFrontage | -0.0000000000000002427602 |
LotArea | 0.1040277048276167931595 |
OverallQual | 0.0718115086600481927759 |
OverallCond | 0.5170226533860245998753 |
YearBuilt | -0.0573518287639829746349 |
Skewness Summary
In the beginning I plotted some of the variables from the train.csv data to see which variable is right skewed then set it as dependent variable X, this is a test to see how rightly skewed is variable X.
## [1] 12.19514
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
For your variable that is skewed to the right, shift it so that the minimum value is above zero.
Since MSSubClass and LotArea are the more right skewed so I will use MSSubClass to finish the remainder of the question.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Minimum Value is above zero.
it looks like that 20 is the minimum value that is above zero for variable MSSubClass.
## [1] 20
load the MASS package and run fitdistr to fit an exponential probability density function.
I will use the fitdistr function to fit this variable to
an exponential distribution with lambda as exponential rate.
## rate
## 0.0175755387
## (0.0004599729)
## rate
## 0.01757554
The parameter is 0.0175755
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).
I will now take 1000 samples from this distribution and print the output below.
## .
## 1 80.679153
## 2 8.069586
## 3 28.547301
## 4 9.181433
## 5 85.706641
## 6 57.165496
Plot a histogram and compare it with a histogram of your original variable.
Looking at original histogram and sample histogram we can say that both histogram are different. Original is right skewed but spread, while sample histogram is more right skewed with everything skew between 0 and 100.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
## [1] 2.918448 170.448959
Generate a 95% confidence interval from the empirical data, assuming normality.
##
## One Sample t-test
##
## data: Housing_MarketValue$MSSubClass
## t = 51.395, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 54.72567 59.06885
## sample estimates:
## mean of x
## 56.89726
Finally, provide the empirical 5th percentile and 95th percentile of the data.
## 5% 95%
## 20 160
Discuss.
I plotted original histogram of MSSubClass from housing market value, find the minimum value above zero for MSSubClass variable to be 20. I use the mass package to find the fit distribution at which MSSubClass variable will increase exponentially at lambda rate of 0.01757554. I take a 1000 sample that was instructed using lambda rate 0.01757554 on the fit distribution to see how fast it will grow then plot it as histogram. The exponential distribution histogram look very different from original histogram, since exponential histogram are more skew to the right. Perform a t-test at 95% confidence interval with result 54.72567 59.06885 and a mean of X of 56.89726, which lay in between the interval result of 54.72567 and 59.06885.
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com username and score.
From Housing Market Value data I will have a train data then use I will train data on the test data.
Train Data
# Train data set
Housing_MarketValue_train = Housing_MarketValue %>%
select_if(is.numeric) %>%
dplyr::select(-Id)
summary(Housing_MarketValue_train)
## MSSubClass LotFrontage LotArea OverallQual
## Min. : 20.0 Min. : 21.00 Min. : 1300 Min. : 1.000
## 1st Qu.: 20.0 1st Qu.: 60.00 1st Qu.: 7554 1st Qu.: 5.000
## Median : 50.0 Median : 70.05 Median : 9478 Median : 6.000
## Mean : 56.9 Mean : 70.05 Mean : 10517 Mean : 6.099
## 3rd Qu.: 70.0 3rd Qu.: 79.00 3rd Qu.: 11602 3rd Qu.: 7.000
## Max. :190.0 Max. :313.00 Max. :215245 Max. :10.000
## OverallCond YearBuilt YearRemodAdd MasVnrArea
## Min. :1.000 Min. :1872 Min. :1950 Min. : 0.0
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0
## Median :5.000 Median :1973 Median :1994 Median : 0.0
## Mean :5.575 Mean :1971 Mean :1985 Mean : 103.7
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 164.2
## Max. :9.000 Max. :2010 Max. :2010 Max. :1600.0
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8
## Median : 383.5 Median : 0.00 Median : 477.5 Median : 991.5
## Mean : 443.6 Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :5644.0 Max. :1474.00 Max. :2336.0 Max. :6110.0
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334 Min. : 0 Min. : 0.000 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130
## Median :1087 Median : 0 Median : 0.000 Median :1464
## Mean :1163 Mean : 347 Mean : 5.845 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :572.000 Max. :5642
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## Min. :0.000 Min. :0.000 Min. : 2.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.000
## Median :3.000 Median :1.000 Median : 6.000 Median :1.000
## Mean :2.866 Mean :1.047 Mean : 6.518 Mean :0.613
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :8.000 Max. :3.000 Max. :14.000 Max. :3.000
## GarageYrBlt GarageCars GarageArea WoodDeckSF
## Min. :1900 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.:1962 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.: 0.00
## Median :1979 Median :2.000 Median : 480.0 Median : 0.00
## Mean :1979 Mean :1.767 Mean : 473.0 Mean : 94.24
## 3rd Qu.:2001 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00
## Max. :2010 Max. :4.000 Max. :1418.0 Max. :857.00
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06
## 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00
## PoolArea MiscVal YrSold SalePrice
## Min. : 0.000 Min. : 0.00 Min. :2006 Min. : 34900
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:2007 1st Qu.:129975
## Median : 0.000 Median : 0.00 Median :2008 Median :163000
## Mean : 2.759 Mean : 43.49 Mean :2008 Mean :180921
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.:2009 3rd Qu.:214000
## Max. :738.000 Max. :15500.00 Max. :2010 Max. :755000
Linear Model
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + MasVnrArea + X1stFlrSF + X2ndFlrSF +
## BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch +
## PoolArea, data = Housing_MarketValue_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -452376 -17295 -1323 13837 292715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.468e+05 8.790e+04 -9.634 < 2e-16 ***
## MSSubClass -1.909e+02 2.420e+01 -7.886 6.09e-15 ***
## LotArea 4.468e-01 1.002e-01 4.458 8.93e-06 ***
## OverallQual 1.978e+04 1.096e+03 18.055 < 2e-16 ***
## OverallCond 5.408e+03 9.265e+02 5.837 6.54e-09 ***
## YearBuilt 3.915e+02 4.493e+01 8.714 < 2e-16 ***
## MasVnrArea 3.208e+01 5.865e+00 5.470 5.29e-08 ***
## X1stFlrSF 7.072e+01 3.717e+00 19.027 < 2e-16 ***
## X2ndFlrSF 6.072e+01 3.331e+00 18.230 < 2e-16 ***
## BsmtFullBath 1.368e+04 1.922e+03 7.117 1.73e-12 ***
## BedroomAbvGr -8.115e+03 1.446e+03 -5.611 2.41e-08 ***
## GarageCars 1.056e+04 1.707e+03 6.187 7.96e-10 ***
## WoodDeckSF 2.811e+01 7.955e+00 3.534 0.000423 ***
## ScreenPorch 5.623e+01 1.688e+01 3.332 0.000885 ***
## PoolArea -2.959e+01 2.352e+01 -1.258 0.208649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35270 on 1445 degrees of freedom
## Multiple R-squared: 0.8048, Adjusted R-squared: 0.8029
## F-statistic: 425.4 on 14 and 1445 DF, p-value: < 2.2e-16
Linear Model Plot
Linear Model with Log Transform
##
## Call:
## lm(formula = log(SalePrice) ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + MasVnrArea + X1stFlrSF + X2ndFlrSF +
## BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch +
## PoolArea, data = Housing_MarketValue_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88574 -0.06865 0.00413 0.07766 0.50022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.743e+00 3.776e-01 9.913 < 2e-16 ***
## MSSubClass -7.229e-04 1.040e-04 -6.951 5.47e-12 ***
## LotArea 2.281e-06 4.307e-07 5.297 1.36e-07 ***
## OverallQual 9.955e-02 4.708e-03 21.146 < 2e-16 ***
## OverallCond 5.462e-02 3.981e-03 13.720 < 2e-16 ***
## YearBuilt 3.421e-03 1.930e-04 17.720 < 2e-16 ***
## MasVnrArea -1.940e-06 2.520e-05 -0.077 0.938635
## X1stFlrSF 3.162e-04 1.597e-05 19.800 < 2e-16 ***
## X2ndFlrSF 2.552e-04 1.431e-05 17.836 < 2e-16 ***
## BsmtFullBath 7.461e-02 8.260e-03 9.034 < 2e-16 ***
## BedroomAbvGr 1.492e-03 6.214e-03 0.240 0.810316
## GarageCars 7.505e-02 7.334e-03 10.233 < 2e-16 ***
## WoodDeckSF 1.392e-04 3.418e-05 4.072 4.92e-05 ***
## ScreenPorch 4.031e-04 7.251e-05 5.560 3.21e-08 ***
## PoolArea -3.599e-04 1.011e-04 -3.562 0.000381 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1515 on 1445 degrees of freedom
## Multiple R-squared: 0.8575, Adjusted R-squared: 0.8561
## F-statistic: 620.8 on 14 and 1445 DF, p-value: < 2.2e-16
Linear Model with Log trasform Plot
Linear Model Train Model with NA’S remove
I will use the train data and remove missing values (NA’S) before testing it against the test.csv file.
##
## Call:
## lm(formula = log(SalePrice) ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + MasVnrArea + X1stFlrSF + X2ndFlrSF +
## BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch +
## PoolArea, data = Housing_MarketValue_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88574 -0.06865 0.00413 0.07766 0.50022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.743e+00 3.776e-01 9.913 < 2e-16 ***
## MSSubClass -7.229e-04 1.040e-04 -6.951 5.47e-12 ***
## LotArea 2.281e-06 4.307e-07 5.297 1.36e-07 ***
## OverallQual 9.955e-02 4.708e-03 21.146 < 2e-16 ***
## OverallCond 5.462e-02 3.981e-03 13.720 < 2e-16 ***
## YearBuilt 3.421e-03 1.930e-04 17.720 < 2e-16 ***
## MasVnrArea -1.940e-06 2.520e-05 -0.077 0.938635
## X1stFlrSF 3.162e-04 1.597e-05 19.800 < 2e-16 ***
## X2ndFlrSF 2.552e-04 1.431e-05 17.836 < 2e-16 ***
## BsmtFullBath 7.461e-02 8.260e-03 9.034 < 2e-16 ***
## BedroomAbvGr 1.492e-03 6.214e-03 0.240 0.810316
## GarageCars 7.505e-02 7.334e-03 10.233 < 2e-16 ***
## WoodDeckSF 1.392e-04 3.418e-05 4.072 4.92e-05 ***
## ScreenPorch 4.031e-04 7.251e-05 5.560 3.21e-08 ***
## PoolArea -3.599e-04 1.011e-04 -3.562 0.000381 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1515 on 1445 degrees of freedom
## Multiple R-squared: 0.8575, Adjusted R-squared: 0.8561
## F-statistic: 620.8 on 14 and 1445 DF, p-value: < 2.2e-16
Test data Loaded
I load the test.csv data set with Housing Market Value train data set after NA removal.
Housing_MarketValue_test = read.csv('test.csv')
pred = predict(lm3, Housing_MarketValue_test) %>%
exp() %>%
cbind(Housing_MarketValue_test$Id, .) %>%
as.data.frame() %>%
set_names(c("Id","SalePrice"))
head(pred) %>%
flextable()
Id | SalePrice |
|---|---|
1,461 | 122,794.4 |
1,462 | 153,066.0 |
1,463 | 161,158.5 |
1,464 | 189,662.3 |
1,465 | 192,457.6 |
1,466 | 172,162.8 |
Export to CSV
I will now export data a save a new csv filename as seen below.
pred %>%
replace(is.na(.), 0) %>%
write.csv("Housing_MarketValue_Final.csv",row.names=F)
Conclusion
In conclusion after perform all these test and train the data I Upload the the final data to kaggle.com using the logistic model I was getting a score of 5.54830 or Errors which require me to make some revision to the building model. After revision I re-ran the train data and re-tested against test.csv file. Re-upload the file Housing Market Value Final back to Kaggle.com and I received a score that is lower than my couple trial, I am unsure what the score mean. Below is my Kaggle username, score along a screenshot of my Kaggle.com score.
Kaggle.com Username and Score
My Kaggle username is Valor383. My score for multiple submission were 1.36971, 4.78295, 5.54830.
{width=“75%”“}
{width=“75%”“}