Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.
set.seed(123)
n = 10000
X = runif(n,1,6)
Y = rnorm(n,(6+1)/2,(6+1)/2)
x <- median(X)
y <- quantile(Y)[['25%']]
x
## [1] 3.472838
y
## [1] 1.171246
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
The probability that random variable X is greater than the median of X given that random variable X is greater than the 1st quartile of Y.
sum(X>x & X>y)/sum(X>y)
## [1] 0.5186184
The probability that random variable X is greater than the median of X and that random variable Y is greater than the first quartile of Y.
sum(X>x & Y>y)/length(X)
## [1] 0.3756
The probability that a random variable X is less than the median of X given that random variable X is greater than the first quartile of Y.
sum(X<x & X>y)/sum(X>y)
## [1] 0.4813816
# finding number of observations for each joint probability.
obs_1 = sum(X > x & Y > y)
obs_2 = sum(X < x & Y > y)
obs_3 = sum(X > x & Y < y)
obs_4 = sum(X < x & Y < y)
# calculating marginal probabilities and joint probabilities into one table
mtrx = (matrix(c(obs_1,obs_2,obs_1+obs_2,obs_3,obs_4,obs_3+obs_4,obs_1+obs_3,obs_2+obs_4,obs_1+obs_2+obs_3+obs_4),byrow = F,nrow=3))/10000
rownames(mtrx) = c("X > x","X < x","total")
colnames(mtrx) = c("Y > y ","Y < y", "total")
mtrx
## Y > y Y < y total
## X > x 0.3756 0.1244 0.5
## X < x 0.3744 0.1256 0.5
## total 0.7500 0.2500 1.0
the probability that P(X>x and Y>y) is 0.3756
the probability that P(X>x)P(Y>y) is 0.5∗0.75=0.375
P(X>x and Y>y) is almost the same as P(X>x)P(Y>y), which indicates that P(X>x and Y>y) = P(X>x)P(Y>y).
mtrx_2 = matrix(c(obs_1,obs_2,obs_3,obs_4),byrow = F,nrow = 2)
rownames(mtrx_2) = c("X > x","X < x")
colnames(mtrx_2) = c("Y > y","Y < y")
mtrx_2
## Y > y Y < y
## X > x 3756 1244
## X < x 3744 1256
# checking independence using Fisher’s Exact Test
fisher.test(mtrx_2)
##
## Fisher's Exact Test for Count Data
##
## data: mtrx_2
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9242273 1.1100187
## sample estimates:
## odds ratio
## 1.012883
# checking independence using Chi Square Test
chisq.test(mtrx_2)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mtrx_2
## X-squared = 0.064533, df = 1, p-value = 0.7995
The p-values for both tests are approximately 0.7995 > p. We can not reject the null hypothesis. There is not enough evidence to conclude that events are dependent.
Fisher’s Exact Test is inappropriate when we have small sample sizes or highly unequal cell distribution, one can instead use Chi-squared test . The Chi-squared test is an approximation of the results from the Fisher’s Exact Test, so erroneous results could potentially be obtained from the few observations.
train = read.csv("https://raw.githubusercontent.com/olgashiligin/data_605/master/train.csv",stringsAsFactors = F)
test = read.csv("https://raw.githubusercontent.com/olgashiligin/data_605/master/test.csv",stringsAsFactors = F)
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Dependent Variable: SalePrice
Independent Variables: LotArea OverallQual
# numeric variables distribution are presented below
library(ggplot2)
library(dplyr)
library(tidyr)
train%>%dplyr::select_if(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
## Warning: Removed 348 rows containing non-finite values (stat_density).
# statistical summary are presented below
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## LandSlope Neighborhood Condition1
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Condition2 BldgType HouseStyle OverallQual
## Length:1460 Length:1460 Length:1460 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 5.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 6.099
## 3rd Qu.: 7.000
## Max. :10.000
##
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Length:1460
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Class :character
## Median :5.000 Median :1973 Median :1994 Mode :character
## Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :9.000 Max. :2010 Max. :2010
##
## RoofMatl Exterior1st Exterior2nd
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## MasVnrType MasVnrArea ExterQual ExterCond
## Length:1460 Min. : 0.0 Length:1460 Length:1460
## Class :character 1st Qu.: 0.0 Class :character Class :character
## Mode :character Median : 0.0 Mode :character Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 Length:1460
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 Class :character
## Median : 0.00 Median : 6.000 Median :2008 Mode :character
## Mean : 43.49 Mean : 6.322 Mean :2008
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :15500.00 Max. :12.000 Max. :2010
##
## SaleCondition SalePrice
## Length:1460 Min. : 34900
## Class :character 1st Qu.:129975
## Mode :character Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
# summary of LotArea
summary(train$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
# histogram of LotArea
hist(train$LotArea, xlab = "Lot Area", main = "House Lot Area")
# summary of the OverallQual
summary(train$OverallQual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.099 7.000 10.000
# # histogram of OverallQual
hist(train$OverallQual, xlab = "Overall Quality", main = "House Overall Quality")
# OverallQual vs. Sale Price (the higher house quality the higher the price)
ggplot(train, aes(x = OverallQual, y = SalePrice)) +
geom_point() +
labs(title = "Sales Price vs OverallQual")
# Lot Area vs. Sale Price (there is a positive correlation of price from LotArea)
ggplot(train, aes(x = log(LotArea), y = SalePrice)) +
geom_point() +
labs(title = "LotArea vs OverallQual")
cor_mtrx = train %>% dplyr::select(LotArea, OverallQual, SalePrice) %>% cor() %>%
as.matrix()
cor_mtrx
## LotArea OverallQual SalePrice
## LotArea 1.0000000 0.1058057 0.2638434
## OverallQual 0.1058057 1.0000000 0.7909816
## SalePrice 0.2638434 0.7909816 1.0000000
Correlation between OverallQual and SalePrice (0.7909816) is much higher than between LotArea and SalePrice (0.2638434) as we saw it on the scatter plots
cor.test(train$LotArea, train$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
As p < 0.05, we can reject the null hypotesis in favor alternative one and conclude that correlation between Sales Price and LotArea is not equal to 0.
correlation between SalePrice and LotArea is within 0.2323391 and 0.2947946 (with 80% confidence interval)
cor.test(train$OverallQual, train$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
As p < 0.05, we can reject the null hypotesis in favor alternative one and conclude that correlation between Sales Price and OverallQual is not equal to 0.
correlation between SalePrice and OverallQual is within 0.7780752 and 0.8032204 (with 80% confidence interval)
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Precision matrix
# calculating precision matrix
pres_mtrx = solve(cor_mtrx)
pres_mtrx
## LotArea OverallQual SalePrice
## LotArea 1.1085153 0.3046752 -0.5334669
## OverallQual 0.3046752 2.7550503 -2.2595806
## SalePrice -0.5334669 -2.2595806 2.9280384
# multiplying the correlation matrix by the precision matrix
mult3 = round(cor_mtrx %*% pres_mtrx)
mult3
## LotArea OverallQual SalePrice
## LotArea 1 0 0
## OverallQual 0 1 0
## SalePrice 0 0 1
# multiplying the precision matrix by the correlation matrix
mult4 = round(pres_mtrx %*% cor_mtrx)
mult4
## LotArea OverallQual SalePrice
## LotArea 1 0 0
## OverallQual 0 1 0
## SalePrice 0 0 1
# conducting LU decomposition on the matrix
library(matrixcalc)
mtrx_decomp = lu.decomposition(cor_mtrx)
mtrx_decomp
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.1058057 1.0000000 0
## [3,] 0.2638434 0.7717046 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.1058057 0.2638434
## [2,] 0 0.9888051 0.7630655
## [3,] 0 0.0000000 0.3415256
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Selected right-skewed variable - GrLivArea: Above grade (ground) living area square feet
library(MASS)
# selecting a variable that is skewed to the right
hist(train$GrLivArea)
# checking that min value is absolutely above zero
check_min = min(train$GrLivArea)
check_min
## [1] 334
# running exponential probability density function
exp_prob = fitdistr(train$GrLivArea, "exponential")
exp_prob
## rate
## 6.598640e-04
## (1.726943e-05)
# finding the optimal value of λ for this distribution
lambda = exp_prob$estimate
lambda
## rate
## 0.000659864
# taking 1000 samples from this exponential distribution using this value
opt_value = rexp(1000, lambda)
opt_value
## [1] 3767.015832 847.482099 1239.371164 150.634184 2944.772065
## [6] 1592.610633 1014.279859 1640.552748 703.447437 2659.006881
## [11] 40.459591 2291.302840 723.334479 3829.976776 437.601477
## [16] 695.056488 716.001343 1544.190109 3222.711849 2072.746613
## [21] 657.157772 301.333508 1575.387718 423.226777 1677.309503
## [26] 4250.560540 804.139088 1599.792715 1004.976879 4361.837892
## [31] 2071.826338 555.217463 306.386857 54.975330 17.697524
## [36] 2879.057747 106.009957 185.802878 362.270717 226.209141
## [41] 3097.754818 1297.801020 250.061749 751.214877 3003.369132
## [46] 3133.688914 5111.834174 617.334662 1404.153270 486.729773
## [51] 6763.528967 765.239898 2738.495732 1464.276865 821.762070
## [56] 76.259900 1537.576590 837.990929 615.769463 1002.010720
## [61] 4455.379189 423.366067 1280.782854 1873.456220 2221.308933
## [66] 394.366138 199.935060 4058.168057 523.024887 2429.079862
## [71] 707.542599 1553.726406 2408.571170 261.728067 933.970012
## [76] 2403.873486 771.889537 640.131397 582.646777 2160.830839
## [81] 3354.864147 3996.194349 277.307685 828.183470 345.072957
## [86] 3700.882748 1154.927299 1241.084495 421.215128 1883.670871
## [91] 1354.810668 1996.594636 2858.190941 479.806317 403.375321
## [96] 1361.392695 3299.198957 2032.189458 520.167173 1183.135274
## [101] 5139.299692 692.939877 8197.915585 1322.664911 1706.870835
## [106] 962.834795 870.489735 591.800720 2524.561119 1362.201873
## [111] 2343.061198 555.666739 1098.779047 623.703780 1368.527133
## [116] 159.995340 1513.888946 592.266385 1063.769708 26.825615
## [121] 3469.312295 4165.651300 729.078248 11675.545641 1017.747682
## [126] 468.492475 2732.542654 1456.016243 3763.707789 2636.277338
## [131] 1391.887433 775.319395 1274.538420 699.559463 78.595611
## [136] 491.801760 1238.555792 1650.170681 2978.737893 3721.881518
## [141] 4021.826708 139.167299 4682.507196 459.554188 1327.314500
## [146] 186.506646 3359.323289 2133.321468 4832.941250 2226.724767
## [151] 801.978958 198.354778 648.030552 48.366529 215.085986
## [156] 606.805090 2114.787293 4011.122647 2437.408725 1232.386583
## [161] 1629.635286 833.008355 6629.073288 1163.137094 614.931056
## [166] 70.382307 774.828040 2520.465213 1482.117636 51.613064
## [171] 5115.610906 105.515574 2113.815137 1341.010653 552.676418
## [176] 479.231934 1714.427003 767.854155 1880.614177 482.053119
## [181] 782.564060 87.999243 358.979749 80.558965 2254.920707
## [186] 207.209011 1704.038588 2195.947404 2345.294844 3826.087630
## [191] 2660.791801 857.959123 2364.774522 452.005330 390.310046
## [196] 981.001722 39.269496 254.801118 603.070545 2109.252516
## [201] 398.372528 2248.002134 1435.740361 2423.786965 204.212506
## [206] 3800.447563 247.709175 2389.165279 1584.175059 260.482058
## [211] 1720.766932 689.040534 1908.567037 162.288197 1039.879770
## [216] 564.329157 1892.149378 922.065346 507.705000 2570.690830
## [221] 2033.939154 1357.470806 19.706968 994.028400 1580.274719
## [226] 361.835959 555.488639 2418.787366 1333.187606 784.817275
## [231] 177.055122 1420.412712 976.297968 1119.250594 593.997493
## [236] 256.846422 572.189043 2509.157003 1759.744195 1899.571719
## [241] 714.856104 454.687335 3251.379116 1487.233612 5774.598888
## [246] 2429.169418 144.408189 1340.419446 534.056104 629.534568
## [251] 3266.130376 1178.990012 3965.691631 975.331548 366.172514
## [256] 2557.589351 1818.701866 614.256700 2926.320615 1278.517771
## [261] 457.606678 1038.573359 381.169083 1556.647487 119.607405
## [266] 67.055824 1611.330964 3218.650128 1212.982730 3172.902239
## [271] 179.045857 1994.662706 204.298230 895.203163 1228.025995
## [276] 3470.692201 928.283220 458.507974 263.358544 4818.732346
## [281] 915.618433 1605.716074 1268.556037 4694.918386 1255.937968
## [286] 1047.088656 269.869310 1961.906323 1047.227160 832.629590
## [291] 5.409209 2017.304997 30.378319 359.224033 956.179313
## [296] 1045.040768 698.995106 1412.258812 11.704891 558.867087
## [301] 713.304452 1069.259518 1962.708270 1973.241393 254.046417
## [306] 1165.483073 468.133724 3643.642257 856.454316 1029.734697
## [311] 1531.619679 1564.299540 1359.304351 570.611919 67.612835
## [316] 1992.490478 3466.485464 4080.627249 1339.458713 1568.962017
## [321] 4211.096280 625.842754 1712.169550 1369.161705 1395.636442
## [326] 1531.635913 5071.979130 963.555670 3850.327274 1750.936710
## [331] 3.887590 535.061020 643.722172 370.562741 965.860934
## [336] 61.032985 1733.262110 31.395535 2807.067509 4321.759373
## [341] 172.128065 1798.910557 646.963313 98.399856 2447.883808
## [346] 2284.599024 5408.686886 1676.708802 875.851981 197.978244
## [351] 709.796293 2610.961521 188.256577 96.579548 2150.570818
## [356] 193.853169 1075.193703 144.756528 127.995678 1623.192770
## [361] 675.424349 1637.123886 371.340545 126.739670 4511.789976
## [366] 214.603326 1720.431988 335.829150 1048.493131 930.309933
## [371] 98.275160 1132.451315 2963.612004 526.565594 1131.571630
## [376] 726.804045 252.796969 489.861483 1297.278936 2245.089834
## [381] 523.532600 203.103082 673.003202 1133.680804 428.919104
## [386] 1341.747409 1160.826805 466.776504 2184.699950 1446.713782
## [391] 2769.902360 199.048313 872.160960 569.593120 837.822715
## [396] 1953.977866 1500.871267 61.876395 725.860445 147.703421
## [401] 1186.666550 468.763253 133.610669 801.484874 19.255746
## [406] 299.100112 65.109477 3843.505257 419.278319 329.476881
## [411] 1898.547590 1596.798446 3260.414144 378.114610 974.439118
## [416] 483.336428 4177.201492 853.126844 530.372637 1146.967631
## [421] 2148.958927 437.858365 372.195652 1004.519225 1855.998297
## [426] 877.480502 206.251474 27.872330 1067.924488 354.703362
## [431] 1625.337580 3592.223643 4452.376643 471.792790 1862.050842
## [436] 469.717801 158.666295 712.218767 973.300572 178.905627
## [441] 2679.956974 169.824413 715.835910 1402.218275 519.853443
## [446] 1969.972449 1498.965491 2081.130779 5607.112045 763.435955
## [451] 859.006583 2906.238346 752.799309 4571.077539 1127.669777
## [456] 337.221007 1030.685671 1423.217997 911.797605 4264.357759
## [461] 577.547458 4862.951789 2241.414668 3427.926349 2866.162612
## [466] 2169.920388 507.686257 3922.579149 567.264520 1644.253472
## [471] 1461.830607 403.595767 224.322289 298.184839 1492.262790
## [476] 431.875522 1455.021060 110.267918 2233.452926 2008.814525
## [481] 59.563723 398.356196 55.141194 300.469522 2126.091156
## [486] 111.214900 3014.806083 7158.364459 888.120790 2594.971968
## [491] 1109.076156 1259.291192 1895.277801 1513.428757 2147.241469
## [496] 3759.105920 1531.888448 128.667301 1350.493514 548.372172
## [501] 1311.274768 3103.142243 3222.513510 5977.285393 298.905376
## [506] 380.162252 69.359525 792.371973 4870.667890 220.362160
## [511] 758.795941 3028.707605 83.586537 903.129629 825.837467
## [516] 153.363456 1400.918820 3513.473267 181.859620 279.894907
## [521] 444.712984 453.574137 2176.526164 633.944729 1840.848565
## [526] 4583.412461 842.641604 158.042683 191.526022 880.433115
## [531] 2196.789719 1403.932126 280.961527 229.467639 4223.098443
## [536] 21.697092 1072.020054 176.630536 4057.923114 1692.110020
## [541] 3317.663055 866.191064 1073.506525 216.855608 819.229707
## [546] 152.047998 3067.076445 3367.074241 3596.751942 1947.828394
## [551] 100.146546 1399.227118 660.790718 478.072051 2875.856794
## [556] 594.114543 2760.759111 849.314311 2745.770128 643.631003
## [561] 1650.512014 774.400040 238.272534 2737.494385 447.493269
## [566] 21.614621 1871.372096 354.557530 893.891546 404.878010
## [571] 1079.522991 1908.113385 1672.247844 2229.522347 1890.025424
## [576] 2682.497333 116.119538 1092.840311 2501.645620 205.100002
## [581] 317.613305 2174.866995 3565.275910 1018.103059 38.509314
## [586] 507.985359 908.858773 1888.829435 4465.694652 829.352365
## [591] 4441.978012 202.032773 2428.831754 317.652071 1013.880063
## [596] 343.426907 977.017147 1437.007192 580.961155 823.389433
## [601] 73.735119 2779.237625 1775.854394 493.647906 2969.906637
## [606] 1134.520110 1295.823657 306.589837 1019.516252 837.639284
## [611] 1058.765338 470.046126 192.276938 495.219951 1900.868079
## [616] 547.239817 100.568747 2031.733702 414.352864 537.877502
## [621] 1862.062868 631.982593 60.938451 7200.618271 438.277392
## [626] 265.060638 1625.227838 620.258054 211.909512 63.149191
## [631] 3296.490893 1761.956769 1079.148161 1520.067928 1987.213296
## [636] 1087.937817 1544.860099 1265.154300 5181.485340 2829.460197
## [641] 591.937261 1748.097029 1919.909139 5339.894396 1209.898620
## [646] 2525.851653 738.760525 1588.680822 2429.664305 664.180018
## [651] 111.890325 839.984095 1052.147995 875.482279 208.003625
## [656] 1401.214974 757.854215 172.246791 4432.148095 3073.042191
## [661] 1417.294458 1753.269229 1882.150496 1927.866923 974.107993
## [666] 1518.202841 1095.498266 1217.519915 1852.803804 3193.512957
## [671] 3824.251163 3635.633043 1848.754083 1294.496958 1864.289333
## [676] 836.628902 1251.633630 5433.614464 3554.683681 1188.460522
## [681] 1256.720376 840.518902 521.507878 581.353500 253.779328
## [686] 591.638852 2370.490368 632.528008 6296.962415 2257.650465
## [691] 2322.822181 781.994079 558.029984 10.949457 5313.820504
## [696] 2703.433126 4116.374896 1331.573004 2213.957376 2715.860836
## [701] 2376.147814 2318.422218 5748.400645 4225.524145 1028.991140
## [706] 387.426591 1064.326751 2742.691202 146.965741 1019.785566
## [711] 586.084292 1088.796686 1214.275537 105.556570 2888.465112
## [716] 1109.277102 44.290126 2166.211521 2283.123522 2345.681295
## [721] 668.557102 1730.238062 1377.869809 1448.459848 3562.603402
## [726] 409.725462 81.499877 629.046772 899.917704 445.212987
## [731] 155.765193 1818.280554 2132.522259 853.731012 66.947247
## [736] 211.351750 2064.349252 4561.485540 1938.142501 3085.219400
## [741] 958.982277 1886.266963 3863.105010 228.991791 1251.251711
## [746] 162.131594 154.172329 1922.090578 255.093064 999.710404
## [751] 3697.525168 138.275961 1982.330426 1127.766010 1553.272588
## [756] 3060.154572 727.710095 133.939274 536.784273 1845.343273
## [761] 1675.023569 4759.288502 254.579239 7917.238557 128.834440
## [766] 1386.430588 1502.053880 1224.825001 2303.578987 2356.808628
## [771] 231.292918 2581.882827 1014.942122 654.194078 1015.788083
## [776] 1464.882566 166.531230 2633.672516 3634.469284 1429.692417
## [781] 282.514396 297.846954 1745.893540 819.226470 1444.482560
## [786] 588.432483 1139.447985 4882.047644 668.375585 3013.792095
## [791] 1174.571566 2883.494432 826.119888 279.541727 21.650728
## [796] 2410.507145 4388.946961 1055.445573 2416.576191 839.190190
## [801] 17.845965 1367.381932 1210.752960 1692.523046 31.472629
## [806] 2187.890271 4030.062047 467.382318 1684.916459 1788.736222
## [811] 610.648057 477.877692 668.271283 92.528237 5731.797910
## [816] 2247.775337 4091.067206 666.392853 549.284859 349.006014
## [821] 5641.696546 2544.128241 2722.688134 114.848358 162.252149
## [826] 5767.811140 100.315760 802.643845 2505.613477 2303.357501
## [831] 305.318307 1002.191767 2055.738615 1350.453207 600.457850
## [836] 1714.393900 1675.035018 365.166089 2055.954032 329.798059
## [841] 9173.638979 1758.826283 1922.934389 1880.093930 616.680208
## [846] 107.246237 731.216418 1834.499678 1789.333008 2096.354529
## [851] 268.156816 211.612888 283.878122 395.517494 47.392194
## [856] 1757.359318 716.088840 638.505813 7122.658392 1836.413122
## [861] 38.965622 169.076177 108.909315 4389.367543 103.545593
## [866] 2361.023220 2068.244798 1293.318379 355.502968 480.434010
## [871] 2020.795935 300.458363 1988.780919 3060.263780 2263.235361
## [876] 188.627175 3018.141418 1428.382189 466.994813 4999.067855
## [881] 291.971797 92.936363 1100.714256 1405.958936 599.471633
## [886] 892.369000 114.472739 1241.199591 51.562100 2265.739579
## [891] 17.699669 1399.423114 438.002247 1583.634922 1096.834706
## [896] 592.657308 2180.452601 865.324680 1748.795666 3051.723249
## [901] 2040.825338 2576.907368 390.012873 387.406435 595.240793
## [906] 102.120711 1792.803073 111.934423 34.332204 1063.732961
## [911] 3921.940830 902.342046 4466.355751 623.657576 682.938910
## [916] 228.190288 3834.362293 1309.406922 2637.984600 2281.568580
## [921] 806.653214 1285.089905 1912.472268 1650.533966 683.646362
## [926] 933.709007 2557.206959 2402.655229 2167.437677 2638.548685
## [931] 3866.657101 2620.562156 568.584289 243.591711 3177.231609
## [936] 3136.754184 79.015299 283.469281 1058.500246 1901.623099
## [941] 306.516859 32.389869 1990.686562 4913.313672 1646.433033
## [946] 120.584349 3303.612324 438.244132 662.657958 744.691158
## [951] 556.778066 304.360878 1377.204271 488.150552 3538.887940
## [956] 1465.968104 139.217057 298.984563 1260.663215 973.893364
## [961] 2390.554938 718.153541 40.319247 2840.474074 75.503877
## [966] 224.102710 482.474104 351.612039 468.738973 1241.167115
## [971] 1505.558156 3572.493081 843.382103 1209.556399 1785.730750
## [976] 377.689692 488.340869 1522.672164 554.473458 304.633569
## [981] 1855.515143 1390.768351 1320.557218 609.973592 261.031694
## [986] 1458.705977 2424.824135 1210.715515 228.098464 1659.284115
## [991] 3393.420744 432.159384 88.910685 4164.001243 3485.012889
## [996] 2066.498642 2080.767779 2509.380244 924.972985 1065.807018
# comparing data distribution before and after transformation
par(mfrow = c(1, 2))
hist(opt_value, breaks = 30, main = "Exponential: GrLivArea", col = "grey")
hist(train$GrLivArea, breaks = 30, main = "Original: GrLivArea", col = "pink")
# finding the 5th and 95th percentiles using the cumulative distribution function (CDF).
qexp(0.05, rate = lambda)
## [1] 77.73313
qexp(0.95, rate = lambda)
## [1] 4539.924
# generating a 95% confidence interval from the empirical data (assuming normality)
library("Rmisc")
CI(na.exclude(opt_value), ci = 0.95)
## upper mean lower
## 1584.893 1497.636 1410.379
# providing the empirical 5th percentile and 95th percentile of the data
quantile(opt_value, 0.05)
## 5%
## 87.77861
quantile(opt_value, 0.95)
## 95%
## 4267.228
The 5th and 95th percentiles using the cumulative distribution function is very close to the empirical 5th percentile and 95th percentile of the data. Trsnsformed GrLivArea is within 1358.133 and 1546.484 with 95% confidence interval.
# checking NAs
sapply(train, function(y) sum(length(which(is.na(y)))))/nrow(train)*100
## Id MSSubClass MSZoning LotFrontage LotArea
## 0.00000000 0.00000000 0.00000000 17.73972603 0.00000000
## Street Alley LotShape LandContour Utilities
## 0.00000000 93.76712329 0.00000000 0.00000000 0.00000000
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 0.54794521 0.54794521 0.00000000 0.00000000 0.00000000
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 2.53424658 2.53424658 2.60273973 2.53424658 0.00000000
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 2.60273973 0.00000000 0.00000000 0.00000000 0.00000000
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0.00000000 0.00000000 0.06849315 0.00000000 0.00000000
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0.00000000 0.00000000 47.26027397 5.54794521 5.54794521
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 5.54794521 0.00000000 0.00000000 5.54794521 5.54794521
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0.00000000 0.00000000 99.52054795 80.75342466 96.30136986
## MiscVal MoSold YrSold SaleType SaleCondition
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## SalePrice
## 0.00000000
Imputing NAs using MICE method - Multivariate Imputation By Chained Equations. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data.
library(mice)
mice_imputes = mice(train, method = "rf", print = FALSE)
train =complete(mice_imputes)
Selecting top 20 predictors that correlate with the response variable the most.
library(dplyr)
library(tidyverse)
library(plyr)
data_cor <- cor(train%>%dplyr::select_if(is.numeric), use="complete.obs")
top_data_cor <- data_cor %>% as.data.frame() %>% dplyr::select(SalePrice) %>%
rownames_to_column() %>%
arrange(desc(SalePrice))
top_data_cor %>%
top_n(20, SalePrice)
## rowname SalePrice
## 1 SalePrice 1.0000000
## 2 OverallQual 0.7909816
## 3 GrLivArea 0.7086245
## 4 GarageCars 0.6404092
## 5 GarageArea 0.6234314
## 6 TotalBsmtSF 0.6135806
## 7 X1stFlrSF 0.6058522
## 8 FullBath 0.5606638
## 9 TotRmsAbvGrd 0.5337232
## 10 YearBuilt 0.5228973
## 11 YearRemodAdd 0.5071010
## 12 GarageYrBlt 0.5009348
## 13 MasVnrArea 0.4726145
## 14 Fireplaces 0.4669288
## 15 BsmtFinSF1 0.3864198
## 16 LotFrontage 0.3368489
## 17 WoodDeckSF 0.3244134
## 18 X2ndFlrSF 0.3193338
## 19 OpenPorchSF 0.3158562
## 20 HalfBath 0.2841077
Building model using highly correlated predictors with SalePrice response variable.
fit <- lm(SalePrice ~ OverallQual + GrLivArea + GarageCars + GarageArea + TotalBsmtSF + X1stFlrSF + FullBath + TotRmsAbvGrd + YearBuilt + YearRemodAdd + GarageYrBlt + MasVnrArea + Fireplaces + BsmtFinSF1 + LotFrontage + OpenPorchSF+ WoodDeckSF + X2ndFlrSF + LotArea, data = train)
summary(fit)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars +
## GarageArea + TotalBsmtSF + X1stFlrSF + FullBath + TotRmsAbvGrd +
## YearBuilt + YearRemodAdd + GarageYrBlt + MasVnrArea + Fireplaces +
## BsmtFinSF1 + LotFrontage + OpenPorchSF + WoodDeckSF + X2ndFlrSF +
## LotArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -520322 -17145 -1892 14332 288173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.098e+06 1.325e+05 -8.285 2.68e-16 ***
## OverallQual 1.948e+04 1.171e+03 16.642 < 2e-16 ***
## GrLivArea 2.331e+01 2.047e+01 1.138 0.255187
## GarageCars 1.006e+04 2.953e+03 3.407 0.000676 ***
## GarageArea 1.037e+01 1.044e+01 0.993 0.320738
## TotalBsmtSF 1.019e+01 4.284e+00 2.379 0.017494 *
## X1stFlrSF 2.104e+01 2.096e+01 1.004 0.315538
## FullBath -1.905e+03 2.615e+03 -0.728 0.466561
## TotRmsAbvGrd 1.683e+03 1.088e+03 1.547 0.122127
## YearBuilt 2.017e+02 6.550e+01 3.080 0.002109 **
## YearRemodAdd 3.698e+02 6.334e+01 5.838 6.53e-09 ***
## GarageYrBlt -5.160e+01 7.742e+01 -0.667 0.505183
## MasVnrArea 2.917e+01 6.117e+00 4.769 2.03e-06 ***
## Fireplaces 6.290e+03 1.798e+03 3.498 0.000484 ***
## BsmtFinSF1 1.641e+01 2.588e+00 6.342 3.03e-10 ***
## LotFrontage 2.777e+01 4.757e+01 0.584 0.559430
## OpenPorchSF 9.888e+00 1.557e+01 0.635 0.525587
## WoodDeckSF 2.890e+01 8.133e+00 3.554 0.000392 ***
## X2ndFlrSF 1.372e+01 2.061e+01 0.666 0.505741
## LotArea 4.703e-01 1.057e-01 4.449 9.29e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36200 on 1440 degrees of freedom
## Multiple R-squared: 0.795, Adjusted R-squared: 0.7923
## F-statistic: 294 on 19 and 1440 DF, p-value: < 2.2e-16
Checking residuals
plot(fit$fitted.values, fit$residuals,
xlab="Fitted Values", ylab="Residuals", main="Fitted Values vs. Residuals")
abline(h=0)
qqnorm(fit$residuals)
qqline(fit$residuals)
R^2 = 0.7958, that means 79.58% of variance is explained by this model.
A low p-value (< 0.05): 2.2e-16 indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to our model because changes in the predictor’s value are related to changes in the response variable.
The strongest predictors according to this model are the following: OverallQual, GarageCars, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, Fireplaces, WoodDeckSF, LotArea.
If we take one the most important predictor according to this model - OverallQual, then we can say that 1 unit increase in OverallQual results in 1.942 units increase in SalePrice.
Residual analysis indicates that residuals do not look normally distributed and there is some pattern in residuals (they are not randomly spread around the horizontal line). That mean that there is some useful information is hidden in residuals to be extracted and this model can be improved.
Making Predictions
Applying train data set transformations on the test set.
# imputing missing values using MICE method
mice_imputes = mice(test, method = "rf", print = FALSE)
test =complete(mice_imputes)
# making predictions
pred_price <- predict(fit, test)
# saving predictions in csv file
kaggle_subm <- data.frame(Id=test$Id, SalePrice=pred_price)
write.csv(kaggle_subm, file = "SalePrice_pred.csv", row.names=FALSE)
The model was submitted to the Kaggle competition board.
Kaggle.com user name and score: Olga #3, score: 0.50061