Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.

set.seed(123)

n = 10000
X = runif(n,1,6)

Y = rnorm(n,(6+1)/2,(6+1)/2)

x <- median(X)
y <- quantile(Y)[['25%']]

x
## [1] 3.472838
y
## [1] 1.171246

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

  1. P(X>x | X>y)

The probability that random variable X is greater than the median of X given that random variable X is greater than the 1st quartile of Y.

sum(X>x & X>y)/sum(X>y)
## [1] 0.5186184
  1. P(X>x, Y>y)

The probability that random variable X is greater than the median of X and that random variable Y is greater than the first quartile of Y.

sum(X>x & Y>y)/length(X)
## [1] 0.3756
  1. P(X less x | X more y)

The probability that a random variable X is less than the median of X given that random variable X is greater than the first quartile of Y.

sum(X<x & X>y)/sum(X>y)
## [1] 0.4813816
  1. Investigate whether P(X>x and Y>y) = P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
# finding number of observations for each joint probability. 
obs_1 = sum(X > x & Y > y)
obs_2 = sum(X < x & Y > y)
obs_3 = sum(X > x & Y < y)
obs_4 = sum(X < x & Y < y)

# calculating marginal probabilities and joint probabilities into one table
mtrx = (matrix(c(obs_1,obs_2,obs_1+obs_2,obs_3,obs_4,obs_3+obs_4,obs_1+obs_3,obs_2+obs_4,obs_1+obs_2+obs_3+obs_4),byrow = F,nrow=3))/10000
rownames(mtrx) = c("X > x","X < x","total")
colnames(mtrx) = c("Y > y ","Y < y", "total")
mtrx
##       Y > y   Y < y total
## X > x 0.3756 0.1244   0.5
## X < x 0.3744 0.1256   0.5
## total 0.7500 0.2500   1.0

the probability that P(X>x and Y>y) is 0.3756

the probability that P(X>x)P(Y>y) is 0.5∗0.75=0.375

P(X>x and Y>y) is almost the same as P(X>x)P(Y>y), which indicates that P(X>x and Y>y) = P(X>x)P(Y>y).

  1. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
mtrx_2 = matrix(c(obs_1,obs_2,obs_3,obs_4),byrow = F,nrow = 2)

rownames(mtrx_2) = c("X > x","X < x")
colnames(mtrx_2) = c("Y > y","Y < y")

mtrx_2
##       Y > y Y < y
## X > x  3756  1244
## X < x  3744  1256
#  checking independence using Fisher’s Exact Test
fisher.test(mtrx_2)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mtrx_2
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9242273 1.1100187
## sample estimates:
## odds ratio 
##   1.012883
#  checking independence using Chi Square Test
chisq.test(mtrx_2)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mtrx_2
## X-squared = 0.064533, df = 1, p-value = 0.7995

The p-values for both tests are approximately 0.7995 > p. We can not reject the null hypothesis. There is not enough evidence to conclude that events are dependent.

Fisher’s Exact Test is inappropriate when we have small sample sizes or highly unequal cell distribution, one can instead use Chi-squared test . The Chi-squared test is an approximation of the results from the Fisher’s Exact Test, so erroneous results could potentially be obtained from the few observations.

Problem 2

train = read.csv("https://raw.githubusercontent.com/olgashiligin/data_605/master/train.csv",stringsAsFactors = F)
test = read.csv("https://raw.githubusercontent.com/olgashiligin/data_605/master/test.csv",stringsAsFactors = F)

Descriptive and Inferential Statistics

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Dependent Variable: SalePrice

Independent Variables: LotArea OverallQual

Descriptive Statistics

#  numeric variables distribution are presented below 
library(ggplot2)
library(dplyr)
library(tidyr)
train%>%dplyr::select_if(is.numeric) %>%                     
  gather() %>%                             
  ggplot(aes(value)) +                
    facet_wrap(~ key, scales = "free") +   
    geom_density() 
## Warning: Removed 348 rows containing non-finite values (stat_density).

#  statistical summary are presented below
summary(train)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig        
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   LandSlope         Neighborhood        Condition1       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   Condition2          BldgType          HouseStyle         OverallQual    
##  Length:1460        Length:1460        Length:1460        Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 6.099  
##                                                           3rd Qu.: 7.000  
##                                                           Max.   :10.000  
##                                                                           
##   OverallCond      YearBuilt     YearRemodAdd   RoofStyle        
##  Min.   :1.000   Min.   :1872   Min.   :1950   Length:1460       
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Class :character  
##  Median :5.000   Median :1973   Median :1994   Mode  :character  
##  Mean   :5.575   Mean   :1971   Mean   :1985                     
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004                     
##  Max.   :9.000   Max.   :2010   Max.   :2010                     
##                                                                  
##    RoofMatl         Exterior1st        Exterior2nd       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   MasVnrType          MasVnrArea      ExterQual          ExterCond        
##  Length:1460        Min.   :   0.0   Length:1460        Length:1460       
##  Class :character   1st Qu.:   0.0   Class :character   Class :character  
##  Mode  :character   Median :   0.0   Mode  :character   Mode  :character  
##                     Mean   : 103.7                                        
##                     3rd Qu.: 166.0                                        
##                     Max.   :1600.0                                        
##                     NA's   :8                                             
##   Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     MiscVal             MoSold           YrSold       SaleType        
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   Length:1460       
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   Class :character  
##  Median :    0.00   Median : 6.000   Median :2008   Mode  :character  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008                     
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009                     
##  Max.   :15500.00   Max.   :12.000   Max.   :2010                     
##                                                                       
##  SaleCondition        SalePrice     
##  Length:1460        Min.   : 34900  
##  Class :character   1st Qu.:129975  
##  Mode  :character   Median :163000  
##                     Mean   :180921  
##                     3rd Qu.:214000  
##                     Max.   :755000  
## 
#  summary of LotArea
summary(train$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245
#  histogram of LotArea
hist(train$LotArea, xlab = "Lot Area", main = "House Lot Area")

#  summary of the OverallQual
summary(train$OverallQual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.099   7.000  10.000
# #  histogram of OverallQual
hist(train$OverallQual, xlab = "Overall Quality", main = "House Overall Quality")

Scatterplots

# OverallQual vs. Sale Price (the higher house quality the higher the price)
ggplot(train, aes(x = OverallQual, y = SalePrice)) + 
  geom_point() +
  labs(title = "Sales Price vs OverallQual")

# Lot Area vs. Sale Price (there is a positive correlation of price from LotArea)
ggplot(train, aes(x = log(LotArea), y = SalePrice)) + 
  geom_point() +
  labs(title = "LotArea vs OverallQual")

Correlation Matrix

cor_mtrx = train %>% dplyr::select(LotArea, OverallQual, SalePrice) %>% cor() %>% 
    as.matrix()
cor_mtrx
##               LotArea OverallQual SalePrice
## LotArea     1.0000000   0.1058057 0.2638434
## OverallQual 0.1058057   1.0000000 0.7909816
## SalePrice   0.2638434   0.7909816 1.0000000

Correlation between OverallQual and SalePrice (0.7909816) is much higher than between LotArea and SalePrice (0.2638434) as we saw it on the scatter plots

Test Hypotheses

cor.test(train$LotArea, train$SalePrice, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

As p < 0.05, we can reject the null hypotesis in favor alternative one and conclude that correlation between Sales Price and LotArea is not equal to 0.

correlation between SalePrice and LotArea is within 0.2323391 and 0.2947946 (with 80% confidence interval)

cor.test(train$OverallQual, train$SalePrice, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

As p < 0.05, we can reject the null hypotesis in favor alternative one and conclude that correlation between Sales Price and OverallQual is not equal to 0.

correlation between SalePrice and OverallQual is within 0.7780752 and 0.8032204 (with 80% confidence interval)

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Precision matrix

#  calculating precision matrix
pres_mtrx = solve(cor_mtrx)
pres_mtrx
##                LotArea OverallQual  SalePrice
## LotArea      1.1085153   0.3046752 -0.5334669
## OverallQual  0.3046752   2.7550503 -2.2595806
## SalePrice   -0.5334669  -2.2595806  2.9280384
# multiplying the correlation matrix by the precision matrix
mult3 = round(cor_mtrx %*% pres_mtrx)
mult3
##             LotArea OverallQual SalePrice
## LotArea           1           0         0
## OverallQual       0           1         0
## SalePrice         0           0         1
# multiplying the precision matrix by the correlation matrix
mult4 = round(pres_mtrx %*% cor_mtrx)
mult4
##             LotArea OverallQual SalePrice
## LotArea           1           0         0
## OverallQual       0           1         0
## SalePrice         0           0         1
# conducting LU decomposition on the matrix
library(matrixcalc)
mtrx_decomp = lu.decomposition(cor_mtrx)
mtrx_decomp
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.1058057 1.0000000    0
## [3,] 0.2638434 0.7717046    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.1058057 0.2638434
## [2,]    0 0.9888051 0.7630655
## [3,]    0 0.0000000 0.3415256

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Selected right-skewed variable - GrLivArea: Above grade (ground) living area square feet

library(MASS)
# selecting a variable that is skewed to the right
hist(train$GrLivArea)

#  checking that min value is absolutely above zero
check_min = min(train$GrLivArea)
check_min
## [1] 334
# running exponential probability density function
exp_prob = fitdistr(train$GrLivArea, "exponential")
exp_prob
##        rate    
##   6.598640e-04 
##  (1.726943e-05)
# finding the optimal value of λ for this distribution
lambda = exp_prob$estimate
lambda
##        rate 
## 0.000659864
# taking 1000 samples from this exponential distribution using this value 
opt_value = rexp(1000, lambda)
opt_value
##    [1]  3767.015832   847.482099  1239.371164   150.634184  2944.772065
##    [6]  1592.610633  1014.279859  1640.552748   703.447437  2659.006881
##   [11]    40.459591  2291.302840   723.334479  3829.976776   437.601477
##   [16]   695.056488   716.001343  1544.190109  3222.711849  2072.746613
##   [21]   657.157772   301.333508  1575.387718   423.226777  1677.309503
##   [26]  4250.560540   804.139088  1599.792715  1004.976879  4361.837892
##   [31]  2071.826338   555.217463   306.386857    54.975330    17.697524
##   [36]  2879.057747   106.009957   185.802878   362.270717   226.209141
##   [41]  3097.754818  1297.801020   250.061749   751.214877  3003.369132
##   [46]  3133.688914  5111.834174   617.334662  1404.153270   486.729773
##   [51]  6763.528967   765.239898  2738.495732  1464.276865   821.762070
##   [56]    76.259900  1537.576590   837.990929   615.769463  1002.010720
##   [61]  4455.379189   423.366067  1280.782854  1873.456220  2221.308933
##   [66]   394.366138   199.935060  4058.168057   523.024887  2429.079862
##   [71]   707.542599  1553.726406  2408.571170   261.728067   933.970012
##   [76]  2403.873486   771.889537   640.131397   582.646777  2160.830839
##   [81]  3354.864147  3996.194349   277.307685   828.183470   345.072957
##   [86]  3700.882748  1154.927299  1241.084495   421.215128  1883.670871
##   [91]  1354.810668  1996.594636  2858.190941   479.806317   403.375321
##   [96]  1361.392695  3299.198957  2032.189458   520.167173  1183.135274
##  [101]  5139.299692   692.939877  8197.915585  1322.664911  1706.870835
##  [106]   962.834795   870.489735   591.800720  2524.561119  1362.201873
##  [111]  2343.061198   555.666739  1098.779047   623.703780  1368.527133
##  [116]   159.995340  1513.888946   592.266385  1063.769708    26.825615
##  [121]  3469.312295  4165.651300   729.078248 11675.545641  1017.747682
##  [126]   468.492475  2732.542654  1456.016243  3763.707789  2636.277338
##  [131]  1391.887433   775.319395  1274.538420   699.559463    78.595611
##  [136]   491.801760  1238.555792  1650.170681  2978.737893  3721.881518
##  [141]  4021.826708   139.167299  4682.507196   459.554188  1327.314500
##  [146]   186.506646  3359.323289  2133.321468  4832.941250  2226.724767
##  [151]   801.978958   198.354778   648.030552    48.366529   215.085986
##  [156]   606.805090  2114.787293  4011.122647  2437.408725  1232.386583
##  [161]  1629.635286   833.008355  6629.073288  1163.137094   614.931056
##  [166]    70.382307   774.828040  2520.465213  1482.117636    51.613064
##  [171]  5115.610906   105.515574  2113.815137  1341.010653   552.676418
##  [176]   479.231934  1714.427003   767.854155  1880.614177   482.053119
##  [181]   782.564060    87.999243   358.979749    80.558965  2254.920707
##  [186]   207.209011  1704.038588  2195.947404  2345.294844  3826.087630
##  [191]  2660.791801   857.959123  2364.774522   452.005330   390.310046
##  [196]   981.001722    39.269496   254.801118   603.070545  2109.252516
##  [201]   398.372528  2248.002134  1435.740361  2423.786965   204.212506
##  [206]  3800.447563   247.709175  2389.165279  1584.175059   260.482058
##  [211]  1720.766932   689.040534  1908.567037   162.288197  1039.879770
##  [216]   564.329157  1892.149378   922.065346   507.705000  2570.690830
##  [221]  2033.939154  1357.470806    19.706968   994.028400  1580.274719
##  [226]   361.835959   555.488639  2418.787366  1333.187606   784.817275
##  [231]   177.055122  1420.412712   976.297968  1119.250594   593.997493
##  [236]   256.846422   572.189043  2509.157003  1759.744195  1899.571719
##  [241]   714.856104   454.687335  3251.379116  1487.233612  5774.598888
##  [246]  2429.169418   144.408189  1340.419446   534.056104   629.534568
##  [251]  3266.130376  1178.990012  3965.691631   975.331548   366.172514
##  [256]  2557.589351  1818.701866   614.256700  2926.320615  1278.517771
##  [261]   457.606678  1038.573359   381.169083  1556.647487   119.607405
##  [266]    67.055824  1611.330964  3218.650128  1212.982730  3172.902239
##  [271]   179.045857  1994.662706   204.298230   895.203163  1228.025995
##  [276]  3470.692201   928.283220   458.507974   263.358544  4818.732346
##  [281]   915.618433  1605.716074  1268.556037  4694.918386  1255.937968
##  [286]  1047.088656   269.869310  1961.906323  1047.227160   832.629590
##  [291]     5.409209  2017.304997    30.378319   359.224033   956.179313
##  [296]  1045.040768   698.995106  1412.258812    11.704891   558.867087
##  [301]   713.304452  1069.259518  1962.708270  1973.241393   254.046417
##  [306]  1165.483073   468.133724  3643.642257   856.454316  1029.734697
##  [311]  1531.619679  1564.299540  1359.304351   570.611919    67.612835
##  [316]  1992.490478  3466.485464  4080.627249  1339.458713  1568.962017
##  [321]  4211.096280   625.842754  1712.169550  1369.161705  1395.636442
##  [326]  1531.635913  5071.979130   963.555670  3850.327274  1750.936710
##  [331]     3.887590   535.061020   643.722172   370.562741   965.860934
##  [336]    61.032985  1733.262110    31.395535  2807.067509  4321.759373
##  [341]   172.128065  1798.910557   646.963313    98.399856  2447.883808
##  [346]  2284.599024  5408.686886  1676.708802   875.851981   197.978244
##  [351]   709.796293  2610.961521   188.256577    96.579548  2150.570818
##  [356]   193.853169  1075.193703   144.756528   127.995678  1623.192770
##  [361]   675.424349  1637.123886   371.340545   126.739670  4511.789976
##  [366]   214.603326  1720.431988   335.829150  1048.493131   930.309933
##  [371]    98.275160  1132.451315  2963.612004   526.565594  1131.571630
##  [376]   726.804045   252.796969   489.861483  1297.278936  2245.089834
##  [381]   523.532600   203.103082   673.003202  1133.680804   428.919104
##  [386]  1341.747409  1160.826805   466.776504  2184.699950  1446.713782
##  [391]  2769.902360   199.048313   872.160960   569.593120   837.822715
##  [396]  1953.977866  1500.871267    61.876395   725.860445   147.703421
##  [401]  1186.666550   468.763253   133.610669   801.484874    19.255746
##  [406]   299.100112    65.109477  3843.505257   419.278319   329.476881
##  [411]  1898.547590  1596.798446  3260.414144   378.114610   974.439118
##  [416]   483.336428  4177.201492   853.126844   530.372637  1146.967631
##  [421]  2148.958927   437.858365   372.195652  1004.519225  1855.998297
##  [426]   877.480502   206.251474    27.872330  1067.924488   354.703362
##  [431]  1625.337580  3592.223643  4452.376643   471.792790  1862.050842
##  [436]   469.717801   158.666295   712.218767   973.300572   178.905627
##  [441]  2679.956974   169.824413   715.835910  1402.218275   519.853443
##  [446]  1969.972449  1498.965491  2081.130779  5607.112045   763.435955
##  [451]   859.006583  2906.238346   752.799309  4571.077539  1127.669777
##  [456]   337.221007  1030.685671  1423.217997   911.797605  4264.357759
##  [461]   577.547458  4862.951789  2241.414668  3427.926349  2866.162612
##  [466]  2169.920388   507.686257  3922.579149   567.264520  1644.253472
##  [471]  1461.830607   403.595767   224.322289   298.184839  1492.262790
##  [476]   431.875522  1455.021060   110.267918  2233.452926  2008.814525
##  [481]    59.563723   398.356196    55.141194   300.469522  2126.091156
##  [486]   111.214900  3014.806083  7158.364459   888.120790  2594.971968
##  [491]  1109.076156  1259.291192  1895.277801  1513.428757  2147.241469
##  [496]  3759.105920  1531.888448   128.667301  1350.493514   548.372172
##  [501]  1311.274768  3103.142243  3222.513510  5977.285393   298.905376
##  [506]   380.162252    69.359525   792.371973  4870.667890   220.362160
##  [511]   758.795941  3028.707605    83.586537   903.129629   825.837467
##  [516]   153.363456  1400.918820  3513.473267   181.859620   279.894907
##  [521]   444.712984   453.574137  2176.526164   633.944729  1840.848565
##  [526]  4583.412461   842.641604   158.042683   191.526022   880.433115
##  [531]  2196.789719  1403.932126   280.961527   229.467639  4223.098443
##  [536]    21.697092  1072.020054   176.630536  4057.923114  1692.110020
##  [541]  3317.663055   866.191064  1073.506525   216.855608   819.229707
##  [546]   152.047998  3067.076445  3367.074241  3596.751942  1947.828394
##  [551]   100.146546  1399.227118   660.790718   478.072051  2875.856794
##  [556]   594.114543  2760.759111   849.314311  2745.770128   643.631003
##  [561]  1650.512014   774.400040   238.272534  2737.494385   447.493269
##  [566]    21.614621  1871.372096   354.557530   893.891546   404.878010
##  [571]  1079.522991  1908.113385  1672.247844  2229.522347  1890.025424
##  [576]  2682.497333   116.119538  1092.840311  2501.645620   205.100002
##  [581]   317.613305  2174.866995  3565.275910  1018.103059    38.509314
##  [586]   507.985359   908.858773  1888.829435  4465.694652   829.352365
##  [591]  4441.978012   202.032773  2428.831754   317.652071  1013.880063
##  [596]   343.426907   977.017147  1437.007192   580.961155   823.389433
##  [601]    73.735119  2779.237625  1775.854394   493.647906  2969.906637
##  [606]  1134.520110  1295.823657   306.589837  1019.516252   837.639284
##  [611]  1058.765338   470.046126   192.276938   495.219951  1900.868079
##  [616]   547.239817   100.568747  2031.733702   414.352864   537.877502
##  [621]  1862.062868   631.982593    60.938451  7200.618271   438.277392
##  [626]   265.060638  1625.227838   620.258054   211.909512    63.149191
##  [631]  3296.490893  1761.956769  1079.148161  1520.067928  1987.213296
##  [636]  1087.937817  1544.860099  1265.154300  5181.485340  2829.460197
##  [641]   591.937261  1748.097029  1919.909139  5339.894396  1209.898620
##  [646]  2525.851653   738.760525  1588.680822  2429.664305   664.180018
##  [651]   111.890325   839.984095  1052.147995   875.482279   208.003625
##  [656]  1401.214974   757.854215   172.246791  4432.148095  3073.042191
##  [661]  1417.294458  1753.269229  1882.150496  1927.866923   974.107993
##  [666]  1518.202841  1095.498266  1217.519915  1852.803804  3193.512957
##  [671]  3824.251163  3635.633043  1848.754083  1294.496958  1864.289333
##  [676]   836.628902  1251.633630  5433.614464  3554.683681  1188.460522
##  [681]  1256.720376   840.518902   521.507878   581.353500   253.779328
##  [686]   591.638852  2370.490368   632.528008  6296.962415  2257.650465
##  [691]  2322.822181   781.994079   558.029984    10.949457  5313.820504
##  [696]  2703.433126  4116.374896  1331.573004  2213.957376  2715.860836
##  [701]  2376.147814  2318.422218  5748.400645  4225.524145  1028.991140
##  [706]   387.426591  1064.326751  2742.691202   146.965741  1019.785566
##  [711]   586.084292  1088.796686  1214.275537   105.556570  2888.465112
##  [716]  1109.277102    44.290126  2166.211521  2283.123522  2345.681295
##  [721]   668.557102  1730.238062  1377.869809  1448.459848  3562.603402
##  [726]   409.725462    81.499877   629.046772   899.917704   445.212987
##  [731]   155.765193  1818.280554  2132.522259   853.731012    66.947247
##  [736]   211.351750  2064.349252  4561.485540  1938.142501  3085.219400
##  [741]   958.982277  1886.266963  3863.105010   228.991791  1251.251711
##  [746]   162.131594   154.172329  1922.090578   255.093064   999.710404
##  [751]  3697.525168   138.275961  1982.330426  1127.766010  1553.272588
##  [756]  3060.154572   727.710095   133.939274   536.784273  1845.343273
##  [761]  1675.023569  4759.288502   254.579239  7917.238557   128.834440
##  [766]  1386.430588  1502.053880  1224.825001  2303.578987  2356.808628
##  [771]   231.292918  2581.882827  1014.942122   654.194078  1015.788083
##  [776]  1464.882566   166.531230  2633.672516  3634.469284  1429.692417
##  [781]   282.514396   297.846954  1745.893540   819.226470  1444.482560
##  [786]   588.432483  1139.447985  4882.047644   668.375585  3013.792095
##  [791]  1174.571566  2883.494432   826.119888   279.541727    21.650728
##  [796]  2410.507145  4388.946961  1055.445573  2416.576191   839.190190
##  [801]    17.845965  1367.381932  1210.752960  1692.523046    31.472629
##  [806]  2187.890271  4030.062047   467.382318  1684.916459  1788.736222
##  [811]   610.648057   477.877692   668.271283    92.528237  5731.797910
##  [816]  2247.775337  4091.067206   666.392853   549.284859   349.006014
##  [821]  5641.696546  2544.128241  2722.688134   114.848358   162.252149
##  [826]  5767.811140   100.315760   802.643845  2505.613477  2303.357501
##  [831]   305.318307  1002.191767  2055.738615  1350.453207   600.457850
##  [836]  1714.393900  1675.035018   365.166089  2055.954032   329.798059
##  [841]  9173.638979  1758.826283  1922.934389  1880.093930   616.680208
##  [846]   107.246237   731.216418  1834.499678  1789.333008  2096.354529
##  [851]   268.156816   211.612888   283.878122   395.517494    47.392194
##  [856]  1757.359318   716.088840   638.505813  7122.658392  1836.413122
##  [861]    38.965622   169.076177   108.909315  4389.367543   103.545593
##  [866]  2361.023220  2068.244798  1293.318379   355.502968   480.434010
##  [871]  2020.795935   300.458363  1988.780919  3060.263780  2263.235361
##  [876]   188.627175  3018.141418  1428.382189   466.994813  4999.067855
##  [881]   291.971797    92.936363  1100.714256  1405.958936   599.471633
##  [886]   892.369000   114.472739  1241.199591    51.562100  2265.739579
##  [891]    17.699669  1399.423114   438.002247  1583.634922  1096.834706
##  [896]   592.657308  2180.452601   865.324680  1748.795666  3051.723249
##  [901]  2040.825338  2576.907368   390.012873   387.406435   595.240793
##  [906]   102.120711  1792.803073   111.934423    34.332204  1063.732961
##  [911]  3921.940830   902.342046  4466.355751   623.657576   682.938910
##  [916]   228.190288  3834.362293  1309.406922  2637.984600  2281.568580
##  [921]   806.653214  1285.089905  1912.472268  1650.533966   683.646362
##  [926]   933.709007  2557.206959  2402.655229  2167.437677  2638.548685
##  [931]  3866.657101  2620.562156   568.584289   243.591711  3177.231609
##  [936]  3136.754184    79.015299   283.469281  1058.500246  1901.623099
##  [941]   306.516859    32.389869  1990.686562  4913.313672  1646.433033
##  [946]   120.584349  3303.612324   438.244132   662.657958   744.691158
##  [951]   556.778066   304.360878  1377.204271   488.150552  3538.887940
##  [956]  1465.968104   139.217057   298.984563  1260.663215   973.893364
##  [961]  2390.554938   718.153541    40.319247  2840.474074    75.503877
##  [966]   224.102710   482.474104   351.612039   468.738973  1241.167115
##  [971]  1505.558156  3572.493081   843.382103  1209.556399  1785.730750
##  [976]   377.689692   488.340869  1522.672164   554.473458   304.633569
##  [981]  1855.515143  1390.768351  1320.557218   609.973592   261.031694
##  [986]  1458.705977  2424.824135  1210.715515   228.098464  1659.284115
##  [991]  3393.420744   432.159384    88.910685  4164.001243  3485.012889
##  [996]  2066.498642  2080.767779  2509.380244   924.972985  1065.807018
# comparing data distribution before and after transformation
par(mfrow = c(1, 2))
hist(opt_value, breaks = 30, main = "Exponential: GrLivArea",  col = "grey")
hist(train$GrLivArea, breaks = 30, main = "Original: GrLivArea", col = "pink")

#  finding the 5th and 95th percentiles using the cumulative distribution function (CDF).
qexp(0.05, rate = lambda)
## [1] 77.73313
qexp(0.95, rate = lambda)
## [1] 4539.924
# generating a 95% confidence interval from the empirical data (assuming normality)
library("Rmisc")
CI(na.exclude(opt_value), ci = 0.95)
##    upper     mean    lower 
## 1584.893 1497.636 1410.379
# providing the empirical 5th percentile and 95th percentile of the data
quantile(opt_value, 0.05)
##       5% 
## 87.77861
quantile(opt_value, 0.95)
##      95% 
## 4267.228

The 5th and 95th percentiles using the cumulative distribution function is very close to the empirical 5th percentile and 95th percentile of the data. Trsnsformed GrLivArea is within 1358.133 and 1546.484 with 95% confidence interval.

Model Building

#  checking NAs
sapply(train, function(y) sum(length(which(is.na(y)))))/nrow(train)*100
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##    0.00000000    0.00000000    0.00000000   17.73972603    0.00000000 
##        Street         Alley      LotShape   LandContour     Utilities 
##    0.00000000   93.76712329    0.00000000    0.00000000    0.00000000 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##    0.54794521    0.54794521    0.00000000    0.00000000    0.00000000 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##    2.53424658    2.53424658    2.60273973    2.53424658    0.00000000 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##    2.60273973    0.00000000    0.00000000    0.00000000    0.00000000 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##    0.00000000    0.00000000    0.06849315    0.00000000    0.00000000 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##    0.00000000    0.00000000   47.26027397    5.54794521    5.54794521 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##    5.54794521    0.00000000    0.00000000    5.54794521    5.54794521 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##    0.00000000    0.00000000   99.52054795   80.75342466   96.30136986 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##     SalePrice 
##    0.00000000

Imputing NAs using MICE method - Multivariate Imputation By Chained Equations. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data.

library(mice)
mice_imputes = mice(train, method = "rf", print = FALSE)
train =complete(mice_imputes)

Selecting top 20 predictors that correlate with the response variable the most.

library(dplyr)
library(tidyverse)
library(plyr)
data_cor <- cor(train%>%dplyr::select_if(is.numeric), use="complete.obs")
top_data_cor <- data_cor %>%  as.data.frame() %>% dplyr::select(SalePrice) %>% 
  rownames_to_column() %>% 
  arrange(desc(SalePrice))
top_data_cor %>%
  top_n(20, SalePrice)
##         rowname SalePrice
## 1     SalePrice 1.0000000
## 2   OverallQual 0.7909816
## 3     GrLivArea 0.7086245
## 4    GarageCars 0.6404092
## 5    GarageArea 0.6234314
## 6   TotalBsmtSF 0.6135806
## 7     X1stFlrSF 0.6058522
## 8      FullBath 0.5606638
## 9  TotRmsAbvGrd 0.5337232
## 10    YearBuilt 0.5228973
## 11 YearRemodAdd 0.5071010
## 12  GarageYrBlt 0.5009348
## 13   MasVnrArea 0.4726145
## 14   Fireplaces 0.4669288
## 15   BsmtFinSF1 0.3864198
## 16  LotFrontage 0.3368489
## 17   WoodDeckSF 0.3244134
## 18    X2ndFlrSF 0.3193338
## 19  OpenPorchSF 0.3158562
## 20     HalfBath 0.2841077

Building model using highly correlated predictors with SalePrice response variable.

fit <- lm(SalePrice ~ OverallQual + GrLivArea + GarageCars + GarageArea + TotalBsmtSF + X1stFlrSF + FullBath + TotRmsAbvGrd + YearBuilt + YearRemodAdd + GarageYrBlt + MasVnrArea + Fireplaces + BsmtFinSF1 + LotFrontage + OpenPorchSF+ WoodDeckSF + X2ndFlrSF + LotArea, data = train)

summary(fit)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars + 
##     GarageArea + TotalBsmtSF + X1stFlrSF + FullBath + TotRmsAbvGrd + 
##     YearBuilt + YearRemodAdd + GarageYrBlt + MasVnrArea + Fireplaces + 
##     BsmtFinSF1 + LotFrontage + OpenPorchSF + WoodDeckSF + X2ndFlrSF + 
##     LotArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -520322  -17145   -1892   14332  288173 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.098e+06  1.325e+05  -8.285 2.68e-16 ***
## OverallQual   1.948e+04  1.171e+03  16.642  < 2e-16 ***
## GrLivArea     2.331e+01  2.047e+01   1.138 0.255187    
## GarageCars    1.006e+04  2.953e+03   3.407 0.000676 ***
## GarageArea    1.037e+01  1.044e+01   0.993 0.320738    
## TotalBsmtSF   1.019e+01  4.284e+00   2.379 0.017494 *  
## X1stFlrSF     2.104e+01  2.096e+01   1.004 0.315538    
## FullBath     -1.905e+03  2.615e+03  -0.728 0.466561    
## TotRmsAbvGrd  1.683e+03  1.088e+03   1.547 0.122127    
## YearBuilt     2.017e+02  6.550e+01   3.080 0.002109 ** 
## YearRemodAdd  3.698e+02  6.334e+01   5.838 6.53e-09 ***
## GarageYrBlt  -5.160e+01  7.742e+01  -0.667 0.505183    
## MasVnrArea    2.917e+01  6.117e+00   4.769 2.03e-06 ***
## Fireplaces    6.290e+03  1.798e+03   3.498 0.000484 ***
## BsmtFinSF1    1.641e+01  2.588e+00   6.342 3.03e-10 ***
## LotFrontage   2.777e+01  4.757e+01   0.584 0.559430    
## OpenPorchSF   9.888e+00  1.557e+01   0.635 0.525587    
## WoodDeckSF    2.890e+01  8.133e+00   3.554 0.000392 ***
## X2ndFlrSF     1.372e+01  2.061e+01   0.666 0.505741    
## LotArea       4.703e-01  1.057e-01   4.449 9.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36200 on 1440 degrees of freedom
## Multiple R-squared:  0.795,  Adjusted R-squared:  0.7923 
## F-statistic:   294 on 19 and 1440 DF,  p-value: < 2.2e-16

Checking residuals

plot(fit$fitted.values, fit$residuals,
     xlab="Fitted Values", ylab="Residuals", main="Fitted Values vs. Residuals")
abline(h=0)

qqnorm(fit$residuals)
qqline(fit$residuals)

R^2 = 0.7958, that means 79.58% of variance is explained by this model.

A low p-value (< 0.05): 2.2e-16 indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to our model because changes in the predictor’s value are related to changes in the response variable.

The strongest predictors according to this model are the following: OverallQual, GarageCars, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, Fireplaces, WoodDeckSF, LotArea.

If we take one the most important predictor according to this model - OverallQual, then we can say that 1 unit increase in OverallQual results in 1.942 units increase in SalePrice.

Residual analysis indicates that residuals do not look normally distributed and there is some pattern in residuals (they are not randomly spread around the horizontal line). That mean that there is some useful information is hidden in residuals to be extracted and this model can be improved.

Making Predictions

Applying train data set transformations on the test set.

#  imputing missing values using MICE method
mice_imputes = mice(test, method = "rf", print = FALSE)
test =complete(mice_imputes)
# making predictions
pred_price <- predict(fit, test)

# saving predictions in csv file
kaggle_subm <- data.frame(Id=test$Id, SalePrice=pred_price)
write.csv(kaggle_subm, file = "SalePrice_pred.csv", row.names=FALSE)

The model was submitted to the Kaggle competition board.

Kaggle.com user name and score: Olga #3, score: 0.50061