Your final is due by the end of day on 12/27/2017. You should post your solutions to your GitHub account. You are also expected to make a short presentation during our last meeting (3-5 minutes) or post a recording to the board. This project will show off your ability to understand the elements of the class.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. Kaggle . I want you to do the following.

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.

The variable that I choose as my quantitative variable was Lot Area.

loc_train = "https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/train.csv"
loc_test = "https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/test.csv"
train_df = read.csv(loc_train, stringsAsFactors = FALSE)
test_df = read.csv(loc_test, stringsAsFactors = FALSE)


X = train_df$LotArea 
Y =train_df$SalePrice

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.

a P(X>x | Y>y)

The probability that the independent variable X lot Area is greater than the 1st quartile 7553.5, given that variable Y Sales Price is greater than its 2nd quartile $163,000 .

x_Q1= quantile(X,.25)
x_Q1
##    25% 
## 7553.5
y_Q2= quantile(Y,.50)
y_Q2
##    50% 
## 163000
length(X[X > x_Q1 &Y>y_Q2]) / length(Y[Y > y_Q2])
## [1] 0.8653846

b P(X>x, Y>y)

probability that both X Lot Area is greater than its 1st quartile and variable Y is greater than its 2nd quartile.

length(X[X >x_Q1 & Y>y_Q2])/length(X)
## [1] 0.4315068

c P(Xy)

The probability that variable X Lot Area is less than its 1st quartile AND variable Y Sales Price is greater than its 2nd quartile.

length(X[X < x_Q1 & Y>y_Q2]) /length(X)
## [1] 0.06712329

Does splitting the training data in this fashion make them independent? In other words, does P(X|Y)=P(X)P(Y))? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this.

PXY=length(X[X > x_Q1 &Y>y_Q2]) / length(Y[Y > y_Q2])

PX=length(X[X > x_Q1])/length(X)

PY=length(Y[Y > y_Q2])/length(Y)


PXY
## [1] 0.8653846
PX*PY
## [1] 0.3739726
PXY==(PX*PY)
## [1] FALSE

Splitting the data in this fashion does not make them independent.

Chi square test for independence

chisq.test(train_df$SalePrice,train_df$LotArea)
## Warning in chisq.test(train_df$SalePrice, train_df$LotArea): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  train_df$SalePrice and train_df$LotArea
## X-squared = 735090, df = 709660, p-value < 2.2e-16

Based on the results and p value of <.05, we can reject the null hypothesis stating that X Lot Area and Y Sales Price are independent. We can say that there is dependence between the Sale Price and Lot Area.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations. You might have to research this.

#X = train_df$LotArea 
#Y =train_df$SalePrice
summary(train_df$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10520   11600  215200
summary(train_df$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  130000  163000  180900  214000  755000
# Lot Area is right skewed
hist(train_df$LotArea,100)

# Sales Price is right skewed
hist(train_df$SalePrice,25)

boxplot(train_df$LotArea,xlab="Lot Area",horizontal = TRUE)

boxplot(train_df$SalePrice,xlab="SalePrice" ,horizontal = TRUE)

Scatterplot of X Lot Area vs Y Sale Price

plot(train_df$LotArea,train_df$SalePrice,main ="scatterplot of X and Y")

Box-Cox transformations r-bloggers

# Tranformation of Lot Area using Box Cox method
LotArea_lambda = BoxCox.lambda(train_df$LotArea)
LotArea_BoxCox = BoxCox(train_df$LotArea, LotArea_lambda)
# Tranformation of Sale Price using Box Cox method
SalePrice_lambda = BoxCox.lambda(train_df$SalePrice)
SalePrice_BoxCox = BoxCox(train_df$SalePrice, SalePrice_lambda)

par(mfrow=c(2,2))

hist(train_df$LotArea,100)
hist(LotArea_BoxCox,25)
hist(train_df$SalePrice,25)
hist(SalePrice_BoxCox,25)

Applying the Box Cox transformation certainly made the histogram more uniform view.

Linear Algebra and Correlation.

Using at least three untransformed variables, build a correlation matrix. Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

cor_df = train_df[,c("SalePrice","LotArea","OverallCond")]
  
cor_mx = cor(cor_df , method = "pearson")
cor_mx
##               SalePrice     LotArea OverallCond
## SalePrice    1.00000000  0.26384335 -0.07785589
## LotArea      0.26384335  1.00000000 -0.00563627
## OverallCond -0.07785589 -0.00563627  1.00000000
corrplot(cor_mx, method = "number")

cor_df_inv_pre_mx = solve(cor_mx)
cor_df_inv_pre_mx
##               SalePrice     LotArea OverallCond
## SalePrice    1.08160223 -0.28490799  0.08260329
## LotArea     -0.28490799  1.07508021 -0.01612232
## OverallCond  0.08260329 -0.01612232  1.00634028
# correlation matrix by the precision matrix
cor_mx %*% cor_df_inv_pre_mx
##                 SalePrice       LotArea   OverallCond
## SalePrice    1.000000e+00  8.023096e-18  0.000000e+00
## LotArea     -3.735076e-17  1.000000e+00 -2.602085e-18
## OverallCond  0.000000e+00 -3.469447e-18  1.000000e+00
# precision matrix by the correlation matrix
cor_df_inv_pre_mx %*% cor_mx
##                 SalePrice       LotArea  OverallCond
## SalePrice    1.000000e+00  1.805197e-17 1.387779e-17
## LotArea     -4.705437e-17  1.000000e+00 0.000000e+00
## OverallCond -1.387779e-17 -2.602085e-18 1.000000e+00

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift (if necessary) it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice.
Mass package. Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, \(\lambda\)) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.

# Summary to verify that values are above zero
summary(train_df$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10520   11600  215200
# I fitted Lot Area to a normal distribution
normal_fit = fitdistr(train_df$LotArea, "normal")

samp_fit = rexp(1000, normal_fit$estimate)


par(mfrow=c(1,2))

hist(train_df$LotArea,25)
hist(samp_fit ,25)

# I displayed the original distribution and the tranformed distribution. 

Modeling

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

# step(lm(SalePrice~., data=train_df), direction="both")

# I will only select numeric colmns
numeric_cols = select_if(train_df, is.numeric) 
# check Nas
sum(is.na(numeric_cols))
## [1] 348
# replace na with 0
numeric_cols[is.na(numeric_cols)] <- 0

Linearmodel =lm(SalePrice~ ., data = numeric_cols)

summary(Linearmodel)
## 
## Call:
## lm(formula = SalePrice ~ ., data = numeric_cols)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473606  -16047   -2186   14258  298649 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.843e+05  1.401e+06   0.346 0.729585    
## Id            -1.050e+00  2.168e+00  -0.485 0.628079    
## MSSubClass    -1.673e+02  2.649e+01  -6.315 3.61e-10 ***
## LotFrontage    8.827e+00  2.852e+01   0.310 0.756985    
## LotArea        3.917e-01  9.990e-02   3.921 9.23e-05 ***
## OverallQual    1.731e+04  1.181e+03  14.660  < 2e-16 ***
## OverallCond    5.113e+03  1.023e+03   5.000 6.45e-07 ***
## YearBuilt      3.381e+02  6.044e+01   5.594 2.65e-08 ***
## YearRemodAdd   1.221e+02  6.602e+01   1.849 0.064638 .  
## MasVnrArea     2.808e+01  5.918e+00   4.744 2.31e-06 ***
## BsmtFinSF1     1.861e+01  4.632e+00   4.019 6.16e-05 ***
## BsmtFinSF2     9.144e+00  7.003e+00   1.306 0.191873    
## BsmtUnfSF      8.506e+00  4.168e+00   2.041 0.041452 *  
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      4.686e+01  5.726e+00   8.183 6.07e-16 ***
## X2ndFlrSF      4.810e+01  4.918e+00   9.779  < 2e-16 ***
## LowQualFinSF   1.779e+01  1.968e+01   0.904 0.366030    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   8.549e+03  2.594e+03   3.295 0.001008 ** 
## BsmtHalfBath   1.709e+03  4.054e+03   0.421 0.673467    
## FullBath       3.234e+03  2.800e+03   1.155 0.248390    
## HalfBath      -1.913e+03  2.642e+03  -0.724 0.469234    
## BedroomAbvGr  -1.027e+04  1.680e+03  -6.114 1.25e-09 ***
## KitchenAbvGr  -1.576e+04  5.195e+03  -3.033 0.002466 ** 
## TotRmsAbvGrd   5.005e+03  1.228e+03   4.076 4.84e-05 ***
## Fireplaces     4.075e+03  1.757e+03   2.319 0.020556 *  
## GarageYrBlt   -1.457e+01  2.683e+00  -5.428 6.67e-08 ***
## GarageCars     1.569e+04  2.974e+03   5.275 1.53e-07 ***
## GarageArea     5.001e+00  9.716e+00   0.515 0.606854    
## WoodDeckSF     2.579e+01  7.926e+00   3.254 0.001164 ** 
## OpenPorchSF   -6.285e+00  1.506e+01  -0.417 0.676420    
## EnclosedPorch  1.162e+01  1.671e+01   0.695 0.487055    
## X3SsnPorch     2.023e+01  3.114e+01   0.650 0.515997    
## ScreenPorch    5.771e+01  1.704e+01   3.387 0.000726 ***
## PoolArea      -3.215e+01  2.354e+01  -1.366 0.172244    
## MiscVal       -4.808e-01  1.844e+00  -0.261 0.794330    
## MoSold        -4.188e+01  3.420e+02  -0.122 0.902546    
## YrSold        -7.129e+02  6.964e+02  -1.024 0.306203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34450 on 1424 degrees of freedom
## Multiple R-squared:  0.8164, Adjusted R-squared:  0.8119 
## F-statistic: 180.9 on 35 and 1424 DF,  p-value: < 2.2e-16
Linearmodel = update(Linearmodel, .~. -GrLivArea-BsmtHalfBath-FullBath-HalfBath-GarageYrBlt-GarageArea-OpenPorchSF-EnclosedPorch-X3SsnPorch-PoolArea-MiscVal-MoSold-YrSold-LowQualFinSF-TotRmsAbvGrd-Id-LotFrontage-YearRemodAdd-BsmtFinSF2-TotalBsmtSF-KitchenAbvGr-Fireplaces-OverallCond-BsmtUnfSF , data = train_df)
summary(Linearmodel)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     YearBuilt + MasVnrArea + BsmtFinSF1 + X1stFlrSF + X2ndFlrSF + 
##     BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch, 
##     data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -488335  -16880   -1527   14250  267222 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.155e+05  8.015e+04  -7.679 2.95e-14 ***
## MSSubClass   -2.002e+02  2.430e+01  -8.237 3.93e-16 ***
## LotArea       4.370e-01  1.010e-01   4.327 1.61e-05 ***
## OverallQual   2.107e+04  1.081e+03  19.495  < 2e-16 ***
## YearBuilt     2.877e+02  4.205e+01   6.842 1.15e-11 ***
## MasVnrArea    2.873e+01  5.962e+00   4.819 1.59e-06 ***
## BsmtFinSF1    1.036e+01  2.985e+00   3.470 0.000536 ***
## X1stFlrSF     6.358e+01  3.817e+00  16.655  < 2e-16 ***
## X2ndFlrSF     5.869e+01  3.335e+00  17.598  < 2e-16 ***
## BsmtFullBath  8.944e+03  2.415e+03   3.704 0.000220 ***
## BedroomAbvGr -7.281e+03  1.467e+03  -4.965 7.71e-07 ***
## GarageCars    1.072e+04  1.716e+03   6.249 5.44e-10 ***
## WoodDeckSF    3.033e+01  7.987e+00   3.798 0.000152 ***
## ScreenPorch   5.712e+01  1.696e+01   3.368 0.000779 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35470 on 1438 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8017, Adjusted R-squared:  0.7999 
## F-statistic: 447.2 on 13 and 1438 DF,  p-value: < 2.2e-16
# The rediduals seem normally distributed.
hist(Linearmodel$residuals,100)

# The QQplot plot seems to have long tails.
qqnorm(Linearmodel$residuals)
qqline(Linearmodel$residuals)

# I will check variables will help predict the sale price in the test data.


# Variables that had nas will be replace

#did not replace due to column predictor stating none that lead to 0 in this column
#summary(test_df$MasVnrArea) = as.numeric(impute(test_df$BsmtFinSF1, mean))

test_df$MasVnrArea[is.na(test_df$MasVnrArea)]=0
# has nas

test_df$BsmtFinSF1[is.na(test_df$BsmtFinSF1)]  = mean(test_df$BsmtFinSF1,na.rm = TRUE)

# has nas
test_df$BsmtFullBath[is.na(test_df$BsmtFullBath)]= median(test_df$BsmtFullBath,na.rm = TRUE)

# has nas
test_df$GarageCars[is.na(test_df$GarageCars)] = median(test_df$GarageCars,na.rm = TRUE)


Final_Result=data.frame('Id'=test_df$Id,'SalePrice'=predict(Linearmodel, test_df))
#Final_Result$SalePrice[Final_Result$SalePrice<0] =0
#Final_Result$SalePrice[is.na(Final_Result$SalePrice)]=0

boxplot(Final_Result$SalePrice,xlab="SalePrice predicted" ,horizontal = TRUE)

#write.csv(Final_Result, file = "Results_2nd_sub.csv",row.names=FALSE)

My model was based on numerics only. During my first submission I did not check the testing data for missing values. Fixing these significant values greatly increased my score. In the third attempt I would like to include factors and binary variables to the model. Items also need consideration in the third attempt is dealing with Nas in the training model.

kaggle