Instruction:

Your final is due by the end of day on 5/20/2018 You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

The Data Set: Ames Iowa Housing Data

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!
Pick the dependent variable and define it as Y.

setwd('/Users/gaboston/Documents/bethany/ds605_final')
train <- read.csv('train.csv')

train_names <-colnames(train)
numeric_train <- train%>%
  select_if(is.numeric)
(describe(numeric_train))

## numeric_train 
## 
##  38  Variables      1460  Observations
## ---------------------------------------------------------------------------
## Id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0     1460        1    730.5      487    73.95   146.90 
##      .25      .50      .75      .90      .95 
##   365.75   730.50  1095.25  1314.10  1387.05 
## 
## lowest :    1    2    3    4    5, highest: 1456 1457 1458 1459 1460
## ---------------------------------------------------------------------------
## MSSubClass 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       15     0.94     56.9    43.19       20       20 
##      .25      .50      .75      .90      .95 
##       20       50       70      120      160 
##                                                                       
## Value         20    30    40    45    50    60    70    75    80    85
## Frequency    536    69     4    12   144   299    60    16    58    20
## Proportion 0.367 0.047 0.003 0.008 0.099 0.205 0.041 0.011 0.040 0.014
##                                         
## Value         90   120   160   180   190
## Frequency     52    87    63    10    30
## Proportion 0.036 0.060 0.043 0.007 0.021
## ---------------------------------------------------------------------------
## LotFrontage 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1201      259      110    0.998    70.05    24.61       34       44 
##      .25      .50      .75      .90      .95 
##       59       69       80       96      107 
## 
## lowest :  21  24  30  32  33, highest: 160 168 174 182 313
## ---------------------------------------------------------------------------
## LotArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0     1073        1    10517     5718     3312     5000 
##      .25      .50      .75      .90      .95 
##     7554     9478    11602    14382    17401 
## 
## lowest :   1300   1477   1491   1526   1533, highest:  70761 115149 159000 164660 215245
## ---------------------------------------------------------------------------
## OverallQual 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       10    0.951    6.099    1.522        4        5 
##      .25      .50      .75      .90      .95 
##        5        6        7        8        8 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency      2     3    20   116   397   374   319   168    43    18
## Proportion 0.001 0.002 0.014 0.079 0.272 0.256 0.218 0.115 0.029 0.012
## ---------------------------------------------------------------------------
## OverallCond 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        9    0.814    5.575    1.111 
##                                                                 
## Value          1     2     3     4     5     6     7     8     9
## Frequency      1     5    25    57   821   252   205    72    22
## Proportion 0.001 0.003 0.017 0.039 0.562 0.173 0.140 0.049 0.015
## ---------------------------------------------------------------------------
## YearBuilt 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      112        1     1971    33.88     1916     1925 
##      .25      .50      .75      .90      .95 
##     1954     1973     2000     2006     2007 
## 
## lowest : 1872 1875 1880 1882 1885, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## YearRemodAdd 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       61    0.997     1985    23.05     1950     1950 
##      .25      .50      .75      .90      .95 
##     1967     1994     2004     2006     2007 
## 
## lowest : 1950 1951 1952 1953 1954, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## MasVnrArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1452        8      327    0.791    103.7    156.9        0        0 
##      .25      .50      .75      .90      .95 
##        0        0      166      335      456 
## 
## lowest :    0    1   11   14   16, highest: 1115 1129 1170 1378 1600
## ---------------------------------------------------------------------------
## BsmtFinSF1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      637    0.967    443.6    484.5      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0    383.5    712.2   1065.5   1274.0 
## 
## lowest :    0    2   16   20   24, highest: 1904 2096 2188 2260 5644
## ---------------------------------------------------------------------------
## BsmtFinSF2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      144    0.305    46.55    86.58      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0      0.0    117.2    396.2 
## 
## lowest :    0   28   32   35   40, highest: 1080 1085 1120 1127 1474
## ---------------------------------------------------------------------------
## BsmtUnfSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      780    0.999    567.2    486.6      0.0     74.9 
##      .25      .50      .75      .90      .95 
##    223.0    477.5    808.0   1232.0   1468.0 
## 
## lowest :    0   14   15   23   26, highest: 2042 2046 2121 2153 2336
## ---------------------------------------------------------------------------
## TotalBsmtSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      721        1     1057    459.5    519.3    636.9 
##      .25      .50      .75      .90      .95 
##    795.8    991.5   1298.2   1602.2   1753.0 
## 
## lowest :    0  105  190  264  270, highest: 3094 3138 3200 3206 6110
## ---------------------------------------------------------------------------
## X1stFlrSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      753        1     1163    416.4    673.0    756.9 
##      .25      .50      .75      .90      .95 
##    882.0   1087.0   1391.2   1680.0   1831.2 
## 
## lowest :  334  372  438  480  483, highest: 2633 2898 3138 3228 4692
## ---------------------------------------------------------------------------
## X2ndFlrSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      417    0.817      347    450.2      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0    728.0    954.2   1141.0 
## 
## lowest :    0  110  167  192  208, highest: 1611 1796 1818 1872 2065
## ---------------------------------------------------------------------------
## LowQualFinSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       24    0.052    5.845    11.55        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
## 
## lowest :   0  53  80 120 144, highest: 513 514 515 528 572
## ---------------------------------------------------------------------------
## GrLivArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      861        1     1515    563.1      848      912 
##      .25      .50      .75      .90      .95 
##     1130     1464     1777     2158     2466 
## 
## lowest :  334  438  480  520  605, highest: 3627 4316 4476 4676 5642
## ---------------------------------------------------------------------------
## BsmtFullBath 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        4    0.733   0.4253   0.5085 
##                                   
## Value          0     1     2     3
## Frequency    856   588    15     1
## Proportion 0.586 0.403 0.010 0.001
## ---------------------------------------------------------------------------
## BsmtHalfBath 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        3    0.159  0.05753   0.1088 
##                             
## Value          0     1     2
## Frequency   1378    80     2
## Proportion 0.944 0.055 0.001
## ---------------------------------------------------------------------------
## FullBath 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        4    0.766    1.565   0.5521 
##                                   
## Value          0     1     2     3
## Frequency      9   650   768    33
## Proportion 0.006 0.445 0.526 0.023
## ---------------------------------------------------------------------------
## HalfBath 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        3    0.706   0.3829   0.4852 
##                             
## Value          0     1     2
## Frequency    913   535    12
## Proportion 0.625 0.366 0.008
## ---------------------------------------------------------------------------
## BedroomAbvGr 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        8    0.815    2.866    0.818 
##                                                           
## Value          0     1     2     3     4     5     6     8
## Frequency      6    50   358   804   213    21     7     1
## Proportion 0.004 0.034 0.245 0.551 0.146 0.014 0.005 0.001
## ---------------------------------------------------------------------------
## KitchenAbvGr 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        4    0.133    1.047  0.09174 
##                                   
## Value          0     1     2     3
## Frequency      1  1392    65     2
## Proportion 0.001 0.953 0.045 0.001
## ---------------------------------------------------------------------------
## TotRmsAbvGrd 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       12    0.958    6.518    1.762        4        5 
##      .25      .50      .75      .90      .95 
##        5        6        7        9       10 
##                                                                       
## Value          2     3     4     5     6     7     8     9    10    11
## Frequency      1    17    97   275   402   329   187    75    47    18
## Proportion 0.001 0.012 0.066 0.188 0.275 0.225 0.128 0.051 0.032 0.012
##                       
## Value         12    14
## Frequency     11     1
## Proportion 0.008 0.001
## ---------------------------------------------------------------------------
## Fireplaces 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        4    0.806    0.613   0.6566 
##                                   
## Value          0     1     2     3
## Frequency    690   650   115     5
## Proportion 0.473 0.445 0.079 0.003
## ---------------------------------------------------------------------------
## GarageYrBlt 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1379       81       97        1     1979    27.63     1930     1945 
##      .25      .50      .75      .90      .95 
##     1961     1980     2002     2006     2007 
## 
## lowest : 1900 1906 1908 1910 1914, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## GarageCars 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        5    0.802    1.767   0.7609 
##                                         
## Value          0     1     2     3     4
## Frequency     81   369   824   181     5
## Proportion 0.055 0.253 0.564 0.124 0.003
## ---------------------------------------------------------------------------
## GarageArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      441        1      473    234.9      0.0    240.0 
##      .25      .50      .75      .90      .95 
##    334.5    480.0    576.0    757.1    850.1 
## 
## lowest :    0  160  164  180  186, highest: 1220 1248 1356 1390 1418
## ---------------------------------------------------------------------------
## WoodDeckSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      274    0.858    94.24      125        0        0 
##      .25      .50      .75      .90      .95 
##        0        0      168      262      335 
## 
## lowest :   0  12  24  26  28, highest: 668 670 728 736 857
## ---------------------------------------------------------------------------
## OpenPorchSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      202    0.909    46.66    62.43        0        0 
##      .25      .50      .75      .90      .95 
##        0       25       68      130      175 
## 
## lowest :   0   4   8  10  11, highest: 406 418 502 523 547
## ---------------------------------------------------------------------------
## EnclosedPorch 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      120    0.369    21.95    39.39      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0      0.0    112.0    180.1 
## 
## lowest :   0  19  20  24  30, highest: 301 318 330 386 552
## ---------------------------------------------------------------------------
## X3SsnPorch 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       20    0.049     3.41    6.739        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
##                                                                       
## Value          0    23    96   130   140   144   153   162   168   180
## Frequency   1436     1     1     1     1     2     1     1     3     2
## Proportion 0.984 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.001
##                                                                       
## Value        182   196   216   238   245   290   304   320   407   508
## Frequency      1     1     2     1     1     1     1     1     1     1
## Proportion 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## ScreenPorch 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       76     0.22    15.06    28.27        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0      160 
## 
## lowest :   0  40  53  60  63, highest: 385 396 410 440 480
## ---------------------------------------------------------------------------
## PoolArea 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        8    0.014    2.759    5.497 
##                                                           
## Value          0   480   512   519   555   576   648   738
## Frequency   1453     1     1     1     1     1     1     1
## Proportion 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## MiscVal 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       21    0.103    43.49    85.67        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
##                                                                       
## Value          0    50   350   400   450   500   550   600   700   800
## Frequency   1408     1     1    11     4    10     1     5     5     1
## Proportion 0.964 0.001 0.001 0.008 0.003 0.007 0.001 0.003 0.003 0.001
##                                                                 
## Value       1150  1200  1300  1400  2000  2500  3500  8300 15500
## Frequency      1     2     1     1     4     1     1     1     1
## Proportion 0.001 0.001 0.001 0.001 0.003 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## MoSold 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       12    0.985    6.322    3.041        2        3 
##      .25      .50      .75      .90      .95 
##        5        6        8       10       11 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency     58    52   106   141   204   253   234   122    63    89
## Proportion 0.040 0.036 0.073 0.097 0.140 0.173 0.160 0.084 0.043 0.061
##                       
## Value         11    12
## Frequency     79    59
## Proportion 0.054 0.040
## ---------------------------------------------------------------------------
## YrSold 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        5    0.955     2008    1.498 
##                                         
## Value       2006  2007  2008  2009  2010
## Frequency    314   329   304   338   175
## Proportion 0.215 0.225 0.208 0.232 0.120
## ---------------------------------------------------------------------------
## SalePrice 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      663        1   180921    81086    88000   106475 
##      .25      .50      .75      .90      .95 
##   129975   163000   214000   278000   326100 
## 
## lowest :  34900  35311  37900  39300  40000, highest: 582933 611657 625000 745000 755000
## ---------------------------------------------------------------------------

# The Following is The Basic Statistics of the Numeric Variables

Extracting The Outcome Variable

### Outcome Variable 
Y <- as.numeric(numeric_train[,'SalePrice'])
# The following are Statistical Summaries for the Outcome Variable, Sales Price
(summary(Y))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Choosing a Right Skewed X Variable

Using the positive values greater than 2 in skew for the above descriptive statistics, I chose LotArea as my X for this exploration.I did so because having a large lot might be seen as valuable, and also many homes do NOT have one, so there might be some connection to the outcome variable SalesPrice and it was distrubuted as requested.

X = as.numeric(numeric_train[,'LotArea'])
hist(X, breaks=30, main = "Histogram of Lot Area", xlab = 'Lot Area in Square Feet', ylab = 'Frequency', col= 'magenta')

Probability

Calculate as a minimum the below probabilities a through c.

Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable.

Interpret the meaning of all probabilities.

$P(X>x | Y>y)$
$P(X>x,Y>y)$
$P(X<x|Y>y)$

In addition, make a table of counts as shown below.

Creating The 25th, 50th and 75th Percentiles

Setting the Thresholds for X & Y at each Percentile Boundary

x<-quantile(X)[2]
y<-quantile(Y)[2]
x_2<-quantile(Y)[3]
y_2<-quantile(Y)[3]
x_3<-quantile(Y)[4]
y_3<-quantile(Y)[4]
#create a whole frame of XY for simplicity sake and calculate denominators for the
# whole set denXY (all observations) and the just Y-values greater than the 25th percentile, denY.
XY <-cbind(X,Y)
denXY <- nrow(XY) #All of XY
denY <-nrow(subset(XY, Y>y)) # all of XY where Y>y

a. What is the probability that a house sold for mor than $181,500 (the 25th percentile) given that a home has a lot more than 9600 square feet (the 25th percentile).

Because we are only looking at X values above the 25th percentile if they already are identified as being above the 25th of Y, we take the numerator subsets of XY above the 25th percentile, count those values and divide byt the number of Y values greater than the 25th percentile.

a <- nrow(subset(XY, (X>x & Y>y)))/denY

$P(X>x|Y>y)$ = 0.82

b. What is the probability that a home has sold for more than $181,500 (the 25th percentile) and had a lot greater than 9600 square feet (the 25th percentile).

In thins case the numerator is the same as above, but because we are not predicating the probability on knowing that the Y value was greater than the 25th while selecting the X value, the denominator is all observations in the set.

b <- nrow(subset(XY, (X>x & Y>y)))/denXY

$P(X>x,Y>y)$ = 0.615

c. What is the probability that a house sold for less than $181,500 (the 25th percentile) given it had a lot size greater than 9600 square feet(the 25th percentile).

c <- nrow(subset(XY, (X<x & Y>y)))/denY

$P(X<x|Y>y)$ = 0.18

Table

X_LE_3_Y_EL_2 <- nrow(subset(XY, (X<=x_3 & Y <= y_2)))
X_LE_3_Y_GT_2 <- nrow(subset(XY, (X<=x_3 & Y > y_2)))
X_LE_3_TOT <- X_LE_3_Y_EL_2 + X_LE_3_Y_GT_2
Lots_Less_Than<- data.frame(Price_At_Bellow_163k=X_LE_3_Y_EL_2, Price_Greater_163k = X_LE_3_Y_GT_2, Total = X_LE_3_TOT)


XY_GT_3_Y_EL_2 <- nrow(subset(XY, (X>x_3 & Y <= y_2)))
XY_GT_3_Y_GT_2 <- nrow(subset(XY, (X>x_3 & Y > y_2)))
X_GT_3_TOT <- XY_GT_3_Y_EL_2  +  XY_GT_3_Y_GT_2 
Lots_Greater_Than<- data.frame(Price_At_Bellow_163k=XY_GT_3_Y_EL_2, Price_Greater_163k = XY_GT_3_Y_GT_2, Total = X_GT_3_TOT)

Totals <- data.frame(Price_At_Bellow_163k=(X_LE_3_Y_EL_2 + XY_GT_3_Y_EL_2), Price_Greater_163k = (X_LE_3_Y_GT_2 + XY_GT_3_Y_GT_2), Total = (X_LE_3_TOT + X_GT_3_TOT))

table_probs <- data.frame(rbind(Lots_Less_Than, Lots_Greater_Than, Totals ))                     

row.names(table_probs)<- c("Lots_Less_Than_7553", "Lots_Greater_Than_7553", "Totals" )

r DT::datatable(table_probs)

Does $P(AB) = P(A)P(B)$?

Define $P(AB) as P(A | B) \times P(B)$

(where $P(A | B)$ is problem a. from above)

p_ab <-nrow(subset(XY, (X>x & Y>y)))/denY
p_b <-nrow(subset(XY, Y>y))/nrow(XY)  
pAB <- p_ab * p_b

Now Calculate P(A) P(B)

p_a <-nrow(subset(XY, X>x))/nrow(XY)
P_A_B <- p_a*p_b
answer =pAB==P_A_B

So does $P(A | B) \times P(B) = P(A) \times P(B)$ ?

FALSE

Based on the fact that p(AB) is not equal to P(A)*P(B) the conclusion is that splitting them in this way does not make sense as you are not capturing the same distributions as you would in the sets unsplit.

Chi_Square

Are these two datasets independent as subseted above the 25th percentile? .

A<- subset(X, X>x)
B<-subset(Y, Y>y)
tab_ab <- table(A, B)
ch_sq <-chisq.test(tab_ab)

Statistic: 428917.745478
p-value: 0.705902

In this case where the null hypothesis is that they are independent and given a p-value of .705 we fail to reject and assume the division does in fact support independence in these two variables

Provide univariate descriptive statistics and appropriate plots for the training data set.

(describe(numeric_train))

## numeric_train 
## 
##  38  Variables      1460  Observations
## ---------------------------------------------------------------------------
## Id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0     1460        1    730.5      487    73.95   146.90 
##      .25      .50      .75      .90      .95 
##   365.75   730.50  1095.25  1314.10  1387.05 
## 
## lowest :    1    2    3    4    5, highest: 1456 1457 1458 1459 1460
## ---------------------------------------------------------------------------
## MSSubClass 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       15     0.94     56.9    43.19       20       20 
##      .25      .50      .75      .90      .95 
##       20       50       70      120      160 
##                                                                       
## Value         20    30    40    45    50    60    70    75    80    85
## Frequency    536    69     4    12   144   299    60    16    58    20
## Proportion 0.367 0.047 0.003 0.008 0.099 0.205 0.041 0.011 0.040 0.014
##                                         
## Value         90   120   160   180   190
## Frequency     52    87    63    10    30
## Proportion 0.036 0.060 0.043 0.007 0.021
## ---------------------------------------------------------------------------
## LotFrontage 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1201      259      110    0.998    70.05    24.61       34       44 
##      .25      .50      .75      .90      .95 
##       59       69       80       96      107 
## 
## lowest :  21  24  30  32  33, highest: 160 168 174 182 313
## ---------------------------------------------------------------------------
## LotArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0     1073        1    10517     5718     3312     5000 
##      .25      .50      .75      .90      .95 
##     7554     9478    11602    14382    17401 
## 
## lowest :   1300   1477   1491   1526   1533, highest:  70761 115149 159000 164660 215245
## ---------------------------------------------------------------------------
## OverallQual 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       10    0.951    6.099    1.522        4        5 
##      .25      .50      .75      .90      .95 
##        5        6        7        8        8 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency      2     3    20   116   397   374   319   168    43    18
## Proportion 0.001 0.002 0.014 0.079 0.272 0.256 0.218 0.115 0.029 0.012
## ---------------------------------------------------------------------------
## OverallCond 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        9    0.814    5.575    1.111 
##                                                                 
## Value          1     2     3     4     5     6     7     8     9
## Frequency      1     5    25    57   821   252   205    72    22
## Proportion 0.001 0.003 0.017 0.039 0.562 0.173 0.140 0.049 0.015
## ---------------------------------------------------------------------------
## YearBuilt 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      112        1     1971    33.88     1916     1925 
##      .25      .50      .75      .90      .95 
##     1954     1973     2000     2006     2007 
## 
## lowest : 1872 1875 1880 1882 1885, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## YearRemodAdd 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       61    0.997     1985    23.05     1950     1950 
##      .25      .50      .75      .90      .95 
##     1967     1994     2004     2006     2007 
## 
## lowest : 1950 1951 1952 1953 1954, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## MasVnrArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1452        8      327    0.791    103.7    156.9        0        0 
##      .25      .50      .75      .90      .95 
##        0        0      166      335      456 
## 
## lowest :    0    1   11   14   16, highest: 1115 1129 1170 1378 1600
## ---------------------------------------------------------------------------
## BsmtFinSF1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      637    0.967    443.6    484.5      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0    383.5    712.2   1065.5   1274.0 
## 
## lowest :    0    2   16   20   24, highest: 1904 2096 2188 2260 5644
## ---------------------------------------------------------------------------
## BsmtFinSF2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      144    0.305    46.55    86.58      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0      0.0    117.2    396.2 
## 
## lowest :    0   28   32   35   40, highest: 1080 1085 1120 1127 1474
## ---------------------------------------------------------------------------
## BsmtUnfSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      780    0.999    567.2    486.6      0.0     74.9 
##      .25      .50      .75      .90      .95 
##    223.0    477.5    808.0   1232.0   1468.0 
## 
## lowest :    0   14   15   23   26, highest: 2042 2046 2121 2153 2336
## ---------------------------------------------------------------------------
## TotalBsmtSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      721        1     1057    459.5    519.3    636.9 
##      .25      .50      .75      .90      .95 
##    795.8    991.5   1298.2   1602.2   1753.0 
## 
## lowest :    0  105  190  264  270, highest: 3094 3138 3200 3206 6110
## ---------------------------------------------------------------------------
## X1stFlrSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      753        1     1163    416.4    673.0    756.9 
##      .25      .50      .75      .90      .95 
##    882.0   1087.0   1391.2   1680.0   1831.2 
## 
## lowest :  334  372  438  480  483, highest: 2633 2898 3138 3228 4692
## ---------------------------------------------------------------------------
## X2ndFlrSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      417    0.817      347    450.2      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0    728.0    954.2   1141.0 
## 
## lowest :    0  110  167  192  208, highest: 1611 1796 1818 1872 2065
## ---------------------------------------------------------------------------
## LowQualFinSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       24    0.052    5.845    11.55        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
## 
## lowest :   0  53  80 120 144, highest: 513 514 515 528 572
## ---------------------------------------------------------------------------
## GrLivArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      861        1     1515    563.1      848      912 
##      .25      .50      .75      .90      .95 
##     1130     1464     1777     2158     2466 
## 
## lowest :  334  438  480  520  605, highest: 3627 4316 4476 4676 5642
## ---------------------------------------------------------------------------
## BsmtFullBath 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        4    0.733   0.4253   0.5085 
##                                   
## Value          0     1     2     3
## Frequency    856   588    15     1
## Proportion 0.586 0.403 0.010 0.001
## ---------------------------------------------------------------------------
## BsmtHalfBath 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        3    0.159  0.05753   0.1088 
##                             
## Value          0     1     2
## Frequency   1378    80     2
## Proportion 0.944 0.055 0.001
## ---------------------------------------------------------------------------
## FullBath 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        4    0.766    1.565   0.5521 
##                                   
## Value          0     1     2     3
## Frequency      9   650   768    33
## Proportion 0.006 0.445 0.526 0.023
## ---------------------------------------------------------------------------
## HalfBath 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        3    0.706   0.3829   0.4852 
##                             
## Value          0     1     2
## Frequency    913   535    12
## Proportion 0.625 0.366 0.008
## ---------------------------------------------------------------------------
## BedroomAbvGr 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        8    0.815    2.866    0.818 
##                                                           
## Value          0     1     2     3     4     5     6     8
## Frequency      6    50   358   804   213    21     7     1
## Proportion 0.004 0.034 0.245 0.551 0.146 0.014 0.005 0.001
## ---------------------------------------------------------------------------
## KitchenAbvGr 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        4    0.133    1.047  0.09174 
##                                   
## Value          0     1     2     3
## Frequency      1  1392    65     2
## Proportion 0.001 0.953 0.045 0.001
## ---------------------------------------------------------------------------
## TotRmsAbvGrd 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       12    0.958    6.518    1.762        4        5 
##      .25      .50      .75      .90      .95 
##        5        6        7        9       10 
##                                                                       
## Value          2     3     4     5     6     7     8     9    10    11
## Frequency      1    17    97   275   402   329   187    75    47    18
## Proportion 0.001 0.012 0.066 0.188 0.275 0.225 0.128 0.051 0.032 0.012
##                       
## Value         12    14
## Frequency     11     1
## Proportion 0.008 0.001
## ---------------------------------------------------------------------------
## Fireplaces 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        4    0.806    0.613   0.6566 
##                                   
## Value          0     1     2     3
## Frequency    690   650   115     5
## Proportion 0.473 0.445 0.079 0.003
## ---------------------------------------------------------------------------
## GarageYrBlt 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1379       81       97        1     1979    27.63     1930     1945 
##      .25      .50      .75      .90      .95 
##     1961     1980     2002     2006     2007 
## 
## lowest : 1900 1906 1908 1910 1914, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## GarageCars 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        5    0.802    1.767   0.7609 
##                                         
## Value          0     1     2     3     4
## Frequency     81   369   824   181     5
## Proportion 0.055 0.253 0.564 0.124 0.003
## ---------------------------------------------------------------------------
## GarageArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      441        1      473    234.9      0.0    240.0 
##      .25      .50      .75      .90      .95 
##    334.5    480.0    576.0    757.1    850.1 
## 
## lowest :    0  160  164  180  186, highest: 1220 1248 1356 1390 1418
## ---------------------------------------------------------------------------
## WoodDeckSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      274    0.858    94.24      125        0        0 
##      .25      .50      .75      .90      .95 
##        0        0      168      262      335 
## 
## lowest :   0  12  24  26  28, highest: 668 670 728 736 857
## ---------------------------------------------------------------------------
## OpenPorchSF 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      202    0.909    46.66    62.43        0        0 
##      .25      .50      .75      .90      .95 
##        0       25       68      130      175 
## 
## lowest :   0   4   8  10  11, highest: 406 418 502 523 547
## ---------------------------------------------------------------------------
## EnclosedPorch 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      120    0.369    21.95    39.39      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0      0.0    112.0    180.1 
## 
## lowest :   0  19  20  24  30, highest: 301 318 330 386 552
## ---------------------------------------------------------------------------
## X3SsnPorch 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       20    0.049     3.41    6.739        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
##                                                                       
## Value          0    23    96   130   140   144   153   162   168   180
## Frequency   1436     1     1     1     1     2     1     1     3     2
## Proportion 0.984 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.001
##                                                                       
## Value        182   196   216   238   245   290   304   320   407   508
## Frequency      1     1     2     1     1     1     1     1     1     1
## Proportion 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## ScreenPorch 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       76     0.22    15.06    28.27        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0      160 
## 
## lowest :   0  40  53  60  63, highest: 385 396 410 440 480
## ---------------------------------------------------------------------------
## PoolArea 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        8    0.014    2.759    5.497 
##                                                           
## Value          0   480   512   519   555   576   648   738
## Frequency   1453     1     1     1     1     1     1     1
## Proportion 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## MiscVal 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       21    0.103    43.49    85.67        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
##                                                                       
## Value          0    50   350   400   450   500   550   600   700   800
## Frequency   1408     1     1    11     4    10     1     5     5     1
## Proportion 0.964 0.001 0.001 0.008 0.003 0.007 0.001 0.003 0.003 0.001
##                                                                 
## Value       1150  1200  1300  1400  2000  2500  3500  8300 15500
## Frequency      1     2     1     1     4     1     1     1     1
## Proportion 0.001 0.001 0.001 0.001 0.003 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## MoSold 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0       12    0.985    6.322    3.041        2        3 
##      .25      .50      .75      .90      .95 
##        5        6        8       10       11 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency     58    52   106   141   204   253   234   122    63    89
## Proportion 0.040 0.036 0.073 0.097 0.140 0.173 0.160 0.084 0.043 0.061
##                       
## Value         11    12
## Frequency     79    59
## Proportion 0.054 0.040
## ---------------------------------------------------------------------------
## YrSold 
##        n  missing distinct     Info     Mean      Gmd 
##     1460        0        5    0.955     2008    1.498 
##                                         
## Value       2006  2007  2008  2009  2010
## Frequency    314   329   304   338   175
## Proportion 0.215 0.225 0.208 0.232 0.120
## ---------------------------------------------------------------------------
## SalePrice 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      663        1   180921    81086    88000   106475 
##      .25      .50      .75      .90      .95 
##   129975   163000   214000   278000   326100 
## 
## lowest :  34900  35311  37900  39300  40000, highest: 582933 611657 625000 745000 755000
## ---------------------------------------------------------------------------

Simple Corelation Plot to Establish Relationships Between Variables

corrplot(cor(as.matrix(numeric_train),use = "complete.obs"), type="lower", insig = "n", tl.srt = 45,number.cex = .30, order = "hclust",tl.col = 'black',tl.cex=.75)

hist(numeric_train$GrLivArea, main="Gross Living Area",xlab ="Square Feet of Finished Living Area", col='navy') #

hist(numeric_train$TotRmsAbvGrd, main="Total Number of Above groun Rooms",xlab ='Rooms Above Ground', col='navy')

hist(numeric_train$YearRemodAdd, main="Years Since Remodel or Addition", xlab ='Years', col='navy') #

hist(numeric_train$FullBath, main="Number of Full Baths", xlab = 'Full Baths', col='navy') #

hist(numeric_train$GarageCars, main ="Numbers of Cars Bays in Garage", xlab ="Car Bays", col='navy')

hist(numeric_train$GarageArea, main = "Square Footage of Garage", xlab = 'Square Feet', col ='navy')

hist(numeric_train$TotalBsmtSF, main = 'Square Footage of Basement', xlab='Square Feet', col = 'navy')

hist(numeric_train$X1stFlrSF, main='First Floor Square Footage', xlab ='Square Feet', col ='navy')

Provide a scatterplot of X and Y

data <- na.omit(train[,c('LotArea','SalePrice')])

plot(x =data$LotArea, y= data$SalePrice,  main = "Sale Price Vs. Square Feet of Lot", xlab = 'Square Feet' , ylab =' Sale Price',  col = 'maroon')

Derive a correlation matrix for any THREE quantitative variables in the dataset.

I chose three variables which had a relatively high correlation with the Sales Price on the correlation triangle above, specifically to have some interaction to evaluate in the following tests.

three_variables<- data.frame( cbind(train$GrLivArea, train$GarageArea, train$TotalBsmtSF))
colnames(three_variables)<-c('GrLivArea', 'GarageArea', 'TotalBsmtSF')
three_cor <- cor(three_variables, method = 'pearson', use = 'complete.obs')

	GrLivArea	GarageArea	TotalBsmtSF
GrLivArea	1.0000000	0.4689975	0.4548682
GarageArea	0.4689975	1.0000000	0.4866655
TotalBsmtSF	0.4548682	0.4866655	1.0000000

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval.

GrLivArea_GarageArea <-cor.test(train$GrLivArea, train$GarageArea,
         alternative = "two.sided",
         method = "pearson",
         exact = NULL, conf.level = 0.92)

GrLivArea_TotalBsmtSF <-cor.test(train$GrLivArea, train$TotalBsmtSF,
                                alternative = "two.sided",
                                method = "pearson",
                                exact = NULL, conf.level = 0.92)

GarageArea_TotalBsmtSF <-cor.test(train$GarageArea, train$TotalBsmtSF,
                                 alternative = "two.sided",
                                 method = "pearson",
                                 exact = NULL, conf.level = 0.92)

Residential Area & Garage Area:

Estimate: 0.4689975
p-value: 0
Interval: 0.4324608 0.5039965

Garage Area & Total Basement Area:

Estimate: 0.4866655
p-value: 0
Interval: 0.4508901 0.5208797

Residential Area & Total Basement Area:
Estimate: 0.4548682
p-value: 0
Interval: 0.4177447 0.4904754

Discuss the meaning of your analysis.

All three of the correlation coefficients (estimates) suggest a moderate positive association, which are statistically supported by extremely small p-values, such that they likely change together to some extent or are tied to another variable with which they trend together.

Would you be worried about familywise error? Why or why not?

Familywise error is a risk incurred when you chain together estimates and it appears that our results are significant to such small extent that this would not be an issue, However, to be sure we can simply re-run the analysis with the alpha value (1-confidence ratio) weighted by the number of tests we are doing, in this case, three. $\alpha = .05$
n=3
$\alpha_n = \frac{\alpha}{3} = 1.66$
$1-\alpha_n= .9833$

GrLivArea_GarageArea2 <-cor.test(train$GrLivArea, train$GarageArea,
         alternative = "two.sided",
         method = "pearson",
         exact = NULL, conf.level = 0.9833)

GrLivArea_TotalBsmtSF2 <-cor.test(train$GrLivArea, train$TotalBsmtSF,
                                alternative = "two.sided",
                                method = "pearson",
                                exact = NULL, conf.level = 0.9833)

GarageArea_TotalBsmtSF2 <-cor.test(train$GarageArea, train$TotalBsmtSF,
                                 alternative = "two.sided",
                                 method = "pearson",
                                 exact = NULL, conf.level = 0.9833)

Residential Area & Garage Area:

Estimate: 0.4689975
p-value: 0
Interval: 0.4186762 0.5164475

Garage Area & Total Basement Area:

Estimate: 0.4866655
p-value: 0
Interval: 0.4373772 0.5330386

Residential Area & Total Basement Area:
Estimate: 0.4548682
p-value: 0
Interval: 0.4037514 0.5031538

From the above calculations, you can see that despite the increase confidence $\aplha$ values, we still have many xeros between our p-value and the risk of a type one error (despite having slight larger intervals now). It is likely a very safe bet that you would note experience extreme results do to Familywise Error.

Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

##Invert Three-Way Correlation
require(Matrix)
three_precise <- solve(three_cor)
print(three_precise)

##              GrLivArea GarageArea TotalBsmtSF
## GrLivArea    1.4030275 -0.4552539  -0.4166363
## GarageArea  -0.4552539  1.4580675  -0.5025106
## TotalBsmtSF -0.4166363 -0.5025106   1.4340691

Multiply the correlation matrix by the precision matrix…

cor_prec <- three_cor%*%three_precise
print(cor_prec)

##                             GrLivArea                 GarageArea
## GrLivArea   1.00000000000000000000000 -0.00000000000000002775558
## GarageArea  0.00000000000000002775558  0.99999999999999977795540
## TotalBsmtSF 0.00000000000000005551115 -0.00000000000000011102230
##                           TotalBsmtSF
## GrLivArea    0.0000000000000000000000
## GarageArea  -0.0000000000000001110223
## TotalBsmtSF  0.9999999999999997779554

Then multiply the precision matrix by the correlation matrix…

prec_cor<- three_precise%*%three_cor
print(prec_cor)

##                              GrLivArea                 GarageArea
## GrLivArea    1.00000000000000000000000  0.00000000000000002775558
## GarageArea  -0.00000000000000002775558  0.99999999999999977795540
## TotalBsmtSF  0.00000000000000000000000 -0.00000000000000011102230
##                            TotalBsmtSF
## GrLivArea    0.00000000000000005551115
## GarageArea  -0.00000000000000011102230
## TotalBsmtSF  0.99999999999999977795540

From these values we can see that both of these equate to inversions of each other where non-diagonal values are very small (approaching zero) such that rounding would bring these matrices to 0, providing us with two instances of the same matrix, the identity matrix for this correlation matrix.

Now Conduct LU decomposition on the matrix

lu_decomp <-lu(three_cor)
expand(lu_decomp)

## $L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000         .         .
## [2,] 0.4689975 1.0000000         .
## [3,] 0.4548682 0.3504089 1.0000000
## 
## $U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000 0.4689975 0.4548682
## [2,]         . 0.7800414 0.2733334
## [3,]         .         . 0.6973165
## 
## $P
## 3 x 3 sparse Matrix of class "pMatrix"
##           
## [1,] | . .
## [2,] . | .
## [3,] . . |

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. See: MASS.

(summary(X))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

The minimumn is greater than zero, so we can proceed with the rest of this process as-is.

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

x_dist<-fitdistr(X, densfun = 'exponential')
lambda <- x_dist$estimate
re_distibution <- rexp(1000, lambda)

(summary(re_distibution))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.24  3344.42  7692.38 10571.80 14703.51 70187.17

Plot a histogram and compare it with a histogram of your original variable.

par(mfrow=c(1,2))
hist(numeric_train$LotArea, main = "Ames Iowa Lot Areas ", xlab='Square Feet', col ='navy')
hist(re_distibution, main = "Simulated Lot Areas", xlab="Simulated Values Square Feet", col = "darkorange")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

 p_05_exp <- round(log(1-0.05)/-lambda,2)
 p_95_exp <- round(log(1-0.95)/-lambda,2)

Exponential Distribution:
5th Percentile: 539.44
95th Percentile: 31505.6

Also generate a 95% confidence interval from the empirical data, assuming normality.

 empirical_interval <- CI(train$LotArea, ci=0.95)

Empirical Confidence Interval (on Lot Size) : (11029 , 10516)

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

 p_05_X <-quantile(train$LotArea, 0.05)
 p_95_X <- quantile(train$LotArea, 0.95)

Distribution of Lot Size:
5th Percentile: 3311.7
95th Percentile: 17401.15

Table Comparing Distributions:

 dtable<- data.frame(cbind( c('Simulated_Data' , 'Lot_Area'), c(p_05_exp, p_05_X), c(p_95_exp,p_95_X)))
 colnames(dtable) <- c('Data Set', 'Fifth Percentile', 'Ninty-Fifty Percentile')

	Data Set	Fifth Percentile	Ninty-Fifty Percentile
rate	Simulated_Data	539.44	31505.6
5%	Lot_Area	3311.7	17401.15

Discuss:

The variable Lot Size is definitely skewed, not bounded closely by zero and trends to the smaller side with a sharp decrease in frequency over 10,000 square feet. Because it is in many cases limited by the space between streets (with homes back to back in many cases) or rows of streets with alleys in between, the relatively compact range of sizes makes perfect sense for homes within the city proper where things are divied up in blocks.

It also makes sense that a developer would create as many lots as possible from a contiguous parcel such that lots tend to converge on a size somewhere at or near the city minimum. Because of this, despite the appearance of an exponential distribution, there definitely is not one, nor is the distribution normal. In fact with the extremes removed it almost approaches uniform distribution or a few comapct uniform distributions.

This is why the exponential distribution sampled using the highly smoothed lambda from our Lot Size variable produced a distribution with more extreme percentiles than the wild data shows.

These values are an artifact of the smoothing provided by heavy sampling from the entire exponenital range as opposed to sampling from the actual values.

The smoothing effect of the lambda value also contributes to the unrealist 5the percentile value of 540-square feet, a size much smaller than any home could inhabit or the city would allow.

Although the distribution is heavily skewed right, using the Exponential distribution is clearly not the right method for simulating this distribution.

A better choice might be using the absolute values of a samples selected using weighted probabilities from the normal distribution centered around zero which is then shifted by the minimum of the empirical distribution after the absolute values are taken. This would stack them up in the lower range and taper off to the right more appropriately with some compensation for the plateaus in the weights.

Linear Regression Model: Kaggle Competition

Grooming Decisions

This data set was acquired from the City of Ames Iowa with the express purpose of become a go-to open source regression modeling tool by Professor Dean DeCock of Truman State University. Having read his publication, ‘Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project’ it became clear to me that this was never meant to become a festival of Ridge-Lasso-ElasticNet-and-Crossvalidation,that Kaggle has reduced it to, but was instead intended to be a data set rich with useful predictive indicators suggestive of sale prices, groomed for easy access.

With that in mind I am starting as suggested by the author: -remvoing homes with footage gerater than 4000 square feet - targeting the predictions to the squared sale price - using neighborhood, Gross Living Area + finished basement as the core variables _ scaling data -using a few discrete ordinal -including a categorical or two

All the Ames Variables:

types <- sapply(train, class)
data_view<- data.frame(types)

	types
Id	integer
MSSubClass	integer
MSZoning	factor
LotFrontage	integer
LotArea	integer
Street	factor
Alley	factor
LotShape	factor
LandContour	factor
Utilities	factor
LotConfig	factor
LandSlope	factor
Neighborhood	factor
Condition1	factor
Condition2	factor
BldgType	factor
HouseStyle	factor
OverallQual	integer
OverallCond	integer
YearBuilt	integer
YearRemodAdd	integer
RoofStyle	factor
RoofMatl	factor
Exterior1st	factor
Exterior2nd	factor
MasVnrType	factor
MasVnrArea	integer
ExterQual	factor
ExterCond	factor
Foundation	factor
BsmtQual	factor
BsmtCond	factor
BsmtExposure	factor
BsmtFinType1	factor
BsmtFinSF1	integer
BsmtFinType2	factor
BsmtFinSF2	integer
BsmtUnfSF	integer
TotalBsmtSF	integer
Heating	factor
HeatingQC	factor
CentralAir	factor
Electrical	factor
X1stFlrSF	integer
X2ndFlrSF	integer
LowQualFinSF	integer
GrLivArea	integer
BsmtFullBath	integer
BsmtHalfBath	integer
FullBath	integer
HalfBath	integer
BedroomAbvGr	integer
KitchenAbvGr	integer
KitchenQual	factor
TotRmsAbvGrd	integer
Functional	factor
Fireplaces	integer
FireplaceQu	factor
GarageType	factor
GarageYrBlt	integer
GarageFinish	factor
GarageCars	integer
GarageArea	integer
GarageQual	factor
GarageCond	factor
PavedDrive	factor
WoodDeckSF	integer
OpenPorchSF	integer
EnclosedPorch	integer
X3SsnPorch	integer
ScreenPorch	integer
PoolArea	integer
PoolQC	factor
Fence	factor
MiscFeature	factor
MiscVal	integer
MoSold	integer
YrSold	integer
SaleType	factor
SaleCondition	factor
SalePrice	integer

And trying to keep the number of variables to 10 or fewer, as complexity is th enemy of comprehension and consistency, Picking from the above, I used the Correlation Matrix below to choose variables that had a high correlation to SalePrice.

LotArea
Neighborhood
YearBuilt
YearRemodAdd
TotalBsmtSF
GarageCars
FullBath
TotRmsAbvGrd GrLivArea

Basis for Choosing Variables

corrplot(cor(as.matrix(numeric_train),use = "complete.obs"), type="lower", insig = "n", tl.srt = 45,number.cex = .30, order = "hclust",tl.col = 'black',tl.cex=.75)

Looking at NA values in Chosen Data

#sub_train <- train[ ,c('LotArea','GrLivArea','Neighborhood',  'YearBuilt', 'YearRemodAdd',  'TotalBsmtSF',  'GarageCars',  'FullBath',  'TotRmsAbvGrd', 'YrSold','GarageArea', 'SalePrice')]

sapply(train, function(x) sum(is.na(x)))

##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1369             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1453          1179          1406 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0

	x
Id	integer
MSSubClass	integer
MSZoning	factor
LotFrontage	integer
LotArea	integer
Street	factor
Alley	factor
LotShape	factor
LandContour	factor
Utilities	factor
LotConfig	factor
LandSlope	factor
Neighborhood	factor
Condition1	factor
Condition2	factor
BldgType	factor
HouseStyle	factor
OverallQual	integer
OverallCond	integer
YearBuilt	integer
YearRemodAdd	integer
RoofStyle	factor
RoofMatl	factor
Exterior1st	factor
Exterior2nd	factor
MasVnrType	factor
MasVnrArea	integer
ExterQual	factor
ExterCond	factor
Foundation	factor
BsmtQual	factor
BsmtCond	factor
BsmtExposure	factor
BsmtFinType1	factor
BsmtFinSF1	integer
BsmtFinType2	factor
BsmtFinSF2	integer
BsmtUnfSF	integer
TotalBsmtSF	integer
Heating	factor
HeatingQC	factor
CentralAir	factor
Electrical	factor
X1stFlrSF	integer
X2ndFlrSF	integer
LowQualFinSF	integer
GrLivArea	integer
BsmtFullBath	integer
BsmtHalfBath	integer
FullBath	integer
HalfBath	integer
BedroomAbvGr	integer
KitchenAbvGr	integer
KitchenQual	factor
TotRmsAbvGrd	integer
Functional	factor
Fireplaces	integer
FireplaceQu	factor
GarageType	factor
GarageYrBlt	integer
GarageFinish	factor
GarageCars	integer
GarageArea	integer
GarageQual	factor
GarageCond	factor
PavedDrive	factor
WoodDeckSF	integer
OpenPorchSF	integer
EnclosedPorch	integer
X3SsnPorch	integer
ScreenPorch	integer
PoolArea	integer
PoolQC	factor
Fence	factor
MiscFeature	factor
MiscVal	integer
MoSold	integer
YrSold	integer
SaleType	factor
SaleCondition	factor
SalePrice	integer

Moving Forward

With no NA values in our chosen fields I want convert neighborhood to factor

** To Numeric**

Before turning the integers of measurements into continuous numerics, or factors I decided to combine TWO variables into what might be more meaningful given this data was collected over four years. Instead of YearRemod, which would have less value as you move away so that 2005 is less impressive in 2010 than it was in 2005 I create YearsSinceRemod to use the difference between the sale year and the remodel, which should tighten up the value attribution a little bit.

train[,c('LotArea','TotalBsmtSF','GrLivArea', 'GarageArea')]<-lapply(train[,c('LotArea','TotalBsmtSF','GrLivArea', 'GarageArea')], as.numeric)
train['YearsSinceRemod'] <- train['YrSold'] - train['YearRemodAdd']
train<- train[which(train['GrLivArea']<=4000),]
temp<-data.frame(sapply(train, class))

	sapply.train..class.
Id	integer
MSSubClass	integer
MSZoning	factor
LotFrontage	integer
LotArea	numeric
Street	factor
Alley	factor
LotShape	factor
LandContour	factor
Utilities	factor
LotConfig	factor
LandSlope	factor
Neighborhood	factor
Condition1	factor
Condition2	factor
BldgType	factor
HouseStyle	factor
OverallQual	integer
OverallCond	integer
YearBuilt	integer
YearRemodAdd	integer
RoofStyle	factor
RoofMatl	factor
Exterior1st	factor
Exterior2nd	factor
MasVnrType	factor
MasVnrArea	integer
ExterQual	factor
ExterCond	factor
Foundation	factor
BsmtQual	factor
BsmtCond	factor
BsmtExposure	factor
BsmtFinType1	factor
BsmtFinSF1	integer
BsmtFinType2	factor
BsmtFinSF2	integer
BsmtUnfSF	integer
TotalBsmtSF	numeric
Heating	factor
HeatingQC	factor
CentralAir	factor
Electrical	factor
X1stFlrSF	integer
X2ndFlrSF	integer
LowQualFinSF	integer
GrLivArea	numeric
BsmtFullBath	integer
BsmtHalfBath	integer
FullBath	integer
HalfBath	integer
BedroomAbvGr	integer
KitchenAbvGr	integer
KitchenQual	factor
TotRmsAbvGrd	integer
Functional	factor
Fireplaces	integer
FireplaceQu	factor
GarageType	factor
GarageYrBlt	integer
GarageFinish	factor
GarageCars	integer
GarageArea	numeric
GarageQual	factor
GarageCond	factor
PavedDrive	factor
WoodDeckSF	integer
OpenPorchSF	integer
EnclosedPorch	integer
X3SsnPorch	integer
ScreenPorch	integer
PoolArea	integer
PoolQC	factor
Fence	factor
MiscFeature	factor
MiscVal	integer
MoSold	integer
YrSold	integer
SaleType	factor
SaleCondition	factor
SalePrice	integer
YearsSinceRemod	integer

Let’s Build a Model

Since the goal is to make useful predictions, that I can also explain easily. I decided to go with simple. Raise sale price to the second power, add interaction to the GrLivArea to TotalBsmtSF variables, and square the LotArea to help pick up value on the uber-large lots which might be more valuable.

I will reserve some of the training data to validate on prior to transforming the test data and submitting it.

set.seed(1)
sample <- sample.split(train, SplitRatio = .75)
train_2 = subset(train, sample=TRUE)
validation = subset(train, sample=FALSE)
fit <- lm(SalePrice ~ GrLivArea * TotalBsmtSF+ LotArea + GarageArea +  Neighborhood + YearBuilt + YearsSinceRemod + KitchenQual +WoodDeckSF + GarageCars +  ExterQual +  Fireplaces + BsmtQual + MasVnrArea ,data= train_2)
summary(fit)

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea * TotalBsmtSF + LotArea + 
##     GarageArea + Neighborhood + YearBuilt + YearsSinceRemod + 
##     KitchenQual + WoodDeckSF + GarageCars + ExterQual + Fireplaces + 
##     BsmtQual + MasVnrArea, data = train_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -236625  -14310     710   13892  282211 
## 
## Coefficients:
##                             Estimate     Std. Error t value
## (Intercept)           -377992.372327  142282.686136  -2.657
## GrLivArea                  71.878798       3.201212  22.454
## TotalBsmtSF                57.628181       4.712041  12.230
## LotArea                     0.609556       0.096510   6.316
## GarageArea                 16.418702       9.073949   1.809
## NeighborhoodBlueste     -1109.512065   23497.567838  -0.047
## NeighborhoodBrDale       -958.312115   11735.038930  -0.082
## NeighborhoodBrkSide     17583.950101    9913.298649   1.774
## NeighborhoodClearCr     11277.380097   10320.706445   1.093
## NeighborhoodCollgCr     16935.426841    8176.931521   2.071
## NeighborhoodCrawfor     37786.341599    9672.330203   3.907
## NeighborhoodEdwards     -1171.472465    9085.007958  -0.129
## NeighborhoodGilbert     10866.643262    8596.286851   1.264
## NeighborhoodIDOTRR       3009.245320   10585.359257   0.284
## NeighborhoodMeadowV     -6652.978812   11336.259114  -0.587
## NeighborhoodMitchel       578.385179    9288.420842   0.062
## NeighborhoodNAmes        6741.502866    8694.219399   0.775
## NeighborhoodNoRidge     75695.510049    9462.053591   8.000
## NeighborhoodNPkVill      6919.906488   13175.760595   0.525
## NeighborhoodNridgHt     32654.885648    8825.311821   3.700
## NeighborhoodNWAmes       5523.006891    8909.184759   0.620
## NeighborhoodOldTown      -305.338087    9650.436864  -0.032
## NeighborhoodSawyer       5620.883903    9131.484560   0.616
## NeighborhoodSawyerW     14027.805029    8865.825699   1.582
## NeighborhoodSomerst     24053.968481    8515.308006   2.825
## NeighborhoodStoneBr     63115.035219    9960.659785   6.336
## NeighborhoodSWISU        -384.255014   11129.971822  -0.035
## NeighborhoodTimber      14956.938404    9395.312184   1.592
## NeighborhoodVeenker     38961.777696   12198.305181   3.194
## YearBuilt                 246.820249      70.697294   3.491
## YearsSinceRemod          -300.245411      60.659574  -4.950
## KitchenQualFa          -42573.026487    7531.793003  -5.652
## KitchenQualGd          -34238.042327    4309.920063  -7.944
## KitchenQualTA          -40596.605374    4843.115388  -8.382
## WoodDeckSF                 23.534965       7.151939   3.291
## GarageCars               4735.900002    2711.904799   1.746
## ExterQualFa            -46709.359725   11584.465977  -4.032
## ExterQualGd            -32065.276924    5946.729001  -5.392
## ExterQualTA            -35524.482013    6503.823563  -5.462
## Fireplaces               8907.003576    1590.723691   5.599
## BsmtQualFa             -38510.801345    7542.480578  -5.106
## BsmtQualGd             -37107.519633    4071.244774  -9.115
## BsmtQualTA             -36703.494372    4929.067806  -7.446
## MasVnrArea                 12.533570       5.794510   2.163
## GrLivArea:TotalBsmtSF      -0.019667       0.001656 -11.874
##                                   Pr(>|t|)    
## (Intercept)                       0.007984 ** 
## GrLivArea             < 0.0000000000000002 ***
## TotalBsmtSF           < 0.0000000000000002 ***
## LotArea                0.00000000036194846 ***
## GarageArea                        0.070603 .  
## NeighborhoodBlueste               0.962346    
## NeighborhoodBrDale                0.934927    
## NeighborhoodBrkSide               0.076323 .  
## NeighborhoodClearCr               0.274720    
## NeighborhoodCollgCr               0.038534 *  
## NeighborhoodCrawfor    0.00009815094442356 ***
## NeighborhoodEdwards               0.897420    
## NeighborhoodGilbert               0.206406    
## NeighborhoodIDOTRR                0.776236    
## NeighborhoodMeadowV               0.557384    
## NeighborhoodMitchel               0.950357    
## NeighborhoodNAmes                 0.438237    
## NeighborhoodNoRidge    0.00000000000000263 ***
## NeighborhoodNPkVill               0.599529    
## NeighborhoodNridgHt               0.000224 ***
## NeighborhoodNWAmes                0.535412    
## NeighborhoodOldTown               0.974764    
## NeighborhoodSawyer                0.538294    
## NeighborhoodSawyerW               0.113827    
## NeighborhoodSomerst               0.004800 ** 
## NeighborhoodStoneBr    0.00000000031831850 ***
## NeighborhoodSWISU                 0.972464    
## NeighborhoodTimber                0.111625    
## NeighborhoodVeenker               0.001435 ** 
## YearBuilt                         0.000496 ***
## YearsSinceRemod        0.00000083559708956 ***
## KitchenQualFa          0.00000001922066606 ***
## KitchenQualGd          0.00000000000000405 ***
## KitchenQualTA         < 0.0000000000000002 ***
## WoodDeckSF                        0.001025 ** 
## GarageCars                        0.080977 .  
## ExterQualFa            0.00005833417842443 ***
## ExterQualGd            0.00000008193012827 ***
## ExterQualTA            0.00000005581621460 ***
## Fireplaces             0.00000002596352784 ***
## BsmtQualFa             0.00000037568023232 ***
## BsmtQualGd            < 0.0000000000000002 ***
## BsmtQualTA             0.00000000000016902 ***
## MasVnrArea                        0.030713 *  
## GrLivArea:TotalBsmtSF < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31050 on 1370 degrees of freedom
##   (45 observations deleted due to missingness)
## Multiple R-squared:  0.8512, Adjusted R-squared:  0.8464 
## F-statistic: 178.2 on 44 and 1370 DF,  p-value: < 0.00000000000000022

We startedwith just a few variables (from GrLivArea to YearsSinceRemod ) the performance was too low down around 62% according to the $R^2$ SO I added variables and subtracted them until all the p-values had significane (preferabbly extreme significance with ***) and in the case of Nighborhood some neighborhoods this was significant and dropping it cost about 3% overall so I decided to keep it in in favor of adding more quantitative variables. This model is a bit larger than I had hoped initially and is not against the squared home price as suggested by the author. however it gives a solid performance. So lets look at some plots and see if it is in fact doing as well as I hoped it would.

 plot(fitted(fit), resid(fit))

 qqnorm(resid(fit))
qqline(resid(fit))

From the charts above there is clearly a lack of adherence in the upper end of home prices the residuals and fitted values are less consistent at the high end as well, tu the seem to be solid in the most common prices which leads me to believe there might be two mechanisms at work here and settle in with this particular model, as adding new variables is not improving the base performance enough to justify the added complexity, which is already pretty high.

preds<-  predict(fit, validation)
difference <- preds - validation$SalePrice
plot(difference)

From this plot you can see a little bit of wildness but the core of the predictions are in a tight spot. At this poitn I think it is time to run the test set, submit to Kaggle and hope for the best!

Making Test Predictions

test <- read.csv('test.csv')
test[,c('LotArea','TotalBsmtSF','GrLivArea', 'GarageArea')]<-lapply(test[,c('LotArea','TotalBsmtSF','GrLivArea', 'GarageArea')], as.numeric)
test['YearsSinceRemod'] <- test['YrSold'] - test['YearRemodAdd']
preds_2<-  predict(fit, test)
kaggle<- data.frame(test$Id, preds_2)
colnames(kaggle)<- c('Id', 'SalePrice')
write.csv(kaggle, 'submission.csv', row.names = FALSE)

Kaggle Results: Number 519 with a score of .20458, I have NO idea what that means…I am guessing that means I still have work to do!

Thanks for everything, Doc!

Bethany

Data 605 Computational Mathematics Final

Bethany Poulin - May 20, 2018