DATA 605 Final

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = \frac{N+1}{2}\)

set.seed(1234)

N = 10
X = runif(10000, 1, N)
Y = rnorm(10000, (N+1)/2, (N+1)/2)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter x is estimated as the median of the X variable, and the small letter y is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

x = median(X)
y = quantile(Y, 0.25)

Part 1

5 points. a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)

a. P(X>x | X>y)

\[ P(X>x | X>y) = \frac{P(X>x \cap X>y)}{P(X>y)} \]

prob_Xy = length(X[X > y]) / length(X)
prob_Xx_Xy = length(X[X > y & X > x]) / length(X)
prob_Xx_Xy /prob_Xy

## [1] 0.5543237

There is a 55.43% probability that X would be greater than the median of the X variable, given that X is greater than the first quartile of the Y variable.

b. P(X>x, Y>y)

\[ P(X>x, Y>y) = P(X>x) \cap P(Y>y) \]

prob_Xx_Yy = length(which(X > x & Y > y)) / length(X)
prob_Xx_Yy

## [1] 0.3738

There is a 37.38% probability that X is greater than its median and that Y is greater than its first quartile.

c. P(X<x | X>y)

\[ P(X<x | X>y) = \frac{P(X<x \cap X>y)}{P(X>y)} \]

prob_Xy = length(X[X > y]) / length(X)
prob_Xx_Xy = length(X[X > y & X < x]) / length(X)
prob_Xx_Xy /prob_Xy

## [1] 0.4456763

There is a 44.57% probability that X would be less than the median of the X variable, given that X is greater than the first quartile of the Y variable.

Part 2

5 points. Investigate whether \(P(X>x\text{ and }Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

probtable <- data.frame(c(length(which(X > x & Y > y)) / 10000,
                            length(which(X > x & Y <= y)) / 10000),
                          c(length(which(X <= x & Y > y)) / 10000,
                            length(which(X <= x & Y <= y)) / 10000)) %>%
  mutate(Total = rowSums(.)) %>%
  rbind(., colSums(.)) %>%
  set_colnames(c('$X>x$', '$X\\leq x$', 'Total')) %>%
  set_rownames(c('$Y>y$', '$Y\\leq y$', 'Total')) %>% 
  kable(escape = F) %>% 
  kable_styling(latex_options = "hold_position") %>%
  kable_classic(full_width = F)
probtable

	\(X>x\)	\(X\leq x\)	Total
\(Y>y\)	0.3738	0.3762	0.75
\(Y\leq y\)	0.1262	0.1238	0.25
Total	0.5000	0.5000	1.00

prob_Xx = length(X[X > x]) / length(X)
prob_Yy = length(Y[Y > y]) / length(Y)
prob_Xx * prob_Yy

## [1] 0.375

As seen here, the joint probability is 0.3738 but the marginal probability is 0.375, which is found by multiplying \(P(X>x)\)=0.5 and \(P(Y>y)\)=0.75. They are close in value but are not exactly equal.

Part 3

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

For both, FIsher’s Exact Test and the Chi Square Test, the hypothesis are as follows:

\(H_0\): \(P(X>x)\) and \(P(Y>y)\) are independent
\(H_1\): \(P(X>x)\) and \(P(Y>y)\) are not independent

table(X > x, Y > y) %>%
  fisher.test()

## 
##  Fisher's Exact Test for Count Data
## 
## data:  .
## p-value = 0.5953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8894241 1.0682094
## sample estimates:
## odds ratio 
##  0.9747199

table(X > x, Y > y) %>%
  chisq.test(.)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 0.28213, df = 1, p-value = 0.5953

The p-value for both tests is 0.5953. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. That means that according to the Fisher’s Exact Test and the Chi Square Test, \(P(X>x)\) and \(P(Y>y)\) are independent.

The Fisher’s Exact Test is better applied on small samples and produces an exact test for independence. On the other hand, the Chi Square Test only produces an approximate but is applied to large data sets and the accuracy increases as the data gets larger. It can be unreliable when the data set sample is too small.

Given that the Fisher’s Exact Test can be rather tedious with large samples, it would be preferable to use the Chi Square Test. However, since both tests produced the same results, either test is appropriate.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. I want you to do the following.

train <- read.csv("train.csv", stringsAsFactors = TRUE) %>%
  #converts to the appropriate type
  type.convert(.) %>%
  #convert into factors
  mutate(OverallQual = as.factor(OverallQual),
         OverallCond = as.factor(OverallCond),
         MSSubClass = as.factor(MSSubClass))
test <- read.csv("test.csv", stringsAsFactors = TRUE) %>%
  #converts to the appropriate type
  type.convert(.) %>%
  #convert into factors
  mutate(OverallQual = as.factor(OverallQual),
         OverallCond = as.factor(OverallCond),
         MSSubClass = as.factor(MSSubClass))

Part 1

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset.
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis.
Would you be worried about familywise error? Why or why not?

The data set that describes houses sold between 2006 and 2010 in Ames, Iowa has 80 predictor variables and the response variable is the sales price.There are 1460 observations in the training data set.

Summary

summary(train)

##        Id           MSSubClass     MSZoning     LotFrontage    
##  Min.   :   1.0   20     :536   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   60     :299   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   50     :144   RH     :  16   Median : 69.00  
##  Mean   : 730.5   120    : 87   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   30     : 69   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   160    : 63                  Max.   :313.00  
##                   (Other):262                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour  Utilities   
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63    AllPub:1459  
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50    NoSeWa:   1  
##  Median :  9478               NA's:1369   IR3: 10   Low:  36                 
##  Mean   : 10517                           Reg:925   Lvl:1311                 
##  3rd Qu.: 11602                                                              
##  Max.   :215245                                                              
##                                                                              
##    LotConfig    LandSlope   Neighborhood   Condition1     Condition2  
##  Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260   Norm   :1445  
##  CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81   Feedr  :   6  
##  FR2    :  47   Sev:  13   OldTown:113   Artery :  48   Artery :   2  
##  FR3    :   4              Edwards:100   RRAn   :  26   PosN   :   2  
##  Inside :1052              Somerst: 86   PosN   :  19   RRNn   :   2  
##                            Gilbert: 79   RRAe   :  11   PosA   :   1  
##                            (Other):707   (Other):  15   (Other):   2  
##    BldgType      HouseStyle   OverallQual   OverallCond    YearBuilt   
##  1Fam  :1220   1Story :726   5      :397   5      :821   Min.   :1872  
##  2fmCon:  31   2Story :445   6      :374   6      :252   1st Qu.:1954  
##  Duplex:  52   1.5Fin :154   7      :319   7      :205   Median :1973  
##  Twnhs :  43   SLvl   : 65   8      :168   8      : 72   Mean   :1971  
##  TwnhsE: 114   SFoyer : 37   4      :116   4      : 57   3rd Qu.:2000  
##                1.5Unf : 14   9      : 43   3      : 25   Max.   :2010  
##                (Other): 19   (Other): 43   (Other): 28                 
##   YearRemodAdd    RoofStyle       RoofMatl     Exterior1st   Exterior2nd 
##  Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515   VinylSd:504  
##  1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222   MetalSd:214  
##  Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220   HdBoard:207  
##  Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206   Wd Sdng:197  
##  3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108   Plywood:142  
##  Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61   CmentBd: 60  
##                                (Other):   2   (Other):128   (Other):136  
##    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation  BsmtQual  
##  BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3   BrkTil:146   Ex  :121  
##  BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28   CBlock:634   Fa  : 35  
##  None   :864   Median :   0.0   Gd:488    Gd: 146   PConc :647   Gd  :618  
##  Stone  :128   Mean   : 103.7   TA:906    Po:   1   Slab  : 24   TA  :649  
##  NA's   :  8   3rd Qu.: 166.0             TA:1282   Stone :  6   NA's: 37  
##                Max.   :1600.0                       Wood  :  3             
##                NA's   :8                                                   
##  BsmtCond    BsmtExposure BsmtFinType1   BsmtFinSF1     BsmtFinType2
##  Fa  :  45   Av  :221     ALQ :220     Min.   :   0.0   ALQ :  19   
##  Gd  :  65   Gd  :134     BLQ :148     1st Qu.:   0.0   BLQ :  33   
##  Po  :   2   Mn  :114     GLQ :418     Median : 383.5   GLQ :  14   
##  TA  :1311   No  :953     LwQ : 74     Mean   : 443.6   LwQ :  46   
##  NA's:  37   NA's: 38     Rec :133     3rd Qu.: 712.2   Rec :  54   
##                           Unf :430     Max.   :5644.0   Unf :1256   
##                           NA's: 37                      NA's:  38   
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741   
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49   
##  Median :   0.00   Median : 477.5   Median : 991.5   GasW :  18   Gd:241   
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1   
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428   
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0   Wall :   4            
##                                                                            
##  CentralAir Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  N:  95     FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  Y:1365     FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##             FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##             Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##             SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##             NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                         
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual  TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100      Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39      1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586      Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735      Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000               3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000               Max.   :14.000  
##                                                                               
##  Functional    Fireplaces    FireplaceQu   GarageType   GarageYrBlt  
##  Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6   Min.   :1900  
##  Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870   1st Qu.:1961  
##  Min1:  31   Median :1.000   Gd  :380    Basment: 19   Median :1980  
##  Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88   Mean   :1979  
##  Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9   3rd Qu.:2002  
##  Sev :   1   Max.   :3.000   NA's:690    Detchd :387   Max.   :2010  
##  Typ :1360                               NA's   : 81   NA's   :81    
##  GarageFinish   GarageCars      GarageArea     GarageQual  GarageCond 
##  Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3   Ex  :   2  
##  RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48   Fa  :  35  
##  Unf :605     Median :2.000   Median : 480.0   Gd  :  14   Gd  :   9  
##  NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3   Po  :   7  
##               3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311   TA  :1326  
##               Max.   :4.000   Max.   :1418.0   NA's:  81   NA's:  81  
##                                                                       
##  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Y:1340     Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##             Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##             3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##             Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                                
##   ScreenPorch        PoolArea        PoolQC       Fence      MiscFeature
##  Min.   :  0.00   Min.   :  0.000   Ex  :   2   GdPrv:  59   Gar2:   2  
##  1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2   GdWo :  54   Othr:   2  
##  Median :  0.00   Median :  0.000   Gd  :   3   MnPrv: 157   Shed:  49  
##  Mean   : 15.06   Mean   :  2.759   NA's:1453   MnWw :  11   TenC:   1  
##  3rd Qu.:  0.00   3rd Qu.:  0.000               NA's :1179   NA's:1406  
##  Max.   :480.00   Max.   :738.000                                       
##                                                                         
##     MiscVal             MoSold           YrSold        SaleType   
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :1267  
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   New    : 122  
##  Median :    0.00   Median : 6.000   Median :2008   COD    :  43  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008   ConLD  :   9  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   ConLI  :   5  
##  Max.   :15500.00   Max.   :12.000   Max.   :2010   ConLw  :   5  
##                                                     (Other):   9  
##  SaleCondition    SalePrice     
##  Abnorml: 101   Min.   : 34900  
##  AdjLand:   4   1st Qu.:129975  
##  Alloca :  12   Median :163000  
##  Family :  20   Mean   :180921  
##  Normal :1198   3rd Qu.:214000  
##  Partial: 125   Max.   :755000  
##

Variable Plots

Here are some interesting plot for the training data set.

It can be seen that as the square footage of the above ground and basement increases, the sale price increases as well.
As the overall quality of the house increases, which rates the overall material and finish of the house, so does the sales prices.
The sales price tends to increase when there are 0-3 cars in the garage, but tends to be lower when there are 4. When there are 3 cars in the garage, there is more variation in the sales price.
The sales price is higher for homes that have central air conditioning.
The sales prices for home on gentle slopes tend to be right-skewed and have greater variation compared to moderate and severe slopes who show more steady sale prices.
When it comes to various land conditions, homes that are located near a busy street such as a highway, or near a railroad, tend to sell less than homes that are normal or near near positive off-site features such as parks.
As the quality of the exterior decreases, the prices of the homes increase.
Generally, as the number of bathrooms increase, the value of the home increases too.
Newer homes tend to sell for more and decrease in price as it ages towards 100.
Normal sales, as opposed to foreclosures, trades, sales between family members, an allocation, tend to be sold for more. Partial homes were sold the highest on average, but that is associated with new homes.
New homes on average sell for more with great variability. Also, homes with a conventional warranty deed tend to have the greatest variability and is right skewed.

train %>%
  mutate(TotalSqFt = GrLivArea + TotalBsmtSF) %>%
  ggplot(., aes(x = TotalSqFt, y = SalePrice)) +
  geom_point() + 
  geom_smooth() +
  ggtitle("Total Square Footage vs Sales Price") +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  ggplot(., aes(x = OverallQual, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Distributions of Overall Quality vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  ggplot(., aes(x = OverallCond, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Distributions of Overall Condition vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  mutate(GarageCars = as.factor(GarageCars)) %>%
  ggplot(., aes(x = GarageCars, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Distributions of Amount of Cars in Garage vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  ggplot(., aes(x = CentralAir, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Central Air Coniditioning vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  mutate(LandSlope = recode(LandSlope, `Gtl` = "Gentle",
                           `Mod` = "Moderate",
                           `Sev` = "Severe")) %>%
  ggplot(., aes(x = LandSlope, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Land Slope vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma()) 

train %>%
  mutate(Condition = ifelse(Condition1 == "PosN" | Condition1 == "PosA", "positive off-site",
                             ifelse(Condition1 == "Norm", "normal" ,
                                    ifelse(Condition1 == "Artery" | Condition1 == "Feedr", "busy street",
                                           "railroad")))) %>%
  ggplot(., aes(x = Condition, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Proximity to Various Land Features vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma()) 

train %>%
  mutate(ExterQual = fct_relevel(ExterQual, "Ex", "Gd", "TA", "Fa"),
         ExterQual = recode(ExterQual, `Ex` = "Excellent",
                           `TA` = "Average",
                           `Fa` = "Fair",
                           `Gd` = "Good")) %>%
  ggplot(., aes(x = ExterQual, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Exterior Quality vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  mutate(TotalBath = BsmtFullBath + 0.5 * BsmtHalfBath + FullBath + 0.5 * HalfBath) %>%
  ggplot(., aes(x = TotalBath, y = SalePrice)) +
  geom_point() + 
  geom_smooth() +
  ggtitle("Total Bathrooms vs Sales Price") +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  mutate(Age = YrSold - YearBuilt) %>%
  ggplot(., aes(x = Age, y = SalePrice)) +
  geom_point() + 
  geom_smooth() +
  ggtitle("Age of House vs Sales Price") +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  ggplot(., aes(x = SaleCondition, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Conidition of Sale vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma())

train %>%
  ggplot(., aes(x = SaleType, y = SalePrice)) + 
  geom_boxplot() +
  labs(title = 'Type of Sale vs Sales Price') +
  scale_y_continuous(labels = scales::label_comma())

Scatterplot Matrix

As seen by the scatterplots, the Sale Price tends to generally increase as the square footage of the above ground living area (GrLivArea), the basement (TotalBsmtSF), and the garage (GarageArea) increases. There is also a increasing relationship between the square footage of the house above ground and the basement. There also seems to outliers for extremely large areas. It should be noted that the compiler of the data set mentions that houses who have over 4000 sq. ft. above ground should be removed for modeling purposes.

pairs(~SalePrice + GrLivArea + TotalBsmtSF + GarageArea, data = train, lwd =0.5)

Correlation Matrix

correlation_table <- train %>%
  dplyr::select(SalePrice, GrLivArea, GarageArea) %>%
  cor(., method = "pearson")

correlation_table %>% 
  kable(escape = F) %>% 
  kable_styling(latex_options = "hold_position") %>%
  kable_classic(full_width = F)

	SalePrice	GrLivArea	GarageArea
SalePrice	1.0000000	0.7086245	0.6234314
GrLivArea	0.7086245	1.0000000	0.4689975
GarageArea	0.6234314	0.4689975	1.0000000

Testing the correlations between each pairwise set of variables

\(H_0\): The correlation is 0
\(H_1\): The correlation is not 0

cor.test(~ SalePrice + GrLivArea, data = train, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  SalePrice and GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

cor.test(~ SalePrice + GarageArea, data = train, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  SalePrice and GarageArea
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314

cor.test(~ GarageArea + GrLivArea, data = train, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  GarageArea and GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

0 is not contained in any of the confidence intervals and all three p-values are less than our significance level of 0.05, which means that there correlations between each pairwise set of variables is not 0. It shows that there is indeed some relationships between these pairs that can be of further interest.

A familywise error is a type I error or false discovery in which the null hypotheses is rejected incorrectly. I would not be worried in this case because as we explored the variables in the plots above, we saw that there was some relationship between them. The p-values are extremely small and close to 0, that we would still reject the null hypothesis if we minimized the significance level.

Part 2

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

# inverse of matrix
precision_mat <- solve(correlation_table)
precision_mat %>% 
  kable(escape = F, caption = "Precision Matrix") %>% 
  kable_styling(latex_options = "hold_position") %>%
  kable_classic(full_width = F)

Precision Matrix
	SalePrice	GrLivArea	GarageArea
SalePrice	2.5692028	-1.3709485	-0.9587504
GrLivArea	-1.3709485	2.0135330	-0.0896496
GarageArea	-0.9587504	-0.0896496	1.6397606

# Multiply the correlation matrix by the precision matrix
corr_prec <- correlation_table %*% precision_mat
corr_prec %>% 
  kable(escape = F, caption = "Correlation Matrix $\\cdot$ Precision Matrix", digits = 24) %>% 
  kable_styling(latex_options = "hold_position") %>%
  kable_classic(full_width = F)

Correlation Matrix \(\cdot\) Precision Matrix
	SalePrice	GrLivArea	GarageArea
SalePrice	1.000000e+00	4.163336e-17	0
GrLivArea	2.220446e-16	1.000000e+00	0
GarageArea	3.330669e-16	8.326673e-17	1

# multiply the precision matrix by the correlation matrix
prec_corr <- precision_mat %*% correlation_table
prec_corr %>% 
  kable(escape = F, caption = "Precision Matrix $\\cdot$ Correlation Matrix", digits = 24 ) %>% 
  kable_styling(latex_options = "hold_position") %>%
  kable_classic(full_width = F)

Precision Matrix \(\cdot\) Correlation Matrix
	SalePrice	GrLivArea	GarageArea
SalePrice	1.000000e+00	4.440892e-16	4.440892e-16
GrLivArea	-1.804112e-16	1.000000e+00	-1.387779e-16
GarageArea	0.000000e+00	0.000000e+00	1.000000e+00

# When multiplying by the inverse, the result should be identity matrix
# when rounded the products are equal
round(corr_prec, 15) == round(prec_corr, 15)

##            SalePrice GrLivArea GarageArea
## SalePrice       TRUE      TRUE       TRUE
## GrLivArea       TRUE      TRUE       TRUE
## GarageArea      TRUE      TRUE       TRUE

#Conduct LU decomposition on the matrix
clu <- lu.decomposition(correlation_table)
clu

## $L
##           [,1]       [,2] [,3]
## [1,] 1.0000000 0.00000000    0
## [2,] 0.7086245 1.00000000    0
## [3,] 0.6234314 0.05467234    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.7086245 0.6234314
## [2,]    0 0.4978513 0.0272187
## [3,]    0 0.0000000 0.6098451

# check to see if A=LU
correlation_table == clu$L %*% clu$U

##            SalePrice GrLivArea GarageArea
## SalePrice       TRUE      TRUE       TRUE
## GrLivArea       TRUE      TRUE       TRUE
## GarageArea      TRUE      TRUE       TRUE

plu <- lu.decomposition(precision_mat)
plu

## $L
##            [,1]       [,2] [,3]
## [1,]  1.0000000  0.0000000    0
## [2,] -0.5336085  1.0000000    0
## [3,] -0.3731704 -0.4689975    1
## 
## $U
##          [,1]      [,2]       [,3]
## [1,] 2.569203 -1.370948 -0.9587504
## [2,] 0.000000  1.281983 -0.6012469
## [3,] 0.000000  0.000000  1.0000000

round(precision_mat,16) ==  round(plu$L %*% plu$U,16)

##            SalePrice GrLivArea GarageArea
## SalePrice       TRUE      TRUE       TRUE
## GrLivArea       TRUE      TRUE       TRUE
## GarageArea      TRUE      TRUE       TRUE

#side note: 2 values come up as false only when they are rounded to the 17th digit

Part 3

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. See this. Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

e1071::skewness(train$GrLivArea)

## [1] 1.363754

min(train$GrLivArea) > 0

## [1] TRUE

The skewness of GrLivArea is 1.363754 which indicates that it is highly skewed to the right. Shifting the variable is not necessary as it is strictly above zero.

Optimal Lambda and Sample Exponential distribution

exp_pdf <- fitdistr(train$GrLivArea, "exponential")
exp_pdf

##        rate    
##   6.598640e-04 
##  (1.726943e-05)

optimal_lambda <- exp_pdf$estimate %>% unname(.)

exp_sample <- rexp(1000, optimal_lambda)

Histogram Comparison

As seen, the scales for both axes changed, and the peak of the distribution shifted to the right.

train %>%
  ggplot(., aes(x = GrLivArea)) +
  geom_histogram(bins = 50) + 
  ggtitle("Original Variable") +
  scale_x_continuous(labels = scales::label_comma()) +
  geom_vline(aes(xintercept = mean(GrLivArea)),col='red',size=0.5) +
  geom_vline(aes(xintercept = median(GrLivArea)),col='blue',size=0.5) +
  geom_text(aes(x=mean(GrLivArea) + 70, label="Mean", y=20), colour="red", angle=90) +
  geom_text(aes(x=median(GrLivArea) - 80, label="Median", y=35), colour="blue", angle=90)

exp_sample%>%
  as.data.frame() %>%
  ggplot(., aes(x = .)) +
  geom_histogram(bins = 50) + 
  ggtitle("Exponential Distribution") +
  scale_x_continuous(labels = scales::label_comma()) +
  geom_vline(aes(xintercept = mean(.)),col='red',size=0.5) +
  geom_vline(aes(xintercept = median(.)),col='blue',size=0.5) +
  geom_text(aes(x=mean(.) + 70, label="Mean", y=20), colour="red", angle=90) +
  geom_text(aes(x=median(.) - 120, label="Median", y=35), colour="blue", angle=90)

### 5th and 95th Percentile

quantile(exp_sample, c(0.05, 0.95))

##         5%        95% 
##   81.22943 4376.21881

95% Confidence Interval

avg = mean(train$GrLivArea)
sd = sd(train$GrLivArea)
n = length(train$GrLivArea)
z = qnorm(0.975)

c(avg - z * sd/sqrt(n), avg + z * sd/sqrt(n))

## [1] 1488.509 1542.418

The 95% confidence interval from the empirical data is (1488.509, 1542.419). It has a mean of 1515.4636986. It can be seen on the histogram as the red line.

Empirical 5th and 95th Percentile

quantile(train$GrLivArea, c(0.05, 0.95))

##     5%    95% 
##  848.0 2466.1

The 5th and 95th percentile for the empirical data is over a smaller range compared to the samples taken of the exponential function with a \(\lambda\) of 6.598640410^{-4}.

Part 4

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Prior to building a model, the data has to be cleaned as there are a lot of missing variables. Some of the factored variables can be imputed with the value “none” as the house may not have it and the description text for the data set states that “NA” may specify none. The remainder of the missing variables can be imputed with the mice package, using the random forest method.

train %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  gather(variable, value) %>%
  filter(value != 0)

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

##        variable value
## 1   LotFrontage   259
## 2         Alley  1369
## 3    MasVnrType     8
## 4    MasVnrArea     8
## 5      BsmtQual    37
## 6      BsmtCond    37
## 7  BsmtExposure    38
## 8  BsmtFinType1    37
## 9  BsmtFinType2    38
## 10   Electrical     1
## 11  FireplaceQu   690
## 12   GarageType    81
## 13  GarageYrBlt    81
## 14 GarageFinish    81
## 15   GarageQual    81
## 16   GarageCond    81
## 17       PoolQC  1453
## 18        Fence  1179
## 19  MiscFeature  1406

# replacing na values
train <- train %>%
  mutate(Alley = fct_explicit_na(Alley, "none"),
         MasVnrType = fct_explicit_na(MasVnrType, "none"),
         BsmtQual = fct_explicit_na(BsmtQual, "none"),
         BsmtCond = fct_explicit_na(BsmtCond, "none"),
         BsmtExposure = fct_explicit_na(BsmtExposure, "none"),
         BsmtFinType1 = fct_explicit_na(BsmtFinType1, "none"),
         BsmtFinType2 = fct_explicit_na(BsmtFinType2, "none"),
         FireplaceQu = fct_explicit_na(FireplaceQu, "none"),
         GarageType = fct_explicit_na(GarageType, "none"),
         GarageFinish = fct_explicit_na(GarageFinish, "none"),
         GarageQual = fct_explicit_na(GarageQual, "none"),
         GarageCond = fct_explicit_na(GarageCond, "none"),
         PoolQC = fct_explicit_na(PoolQC, "none"),
         Fence = fct_explicit_na(Fence, "none"),
         MiscFeature = fct_explicit_na(MiscFeature, "none"))

train %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  gather(variable, value) %>%
  filter(value != 0)

##      variable value
## 1 LotFrontage   259
## 2  MasVnrArea     8
## 3  Electrical     1
## 4 GarageYrBlt    81

#imputing by random forest 
tempData <- mice(train, m=1, maxit=1, meth='cart',seed=1234)

## 
##  iter imp variable
##   1   1  LotFrontage  MasVnrArea  Electrical  GarageYrBlt

## Warning: Number of logged events: 4

imputed <- complete(tempData,1)

Since there is increasing variation in the SalesPrice with an increase of square footage, the natural log of it was taken.
An outlier was removed where the square footage above the ground exceeded 4000.
TotalSqFt is a combination of the square footage above the ground and below.
TotalBath is the total of all the bathroom, with 0.5 given as weight to half bathrooms.
Age is just the difference in years from when the house was built to when it was sold.
SaleCondition was marked as “normal”, “partial”, and the rest was marked as “other” since they were abnormal sales or traded between family members. They also had less of an effect on the variation on sales price.
PorchSqFt is just the square footage of the entire porch.
NewHome denotes if the house is newly constructed or not.
The OverallCond was regrouped in groups of {1,2,3,4}, {5}, {6,7}, {8,9}.

#transforming variables
transformed = imputed %>%
  filter(GrLivArea < 4000) %>%
  mutate(TotalSqFt = GrLivArea + TotalBsmtSF,
         TotalBath = BsmtFullBath + 0.5 * BsmtHalfBath + FullBath + 0.5 * HalfBath,
         Age = YrSold - YearBuilt,
         SaleCondition = ifelse(SaleCondition == "Normal", "normal", 
                                ifelse(SaleCondition == "Partial", "partial", "other")),
         PorchSqFt = ScreenPorch + X3SsnPorch + EnclosedPorch + OpenPorchSF + WoodDeckSF,
         OverallCond = ifelse(OverallCond %in% "5", "5" ,
                                    ifelse(OverallCond %in% c("1", "2" ,"3" ,"4"), "less than 5",
                                           ifelse(OverallCond %in% c("6", "7"), "6-7",
                                           "8-9"))),
         NewHome = ifelse(SaleType == 'New', 'new', 'other'))

model = lm(log(SalePrice) ~ TotalSqFt + OverallQual + Neighborhood + NewHome +
     Age + CentralAir + Fireplaces + GarageArea + TotalBath  + PorchSqFt + PoolArea +
     SaleCondition + MSZoning + BldgType + OverallCond , data = transformed)

summary(model)

## 
## Call:
## lm(formula = log(SalePrice) ~ TotalSqFt + OverallQual + Neighborhood + 
##     NewHome + Age + CentralAir + Fireplaces + GarageArea + TotalBath + 
##     PorchSqFt + PoolArea + SaleCondition + MSZoning + BldgType + 
##     OverallCond, data = transformed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69166 -0.06142  0.00437  0.06784  0.50492 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.064e+01  1.269e-01  83.908  < 2e-16 ***
## TotalSqFt               1.813e-04  7.252e-06  25.006  < 2e-16 ***
## OverallQual2            9.797e-02  1.095e-01   0.895 0.370911    
## OverallQual3            2.829e-01  8.869e-02   3.190 0.001456 ** 
## OverallQual4            3.502e-01  8.640e-02   4.054 5.32e-05 ***
## OverallQual5            3.860e-01  8.654e-02   4.461 8.82e-06 ***
## OverallQual6            4.341e-01  8.687e-02   4.997 6.57e-07 ***
## OverallQual7            5.068e-01  8.737e-02   5.800 8.17e-09 ***
## OverallQual8            5.821e-01  8.842e-02   6.583 6.49e-11 ***
## OverallQual9            7.228e-01  9.069e-02   7.970 3.26e-15 ***
## OverallQual10           7.564e-01  9.513e-02   7.951 3.77e-15 ***
## NeighborhoodBlueste    -8.858e-02  9.158e-02  -0.967 0.333558    
## NeighborhoodBrDale     -7.403e-02  4.953e-02  -1.495 0.135213    
## NeighborhoodBrkSide     1.283e-02  4.212e-02   0.305 0.760754    
## NeighborhoodClearCr     1.038e-01  4.092e-02   2.536 0.011332 *  
## NeighborhoodCollgCr     1.589e-02  3.410e-02   0.466 0.641268    
## NeighborhoodCrawfor     1.604e-01  3.885e-02   4.129 3.85e-05 ***
## NeighborhoodEdwards    -3.309e-02  3.700e-02  -0.894 0.371256    
## NeighborhoodGilbert     3.248e-02  3.592e-02   0.904 0.366150    
## NeighborhoodIDOTRR     -3.407e-02  4.853e-02  -0.702 0.482686    
## NeighborhoodMeadowV    -1.499e-01  4.738e-02  -3.164 0.001592 ** 
## NeighborhoodMitchel    -1.668e-02  3.775e-02  -0.442 0.658654    
## NeighborhoodNAmes      -8.865e-03  3.544e-02  -0.250 0.802498    
## NeighborhoodNoRidge     7.228e-02  3.847e-02   1.879 0.060488 .  
## NeighborhoodNPkVill    -4.514e-02  5.083e-02  -0.888 0.374668    
## NeighborhoodNridgHt     7.171e-02  3.464e-02   2.070 0.038628 *  
## NeighborhoodNWAmes     -4.988e-02  3.658e-02  -1.364 0.172938    
## NeighborhoodOldTown    -4.240e-02  4.374e-02  -0.969 0.332579    
## NeighborhoodSawyer     -2.206e-02  3.740e-02  -0.590 0.555393    
## NeighborhoodSawyerW     6.852e-03  3.626e-02   0.189 0.850154    
## NeighborhoodSomerst     2.913e-02  4.191e-02   0.695 0.487166    
## NeighborhoodStoneBr     1.208e-01  3.889e-02   3.105 0.001938 ** 
## NeighborhoodSWISU      -1.180e-03  4.452e-02  -0.027 0.978857    
## NeighborhoodTimber      3.687e-02  3.815e-02   0.966 0.334046    
## NeighborhoodVeenker     8.480e-02  4.787e-02   1.771 0.076719 .  
## NewHomeother           -1.376e-01  7.041e-02  -1.954 0.050860 .  
## Age                    -2.175e-03  2.590e-04  -8.398  < 2e-16 ***
## CentralAirY             7.330e-02  1.557e-02   4.709 2.74e-06 ***
## Fireplaces              3.668e-02  6.130e-03   5.983 2.77e-09 ***
## GarageArea              1.987e-04  2.027e-05   9.803  < 2e-16 ***
## TotalBath               6.410e-02  5.970e-03  10.737  < 2e-16 ***
## PorchSqFt               1.336e-04  2.303e-05   5.804 8.00e-09 ***
## PoolArea                1.664e-04  8.919e-05   1.866 0.062282 .  
## SaleConditionother     -6.775e-02  1.117e-02  -6.063 1.72e-09 ***
## SaleConditionpartial   -8.540e-02  6.964e-02  -1.226 0.220243    
## MSZoningFV              3.396e-01  5.575e-02   6.092 1.44e-09 ***
## MSZoningRH              2.947e-01  5.604e-02   5.260 1.67e-07 ***
## MSZoningRL              3.070e-01  4.681e-02   6.558 7.63e-11 ***
## MSZoningRM              2.799e-01  4.386e-02   6.383 2.36e-10 ***
## BldgType2fmCon         -6.620e-04  2.309e-02  -0.029 0.977134    
## BldgTypeDuplex         -5.707e-02  1.859e-02  -3.070 0.002180 ** 
## BldgTypeTwnhs          -9.490e-02  2.541e-02  -3.735 0.000195 ***
## BldgTypeTwnhsE         -4.295e-02  1.667e-02  -2.576 0.010110 *  
## OverallCond6-7          7.539e-02  8.887e-03   8.483  < 2e-16 ***
## OverallCond8-9          1.243e-01  1.497e-02   8.307 2.29e-16 ***
## OverallCondless than 5 -9.790e-02  1.548e-02  -6.324 3.42e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1179 on 1400 degrees of freedom
## Multiple R-squared:  0.9147, Adjusted R-squared:  0.9114 
## F-statistic: 273.1 on 55 and 1400 DF,  p-value: < 2.2e-16

Overall, the model has an adjusted R-squared of 0.9114, meaning that 91.14% of the variation in sales price can be explained by the predictor variables.

ggplot(data = model, aes(x = .fitted, y = .resid)) +
  geom_point() + geom_hline(yintercept = 0, linetype = 'dashed') +
  geom_smooth(se = FALSE) + xlab('Fitted values') + ylab('Residuals')
ggplot(data = model, aes(x = .resid)) + geom_histogram() + xlab('Residuals')
ggplot(data = model) + stat_qq(aes(sample = .stdresid)) + geom_abline()

Testing the Model

#replacing na values
test <- test %>%
  mutate(Alley = fct_explicit_na(Alley, "none"),
         MasVnrType = fct_explicit_na(MasVnrType, "none"),
         BsmtQual = fct_explicit_na(BsmtQual, "none"),
         BsmtCond = fct_explicit_na(BsmtCond, "none"),
         BsmtExposure = fct_explicit_na(BsmtExposure, "none"),
         BsmtFinType1 = fct_explicit_na(BsmtFinType1, "none"),
         BsmtFinType2 = fct_explicit_na(BsmtFinType2, "none"),
         FireplaceQu = fct_explicit_na(FireplaceQu, "none"),
         GarageType = fct_explicit_na(GarageType, "none"),
         GarageFinish = fct_explicit_na(GarageFinish, "none"),
         GarageQual = fct_explicit_na(GarageQual, "none"),
         GarageCond = fct_explicit_na(GarageCond, "none"),
         PoolQC = fct_explicit_na(PoolQC, "none"),
         Fence = fct_explicit_na(Fence, "none"),
         MiscFeature = fct_explicit_na(MiscFeature, "none"))

#imputing by random forest 
tempData <- mice(test, m=1, maxit=1, meth='cart',seed=1234)

## 
##  iter imp variable
##   1   1  MSZoning  LotFrontage  Exterior1st  Exterior2nd  MasVnrArea  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageYrBlt  GarageCars  GarageArea  SaleType

## Warning: Number of logged events: 18

imputed_test <- complete(tempData,1)

#transforming variables
imputed_test <- imputed_test %>%
  mutate(TotalSqFt = GrLivArea + TotalBsmtSF,
         TotalBath = BsmtFullBath + 0.5 * BsmtHalfBath + FullBath + 0.5 * HalfBath,
         Age = YrSold - YearBuilt,
         SaleCondition = ifelse(SaleCondition == "Normal", "normal", 
                                ifelse(SaleCondition == "Partial", "partial", "other")),
         PorchSqFt = ScreenPorch + X3SsnPorch + EnclosedPorch + OpenPorchSF + WoodDeckSF,
         OverallCond = ifelse(OverallCond %in% "5", "5" ,
                                    ifelse(OverallCond %in% c("1", "2" ,"3" ,"4"), "less than 5",
                                           ifelse(OverallCond %in% c("6", "7"), "6-7",
                                           "8-9"))),
         NewHome = ifelse(SaleType == 'New', 'new', 'other'))

#predicting sale price
imputed_test$SalePrice = exp(predict(model, imputed_test))

Creating CSV for submission

#creating file for submission
tosubmit <- imputed_test %>%
  dplyr::select(Id, SalePrice)
write.csv(tosubmit, "finalmodel.csv")

Kaggle.com user name: llamas24 Score: 0.14102

DATA 605 Final

Orli Khaimova

Problem 1

Part 1

a. P(X>x | X>y)

b. P(X>x, Y>y)

c. P(X<x | X>y)

Part 2

Part 3

Problem 2

Part 1

Summary

Variable Plots

Scatterplot Matrix

Correlation Matrix

Part 2

Part 3

Optimal Lambda and Sample Exponential distribution

Histogram Comparison

95% Confidence Interval

Empirical 5th and 95th Percentile

Part 4

Testing the Model

Creating CSV for submission