Load necessary libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(matrixcalc)
#library(gmodels)

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=(N+1)/2\).

Definition of X and Y

set.seed(123)

N<-9

X<-runif(10000,1,N)# generate 10 000 random uniform  numbers from 1 to 9

Y<-rnorm(10000,mean =(N+1)/2)

Create a dataframe of the values

df<-data.frame(cbind(X,Y))

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Definition of x and y

x<-median(X)
y<-quantile(Y,.25)

Find all probabilities

pA<-nrow(subset(df, X > x ))/nrow(df)
pB<-nrow(subset(df, X > y))/nrow(df)
pC<-nrow(subset(df, X < x ))/nrow(df)
pD <- nrow(subset(df, Y > y))/nrow(df)
pA_and_B<-nrow(subset(df, X > x & X > y))/nrow(df)
pC_and_B<-nrow(subset(df, X < x & X > y))/nrow(df)
pA_and_pD<-nrow(subset(df, X > x & Y > y))/nrow(df)

a. P(X>x | X>y)

pA_given_B <- pA_and_B/pB
pA_given_B

## [1] 0.8570449

P(X>x | Y>y) = .85 or 85%, which means that there is 85% probablity of X>x or X will be greater than its median value given that X is greater than the value of 1st quartile of the variable y.

b. P(X>x,Y>y)

pA_and_pD

## [1] 0.3756

P(X>x, Y>y) = .38 or 38%, which means that there is 38% probablity of X>x or X will be greater than its median value , while Y is greater than the value of 1st quartile of the variable y.

c. P(Xy)

pC_given_B <- pC_and_B/pB
pC_given_B

## [1] 0.1429551

P(Xy) = .49 or 49%, which means that there is 49% probablity of X < x or X will be smaller than its median value , given that X is greater than the value of the 1st quartile of the variable y.

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.fair

prob_tab<-c(sum(X < x & Y< y),sum(X < x  & Y > y))
prob_tab<-rbind(prob_tab,c(sum(X > x & Y < y), sum(X > x & Y< y)))
prob_tab<-cbind(prob_tab,prob_tab[,1]+prob_tab[,2])
prob_tab<-rbind(prob_tab,prob_tab[1,]+prob_tab[2,])
colnames(prob_tab)<-c('Y<y', 'Y>y', 'Total')
rownames(prob_tab)<-c('X<x', 'X>x', 'Total')
prob_tab

##        Y<y  Y>y Total
## X<x   1256 3744  5000
## X>x   1244 1244  2488
## Total 2500 4988  7488

Fisher’s Exact Test

fisher.test(table(X>x,Y>y))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(X > x, Y > y)
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9242273 1.1100187
## sample estimates:
## odds ratio 
##   1.012883

The fisher’s exact test in R by default tests whether the odds ratio associated with the first cell being 1 or not.The odds ratio is a measure of how far from independence the table is.

Chi-Square Test

Using a significance level of 0.05, if the chi-square p-value is less than 0.05 then there is very little chance that we would get the results we did if the null were true and we would reject the null hypothesis.If the calculated chi-square p-value is greater than the significance level of 0.05, then there is good chance we could get the results we did even if the null is true. This implies that we will accept the alternative hypothesis.

chisq.test(table(X>x,Y>y))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(X > x, Y > y)
## X-squared = 0.064533, df = 1, p-value = 0.7995

From research it has been gathered that Fisher’s exact test is preferable to the chi-squared test because it is an exact test. It is ideal for small sample sizes.On the other hand, the chi-squared test should be particularly avoided if there are few observations (e.g. less than 10) for individual cells.

Which is more appropriate??

Research: https://www.graphpad.com/guides/prism/7/statistics/stat_chi-square_or_fishers_test.htm?toc=0&printWindow

-https://www.datascienceblog.net/post/statistical_test/contingency_table_tests/

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following:(see main headings below)

Note:Kaggle name

My Kaggle name isJuanelle_Marks

Load dataset

train<-read.csv("/Users/juanelle/Desktop/MSDS/Spring2019/DATA605_Textbooks/DATA605 course masterials/Week 16/Final_Project/train.csv")
test_data<-read.csv("/Users/juanelle/Desktop/MSDS/Spring2019/DATA605_Textbooks/DATA605 course masterials/Week 16/Final_Project/test.csv")

#str(train)

Summary of entire dataset

summary(train)

##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  Median :  9478               NA's:1369   IR3: 10   Low:  36   
##  Mean   : 10517                           Reg:925   Lvl:1311   
##  3rd Qu.: 11602                                                
##  Max.   :215245                                                
##                                                                
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle   OverallQual    
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000  
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000  
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000  
##  PosA   :   1                 1.5Unf : 14   Max.   :10.000  
##  (Other):   2                 (Other): 19                   
##   OverallCond      YearBuilt     YearRemodAdd    RoofStyle   
##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  13  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Gable  :1141  
##  Median :5.000   Median :1973   Median :1994   Gambrel:  11  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Hip    : 286  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   Mansard:   7  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :   2  
##                                                              
##     RoofMatl     Exterior1st   Exterior2nd    MasVnrType    MasVnrArea    
##  CompShg:1434   VinylSd:515   VinylSd:504   BrkCmn : 15   Min.   :   0.0  
##  Tar&Grv:  11   HdBoard:222   MetalSd:214   BrkFace:445   1st Qu.:   0.0  
##  WdShngl:   6   MetalSd:220   HdBoard:207   None   :864   Median :   0.0  
##  WdShake:   5   Wd Sdng:206   Wd Sdng:197   Stone  :128   Mean   : 103.7  
##  ClyTile:   1   Plywood:108   Plywood:142   NA's   :  8   3rd Qu.: 166.0  
##  Membran:   1   CemntBd: 61   CmentBd: 60                 Max.   :1600.0  
##  (Other):   2   (Other):128   (Other):136                 NA's   :8       
##  ExterQual ExterCond  Foundation  BsmtQual   BsmtCond    BsmtExposure
##  Ex: 52    Ex:   3   BrkTil:146   Ex  :121   Fa  :  45   Av  :221    
##  Fa: 14    Fa:  28   CBlock:634   Fa  : 35   Gd  :  65   Gd  :134    
##  Gd:488    Gd: 146   PConc :647   Gd  :618   Po  :   2   Mn  :114    
##  TA:906    Po:   1   Slab  : 24   TA  :649   TA  :1311   No  :953    
##            TA:1282   Stone :  6   NA's: 37   NA's:  37   NA's: 38    
##                      Wood  :  3                                      
##                                                                      
##  BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2     
##  ALQ :220     Min.   :   0.0   ALQ :  19    Min.   :   0.00  
##  BLQ :148     1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00  
##  GLQ :418     Median : 383.5   GLQ :  14    Median :   0.00  
##  LwQ : 74     Mean   : 443.6   LwQ :  46    Mean   :  46.55  
##  Rec :133     3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00  
##  Unf :430     Max.   :5644.0   Unf :1256    Max.   :1474.00  
##  NA's: 37                      NA's:  38                     
##    BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC CentralAir
##  Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741    N:  95    
##  1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365    
##  Median : 477.5   Median : 991.5   GasW :  18   Gd:241              
##  Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1              
##  3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428              
##  Max.   :2336.0   Max.   :6110.0   Wall :   4                       
##                                                                     
##  Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##  Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                              
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100     
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39     
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586     
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735     
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000              
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000              
##                                                              
##   TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu   GarageType 
##  Min.   : 2.000   Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6  
##  1st Qu.: 5.000   Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870  
##  Median : 6.000   Min1:  31   Median :1.000   Gd  :380    Basment: 19  
##  Mean   : 6.518   Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88  
##  3rd Qu.: 7.000   Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9  
##  Max.   :14.000   Sev :   1   Max.   :3.000   NA's:690    Detchd :387  
##                   Typ :1360                               NA's   : 81  
##   GarageYrBlt   GarageFinish   GarageCars      GarageArea     GarageQual 
##  Min.   :1900   Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3  
##  1st Qu.:1961   RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48  
##  Median :1980   Unf :605     Median :2.000   Median : 480.0   Gd  :  14  
##  Mean   :1979   NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3  
##  3rd Qu.:2002                3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311  
##  Max.   :2010                Max.   :4.000   Max.   :1418.0   NA's:  81  
##  NA's   :81                                                              
##  GarageCond  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Ex  :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Fa  :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Gd  :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
##  Po  :   7              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##  TA  :1326              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##  NA's:  81              Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                           
##    X3SsnPorch      ScreenPorch        PoolArea        PoolQC    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex  :   2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2  
##  Median :  0.00   Median :  0.00   Median :  0.000   Gd  :   3  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   NA's:1453  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000              
##  Max.   :508.00   Max.   :480.00   Max.   :738.000              
##                                                                 
##    Fence      MiscFeature    MiscVal             MoSold      
##  GdPrv:  59   Gar2:   2   Min.   :    0.00   Min.   : 1.000  
##  GdWo :  54   Othr:   2   1st Qu.:    0.00   1st Qu.: 5.000  
##  MnPrv: 157   Shed:  49   Median :    0.00   Median : 6.000  
##  MnWw :  11   TenC:   1   Mean   :   43.49   Mean   : 6.322  
##  NA's :1179   NA's:1406   3rd Qu.:    0.00   3rd Qu.: 8.000  
##                           Max.   :15500.00   Max.   :12.000  
##                                                              
##      YrSold        SaleType    SaleCondition    SalePrice     
##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
##  Median :2008   COD    :  43   Alloca :  12   Median :163000  
##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
##                 (Other):   9

Scale down dataset to numeric variables only

numeric_data <- train %>% select_if(is.numeric)
#str(numeric_data)

Subset dataset to remove variables that have a lot of Na’s

numeric_data <- numeric_data[,c(1,2,4,7,8,10,12:14,17,28,37,38)]
str(numeric_data)

## 'data.frame':    1460 obs. of  13 variables:
##  $ Id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass  : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ LotArea     : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ YearBuilt   : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd: int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ BsmtFinSF1  : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtUnfSF   : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ X1stFlrSF   : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ GrLivArea   : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ GarageArea  : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ YrSold      : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SalePrice   : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Descriptive and inferential statistics

Univariate descriptive stats of dataset used for the problem set 2

summary(numeric_data)

##        Id           MSSubClass       LotArea         YearBuilt   
##  Min.   :   1.0   Min.   : 20.0   Min.   :  1300   Min.   :1872  
##  1st Qu.: 365.8   1st Qu.: 20.0   1st Qu.:  7554   1st Qu.:1954  
##  Median : 730.5   Median : 50.0   Median :  9478   Median :1973  
##  Mean   : 730.5   Mean   : 56.9   Mean   : 10517   Mean   :1971  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   3rd Qu.: 11602   3rd Qu.:2000  
##  Max.   :1460.0   Max.   :190.0   Max.   :215245   Max.   :2010  
##   YearRemodAdd    BsmtFinSF1       BsmtUnfSF       TotalBsmtSF    
##  Min.   :1950   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:1967   1st Qu.:   0.0   1st Qu.: 223.0   1st Qu.: 795.8  
##  Median :1994   Median : 383.5   Median : 477.5   Median : 991.5  
##  Mean   :1985   Mean   : 443.6   Mean   : 567.2   Mean   :1057.4  
##  3rd Qu.:2004   3rd Qu.: 712.2   3rd Qu.: 808.0   3rd Qu.:1298.2  
##  Max.   :2010   Max.   :5644.0   Max.   :2336.0   Max.   :6110.0  
##    X1stFlrSF      GrLivArea      GarageArea         YrSold    
##  Min.   : 334   Min.   : 334   Min.   :   0.0   Min.   :2006  
##  1st Qu.: 882   1st Qu.:1130   1st Qu.: 334.5   1st Qu.:2007  
##  Median :1087   Median :1464   Median : 480.0   Median :2008  
##  Mean   :1163   Mean   :1515   Mean   : 473.0   Mean   :2008  
##  3rd Qu.:1391   3rd Qu.:1777   3rd Qu.: 576.0   3rd Qu.:2009  
##  Max.   :4692   Max.   :5642   Max.   :1418.0   Max.   :2010  
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000

Plot of data set

numeric_data[, 2:10] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scatter plot matrix with at least two independent variables and 1 dependent variable.

pairs(~LotArea+ GrLivArea+ YearRemodAdd+SalePrice,data = numeric_data,col=numeric_data$SalePrice, main="Scatter matrix of two independent and one dependent Variable")

There seems to be strong correlation between the three independent variables and the dependent variable.

#Scatterplot matrix: https://www.statmethods.net/graphs/scatterplot.html
library(gclus)

## Warning: package 'gclus' was built under R version 3.5.2

## Loading required package: cluster

data <- numeric_data # get data 
data.r <- abs(cor(data )) # get correlations
data.col <- dmat.color(data.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
data.o <- order.single(data.r) 
cpairs(data, data.o, panel.colors=data.col, gap=.8,
main="Variables Ordered and Colored by Correlation" )

The gclus package provides options to rearrange the variables so that those with higher correlations are closer to the principal diagonal. It can also color code the cells to reflect the size of the correlations.

Correlation Matrix

Plot of all 12 variables used in problem set

#compuation of correlaion matrix
library(corrplot)

## corrplot 0.84 loaded

library(RColorBrewer)
M <-cor(numeric_data)
corrplot(M, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

#source("http://www.sthda.com/upload/rquery_cormat.r")
#mydata <- numeric_data
#require("corrplot")
#rquery.cormat(mydata)
#

Derive a correlation matrix for any ‘three’ quantitative variables in the dataset.

The variables i would use are: GrLivArea“,”GarageArea" and “SalePrice.

corr_data <- numeric_data[, c("GrLivArea", "GarageArea", "SalePrice")]
#corr_data <- numeric_data
corr_matrix <- round(cor(corr_data),2)
corr_matrix

##            GrLivArea GarageArea SalePrice
## GrLivArea       1.00       0.47      0.71
## GarageArea      0.47       1.00      0.62
## SalePrice       0.71       0.62      1.00

Hypothesis test

cor.test(corr_data$GrLivArea, corr_data$GarageArea, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$GrLivArea and corr_data$GarageArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

cor.test(corr_data$GrLivArea, corr_data$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$GrLivArea and corr_data$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

cor.test(corr_data$GarageArea, corr_data$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$GarageArea and corr_data$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314

Discussion

In all 3 tests we have a very small p value, therefore, we can reject the the null hypothesis. The true correlation is not 0 for any of the three pairs of variables.

Family wise error

The formula to estimate the familywise error rate is:

\(FWE\leq 1-(1-a_{|T}) ^c\)

Where:

\(a_{|T}\) = alpha level for an individual test (e.g. .05), c = Number of comparisons.

Source: https://www.statisticshowto.datasciencecentral.com/familywise-error-rate/

Thus

FWE <- 1-((1-0.05)^3)
FWE

## [1] 0.142625

So the probability of a family-wise error is just over 14% which is somewhat significant

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Inversion of correlation matrix

precision_matrix <- solve(corr_matrix)
precision_matrix

##              GrLivArea  GarageArea  SalePrice
## GrLivArea   2.02241876 -0.09790136 -1.3752185
## GarageArea -0.09790136  1.62917066 -0.9405758
## SalePrice  -1.37521847 -0.94057584  2.5595621

Multiplication of matrices and LU Decomposition

round(corr_matrix %*% precision_matrix, 2)

##            GrLivArea GarageArea SalePrice
## GrLivArea          1          0         0
## GarageArea         0          1         0
## SalePrice          0          0         1

round(precision_matrix %*% corr_matrix, 2)

##            GrLivArea GarageArea SalePrice
## GrLivArea          1          0         0
## GarageArea         0          1         0
## SalePrice          0          0         1

lu.decomposition(corr_matrix)

## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.47 1.0000000    0
## [3,] 0.71 0.3674753    1
## 
## $U
##      [,1]   [,2]      [,3]
## [1,]    1 0.4700 0.7100000
## [2,]    0 0.7791 0.2863000
## [3,]    0 0.0000 0.3906918

Calculus-Based Probability and Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\) )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Variable that is skewed to the right

hist(numeric_data$X1stFlrSF, breaks = 20)

Check of minimum value

min(numeric_data$X1stFlrSF)

## [1] 334

Minimum value is above 0.

Optimal value of lambda and 1000 samples from this exponential distribution

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

shifted<- numeric_data$X1stFlrSF + 0.0000001

fit<- fitdistr(shifted, densfun = 'exponential')
lambda <- fit$estimate
sample_exp_dist <- rexp(1000, lambda)
head(sample_exp_dist)

## [1] 2889.9625  650.1676  950.8153  115.5629 2259.1572 1221.8120

Comparison Histograms

par(mfrow=c(1,2))
hist(numeric_data$X1stFlrSF, breaks = 20, main='Original')
hist(sample_exp_dist, breaks = 20, main='Exponential')

The 5th and 95th percentiles of sample using CDF

ex_pdn<-ecdf(sample_exp_dist)
sample_exp_dist[ex_pdn(sample_exp_dist)==0.05]# 5th percentile

## [1] 64.12555

sample_exp_dist[ex_pdn(sample_exp_dist)==0.95]# 95th percentile

## [1] 3271.511

95% confidence interval from the empirical data, assuming normality.

t.test(numeric_data$X1stFlrSF,conf.level=0.95)

## 
##  One Sample t-test
## 
## data:  numeric_data$X1stFlrSF
## t = 114.91, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1142.780 1182.473
## sample estimates:
## mean of x 
##  1162.627

With 95% confidence, the mean of X1stFlrSF is between 1142.780 and 1182.473

5th and 95th percentile of the empirical data

quantile(numeric_data$X1stFlrSF,c(0.05,0.95))

##      5%     95% 
##  672.95 1831.25

The simulated data is not a good fit for the observed data in this case. The simulated exponential distribution is much more skewed than our original data.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Multiple regression

For my model i will use the 8 variables described above. The variable SalePrice will be the dependent variable

My First Model

first_model <- lm(SalePrice ~ ., data = numeric_data )
summary(first_model )

## 
## Call:
## lm(formula = SalePrice ~ ., data = numeric_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -619839  -18147   -2733   13711  283529 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.641e+05  1.616e+06  -0.535  0.59302    
## Id           -3.305e+00  2.533e+00  -1.305  0.19216    
## MSSubClass   -1.418e+02  2.765e+01  -5.129  3.3e-07 ***
## LotArea       3.516e-01  1.154e-01   3.047  0.00235 ** 
## YearBuilt     5.110e+02  4.881e+01  10.469  < 2e-16 ***
## YearRemodAdd  5.939e+02  6.617e+01   8.975  < 2e-16 ***
## BsmtFinSF1    1.761e+01  6.990e+00   2.519  0.01187 *  
## BsmtUnfSF     9.304e-01  6.797e+00   0.137  0.89114    
## TotalBsmtSF   2.371e+01  7.846e+00   3.022  0.00256 ** 
## X1stFlrSF    -2.801e+00  5.372e+00  -0.521  0.60218    
## GrLivArea     7.208e+01  2.754e+00  26.177  < 2e-16 ***
## GarageArea    5.425e+01  6.572e+00   8.254  3.4e-16 ***
## YrSold       -6.472e+02  8.058e+02  -0.803  0.42198    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40700 on 1447 degrees of freedom
## Multiple R-squared:  0.7397, Adjusted R-squared:  0.7375 
## F-statistic: 342.7 on 12 and 1447 DF,  p-value: < 2.2e-16

In the model above, it can be seen that the F-statistic is < 2.2e-16. This means that at least one of the predictor variables is significantly related to the outcome

summary(first_model)$coefficient

##                   Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)  -8.641140e+05 1.616420e+06 -0.5345850  5.930190e-01
## Id           -3.305444e+00 2.533255e+00 -1.3048208  1.921613e-01
## MSSubClass   -1.418308e+02 2.765184e+01 -5.1291610  3.304565e-07
## LotArea       3.515733e-01 1.153700e-01  3.0473552  2.350450e-03
## YearBuilt     5.110352e+02 4.881193e+01 10.4694723  8.929786e-25
## YearRemodAdd  5.939148e+02 6.617124e+01  8.9754226  8.549581e-19
## BsmtFinSF1    1.760840e+01 6.989867e+00  2.5191320  1.187147e-02
## BsmtUnfSF     9.304323e-01 6.796904e+00  0.1368906  8.911363e-01
## TotalBsmtSF   2.370706e+01 7.845512e+00  3.0217349  2.557422e-03
## X1stFlrSF    -2.800707e+00 5.371755e+00 -0.5213766  6.021842e-01
## GrLivArea     7.208112e+01 2.753640e+00 26.1766719 5.712084e-124
## GarageArea    5.424615e+01 6.571726e+00  8.2544754  3.397671e-16
## YrSold       -6.471967e+02 8.057529e+02 -0.8032198  4.219796e-01

The t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero. It can be seen that, changing in lot area year buillt, year remodadd, bsmtfinsf1,totalbsmftsf grlivarea and garagearea are significantly associated to changes in sales price while changes in mssubclass,x1stFlrSF and yrs sold are not significantly associated with sale price.

The latter variables will be removed from the model.

Though i am not confident about this model, i will submit it to kaggle to see what score it will receive before doing an updated model.

*First Kaggle submission

Predict test data sale price

predict_price<-predict(first_model,test_data,type = 'response')
results1<-data.frame(test_data$Id,predict_price)
colnames(results1)<-c("ID","SalePrice")
results1[is.na(results1)] <- 0
head(results1)

##     ID SalePrice
## 1 1461  129324.1
## 2 1462  152994.8
## 3 1463  210739.8
## 4 1464  204244.4
## 5 1465  168028.2
## 6 1466  187763.9

summary(results1)

##        ID         SalePrice     
##  Min.   :1461   Min.   :     0  
##  1st Qu.:1826   1st Qu.:122601  
##  Median :2190   Median :165597  
##  Mean   :2190   Mean   :172488  
##  3rd Qu.:2554   3rd Qu.:215097  
##  Max.   :2919   Max.   :666834

Write result to file

write.csv(results1,file="juanelle_predicted_prices1.csv",row.names = FALSE)

Kaggle score for first model

0.49197

An updated model removing the aforementioned variables

updated_model <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + TotalBsmtSF  + GrLivArea + GarageArea , data = train)
summary(updated_model)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + TotalBsmtSF + GrLivArea + GarageArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633710  -17570   -3483   14170  285650 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.115e+06  1.157e+05 -18.273  < 2e-16 ***
## LotArea       4.191e-01  1.147e-01   3.655 0.000267 ***
## YearBuilt     4.850e+02  4.870e+01   9.959  < 2e-16 ***
## YearRemodAdd  5.882e+02  6.643e+01   8.855  < 2e-16 ***
## BsmtFinSF1    1.529e+01  2.795e+00   5.470 5.29e-08 ***
## TotalBsmtSF   2.832e+01  3.366e+00   8.415  < 2e-16 ***
## GrLivArea     6.837e+01  2.506e+00  27.278  < 2e-16 ***
## GarageArea    5.742e+01  6.554e+00   8.760  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41030 on 1452 degrees of freedom
## Multiple R-squared:  0.7345, Adjusted R-squared:  0.7332 
## F-statistic: 573.9 on 7 and 1452 DF,  p-value: < 2.2e-16

summary(updated_model)$coefficient

##                   Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)  -2.114573e+06 1.157208e+05 -18.273067  2.644333e-67
## LotArea       4.190669e-01 1.146620e-01   3.654803  2.665494e-04
## YearBuilt     4.849713e+02 4.869894e+01   9.958561  1.208342e-22
## YearRemodAdd  5.882385e+02 6.643144e+01   8.854820  2.390500e-18
## BsmtFinSF1    1.528645e+01 2.794531e+00   5.470130  5.288629e-08
## TotalBsmtSF   2.832309e+01 3.365860e+00   8.414815  9.298045e-17
## GrLivArea     6.837236e+01 2.506505e+00  27.277965 1.274846e-132
## GarageArea    5.741747e+01 6.554538e+00   8.759956  5.336645e-18

Check of confidence Interval

confint(updated_model)

##                      2.5 %        97.5 %
## (Intercept)  -2.341571e+06 -1.887576e+06
## LotArea       1.941461e-01  6.439878e-01
## YearBuilt     3.894436e+02  5.804991e+02
## YearRemodAdd  4.579266e+02  7.185503e+02
## BsmtFinSF1    9.804697e+00  2.076820e+01
## TotalBsmtSF   2.172062e+01  3.492556e+01
## GrLivArea     6.345560e+01  7.328912e+01
## GarageArea    4.456009e+01  7.027484e+01

Based on results from confidence levels i removed LotArea from the model

* Second Kaggle submission

second_model<- update(updated_model, ~ .-LotArea, data = train)
summary(second_model)

## 
## Call:
## lm(formula = SalePrice ~ YearBuilt + YearRemodAdd + BsmtFinSF1 + 
##     TotalBsmtSF + GrLivArea + GarageArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -630138  -17988   -3226   14245  283248 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.064e+06  1.154e+05 -17.888  < 2e-16 ***
## YearBuilt     4.707e+02  4.875e+01   9.656  < 2e-16 ***
## YearRemodAdd  5.767e+02  6.664e+01   8.654  < 2e-16 ***
## BsmtFinSF1    1.634e+01  2.791e+00   5.855 5.87e-09 ***
## TotalBsmtSF   2.971e+01  3.359e+00   8.846  < 2e-16 ***
## GrLivArea     6.982e+01  2.486e+00  28.087  < 2e-16 ***
## GarageArea    5.861e+01  6.574e+00   8.915  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41200 on 1453 degrees of freedom
## Multiple R-squared:  0.7321, Adjusted R-squared:  0.731 
## F-statistic: 661.7 on 6 and 1453 DF,  p-value: < 2.2e-16

Predict test data sale price

predict_price<-predict(second_model,test_data,type = 'response')
results2<-data.frame(test_data$Id,predict_price)
colnames(results2)<-c("ID","SalePrice")
results2[is.na(results2)] <- 0
head(results2)

##     ID SalePrice
## 1 1461  129326.4
## 2 1462  152633.1
## 3 1463  210895.6
## 4 1464  205769.2
## 5 1465  183950.7
## 6 1466  188229.4

summary(results2)

##        ID         SalePrice     
##  Min.   :1461   Min.   :     0  
##  1st Qu.:1826   1st Qu.:128370  
##  Median :2190   Median :170120  
##  Mean   :2190   Mean   :177605  
##  3rd Qu.:2554   3rd Qu.:219136  
##  Max.   :2919   Max.   :680192

Write result to file

write.csv(results2,file="juanelle_predicted_prices.csv",row.names = FALSE)

Kaggle submission

My kaggle score for second_model submission scored is 0.21903 which turned out to be my best score.

Future Work

I intend to do futher work which involves not testing the quality of the models above but also trying other strategies of modeling for predicting sales price. The model offers a decent fit, but transforming the variables should helo linearity and collinearity.

plot(second_model)

JMarks_DATA605_Final Project_PS1&PS2

Juanelle Marks

5/22/2019

Load necessary libraries

Problem 1

Definition of X and Y

Create a dataframe of the values

Probability

Definition of x and y

Find all probabilities

a. P(X>x | X>y)

b. P(X>x,Y>y)

c. P(Xy)

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.fair

Fisher’s Exact Test

Chi-Square Test

Which is more appropriate??

Problem 2

Note:Kaggle name

Load dataset

Summary of entire dataset

Subset dataset to remove variables that have a lot of Na’s

Descriptive and inferential statistics

Univariate descriptive stats of dataset used for the problem set 2

Plot of data set

Scatter plot matrix with at least two independent variables and 1 dependent variable.

Correlation Matrix

Plot of all 12 variables used in problem set

Derive a correlation matrix for any ‘three’ quantitative variables in the dataset.

Hypothesis test

Discussion

Family wise error

Linear Algebra and Correlation

Inversion of correlation matrix

Multiplication of matrices and LU Decomposition

Calculus-Based Probability and Statistics

Variable that is skewed to the right

Check of minimum value

Optimal value of lambda and 1000 samples from this exponential distribution

Comparison Histograms

The 5th and 95th percentiles of sample using CDF

95% confidence interval from the empirical data, assuming normality.

5th and 95th percentile of the empirical data

Modeling

Multiple regression

My First Model

*First Kaggle submission

Predict test data sale price

Write result to file

Kaggle score for first model

An updated model removing the aforementioned variables

Check of confidence Interval

*** Second Kaggle submission**

Predict test data sale price

Write result to file

Kaggle submission

Future Work

* Second Kaggle submission