Problem 1

TASK: Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).

#Set the seed for reproducability
set.seed(19)

#define N
N <- 25

m <- (N+1)/2

#Random variable X (N = 10)
x <- runif(10000, 1, 25)

#random variable Y using the normal dist with mu and sigma given (N = 10)
y <- rnorm(10000, m, m)

Probability

TASK: Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

#Assumptions
smallX <- median(x)
smallY <- quantile(y,0.25)

The estimated value for “small x” is 13.0282515 and the estimated value for “small y” is 4.1538109.

a. P(X>x | X>y)

Since smallX is greater than smallY we can simulate this in R using a uniform distribution to calculate the probability that smallX is greater than x if the minimum begins a smallY. This is a right tail problem therefore we specify lower.tail = FALSE

p <- punif(smallX, smallY, max = N, lower.tail = FALSE)
p
## [1] 0.5742895

Answer: P(X>x | X>y) = 0.5742895 when N = 25. As N increases, P(X>x | X>y) approaches 0.58.

b. P(X>x, Y>y)

Assuming x and y are independent of eachother, we can calculate the probability of x > smallX and y > smallY using the formula:

\(P(A,B) = P(A) \times P(B)\)

#calculate the probabilty of A and B by counting the number
#of entries that satisfy the condition and
#divide by the total number of entries
pA <- length(x[x > smallX])/ 10000

pB <- length(y[y > smallY])/10000

#Apply the formula
pA * pB
## [1] 0.375

Additionally we can calculate probabilities using simulation is R:

#x as a uniform distribution from 1 to N
pAsim <- punif(smallX, min = 1, max = N) 
#y as a normal distribution with mu = sigma as defined above
pBsim<- pnorm(smallY, m, m, lower.tail = FALSE)

#Apply formula
pAsim * pBsim
##       25% 
## 0.3768343

Answer: P(X>x, Y>y) = 0.3768343 when N = 25

c. P(X<x | X>y)

Since smallX is greater than smallY we can simulate this in R using a uniform distribution to calculate the probability that smallX is less than x if the minimum begins a smallY. This is a left tail problem (since it is asking for the probabiliy that smallX is less than x) and we expect it to be 1 - P(X>x | X>y) as found in part A.

p3 <- punif(smallX, smallY, max = N, lower.tail = TRUE)
p3
## [1] 0.4257105

Answer: P(X<x|X>y) = 0.4257105 when N = 25

Investigate

TASK: Investigate whether P(X>x and Y>y) = P(X>x) P(Y>y) by building a table and evaluating the marginal and joint probabilities.

#Import library for displaying tables
library(kableExtra) 

#Create a dataframe of probabilities that is 3x3
probDF <- data.frame(matrix(c(length(which(x>smallX & y>smallY)), #P(X > x) & P(Y > y)
  length(which(x<smallX & y>smallY)), #P(X < x) & P(Y > y)
  length(which(x>smallX & y>smallY)) + length(which(x<smallX & y>smallY)), # Sum
  length(which(x>smallX & y<smallY)), #P(X > x) & P(Y < y)
  length(which(x<smallX & y<smallY)), #P(X < x) & P(Y > y)
  length(which(x<smallX & y<smallY)) + length(which(x<smallX & y<smallY)), #Sums
  length(which(x>smallX & y>smallY)) + length(which(x>smallX & y<smallY)),
  length(which(x<smallX & y>smallY)) + length(which(x<smallX& y<smallY)),
  length(which(x>smallX & y>smallY)) + length(which(x<smallX & y>smallY)) + length(which(x<smallX & y<smallY)) + length(which(x<smallX & y<smallY)))/10000, ncol = 3, byrow = TRUE))

#Add the names for each row as Y
rownames(probDF) <- c('P(Y > y)','P(Y < y)','Sum')

#add names for each column as X
kable(probDF, 
      col.names = c('P(X > x)','P(X < x)','Sum'), 
      row.names = 1, 
      format = "html", 
      caption = "Probabilities")
Probabilities
P(X > x) P(X < x) Sum
P(Y > y) 0.3744 0.3756 0.7500
P(Y < y) 0.1256 0.1244 0.2488
Sum 0.5000 0.5000 0.9988
#Probability calculation using product
(length(which(x > smallX))/10000 ) * (length(which(y > smallY))/10000)
## [1] 0.375

Answer: According to the table a P(X>x and Y>y) = 0.3744 which is approximately the same as if we multiply \(P(X > x) \times P(Y > y) = 0.375\) considering that our data has a small error associated with it.

Independence Check

TASK: Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

First we create a contingency table that counts all variations of cases we are interested in. We can pass this into some built in R functions to run the various tests.

#Create table
contTable <- data.frame(matrix(c(length(which(x > smallX & y > smallY)), # X > x; Y > y
                                 length(which(x < smallX & y > smallY)), # X < x; Y > y
                                 length(which(x > smallX & y < smallY)), # X > x; Y < y
                                 length(which(x < smallX & y < smallY)) # X < x; Y < y
                                 ), nrow = 2, byrow = TRUE))

# specify names for rows
rownames(contTable) <- c("Y > y","Y < y")

#Display table with column frames
kable(contTable, 
      col.names = c("X > x","X < x"), 
      row.names = 1, 
      format = "html",
      caption ="Contingency Table")
Contingency Table
X > x X < x
Y > y 3744 3756
Y < y 1256 1244

Fisher’s Exact Test

Fisher’s Exact test is useful to test the independence of categorical variables that have two possible outcomes. It uses contingency table infomration to calculate an exact p-value which we can use to determine if there is enough events to prove independance. It is worth noting that in reality values are not exact because x and y are random variables.

fisher.test(contTable)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contTable
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9008857 1.0819850
## sample estimates:
## odds ratio 
##   0.987281

From the table above we see that we have a p-value of 0.7995 which is much higher than 0.05 which indicates that we can state that these events are independent with 95% confidence.

Chi Square Test

\(\chi^2\) test is another useful test to help identify if categorical variables are independent of eachother. It is similar to the Fisher Exact Test in that it uses a contingency table infomation to calculate a p-value however, it does this by comparing the data in question with what the data is expected to look like if it were truly independent and random.

#run the test
chiTest <- chisq.test(contTable)
#disp;lay results
chiTest
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contTable
## X-squared = 0.064533, df = 1, p-value = 0.7995
#display expected values and residuals
chiTest$expected
##         X1   X2
## Y > y 3750 3750
## Y < y 1250 1250

As we can see from the output above, the p-value associated with the \(\chi^2\) test is the same as in the Fisher’s Exact Test (p = 0.7995), therefore there is evidence of independence. In this specific example because our sample size is very large (10,000), the \(\chi^2\) is the better choice. In addition, since the Fisher’s Exact Test relies on computing the p-value according to the hypergeometric distribution using binomial coefficients, these coefficients can get real large as sample size increases.

Problem 2

TASK: You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. Complete the tasks as stated below.

Descriptive and Inferential Statistics

Initial Data Analysis

TASK: Provide univariate descriptive statistics and appropriate plots for the training data set.

Before we can choose variables to highlight, it is a good idea to get a general sense of the behavior of each variable in the dataset which can be done by running descriptive statistics and visualizing the shape of each predictor in the training data set.

#Import libraries
library(tidyverse)
library(RCurl)
library(caret)
library(ggplot2)
library(GGally)
library(ggpubr)
library(visdat)
library(psych)

#Import data from github repository and place ion a dataframe
urlTest <- "https://raw.githubusercontent.com/MsQCompSci/605Final/main/test.csv"
urlTrain <- "https://raw.githubusercontent.com/MsQCompSci/605Final/main/train.csv"
rawTrain <- getURL(urlTrain)
rawTest <- getURL(urlTest)
trainDf <- read.csv(text = rawTrain) %>% 
  data.frame()
testDf <- read.csv(text = rawTest) %>% 
  data.frame()
#summary statistics of training data
summary(trainDf)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour  Utilities   
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63    AllPub:1459  
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50    NoSeWa:   1  
##  Median :  9478               NA's:1369   IR3: 10   Low:  36                 
##  Mean   : 10517                           Reg:925   Lvl:1311                 
##  3rd Qu.: 11602                                                              
##  Max.   :215245                                                              
##                                                                              
##    LotConfig    LandSlope   Neighborhood   Condition1     Condition2  
##  Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260   Norm   :1445  
##  CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81   Feedr  :   6  
##  FR2    :  47   Sev:  13   OldTown:113   Artery :  48   Artery :   2  
##  FR3    :   4              Edwards:100   RRAn   :  26   PosN   :   2  
##  Inside :1052              Somerst: 86   PosN   :  19   RRNn   :   2  
##                            Gilbert: 79   RRAe   :  11   PosA   :   1  
##                            (Other):707   (Other):  15   (Other):   2  
##    BldgType      HouseStyle   OverallQual      OverallCond      YearBuilt   
##  1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000   Median :1973  
##  Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575   Mean   :1971  
##  TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                1.5Unf : 14   Max.   :10.000   Max.   :9.000   Max.   :2010  
##                (Other): 19                                                  
##   YearRemodAdd    RoofStyle       RoofMatl     Exterior1st   Exterior2nd 
##  Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515   VinylSd:504  
##  1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222   MetalSd:214  
##  Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220   HdBoard:207  
##  Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206   Wd Sdng:197  
##  3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108   Plywood:142  
##  Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61   CmentBd: 60  
##                                (Other):   2   (Other):128   (Other):136  
##    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation  BsmtQual  
##  BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3   BrkTil:146   Ex  :121  
##  BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28   CBlock:634   Fa  : 35  
##  None   :864   Median :   0.0   Gd:488    Gd: 146   PConc :647   Gd  :618  
##  Stone  :128   Mean   : 103.7   TA:906    Po:   1   Slab  : 24   TA  :649  
##  NA's   :  8   3rd Qu.: 166.0             TA:1282   Stone :  6   NA's: 37  
##                Max.   :1600.0                       Wood  :  3             
##                NA's   :8                                                   
##  BsmtCond    BsmtExposure BsmtFinType1   BsmtFinSF1     BsmtFinType2
##  Fa  :  45   Av  :221     ALQ :220     Min.   :   0.0   ALQ :  19   
##  Gd  :  65   Gd  :134     BLQ :148     1st Qu.:   0.0   BLQ :  33   
##  Po  :   2   Mn  :114     GLQ :418     Median : 383.5   GLQ :  14   
##  TA  :1311   No  :953     LwQ : 74     Mean   : 443.6   LwQ :  46   
##  NA's:  37   NA's: 38     Rec :133     3rd Qu.: 712.2   Rec :  54   
##                           Unf :430     Max.   :5644.0   Unf :1256   
##                           NA's: 37                      NA's:  38   
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741   
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49   
##  Median :   0.00   Median : 477.5   Median : 991.5   GasW :  18   Gd:241   
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1   
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428   
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0   Wall :   4            
##                                                                            
##  CentralAir Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  N:  95     FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  Y:1365     FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##             FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##             Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##             SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##             NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                         
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual  TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100      Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39      1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586      Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735      Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000               3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000               Max.   :14.000  
##                                                                               
##  Functional    Fireplaces    FireplaceQu   GarageType   GarageYrBlt  
##  Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6   Min.   :1900  
##  Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870   1st Qu.:1961  
##  Min1:  31   Median :1.000   Gd  :380    Basment: 19   Median :1980  
##  Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88   Mean   :1979  
##  Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9   3rd Qu.:2002  
##  Sev :   1   Max.   :3.000   NA's:690    Detchd :387   Max.   :2010  
##  Typ :1360                               NA's   : 81   NA's   :81    
##  GarageFinish   GarageCars      GarageArea     GarageQual  GarageCond 
##  Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3   Ex  :   2  
##  RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48   Fa  :  35  
##  Unf :605     Median :2.000   Median : 480.0   Gd  :  14   Gd  :   9  
##  NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3   Po  :   7  
##               3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311   TA  :1326  
##               Max.   :4.000   Max.   :1418.0   NA's:  81   NA's:  81  
##                                                                       
##  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Y:1340     Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##             Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##             3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##             Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                                
##   ScreenPorch        PoolArea        PoolQC       Fence      MiscFeature
##  Min.   :  0.00   Min.   :  0.000   Ex  :   2   GdPrv:  59   Gar2:   2  
##  1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2   GdWo :  54   Othr:   2  
##  Median :  0.00   Median :  0.000   Gd  :   3   MnPrv: 157   Shed:  49  
##  Mean   : 15.06   Mean   :  2.759   NA's:1453   MnWw :  11   TenC:   1  
##  3rd Qu.:  0.00   3rd Qu.:  0.000               NA's :1179   NA's:1406  
##  Max.   :480.00   Max.   :738.000                                       
##                                                                         
##     MiscVal             MoSold           YrSold        SaleType   
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :1267  
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   New    : 122  
##  Median :    0.00   Median : 6.000   Median :2008   COD    :  43  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008   ConLD  :   9  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   ConLI  :   5  
##  Max.   :15500.00   Max.   :12.000   Max.   :2010   ConLw  :   5  
##                                                     (Other):   9  
##  SaleCondition    SalePrice     
##  Abnorml: 101   Min.   : 34900  
##  AdjLand:   4   1st Qu.:129975  
##  Alloca :  12   Median :163000  
##  Family :  20   Mean   :180921  
##  Normal :1198   3rd Qu.:214000  
##  Partial: 125   Max.   :755000  
## 
dim(trainDf)
## [1] 1460   81

Upon intial inspection of the data, we can see that there are 1460 cases and 81 variables which include both numerical and categorical data. Next we can visualize each variable so we can choose variables to highlight for this project.

#Explore shape of predictors and response variable 
# Plot each variable frequency for each class (not including class column)
par(mfrow = c(3,3))
for(i in 2:ncol(trainDf)) {
  plot(trainDf[i], main = colnames(trainDf[i]), col = "pink")
}

#Identify near zero variance predictors
#so we can throw them out or not choose them
nearZero<- nearZeroVar(trainDf)
names(trainDf)[nearZero]
##  [1] "Street"        "LandContour"   "Utilities"     "LandSlope"    
##  [5] "Condition2"    "RoofMatl"      "BsmtCond"      "BsmtFinType2" 
##  [9] "BsmtFinSF2"    "Heating"       "LowQualFinSF"  "KitchenAbvGr" 
## [13] "Functional"    "GarageQual"    "GarageCond"    "EnclosedPorch"
## [17] "X3SsnPorch"    "ScreenPorch"   "PoolArea"      "MiscFeature"  
## [21] "MiscVal"
#identify predictors that have NA values 
#so we can impute or throw them out
#Look at each column with Missing Values
missing <- colSums(is.na(trainDf))

#Select only columns with NA values
missingColumns <- names(missing)[missing>0]

# Add column with variable names             
missing <- as.data.frame(missing)%>%
  filter(missing!=0)%>%
  mutate(key = as.factor(missingColumns))

#Visualize columns with missing data
  ggplot(missing) +
    geom_bar(aes(x = reorder(key, missing), y = missing), fill = "purple", stat = "identity") +
    labs(x='variable', y="number of missing values", title='Missing values by Column') +
    coord_flip() +
    geom_text(aes(x = key, y = missing + 1, label = missing))+
    theme( plot.title = element_text(hjust = 0.5))

Based on the initial anlysis above and assuming that the Sale price is the response variable, lets take a look at GarageArea, GrLivArea, BsmtUnfSF, Exterior1st, and Neighborhood. The variables were chosen because they do not have near zero variance, missing values and include a range of categorical and numerical data. First lets take a look at the response variable SalePrice by plotting it as a histogram.

ggplot(trainDf, aes(x=SalePrice)) + 
  geom_histogram(color="black", fill="pink") +
  labs(x='Sale Price', y="Frequency", title='Shape of Response Variable') +
   theme(plot.title = element_text(hjust = 0.5))

From the above histogram we can tell that SalePrice has a right skew with most prices landing around the median and very few houses landing on the higher end of the price spectrum. Lets take a deeper look at each of the response variables identified above.

#plot Garage Area
ggplot(trainDf, aes(x=GarageArea)) + 
  geom_histogram(color="black", fill="purple") +
  labs(x='Garage Area (square feet)', y="Frequency", title='Shape of Predictor Variable') +
   theme(plot.title = element_text(hjust = 0.5))

summary(trainDf$GarageArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   334.5   480.0   473.0   576.0  1418.0
#plot Above Ground Living Area
ggplot(trainDf, aes(x=GrLivArea)) + 
  geom_histogram(color="black", fill="purple") +
  labs(x='Living Area (square feet)', y="Frequency", title='Shape of Predictor Variable') +
   theme(plot.title = element_text(hjust = 0.5))

summary(trainDf$GrLivArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642
#plot Unfinished Basement Area
ggplot(trainDf, aes(x=BsmtUnfSF)) + 
  geom_histogram(color="black", fill="purple") +
  labs(x='Unfinished Basement Area (square feet)', y="Frequency", title='Shape of Predictor Variable') +
   theme(plot.title = element_text(hjust = 0.5))

summary(trainDf$BsmtUnfSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   223.0   477.5   567.2   808.0  2336.0

The visualizations above shows that the GarageArea, GrLivArea, BsmtUnfSF predictors have a slight right skew (which is confirmed via descriptive statistics median > mean).

#plot Exterier Covering on House
ggplot(trainDf, aes(x=Exterior1st)) + 
  geom_histogram(color="black", fill="purple", stat = "count") +
  labs(x='Covering Type', y="Frequency", title='Shape of Predictor Variable') +
   theme(plot.title = element_text(hjust = 0.5))

#plot Neighborhood of House
ggplot(trainDf, aes(x=Neighborhood)) + 
  geom_histogram(color="black", fill="purple", stat = "count") +
  labs(x='Neighborhood', y="Frequency", title='Shape of Predictor Variable') +
   theme(plot.title = element_text(hjust = 0.5))

Visualizing the above categorical variables we can see that there is an imbalance in the distribution of Exterior1st with most houses being covered with Vinyl as compared to other coverings. We can also see that there are also imbalances in Neighborhood with more houses being solde in some neighborhoods over others.

Scatterplots and Correlation

TASK: Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset.

#Lets also take a look at correlation between variales
ggcorr(trainDf, method = c("pairwise", "pearson")) +
  labs(title = "Matrix Plot of All Variables") +
  theme(plot.title = element_text(hjust = 0.5))

From initial inspection of all predictors, we see that there are a few that are highly correlated. Lets isolate some so we can take a better look at their correlation.

#select predictors we are interested in and response
trainThreePred <- trainDf %>%
  select(GarageArea, GrLivArea, BsmtUnfSF, SalePrice)

#Derive Correation Matrix
cor(trainThreePred)
##            GarageArea GrLivArea BsmtUnfSF SalePrice
## GarageArea  1.0000000 0.4689975 0.1833027 0.6234314
## GrLivArea   0.4689975 1.0000000 0.2402573 0.7086245
## BsmtUnfSF   0.1833027 0.2402573 1.0000000 0.2144791
## SalePrice   0.6234314 0.7086245 0.2144791 1.0000000
#Plot correlation matrix
vis_cor(trainThreePred)

The above visualization shows that the three quantitative variables chosen are somewhat correlated with GarageArea and GrLivArea being the most correlated compared with other variables. Lets take a look at a scatterplot matrix of these variables.

pairs.panels(trainThreePred, 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

Hypothesis Test

TASK: Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

#GarageArea vs. GrLivArea
cor.test(trainThreePred$GarageArea, trainThreePred$GrLivArea, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  trainThreePred$GarageArea and trainThreePred$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

The above test shows that the p-value is really small (approaching zero) which is much less than our significance level (0.05) therefore we see that there is evidence of correlation between GarageArea and GrLivArea. We can also be 80% confident that the true correlation between these two variables falls approximately between 0.4424 and 0.4948 with the test estimating the true correlation at approximately 0.469.

#GarageArea vs. BsmtUnfSF
cor.test(trainThreePred$GarageArea, trainThreePred$BsmtUnfSF , conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  trainThreePred$GarageArea and trainThreePred$BsmtUnfSF
## t = 7.1198, df = 1458, p-value = 1.691e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1506680 0.2155384
## sample estimates:
##       cor 
## 0.1833027

We can see from the above test that the p-value is really small (approaching zero) which is much less than our significance level (0.05) therefore we see that there is evidence of correlation between GarageArea and BsmtUnfSF. We can also be 80% confident that the true correlation between these two variables falls approximately between 0.151 and 0.2155 with the test estimating the true correlation at approximately 0.1833.

#GrLivArea vs. BsmtUnfSF
cor.test(trainThreePred$GrLivArea, trainThreePred$BsmtUnfSF , conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  trainThreePred$GrLivArea and trainThreePred$BsmtUnfSF
## t = 9.4507, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2083759 0.2716286
## sample estimates:
##       cor 
## 0.2402573

We can see from the above test that the p-value is really small (approaching zero) which is much less than our significance level (0.05) therefore we see that there is evidence of correlation between GrLivArea and BsmtUnfSF. We can also be 80% confident that the true correlation between these two variables falls approximately between 0.2084 and 0.2716 with the test estimating the true correlation at approximately 0.2403.

For these specific variables that I have highlighted, familywise error is not a concern because there are only 3 tests being performed however, when conducting multiple hypothesis tests (for example, for all predictors in the data set), I would be concerned about familywise error. This is because family wise error is the probability of detecting a false (determining that two variables are correlated when in fact they are not), therefore as more tests are conducted between variables this probability increases.

Linear Algebra and Correlation

TASK: Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

#library for LU decomposition
library(pracma)
#Derive Correation Matrix
cMatrix <- cor(trainThreePred)
#display
cMatrix
##            GarageArea GrLivArea BsmtUnfSF SalePrice
## GarageArea  1.0000000 0.4689975 0.1833027 0.6234314
## GrLivArea   0.4689975 1.0000000 0.2402573 0.7086245
## BsmtUnfSF   0.1833027 0.2402573 1.0000000 0.2144791
## SalePrice   0.6234314 0.7086245 0.2144791 1.0000000
#invert to precision matrix
pMatrix <- solve(cMatrix)
#display
pMatrix
##             GarageArea   GrLivArea   BsmtUnfSF   SalePrice
## GarageArea  1.64552254 -0.07604617 -0.07849925 -0.95514586
## GrLivArea  -0.07604617  2.04564912 -0.18532802 -1.36243848
## BsmtUnfSF  -0.07849925 -0.18532802  1.06944804 -0.04910739
## SalePrice  -0.95514586 -1.36243848 -0.04910739  2.57145772
#multiply correlation matrix with precision matrix (dot product)
CdotP <- cMatrix %*% pMatrix
#Display
CdotP
##              GarageArea     GrLivArea     BsmtUnfSF SalePrice
## GarageArea 1.000000e+00  0.000000e+00 -1.040834e-17         0
## GrLivArea  0.000000e+00  1.000000e+00  1.387779e-17         0
## BsmtUnfSF  2.775558e-17  0.000000e+00  1.000000e+00         0
## SalePrice  0.000000e+00 -2.220446e-16 -1.387779e-17         1
#multiply correlation matrix with precision matrix (dot product)
PdotC<- pMatrix %*% cMatrix
#display
PdotC
##               GarageArea     GrLivArea    BsmtUnfSF    SalePrice
## GarageArea  1.000000e+00  2.220446e-16 5.551115e-17 1.110223e-16
## GrLivArea   0.000000e+00  1.000000e+00 5.551115e-17 0.000000e+00
## BsmtUnfSF   2.081668e-17  4.857226e-17 1.000000e+00 2.081668e-17
## SalePrice  -2.220446e-16 -4.440892e-16 0.000000e+00 1.000000e+00
#LU decomposition of correlation matrix
lu(cMatrix)
## $L
##            GarageArea GrLivArea BsmtUnfSF SalePrice
## GarageArea  1.0000000 0.0000000 0.0000000         0
## GrLivArea   0.4689975 1.0000000 0.0000000         0
## BsmtUnfSF   0.1833027 0.1977956 1.0000000         0
## SalePrice   0.6234314 0.5336085 0.0190971         1
## 
## $U
##            GarageArea GrLivArea BsmtUnfSF  SalePrice
## GarageArea          1 0.4689975 0.1833027 0.62343144
## GrLivArea           0 0.7800414 0.1542888 0.41623671
## BsmtUnfSF           0 0.0000000 0.9358825 0.01787264
## SalePrice           0 0.0000000 0.0000000 0.38888448

Calculus-Based Probability & Statistics

TASK: Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

After my initial inspection of all variables I chose TotalBsmtSF for this problem because it seemed to be really skewed right. Further inspection confirms this as seen in the histogram and summary statitstics below. No shifting was necessary because the minimum value is zero however code was written to ensure this was the case and to demonstrate understanding.

#Historgram of TotalBsmtSF
ggplot(trainDf, aes(x=TotalBsmtSF)) + 
  geom_histogram(color="black", fill="lightblue") +
  labs(x='Total Basement Area (square feet)', y="Frequency", title='Shape of Variable') +
   theme(plot.title = element_text(hjust = 0.5))

#summary Statitsics
summary(trainDf$TotalBsmtSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0
#Code for shifting
if(min(trainDf$TotalBsmtSF) < 0){ #check if the minimum value is less than zero
  
  offset <- min(trainDf$TotalBsmtSF) #save the number of the min value
  
  for(i in trainDf$TotalBsmtSF){
    # add this value to each case (multiply by negative 1 because the min value should be negative)
    i <- i + (offset*-1) 
  }
}

Fit Exponential Density

TASK: Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda)). Plot a histogram and compare it with a histogram of your original variable.

#load library
library(MASS)
#fit exponential probability density function
expDist <- fitdistr(trainDf$TotalBsmtSF, "exponential")
#Find Lambda
lambdaEXP <- expDist$estimate
#Display Lambda
lambdaEXP
##         rate 
## 0.0009456896

As we can see from the output above, the optimal value for lambda is 9.456895710^{-4}.

library(Stack) #to stack DF
set.seed(111)

#Take 1000 samples from exponential distribution from above
expSamples <- rexp(1000, lambdaEXP)

#put in a dataframe
expDF <- data.frame(sqFT = c(expSamples), type = c("Exponential"))
bsmtDF <- data.frame(sqFT = c(trainDf$TotalBsmtSF), type = c("Original"))

#stack them
bsmtDF <- Stack(expDF, bsmtDF)

#plot a histogram with  both distributions to compare
ggplot(bsmtDF, aes(x=sqFT, fill = type)) + 
  geom_histogram(color = "white", alpha=0.5, position="identity") +
  labs(x='Total Basement Area (square feet)', y="Frequency", title='Comparing Exponential Fit with Orginial Data') +
   theme(plot.title = element_text(hjust = 0.5))

#summary stats of both
summary(expSamples)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.63  281.06  730.57 1051.76 1428.55 8075.10
summary(trainDf$TotalBsmtSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0

We can clearly see some differences here between the original data and the data fit the the exponential function. Looking at the overlapping histogram visualization above, we can easily see that the median for each are really different; the exponential function has most of the data located around the minimum value (seems more skewed and spread out). After running some summary statistics we can see that the median for the exponentially fit data is much less than the original data (730.57 versus 991.5) howver, we also see that although the mean did change slightly, they are very close to one another and the scale has not changed. We also see that the min and max has changed as well (the eponential function has a greater min and max than the original data) and the exponential has a greater spread than the original data.

Percentiles and Confidence Intervals

TASK: Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

We can find the 5th and 95th percentile by using the lambda calculated above.

#5th percentile
qexp(0.05, lambdaEXP)
## [1] 54.23904
#95th percentile
qexp(0.95, lambdaEXP)
## [1] 3167.776

We can use the normal distribution to construct a confidence interval for the emperical data.

#calculate the mean
meanExp <- mean(expSamples)
#calculate the standard dev
s <- sd(expSamples)
#number of cases
n <- 1000
error <- qnorm(0.95)*s/sqrt(n)
lower <- meanExp-error
upper <- meanExp+error
lower
## [1] 995.7825
upper
## [1] 1107.734
#load library with built in z.test function
library(BSDA)
## 
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
## 
##     Orange
#confirm restuls using z.test function
z.test(expSamples,mu = meanExp, sigma.x = s, sigma.y = NULL, alternative = "two.sided", conf.level = 0.95)
## 
##  One-sample z-Test
## 
## data:  expSamples
## z = 0, p-value = 1
## alternative hypothesis: true mean is not equal to 1051.758
## 95 percent confidence interval:
##   985.059 1118.457
## sample estimates:
## mean of x 
##  1051.758

Based on my calculations above we can say that we are 95% confident that the true mean for the emperical data falls between 995.7824972 and 1107.7340262. In addition the built in z.test() function also shows that the upper and lower limits to this confidence interval are very close and confirms my calculations are correct. We can calculate the 5th and 95th percentiles using emperical data and compare it to the 5th and 95th percentile that was found via the exponential distribution.

quantile(expSamples, c(0.05, 0.95))
##         5%        95% 
##   59.02471 3256.78363

Comparing the 5th and 95th percentile calculated using emperical data with the ones calculated using the exponential distribution, we can see some differences as expected. This is because our emperical data is random and based on the exponential distribution. As the number of samples in our emperical data approaches infinity, the values calculated emeprically should approach those calculated using the exponential distribution. In addition, as seen by comparing the 5th and 95th percentile from the emperical data, the exponential density function and the normal distribution function, we can see that the best way to model the behaviorof this variable is via exponential function.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

As seen in the EDA above, there are 79 variables, some are highly correlated, some have near zero variances and some have NA values that need correction. Also, we saw that some of the variables in our data is skewed left or right. All these factors will help us determine which model will be best, and appropriate methods to correct for some of these flaws in the data.

#Plot before
#Visualize columns with missing data
  ggplot(missing) +
    geom_bar(aes(x = reorder(key, missing), y = missing), fill = "purple", stat = "identity") +
    labs(x='variable', y="number of missing values", title='Missing values by Column') +
    coord_flip() +
    geom_text(aes(x = key, y = missing + 1, label = missing))+
    theme( plot.title = element_text(hjust = 0.5))

Things that stand out to me are that variables like Pool QC,MiscFeature, etc which have mostly NA values but these NA values mean something (is its own category). Some of these columns will be translated and others will be imputed after test and train sets are identified.

#Add Level and replace NA with NONE
trainDf$PoolQC <- factor(trainDf$PoolQC, levels = c(levels(trainDf$PoolQC), "None"))
trainDf$PoolQC[is.na(trainDf$PoolQC)] <- "None"

trainDf$Fence <- factor(trainDf$Fence, levels = c(levels(trainDf$Fence), "None"))
trainDf$Fence[is.na(trainDf$Fence)] <- "None"

trainDf$Alley <- factor(trainDf$Alley, levels = c(levels(trainDf$Alley), "None"))
trainDf$Alley[is.na(trainDf$Alley)] <- "None"

trainDf$FireplaceQu <- factor(trainDf$FireplaceQu, levels = c(levels(trainDf$FireplaceQu), "None"))
trainDf$FireplaceQu[is.na(trainDf$FireplaceQu)] <- "None"

trainDf$GarageType <- factor(trainDf$GarageType, levels = c(levels(trainDf$GarageType), "None"))
trainDf$GarageType[is.na(trainDf$GarageType)] <- "None"

trainDf$GarageYrBlt <- factor(trainDf$GarageYrBlt, levels = c(levels(trainDf$GarageYrBlt), "None"))
trainDf$GarageYrBlt[is.na(trainDf$GarageYrBlt)] <- "None"

trainDf$GarageFinish <- factor(trainDf$GarageFinish, levels = c(levels(trainDf$GarageFinish), "None"))
trainDf$GarageFinish[is.na(trainDf$GarageFinish)] <- "None"

trainDf$GarageQual <- factor(trainDf$GarageQual, levels = c(levels(trainDf$GarageQual), "None"))
trainDf$GarageQual[is.na(trainDf$GarageQual)] <- "None"

trainDf$GarageCond <- factor(trainDf$GarageCond, levels = c(levels(trainDf$GarageCond), "None"))
trainDf$GarageCond[is.na(trainDf$GarageCond)] <- "None"

trainDf$BsmtQual<- factor(trainDf$BsmtQual, levels = c(levels(trainDf$BsmtQual), "None"))
trainDf$BsmtQual[is.na(trainDf$BsmtQual)] <- "None"

trainDf$BsmtCond<- factor(trainDf$BsmtCond, levels = c(levels(trainDf$BsmtCond), "None"))
trainDf$BsmtCond[is.na(trainDf$BsmtCond)] <- "None"

trainDf$BsmtExposure<- factor(trainDf$BsmtExposure, levels = c(levels(trainDf$BsmtExposure), "None"))
trainDf$BsmtExposure[is.na(trainDf$BsmtExposure)] <- "None"

trainDf$BsmtFinType1<- factor(trainDf$BsmtFinType1, levels = c(levels(trainDf$BsmtFinType1), "None"))
trainDf$BsmtFinType1[is.na(trainDf$BsmtFinType1)] <- "None"

trainDf$BsmtFinType2<- factor(trainDf$BsmtFinType2, levels = c(levels(trainDf$BsmtFinType2), "None"))
trainDf$BsmtFinType2[is.na(trainDf$BsmtFinType2)] <- "None"


trainDf$MiscFeature<- factor(trainDf$MiscFeature, levels = c(levels(trainDf$MiscFeature), "None"))
trainDf$MiscFeature[is.na(trainDf$MiscFeature)] <- "None"

#Look at missing values now (Same method as above)
missing <- colSums(is.na(trainDf))
missingColumns <- names(missing)[missing>0]       
missing <- as.data.frame(missing)%>%
  filter(missing!=0)%>%
  mutate(key = as.factor(missingColumns))

#Visualize columns with missing data
  ggplot(missing) +
    geom_bar(aes(x = reorder(key, missing), y = missing), fill = "purple", stat = "identity") +
    labs(x='variable', y="number of missing values", title='Missing values by Column') +
    coord_flip() +
    geom_text(aes(x = key, y = missing + 1, label = missing))+
    theme( plot.title = element_text(hjust = 0.5))

#Variables that should be imputed using PMM
#LotFrontage
#MasVnrType
#MasVnrArea
#Electrical

Next we split the data into train and test sets. The train set will be used to develop the model and the test data will be used to evaluate the model’s efficiency or “correctness” for predictions of our target variable Sale Price.

set.seed(8)
#create train and test sets (80/20 split)
#select random 80% of numbers between 1 and the row length
trainIndex <- sample(1:nrow(trainDf), size = round(0.8*nrow(trainDf)), replace = FALSE)

#Use the index to split the data 80/20
trainData <- trainDf[trainIndex,]
testData <- trainDf[-trainIndex,]

#seperate predictors(X) from response(Y)
#and omit meaningless column (test/train)
trainDataX <- trainData %>% dplyr::select(-Id,-SalePrice)
trainDataY <- trainData %>% dplyr::select(SalePrice)
testDataX <-  testData %>% dplyr::select(-Id,-SalePrice)
testDataY <-  testData %>% dplyr::select(SalePrice)
predictDfX <- testDf %>% dplyr::select(-Id) #data to use for predictions

After exploring the shape of variables in the training set the following preprocessing steps were applied: center to center skewed data, scale to scale predictors to the mean, BoxCox to account for skewness, pca (Principal Component Analysis) transformations to deal with correlated variables and near zero variances (dimentionality reduction to linearly indepent features), and knnImpute which imputes the remaining NA values with k-nearest neighbor means. This was done and test and train data sets seperatley.

#Center and Scale Preprocessing (not on Yeild)
preProcessedValues <- preProcess(trainDataX, method = c("center", "scale", "pca", "BoxCox", "knnImpute")) 
preProcessedValues2 <- preProcess(trainDataX, method = c("pca")) 
#Take  a look at the transformations being done
preProcessedValues
## Created from 951 samples and 79 variables
## 
## Pre-processing:
##   - Box-Cox transformation (12)
##   - centered (35)
##   - ignored (44)
##   - 5 nearest neighbor imputation (35)
##   - principal component signal extraction (35)
##   - scaled (35)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000  0.0000  0.4000  0.3917  0.7250  2.0000 
## 
## PCA needed 25 components to capture 95 percent of the variance
#Apply transofrmations
trainDataX <- predict(preProcessedValues, trainDataX)
testDataX <- predict(preProcessedValues, testDataX)

#Identify near zero predictors (There is none now)
nearZeroVar(trainDataX)
##  [1]  2  3  5  6  8 11 15 26 27 32 35 40 42
#Lets also take a look at correlation between variables
p1 <- ggcorr(trainDf, method = c("pairwise", "pearson")) +
  labs(title = "Matrix Plot of All Variables") +
  theme(plot.title = element_text(hjust = 0.5))
## Warning in ggcorr(trainDf, method = c("pairwise", "pearson")): data in
## column(s) 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
## 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
## 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
## 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
## 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
## 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
## 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
## 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition' are
## not numeric and were ignored
p2 <- ggcorr(trainDataX, method = c("pairwise", "pearson")) +
  labs(title = "Matrix Plot After PCA") +
  theme(plot.title = element_text(hjust = 0.5))
## Warning in ggcorr(trainDataX, method = c("pairwise", "pearson")): data in
## column(s) 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
## 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
## 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
## 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
## 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
## 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
## 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
## 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition' are
## not numeric and were ignored
#create 3x3 grid for display for correlation before and after
ggarrange(p1, p2, ncol = 2, nrow = 1)

Now that we have cleaned our data we can begin to train models to try to reach the best one for this data set. Lets take a look at a multiple linear regression approach. Here I chose to use the various model making tools that the caret package provides. To begin we first merge dependent and independant variable data sets into one dataframe so we can input them into the train() function caret provides for training models. Next we specify a control for how many times the model will be fit and validated. Here I chose 10-fold cross validation in which typically a portion of the training set is used to validate the model and measure how effective the model is. caret allows us to specify our validation method using the trainControl() function where you can also specify the resampling method, number of folds, number of times to repeat and finally, search for optimal hypertunning parameters.

#merges X and Y data
df <- data.frame(trainDataX, SalePrice = c(trainDataY$SalePrice))

## 10-fold CV
fitControl <- trainControl(method = "repeatedcv",   
                           number = 10,     # number of folds
                           repeats = 10, # repeated ten times
                           search = "random") #tunning hyperparameters   
# Only chose the PCA variables because its linear regression
regModelOneA <- train(SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC17 + PC18 + PC19 +PC20 + PC21 + PC22 +PC23 + PC24 +PC25,
                  data = df,
                  method = "lm", 
                  trControl = fitControl,
                  na.action = na.omit)
#display Model
regModelOneA #SUPER LOW R^2
## Linear Regression 
## 
## 1168 samples
##   25 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 1052, 1052, 1051, 1052, 1051, 1051, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   34829.08  0.8104366  22464.05
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

We can see that the \(R^2\) value is about 0.81 which means that this model accounts for 81% of the variation in the response variable. I was interested to see if we can get a better \(R^2\) values if we were able to incorperate the categorical variables as well and may want to look into using a gradient boosted tree model. For now we can check the accuracy of our model’s predictions. —-

We can start to get a sense of how much each predictor contributes to the model by graphing each variable by order of importance. We can see that although all PCA variables were used in our model, some are much more influencial to our response variable than others.

#display variable importance
ggplot(varImp(regModelOneA)) +
  scale_fill_manual(values = "pink") +
  labs(title = "Most Important Variables in Model") +
  theme(plot.title = element_text(hjust = 0.5))

We can evaluate the model on our test set to measure how accurate it is able to predict known values for the response variable in the test set.

#Make a prediction
predictions <- predict(regModelOneA, testDataX)

#R-squared on test data
postResample(pred = predictions, obs = testDataY$SalePrice)
##         RMSE     Rsquared          MAE 
## 4.856255e+04 6.789216e-01 2.510901e+04

We can see from the \(R^2\) above that the model accounts for about 68 percent of the variation in the response variable for the test data. In the future I would like to improve my approach and choose a model that is able to incorperate both categorical and numerical predictors in hopes that the \(R^2\) is improved and the model is more accurate at making predictions.

Kaggle Username: laylaquinones

Score: 0.61

References: