Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu =\sigma =(N+1)/2\).

N <- 11 # Set a value for N
numbers <- 10000 # Set the amount for the random numbers
X <- runif(numbers, min = 1, max = N) # Generate random variable X
summary(X) # Display X's information
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.543   6.069   6.047   8.558  11.000
mu = sigma = (N + 1) / 2 # Set value for mu and sigma
Y <- rnorm(numbers, mean = mu, sd = sigma) # Generate random variable Y
summary(Y) # Display Y's information
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -18.616   2.010   5.909   6.020  10.058  26.463

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. 5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)

x <- median(X) # Median of X
y <- quantile(Y, 0.25) # 1st quartile of Y
  1. P(X>x | X>y)
prob_Xx_and_Xy <- sum(X > x & X > y) / numbers # Probabilty that X > x and X > y
prob_Xy <- sum(X > y) / numbers # Probability that X > y
round(prob_Xx_given_Xy <- prob_Xx_and_Xy / prob_Xy, 4) # Divide the first probability found by the second probability found
## [1] 0.554

The probability that X > x given that X > y is equal to 0.5561 or 55.61%. (at the moment I ran the code)

  1. P(X>x, Y>y)
round(sum(X > x & Y > y) / numbers, 4) # Probability that X > x and Y > y
## [1] 0.3742

The probability that X > x while at the same time Y > y is 0.3754 or 37.54%. (at the moment I ran the code)

  1. P(X<x | X>y)
prob_Xx_and_Xy_less <- sum(X < x & X > y) / numbers # Probability that X < x and X > y
round(prob_Xx_and_Xy_less / prob_Xy, 4)
## [1] 0.446

The probability that X < x given that X > y is 0.4439 or 44.39%. (at the moment I ran the code)

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

table <- matrix(c(sum(X > x & Y < y) / numbers, sum(X > x & Y > y) / numbers, sum(X < x & Y < y) / numbers, sum(X < x & Y > y) / numbers), ncol = 2, byrow = TRUE) # Create a matrix showing the different probabilities
table <- cbind(table, c(table[1,1] + table[1,2], table[2,1] + table[2,2])) # Get total of columns
table <- rbind(table, c(table[1,1] + table[2,1], table[1,2] + table[2,2], table[1,3] + table[2,3])) # Get total of rows
colnames(table) <- c("X > x", "X < x", "Total") # Rename columns to show probabilites of X and x
rownames(table) <- c("Y < y", "Y > y", "Total") # Rename rows to show probabilities of Y and y
as.table(table) # Convert matrix into a table
##        X > x  X < x  Total
## Y < y 0.1258 0.3742 0.5000
## Y > y 0.1242 0.3758 0.5000
## Total 0.2500 0.7500 1.0000
round(table[3,1] * table[2,3], 4) # P(X >  x) * P(Y > y)
## [1] 0.125
round(table[2,1], 4) # P(X > x & Y > y)
## [1] 0.1242

From the table built and by evaluating the marginal and joint probabilities, we can see that P(X > x and Y > y) is almost equal to P(X > x) * P(Y > y). There is a slight rounding error.

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

fisher.test(table)
## Warning in fisher.test(table): 'x' has been rounded to integer: Mean relative
## difference: 0.8333333
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table
## p-value = 1
## alternative hypothesis: two.sided
chisq.test(table)
## Warning in chisq.test(table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table
## X-squared = 1.3653e-05, df = 4, p-value = 1

For both the Fisher’s Exact Test and the Chi Square Test independence holds. This is seen by the high p-value we get from both tests. The difference between the two is Fisher’s Exact Test is best used when dealing with a small sample size and the Chi Square Test is best used when dealing with a large sample size. Chi Square Test would be most appropriate to use in this situation.

Problem 2

train_dataset <- read.csv("https://raw.githubusercontent.com/bpersaud104/Data605/master/train.csv", header = TRUE) # Load training dataset from Kaggle
test_dataset <- read.csv("https://raw.githubusercontent.com/bpersaud104/Data605/master/test.csv", header = TRUE) # Load testing dataset from Kaggle

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

summary(train_dataset) # Show statistics for training dataset
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour  Utilities   
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63    AllPub:1459  
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50    NoSeWa:   1  
##  Median :  9478               NA's:1369   IR3: 10   Low:  36                 
##  Mean   : 10517                           Reg:925   Lvl:1311                 
##  3rd Qu.: 11602                                                              
##  Max.   :215245                                                              
##                                                                              
##    LotConfig    LandSlope   Neighborhood   Condition1     Condition2  
##  Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260   Norm   :1445  
##  CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81   Feedr  :   6  
##  FR2    :  47   Sev:  13   OldTown:113   Artery :  48   Artery :   2  
##  FR3    :   4              Edwards:100   RRAn   :  26   PosN   :   2  
##  Inside :1052              Somerst: 86   PosN   :  19   RRNn   :   2  
##                            Gilbert: 79   RRAe   :  11   PosA   :   1  
##                            (Other):707   (Other):  15   (Other):   2  
##    BldgType      HouseStyle   OverallQual      OverallCond      YearBuilt   
##  1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000   Median :1973  
##  Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575   Mean   :1971  
##  TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                1.5Unf : 14   Max.   :10.000   Max.   :9.000   Max.   :2010  
##                (Other): 19                                                  
##   YearRemodAdd    RoofStyle       RoofMatl     Exterior1st   Exterior2nd 
##  Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515   VinylSd:504  
##  1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222   MetalSd:214  
##  Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220   HdBoard:207  
##  Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206   Wd Sdng:197  
##  3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108   Plywood:142  
##  Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61   CmentBd: 60  
##                                (Other):   2   (Other):128   (Other):136  
##    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation  BsmtQual  
##  BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3   BrkTil:146   Ex  :121  
##  BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28   CBlock:634   Fa  : 35  
##  None   :864   Median :   0.0   Gd:488    Gd: 146   PConc :647   Gd  :618  
##  Stone  :128   Mean   : 103.7   TA:906    Po:   1   Slab  : 24   TA  :649  
##  NA's   :  8   3rd Qu.: 166.0             TA:1282   Stone :  6   NA's: 37  
##                Max.   :1600.0                       Wood  :  3             
##                NA's   :8                                                   
##  BsmtCond    BsmtExposure BsmtFinType1   BsmtFinSF1     BsmtFinType2
##  Fa  :  45   Av  :221     ALQ :220     Min.   :   0.0   ALQ :  19   
##  Gd  :  65   Gd  :134     BLQ :148     1st Qu.:   0.0   BLQ :  33   
##  Po  :   2   Mn  :114     GLQ :418     Median : 383.5   GLQ :  14   
##  TA  :1311   No  :953     LwQ : 74     Mean   : 443.6   LwQ :  46   
##  NA's:  37   NA's: 38     Rec :133     3rd Qu.: 712.2   Rec :  54   
##                           Unf :430     Max.   :5644.0   Unf :1256   
##                           NA's: 37                      NA's:  38   
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741   
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49   
##  Median :   0.00   Median : 477.5   Median : 991.5   GasW :  18   Gd:241   
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1   
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428   
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0   Wall :   4            
##                                                                            
##  CentralAir Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  N:  95     FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  Y:1365     FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##             FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##             Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##             SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##             NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                         
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual  TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100      Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39      1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586      Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735      Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000               3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000               Max.   :14.000  
##                                                                               
##  Functional    Fireplaces    FireplaceQu   GarageType   GarageYrBlt  
##  Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6   Min.   :1900  
##  Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870   1st Qu.:1961  
##  Min1:  31   Median :1.000   Gd  :380    Basment: 19   Median :1980  
##  Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88   Mean   :1979  
##  Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9   3rd Qu.:2002  
##  Sev :   1   Max.   :3.000   NA's:690    Detchd :387   Max.   :2010  
##  Typ :1360                               NA's   : 81   NA's   :81    
##  GarageFinish   GarageCars      GarageArea     GarageQual  GarageCond 
##  Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3   Ex  :   2  
##  RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48   Fa  :  35  
##  Unf :605     Median :2.000   Median : 480.0   Gd  :  14   Gd  :   9  
##  NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3   Po  :   7  
##               3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311   TA  :1326  
##               Max.   :4.000   Max.   :1418.0   NA's:  81   NA's:  81  
##                                                                       
##  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Y:1340     Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##             Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##             3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##             Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                                
##   ScreenPorch        PoolArea        PoolQC       Fence      MiscFeature
##  Min.   :  0.00   Min.   :  0.000   Ex  :   2   GdPrv:  59   Gar2:   2  
##  1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2   GdWo :  54   Othr:   2  
##  Median :  0.00   Median :  0.000   Gd  :   3   MnPrv: 157   Shed:  49  
##  Mean   : 15.06   Mean   :  2.759   NA's:1453   MnWw :  11   TenC:   1  
##  3rd Qu.:  0.00   3rd Qu.:  0.000               NA's :1179   NA's:1406  
##  Max.   :480.00   Max.   :738.000                                       
##                                                                         
##     MiscVal             MoSold           YrSold        SaleType   
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :1267  
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   New    : 122  
##  Median :    0.00   Median : 6.000   Median :2008   COD    :  43  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008   ConLD  :   9  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   ConLI  :   5  
##  Max.   :15500.00   Max.   :12.000   Max.   :2010   ConLw  :   5  
##                                                     (Other):   9  
##  SaleCondition    SalePrice     
##  Abnorml: 101   Min.   : 34900  
##  AdjLand:   4   1st Qu.:129975  
##  Alloca :  12   Median :163000  
##  Family :  20   Mean   :180921  
##  Normal :1198   3rd Qu.:214000  
##  Partial: 125   Max.   :755000  
## 

I chose LotArea and GarageArea as the two independent variables and SalePrice as the dependent variable. Let’s plot these variables.

hist(train_dataset$LotArea) # Plot LotArea variable from training dataset

hist(train_dataset$GarageArea) # Plot GarageArea variable from training dataset

hist(train_dataset$SalePrice) # Plot SalePrice from training dataset

For the most part, all three appear to be right skewed, with LotArea being heavily right skewed.

Let’s show a scatterplot of LotArea and GarageArea with SalePrice.

plot(train_dataset$LotArea, train_dataset$SalePrice) # Scatterplot of LotArea and SalePrice

plot(train_dataset$GarageArea, train_dataset$SalePrice) # Scatterplot of OverallQual and SalePrice

The scatterplot for LotArea has most of its points all in one area, the bottom left of the graph. The scatterplot for GarageArea has its points more spread out.

Let’s use the same three variables, LotArea, GarageArea, and SalePrice as the three quantitative variables and use them to make a correlation matrix.

correlation_matrix <- cbind(train_dataset$LotArea, train_dataset$GarageArea, train_dataset$SalePrice)
correlation_matrix <- cor(correlation_matrix)
correlation_matrix
##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.1804028 0.2638434
## [2,] 0.1804028 1.0000000 0.6234314
## [3,] 0.2638434 0.6234314 1.0000000

Let’s do a hypothesis test using these three variables with a 80% confidence interval.

cor.test(train_dataset$LotArea, train_dataset$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_dataset$LotArea and train_dataset$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434
cor.test(train_dataset$LotArea, train_dataset$GarageArea, conf.level =  0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_dataset$LotArea and train_dataset$GarageArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1477356 0.2126767
## sample estimates:
##       cor 
## 0.1804028
cor.test(train_dataset$GarageArea, train_dataset$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_dataset$GarageArea and train_dataset$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314

From the hypothesis test between LotArea and SalePrice, we can say that there is a correlation since the p-value is below 0.05 and the 80% confidence interval is (0.2323391, 0.2947946). From the hypothesis test between LotArea and GarageArea we can say there is a correlation since the p-value is below 0.05 and the 80% confidence interval is (0.1477356, 0.2126767). From the hypothesis test between GarageArea and SalePrice we can say there is a correlation since the p-value is below 0.05 and the 80% confidence interval is (0.6024756, 0.6435283). My analysis shows that the there is a correlation between the three variables picked, LotArea, GarageArea, and SalePrice. There is little correlation between LotArea and GarageArea as the correlation is only 0.1804 . There is a big correlation between GarageArea and SalePrice as the correlation is 0.6234. Based on this analysis I would be worried about familywise error because each of the p-values are below 0.05, showing that there is some correlation. This means that we reject the null hypothesis.

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

precision_matrix <- solve(correlation_matrix) # Create precision matrix using correlation matrix
precision_matrix
##             [,1]        [,2]       [,3]
## [1,]  1.07530074 -0.02799273 -0.2662594
## [2,] -0.02799273  1.63649778 -1.0128585
## [3,] -0.26625940 -1.01285847  1.7016986
round(correlation_matrix %*% precision_matrix, 4) # Multiply correlation matrix by precision matrix
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
round(precision_matrix %*% correlation_matrix, 4) # Multiply precision matrix by correlation matrix
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

In both cases, multiplying the correlation matrix by the precision matrix and multiplying the precision matrix by the correlation matrix gives the Identity matrix.

LU_decomposition <- function(A){ # Simple function to calculate LU decomposition
  rows = columns = dim(A)[1] # Get rows and columns
  U = A # Get upper
  L = diag(rows) # Get Lower
  for (j in 1:(columns-1)) { 
    for(i in (j+1):rows){
      L[i,j] = (U[i,j] / U[j,j]) # Calculate Lower
      U[i,] = U[i,] - (U[j,] * L[i,j]) # Calculate Upper
    }
  }
  LU = list("Lower" = L, "Upper" = U) # List Lower and Upper
  return(LU)
}
LU_decomposition(correlation_matrix) # LU decomposition on correlation matrix
## $Lower
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.1804028 1.0000000    0
## [3,] 0.2638434 0.5952044    1
## 
## $Upper
##      [,1]      [,2]      [,3]
## [1,]    1 0.1804028 0.2638434
## [2,]    0 0.9674548 0.5758334
## [3,]    0 0.0000000 0.5876481
LU_decomposition(precision_matrix) # LU decomposition on precision matrix
## $Lower
##             [,1]       [,2] [,3]
## [1,]  1.00000000  0.0000000    0
## [2,] -0.02603247  1.0000000    0
## [3,] -0.24761389 -0.6234314    1
## 
## $Upper
##          [,1]        [,2]       [,3]
## [1,] 1.075301 -0.02799273 -0.2662594
## [2,] 0.000000  1.63576906 -1.0197899
## [3,] 0.000000  0.00000000  1.0000000

From the LU decomposition above we can see the lower and upper matrices for both the correlation matrix and the precision matrix. In both cases, multiplying the lower by the upper gives us the original matrix.

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

library(MASS) # Load MASS pacakge
## Warning: package 'MASS' was built under R version 3.6.3

I chose to use the PoolArea variable due to it being heavily right skewed. The min of PoolArea is 0 so let us add 1 to it so we can have a min above 0.

fit_variable <- train_dataset$PoolArea + 1 
summary(fit_variable)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   3.759   1.000 739.000
fit_exp_function <- fitdistr(fit_variable, "exponential") # Fit exponential probability function using created variable
fit_exp_function
##       rate    
##   0.266034985 
##  (0.006962454)
lambda <- fit_exp_function$estimate # Compute optimal values of lambda
lambda
##     rate 
## 0.266035
exponential_samples <- rexp(1000, lambda) # Use optimal value to create 1000 samples of exponential distribution
exponential_samples
##    [1]  0.852380777  3.109206304  3.037865460  2.598455560 15.450979425
##    [6]  4.733622179  2.480186403  0.399630561  2.628919552  3.698496697
##   [11]  1.012472765  6.941859424  1.999521234  6.490322617  5.111915806
##   [16]  2.034294575  8.030406645  3.201296004  0.093622765  8.939915284
##   [21]  1.014436810  0.640881663  1.126627964  2.914497835  2.806180258
##   [26]  9.920630091  3.254174138  0.414889752  4.700588570  0.354377038
##   [31]  1.480375479  1.121633241  6.121231978  2.389816054  2.702871970
##   [36]  9.013886222  0.178775466  1.320370346  0.919730573  1.725417562
##   [41]  1.553309645  5.739827394  2.751278504  3.698643028  4.676062258
##   [46]  3.812122871  7.278435182  1.086761368  0.124159468  2.048646278
##   [51]  4.563502357  3.770401518  5.671556286  6.970588894  6.658117896
##   [56]  4.729882910  3.375228787  4.616870751  1.669679373  3.238358169
##   [61]  0.734958774  7.672791555  1.283713880  2.323640401 10.073728576
##   [66]  4.691019099  2.863218501  6.754300664  3.091523854  1.260383614
##   [71]  0.158996287  0.690491030  7.236519352  1.061367000  0.513654500
##   [76]  9.838932450  2.080668946  0.332561861  8.316471693  0.408903168
##   [81]  4.538622902  3.574722497  4.327325436  3.416672516  2.385242902
##   [86]  8.146210675  0.864146173  2.651705128  1.641118190  2.092865386
##   [91]  3.366107034  0.216609427  0.489303533  0.912900267  1.063013014
##   [96]  6.837928773 12.323481865  3.582094965  0.602631463  4.174828061
##  [101]  0.729270777  5.709839433  5.384287102  0.429651588  2.415587528
##  [106]  2.101911631  4.523340456  0.834685430  6.257241062  0.727302836
##  [111]  0.531167509  0.802529630  5.759148717  1.330296913  1.609193382
##  [116]  1.746698182  2.339419424  2.841063026  2.097262603  2.354760737
##  [121]  0.442306265  7.208270615  2.188616038  0.056413418 10.349939847
##  [126]  3.371936024 14.535708813  0.735872557  3.511678413  2.265849160
##  [131]  0.039256050  6.454424082  1.277832265  9.021999225  1.096409059
##  [136]  2.761917359  0.128671054  2.703598817  1.324357012  0.453988084
##  [141]  0.351512039  5.526375298  0.544090047  0.299089870  5.980070150
##  [146]  1.110647936  2.195951058  0.938600828  2.608013852  6.079154224
##  [151]  0.045280994  4.794426685  1.542379777  0.376087033  5.158523467
##  [156]  3.432638193 11.434221842  0.428077402  0.282466288 11.776089768
##  [161]  3.352738438  3.901712607 19.628117738  0.596319692  3.709154079
##  [166]  2.216636779  8.353473930  1.486519917  0.734489799  3.541697835
##  [171]  1.103258397  2.245158585  4.107367700  3.387252978  4.489355612
##  [176]  4.897362970  4.352877975  3.836680073  9.337416895  3.248104137
##  [181]  1.952001829  7.402323233  0.292054202  1.212592127  3.669444952
##  [186]  7.472067446  1.632741017  0.539565548  4.906654344  0.984561781
##  [191]  3.213994475  2.820256405  0.204425920  2.954599908 12.142129904
##  [196]  0.351177784  2.807273979  2.492472886  0.896519479  6.538667236
##  [201]  4.575715613  0.237983242  5.381611140  0.025422440 17.491400374
##  [206]  0.151228919  0.041210897  1.578045222  7.990793803  1.140586769
##  [211]  1.425772294  6.117681055  0.122802461  4.302937554  1.706823089
##  [216]  4.959760503  1.319614683  0.735676323  0.906169734 18.355265597
##  [221] 10.248724824  1.050553241  1.042025647  5.428994957  0.466698141
##  [226]  2.485027851  1.696976707  0.162050682  2.784301194  5.406383486
##  [231]  0.643220011  0.132868987  1.149794318  0.892150693  1.568816138
##  [236]  7.364369463  8.998408752  0.543853058  0.958449287  1.263275060
##  [241]  8.680305617  2.339584733  2.209614357  3.350989042  3.086619864
##  [246]  1.005272038  5.311309230  3.412816654  6.491440954  0.815249501
##  [251]  5.335092228  4.382267810  0.021344030  3.562973129  5.972171844
##  [256]  4.466163163  1.774893340  5.339354415  0.047721405  1.124701108
##  [261]  2.111336078  4.545810384 13.507553111  0.726596290  4.517635427
##  [266]  5.521027689  5.993788191  5.522238420  9.020437122  1.845947382
##  [271]  2.758370653  2.630063535  3.106118238  2.814219960  1.097099094
##  [276]  1.551164221  2.236425812  1.887930756  1.134933993  6.999261161
##  [281]  5.810937292 12.423489647  3.712093265  4.214996231  7.181064957
##  [286] 13.043878173 12.745118012  3.624773053  1.151511931  1.266968981
##  [291]  3.157247835  1.205185171  6.552687777  4.394870557  1.126610416
##  [296]  7.377538173  7.653611704  3.157608430  4.884736981  0.257439633
##  [301]  0.173081807  3.122581583  6.102784372  9.643505729  0.985096031
##  [306]  7.125045843  5.284928940  5.769075857  4.296431203 11.712784961
##  [311]  3.172470494  9.762607692  6.098729311  5.183835327  1.580303506
##  [316]  0.637555005  0.462442590  1.161795756  2.502392681  1.057626565
##  [321]  2.896204941  0.358888866  0.901430908  2.269119239  0.878178085
##  [326]  4.013095148  6.374461770  2.307405697  1.240727119 11.784148300
##  [331]  0.470845465  6.978872206 11.199566691  1.553074221  5.342843268
##  [336]  5.689289848  0.810161658  1.450717526  2.521929504  2.113526903
##  [341]  3.658785620 13.179302035  4.200538961  2.989226279  0.387669032
##  [346]  3.872490448  0.864035873  3.486516601  3.163213256  2.707902346
##  [351]  3.187016707 12.998997550  0.622204894  3.351495178  1.661535142
##  [356]  7.238432065  8.287132155  1.888532565  0.122006030  5.072128027
##  [361]  0.255464184  2.552254480  0.016718131  0.776912679  3.035450005
##  [366]  3.390102601  4.132966809  3.870962916  3.816933464  0.640003574
##  [371]  7.020040072  8.584313562  7.398310629  5.929003340  1.612647196
##  [376]  0.574984327  5.405448967 11.219921583  0.387533154  2.208030879
##  [381]  5.174942845  1.129994590  4.817359207  3.564708938  2.779931453
##  [386]  4.558146010  6.493833088  1.480950845  1.993990386  2.598352948
##  [391]  0.235597801  5.767812351  1.953187454  1.845705080  0.476565890
##  [396]  0.102024194  5.576784506  3.763834845  2.753240318 12.385887858
##  [401]  2.695898668  0.245797222  1.351726887  0.382203316  2.889544988
##  [406]  2.188239421  0.007314591  2.180737562  8.455559816  3.670920790
##  [411]  0.534617207  6.249717490  6.648628660  0.792581094  3.325800479
##  [416]  0.454891186  2.674123396  2.113808374  2.847370190  1.515487467
##  [421]  3.068626215  3.940373360  0.029553950  7.412565678  0.426190725
##  [426]  5.941848030  7.934791386  7.367036421  2.380379489  3.158717364
##  [431]  0.264631303  6.395575003  5.891494660  2.280915456  0.590257565
##  [436]  1.393090545  5.428280831 12.982228023  0.720591364  1.442643707
##  [441]  2.012946027  1.147126026  3.127349219 12.769599305  1.159377325
##  [446]  1.082054583  2.152348612  4.437215159  1.814568959  3.552740668
##  [451]  0.364645427  2.764492022  5.786840683  8.129179529  5.838393095
##  [456]  5.065214686  4.704299819  2.292423090  1.785100941  0.014974688
##  [461]  3.300249570 13.562155156  6.404286997  4.950455038  0.637304864
##  [466]  3.064036965  0.196256388  1.051735545  0.350995114  1.119146561
##  [471]  3.729249947  3.159922817  1.795796851  3.358802725  0.861435969
##  [476]  0.749399076  4.891437921  2.248180470  2.011495858  1.225542400
##  [481]  2.282391671  1.974742291  0.441139306 17.977525586  0.703740006
##  [486]  1.979284650  1.178705470  5.121465190  0.091690723  8.110342720
##  [491]  3.963697738  6.167933827 13.226997449  7.987267956  5.114614385
##  [496]  4.752185544  2.122685058  0.769359079  8.032245251  2.123409198
##  [501]  2.103962335 15.567096465  3.663813144  1.655976231  3.552741203
##  [506]  1.593309907  5.179473607 10.951095403  1.040711872  4.099125725
##  [511]  4.237197918  2.572024191  2.016867249  7.031318865  2.821717689
##  [516]  1.282456844  6.947770108  1.956671671  0.183012118  2.146704829
##  [521] 16.222517601  0.067345716  5.098047646  0.414723334  0.174381948
##  [526]  1.746275834  4.891523056  3.173488829  0.895149848  0.861065840
##  [531] 10.273287054  1.939077042  8.090423401  2.141558340  4.212666784
##  [536] 17.816471827  1.741094069  5.033103811  4.231596189  5.044652562
##  [541]  2.233930714  5.335560926  6.288484604  3.255643881  3.176925332
##  [546]  2.032306620  2.401038270 16.656873319  8.831298915  1.497618954
##  [551]  3.273533519  5.676920724  0.468115783  2.396620893  1.411948697
##  [556]  3.525146034  6.446577700 14.579456509 12.508583693  0.161352202
##  [561]  0.862589488  0.627549815  0.200960618  4.090325209  5.863252897
##  [566]  2.558674242  5.040691692  3.620324461  9.023190857  5.367471350
##  [571]  0.404223172  1.741810817  4.071977490  0.343893936  2.808708165
##  [576]  7.202852603  3.889148585  3.412414047  0.617888558  5.675214485
##  [581]  5.618802211  0.102580306  4.724093954  5.004489013  8.682202963
##  [586]  1.777339206  1.388411775  1.562734505 15.756496874  2.764846931
##  [591]  1.554158273  1.653136528  2.864645094  5.088013779  2.986616955
##  [596]  3.809038883  8.645431148  3.462754562  8.706952363  5.212929725
##  [601]  5.909984993 14.214520115 10.412667769  4.478829809  7.535102376
##  [606]  2.474374832 16.683076439  8.935206017  2.265774501  8.856569506
##  [611]  8.873468211  2.692473938  1.052170857  2.593719198  2.816731168
##  [616] 19.854743426  0.306654270  6.526598953  3.584431973  0.042259546
##  [621] 15.280476267  0.431434010  7.813080372 11.932090006  0.685169745
##  [626]  6.851725318  2.380698502  2.881421772  7.581749704  7.450086501
##  [631] 11.076412363  0.736356550  1.439531728  0.778429504  0.692785157
##  [636]  0.355585299  0.324299223  2.067483474  1.595062298  0.015306725
##  [641]  0.841275359  0.683355859  2.614527722  2.534891453  6.233784041
##  [646]  8.024730194 10.099592932  7.437462622  3.465241255  0.200207742
##  [651]  0.211857585  0.937115860  1.259119324  1.678063119  6.394659325
##  [656]  3.011221684  3.251165094  1.366039947  2.791243334  3.690138535
##  [661]  2.901972707  7.914513090  6.792207341  2.906616637  1.853794930
##  [666]  2.475678201  1.890062899  4.699198529  6.628684668  0.758380330
##  [671]  3.844901769  1.050397850  0.029760500  4.141698198  5.012647172
##  [676]  2.426943809  2.358472233  3.093293526 11.518005402  3.158870477
##  [681]  0.506056574  6.871217842  2.518932426  4.591710361  3.216643298
##  [686]  2.137160534 13.109966240  2.472459653  1.141206229  0.277636368
##  [691]  2.694946285  2.088629390  6.578202400  8.878764298  4.328744998
##  [696]  0.607274359  3.069341699  5.442913176  9.406815836  2.611704325
##  [701]  1.671234855  3.769147336  0.485564539  1.130400538  0.262000246
##  [706]  7.595307557  5.147652826  3.178325038  0.609244341  0.008868522
##  [711]  2.152343715  5.443811800  3.908195815  0.490855336  0.058184358
##  [716]  1.964450296  8.390083496  0.856995072  2.477643203  9.208026739
##  [721]  3.523652532  2.720149190  2.624908443  5.734281005  5.024077514
##  [726]  3.196871936  2.132845372  7.359918649  2.840503316  1.998027167
##  [731]  0.310575137  2.499625478  1.333684625  6.309060120  0.707022135
##  [736]  1.928535626  1.434800158  0.658521462  0.245469297  5.993584623
##  [741]  5.937031228  1.593910736  1.527705828  0.067859999  2.690595077
##  [746]  8.355777758  1.691539876  6.259858561  1.125528391  5.386997777
##  [751]  0.432592757  2.144274358  0.189539115  9.736312120  3.995563830
##  [756]  3.098790397  0.655318932  7.391047093  2.688961535  0.369209675
##  [761]  6.235347669  1.601585834  6.314107086  1.817275504  0.646999158
##  [766]  1.815696047  1.724770802  0.774702586  0.811582995  2.217054690
##  [771]  0.076225382  3.633700681 11.227947325 13.148544401  6.199935776
##  [776]  8.874757934  4.517807389  2.417023408  4.623699675  6.658633093
##  [781]  4.044615154  1.738341573  2.673738006  8.288505789  2.659223646
##  [786]  2.517005672  3.431147181  0.499002050  0.733022676  2.641812718
##  [791]  2.362901281  0.285893860  1.862018361  7.877983989  0.464151235
##  [796]  1.083381076  6.504808849 13.528513805  9.237068525  3.608855006
##  [801]  0.775139373  3.896374551  0.089003510  2.309198184  9.594046149
##  [806]  1.408336706  7.656132476  3.371754636  3.986925607  5.712004467
##  [811]  8.648842563  0.139181142  1.691625608  7.276914357  6.604882480
##  [816]  0.280871244  2.345232353  3.199164099  1.731213731  0.906785560
##  [821]  4.353474186  0.155725221  0.223448376  3.779911847  5.176427749
##  [826]  4.184171240  5.313580686  4.576825646  4.154953482  0.690737635
##  [831]  0.728782526  2.967448326  0.446174854  1.007025395  1.204849221
##  [836]  5.799620060  0.746522082  4.166733702  0.456762605  2.883034835
##  [841]  0.957403897  3.417426347 10.630175786  0.712674703  1.389872899
##  [846]  1.199138965  1.740738221  1.256558937  2.136952393 12.862346190
##  [851]  9.154381163  6.023547435  2.673257619  0.775279255  0.569082082
##  [856]  2.182246668 16.641444613  0.948426927  0.441191746  1.878861975
##  [861]  0.509324963  1.622948053  0.918635471 11.584130495  5.128186455
##  [866]  0.370337457  3.196808432  1.231893442  5.700845411  4.622922690
##  [871]  2.717617468 16.570756079  2.416864462  4.078287076  2.345650635
##  [876]  1.363981748  0.399134804  9.178234876  1.814036548  5.087943547
##  [881]  8.862191737  1.643509433  0.650427614  6.635995843  3.093449415
##  [886]  4.941914827  4.977007712  0.895248282  4.153066244  0.977259029
##  [891]  6.879590823  0.081975055  5.403212667  0.777847826  1.523383920
##  [896]  0.527664158  6.904729458  3.096941553  5.421100277  4.785761545
##  [901]  2.235347163  2.630839297  1.742589060  0.379240239  2.151756017
##  [906]  1.855044218  1.829441301  1.311091631  9.410755133  1.832939741
##  [911]  1.016571894  2.323072064  3.416716791  5.691269871  6.937546595
##  [916]  3.074685985  4.051563308  5.380218206  1.401663074  2.857890718
##  [921]  3.640364825  4.446774159  0.278134563  0.234070802  0.611373852
##  [926]  0.146156202  5.548456363  1.136106661  6.749551690  2.917973362
##  [931]  1.460174430  6.129404288  7.764868796  0.402434853  1.297066737
##  [936] 12.727066677  3.339954187  1.351916647  6.814858903  0.692328156
##  [941]  1.906778167  4.920045540  0.496406559  8.786102320  0.735574799
##  [946]  1.261677328  2.883606101  1.037399860  1.165448174  0.235606413
##  [951]  3.631852737  1.174257072  3.663826643  7.147493261 11.913512787
##  [956]  1.318980999  5.587728663 11.270492976  8.665113082  4.024052846
##  [961]  0.200409087  1.163920141  1.839020544  1.372396442  0.504132049
##  [966]  5.027656477 12.325916485  4.433231304  0.119796384  0.079255939
##  [971]  4.847343293  1.953165870  3.345510554  0.185884247  4.192279924
##  [976]  5.241749251  0.127498458  1.557928790  4.329972568  0.704232753
##  [981] 11.210979535  3.827976164  3.638728880  0.890811402  0.470430526
##  [986]  0.668710732  0.003484520 12.668219144  0.475208306  1.396938945
##  [991]  8.023092047 11.512461818 13.604057731  7.089480370  5.525544437
##  [996] 11.250142095  6.965915312  0.016391932  2.940792556  1.493531197
hist(exponential_samples) # Plot samples

hist(train_dataset$PoolArea) # Plot original variable

From the plots, you can see for the original variable PoolArea is heavily right skewed with almost all of its data at 0. Looking at the plot from the fitted variable, we see the distribution is still right skewed but its data is more spread out.

round(qexp(0.05, rate = lambda), 4) # Find 5th percentile
## [1] 0.1928
round(qexp(0.95, rate = lambda), 4) # Find 95th percentile
## [1] 11.2607

The 5th percentile is 0.1928 and the 95th percentile is 11.2607.

Z <- 1.96 # Z value for 95% confidence interval
n <- length(train_dataset$PoolArea) # length of PoolArea
mean <- mean(train_dataset$PoolArea) # Mean of PoolArea
standard_deviation <- sd(train_dataset$PoolArea) # Standard deviation of PoolArea
upper_bound <- round(mean + Z * standard_deviation / sqrt(n), 4) # Calculate upper bound of confidence interval
lower_bound <- round(mean - Z * standard_deviation / sqrt(n), 4) # Calculate lower bound of confidence interval
c(lower_bound, upper_bound) # Display confidence interval
## [1] 0.6980 4.8198

The confidence interval is (0.6980, 4.8198).

quantile(fit_variable, 0.05) # Find empirical 5th percentile
## 5% 
##  1
quantile(fit_variable, 0.95) # Find empirical 95th percentile
## 95% 
##   1

The empirical 5th percentile and empirical 95th percentile are both 1.

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

To build the multiple regression model. let’s chose the variables that are a good fit. From looking over the data, I chose LotArea, GarageArea, OverallQual, and GrLivArea.

linear_model <- lm(train_dataset$SalePrice ~ train_dataset$LotArea + train_dataset$GarageArea + train_dataset$PoolArea + train_dataset$OverallQual + train_dataset$GrLivArea)
summary(linear_model)
## 
## Call:
## lm(formula = train_dataset$SalePrice ~ train_dataset$LotArea + 
##     train_dataset$GarageArea + train_dataset$PoolArea + train_dataset$OverallQual + 
##     train_dataset$GrLivArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -418154  -20584   -1794   17549  301223 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.036e+05  4.803e+03 -21.564  < 2e-16 ***
## train_dataset$LotArea      7.889e-01  1.093e-01   7.214 8.69e-13 ***
## train_dataset$GarageArea   6.849e+01  6.066e+00  11.291  < 2e-16 ***
## train_dataset$PoolArea    -2.074e+01  2.643e+01  -0.785    0.433    
## train_dataset$OverallQual  2.862e+04  1.029e+03  27.810  < 2e-16 ***
## train_dataset$GrLivArea    4.572e+01  2.620e+00  17.451  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39910 on 1454 degrees of freedom
## Multiple R-squared:  0.7485, Adjusted R-squared:  0.7476 
## F-statistic: 865.5 on 5 and 1454 DF,  p-value: < 2.2e-16

This model doesn’t seem to be good. Let’s remove LotArea and PoolArea since those have the highest values.

linear_model <- lm(train_dataset$SalePrice ~ + train_dataset$GarageArea + train_dataset$OverallQual + train_dataset$GrLivArea)
summary(linear_model)
## 
## Call:
## lm(formula = train_dataset$SalePrice ~ +train_dataset$GarageArea + 
##     train_dataset$OverallQual + train_dataset$GrLivArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403609  -21227   -1439   17917  299973 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -99060.087   4837.368  -20.48   <2e-16 ***
## train_dataset$GarageArea      72.948      6.138   11.88   <2e-16 ***
## train_dataset$OverallQual  27910.785   1040.867   26.82   <2e-16 ***
## train_dataset$GrLivArea       49.649      2.565   19.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40590 on 1456 degrees of freedom
## Multiple R-squared:  0.7394, Adjusted R-squared:  0.7389 
## F-statistic:  1377 on 3 and 1456 DF,  p-value: < 2.2e-16

The formula based on the summary of the linear model is SalePrice = 72.948 * GarageArea + 27910.785 * OverallQual + 49.649 * GrLivArea - 99060.087.

hist(linear_model$residuals)

qqnorm(linear_model$residuals)
qqline(linear_model$residuals)

The histogram looks to show a nearly normal distribution. The Q-Q Plot shows that most points follow the straight line. So we could assume there is a normal distribution.

SalePrice <- (72.948 * train_dataset$GarageArea) + (27910.785 * train_dataset$OverallQual) + (49.649 * train_dataset$GrLivArea) - 99060.087 # Change SalePrice to match linear regression model 
test_data <- test_dataset[,c("Id", "GarageArea", "OverallQual", "GrLivArea")] # Get variables from test dataset to use in model
model_submission <- cbind(test_data$Id, SalePrice) # Create model to submit using Id test dataset and altered SalePrice
## Warning in cbind(test_data$Id, SalePrice): number of rows of result is not a
## multiple of vector length (arg 1)
model_submission[model_submission < 0] <- median(SalePrice) # To avoid any negative numbers in model
colnames(model_submission) <- c("Id", "SalePrice") # Change to appropriate column names
model_submission <- as.data.frame(model_submission[1:1459,]) # Change to a dataframe with 1459 observations
head(model_submission) # Display model
##     Id SalePrice
## 1 1461  221190.7
## 2 1462  164617.7
## 3 1463  229340.9
## 4 1464  228395.4
## 5 1465  294339.2
## 6 1466  143130.8
write.csv(model_submission, file = "Kaggle Submission.csv", row.names = FALSE) # Create csv file to submit to Kaggle

Kaggle.com Username: bpersaud Score: 0.59837