Data 605 - Final Exam

library(matrixcalc)
library(kableExtra)
library(MASS)

Problem 1

Generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.

Similar to Assignment 5

set.seed(1234)
#I chose 8
N <- 8
X <- runif(10000, 1, N) 
Y <- rnorm(10000, (N+1)/2, (N+1)/2)
df <- data.frame(cbind(X, Y))
summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.002   2.767   4.509   4.502   6.232   7.997

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -13.506   1.562   4.549   4.551   7.534  20.781

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

#given
x<-median(X)
#given
y<-quantile(df$Y,.25)

a. \(P(X>x | X>y)\)

pXXy <- nrow(subset(df, X > x & Y > y))/10000
pXy <- nrow(subset(df, X > y))/10000
a <- (pXXy/pXy)
a

## [1] 0.405467

b. \(P(X>x,Y>y)\)

b <- nrow(subset(df, X > x & Y > y))/10000
b

## [1] 0.3738

c. \(P(X<x|X>y)\)

pXXy2 <- nrow(subset(df, X < x & X > y))/10000
pXy2<- nrow(subset(df, X > y))/10000
c<-pXXy2/pXy2
c

## [1] 0.4576418

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Using kableExtra kable function, kable_styling function and row_spec function to build table.

mtrx <-matrix( c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)), 2,2)
mtrx<-cbind(mtrx,c(mtrx[1,1]+mtrx[1,2],mtrx[2,1]+mtrx[2,2]))
mtrx<-rbind(mtrx,c(mtrx[1,1]+mtrx[2,1],mtrx[1,2]+mtrx[2,2],mtrx[1,3]+mtrx[2,3]))
df_mtrx<-as.data.frame(mtrx)
names(df_mtrx) <- c("X>x","X<x", "Total")
row.names(df_mtrx) <- c("Y<y","Y>y", "Total")
kable(df_mtrx) %>%
  kable_styling(bootstrap_options = "bordered") %>%
  row_spec(0, bold = T, color = "black", background = "#7fcdbb")

	X>x	X<x	Total
Y<y	1262	1238	2500
Y>y	3738	3762	7500
Total	5000	5000	10000

Probability Calculations

prob <-mtrx/10000
df_prob <-as.data.frame(prob)
names(df_prob) <- c("X>x","X<x", "Total")
row.names(df_prob) <- c("Y<y","Y>y", "Total")
#round probability to two decimal places
kable(round(df_prob,2)) %>%
  kable_styling(bootstrap_options = "bordered") %>%
  row_spec(0, bold = T, color = "black", background = "#7fcdbb")

	X>x	X<x	Total
Y<y	0.13	0.12	0.25
Y>y	0.37	0.38	0.75
Total	0.50	0.50	1.00

\(P(X>x and Y > y)\)

#this is the second row and first column
df_prob [2,1]

## [1] 0.3738

\(P(X>x) (Y > y)\)

#refer to total values
0.5*0.75

## [1] 0.375

The values are extremely close. The below Chi Square and Fisher Exact Test confirm the independence.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Before computers were readily available, people analyzed contingency tables by hand or a calculator, using chi-square tests. This test works by computing the expected values for each cell if the relative risk (or odds’ ratio) were 1.0. It then combines the discrepancies between observed and expected values into a chi-square statistic from which a P value is computed. The chi-square test is only an approximation. With large sample sizes,the chi-square test works very well. Chi-square is also more suitable for contingency tables with large \(n\)x\(n\) dimensions. Fisher’s exact test always gives an exact P value and works fine with small sample sizes. Fisher’s test (unlike chi-square) is very hard to calculate by hand, but is easy to compute with a computer.

Chi Square Test

chisq.test(mtrx, correct=TRUE)

## 
##  Pearson's Chi-squared test
## 
## data:  mtrx
## X-squared = 0.3072, df = 4, p-value = 0.9893

Fisher’s Exact Test

fisher.test(mtrx,simulate.p.value=TRUE)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  mtrx
## p-value = 0.987
## alternative hypothesis: two.sided

With high p-values, we fail to reject the null hypothesis that X>x and Y>y are independent events.

Problem 2

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

Training Dataset

traindf <- read.csv("/Users/aaronzalki/Downloads/train.csv")
head(traindf)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000

I chose the following variables:

LotArea : Lot size in square feet

GrLivArea : Above grade (ground) living area square feet

SalePrice : the property’s sale price in dollars. This is the target variable that I will try to predict using Python.

summary(traindf$LotArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

hist(traindf$LotArea)

summary(traindf$GrLivArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

hist(traindf$GrLivArea)

summary(traindf$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

hist(traindf$SalePrice)

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

plot(traindf$GrLivArea,traindf$SalePrice)

plot(traindf$LotArea,traindf$SalePrice)

Derive a correlation matrix for any three quantitative variables in the dataset.

df_for_cor <-data.frame(traindf$SalePrice,traindf$LotArea,traindf$GrLivArea)
corr <-cor(df_for_cor)
corr

##                   traindf.SalePrice traindf.LotArea traindf.GrLivArea
## traindf.SalePrice         1.0000000       0.2638434         0.7086245
## traindf.LotArea           0.2638434       1.0000000         0.2631162
## traindf.GrLivArea         0.7086245       0.2631162         1.0000000

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

Null Hypothesis: There is zero correlation between each pairwise variables

Alternative Hypothesis: There is correlation between each pairwise variables

cor.test(traindf$SalePrice, traindf$LotArea, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  traindf$SalePrice and traindf$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

cor.test(traindf$SalePrice, traindf$GrLivArea, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  traindf$SalePrice and traindf$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

cor.test(traindf$LotArea, traindf$GrLivArea, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  traindf$LotArea and traindf$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

For all three tests, the p-value was lower than alpha of .05. Based on the results, we are confident the correlation between these two variables is not zero, and we are 80% confident it is between 0.2323391 and 0.2947946 for Sales Price and Lot Area. We are 80% confident it is between 0.6915087 and 0.7249450 for Sales Price and GrLivArea.We are 80% confident it is between 0.2315997 and 0.2940809 for Lot Area and GrLivArea. The probability of familywise error is going to be high since we’re only executing a single experiment. We can prevent this by adjusting the correlation test to a higher confident level percentage.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

inverted <- matrix.inverse(corr)
inverted

##                   traindf.SalePrice traindf.LotArea traindf.GrLivArea
## traindf.SalePrice         2.0349350      -0.1692033        -1.3974846
## traindf.LotArea          -0.1692033       1.0884485        -0.1664868
## traindf.GrLivArea        -1.3974846      -0.1664868         2.0340972

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

corr_by_precision<- corr%*%inverted
corr_by_precision

##                   traindf.SalePrice traindf.LotArea traindf.GrLivArea
## traindf.SalePrice      1.000000e+00    1.387779e-17      0.000000e+00
## traindf.LotArea       -5.551115e-17    1.000000e+00     -1.110223e-16
## traindf.GrLivArea      0.000000e+00    2.775558e-17      1.000000e+00

precision_by_corr <-inverted%*%corr
precision_by_corr

##                   traindf.SalePrice traindf.LotArea traindf.GrLivArea
## traindf.SalePrice      1.000000e+00   -1.110223e-16     -2.220446e-16
## traindf.LotArea        1.387779e-17    1.000000e+00      2.775558e-17
## traindf.GrLivArea      2.220446e-16    0.000000e+00      1.000000e+00

Conduct LU decomposition on the matrix.

lu.decomposition(precision_by_corr)

## $L
##              [,1]        [,2] [,3]
## [1,] 1.000000e+00 0.00000e+00    0
## [2,] 1.387779e-17 1.00000e+00    0
## [3,] 2.220446e-16 2.46519e-32    1
## 
## $U
##      [,1]          [,2]          [,3]
## [1,]    1 -1.110223e-16 -2.220446e-16
## [2,]    0  1.000000e+00  2.775558e-17
## [3,]    0  0.000000e+00  1.000000e+00

Calculus-Based Probability & Statistics

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function.

I chose the variable TotalBsmtSF (Total square feet of basement area)

hist(traindf$TotalBsmtSF, breaks=100)

massfit <- traindf$TotalBsmtSF
#minimum value is 0 since square feet can't be negative
min(massfit)

## [1] 0

lamb <- fitdistr(massfit, "exponential")
lamb

##        rate    
##   9.456896e-04 
##  (2.474983e-05)

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

lamb$estimate

##         rate 
## 0.0009456896

simulator <- rexp(1000,lamb$estimate)

Plot a histogram and compare it with a histogram of your original variable.

hist(simulator,breaks=100)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.

quantile(simulator, probs=c(0.05, 0.95))

##         5%        95% 
##   73.69268 2988.52768

mean(massfit)

## [1] 1057.429

Normal Distribution Histogram Centered Around 1057.429

normality <- rnorm(length(massfit), mean(massfit), sd(massfit))
hist(normality)

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(massfit, probs=c(0.05, 0.95))

##     5%    95% 
##  519.3 1753.0

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Refer to Python Code

Username aaronzalki

Score 0.13771