Generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.
Similar to Assignment 5
set.seed(1234)
#I chose 8
N <- 8
X <- runif(10000, 1, N)
Y <- rnorm(10000, (N+1)/2, (N+1)/2)
df <- data.frame(cbind(X, Y))
summary(X)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.002 2.767 4.509 4.502 6.232 7.997
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -13.506 1.562 4.549 4.551 7.534 20.781
pXXy <- nrow(subset(df, X > x & Y > y))/10000
pXy <- nrow(subset(df, X > y))/10000
a <- (pXXy/pXy)
a## [1] 0.405467
## [1] 0.4576418
Using kableExtra kable function, kable_styling function and row_spec function to build table.
mtrx <-matrix( c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)), 2,2)
mtrx<-cbind(mtrx,c(mtrx[1,1]+mtrx[1,2],mtrx[2,1]+mtrx[2,2]))
mtrx<-rbind(mtrx,c(mtrx[1,1]+mtrx[2,1],mtrx[1,2]+mtrx[2,2],mtrx[1,3]+mtrx[2,3]))
df_mtrx<-as.data.frame(mtrx)
names(df_mtrx) <- c("X>x","X<x", "Total")
row.names(df_mtrx) <- c("Y<y","Y>y", "Total")
kable(df_mtrx) %>%
kable_styling(bootstrap_options = "bordered") %>%
row_spec(0, bold = T, color = "black", background = "#7fcdbb")| X>x | X<x | Total | |
|---|---|---|---|
| Y<y | 1262 | 1238 | 2500 |
| Y>y | 3738 | 3762 | 7500 |
| Total | 5000 | 5000 | 10000 |
Probability Calculations
prob <-mtrx/10000
df_prob <-as.data.frame(prob)
names(df_prob) <- c("X>x","X<x", "Total")
row.names(df_prob) <- c("Y<y","Y>y", "Total")
#round probability to two decimal places
kable(round(df_prob,2)) %>%
kable_styling(bootstrap_options = "bordered") %>%
row_spec(0, bold = T, color = "black", background = "#7fcdbb")| X>x | X<x | Total | |
|---|---|---|---|
| Y<y | 0.13 | 0.12 | 0.25 |
| Y>y | 0.37 | 0.38 | 0.75 |
| Total | 0.50 | 0.50 | 1.00 |
\(P(X>x and Y > y)\)
## [1] 0.3738
\(P(X>x) (Y > y)\)
## [1] 0.375
The values are extremely close. The below Chi Square and Fisher Exact Test confirm the independence.
Before computers were readily available, people analyzed contingency tables by hand or a calculator, using chi-square tests. This test works by computing the expected values for each cell if the relative risk (or odds’ ratio) were 1.0. It then combines the discrepancies between observed and expected values into a chi-square statistic from which a P value is computed. The chi-square test is only an approximation. With large sample sizes,the chi-square test works very well. Chi-square is also more suitable for contingency tables with large \(n\)x\(n\) dimensions. Fisher’s exact test always gives an exact P value and works fine with small sample sizes. Fisher’s test (unlike chi-square) is very hard to calculate by hand, but is easy to compute with a computer.
Chi Square Test
##
## Pearson's Chi-squared test
##
## data: mtrx
## X-squared = 0.3072, df = 4, p-value = 0.9893
Fisher’s Exact Test
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: mtrx
## p-value = 0.987
## alternative hypothesis: two.sided
With high p-values, we fail to reject the null hypothesis that X>x and Y>y are independent events.
Training Dataset
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
I chose the following variables:
LotArea : Lot size in square feet
GrLivArea : Above grade (ground) living area square feet
SalePrice : the property’s sale price in dollars. This is the target variable that I will try to predict using Python.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
df_for_cor <-data.frame(traindf$SalePrice,traindf$LotArea,traindf$GrLivArea)
corr <-cor(df_for_cor)
corr## traindf.SalePrice traindf.LotArea traindf.GrLivArea
## traindf.SalePrice 1.0000000 0.2638434 0.7086245
## traindf.LotArea 0.2638434 1.0000000 0.2631162
## traindf.GrLivArea 0.7086245 0.2631162 1.0000000
Null Hypothesis: There is zero correlation between each pairwise variables
Alternative Hypothesis: There is correlation between each pairwise variables
##
## Pearson's product-moment correlation
##
## data: traindf$SalePrice and traindf$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
##
## Pearson's product-moment correlation
##
## data: traindf$SalePrice and traindf$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
##
## Pearson's product-moment correlation
##
## data: traindf$LotArea and traindf$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
For all three tests, the p-value was lower than alpha of .05. Based on the results, we are confident the correlation between these two variables is not zero, and we are 80% confident it is between 0.2323391 and 0.2947946 for Sales Price and Lot Area. We are 80% confident it is between 0.6915087 and 0.7249450 for Sales Price and GrLivArea.We are 80% confident it is between 0.2315997 and 0.2940809 for Lot Area and GrLivArea. The probability of familywise error is going to be high since we’re only executing a single experiment. We can prevent this by adjusting the correlation test to a higher confident level percentage.
## traindf.SalePrice traindf.LotArea traindf.GrLivArea
## traindf.SalePrice 2.0349350 -0.1692033 -1.3974846
## traindf.LotArea -0.1692033 1.0884485 -0.1664868
## traindf.GrLivArea -1.3974846 -0.1664868 2.0340972
## traindf.SalePrice traindf.LotArea traindf.GrLivArea
## traindf.SalePrice 1.000000e+00 1.387779e-17 0.000000e+00
## traindf.LotArea -5.551115e-17 1.000000e+00 -1.110223e-16
## traindf.GrLivArea 0.000000e+00 2.775558e-17 1.000000e+00
## traindf.SalePrice traindf.LotArea traindf.GrLivArea
## traindf.SalePrice 1.000000e+00 -1.110223e-16 -2.220446e-16
## traindf.LotArea 1.387779e-17 1.000000e+00 2.775558e-17
## traindf.GrLivArea 2.220446e-16 0.000000e+00 1.000000e+00
## $L
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0.00000e+00 0
## [2,] 1.387779e-17 1.00000e+00 0
## [3,] 2.220446e-16 2.46519e-32 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 -1.110223e-16 -2.220446e-16
## [2,] 0 1.000000e+00 2.775558e-17
## [3,] 0 0.000000e+00 1.000000e+00
I chose the variable TotalBsmtSF (Total square feet of basement area)
## [1] 0
## rate
## 9.456896e-04
## (2.474983e-05)
## rate
## 0.0009456896
## 5% 95%
## 73.69268 2988.52768
## [1] 1057.429
Normal Distribution Histogram Centered Around 1057.429
## 5% 95%
## 519.3 1753.0
Refer to Python Code
Username aaronzalki
Score 0.13771