Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{N+1}{2}\).
set.seed(12345)
N <- 10
n <- 10000
mu <- sigma <- (N + 1)/2
df <- data.frame(X = runif(n, min = 1, max = N), Y = rnorm(n, mean = mu, sd = sigma))
summary(df$X)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.283 5.541 5.506 7.742 10.000
summary(df$Y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-14.664 1.846 5.482 5.500 9.154 26.766
Random Uniform Numbers - runif Random Normal Numbers - rnorm
‘Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.’
x = quantile(df$X, 0.5)
x
50%
5.540952
y = quantile(df$Y, 0.25)
y
25%
1.846076
P(A|B) means the probability of A occurring, given that B has occurred. In this case, the probablity of A (5.54) being greater than B (1.85) is 0.55.
pba_a = df %>% filter(X > x, X > y) %>% nrow()/n
pb_a = df %>% filter(X > y) %>% nrow()/n
Qa = pba_a/pb_a
Qa
[1] 0.5512679
In this case, the probablity of A (5.54) being greater than B (1.85) is 0.38.
Qb = df %>% filter(X > x, Y > y) %>% nrow()/n
Qb
[1] 0.3808
P(A|B) means the probability of A occurring, given that B has occurred. In this case, the probablity of A (0.41) being less than B (.91) is 0.45.
pba_c = df %>% filter(X < x, X > y) %>% nrow()/n
pb_c = df %>% filter(X > y) %>% nrow()/n
Qc = pba_c/pb_c
Qc
[1] 0.4487321
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
# Probability - Joint
inv = df %>% mutate(A = ifelse(X > x, " X > x", " X < x"), B = ifelse(Y > y,
" Y > y", " Y < y")) %>% group_by(A, B) %>% summarise(count = n()) %>% mutate(probability = count/n)
# Probability - Marginal
inv = inv %>% ungroup() %>% group_by(A) %>% summarize(count = sum(count), probability = sum(probability)) %>%
mutate(B = "Total") %>% bind_rows(inv)
inv = inv %>% ungroup() %>% group_by(B) %>% summarize(count = sum(count), probability = sum(probability)) %>%
mutate(A = "Total") %>% bind_rows(inv)
# Table
inv %>% select(-count) %>% spread(A, probability) %>% rename(` ` = B) %>% kable() %>%
kable_styling()
| X < x | X > x | Total | |
|---|---|---|---|
| Y < y | 0.1308 | 0.1192 | 0.25 |
| Y > y | 0.3692 | 0.3808 | 0.75 |
| Total | 0.5000 | 0.5000 | 1.00 |
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher is a statistical significance test used in the analysis of contingency tables, used in practice when sample sizes are small. Within the odds ratio, the confidence interval contains one therefore we fail to reject the null hypothesis of independence.
Chi Square is a statistical significance test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.This approach is appropriate because the sampling method was simple random sampling.We reject the null hypothesis when the P-value is less than the significance level.
Fisher’s Exact Test is used when there is a small sample set and the Chi Square Test is used when sample set is large. I would say that the Chi Square test is most appropriate, in this case.
# Fisher
count_data = inv %>% filter(A != "Total", B != "Total") %>% select(-probability) %>%
spread(A, count) %>% as.data.frame()
row.names(count_data) = count_data$B
count_data = count_data %>% select(-B) %>% as.matrix()
fisher.test(count_data)
Fisher's Exact Test for Count Data
data: count_data
p-value = 0.007904
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.032680 1.240419
sample estimates:
odds ratio
1.131777
# Chi
chisq.test(count_data)
Pearson's Chi-squared test with Yates' continuity correction
data: count_data
X-squared = 7.0533, df = 1, p-value = 0.007912
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
Report your Kaggle.com user name and score.My Kaggle.com user name is isaram : https://www.kaggle.com/isaram and my score is 0.56288.
I took the data provided (Test.csv) and created a Linear Regression for the following variables, Sale Price (Dependent), Overall Quality, Ground Living Square Feet, Garage Car Capacity , and Garage Area (Independent). Using the intercepts from Linear Regression summary helped to create the Multiple Linear Regression model. I then plotted each variable against Sale Price. I loaded the test data set and created a column using the Multiple Linear Regression equation. I combined ID and Sale Price to prepare the data for submission by selecting the required columns for contest entry.
Youtube Video: https://youtu.be/xxahRg7G99s
# Data Load
kdata = read.csv(text = getURL("https://raw.githubusercontent.com/IsARam/DATA605/master/train%5B1%5D.csv?_sm_au_=iVV6f56RMnTkt5jMkRvMGK3JRp2ft"),
header = TRUE) %>% # Outlier Removal
filter(GrLivArea < 4000)
data = function(df) {
df %>% # Replace N/A's
mutate(BedroomAbvGr = replace_na(BedroomAbvGr, 0), BsmtFullBath = replace_na(BsmtFullBath,
0), BsmtHalfBath = replace_na(BsmtHalfBath, 0), BsmtUnfSF = replace_na(BsmtUnfSF,
0), EnclosedPorch = replace_na(EnclosedPorch, 0), Fireplaces = replace_na(Fireplaces,
0), GarageArea = replace_na(GarageArea, 0), GarageCars = replace_na(GarageCars,
0), HalfBath = replace_na(HalfBath, 0), KitchenAbvGr = replace_na(KitchenAbvGr,
0), LotFrontage = replace_na(LotFrontage, 0), OpenPorchSF = replace_na(OpenPorchSF,
0), PoolArea = replace_na(PoolArea, 0), ScreenPorch = replace_na(ScreenPorch,
0), TotRmsAbvGrd = replace_na(TotRmsAbvGrd, 0), WoodDeckSF = replace_na(WoodDeckSF,
0))
}
kdata = data(kdata)
print(kdata)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
LandContour Utilities LotConfig LandSlope Neighborhood Condition1
Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
SalePrice
[ reached 'max' / getOption("max.print") -- omitted 1456 rows ]
Provide univariate descriptive statistics and appropriate plots for the training data set.
An independent variable is the variable that is changed or controlled to test the effects on the dependent variable. A dependent variable is the variable being tested and measured.For this exercise, I will choose the following variables:
Independent LotArea OverallQual
Dependent SalePrice
Lot Area
summary(kdata$LotArea)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1300 7539 9468 10449 11588 215245
hist(kdata$LotArea, xlab = "Lot Area", main = "Histogram of Lot Area", col = "red")
Overall Quality
summary(kdata$OverallQual)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 5.000 6.000 6.089 7.000 10.000
hist(kdata$OverallQual, xlab = "Overall Quality", main = "Histogram of Overall Quality",
col = "purple")
Sale Price
summary(kdata$SalePrice)
Min. 1st Qu. Median Mean 3rd Qu. Max.
34900 129900 163000 180151 214000 625000
hist(kdata$SalePrice, xlab = "Lot Area", main = "Histogram of Sale Price", col = "grey")
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
# Lot Area vs. Sale Price
ggplot(kdata, aes(x = LotArea, y = SalePrice)) + geom_point() + theme(axis.text.x = element_text(angle = 60,
hjust = 1))
# Overall Quality vs. Sale Price
ggplot(kdata, aes(x = OverallQual, y = SalePrice)) + geom_point(aes(color = factor(OverallQual))) +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Derive a correlation matrix for any three quantitative variables in the dataset.
correlationmatrix = kdata %>% select(LotArea, OverallQual, SalePrice) %>% cor() %>%
as.matrix()
correlationmatrix
LotArea OverallQual SalePrice
LotArea 1.00000000 0.08871877 0.2698665
OverallQual 0.08871877 1.00000000 0.8008584
SalePrice 0.26986648 0.80085836 1.0000000
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
cor.test(kdata$LotArea, kdata$SalePrice, conf.level = 0.8)
Pearson's product-moment correlation
data: kdata$LotArea and kdata$SalePrice
t = 10.687, df = 1454, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
80 percent confidence interval:
0.2384211 0.3007466
sample estimates:
cor
0.2698665
In the above, the correlation value (0.27) indicates there is little to no correlation between Lot Area and Sale Price.
cor.test(kdata$OverallQual, kdata$SalePrice, conf.level = 0.8)
Pearson's product-moment correlation
data: kdata$OverallQual and kdata$SalePrice
t = 50.994, df = 1454, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
80 percent confidence interval:
0.7884724 0.8125951
sample estimates:
cor
0.8008584
In the above, the correlation value (.80) indicates there is correlation between Overall Quality and Sale Price. The p-value is less than the significance level 0.05. It can be conclude that variables are significantly correlated.
In statistics, family-wise error rate is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.I think in general if one were to run enough hypthesis test it is highly likely to get at least one significant result in which one would reject the null hypothesis.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
precisionmatrix = solve(correlationmatrix)
precisionmatrix
LotArea OverallQual SalePrice
LotArea 1.1339031 0.4028324 -0.6286141
OverallQual 0.4028324 2.9315320 -2.4564529
SalePrice -0.6286141 -2.4564529 3.1369127
mult = round(correlationmatrix %*% precisionmatrix)
mult
LotArea OverallQual SalePrice
LotArea 1 0 0
OverallQual 0 1 0
SalePrice 0 0 1
Multiplying the correlation matrix by the precision matrix, results in the identity matrix.
mult2 = round(precisionmatrix %*% correlationmatrix)
mult2
LotArea OverallQual SalePrice
LotArea 1 0 0
OverallQual 0 1 0
SalePrice 0 0 1
Multiplying the precision matrix by the correlation matrix, results in the identity matrix.
# install_github('https://github.com/cran/matrixcalc/blob/master/R/lu.decomposition.R')
library(matrixcalc)
decomp = lu.decomposition(correlationmatrix)
decomp
$L
[,1] [,2] [,3]
[1,] 1.00000000 0.0000000 0
[2,] 0.08871877 1.0000000 0
[3,] 0.26986648 0.7830798 1
$U
[,1] [,2] [,3]
[1,] 1 0.08871877 0.2698665
[2,] 0 0.99212898 0.7769161
[3,] 0 0.00000000 0.3187848
correlationmatrix
LotArea OverallQual SalePrice
LotArea 1.00000000 0.08871877 0.2698665
OverallQual 0.08871877 1.00000000 0.8008584
SalePrice 0.26986648 0.80085836 1.0000000
LU decomposition, and it is also a key step when inverting a matrix or computing the determinant of a matrix. The LU decomposition should yield the correlation matrix after multiplying the two components. As indicated above, this holds true.
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
rs = min(kdata$GrLivArea)
rs
[1] 334
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
library(MASS)
epb = fitdistr(kdata$GrLivArea, "exponential")
epb
FALSE rate
FALSE 6.637893e-04
FALSE (1.739601e-05)
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)).
lambda = epb$estimate
lambda
rate
0.0006637893
opval = rexp(1000, lambda)
opval
[1] 1903.8259 1008.4676 245.6237 2234.9092 615.6004 592.8081 2846.3432
[8] 387.6053 2233.2733 2091.3630 137.4129 399.8992 792.4274 1205.5701
[15] 1719.2709 1721.7270 320.0463 237.6520 2017.0141 6051.5086 1301.5289
[22] 3575.0064 332.5678 2113.7544 155.7340 1488.4323 1802.8334 1413.2108
[29] 5433.6522 549.7137 291.8545 4547.6083 3034.5772 475.4163 3006.8165
[36] 576.3791 618.9506 473.0999 127.4649 3875.5549 4956.1278 960.2526
[43] 189.2189 3169.8487 523.2383 1569.3683 3568.7595 3065.4054 644.6788
[50] 2332.7803 3222.2351 2631.5834 598.6394 396.4437 2838.6606 345.4371
[57] 338.1813 1379.9616 2210.0985 1542.5684 1082.6792 3680.3106 2654.0881
[64] 1141.9233 153.0491 691.7564 407.4845 769.9246 370.9539 493.2269
[71] 231.9066 633.0874 299.3800 765.2258 4369.8172
[ reached getOption("max.print") -- omitted 925 entries ]
Plot a histogram and compare it with a histogram of your original variable.
par(mfrow = c(1, 2))
hist(opval, breaks = 50, xlim = c(0, 6000), main = "Exponential - GrLivArea",
col = "green")
hist(kdata$GrLivArea, breaks = 50, main = "Original - GrLivArea", col = "blue")
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
qexp(0.05, rate = lambda)
[1] 77.27345
qexp(0.95, rate = lambda)
[1] 4513.077
Also generate a 95% confidence interval from the empirical data, assuming normality.
library("Rmisc")
CI(na.exclude(kdata$GrLivArea), ci = 0.95)
FALSE upper mean lower
FALSE 1532.042 1506.502 1480.962
Finally, provide the empirical 5th percentile and 95th percentile of the data.
quantile(kdata$GrLivArea, 0.05)
5%
848
quantile(kdata$GrLivArea, 0.95)
95%
2450.5
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.
model = kdata[, which(sapply(kdata, function(x) sum(is.na(x))) == 0)]
fit = lm(kdata$SalePrice ~ kdata$OverallQual + kdata$GrLivArea + kdata$GarageCars +
kdata$GarageArea, data = kdata)
summary(fit)
Call:
lm(formula = kdata$SalePrice ~ kdata$OverallQual + kdata$GrLivArea +
kdata$GarageCars + kdata$GarageArea, data = kdata)
Residuals:
Min 1Q Median 3Q Max
-132308 -21454 -1728 18215 280617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e+05 4.461e+03 -22.419 < 2e-16 ***
kdata$OverallQual 2.670e+04 9.742e+02 27.410 < 2e-16 ***
kdata$GrLivArea 5.252e+01 2.447e+00 21.466 < 2e-16 ***
kdata$GarageCars 4.513e+03 2.924e+03 1.543 0.123
kdata$GarageArea 6.469e+01 9.913e+00 6.526 9.31e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36880 on 1451 degrees of freedom
Multiple R-squared: 0.7694, Adjusted R-squared: 0.7688
F-statistic: 1210 on 4 and 1451 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
X1 = kdata$OverallQual
X2 = kdata$GrLivArea
X3 = kdata$GarageCars
X4 = kdata$GarageArea
Y = kdata$SalePrice
plot(X1, Y, col = "#00FF00", main = "OverallQual", ylab = "Sale Price")
abline(lm(Y ~ X1), col = "blue", lwd = 3)
plot(X2, Y, col = "#C00000", main = "GrLivArea", ylab = "Sale Price")
abline(lm(Y ~ X2), col = "blue", lwd = 3)
plot(X3, Y, col = "#aa4371", main = "GarageCars", ylab = "Sale Price")
abline(lm(Y ~ X3), col = "blue", lwd = 3)
plot(X4, Y, col = "#37004d", main = "GarageArea", ylab = "Sale Price")
abline(lm(Y ~ X4), col = "blue", lwd = 3)
kdatatest = read.csv(text = getURL("https://raw.githubusercontent.com/IsARam/DATA605/master/test%5B1%5D.csv?_sm_au_=iVV1fQtQ75TPVL5NkRvMGK3JRp2ft"),
header = TRUE) %>% # Outlier Removal
filter(GrLivArea < 4000)
data = function(df) {
df %>% # Replace N/A's
mutate(BedroomAbvGr = replace_na(BedroomAbvGr, 0), BsmtFullBath = replace_na(BsmtFullBath,
0), BsmtHalfBath = replace_na(BsmtHalfBath, 0), BsmtUnfSF = replace_na(BsmtUnfSF,
0), EnclosedPorch = replace_na(EnclosedPorch, 0), Fireplaces = replace_na(Fireplaces,
0), GarageArea = replace_na(GarageArea, 0), GarageCars = replace_na(GarageCars,
0), HalfBath = replace_na(HalfBath, 0), KitchenAbvGr = replace_na(KitchenAbvGr,
0), LotFrontage = replace_na(LotFrontage, 0), OpenPorchSF = replace_na(OpenPorchSF,
0), PoolArea = replace_na(PoolArea, 0), ScreenPorch = replace_na(ScreenPorch,
0), TotRmsAbvGrd = replace_na(TotRmsAbvGrd, 0), WoodDeckSF = replace_na(WoodDeckSF,
0))
}
kdatatest = data(kdatatest)
print(kdatatest)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
LandContour Utilities LotConfig LandSlope Neighborhood Condition1
Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
[ reached 'max' / getOption("max.print") -- omitted 1458 rows ]
SalePrice = ((26988.854 * df$OverallQual) + (49.573 * df$GrLivArea) + (11317.522 *
df$GarageCars) + (41.478 * df$GarageArea) - 98436.05)
kdatatest = kdatatest[, c("Id", "OverallQual", "GrLivArea", "GarageCars", "GarageArea")]
(head(kdatatest))
Id OverallQual GrLivArea GarageCars GarageArea
1 1461 5 896 1 730
2 1462 6 1329 1 312
3 1463 5 1629 2 482
4 1464 6 1604 2 470
5 1465 8 1280 2 506
6 1466 6 1655 2 440
ksubmission = cbind(kdatatest$Id, kdata$SalePrice)
colnames(ksubmission) = c("Id", "Sale Price")
ksubmission[ksubmission < 0] = median(SalePrice)
ksubmission = as.data.frame(ksubmission[1:1458, ])
head(ksubmission)
Id Sale Price
1 1461 208500
2 1462 181500
3 1463 223500
4 1464 140000
5 1465 250000
6 1466 143000
write.csv(ksubmission, file = "KaggleSubmissionIsARam.csv", quote = FALSE, row.names = FALSE)