Problem 1


Random Variables

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{N+1}{2}\).

set.seed(12345)
N <- 10
n <- 10000
mu <- sigma <- (N + 1)/2
df <- data.frame(X = runif(n, min = 1, max = N), Y = rnorm(n, mean = mu, sd = sigma))
summary(df$X)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.283   5.541   5.506   7.742  10.000 
summary(df$Y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-14.664   1.846   5.482   5.500   9.154  26.766 

Random Uniform Numbers - runif Random Normal Numbers - rnorm

Probability

‘Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.’

x = quantile(df$X, 0.5)
x
     50% 
5.540952 
y = quantile(df$Y, 0.25)
y
     25% 
1.846076 
  1. P(X>x | X>y)

P(A|B) means the probability of A occurring, given that B has occurred. In this case, the probablity of A (5.54) being greater than B (1.85) is 0.55.

pba_a = df %>% filter(X > x, X > y) %>% nrow()/n

pb_a = df %>% filter(X > y) %>% nrow()/n

Qa = pba_a/pb_a
Qa
[1] 0.5512679
  1. P(X>x, Y>y)

In this case, the probablity of A (5.54) being greater than B (1.85) is 0.38.

Qb = df %>% filter(X > x, Y > y) %>% nrow()/n

Qb
[1] 0.3808
  1. P(Xy)

P(A|B) means the probability of A occurring, given that B has occurred. In this case, the probablity of A (0.41) being less than B (.91) is 0.45.

pba_c = df %>% filter(X < x, X > y) %>% nrow()/n

pb_c = df %>% filter(X > y) %>% nrow()/n

Qc = pba_c/pb_c
Qc
[1] 0.4487321

Evaluating Probabilities - Marginal and Joint

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

# Probability - Joint
inv = df %>% mutate(A = ifelse(X > x, " X > x", " X < x"), B = ifelse(Y > y, 
    " Y > y", " Y < y")) %>% group_by(A, B) %>% summarise(count = n()) %>% mutate(probability = count/n)

# Probability - Marginal
inv = inv %>% ungroup() %>% group_by(A) %>% summarize(count = sum(count), probability = sum(probability)) %>% 
    mutate(B = "Total") %>% bind_rows(inv)

inv = inv %>% ungroup() %>% group_by(B) %>% summarize(count = sum(count), probability = sum(probability)) %>% 
    mutate(A = "Total") %>% bind_rows(inv)

# Table
inv %>% select(-count) %>% spread(A, probability) %>% rename(` ` = B) %>% kable() %>% 
    kable_styling()
X < x X > x Total
Y < y 0.1308 0.1192 0.25
Y > y 0.3692 0.3808 0.75
Total 0.5000 0.5000 1.00

Independence

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher is a statistical significance test used in the analysis of contingency tables, used in practice when sample sizes are small. Within the odds ratio, the confidence interval contains one therefore we fail to reject the null hypothesis of independence.

Chi Square is a statistical significance test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.This approach is appropriate because the sampling method was simple random sampling.We reject the null hypothesis when the P-value is less than the significance level.

Fisher’s Exact Test is used when there is a small sample set and the Chi Square Test is used when sample set is large. I would say that the Chi Square test is most appropriate, in this case.

# Fisher

count_data = inv %>% filter(A != "Total", B != "Total") %>% select(-probability) %>% 
    spread(A, count) %>% as.data.frame()

row.names(count_data) = count_data$B

count_data = count_data %>% select(-B) %>% as.matrix()


fisher.test(count_data)

    Fisher's Exact Test for Count Data

data:  count_data
p-value = 0.007904
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.032680 1.240419
sample estimates:
odds ratio 
  1.131777 
# Chi
chisq.test(count_data)

    Pearson's Chi-squared test with Yates' continuity correction

data:  count_data
X-squared = 7.0533, df = 1, p-value = 0.007912

Problem 2


You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques.

Report your Kaggle.com user name and score.My Kaggle.com user name is isaram : https://www.kaggle.com/isaram and my score is 0.56288.

I took the data provided (Test.csv) and created a Linear Regression for the following variables, Sale Price (Dependent), Overall Quality, Ground Living Square Feet, Garage Car Capacity , and Garage Area (Independent). Using the intercepts from Linear Regression summary helped to create the Multiple Linear Regression model. I then plotted each variable against Sale Price. I loaded the test data set and created a column using the Multiple Linear Regression equation. I combined ID and Sale Price to prepare the data for submission by selecting the required columns for contest entry.

Youtube Video: https://youtu.be/xxahRg7G99s

# Data Load
kdata = read.csv(text = getURL("https://raw.githubusercontent.com/IsARam/DATA605/master/train%5B1%5D.csv?_sm_au_=iVV6f56RMnTkt5jMkRvMGK3JRp2ft"), 
    header = TRUE) %>% # Outlier Removal
filter(GrLivArea < 4000)


data = function(df) {
    df %>% # Replace N/A's
    mutate(BedroomAbvGr = replace_na(BedroomAbvGr, 0), BsmtFullBath = replace_na(BsmtFullBath, 
        0), BsmtHalfBath = replace_na(BsmtHalfBath, 0), BsmtUnfSF = replace_na(BsmtUnfSF, 
        0), EnclosedPorch = replace_na(EnclosedPorch, 0), Fireplaces = replace_na(Fireplaces, 
        0), GarageArea = replace_na(GarageArea, 0), GarageCars = replace_na(GarageCars, 
        0), HalfBath = replace_na(HalfBath, 0), KitchenAbvGr = replace_na(KitchenAbvGr, 
        0), LotFrontage = replace_na(LotFrontage, 0), OpenPorchSF = replace_na(OpenPorchSF, 
        0), PoolArea = replace_na(PoolArea, 0), ScreenPorch = replace_na(ScreenPorch, 
        0), TotRmsAbvGrd = replace_na(TotRmsAbvGrd, 0), WoodDeckSF = replace_na(WoodDeckSF, 
        0))
}

kdata = data(kdata)
print(kdata)
     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
     LandContour Utilities LotConfig LandSlope Neighborhood Condition1
     Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
     YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
     MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
     BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
     BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
     X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
     FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
     Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
     GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
     OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
     Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
     SalePrice
 [ reached 'max' / getOption("max.print") -- omitted 1456 rows ]

Descriptive and Inferential Statistics

Descriptive Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

An independent variable is the variable that is changed or controlled to test the effects on the dependent variable. A dependent variable is the variable being tested and measured.For this exercise, I will choose the following variables:

Independent LotArea OverallQual

Dependent SalePrice

Lot Area

summary(kdata$LotArea)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1300    7539    9468   10449   11588  215245 
hist(kdata$LotArea, xlab = "Lot Area", main = "Histogram of Lot Area", col = "red")

Overall Quality

summary(kdata$OverallQual)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   5.000   6.000   6.089   7.000  10.000 
hist(kdata$OverallQual, xlab = "Overall Quality", main = "Histogram of Overall Quality", 
    col = "purple")

Sale Price

summary(kdata$SalePrice)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34900  129900  163000  180151  214000  625000 
hist(kdata$SalePrice, xlab = "Lot Area", main = "Histogram of Sale Price", col = "grey")

Scatterplot

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

# Lot Area vs. Sale Price
ggplot(kdata, aes(x = LotArea, y = SalePrice)) + geom_point() + theme(axis.text.x = element_text(angle = 60, 
    hjust = 1))

# Overall Quality vs. Sale Price



ggplot(kdata, aes(x = OverallQual, y = SalePrice)) + geom_point(aes(color = factor(OverallQual))) + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1))

Correlation Matrix

Derive a correlation matrix for any three quantitative variables in the dataset.

correlationmatrix = kdata %>% select(LotArea, OverallQual, SalePrice) %>% cor() %>% 
    as.matrix()
correlationmatrix
               LotArea OverallQual SalePrice
LotArea     1.00000000  0.08871877 0.2698665
OverallQual 0.08871877  1.00000000 0.8008584
SalePrice   0.26986648  0.80085836 1.0000000

Test Hypothesis

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

cor.test(kdata$LotArea, kdata$SalePrice, conf.level = 0.8)

    Pearson's product-moment correlation

data:  kdata$LotArea and kdata$SalePrice
t = 10.687, df = 1454, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
80 percent confidence interval:
 0.2384211 0.3007466
sample estimates:
      cor 
0.2698665 

In the above, the correlation value (0.27) indicates there is little to no correlation between Lot Area and Sale Price.

cor.test(kdata$OverallQual, kdata$SalePrice, conf.level = 0.8)

    Pearson's product-moment correlation

data:  kdata$OverallQual and kdata$SalePrice
t = 50.994, df = 1454, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
80 percent confidence interval:
 0.7884724 0.8125951
sample estimates:
      cor 
0.8008584 

In the above, the correlation value (.80) indicates there is correlation between Overall Quality and Sale Price. The p-value is less than the significance level 0.05. It can be conclude that variables are significantly correlated.

In statistics, family-wise error rate is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.I think in general if one were to run enough hypthesis test it is highly likely to get at least one significant result in which one would reject the null hypothesis.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Precision Matrix

precisionmatrix = solve(correlationmatrix)
precisionmatrix
               LotArea OverallQual  SalePrice
LotArea      1.1339031   0.4028324 -0.6286141
OverallQual  0.4028324   2.9315320 -2.4564529
SalePrice   -0.6286141  -2.4564529  3.1369127

Correlation Matrix by Precision Matrix

mult = round(correlationmatrix %*% precisionmatrix)
mult
            LotArea OverallQual SalePrice
LotArea           1           0         0
OverallQual       0           1         0
SalePrice         0           0         1

Multiplying the correlation matrix by the precision matrix, results in the identity matrix.

Precision Matrix by Correlation Matrix

mult2 = round(precisionmatrix %*% correlationmatrix)
mult2
            LotArea OverallQual SalePrice
LotArea           1           0         0
OverallQual       0           1         0
SalePrice         0           0         1

Multiplying the precision matrix by the correlation matrix, results in the identity matrix.

LU Decomposition

# install_github('https://github.com/cran/matrixcalc/blob/master/R/lu.decomposition.R')
library(matrixcalc)
decomp = lu.decomposition(correlationmatrix)
decomp
$L
           [,1]      [,2] [,3]
[1,] 1.00000000 0.0000000    0
[2,] 0.08871877 1.0000000    0
[3,] 0.26986648 0.7830798    1

$U
     [,1]       [,2]      [,3]
[1,]    1 0.08871877 0.2698665
[2,]    0 0.99212898 0.7769161
[3,]    0 0.00000000 0.3187848
correlationmatrix
               LotArea OverallQual SalePrice
LotArea     1.00000000  0.08871877 0.2698665
OverallQual 0.08871877  1.00000000 0.8008584
SalePrice   0.26986648  0.80085836 1.0000000

LU decomposition, and it is also a key step when inverting a matrix or computing the determinant of a matrix. The LU decomposition should yield the correlation matrix after multiplying the two components. As indicated above, this holds true.

Calculus-Based Probability & Statistics

Variable Skewed To The Right

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

rs = min(kdata$GrLivArea)
rs
[1] 334

Fit Exponential Probability

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

library(MASS)
epb = fitdistr(kdata$GrLivArea, "exponential")
epb
FALSE        rate    
FALSE   6.637893e-04 
FALSE  (1.739601e-05)

Optimal Value

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)).

lambda = epb$estimate
lambda
        rate 
0.0006637893 
opval = rexp(1000, lambda)
opval
 [1] 1903.8259 1008.4676  245.6237 2234.9092  615.6004  592.8081 2846.3432
 [8]  387.6053 2233.2733 2091.3630  137.4129  399.8992  792.4274 1205.5701
[15] 1719.2709 1721.7270  320.0463  237.6520 2017.0141 6051.5086 1301.5289
[22] 3575.0064  332.5678 2113.7544  155.7340 1488.4323 1802.8334 1413.2108
[29] 5433.6522  549.7137  291.8545 4547.6083 3034.5772  475.4163 3006.8165
[36]  576.3791  618.9506  473.0999  127.4649 3875.5549 4956.1278  960.2526
[43]  189.2189 3169.8487  523.2383 1569.3683 3568.7595 3065.4054  644.6788
[50] 2332.7803 3222.2351 2631.5834  598.6394  396.4437 2838.6606  345.4371
[57]  338.1813 1379.9616 2210.0985 1542.5684 1082.6792 3680.3106 2654.0881
[64] 1141.9233  153.0491  691.7564  407.4845  769.9246  370.9539  493.2269
[71]  231.9066  633.0874  299.3800  765.2258 4369.8172
 [ reached getOption("max.print") -- omitted 925 entries ]

Histogram

Plot a histogram and compare it with a histogram of your original variable.

par(mfrow = c(1, 2))
hist(opval, breaks = 50, xlim = c(0, 6000), main = "Exponential - GrLivArea", 
    col = "green")
hist(kdata$GrLivArea, breaks = 50, main = "Original - GrLivArea", col = "blue")

Percentiles

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(0.05, rate = lambda)
[1] 77.27345
qexp(0.95, rate = lambda)
[1] 4513.077

Confidence Interval

Also generate a 95% confidence interval from the empirical data, assuming normality.

library("Rmisc")

CI(na.exclude(kdata$GrLivArea), ci = 0.95)
FALSE    upper     mean    lower 
FALSE 1532.042 1506.502 1480.962

Empirical Percentile

Finally, provide the empirical 5th percentile and 95th percentile of the data.

quantile(kdata$GrLivArea, 0.05)
 5% 
848 
quantile(kdata$GrLivArea, 0.95)
   95% 
2450.5 

Modeling

Linear Regression

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

model = kdata[, which(sapply(kdata, function(x) sum(is.na(x))) == 0)]
fit = lm(kdata$SalePrice ~ kdata$OverallQual + kdata$GrLivArea + kdata$GarageCars + 
    kdata$GarageArea, data = kdata)
summary(fit)

Call:
lm(formula = kdata$SalePrice ~ kdata$OverallQual + kdata$GrLivArea + 
    kdata$GarageCars + kdata$GarageArea, data = kdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-132308  -21454   -1728   18215  280617 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.000e+05  4.461e+03 -22.419  < 2e-16 ***
kdata$OverallQual  2.670e+04  9.742e+02  27.410  < 2e-16 ***
kdata$GrLivArea    5.252e+01  2.447e+00  21.466  < 2e-16 ***
kdata$GarageCars   4.513e+03  2.924e+03   1.543    0.123    
kdata$GarageArea   6.469e+01  9.913e+00   6.526 9.31e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36880 on 1451 degrees of freedom
Multiple R-squared:  0.7694,    Adjusted R-squared:  0.7688 
F-statistic:  1210 on 4 and 1451 DF,  p-value: < 2.2e-16

Multiple Linear Regression

par(mfrow = c(2, 2))
X1 = kdata$OverallQual
X2 = kdata$GrLivArea
X3 = kdata$GarageCars
X4 = kdata$GarageArea
Y = kdata$SalePrice

plot(X1, Y, col = "#00FF00", main = "OverallQual", ylab = "Sale Price")
abline(lm(Y ~ X1), col = "blue", lwd = 3)

plot(X2, Y, col = "#C00000", main = "GrLivArea", ylab = "Sale Price")
abline(lm(Y ~ X2), col = "blue", lwd = 3)

plot(X3, Y, col = "#aa4371", main = "GarageCars", ylab = "Sale Price")
abline(lm(Y ~ X3), col = "blue", lwd = 3)

plot(X4, Y, col = "#37004d", main = "GarageArea", ylab = "Sale Price")
abline(lm(Y ~ X4), col = "blue", lwd = 3)

kdatatest = read.csv(text = getURL("https://raw.githubusercontent.com/IsARam/DATA605/master/test%5B1%5D.csv?_sm_au_=iVV1fQtQ75TPVL5NkRvMGK3JRp2ft"), 
    header = TRUE) %>% # Outlier Removal
filter(GrLivArea < 4000)


data = function(df) {
    df %>% # Replace N/A's
    mutate(BedroomAbvGr = replace_na(BedroomAbvGr, 0), BsmtFullBath = replace_na(BsmtFullBath, 
        0), BsmtHalfBath = replace_na(BsmtHalfBath, 0), BsmtUnfSF = replace_na(BsmtUnfSF, 
        0), EnclosedPorch = replace_na(EnclosedPorch, 0), Fireplaces = replace_na(Fireplaces, 
        0), GarageArea = replace_na(GarageArea, 0), GarageCars = replace_na(GarageCars, 
        0), HalfBath = replace_na(HalfBath, 0), KitchenAbvGr = replace_na(KitchenAbvGr, 
        0), LotFrontage = replace_na(LotFrontage, 0), OpenPorchSF = replace_na(OpenPorchSF, 
        0), PoolArea = replace_na(PoolArea, 0), ScreenPorch = replace_na(ScreenPorch, 
        0), TotRmsAbvGrd = replace_na(TotRmsAbvGrd, 0), WoodDeckSF = replace_na(WoodDeckSF, 
        0))
}

kdatatest = data(kdatatest)
print(kdatatest)
     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
     LandContour Utilities LotConfig LandSlope Neighborhood Condition1
     Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
     YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
     MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
     BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
     BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
     X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
     FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
     Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
     GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
     OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
     Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
 [ reached 'max' / getOption("max.print") -- omitted 1458 rows ]
SalePrice = ((26988.854 * df$OverallQual) + (49.573 * df$GrLivArea) + (11317.522 * 
    df$GarageCars) + (41.478 * df$GarageArea) - 98436.05)
kdatatest = kdatatest[, c("Id", "OverallQual", "GrLivArea", "GarageCars", "GarageArea")]
(head(kdatatest))
    Id OverallQual GrLivArea GarageCars GarageArea
1 1461           5       896          1        730
2 1462           6      1329          1        312
3 1463           5      1629          2        482
4 1464           6      1604          2        470
5 1465           8      1280          2        506
6 1466           6      1655          2        440
ksubmission = cbind(kdatatest$Id, kdata$SalePrice)
colnames(ksubmission) = c("Id", "Sale Price")
ksubmission[ksubmission < 0] = median(SalePrice)
ksubmission = as.data.frame(ksubmission[1:1458, ])
head(ksubmission)
    Id Sale Price
1 1461     208500
2 1462     181500
3 1463     223500
4 1464     140000
5 1465     250000
6 1466     143000
write.csv(ksubmission, file = "KaggleSubmissionIsARam.csv", quote = FALSE, row.names = FALSE)