# X = X1
# Y = Y4
x <- c(9.3, 4.1, 22.4, 9.1, 15.8, 7.1, 15.9, 6.9, 16.0, 6.7, 8.2, 16.0, 6.4, 11.8, 3.5, 21.7, 12.2, 9.3, 8.0, 6.2)
y <- c(20.2, 18.6, 22.6, 11.4, 23.6, 24.0, 26.0, 26.8, 19.7, 22.7, 16.8, 20.2, 21.7, 20.9, 26.9, 16.3, 19.9, 15.5, 26.5, 21.7)
summary.x <- summary(x)
summary.y <- summary(y)
lil.x <- summary.x[5]
lil.y <- summary.y[2]
plot(x,y)
abline(lm(y~x))
Conditional probability is: \[ P(A|B) = \frac{P(A \cap B)}{P(B)} \] First, I defined A, B, and \(A \cap B\) for each case asked. I used the which function to fine which entries met the criteria in each case. Then I used the length function to count those. I then divided by the total length of the vector to get a percentage (expressed as a decimal between 0 and 1).
# P(X>x)
p1 <- length(which(x>lil.x))/length(x)
# P(Y>y)
p2 <- length(which(y>lil.y))/length(y)
# P(X<x)
p3 <- length(which(x<lil.x))/length(x)
# P(X>x & Y>y)
p4 <- length(which(x>lil.x & y > lil.y))/length(x)
# P(X<x & Y>y)
p5 <- length(which(x<lil.x & y >lil.y))/length(x)
# P(X<x & Y<y)
p6 <- length(which(x<lil.x & y <lil.y))/length(x)
# P(X>x & Y <y)
p7 <- length(which(x>lil.x & y <lil.y))/length(x)
\[ P(X>x | Y>y) = \frac{P(X>x \cap Y>y)}{P(Y>y)} \]
pa.and.pb <- p4
pb <- p2
result <- pa.and.pb/pb
result
[1] 0.2666667
There is a 26% chance of X>x when Y>y.
\[ P(X>x, Y>y) = P(X>x \cap Y>y) \]
p4
[1] 0.2
There is a 20% chance of X>x and Y>y.
\[ P(X>x | Y>y) = \frac{P(X<x \cap Y>y)}{P(Y>y)} \]
pa.and.pb <- p5
pb <- p2
result <- pa.and.pb/pb
result
[1] 0.7333333
There is a a 73% chance of \(X < x\) given \(Y > y\).
table <- matrix(c(p6, p7,p5, p4), ncol = 2) *length(x)
table2 <- t(matrix(c(4,11, 15, 1, 4, 5, 5, 15, 20), ncol= 3))
table2
[,1] [,2] [,3]
[1,] 4 11 15
[2,] 1 4 5
[3,] 5 15 20
If splitting the training data in this way made them independent, it would be true that \[ P(AB) = P(A)P(B) \] where \[ A = [X> x] \\ B = [Y>y] \\ AB = [X>x \cap Y>y] \]
A <- p1
B <- p2
AB <- p4
A*B == AB
[1] FALSE
These two data points are not independent. Below we can verify that with a chi sq test on the table above (minus the total rows/columns)
c1 <- rbind(4,1)
c2 <- rbind(11,4)
mtx <- cbind(c1,c2)
mtx
[,1] [,2]
[1,] 4 11
[2,] 1 4
chisq.test(mtx)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity
correction
data: mtx
X-squared = 0, df = 1, p-value = 1
This throws a warning becauase some of the cells are <5. Instead, we can do a Fisher’s Exact Test.
fisher.test(c(p6,p7), c(p5,p4))
Fisher's Exact Test for Count Data
data: c(p6, p7) and c(p5, p4)
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.02564066 Inf
sample estimates:
odds ratio
Inf
df <- as.data.frame(read.csv("data/train.csv"))
summary(df$SalePrice)
Min. 1st Qu. Median Mean 3rd Qu. Max.
34900 129975 163000 180921 214000 755000
hist(df$SalePrice)
summary(df$GrLivArea)
Min. 1st Qu. Median Mean 3rd Qu. Max.
334 1130 1464 1515 1777 5642
hist(df$GrLivArea)
summary(df$YearBuilt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1872 1954 1973 1971 2000 2010
hist(df$YearBuilt)
summary(df$LotArea)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1300 7554 9478 10517 11602 215245
hist(df$LotArea, breaks = 100)
plot(df$SalePrice ~ df$GrLivArea)
plot(df$SalePrice ~ df$YearBuilt)
plot(df$SalePrice ~ df$LotArea)
df2 <- cbind(df$GrLivArea, df$YearBuilt, df$LotArea)
correlation.matrix <- cor(df2)
correlation.matrix
[,1] [,2] [,3]
[1,] 1.0000000 0.19900971 0.26311617
[2,] 0.1990097 1.00000000 0.01422765
[3,] 0.2631162 0.01422765 1.00000000
col.1 <- df$GrLivArea
col.2 <- df$YearBuilt
col.3 <- df$LotArea
test1 <- cor.test(col.1, col.2, conf.level = .8)
conf.inv.1 <- test1$conf.int
test2 <- cor.test(col.1, col.3, conf.level = .8)
conf.inv.2 <- test2$conf.int
test3 <- cor.test(col.2, col.3, conf.level = .8)
conf.inv.3 <- test3$conf.int
Hypothesis test:
\(H_0\): The correlation between each variable pair is 0. \(H_A\): The correlation between is variable pair is not 0.
conf.inv.1
[1] 0.1665605 0.2310283
attr(,"conf.level")
[1] 0.8
conf.inv.2
[1] 0.2315997 0.2940809
attr(,"conf.level")
[1] 0.8
conf.inv.3
[1] -0.01934322 0.04776648
attr(,"conf.level")
[1] 0.8
When \(\alpha = .2\), we find that living area and year built are seemingly correlated. Additionally, we find that the living area and lot area are correlated. However, a correlation of 0 is within our confidence interval when comparing lot size to year built. Additionally, the family-wise error rate is
FWE <- 1-(1-.2)^3
FWE
[1] 0.488
which means there is a 49% chance of a Type 1 Error. This is very worrying considering we only did 3 tests.
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
The inverse exists if and only if the determinant is not 0. In this cases, the determinant = .89. So, an inverse exists.
det(correlation.matrix)
[1] 0.8924526
An inverse of a matrix is defined by \[ A A^{-1} = I = A^{-1} A \] The mechanics of this calculation involve finding a determinant, an adjugate, and multiplying them together. However, R gives us some handy tools for doing matrix math. Below, I use the matlib library to find an inverse.
library(matlib)
precision.matrix <- solve(correlation.matrix)
mtx1 <- correlation.matrix %*% precision.matrix
mtx2 <- precision.matrix %*% correlation.matrix
mtx1
[,1] [,2] [,3]
[1,] 1.000000e+00 1.387779e-17 0.000000e+00
[2,] 7.806256e-18 1.000000e+00 -1.734723e-18
[3,] 0.000000e+00 6.938894e-18 1.000000e+00
mtx2
[,1] [,2] [,3]
[1,] 1.000000e+00 3.642919e-17 5.551115e-17
[2,] -1.561251e-17 1.000000e+00 -6.938894e-18
[3,] -5.551115e-17 -1.734723e-18 1.000000e+00
LU decomposition allows for a quick recalculation if the right hand side of our equation changes (by, say adding more data). This allows us to quickly compute the new values for the \(\overline{b}\). Below I use the lu_factorization function I wrote all the way back in assignment 1.
lu_factorization <- function(A){
library(matrixcalc)
if (nrow(A)!= ncol(A)){
stop("Must be a square matrix!")
}
size <- nrow(A)
L <- matrix (0, nrow = size, ncol = size)
U <- matrix (0, nrow = size, ncol = size)
for (i in 1:size){
L[i,i] = 1
}
for ( i in 1:size ) {
up <- i + 1
down <- i - 1
for ( j in 1:size ) {
U[i,j] <- A[i,j]
if ( down > 0 ) {
for ( k in 1:down ) {
U[i,j] <- U[i,j] - L[i,k] * U[k,j]
}
}
}
if ( up <= size ) {
for ( j in up:size ) {
L[j,i] <- A[j,i]
if ( down > 0 ) {
for ( k in 1:down ) {
L[j,i] <- L[j,i] - L[j,k] * U[k,i]
}
}
L[j,i] <- L[j,i] / U[i,i]
}
}
}
answer <- list(L,U)
return(answer)
}
L <- lu_factorization(correlation.matrix)[1]
U <- lu_factorization(correlation.matrix)[2]
L <- as.data.frame(L)
U <- as.data.frame(U)
L <- as.matrix(L)
U <- as.matrix(U)
L
X1 X2 X3
[1,] 1.0000000 0.00000000 0
[2,] 0.1990097 1.00000000 0
[3,] 0.2631162 -0.03970764 1
U
X1 X2 X3
[1,] 1 0.1990097 0.26311617
[2,] 0 0.9603951 -0.03813502
[3,] 0 0.0000000 0.92925563
L %*% U
X1 X2 X3
[1,] 1.0000000 0.19900971 0.26311617
[2,] 0.1990097 1.00000000 0.01422765
[3,] 0.2631162 0.01422765 1.00000000
Because LU decomposition is not unique, I cannot check it directly. However, I can make sure that L * U results in the same answer whether I use my function or the one from pracma.
library(pracma)
res <- lu(correlation.matrix)
res$L %*% res$U
[,1] [,2] [,3]
[1,] 1.0000000 0.19900971 0.26311617
[2,] 0.1990097 1.00000000 0.01422765
[3,] 0.2631162 0.01422765 1.00000000
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
library(MASS)
attach(data)
The following objects are masked from data (pos = 3):
Condition, GarageSize, Price, Quality, Rooms,
SaleYear, Size
The following objects are masked from data (pos = 4):
Condition, GarageSize, Price, Quality, Rooms,
SaleYear, Size
The following objects are masked from data (pos = 5):
Condition, GarageSize, Price, Quality, Rooms,
SaleYear, Size
The following objects are masked from data (pos = 6):
Condition, GarageSize, Price, Quality, Rooms,
SaleYear, Size
The following objects are masked from data (pos = 7):
Condition, GarageSize, Price, Quality, Rooms,
SaleYear, Size
par(mfrow = c(1,2))
hist(df$GrLivArea)
model2 <- fitdistr(x = df$GrLivArea, "exponential")
ex <- rexp(1000, rate = model2$estimate)
hist(ex)
lower <- qexp(p = .05, rate = model2$estimate)
upper <- qexp(p = .95, rate = model2$estimate)
range1 <- c(lower,upper)
range1
[1] 77.73313 4539.92351
The above show the 5 and 95 percentiles for the fitted exponential function.
upper <- mean(df$GrLivArea) + sd(df$GrLivArea)*1.65
lower <- mean(df$GrLivArea) - sd(df$GrLivArea)*1.65
range2 <- c(lower, upper)
range2
[1] 648.4211 2382.5063
The above shows the 5 and 95 percentiles for a normally distributed function with the same mean and standard deviation as our original data.
lower <- quantile(df$GrLivArea, .05)
upper <- quantile(df$GrLivArea, .95)
range3 <- c(lower, upper)
range3
5% 95%
848.0 2466.1
By looking at the residuals of the quantiles (as well as the initial histograms), we see that our data is much more suited to a normal distributution than an exponential one.
Below, I made a model attempted to measure sale price from Overall Quality, Graded Living Area, Sale Year and Overall Condition. Below is are the summary plots for the model and its summary statistics.
data <- as.data.frame(cbind(df$OverallQual, df$GrLivArea, df$YearBuilt, df$OverallCond, df$SalePrice, df$BedroomAbvGr, df$GarageType))
colnames(data) <- c("Quality", "Size", "SaleYear", "Condition", "Price", "Rooms", "GarageSize")
model3 <- lm(Price ~I(Quality**5) + I(Size**2) + SaleYear + Condition + Rooms, data = data)
plot(model3)
summary(model3)
Call:
lm(formula = Price ~ I(Quality^5) + I(Size^2) + SaleYear + Condition +
Rooms, data = data)
Residuals:
Min 1Q Median 3Q Max
-619774 -17314 -2676 13211 256827
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.443e+06 8.365e+04 -17.251 <2e-16 ***
I(Quality^5) 2.521e+00 9.194e-02 27.415 <2e-16 ***
I(Size^2) 1.220e-02 7.168e-04 17.016 <2e-16 ***
SaleYear 7.636e+02 4.141e+01 18.441 <2e-16 ***
Condition 8.383e+03 9.971e+02 8.407 <2e-16 ***
Rooms 2.398e+03 1.482e+03 1.618 0.106
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39240 on 1454 degrees of freedom
Multiple R-squared: 0.7569, Adjusted R-squared: 0.7561
F-statistic: 905.5 on 5 and 1454 DF, p-value: < 2.2e-16
This model does a decent job modelling house prices, except for the most luxurious of houses. The residuals appear to be relatively normally distributed about the fitted values, except for the tails where we simply don’t have many data points. Additionally, the Normal Q-Q plot shows us that our model fits our data very neatly until the most luxurious of houses. This would suggest that these luxury homes have prices beyond size, quality, sale year, and condition. The other option is that these measures are not granular enough in our data to differentiate the most exquisitie houses. With an R^2 value of .75 and a relatively small number of features, I am satisfied with this model.
Below, I use the test data set to test my model. I then write it to a csv in the data folder called “kaggle_results.csv”.
df2<- read.csv("data/test.csv")
new.data <- as.data.frame(cbind(df2$OverallQual, df2$GrLivArea, df2$YearBuilt, df2$OverallCond, df2$BedroomAbvGr))
colnames(new.data) <- c("Quality", "Size", "SaleYear", "Condition", "Rooms")
predictions <- as.data.frame(round(predict.lm(object = model3, newdata = new.data), 2))
index <- df2$Id
predictions <- as.data.frame(cbind(index, predictions))
colnames(predictions) <- c("ID", "SalePrice")
write.csv(predictions, file = 'data/kaggle_results.csv', row.names = FALSE)
After submitting the csv manually to Kaggle, I ranked 4612 under the handle @simplymathematics.