Let’s load the necessary libraries
Let’s start by loading the train.csv file and examining the numerical variables to identify one with right skewness. We’ll then select that variable as X and choose the dependent variable Y.
train <- read_csv("https://raw.githubusercontent.com/Kossi-Akplaka/Data605_Computational_mathematics/main/data605/train.csv")
head(train)## # A tibble: 6 × 81
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## # ℹ 73 more variables: LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <chr>,
## # BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
Let’s identify the numerical variables in the data set
## # A tibble: 1,460 × 38
## Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 60 65 8450 7 5 2003
## 2 2 20 80 9600 6 8 1976
## 3 3 60 68 11250 7 5 2001
## 4 4 70 60 9550 7 5 1915
## 5 5 60 84 14260 8 5 2000
## 6 6 50 85 14115 5 5 1993
## 7 7 20 75 10084 8 5 2004
## 8 8 60 NA 10382 7 6 1973
## 9 9 50 51 6120 7 5 1931
## 10 10 190 50 7420 5 6 1939
## # ℹ 1,450 more rows
## # ℹ 31 more variables: YearRemodAdd <dbl>, MasVnrArea <dbl>, BsmtFinSF1 <dbl>,
## # BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>, `1stFlrSF` <dbl>,
## # `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>, BsmtFullBath <dbl>,
## # BsmtHalfBath <dbl>, FullBath <dbl>, HalfBath <dbl>, BedroomAbvGr <dbl>,
## # KitchenAbvGr <dbl>, TotRmsAbvGrd <dbl>, Fireplaces <dbl>,
## # GarageYrBlt <dbl>, GarageCars <dbl>, GarageArea <dbl>, WoodDeckSF <dbl>, …
Then we can calculate the skewness of each variable
## Id MSSubClass LotFrontage LotArea OverallQual
## 0.00000000 1.40476562 2.15816772 12.18261502 0.21649836
## OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## 0.69164401 -0.61220121 -0.50252776 2.66357211 1.68204129
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF
## 4.24652141 0.91837835 1.52112395 1.37392896 0.81135997
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 8.99283329 1.36375364 0.59484237 4.09497490 0.03648647
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 0.67450925 0.21135511 4.47917826 0.67495173 0.64823107
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## -0.64800251 -0.34184538 0.17961125 1.53820999 2.35948572
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal
## 3.08352575 10.28317840 4.11374731 14.79791829 24.42652237
## MoSold YrSold SalePrice
## 0.21161746 0.09607079 1.87900860
Positive skewness means that variable with is right skew. Let’s select the variable LotArea (Lot size in square feet) as the variable X and the SalePrice as the variable Y
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities
# Calculate the quartiles for X and Y
x_quartile <- quantile(X, probs = 0.75)
y_quartile <- quantile(Y, probs = 0.5)
# Calculate the probabilities
prob_a <- sum(X > x_quartile & Y > y_quartile) / sum(Y > y_quartile)
prob_b <- sum(X > x_quartile & Y > y_quartile) / length(X)
prob_c <- sum(X < x_quartile & Y > y_quartile) / sum(Y > y_quartile)
# Display the probabilities
prob_a## [1] 0.3791209
## [1] 0.1890411
## [1] 0.6208791
Let’s make the table
# Calculate counts for each cell in the table
count_x_below <- sum(X <= x_quartile)
count_x_above <- sum(X > x_quartile)
count_y_below <- sum(Y <= y_quartile)
count_y_above <- sum(Y > y_quartile)
total_x_below <- count_x_below + count_x_above
total_x_above <- count_x_below + count_x_above
total_y_below <- count_y_below + count_y_above
total_y_above <- count_y_below + count_y_above
total <- total_x_above + total_x_below
# Create the table of counts
table_counts <- matrix(c(count_x_below, count_x_above, total_x_below,
count_y_below, count_y_above, total_y_below,
total_x_below, total_x_above, total),
nrow = 3, byrow = TRUE)
rownames(table_counts) <- c("<=2d quartile", ">2d quartile", "Total")
colnames(table_counts) <- c("<=3d quartile", ">3d quartile", "Total")
# Display the table
table_counts## <=3d quartile >3d quartile Total
## <=2d quartile 1095 365 1460
## >2d quartile 732 728 1460
## Total 1460 1460 2920
First, let’s check mathematically
# Calculate P(A|B)
prob_A_given_B <- sum(X > x_quartile & Y > y_quartile) / count_y_above
# Calculate P(A) and P(B)
prob_A <- total_x_above / total
prob_B <- total_y_above / total
# Calculate P(A) * P(B)
prob_A_times_prob_B <- prob_A * prob_B
# Check if P(A|B) equals P(A) * P(B)
prob_A_given_B == prob_A_times_prob_B## [1] FALSE
No, splitting the training data doesn’t make them independent. let’s evaluate by running a Chi Square for association
# Create contingency table for A and B
contingency_table <- table(X > x_quartile, Y > y_quartile)
# Perform Chi-Square test
chisq.test(contingency_table)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 127.74, df = 1, p-value < 2.2e-16
P-value is less than 0.05, we reject the null hypothesis, indicating that there is a significant association between A and B.
Univariate descriptive statistics
# Calculate summary statistics for X
summary_X <- summary(train$LotArea)
# Calculate summary statistics for Y
summary_Y <- summary(train$SalePrice)
summary_X## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Let’s create a scatterplot of variables X and Y
ggplot(train, aes(x = X, y = Y)) +
geom_point() +
labs(x = "LotArea", y = "SalePrice ($)", title = "Scatterplot of LotArea vs SalePrice")To calculate a 95% confidence interval for the difference in means of X and Y, we can use a t-test for independent samples.
# Perform t-test for independent samples
t_test <- t.test(train$SalePrice, train$LotArea)
# Calculate 95% confidence interval for the difference in means
t_test$conf.int## [1] 166294.1 174514.7
## attr(,"conf.level")
## [1] 0.95
Let’s derive the correlation matrix for X and Y
# Calculate the correlation matrix
correlation_matrix <- cor(train[c("LotArea", "SalePrice")])
correlation_matrix## LotArea SalePrice
## LotArea 1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
Test the hypothesis
# Perform hypothesis test for correlation coefficient
cor_test <- cor.test(train$LotArea, train$SalePrice, method = "pearson")
# Results of the hypothesis test
cor_test##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2154574 0.3109369
## sample estimates:
## cor
## 0.2638434
# Calculate 99% confidence interval for correlation coefficient
conf_interval <- cor_test$conf.int
# Confidence interval
conf_interval## [1] 0.2154574 0.3109369
## attr(,"conf.level")
## [1] 0.95
The correlation coefficient of 0.2638 indicates a moderate positive linear relationship between LotArea and SalePrice.
The p-value being very small (< 0.05) suggests strong evidence against the null hypothesis, indicating that the correlation between LotArea and SalePrice is statistically significant.
The 95% confidence interval for the correlation coefficient (0.2155, 0.3109) indicates that we are 95% confident that the true correlation coefficient falls within this interval.
Invert the correlation matrix
# Invert the correlation matrix
inverse_correlation_matrix <- solve(correlation_matrix)
inverse_correlation_matrix## LotArea SalePrice
## LotArea 1.0748219 -0.2835846
## SalePrice -0.2835846 1.0748219
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix
# Multiply the correlation matrix by the precision matrix
product_1 <- correlation_matrix %*% inverse_correlation_matrix
# Multiply the precision matrix by the correlation matrix
product_2 <- inverse_correlation_matrix %*% correlation_matrix
# Display the results
product_1## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
Conduct PCA
# Perform PCA
pca_result <- prcomp(train[c("LotArea", "SalePrice")], scale. = TRUE)
# Display the PCA results
summary(pca_result)## Importance of components:
## PC1 PC2
## Standard deviation 1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion 0.6319 1.0000
PC1 captures the majority of the variability in the data, while PC2 captures the remaining variability. The cumulative proportion of variance reaching 100% indicates that all variability in the data is accounted for by the two principal components.
# Shift the variable so that the minimum value is above zero
X_shifted <- X - min(X) + 1
# Display the first few values of the shifted variable
head(X_shifted)## [1] 7151 8301 9951 8251 12961 12816
Loading MASS package to run fitdistr
# Fit an exponential distribution to the shifted variable
fit <- fitdistr(X_shifted, densfun = "exponential")
# Display the fit results
fit## rate
## 1.084854e-04
## (2.839193e-06)
Find the optimal value of lambda for this distribution
lambda <- fit$estimate
# Take 1000 samples from the exponential distribution
samples <- rexp(1000, rate = lambda)
# Display the first few samples
head(samples)## [1] 2152.665 3341.839 1605.561 24158.226 4434.988 7394.965
Plot a histogram and compare it with a histogram of the original variable X
par(mfrow = c(1, 2))
# Plot histogram of the original variable
hist(X, breaks = "FD", col = "blue", main = "Histogram of Original Variable",
xlab = "Original Variable")
# Plot histogram of the samples from the fitted exponential distribution
hist(samples, breaks = "FD", col = "red", main = "Histogram of Samples",
xlab = "Samples from Fitted Exponential Distribution")Find the 5th and 95th percentile
# Find the 5th percentile using the cumulative distribution function
fifth_percentile <- qexp(0.05, rate = lambda)
# Find the 95th percentile using the cumulative distribution function
ninety_fifth_percentile <- qexp(0.95, rate = lambda)
fifth_percentile## [1] 472.8128
## [1] 27614.15
Generate a 95% confidence interval
# Calculate the standard error of the mean
standard_error <- sd(X) / sqrt(length(X))
# Calculate the margin of error for a 95% confidence interval
margin_of_error <- qnorm(0.975) * standard_error
# Calculate the confidence interval
confidence_interval <- c(mean(X) - margin_of_error, mean(X) + margin_of_error)
# Display the confidence interval
confidence_interval## [1] 10004.84 11028.81
# Find the empirical 5th percentile of the data
empirical_5th_percentile <- quantile(X, 0.05)
# Find the empirical 95th percentile of the data
empirical_95th_percentile <- quantile(X, 0.95)
# Display the results
empirical_5th_percentile## 5%
## 3311.7
## 95%
## 17401.15
The empirical 5th percentile indicates that 5% of the data points are below the value of approximately 3311.7. This suggests the presence of lower outliers or extreme low values in the dataset.
The empirical 95th percentile indicates that 95% of the data points are below the value of approximately 17401.15. This suggests the presence of upper outliers or extreme high values in the dataset.
Let’s choose a subset of numerical variables as predictors for our regression model. We’ll include variables that are likely to have a significant impact on the sale price of a property
# Select predictors
predictors <- c("LotArea", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd",
"MasVnrArea", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "GrLivArea",
"FullBath", "HalfBath", "BedroomAbvGr", "TotRmsAbvGrd", "GarageArea",
"WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "ScreenPorch", "PoolArea")
# Build the linear regression model
lm_model <- lm(SalePrice ~ ., data = train[, c("SalePrice", predictors)])
# Summarize the linear regression model
summary(lm_model)##
## Call:
## lm(formula = SalePrice ~ ., data = train[, c("SalePrice", predictors)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -506172 -16760 -1545 14141 318920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.234e+06 1.350e+05 -9.142 < 2e-16 ***
## LotArea 6.042e-01 1.021e-01 5.917 4.10e-09 ***
## OverallQual 1.776e+04 1.174e+03 15.127 < 2e-16 ***
## OverallCond 5.465e+03 1.040e+03 5.253 1.72e-07 ***
## YearBuilt 4.196e+02 6.195e+01 6.773 1.84e-11 ***
## YearRemodAdd 1.637e+02 6.726e+01 2.433 0.015082 *
## MasVnrArea 3.155e+01 6.122e+00 5.154 2.91e-07 ***
## TotalBsmtSF 1.725e+01 4.140e+00 4.167 3.27e-05 ***
## `1stFlrSF` 3.177e+01 2.062e+01 1.541 0.123544
## `2ndFlrSF` 2.449e+01 2.050e+01 1.195 0.232328
## GrLivArea 2.252e+01 2.028e+01 1.110 0.267000
## FullBath -1.620e+03 2.813e+03 -0.576 0.564674
## HalfBath -1.263e+03 2.751e+03 -0.459 0.646089
## BedroomAbvGr -9.715e+03 1.716e+03 -5.663 1.79e-08 ***
## TotRmsAbvGrd 4.319e+03 1.226e+03 3.522 0.000442 ***
## GarageArea 3.260e+01 5.952e+00 5.477 5.11e-08 ***
## WoodDeckSF 3.294e+01 8.171e+00 4.032 5.82e-05 ***
## OpenPorchSF -7.321e-01 1.568e+01 -0.047 0.962774
## EnclosedPorch 1.958e+01 1.747e+01 1.121 0.262584
## ScreenPorch 6.842e+01 1.757e+01 3.894 0.000103 ***
## PoolArea -3.170e+01 2.432e+01 -1.303 0.192667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36130 on 1431 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7952, Adjusted R-squared: 0.7923
## F-statistic: 277.7 on 20 and 1431 DF, p-value: < 2.2e-16
Get the test set
test <- read_csv("https://raw.githubusercontent.com/Kossi-Akplaka/Data605_Computational_mathematics/main/data605/test.csv")
head(test)## # A tibble: 6 × 80
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1461 20 RH 80 11622 Pave <NA> Reg
## 2 1462 20 RL 81 14267 Pave <NA> IR1
## 3 1463 60 RL 74 13830 Pave <NA> IR1
## 4 1464 60 RL 78 9978 Pave <NA> IR1
## 5 1465 120 RL 43 5005 Pave <NA> IR1
## 6 1466 60 RL 75 10000 Pave <NA> IR1
## # ℹ 72 more variables: LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <chr>,
## # BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
Predict SalePrice using the linear regression model and the test set
predicted_prices <- predict(lm_model, newdata = test)
# Calculate the mean of non-missing values in predicted_prices
mean_predicted_price <- mean(predicted_prices, na.rm = TRUE)
# Replace missing values with the mean
predicted_prices[is.na(predicted_prices)] <- mean_predicted_price
# Display the predicted SalePrice values
head(predicted_prices)## 1 2 3 4 5 6
## 138956.9 171001.7 169241.8 198768.7 216613.4 180631.4
Submission to Kaggle
Username: flavioakplaka
Score: 46%
…