You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R- devel/library/MASS/html/fitdistr.html ). Find the optimal value of l for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, l)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.
# Load the required libraries
library(ggplot2)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(Matrix)
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(glmnet)
## Loaded glmnet 4.1-7
library(missForest)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
library(MASS)
First, I have downloaded the CSV files and uploaded in the github so that the train data sets are available onilne.
# Read the data from the URL
data_url <- "https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/data_605/Final/house-prices-advanced-regression-techniques/train.csv"
df <- read.csv(data_url)
# Univariate descriptive statistics
univariate_stats <- describe(df)
print(univariate_stats)
## vars n mean sd median trimmed mad min
## Id 1 1460 730.50 421.61 730.5 730.50 541.15 1
## MSSubClass 2 1460 56.90 42.30 50.0 49.15 44.48 20
## MSZoning* 3 1460 4.03 0.63 4.0 4.06 0.00 1
## LotFrontage 4 1201 70.05 24.28 69.0 68.94 16.31 21
## LotArea 5 1460 10516.83 9981.26 9478.5 9563.28 2962.23 1300
## Street* 6 1460 2.00 0.06 2.0 2.00 0.00 1
## Alley* 7 91 1.45 0.50 1.0 1.44 0.00 1
## LotShape* 8 1460 2.94 1.41 4.0 3.05 0.00 1
## LandContour* 9 1460 3.78 0.71 4.0 4.00 0.00 1
## Utilities* 10 1460 1.00 0.03 1.0 1.00 0.00 1
## LotConfig* 11 1460 4.02 1.62 5.0 4.27 0.00 1
## LandSlope* 12 1460 1.06 0.28 1.0 1.00 0.00 1
## Neighborhood* 13 1460 13.15 5.89 13.0 13.11 7.41 1
## Condition1* 14 1460 3.03 0.87 3.0 3.00 0.00 1
## Condition2* 15 1460 3.01 0.26 3.0 3.00 0.00 1
## BldgType* 16 1460 1.49 1.20 1.0 1.14 0.00 1
## HouseStyle* 17 1460 4.04 1.91 3.0 4.03 1.48 1
## OverallQual 18 1460 6.10 1.38 6.0 6.08 1.48 1
## OverallCond 19 1460 5.58 1.11 5.0 5.48 0.00 1
## YearBuilt 20 1460 1971.27 30.20 1973.0 1974.13 37.06 1872
## YearRemodAdd 21 1460 1984.87 20.65 1994.0 1986.37 19.27 1950
## RoofStyle* 22 1460 2.41 0.83 2.0 2.26 0.00 1
## RoofMatl* 23 1460 2.08 0.60 2.0 2.00 0.00 1
## Exterior1st* 24 1460 10.62 3.20 13.0 10.93 1.48 1
## Exterior2nd* 25 1460 11.34 3.54 14.0 11.65 2.97 1
## MasVnrType* 26 1452 2.76 0.62 3.0 2.73 0.00 1
## MasVnrArea 27 1452 103.69 181.07 0.0 63.15 0.00 0
## ExterQual* 28 1460 3.54 0.69 4.0 3.65 0.00 1
## ExterCond* 29 1460 4.73 0.73 5.0 4.95 0.00 1
## Foundation* 30 1460 2.40 0.72 2.0 2.46 1.48 1
## BsmtQual* 31 1423 3.26 0.87 3.0 3.43 1.48 1
## BsmtCond* 32 1423 3.81 0.66 4.0 4.00 0.00 1
## BsmtExposure* 33 1422 3.27 1.15 4.0 3.46 0.00 1
## BsmtFinType1* 34 1423 3.73 1.83 3.0 3.79 2.97 1
## BsmtFinSF1 35 1460 443.64 456.10 383.5 386.08 568.58 0
## BsmtFinType2* 36 1422 5.71 0.94 6.0 5.98 0.00 1
## BsmtFinSF2 37 1460 46.55 161.32 0.0 1.38 0.00 0
## BsmtUnfSF 38 1460 567.24 441.87 477.5 519.29 426.99 0
## TotalBsmtSF 39 1460 1057.43 438.71 991.5 1036.70 347.67 0
## Heating* 40 1460 2.04 0.30 2.0 2.00 0.00 1
## HeatingQC* 41 1460 2.54 1.74 1.0 2.42 0.00 1
## CentralAir* 42 1460 1.93 0.25 2.0 2.00 0.00 1
## Electrical* 43 1459 4.68 1.05 5.0 5.00 0.00 1
## X1stFlrSF 44 1460 1162.63 386.59 1087.0 1129.99 347.67 334
## X2ndFlrSF 45 1460 346.99 436.53 0.0 285.36 0.00 0
## LowQualFinSF 46 1460 5.84 48.62 0.0 0.00 0.00 0
## GrLivArea 47 1460 1515.46 525.48 1464.0 1467.67 483.33 334
## BsmtFullBath 48 1460 0.43 0.52 0.0 0.39 0.00 0
## BsmtHalfBath 49 1460 0.06 0.24 0.0 0.00 0.00 0
## FullBath 50 1460 1.57 0.55 2.0 1.56 0.00 0
## HalfBath 51 1460 0.38 0.50 0.0 0.34 0.00 0
## BedroomAbvGr 52 1460 2.87 0.82 3.0 2.85 0.00 0
## KitchenAbvGr 53 1460 1.05 0.22 1.0 1.00 0.00 0
## KitchenQual* 54 1460 3.34 0.83 4.0 3.50 0.00 1
## TotRmsAbvGrd 55 1460 6.52 1.63 6.0 6.41 1.48 2
## Functional* 56 1460 6.75 0.98 7.0 7.00 0.00 1
## Fireplaces 57 1460 0.61 0.64 1.0 0.53 1.48 0
## FireplaceQu* 58 770 3.73 1.13 3.0 3.80 1.48 1
## GarageType* 59 1379 3.28 1.79 2.0 3.11 0.00 1
## GarageYrBlt 60 1379 1978.51 24.69 1980.0 1981.07 31.13 1900
## GarageFinish* 61 1379 2.18 0.81 2.0 2.23 1.48 1
## GarageCars 62 1460 1.77 0.75 2.0 1.77 0.00 0
## GarageArea 63 1460 472.98 213.80 480.0 469.81 177.91 0
## GarageQual* 64 1379 4.86 0.61 5.0 5.00 0.00 1
## GarageCond* 65 1379 4.90 0.52 5.0 5.00 0.00 1
## PavedDrive* 66 1460 2.86 0.50 3.0 3.00 0.00 1
## WoodDeckSF 67 1460 94.24 125.34 0.0 71.76 0.00 0
## OpenPorchSF 68 1460 46.66 66.26 25.0 33.23 37.06 0
## EnclosedPorch 69 1460 21.95 61.12 0.0 3.87 0.00 0
## X3SsnPorch 70 1460 3.41 29.32 0.0 0.00 0.00 0
## ScreenPorch 71 1460 15.06 55.76 0.0 0.00 0.00 0
## PoolArea 72 1460 2.76 40.18 0.0 0.00 0.00 0
## PoolQC* 73 7 2.14 0.90 2.0 2.14 1.48 1
## Fence* 74 281 2.43 0.86 3.0 2.48 0.00 1
## MiscFeature* 75 54 2.91 0.45 3.0 3.00 0.00 1
## MiscVal 76 1460 43.49 496.12 0.0 0.00 0.00 0
## MoSold 77 1460 6.32 2.70 6.0 6.25 2.97 1
## YrSold 78 1460 2007.82 1.33 2008.0 2007.77 1.48 2006
## SaleType* 79 1460 8.51 1.56 9.0 8.92 0.00 1
## SaleCondition* 80 1460 4.77 1.10 5.0 5.00 0.00 1
## SalePrice 81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
## max range skew kurtosis se
## Id 1460 1459 0.00 -1.20 11.03
## MSSubClass 190 170 1.40 1.56 1.11
## MSZoning* 5 4 -1.73 6.25 0.02
## LotFrontage 313 292 2.16 17.34 0.70
## LotArea 215245 213945 12.18 202.26 261.22
## Street* 2 1 -15.49 238.01 0.00
## Alley* 2 1 0.20 -1.98 0.05
## LotShape* 4 3 -0.61 -1.60 0.04
## LandContour* 4 3 -3.16 8.65 0.02
## Utilities* 2 1 38.13 1453.00 0.00
## LotConfig* 5 4 -1.13 -0.59 0.04
## LandSlope* 3 2 4.80 24.47 0.01
## Neighborhood* 25 24 0.02 -1.06 0.15
## Condition1* 9 8 3.01 16.34 0.02
## Condition2* 8 7 13.14 247.54 0.01
## BldgType* 5 4 2.24 3.41 0.03
## HouseStyle* 8 7 0.31 -0.96 0.05
## OverallQual 10 9 0.22 0.09 0.04
## OverallCond 9 8 0.69 1.09 0.03
## YearBuilt 2010 138 -0.61 -0.45 0.79
## YearRemodAdd 2010 60 -0.50 -1.27 0.54
## RoofStyle* 6 5 1.47 0.61 0.02
## RoofMatl* 8 7 8.09 66.28 0.02
## Exterior1st* 15 14 -0.72 -0.37 0.08
## Exterior2nd* 16 15 -0.69 -0.52 0.09
## MasVnrType* 4 3 -0.07 -0.13 0.02
## MasVnrArea 1600 1600 2.66 10.03 4.75
## ExterQual* 4 3 -1.83 3.86 0.02
## ExterCond* 5 4 -2.56 5.29 0.02
## Foundation* 6 5 0.09 1.02 0.02
## BsmtQual* 4 3 -1.31 1.27 0.02
## BsmtCond* 4 3 -3.39 10.14 0.02
## BsmtExposure* 4 3 -1.15 -0.39 0.03
## BsmtFinType1* 6 5 -0.02 -1.39 0.05
## BsmtFinSF1 5644 5644 1.68 11.06 11.94
## BsmtFinType2* 6 5 -3.56 12.32 0.02
## BsmtFinSF2 1474 1474 4.25 20.01 4.22
## BsmtUnfSF 2336 2336 0.92 0.46 11.56
## TotalBsmtSF 6110 6110 1.52 13.18 11.48
## Heating* 6 5 9.83 110.98 0.01
## HeatingQC* 5 4 0.48 -1.51 0.05
## CentralAir* 2 1 -3.52 10.42 0.01
## Electrical* 5 4 -3.06 7.49 0.03
## X1stFlrSF 4692 4358 1.37 5.71 10.12
## X2ndFlrSF 2065 2065 0.81 -0.56 11.42
## LowQualFinSF 572 572 8.99 82.83 1.27
## GrLivArea 5642 5308 1.36 4.86 13.75
## BsmtFullBath 3 3 0.59 -0.84 0.01
## BsmtHalfBath 2 2 4.09 16.31 0.01
## FullBath 3 3 0.04 -0.86 0.01
## HalfBath 2 2 0.67 -1.08 0.01
## BedroomAbvGr 8 8 0.21 2.21 0.02
## KitchenAbvGr 3 3 4.48 21.42 0.01
## KitchenQual* 4 3 -1.42 1.72 0.02
## TotRmsAbvGrd 14 12 0.67 0.87 0.04
## Functional* 7 6 -4.08 16.37 0.03
## Fireplaces 3 3 0.65 -0.22 0.02
## FireplaceQu* 5 4 -0.16 -0.98 0.04
## GarageType* 6 5 0.76 -1.30 0.05
## GarageYrBlt 2010 110 -0.65 -0.42 0.66
## GarageFinish* 3 2 -0.35 -1.41 0.02
## GarageCars 4 4 -0.34 0.21 0.02
## GarageArea 1418 1418 0.18 0.90 5.60
## GarageQual* 5 4 -4.43 18.25 0.02
## GarageCond* 5 4 -5.28 26.77 0.01
## PavedDrive* 3 2 -3.30 9.22 0.01
## WoodDeckSF 857 857 1.54 2.97 3.28
## OpenPorchSF 547 547 2.36 8.44 1.73
## EnclosedPorch 552 552 3.08 10.37 1.60
## X3SsnPorch 508 508 10.28 123.06 0.77
## ScreenPorch 480 480 4.11 18.34 1.46
## PoolArea 738 738 14.80 222.19 1.05
## PoolQC* 3 2 -0.22 -1.90 0.34
## Fence* 4 3 -0.57 -0.88 0.05
## MiscFeature* 4 3 -2.93 10.71 0.06
## MiscVal 15500 15500 24.43 697.64 12.98
## MoSold 12 11 0.21 -0.41 0.07
## YrSold 2010 4 0.10 -1.19 0.03
## SaleType* 9 8 -3.83 14.57 0.04
## SaleCondition* 6 5 -2.74 6.82 0.03
## SalePrice 755000 720100 1.88 6.50 2079.11
The univariate descriptive statistics provide summary measures for each variable in the dataset, including mean, standard deviation, minimum, maximum, and quartiles. These statistics help us understand the distribution and spread of the variables.
# Convert non-numeric variables to appropriate types
df$SalePrice <- as.numeric(as.character(df$SalePrice))
df$GrLivArea <- as.numeric(as.character(df$GrLivArea))
df$OverallQual <- as.numeric(as.character(df$OverallQual))
# Select the variables for the scatterplot matrix
scatterplot_vars <- c("GrLivArea", "OverallQual", "SalePrice")
# Create the scatterplot matrix
scatterplot_matrix <- pairs(df[, scatterplot_vars], lower.panel = NULL, diag.panel = NULL)
The scatterplot matrix allows us to visualize the relationships between
two independent variables (GrLivArea and OverallQual) and the dependent
variable (SalePrice). By examining the scatterplots, we can identify any
linear or nonlinear associations between the variables.
# Correlation matrix
cor_vars <- c("SalePrice", "GrLivArea", "OverallQual")
# Create the correlation matrix
cor_matrix <- rcorr(as.matrix(df[, cor_vars]))$r
print(cor_matrix)
## SalePrice GrLivArea OverallQual
## SalePrice 1.0000000 0.7086245 0.7909816
## GrLivArea 0.7086245 1.0000000 0.5930074
## OverallQual 0.7909816 0.5930074 1.0000000
The correlation matrix shows the pairwise correlations between SalePrice, GrLivArea, and OverallQual. The values range from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation. The correlation matrix helps us understand the strength and direction of the relationships between variables.
# Test hypotheses for each pairwise correlation
alpha <- 0.2 # Significance level (1 - confidence level)
n <- nrow(df) # Number of observations
# Calculate the critical t-value
critical_t <- qt(1 - alpha / 2, df = n - 2)
# Calculate the confidence interval
conf_interval <- critical_t * sqrt((1 - cor_matrix^2) / (n - 2))
print(conf_interval)
## SalePrice GrLivArea OverallQual
## SalePrice 0.00000000 0.02369212 0.02054433
## GrLivArea 0.02369212 0.00000000 0.02703686
## OverallQual 0.02054433 0.02703686 0.00000000
# Test hypotheses for each pairwise correlation
for (i in 1:(length(cor_vars) - 1)) {
for (j in (i + 1):length(cor_vars)) {
hypothesis_results <- cor.test(df[[cor_vars[i]]], df[[cor_vars[j]]])
print(hypothesis_results)
}
}
##
## Pearson's product-moment correlation
##
## data: df[[cor_vars[i]]] and df[[cor_vars[j]]]
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6821200 0.7332695
## sample estimates:
## cor
## 0.7086245
##
##
## Pearson's product-moment correlation
##
## data: df[[cor_vars[i]]] and df[[cor_vars[j]]]
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7709644 0.8094376
## sample estimates:
## cor
## 0.7909816
##
##
## Pearson's product-moment correlation
##
## data: df[[cor_vars[i]]] and df[[cor_vars[j]]]
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5587023 0.6252869
## sample estimates:
## cor
## 0.5930074
The hypothesis tests for correlation examine whether the correlations between each pairwise set of variables are significantly different from zero. We use a significance level of 0.2 (80% confidence interval) for these tests. The test results provide the correlation coefficient, p-value, and confidence interval. If the p-value is below the significance level, we reject the null hypothesis and conclude that there is a significant correlation.
Regarding familywise error, conducting multiple hypothesis tests increases the chance of making a Type I error (false positive) at least once. In this analysis, we are performing multiple tests for each pairwise correlation. By setting a higher significance level of 0.2 (80% confidence interval) for each test, we have adjusted for multiple testing to some extent. However, it’s important to interpret the results cautiously and consider the chosen significance level in the context of the analysis and research goals.
Familywise error can still be a concern, especially if a lower significance level is desired or if additional hypotheses are tested in the same analysis. To further address familywise error, one could employ methods such as Bonferroni correction or control the false discovery rate (FDR) using procedures like the Benjamini-Hochberg procedure. These methods help mitigate the potential inflation of Type I error rates when conducting multiple hypothesis tests.
# Convert the correlation matrix to a sparse matrix
sparse_cor_matrix <- as(Matrix(cor_matrix), "CsparseMatrix")
# Invert the correlation matrix (precision matrix)
precision_matrix <- solve(sparse_cor_matrix)
# Multiply the correlation matrix by the precision matrix
cor_by_precision <- sparse_cor_matrix %*% precision_matrix
# Multiply the precision matrix by the correlation matrix
precision_by_cor <- precision_matrix %*% sparse_cor_matrix
# Conduct LU decomposition
lu_decomposition <- lu(sparse_cor_matrix)
We convert the correlation matrix (cor_matrix) to a sparse matrix using the as() function from the Matrix package. This step is necessary because the LU decomposition function in the Matrix package requires a sparse matrix.
We then invert the correlation matrix (now a sparse matrix) using the solve() function, which calculates the inverse of a matrix.
Next, we multiply the correlation matrix by the precision matrix using the %*% operator and store the result in the cor_by_precision variable. This step helps examine how the correlation matrix is affected when multiplied by the precision matrix.
Similarly, we multiply the precision matrix by the correlation matrix using %*% and store the result in the precision_by_cor variable. This step allows us to observe the impact of multiplying the precision matrix by the correlation matrix.
Finally, we perform LU decomposition on the sparse correlation matrix using the lu() function from the Matrix package. The LU decomposition breaks down the matrix into lower and upper triangular matrices, which can be used for various calculations and solving systems of linear equations.
# Check skewness of the variable
skewness(df$LotArea)
## [1] 12.18262
# Shift the variable if necessary
if (min(df$LotArea) <= 0) {
df$LotArea <- df$LotArea - min(df$LotArea) + 1
}
# Fit exponential distribution
fit <- fitdistr(df$LotArea, "exponential")
# Extract the estimated lambda parameter
lambda <- fit$estimate["rate"]
Next, we’ll generate 1000 samples from the fitted exponential distribution using the estimated lambda value.
# Generate 1000 samples from exponential distribution
samples <- rexp(1000, lambda)
To compare the histograms of the original variable and the generated samples, we can plot them using the hist function.
# Plot histograms
par(mfrow = c(1, 2))
hist(df$LotArea, main = "Original Variable: LotArea", xlab = "LotArea")
hist(samples, main = "Exponential Distribution", xlab = "Samples")
# Calculate percentiles using CDF
percentile_5th <- qexp(0.05, rate = lambda)
percentile_95th <- qexp(0.95, rate = lambda)
To generate a 95% confidence interval from the empirical data, assuming normality, we can use the t.test function.
# Calculate 95% confidence interval assuming normality
conf_interval <- t.test(df$LotArea)$conf.int
Finally, we can calculate the empirical 5th and 95th percentiles of the original data using the quantile function.
# Calculate empirical percentiles
empirical_5th <- quantile(df$LotArea, 0.05)
empirical_95th <- quantile(df$LotArea, 0.95)
For the multiple regression model, we’ll use the variable “SalePrice” as the response variable and select a few predictor variables from the dataset.
# Convert necessary variables to the correct data type
df$SalePrice <- as.numeric(as.character(df$SalePrice))
df$OverallQual <- as.numeric(as.character(df$OverallQual))
df$GrLivArea <- as.numeric(as.character(df$GrLivArea))
df$GarageCars <- as.numeric(as.character(df$GarageCars))
df$TotalBsmtSF <- as.numeric(as.character(df$TotalBsmtSF))
df$YearBuilt <- as.numeric(as.character(df$YearBuilt))
# Select predictor variables
predictors <- c("OverallQual", "GrLivArea", "GarageCars", "TotalBsmtSF", "YearBuilt")
# Build the multiple regression model
model <- lm(SalePrice ~ ., data = df[, c("SalePrice", predictors)])
We have built a multiple regression model with “SalePrice” as the response variable and the selected predictors. Now let’s analyze the model by summarizing the results
summary(model)
##
## Call:
## lm(formula = SalePrice ~ ., data = df[, c("SalePrice", predictors)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -476135 -20008 -2706 16431 285876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.720e+05 8.485e+04 -7.920 4.69e-15 ***
## OverallQual 2.039e+04 1.156e+03 17.633 < 2e-16 ***
## GrLivArea 5.083e+01 2.564e+00 19.825 < 2e-16 ***
## GarageCars 1.451e+04 1.824e+03 7.957 3.52e-15 ***
## TotalBsmtSF 2.998e+01 2.821e+00 10.628 < 2e-16 ***
## YearBuilt 3.014e+02 4.459e+01 6.760 1.99e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38330 on 1454 degrees of freedom
## Multiple R-squared: 0.768, Adjusted R-squared: 0.7672
## F-statistic: 962.4 on 5 and 1454 DF, p-value: < 2.2e-16
# Check for missing values in the dataset
missing_values <- colSums(is.na(df))
missing_values
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1369 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1453 1179 1406
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
# Count the total number of missing values
total_missing <- sum(missing_values)
total_missing
## [1] 6965
skewed_variable <- "OverallQual"
# Shift the variable so that the minimum value is above zero if necessary
df$ShiftedVar <- df[, skewed_variable] - min(df[, skewed_variable]) + 1
# Fit an exponential distribution using fitdistr
fit <- fitdistr(df$ShiftedVar, "exponential")
# Extract the optimal value of lambda (l)
optimal_lambda <- fit$estimate
# Generate 1000 samples from the exponential distribution
exponential_samples <- rexp(1000, rate = 1 / optimal_lambda)
# Plot a histogram of the original variable and the exponential samples
hist(df[, skewed_variable], breaks = 30, col = "lightblue", main = "Original Variable", xlab = skewed_variable)
hist(exponential_samples, breaks = 30, col = "lightgreen", add = TRUE)
legend("topright", c("Original", "Exponential"), fill = c("lightblue", "lightgreen"))
# Calculate the 5th and 95th percentiles using the exponential CDF
percentile_5th <- qexp(0.05, rate = 1 / optimal_lambda)
percentile_95th <- qexp(0.95, rate = 1 / optimal_lambda)
# Generate a 95% confidence interval assuming normality
confidence_interval <- t.test(df$ShiftedVar)$conf.int
# Calculate the empirical 5th and 95th percentiles of the data
empirical_percentile_5th <- quantile(df$ShiftedVar, probs = 0.05)
empirical_percentile_95th <- quantile(df$ShiftedVar, probs = 0.95)
This code will shift the selected variable, fit an exponential distribution using the fitdistr function, generate 1000 samples from the exponential distribution, plot a histogram comparing the original variable and the exponential samples, and calculate the 5th and 95th percentiles using the exponential cumulative distribution function (CDF). It will also generate a 95% confidence interval assuming normality and calculate the empirical 5th and 95th percentiles of the data.
# Select the predictor variables for the regression model
predictors <- c("OverallQual", "GrLivArea", "TotalBsmtSF", "GarageCars", "YearBuilt")
# Create a new data frame with the predictor variables and the response variable
regression_data <- df[-1, c(predictors, "SalePrice")]
# Remove rows with missing values
regression_data <- na.omit(regression_data)
# Fit the multiple regression model
model <- lm(SalePrice ~ ., data = regression_data)
# Print the model summary
summary(model)
##
## Call:
## lm(formula = SalePrice ~ ., data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -476100 -20035 -2712 16441 285867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.725e+05 8.492e+04 -7.920 4.70e-15 ***
## OverallQual 2.039e+04 1.157e+03 17.628 < 2e-16 ***
## GrLivArea 5.084e+01 2.565e+00 19.819 < 2e-16 ***
## TotalBsmtSF 2.996e+01 2.823e+00 10.615 < 2e-16 ***
## GarageCars 1.451e+04 1.824e+03 7.952 3.65e-15 ***
## YearBuilt 3.017e+02 4.462e+01 6.761 1.98e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38350 on 1453 degrees of freedom
## Multiple R-squared: 0.7679, Adjusted R-squared: 0.7671
## F-statistic: 961.7 on 5 and 1453 DF, p-value: < 2.2e-16
Overall, the model appears to have a good fit based on the high R-squared value and significant predictor variables. However, it is important to further examine the assumptions of the regression model, such as the normality of residuals and the absence of influential outliers, to ensure the reliability of the results.
# Read the sample_submission file
sample_submission <- read.csv("https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/data_605/Final/house-prices-advanced-regression-techniques/sample_submission.csv")
# Create a new data frame with only "Id" column
predictions_df <- data.frame(ID = sample_submission$Id)
# Predict the SalePrice using your regression model (replace `model` with your actual model)
predictions_df$SalePrice <- predict(model, newdata = regression_data)
# Write the predictions to a CSV file
write.csv(predictions_df, file = "predictions.csv", row.names = FALSE)
# Verify the number of rows in the predictions file
num_rows <- nrow(predictions_df)
print(num_rows) # Should be 1459
## [1] 1459