Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R- devel/library/MASS/html/fitdistr.html ). Find the optimal value of l for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, l)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.

# Load the required libraries
library(ggplot2)
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(Matrix)
library(MASS)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(glmnet)
## Loaded glmnet 4.1-7
library(missForest)
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
library(MASS)

Reading the data

First, I have downloaded the CSV files and uploaded in the github so that the train data sets are available onilne.

# Read the data from the URL
data_url <- "https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/data_605/Final/house-prices-advanced-regression-techniques/train.csv"
df <- read.csv(data_url)

a

# Univariate descriptive statistics
univariate_stats <- describe(df)
print(univariate_stats)
##                vars    n      mean       sd   median   trimmed      mad   min
## Id                1 1460    730.50   421.61    730.5    730.50   541.15     1
## MSSubClass        2 1460     56.90    42.30     50.0     49.15    44.48    20
## MSZoning*         3 1460      4.03     0.63      4.0      4.06     0.00     1
## LotFrontage       4 1201     70.05    24.28     69.0     68.94    16.31    21
## LotArea           5 1460  10516.83  9981.26   9478.5   9563.28  2962.23  1300
## Street*           6 1460      2.00     0.06      2.0      2.00     0.00     1
## Alley*            7   91      1.45     0.50      1.0      1.44     0.00     1
## LotShape*         8 1460      2.94     1.41      4.0      3.05     0.00     1
## LandContour*      9 1460      3.78     0.71      4.0      4.00     0.00     1
## Utilities*       10 1460      1.00     0.03      1.0      1.00     0.00     1
## LotConfig*       11 1460      4.02     1.62      5.0      4.27     0.00     1
## LandSlope*       12 1460      1.06     0.28      1.0      1.00     0.00     1
## Neighborhood*    13 1460     13.15     5.89     13.0     13.11     7.41     1
## Condition1*      14 1460      3.03     0.87      3.0      3.00     0.00     1
## Condition2*      15 1460      3.01     0.26      3.0      3.00     0.00     1
## BldgType*        16 1460      1.49     1.20      1.0      1.14     0.00     1
## HouseStyle*      17 1460      4.04     1.91      3.0      4.03     1.48     1
## OverallQual      18 1460      6.10     1.38      6.0      6.08     1.48     1
## OverallCond      19 1460      5.58     1.11      5.0      5.48     0.00     1
## YearBuilt        20 1460   1971.27    30.20   1973.0   1974.13    37.06  1872
## YearRemodAdd     21 1460   1984.87    20.65   1994.0   1986.37    19.27  1950
## RoofStyle*       22 1460      2.41     0.83      2.0      2.26     0.00     1
## RoofMatl*        23 1460      2.08     0.60      2.0      2.00     0.00     1
## Exterior1st*     24 1460     10.62     3.20     13.0     10.93     1.48     1
## Exterior2nd*     25 1460     11.34     3.54     14.0     11.65     2.97     1
## MasVnrType*      26 1452      2.76     0.62      3.0      2.73     0.00     1
## MasVnrArea       27 1452    103.69   181.07      0.0     63.15     0.00     0
## ExterQual*       28 1460      3.54     0.69      4.0      3.65     0.00     1
## ExterCond*       29 1460      4.73     0.73      5.0      4.95     0.00     1
## Foundation*      30 1460      2.40     0.72      2.0      2.46     1.48     1
## BsmtQual*        31 1423      3.26     0.87      3.0      3.43     1.48     1
## BsmtCond*        32 1423      3.81     0.66      4.0      4.00     0.00     1
## BsmtExposure*    33 1422      3.27     1.15      4.0      3.46     0.00     1
## BsmtFinType1*    34 1423      3.73     1.83      3.0      3.79     2.97     1
## BsmtFinSF1       35 1460    443.64   456.10    383.5    386.08   568.58     0
## BsmtFinType2*    36 1422      5.71     0.94      6.0      5.98     0.00     1
## BsmtFinSF2       37 1460     46.55   161.32      0.0      1.38     0.00     0
## BsmtUnfSF        38 1460    567.24   441.87    477.5    519.29   426.99     0
## TotalBsmtSF      39 1460   1057.43   438.71    991.5   1036.70   347.67     0
## Heating*         40 1460      2.04     0.30      2.0      2.00     0.00     1
## HeatingQC*       41 1460      2.54     1.74      1.0      2.42     0.00     1
## CentralAir*      42 1460      1.93     0.25      2.0      2.00     0.00     1
## Electrical*      43 1459      4.68     1.05      5.0      5.00     0.00     1
## X1stFlrSF        44 1460   1162.63   386.59   1087.0   1129.99   347.67   334
## X2ndFlrSF        45 1460    346.99   436.53      0.0    285.36     0.00     0
## LowQualFinSF     46 1460      5.84    48.62      0.0      0.00     0.00     0
## GrLivArea        47 1460   1515.46   525.48   1464.0   1467.67   483.33   334
## BsmtFullBath     48 1460      0.43     0.52      0.0      0.39     0.00     0
## BsmtHalfBath     49 1460      0.06     0.24      0.0      0.00     0.00     0
## FullBath         50 1460      1.57     0.55      2.0      1.56     0.00     0
## HalfBath         51 1460      0.38     0.50      0.0      0.34     0.00     0
## BedroomAbvGr     52 1460      2.87     0.82      3.0      2.85     0.00     0
## KitchenAbvGr     53 1460      1.05     0.22      1.0      1.00     0.00     0
## KitchenQual*     54 1460      3.34     0.83      4.0      3.50     0.00     1
## TotRmsAbvGrd     55 1460      6.52     1.63      6.0      6.41     1.48     2
## Functional*      56 1460      6.75     0.98      7.0      7.00     0.00     1
## Fireplaces       57 1460      0.61     0.64      1.0      0.53     1.48     0
## FireplaceQu*     58  770      3.73     1.13      3.0      3.80     1.48     1
## GarageType*      59 1379      3.28     1.79      2.0      3.11     0.00     1
## GarageYrBlt      60 1379   1978.51    24.69   1980.0   1981.07    31.13  1900
## GarageFinish*    61 1379      2.18     0.81      2.0      2.23     1.48     1
## GarageCars       62 1460      1.77     0.75      2.0      1.77     0.00     0
## GarageArea       63 1460    472.98   213.80    480.0    469.81   177.91     0
## GarageQual*      64 1379      4.86     0.61      5.0      5.00     0.00     1
## GarageCond*      65 1379      4.90     0.52      5.0      5.00     0.00     1
## PavedDrive*      66 1460      2.86     0.50      3.0      3.00     0.00     1
## WoodDeckSF       67 1460     94.24   125.34      0.0     71.76     0.00     0
## OpenPorchSF      68 1460     46.66    66.26     25.0     33.23    37.06     0
## EnclosedPorch    69 1460     21.95    61.12      0.0      3.87     0.00     0
## X3SsnPorch       70 1460      3.41    29.32      0.0      0.00     0.00     0
## ScreenPorch      71 1460     15.06    55.76      0.0      0.00     0.00     0
## PoolArea         72 1460      2.76    40.18      0.0      0.00     0.00     0
## PoolQC*          73    7      2.14     0.90      2.0      2.14     1.48     1
## Fence*           74  281      2.43     0.86      3.0      2.48     0.00     1
## MiscFeature*     75   54      2.91     0.45      3.0      3.00     0.00     1
## MiscVal          76 1460     43.49   496.12      0.0      0.00     0.00     0
## MoSold           77 1460      6.32     2.70      6.0      6.25     2.97     1
## YrSold           78 1460   2007.82     1.33   2008.0   2007.77     1.48  2006
## SaleType*        79 1460      8.51     1.56      9.0      8.92     0.00     1
## SaleCondition*   80 1460      4.77     1.10      5.0      5.00     0.00     1
## SalePrice        81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
##                   max  range   skew kurtosis      se
## Id               1460   1459   0.00    -1.20   11.03
## MSSubClass        190    170   1.40     1.56    1.11
## MSZoning*           5      4  -1.73     6.25    0.02
## LotFrontage       313    292   2.16    17.34    0.70
## LotArea        215245 213945  12.18   202.26  261.22
## Street*             2      1 -15.49   238.01    0.00
## Alley*              2      1   0.20    -1.98    0.05
## LotShape*           4      3  -0.61    -1.60    0.04
## LandContour*        4      3  -3.16     8.65    0.02
## Utilities*          2      1  38.13  1453.00    0.00
## LotConfig*          5      4  -1.13    -0.59    0.04
## LandSlope*          3      2   4.80    24.47    0.01
## Neighborhood*      25     24   0.02    -1.06    0.15
## Condition1*         9      8   3.01    16.34    0.02
## Condition2*         8      7  13.14   247.54    0.01
## BldgType*           5      4   2.24     3.41    0.03
## HouseStyle*         8      7   0.31    -0.96    0.05
## OverallQual        10      9   0.22     0.09    0.04
## OverallCond         9      8   0.69     1.09    0.03
## YearBuilt        2010    138  -0.61    -0.45    0.79
## YearRemodAdd     2010     60  -0.50    -1.27    0.54
## RoofStyle*          6      5   1.47     0.61    0.02
## RoofMatl*           8      7   8.09    66.28    0.02
## Exterior1st*       15     14  -0.72    -0.37    0.08
## Exterior2nd*       16     15  -0.69    -0.52    0.09
## MasVnrType*         4      3  -0.07    -0.13    0.02
## MasVnrArea       1600   1600   2.66    10.03    4.75
## ExterQual*          4      3  -1.83     3.86    0.02
## ExterCond*          5      4  -2.56     5.29    0.02
## Foundation*         6      5   0.09     1.02    0.02
## BsmtQual*           4      3  -1.31     1.27    0.02
## BsmtCond*           4      3  -3.39    10.14    0.02
## BsmtExposure*       4      3  -1.15    -0.39    0.03
## BsmtFinType1*       6      5  -0.02    -1.39    0.05
## BsmtFinSF1       5644   5644   1.68    11.06   11.94
## BsmtFinType2*       6      5  -3.56    12.32    0.02
## BsmtFinSF2       1474   1474   4.25    20.01    4.22
## BsmtUnfSF        2336   2336   0.92     0.46   11.56
## TotalBsmtSF      6110   6110   1.52    13.18   11.48
## Heating*            6      5   9.83   110.98    0.01
## HeatingQC*          5      4   0.48    -1.51    0.05
## CentralAir*         2      1  -3.52    10.42    0.01
## Electrical*         5      4  -3.06     7.49    0.03
## X1stFlrSF        4692   4358   1.37     5.71   10.12
## X2ndFlrSF        2065   2065   0.81    -0.56   11.42
## LowQualFinSF      572    572   8.99    82.83    1.27
## GrLivArea        5642   5308   1.36     4.86   13.75
## BsmtFullBath        3      3   0.59    -0.84    0.01
## BsmtHalfBath        2      2   4.09    16.31    0.01
## FullBath            3      3   0.04    -0.86    0.01
## HalfBath            2      2   0.67    -1.08    0.01
## BedroomAbvGr        8      8   0.21     2.21    0.02
## KitchenAbvGr        3      3   4.48    21.42    0.01
## KitchenQual*        4      3  -1.42     1.72    0.02
## TotRmsAbvGrd       14     12   0.67     0.87    0.04
## Functional*         7      6  -4.08    16.37    0.03
## Fireplaces          3      3   0.65    -0.22    0.02
## FireplaceQu*        5      4  -0.16    -0.98    0.04
## GarageType*         6      5   0.76    -1.30    0.05
## GarageYrBlt      2010    110  -0.65    -0.42    0.66
## GarageFinish*       3      2  -0.35    -1.41    0.02
## GarageCars          4      4  -0.34     0.21    0.02
## GarageArea       1418   1418   0.18     0.90    5.60
## GarageQual*         5      4  -4.43    18.25    0.02
## GarageCond*         5      4  -5.28    26.77    0.01
## PavedDrive*         3      2  -3.30     9.22    0.01
## WoodDeckSF        857    857   1.54     2.97    3.28
## OpenPorchSF       547    547   2.36     8.44    1.73
## EnclosedPorch     552    552   3.08    10.37    1.60
## X3SsnPorch        508    508  10.28   123.06    0.77
## ScreenPorch       480    480   4.11    18.34    1.46
## PoolArea          738    738  14.80   222.19    1.05
## PoolQC*             3      2  -0.22    -1.90    0.34
## Fence*              4      3  -0.57    -0.88    0.05
## MiscFeature*        4      3  -2.93    10.71    0.06
## MiscVal         15500  15500  24.43   697.64   12.98
## MoSold             12     11   0.21    -0.41    0.07
## YrSold           2010      4   0.10    -1.19    0.03
## SaleType*           9      8  -3.83    14.57    0.04
## SaleCondition*      6      5  -2.74     6.82    0.03
## SalePrice      755000 720100   1.88     6.50 2079.11

The univariate descriptive statistics provide summary measures for each variable in the dataset, including mean, standard deviation, minimum, maximum, and quartiles. These statistics help us understand the distribution and spread of the variables.

# Convert non-numeric variables to appropriate types
df$SalePrice <- as.numeric(as.character(df$SalePrice))
df$GrLivArea <- as.numeric(as.character(df$GrLivArea))
df$OverallQual <- as.numeric(as.character(df$OverallQual))

# Select the variables for the scatterplot matrix
scatterplot_vars <- c("GrLivArea", "OverallQual", "SalePrice")

# Create the scatterplot matrix
scatterplot_matrix <- pairs(df[, scatterplot_vars], lower.panel = NULL, diag.panel = NULL)

The scatterplot matrix allows us to visualize the relationships between two independent variables (GrLivArea and OverallQual) and the dependent variable (SalePrice). By examining the scatterplots, we can identify any linear or nonlinear associations between the variables.

# Correlation matrix
cor_vars <- c("SalePrice", "GrLivArea", "OverallQual")

# Create the correlation matrix
cor_matrix <- rcorr(as.matrix(df[, cor_vars]))$r

print(cor_matrix)
##             SalePrice GrLivArea OverallQual
## SalePrice   1.0000000 0.7086245   0.7909816
## GrLivArea   0.7086245 1.0000000   0.5930074
## OverallQual 0.7909816 0.5930074   1.0000000

The correlation matrix shows the pairwise correlations between SalePrice, GrLivArea, and OverallQual. The values range from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation. The correlation matrix helps us understand the strength and direction of the relationships between variables.

# Test hypotheses for each pairwise correlation
alpha <- 0.2  # Significance level (1 - confidence level)
n <- nrow(df)  # Number of observations

# Calculate the critical t-value
critical_t <- qt(1 - alpha / 2, df = n - 2)

# Calculate the confidence interval
conf_interval <- critical_t * sqrt((1 - cor_matrix^2) / (n - 2))
print(conf_interval)
##              SalePrice  GrLivArea OverallQual
## SalePrice   0.00000000 0.02369212  0.02054433
## GrLivArea   0.02369212 0.00000000  0.02703686
## OverallQual 0.02054433 0.02703686  0.00000000
# Test hypotheses for each pairwise correlation
for (i in 1:(length(cor_vars) - 1)) {
  for (j in (i + 1):length(cor_vars)) {
    hypothesis_results <- cor.test(df[[cor_vars[i]]], df[[cor_vars[j]]])
    print(hypothesis_results)
  }
}
## 
##  Pearson's product-moment correlation
## 
## data:  df[[cor_vars[i]]] and df[[cor_vars[j]]]
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6821200 0.7332695
## sample estimates:
##       cor 
## 0.7086245 
## 
## 
##  Pearson's product-moment correlation
## 
## data:  df[[cor_vars[i]]] and df[[cor_vars[j]]]
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7709644 0.8094376
## sample estimates:
##       cor 
## 0.7909816 
## 
## 
##  Pearson's product-moment correlation
## 
## data:  df[[cor_vars[i]]] and df[[cor_vars[j]]]
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5587023 0.6252869
## sample estimates:
##       cor 
## 0.5930074

The hypothesis tests for correlation examine whether the correlations between each pairwise set of variables are significantly different from zero. We use a significance level of 0.2 (80% confidence interval) for these tests. The test results provide the correlation coefficient, p-value, and confidence interval. If the p-value is below the significance level, we reject the null hypothesis and conclude that there is a significant correlation.

Regarding familywise error, conducting multiple hypothesis tests increases the chance of making a Type I error (false positive) at least once. In this analysis, we are performing multiple tests for each pairwise correlation. By setting a higher significance level of 0.2 (80% confidence interval) for each test, we have adjusted for multiple testing to some extent. However, it’s important to interpret the results cautiously and consider the chosen significance level in the context of the analysis and research goals.

Familywise error can still be a concern, especially if a lower significance level is desired or if additional hypotheses are tested in the same analysis. To further address familywise error, one could employ methods such as Bonferroni correction or control the false discovery rate (FDR) using procedures like the Benjamini-Hochberg procedure. These methods help mitigate the potential inflation of Type I error rates when conducting multiple hypothesis tests.

b

# Convert the correlation matrix to a sparse matrix
sparse_cor_matrix <- as(Matrix(cor_matrix), "CsparseMatrix")

# Invert the correlation matrix (precision matrix)
precision_matrix <- solve(sparse_cor_matrix)

# Multiply the correlation matrix by the precision matrix
cor_by_precision <- sparse_cor_matrix %*% precision_matrix

# Multiply the precision matrix by the correlation matrix
precision_by_cor <- precision_matrix %*% sparse_cor_matrix

# Conduct LU decomposition
lu_decomposition <- lu(sparse_cor_matrix)

We convert the correlation matrix (cor_matrix) to a sparse matrix using the as() function from the Matrix package. This step is necessary because the LU decomposition function in the Matrix package requires a sparse matrix.

We then invert the correlation matrix (now a sparse matrix) using the solve() function, which calculates the inverse of a matrix.

Next, we multiply the correlation matrix by the precision matrix using the %*% operator and store the result in the cor_by_precision variable. This step helps examine how the correlation matrix is affected when multiplied by the precision matrix.

Similarly, we multiply the precision matrix by the correlation matrix using %*% and store the result in the precision_by_cor variable. This step allows us to observe the impact of multiplying the precision matrix by the correlation matrix.

Finally, we perform LU decomposition on the sparse correlation matrix using the lu() function from the Matrix package. The LU decomposition breaks down the matrix into lower and upper triangular matrices, which can be used for various calculations and solving systems of linear equations.

# Check skewness of the variable
skewness(df$LotArea)
## [1] 12.18262
# Shift the variable if necessary
if (min(df$LotArea) <= 0) {
  df$LotArea <- df$LotArea - min(df$LotArea) + 1
}
# Fit exponential distribution
fit <- fitdistr(df$LotArea, "exponential")

# Extract the estimated lambda parameter
lambda <- fit$estimate["rate"]

Next, we’ll generate 1000 samples from the fitted exponential distribution using the estimated lambda value.

# Generate 1000 samples from exponential distribution
samples <- rexp(1000, lambda)

To compare the histograms of the original variable and the generated samples, we can plot them using the hist function.

# Plot histograms
par(mfrow = c(1, 2))
hist(df$LotArea, main = "Original Variable: LotArea", xlab = "LotArea")
hist(samples, main = "Exponential Distribution", xlab = "Samples")

# Calculate percentiles using CDF
percentile_5th <- qexp(0.05, rate = lambda)
percentile_95th <- qexp(0.95, rate = lambda)

To generate a 95% confidence interval from the empirical data, assuming normality, we can use the t.test function.

# Calculate 95% confidence interval assuming normality
conf_interval <- t.test(df$LotArea)$conf.int

Finally, we can calculate the empirical 5th and 95th percentiles of the original data using the quantile function.

# Calculate empirical percentiles
empirical_5th <- quantile(df$LotArea, 0.05)
empirical_95th <- quantile(df$LotArea, 0.95)

For the multiple regression model, we’ll use the variable “SalePrice” as the response variable and select a few predictor variables from the dataset.

# Convert necessary variables to the correct data type
df$SalePrice <- as.numeric(as.character(df$SalePrice))
df$OverallQual <- as.numeric(as.character(df$OverallQual))
df$GrLivArea <- as.numeric(as.character(df$GrLivArea))
df$GarageCars <- as.numeric(as.character(df$GarageCars))
df$TotalBsmtSF <- as.numeric(as.character(df$TotalBsmtSF))
df$YearBuilt <- as.numeric(as.character(df$YearBuilt))

# Select predictor variables
predictors <- c("OverallQual", "GrLivArea", "GarageCars", "TotalBsmtSF", "YearBuilt")

# Build the multiple regression model
model <- lm(SalePrice ~ ., data = df[, c("SalePrice", predictors)])

We have built a multiple regression model with “SalePrice” as the response variable and the selected predictors. Now let’s analyze the model by summarizing the results

summary(model)
## 
## Call:
## lm(formula = SalePrice ~ ., data = df[, c("SalePrice", predictors)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -476135  -20008   -2706   16431  285876 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.720e+05  8.485e+04  -7.920 4.69e-15 ***
## OverallQual  2.039e+04  1.156e+03  17.633  < 2e-16 ***
## GrLivArea    5.083e+01  2.564e+00  19.825  < 2e-16 ***
## GarageCars   1.451e+04  1.824e+03   7.957 3.52e-15 ***
## TotalBsmtSF  2.998e+01  2.821e+00  10.628  < 2e-16 ***
## YearBuilt    3.014e+02  4.459e+01   6.760 1.99e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38330 on 1454 degrees of freedom
## Multiple R-squared:  0.768,  Adjusted R-squared:  0.7672 
## F-statistic: 962.4 on 5 and 1454 DF,  p-value: < 2.2e-16
# Check for missing values in the dataset
missing_values <- colSums(is.na(df))
missing_values
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1369             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1453          1179          1406 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0
# Count the total number of missing values
total_missing <- sum(missing_values)
total_missing
## [1] 6965
skewed_variable <- "OverallQual"
# Shift the variable so that the minimum value is above zero if necessary
df$ShiftedVar <- df[, skewed_variable] - min(df[, skewed_variable]) + 1

# Fit an exponential distribution using fitdistr
fit <- fitdistr(df$ShiftedVar, "exponential")

# Extract the optimal value of lambda (l)
optimal_lambda <- fit$estimate

# Generate 1000 samples from the exponential distribution
exponential_samples <- rexp(1000, rate = 1 / optimal_lambda)

# Plot a histogram of the original variable and the exponential samples
hist(df[, skewed_variable], breaks = 30, col = "lightblue", main = "Original Variable", xlab = skewed_variable)
hist(exponential_samples, breaks = 30, col = "lightgreen", add = TRUE)
legend("topright", c("Original", "Exponential"), fill = c("lightblue", "lightgreen"))

# Calculate the 5th and 95th percentiles using the exponential CDF
percentile_5th <- qexp(0.05, rate = 1 / optimal_lambda)
percentile_95th <- qexp(0.95, rate = 1 / optimal_lambda)

# Generate a 95% confidence interval assuming normality
confidence_interval <- t.test(df$ShiftedVar)$conf.int

# Calculate the empirical 5th and 95th percentiles of the data
empirical_percentile_5th <- quantile(df$ShiftedVar, probs = 0.05)
empirical_percentile_95th <- quantile(df$ShiftedVar, probs = 0.95)

This code will shift the selected variable, fit an exponential distribution using the fitdistr function, generate 1000 samples from the exponential distribution, plot a histogram comparing the original variable and the exponential samples, and calculate the 5th and 95th percentiles using the exponential cumulative distribution function (CDF). It will also generate a 95% confidence interval assuming normality and calculate the empirical 5th and 95th percentiles of the data.

# Select the predictor variables for the regression model
predictors <- c("OverallQual", "GrLivArea", "TotalBsmtSF", "GarageCars", "YearBuilt")

# Create a new data frame with the predictor variables and the response variable
regression_data <- df[-1, c(predictors, "SalePrice")]


# Remove rows with missing values
regression_data <- na.omit(regression_data)

# Fit the multiple regression model
model <- lm(SalePrice ~ ., data = regression_data)

# Print the model summary
summary(model)
## 
## Call:
## lm(formula = SalePrice ~ ., data = regression_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -476100  -20035   -2712   16441  285867 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.725e+05  8.492e+04  -7.920 4.70e-15 ***
## OverallQual  2.039e+04  1.157e+03  17.628  < 2e-16 ***
## GrLivArea    5.084e+01  2.565e+00  19.819  < 2e-16 ***
## TotalBsmtSF  2.996e+01  2.823e+00  10.615  < 2e-16 ***
## GarageCars   1.451e+04  1.824e+03   7.952 3.65e-15 ***
## YearBuilt    3.017e+02  4.462e+01   6.761 1.98e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38350 on 1453 degrees of freedom
## Multiple R-squared:  0.7679, Adjusted R-squared:  0.7671 
## F-statistic: 961.7 on 5 and 1453 DF,  p-value: < 2.2e-16

Overall, the model appears to have a good fit based on the high R-squared value and significant predictor variables. However, it is important to further examine the assumptions of the regression model, such as the normality of residuals and the absence of influential outliers, to ensure the reliability of the results.

# Read the sample_submission file
sample_submission <- read.csv("https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/data_605/Final/house-prices-advanced-regression-techniques/sample_submission.csv")

# Create a new data frame with only "Id" column
predictions_df <- data.frame(ID = sample_submission$Id)


# Predict the SalePrice using your regression model (replace `model` with your actual model)
predictions_df$SalePrice <- predict(model, newdata = regression_data)

# Write the predictions to a CSV file
write.csv(predictions_df, file = "predictions.csv", row.names = FALSE)

# Verify the number of rows in the predictions file
num_rows <- nrow(predictions_df)
print(num_rows)  # Should be 1459
## [1] 1459