House Prices: Advanced Regression Techniques

DATA 605 Final
Gregg Maloy

PART I: SETUP
Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

library(readr)
library(ggplot2)
library(scales)
library(dplyr)
library(MASS)
library(tidyverse)
url <- "https://raw.githubusercontent.com/greggmaloy/Data_605/main/train.csv"
data <- read_csv(url)
#write.table(data, "clipboard", sep = "\t", row.names = FALSE)

#transforming to long for plots further in the analysis
datalong <- gather(data, key="variable", value="value", c("Id", "MSSubClass",   "MSZoning", "LotFrontage",  "LotArea",  "Street",   "Alley",    "LotShape", "LandContour",  "Utilities",    "LotConfig",    "LandSlope",    "Neighborhood", "Condition1",   "Condition2",   "BldgType", "HouseStyle",   "OverallQual",  "OverallCond",  "YearBuilt",    "YearRemodAdd", "RoofStyle",    "RoofMatl", "Exterior1st",  "Exterior2nd",  "MasVnrType",   "MasVnrArea",   "ExterQual",    "ExterCond",    "Foundation",   "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1",   "BsmtFinType2", "BsmtFinSF2",   "BsmtUnfSF",    "TotalBsmtSF",  "Heating",  "HeatingQC",    "CentralAir",   "Electrical",   "1stFlrSF", "2ndFlrSF", "LowQualFinSF", "GrLivArea",    "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "KitchenQual",  "TotRmsAbvGrd", "Functional",   "Fireplaces",   "FireplaceQu",  "GarageType",   "GarageYrBlt",  "GarageFinish", "GarageCars",   "GarageArea",   "GarageQual",   "GarageCond",   "PavedDrive",   "WoodDeckSF",   "OpenPorchSF",  "EnclosedPorch",    "3SsnPorch",    "ScreenPorch",  "PoolArea", "PoolQC",   "Fence",    "MiscFeature",  "MiscVal",  "MoSold",   "YrSold",   "SaleType", "SaleCondition",    "SalePrice"))

#view(datalong)
#write.table(datalong, "clipboard", sep = "\t", row.names = FALSE)

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!
Let’s make ‘LotArea’ the independent variable and plot a histogram. The resulting histogram is skewed to the right.

p <- ggplot(data, aes(x = LotArea)) +
  geom_histogram(bins = 100, fill = "yellow", color = "gray") + 
  labs(title = "Lot Area", x = "LotArea", y = "Frequency") +
  scale_x_continuous(labels = label_number(big.mark = ","))+
  theme_minimal()
print(p)

Pick the dependent variable and define it as Y. Let’s make House ‘SalePrice’ the dependent variable and plot a histogram. The resulting histogram approximates normal distribution.

p <- ggplot(data, aes(x = SalePrice)) +
  geom_histogram(bins = 100, fill = "yellow", color = "gray") + 
  labs(title = "Sale Price", x = "Sale Price", y = "Frequency") +
  scale_x_continuous(labels = label_number(big.mark = ","))+
  theme_minimal()
print(p)

PART II: PROBABILITY
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
a. P(X>x | Y>y) b. P(X>x, Y>y) c. P(X<x | Y>y)

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

#preparing the data
#define X and Y
X= data$LotArea
Y= data$SalePrice

#assigning the 3rd quartile value to 'x'
x_1quartile = quantile(X, 0.25)  
x_2quartile = quantile(X, 0.50)  
x_3quartile = quantile(X, 0.75)  
x_max = quantile(X, 1.00)  
x<-x_3quartile #assigning the 3rd quartile value to 'x'

#assigning the 2nd quartile value to 'y'
y_1quartile = quantile(Y, 0.25)  
y_2quartile = quantile(Y, 0.50)  
y_3quartile = quantile(Y, 0.75)  
y_max = quantile(Y, 1.00)  
y=y_2quartile #assigning the 2nd quartile value to 'y'

#x_1quartile
#x_2quartile
#x_3quartile
#x_max

#y_1quartile
#y_2quartile
#y_3quartile
#y_max


#validate
d2 = data %>% 
  dplyr::select(LotArea, SalePrice)

summary(d2)

##     LotArea         SalePrice     
##  Min.   :  1300   Min.   : 34900  
##  1st Qu.:  7554   1st Qu.:129975  
##  Median :  9478   Median :163000  
##  Mean   : 10517   Mean   :180921  
##  3rd Qu.: 11602   3rd Qu.:214000  
##  Max.   :215245   Max.   :755000

a. P(X > x | Y > y)
Given that Y > y, determine the probability that X > x.
Answer: Of the houses whose SalesPrice are in the top 50%(2nd Quartile), the probability that the LotArea is in the top 25%(4th Quartile) is 38%.

den=nrow(data)
p_x<- sum(X > x)/den
#p_x

p_y<- sum(Y > y)/den
#p_y

p_x_and_y<- sum(Y > y & X > x)/den
#p_x_and_y

p_given <-p_x_and_y/p_y
p_given

## [1] 0.3791209

b. P(X>x, Y>y)
Answer: The probability that the the LotArea is in the top 25% (4th Quartile) and the SalePrice is in the top 50% (2nd Quartile) is 19%.

p_x_and_y

## [1] 0.1890411

c.P(X<x | Y>y)
Given that Y > y, determine the probability that X<x
Answer: Of the houses whose SalesPrice are in the top 50%(2nd Quartile), the probability that the LotArea is in the bottom 75%(1st, 2nd or 3rd Quartile) is 62%. This also could have been answered by subtracting 100 from part A (100-32=62).

p_xless_and_y<- sum(X < x & Y > y)/den
#p_xless_and_y

p_Xless_given_Y <- p_xless_and_y/p_y
p_Xless_given_Y

## [1] 0.6208791

In addition, make a table of counts
See below table. The information in the table is correct but the code could be more elegant.

total=den
#total

total_2nd_Q_y<- sum(Y > y)
#total_2nd_Q_y

total_2nd_Q_yless<- sum(Y <= y)
#total_2nd_Q_yless

total_2nd_Q_x<- sum(X > x)
#total_2nd_Q_x

total_2nd_Q_xless<- sum(X <= x)
#total_2nd_Q_xless

y_and_x <- sum(Y > y & X > x)
#y_and_x

y_and_xless <-sum(X <= x & Y > y)
#y_and_xless

yless_and_xless <-sum(X <= x & Y <= y)
#yless_and_xless

yless_and_x <-sum(X > x & Y <= y)
#yless_and_x


# empty df
table_df <- data.frame(
  `x/y` = c("<=3rd quartile", ">3rd quartile", "Total"),
  "Less_Than_Equal_to_2nd_Quartile" = NA,
  "Greater_Than_2nd_Quartile"  = NA,
  `Total` = NA
)


rownames(table_df) <- table_df$`x/y`
table_df$`x/y` <- NULL  




table_df[1, 1] <- "<=_3rd_Quartile"  
table_df[1, 2] <- yless_and_xless  # <=3d quartile, <=2d quartile
table_df[1, 3] <- y_and_xless # <=3d quartile, >2d quartile
table_df[1, 4] <- total_2nd_Q_xless<- sum(X < x)  # <=3d quartile, Total

table_df[2, 1] <- ">_3rd_Quartile"# 
table_df[2, 2] <- yless_and_x # >3d quartile, <=2d quartile
table_df[2, 3] <- y_and_x  # >3d quartile, >2d quartile
table_df[2, 4] <- total_2nd_Q_x  # >3d quartile, Total

table_df[3, 1] <- "Total" # Total, <=2d quartile
table_df[3, 2] <- total-total_2nd_Q_y   # Total, <=2d quartile
table_df[3, 3] <- total_2nd_Q_y # Total, >2d quartile
table_df[3, 4] <- total # Total, Total


table_df

##               x.y Less_Than_Equal_to_2nd_Quartile Greater_Than_2nd_Quartile
## 1 <=_3rd_Quartile                             643                       452
## 2  >_3rd_Quartile                              89                       276
## 3           Total                             732                       728
##   Total
## 1  1095
## 2   365
## 3  1460

Does splitting the training data in this fashion make them independent? Does P(A|B)=P(A)P(B)?

No P(A|B) does not equal P(A)P(B) \[25\% \neq 12.47\%\] \[(A|B) \neq P(A)P(B)\]

A<- sum(X > x)
#A#count
p_A<- sum(X > x)/den # den is the total number of rows
#p_A


B<- sum(Y > y)
#B#count
p_B<- sum(Y > y)/den # den is the total number of rows
#p_B

#P(A)P(B)
P_A_and_P_B<-p_B*p_A#=12.47%
P_A_and_P_B

## [1] 0.1246575

#P(A|B) = P(B and A) / P(B)
P_A_Given_B<-P_A_and_P_B/p_B
P_A_Given_B

## [1] 0.25

Evaluate by running a Chi Square test for association
Below the Chi square test is statistically significant and can be interpreted as a rejection of the null hypothesis of independence. However, since the chi squared test is designed for categorical variables and both ‘SalePrice’ and ‘LotArea’ are continuous variables, the chi squared test results should have no bearing on independence.

chisq.test(matrix(c(643,452,89,276), ncol=2))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix(c(643, 452, 89, 276), ncol = 2)
## X-squared = 127.74, df = 1, p-value < 2.2e-16

PART III: DESCRIPTIVE & INFERENTIAL STATISTICS
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Descriptive statistics

library(psych)
describe(data)

##               vars    n      mean       sd   median   trimmed      mad   min
## Id               1 1460    730.50   421.61    730.5    730.50   541.15     1
## MSSubClass       2 1460     56.90    42.30     50.0     49.15    44.48    20
## MSZoning         3 1460      4.03     0.63      4.0      4.06     0.00     1
## LotFrontage      4 1201     70.05    24.28     69.0     68.94    16.31    21
## LotArea          5 1460  10516.83  9981.26   9478.5   9563.28  2962.23  1300
## Street           6 1460      2.00     0.06      2.0      2.00     0.00     1
## Alley            7   91      1.45     0.50      1.0      1.44     0.00     1
## LotShape         8 1460      2.94     1.41      4.0      3.05     0.00     1
## LandContour      9 1460      3.78     0.71      4.0      4.00     0.00     1
## Utilities       10 1460      1.00     0.03      1.0      1.00     0.00     1
## LotConfig       11 1460      4.02     1.62      5.0      4.27     0.00     1
## LandSlope       12 1460      1.06     0.28      1.0      1.00     0.00     1
## Neighborhood    13 1460     13.15     5.89     13.0     13.11     7.41     1
## Condition1      14 1460      3.03     0.87      3.0      3.00     0.00     1
## Condition2      15 1460      3.01     0.26      3.0      3.00     0.00     1
## BldgType        16 1460      1.49     1.20      1.0      1.14     0.00     1
## HouseStyle      17 1460      4.04     1.91      3.0      4.03     1.48     1
## OverallQual     18 1460      6.10     1.38      6.0      6.08     1.48     1
## OverallCond     19 1460      5.58     1.11      5.0      5.48     0.00     1
## YearBuilt       20 1460   1971.27    30.20   1973.0   1974.13    37.06  1872
## YearRemodAdd    21 1460   1984.87    20.65   1994.0   1986.37    19.27  1950
## RoofStyle       22 1460      2.41     0.83      2.0      2.26     0.00     1
## RoofMatl        23 1460      2.08     0.60      2.0      2.00     0.00     1
## Exterior1st     24 1460     10.62     3.20     13.0     10.93     1.48     1
## Exterior2nd     25 1460     11.34     3.54     14.0     11.65     2.97     1
## MasVnrType      26 1452      2.76     0.62      3.0      2.73     0.00     1
## MasVnrArea      27 1452    103.69   181.07      0.0     63.15     0.00     0
## ExterQual       28 1460      3.54     0.69      4.0      3.65     0.00     1
## ExterCond       29 1460      4.73     0.73      5.0      4.95     0.00     1
## Foundation      30 1460      2.40     0.72      2.0      2.46     1.48     1
## BsmtQual        31 1423      3.26     0.87      3.0      3.43     1.48     1
## BsmtCond        32 1423      3.81     0.66      4.0      4.00     0.00     1
## BsmtExposure    33 1422      3.27     1.15      4.0      3.46     0.00     1
## BsmtFinType1    34 1423      3.73     1.83      3.0      3.79     2.97     1
## BsmtFinSF1      35 1460    443.64   456.10    383.5    386.08   568.58     0
## BsmtFinType2    36 1422      5.71     0.94      6.0      5.98     0.00     1
## BsmtFinSF2      37 1460     46.55   161.32      0.0      1.38     0.00     0
## BsmtUnfSF       38 1460    567.24   441.87    477.5    519.29   426.99     0
## TotalBsmtSF     39 1460   1057.43   438.71    991.5   1036.70   347.67     0
## Heating         40 1460      2.04     0.30      2.0      2.00     0.00     1
## HeatingQC       41 1460      2.54     1.74      1.0      2.42     0.00     1
## CentralAir      42 1460      1.93     0.25      2.0      2.00     0.00     1
## Electrical      43 1459      4.68     1.05      5.0      5.00     0.00     1
## 1stFlrSF        44 1460   1162.63   386.59   1087.0   1129.99   347.67   334
## 2ndFlrSF        45 1460    346.99   436.53      0.0    285.36     0.00     0
## LowQualFinSF    46 1460      5.84    48.62      0.0      0.00     0.00     0
## GrLivArea       47 1460   1515.46   525.48   1464.0   1467.67   483.33   334
## BsmtFullBath    48 1460      0.43     0.52      0.0      0.39     0.00     0
## BsmtHalfBath    49 1460      0.06     0.24      0.0      0.00     0.00     0
## FullBath        50 1460      1.57     0.55      2.0      1.56     0.00     0
## HalfBath        51 1460      0.38     0.50      0.0      0.34     0.00     0
## BedroomAbvGr    52 1460      2.87     0.82      3.0      2.85     0.00     0
## KitchenAbvGr    53 1460      1.05     0.22      1.0      1.00     0.00     0
## KitchenQual     54 1460      3.34     0.83      4.0      3.50     0.00     1
## TotRmsAbvGrd    55 1460      6.52     1.63      6.0      6.41     1.48     2
## Functional      56 1460      6.75     0.98      7.0      7.00     0.00     1
## Fireplaces      57 1460      0.61     0.64      1.0      0.53     1.48     0
## FireplaceQu     58  770      3.73     1.13      3.0      3.80     1.48     1
## GarageType      59 1379      3.28     1.79      2.0      3.11     0.00     1
## GarageYrBlt     60 1379   1978.51    24.69   1980.0   1981.07    31.13  1900
## GarageFinish    61 1379      2.18     0.81      2.0      2.23     1.48     1
## GarageCars      62 1460      1.77     0.75      2.0      1.77     0.00     0
## GarageArea      63 1460    472.98   213.80    480.0    469.81   177.91     0
## GarageQual      64 1379      4.86     0.61      5.0      5.00     0.00     1
## GarageCond      65 1379      4.90     0.52      5.0      5.00     0.00     1
## PavedDrive      66 1460      2.86     0.50      3.0      3.00     0.00     1
## WoodDeckSF      67 1460     94.24   125.34      0.0     71.76     0.00     0
## OpenPorchSF     68 1460     46.66    66.26     25.0     33.23    37.06     0
## EnclosedPorch   69 1460     21.95    61.12      0.0      3.87     0.00     0
## 3SsnPorch       70 1460      3.41    29.32      0.0      0.00     0.00     0
## ScreenPorch     71 1460     15.06    55.76      0.0      0.00     0.00     0
## PoolArea        72 1460      2.76    40.18      0.0      0.00     0.00     0
## PoolQC          73    7      2.14     0.90      2.0      2.14     1.48     1
## Fence           74  281      2.43     0.86      3.0      2.48     0.00     1
## MiscFeature     75   54      2.91     0.45      3.0      3.00     0.00     1
## MiscVal         76 1460     43.49   496.12      0.0      0.00     0.00     0
## MoSold          77 1460      6.32     2.70      6.0      6.25     2.97     1
## YrSold          78 1460   2007.82     1.33   2008.0   2007.77     1.48  2006
## SaleType        79 1460      8.51     1.56      9.0      8.92     0.00     1
## SaleCondition   80 1460      4.77     1.10      5.0      5.00     0.00     1
## SalePrice       81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
##                  max  range   skew kurtosis      se
## Id              1460   1459   0.00    -1.20   11.03
## MSSubClass       190    170   1.40     1.56    1.11
## MSZoning           5      4  -1.73     6.25    0.02
## LotFrontage      313    292   2.16    17.34    0.70
## LotArea       215245 213945  12.18   202.26  261.22
## Street             2      1 -15.49   238.01    0.00
## Alley              2      1   0.20    -1.98    0.05
## LotShape           4      3  -0.61    -1.60    0.04
## LandContour        4      3  -3.16     8.65    0.02
## Utilities          2      1  38.13  1453.00    0.00
## LotConfig          5      4  -1.13    -0.59    0.04
## LandSlope          3      2   4.80    24.47    0.01
## Neighborhood      25     24   0.02    -1.06    0.15
## Condition1         9      8   3.01    16.34    0.02
## Condition2         8      7  13.14   247.54    0.01
## BldgType           5      4   2.24     3.41    0.03
## HouseStyle         8      7   0.31    -0.96    0.05
## OverallQual       10      9   0.22     0.09    0.04
## OverallCond        9      8   0.69     1.09    0.03
## YearBuilt       2010    138  -0.61    -0.45    0.79
## YearRemodAdd    2010     60  -0.50    -1.27    0.54
## RoofStyle          6      5   1.47     0.61    0.02
## RoofMatl           8      7   8.09    66.28    0.02
## Exterior1st       15     14  -0.72    -0.37    0.08
## Exterior2nd       16     15  -0.69    -0.52    0.09
## MasVnrType         4      3  -0.07    -0.13    0.02
## MasVnrArea      1600   1600   2.66    10.03    4.75
## ExterQual          4      3  -1.83     3.86    0.02
## ExterCond          5      4  -2.56     5.29    0.02
## Foundation         6      5   0.09     1.02    0.02
## BsmtQual           4      3  -1.31     1.27    0.02
## BsmtCond           4      3  -3.39    10.14    0.02
## BsmtExposure       4      3  -1.15    -0.39    0.03
## BsmtFinType1       6      5  -0.02    -1.39    0.05
## BsmtFinSF1      5644   5644   1.68    11.06   11.94
## BsmtFinType2       6      5  -3.56    12.32    0.02
## BsmtFinSF2      1474   1474   4.25    20.01    4.22
## BsmtUnfSF       2336   2336   0.92     0.46   11.56
## TotalBsmtSF     6110   6110   1.52    13.18   11.48
## Heating            6      5   9.83   110.98    0.01
## HeatingQC          5      4   0.48    -1.51    0.05
## CentralAir         2      1  -3.52    10.42    0.01
## Electrical         5      4  -3.06     7.49    0.03
## 1stFlrSF        4692   4358   1.37     5.71   10.12
## 2ndFlrSF        2065   2065   0.81    -0.56   11.42
## LowQualFinSF     572    572   8.99    82.83    1.27
## GrLivArea       5642   5308   1.36     4.86   13.75
## BsmtFullBath       3      3   0.59    -0.84    0.01
## BsmtHalfBath       2      2   4.09    16.31    0.01
## FullBath           3      3   0.04    -0.86    0.01
## HalfBath           2      2   0.67    -1.08    0.01
## BedroomAbvGr       8      8   0.21     2.21    0.02
## KitchenAbvGr       3      3   4.48    21.42    0.01
## KitchenQual        4      3  -1.42     1.72    0.02
## TotRmsAbvGrd      14     12   0.67     0.87    0.04
## Functional         7      6  -4.08    16.37    0.03
## Fireplaces         3      3   0.65    -0.22    0.02
## FireplaceQu        5      4  -0.16    -0.98    0.04
## GarageType         6      5   0.76    -1.30    0.05
## GarageYrBlt     2010    110  -0.65    -0.42    0.66
## GarageFinish       3      2  -0.35    -1.41    0.02
## GarageCars         4      4  -0.34     0.21    0.02
## GarageArea      1418   1418   0.18     0.90    5.60
## GarageQual         5      4  -4.43    18.25    0.02
## GarageCond         5      4  -5.28    26.77    0.01
## PavedDrive         3      2  -3.30     9.22    0.01
## WoodDeckSF       857    857   1.54     2.97    3.28
## OpenPorchSF      547    547   2.36     8.44    1.73
## EnclosedPorch    552    552   3.08    10.37    1.60
## 3SsnPorch        508    508  10.28   123.06    0.77
## ScreenPorch      480    480   4.11    18.34    1.46
## PoolArea         738    738  14.80   222.19    1.05
## PoolQC             3      2  -0.22    -1.90    0.34
## Fence              4      3  -0.57    -0.88    0.05
## MiscFeature        4      3  -2.93    10.71    0.06
## MiscVal        15500  15500  24.43   697.64   12.98
## MoSold            12     11   0.21    -0.41    0.07
## YrSold          2010      4   0.10    -1.19    0.03
## SaleType           9      8  -3.83    14.57    0.04
## SaleCondition      6      5  -2.74     6.82    0.03
## SalePrice     755000 720100   1.88     6.50 2079.11

p <- ggplot(datalong, aes(x = value)) +
  geom_histogram(stat="count", fill = "yellow", color = "black") +
  facet_wrap(~ variable, scales = "free")

## Warning in geom_histogram(stat = "count", fill = "yellow", color = "black"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

  labs(title = "Histograms of Train Dataset Variables", x = "Value", y = "Frequency")+
  theme_minimal() +
  theme( axis.text = element_text( size = 20 ),
           axis.text.x = element_text( size = 20 ),
           axis.title = element_text( size = 20, face = "bold" ),
           legend.position="none",
           theme(strip.text = element_text(face="bold", size=20)))

## NULL

Provide a scatter plot of X and Y

data %>%
  ggplot(aes(x = LotArea, y = SalePrice)) +
  geom_point(alpha = 0.3, color = "black", fill = "yellow", shape = 21)

  labs(x = "Lot Area", y = "Sale Price", title = "Lot Area versus Sale Price")+
  scale_y_continuous(labels = label_number(big.mark = ","))+
  theme_minimal()

## NULL

Provide a 95% CI for the difference in the mean of the variables
Interpretation: Of 100 sample distributions, 95 are likely to have prices which are within the confidence intervals (95% CI -174378.4 and -166430.4). The p-value is statistically significant

t1 <- t.test( data$LotArea, data$SalePrice, paired = TRUE, conf.level = 0.95)  
t1

## 
##  Paired t-test
## 
## data:  data$LotArea and data$SalePrice
## t = -84.112, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -174378.4 -166430.4
## sample estimates:
## mean difference 
##       -170404.4

Derive a correlation matrix and test the hypothesis that the correlation between these variables is 0.
The correlation coefficient value of 0.2638 denotes weak positive association between LotArea and SalePrice. Generally, as the lot area increases, the sale price tends to increase, but the relationship is weak as the correlation coefficient value is far away from 1.

#weak positive correlation
c <- cor(data[,c("LotArea","SalePrice")])
c

##             LotArea SalePrice
## LotArea   1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000

Provide a 99% confidence interval

cor.test(data$LotArea, data$SalePrice,conf.level=.99)

## 
##  Pearson's product-moment correlation
## 
## data:  data$LotArea and data$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.2000196 0.3254375
## sample estimates:
##       cor 
## 0.2638434

Discuss the meaning of your analysis
There is a statistically significant positive, but weak, linear relationship between LotArea and SalePrice. Despite the weak correlation, the p-value indicates that the association is unlikely to be due chance. The 99% confidence intervals being between 0.2000196 and 0.3254375 further supports a positive association between the variables as the confidence interval values do not include 0.

PART IV: LINEAR ALGEBRA & CORRELATION
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.

Invert your correlation matrix

p <- solve(c) 
p

##              LotArea  SalePrice
## LotArea    1.0748219 -0.2835846
## SalePrice -0.2835846  1.0748219

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix

#multiply the correlation matrix by the precision matrix
p %*% c

##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1

#multiply the precision matrix by the correlation matrix
c %*% p

##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1

Conduct principle components analysis, interpret and discuss
PC1, which represents the combined measure of the two variables, captures 63% of the variance in the data. Both LotArea and SalePrice contributing equally and positively.

PC2, which represents the contrast between the two variables, captures 37% of the variance. LotArea and SalePrice contributing equally but in opposite directions.

The results show that most of the variability in LotArea and SalePrice can be captured by two principal components, where the first component represents the combined effect and the second component represents the difference.

df <- data.frame(LotArea = data$LotArea, SalePrice = data$SalePrice)
df_scaled <- scale(df)
pca_result <- prcomp(df_scaled, scale. = TRUE, center = TRUE)
summary(pca_result)

## Importance of components:
##                           PC1    PC2
## Standard deviation     1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion  0.6319 1.0000

print(pca_result$rotation)

##                 PC1        PC2
## LotArea   0.7071068  0.7071068
## SalePrice 0.7071068 -0.7071068

print(pca_result$sdev^2)

## [1] 1.2638434 0.7361566

PART V: CALCULUS-BASED PROBABILITY & STATISTICS
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

For your variable that is skewed to the right, shift it so that the minimum value is above zero
There is no need to shift as the minimum value is already above zero

#minimum is above zero
summary(data$LotArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

Run fitdistr to fit an exponential probability density function

exp_pdf = fitdistr(data$LotArea, "exponential")
exp_pdf

##        rate    
##   9.508570e-05 
##  (2.488507e-06)

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ))

lambda = exp_pdf$estimate
lambda

##        rate 
## 9.50857e-05

samples <- rexp(1000, rate = lambda)

Plot a histogram and compare it with a histogram of your original variable

par(mfrow = c(1, 2))
hist(data$LotArea, main = "Original Variable", xlab = "Value")
hist(samples, main = "Exponential Distribution", xlab = "Value")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF)

quantile(samples, probs=c(0.05, 0.95))

##         5%        95% 
##   567.5301 30915.0411

Also generate a 95% confidence interval from the empirical data, assuming normality

# 
mean_x <- mean(data$LotArea)
std_x <- sd(data$LotArea)
n <- length(data$LotArea)
z <- qnorm(1 - 0.05/2)

lower_ci <- mean_x - z * (std_x / sqrt(n))
upper_ci <- mean_x + z * (std_x / sqrt(n))

lower_ci

## [1] 10004.84

upper_ci

## [1] 11028.81

Provide the empirical 5th percentile and 95th percentile of the data

quantile(data$LotArea, probs=c(0.05, 0.95))

##       5%      95% 
##  3311.70 17401.15

Discuss
Despite the right skew, the empirical distribution which assumes normality appears better fit model. The confidence intervals for the exponential distribution are wider than the empirical model

exponential CI - 5%: 560.2219, 95%: 32075.7943      
empirical CI   - 5%: 10004.84, 95%: 11028.81

PART VI: MODELING
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Simple Linear Regression

train.lm <- lm(SalePrice ~ LotArea, data=data)
summary(train.lm)

## 
## Call:
## lm(formula = SalePrice ~ LotArea, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -275668  -48169  -17725   31248  553356 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.588e+05  2.915e+03   54.49   <2e-16 ***
## LotArea     2.100e+00  2.011e-01   10.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76650 on 1458 degrees of freedom
## Multiple R-squared:  0.06961,    Adjusted R-squared:  0.06898 
## F-statistic: 109.1 on 1 and 1458 DF,  p-value: < 2.2e-16

Residuals
The median value being not being anywhere near zero, and the first and third quartile not approximating of each other suggests that the residuals are not normally distributed. Additionally, minimum and maximum values are not of similar magnitude, which provides some evidence that the model is not a good fit.

Coefficients
LotArea p-value for this model is statistically significant (p=value <0.05). The t-value for LotArea is 10.45. This high t-value suggests a strong statistical significance of the LotArea coefficient in predicting SalePrice.

Residual Standard Error and Degrees of Freedom
Generally speaking, residual standard error that is approximately 1.5 times the 1st and 3rd quartile residuals provides evidence that residuals are normally distributed. Althought the RSE is 1.5 times the first quartile residual, the RSE is not 1.5 the 3rd quartile residuals suggesting that the residuals are not noramlly distributed.

The Multiple R-squared Value
The reported R2 of 0.06961 for this model means that 7% of the variability in SalePrice is explained by the model.

Residual Plots

# Plot
par(mfrow=c(2,2))
plot(train.lm)

Below, residual analysis is conducted in the form of Residual versus Fitted Value Plot and a Q-Q Plot

Residual versus Fitted Value Plot
For a Residual versus Fitted Value Plot to support the linear model, residuals should be scattered around the horizontal axis where the residual equals zero. The above Residual versus Fitted Value Plot provides possible evidence that 1. there are significant outlyiers influencing the model, 2. the relationship is not linear as the data is not scattered around the y axis zero value randomly.

Q-Q Plot
For the Q-Q Plot to support our linear model, we would expect the plotted values to follow a straight line, indicating the residuals were normally distributed. Below our model’s Q-Q Plot suggests that the distribution of the residuals are not normal, with the right tail deviating significantly from the expected straight line, alluding to skew and outlyiers. This suggests that the model could be improved.

Conclusion Although there is evidence to suggest that there is a association between LotArea and SalePrice, mainly a statistically significant t-value, residual analysis alludes to outlyiers and non-normal residual distribution. Accounting for the outliers could improve the model, as could multiple regression technique. Even then, the relationship between the two variables may not be linear. More analysis to improve the model.