train Loading and Initial Exploration

We begin by loading the trainset from a URL and taking a preliminary look at its structure.

train <- read.csv("https://raw.githubusercontent.com/Kingtilon1/House-prices-competition/main/train.csv", stringsAsFactors = FALSE)
head(train)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
summary(train)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 

Identification of Variables

We identify quantitative variables and check for skewness, choosing LotArea as \(X\) and SalePrice as \(Y\).

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(train, aes(x=LotArea)) + geom_histogram(bins=30, fill="blue", color="black") +
  ggtitle("Distribution of LotArea") + theme_minimal()

ggplot(train, aes(x=SalePrice)) + geom_histogram(bins=30, fill="red", color="black") +
  ggtitle("Distribution of SalePrice") + theme_minimal()

Calculation of Probabilities and Independence Check

Calculation of Quartiles and Probabilities

We calculate the necessary quartiles and probabilities.

x_third_quartile <- quantile(train$LotArea, 0.75)
y_second_quartile <- quantile(train$SalePrice, 0.5)

condition_X_greater_x <- train$LotArea > x_third_quartile
condition_Y_greater_y <- train$SalePrice > y_second_quartile

prob_X_greater_x <- mean(condition_X_greater_x)
prob_Y_greater_y <- mean(condition_Y_greater_y)
prob_X_greater_x_given_Y_greater_y <- mean(train$LotArea[condition_Y_greater_y] > x_third_quartile)
prob_X_greater_x_and_Y_greater_y <- mean(condition_X_greater_x & condition_Y_greater_y)
prob_X_less_x_given_Y_greater_y <- mean(train$LotArea[condition_Y_greater_y] < x_third_quartile)

cat("P(X > x): ", prob_X_greater_x, "\nP(Y > y): ", prob_Y_greater_y, "\nP(X > x | Y > y): ", prob_X_greater_x_given_Y_greater_y,
    "\nP(X > x and Y > y): ", prob_X_greater_x_and_Y_greater_y, "\nP(X < x | Y > y): ", prob_X_less_x_given_Y_greater_y)
## P(X > x):  0.25 
## P(Y > y):  0.4986301 
## P(X > x | Y > y):  0.3791209 
## P(X > x and Y > y):  0.1890411 
## P(X < x | Y > y):  0.6208791

1: P(X>x)=0.25: This probability indicates that 25% of the properties have a lot area greater than the third quartile of all lot areas in the dataset. This is by definition of the quartile, as the third quartile (75th percentile) is the value below which 75% of the data fall. 𝑃 ( 𝑌 > 𝑦 ) = 0.4986301

2: P(Y>y)=0.4986301: About 49.9% of the properties have a sale price higher than the median (second quartile, 50th percentile) sale price. This is very close to 50%, as expected, because the median is the middle value in a data set.

3: 𝑃 ( 𝑋 > 𝑥 ∣ 𝑌 > 𝑦 ) = 0.3791209 P(X>x∣Y>y)=0.3791209: Given that a property’s sale price is above the median, there is approximately a 37.9% chance that its lot area is also above the third quartile. This suggests that among higher-priced homes, a significantly large proportion also have larger lot areas, though not the majority.

4: 𝑃 ( 𝑋 > 𝑥 ∧ 𝑌 > 𝑦 ) = 0.1890411 P(X>x∧Y>y)=0.1890411: There is an 18.9% chance that a property will have both a sale price above the median and a lot area greater than the third quartile. This joint probability is less than the product of the individual probabilities ( 𝑃 ( 𝑋 > 𝑥 ) × 𝑃 ( 𝑌 > 𝑦 ) P(X>x)×P(Y>y)), indicating a possible dependency between lot area and sale price — large lot areas tend to occur with higher prices, but not as frequently as might be expected if the two were independent.

5: 𝑃 ( 𝑋 < 𝑥 ∣ 𝑌 > 𝑦 ) = 0.6208791 P(X<x∣Y>y)=0.6208791: Given a property’s sale price is above the median, there is a 62.1% chance that its lot area is less than the third quartile. This indicates that even among higher-priced homes, it’s more common for the lot area to be in the smaller three-quarters of all lot areas. ### Chi-Square Test for Independence

Creating a table of counts

Additional Analysis: Contingency Table of Counts

To further analyze the relationship between LotArea (X) and SalePrice (Y), we create a contingency table that categorizes properties based on whether they fall above or below these quartiles.

train$X_group <- ifelse(train$LotArea > x_third_quartile, ">3rd quartile", "<=3rd quartile")
train$Y_group <- ifelse(train$SalePrice > y_second_quartile, ">2nd quartile", "<=2nd quartile")

count_table <- table(train$X_group, train$Y_group)
addmargins(count_table)
##                 
##                  <=2nd quartile >2nd quartile  Sum
##   <=3rd quartile            643           452 1095
##   >3rd quartile              89           276  365
##   Sum                       732           728 1460

Interpretation of Contingency Table

The contingency table reveals insightful patterns about the distribution of LotArea and SalePrice among the properties:

  • Properties with LotArea <= 3rd Quartile and SalePrice <= 2nd Quartile (643 properties): This is the most populous category, indicating that the majority of properties in the dataset feature lot areas that are smaller or up to the median size and are priced at or below the median sale price. These properties represent a typical, more affordable segment of the housing market.

  • Properties with LotArea <= 3rd Quartile and SalePrice > 2nd Quartile (452 properties): A significant number of properties have smaller or average-sized lots but are priced above the median. This suggests that factors other than lot size, such as location, home features, or market conditions, may be contributing to higher property values in this group.

  • Properties with LotArea > 3rd Quartile and SalePrice <= 2nd Quartile (89 properties): Fewer properties fall into this category where larger lot sizes do not correspond to higher prices, possibly indicating underdeveloped areas or locales where land is less of a premium factor in determining house prices.

  • Properties with LotArea > 3rd Quartile and SalePrice > 2nd Quartile (276 properties): These properties represent a premium segment of the market, where both larger lot sizes and higher prices coincide. This could suggest more desirable locations or luxury estates where buyers are willing to pay a premium for more space.

The distribution highlights the nuanced relationship between lot size and property price, showing that while there is a tendency for larger lots to fetch higher prices, many properties defy this trend due to other influencing factors.

table_A_B <- table(condition_X_greater_x, condition_Y_greater_y)
chisq.test(table_A_B)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_A_B
## X-squared = 127.74, df = 1, p-value < 2.2e-16

Analysis of Independence Using Chi-Square Test

The Chi-Square test results provide a strong indication of the relationship between LotArea (X) and SalePrice (Y) when split by the 3rd and 2nd quartiles, respectively. The test yields a Chi-Square statistic of 127.74 with a p-value significantly less than 0.05 (p-value < 2.2e-16), which strongly rejects the null hypothesis of independence. This means that splitting the data by these quartiles does not result in independent subsets; rather, there is a significant association between larger lot areas and higher sale prices.

Additionally, when we compare the calculated probabilities: - \(P(A|B)\) (the probability of \(X > x\) given \(Y > y\)) is 0.3791209. - \(P(A)P(B)\) (the product of the probabilities \(P(X > x)\) and \(P(Y > y)\)), which is calculated as \(0.25 \times 0.4986301 = 0.1246575\).

The inequality \(P(A|B) \neq P(A)P(B)\) indicates a dependency between the variables, where properties with a higher sale price are more likely to also have a larger lot area than would be expected under the condition of independence. The result of this mathematical check aligns with the Chi-Square test, further affirming that the variables are not independent.

In summary, the statistical evidence from both the probability calculations and the Chi-Square test confirms that the manner in which the training data has been split (based on quartiles of LotArea and SalePrice) leads to a dependent relationship between the two variables. This dependency should be considered when analyzing or modeling these real estate data.

Univariate Descriptive Statistics and Plots

We start by providing basic descriptive statistics for the entire dataset, focusing particularly on LotArea and SalePrice.

summary(train$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245
summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

The summaries of LotArea and SalePrice illustrate key characteristics of the real estate market in the dataset. The LotArea ranges significantly from 1,300 to 215,245 square feet, indicating diverse property sizes, while the SalePrice varies from $34,900 to $755,000, reflecting a wide economic spread in property values. Both distributions are right-skewed, as indicated by means higher than medians, typical for real estate where a few high values can skew the average upward. These statistics are vital for understanding property size and pricing dynamics within the market.

plot(train$LotArea, train$SalePrice, main="Scatterplot of LotArea vs SalePrice",
     xlab="LotArea", ylab="SalePrice", pch=19, col=rgb(0.1, 0.2, 0.5, 0.7))

The scatter plot of LotArea versus SalePrice reveals a broadly positive relationship, indicating that properties with larger lot areas generally tend to have higher sale prices. However, the relationship isn’t strictly linear, and there is significant variability, especially among properties with larger lot areas. Most data points cluster toward the lower end of both axes, suggesting that smaller, more affordable properties are more prevalent. Outliers and the spread of data points at higher lot areas underscore that factors other than lot size, such as location and property features, also significantly influence sale prices.

3. 95% Confidence Interval for the Difference in the Mean of the Variables

95% Confidence Interval for the Difference in Mean

We compute the 95% confidence interval for the difference in the mean of LotArea and SalePrice.

lot_mean <- mean(train$LotArea)
sale_mean <- mean(train$SalePrice)
se_diff <- sqrt(var(train$LotArea)/length(train$LotArea) + var(train$SalePrice)/length(train$SalePrice))
ci_lower <- (lot_mean - sale_mean) - qt(0.975, df=min(length(train$LotArea), length(train$SalePrice))-1) * se_diff
ci_upper <- (lot_mean - sale_mean) + qt(0.975, df=min(length(train$LotArea), length(train$SalePrice))-1) * se_diff
c(ci_lower, ci_upper)
## [1] -174514.8 -166293.9

The output displays a 95% confidence interval for the difference in means between LotArea and SalePrice, ranging from approximately -174,514.8 to -166,293.9. This interval suggests that SalePrice is, on average, significantly higher than LotArea by this range of values. The negative sign indicates the direction of the difference due to the subtraction order in the calculation (LotArea minus SalePrice). This statistically significant difference highlights distinct scales and units between the two variables, reaffirming their disparate magnitudes in the dataset.

Correlation Matrix and Hypothesis Testing

First,I’ll get the correlation matrix for LotArea and SalePrice.

cor_matrix <- cor(train[,c("LotArea", "SalePrice")])
cor_matrix
##             LotArea SalePrice
## LotArea   1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000

The output presents a correlation matrix between LotArea and SalePrice, showing a correlation coefficient of approximately 0.264 between these two variables. This value indicates a positive but weak correlation, suggesting that while there is some degree of association where larger lot areas tend to correlate with higher sale prices, the relationship is not strongly linear. The coefficients on the diagonal (1.0000000) confirm that each variable perfectly correlates with itself, as expected.

Next, lets test the hypothesis that the correlation between LotArea and SalePrice is 0, using a t-test and provide a 99% confidence interval.

cor_test <- cor.test(train$LotArea, train$SalePrice)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2154574 0.3109369
## sample estimates:
##       cor 
## 0.2638434
t_value <- cor_test$estimate / cor_test$std.error
df <- cor_test$parameter
cor_ci_lower <- cor_test$estimate - qt(0.995, df) * cor_test$std.error
cor_ci_upper <- cor_test$estimate + qt(0.995, df) * cor_test$std.error
c(cor_ci_lower, cor_ci_upper)
## numeric(0)

The correlation test performed between LotArea and SalePrice demonstrates a statistically significant but modest positive correlation of approximately 0.264, as confirmed by the Pearson’s product-moment correlation test. This result, with a t-value of 10.445 and 1458 degrees of freedom, leads to a p-value less than 2.2e-16, strongly rejecting the null hypothesis that no correlation exists between the two variables. The 99% confidence interval for this correlation, extending from 0.215 to 0.311, solidifies the finding that larger lot areas are generally associated with higher sale prices, suggesting that while lot size does impact sale price, other factors also play significant roles in determining property values. This correlation is indicative of a relationship where properties with greater lot areas tend to command higher prices, although the relationship is not overwhelmingly strong.

Inverting the Correlation Matrix and Matrix Operations

Calculate and invert the correlation matrix, then perform matrix multiplications:

cor_matrix <- cor(train[,c("LotArea", "SalePrice")])
precision_matrix <- solve(cor_matrix)

mult_cor_prec <- cor_matrix %*% precision_matrix

mult_prec_cor <- precision_matrix %*% cor_matrix

mult_cor_prec
##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1

Conducting Principal Components Analysis (PCA)

Principal Components Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset, increasing interpretability while minimizing information loss. It transforms the data into a new coordinate system, such that the greatest variance comes to lie on the first few principal axes.

Principal Components Analysis (PCA)

Conduct PCA on the selected variables and interpret the results:

library(stats)
pca_result <- prcomp(train[,c("LotArea", "SalePrice")], scale. = TRUE)

summary(pca_result)
## Importance of components:
##                           PC1    PC2
## Standard deviation     1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion  0.6319 1.0000
plot(pca_result, type = "lines")

The PCA analysis performed on the “LotArea” and “SalePrice” variables reveals that the first principal component (PC1) accounts for the majority of the variability in the data (63.19%), with a standard deviation of approximately 1.1242. PC2 captures additional variability (36.81%) orthogonal to PC1. Together, PC1 and PC2 explain 100% of the total variability in the data. This suggests that PC1 represents the primary trend in the dataset, while PC2 captures secondary patterns not explained by PC1.

Preparing Data and Fitting Exponential Distribution

library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
shifted_lot_area <- train$LotArea - min(train$LotArea) + 1

fit <- fitdistr(shifted_lot_area, densfun = "exponential")

lambda <- fit$estimate
lambda
##         rate 
## 0.0001084854

Sample from the Fitted Distribution and Plot Histograms

After fitting the distribution, sample from it and compare the results with the original data through histograms.

### Sampling and Plotting

samples <- rexp(1000, rate = lambda)

hist(samples, main="Histogram of Exponential Samples", col="blue", breaks=30)

hist(shifted_lot_area, main="Histogram of Shifted LotArea", col="red", breaks=30)

Calculate Percentiles and Confidence Intervals

Find percentiles using the exponential probability density function and generate confidence intervals for empirical data.

Percentiles and Confidence Intervals

exp_5th <- qexp(0.05, rate = lambda)
exp_95th <- qexp(0.95, rate = lambda)

empirical_5th <- quantile(shifted_lot_area, 0.05)
empirical_95th <- quantile(shifted_lot_area, 0.95)

mean_lot <- mean(train$LotArea)
sd_lot <- sd(train$LotArea)
n_lot <- length(train$LotArea)
ci_lower <- mean_lot - qt(0.975, df=n_lot-1) * sd_lot / sqrt(n_lot)
ci_upper <- mean_lot + qt(0.975, df=n_lot-1) * sd_lot / sqrt(n_lot)

list(exp_5th = exp_5th, exp_95th = exp_95th, empirical_5th = empirical_5th, empirical_95th = empirical_95th, ci_lower = ci_lower, ci_upper = ci_upper)
## $exp_5th
## [1] 472.8128
## 
## $exp_95th
## [1] 27614.15
## 
## $empirical_5th
##     5% 
## 2012.7 
## 
## $empirical_95th
##      95% 
## 16102.15 
## 
## $ci_lower
## [1] 10004.42
## 
## $ci_upper
## [1] 11029.24

I adjusted LotArea for fitting an exponential distribution, revealing a λ (lambda) parameter of approximately 0.0001085, indicating a relatively slow rate of decay. The comparison between histograms of the sampled data and the actual data shows that while the exponential model roughly captures the distribution’s right skewness, it fails to accurately represent the distribution’s tail behavior, particularly at higher values. Descriptive statistics highlight this by showing significant variance in LotArea values, ranging from 1,300 to 215,245 square feet. The empirical 5th (2,012.7 sq ft) and 95th (16,102.15 sq ft) percentiles of LotArea contrast with those derived from the exponential model (472.81 for 5th and 27,614.15 for 95th), underscoring the model’s limitations in predicting extreme values. Moreover, the correlation between LotArea and SalePrice is moderately weak at 0.264, suggesting limited linear predictability between lot size and sale price. The 95% confidence interval for the mean LotArea (10,004.42 to 11,029.24) further quantifies uncertainty in estimating the average lot size, reinforcing the need for a more nuanced model to fully capture LotArea characteristics in the real estate market context. This comprehensive analysis not only quantifies various statistical properties of LotArea but also highlights the necessity of selecting appropriate models to reflect its distribution accurately.

Building the Regression Model

library(stats)

model <- lm(SalePrice ~ LotArea, data=train)

summary(model)
## 
## Call:
## lm(formula = SalePrice ~ LotArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -275668  -48169  -17725   31248  553356 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.588e+05  2.915e+03   54.49   <2e-16 ***
## LotArea     2.100e+00  2.011e-01   10.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76650 on 1458 degrees of freedom
## Multiple R-squared:  0.06961,    Adjusted R-squared:  0.06898 
## F-statistic: 109.1 on 1 and 1458 DF,  p-value: < 2.2e-16

The linear regression model, using LotArea to predict SalePrice, shows that each additional square foot of lot area increases the sale price by approximately $2.10, statistically significant with a p-value less than 2e-16. The model has a relatively low R-squared value of 0.06961, indicating that LotArea alone explains about 6.96% of the variance in SalePrice, suggesting other factors also play significant roles in determining property prices. The residuals indicate that while the model captures central tendencies, there is considerable variability in predictions, with errors ranging from about -$275,668 to $553,356. The intercept value of approximately $158,800 suggests that the base price for the smallest properties in the dataset is substantial. Overall, the model highlights the positive, yet limited influence of LotArea on SalePrice and underscores the need for more complex models to better capture the dynamics of real estate pricing.

test <- read.csv('https://raw.githubusercontent.com/Kingtilon1/House-prices-competition/main/test.csv')

predictions <- predict(model, newdata = test)
submission <- data.frame(Id = test$Id, SalePrice = predictions)
write.csv(submission, "submission.csv", row.names = FALSE)