We begin by loading the trainset from a URL and taking a preliminary look at its structure.
train <- read.csv("https://raw.githubusercontent.com/Kingtilon1/House-prices-competition/main/train.csv", stringsAsFactors = FALSE)
head(train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
We identify quantitative variables and check for skewness, choosing
LotArea as \(X\) and
SalePrice as \(Y\).
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(train, aes(x=LotArea)) + geom_histogram(bins=30, fill="blue", color="black") +
ggtitle("Distribution of LotArea") + theme_minimal()
ggplot(train, aes(x=SalePrice)) + geom_histogram(bins=30, fill="red", color="black") +
ggtitle("Distribution of SalePrice") + theme_minimal()
We calculate the necessary quartiles and probabilities.
x_third_quartile <- quantile(train$LotArea, 0.75)
y_second_quartile <- quantile(train$SalePrice, 0.5)
condition_X_greater_x <- train$LotArea > x_third_quartile
condition_Y_greater_y <- train$SalePrice > y_second_quartile
prob_X_greater_x <- mean(condition_X_greater_x)
prob_Y_greater_y <- mean(condition_Y_greater_y)
prob_X_greater_x_given_Y_greater_y <- mean(train$LotArea[condition_Y_greater_y] > x_third_quartile)
prob_X_greater_x_and_Y_greater_y <- mean(condition_X_greater_x & condition_Y_greater_y)
prob_X_less_x_given_Y_greater_y <- mean(train$LotArea[condition_Y_greater_y] < x_third_quartile)
cat("P(X > x): ", prob_X_greater_x, "\nP(Y > y): ", prob_Y_greater_y, "\nP(X > x | Y > y): ", prob_X_greater_x_given_Y_greater_y,
"\nP(X > x and Y > y): ", prob_X_greater_x_and_Y_greater_y, "\nP(X < x | Y > y): ", prob_X_less_x_given_Y_greater_y)
## P(X > x): 0.25
## P(Y > y): 0.4986301
## P(X > x | Y > y): 0.3791209
## P(X > x and Y > y): 0.1890411
## P(X < x | Y > y): 0.6208791
1: P(X>x)=0.25: This probability indicates that 25% of the properties have a lot area greater than the third quartile of all lot areas in the dataset. This is by definition of the quartile, as the third quartile (75th percentile) is the value below which 75% of the data fall. 𝑃 ( 𝑌 > 𝑦 ) = 0.4986301
2: P(Y>y)=0.4986301: About 49.9% of the properties have a sale price higher than the median (second quartile, 50th percentile) sale price. This is very close to 50%, as expected, because the median is the middle value in a data set.
3: 𝑃 ( 𝑋 > 𝑥 ∣ 𝑌 > 𝑦 ) = 0.3791209 P(X>x∣Y>y)=0.3791209: Given that a property’s sale price is above the median, there is approximately a 37.9% chance that its lot area is also above the third quartile. This suggests that among higher-priced homes, a significantly large proportion also have larger lot areas, though not the majority.
4: 𝑃 ( 𝑋 > 𝑥 ∧ 𝑌 > 𝑦 ) = 0.1890411 P(X>x∧Y>y)=0.1890411: There is an 18.9% chance that a property will have both a sale price above the median and a lot area greater than the third quartile. This joint probability is less than the product of the individual probabilities ( 𝑃 ( 𝑋 > 𝑥 ) × 𝑃 ( 𝑌 > 𝑦 ) P(X>x)×P(Y>y)), indicating a possible dependency between lot area and sale price — large lot areas tend to occur with higher prices, but not as frequently as might be expected if the two were independent.
5: 𝑃 ( 𝑋 < 𝑥 ∣ 𝑌 > 𝑦 ) = 0.6208791 P(X<x∣Y>y)=0.6208791: Given a property’s sale price is above the median, there is a 62.1% chance that its lot area is less than the third quartile. This indicates that even among higher-priced homes, it’s more common for the lot area to be in the smaller three-quarters of all lot areas. ### Chi-Square Test for Independence
To further analyze the relationship between LotArea (X)
and SalePrice (Y), we create a contingency table that
categorizes properties based on whether they fall above or below these
quartiles.
train$X_group <- ifelse(train$LotArea > x_third_quartile, ">3rd quartile", "<=3rd quartile")
train$Y_group <- ifelse(train$SalePrice > y_second_quartile, ">2nd quartile", "<=2nd quartile")
count_table <- table(train$X_group, train$Y_group)
addmargins(count_table)
##
## <=2nd quartile >2nd quartile Sum
## <=3rd quartile 643 452 1095
## >3rd quartile 89 276 365
## Sum 732 728 1460
The contingency table reveals insightful patterns about the
distribution of LotArea and SalePrice among
the properties:
Properties with LotArea <= 3rd Quartile
and SalePrice <= 2nd Quartile (643 properties):
This is the most populous category, indicating that the majority of
properties in the dataset feature lot areas that are smaller or up to
the median size and are priced at or below the median sale price. These
properties represent a typical, more affordable segment of the housing
market.
Properties with LotArea <= 3rd Quartile
and SalePrice > 2nd Quartile (452 properties):
A significant number of properties have smaller or average-sized lots
but are priced above the median. This suggests that factors other than
lot size, such as location, home features, or market conditions, may be
contributing to higher property values in this group.
Properties with LotArea > 3rd Quartile
and SalePrice <= 2nd Quartile (89 properties):
Fewer properties fall into this category where larger lot sizes do not
correspond to higher prices, possibly indicating underdeveloped areas or
locales where land is less of a premium factor in determining house
prices.
Properties with LotArea > 3rd Quartile
and SalePrice > 2nd Quartile (276 properties):
These properties represent a premium segment of the market, where both
larger lot sizes and higher prices coincide. This could suggest more
desirable locations or luxury estates where buyers are willing to pay a
premium for more space.
The distribution highlights the nuanced relationship between lot size and property price, showing that while there is a tendency for larger lots to fetch higher prices, many properties defy this trend due to other influencing factors.
table_A_B <- table(condition_X_greater_x, condition_Y_greater_y)
chisq.test(table_A_B)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_A_B
## X-squared = 127.74, df = 1, p-value < 2.2e-16
The Chi-Square test results provide a strong indication of the
relationship between LotArea (X) and SalePrice
(Y) when split by the 3rd and 2nd quartiles, respectively. The test
yields a Chi-Square statistic of 127.74 with a p-value significantly
less than 0.05 (p-value < 2.2e-16), which strongly rejects the null
hypothesis of independence. This means that splitting the data by these
quartiles does not result in independent subsets; rather, there is a
significant association between larger lot areas and higher sale
prices.
Additionally, when we compare the calculated probabilities: - \(P(A|B)\) (the probability of \(X > x\) given \(Y > y\)) is 0.3791209. - \(P(A)P(B)\) (the product of the probabilities \(P(X > x)\) and \(P(Y > y)\)), which is calculated as \(0.25 \times 0.4986301 = 0.1246575\).
The inequality \(P(A|B) \neq P(A)P(B)\) indicates a dependency between the variables, where properties with a higher sale price are more likely to also have a larger lot area than would be expected under the condition of independence. The result of this mathematical check aligns with the Chi-Square test, further affirming that the variables are not independent.
In summary, the statistical evidence from both the probability
calculations and the Chi-Square test confirms that the manner in which
the training data has been split (based on quartiles of
LotArea and SalePrice) leads to a dependent
relationship between the two variables. This dependency should be
considered when analyzing or modeling these real estate data.
We start by providing basic descriptive statistics for the entire
dataset, focusing particularly on LotArea and
SalePrice.
summary(train$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
The summaries of LotArea and SalePrice illustrate key characteristics of the real estate market in the dataset. The LotArea ranges significantly from 1,300 to 215,245 square feet, indicating diverse property sizes, while the SalePrice varies from $34,900 to $755,000, reflecting a wide economic spread in property values. Both distributions are right-skewed, as indicated by means higher than medians, typical for real estate where a few high values can skew the average upward. These statistics are vital for understanding property size and pricing dynamics within the market.
plot(train$LotArea, train$SalePrice, main="Scatterplot of LotArea vs SalePrice",
xlab="LotArea", ylab="SalePrice", pch=19, col=rgb(0.1, 0.2, 0.5, 0.7))
The scatter plot of LotArea versus SalePrice reveals a broadly positive relationship, indicating that properties with larger lot areas generally tend to have higher sale prices. However, the relationship isn’t strictly linear, and there is significant variability, especially among properties with larger lot areas. Most data points cluster toward the lower end of both axes, suggesting that smaller, more affordable properties are more prevalent. Outliers and the spread of data points at higher lot areas underscore that factors other than lot size, such as location and property features, also significantly influence sale prices.
We compute the 95% confidence interval for the difference in the mean
of LotArea and SalePrice.
lot_mean <- mean(train$LotArea)
sale_mean <- mean(train$SalePrice)
se_diff <- sqrt(var(train$LotArea)/length(train$LotArea) + var(train$SalePrice)/length(train$SalePrice))
ci_lower <- (lot_mean - sale_mean) - qt(0.975, df=min(length(train$LotArea), length(train$SalePrice))-1) * se_diff
ci_upper <- (lot_mean - sale_mean) + qt(0.975, df=min(length(train$LotArea), length(train$SalePrice))-1) * se_diff
c(ci_lower, ci_upper)
## [1] -174514.8 -166293.9
The output displays a 95% confidence interval for the difference in means between LotArea and SalePrice, ranging from approximately -174,514.8 to -166,293.9. This interval suggests that SalePrice is, on average, significantly higher than LotArea by this range of values. The negative sign indicates the direction of the difference due to the subtraction order in the calculation (LotArea minus SalePrice). This statistically significant difference highlights distinct scales and units between the two variables, reaffirming their disparate magnitudes in the dataset.
First,I’ll get the correlation matrix for LotArea and
SalePrice.
cor_matrix <- cor(train[,c("LotArea", "SalePrice")])
cor_matrix
## LotArea SalePrice
## LotArea 1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
The output presents a correlation matrix between LotArea and SalePrice, showing a correlation coefficient of approximately 0.264 between these two variables. This value indicates a positive but weak correlation, suggesting that while there is some degree of association where larger lot areas tend to correlate with higher sale prices, the relationship is not strongly linear. The coefficients on the diagonal (1.0000000) confirm that each variable perfectly correlates with itself, as expected.
Next, lets test the hypothesis that the correlation between LotArea and SalePrice is 0, using a t-test and provide a 99% confidence interval.
cor_test <- cor.test(train$LotArea, train$SalePrice)
cor_test
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2154574 0.3109369
## sample estimates:
## cor
## 0.2638434
t_value <- cor_test$estimate / cor_test$std.error
df <- cor_test$parameter
cor_ci_lower <- cor_test$estimate - qt(0.995, df) * cor_test$std.error
cor_ci_upper <- cor_test$estimate + qt(0.995, df) * cor_test$std.error
c(cor_ci_lower, cor_ci_upper)
## numeric(0)
The correlation test performed between LotArea and SalePrice demonstrates a statistically significant but modest positive correlation of approximately 0.264, as confirmed by the Pearson’s product-moment correlation test. This result, with a t-value of 10.445 and 1458 degrees of freedom, leads to a p-value less than 2.2e-16, strongly rejecting the null hypothesis that no correlation exists between the two variables. The 99% confidence interval for this correlation, extending from 0.215 to 0.311, solidifies the finding that larger lot areas are generally associated with higher sale prices, suggesting that while lot size does impact sale price, other factors also play significant roles in determining property values. This correlation is indicative of a relationship where properties with greater lot areas tend to command higher prices, although the relationship is not overwhelmingly strong.
Calculate and invert the correlation matrix, then perform matrix multiplications:
cor_matrix <- cor(train[,c("LotArea", "SalePrice")])
precision_matrix <- solve(cor_matrix)
mult_cor_prec <- cor_matrix %*% precision_matrix
mult_prec_cor <- precision_matrix %*% cor_matrix
mult_cor_prec
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
Principal Components Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset, increasing interpretability while minimizing information loss. It transforms the data into a new coordinate system, such that the greatest variance comes to lie on the first few principal axes.
Conduct PCA on the selected variables and interpret the results:
library(stats)
pca_result <- prcomp(train[,c("LotArea", "SalePrice")], scale. = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2
## Standard deviation 1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion 0.6319 1.0000
plot(pca_result, type = "lines")
The PCA analysis performed on the “LotArea” and “SalePrice” variables reveals that the first principal component (PC1) accounts for the majority of the variability in the data (63.19%), with a standard deviation of approximately 1.1242. PC2 captures additional variability (36.81%) orthogonal to PC1. Together, PC1 and PC2 explain 100% of the total variability in the data. This suggests that PC1 represents the primary trend in the dataset, while PC2 captures secondary patterns not explained by PC1.
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
shifted_lot_area <- train$LotArea - min(train$LotArea) + 1
fit <- fitdistr(shifted_lot_area, densfun = "exponential")
lambda <- fit$estimate
lambda
## rate
## 0.0001084854
After fitting the distribution, sample from it and compare the results with the original data through histograms.
### Sampling and Plotting
samples <- rexp(1000, rate = lambda)
hist(samples, main="Histogram of Exponential Samples", col="blue", breaks=30)
hist(shifted_lot_area, main="Histogram of Shifted LotArea", col="red", breaks=30)
Find percentiles using the exponential probability density function and generate confidence intervals for empirical data.
exp_5th <- qexp(0.05, rate = lambda)
exp_95th <- qexp(0.95, rate = lambda)
empirical_5th <- quantile(shifted_lot_area, 0.05)
empirical_95th <- quantile(shifted_lot_area, 0.95)
mean_lot <- mean(train$LotArea)
sd_lot <- sd(train$LotArea)
n_lot <- length(train$LotArea)
ci_lower <- mean_lot - qt(0.975, df=n_lot-1) * sd_lot / sqrt(n_lot)
ci_upper <- mean_lot + qt(0.975, df=n_lot-1) * sd_lot / sqrt(n_lot)
list(exp_5th = exp_5th, exp_95th = exp_95th, empirical_5th = empirical_5th, empirical_95th = empirical_95th, ci_lower = ci_lower, ci_upper = ci_upper)
## $exp_5th
## [1] 472.8128
##
## $exp_95th
## [1] 27614.15
##
## $empirical_5th
## 5%
## 2012.7
##
## $empirical_95th
## 95%
## 16102.15
##
## $ci_lower
## [1] 10004.42
##
## $ci_upper
## [1] 11029.24
I adjusted LotArea for fitting an exponential distribution, revealing a λ (lambda) parameter of approximately 0.0001085, indicating a relatively slow rate of decay. The comparison between histograms of the sampled data and the actual data shows that while the exponential model roughly captures the distribution’s right skewness, it fails to accurately represent the distribution’s tail behavior, particularly at higher values. Descriptive statistics highlight this by showing significant variance in LotArea values, ranging from 1,300 to 215,245 square feet. The empirical 5th (2,012.7 sq ft) and 95th (16,102.15 sq ft) percentiles of LotArea contrast with those derived from the exponential model (472.81 for 5th and 27,614.15 for 95th), underscoring the model’s limitations in predicting extreme values. Moreover, the correlation between LotArea and SalePrice is moderately weak at 0.264, suggesting limited linear predictability between lot size and sale price. The 95% confidence interval for the mean LotArea (10,004.42 to 11,029.24) further quantifies uncertainty in estimating the average lot size, reinforcing the need for a more nuanced model to fully capture LotArea characteristics in the real estate market context. This comprehensive analysis not only quantifies various statistical properties of LotArea but also highlights the necessity of selecting appropriate models to reflect its distribution accurately.
library(stats)
model <- lm(SalePrice ~ LotArea, data=train)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ LotArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -275668 -48169 -17725 31248 553356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.588e+05 2.915e+03 54.49 <2e-16 ***
## LotArea 2.100e+00 2.011e-01 10.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76650 on 1458 degrees of freedom
## Multiple R-squared: 0.06961, Adjusted R-squared: 0.06898
## F-statistic: 109.1 on 1 and 1458 DF, p-value: < 2.2e-16
The linear regression model, using LotArea to predict SalePrice, shows that each additional square foot of lot area increases the sale price by approximately $2.10, statistically significant with a p-value less than 2e-16. The model has a relatively low R-squared value of 0.06961, indicating that LotArea alone explains about 6.96% of the variance in SalePrice, suggesting other factors also play significant roles in determining property prices. The residuals indicate that while the model captures central tendencies, there is considerable variability in predictions, with errors ranging from about -$275,668 to $553,356. The intercept value of approximately $158,800 suggests that the base price for the smallest properties in the dataset is substantial. Overall, the model highlights the positive, yet limited influence of LotArea on SalePrice and underscores the need for more complex models to better capture the dynamics of real estate pricing.
test <- read.csv('https://raw.githubusercontent.com/Kingtilon1/House-prices-competition/main/test.csv')
predictions <- predict(model, newdata = test)
submission <- data.frame(Id = test$Id, SalePrice = predictions)
write.csv(submission, "submission.csv", row.names = FALSE)