DATA 605 Final
Gregg Maloy
PART I: SETUP
Pick one of the quantitative independent variables from the
training data set (train.csv) , and define that variable as X. Make sure
this variable is skewed to the right! Pick the dependent variable and
define it as Y.
library(readr)
library(ggplot2)
library(scales)
library(dplyr)
library(MASS)
library(tidyverse)
url <- "https://raw.githubusercontent.com/greggmaloy/Data_605/main/train.csv"
data <- read_csv(url)
#write.table(data, "clipboard", sep = "\t", row.names = FALSE)
#transforming to long for plots further in the analysis
datalong <- gather(data, key="variable", value="value", c("Id", "MSSubClass", "MSZoning", "LotFrontage", "LotArea", "Street", "Alley", "LotShape", "LandContour", "Utilities", "LotConfig", "LandSlope", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "MasVnrArea", "ExterQual", "ExterCond", "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "Heating", "HeatingQC", "CentralAir", "Electrical", "1stFlrSF", "2ndFlrSF", "LowQualFinSF", "GrLivArea", "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "KitchenQual", "TotRmsAbvGrd", "Functional", "Fireplaces", "FireplaceQu", "GarageType", "GarageYrBlt", "GarageFinish", "GarageCars", "GarageArea", "GarageQual", "GarageCond", "PavedDrive", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch", "PoolArea", "PoolQC", "Fence", "MiscFeature", "MiscVal", "MoSold", "YrSold", "SaleType", "SaleCondition", "SalePrice"))
#view(datalong)
#write.table(datalong, "clipboard", sep = "\t", row.names = FALSE)
Pick one of the quantitative independent variables from the
training data set (train.csv) , and define that variable as X. Make sure
this variable is skewed to the right!
Let’s make ‘LotArea’ the independent variable and plot a histogram. The
resulting histogram is skewed to the right.
p <- ggplot(data, aes(x = LotArea)) +
geom_histogram(bins = 100, fill = "yellow", color = "gray") +
labs(title = "Lot Area", x = "LotArea", y = "Frequency") +
scale_x_continuous(labels = label_number(big.mark = ","))+
theme_minimal()
print(p)
Pick the dependent variable and define it as Y. Let’s make House ‘SalePrice’ the dependent variable and plot a histogram. The resulting histogram approximates normal distribution.
p <- ggplot(data, aes(x = SalePrice)) +
geom_histogram(bins = 100, fill = "yellow", color = "gray") +
labs(title = "Sale Price", x = "Sale Price", y = "Frequency") +
scale_x_continuous(labels = label_number(big.mark = ","))+
theme_minimal()
print(p)
PART II: PROBABILITY
Calculate as a minimum the below probabilities a through
c. Assume the small letter “x” is estimated as the 3d quartile of the X
variable, and the small letter “y” is estimated as the 2d quartile of
the Y variable. Interpret the meaning of all probabilities. In addition,
make a table of counts as shown below.
a. P(X>x | Y>y) b. P(X>x, Y>y) c. P(X<x |
Y>y)
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
#preparing the data
#define X and Y
X= data$LotArea
Y= data$SalePrice
#assigning the 3rd quartile value to 'x'
x_1quartile = quantile(X, 0.25)
x_2quartile = quantile(X, 0.50)
x_3quartile = quantile(X, 0.75)
x_max = quantile(X, 1.00)
x<-x_3quartile #assigning the 3rd quartile value to 'x'
#assigning the 2nd quartile value to 'y'
y_1quartile = quantile(Y, 0.25)
y_2quartile = quantile(Y, 0.50)
y_3quartile = quantile(Y, 0.75)
y_max = quantile(Y, 1.00)
y=y_2quartile #assigning the 2nd quartile value to 'y'
#x_1quartile
#x_2quartile
#x_3quartile
#x_max
#y_1quartile
#y_2quartile
#y_3quartile
#y_max
#validate
d2 = data %>%
dplyr::select(LotArea, SalePrice)
summary(d2)
## LotArea SalePrice
## Min. : 1300 Min. : 34900
## 1st Qu.: 7554 1st Qu.:129975
## Median : 9478 Median :163000
## Mean : 10517 Mean :180921
## 3rd Qu.: 11602 3rd Qu.:214000
## Max. :215245 Max. :755000
a. P(X > x | Y > y)
Given that Y > y, determine the probability that X >
x.
Answer: Of the houses whose SalesPrice are in the top 50%(2nd Quartile),
the probability that the LotArea is in the top 25%(4th Quartile) is
38%.
den=nrow(data)
p_x<- sum(X > x)/den
#p_x
p_y<- sum(Y > y)/den
#p_y
p_x_and_y<- sum(Y > y & X > x)/den
#p_x_and_y
p_given <-p_x_and_y/p_y
p_given
## [1] 0.3791209
b. P(X>x, Y>y)
Answer: The probability that the the LotArea is in the top 25% (4th
Quartile) and the SalePrice is in the top 50% (2nd Quartile) is 19%.
p_x_and_y
## [1] 0.1890411
c.P(X<x | Y>y)
Given that Y > y, determine the probability that
X<x
Answer: Of the houses whose SalesPrice are in the top 50%(2nd Quartile),
the probability that the LotArea is in the bottom 75%(1st, 2nd or 3rd
Quartile) is 62%. This also could have been answered by subtracting 100
from part A (100-32=62).
p_xless_and_y<- sum(X < x & Y > y)/den
#p_xless_and_y
p_Xless_given_Y <- p_xless_and_y/p_y
p_Xless_given_Y
## [1] 0.6208791
In addition, make a table of counts
See below table. The information in the table is correct but the code
could be more elegant.
total=den
#total
total_2nd_Q_y<- sum(Y > y)
#total_2nd_Q_y
total_2nd_Q_yless<- sum(Y <= y)
#total_2nd_Q_yless
total_2nd_Q_x<- sum(X > x)
#total_2nd_Q_x
total_2nd_Q_xless<- sum(X <= x)
#total_2nd_Q_xless
y_and_x <- sum(Y > y & X > x)
#y_and_x
y_and_xless <-sum(X <= x & Y > y)
#y_and_xless
yless_and_xless <-sum(X <= x & Y <= y)
#yless_and_xless
yless_and_x <-sum(X > x & Y <= y)
#yless_and_x
# empty df
table_df <- data.frame(
`x/y` = c("<=3rd quartile", ">3rd quartile", "Total"),
"Less_Than_Equal_to_2nd_Quartile" = NA,
"Greater_Than_2nd_Quartile" = NA,
`Total` = NA
)
rownames(table_df) <- table_df$`x/y`
table_df$`x/y` <- NULL
table_df[1, 1] <- "<=_3rd_Quartile"
table_df[1, 2] <- yless_and_xless # <=3d quartile, <=2d quartile
table_df[1, 3] <- y_and_xless # <=3d quartile, >2d quartile
table_df[1, 4] <- total_2nd_Q_xless<- sum(X < x) # <=3d quartile, Total
table_df[2, 1] <- ">_3rd_Quartile"#
table_df[2, 2] <- yless_and_x # >3d quartile, <=2d quartile
table_df[2, 3] <- y_and_x # >3d quartile, >2d quartile
table_df[2, 4] <- total_2nd_Q_x # >3d quartile, Total
table_df[3, 1] <- "Total" # Total, <=2d quartile
table_df[3, 2] <- total-total_2nd_Q_y # Total, <=2d quartile
table_df[3, 3] <- total_2nd_Q_y # Total, >2d quartile
table_df[3, 4] <- total # Total, Total
table_df
## x.y Less_Than_Equal_to_2nd_Quartile Greater_Than_2nd_Quartile
## 1 <=_3rd_Quartile 643 452
## 2 >_3rd_Quartile 89 276
## 3 Total 732 728
## Total
## 1 1095
## 2 365
## 3 1460
Does splitting the training data in this fashion make them independent? Does P(A|B)=P(A)P(B)?
No P(A|B) does not equal P(A)P(B) \[25\% \neq 12.47\%\] \[(A|B) \neq P(A)P(B)\]
A<- sum(X > x)
#A#count
p_A<- sum(X > x)/den # den is the total number of rows
#p_A
B<- sum(Y > y)
#B#count
p_B<- sum(Y > y)/den # den is the total number of rows
#p_B
#P(A)P(B)
P_A_and_P_B<-p_B*p_A#=12.47%
P_A_and_P_B
## [1] 0.1246575
#P(A|B) = P(B and A) / P(B)
P_A_Given_B<-P_A_and_P_B/p_B
P_A_Given_B
## [1] 0.25
Evaluate by running a Chi Square test for
association
Below the Chi square test is statistically significant and can be
interpreted as a rejection of the null hypothesis of independence.
However, since the chi squared test is designed for categorical
variables and both ‘SalePrice’ and ‘LotArea’ are continuous variables,
the chi squared test results should have no bearing on independence.
chisq.test(matrix(c(643,452,89,276), ncol=2))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix(c(643, 452, 89, 276), ncol = 2)
## X-squared = 127.74, df = 1, p-value < 2.2e-16
PART III: DESCRIPTIVE & INFERENTIAL
STATISTICS
Provide univariate descriptive statistics and appropriate plots
for the training data set. Provide a scatter plot of X and Y. Provide a
95% CI for the difference in the mean of the variables. Derive a
correlation matrix for two of the quantitative variables you selected.
Test the hypothesis that the correlation between these variables is 0
and provide a 99% confidence interval. Discuss the meaning of your
analysis.
Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
Descriptive statistics
library(psych)
describe(data)
## vars n mean sd median trimmed mad min
## Id 1 1460 730.50 421.61 730.5 730.50 541.15 1
## MSSubClass 2 1460 56.90 42.30 50.0 49.15 44.48 20
## MSZoning 3 1460 4.03 0.63 4.0 4.06 0.00 1
## LotFrontage 4 1201 70.05 24.28 69.0 68.94 16.31 21
## LotArea 5 1460 10516.83 9981.26 9478.5 9563.28 2962.23 1300
## Street 6 1460 2.00 0.06 2.0 2.00 0.00 1
## Alley 7 91 1.45 0.50 1.0 1.44 0.00 1
## LotShape 8 1460 2.94 1.41 4.0 3.05 0.00 1
## LandContour 9 1460 3.78 0.71 4.0 4.00 0.00 1
## Utilities 10 1460 1.00 0.03 1.0 1.00 0.00 1
## LotConfig 11 1460 4.02 1.62 5.0 4.27 0.00 1
## LandSlope 12 1460 1.06 0.28 1.0 1.00 0.00 1
## Neighborhood 13 1460 13.15 5.89 13.0 13.11 7.41 1
## Condition1 14 1460 3.03 0.87 3.0 3.00 0.00 1
## Condition2 15 1460 3.01 0.26 3.0 3.00 0.00 1
## BldgType 16 1460 1.49 1.20 1.0 1.14 0.00 1
## HouseStyle 17 1460 4.04 1.91 3.0 4.03 1.48 1
## OverallQual 18 1460 6.10 1.38 6.0 6.08 1.48 1
## OverallCond 19 1460 5.58 1.11 5.0 5.48 0.00 1
## YearBuilt 20 1460 1971.27 30.20 1973.0 1974.13 37.06 1872
## YearRemodAdd 21 1460 1984.87 20.65 1994.0 1986.37 19.27 1950
## RoofStyle 22 1460 2.41 0.83 2.0 2.26 0.00 1
## RoofMatl 23 1460 2.08 0.60 2.0 2.00 0.00 1
## Exterior1st 24 1460 10.62 3.20 13.0 10.93 1.48 1
## Exterior2nd 25 1460 11.34 3.54 14.0 11.65 2.97 1
## MasVnrType 26 1452 2.76 0.62 3.0 2.73 0.00 1
## MasVnrArea 27 1452 103.69 181.07 0.0 63.15 0.00 0
## ExterQual 28 1460 3.54 0.69 4.0 3.65 0.00 1
## ExterCond 29 1460 4.73 0.73 5.0 4.95 0.00 1
## Foundation 30 1460 2.40 0.72 2.0 2.46 1.48 1
## BsmtQual 31 1423 3.26 0.87 3.0 3.43 1.48 1
## BsmtCond 32 1423 3.81 0.66 4.0 4.00 0.00 1
## BsmtExposure 33 1422 3.27 1.15 4.0 3.46 0.00 1
## BsmtFinType1 34 1423 3.73 1.83 3.0 3.79 2.97 1
## BsmtFinSF1 35 1460 443.64 456.10 383.5 386.08 568.58 0
## BsmtFinType2 36 1422 5.71 0.94 6.0 5.98 0.00 1
## BsmtFinSF2 37 1460 46.55 161.32 0.0 1.38 0.00 0
## BsmtUnfSF 38 1460 567.24 441.87 477.5 519.29 426.99 0
## TotalBsmtSF 39 1460 1057.43 438.71 991.5 1036.70 347.67 0
## Heating 40 1460 2.04 0.30 2.0 2.00 0.00 1
## HeatingQC 41 1460 2.54 1.74 1.0 2.42 0.00 1
## CentralAir 42 1460 1.93 0.25 2.0 2.00 0.00 1
## Electrical 43 1459 4.68 1.05 5.0 5.00 0.00 1
## 1stFlrSF 44 1460 1162.63 386.59 1087.0 1129.99 347.67 334
## 2ndFlrSF 45 1460 346.99 436.53 0.0 285.36 0.00 0
## LowQualFinSF 46 1460 5.84 48.62 0.0 0.00 0.00 0
## GrLivArea 47 1460 1515.46 525.48 1464.0 1467.67 483.33 334
## BsmtFullBath 48 1460 0.43 0.52 0.0 0.39 0.00 0
## BsmtHalfBath 49 1460 0.06 0.24 0.0 0.00 0.00 0
## FullBath 50 1460 1.57 0.55 2.0 1.56 0.00 0
## HalfBath 51 1460 0.38 0.50 0.0 0.34 0.00 0
## BedroomAbvGr 52 1460 2.87 0.82 3.0 2.85 0.00 0
## KitchenAbvGr 53 1460 1.05 0.22 1.0 1.00 0.00 0
## KitchenQual 54 1460 3.34 0.83 4.0 3.50 0.00 1
## TotRmsAbvGrd 55 1460 6.52 1.63 6.0 6.41 1.48 2
## Functional 56 1460 6.75 0.98 7.0 7.00 0.00 1
## Fireplaces 57 1460 0.61 0.64 1.0 0.53 1.48 0
## FireplaceQu 58 770 3.73 1.13 3.0 3.80 1.48 1
## GarageType 59 1379 3.28 1.79 2.0 3.11 0.00 1
## GarageYrBlt 60 1379 1978.51 24.69 1980.0 1981.07 31.13 1900
## GarageFinish 61 1379 2.18 0.81 2.0 2.23 1.48 1
## GarageCars 62 1460 1.77 0.75 2.0 1.77 0.00 0
## GarageArea 63 1460 472.98 213.80 480.0 469.81 177.91 0
## GarageQual 64 1379 4.86 0.61 5.0 5.00 0.00 1
## GarageCond 65 1379 4.90 0.52 5.0 5.00 0.00 1
## PavedDrive 66 1460 2.86 0.50 3.0 3.00 0.00 1
## WoodDeckSF 67 1460 94.24 125.34 0.0 71.76 0.00 0
## OpenPorchSF 68 1460 46.66 66.26 25.0 33.23 37.06 0
## EnclosedPorch 69 1460 21.95 61.12 0.0 3.87 0.00 0
## 3SsnPorch 70 1460 3.41 29.32 0.0 0.00 0.00 0
## ScreenPorch 71 1460 15.06 55.76 0.0 0.00 0.00 0
## PoolArea 72 1460 2.76 40.18 0.0 0.00 0.00 0
## PoolQC 73 7 2.14 0.90 2.0 2.14 1.48 1
## Fence 74 281 2.43 0.86 3.0 2.48 0.00 1
## MiscFeature 75 54 2.91 0.45 3.0 3.00 0.00 1
## MiscVal 76 1460 43.49 496.12 0.0 0.00 0.00 0
## MoSold 77 1460 6.32 2.70 6.0 6.25 2.97 1
## YrSold 78 1460 2007.82 1.33 2008.0 2007.77 1.48 2006
## SaleType 79 1460 8.51 1.56 9.0 8.92 0.00 1
## SaleCondition 80 1460 4.77 1.10 5.0 5.00 0.00 1
## SalePrice 81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
## max range skew kurtosis se
## Id 1460 1459 0.00 -1.20 11.03
## MSSubClass 190 170 1.40 1.56 1.11
## MSZoning 5 4 -1.73 6.25 0.02
## LotFrontage 313 292 2.16 17.34 0.70
## LotArea 215245 213945 12.18 202.26 261.22
## Street 2 1 -15.49 238.01 0.00
## Alley 2 1 0.20 -1.98 0.05
## LotShape 4 3 -0.61 -1.60 0.04
## LandContour 4 3 -3.16 8.65 0.02
## Utilities 2 1 38.13 1453.00 0.00
## LotConfig 5 4 -1.13 -0.59 0.04
## LandSlope 3 2 4.80 24.47 0.01
## Neighborhood 25 24 0.02 -1.06 0.15
## Condition1 9 8 3.01 16.34 0.02
## Condition2 8 7 13.14 247.54 0.01
## BldgType 5 4 2.24 3.41 0.03
## HouseStyle 8 7 0.31 -0.96 0.05
## OverallQual 10 9 0.22 0.09 0.04
## OverallCond 9 8 0.69 1.09 0.03
## YearBuilt 2010 138 -0.61 -0.45 0.79
## YearRemodAdd 2010 60 -0.50 -1.27 0.54
## RoofStyle 6 5 1.47 0.61 0.02
## RoofMatl 8 7 8.09 66.28 0.02
## Exterior1st 15 14 -0.72 -0.37 0.08
## Exterior2nd 16 15 -0.69 -0.52 0.09
## MasVnrType 4 3 -0.07 -0.13 0.02
## MasVnrArea 1600 1600 2.66 10.03 4.75
## ExterQual 4 3 -1.83 3.86 0.02
## ExterCond 5 4 -2.56 5.29 0.02
## Foundation 6 5 0.09 1.02 0.02
## BsmtQual 4 3 -1.31 1.27 0.02
## BsmtCond 4 3 -3.39 10.14 0.02
## BsmtExposure 4 3 -1.15 -0.39 0.03
## BsmtFinType1 6 5 -0.02 -1.39 0.05
## BsmtFinSF1 5644 5644 1.68 11.06 11.94
## BsmtFinType2 6 5 -3.56 12.32 0.02
## BsmtFinSF2 1474 1474 4.25 20.01 4.22
## BsmtUnfSF 2336 2336 0.92 0.46 11.56
## TotalBsmtSF 6110 6110 1.52 13.18 11.48
## Heating 6 5 9.83 110.98 0.01
## HeatingQC 5 4 0.48 -1.51 0.05
## CentralAir 2 1 -3.52 10.42 0.01
## Electrical 5 4 -3.06 7.49 0.03
## 1stFlrSF 4692 4358 1.37 5.71 10.12
## 2ndFlrSF 2065 2065 0.81 -0.56 11.42
## LowQualFinSF 572 572 8.99 82.83 1.27
## GrLivArea 5642 5308 1.36 4.86 13.75
## BsmtFullBath 3 3 0.59 -0.84 0.01
## BsmtHalfBath 2 2 4.09 16.31 0.01
## FullBath 3 3 0.04 -0.86 0.01
## HalfBath 2 2 0.67 -1.08 0.01
## BedroomAbvGr 8 8 0.21 2.21 0.02
## KitchenAbvGr 3 3 4.48 21.42 0.01
## KitchenQual 4 3 -1.42 1.72 0.02
## TotRmsAbvGrd 14 12 0.67 0.87 0.04
## Functional 7 6 -4.08 16.37 0.03
## Fireplaces 3 3 0.65 -0.22 0.02
## FireplaceQu 5 4 -0.16 -0.98 0.04
## GarageType 6 5 0.76 -1.30 0.05
## GarageYrBlt 2010 110 -0.65 -0.42 0.66
## GarageFinish 3 2 -0.35 -1.41 0.02
## GarageCars 4 4 -0.34 0.21 0.02
## GarageArea 1418 1418 0.18 0.90 5.60
## GarageQual 5 4 -4.43 18.25 0.02
## GarageCond 5 4 -5.28 26.77 0.01
## PavedDrive 3 2 -3.30 9.22 0.01
## WoodDeckSF 857 857 1.54 2.97 3.28
## OpenPorchSF 547 547 2.36 8.44 1.73
## EnclosedPorch 552 552 3.08 10.37 1.60
## 3SsnPorch 508 508 10.28 123.06 0.77
## ScreenPorch 480 480 4.11 18.34 1.46
## PoolArea 738 738 14.80 222.19 1.05
## PoolQC 3 2 -0.22 -1.90 0.34
## Fence 4 3 -0.57 -0.88 0.05
## MiscFeature 4 3 -2.93 10.71 0.06
## MiscVal 15500 15500 24.43 697.64 12.98
## MoSold 12 11 0.21 -0.41 0.07
## YrSold 2010 4 0.10 -1.19 0.03
## SaleType 9 8 -3.83 14.57 0.04
## SaleCondition 6 5 -2.74 6.82 0.03
## SalePrice 755000 720100 1.88 6.50 2079.11
p <- ggplot(datalong, aes(x = value)) +
geom_histogram(stat="count", fill = "yellow", color = "black") +
facet_wrap(~ variable, scales = "free")
## Warning in geom_histogram(stat = "count", fill = "yellow", color = "black"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
labs(title = "Histograms of Train Dataset Variables", x = "Value", y = "Frequency")+
theme_minimal() +
theme( axis.text = element_text( size = 20 ),
axis.text.x = element_text( size = 20 ),
axis.title = element_text( size = 20, face = "bold" ),
legend.position="none",
theme(strip.text = element_text(face="bold", size=20)))
## NULL
p
Provide a scatter plot of X and Y
data %>%
ggplot(aes(x = LotArea, y = SalePrice)) +
geom_point(alpha = 0.3, color = "black", fill = "yellow", shape = 21)
labs(x = "Lot Area", y = "Sale Price", title = "Lot Area versus Sale Price")+
scale_y_continuous(labels = label_number(big.mark = ","))+
theme_minimal()
## NULL
Provide a 95% CI for the difference in the mean of the
variables
Interpretation: Of 100 sample distributions, 95 are likely to have
prices which are within the confidence intervals (95% CI -174378.4 and
-166430.4). The p-value is statistically significant
t1 <- t.test( data$LotArea, data$SalePrice, paired = TRUE, conf.level = 0.95)
t1
##
## Paired t-test
##
## data: data$LotArea and data$SalePrice
## t = -84.112, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -174378.4 -166430.4
## sample estimates:
## mean difference
## -170404.4
Derive a correlation matrix and test the hypothesis that the
correlation between these variables is 0.
The correlation coefficient value of 0.2638 denotes weak positive
association between LotArea and SalePrice. Generally, as the lot area
increases, the sale price tends to increase, but the relationship is
weak as the correlation coefficient value is far away from 1.
#weak positive correlation
c <- cor(data[,c("LotArea","SalePrice")])
c
## LotArea SalePrice
## LotArea 1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
Provide a 99% confidence interval
cor.test(data$LotArea, data$SalePrice,conf.level=.99)
##
## Pearson's product-moment correlation
##
## data: data$LotArea and data$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.2000196 0.3254375
## sample estimates:
## cor
## 0.2638434
Discuss the meaning of your analysis
There is a statistically significant positive, but weak, linear
relationship between LotArea and SalePrice. Despite the weak
correlation, the p-value indicates that the association is unlikely to
be due chance. The 99% confidence intervals being between 0.2000196 and
0.3254375 further supports a positive association between the variables
as the confidence interval values do not include 0.
PART IV: LINEAR ALGEBRA &
CORRELATION
Invert your correlation matrix. (This is known as the precision
matrix and contains variance inflation factors on the diagonal.)
Multiply the correlation matrix by the precision matrix, and then
multiply the precision matrix by the correlation matrix. Conduct
principle components analysis (research this!) and interpret.
Discuss.
Invert your correlation matrix
p <- solve(c)
p
## LotArea SalePrice
## LotArea 1.0748219 -0.2835846
## SalePrice -0.2835846 1.0748219
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix
#multiply the correlation matrix by the precision matrix
p %*% c
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
#multiply the precision matrix by the correlation matrix
c %*% p
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
Conduct principle components analysis, interpret and
discuss
PC1, which represents the combined measure of the two variables,
captures 63% of the variance in the data. Both LotArea and SalePrice
contributing equally and positively.
PC2, which represents the contrast between the two variables, captures 37% of the variance. LotArea and SalePrice contributing equally but in opposite directions.
The results show that most of the variability in LotArea and SalePrice can be captured by two principal components, where the first component represents the combined effect and the second component represents the difference.
df <- data.frame(LotArea = data$LotArea, SalePrice = data$SalePrice)
df_scaled <- scale(df)
pca_result <- prcomp(df_scaled, scale. = TRUE, center = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2
## Standard deviation 1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion 0.6319 1.0000
print(pca_result$rotation)
## PC1 PC2
## LotArea 0.7071068 0.7071068
## SalePrice 0.7071068 -0.7071068
print(pca_result$sdev^2)
## [1] 1.2638434 0.7361566
PART V: CALCULUS-BASED PROBABILITY &
STATISTICS
Many times, it makes sense to fit a closed form distribution to
data. For your variable that is skewed to the right, shift it so that
the minimum value is above zero. Then load the MASS package and run
fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html
). Find the optimal value of for this distribution, and then take 1000
samples from this exponential distribution using this value (e.g.,
rexp(1000, )). Plot a histogram and compare it with a histogram of your
original variable. Using the exponential pdf, find the 5th and 95th
percentiles using the cumulative distribution function (CDF). Also
generate a 95% confidence interval from the empirical data, assuming
normality. Finally, provide the empirical 5th percentile and 95th
percentile of the data. Discuss.
For your variable that is skewed to the right, shift it so
that the minimum value is above zero
There is no need to shift as the minimum value is already above zero
#minimum is above zero
summary(data$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
Run fitdistr to fit an exponential probability density function
exp_pdf = fitdistr(data$LotArea, "exponential")
exp_pdf
## rate
## 9.508570e-05
## (2.488507e-06)
Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ))
lambda = exp_pdf$estimate
lambda
## rate
## 9.50857e-05
samples <- rexp(1000, rate = lambda)
Plot a histogram and compare it with a histogram of your original variable
par(mfrow = c(1, 2))
hist(data$LotArea, main = "Original Variable", xlab = "Value")
hist(samples, main = "Exponential Distribution", xlab = "Value")
Using the exponential pdf, find the 5th and 95th percentiles
using the cumulative distribution function (CDF)
quantile(samples, probs=c(0.05, 0.95))
## 5% 95%
## 567.5301 30915.0411
Also generate a 95% confidence interval from the empirical data, assuming normality
#
mean_x <- mean(data$LotArea)
std_x <- sd(data$LotArea)
n <- length(data$LotArea)
z <- qnorm(1 - 0.05/2)
lower_ci <- mean_x - z * (std_x / sqrt(n))
upper_ci <- mean_x + z * (std_x / sqrt(n))
lower_ci
## [1] 10004.84
upper_ci
## [1] 11028.81
Provide the empirical 5th percentile and 95th percentile of the data
quantile(data$LotArea, probs=c(0.05, 0.95))
## 5% 95%
## 3311.70 17401.15
Discuss
Despite the right skew, the empirical distribution which assumes
normality appears better fit model. The confidence intervals for the
exponential distribution are wider than the empirical model
exponential CI - 5%: 560.2219, 95%: 32075.7943
empirical CI - 5%: 10004.84, 95%: 11028.81
PART VI: MODELING
Build some type of regression model and submit your model to the
competition board. Provide your complete model summary and results with
analysis. Report your Kaggle.com user name and score.
Simple Linear Regression
train.lm <- lm(SalePrice ~ LotArea, data=data)
summary(train.lm)
##
## Call:
## lm(formula = SalePrice ~ LotArea, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -275668 -48169 -17725 31248 553356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.588e+05 2.915e+03 54.49 <2e-16 ***
## LotArea 2.100e+00 2.011e-01 10.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76650 on 1458 degrees of freedom
## Multiple R-squared: 0.06961, Adjusted R-squared: 0.06898
## F-statistic: 109.1 on 1 and 1458 DF, p-value: < 2.2e-16
Residuals
The median value being not being anywhere near zero, and the first and
third quartile not approximating of each other suggests that the
residuals are not normally distributed. Additionally, minimum and
maximum values are not of similar magnitude, which provides some
evidence that the model is not a good fit.
Coefficients
LotArea p-value for this model is statistically significant (p=value
<0.05). The t-value for LotArea is 10.45. This high t-value suggests
a strong statistical significance of the LotArea coefficient in
predicting SalePrice.
Residual Standard Error and Degrees of Freedom
Generally speaking, residual standard error that is approximately 1.5
times the 1st and 3rd quartile residuals provides evidence that
residuals are normally distributed. Althought the RSE is 1.5 times the
first quartile residual, the RSE is not 1.5 the 3rd quartile residuals
suggesting that the residuals are not noramlly distributed.
The Multiple R-squared Value
The reported R2 of 0.06961 for this model means that 7% of the
variability in SalePrice is explained by the model.
Residual Plots
# Plot
par(mfrow=c(2,2))
plot(train.lm)
Below, residual analysis is conducted in the form of Residual versus Fitted Value Plot and a Q-Q Plot
Residual versus Fitted Value Plot
For a Residual versus Fitted Value Plot to support the linear model,
residuals should be scattered around the horizontal axis where the
residual equals zero. The above Residual versus Fitted Value Plot
provides possible evidence that 1. there are significant outlyiers
influencing the model, 2. the relationship is not linear as the data is
not scattered around the y axis zero value randomly.
Q-Q Plot
For the Q-Q Plot to support our linear model, we would expect the
plotted values to follow a straight line, indicating the residuals were
normally distributed. Below our model’s Q-Q Plot suggests that the
distribution of the residuals are not normal, with the right tail
deviating significantly from the expected straight line, alluding to
skew and outlyiers. This suggests that the model could be improved.
Conclusion Although there is evidence to suggest that there is a association between LotArea and SalePrice, mainly a statistically significant t-value, residual analysis alludes to outlyiers and non-normal residual distribution. Accounting for the outliers could improve the model, as could multiple regression technique. Even then, the relationship between the two variables may not be linear. More analysis to improve the model.