Data 605 Final Problem 2

Problem 2:

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. ##https://www.kaggle.com/c/house-prices-advanced-regression-techniques## . I want you to do the following.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Correlations:

We can start by plotting a correlation matrix, describing the pearson correlation coefficients between predictors. This will inform us about our dataset and provide insight when model building:

df_train <- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-605/main/train.csv")
df.orig <- df_train
tib <- df_train

# Select only the numeric variables
df.num <- select_if(df_train, is.numeric)

mat.correlation <- cor(df.num, use = 'complete')
mat.correlation[upper.tri(mat.correlation)] <- NA
mlt.correlation <- melt(mat.correlation)

ggplot(data = mlt.correlation, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name = "Pearson\nCorrelation") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 60, vjust = 1, 
                                   size = 5, hjust = 1), 
        axis.text.y = element_text(size = 7)) +
  coord_fixed()

Provide univariate descriptive statistics and appropriate plots for the training data set.

This will give us a summary of the count, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum for each variable in the dataset.

##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

This will give us a summary of the count, missing values, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum for each variable in the dataset, as well as additional information such as the type of variable, number of unique values, and top and bottom values.

library(skimr)

skim(df_train)

Data summary
Name	df_train
Number of rows	1460
Number of columns	81
_______________________
Column type frequency:
character	43
numeric	38
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
MSZoning	0	1.00	2	7	5
Street	0	1.00	4	4	2
Alley	1369	0.06	4	4	2
LotShape	0	1.00	3	3	4
LandContour	0	1.00	3	3	4
Utilities	0	1.00	6	6	2
LotConfig	0	1.00	3	7	5
LandSlope	0	1.00	3	3	3
Neighborhood	0	1.00	5	7	25
Condition1	0	1.00	4	6	9
Condition2	0	1.00	4	6	8
BldgType	0	1.00	4	6	5
HouseStyle	0	1.00	4	6	8
RoofStyle	0	1.00	3	7	6
RoofMatl	0	1.00	4	7	8
Exterior1st	0	1.00	5	7	15
Exterior2nd	0	1.00	5	7	16
MasVnrType	8	0.99	4	7	4
ExterQual	0	1.00	2	2	4
ExterCond	0	1.00	2	2	5
Foundation	0	1.00	4	6	6
BsmtQual	37	0.97	2	2	4
BsmtCond	37	0.97	2	2	4
BsmtExposure	38	0.97	2	2	4
BsmtFinType1	37	0.97	3	3	6
BsmtFinType2	38	0.97	3	3	6
Heating	0	1.00	4	5	6
HeatingQC	0	1.00	2	2	5
CentralAir	0	1.00	1	1	2
Electrical	1	1.00	3	5	5
KitchenQual	0	1.00	2	2	4
Functional	0	1.00	3	4	7
FireplaceQu	690	0.53	2	2	5
GarageType	81	0.94	6	7	6
GarageFinish	81	0.94	3	3	3
GarageQual	81	0.94	2	2	5
GarageCond	81	0.94	2	2	5
PavedDrive	0	1.00	1	1	3
PoolQC	1453	0.00	2	2	3
Fence	1179	0.19	4	5	4
MiscFeature	1406	0.04	4	4	4
SaleType	0	1.00	2	5	9
SaleCondition	0	1.00	6	7	6

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Id	0	1.00	730.50	421.61	1	365.75	730.5	1095.25	1460	▇▇▇▇▇
MSSubClass	0	1.00	56.90	42.30	20	20.00	50.0	70.00	190	▇▅▂▁▁
LotFrontage	259	0.82	70.05	24.28	21	59.00	69.0	80.00	313	▇▃▁▁▁
LotArea	0	1.00	10516.83	9981.26	1300	7553.50	9478.5	11601.50	215245	▇▁▁▁▁
OverallQual	0	1.00	6.10	1.38	1	5.00	6.0	7.00	10	▁▂▇▅▁
OverallCond	0	1.00	5.58	1.11	1	5.00	5.0	6.00	9	▁▁▇▅▁
YearBuilt	0	1.00	1971.27	30.20	1872	1954.00	1973.0	2000.00	2010	▁▂▃▆▇
YearRemodAdd	0	1.00	1984.87	20.65	1950	1967.00	1994.0	2004.00	2010	▅▂▂▃▇
MasVnrArea	8	0.99	103.69	181.07	0	0.00	0.0	166.00	1600	▇▁▁▁▁
BsmtFinSF1	0	1.00	443.64	456.10	0	0.00	383.5	712.25	5644	▇▁▁▁▁
BsmtFinSF2	0	1.00	46.55	161.32	0	0.00	0.0	0.00	1474	▇▁▁▁▁
BsmtUnfSF	0	1.00	567.24	441.87	0	223.00	477.5	808.00	2336	▇▅▂▁▁
TotalBsmtSF	0	1.00	1057.43	438.71	0	795.75	991.5	1298.25	6110	▇▃▁▁▁
X1stFlrSF	0	1.00	1162.63	386.59	334	882.00	1087.0	1391.25	4692	▇▅▁▁▁
X2ndFlrSF	0	1.00	346.99	436.53	0	0.00	0.0	728.00	2065	▇▃▂▁▁
LowQualFinSF	0	1.00	5.84	48.62	0	0.00	0.0	0.00	572	▇▁▁▁▁
GrLivArea	0	1.00	1515.46	525.48	334	1129.50	1464.0	1776.75	5642	▇▇▁▁▁
BsmtFullBath	0	1.00	0.43	0.52	0	0.00	0.0	1.00	3	▇▆▁▁▁
BsmtHalfBath	0	1.00	0.06	0.24	0	0.00	0.0	0.00	2	▇▁▁▁▁
FullBath	0	1.00	1.57	0.55	0	1.00	2.0	2.00	3	▁▇▁▇▁
HalfBath	0	1.00	0.38	0.50	0	0.00	0.0	1.00	2	▇▁▅▁▁
BedroomAbvGr	0	1.00	2.87	0.82	0	2.00	3.0	3.00	8	▁▇▂▁▁
KitchenAbvGr	0	1.00	1.05	0.22	0	1.00	1.0	1.00	3	▁▇▁▁▁
TotRmsAbvGrd	0	1.00	6.52	1.63	2	5.00	6.0	7.00	14	▂▇▇▁▁
Fireplaces	0	1.00	0.61	0.64	0	0.00	1.0	1.00	3	▇▇▁▁▁
GarageYrBlt	81	0.94	1978.51	24.69	1900	1961.00	1980.0	2002.00	2010	▁▁▅▅▇
GarageCars	0	1.00	1.77	0.75	0	1.00	2.0	2.00	4	▁▃▇▂▁
GarageArea	0	1.00	472.98	213.80	0	334.50	480.0	576.00	1418	▂▇▃▁▁
WoodDeckSF	0	1.00	94.24	125.34	0	0.00	0.0	168.00	857	▇▂▁▁▁
OpenPorchSF	0	1.00	46.66	66.26	0	0.00	25.0	68.00	547	▇▁▁▁▁
EnclosedPorch	0	1.00	21.95	61.12	0	0.00	0.0	0.00	552	▇▁▁▁▁
X3SsnPorch	0	1.00	3.41	29.32	0	0.00	0.0	0.00	508	▇▁▁▁▁
ScreenPorch	0	1.00	15.06	55.76	0	0.00	0.0	0.00	480	▇▁▁▁▁
PoolArea	0	1.00	2.76	40.18	0	0.00	0.0	0.00	738	▇▁▁▁▁
MiscVal	0	1.00	43.49	496.12	0	0.00	0.0	0.00	15500	▇▁▁▁▁
MoSold	0	1.00	6.32	2.70	1	5.00	6.0	8.00	12	▃▆▇▃▃
YrSold	0	1.00	2007.82	1.33	2006	2007.00	2008.0	2009.00	2010	▇▇▇▇▅
SalePrice	0	1.00	180921.20	79442.50	34900	129975.00	163000.0	214000.00	755000	▇▅▁▁▁

Data fields

Here’s a brief version of what you’ll find in the data description file.

Dependent Variable

SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict

Independent Variable

LotArea: Lot size in square feet
YearBuilt: Original construction date

Subsetting SalePrice, LotArea, PoolArea and YearBuilt:

subset.df_train <- subset(df_train, select = c("SalePrice","LotArea", "YearBuilt","PoolArea"))
head(subset.df_train)

##   SalePrice LotArea YearBuilt PoolArea
## 1    208500    8450      2003        0
## 2    181500    9600      1976        0
## 3    223500   11250      2001        0
## 4    140000    9550      1915        0
## 5    250000   14260      2000        0
## 6    143000   14115      1993        0

p1 <-ggplot(df_train, aes(x=LotArea)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
 geom_density(alpha=.2, fill="green")+ 
  labs(title = "Lot Area", x = "", y = "")

p2 <- ggplot(df_train, aes(x=YearBuilt)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
 geom_density(alpha=.2, fill="green")+ 
  labs(title = "Original construction date", x = "", y = "")

p3 <-ggplot(df_train, aes(x=PoolArea)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
 geom_density(alpha=.2, fill="green")+ 
  labs(title = "Pool Area", x = "", y = "")

grid.arrange(p1, p2,p3, nrow=2)

Scatter Plot of Each Independent Variable:

s1 <- ggplot(df_train, aes(sample = LotArea))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(title="Lot Area",x = "", y = "")

s2 <- ggplot(df_train, aes(sample = YearBuilt))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(title="Original construction date", x = "", y = "")

s3 <- ggplot(df_train, aes(sample = PoolArea))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(title="Pool Area", x = "", y = "")

grid.arrange(s1, s2, s3, nrow=2)

We can create boxplots of each numeric variable in the dataset using the following code:

subset.df_train %>% 
  select_if(is.numeric) %>% 
  tidyr::gather()%>% 
  #gather() %>% 
  ggplot(aes(x = key, y = value)) + 
    geom_boxplot() +
    xlab("Variable") +
    ylab("Value")

We can also create density plots of each numeric variable using the following code:

subset.df_train %>% 
  select_if(is.numeric) %>%
  tidyr::gather()%>% 
  #gather() %>% 
  ggplot(aes(x = value, fill = key)) + 
    geom_density(alpha = 0.5) +
    xlab("Value") +
    ylab("Density") +
    facet_wrap(~key, ncol = 1, scales = "free")

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

# Create a scatterplot matrix
library(ggplot2)
ggplot(df_train, aes(x = LotArea, y = SalePrice)) +
  geom_point() +
  labs(title = "Scatterplot of GrLivArea and TotalBsmtSF")

ggplot(df_train, aes(x = YearBuilt, y = SalePrice)) +
  geom_point() +
  labs(title = "Scatterplot of GrLivArea and SalePrice")

ggplot(df_train, aes(x = PoolArea, y = SalePrice)) +
  geom_point() +
  labs(title = "Scatterplot of TotalBsmtSF and SalePrice")

Derive a correlation matrix for any three quantitative variables in the dataset.

# Select three quantitative variables
vars <- c("SalePrice", "LotArea", "YearBuilt")
df_sub <- df_train[, vars]

# Calculate correlation matrix
cor_matrix <- cor(df_sub)

# Print correlation matrix
print(cor_matrix)

##           SalePrice    LotArea  YearBuilt
## SalePrice 1.0000000 0.26384335 0.52289733
## LotArea   0.2638434 1.00000000 0.01422765
## YearBuilt 0.5228973 0.01422765 1.00000000

Correlation Matrix:

df.orig= subset.df_train
tib<-df_train

mat.correlation = cor(subset.df_train, use = 'complete')
mat.correlation[upper.tri(mat.correlation)] = NA
mlt.correlation <- melt(mat.correlation)


ggplot(data = mlt.correlation, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 8, hjust = 1), axis.text.y = element_text(size = 8))+
 coord_fixed()

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Hypotheses:

H0 = There is 0 correlation between each pairwise variables
HA = There is correlation between each pairwise variables

cor.test(subset.df_train$LotArea, subset.df_train$SalePrice, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  subset.df_train$LotArea and subset.df_train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

cor.test(subset.df_train$YearBuilt, subset.df_train$SalePrice, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  subset.df_train$YearBuilt and subset.df_train$SalePrice
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4980766 0.5468619
## sample estimates:
##       cor 
## 0.5228973

cor.test(subset.df_train$PoolArea, subset.df_train$SalePrice, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  subset.df_train$PoolArea and subset.df_train$SalePrice
## t = 3.5435, df = 1458, p-value = 0.0004073
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.05902496 0.12557575
## sample estimates:
##        cor 
## 0.09240355

The correlation coefficient between PoolArea and SalePrice is 0.0924, which indicates a weak positive correlation between the two variables.

The t value is 3.5435, and the p-value is 0.0004073, which means that the correlation coefficient is statistically significant at a significance level of 0.05. Therefore, we can reject the null hypothesis that there is no correlation between PoolArea and SalePrice.

The 80 percent confidence interval for the true correlation coefficient lies between 0.059 and 0.126, which means that 80 percent confident that the true correlation coefficient falls within this interval.

The corr_results dataframe contains the correlation coefficient, t-value, p-value, and 80% confidence interval for each pairwise correlation in the dataset.

The null hypothesis that the correlation coefficient is 0 is rejected if the p-value is less than the significance level of 0.05. In this case, we are using an 80% confidence interval, which corresponds to a significance level of 0.2. Therefore, we reject the null hypothesis if the p-value is less than 0.1.

In terms of familywise error, we would be worried about it if we were conducting multiple hypothesis tests simultaneously without adjusting the significance level. In this case, we are only conducting a small number of hypothesis tests (i.e., the number of pairwise correlations), so familywise error is not a major concern.

Linear Algebra and Correlation:

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

#To invert a matrix, you can use the solve() function in R
prec_mat <- solve(cor_matrix)

Multiply the correlation matrix by the precision matrix.

cor_prec <- cor_matrix %*% prec_mat

Multiply the precision matrix by the correlation matrix.

prec_cor <- prec_mat %*% cor_matrix

To conduct LU decomposition on a matrix, we can use the lu()

#lu_cor <- lu(cor_matrix)
lu_decomp <- lu.decomposition(cor_matrix)
L <- lu_decomp$L
L

##           [,1]       [,2] [,3]
## [1,] 1.0000000  0.0000000    0
## [2,] 0.2638434  1.0000000    0
## [3,] 0.5228973 -0.1329934    1

Calculus-Based Probability & Statistics.

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

numeric = which(sapply(df_train, is.numeric)) 
all_num <- df_train[, numeric]

df_train %>%
  dplyr::select(names(all_num)) %>%
  gather() %>%                  
  ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

From above we can see BsmtUnfSF is skewed to the right.

ggplot(df_train, aes(BsmtUnfSF)) + 
  geom_histogram(bins = 30, fill = "lightgreen", color = "white") + 
  labs(title = "Histogram of Basement Unfinished Square Feet",
       x = "Basement Unfinished Square Feet",
       y = "Count")

summary(df_train$BsmtUnfSF)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   223.0   477.5   567.2   808.0  2336.0

fit_data <- df_train$BsmtUnfSF
fit_data <- fit_data[complete.cases(fit_data)]

hist(fit_data)

length(fit_data[fit_data == 0])

## [1] 118

fit_data <- fit_data + .01

Adding a value of .01 should be negligible and would get rid of the zero values.

library(MASS)

Run fitdistr to fit an exponential probability density function.

#fitdistr(df_train$BsmtUnfSF, "exponential")
BsmtUnfSF_exp_dist <- fitdistr(df_train$BsmtUnfSF,'exponential')

Find the optimal value of λ for this distribution.

BsmtUnfSF_lamb <- BsmtUnfSF_exp_dist$estimate
BsmtUnfSF_lamb

##        rate 
## 0.001762921

Take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

set.seed(1000)
BsmtUnfSF_sample <- rexp(1000,BsmtUnfSF_lamb)

hist(BsmtUnfSF_sample)

Now plot both the histogram to compare

par(mfrow=c(1,2))
hist(fit_data)
hist(BsmtUnfSF_sample)

The histogram of fit_data and exp_dist are both right skrewed; however, the second bin of BsmtUnfSF_sample has a frequency that is about double the frequency of fit_data

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(.05, rate=BsmtUnfSF_lamb)

## [1] 29.09563

qexp(.95, rate=BsmtUnfSF_lamb)

## [1] 1699.3

Generate a 95% confidence interval from the empirical data, assuming normality.

norm.interval = function(data, variance = var(data), conf.level = 0.95) 
{
      z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
      xbar = mean(data)
      sdx = sqrt(variance/length(data))
      c(xbar - z * sdx, xbar + z * sdx)
}

norm.interval(fit_data, variance=var(fit_data), conf.level = 0.95)

## [1] 544.5850 589.9158

Provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(x=fit_data, probs=c(.05, .95))

##      5%     95% 
##    0.01 1468.01

We are 95% confident that the mean of Unfinished square feet of basement area is between 544.8550 and 589.9158. The exponential distribution is a good fit since 95% is 1468.01 and only 5% is .01.

Model:

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.

I am chosing following varibles for my multiple linear regression model

Dependent Variable:

SalePrice - the property’s sale price in dollars. This is the target variable that we are trying to predict.

Independent Variable:

LotArea - Lot size in square feet
PoolArea: Pool area in square feet
YearBuilt: Original construction date
TotalBsmtSF - Total square feet of basement area

We want to build a model for estimating SalePrice based on the LotArea, PoolArea, YearBuilt, and TotalBsmtSF of each house.

Model 1:

model_1 <- lm(SalePrice ~ LotArea + PoolArea + YearBuilt + TotalBsmtSF, data = df_train)
summary(model_1)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + PoolArea + YearBuilt + TotalBsmtSF, 
##     data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -537777  -34223  -10964   24191  431224 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.730e+06  1.046e+05 -16.529  < 2e-16 ***
## LotArea      1.140e+00  1.552e-01   7.346 3.39e-13 ***
## PoolArea     4.859e+01  3.737e+01   1.300    0.194    
## YearBuilt    9.207e+02  5.380e+01  17.114  < 2e-16 ***
## TotalBsmtSF  7.897e+01  3.859e+00  20.463  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56770 on 1455 degrees of freedom
## Multiple R-squared:  0.4907, Adjusted R-squared:  0.4893 
## F-statistic: 350.4 on 4 and 1455 DF,  p-value: < 2.2e-16

This model is a multiple linear regression model with SalePrice as the response variable and LotArea, PoolArea, YearBuilt and TotalBsmtSF as the predictor variables.

Multiple R-squared: The multiple R-squared value measures the proportion of the variation in the dependent variable that can be explained by the independent variables in a regression model. Here, the multiple R-squared is 0.4907, indicating that approximately 49.07% of the variability in the dependent variable is explained by the independent variables in the model.

The adjusted R-squared is similar to the multiple R-squared, suggesting that the inclusion of additional predictors does not substantially improve the model’s explanatory power.

The p-values for each predictor variable show whether they are statistically significant in predicting SalePrice or not. In this case, PoolArea is not significant (p-value = 0.194), while all the other predictor variables have very small p-values (less than 0.05), indicating strong evidence of a significant linear relationship between each predictor variable and SalePrice.

F-statistic: The F-statistic tests the overall significance of the regression model. It compares the variability explained by the model to the variability not explained. In this case, the F-statistic is 350.4 on 4 and 1455 degrees of freedom, with a p-value of < 2.2e-16. The small p-value indicates that the model is highly significant, suggesting that the relationship between the independent variables and the dependent variable is not due to chance.

summary(model_1)$coefficients

##                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept) -1.729567e+06 1.046366e+05 -16.529280 2.235997e-56
## LotArea      1.139763e+00 1.551523e-01   7.346090 3.386472e-13
## PoolArea     4.859126e+01 3.736973e+01   1.300284 1.937095e-01
## YearBuilt    9.206557e+02 5.379628e+01  17.113745 5.787068e-60
## TotalBsmtSF  7.897346e+01 3.859405e+00  20.462598 5.436356e-82

Residual vs. Fitted Values

ggplot(model_1, aes(x = .fitted, y = .resid)) +
    geom_point() +
    geom_hline(yintercept = 0) +
    labs(title = "Residual vs. Fitted Values",
    x = "Fitted Values", y= "Residuals")

QQ Plot:

# define residuals
res <- resid(model_1)

#create Q-Q plot for residuals
qqnorm(res)

#add a straight diagonal line to the plot
qqline(res, col = "red")

The pattern of the normal probability plot is straight, so this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution. Let do box cox transformation and see if we get better QQ plot.

Box Cox Transformation:

b = boxcox(SalePrice ~ LotArea + PoolArea + YearBuilt + TotalBsmtSF, data = df_train)

lamda = b$x
lik = b$y
bc = cbind(lamda,lik)
head(bc)

##          lamda       lik
## [1,] -2.000000 -4766.602
## [2,] -1.959596 -4715.197
## [3,] -1.919192 -4664.572
## [4,] -1.878788 -4614.743
## [5,] -1.838384 -4565.727
## [6,] -1.797980 -4517.542

#Order Likelihood in ascending
bc[order(-lik),]

##              lamda       lik
##   [1,] -0.02020202 -3417.663
##   [2,]  0.02020202 -3418.003
##   [3,] -0.06060606 -3418.435
##   [4,]  0.06060606 -3419.447
##   [5,] -0.10101010 -3420.324
##   [6,]  0.10101010 -3421.986
##   [7,] -0.14141414 -3423.338
##   [8,]  0.14141414 -3425.612
##   [9,] -0.18181818 -3427.484
##  [10,]  0.18181818 -3430.319
##  [11,] -0.22222222 -3432.766
##  [12,]  0.22222222 -3436.096
##  [13,] -0.26262626 -3439.192
##  [14,]  0.26262626 -3442.935
##  [15,] -0.30303030 -3446.765
##  [16,]  0.30303030 -3450.828
##  [17,] -0.34343434 -3455.490
##  [18,]  0.34343434 -3459.765
##  [19,] -0.38383838 -3465.371
##  [20,]  0.38383838 -3469.737
##  [21,] -0.42424242 -3476.411
##  [22,]  0.42424242 -3480.736
##  [23,] -0.46464646 -3488.612
##  [24,]  0.46464646 -3492.753
##  [25,] -0.50505051 -3501.978
##  [26,]  0.50505051 -3505.777
##  [27,] -0.54545455 -3516.508
##  [28,]  0.54545455 -3519.801
##  [29,] -0.58585859 -3532.203
##  [30,]  0.58585859 -3534.815
##  [31,] -0.62626263 -3549.064
##  [32,]  0.62626263 -3550.809
##  [33,] -0.66666667 -3567.089
##  [34,]  0.66666667 -3567.775
##  [35,]  0.70707071 -3585.704
##  [36,] -0.70707071 -3586.277
##  [37,]  0.74747475 -3604.588
##  [38,] -0.74747475 -3606.625
##  [39,]  0.78787879 -3624.416
##  [40,] -0.78787879 -3628.130
##  [41,]  0.82828283 -3645.180
##  [42,] -0.82828283 -3650.788
##  [43,]  0.86868687 -3666.873
##  [44,] -0.86868687 -3674.594
##  [45,]  0.90909091 -3689.484
##  [46,] -0.90909091 -3699.543
##  [47,]  0.94949495 -3713.006
##  [48,] -0.94949495 -3725.628
##  [49,]  0.98989899 -3737.430
##  [50,] -0.98989899 -3752.843
##  [51,]  1.03030303 -3762.748
##  [52,] -1.03030303 -3781.179
##  [53,]  1.07070707 -3788.952
##  [54,] -1.07070707 -3810.629
##  [55,]  1.11111111 -3816.032
##  [56,] -1.11111111 -3841.182
##  [57,]  1.15151515 -3843.983
##  [58,]  1.19191919 -3872.794
##  [59,] -1.15151515 -3872.828
##  [60,]  1.23232323 -3902.458
##  [61,] -1.19191919 -3905.558
##  [62,]  1.27272727 -3932.968
##  [63,] -1.23232323 -3939.360
##  [64,]  1.31313131 -3964.315
##  [65,] -1.27272727 -3974.222
##  [66,]  1.35353535 -3996.491
##  [67,] -1.31313131 -4010.131
##  [68,]  1.39393939 -4029.488
##  [69,] -1.35353535 -4047.074
##  [70,]  1.43434343 -4063.299
##  [71,] -1.39393939 -4085.039
##  [72,]  1.47474747 -4097.916
##  [73,] -1.43434343 -4124.010
##  [74,]  1.51515152 -4133.331
##  [75,] -1.47474747 -4163.973
##  [76,]  1.55555556 -4169.535
##  [77,] -1.51515152 -4204.914
##  [78,]  1.59595960 -4206.521
##  [79,]  1.63636364 -4244.281
##  [80,] -1.55555556 -4246.817
##  [81,]  1.67676768 -4282.807
##  [82,] -1.59595960 -4289.666
##  [83,]  1.71717172 -4322.091
##  [84,] -1.63636364 -4333.446
##  [85,]  1.75757576 -4362.124
##  [86,] -1.67676768 -4378.140
##  [87,]  1.79797980 -4402.899
##  [88,] -1.71717172 -4423.732
##  [89,]  1.83838384 -4444.407
##  [90,] -1.75757576 -4470.205
##  [91,]  1.87878788 -4486.641
##  [92,] -1.79797980 -4517.542
##  [93,]  1.91919192 -4529.590
##  [94,] -1.83838384 -4565.727
##  [95,]  1.95959596 -4573.248
##  [96,] -1.87878788 -4614.743
##  [97,]  2.00000000 -4617.606
##  [98,] -1.91919192 -4664.572
##  [99,] -1.95959596 -4715.197
## [100,] -2.00000000 -4766.602

Since landa is .2 we can do square root of selling price from model 2

model2_transform = lm(SalePrice^(1/2)~ LotArea + PoolArea +YearBuilt + TotalBsmtSF, data = df_train)
summary(model2_transform)

## 
## Call:
## lm(formula = SalePrice^(1/2) ~ LotArea + PoolArea + YearBuilt + 
##     TotalBsmtSF, data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -559.74  -36.21   -9.33   30.97  341.75 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.896e+03  1.087e+02 -17.440  < 2e-16 ***
## LotArea      1.263e-03  1.612e-04   7.832 9.21e-15 ***
## PoolArea     2.707e-02  3.884e-02   0.697    0.486    
## YearBuilt    1.122e+00  5.591e-02  20.067  < 2e-16 ***
## TotalBsmtSF  8.344e-02  4.011e-03  20.804  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59 on 1455 degrees of freedom
## Multiple R-squared:  0.5281, Adjusted R-squared:  0.5268 
## F-statistic:   407 on 4 and 1455 DF,  p-value: < 2.2e-16

Overall, the given information suggests that the regression model is statistically significant and provides a moderate level of explanatory power (as indicated by the multiple R-squared value). The residuals have a standard error of 59, indicating the average deviation between observed and predicted values. The adjusted R-squared is similar to the multiple R-squared, suggesting that the inclusion of additional predictors does not substantially improve the model’s explanatory power.