library(knitr)
library(ggplot2)
library(stats)
library(kableExtra)
library(grid)
library(gridExtra)
library(MASS)
library(data.table)
library(RCurl)
options(scipen=8)
set.seed(0305)

Problem: 1


Solution:

To answer the above question I will using data from GrLivArea column, according to documentation it is described as Above grade (ground) living area square feet. This column will be X, independent variable and Y would be SalePrice.

Load data from train.csv. It is downloaded from kaggle.com.

#Load data
#kaggle.data <- read.csv("D:\\CUNY\\605\\FinalProject\\train.csv", header=T)
#kaggle.data.test <- read.csv("D:\\CUNY\\605\\FinalProject\\test.csv", header=T)

kaggle.data <- read.csv(text=getURL("https://raw.githubusercontent.com/akulapa/Akula-Data605-Final/master/train.csv"), header=T)
kaggle.data.test <- read.csv(text=getURL("https://raw.githubusercontent.com/akulapa/Akula-Data605-Final/master/test.csv"), header=T)

#Get required data
reqCols <- c('GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]

kable(req.data[sample(nrow(req.data), 10), ] , format="pandoc", align="l", row.names = F, caption = "Sample Data")
Sample Data
GrLivArea SalePrice
2108 235000
2290 122500
2097 423000
860 137000
1953 240000
1980 315000
3194 359100
1743 175000
990 126000
1768 224900

Summary of the table. First quartile of X is \(x = 1130\), Second quartile of Y is \(y = 163000\), it is median sale price of the houses.

summary(req.data)
##    GrLivArea      SalePrice     
##  Min.   : 334   Min.   : 34900  
##  1st Qu.:1130   1st Qu.:129975  
##  Median :1464   Median :163000  
##  Mean   :1515   Mean   :180921  
##  3rd Qu.:1777   3rd Qu.:214000  
##  Max.   :5642   Max.   :755000
#Create matrix based on conditions
#Total observations
total.obs <- nrow(req.data)

#Number of observations where X>x and Y>y
xy.pos.data <- req.data[which(req.data$GrLivArea > 1130 & req.data$SalePrice > 163000),]
xy.pos <- nrow(xy.pos.data)

#Number of observations where X<x and Y<y
xy.neg.data <- req.data[which(req.data$GrLivArea <= 1130 & req.data$SalePrice <= 163000),]
xy.neg <- nrow(xy.neg.data)

#Number of observations where X>x and Y<y
x.pos.y.neg.data <- req.data[which(req.data$GrLivArea > 1130 & req.data$SalePrice <= 163000),]
x.pos.y.neg <- nrow(x.pos.y.neg.data)

#Number of observations where X<x and Y>y
x.neg.y.pos.data <- req.data[which(req.data$GrLivArea <= 1130 & req.data$SalePrice > 163000),]
x.neg.y.pos <- nrow(x.neg.y.pos.data)

house.data<- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)
#add column and row totals
house.data<- cbind(house.data, Total = rowSums(house.data))
house.data<- rbind(house.data, Total = colSums(house.data))

rownames(house.data)<- c('(X>x)', '(X<=x)', 'Total')


kable(house.data, digits = 2, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Observed Data')
Observed Data
(Y>y) (Y<=y) Total
(X>x) 720 374 1094
(X<=x) 8 358 366
Total 728 732 1460

Joint Probabilities

house.data.prob <- matrix(c(round(xy.pos/total.obs,4), round(x.neg.y.pos/total.obs,4),round(x.pos.y.neg/total.obs,4),round(xy.neg/total.obs,4)), nrow=2, ncol=2)

house.data.prob <- cbind(house.data.prob, Total = round(rowSums(house.data.prob),2))
house.data.prob <- rbind(house.data.prob, Total = round(colSums(house.data.prob),2))

rownames(house.data.prob) <- c('(X>x)', '(X<=x)', 'Total')


kable(house.data.prob, digits = 4, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Observed Data Joint Probabilities')
Observed Data Joint Probabilities
(Y>y) (Y<=y) Total
(X>x) 0.4932 0.2562 0.75
(X<=x) 0.0055 0.2452 0.25
Total 0.5000 0.5000 1.00

Note: Column and rows values are added and then rounded to 2 digits.


a. \(P(X>x~ |~ Y>y )\), read as probability X (GrLivArea) is greater than 1130 square feet given Y (SalePrice) is greater than $163000.

This is known as conditional probability because we are computing the probability under a condition, SalePrice is greater than $163000. Two parts to a conditional probability, the outcome of interest and the condition. We can assume condition as information we know to be true, and this information usually can be used to describe outcome.

\(P(GrLivArea~ > 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}\)

= \(\frac{720}{728} = 98.9 \%\)

Using the joint probabilities,

= \(\frac{0.4932}{0.5000} = 98.64 \%\)

Therefore, probability that SalePrice will be greater than \(\$163000\), if GrLivArea is greater than \(1130~ sq.ft.\) is \(99\%\)


b. \(P(X>x~ \&~ Y>y)\), read a probability X (GrLivArea) is greater than 1130 square feet and Y (SalePrice) is greater than $163000.

This is known as joint probability because we are computing the probability using outcomes of two variables.

\(P(GrLivArea~ > 1130~ sq.ft. and SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ observed}\)

= \(\frac{720}{1460} = 49.32 \%\)

Therefore, probability that GrLivArea is greater than \(1130~ sq.ft.\) and SalePrice will be greater than \(\$163000\), is \(49.32 \%\)


c. \(P(X<x~ |~ Y>y )\), read as probability X (GrLivArea) is less than 1130 square feet given Y (SalePrice) is greater than $163000. This is conditional probability.

\(P(GrLivArea~ < 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ < 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}\)

= \(\frac{8}{728} = 1.1 \%\)

Using the joint probabilities,

= \(\frac{0.0055}{0.5000} = 1.1 \%\)

Therefore, probability that SalePrice will be greater than \(\$163000\), if GrLivArea is less than \(1130~ sq.ft.\) is \(1.1 \%\)


Relation to independence:

\(P(XY) = P(X)P(Y)\)

Above condition can be rewritten as

\(P(X \cap Y) = P(X)P(Y)\), condition will be true only when \(X\) and \(Y\) are independent.

We can say that above grade living area and sale price are independent only when an increase or decrease in the area does not affect the probability of increase or decrease of the sale price of the house. We can test the condition by using the following hypothesis.

Null Hypothesis(\(H_0\)): Sale price of the house is not influenced by above grade living area.

Alternative Hypothesis(\(H_A\)): Above grade living area has significant influence on sale price of the house.

If two variables were to be independent it should satisfy the condition

\(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\)

We will solve above conditions in two parts,

\(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y)\)

= \(P(X>x~ |~ Y>y) = \frac{P(X>x~ \cap ~ Y>y)}{P(Y>y)}\)

= \(P(X>x~ \cap ~ Y>y)\) - probability where GrLivArea > 1130 sq.ft. and SalePrice > $163000

= \(P(Y>y)\) - probability where SalePrice > $163000

= \(P(X>x~ |~ Y>y) = \frac{720}{728} = 98.9 \%\)

Comparing other way,

\(P(Y>y~ |~ X>x)P(X>x) = P(X>x~ \cap ~ Y>y)\)

= \(P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\)

= \(P(Y>y~ |~ X>x) = \frac{P(X>x~ \cap ~ Y>y)}{P(X>x)}\)

= \(P(X>x)\) - probability where GrLivArea > 1130 sq.ft.

= \(P(Y>y~ |~ X>x) = \frac{720}{1094} = 65.81 \%\)

Since \(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\), condition is not met we reject Null Hypothesis(\(H_0\)), and accept Alternative Hypothesis(\(H_A\)) that above grade living area has significant influence on sale price of the house.


Using Chi Square test

house.data <- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)

chisq.test(house.data) 
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  house.data
## X-squared = 441.58, df = 1, p-value < 2.2e-16

Since we have 2 variables GrLivArea and SalePrice, degrees of freedom(df) = 1, p-value = \(2.2 \times 10^{-16}\) is very small compared at \(0.05\) significance level, we reject Null Hypothesis(\(H_0\)), and accept Alternative Hypothesis(\(H_A\)) that above grade living area has significant influence on sale price of the house.


Problem: 2


Solution:

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

house.stats <- matrix(data = NA, nrow=11,ncol=2)
qarea <- quantile(req.data$GrLivArea)
qprice <- quantile(req.data$SalePrice)

house.stats[1,1] <- nrow(req.data)
house.stats[1,2] <- nrow(req.data)

house.stats[2,1] <- length(req.data$GrLivArea[!is.na(req.data$GrLivArea)])
house.stats[2,2] <- length(req.data$SalePrice[!is.na(req.data$SalePrice)])

house.stats[3,1] <- paste0(min(req.data$GrLivArea), ' sq. ft.')
house.stats[3,2] <- paste0('$', min(req.data$SalePrice))

house.stats[4,1] <- paste0(max(req.data$GrLivArea), ' sq. ft.')
house.stats[4,2] <- paste0('$', max(req.data$SalePrice))

house.stats[5,1] <- paste0(median(req.data$GrLivArea), ' sq. ft.')
house.stats[5,2] <- paste0('$', median(req.data$SalePrice))

house.stats[6,1] <- paste0(qarea[2], ' sq. ft.')
house.stats[6,2] <- paste0('$', qprice[2])

house.stats[7,1] <- paste0(qarea[4], ' sq. ft.')
house.stats[7,2] <- paste0('$', qprice[4])

house.stats[8,1] <- paste0(round(mean(req.data$GrLivArea),2), ' sq. ft.')
house.stats[8,2] <- paste0('$', round(mean(req.data$SalePrice),2))

house.stats[9,1] <- round(sd(req.data$GrLivArea),2)
house.stats[9,2] <- round(sd(req.data$SalePrice),2)

house.stats[10,1] <- paste0(getmode(req.data$GrLivArea), ' sq. ft.')
house.stats[10,2] <- paste0('$', getmode(req.data$SalePrice))

house.stats[11,1] <- paste0(IQR(req.data$GrLivArea), ' sq. ft.')
house.stats[11,2] <- paste0('$', IQR(req.data$SalePrice))

rownames(house.stats)<- c('Number of Observations', 'Non-missing values', 'Minimum','Maximum', 'Median','1st quartile','3rd quartile', 'Average(mean)', 'Standard deviation', 'Mode', 'Interquartile range(IQR)')

kable(house.stats, digits = 2, 
      col.names = c('GrLivArea', 'SalePrice'), 
      align = "l", 
      caption = 'Univariate Descriptive Statistics', "html") %>%  
  kable_styling(bootstrap_options = c("striped", "hover"))
Univariate Descriptive Statistics
GrLivArea SalePrice
Number of Observations 1460 1460
Non-missing values 1460 1460
Minimum 334 sq. ft. $34900
Maximum 5642 sq. ft. $755000
Median 1464 sq. ft. $163000
1st quartile 1129.5 sq. ft. $129975
3rd quartile 1776.75 sq. ft. $214000
Average(mean) 1515.46 sq. ft. $180921.2
Standard deviation 525.48 79442.5
Mode 864 sq. ft. $140000
Interquartile range(IQR) 647.25 sq. ft. $84025

Graphs

fill <- "gray45"
line <- "navy"

bxgg <- ggplot(req.data, aes(x='', y=GrLivArea)) + 
          geom_boxplot(fill = fill, colour = line, alpha = 0.7, varwidth=T, width=0.6) +
          #coord_flip() +
          scale_x_discrete(name = "") +
          scale_y_continuous(name = "Above Grade Living Area(sq.ft.)", breaks = seq(0, max(req.data$GrLivArea) + 100, 500),
                              limits=c(0, max(req.data$GrLivArea) + 100)) +
          ggtitle("Boxplot") +
          stat_summary(geom="text", fun.y=quantile,
                        aes(label = sprintf("%1.2f", ..y..)),
                        position = position_nudge(x=0.35), size=2.0)

hgg <- ggplot(req.data, aes(x=GrLivArea)) +
          geom_histogram(binwidth=100, colour = line, fill = fill, aes(y=..density.., fill=..count..)) +
          scale_x_continuous(name = "Above Grade Living Area(sq.ft.)") +
          stat_function(fun=dnorm, color="red",
                        args=list(mean=mean(req.data$GrLivArea), sd=sd(req.data$GrLivArea))) +
          ggtitle("Histogram")
          #geom_vline(xintercept = mean(req.data$GrLivArea), show_guide=TRUE, color="red", labels="Average") +
          #geom_vline(xintercept = median(req.data$GrLivArea), show_guide=TRUE, color="green", labels="Median")

grid.arrange(bxgg, hgg, nrow = 2, top='Above Grade Living Area(sq.ft.)')

Histogram shows distribution of above grade living area is. Average area is 1515.46 sq.ft. with standard deviation as 525.48. It also shows right tail, suggesting existence of outliers to the right of the average.

kable(qarea, digits = 2, 
      caption = 'Quartiles', 
      align = 'l', padding = 10, "html") %>%  
  kable_styling(bootstrap_options = c("striped", "hover"))
Quartiles
0% 334.00
25% 1129.50
50% 1464.00
75% 1776.75
100% 5642.00
fill <- "gray45"
line <- "navy"

bxgg <- ggplot(req.data, aes(x='', y=SalePrice)) + 
          geom_boxplot(fill = fill, colour = line, alpha = 0.7, varwidth=T, width=0.6) +
          #coord_flip() +
          scale_x_discrete(name = "") +
          scale_y_continuous(name = "Sale Price(Dollars)", breaks = seq(0, max(req.data$SalePrice) + 100, 100000),
                              limits=c(0, max(req.data$SalePrice) + 100)) +
          ggtitle("Boxplot") +
          stat_summary(geom="text", fun.y=quantile,
                      aes(label=sprintf("%1.2f", ..y..)),
                      position=position_nudge(x=0.35), size=2.0)

hgg <- ggplot(req.data, aes(x=SalePrice)) +
          geom_histogram(binwidth=10000, colour= line, fill = fill, aes(y=..density.., fill=..count..)) +
          scale_x_continuous(name = "Sale Price(Dollars)") +
          stat_function(fun=dnorm, color="red",
                        args=list(mean=mean(req.data$SalePrice), sd=sd(req.data$SalePrice))) +
          ggtitle("Histogram")
          #geom_vline(xintercept = mean(req.data$SalePrice), show_guide=TRUE, color="red", labels="Average") +
          #geom_vline(xintercept = median(req.data$SalePrice), show_guide=TRUE, color="green", labels="Median")

grid.arrange(bxgg, hgg, nrow = 2, top='Sale Price')

Histogram shows distribution of sale proce of houses. Average sale price(\(\mu\)) is $ 180921.2, with sandard deviation(\(\sigma\)) 79442.5. It also shows right tail, suggesting existence of outliers to the right of the average.

kable(qprice, digits = 2, 
      caption = 'Quartiles', 
      align = 'l', 
      padding = 10, "html") %>%  
  kable_styling(bootstrap_options = c("striped", "hover"))
Quartiles
0% 34900
25% 129975
50% 163000
75% 214000
100% 755000

Scatterplot

ggplot(req.data, aes(x=GrLivArea, y=SalePrice)) +
  geom_point() +
  geom_smooth(method=lm) +
  labs(title="Scatterplot Above Grade Living Area Vs. Sale Price",
       x="Above Grade Living Area(sq.ft.)", y = "Sale Price(Dollars)")

From above boxplots, histograms and scatterplot we can notice there are some outliers and the variation amoung area and sale price is not constant. This causes a longer tail on the right side.

Linear Model

house.lm <- lm(req.data$SalePrice ~ req.data$GrLivArea)
summary(house.lm)
## 
## Call:
## lm(formula = req.data$SalePrice ~ req.data$GrLivArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462999  -29800   -1124   21957  339832 
## 
## Coefficients:
##                     Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)        18569.026   4480.755   4.144 0.0000361 ***
## req.data$GrLivArea   107.130      2.794  38.348   < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.5018 
## F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16

Above observations are without data normalization.

Box-Cox Transformation

A Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that one will be able to run a broader number of tests.

For all positive values of \(y\), it is defined by

\[ y(\lambda)=\begin{cases} \frac{y^{\lambda} - 1}{\lambda}, & \text{if }\lambda \neq 0\ \\ log~ y, & \text{if }\lambda = 0\ \end{cases} \]

If \(y\) has negative values then it is defined as

\[ y(\lambda)=\begin{cases} \frac{(y + {\lambda}_2)^{\lambda_1} - 1}{\lambda_1}, & \text{if }\lambda_1 \neq 0\ \\ log~ (y + {\lambda}_2), & \text{if }\lambda_1 = 0\ \end{cases} \]

We will using R-function boxcox from MASS library to determine optimal lambda(\(\lambda\)) value.

par(mfrow=c(1,2))
house.bc <- boxcox(house.lm)
house.bc.df <- as.data.frame(house.bc)
lambda <- house.bc.df[which.max(house.bc.df$y),1]
boxcox(house.lm, plotit=T, lambda=seq(0,0.20,by=0.05))

From above boxcox plot, optimal lambda(\(\lambda\)) is 0.1010101. Confidence interval runs between \(0.02\) and \(0.18\). Since \(\lambda\) is less than \(0.5\), there is no good reason to transform data.

However, let’s check if transforming data provides a better understanding of data.

req.data$SalePrice_trans <- ((req.data$SalePrice^lambda) -1)/lambda


ggplot(req.data, aes(x=GrLivArea, y=SalePrice_trans)) +
  geom_point() +
  geom_smooth(method=lm) +
  labs(title="Scatterplot Above Grade Living Area Vs. Sale Price Transformed",
       x="Above Grade Living Area(sq.ft.)", y = "Sale Price Transformed")

house_t.lm <- lm(req.data$SalePrice_trans ~ req.data$GrLivArea)
summary(house_t.lm)
## 
## Call:
## lm(formula = req.data$SalePrice_trans ~ req.data$GrLivArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6488 -0.4945  0.0843  0.5314  3.1560 
## 
## Coefficients:
##                       Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)        20.72985263  0.07656336  270.75   <2e-16 ***
## req.data$GrLivArea  0.00181347  0.00004774   37.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9581 on 1458 degrees of freedom
## Multiple R-squared:  0.4975, Adjusted R-squared:  0.4971 
## F-statistic:  1443 on 1 and 1458 DF,  p-value: < 2.2e-16

Conclusion


Problem: 3


Solution:

Following are the three untransformed variables that will be used.

Correlation Matrix

Correlation provides trends shared between two variables. If the value is close to 1 variables are positively related. In other words, as one variable increases, other variable tends to move in the same. If the value is close to -1, then variables are negatively related. When one variable increases other variable decreases. It is also known as variables are inversely related.

As dataset has more than 30 observations and mean represents the center of the data, it meets parametric conditions. Hence pearson correlation is chosen over spearman. Also pearson coefficient of correlation, when we square it we obtain the amount of variation in \(y\) explained by \(x\). Same is not true with spearman method.

reqCols <- c('LotArea','TotalBsmtSF','GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]

pearson.cor <- cor(req.data,method="pearson")
pearson.cor
##               LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea     1.0000000   0.2608331 0.2631162 0.2638434
## TotalBsmtSF 0.2608331   1.0000000 0.4548682 0.6135806
## GrLivArea   0.2631162   0.4548682 1.0000000 0.7086245
## SalePrice   0.2638434   0.6135806 0.7086245 1.0000000

Looking at the output all the variables are positively correlated.

Correlation between TotalBsmtSF and SalePrice is \(0.61\). It explains bigger basement area will result in the better sale price. Square value of the coefficient is \(0.3721\). It means \(37.21 \%\) percent of the variance in the sale price of a house can be explained by the total area of the basement.

Correlation between GrLivArea and SalePrice is \(0.71\). It explains bigger living area will result in the better sale price. Square value of the coefficient is \(0.5041\). It means \(50.41 \%\) percent of the variance in the sale price of a house can be explained by the total above grade living area.


Precision Matrix

Precision matrix is inverse of Correlation Matrix. To solve the problem, I will be using R-function solve.

inv.cor <- solve(pearson.cor)
inv.cor
##                 LotArea TotalBsmtSF  GrLivArea   SalePrice
## LotArea      1.10622180  -0.1703170 -0.1623394 -0.07232846
## TotalBsmtSF -0.17031695   1.6321069 -0.0397442 -0.92832834
## GrLivArea   -0.16233936  -0.0397442  2.0350650 -1.37487844
## SalePrice   -0.07232846  -0.9283283 -1.3748784  2.56296011

Matrix Multiplication

Correlation Matrix multiplied by Precision Matrix

round(pearson.cor %*% inv.cor)
##             LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea           1           0         0         0
## TotalBsmtSF       0           1         0         0
## GrLivArea         0           0         1         0
## SalePrice         0           0         0         1

Precision Matrix multiplied by Correlation Matrix

round(inv.cor %*% pearson.cor)
##             LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea           1           0         0         0
## TotalBsmtSF       0           1         0         0
## GrLivArea         0           0         1         0
## SalePrice         0           0         0         1

Conclusion


Problem: 4


Solution:

The non-transformed independent variable I will be using is GrLivArea and the distribution function is normal. Parameter estimates of the normal distribution are mean(\(\mu\)) and standard deviation(\(\sigma\)).

Basic Analysis

reqCols <- c('GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]

fit.output <- fitdistr(req.data$GrLivArea, densfun="normal")
fit.output
##       mean          sd    
##   1515.46370    525.30039 
##  (  13.74774) (   9.72112)

Output of optim function, average(\(\mu\)) = 1515.46 and standard deviation(\(\sigma\)) = 525.3.

To find the optimal estimates, I will be using optim and dnorm functions. dnorm is the R function that calculates the probability density of the normal distribution. Since GrLivArea cannot be negative and has to greater than zero, we will use the output of fitdistr to get optimum values. However, lower standard deviation would be preferred. Standard deviation generated by fitdistr is little over one-third of average. I will be passing the value of 50 as standard deviation to optim function.

likelihood.func <- function(params) { -sum(dnorm(req.data$GrLivArea, params[1], params[2], log=TRUE)) }
optim.output <- optim(c(fit.output$estimate[1], 50), likelihood.func)    # c(0,1) are just initial guesses
optim.output
## $par
##      mean           
## 1515.7270  525.3384 
## 
## $value
## [1] 11217.05
## 
## $counts
## function gradient 
##       47       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Output of optim function, average(\(\mu\)) = 1515.73 and standard deviation(\(\sigma\)) = 525.34.

Apart from rounding, optim function produced same output as fitdistr.

Sample Data

To generate 1000 samples, I will be using rnorm function with the optimal parameters generated by optim function.

#generate 1000 samples
house.sample <- rnorm(n=1000, mean=round(optim.output$par[1],2), sd=round(optim.output$par[2],2))
house.sample <- data.frame(house.sample)
names(house.sample)[1] <- "Samples"

req.data$areaSplit <- as.factor(req.data[,1]>=round(optim.output$par[1],2))

ogg <- ggplot(req.data, aes(GrLivArea, fill=req.data$areaSplit)) + 
          geom_histogram(color="white", binwidth=125) + 
          scale_x_continuous(name = "Observed Area(sq.ft.)") +
          scale_fill_manual(values=c("#a3c4dc","#0e668b"),labels=c("area >= 1515.73 sq.ft.","area < 1515.73 sq.ft.")) +
          ylab("Number of Observations") +
          ggtitle("Observed Area Distribution") +
          geom_vline(xintercept = mean(req.data$GrLivArea), color="red", labels="Average", lwd=1)


house.sample$areaSplit <- as.factor(house.sample[,1]>=round(optim.output$par[1],2))

sgg <- ggplot(house.sample, aes(Samples, fill=house.sample$areaSplit)) + 
          geom_histogram(color="white", binwidth=125) + 
          scale_x_continuous(name = "Sample Data Area(sq.ft.)") +
          scale_fill_manual(values=c("#a3c4dc","#0e668b"),labels=c("area >= 1515.73 sq.ft.","area < 1515.73 sq.ft.")) +
          ylab("Sample Split") +
          ggtitle("Sample Data Distribution") +
          geom_vline(xintercept = mean(req.data$GrLivArea), color="red", labels="Average", lwd=1)

grid.arrange(sgg, ogg, nrow = 2, top='Histograms')

Comparison

Goodness-of-Fit Test

I will be using Chi-Square test to see if the sample generated represents a normal distribution. In our case, there should 50% cases where the area is greater than or equal to average and 50% cases less than average.

Hypothesis,

\(H_0\) : Sample data follow a specified distribution.

\(H_A\) : Sample data do not follow the specified distribution.

#Chi-square test

#Ratio of actual observed values
nullp<-c(0.50, 0.50)

#Samples generated
sample.rows <- c(sum(house.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(house.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))

#Goodness-of-Fit Test
chisq.test(sample.rows, p=nullp)
## 
##  Chi-squared test for given probabilities
## 
## data:  sample.rows
## X-squared = 1.156, df = 1, p-value = 0.2823

Since p-value is greater than \(0.05\), we accept null hypothesis(\(H_0\)) with 95% confidence, that sample data represents normal distribution.

Following is a test to see if sample represents actual observed data. Actual observations ratio between greater than or equal to average and less than average, 0.45 : 0.55. For samples data ratio between samples greater than or equal to average and less than average, 517 : 483.

Hypothesis,

\(H_0\): Sample data represents actual observed data.

\(H_A\): Sample data do not represent actual observed data.

#Chi-square test

#Ratio of actual observed values
nullp<-c(round((sum(req.data$GrLivArea >= round(optim.output$par[1],2), na.rm=TRUE)) / nrow(req.data),2), round((sum(req.data$GrLivArea < round(optim.output$par[1],2), na.rm=TRUE)) / nrow(req.data),2))

#Samples generated
sample.rows <- c(sum(house.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(house.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))

#Goodness-of-Fit Test
chisq.test(sample.rows, p=nullp)
## 
##  Chi-squared test for given probabilities
## 
## data:  sample.rows
## X-squared = 18.137, df = 1, p-value = 0.00002055

Since p-value is less than \(0.05\), we can reject the null hypothesis(\(H_0\)) and accept alternative hypothesis(\(H_A\)) with 95% confidence that sample data does not represent actual observed data.

Conclusion


Problem: 5

#Training data
reqCols <- c('MSZoning', 'LotArea','Street','LotShape','BldgType','OverallQual','OverallCond','YearBuilt','YearRemodAdd','BsmtCond',
             'TotalBsmtSF','CentralAir','X1stFlrSF','X2ndFlrSF','GrLivArea','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr',
             'KitchenQual','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','SaleCondition','SalePrice')
kaggle.train <- kaggle.data[reqCols]

levels <- levels(kaggle.train$BsmtCond)
levels[length(levels) + 1] <- "NA"

kaggle.train$BsmtCond <- factor(kaggle.train$BsmtCond, levels = levels)
kaggle.train$BsmtCond[is.na(kaggle.train$BsmtCond)] <- "NA"
kaggle.train$KitchenQual[is.na(kaggle.train$KitchenQual)] <- "TA"
kaggle.train$MSZoning[is.na(kaggle.train$MSZoning)] <- "RM"
kaggle.train$TotalBsmtSF[is.na(kaggle.train$TotalBsmtSF)] <- 0
kaggle.train$GarageCars[is.na(kaggle.train$GarageCars)] <- 0
kaggle.train$GarageArea[is.na(kaggle.train$GarageArea)] <- 0


#Test data
reqCols <- c('Id','MSZoning', 'LotArea','Street','LotShape','BldgType','OverallQual','OverallCond','YearBuilt','YearRemodAdd','BsmtCond',
             'TotalBsmtSF','CentralAir','X1stFlrSF','X2ndFlrSF','GrLivArea','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr',
             'KitchenQual','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','SaleCondition')

kaggle.test <- kaggle.data.test[reqCols]

levels <- levels(kaggle.test$BsmtCond)
levels[length(levels) + 1] <- "NA"

kaggle.test$BsmtCond <- factor(kaggle.test$BsmtCond, levels = levels)
kaggle.test$BsmtCond[is.na(kaggle.test$BsmtCond)] <- "NA"

kaggle.test$KitchenQual[is.na(kaggle.test$KitchenQual)] <- "TA"
kaggle.test$MSZoning[is.na(kaggle.test$MSZoning)] <- "RM"
kaggle.test$TotalBsmtSF[is.na(kaggle.test$TotalBsmtSF)] <- 0
kaggle.test$GarageCars[is.na(kaggle.test$GarageCars)] <- 0
kaggle.test$GarageArea[is.na(kaggle.test$GarageArea)] <- 0


#kaggle.pred <- cbind(kaggle.test, s<-predict(kaggle.lm, kaggle.test))

#names(kaggle.pred)[ncol(kaggle.pred)] <- "SalePrice"

#reqCols <- c('Id','SalePrice')
#kaggle.pred.output <- kaggle.pred[reqCols]
#write.csv(kaggle.pred.output, file = "D:\\CUNY\\605\\FinalProject\\submission.csv",row.names=FALSE)

Model Building

Columns used for building regression model,

names(kaggle.train)
##  [1] "MSZoning"      "LotArea"       "Street"        "LotShape"     
##  [5] "BldgType"      "OverallQual"   "OverallCond"   "YearBuilt"    
##  [9] "YearRemodAdd"  "BsmtCond"      "TotalBsmtSF"   "CentralAir"   
## [13] "X1stFlrSF"     "X2ndFlrSF"     "GrLivArea"     "FullBath"     
## [17] "HalfBath"      "BedroomAbvGr"  "KitchenAbvGr"  "KitchenQual"  
## [21] "TotRmsAbvGrd"  "Fireplaces"    "GarageCars"    "GarageArea"   
## [25] "WoodDeckSF"    "OpenPorchSF"   "SaleCondition" "SalePrice"

Missing data for BsmtCond is replaced with NA. For TotalBsmtSF, GarageCars and GarageArea missing data is replaced with zero.

kaggle.lm <- lm(SalePrice ~ ., data = kaggle.train)
summary(kaggle.lm)
## 
## Call:
## lm(formula = SalePrice ~ ., data = kaggle.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -466707  -14819   -1247   12263  259151 
## 
## Coefficients:
##                          Estimate   Std. Error t value     Pr(>|t|)    
## (Intercept)          -834461.2017  148611.8703  -5.615 0.0000000236 ***
## MSZoningFV             13657.4321   12536.4521   1.089     0.276155    
## MSZoningRH             14191.9534   14213.1843   0.999     0.318205    
## MSZoningRL              9495.8430   11716.4513   0.810     0.417806    
## MSZoningRM              3612.8881   11690.6505   0.309     0.757336    
## LotArea                    0.4916       0.1067   4.606 0.0000044696 ***
## StreetPave             24453.2266   15140.2930   1.615     0.106510    
## LotShapeIR2             7857.2149    5646.9385   1.391     0.164319    
## LotShapeIR3           -51788.7586   11331.9555  -4.570 0.0000052983 ***
## LotShapeReg            -3311.6478    2041.0949  -1.622     0.104922    
## BldgType2fmCon          5844.6630    7452.0259   0.784     0.432992    
## BldgTypeDuplex           951.2547    7785.8292   0.122     0.902776    
## BldgTypeTwnhs         -18502.8225    5807.8563  -3.186     0.001475 ** 
## BldgTypeTwnhsE        -14400.2176    3925.5787  -3.668     0.000253 ***
## OverallQual            13660.7015    1178.8995  11.588      < 2e-16 ***
## OverallCond             5157.3916    1056.5760   4.881 0.0000011738 ***
## YearBuilt                347.4258      63.5952   5.463 0.0000000552 ***
## YearRemodAdd              61.7003      68.7175   0.898     0.369400    
## BsmtCondGd              7402.5137    7012.3297   1.056     0.291312    
## BsmtCondPo             24684.5539   24853.7180   0.993     0.320785    
## BsmtCondTA              8822.5327    5491.0097   1.607     0.108338    
## BsmtCondNA             22140.3824    9510.5536   2.328     0.020054 *  
## TotalBsmtSF               24.3378       5.1631   4.714 0.0000026708 ***
## CentralAirY            -3221.3162    4474.2420  -0.720     0.471663    
## X1stFlrSF                 27.8363      19.6979   1.413     0.157828    
## X2ndFlrSF                 32.0763      19.2219   1.669     0.095390 .  
## GrLivArea                 19.0507      19.1194   0.996     0.319222    
## FullBath                1011.8753    2711.2224   0.373     0.709043    
## HalfBath               -1636.4413    2624.6403  -0.623     0.533062    
## BedroomAbvGr           -7967.9128    1711.4010  -4.656 0.0000035302 ***
## KitchenAbvGr          -24572.0086    7147.6623  -3.438     0.000604 ***
## KitchenQualFa         -42158.0859    7729.0392  -5.455 0.0000000579 ***
## KitchenQualGd         -45369.2909    4128.5516 -10.989      < 2e-16 ***
## KitchenQualTA         -49501.7920    4818.3597 -10.274      < 2e-16 ***
## TotRmsAbvGrd            2290.0596    1219.3831   1.878     0.060580 .  
## Fireplaces              5384.1705    1722.6888   3.125     0.001811 ** 
## GarageCars             12073.8212    2790.1894   4.327 0.0000161590 ***
## GarageArea                -2.0307       9.5503  -0.213     0.831641    
## WoodDeckSF                28.8093       7.6375   3.772     0.000169 ***
## OpenPorchSF              -12.6482      14.8625  -0.851     0.394909    
## SaleConditionAdjLand   17304.1575   17861.7073   0.969     0.332818    
## SaleConditionAlloca    16667.3871   10954.5029   1.522     0.128355    
## SaleConditionFamily     -217.8811    8378.0094  -0.026     0.979256    
## SaleConditionNormal     8008.9981    3623.5822   2.210     0.027248 *  
## SaleConditionPartial   21527.7619    5001.1914   4.305 0.0000178853 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33540 on 1415 degrees of freedom
## Multiple R-squared:  0.8271, Adjusted R-squared:  0.8217 
## F-statistic: 153.8 on 44 and 1415 DF,  p-value: < 2.2e-16

\(R^2\) value of the model is \(0.8271\). It means model can explain \(82.71 \%\) of the variance in the sale price of a house. However, variables MSZoning, YearRemodAdd, OpenPorchSF, GarageArea, BsmtCond, HalfBath, CentralAir, GrLivArea and FullBath have high p-value. It suggests these variables are not contributing to model.

kaggle.lm <- update(kaggle.lm, .~ . -MSZoning -YearRemodAdd -OpenPorchSF -GarageArea -BsmtCond -HalfBath -CentralAir -GrLivArea -FullBath)
summary(kaggle.lm)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + Street + LotShape + BldgType + 
##     OverallQual + OverallCond + YearBuilt + TotalBsmtSF + X1stFlrSF + 
##     X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + 
##     Fireplaces + GarageCars + WoodDeckSF + SaleCondition, data = kaggle.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -460319  -15275   -1234   12643  262810 
## 
## Coefficients:
##                          Estimate   Std. Error t value        Pr(>|t|)    
## (Intercept)          -824312.8757   97193.2468  -8.481         < 2e-16 ***
## LotArea                    0.5207       0.1059   4.917 0.0000009802343 ***
## StreetPave             29323.9224   14553.6486   2.015        0.044104 *  
## LotShapeIR2             8459.2088    5613.1042   1.507        0.132020    
## LotShapeIR3           -50983.1758   11222.8435  -4.543 0.0000060181570 ***
## LotShapeReg            -3372.1393    2012.4911  -1.676        0.094034 .  
## BldgType2fmCon          6858.5577    7357.2576   0.932        0.351382    
## BldgTypeDuplex          3047.4816    7680.4211   0.397        0.691585    
## BldgTypeTwnhs         -20821.5965    5545.0916  -3.755        0.000180 ***
## BldgTypeTwnhsE        -15785.5942    3718.8552  -4.245 0.0000232938011 ***
## OverallQual            13780.4842    1157.0155  11.910         < 2e-16 ***
## OverallCond             5117.6174     920.2462   5.561 0.0000000319403 ***
## YearBuilt                409.3079      48.2739   8.479         < 2e-16 ***
## TotalBsmtSF               17.3150       3.8881   4.453 0.0000091129918 ***
## X1stFlrSF                 52.7975       5.3224   9.920         < 2e-16 ***
## X2ndFlrSF                 49.4065       3.8415  12.861         < 2e-16 ***
## BedroomAbvGr           -7746.2693    1660.4203  -4.665 0.0000033707724 ***
## KitchenAbvGr          -22970.0828    6987.5840  -3.287        0.001036 ** 
## KitchenQualFa         -42693.8655    7541.8616  -5.661 0.0000000181677 ***
## KitchenQualGd         -45548.2056    4067.3129 -11.199         < 2e-16 ***
## KitchenQualTA         -51154.7355    4678.5453 -10.934         < 2e-16 ***
## TotRmsAbvGrd            2327.3620    1188.6831   1.958        0.050432 .  
## Fireplaces              5019.5825    1667.6467   3.010        0.002658 ** 
## GarageCars             11222.2049    1655.5203   6.779 0.0000000000177 ***
## WoodDeckSF                28.3168       7.5812   3.735        0.000195 ***
## SaleConditionAdjLand   20224.9292   17685.0618   1.144        0.252974    
## SaleConditionAlloca    14514.8202   10856.2911   1.337        0.181437    
## SaleConditionFamily     -821.0596    8313.9049  -0.099        0.921345    
## SaleConditionNormal     8218.0876    3545.5362   2.318        0.020597 *  
## SaleConditionPartial   22001.7485    4877.4789   4.511 0.0000069839754 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33540 on 1430 degrees of freedom
## Multiple R-squared:  0.8253, Adjusted R-squared:  0.8217 
## F-statistic: 232.9 on 29 and 1430 DF,  p-value: < 2.2e-16

Second model after removing the variables, summary did not show any improvement in \(R^2\) value, but \(Adjusted~ R^2\) remained constant. Also, every variable is contributing to the model.

Price Prediction

kaggle.pred <- cbind(kaggle.test, s<-predict(kaggle.lm, kaggle.test))
names(kaggle.pred)[ncol(kaggle.pred)] <- "SalePrice"

Data is submitted to Kaggle website. Following is my username and score.


References