DATA605 Final Examination

library(knitr)
library(ggplot2)
library(stats)
library(kableExtra)
library(grid)
library(gridExtra)
library(MASS)
library(data.table)
library(RCurl)
options(scipen=8)
set.seed(0305)

Problem: 1

Solution:

To answer the above question I will using data from GrLivArea column, according to documentation it is described as Above grade (ground) living area square feet. This column will be X, independent variable and Y would be SalePrice.

Load data from train.csv. It is downloaded from kaggle.com.

#Load data
#kaggle.data <- read.csv("D:\\CUNY\\605\\FinalProject\\train.csv", header=T)
#kaggle.data.test <- read.csv("D:\\CUNY\\605\\FinalProject\\test.csv", header=T)

kaggle.data <- read.csv(text=getURL("https://raw.githubusercontent.com/akulapa/Akula-Data605-Final/master/train.csv"), header=T)
kaggle.data.test <- read.csv(text=getURL("https://raw.githubusercontent.com/akulapa/Akula-Data605-Final/master/test.csv"), header=T)

#Get required data
reqCols <- c('GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]

kable(req.data[sample(nrow(req.data), 10), ] , format="pandoc", align="l", row.names = F, caption = "Sample Data")

Sample Data
GrLivArea	SalePrice
2108	235000
2290	122500
2097	423000
860	137000
1953	240000
1980	315000
3194	359100
1743	175000
990	126000
1768	224900

Summary of the table. First quartile of X is $x = 1130$, Second quartile of Y is $y = 163000$, it is median sale price of the houses.

summary(req.data)

##    GrLivArea      SalePrice     
##  Min.   : 334   Min.   : 34900  
##  1st Qu.:1130   1st Qu.:129975  
##  Median :1464   Median :163000  
##  Mean   :1515   Mean   :180921  
##  3rd Qu.:1777   3rd Qu.:214000  
##  Max.   :5642   Max.   :755000

#Create matrix based on conditions
#Total observations
total.obs <- nrow(req.data)

#Number of observations where X>x and Y>y
xy.pos.data <- req.data[which(req.data$GrLivArea > 1130 & req.data$SalePrice > 163000),]
xy.pos <- nrow(xy.pos.data)

#Number of observations where X<x and Y<y
xy.neg.data <- req.data[which(req.data$GrLivArea <= 1130 & req.data$SalePrice <= 163000),]
xy.neg <- nrow(xy.neg.data)

#Number of observations where X>x and Y<y
x.pos.y.neg.data <- req.data[which(req.data$GrLivArea > 1130 & req.data$SalePrice <= 163000),]
x.pos.y.neg <- nrow(x.pos.y.neg.data)

#Number of observations where X<x and Y>y
x.neg.y.pos.data <- req.data[which(req.data$GrLivArea <= 1130 & req.data$SalePrice > 163000),]
x.neg.y.pos <- nrow(x.neg.y.pos.data)

house.data<- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)
#add column and row totals
house.data<- cbind(house.data, Total = rowSums(house.data))
house.data<- rbind(house.data, Total = colSums(house.data))

rownames(house.data)<- c('(X>x)', '(X<=x)', 'Total')


kable(house.data, digits = 2, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Observed Data')

Observed Data
	(Y>y)	(Y<=y)	Total
(X>x)	720	374	1094
(X<=x)	8	358	366
Total	728	732	1460

Joint Probabilities

house.data.prob <- matrix(c(round(xy.pos/total.obs,4), round(x.neg.y.pos/total.obs,4),round(x.pos.y.neg/total.obs,4),round(xy.neg/total.obs,4)), nrow=2, ncol=2)

house.data.prob <- cbind(house.data.prob, Total = round(rowSums(house.data.prob),2))
house.data.prob <- rbind(house.data.prob, Total = round(colSums(house.data.prob),2))

rownames(house.data.prob) <- c('(X>x)', '(X<=x)', 'Total')


kable(house.data.prob, digits = 4, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Observed Data Joint Probabilities')

Observed Data Joint Probabilities
	(Y>y)	(Y<=y)	Total
(X>x)	0.4932	0.2562	0.75
(X<=x)	0.0055	0.2452	0.25
Total	0.5000	0.5000	1.00

Note: Column and rows values are added and then rounded to 2 digits.

a. $P(X>x~ |~ Y>y )$, read as probability X (GrLivArea) is greater than 1130 square feet given Y (SalePrice) is greater than $163000.

This is known as conditional probability because we are computing the probability under a condition, SalePrice is greater than $163000. Two parts to a conditional probability, the outcome of interest and the condition. We can assume condition as information we know to be true, and this information usually can be used to describe outcome.

$P(GrLivArea~ > 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}$

= $\frac{720}{728} = 98.9 \%$

Using the joint probabilities,

= $\frac{0.4932}{0.5000} = 98.64 \%$

Therefore, probability that SalePrice will be greater than $\$163000$, if GrLivArea is greater than $1130~ sq.ft.$ is $99\%$

b. $P(X>x~ \&~ Y>y)$, read a probability X (GrLivArea) is greater than 1130 square feet and Y (SalePrice) is greater than $163000.

This is known as joint probability because we are computing the probability using outcomes of two variables.

$P(GrLivArea~ > 1130~ sq.ft. and SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ observed}$

= $\frac{720}{1460} = 49.32 \%$

Therefore, probability that GrLivArea is greater than $1130~ sq.ft.$ and SalePrice will be greater than $\$163000$, is $49.32 \%$

c. $P(X<x~ |~ Y>y )$, read as probability X (GrLivArea) is less than 1130 square feet given Y (SalePrice) is greater than $163000. This is conditional probability.

$P(GrLivArea~ < 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ < 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}$

= $\frac{8}{728} = 1.1 \%$

Using the joint probabilities,

= $\frac{0.0055}{0.5000} = 1.1 \%$

Therefore, probability that SalePrice will be greater than $\$163000$, if GrLivArea is less than $1130~ sq.ft.$ is $1.1 \%$

Relation to independence:

$P(XY) = P(X)P(Y)$

Above condition can be rewritten as

$P(X \cap Y) = P(X)P(Y)$, condition will be true only when $X$ and $Y$ are independent.

We can say that above grade living area and sale price are independent only when an increase or decrease in the area does not affect the probability of increase or decrease of the sale price of the house. We can test the condition by using the following hypothesis.

Null Hypothesis($H_0$): Sale price of the house is not influenced by above grade living area.

Alternative Hypothesis($H_A$): Above grade living area has significant influence on sale price of the house.

If two variables were to be independent it should satisfy the condition

$P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)$

We will solve above conditions in two parts,

$P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y)$

= $P(X>x~ |~ Y>y) = \frac{P(X>x~ \cap ~ Y>y)}{P(Y>y)}$

= $P(X>x~ \cap ~ Y>y)$ - probability where GrLivArea > 1130 sq.ft. and SalePrice > $163000

= $P(Y>y)$ - probability where SalePrice > $163000

= $P(X>x~ |~ Y>y) = \frac{720}{728} = 98.9 \%$

Comparing other way,

$P(Y>y~ |~ X>x)P(X>x) = P(X>x~ \cap ~ Y>y)$

= $P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)$

= $P(Y>y~ |~ X>x) = \frac{P(X>x~ \cap ~ Y>y)}{P(X>x)}$

= $P(X>x)$ - probability where GrLivArea > 1130 sq.ft.

= $P(Y>y~ |~ X>x) = \frac{720}{1094} = 65.81 \%$

Since $P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)$, condition is not met we reject Null Hypothesis($H_0$), and accept Alternative Hypothesis($H_A$) that above grade living area has significant influence on sale price of the house.

Using Chi Square test

house.data <- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)

chisq.test(house.data)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  house.data
## X-squared = 441.58, df = 1, p-value < 2.2e-16

Since we have 2 variables GrLivArea and SalePrice, degrees of freedom(df) = 1, p-value = $2.2 \times 10^{-16}$ is very small compared at $0.05$ significance level, we reject Null Hypothesis($H_0$), and accept Alternative Hypothesis($H_A$) that above grade living area has significant influence on sale price of the house.

Problem: 2

Solution:

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

house.stats <- matrix(data = NA, nrow=11,ncol=2)
qarea <- quantile(req.data$GrLivArea)
qprice <- quantile(req.data$SalePrice)

house.stats[1,1] <- nrow(req.data)
house.stats[1,2] <- nrow(req.data)

house.stats[2,1] <- length(req.data$GrLivArea[!is.na(req.data$GrLivArea)])
house.stats[2,2] <- length(req.data$SalePrice[!is.na(req.data$SalePrice)])

house.stats[3,1] <- paste0(min(req.data$GrLivArea), ' sq. ft.')
house.stats[3,2] <- paste0('$', min(req.data$SalePrice))

house.stats[4,1] <- paste0(max(req.data$GrLivArea), ' sq. ft.')
house.stats[4,2] <- paste0('$', max(req.data$SalePrice))

house.stats[5,1] <- paste0(median(req.data$GrLivArea), ' sq. ft.')
house.stats[5,2] <- paste0('$', median(req.data$SalePrice))

house.stats[6,1] <- paste0(qarea[2], ' sq. ft.')
house.stats[6,2] <- paste0('$', qprice[2])

house.stats[7,1] <- paste0(qarea[4], ' sq. ft.')
house.stats[7,2] <- paste0('$', qprice[4])

house.stats[8,1] <- paste0(round(mean(req.data$GrLivArea),2), ' sq. ft.')
house.stats[8,2] <- paste0('$', round(mean(req.data$SalePrice),2))

house.stats[9,1] <- round(sd(req.data$GrLivArea),2)
house.stats[9,2] <- round(sd(req.data$SalePrice),2)

house.stats[10,1] <- paste0(getmode(req.data$GrLivArea), ' sq. ft.')
house.stats[10,2] <- paste0('$', getmode(req.data$SalePrice))

house.stats[11,1] <- paste0(IQR(req.data$GrLivArea), ' sq. ft.')
house.stats[11,2] <- paste0('$', IQR(req.data$SalePrice))

rownames(house.stats)<- c('Number of Observations', 'Non-missing values', 'Minimum','Maximum', 'Median','1st quartile','3rd quartile', 'Average(mean)', 'Standard deviation', 'Mode', 'Interquartile range(IQR)')

kable(house.stats, digits = 2, 
      col.names = c('GrLivArea', 'SalePrice'), 
      align = "l", 
      caption = 'Univariate Descriptive Statistics', "html") %>%  
  kable_styling(bootstrap_options = c("striped", "hover"))

Univariate Descriptive Statistics
	GrLivArea	SalePrice
Number of Observations	1460	1460
Non-missing values	1460	1460
Minimum	334 sq. ft.	$34900
Maximum	5642 sq. ft.	$755000
Median	1464 sq. ft.	$163000
1st quartile	1129.5 sq. ft.	$129975
3rd quartile	1776.75 sq. ft.	$214000
Average(mean)	1515.46 sq. ft.	$180921.2
Standard deviation	525.48	79442.5
Mode	864 sq. ft.	$140000
Interquartile range(IQR)	647.25 sq. ft.	$84025

Graphs

fill <- "gray45"
line <- "navy"

bxgg <- ggplot(req.data, aes(x='', y=GrLivArea)) + 
          geom_boxplot(fill = fill, colour = line, alpha = 0.7, varwidth=T, width=0.6) +
          #coord_flip() +
          scale_x_discrete(name = "") +
          scale_y_continuous(name = "Above Grade Living Area(sq.ft.)", breaks = seq(0, max(req.data$GrLivArea) + 100, 500),
                              limits=c(0, max(req.data$GrLivArea) + 100)) +
          ggtitle("Boxplot") +
          stat_summary(geom="text", fun.y=quantile,
                        aes(label = sprintf("%1.2f", ..y..)),
                        position = position_nudge(x=0.35), size=2.0)

hgg <- ggplot(req.data, aes(x=GrLivArea)) +
          geom_histogram(binwidth=100, colour = line, fill = fill, aes(y=..density.., fill=..count..)) +
          scale_x_continuous(name = "Above Grade Living Area(sq.ft.)") +
          stat_function(fun=dnorm, color="red",
                        args=list(mean=mean(req.data$GrLivArea), sd=sd(req.data$GrLivArea))) +
          ggtitle("Histogram")
          #geom_vline(xintercept = mean(req.data$GrLivArea), show_guide=TRUE, color="red", labels="Average") +
          #geom_vline(xintercept = median(req.data$GrLivArea), show_guide=TRUE, color="green", labels="Median")

grid.arrange(bxgg, hgg, nrow = 2, top='Above Grade Living Area(sq.ft.)')

Histogram shows distribution of above grade living area is. Average area is 1515.46 sq.ft. with standard deviation as 525.48. It also shows right tail, suggesting existence of outliers to the right of the average.

kable(qarea, digits = 2, 
      caption = 'Quartiles', 
      align = 'l', padding = 10, "html") %>%  
  kable_styling(bootstrap_options = c("striped", "hover"))

Quartiles
0%	334.00
25%	1129.50
50%	1464.00
75%	1776.75
100%	5642.00

fill <- "gray45"
line <- "navy"

bxgg <- ggplot(req.data, aes(x='', y=SalePrice)) + 
          geom_boxplot(fill = fill, colour = line, alpha = 0.7, varwidth=T, width=0.6) +
          #coord_flip() +
          scale_x_discrete(name = "") +
          scale_y_continuous(name = "Sale Price(Dollars)", breaks = seq(0, max(req.data$SalePrice) + 100, 100000),
                              limits=c(0, max(req.data$SalePrice) + 100)) +
          ggtitle("Boxplot") +
          stat_summary(geom="text", fun.y=quantile,
                      aes(label=sprintf("%1.2f", ..y..)),
                      position=position_nudge(x=0.35), size=2.0)

hgg <- ggplot(req.data, aes(x=SalePrice)) +
          geom_histogram(binwidth=10000, colour= line, fill = fill, aes(y=..density.., fill=..count..)) +
          scale_x_continuous(name = "Sale Price(Dollars)") +
          stat_function(fun=dnorm, color="red",
                        args=list(mean=mean(req.data$SalePrice), sd=sd(req.data$SalePrice))) +
          ggtitle("Histogram")
          #geom_vline(xintercept = mean(req.data$SalePrice), show_guide=TRUE, color="red", labels="Average") +
          #geom_vline(xintercept = median(req.data$SalePrice), show_guide=TRUE, color="green", labels="Median")

grid.arrange(bxgg, hgg, nrow = 2, top='Sale Price')

Histogram shows distribution of sale proce of houses. Average sale price($\mu$) is $ 180921.2, with sandard deviation($\sigma$) 79442.5. It also shows right tail, suggesting existence of outliers to the right of the average.

kable(qprice, digits = 2, 
      caption = 'Quartiles', 
      align = 'l', 
      padding = 10, "html") %>%  
  kable_styling(bootstrap_options = c("striped", "hover"))

Quartiles
0%	34900
25%	129975
50%	163000
75%	214000
100%	755000

Scatterplot

ggplot(req.data, aes(x=GrLivArea, y=SalePrice)) +
  geom_point() +
  geom_smooth(method=lm) +
  labs(title="Scatterplot Above Grade Living Area Vs. Sale Price",
       x="Above Grade Living Area(sq.ft.)", y = "Sale Price(Dollars)")

From above boxplots, histograms and scatterplot we can notice there are some outliers and the variation amoung area and sale price is not constant. This causes a longer tail on the right side.

Linear Model

house.lm <- lm(req.data$SalePrice ~ req.data$GrLivArea)
summary(house.lm)

## 
## Call:
## lm(formula = req.data$SalePrice ~ req.data$GrLivArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462999  -29800   -1124   21957  339832 
## 
## Coefficients:
##                     Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)        18569.026   4480.755   4.144 0.0000361 ***
## req.data$GrLivArea   107.130      2.794  38.348   < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.5018 
## F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16

Multiple R-squared: 0.5021 suggests that regression model can explain 50.21% of the variation in data.
Residual standard error: 56070 suggests that the average distance of the data points from the fitted line is about 56070. And 95% of times sale price should fall between $\pm 112140$. This very high value. In other words, based on the size of living area, sale price varies by $56070.00.

Above observations are without data normalization.

Box-Cox Transformation

A Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that one will be able to run a broader number of tests.

For all positive values of $y$, it is defined by

\[ y(\lambda)=\begin{cases} \frac{y^{\lambda} - 1}{\lambda}, & \text{if }\lambda \neq 0\ \\ log~ y, & \text{if }\lambda = 0\ \end{cases} \]

If $y$ has negative values then it is defined as

\[ y(\lambda)=\begin{cases} \frac{(y + {\lambda}_2)^{\lambda_1} - 1}{\lambda_1}, & \text{if }\lambda_1 \neq 0\ \\ log~ (y + {\lambda}_2), & \text{if }\lambda_1 = 0\ \end{cases} \]

We will using R-function boxcox from MASS library to determine optimal lambda($\lambda$) value.

par(mfrow=c(1,2))
house.bc <- boxcox(house.lm)
house.bc.df <- as.data.frame(house.bc)
lambda <- house.bc.df[which.max(house.bc.df$y),1]
boxcox(house.lm, plotit=T, lambda=seq(0,0.20,by=0.05))

From above boxcox plot, optimal lambda($\lambda$) is 0.1010101. Confidence interval runs between $0.02$ and $0.18$. Since $\lambda$ is less than $0.5$, there is no good reason to transform data.

However, let’s check if transforming data provides a better understanding of data.

req.data$SalePrice_trans <- ((req.data$SalePrice^lambda) -1)/lambda


ggplot(req.data, aes(x=GrLivArea, y=SalePrice_trans)) +
  geom_point() +
  geom_smooth(method=lm) +
  labs(title="Scatterplot Above Grade Living Area Vs. Sale Price Transformed",
       x="Above Grade Living Area(sq.ft.)", y = "Sale Price Transformed")

house_t.lm <- lm(req.data$SalePrice_trans ~ req.data$GrLivArea)
summary(house_t.lm)

## 
## Call:
## lm(formula = req.data$SalePrice_trans ~ req.data$GrLivArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6488 -0.4945  0.0843  0.5314  3.1560 
## 
## Coefficients:
##                       Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)        20.72985263  0.07656336  270.75   <2e-16 ***
## req.data$GrLivArea  0.00181347  0.00004774   37.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9581 on 1458 degrees of freedom
## Multiple R-squared:  0.4975, Adjusted R-squared:  0.4971 
## F-statistic:  1443 on 1 and 1458 DF,  p-value: < 2.2e-16

Conclusion

Fewer outlier above regression line.
Multiple R-squared: 0.4975 suggests that regression model generated using data transformation explains 49.75% of the variation in data. This value is less than the value generated using non-transformed data. We can conclude Box-Cox transformation is not very helpful in understanding data variation.

Problem: 3

Solution:

Following are the three untransformed variables that will be used.

LotArea - Lot size in square feet.
TotalBsmtSF - Total square feet of the basement area.
GrLivArea - Above grade (ground) living area square feet.
SalePrice - Price House was sold.

Correlation Matrix

Correlation provides trends shared between two variables. If the value is close to 1 variables are positively related. In other words, as one variable increases, other variable tends to move in the same. If the value is close to -1, then variables are negatively related. When one variable increases other variable decreases. It is also known as variables are inversely related.

As dataset has more than 30 observations and mean represents the center of the data, it meets parametric conditions. Hence pearson correlation is chosen over spearman. Also pearson coefficient of correlation, when we square it we obtain the amount of variation in $y$ explained by $x$. Same is not true with spearman method.

reqCols <- c('LotArea','TotalBsmtSF','GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]

pearson.cor <- cor(req.data,method="pearson")
pearson.cor

##               LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea     1.0000000   0.2608331 0.2631162 0.2638434
## TotalBsmtSF 0.2608331   1.0000000 0.4548682 0.6135806
## GrLivArea   0.2631162   0.4548682 1.0000000 0.7086245
## SalePrice   0.2638434   0.6135806 0.7086245 1.0000000

Looking at the output all the variables are positively correlated.

Correlation between TotalBsmtSF and SalePrice is $0.61$. It explains bigger basement area will result in the better sale price. Square value of the coefficient is $0.3721$. It means $37.21 \%$ percent of the variance in the sale price of a house can be explained by the total area of the basement.

Correlation between GrLivArea and SalePrice is $0.71$. It explains bigger living area will result in the better sale price. Square value of the coefficient is $0.5041$. It means $50.41 \%$ percent of the variance in the sale price of a house can be explained by the total above grade living area.

Precision Matrix

Precision matrix is inverse of Correlation Matrix. To solve the problem, I will be using R-function solve.

inv.cor <- solve(pearson.cor)
inv.cor

##                 LotArea TotalBsmtSF  GrLivArea   SalePrice
## LotArea      1.10622180  -0.1703170 -0.1623394 -0.07232846
## TotalBsmtSF -0.17031695   1.6321069 -0.0397442 -0.92832834
## GrLivArea   -0.16233936  -0.0397442  2.0350650 -1.37487844
## SalePrice   -0.07232846  -0.9283283 -1.3748784  2.56296011

Matrix Multiplication

Correlation Matrix multiplied by Precision Matrix

round(pearson.cor %*% inv.cor)

##             LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea           1           0         0         0
## TotalBsmtSF       0           1         0         0
## GrLivArea         0           0         1         0
## SalePrice         0           0         0         1

Precision Matrix multiplied by Correlation Matrix

round(inv.cor %*% pearson.cor)

##             LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea           1           0         0         0
## TotalBsmtSF       0           1         0         0
## GrLivArea         0           0         1         0
## SalePrice         0           0         0         1

Conclusion

Correlation Matrix multiplied by Precision Matrix and Precision Matrix multiplied by Correlation Matrix results in identity matrix.
Since Correlation Matrix has inverse it is known as invertible or nonsingular matrix.

Problem: 4

Solution:

The non-transformed independent variable I will be using is GrLivArea and the distribution function is normal. Parameter estimates of the normal distribution are mean($\mu$) and standard deviation($\sigma$).

Basic Analysis

reqCols <- c('GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]

fit.output <- fitdistr(req.data$GrLivArea, densfun="normal")
fit.output

##       mean          sd    
##   1515.46370    525.30039 
##  (  13.74774) (   9.72112)

Output of optim function, average($\mu$) = 1515.46 and standard deviation($\sigma$) = 525.3.

To find the optimal estimates, I will be using optim and dnorm functions. dnorm is the R function that calculates the probability density of the normal distribution. Since GrLivArea cannot be negative and has to greater than zero, we will use the output of fitdistr to get optimum values. However, lower standard deviation would be preferred. Standard deviation generated by fitdistr is little over one-third of average. I will be passing the value of 50 as standard deviation to optim function.

likelihood.func <- function(params) { -sum(dnorm(req.data$GrLivArea, params[1], params[2], log=TRUE)) }
optim.output <- optim(c(fit.output$estimate[1], 50), likelihood.func)    # c(0,1) are just initial guesses
optim.output

## $par
##      mean           
## 1515.7270  525.3384 
## 
## $value
## [1] 11217.05
## 
## $counts
## function gradient 
##       47       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Output of optim function, average($\mu$) = 1515.73 and standard deviation($\sigma$) = 525.34.

Apart from rounding, optim function produced same output as fitdistr.

Sample Data

To generate 1000 samples, I will be using rnorm function with the optimal parameters generated by optim function.

#generate 1000 samples
house.sample <- rnorm(n=1000, mean=round(optim.output$par[1],2), sd=round(optim.output$par[2],2))
house.sample <- data.frame(house.sample)
names(house.sample)[1] <- "Samples"

req.data$areaSplit <- as.factor(req.data[,1]>=round(optim.output$par[1],2))

ogg <- ggplot(req.data, aes(GrLivArea, fill=req.data$areaSplit)) + 
          geom_histogram(color="white", binwidth=125) + 
          scale_x_continuous(name = "Observed Area(sq.ft.)") +
          scale_fill_manual(values=c("#a3c4dc","#0e668b"),labels=c("area >= 1515.73 sq.ft.","area < 1515.73 sq.ft.")) +
          ylab("Number of Observations") +
          ggtitle("Observed Area Distribution") +
          geom_vline(xintercept = mean(req.data$GrLivArea), color="red", labels="Average", lwd=1)


house.sample$areaSplit <- as.factor(house.sample[,1]>=round(optim.output$par[1],2))

sgg <- ggplot(house.sample, aes(Samples, fill=house.sample$areaSplit)) + 
          geom_histogram(color="white", binwidth=125) + 
          scale_x_continuous(name = "Sample Data Area(sq.ft.)") +
          scale_fill_manual(values=c("#a3c4dc","#0e668b"),labels=c("area >= 1515.73 sq.ft.","area < 1515.73 sq.ft.")) +
          ylab("Sample Split") +
          ggtitle("Sample Data Distribution") +
          geom_vline(xintercept = mean(req.data$GrLivArea), color="red", labels="Average", lwd=1)

grid.arrange(sgg, ogg, nrow = 2, top='Histograms')

Comparison

Mean and standard deviation of samples and observed data is same, 1515.73, 525.34 respectively.
Red line represents average of the data.
Actual observed data have some outliers, where as sample data does not have outliers.
Sample data split between percentage of rows greater than or equal to average and less than average is 51.7 : 48.3.
Observed data split between percentage of rows greater than or equal to average and less than average is 44.73 : 55.27.
Sample data split, rows greater than or equal to average and less than average 517 : 483.
Observed data split, observations greater than or equal to average and less than average 653 : 807.

Goodness-of-Fit Test

I will be using Chi-Square test to see if the sample generated represents a normal distribution. In our case, there should 50% cases where the area is greater than or equal to average and 50% cases less than average.

Hypothesis,

$H_0$ : Sample data follow a specified distribution.

$H_A$ : Sample data do not follow the specified distribution.

#Chi-square test

#Ratio of actual observed values
nullp<-c(0.50, 0.50)

#Samples generated
sample.rows <- c(sum(house.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(house.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))

#Goodness-of-Fit Test
chisq.test(sample.rows, p=nullp)

## 
##  Chi-squared test for given probabilities
## 
## data:  sample.rows
## X-squared = 1.156, df = 1, p-value = 0.2823

Since p-value is greater than $0.05$, we accept null hypothesis($H_0$) with 95% confidence, that sample data represents normal distribution.

Following is a test to see if sample represents actual observed data. Actual observations ratio between greater than or equal to average and less than average, 0.45 : 0.55. For samples data ratio between samples greater than or equal to average and less than average, 517 : 483.

Hypothesis,

$H_0$: Sample data represents actual observed data.

$H_A$: Sample data do not represent actual observed data.

#Chi-square test

#Ratio of actual observed values
nullp<-c(round((sum(req.data$GrLivArea >= round(optim.output$par[1],2), na.rm=TRUE)) / nrow(req.data),2), round((sum(req.data$GrLivArea < round(optim.output$par[1],2), na.rm=TRUE)) / nrow(req.data),2))

#Samples generated
sample.rows <- c(sum(house.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(house.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))

#Goodness-of-Fit Test
chisq.test(sample.rows, p=nullp)

## 
##  Chi-squared test for given probabilities
## 
## data:  sample.rows
## X-squared = 18.137, df = 1, p-value = 0.00002055

Since p-value is less than $0.05$, we can reject the null hypothesis($H_0$) and accept alternative hypothesis($H_A$) with 95% confidence that sample data does not represent actual observed data.

Conclusion

Sample data generated follows a normal distribution.
Sample data does not represent actual observations. This is due to outliers influence on average and standard deviation of actual observations.
Parameter estimate of actual observations is not accurate.

Problem: 5

#Training data
reqCols <- c('MSZoning', 'LotArea','Street','LotShape','BldgType','OverallQual','OverallCond','YearBuilt','YearRemodAdd','BsmtCond',
             'TotalBsmtSF','CentralAir','X1stFlrSF','X2ndFlrSF','GrLivArea','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr',
             'KitchenQual','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','SaleCondition','SalePrice')
kaggle.train <- kaggle.data[reqCols]

levels <- levels(kaggle.train$BsmtCond)
levels[length(levels) + 1] <- "NA"

kaggle.train$BsmtCond <- factor(kaggle.train$BsmtCond, levels = levels)
kaggle.train$BsmtCond[is.na(kaggle.train$BsmtCond)] <- "NA"
kaggle.train$KitchenQual[is.na(kaggle.train$KitchenQual)] <- "TA"
kaggle.train$MSZoning[is.na(kaggle.train$MSZoning)] <- "RM"
kaggle.train$TotalBsmtSF[is.na(kaggle.train$TotalBsmtSF)] <- 0
kaggle.train$GarageCars[is.na(kaggle.train$GarageCars)] <- 0
kaggle.train$GarageArea[is.na(kaggle.train$GarageArea)] <- 0


#Test data
reqCols <- c('Id','MSZoning', 'LotArea','Street','LotShape','BldgType','OverallQual','OverallCond','YearBuilt','YearRemodAdd','BsmtCond',
             'TotalBsmtSF','CentralAir','X1stFlrSF','X2ndFlrSF','GrLivArea','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr',
             'KitchenQual','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','SaleCondition')

kaggle.test <- kaggle.data.test[reqCols]

levels <- levels(kaggle.test$BsmtCond)
levels[length(levels) + 1] <- "NA"

kaggle.test$BsmtCond <- factor(kaggle.test$BsmtCond, levels = levels)
kaggle.test$BsmtCond[is.na(kaggle.test$BsmtCond)] <- "NA"

kaggle.test$KitchenQual[is.na(kaggle.test$KitchenQual)] <- "TA"
kaggle.test$MSZoning[is.na(kaggle.test$MSZoning)] <- "RM"
kaggle.test$TotalBsmtSF[is.na(kaggle.test$TotalBsmtSF)] <- 0
kaggle.test$GarageCars[is.na(kaggle.test$GarageCars)] <- 0
kaggle.test$GarageArea[is.na(kaggle.test$GarageArea)] <- 0


#kaggle.pred <- cbind(kaggle.test, s<-predict(kaggle.lm, kaggle.test))

#names(kaggle.pred)[ncol(kaggle.pred)] <- "SalePrice"

#reqCols <- c('Id','SalePrice')
#kaggle.pred.output <- kaggle.pred[reqCols]
#write.csv(kaggle.pred.output, file = "D:\\CUNY\\605\\FinalProject\\submission.csv",row.names=FALSE)

Model Building

Columns used for building regression model,

names(kaggle.train)

##  [1] "MSZoning"      "LotArea"       "Street"        "LotShape"     
##  [5] "BldgType"      "OverallQual"   "OverallCond"   "YearBuilt"    
##  [9] "YearRemodAdd"  "BsmtCond"      "TotalBsmtSF"   "CentralAir"   
## [13] "X1stFlrSF"     "X2ndFlrSF"     "GrLivArea"     "FullBath"     
## [17] "HalfBath"      "BedroomAbvGr"  "KitchenAbvGr"  "KitchenQual"  
## [21] "TotRmsAbvGrd"  "Fireplaces"    "GarageCars"    "GarageArea"   
## [25] "WoodDeckSF"    "OpenPorchSF"   "SaleCondition" "SalePrice"

Missing data for BsmtCond is replaced with NA. For TotalBsmtSF, GarageCars and GarageArea missing data is replaced with zero.

kaggle.lm <- lm(SalePrice ~ ., data = kaggle.train)
summary(kaggle.lm)

## 
## Call:
## lm(formula = SalePrice ~ ., data = kaggle.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -466707  -14819   -1247   12263  259151 
## 
## Coefficients:
##                          Estimate   Std. Error t value     Pr(>|t|)    
## (Intercept)          -834461.2017  148611.8703  -5.615 0.0000000236 ***
## MSZoningFV             13657.4321   12536.4521   1.089     0.276155    
## MSZoningRH             14191.9534   14213.1843   0.999     0.318205    
## MSZoningRL              9495.8430   11716.4513   0.810     0.417806    
## MSZoningRM              3612.8881   11690.6505   0.309     0.757336    
## LotArea                    0.4916       0.1067   4.606 0.0000044696 ***
## StreetPave             24453.2266   15140.2930   1.615     0.106510    
## LotShapeIR2             7857.2149    5646.9385   1.391     0.164319    
## LotShapeIR3           -51788.7586   11331.9555  -4.570 0.0000052983 ***
## LotShapeReg            -3311.6478    2041.0949  -1.622     0.104922    
## BldgType2fmCon          5844.6630    7452.0259   0.784     0.432992    
## BldgTypeDuplex           951.2547    7785.8292   0.122     0.902776    
## BldgTypeTwnhs         -18502.8225    5807.8563  -3.186     0.001475 ** 
## BldgTypeTwnhsE        -14400.2176    3925.5787  -3.668     0.000253 ***
## OverallQual            13660.7015    1178.8995  11.588      < 2e-16 ***
## OverallCond             5157.3916    1056.5760   4.881 0.0000011738 ***
## YearBuilt                347.4258      63.5952   5.463 0.0000000552 ***
## YearRemodAdd              61.7003      68.7175   0.898     0.369400    
## BsmtCondGd              7402.5137    7012.3297   1.056     0.291312    
## BsmtCondPo             24684.5539   24853.7180   0.993     0.320785    
## BsmtCondTA              8822.5327    5491.0097   1.607     0.108338    
## BsmtCondNA             22140.3824    9510.5536   2.328     0.020054 *  
## TotalBsmtSF               24.3378       5.1631   4.714 0.0000026708 ***
## CentralAirY            -3221.3162    4474.2420  -0.720     0.471663    
## X1stFlrSF                 27.8363      19.6979   1.413     0.157828    
## X2ndFlrSF                 32.0763      19.2219   1.669     0.095390 .  
## GrLivArea                 19.0507      19.1194   0.996     0.319222    
## FullBath                1011.8753    2711.2224   0.373     0.709043    
## HalfBath               -1636.4413    2624.6403  -0.623     0.533062    
## BedroomAbvGr           -7967.9128    1711.4010  -4.656 0.0000035302 ***
## KitchenAbvGr          -24572.0086    7147.6623  -3.438     0.000604 ***
## KitchenQualFa         -42158.0859    7729.0392  -5.455 0.0000000579 ***
## KitchenQualGd         -45369.2909    4128.5516 -10.989      < 2e-16 ***
## KitchenQualTA         -49501.7920    4818.3597 -10.274      < 2e-16 ***
## TotRmsAbvGrd            2290.0596    1219.3831   1.878     0.060580 .  
## Fireplaces              5384.1705    1722.6888   3.125     0.001811 ** 
## GarageCars             12073.8212    2790.1894   4.327 0.0000161590 ***
## GarageArea                -2.0307       9.5503  -0.213     0.831641    
## WoodDeckSF                28.8093       7.6375   3.772     0.000169 ***
## OpenPorchSF              -12.6482      14.8625  -0.851     0.394909    
## SaleConditionAdjLand   17304.1575   17861.7073   0.969     0.332818    
## SaleConditionAlloca    16667.3871   10954.5029   1.522     0.128355    
## SaleConditionFamily     -217.8811    8378.0094  -0.026     0.979256    
## SaleConditionNormal     8008.9981    3623.5822   2.210     0.027248 *  
## SaleConditionPartial   21527.7619    5001.1914   4.305 0.0000178853 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33540 on 1415 degrees of freedom
## Multiple R-squared:  0.8271, Adjusted R-squared:  0.8217 
## F-statistic: 153.8 on 44 and 1415 DF,  p-value: < 2.2e-16

$R^2$ value of the model is $0.8271$. It means model can explain $82.71 \%$ of the variance in the sale price of a house. However, variables MSZoning, YearRemodAdd, OpenPorchSF, GarageArea, BsmtCond, HalfBath, CentralAir, GrLivArea and FullBath have high p-value. It suggests these variables are not contributing to model.

kaggle.lm <- update(kaggle.lm, .~ . -MSZoning -YearRemodAdd -OpenPorchSF -GarageArea -BsmtCond -HalfBath -CentralAir -GrLivArea -FullBath)
summary(kaggle.lm)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + Street + LotShape + BldgType + 
##     OverallQual + OverallCond + YearBuilt + TotalBsmtSF + X1stFlrSF + 
##     X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + 
##     Fireplaces + GarageCars + WoodDeckSF + SaleCondition, data = kaggle.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -460319  -15275   -1234   12643  262810 
## 
## Coefficients:
##                          Estimate   Std. Error t value        Pr(>|t|)    
## (Intercept)          -824312.8757   97193.2468  -8.481         < 2e-16 ***
## LotArea                    0.5207       0.1059   4.917 0.0000009802343 ***
## StreetPave             29323.9224   14553.6486   2.015        0.044104 *  
## LotShapeIR2             8459.2088    5613.1042   1.507        0.132020    
## LotShapeIR3           -50983.1758   11222.8435  -4.543 0.0000060181570 ***
## LotShapeReg            -3372.1393    2012.4911  -1.676        0.094034 .  
## BldgType2fmCon          6858.5577    7357.2576   0.932        0.351382    
## BldgTypeDuplex          3047.4816    7680.4211   0.397        0.691585    
## BldgTypeTwnhs         -20821.5965    5545.0916  -3.755        0.000180 ***
## BldgTypeTwnhsE        -15785.5942    3718.8552  -4.245 0.0000232938011 ***
## OverallQual            13780.4842    1157.0155  11.910         < 2e-16 ***
## OverallCond             5117.6174     920.2462   5.561 0.0000000319403 ***
## YearBuilt                409.3079      48.2739   8.479         < 2e-16 ***
## TotalBsmtSF               17.3150       3.8881   4.453 0.0000091129918 ***
## X1stFlrSF                 52.7975       5.3224   9.920         < 2e-16 ***
## X2ndFlrSF                 49.4065       3.8415  12.861         < 2e-16 ***
## BedroomAbvGr           -7746.2693    1660.4203  -4.665 0.0000033707724 ***
## KitchenAbvGr          -22970.0828    6987.5840  -3.287        0.001036 ** 
## KitchenQualFa         -42693.8655    7541.8616  -5.661 0.0000000181677 ***
## KitchenQualGd         -45548.2056    4067.3129 -11.199         < 2e-16 ***
## KitchenQualTA         -51154.7355    4678.5453 -10.934         < 2e-16 ***
## TotRmsAbvGrd            2327.3620    1188.6831   1.958        0.050432 .  
## Fireplaces              5019.5825    1667.6467   3.010        0.002658 ** 
## GarageCars             11222.2049    1655.5203   6.779 0.0000000000177 ***
## WoodDeckSF                28.3168       7.5812   3.735        0.000195 ***
## SaleConditionAdjLand   20224.9292   17685.0618   1.144        0.252974    
## SaleConditionAlloca    14514.8202   10856.2911   1.337        0.181437    
## SaleConditionFamily     -821.0596    8313.9049  -0.099        0.921345    
## SaleConditionNormal     8218.0876    3545.5362   2.318        0.020597 *  
## SaleConditionPartial   22001.7485    4877.4789   4.511 0.0000069839754 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33540 on 1430 degrees of freedom
## Multiple R-squared:  0.8253, Adjusted R-squared:  0.8217 
## F-statistic: 232.9 on 29 and 1430 DF,  p-value: < 2.2e-16

Second model after removing the variables, summary did not show any improvement in $R^2$ value, but $Adjusted~ R^2$ remained constant. Also, every variable is contributing to the model.

Price Prediction

kaggle.pred <- cbind(kaggle.test, s<-predict(kaggle.lm, kaggle.test))
names(kaggle.pred)[ncol(kaggle.pred)] <- "SalePrice"

Data is submitted to Kaggle website. Following is my username and score.

DATA605 Final Examination

Pavan Akula

December 12, 2017

Joint Probabilities

Relation to independence:

Using Chi Square test

Graphs

Scatterplot

Linear Model

Box-Cox Transformation

Conclusion

Correlation Matrix

Precision Matrix

Matrix Multiplication

Conclusion

Basic Analysis

Sample Data

Comparison

Goodness-of-Fit Test

Conclusion

Model Building

Price Prediction

References