library(knitr)
library(ggplot2)
library(stats)
library(kableExtra)
library(grid)
library(gridExtra)
library(MASS)
library(data.table)
library(RCurl)
options(scipen=8)
set.seed(0305)
Problem: 1
Solution:
To answer the above question I will using data from GrLivArea column, according to documentation it is described as Above grade (ground) living area square feet. This column will be X, independent variable and Y would be SalePrice.
Load data from train.csv. It is downloaded from kaggle.com.
#Load data
#kaggle.data <- read.csv("D:\\CUNY\\605\\FinalProject\\train.csv", header=T)
#kaggle.data.test <- read.csv("D:\\CUNY\\605\\FinalProject\\test.csv", header=T)
kaggle.data <- read.csv(text=getURL("https://raw.githubusercontent.com/akulapa/Akula-Data605-Final/master/train.csv"), header=T)
kaggle.data.test <- read.csv(text=getURL("https://raw.githubusercontent.com/akulapa/Akula-Data605-Final/master/test.csv"), header=T)
#Get required data
reqCols <- c('GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]
kable(req.data[sample(nrow(req.data), 10), ] , format="pandoc", align="l", row.names = F, caption = "Sample Data")
| GrLivArea | SalePrice |
|---|---|
| 2108 | 235000 |
| 2290 | 122500 |
| 2097 | 423000 |
| 860 | 137000 |
| 1953 | 240000 |
| 1980 | 315000 |
| 3194 | 359100 |
| 1743 | 175000 |
| 990 | 126000 |
| 1768 | 224900 |
Summary of the table. First quartile of X is \(x = 1130\), Second quartile of Y is \(y = 163000\), it is median sale price of the houses.
summary(req.data)
## GrLivArea SalePrice
## Min. : 334 Min. : 34900
## 1st Qu.:1130 1st Qu.:129975
## Median :1464 Median :163000
## Mean :1515 Mean :180921
## 3rd Qu.:1777 3rd Qu.:214000
## Max. :5642 Max. :755000
#Create matrix based on conditions
#Total observations
total.obs <- nrow(req.data)
#Number of observations where X>x and Y>y
xy.pos.data <- req.data[which(req.data$GrLivArea > 1130 & req.data$SalePrice > 163000),]
xy.pos <- nrow(xy.pos.data)
#Number of observations where X<x and Y<y
xy.neg.data <- req.data[which(req.data$GrLivArea <= 1130 & req.data$SalePrice <= 163000),]
xy.neg <- nrow(xy.neg.data)
#Number of observations where X>x and Y<y
x.pos.y.neg.data <- req.data[which(req.data$GrLivArea > 1130 & req.data$SalePrice <= 163000),]
x.pos.y.neg <- nrow(x.pos.y.neg.data)
#Number of observations where X<x and Y>y
x.neg.y.pos.data <- req.data[which(req.data$GrLivArea <= 1130 & req.data$SalePrice > 163000),]
x.neg.y.pos <- nrow(x.neg.y.pos.data)
house.data<- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)
#add column and row totals
house.data<- cbind(house.data, Total = rowSums(house.data))
house.data<- rbind(house.data, Total = colSums(house.data))
rownames(house.data)<- c('(X>x)', '(X<=x)', 'Total')
kable(house.data, digits = 2, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Observed Data')
| (Y>y) | (Y<=y) | Total | |
|---|---|---|---|
| (X>x) | 720 | 374 | 1094 |
| (X<=x) | 8 | 358 | 366 |
| Total | 728 | 732 | 1460 |
house.data.prob <- matrix(c(round(xy.pos/total.obs,4), round(x.neg.y.pos/total.obs,4),round(x.pos.y.neg/total.obs,4),round(xy.neg/total.obs,4)), nrow=2, ncol=2)
house.data.prob <- cbind(house.data.prob, Total = round(rowSums(house.data.prob),2))
house.data.prob <- rbind(house.data.prob, Total = round(colSums(house.data.prob),2))
rownames(house.data.prob) <- c('(X>x)', '(X<=x)', 'Total')
kable(house.data.prob, digits = 4, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Observed Data Joint Probabilities')
| (Y>y) | (Y<=y) | Total | |
|---|---|---|---|
| (X>x) | 0.4932 | 0.2562 | 0.75 |
| (X<=x) | 0.0055 | 0.2452 | 0.25 |
| Total | 0.5000 | 0.5000 | 1.00 |
Note: Column and rows values are added and then rounded to 2 digits.
a. \(P(X>x~ |~ Y>y )\), read as probability X (GrLivArea) is greater than 1130 square feet given Y (SalePrice) is greater than $163000.
This is known as conditional probability because we are computing the probability under a condition, SalePrice is greater than $163000. Two parts to a conditional probability, the outcome of interest and the condition. We can assume condition as information we know to be true, and this information usually can be used to describe outcome.
\(P(GrLivArea~ > 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}\)
= \(\frac{720}{728} = 98.9 \%\)
Using the joint probabilities,
= \(\frac{0.4932}{0.5000} = 98.64 \%\)
Therefore, probability that SalePrice will be greater than \(\$163000\), if GrLivArea is greater than \(1130~ sq.ft.\) is \(99\%\)
b. \(P(X>x~ \&~ Y>y)\), read a probability X (GrLivArea) is greater than 1130 square feet and Y (SalePrice) is greater than $163000.
This is known as joint probability because we are computing the probability using outcomes of two variables.
\(P(GrLivArea~ > 1130~ sq.ft. and SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ observed}\)
= \(\frac{720}{1460} = 49.32 \%\)
Therefore, probability that GrLivArea is greater than \(1130~ sq.ft.\) and SalePrice will be greater than \(\$163000\), is \(49.32 \%\)
c. \(P(X<x~ |~ Y>y )\), read as probability X (GrLivArea) is less than 1130 square feet given Y (SalePrice) is greater than $163000. This is conditional probability.
\(P(GrLivArea~ < 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ < 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}\)
= \(\frac{8}{728} = 1.1 \%\)
Using the joint probabilities,
= \(\frac{0.0055}{0.5000} = 1.1 \%\)
Therefore, probability that SalePrice will be greater than \(\$163000\), if GrLivArea is less than \(1130~ sq.ft.\) is \(1.1 \%\)
\(P(XY) = P(X)P(Y)\)
Above condition can be rewritten as
\(P(X \cap Y) = P(X)P(Y)\), condition will be true only when \(X\) and \(Y\) are independent.
We can say that above grade living area and sale price are independent only when an increase or decrease in the area does not affect the probability of increase or decrease of the sale price of the house. We can test the condition by using the following hypothesis.
Null Hypothesis(\(H_0\)): Sale price of the house is not influenced by above grade living area.
Alternative Hypothesis(\(H_A\)): Above grade living area has significant influence on sale price of the house.
If two variables were to be independent it should satisfy the condition
\(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\)
We will solve above conditions in two parts,
\(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y)\)
= \(P(X>x~ |~ Y>y) = \frac{P(X>x~ \cap ~ Y>y)}{P(Y>y)}\)
= \(P(X>x~ \cap ~ Y>y)\) - probability where GrLivArea > 1130 sq.ft. and SalePrice > $163000
= \(P(Y>y)\) - probability where SalePrice > $163000
= \(P(X>x~ |~ Y>y) = \frac{720}{728} = 98.9 \%\)
Comparing other way,
\(P(Y>y~ |~ X>x)P(X>x) = P(X>x~ \cap ~ Y>y)\)
= \(P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\)
= \(P(Y>y~ |~ X>x) = \frac{P(X>x~ \cap ~ Y>y)}{P(X>x)}\)
= \(P(X>x)\) - probability where GrLivArea > 1130 sq.ft.
= \(P(Y>y~ |~ X>x) = \frac{720}{1094} = 65.81 \%\)
Since \(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\), condition is not met we reject Null Hypothesis(\(H_0\)), and accept Alternative Hypothesis(\(H_A\)) that above grade living area has significant influence on sale price of the house.
house.data <- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)
chisq.test(house.data)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: house.data
## X-squared = 441.58, df = 1, p-value < 2.2e-16
Since we have 2 variables GrLivArea and SalePrice, degrees of freedom(df) = 1, p-value = \(2.2 \times 10^{-16}\) is very small compared at \(0.05\) significance level, we reject Null Hypothesis(\(H_0\)), and accept Alternative Hypothesis(\(H_A\)) that above grade living area has significant influence on sale price of the house.
Problem: 2
Solution:
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
house.stats <- matrix(data = NA, nrow=11,ncol=2)
qarea <- quantile(req.data$GrLivArea)
qprice <- quantile(req.data$SalePrice)
house.stats[1,1] <- nrow(req.data)
house.stats[1,2] <- nrow(req.data)
house.stats[2,1] <- length(req.data$GrLivArea[!is.na(req.data$GrLivArea)])
house.stats[2,2] <- length(req.data$SalePrice[!is.na(req.data$SalePrice)])
house.stats[3,1] <- paste0(min(req.data$GrLivArea), ' sq. ft.')
house.stats[3,2] <- paste0('$', min(req.data$SalePrice))
house.stats[4,1] <- paste0(max(req.data$GrLivArea), ' sq. ft.')
house.stats[4,2] <- paste0('$', max(req.data$SalePrice))
house.stats[5,1] <- paste0(median(req.data$GrLivArea), ' sq. ft.')
house.stats[5,2] <- paste0('$', median(req.data$SalePrice))
house.stats[6,1] <- paste0(qarea[2], ' sq. ft.')
house.stats[6,2] <- paste0('$', qprice[2])
house.stats[7,1] <- paste0(qarea[4], ' sq. ft.')
house.stats[7,2] <- paste0('$', qprice[4])
house.stats[8,1] <- paste0(round(mean(req.data$GrLivArea),2), ' sq. ft.')
house.stats[8,2] <- paste0('$', round(mean(req.data$SalePrice),2))
house.stats[9,1] <- round(sd(req.data$GrLivArea),2)
house.stats[9,2] <- round(sd(req.data$SalePrice),2)
house.stats[10,1] <- paste0(getmode(req.data$GrLivArea), ' sq. ft.')
house.stats[10,2] <- paste0('$', getmode(req.data$SalePrice))
house.stats[11,1] <- paste0(IQR(req.data$GrLivArea), ' sq. ft.')
house.stats[11,2] <- paste0('$', IQR(req.data$SalePrice))
rownames(house.stats)<- c('Number of Observations', 'Non-missing values', 'Minimum','Maximum', 'Median','1st quartile','3rd quartile', 'Average(mean)', 'Standard deviation', 'Mode', 'Interquartile range(IQR)')
kable(house.stats, digits = 2,
col.names = c('GrLivArea', 'SalePrice'),
align = "l",
caption = 'Univariate Descriptive Statistics', "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| GrLivArea | SalePrice | |
|---|---|---|
| Number of Observations | 1460 | 1460 |
| Non-missing values | 1460 | 1460 |
| Minimum | 334 sq. ft. | $34900 |
| Maximum | 5642 sq. ft. | $755000 |
| Median | 1464 sq. ft. | $163000 |
| 1st quartile | 1129.5 sq. ft. | $129975 |
| 3rd quartile | 1776.75 sq. ft. | $214000 |
| Average(mean) | 1515.46 sq. ft. | $180921.2 |
| Standard deviation | 525.48 | 79442.5 |
| Mode | 864 sq. ft. | $140000 |
| Interquartile range(IQR) | 647.25 sq. ft. | $84025 |
fill <- "gray45"
line <- "navy"
bxgg <- ggplot(req.data, aes(x='', y=GrLivArea)) +
geom_boxplot(fill = fill, colour = line, alpha = 0.7, varwidth=T, width=0.6) +
#coord_flip() +
scale_x_discrete(name = "") +
scale_y_continuous(name = "Above Grade Living Area(sq.ft.)", breaks = seq(0, max(req.data$GrLivArea) + 100, 500),
limits=c(0, max(req.data$GrLivArea) + 100)) +
ggtitle("Boxplot") +
stat_summary(geom="text", fun.y=quantile,
aes(label = sprintf("%1.2f", ..y..)),
position = position_nudge(x=0.35), size=2.0)
hgg <- ggplot(req.data, aes(x=GrLivArea)) +
geom_histogram(binwidth=100, colour = line, fill = fill, aes(y=..density.., fill=..count..)) +
scale_x_continuous(name = "Above Grade Living Area(sq.ft.)") +
stat_function(fun=dnorm, color="red",
args=list(mean=mean(req.data$GrLivArea), sd=sd(req.data$GrLivArea))) +
ggtitle("Histogram")
#geom_vline(xintercept = mean(req.data$GrLivArea), show_guide=TRUE, color="red", labels="Average") +
#geom_vline(xintercept = median(req.data$GrLivArea), show_guide=TRUE, color="green", labels="Median")
grid.arrange(bxgg, hgg, nrow = 2, top='Above Grade Living Area(sq.ft.)')
Histogram shows distribution of above grade living area is. Average area is 1515.46 sq.ft. with standard deviation as 525.48. It also shows right tail, suggesting existence of outliers to the right of the average.
kable(qarea, digits = 2,
caption = 'Quartiles',
align = 'l', padding = 10, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| 0% | 334.00 |
| 25% | 1129.50 |
| 50% | 1464.00 |
| 75% | 1776.75 |
| 100% | 5642.00 |
fill <- "gray45"
line <- "navy"
bxgg <- ggplot(req.data, aes(x='', y=SalePrice)) +
geom_boxplot(fill = fill, colour = line, alpha = 0.7, varwidth=T, width=0.6) +
#coord_flip() +
scale_x_discrete(name = "") +
scale_y_continuous(name = "Sale Price(Dollars)", breaks = seq(0, max(req.data$SalePrice) + 100, 100000),
limits=c(0, max(req.data$SalePrice) + 100)) +
ggtitle("Boxplot") +
stat_summary(geom="text", fun.y=quantile,
aes(label=sprintf("%1.2f", ..y..)),
position=position_nudge(x=0.35), size=2.0)
hgg <- ggplot(req.data, aes(x=SalePrice)) +
geom_histogram(binwidth=10000, colour= line, fill = fill, aes(y=..density.., fill=..count..)) +
scale_x_continuous(name = "Sale Price(Dollars)") +
stat_function(fun=dnorm, color="red",
args=list(mean=mean(req.data$SalePrice), sd=sd(req.data$SalePrice))) +
ggtitle("Histogram")
#geom_vline(xintercept = mean(req.data$SalePrice), show_guide=TRUE, color="red", labels="Average") +
#geom_vline(xintercept = median(req.data$SalePrice), show_guide=TRUE, color="green", labels="Median")
grid.arrange(bxgg, hgg, nrow = 2, top='Sale Price')
Histogram shows distribution of sale proce of houses. Average sale price(\(\mu\)) is $ 180921.2, with sandard deviation(\(\sigma\)) 79442.5. It also shows right tail, suggesting existence of outliers to the right of the average.
kable(qprice, digits = 2,
caption = 'Quartiles',
align = 'l',
padding = 10, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| 0% | 34900 |
| 25% | 129975 |
| 50% | 163000 |
| 75% | 214000 |
| 100% | 755000 |
ggplot(req.data, aes(x=GrLivArea, y=SalePrice)) +
geom_point() +
geom_smooth(method=lm) +
labs(title="Scatterplot Above Grade Living Area Vs. Sale Price",
x="Above Grade Living Area(sq.ft.)", y = "Sale Price(Dollars)")
From above boxplots, histograms and scatterplot we can notice there are some outliers and the variation amoung area and sale price is not constant. This causes a longer tail on the right side.
house.lm <- lm(req.data$SalePrice ~ req.data$GrLivArea)
summary(house.lm)
##
## Call:
## lm(formula = req.data$SalePrice ~ req.data$GrLivArea)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462999 -29800 -1124 21957 339832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18569.026 4480.755 4.144 0.0000361 ***
## req.data$GrLivArea 107.130 2.794 38.348 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018
## F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16
Multiple R-squared: 0.5021 suggests that regression model can explain 50.21% of the variation in data.Residual standard error: 56070 suggests that the average distance of the data points from the fitted line is about 56070. And 95% of times sale price should fall between \(\pm 112140\). This very high value. In other words, based on the size of living area, sale price varies by $56070.00.Above observations are without data normalization.
A Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that one will be able to run a broader number of tests.
For all positive values of \(y\), it is defined by
\[ y(\lambda)=\begin{cases} \frac{y^{\lambda} - 1}{\lambda}, & \text{if }\lambda \neq 0\ \\ log~ y, & \text{if }\lambda = 0\ \end{cases} \]
If \(y\) has negative values then it is defined as
\[ y(\lambda)=\begin{cases} \frac{(y + {\lambda}_2)^{\lambda_1} - 1}{\lambda_1}, & \text{if }\lambda_1 \neq 0\ \\ log~ (y + {\lambda}_2), & \text{if }\lambda_1 = 0\ \end{cases} \]
We will using R-function boxcox from MASS library to determine optimal lambda(\(\lambda\)) value.
par(mfrow=c(1,2))
house.bc <- boxcox(house.lm)
house.bc.df <- as.data.frame(house.bc)
lambda <- house.bc.df[which.max(house.bc.df$y),1]
boxcox(house.lm, plotit=T, lambda=seq(0,0.20,by=0.05))
From above boxcox plot, optimal lambda(\(\lambda\)) is 0.1010101. Confidence interval runs between \(0.02\) and \(0.18\). Since \(\lambda\) is less than \(0.5\), there is no good reason to transform data.
However, let’s check if transforming data provides a better understanding of data.
req.data$SalePrice_trans <- ((req.data$SalePrice^lambda) -1)/lambda
ggplot(req.data, aes(x=GrLivArea, y=SalePrice_trans)) +
geom_point() +
geom_smooth(method=lm) +
labs(title="Scatterplot Above Grade Living Area Vs. Sale Price Transformed",
x="Above Grade Living Area(sq.ft.)", y = "Sale Price Transformed")
house_t.lm <- lm(req.data$SalePrice_trans ~ req.data$GrLivArea)
summary(house_t.lm)
##
## Call:
## lm(formula = req.data$SalePrice_trans ~ req.data$GrLivArea)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6488 -0.4945 0.0843 0.5314 3.1560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.72985263 0.07656336 270.75 <2e-16 ***
## req.data$GrLivArea 0.00181347 0.00004774 37.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9581 on 1458 degrees of freedom
## Multiple R-squared: 0.4975, Adjusted R-squared: 0.4971
## F-statistic: 1443 on 1 and 1458 DF, p-value: < 2.2e-16
Multiple R-squared: 0.4975 suggests that regression model generated using data transformation explains 49.75% of the variation in data. This value is less than the value generated using non-transformed data. We can conclude Box-Cox transformation is not very helpful in understanding data variation.Problem: 3
Solution:
Following are the three untransformed variables that will be used.
LotArea - Lot size in square feet.TotalBsmtSF - Total square feet of the basement area.GrLivArea - Above grade (ground) living area square feet.SalePrice - Price House was sold.Correlation provides trends shared between two variables. If the value is close to 1 variables are positively related. In other words, as one variable increases, other variable tends to move in the same. If the value is close to -1, then variables are negatively related. When one variable increases other variable decreases. It is also known as variables are inversely related.
As dataset has more than 30 observations and mean represents the center of the data, it meets parametric conditions. Hence pearson correlation is chosen over spearman. Also pearson coefficient of correlation, when we square it we obtain the amount of variation in \(y\) explained by \(x\). Same is not true with spearman method.
reqCols <- c('LotArea','TotalBsmtSF','GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]
pearson.cor <- cor(req.data,method="pearson")
pearson.cor
## LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea 1.0000000 0.2608331 0.2631162 0.2638434
## TotalBsmtSF 0.2608331 1.0000000 0.4548682 0.6135806
## GrLivArea 0.2631162 0.4548682 1.0000000 0.7086245
## SalePrice 0.2638434 0.6135806 0.7086245 1.0000000
Looking at the output all the variables are positively correlated.
Correlation between TotalBsmtSF and SalePrice is \(0.61\). It explains bigger basement area will result in the better sale price. Square value of the coefficient is \(0.3721\). It means \(37.21 \%\) percent of the variance in the sale price of a house can be explained by the total area of the basement.
Correlation between GrLivArea and SalePrice is \(0.71\). It explains bigger living area will result in the better sale price. Square value of the coefficient is \(0.5041\). It means \(50.41 \%\) percent of the variance in the sale price of a house can be explained by the total above grade living area.
Precision matrix is inverse of Correlation Matrix. To solve the problem, I will be using R-function solve.
inv.cor <- solve(pearson.cor)
inv.cor
## LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea 1.10622180 -0.1703170 -0.1623394 -0.07232846
## TotalBsmtSF -0.17031695 1.6321069 -0.0397442 -0.92832834
## GrLivArea -0.16233936 -0.0397442 2.0350650 -1.37487844
## SalePrice -0.07232846 -0.9283283 -1.3748784 2.56296011
Correlation Matrix multiplied by Precision Matrix
round(pearson.cor %*% inv.cor)
## LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea 1 0 0 0
## TotalBsmtSF 0 1 0 0
## GrLivArea 0 0 1 0
## SalePrice 0 0 0 1
Precision Matrix multiplied by Correlation Matrix
round(inv.cor %*% pearson.cor)
## LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea 1 0 0 0
## TotalBsmtSF 0 1 0 0
## GrLivArea 0 0 1 0
## SalePrice 0 0 0 1
identity matrix.Problem: 4
Solution:
The non-transformed independent variable I will be using is GrLivArea and the distribution function is normal. Parameter estimates of the normal distribution are mean(\(\mu\)) and standard deviation(\(\sigma\)).
reqCols <- c('GrLivArea', 'SalePrice')
req.data <- kaggle.data[reqCols]
fit.output <- fitdistr(req.data$GrLivArea, densfun="normal")
fit.output
## mean sd
## 1515.46370 525.30039
## ( 13.74774) ( 9.72112)
Output of optim function, average(\(\mu\)) = 1515.46 and standard deviation(\(\sigma\)) = 525.3.
To find the optimal estimates, I will be using optim and dnorm functions. dnorm is the R function that calculates the probability density of the normal distribution. Since GrLivArea cannot be negative and has to greater than zero, we will use the output of fitdistr to get optimum values. However, lower standard deviation would be preferred. Standard deviation generated by fitdistr is little over one-third of average. I will be passing the value of 50 as standard deviation to optim function.
likelihood.func <- function(params) { -sum(dnorm(req.data$GrLivArea, params[1], params[2], log=TRUE)) }
optim.output <- optim(c(fit.output$estimate[1], 50), likelihood.func) # c(0,1) are just initial guesses
optim.output
## $par
## mean
## 1515.7270 525.3384
##
## $value
## [1] 11217.05
##
## $counts
## function gradient
## 47 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Output of optim function, average(\(\mu\)) = 1515.73 and standard deviation(\(\sigma\)) = 525.34.
Apart from rounding, optim function produced same output as fitdistr.
To generate 1000 samples, I will be using rnorm function with the optimal parameters generated by optim function.
#generate 1000 samples
house.sample <- rnorm(n=1000, mean=round(optim.output$par[1],2), sd=round(optim.output$par[2],2))
house.sample <- data.frame(house.sample)
names(house.sample)[1] <- "Samples"
req.data$areaSplit <- as.factor(req.data[,1]>=round(optim.output$par[1],2))
ogg <- ggplot(req.data, aes(GrLivArea, fill=req.data$areaSplit)) +
geom_histogram(color="white", binwidth=125) +
scale_x_continuous(name = "Observed Area(sq.ft.)") +
scale_fill_manual(values=c("#a3c4dc","#0e668b"),labels=c("area >= 1515.73 sq.ft.","area < 1515.73 sq.ft.")) +
ylab("Number of Observations") +
ggtitle("Observed Area Distribution") +
geom_vline(xintercept = mean(req.data$GrLivArea), color="red", labels="Average", lwd=1)
house.sample$areaSplit <- as.factor(house.sample[,1]>=round(optim.output$par[1],2))
sgg <- ggplot(house.sample, aes(Samples, fill=house.sample$areaSplit)) +
geom_histogram(color="white", binwidth=125) +
scale_x_continuous(name = "Sample Data Area(sq.ft.)") +
scale_fill_manual(values=c("#a3c4dc","#0e668b"),labels=c("area >= 1515.73 sq.ft.","area < 1515.73 sq.ft.")) +
ylab("Sample Split") +
ggtitle("Sample Data Distribution") +
geom_vline(xintercept = mean(req.data$GrLivArea), color="red", labels="Average", lwd=1)
grid.arrange(sgg, ogg, nrow = 2, top='Histograms')
Red line represents average of the data.average and less than average is 51.7 : 48.3.average and less than average is 44.73 : 55.27.average and less than average 517 : 483.average and less than average 653 : 807.I will be using Chi-Square test to see if the sample generated represents a normal distribution. In our case, there should 50% cases where the area is greater than or equal to average and 50% cases less than average.
Hypothesis,
\(H_0\) : Sample data follow a specified distribution.
\(H_A\) : Sample data do not follow the specified distribution.
#Chi-square test
#Ratio of actual observed values
nullp<-c(0.50, 0.50)
#Samples generated
sample.rows <- c(sum(house.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(house.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))
#Goodness-of-Fit Test
chisq.test(sample.rows, p=nullp)
##
## Chi-squared test for given probabilities
##
## data: sample.rows
## X-squared = 1.156, df = 1, p-value = 0.2823
Since p-value is greater than \(0.05\), we accept null hypothesis(\(H_0\)) with 95% confidence, that sample data represents normal distribution.
Following is a test to see if sample represents actual observed data. Actual observations ratio between greater than or equal to average and less than average, 0.45 : 0.55. For samples data ratio between samples greater than or equal to average and less than average, 517 : 483.
Hypothesis,
\(H_0\): Sample data represents actual observed data.
\(H_A\): Sample data do not represent actual observed data.
#Chi-square test
#Ratio of actual observed values
nullp<-c(round((sum(req.data$GrLivArea >= round(optim.output$par[1],2), na.rm=TRUE)) / nrow(req.data),2), round((sum(req.data$GrLivArea < round(optim.output$par[1],2), na.rm=TRUE)) / nrow(req.data),2))
#Samples generated
sample.rows <- c(sum(house.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(house.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))
#Goodness-of-Fit Test
chisq.test(sample.rows, p=nullp)
##
## Chi-squared test for given probabilities
##
## data: sample.rows
## X-squared = 18.137, df = 1, p-value = 0.00002055
Since p-value is less than \(0.05\), we can reject the null hypothesis(\(H_0\)) and accept alternative hypothesis(\(H_A\)) with 95% confidence that sample data does not represent actual observed data.
average and standard deviation of actual observations.Problem: 5
#Training data
reqCols <- c('MSZoning', 'LotArea','Street','LotShape','BldgType','OverallQual','OverallCond','YearBuilt','YearRemodAdd','BsmtCond',
'TotalBsmtSF','CentralAir','X1stFlrSF','X2ndFlrSF','GrLivArea','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr',
'KitchenQual','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','SaleCondition','SalePrice')
kaggle.train <- kaggle.data[reqCols]
levels <- levels(kaggle.train$BsmtCond)
levels[length(levels) + 1] <- "NA"
kaggle.train$BsmtCond <- factor(kaggle.train$BsmtCond, levels = levels)
kaggle.train$BsmtCond[is.na(kaggle.train$BsmtCond)] <- "NA"
kaggle.train$KitchenQual[is.na(kaggle.train$KitchenQual)] <- "TA"
kaggle.train$MSZoning[is.na(kaggle.train$MSZoning)] <- "RM"
kaggle.train$TotalBsmtSF[is.na(kaggle.train$TotalBsmtSF)] <- 0
kaggle.train$GarageCars[is.na(kaggle.train$GarageCars)] <- 0
kaggle.train$GarageArea[is.na(kaggle.train$GarageArea)] <- 0
#Test data
reqCols <- c('Id','MSZoning', 'LotArea','Street','LotShape','BldgType','OverallQual','OverallCond','YearBuilt','YearRemodAdd','BsmtCond',
'TotalBsmtSF','CentralAir','X1stFlrSF','X2ndFlrSF','GrLivArea','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr',
'KitchenQual','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','SaleCondition')
kaggle.test <- kaggle.data.test[reqCols]
levels <- levels(kaggle.test$BsmtCond)
levels[length(levels) + 1] <- "NA"
kaggle.test$BsmtCond <- factor(kaggle.test$BsmtCond, levels = levels)
kaggle.test$BsmtCond[is.na(kaggle.test$BsmtCond)] <- "NA"
kaggle.test$KitchenQual[is.na(kaggle.test$KitchenQual)] <- "TA"
kaggle.test$MSZoning[is.na(kaggle.test$MSZoning)] <- "RM"
kaggle.test$TotalBsmtSF[is.na(kaggle.test$TotalBsmtSF)] <- 0
kaggle.test$GarageCars[is.na(kaggle.test$GarageCars)] <- 0
kaggle.test$GarageArea[is.na(kaggle.test$GarageArea)] <- 0
#kaggle.pred <- cbind(kaggle.test, s<-predict(kaggle.lm, kaggle.test))
#names(kaggle.pred)[ncol(kaggle.pred)] <- "SalePrice"
#reqCols <- c('Id','SalePrice')
#kaggle.pred.output <- kaggle.pred[reqCols]
#write.csv(kaggle.pred.output, file = "D:\\CUNY\\605\\FinalProject\\submission.csv",row.names=FALSE)
Columns used for building regression model,
names(kaggle.train)
## [1] "MSZoning" "LotArea" "Street" "LotShape"
## [5] "BldgType" "OverallQual" "OverallCond" "YearBuilt"
## [9] "YearRemodAdd" "BsmtCond" "TotalBsmtSF" "CentralAir"
## [13] "X1stFlrSF" "X2ndFlrSF" "GrLivArea" "FullBath"
## [17] "HalfBath" "BedroomAbvGr" "KitchenAbvGr" "KitchenQual"
## [21] "TotRmsAbvGrd" "Fireplaces" "GarageCars" "GarageArea"
## [25] "WoodDeckSF" "OpenPorchSF" "SaleCondition" "SalePrice"
Missing data for BsmtCond is replaced with NA. For TotalBsmtSF, GarageCars and GarageArea missing data is replaced with zero.
kaggle.lm <- lm(SalePrice ~ ., data = kaggle.train)
summary(kaggle.lm)
##
## Call:
## lm(formula = SalePrice ~ ., data = kaggle.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -466707 -14819 -1247 12263 259151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -834461.2017 148611.8703 -5.615 0.0000000236 ***
## MSZoningFV 13657.4321 12536.4521 1.089 0.276155
## MSZoningRH 14191.9534 14213.1843 0.999 0.318205
## MSZoningRL 9495.8430 11716.4513 0.810 0.417806
## MSZoningRM 3612.8881 11690.6505 0.309 0.757336
## LotArea 0.4916 0.1067 4.606 0.0000044696 ***
## StreetPave 24453.2266 15140.2930 1.615 0.106510
## LotShapeIR2 7857.2149 5646.9385 1.391 0.164319
## LotShapeIR3 -51788.7586 11331.9555 -4.570 0.0000052983 ***
## LotShapeReg -3311.6478 2041.0949 -1.622 0.104922
## BldgType2fmCon 5844.6630 7452.0259 0.784 0.432992
## BldgTypeDuplex 951.2547 7785.8292 0.122 0.902776
## BldgTypeTwnhs -18502.8225 5807.8563 -3.186 0.001475 **
## BldgTypeTwnhsE -14400.2176 3925.5787 -3.668 0.000253 ***
## OverallQual 13660.7015 1178.8995 11.588 < 2e-16 ***
## OverallCond 5157.3916 1056.5760 4.881 0.0000011738 ***
## YearBuilt 347.4258 63.5952 5.463 0.0000000552 ***
## YearRemodAdd 61.7003 68.7175 0.898 0.369400
## BsmtCondGd 7402.5137 7012.3297 1.056 0.291312
## BsmtCondPo 24684.5539 24853.7180 0.993 0.320785
## BsmtCondTA 8822.5327 5491.0097 1.607 0.108338
## BsmtCondNA 22140.3824 9510.5536 2.328 0.020054 *
## TotalBsmtSF 24.3378 5.1631 4.714 0.0000026708 ***
## CentralAirY -3221.3162 4474.2420 -0.720 0.471663
## X1stFlrSF 27.8363 19.6979 1.413 0.157828
## X2ndFlrSF 32.0763 19.2219 1.669 0.095390 .
## GrLivArea 19.0507 19.1194 0.996 0.319222
## FullBath 1011.8753 2711.2224 0.373 0.709043
## HalfBath -1636.4413 2624.6403 -0.623 0.533062
## BedroomAbvGr -7967.9128 1711.4010 -4.656 0.0000035302 ***
## KitchenAbvGr -24572.0086 7147.6623 -3.438 0.000604 ***
## KitchenQualFa -42158.0859 7729.0392 -5.455 0.0000000579 ***
## KitchenQualGd -45369.2909 4128.5516 -10.989 < 2e-16 ***
## KitchenQualTA -49501.7920 4818.3597 -10.274 < 2e-16 ***
## TotRmsAbvGrd 2290.0596 1219.3831 1.878 0.060580 .
## Fireplaces 5384.1705 1722.6888 3.125 0.001811 **
## GarageCars 12073.8212 2790.1894 4.327 0.0000161590 ***
## GarageArea -2.0307 9.5503 -0.213 0.831641
## WoodDeckSF 28.8093 7.6375 3.772 0.000169 ***
## OpenPorchSF -12.6482 14.8625 -0.851 0.394909
## SaleConditionAdjLand 17304.1575 17861.7073 0.969 0.332818
## SaleConditionAlloca 16667.3871 10954.5029 1.522 0.128355
## SaleConditionFamily -217.8811 8378.0094 -0.026 0.979256
## SaleConditionNormal 8008.9981 3623.5822 2.210 0.027248 *
## SaleConditionPartial 21527.7619 5001.1914 4.305 0.0000178853 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33540 on 1415 degrees of freedom
## Multiple R-squared: 0.8271, Adjusted R-squared: 0.8217
## F-statistic: 153.8 on 44 and 1415 DF, p-value: < 2.2e-16
\(R^2\) value of the model is \(0.8271\). It means model can explain \(82.71 \%\) of the variance in the sale price of a house. However, variables MSZoning, YearRemodAdd, OpenPorchSF, GarageArea, BsmtCond, HalfBath, CentralAir, GrLivArea and FullBath have high p-value. It suggests these variables are not contributing to model.
kaggle.lm <- update(kaggle.lm, .~ . -MSZoning -YearRemodAdd -OpenPorchSF -GarageArea -BsmtCond -HalfBath -CentralAir -GrLivArea -FullBath)
summary(kaggle.lm)
##
## Call:
## lm(formula = SalePrice ~ LotArea + Street + LotShape + BldgType +
## OverallQual + OverallCond + YearBuilt + TotalBsmtSF + X1stFlrSF +
## X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd +
## Fireplaces + GarageCars + WoodDeckSF + SaleCondition, data = kaggle.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -460319 -15275 -1234 12643 262810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -824312.8757 97193.2468 -8.481 < 2e-16 ***
## LotArea 0.5207 0.1059 4.917 0.0000009802343 ***
## StreetPave 29323.9224 14553.6486 2.015 0.044104 *
## LotShapeIR2 8459.2088 5613.1042 1.507 0.132020
## LotShapeIR3 -50983.1758 11222.8435 -4.543 0.0000060181570 ***
## LotShapeReg -3372.1393 2012.4911 -1.676 0.094034 .
## BldgType2fmCon 6858.5577 7357.2576 0.932 0.351382
## BldgTypeDuplex 3047.4816 7680.4211 0.397 0.691585
## BldgTypeTwnhs -20821.5965 5545.0916 -3.755 0.000180 ***
## BldgTypeTwnhsE -15785.5942 3718.8552 -4.245 0.0000232938011 ***
## OverallQual 13780.4842 1157.0155 11.910 < 2e-16 ***
## OverallCond 5117.6174 920.2462 5.561 0.0000000319403 ***
## YearBuilt 409.3079 48.2739 8.479 < 2e-16 ***
## TotalBsmtSF 17.3150 3.8881 4.453 0.0000091129918 ***
## X1stFlrSF 52.7975 5.3224 9.920 < 2e-16 ***
## X2ndFlrSF 49.4065 3.8415 12.861 < 2e-16 ***
## BedroomAbvGr -7746.2693 1660.4203 -4.665 0.0000033707724 ***
## KitchenAbvGr -22970.0828 6987.5840 -3.287 0.001036 **
## KitchenQualFa -42693.8655 7541.8616 -5.661 0.0000000181677 ***
## KitchenQualGd -45548.2056 4067.3129 -11.199 < 2e-16 ***
## KitchenQualTA -51154.7355 4678.5453 -10.934 < 2e-16 ***
## TotRmsAbvGrd 2327.3620 1188.6831 1.958 0.050432 .
## Fireplaces 5019.5825 1667.6467 3.010 0.002658 **
## GarageCars 11222.2049 1655.5203 6.779 0.0000000000177 ***
## WoodDeckSF 28.3168 7.5812 3.735 0.000195 ***
## SaleConditionAdjLand 20224.9292 17685.0618 1.144 0.252974
## SaleConditionAlloca 14514.8202 10856.2911 1.337 0.181437
## SaleConditionFamily -821.0596 8313.9049 -0.099 0.921345
## SaleConditionNormal 8218.0876 3545.5362 2.318 0.020597 *
## SaleConditionPartial 22001.7485 4877.4789 4.511 0.0000069839754 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33540 on 1430 degrees of freedom
## Multiple R-squared: 0.8253, Adjusted R-squared: 0.8217
## F-statistic: 232.9 on 29 and 1430 DF, p-value: < 2.2e-16
Second model after removing the variables, summary did not show any improvement in \(R^2\) value, but \(Adjusted~ R^2\) remained constant. Also, every variable is contributing to the model.
kaggle.pred <- cbind(kaggle.test, s<-predict(kaggle.lm, kaggle.test))
names(kaggle.pred)[ncol(kaggle.pred)] <- "SalePrice"
Data is submitted to Kaggle website. Following is my username and score.