DATA605_Final_Project

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\)

library(knitr)
library(kableExtra)
#Setting variables and values
N <- 10
mean <- (N+1)/2
sd <- (N+1)/2
#Calculate X and Y
set.seed(12)
X <- runif(10000, 1, N)
Y <- rnorm(X, mean, sd)
#Build data frame
df <- data.frame(X,Y)
#Preview data
kable(data.frame(head(df, n = 10L))) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#ea7872") %>%
    scroll_box(width = "100%", height = "200px")

X	Y
1.624248	2.013936
8.359977	-4.616887
9.483596	6.414498
3.424437	8.840356
2.524133	5.920598
1.305061	4.692422
2.609065	-3.742906
6.774988	10.915229
1.205900	6.883874
1.074923	7.445159

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

#x is estimated as the median of the X variable
x <- median(X)
x

## [1] 5.540831

#y is estimated as the 1st quartile of the Y variable
y <- quantile(Y)[2]
y

##      25% 
## 1.813683

a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)

p_a <- sum(X>x & X>y)/sum(X>y)
p_a

## [1] 0.5514503

p_b <- sum(X>x & Y>y)/length(X)
p_b

## [1] 0.3742

p_c <- sum(X<x & X>y)/sum(X>y)
p_c

## [1] 0.4485497

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

c_tab <- c(sum(X<x & Y < y),sum(X < x & Y == y),sum(X < x & Y > y))
c_tab <- rbind(c_tab,c(sum(X==x & Y < y),sum(X == x & Y == y),sum(X == x & Y > y)))
c_tab <- rbind(c_tab,c(sum(X>x & Y < y),sum(X > x & Y == y),sum(X > x & Y > y)))
c_tab <- cbind(c_tab, c_tab[,1] + c_tab[,2] + c_tab[,3])
c_tab <- rbind(c_tab, c_tab[1,] + c_tab[2,] + c_tab[3,])
colnames(c_tab) <- c("Y<y", "Y=y", "Y>y", "Total")
rownames(c_tab) <- c("X<x", "X=x", "X>x", "Total")
#Preview data
jp <- as.data.frame(c_tab)
jp

#P(X>x and Y>y)
p1 <- 3742/10000
p1

## [1] 0.3742

#P(X>x)P(Y>y)
p2 <- ((5000)/10000)*(7500/10000)
p2

## [1] 0.375

Problem 2

Advanced Regression Techniques competition - https://www.kaggle.com/c/house-prices-advanced-regression-techniques

#Load libraries
library(RCurl)
library(knitr)
library(kableExtra)
library(magrittr)
library(psych)
library(Matrix)
library(MASS)

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

#Training data
#train <- read.csv("https://raw.githubusercontent.com/mohamedthasleem/DATA605/master/train.csv", header = TRUE, stringsAsFactors = FALSE)
train <- read.csv("C:/Users/aisha/Dropbox/CUNY/github/DATA602/DATA605/train.csv", header = TRUE, stringsAsFactors = FALSE)

#Summary Infromation
psych::describe(train)

#Preview data
kable(data.frame(head(train, n = 10L))) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#ea7872") %>%
    scroll_box(width = "100%", height = "300px")

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	X1stFlrSF	X2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	X3SsnPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1	60	RL	65	8450	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	0	150	856	GasA	Ex	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NA	Attchd	2003	RFn	2	548	TA	TA	Y	0	61	0	0	NA	NA	NA	0	2	2008	WD	Normal	208500
2	20	RL	80	9600	Pave	NA	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	0	284	1262	GasA	Ex	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976	RFn	2	460	TA	TA	Y	298	0	0	0	NA	NA	NA	0	5	2007	WD	Normal	181500
3	60	RL	68	11250	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	0	434	920	GasA	Ex	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001	RFn	2	608	TA	TA	Y	0	42	0	0	NA	NA	NA	0	9	2008	WD	Normal	223500
4	70	RL	60	9550	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	0	540	756	GasA	Gd	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998	Unf	3	642	TA	TA	Y	0	35	272	0	NA	NA	NA	0	2	2006	WD	Abnorml	140000
5	60	RL	84	14260	Pave	NA	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	0	490	1145	GasA	Ex	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000	RFn	3	836	TA	TA	Y	192	84	0	0	NA	NA	NA	0	12	2008	WD	Normal	250000
6	50	RL	85	14115	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	Mitchel	Norm	Norm	1Fam	1.5Fin	5	5	1993	1995	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	Wood	Gd	TA	No	GLQ	732	Unf	0	64	796	GasA	Ex	Y	SBrkr	796	566	1362	1	0	1	1	1	1	TA	5	Typ	0	NA	Attchd	1993	Unf	2	480	TA	TA	Y	40	30	0	320	NA	MnPrv	Shed	700	10	2009	WD	Normal	143000
7	20	RL	75	10084	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	Somerst	Norm	Norm	1Fam	1Story	8	5	2004	2005	Gable	CompShg	VinylSd	VinylSd	Stone	186	Gd	TA	PConc	Ex	TA	Av	GLQ	1369	Unf	0	317	1686	GasA	Ex	Y	SBrkr	1694	0	1694	1	0	2	0	3	1	Gd	7	Typ	1	Gd	Attchd	2004	RFn	2	636	TA	TA	Y	255	57	0	0	NA	NA	NA	0	8	2007	WD	Normal	307000
8	60	RL	NA	10382	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	NWAmes	PosN	Norm	1Fam	2Story	7	6	1973	1973	Gable	CompShg	HdBoard	HdBoard	Stone	240	TA	TA	CBlock	Gd	TA	Mn	ALQ	859	BLQ	32	216	1107	GasA	Ex	Y	SBrkr	1107	983	2090	1	0	2	1	3	1	TA	7	Typ	2	TA	Attchd	1973	RFn	2	484	TA	TA	Y	235	204	228	0	NA	NA	Shed	350	11	2009	WD	Normal	200000
9	50	RM	51	6120	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	OldTown	Artery	Norm	1Fam	1.5Fin	7	5	1931	1950	Gable	CompShg	BrkFace	Wd Shng	None	0	TA	TA	BrkTil	TA	TA	No	Unf	0	Unf	0	952	952	GasA	Gd	Y	FuseF	1022	752	1774	0	0	2	0	2	2	TA	8	Min1	2	TA	Detchd	1931	Unf	2	468	Fa	TA	Y	90	0	205	0	NA	NA	NA	0	4	2008	WD	Abnorml	129900
10	190	RL	50	7420	Pave	NA	Reg	Lvl	AllPub	Corner	Gtl	BrkSide	Artery	Artery	2fmCon	1.5Unf	5	6	1939	1950	Gable	CompShg	MetalSd	MetalSd	None	0	TA	TA	BrkTil	TA	TA	No	GLQ	851	Unf	0	140	991	GasA	Ex	Y	SBrkr	1077	0	1077	1	0	1	0	2	2	TA	5	Typ	2	TA	Attchd	1939	RFn	1	205	Gd	TA	Y	0	4	0	0	NA	NA	NA	0	1	2008	WD	Normal	118000

Univariate descriptive statistics

-Provide univariate descriptive statistics and appropriate plots for the training data set

#Summary
summary(train$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

#Histogram
hist(train$SalePrice, main="Sale Price")

# QQ Plot
qqnorm(train$SalePrice)
qqline(train$SalePrice)

Scatterplot Matrix

-Provide a scatterplot matrix for at least two of the independent variables and the dependent variable

#ScatterPlot
pairs(~SalePrice+LotArea+GrLivArea++GarageArea,data=train, main="Scatterplot Matrix")

Correlation Matrix

-Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval

#Subsetting data
sub_df <- data.frame(train$LotArea,train$GrLivArea,train$GarageArea)
#Correlation
cormatrix <- cor(sub_df)
cormatrix

##                  train.LotArea train.GrLivArea train.GarageArea
## train.LotArea        1.0000000       0.2631162        0.1804028
## train.GrLivArea      0.2631162       1.0000000        0.4689975
## train.GarageArea     0.1804028       0.4689975        1.0000000

library(corrplot)
corrplot(cormatrix, method="square")

Hypotheses Test

#GrLivArea
cor.test(train$LotArea,train$GrLivArea,method = "pearson",conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162

#GarageArea
cor.test(train$LotArea,train$GarageArea,method = "pearson",conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$GarageArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1477356 0.2126767
## sample estimates:
##       cor 
## 0.1804028

#GrLivArea
cor.test(train$GarageArea,train$GrLivArea,method = "pearson",conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train$GarageArea and train$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

Analysis Observation

All three confidence intervals have p-values less than 0.5 which means that the null hypothesis could be rejected. Possibility of FWE is going to be high since we’re only executing a single experiment so probability wil be higher. FWE on type I errors when performing multiple hypotheses tests. This problem can be avoid by ajusting the correlation test to a confident level of higher percentage.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Inversion

precisionmatrix <- solve(cormatrix)
precisionmatrix

##                  train.LotArea train.GrLivArea train.GarageArea
## train.LotArea       1.07920917      -0.2469705      -0.07886378
## train.GrLivArea    -0.24697046       1.3385010      -0.58319943
## train.GarageArea   -0.07886378      -0.5831994       1.28774631

#multiply the correlation matrix by the precision matrix
round(cormatrix %*% precisionmatrix)

##                  train.LotArea train.GrLivArea train.GarageArea
## train.LotArea                1               0                0
## train.GrLivArea              0               1                0
## train.GarageArea             0               0                1

Identity Matrix

-Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix

#multiply the precision matrix by the correlation matrix
round(precisionmatrix %*% cormatrix)

##                  train.LotArea train.GrLivArea train.GarageArea
## train.LotArea                1               0                0
## train.GrLivArea              0               1                0
## train.GarageArea             0               0                1

LU Decomposition

-Conduct LU decomposition on the matrix.

#LU on cormatrix
expand(lu(cormatrix))$L

## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000         .         .
## [2,] 0.2631162 1.0000000         .
## [3,] 0.1804028 0.4528838 1.0000000

expand(lu(cormatrix))$U

## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000 0.2631162 0.1804028
## [2,]         . 0.9307699 0.4215306
## [3,]         .         . 0.7765505

#LU decomposition on precision matrix
expand(lu(precisionmatrix))$L

## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]        [,2]        [,3]       
## [1,]  1.00000000           .           .
## [2,] -0.22884393  1.00000000           .
## [3,] -0.07307553 -0.46899748  1.00000000

expand(lu(precisionmatrix))$U

## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]        [,2]        [,3]       
## [1,]  1.07920917 -0.24697046 -0.07886378
## [2,]           .  1.28198329 -0.60124693
## [3,]           .           .  1.00000000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

# MASS Package
library(MASS)
#run fitdistr to fit an exponential probability density function, Find optimal lambda
optimal_lambda <- fitdistr(train$TotalBsmtSF,"exponential")
optimal_lambda$estimate

##         rate 
## 0.0009456896

Exponential pdf

-Take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable

#1000 samples from this exponential distribution using this value
hist(rexp(1000,optimal_lambda$estimate),breaks = 200,main = "Fitted Exponential PDF",xlim = c(1,quantile(rexp(1000,optimal_lambda$estimate),0.99)))

hist(train$TotalBsmtSF,breaks = 400,main = "Observed Basement Area Size",xlim = c(1,quantile(train$TotalBsmtSF,0.99)))

#5th and 95th percentiles using CDF
qexp(0.05,rate = optimal_lambda$estimate,lower.tail = TRUE,log.p = FALSE)

## [1] 54.23904

qexp(0.95,rate = optimal_lambda$estimate,lower.tail = TRUE,log.p = FALSE)

## [1] 3167.776

#95% confidence interval from the empirical data - normal
Bsmt_mean <- mean(train$TotalBsmtSF)
Bsmt_sd <- sd(train$TotalBsmtSF)
qnorm(0.95,Bsmt_mean,Bsmt_sd)

## [1] 1779.035

#empirical 5th and 95th percentile of the data
quantile(train$TotalBsmtSF,c(0.05,0.95))

##     5%    95% 
##  519.3 1753.0

The exponential value model doesn’t look like a good enough model for this set of data, since the range covers doesn’t fit the actual data and it is largly biased

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

#select all the quantitative variables and eliminate the ones with low correlations
quantitative <- data.frame(train$OverallQual,train$YearBuilt,train$YearRemodAdd,train$MasVnrArea,train$BsmtFinSF1,train$TotalBsmtSF,train$X1stFlrSF,train$X2ndFlrSF,train$GrLivArea,train$FullBath,train$TotRmsAbvGrd,train$Fireplaces,train$GarageCars,train$GarageArea,train$WoodDeckSF,train$OpenPorchSF,train$SalePrice) 

#create a linear regression model
m1 <- lm(train.SalePrice ~.,data = quantitative)
summary(m1)

## 
## Call:
## lm(formula = train.SalePrice ~ ., data = quantitative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -512233  -17548   -1737   14681  283280 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.094e+06  1.268e+05  -8.627  < 2e-16 ***
## train.OverallQual   1.856e+04  1.174e+03  15.807  < 2e-16 ***
## train.YearBuilt     1.638e+02  4.978e+01   3.290 0.001028 ** 
## train.YearRemodAdd  3.564e+02  6.208e+01   5.741 1.15e-08 ***
## train.MasVnrArea    2.881e+01  6.159e+00   4.678 3.17e-06 ***
## train.BsmtFinSF1    1.725e+01  2.596e+00   6.646 4.26e-11 ***
## train.TotalBsmtSF   1.165e+01  4.298e+00   2.711 0.006796 ** 
## train.X1stFlrSF     2.618e+01  2.082e+01   1.257 0.208871    
## train.X2ndFlrSF     1.753e+01  2.048e+01   0.856 0.392000    
## train.GrLivArea     2.135e+01  2.035e+01   1.049 0.294370    
## train.FullBath     -1.489e+03  2.630e+03  -0.566 0.571228    
## train.TotRmsAbvGrd  1.688e+03  1.089e+03   1.550 0.121402    
## train.Fireplaces    7.888e+03  1.783e+03   4.423 1.05e-05 ***
## train.GarageCars    1.011e+04  2.960e+03   3.414 0.000659 ***
## train.GarageArea    1.040e+01  1.005e+01   1.035 0.301006    
## train.WoodDeckSF    3.068e+01  8.129e+00   3.774 0.000167 ***
## train.OpenPorchSF   7.271e+00  1.572e+01   0.462 0.643861    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36380 on 1435 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7918, Adjusted R-squared:  0.7894 
## F-statistic:   341 on 16 and 1435 DF,  p-value: < 2.2e-16

#eliminate variables based on significant level
quantitative2 <- data.frame(train$OverallQual,train$YearRemodAdd,train$MasVnrArea,train$BsmtFinSF1,train$TotalBsmtSF,train$Fireplaces,train$GarageCars,train$WoodDeckSF,train$SalePrice)
colnames(quantitative2) <- c("OverallQual","YearRemodAdd","MasVnrArea","BsmtFinSF1","TotalBsmtSF","Fireplaces","GarageCars","WoodDeckSF","SalePrice")

#create a linear regression model
m2 <- lm(SalePrice ~.,data = quantitative2)
summary(m2)

## 
## Call:
## lm(formula = SalePrice ~ ., data = quantitative2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -407840  -21443   -2760   16410  363961 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.307e+05  1.210e+05  -6.867 9.70e-12 ***
## OverallQual   2.449e+04  1.183e+03  20.706  < 2e-16 ***
## YearRemodAdd  3.925e+02  6.256e+01   6.273 4.66e-10 ***
## MasVnrArea    4.651e+01  6.602e+00   7.045 2.85e-12 ***
## BsmtFinSF1    1.482e+01  2.752e+00   5.383 8.52e-08 ***
## TotalBsmtSF   2.504e+01  3.290e+00   7.611 4.89e-14 ***
## Fireplaces    1.551e+04  1.849e+03   8.389  < 2e-16 ***
## GarageCars    1.794e+04  1.820e+03   9.855  < 2e-16 ***
## WoodDeckSF    4.464e+01  8.848e+00   5.045 5.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39960 on 1443 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7474, Adjusted R-squared:  0.746 
## F-statistic: 533.6 on 8 and 1443 DF,  p-value: < 2.2e-16

#hist
hist(m2$residuals,breaks = 200)

#QQ plot
qqnorm(m2$residuals)
qqline(m2$residuals)

Nearly normal distributed, perhaps some putliers, not an perfect fit with all dependent variables being statistically significant. Lets check the performance using test data.

#Fetch test data
test <- read.csv("https://raw.githubusercontent.com/mohamedthasleem/DATA605/master/test.csv")
test[complete.cases(test),]

#predicting
pred <- predict(m2,test)

#kaggle Score
kaggle <- data.frame( Id = test[,"Id"],  SalePrice =pred)
kaggle[kaggle<0] <- 0
kaggle <- replace(kaggle,is.na(kaggle),0)
write.csv(kaggle, file="kaggle.csv", row.names = FALSE)

Kaggle Result and youtube link

Presentation Link

https://www.youtube.com/watch?v=k08ZNq4wRP4

DATA 605 - Final Project
Fundamentals of Computational Mathematics

Mohamed Thasleem Kalikul Zaman

Dec 15 2019

Problem 1

Problem 2

Descriptive and Inferential Statistics

Univariate descriptive statistics

Scatterplot Matrix

Correlation Matrix

Hypotheses Test

Analysis Observation

Linear Algebra and Correlation

Inversion

Identity Matrix

LU Decomposition

Calculus-Based Probability & Statistics

Exponential pdf

Modeling

Kaggle Result and youtube link

Presentation Link

DATA 605 - Final Project Fundamentals of Computational Mathematics

Mohamed Thasleem Kalikul Zaman

Dec 15 2019

Problem 1

Problem 2

Descriptive and Inferential Statistics

Univariate descriptive statistics

Scatterplot Matrix

Correlation Matrix

Hypotheses Test

Analysis Observation

Linear Algebra and Correlation

Inversion

Identity Matrix

LU Decomposition

Calculus-Based Probability & Statistics

Exponential pdf

Modeling

Kaggle Result and youtube link

Presentation Link

DATA 605 - Final Project
Fundamentals of Computational Mathematics