All requirements for the Final Project/Exam have been completed to the best of my knowledge and all done by me.
The presentation video for this Final Project/Exam can be found here: https://youtu.be/bw8zkzyiPtA
For each run of this program, values will change especially for the random-generated numbers. That means the mean, standard deviation, confidence intervals, etc. will change as well.
Objectives were completed based on my current and somewhat limited knowledge of R and Computational Mathematics. Due to the deadline, I had to heavily rely on R functions to complete the tasks at hand.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\)
numiterations <- 10000
x <- sample(1:10,numiterations, replace=TRUE)
y <- sample(1:1000,numiterations, replace=TRUE)
medianx <- median(x)
quantiley <- quantile(y)
medianx
## [1] 6
## 0% 25% 50% 75% 100%
## 1 244 498 746 1000
firstquartiley <- quantiley[2]
numtimesxgtmedian <- 0
numtimesygtfirstquartiley <- 0
probxgtmedian <- 0
probygtfirstquartiley <- 0
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
(5 points)
numtimes <- 0
for (i in 1:numiterations) {
X <- x[i]
Y <- y[i]
if (X > medianx || Y > firstquartiley)
numtimes = numtimes + 1
}
curprob <- numtimes / numiterations
print(curprob)
## [1] 0.8545
numtimes <- 0
for (i in 1:numiterations) {
X <- x[i]
Y <- y[i]
if (X > medianx && Y > firstquartiley)
numtimes = numtimes + 1
}
curprob <- numtimes / numiterations
print(curprob)
## [1] 0.298
numtimes <- 0
for (i in 1:numiterations) {
X <- x[i]
Y <- y[i]
if (X < medianx || Y > firstquartiley)
numtimes = numtimes + 1
}
curprob <- numtimes / numiterations
print(curprob)
## [1] 0.8704
(5 points) Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
for (i in 1:numiterations) {
X <- x[i]
Y <- y[i]
if (X > medianx)
numtimesxgtmedian = numtimesxgtmedian + 1
if (Y > firstquartiley)
numtimesygtfirstquartiley = numtimesygtfirstquartiley + 1
}
probxgtmedian <- numtimesxgtmedian / numiterations
probygtfirstquartiley <- numtimesygtfirstquartiley / numiterations
probxlemedian <- 1 - probxgtmedian
probylefirstquariley <- 1 - probygtfirstquartiley
print(probxgtmedian)
## [1] 0.403
## [1] 0.7495
ScenarioC = c("X and Median", "Y and 1st Quartile", "TOTAL")
GreaterThanC = c(probxgtmedian,probygtfirstquartiley, probxgtmedian+probygtfirstquartiley)
LessThanEqualC = c(probxlemedian,probylefirstquariley, probxlemedian+probylefirstquariley)
TotalC = c(GreaterThanC[1]+LessThanEqualC[1],GreaterThanC[2]+LessThanEqualC[2],GreaterThanC[3]+LessThanEqualC[3])
DT <- data.table(Scenario = ScenarioC, GreaterThan = GreaterThanC, LessThanEqual = LessThanEqualC, TOTAL=TotalC)
print(DT)
## Scenario GreaterThan LessThanEqual TOTAL
## 1: X and Median 0.4030 0.5970 1
## 2: Y and 1st Quartile 0.7495 0.2505 1
## 3: TOTAL 1.1525 0.8475 2
Based on my creation of a joint and marginal probabilities table, I am not satisfied with my table. The grand total for the table is 2, not 1 as I expected. I guess I have the problem of expecting P(X>x and Y>y) = P(X>x) * P(Y>y). However, creating the joint and marginal probabilities table automatically assumes that when I am adding totals, I should be expecting all variables to eventually add to 1 as the Grand Total. P(X>x) * P(Y>y) implies that the probabilities will eventually be multiplied to come up with the total probability result. This is not the case for the joint and marginal probabilities table.
(5 points) Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: testmatrix
## p-value = 0.0004998
## alternative hypothesis: two.sided
As the p-value 0.0004 is significantly less than the .05 significance level, we have enough evidence to reject the null hypothesis that the X column is independent of the Y column.
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 8834.3, df = 8991, p-value = 0.8791
As the p-value 0.9937 is greater than the .05 significance level, we do not reject the null hypothesis that the X column is independent of the Y column.
It appears that Fisher’s Exact Test is less appropriate than the Chi Square Test. The Chi-Square Test requires at least one-column to be categorical. There is a very big difference between the p-values for the Fisher’s Exact Test and Chi Square Test. Fisher’s Exact Test indicates evidence to reject the null-hypothesis while the Chi-Square Test indicates evidence to not reject the null-hypothesis test. Fisher’s Exact Test tests independence for smaller test sizes (expected values less than 5) while the Chi Square Test tests indepence for large test sizes. The number of values in this experiment is 10,000! Therefore, the Chi Square Test is appropriate for this test.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
NOTE: I will be focusing on the relationship between Sales Price and Square Footage of Basement, 1st Floor, and 2nd Floor of homes. The strict requirement for the data for this experiment is to focus only homes that have a basement and a second-floor. The test data will have 359 rows.
trainingDF <- read.csv(file="training.csv",header=TRUE,sep=",")
trainingDF <- subset(trainingDF,trainingDF$BsmtFinSF1 > 0)
trainingDF <- subset(trainingDF,trainingDF$X2ndFlrSF > 0)
#trainingDF <- subset(trainingDF,trainingDF$X1stFlrSF > 0)
nrow(trainingDF)
## [1] 359
(5 points) Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
For the calculations, initially I will provide univerate statistics for each of the variables in this study: Sale Price, Basement, First Floor, and Second Floor square footage. Next, I will provide inferential statistics by comparing the Sales Price to each of the independent variables: Basement, First Floor, and Second Floor square footage. I will plot a scatter chart for each and conduct a linear regression test on the data.
# ref: [UNI]
trainingDF %>%
summarize(variable = "SalePrice",
mean = mean(SalePrice),
sd = sd(SalePrice),
q0.25 = quantile(SalePrice, 0.25),
q0.75 = quantile(SalePrice, 0.75)) %>%
pander()
variable | mean | sd | q0.25 | q0.75 |
---|---|---|---|---|
SalePrice | 208285 | 89807 | 150250 | 239593 |
trainingDF %>%
summarize(variable = "BsmtFinSF1",
mean = mean(BsmtFinSF1),
sd = sd(BsmtFinSF1),
q0.25 = quantile(BsmtFinSF1, 0.25),
q0.75 = quantile(BsmtFinSF1, 0.75)) %>%
pander()
variable | mean | sd | q0.25 | q0.75 |
---|---|---|---|---|
BsmtFinSF1 | 582.8 | 426.2 | 341 | 730 |
trainingDF %>%
summarize(variable = "X2ndFlrSF",
mean = mean(X2ndFlrSF),
sd = sd(X2ndFlrSF),
q0.25 = quantile(X2ndFlrSF, 0.25),
q0.75 = quantile(X2ndFlrSF, 0.75)) %>%
pander()
variable | mean | sd | q0.25 | q0.75 |
---|---|---|---|---|
X2ndFlrSF | 826.9 | 280 | 660 | 973 |
trainingDF %>%
summarize(variable = "X1stFlrSF",
mean = mean(X1stFlrSF),
sd = sd(X1stFlrSF),
q0.25 = quantile(X1stFlrSF, 0.25),
q0.75 = quantile(X1stFlrSF, 0.75)) %>%
pander()
variable | mean | sd | q0.25 | q0.75 |
---|---|---|---|---|
X1stFlrSF | 1062 | 390.8 | 812.5 | 1189 |
plot( trainingDF$BsmtFinSF1, trainingDF$SalePrice, main="Finished Basement SF vs. Sales Price", ylab="Sales Price", xlab="Finished Basement SF")
lmresult1 <- lm(SalePrice ~BsmtFinSF1, data = trainingDF)
abline(lmresult1)
##
## Call:
## lm(formula = SalePrice ~ BsmtFinSF1, data = trainingDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -509884 -45473 -11540 28776 467168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 155131.03 7254.96 21.383 <2e-16 ***
## BsmtFinSF1 91.20 10.05 9.072 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81070 on 357 degrees of freedom
## Multiple R-squared: 0.1874, Adjusted R-squared: 0.1851
## F-statistic: 82.31 on 1 and 357 DF, p-value: < 2.2e-16
Based on the plot observation, the Finished Basement Square Footage after 1000 square feet, does not really matter in increasing the sales price of the house. The outliers in the Basement Square Footage indicate that sales prices does not seem important even when Finished Basement Square Footage is closing in on 1500 square feet. The Linear Regression Model indicates that there is enough evidence to reject the Null Hypothesis indicating that a Finshed Basement may influence the Sales Price.
plot( trainingDF$X2ndFlrSF, trainingDF$SalePrice, main="2nd Floor SF vs. Sales Price", ylab="Sales Price", xlab="2nd Floor SF")
lmresult1 <- lm(SalePrice ~ X2ndFlrSF, data = trainingDF)
abline(lmresult1)
##
## Call:
## lm(formula = SalePrice ~ X2ndFlrSF, data = trainingDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174811 -35526 -11596 18655 337857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32358.5 11089.4 2.918 0.00375 **
## X2ndFlrSF 212.8 12.7 16.747 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67300 on 357 degrees of freedom
## Multiple R-squared: 0.44, Adjusted R-squared: 0.4384
## F-statistic: 280.5 on 1 and 357 DF, p-value: < 2.2e-16
Based on the plot observation, the 2nd Floor Square Footage follows a linear relationship with the Sales Price of the house. It appears that there is a cluster of data for homes wiht second floors between 500 and 1000 square feet. It appears the majority of homes buyers are satisfied with at most a 1500 square footage of 2nd floor space.. The Linear Regression Model indicates that there is enough evidence to reject the Null Hypothesis indicating that a 2nd Floor may influence the Sales Price.
plot( trainingDF$X1stFlrSF, trainingDF$SalePrice, main="1st Floor SF vs. Sales Price", ylab="Sales Price", xlab="1st Floor SF")
lmresult1 <- lm(SalePrice ~ X1stFlrSF, data = trainingDF)
abline(lmresult1)
##
## Call:
## lm(formula = SalePrice ~ X1stFlrSF, data = trainingDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -541112 -38424 979 30222 359110
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64055.895 11105.808 5.768 1.74e-08 ***
## X1stFlrSF 135.775 9.814 13.835 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 72560 on 357 degrees of freedom
## Multiple R-squared: 0.349, Adjusted R-squared: 0.3472
## F-statistic: 191.4 on 1 and 357 DF, p-value: < 2.2e-16
Based on the plot observation, the 1st Floor Square Footage follows a linear relationship with the Sales Price of the house up to 1800 square feet. It does not seem important afterwards despite the presence of outliers. It appears that there is a cluster of data for homes wiht first floors between 500 and 1500 square feet. It appears the majority of homes buyers are satisfied with at most a 1500 square footage of 1st floor space.. The Linear Regression Model indicates that there is enough evidence to reject the Null Hypothesis indicating that a 1st Floor may influence the Sales Price.
#ref: [CRE]
salePrice <- trainingDF$SalePrice
finBsmtSF <- trainingDF$BsmtFinSF1
Flr1SF <- trainingDF$X1stFlrSF
Flr2SF <- trainingDF$X2ndFlrSF
data2A2 = data.frame(salePrice,finBsmtSF,Flr1SF,Flr2SF)
pairs(data2A2)
## salePrice finBsmtSF Flr1SF Flr2SF
## salePrice 1.00 0.43 0.59 0.66
## finBsmtSF 0.43 1.00 0.69 0.28
## Flr1SF 0.59 0.69 1.00 0.45
## Flr2SF 0.66 0.28 0.45 1.00
##
## Pearson's product-moment correlation
##
## data: salePrice and finBsmtSF
## t = 9.0724, df = 357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.3760735 0.4863915
## sample estimates:
## cor
## 0.4328517
The p-value of the test is \(2.2 x 10^{-16}\) which is less than the significant level alpha = 0.20. We can conclude that Sale Price and Finished Basement Square Footage are signficantly correlated with a correlation coefficient of 0.43 and p-value \(2.2 x 10^{-16}\). The t-statistics is signficiant as its value of 9.07 is greater than 0 signifying the difference represented in units of standard error. This coincides with the very low p-value which is important in validating the t-statistics.
##
## Pearson's product-moment correlation
##
## data: salePrice and Flr1SF
## t = 13.835, df = 357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5448047 0.6332404
## sample estimates:
## cor
## 0.5907942
The p-value of the test is \(2.2 x 10^{-16}\) which is less than the significant level alpha = 0.20. We can conclude that Sale Price and 1st Floor Square Footage are signficantly correlated with a correlation coefficient of 0.59 and p-value \(2.2 x 10^{-16}\). The t-statistics is signficiant as its value of 12.84 is greater than 0 signifying the difference represented in units of standard error. This coincides with the very low p-value which is important in validating the t-statistics.
##
## Pearson's product-moment correlation
##
## data: salePrice and Flr2SF
## t = 16.747, df = 357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6235253 0.6996407
## sample estimates:
## cor
## 0.663295
The p-value of the test is \(2.2 x 10^{-16}\) which is less than the significant level alpha = 0.20. We can conclude that Sale Price and 2nd Floor Square Footage are signficantly correlated with a correlation coefficient of 0.66 and p-value \(2.2 x 10^{-16}\). The t-statistics is signficiant as its value of 16.75 is greater than 0 signifying the difference represented in units of standard error. This coincides with the very low p-value which is important in validating the t-statistics.
I don’t think I should worry about familywise error. The data set is large with 359 observations. Moreover, each of the 3 pair sets have demonstrated a high t-statistic and very low p-value indicating that we have enough evidence to reject the null hypothesis.
(5 points) Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
## salePrice finBsmtSF Flr1SF Flr2SF
## salePrice 1.0000000 0.4328517 0.5907942 0.6632950
## finBsmtSF 0.4328517 1.0000000 0.6852469 0.2788414
## Flr1SF 0.5907942 0.6852469 1.0000000 0.4497692
## Flr2SF 0.6632950 0.2788414 0.4497692 1.0000000
## salePrice finBsmtSF Flr1SF Flr2SF
## salePrice 2.2253001 -0.1792725 -0.6900508 -1.1156782
## finBsmtSF -0.1792725 1.9035304 -1.2701740 0.1594125
## Flr1SF -0.6900508 -1.2701740 2.3979965 -0.2666606
## Flr2SF -1.1156782 0.1594125 -0.2666606 1.8155086
## salePrice finBsmtSF Flr1SF Flr2SF
## salePrice 1.000000e+00 8.459451e-18 3.124601e-18 8.042068e-18
## finBsmtSF -9.588181e-17 1.000000e+00 -9.795823e-17 -9.862991e-17
## Flr1SF -2.227974e-16 -6.641383e-17 1.000000e+00 9.903251e-17
## Flr2SF -2.220446e-16 2.775558e-17 5.551115e-17 1.000000e+00
## salePrice finBsmtSF Flr1SF Flr2SF
## salePrice 1.000000e+00 2.154458e-17 -1.190629e-17 0.000000e+00
## finBsmtSF -2.274629e-16 1.000000e+00 -2.884584e-16 -1.110223e-16
## Flr1SF 2.251692e-16 3.461310e-16 1.000000e+00 1.665335e-16
## Flr2SF -2.140025e-16 -2.096522e-16 -1.230121e-16 1.000000e+00
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 0.00000000 0.0000000 0
## [2,] 0.4328517 1.00000000 0.0000000 0
## [3,] 0.5907942 0.52855010 1.0000000 0
## [4,] 0.6632950 -0.01017292 0.1468793 1
## [,1] [,2] [,3] [,4]
## [1,] 1 0.4328517 0.5907942 0.663294968
## [2,] 0 0.8126394 0.4295206 -0.008266919
## [3,] 0 0.0000000 0.4239391 0.062267857
## [4,] 0 0.0000000 0.0000000 0.550809829
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 0.4328517 0.5907942 0.6632950
## [2,] 0.4328517 1.0000000 0.6852469 0.2788414
## [3,] 0.5907942 0.6852469 1.0000000 0.4497692
## [4,] 0.6632950 0.2788414 0.4497692 1.0000000
(5 points) Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Luckily based on experimentation, I found that the Finished Basement Square Footage is found to be skewed to the right as evidenced by its resultant histogram shown below.
## [1] 426.2223
## rate
## 1.715823e-03
## (9.055769e-05)
## [1] 582.8185
I followed the instructions and still found that the histogram for the newly-modified exponential distribution still is right-skewed. While I am not satisfied with this result, I noticed that the histogram is more uniform to the right skew with no values missing as it goes from left to right.
# ref: [CAL]
epmean2 <- mean(rvalues)
epsd2 <- sd(rvalues)
eplen2 <- 1000
error <- qnorm(0.95)*epsd2/sqrt(eplen2)
left <- epmean2 - error
right <- epmean2 + error
left
## [1] 561.2449
## [1] 623.8408
## [1] 592.5428
## [1] 601.712
## 5% 95%
## 34.05061 1870.01164
Based on running the empirical data derived from creating a random normalized distribution for Finished basement, we find that the lower and upper range for the data at 95% confidence interval will be 543 and 605 with a mean of 574.. HOwever, there is a standard deviation of 585 which means the variables can fluctuate with that. As indicated by finding the 5 and 95 percentile, we still find variability with the presence of such a low variable value as 19.7 and such a high variable as 1675.
(10 points) Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I am going to include at least 10 variables and compare it to SalePrice
lmresult2D1 <- lm(SalePrice ~ MSSubClass + LotFrontage + LotArea + MasVnrArea + BsmtUnfSF + BsmtFinSF1 + X2ndFlrSF + X1stFlrSF + TotRmsAbvGrd + KitchenAbvGr, data = trainingDF)
summary(lmresult2D1)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea +
## MasVnrArea + BsmtUnfSF + BsmtFinSF1 + X2ndFlrSF + X1stFlrSF +
## TotRmsAbvGrd + KitchenAbvGr, data = trainingDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -329042 -26668 -4967 21906 258686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.006e+05 2.395e+04 4.201 3.61e-05 ***
## MSSubClass -3.242e+02 1.152e+02 -2.813 0.005264 **
## LotFrontage -4.500e+02 1.838e+02 -2.449 0.014976 *
## LotArea -2.635e-01 8.297e-01 -0.318 0.751076
## MasVnrArea 5.889e+01 1.670e+01 3.526 0.000495 ***
## BsmtUnfSF 1.951e+01 1.991e+01 0.980 0.327875
## BsmtFinSF1 2.421e+01 1.726e+01 1.402 0.161961
## X2ndFlrSF 1.726e+02 1.829e+01 9.434 < 2e-16 ***
## X1stFlrSF 5.564e+01 2.117e+01 2.629 0.009057 **
## TotRmsAbvGrd 2.063e+03 3.682e+03 0.560 0.575863
## KitchenAbvGr -7.772e+04 2.121e+04 -3.664 0.000298 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59990 on 272 degrees of freedom
## (76 observations deleted due to missingness)
## Multiple R-squared: 0.6322, Adjusted R-squared: 0.6186
## F-statistic: 46.74 on 10 and 272 DF, p-value: < 2.2e-16
Analysis: Based on the initial pass, we have identified that LotArea, Unfinished Basement SF, Finished Basement SF, and Total Rooms Above Ground should be eliminated from the second pass since their p-values are greater than 0.05.
lmresult2D2 <- lm(SalePrice ~ MSSubClass + LotFrontage + MasVnrArea + X2ndFlrSF + X1stFlrSF + KitchenAbvGr, data = trainingDF)
summary(lmresult2D2)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + MasVnrArea +
## X2ndFlrSF + X1stFlrSF + KitchenAbvGr, data = trainingDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -333902 -26830 -6670 21177 261020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100476.09 22698.38 4.427 1.38e-05 ***
## MSSubClass -303.57 109.48 -2.773 0.005938 **
## LotFrontage -409.27 172.41 -2.374 0.018291 *
## MasVnrArea 62.50 16.19 3.860 0.000141 ***
## X2ndFlrSF 176.13 14.89 11.832 < 2e-16 ***
## X1stFlrSF 78.79 11.82 6.665 1.43e-10 ***
## KitchenAbvGr -76140.46 19927.94 -3.821 0.000164 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59790 on 276 degrees of freedom
## (76 observations deleted due to missingness)
## Multiple R-squared: 0.6293, Adjusted R-squared: 0.6213
## F-statistic: 78.1 on 6 and 276 DF, p-value: < 2.2e-16
Analysis: After the second pass, we have arrived at out final model, luckily. It appears athat all variables listed have p-values are less than 0.05. Moreover, the Adjusted R-Squared and Multiple R-Squared values are close to 0.62 indicating a stable linear model.
[CAL] Calculating Confidence Intervals. Retrieved from website:https://www.cyclismo.org/tutorial/R/confidence.html
[COR1] Correlation matrix : A quick start guide to analyze, format and visualize a correlation matrix using R software. Retrieved from website: http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software
[COR2] Correlation Test Between Two Variables in R. Retrieved from website: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r
[CRE] Creating and Interpretting a Scatterplot Matrix in R. Retrieved from website: https://www.youtube.com/watch?v=tS7dX-wTa9I
[LUR] lu.decomposition. Retrieved from website: https://www.rdocumentation.org/packages/matrixcalc/versions/1.0-3/topics/lu.decomposition
[UNI] Univariate and bivariate descriptive analysis. Retrieved from website:https://beta.rstudioconnect.com/content/3350/dplyr_tutorial.html