Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\)
library(knitr)
library(kableExtra)
#Setting variables and values
N <- 10
mean <- (N+1)/2
sd <- (N+1)/2
#Calculate X and Y
set.seed(12)
X <- runif(10000, 1, N)
Y <- rnorm(X, mean, sd)
#Build data frame
df <- data.frame(X,Y)
#Preview data
kable(data.frame(head(df, n = 10L))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#ea7872") %>%
scroll_box(width = "100%", height = "200px")| X | Y |
|---|---|
| 1.624248 | 2.013936 |
| 8.359977 | -4.616887 |
| 9.483596 | 6.414498 |
| 3.424437 | 8.840356 |
| 2.524133 | 5.920598 |
| 1.305061 | 4.692422 |
| 2.609065 | -3.742906 |
| 6.774988 | 10.915229 |
| 1.205900 | 6.883874 |
| 1.074923 | 7.445159 |
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
## [1] 5.540831
## 25%
## 1.813683
a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
## [1] 0.5514503
## [1] 0.3742
## [1] 0.4485497
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
c_tab <- c(sum(X<x & Y < y),sum(X < x & Y == y),sum(X < x & Y > y))
c_tab <- rbind(c_tab,c(sum(X==x & Y < y),sum(X == x & Y == y),sum(X == x & Y > y)))
c_tab <- rbind(c_tab,c(sum(X>x & Y < y),sum(X > x & Y == y),sum(X > x & Y > y)))
c_tab <- cbind(c_tab, c_tab[,1] + c_tab[,2] + c_tab[,3])
c_tab <- rbind(c_tab, c_tab[1,] + c_tab[2,] + c_tab[3,])
colnames(c_tab) <- c("Y<y", "Y=y", "Y>y", "Total")
rownames(c_tab) <- c("X<x", "X=x", "X>x", "Total")
#Preview data
jp <- as.data.frame(c_tab)
jp## [1] 0.3742
## [1] 0.375
Advanced Regression Techniques competition - https://www.kaggle.com/c/house-prices-advanced-regression-techniques
#Load libraries
library(RCurl)
library(knitr)
library(kableExtra)
library(magrittr)
library(psych)
library(Matrix)
library(MASS)Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
#Training data
#train <- read.csv("https://raw.githubusercontent.com/mohamedthasleem/DATA605/master/train.csv", header = TRUE, stringsAsFactors = FALSE)
train <- read.csv("C:/Users/aisha/Dropbox/CUNY/github/DATA602/DATA605/train.csv", header = TRUE, stringsAsFactors = FALSE)
#Summary Infromation
psych::describe(train)#Preview data
kable(data.frame(head(train, n = 10L))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#ea7872") %>%
scroll_box(width = "100%", height = "300px")| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
| 2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
| 3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
| 4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
| 6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
| 7 | 20 | RL | 75 | 10084 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2004 | 2005 | Gable | CompShg | VinylSd | VinylSd | Stone | 186 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1369 | Unf | 0 | 317 | 1686 | GasA | Ex | Y | SBrkr | 1694 | 0 | 0 | 1694 | 1 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 2004 | RFn | 2 | 636 | TA | TA | Y | 255 | 57 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 8 | 2007 | WD | Normal | 307000 |
| 8 | 60 | RL | NA | 10382 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NWAmes | PosN | Norm | 1Fam | 2Story | 7 | 6 | 1973 | 1973 | Gable | CompShg | HdBoard | HdBoard | Stone | 240 | TA | TA | CBlock | Gd | TA | Mn | ALQ | 859 | BLQ | 32 | 216 | 1107 | GasA | Ex | Y | SBrkr | 1107 | 983 | 0 | 2090 | 1 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 2 | TA | Attchd | 1973 | RFn | 2 | 484 | TA | TA | Y | 235 | 204 | 228 | 0 | 0 | 0 | NA | NA | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
| 9 | 50 | RM | 51 | 6120 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Artery | Norm | 1Fam | 1.5Fin | 7 | 5 | 1931 | 1950 | Gable | CompShg | BrkFace | Wd Shng | None | 0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 952 | 952 | GasA | Gd | Y | FuseF | 1022 | 752 | 0 | 1774 | 0 | 0 | 2 | 0 | 2 | 2 | TA | 8 | Min1 | 2 | TA | Detchd | 1931 | Unf | 2 | 468 | Fa | TA | Y | 90 | 0 | 205 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2008 | WD | Abnorml | 129900 |
| 10 | 190 | RL | 50 | 7420 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Artery | Artery | 2fmCon | 1.5Unf | 5 | 6 | 1939 | 1950 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | BrkTil | TA | TA | No | GLQ | 851 | Unf | 0 | 140 | 991 | GasA | Ex | Y | SBrkr | 1077 | 0 | 0 | 1077 | 1 | 0 | 1 | 0 | 2 | 2 | TA | 5 | Typ | 2 | TA | Attchd | 1939 | RFn | 1 | 205 | Gd | TA | Y | 0 | 4 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 1 | 2008 | WD | Normal | 118000 |
-Provide univariate descriptive statistics and appropriate plots for the training data set
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
-Provide a scatterplot matrix for at least two of the independent variables and the dependent variable
-Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval
#Subsetting data
sub_df <- data.frame(train$LotArea,train$GrLivArea,train$GarageArea)
#Correlation
cormatrix <- cor(sub_df)
cormatrix## train.LotArea train.GrLivArea train.GarageArea
## train.LotArea 1.0000000 0.2631162 0.1804028
## train.GrLivArea 0.2631162 1.0000000 0.4689975
## train.GarageArea 0.1804028 0.4689975 1.0000000
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$GarageArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1477356 0.2126767
## sample estimates:
## cor
## 0.1804028
##
## Pearson's product-moment correlation
##
## data: train$GarageArea and train$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
All three confidence intervals have p-values less than 0.5 which means that the null hypothesis could be rejected. Possibility of FWE is going to be high since we’re only executing a single experiment so probability wil be higher. FWE on type I errors when performing multiple hypotheses tests. This problem can be avoid by ajusting the correlation test to a confident level of higher percentage.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
## train.LotArea train.GrLivArea train.GarageArea
## train.LotArea 1.07920917 -0.2469705 -0.07886378
## train.GrLivArea -0.24697046 1.3385010 -0.58319943
## train.GarageArea -0.07886378 -0.5831994 1.28774631
## train.LotArea train.GrLivArea train.GarageArea
## train.LotArea 1 0 0
## train.GrLivArea 0 1 0
## train.GarageArea 0 0 1
-Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix
## train.LotArea train.GrLivArea train.GarageArea
## train.LotArea 1 0 0
## train.GrLivArea 0 1 0
## train.GarageArea 0 0 1
-Conduct LU decomposition on the matrix.
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.0000000 . .
## [2,] 0.2631162 1.0000000 .
## [3,] 0.1804028 0.4528838 1.0000000
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.2631162 0.1804028
## [2,] . 0.9307699 0.4215306
## [3,] . . 0.7765505
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.00000000 . .
## [2,] -0.22884393 1.00000000 .
## [3,] -0.07307553 -0.46899748 1.00000000
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.07920917 -0.24697046 -0.07886378
## [2,] . 1.28198329 -0.60124693
## [3,] . . 1.00000000
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
# MASS Package
library(MASS)
#run fitdistr to fit an exponential probability density function, Find optimal lambda
optimal_lambda <- fitdistr(train$TotalBsmtSF,"exponential")
optimal_lambda$estimate## rate
## 0.0009456896
-Take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable
#1000 samples from this exponential distribution using this value
hist(rexp(1000,optimal_lambda$estimate),breaks = 200,main = "Fitted Exponential PDF",xlim = c(1,quantile(rexp(1000,optimal_lambda$estimate),0.99)))#5th and 95th percentiles using CDF
qexp(0.05,rate = optimal_lambda$estimate,lower.tail = TRUE,log.p = FALSE)## [1] 54.23904
## [1] 3167.776
#95% confidence interval from the empirical data - normal
Bsmt_mean <- mean(train$TotalBsmtSF)
Bsmt_sd <- sd(train$TotalBsmtSF)
qnorm(0.95,Bsmt_mean,Bsmt_sd)## [1] 1779.035
## 5% 95%
## 519.3 1753.0
The exponential value model doesn’t look like a good enough model for this set of data, since the range covers doesn’t fit the actual data and it is largly biased
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
#select all the quantitative variables and eliminate the ones with low correlations
quantitative <- data.frame(train$OverallQual,train$YearBuilt,train$YearRemodAdd,train$MasVnrArea,train$BsmtFinSF1,train$TotalBsmtSF,train$X1stFlrSF,train$X2ndFlrSF,train$GrLivArea,train$FullBath,train$TotRmsAbvGrd,train$Fireplaces,train$GarageCars,train$GarageArea,train$WoodDeckSF,train$OpenPorchSF,train$SalePrice)
#create a linear regression model
m1 <- lm(train.SalePrice ~.,data = quantitative)
summary(m1)##
## Call:
## lm(formula = train.SalePrice ~ ., data = quantitative)
##
## Residuals:
## Min 1Q Median 3Q Max
## -512233 -17548 -1737 14681 283280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.094e+06 1.268e+05 -8.627 < 2e-16 ***
## train.OverallQual 1.856e+04 1.174e+03 15.807 < 2e-16 ***
## train.YearBuilt 1.638e+02 4.978e+01 3.290 0.001028 **
## train.YearRemodAdd 3.564e+02 6.208e+01 5.741 1.15e-08 ***
## train.MasVnrArea 2.881e+01 6.159e+00 4.678 3.17e-06 ***
## train.BsmtFinSF1 1.725e+01 2.596e+00 6.646 4.26e-11 ***
## train.TotalBsmtSF 1.165e+01 4.298e+00 2.711 0.006796 **
## train.X1stFlrSF 2.618e+01 2.082e+01 1.257 0.208871
## train.X2ndFlrSF 1.753e+01 2.048e+01 0.856 0.392000
## train.GrLivArea 2.135e+01 2.035e+01 1.049 0.294370
## train.FullBath -1.489e+03 2.630e+03 -0.566 0.571228
## train.TotRmsAbvGrd 1.688e+03 1.089e+03 1.550 0.121402
## train.Fireplaces 7.888e+03 1.783e+03 4.423 1.05e-05 ***
## train.GarageCars 1.011e+04 2.960e+03 3.414 0.000659 ***
## train.GarageArea 1.040e+01 1.005e+01 1.035 0.301006
## train.WoodDeckSF 3.068e+01 8.129e+00 3.774 0.000167 ***
## train.OpenPorchSF 7.271e+00 1.572e+01 0.462 0.643861
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36380 on 1435 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7918, Adjusted R-squared: 0.7894
## F-statistic: 341 on 16 and 1435 DF, p-value: < 2.2e-16
#eliminate variables based on significant level
quantitative2 <- data.frame(train$OverallQual,train$YearRemodAdd,train$MasVnrArea,train$BsmtFinSF1,train$TotalBsmtSF,train$Fireplaces,train$GarageCars,train$WoodDeckSF,train$SalePrice)
colnames(quantitative2) <- c("OverallQual","YearRemodAdd","MasVnrArea","BsmtFinSF1","TotalBsmtSF","Fireplaces","GarageCars","WoodDeckSF","SalePrice")
#create a linear regression model
m2 <- lm(SalePrice ~.,data = quantitative2)
summary(m2)##
## Call:
## lm(formula = SalePrice ~ ., data = quantitative2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -407840 -21443 -2760 16410 363961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.307e+05 1.210e+05 -6.867 9.70e-12 ***
## OverallQual 2.449e+04 1.183e+03 20.706 < 2e-16 ***
## YearRemodAdd 3.925e+02 6.256e+01 6.273 4.66e-10 ***
## MasVnrArea 4.651e+01 6.602e+00 7.045 2.85e-12 ***
## BsmtFinSF1 1.482e+01 2.752e+00 5.383 8.52e-08 ***
## TotalBsmtSF 2.504e+01 3.290e+00 7.611 4.89e-14 ***
## Fireplaces 1.551e+04 1.849e+03 8.389 < 2e-16 ***
## GarageCars 1.794e+04 1.820e+03 9.855 < 2e-16 ***
## WoodDeckSF 4.464e+01 8.848e+00 5.045 5.12e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39960 on 1443 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7474, Adjusted R-squared: 0.746
## F-statistic: 533.6 on 8 and 1443 DF, p-value: < 2.2e-16
Nearly normal distributed, perhaps some putliers, not an perfect fit with all dependent variables being statistically significant. Lets check the performance using test data.
#Fetch test data
test <- read.csv("https://raw.githubusercontent.com/mohamedthasleem/DATA605/master/test.csv")
test[complete.cases(test),]