Problem 1

Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.

X <- c(9.3,7.4, 9.5,    9.3, 4.1,   6.4 ,3.7,   12.4,22.4,  8.5,    11.7,   19.9,9.1,   9.5,    7.4,    6.9,15.8,   11.8,   5.3 ,-1,7.1,    8.8,    7.4 ,10.6,15.9, 8.4,    7.4,    6.4,6.9,    5.1,    8.6,    10.6,16,    11.4,   9.1,    1.2,6.7,    15.1,   11.4,   7.7,8.2,    12.6,   8.4,    15.5,16,    8,  7.3,    6.9,6.4 ,10.3   ,11.3   ,13.7, 11.8 ,10.4   ,4.4    ,3.7,3.5,   9.5,    9.3,    4.4,21.7,   9.5 ,10.9   ,11.5,12.2 ,15.1,   10.9,   4.2,9.3,    6.6,    7.7,    13.9,8  ,15.4,  7.7,    12.9,6.2,   8.2 ,11.5,  1.2)
Y <- c(20.3,    20.8,   28.4,   20.2,   19.1,   14.6,   21.5,   18.6,   15.2,   16.2,   21.3,   26.9,   13.8,   14.7,   25.2,   26,
22.3,   13.1,   21.4,   16.8,   19.3,   18, 20.8,   22.6,   20.9,   15.7,   15.1,   16.3,   18.8,   15.3,   22.5,   26.8,
17.6,   10.3,   20.8,   20.2,   20.9,   7.3,    22.2,   11.4,   18.4,   16.3,   17.8,   19.9,   20.9,   12.6,   21.1,   19.7,
20.8,   14.9,   23, 21.7,   22, 19.4,   21.6,   23.6,   10.3,   11.5,   26.4,   15.5,   18.6,   13, 21.7,   22.7,
28.7,   14.8,   17.4,   20.9,   23.5,   13.5,   21.8,   24, 26.3,   12.2,   21.6,   26.5,   28.1,   11.8,   22.5,   21.7)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Q3_X <- quantile(X, 0.75)
Q1_Y <- quantile(Y, 0.25)
Q3_X
##   75% 
## 11.55
Q1_Y
##   25% 
## 15.65
  1. P(X>x | Y>y)

The conditional probability of X>x given Y>y is

\(P(X>x | Y>y) = \frac{P(X>x,Y>y) }{P(Y>y)} = \frac{P(X>x) P(Y>y) }{P(Y>y)}\)

(length(X[X > Q3_X &  Y > Q1_Y])/length(X)) / (length(Y[Y > Q1_Y])/ length(Y))
## [1] 0.25
  1. P(X>x, Y>y)

\(P(X>x , Y>y) = \frac{P(X>x|Y>y) }{P(Y>y)}\)

(length(X[X > Q3_X &  Y > Q1_Y])/length(X)) 
## [1] 0.1875
  1. P(Xy)

\(P(X>x | Y>y) = \frac{P(X<x,Y>y) }{P(Y>y)} = \frac{P(X<x) P(Y>y) }{P(Y>y)}\)

(length(X[X < Q3_X &  Y > Q1_Y])/length(X)) / (length(Y[Y > Q1_Y])/ length(Y))
## [1] 0.75

In addition, make a table of counts as shown below.

# X<=x, Y<=y
row1_col1 <- length(X[X <= Q3_X &  Y <= Q1_Y])
# X>x, Y<=y
row1_col2 <- length(X[X > Q3_X &  Y <= Q1_Y])
# X<=x, Y>y
row2_col1 <- length(X[X <= Q3_X &  Y > Q1_Y])
# X>x, Y>y
row2_col2 <- length(X[X > Q3_X &  Y > Q1_Y])
count <- data.frame(c(row1_col1, row2_col1), c(row1_col2, row2_col2))
count[3,] = count[1,] + count[2,]
count[,3] = count[,1] + count[,2]
names(count) <- c('X<=x', 'X>x', 'Total')
rownames(count) <- c('Y<=y', 'Y>y', 'Total')
count 
##       X<=x X>x Total
## Y<=y    15   5    20
## Y>y     45  15    60
## Total   60  20    80

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

A <- 20
B <- 60
#from 1.a P(X>x | Y>y)  = 0.25
P_AB <- 0.25 * 80
PA_PB <- 20/80 * 60/80 * 80
P_AB 
## [1] 20
PA_PB
## [1] 15

\(p(A|B)!=p(A)∗p(B)\)

A and B are not independent.

H_0: \(X>x\) and \(Y>y\) are independent

H_a: \(X>x\) and \(Y>y\) are not independent

\(\chi^2 =\sum { \frac { { (observed-expected) }^2 }{ expected } }\)

row1_col1 <- count[1,3] * count[3,1] / count[3,3]
row1_col2 <- count[1,3] * count[3,2] / count[3,3]
row2_col1 <- count[2,3] * count[3,1] / count[3,3]
row2_col2 <- count[2,3] * count[3,2] / count[3,3]
expected_count <- data.frame(c(row1_col1, row2_col1), c(row2_col1, row2_col2))
expected_count
##   c.row1_col1..row2_col1. c.row2_col1..row2_col2.
## 1                      15                      45
## 2                      45                      15
diff_sq <- (count[1:2, 1:2] - expected_count) ^2 / expected_count
diff_sq
##   c.row1_col1..row2_col1. c.row2_col1..row2_col2.
## 1                       0                35.55556
## 2                       0                 0.00000
chi_square <- sum(diff_sq)
chi_square
## [1] 35.55556
p_value <- pchisq(chi_square, 1, lower.tail=F)
p_value
## [1] 2.478791e-09

This value is less than 0.05. \(H_0\) can be rejected: X and Y are not independent.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Data fields

Here’s a brief version of what you’ll find in the data description file.

  • SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: $Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale
train.dt<-read.csv('https://raw.githubusercontent.com/Lidiia25/Data605_Final_Problem2/master/train.csv?token=Ac3_PlkWy_zjurpLjUxOWZj4UvMrBCdiks5cHachwA%3D%3D', header=TRUE)
head(train.dt)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

summary(train.dt$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
hist(train.dt$SalePrice, main="Sale Price")

qqnorm(train.dt$SalePrice)
qqline(train.dt$SalePrice)

pairs(~SalePrice+TotalBsmtSF +YearBuilt+LotFrontage+GrLivArea++GarageArea,data=train.dt, 
   main="Scatterplot Matrix")

Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

selected <-  data.frame(train.dt$SalePrice,train.dt$TotalBsmtSF,train.dt$GrLivArea)
matrix <- cor(selected)
matrix
##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice            1.0000000            0.6135806
## train.dt.TotalBsmtSF          0.6135806            1.0000000
## train.dt.GrLivArea            0.7086245            0.4548682
##                      train.dt.GrLivArea
## train.dt.SalePrice            0.7086245
## train.dt.TotalBsmtSF          0.4548682
## train.dt.GrLivArea            1.0000000
cor.test(train.dt$GrLivArea, train.dt$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train.dt$GrLivArea and train.dt$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

In the result above:

t is the t-test statistic value (t = 38.348), df is the degrees of freedom (df= 1458), p-value is the significance level of the t-test (p-value = 2.2e-16). conf.int is the confidence interval of the correlation coefficient at 80% (0.69 and 0.72); sample estimates is the correlation coefficient (Cor.coeff = 0.7).

cor.test(train.dt$TotalBsmtSF, train.dt$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train.dt$TotalBsmtSF and train.dt$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806

In the result above:

t is the t-test statistic value (t = 29.671), df is the degrees of freedom (df= 1458), p-value is the significance level of the t-test (p-value = 2.2e-16). conf.int is the confidence interval of the correlation coefficient at 80% ( 0.59 and 0.63); sample estimates is the correlation coefficient (Cor.coeff = 0.61).

The p-value of both tests is less than the significance level 0.05. We can conclude that variables are significantly correlated.

Linear Algebra and Correlation. Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

cor_matrix <- matrix
precision_matrix <- solve(cor_matrix)
precision_matrix
##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice            2.5582310          -0.93946422
## train.dt.TotalBsmtSF         -0.9394642           1.60588442
## train.dt.GrLivArea           -1.3854927          -0.06473842
##                      train.dt.GrLivArea
## train.dt.SalePrice          -1.38549273
## train.dt.TotalBsmtSF        -0.06473842
## train.dt.GrLivArea           2.01124151

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

matrix_2 <- cor_matrix%*% precision_matrix
matrix_2
##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice                    1         1.387779e-17
## train.dt.TotalBsmtSF                  0         1.000000e+00
## train.dt.GrLivArea                    0         5.551115e-17
##                      train.dt.GrLivArea
## train.dt.SalePrice         0.000000e+00
## train.dt.TotalBsmtSF       1.110223e-16
## train.dt.GrLivArea         1.000000e+00

and then multiply the precision matrix by the correlation matrix.

matrix_3 <-precision_matrix%*%cor_matrix
matrix_3
##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice         1.000000e+00                    0
## train.dt.TotalBsmtSF       2.775558e-17                    1
## train.dt.GrLivArea         0.000000e+00                    0
##                      train.dt.GrLivArea
## train.dt.SalePrice         0.000000e+00
## train.dt.TotalBsmtSF       6.938894e-17
## train.dt.GrLivArea         1.000000e+00

Conduct LU decomposition on the matrix.

solver_func = function(A){
rows = columns = dim(A)[1]
U = A
L = D =  diag(rows)
 for (j in 1:(columns-1)){
for (i in (j+1):rows){
      L[i,j] = (U[i,j]/U[j,j])
      U[i,] = U[i,]-(U[j,]*L[i,j])
    }
  }
  diag(D) = diag(U)
  for (l in 1:rows){
    U[l,] = U[l,]/U[l,l]
  }
  LDU = list("Lower matrix"=L,"Diagonal matrix"=D,"Upper matrix"=U)
return(LDU)
}
solver_func(matrix_3)
## $`Lower matrix`
##              [,1] [,2] [,3]
## [1,] 1.000000e+00    0    0
## [2,] 2.775558e-17    1    0
## [3,] 0.000000e+00    0    1
## 
## $`Diagonal matrix`
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## 
## $`Upper matrix`
##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice                    1                    0
## train.dt.TotalBsmtSF                  0                    1
## train.dt.GrLivArea                    0                    0
##                      train.dt.GrLivArea
## train.dt.SalePrice         0.000000e+00
## train.dt.TotalBsmtSF       6.938894e-17
## train.dt.GrLivArea         1.000000e+00

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

min(train.dt$GrLivArea)
## [1] 334

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

library(MASS)
exp_prob <- fitdistr(train.dt$GrLivArea, "exponential")
exp_prob
##        rate    
##   6.598640e-04 
##  (1.726943e-05)

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

lambda <- exp_prob$estimate
lambda 
##        rate 
## 0.000659864
samples <- rexp(1000,lambda)

Plot a histogram and compare it with a histogram of your original variable.

par(mfrow=c(1,2))
hist(samples, breaks=50, xlim=c(0, 6000) , main = "Exponential GrLivArea")
hist (train.dt$GrLivArea, breaks=50, main = "Origianl GrLivArea")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(.05,rate = lambda)
## [1] 77.73313
qexp(.95,rate = lambda)
## [1] 4539.924

Also generate a 95% confidence interval from the empirical data, assuming normality.

The Confidence Interval is based on Mean and Standard Deviation. Its formula is:

\(\bar{x}\pm Z*\frac{\sigma }{\sqrt{n}}\)

Z <- 1.96
mean <- mean(train.dt$GrLivArea)
std <- sd(train.dt$GrLivArea)
n <- length(train.dt$GrLivArea)
upper <- mean + Z * std / sqrt(n)
lower <- mean - Z * std / sqrt(n)
lower
## [1] 1488.509
upper
## [1] 1542.419

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(train.dt$GrLivArea, .05)
##  5% 
## 848
quantile(train.dt$GrLivArea, .95)
##    95% 
## 2466.1

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

only_numeric <- sapply(train.dt, is.numeric)
train_data <- train.dt[ , only_numeric]
head(train_data)
##   Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
## 1  1         60          65    8450           7           5      2003
## 2  2         20          80    9600           6           8      1976
## 3  3         60          68   11250           7           5      2001
## 4  4         70          60    9550           7           5      1915
## 5  5         60          84   14260           8           5      2000
## 6  6         50          85   14115           5           5      1993
##   YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1         2003        196        706          0       150         856
## 2         1976          0        978          0       284        1262
## 3         2002        162        486          0       434         920
## 4         1970          0        216          0       540         756
## 5         2000        350        655          0       490        1145
## 6         1995          0        732          0        64         796
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1       856       854            0      1710            1            0
## 2      1262         0            0      1262            0            1
## 3       920       866            0      1786            1            0
## 4       961       756            0      1717            1            0
## 5      1145      1053            0      2198            1            0
## 6       796       566            0      1362            1            0
##   FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 1        2        1            3            1            8          0
## 2        2        0            3            1            6          1
## 3        2        1            3            1            6          1
## 4        1        0            3            1            7          1
## 5        2        1            4            1            9          1
## 6        1        1            1            1            5          0
##   GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch
## 1        2003          2        548          0          61             0
## 2        1976          2        460        298           0             0
## 3        2001          2        608          0          42             0
## 4        1998          3        642          0          35           272
## 5        2000          3        836        192          84             0
## 6        1993          2        480         40          30             0
##   X3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
## 1          0           0        0       0      2   2008    208500
## 2          0           0        0       0      5   2007    181500
## 3          0           0        0       0      9   2008    223500
## 4          0           0        0       0      2   2006    140000
## 5          0           0        0       0     12   2008    250000
## 6        320           0        0     700     10   2009    143000
cor <- as.data.frame(cor(train_data))
cor <- cor[order(cor$SalePrice, decreasing=T),]
cor_saleprice <- cor$SalePrice
names(cor_saleprice) <- rownames(cor)
cor_saleprice
##     SalePrice   OverallQual     GrLivArea    GarageCars    GarageArea 
##    1.00000000    0.79098160    0.70862448    0.64040920    0.62343144 
##   TotalBsmtSF     X1stFlrSF      FullBath  TotRmsAbvGrd     YearBuilt 
##    0.61358055    0.60585218    0.56066376    0.53372316    0.52289733 
##  YearRemodAdd    Fireplaces    BsmtFinSF1    WoodDeckSF     X2ndFlrSF 
##    0.50710097    0.46692884    0.38641981    0.32441344    0.31933380 
##   OpenPorchSF      HalfBath       LotArea  BsmtFullBath     BsmtUnfSF 
##    0.31585623    0.28410768    0.26384335    0.22712223    0.21447911 
##  BedroomAbvGr   ScreenPorch      PoolArea        MoSold    X3SsnPorch 
##    0.16821315    0.11144657    0.09240355    0.04643225    0.04458367 
##    BsmtFinSF2  BsmtHalfBath       MiscVal            Id  LowQualFinSF 
##   -0.01137812   -0.01684415   -0.02118958   -0.02191672   -0.02560613 
##        YrSold   OverallCond    MSSubClass EnclosedPorch  KitchenAbvGr 
##   -0.02892259   -0.07785589   -0.08428414   -0.12857796   -0.13590737 
##   LotFrontage    MasVnrArea   GarageYrBlt 
##            NA            NA            NA
model <- lm(SalePrice ~ OverallQual + GrLivArea+ GarageCars + GarageArea  + TotalBsmtSF  +   X1stFlrSF   + TotRmsAbvGrd 
     +     YearBuilt  + YearRemodAdd  + Fireplaces  +  BsmtFinSF1  +  WoodDeckSF    + X2ndFlrSF +  OpenPorchSF    +  HalfBath    +   LotArea+ BsmtFullBath  +   BsmtUnfSF + BedroomAbvGr +  ScreenPorch  , data = train.dt)

summary(model)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars + 
##     GarageArea + TotalBsmtSF + X1stFlrSF + TotRmsAbvGrd + YearBuilt + 
##     YearRemodAdd + Fireplaces + BsmtFinSF1 + WoodDeckSF + X2ndFlrSF + 
##     OpenPorchSF + HalfBath + LotArea + BsmtFullBath + BsmtUnfSF + 
##     BedroomAbvGr + ScreenPorch, data = train.dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -513055  -16892   -1319   14475  297422 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.057e+06  1.215e+05  -8.703  < 2e-16 ***
## OverallQual   1.919e+04  1.177e+03  16.298  < 2e-16 ***
## GrLivArea     1.462e+01  2.017e+01   0.725 0.468876    
## GarageCars    9.433e+03  2.929e+03   3.220 0.001308 ** 
## GarageArea    1.036e+01  9.935e+00   1.043 0.297272    
## TotalBsmtSF   1.245e+01  7.221e+00   1.724 0.084876 .  
## X1stFlrSF     3.192e+01  2.061e+01   1.549 0.121638    
## TotRmsAbvGrd  4.258e+03  1.223e+03   3.481 0.000515 ***
## YearBuilt     2.171e+02  4.821e+01   4.503 7.24e-06 ***
## YearRemodAdd  2.842e+02  6.172e+01   4.605 4.49e-06 ***
## Fireplaces    5.142e+03  1.809e+03   2.843 0.004536 ** 
## BsmtFinSF1    1.304e+01  6.239e+00   2.090 0.036820 *  
## WoodDeckSF    2.987e+01  8.163e+00   3.660 0.000261 ***
## X2ndFlrSF     2.798e+01  2.039e+01   1.372 0.170294    
## OpenPorchSF   4.942e-01  1.556e+01   0.032 0.974666    
## HalfBath     -9.331e+02  2.547e+03  -0.366 0.714130    
## LotArea       4.799e-01  1.037e-01   4.628 4.03e-06 ***
## BsmtFullBath  4.932e+03  2.530e+03   1.949 0.051436 .  
## BsmtUnfSF     3.463e-02  6.326e+00   0.005 0.995633    
## BedroomAbvGr -7.586e+03  1.707e+03  -4.443 9.54e-06 ***
## ScreenPorch   5.942e+01  1.761e+01   3.375 0.000757 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36090 on 1439 degrees of freedom
## Multiple R-squared:  0.7964, Adjusted R-squared:  0.7936 
## F-statistic: 281.5 on 20 and 1439 DF,  p-value: < 2.2e-16
par(mfrow=c(2,1))
hist(model$residuals, breaks=60, main = "Histogram of Residuals", xlab= "")
qqnorm(model$residuals)
qqline(model$residuals)

The R-squared value is 0.7964 which means that the model explains 79.64 percent of the data’s variation. Residuals are normally distributed. Q-Q plot confirms that we can use speed as a predictor.

test.dt<-read.csv('https://raw.githubusercontent.com/Lidiia25/Data605_Final_Problem2/master/test.csv?token=Ac3_PjYkV8Ye1x60BeCC-8mqFmnuEmUiks5cHyU_wA%3D%3D', header=TRUE)
SalesPred <- predict(model, test.dt) 
par(mfrow=c(1,2))
hist(SalesPred, breaks=40, main = 'Predicted Sales Prices from test data')
hist(train.dt$SalePrice, breaks=50, xlim=c(0, 600000), main = 'Sales Prices from train data')

summary(SalesPred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1139  127385  167504  177967  221957  632904       3
summary(train.dt$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Report your Kaggle.com user name and score

kaggle <- data.frame( Id = test.dt[,"Id"],  SalePrice =SalesPred)
kaggle[kaggle<0] <- 0
kaggle <- replace(kaggle,is.na(kaggle),0)
write.csv(kaggle, file="kaggle.csv", row.names = FALSE)

Kaggle Score Results

Username: Lidiia T

Score : 0.64354

Since I’ve excluded the categorical variables, and haven’t checked all of the numerical values for zero and NA values, I din’t get a high score on http://kaggle.com. There are some outliers, which could also affect the overfitting of the model. If I review and update my model according to the above suggestions, there is a possibility that I may get a higher score.