DATA 605. Final Examination.

Problem 1

Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.

X <- c(9.3,7.4, 9.5,    9.3, 4.1,   6.4 ,3.7,   12.4,22.4,  8.5,    11.7,   19.9,9.1,   9.5,    7.4,    6.9,15.8,   11.8,   5.3 ,-1,7.1,    8.8,    7.4 ,10.6,15.9, 8.4,    7.4,    6.4,6.9,    5.1,    8.6,    10.6,16,    11.4,   9.1,    1.2,6.7,    15.1,   11.4,   7.7,8.2,    12.6,   8.4,    15.5,16,    8,  7.3,    6.9,6.4 ,10.3   ,11.3   ,13.7, 11.8 ,10.4   ,4.4    ,3.7,3.5,   9.5,    9.3,    4.4,21.7,   9.5 ,10.9   ,11.5,12.2 ,15.1,   10.9,   4.2,9.3,    6.6,    7.7,    13.9,8  ,15.4,  7.7,    12.9,6.2,   8.2 ,11.5,  1.2)
Y <- c(20.3,    20.8,   28.4,   20.2,   19.1,   14.6,   21.5,   18.6,   15.2,   16.2,   21.3,   26.9,   13.8,   14.7,   25.2,   26,
22.3,   13.1,   21.4,   16.8,   19.3,   18, 20.8,   22.6,   20.9,   15.7,   15.1,   16.3,   18.8,   15.3,   22.5,   26.8,
17.6,   10.3,   20.8,   20.2,   20.9,   7.3,    22.2,   11.4,   18.4,   16.3,   17.8,   19.9,   20.9,   12.6,   21.1,   19.7,
20.8,   14.9,   23, 21.7,   22, 19.4,   21.6,   23.6,   10.3,   11.5,   26.4,   15.5,   18.6,   13, 21.7,   22.7,
28.7,   14.8,   17.4,   20.9,   23.5,   13.5,   21.8,   24, 26.3,   12.2,   21.6,   26.5,   28.1,   11.8,   22.5,   21.7)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Q3_X <- quantile(X, 0.75)
Q1_Y <- quantile(Y, 0.25)
Q3_X

##   75% 
## 11.55

Q1_Y

##   25% 
## 15.65

P(X>x | Y>y)

The conditional probability of X>x given Y>y is

$P(X>x | Y>y) = \frac{P(X>x,Y>y) }{P(Y>y)} = \frac{P(X>x) P(Y>y) }{P(Y>y)}$

(length(X[X > Q3_X &  Y > Q1_Y])/length(X)) / (length(Y[Y > Q1_Y])/ length(Y))

## [1] 0.25

P(X>x, Y>y)

$P(X>x , Y>y) = \frac{P(X>x|Y>y) }{P(Y>y)}$

(length(X[X > Q3_X &  Y > Q1_Y])/length(X))

## [1] 0.1875

P(Xy)

$P(X>x | Y>y) = \frac{P(X<x,Y>y) }{P(Y>y)} = \frac{P(X<x) P(Y>y) }{P(Y>y)}$

(length(X[X < Q3_X &  Y > Q1_Y])/length(X)) / (length(Y[Y > Q1_Y])/ length(Y))

## [1] 0.75

In addition, make a table of counts as shown below.

# X<=x, Y<=y
row1_col1 <- length(X[X <= Q3_X &  Y <= Q1_Y])
# X>x, Y<=y
row1_col2 <- length(X[X > Q3_X &  Y <= Q1_Y])
# X<=x, Y>y
row2_col1 <- length(X[X <= Q3_X &  Y > Q1_Y])
# X>x, Y>y
row2_col2 <- length(X[X > Q3_X &  Y > Q1_Y])

count <- data.frame(c(row1_col1, row2_col1), c(row1_col2, row2_col2))
count[3,] = count[1,] + count[2,]
count[,3] = count[,1] + count[,2]
names(count) <- c('X<=x', 'X>x', 'Total')
rownames(count) <- c('Y<=y', 'Y>y', 'Total')
count

##       X<=x X>x Total
## Y<=y    15   5    20
## Y>y     45  15    60
## Total   60  20    80

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

A <- 20
B <- 60
#from 1.a P(X>x | Y>y)  = 0.25
P_AB <- 0.25 * 80
PA_PB <- 20/80 * 60/80 * 80
P_AB

## [1] 20

PA_PB

## [1] 15

$p(A|B)!=p(A)∗p(B)$

A and B are not independent.

H_0: $X>x$ and $Y>y$ are independent

H_a: $X>x$ and $Y>y$ are not independent

$\chi^2 =\sum { \frac { { (observed-expected) }^2 }{ expected } }$

row1_col1 <- count[1,3] * count[3,1] / count[3,3]
row1_col2 <- count[1,3] * count[3,2] / count[3,3]
row2_col1 <- count[2,3] * count[3,1] / count[3,3]
row2_col2 <- count[2,3] * count[3,2] / count[3,3]
expected_count <- data.frame(c(row1_col1, row2_col1), c(row2_col1, row2_col2))
expected_count

##   c.row1_col1..row2_col1. c.row2_col1..row2_col2.
## 1                      15                      45
## 2                      45                      15

diff_sq <- (count[1:2, 1:2] - expected_count) ^2 / expected_count
diff_sq

##   c.row1_col1..row2_col1. c.row2_col1..row2_col2.
## 1                       0                35.55556
## 2                       0                 0.00000

chi_square <- sum(diff_sq)
chi_square

## [1] 35.55556

p_value <- pchisq(chi_square, 1, lower.tail=F)
p_value

## [1] 2.478791e-09

This value is less than 0.05. $H_0$ can be rejected: X and Y are not independent.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Data fields

Here’s a brief version of what you’ll find in the data description file.

SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

train.dt<-read.csv('https://raw.githubusercontent.com/Lidiia25/Data605_Final_Problem2/master/train.csv?token=Ac3_PlkWy_zjurpLjUxOWZj4UvMrBCdiks5cHachwA%3D%3D', header=TRUE)
head(train.dt)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

summary(train.dt$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

hist(train.dt$SalePrice, main="Sale Price")

qqnorm(train.dt$SalePrice)
qqline(train.dt$SalePrice)

pairs(~SalePrice+TotalBsmtSF +YearBuilt+LotFrontage+GrLivArea++GarageArea,data=train.dt, 
   main="Scatterplot Matrix")

Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

selected <-  data.frame(train.dt$SalePrice,train.dt$TotalBsmtSF,train.dt$GrLivArea)
matrix <- cor(selected)
matrix

##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice            1.0000000            0.6135806
## train.dt.TotalBsmtSF          0.6135806            1.0000000
## train.dt.GrLivArea            0.7086245            0.4548682
##                      train.dt.GrLivArea
## train.dt.SalePrice            0.7086245
## train.dt.TotalBsmtSF          0.4548682
## train.dt.GrLivArea            1.0000000

cor.test(train.dt$GrLivArea, train.dt$SalePrice, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train.dt$GrLivArea and train.dt$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

In the result above:

t is the t-test statistic value (t = 38.348), df is the degrees of freedom (df= 1458), p-value is the significance level of the t-test (p-value = 2.2e-16). conf.int is the confidence interval of the correlation coefficient at 80% (0.69 and 0.72); sample estimates is the correlation coefficient (Cor.coeff = 0.7).

cor.test(train.dt$TotalBsmtSF, train.dt$SalePrice, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train.dt$TotalBsmtSF and train.dt$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806

In the result above:

t is the t-test statistic value (t = 29.671), df is the degrees of freedom (df= 1458), p-value is the significance level of the t-test (p-value = 2.2e-16). conf.int is the confidence interval of the correlation coefficient at 80% ( 0.59 and 0.63); sample estimates is the correlation coefficient (Cor.coeff = 0.61).

The p-value of both tests is less than the significance level 0.05. We can conclude that variables are significantly correlated.

Linear Algebra and Correlation. Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

cor_matrix <- matrix
precision_matrix <- solve(cor_matrix)
precision_matrix

##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice            2.5582310          -0.93946422
## train.dt.TotalBsmtSF         -0.9394642           1.60588442
## train.dt.GrLivArea           -1.3854927          -0.06473842
##                      train.dt.GrLivArea
## train.dt.SalePrice          -1.38549273
## train.dt.TotalBsmtSF        -0.06473842
## train.dt.GrLivArea           2.01124151

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

matrix_2 <- cor_matrix%*% precision_matrix
matrix_2

##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice                    1         1.387779e-17
## train.dt.TotalBsmtSF                  0         1.000000e+00
## train.dt.GrLivArea                    0         5.551115e-17
##                      train.dt.GrLivArea
## train.dt.SalePrice         0.000000e+00
## train.dt.TotalBsmtSF       1.110223e-16
## train.dt.GrLivArea         1.000000e+00

and then multiply the precision matrix by the correlation matrix.

matrix_3 <-precision_matrix%*%cor_matrix
matrix_3

##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice         1.000000e+00                    0
## train.dt.TotalBsmtSF       2.775558e-17                    1
## train.dt.GrLivArea         0.000000e+00                    0
##                      train.dt.GrLivArea
## train.dt.SalePrice         0.000000e+00
## train.dt.TotalBsmtSF       6.938894e-17
## train.dt.GrLivArea         1.000000e+00

Conduct LU decomposition on the matrix.

solver_func = function(A){
rows = columns = dim(A)[1]
U = A
L = D =  diag(rows)
 for (j in 1:(columns-1)){
for (i in (j+1):rows){
      L[i,j] = (U[i,j]/U[j,j])
      U[i,] = U[i,]-(U[j,]*L[i,j])
    }
  }
  diag(D) = diag(U)
  for (l in 1:rows){
    U[l,] = U[l,]/U[l,l]
  }
  LDU = list("Lower matrix"=L,"Diagonal matrix"=D,"Upper matrix"=U)
return(LDU)
}
solver_func(matrix_3)

## $`Lower matrix`
##              [,1] [,2] [,3]
## [1,] 1.000000e+00    0    0
## [2,] 2.775558e-17    1    0
## [3,] 0.000000e+00    0    1
## 
## $`Diagonal matrix`
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## 
## $`Upper matrix`
##                      train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice                    1                    0
## train.dt.TotalBsmtSF                  0                    1
## train.dt.GrLivArea                    0                    0
##                      train.dt.GrLivArea
## train.dt.SalePrice         0.000000e+00
## train.dt.TotalBsmtSF       6.938894e-17
## train.dt.GrLivArea         1.000000e+00

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

min(train.dt$GrLivArea)

## [1] 334

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

library(MASS)
exp_prob <- fitdistr(train.dt$GrLivArea, "exponential")
exp_prob

##        rate    
##   6.598640e-04 
##  (1.726943e-05)

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

lambda <- exp_prob$estimate
lambda

##        rate 
## 0.000659864

samples <- rexp(1000,lambda)

Plot a histogram and compare it with a histogram of your original variable.

par(mfrow=c(1,2))
hist(samples, breaks=50, xlim=c(0, 6000) , main = "Exponential GrLivArea")
hist (train.dt$GrLivArea, breaks=50, main = "Origianl GrLivArea")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(.05,rate = lambda)

## [1] 77.73313

qexp(.95,rate = lambda)

## [1] 4539.924

Also generate a 95% confidence interval from the empirical data, assuming normality.

The Confidence Interval is based on Mean and Standard Deviation. Its formula is:

$\bar{x}\pm Z*\frac{\sigma }{\sqrt{n}}$

Z <- 1.96
mean <- mean(train.dt$GrLivArea)
std <- sd(train.dt$GrLivArea)
n <- length(train.dt$GrLivArea)
upper <- mean + Z * std / sqrt(n)
lower <- mean - Z * std / sqrt(n)
lower

## [1] 1488.509

upper

## [1] 1542.419

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(train.dt$GrLivArea, .05)

##  5% 
## 848

quantile(train.dt$GrLivArea, .95)

##    95% 
## 2466.1

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

only_numeric <- sapply(train.dt, is.numeric)
train_data <- train.dt[ , only_numeric]
head(train_data)

##   Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
## 1  1         60          65    8450           7           5      2003
## 2  2         20          80    9600           6           8      1976
## 3  3         60          68   11250           7           5      2001
## 4  4         70          60    9550           7           5      1915
## 5  5         60          84   14260           8           5      2000
## 6  6         50          85   14115           5           5      1993
##   YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1         2003        196        706          0       150         856
## 2         1976          0        978          0       284        1262
## 3         2002        162        486          0       434         920
## 4         1970          0        216          0       540         756
## 5         2000        350        655          0       490        1145
## 6         1995          0        732          0        64         796
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1       856       854            0      1710            1            0
## 2      1262         0            0      1262            0            1
## 3       920       866            0      1786            1            0
## 4       961       756            0      1717            1            0
## 5      1145      1053            0      2198            1            0
## 6       796       566            0      1362            1            0
##   FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 1        2        1            3            1            8          0
## 2        2        0            3            1            6          1
## 3        2        1            3            1            6          1
## 4        1        0            3            1            7          1
## 5        2        1            4            1            9          1
## 6        1        1            1            1            5          0
##   GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch
## 1        2003          2        548          0          61             0
## 2        1976          2        460        298           0             0
## 3        2001          2        608          0          42             0
## 4        1998          3        642          0          35           272
## 5        2000          3        836        192          84             0
## 6        1993          2        480         40          30             0
##   X3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
## 1          0           0        0       0      2   2008    208500
## 2          0           0        0       0      5   2007    181500
## 3          0           0        0       0      9   2008    223500
## 4          0           0        0       0      2   2006    140000
## 5          0           0        0       0     12   2008    250000
## 6        320           0        0     700     10   2009    143000

cor <- as.data.frame(cor(train_data))
cor <- cor[order(cor$SalePrice, decreasing=T),]
cor_saleprice <- cor$SalePrice
names(cor_saleprice) <- rownames(cor)
cor_saleprice

##     SalePrice   OverallQual     GrLivArea    GarageCars    GarageArea 
##    1.00000000    0.79098160    0.70862448    0.64040920    0.62343144 
##   TotalBsmtSF     X1stFlrSF      FullBath  TotRmsAbvGrd     YearBuilt 
##    0.61358055    0.60585218    0.56066376    0.53372316    0.52289733 
##  YearRemodAdd    Fireplaces    BsmtFinSF1    WoodDeckSF     X2ndFlrSF 
##    0.50710097    0.46692884    0.38641981    0.32441344    0.31933380 
##   OpenPorchSF      HalfBath       LotArea  BsmtFullBath     BsmtUnfSF 
##    0.31585623    0.28410768    0.26384335    0.22712223    0.21447911 
##  BedroomAbvGr   ScreenPorch      PoolArea        MoSold    X3SsnPorch 
##    0.16821315    0.11144657    0.09240355    0.04643225    0.04458367 
##    BsmtFinSF2  BsmtHalfBath       MiscVal            Id  LowQualFinSF 
##   -0.01137812   -0.01684415   -0.02118958   -0.02191672   -0.02560613 
##        YrSold   OverallCond    MSSubClass EnclosedPorch  KitchenAbvGr 
##   -0.02892259   -0.07785589   -0.08428414   -0.12857796   -0.13590737 
##   LotFrontage    MasVnrArea   GarageYrBlt 
##            NA            NA            NA

model <- lm(SalePrice ~ OverallQual + GrLivArea+ GarageCars + GarageArea  + TotalBsmtSF  +   X1stFlrSF   + TotRmsAbvGrd 
     +     YearBuilt  + YearRemodAdd  + Fireplaces  +  BsmtFinSF1  +  WoodDeckSF    + X2ndFlrSF +  OpenPorchSF    +  HalfBath    +   LotArea+ BsmtFullBath  +   BsmtUnfSF + BedroomAbvGr +  ScreenPorch  , data = train.dt)

summary(model)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars + 
##     GarageArea + TotalBsmtSF + X1stFlrSF + TotRmsAbvGrd + YearBuilt + 
##     YearRemodAdd + Fireplaces + BsmtFinSF1 + WoodDeckSF + X2ndFlrSF + 
##     OpenPorchSF + HalfBath + LotArea + BsmtFullBath + BsmtUnfSF + 
##     BedroomAbvGr + ScreenPorch, data = train.dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -513055  -16892   -1319   14475  297422 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.057e+06  1.215e+05  -8.703  < 2e-16 ***
## OverallQual   1.919e+04  1.177e+03  16.298  < 2e-16 ***
## GrLivArea     1.462e+01  2.017e+01   0.725 0.468876    
## GarageCars    9.433e+03  2.929e+03   3.220 0.001308 ** 
## GarageArea    1.036e+01  9.935e+00   1.043 0.297272    
## TotalBsmtSF   1.245e+01  7.221e+00   1.724 0.084876 .  
## X1stFlrSF     3.192e+01  2.061e+01   1.549 0.121638    
## TotRmsAbvGrd  4.258e+03  1.223e+03   3.481 0.000515 ***
## YearBuilt     2.171e+02  4.821e+01   4.503 7.24e-06 ***
## YearRemodAdd  2.842e+02  6.172e+01   4.605 4.49e-06 ***
## Fireplaces    5.142e+03  1.809e+03   2.843 0.004536 ** 
## BsmtFinSF1    1.304e+01  6.239e+00   2.090 0.036820 *  
## WoodDeckSF    2.987e+01  8.163e+00   3.660 0.000261 ***
## X2ndFlrSF     2.798e+01  2.039e+01   1.372 0.170294    
## OpenPorchSF   4.942e-01  1.556e+01   0.032 0.974666    
## HalfBath     -9.331e+02  2.547e+03  -0.366 0.714130    
## LotArea       4.799e-01  1.037e-01   4.628 4.03e-06 ***
## BsmtFullBath  4.932e+03  2.530e+03   1.949 0.051436 .  
## BsmtUnfSF     3.463e-02  6.326e+00   0.005 0.995633    
## BedroomAbvGr -7.586e+03  1.707e+03  -4.443 9.54e-06 ***
## ScreenPorch   5.942e+01  1.761e+01   3.375 0.000757 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36090 on 1439 degrees of freedom
## Multiple R-squared:  0.7964, Adjusted R-squared:  0.7936 
## F-statistic: 281.5 on 20 and 1439 DF,  p-value: < 2.2e-16

par(mfrow=c(2,1))
hist(model$residuals, breaks=60, main = "Histogram of Residuals", xlab= "")
qqnorm(model$residuals)
qqline(model$residuals)

The R-squared value is 0.7964 which means that the model explains 79.64 percent of the data’s variation. Residuals are normally distributed. Q-Q plot confirms that we can use speed as a predictor.

test.dt<-read.csv('https://raw.githubusercontent.com/Lidiia25/Data605_Final_Problem2/master/test.csv?token=Ac3_PjYkV8Ye1x60BeCC-8mqFmnuEmUiks5cHyU_wA%3D%3D', header=TRUE)

SalesPred <- predict(model, test.dt) 
par(mfrow=c(1,2))
hist(SalesPred, breaks=40, main = 'Predicted Sales Prices from test data')
hist(train.dt$SalePrice, breaks=50, xlim=c(0, 600000), main = 'Sales Prices from train data')

summary(SalesPred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1139  127385  167504  177967  221957  632904       3

summary(train.dt$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Report your Kaggle.com user name and score

kaggle <- data.frame( Id = test.dt[,"Id"],  SalePrice =SalesPred)
kaggle[kaggle<0] <- 0
kaggle <- replace(kaggle,is.na(kaggle),0)
write.csv(kaggle, file="kaggle.csv", row.names = FALSE)

Kaggle Score Results

Username: Lidiia T

Score : 0.64354

Since I’ve excluded the categorical variables, and haven’t checked all of the numerical values for zero and NA values, I din’t get a high score on http://kaggle.com. There are some outliers, which could also affect the overfitting of the model. If I review and update my model according to the above suggestions, there is a possibility that I may get a higher score.

DATA 605. Final Examination.

Lidiia Tronina

12/8/2018

Problem 1

Problem 2

Data fields

Kaggle Score Results

Youtube presentation https://www.youtube.com/watch?v=AmUaVMCjBZY&feature=youtu.be