final data605

Problem 1. Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of (N+1)/2.

#create variables
X <- runif(10000, min = 1, max = 9)
Y <- rnorm(10000, 5, 5)
x <- median(X)
y <- summary(Y)[2]

#calculate probabilities
Xgtx <- sum(X > x)/10000
Xgty <- sum(X > y)/10000
Ygtx <- sum(Y > x)/10000
Ygty <- sum(Y > y)/10000
Xltx <- sum(X < x)/10000
Xlty <- sum(X < y)/10000
Yltx <- sum(Y < x)/10000
Ylty <- sum(Y < y)/10000

Probability. Calculate as a minimum the below probabilities a through c.
Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. 5 points
a. P(X>x | X>y)

#first, find the P(X>x and X>y), which would be the probability that X is greater that the maximum value of x and y
(sum(X > max(x, y))/10000)/Xgty

## [1] 0.5457919

P(X>x, Y>y)

Xgtx*Ygty

## [1] 0.375

P(Xy)

#find P(X<x and X>y)
Xltx_gty <- Xgty-Xgtx

(Xltx_gty)/Xgty

## [1] 0.4542081

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

a <- c(.38, .38, .75)
b <- c(.12, .12, .25)
c <-c(.5, .5, 1)
tbl <- data.frame(rbind(a, b, c))
names(tbl) <- c("P(X>x)", "P(X<x)", "Marginal Probabilities")
row.names(tbl) <- c("P(Y>y)", "P(Y<y)", "Marginal Probabilities")

tbl

##                        P(X>x) P(X<x) Marginal Probabilities
## P(Y>y)                   0.38   0.38                   0.75
## P(Y<y)                   0.12   0.12                   0.25
## Marginal Probabilities   0.50   0.50                   1.00

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

m <- rbind(a, b)
fisher.test(m)

## Warning in fisher.test(m): 'x' has been rounded to integer: Mean relative
## difference: 0.75

## 
##  Fisher's Exact Test for Count Data
## 
## data:  m
## p-value = 1
## alternative hypothesis: two.sided

chisq.test(m)

## Warning in chisq.test(m): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  m
## X-squared = 0.00027031, df = 2, p-value = 0.9999

Given the p-values of both the Chi Square test and the Fisher’s test, it is safe to assume the variables are independent. Fisher’s test is more appropriate, since the contingency table is set.

Problem 2 You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set.

train <- read.csv("train.csv")

head(train)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000

summary(train$LotArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

summary(train$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

hist(train$MoSold)

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

plot(train$SalePrice~train$LotArea)

plot(train$SalePrice~train$OverallQual)

Derive a correlation matrix for any three quantitative variables in the dataset.

m <- cbind(train$SalePrice, train$LotArea, train$YearBuilt)
m <- cor(m)

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = .8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

cor.test(train$SalePrice, train$YearBuilt, method = "pearson", conf.level = .8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4980766 0.5468619
## sample estimates:
##       cor 
## 0.5228973

cor.test(train$YearBuilt, train$LotArea, method = "pearson", conf.level = .8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$YearBuilt and train$LotArea
## t = 0.54332, df = 1458, p-value = 0.587
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.01934322  0.04776648
## sample estimates:
##        cor 
## 0.01422765

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not? Since the P-values for the tests between sale price and lot area and sale price and year built are well below 0, it is safe to say there is correlation between sale prices and lot sizes and sale prices and year built. The p-value for the correlation test between lot area and year built is .6, so we fail to reject the null hypothesis, which was that there is no correlation. I would be worried about a familywise error, since I rejected the null hypothesis for the first two tests.

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

I <- matrix(c(1, 0, 0, 0, 1, 0, 0, 0, 1), 3, 3, byrow=TRUE)
m1 <- cbind(m, I)
m1[2,] <- (-(m1[2,1]))*m1[1,]+m1[2,]

m1[3,] <- (-m[3,1])*m1[1,]+m1[3,]

m1[2,] <- (1/m1[2,2])*m1[2,]

m1[1,] <- (-m1[1,2])*m1[2,] + m1[1,]

m1[3,] <- (-m1[3,2])*m1[2,] + m1[3,]

m1[3,] <- (1/m1[3,3])*m1[3,]

m1[2,] <- (-m1[2,3])*m1[3,] + m1[2,]

m1[1,] <- (-m1[1,3])*m1[3,] + m1[1,]

inv <- m1[,4:6]

m*inv

##            [,1]         [,2]         [,3]
## [1,]  1.5132664 -0.102393843 -0.410872545
## [2,] -0.1023938  1.099729254  0.002664589
## [3,] -0.4108725  0.002664589  1.408207956

inv*m

##            [,1]         [,2]         [,3]
## [1,]  1.5132664 -0.102393843 -0.410872545
## [2,] -0.1023938  1.099729254  0.002664589
## [3,] -0.4108725  0.002664589  1.408207956

L <- matrix(c(1, 0, 0, 0, 1, 0, 0, 0, 1), 3, 3, byrow=TRUE)
U <- m
L[2,1] <- U[2,1]
U[2,] <- (-(U[2,1]))*U[1,]+U[2,]
L[3,1] <- U[3,1]
U[3,] <- (-(U[3,1]))*U[1,]+U[3,]
L[3,2] <- (U[3,2]/U[2,2])
U[3,] <- (-(U[3,2]/U[2,2]))*U[2,]+U[3,]
L*U

##      [,1]      [,2]      [,3]
## [1,]    1 0.0000000 0.0000000
## [2,]    0 0.9303867 0.0000000
## [3,]    0 0.0000000 0.7101224

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

summary(train$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

library(MASS)
fit <- fitdistr(train$SalePrice, densfun = "exponential")
lambda <- fit$estimate
redist <- rexp(1000, lambda)
summary(redist)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      86.2   52696.5  124848.7  183935.2  261845.0 1378081.3

hist(train$SalePrice)

hist(redist)

p05 <- round(log(1-0.05)/-lambda,2)
p95 <- round(log(1-0.95)/-lambda,2)

5th percentile: 9280 95th percentile: 541991

library(Rmisc)

## Loading required package: lattice

## Loading required package: plyr

CI(train$SalePrice, ci=0.95)

##    upper     mean    lower 
## 184999.6 180921.2 176842.8

While the data is heavily skewed right, I believe an exponential model is not the correct model. The 5th percentile for the exponential model is 9280, while the minimum value of the data is 34900. The exponential model estimates too far to the right, and while the CI does contain 95% of the data, it doesn’t create a symmetrical interval. The best model would be a linear model, and disregard the most extreme outliers to balance out the skew. It might also be beneficial to analyze data of sales price by neighborhood or by home type.

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

n <- lm(train$SalePrice ~ train$YearBuilt + train$LotArea)
summary(n)

## 
## Call:
## lm(formula = train$SalePrice ~ train$YearBuilt + train$LotArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -258044  -37455  -14472   21135  520542 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.533e+06  1.104e+05  -22.93   <2e-16 ***
## train$YearBuilt  1.366e+03  5.602e+01   24.38   <2e-16 ***
## train$LotArea    2.041e+00  1.695e-01   12.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64620 on 1457 degrees of freedom
## Multiple R-squared:  0.3392, Adjusted R-squared:  0.3383 
## F-statistic: 373.9 on 2 and 1457 DF,  p-value: < 2.2e-16

Since all of the p-values are well below .05, we can say the coefficients of the model are statistically significant. However, the \(R^2\) value is low, .339. This tells us that only 33.9% of the variation in Sales price is predicted by this model.

To make this model better, I will change the variables.

n <- lm(train$SalePrice ~ train$BedroomAbvGr + train$LotArea +train$WoodDeckSF + train$OpenPorchSF + train$TotRmsAbvGrd)
summary(n)

## 
## Call:
## lm(formula = train$SalePrice ~ train$BedroomAbvGr + train$LotArea + 
##     train$WoodDeckSF + train$OpenPorchSF + train$TotRmsAbvGrd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -310247  -33687   -3075   26523  467012 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.938e+04  6.657e+03   4.414 1.09e-05 ***
## train$BedroomAbvGr -3.011e+04  2.604e+03 -11.562  < 2e-16 ***
## train$LotArea       1.031e+00  1.602e-01   6.435 1.67e-10 ***
## train$WoodDeckSF    1.265e+02  1.274e+01   9.928  < 2e-16 ***
## train$OpenPorchSF   2.051e+02  2.421e+01   8.473  < 2e-16 ***
## train$TotRmsAbvGrd  3.153e+04  1.362e+03  23.144  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59270 on 1454 degrees of freedom
## Multiple R-squared:  0.4453, Adjusted R-squared:  0.4433 
## F-statistic: 233.4 on 5 and 1454 DF,  p-value: < 2.2e-16

These variables all have low p-values, meaning they are statistically signifcant, but the model is still weak (\(R^2=.445\)).

plot(fitted(n), resid(n))

The data looks as though the model fits earlier on, and the residuals become less normal as the model increases.

qqnorm(resid(n))
qqline(resid(n))

As I suspected, the model fits well for the middle values, but there are a few extreme residuals. To make this data set cleaner, it would be better to run analysis on the data set after removing the outliers.

train <- train[-1460,]
test <- read.csv("test.csv")
test[,c('BedroomAbvGr', 'LotArea','WoodDeckSF','OpenPorchSF', 'TotRmsAbvGrd')]<-lapply(test[,c('BedroomAbvGr', 'LotArea','WoodDeckSF','OpenPorchSF', 'TotRmsAbvGrd')], as.numeric)
t <- lm(train$SalePrice ~ test$BedroomAbvGr + test$LotArea +test$WoodDeckSF + test$OpenPorchSF + test$TotRmsAbvGrd)
preds_2<- predict(t, test)
kaggle<- data.frame(test$Id, preds_2)
colnames(kaggle)<- c('Id', 'SalePrice')
write.csv(kaggle, 'submission.csv', row.names = FALSE)

Kaggle name: msiracusa Kaggle score: 0.42322 I’m not sure how to interpret the score, but that is what I got!