You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 4th quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.

library(ggplot2)
library(MASS)
library(Metrics)

## Warning: package 'Metrics' was built under R version 3.3.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.3.3

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

df <- read.csv('https://raw.githubusercontent.com/bkreis84/Math/master/train.csv')

#Independent: 1st floor square feet
X <- df$X1stFlrSF
# dependent: Sale Price
Y <- df$SalePrice

df2 <- data.frame(X,Y)

#The assignment said the 4th Quantile of X, but that would be 100%, so I went with the 3rd quantile. 
x <- quantile(X, 0.75)
x

##     75% 
## 1391.25

# 2nd Quantile of Y
y <- quantile(Y,0.5)
y

##    50% 
## 163000

#P(A|B) Probability of A given that B has occurred is equal to P(A,B)/P(B)


#A.P(X>x|Y>y)
df2$'P(X>x & Y>y)' <- ifelse(df2$X > x & df2$Y > y,1,0)

df2$'Y>y' <- ifelse(df2$Y > y,1,0)

#P(X,Y)/P(Y)
(sum(df2$`P(X>x & Y>y)`)/nrow(df2))/(sum(df2$`Y>y`)/nrow(df2))

## [1] 0.4299451

#B. P(X>x & Y>y)
sum(df2$`P(X>x & Y>y)`)/nrow(df2)

## [1] 0.2143836

#C. P(X<x | Y>y)
df2$'P(X<x & Y>y)' <- ifelse(X < x & Y > y,1,0)
#P(X,Y)/P(Y)
(sum(df2$`P(X<x & Y>y)`)/nrow(df2))/(sum(df2$`Y>y`)/nrow(df2))

## [1] 0.5700549

Does splitting the training data in this fashion make them independent? In other words, does P(X|Y)=P(X)P(Y))? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this.

Testing independence can be tested mathematically by determining if P(A|B)=P(A). If the probability of A given that B has occurred is the same as the probability of A, then B did not have an effect and they are independent. The test below shows that the variables are not independent.

#P(X>x|Y>y) = P(X>x)

#P(X>x|Y>y)
(sum(df2$`P(X>x & Y>y)`)/nrow(df2))/(sum(df2$`Y>y`)/nrow(df2))

## [1] 0.4299451

#P(X>x)
df2$'P(X>x)' <- ifelse(df2$X > x ,1,0)
sum(df2$'P(X>x)')/nrow(df2)

## [1] 0.25

#contingency table
tbl <- table(df2$Y > y, df2$X >x)

# Chi square test
chisq.test(tbl)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 248.85, df = 1, p-value < 2.2e-16

Because our p-value is significantly smaller than 0.05 we reject the null hypothesis that the values are independent and accept the alternative hypothesis that the variables appear to be dependent.

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y.

df3 <- data.frame(X,Y)

summary(df3)

##        X              Y         
##  Min.   : 334   Min.   : 34900  
##  1st Qu.: 882   1st Qu.:129975  
##  Median :1087   Median :163000  
##  Mean   :1163   Mean   :180921  
##  3rd Qu.:1391   3rd Qu.:214000  
##  Max.   :4692   Max.   :755000

Sqmed <- median(X)
Sqmn <- mean(X)

Sqmed <- median(X)
Sqmn <- mean(X)

ggplot(df3, aes(x=X)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="#56B4E9")+
  geom_density(alpha=0.5) +
  xlab("1st Floor Square Ft") +
  geom_vline(aes(xintercept=median(df3$X)),
             color="blue", linetype="dashed", size=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df3, aes(x=Y)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="#56B4E9", binwidth = 20000)+
  geom_density(alpha=0.5) +
  xlab("Sales Price") +
  geom_vline(aes(xintercept=median(df3$Y)),
             color="blue", linetype="dashed", size=1)

ggplot(df3, aes(x=X, y=Y)) +
  geom_point(shape=1) +    
  geom_smooth(method=lm,   
              se=TRUE)

Transform both variables simultaneously using Box-Cox transformations. You might have to research this. Using the transformed variables, run a correlation analysis and interpret. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

reg <- lm(Y~X, df3)
hist(reg$resid)

qqnorm(reg$resid)
qqline(reg$resid)

b=boxcox(Y~X, data=df3, seq(-0.05,0.05,0.01))

#The value appears to be around 0.04

df3$Yprime <- df3$Y^1.04

regP <- lm(Yprime~X, df3)
hist(regP$resid)

qqnorm(regP$resid)
qqline(regP$resid)

cor(df3)

##                X         Y    Yprime
## X      1.0000000 0.6058522 0.6048023
## Y      0.6058522 1.0000000 0.9999131
## Yprime 0.6048023 0.9999131 1.0000000

#Our original model appeared to be normal with some significant outliers.
#The Box-Cox Transformation appears to have little effect

Null hypothesis: The correlation between first floor square footage and sales price variables is 0 Alternative hypothesis: There correlation between first floor square footage and sales price is not 0

There is a positve correlation of 0.606 and a p vlaue of less than 0.05, we therefore reject the null hypothesis and accept the alternative hypothesis

cor.test(df3$X, df3$Y, conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  df3$X and df3$Y
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.5613896 0.6468270
## sample estimates:
##       cor 
## 0.6058522

Linear Algebra and Correlation.

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

df3 <- subset(df3, select = c(X, Yprime))

cormat <- cor(df3)
cormat

##                X    Yprime
## X      1.0000000 0.6048023
## Yprime 0.6048023 1.0000000

prec <-solve(cormat)
prec

##                 X     Yprime
## X       1.5767543 -0.9536246
## Yprime -0.9536246  1.5767543

round(cormat %*% prec, digits = 0)

##        X Yprime
## X      1      0
## Yprime 0      1

round(prec %*% cormat, digits = 0)

##        X Yprime
## X      1      0
## Yprime 0      1

Multiplying by the correlation or precision matrix results in the identity matrix

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, ) for an exponential).Plot a histogram and compare it with a histogram of your non-transformed original variable

#Our independent variable has a minimum value above 0, so I do not believe the shift is necessary.
min(df2$X)

## [1] 334

fit <- fitdistr(df2$X, densfun = 'exponential')
fit

##        rate    
##   0.0008601213 
##  (0.0000225104)

lam <- fit$estimate
lam

##         rate 
## 0.0008601213

#samples from distribution 
est <- rexp(1000, lam)

hist(est, breaks = 100)

hist(df2$X, breaks = 100)

library(randomForest)

#factors are need for the randomForest model (characters are not interpreted)
train <- read.csv('https://raw.githubusercontent.com/bkreis84/Math/master/train.csv', stringsAsFactors = TRUE)

train$Id <- NULL

#to handle missing values the random forest package has a function to impute values
train.na <- train
train.imputed <- rfImpute(SalePrice~ ., train.na)

##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 7.803e+08    12.37 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 7.482e+08    11.86 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 7.622e+08    12.09 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 7.581e+08    12.02 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 7.556e+08    11.98 |

#setting seed ensures the same result.
set.seed(123)
id <- sample(2, nrow(train.imputed), prob = c(0.7,0.3), replace = TRUE)
trn <- train.imputed[id==1,]
tst <- train.imputed[id==2,]



#mtry=27 tells the function to test approximately 1/3rd of the predictior variables for each tree.
rforest <- randomForest(SalePrice~., data = trn, mtry=27, ntree=400)
rforest

## 
## Call:
##  randomForest(formula = SalePrice ~ ., data = trn, mtry = 27,      ntree = 400) 
##                Type of random forest: regression
##                      Number of trees: 400
## No. of variables tried at each split: 27
## 
##           Mean of squared residuals: 868579042
##                     % Var explained: 87.1

importance <- importance(rforest)

#The top 5 predictor variables are overall quality (by a large margin), neighborhood, 
#above ground square feet, # of cars that fit in garage and total basement square feet. 
varImpPlot(rforest)

prediction <- predict(rforest, tst)



rmsle(prediction, tst$SalePrice)

## [1] 0.1657099

I attempted to use only the variables with higher predicitve value, but found it did not improve the model

Modeling.

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

#issues with random forest made me have to combine the sets before splitting them again
#This Error message led me to have to do the above and impute date for the test set in predict.randomForest
#(rforest, test) : Type of predictors in new data do not match that of the training data.
#In the future I may use a different package that handles na values a little better. 
train <- read.csv('https://raw.githubusercontent.com/bkreis84/Math/master/train.csv', stringsAsFactors = TRUE)
test <- read.csv('https://raw.githubusercontent.com/bkreis84/Math/master/test.csv', stringsAsFactors = TRUE)
test$SalePrice = 0

df <- rbind(train, test)

#to handle missing values the random forest package has a function to impute values
df.na <- df
df.imputed <- rfImpute(SalePrice~ ., df.na)

##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 4.341e+08     3.83 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 4.631e+08     4.09 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 4.711e+08     4.16 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 4.709e+08     4.15 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 4.463e+08     3.94 |

trainDF <- df.imputed[1:1460,]
testDF <- df.imputed[1461:2919,]

trainDF2 <- subset(trainDF, select = -Id)
testDF2 <- subset(testDF, select = -c(Id,SalePrice))

rforest <- randomForest(SalePrice~., data = trainDF2, mtry=27, ntree=400)
rforest

## 
## Call:
##  randomForest(formula = SalePrice ~ ., data = trainDF2, mtry = 27,      ntree = 400) 
##                Type of random forest: regression
##                      Number of trees: 400
## No. of variables tried at each split: 27
## 
##           Mean of squared residuals: 774030876
##                     % Var explained: 87.73

predFinal <- predict(rforest, testDF2)

#submission format
kag <- data.frame(Id = testDF$Id, SalePrice = predFinal)

write.csv(kag, file = "Kreis.csv", row.names = FALSE)

Submission

Box-Cox Transformations https://www.youtube.com/watch?v=TgVx9Rqsewo

Chi Square http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence

Final Exam

Brian Kreis

May 19, 2017

Descriptive and Inferential Statistics.

Linear Algebra and Correlation.

Calculus-Based Probability & Statistics.

Modeling.