FZahir_Final

Problem 1

Random Variables

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of\(\mu=\sigma=\frac{N+1}{2}\).

# generating random numbers
set.seed(4242)
n <- 7
X <- runif(10000, 1, n) 
Y <- rnorm(10000, (n+1)/2, (n+1)/2)
df <- data.frame(cbind(X, Y))
#check total
total <- nrow(df)
head(df, 10)

##           X          Y
## 1  6.916686  2.2173825
## 2  3.086296  2.4859062
## 3  2.379411  1.3856085
## 4  5.012267 -0.2024224
## 5  5.216723  2.7324347
## 6  6.607061  8.4151991
## 7  5.927274  3.3911077
## 8  2.011771  2.3692095
## 9  4.970623  4.0439280
## 10 6.304814  4.5068214

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

P(X>x | X>y)

#Storing x and y
x = quantile(df$X, 0.5)
x

##      50% 
## 3.929004

y = quantile(df$Y, 0.25)
y

##      25% 
## 1.244764

We are asked to calculate the probability that X> 3.929 given that X>1.245

We know that

\[P(A|B)=\frac{P(A \cap B)}{P(B)}\]

Using the above formula

# P(X>y)
pX_greatery <- nrow(subset(df, X > y))/total

# P(X>x & X>y)
pXgreaterx_and_greatery <- nrow(subset(df, X > x & Y > y))/total

p1 <- round(pXgreaterx_and_greatery / pX_greatery, 4)
print(paste0("P(X>x | Y>y) = ", p1))

## [1] "P(X>x | Y>y) = 0.3931"

P(X>x,Y>y)

# P(X>x & Y>y)
pXgreaterx_pYgreatery <- nrow(subset(df, X > x & Y > y))/total

print(paste0("P(X>x, Y>y) = ", pXgreaterx_pYgreatery ))

## [1] "P(X>x, Y>y) = 0.377"

P(X<x|X>y)

# P(X<x & X>y)
pxgreaterx_pxgreatery <- nrow(subset(df, X < x & X > y))/total

# P(X<x | X>y)
prob2 <- round(pxgreaterx_pxgreatery  / pX_greatery, 4)
print(paste0("P(X<x | X>y) = ", prob2))

## [1] "P(X<x | X>y) = 0.4786"

Marginal and Joint Probability

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

# marginal probablity of P(X>x)
pXgreaterx <- nrow(subset(df, X > x))/total

# marginal probablity of P(Y>y)
pYgreatery <- nrow(subset(df, Y > y))/total

# joint probability of P(X>x & Y>y)
pXgreatergx_Ygreatery <- nrow(subset(df, X > x & Y > y))/total

# product of marginal probabilities
prod_marginal <- pXgreaterx*pYgreatery

#Test if P(X>x and Y>y)=P(X>x)P(Y>y)
round(pXgreatergx_Ygreatery,2)==round(prod_marginal,2)

## [1] TRUE

We found that P(X>x and Y>y)=P(X>x)P(Y>y)

Next we build the table

kable(cbind(pXgreaterx, pYgreatery, prod_marginal,  pXgreatergx_Ygreatery), col.names = c("P(X>x)", "P(Y>y)", "P(X>x)P(Y>y)", "P(X>x & Y>y)")) %>%
   kable_styling(position = "center", "striped")

P(X>x)	P(Y>y)	P(X>x)P(Y>y)	P(X>x & Y>y)
0.5	0.75	0.375	0.377

Independence

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Creating the count table

greaterx <- nrow(subset(df, X > x))
greatery <- nrow(subset(df, Y > y))
lessx <- nrow(subset(df, X <= x))
lessy <- nrow(subset(df, Y <= y))
  
table <- matrix(c(greaterx,greatery, lessx, lessy), 2, 2,
                dimnames = list(c("x", "y"),
                                c("X>xorY>y", "X<=xorY<= y")))
table

##   X>xorY>y X<=xorY<= y
## x     5000        5000
## y     7500        2500

Running Fisher test

Ho: X and Y are independent of each other

Ha: X and Y are not independent of each other

fisher.test(table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.3137986 0.3540659
## sample estimates:
## odds ratio 
##  0.3333333

Running Chi-squared test

chisq.test(table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 1332.3, df = 1, p-value < 2.2e-16

The p value is less than 0.05 for both tests, so we may reject the null hypothesis. X and Y are not independent of each other.

Fisher’s exact test works well with sample sizes, whereas Chi-squared test is more appropriate for large sample sizes. Here we have a sample size of 10,000 so Chi-squared is more appropriate.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques .I want you to do the following.

Choosing the variables

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

#read the data
train<-read.csv("https://raw.githubusercontent.com/zahirf/Data605/master/train.csv")

There are 1460 observations of 81 variables. I am choosing the Sale Price as Dependent variable.

I am running a pairs test for all variables, keeping the SalePrice in each test. The following code shows the results for the first 10 variables. I have run the test for all 80 variables.

train %>% 
  dplyr::select(2:11, 81) %>%
  pairs.panels(method = "pearson", hist.col = "#f44542")

The train dataset is then modified to keep all variables with absolute correlation value of more than 0.50

train<-train%>%
  dplyr::select(c('SalePrice','Alley', 'OverallQual', 'YearBuilt', 'YearRemodAdd',
                'BsmtQual', 'ExterQual', 'TotalBsmtSF', 'FullBath', 'GrLivArea', 'X1stFlrSF',
                'KitchenQual', 'TotRmsAbvGrd', 'GarageFinish', 'GarageCars', 'GarageArea'))

Now we have 15 independent variables that are more correlated with SalePrice. Let us run the pairs test again, dividing the dataset in 2

train %>% 
  dplyr::select(2:9, 1) %>%
  pairs.panels(method = "pearson", hist.col = "#f44542")

train %>% 
  dplyr::select(10:16, 1) %>%
  pairs.panels(method = "pearson", hist.col = "#f44542")

Descriptive and Inferential statistics

Descriptive statistics

summary(train)

##    SalePrice       Alley       OverallQual       YearBuilt     YearRemodAdd 
##  Min.   : 34900   Grvl:  50   Min.   : 1.000   Min.   :1872   Min.   :1950  
##  1st Qu.:129975   Pave:  41   1st Qu.: 5.000   1st Qu.:1954   1st Qu.:1967  
##  Median :163000   NA's:1369   Median : 6.000   Median :1973   Median :1994  
##  Mean   :180921               Mean   : 6.099   Mean   :1971   Mean   :1985  
##  3rd Qu.:214000               3rd Qu.: 7.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :755000               Max.   :10.000   Max.   :2010   Max.   :2010  
##  BsmtQual   ExterQual  TotalBsmtSF        FullBath       GrLivArea   
##  Ex  :121   Ex: 52    Min.   :   0.0   Min.   :0.000   Min.   : 334  
##  Fa  : 35   Fa: 14    1st Qu.: 795.8   1st Qu.:1.000   1st Qu.:1130  
##  Gd  :618   Gd:488    Median : 991.5   Median :2.000   Median :1464  
##  TA  :649   TA:906    Mean   :1057.4   Mean   :1.565   Mean   :1515  
##  NA's: 37             3rd Qu.:1298.2   3rd Qu.:2.000   3rd Qu.:1777  
##                       Max.   :6110.0   Max.   :3.000   Max.   :5642  
##    X1stFlrSF    KitchenQual  TotRmsAbvGrd    GarageFinish   GarageCars   
##  Min.   : 334   Ex:100      Min.   : 2.000   Fin :352     Min.   :0.000  
##  1st Qu.: 882   Fa: 39      1st Qu.: 5.000   RFn :422     1st Qu.:1.000  
##  Median :1087   Gd:586      Median : 6.000   Unf :605     Median :2.000  
##  Mean   :1163   TA:735      Mean   : 6.518   NA's: 81     Mean   :1.767  
##  3rd Qu.:1391               3rd Qu.: 7.000                3rd Qu.:2.000  
##  Max.   :4692               Max.   :14.000                Max.   :4.000  
##    GarageArea    
##  Min.   :   0.0  
##  1st Qu.: 334.5  
##  Median : 480.0  
##  Mean   : 473.0  
##  3rd Qu.: 576.0  
##  Max.   :1418.0

Independent variable: SalePrice

ggplot(train, aes(x=SalePrice))+
  geom_histogram(bins=30,color="darkblue", fill="lightblue")+
  geom_vline(aes(xintercept=mean(SalePrice)),color="blue", linetype="dashed", size=1)+
  theme_classic()+
  ggtitle('Histogram: SalePrice')+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(labels=comma)

Dependent variables: OverallQual (Correlation 0.79), GrLivArea (Correlation 0.71)

ggplot(train, aes(x=OverallQual))+
  geom_histogram(binwidth=1, color="darkblue", fill="lightblue")+
  theme_classic()+
  ggtitle('Histogram: Overall Quality')+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_vline(aes(xintercept=mean(OverallQual)),color="blue", linetype="dashed", size=1)

ggplot(train, aes(x=GrLivArea))+
  geom_histogram(bins=30, color="darkblue", fill="lightblue")+
  theme_classic()+
  ggtitle('Histogram: Above Ground Living Area')+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_vline(aes(xintercept=mean(GrLivArea)),color="blue", linetype="dashed", size=1)

SalePrice vs Overall Quality

ggplot(train, aes(x = OverallQual, y = SalePrice)) + 
  geom_point(aes(color = factor(OverallQual)))+
  theme_classic() +
  ggtitle ('SalePrice vs OverallQual')+
  scale_y_continuous(labels=comma)+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_smooth(method = lm, se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

SalePrice vs GrLivArea

ggplot(train, aes(x = GrLivArea, y = SalePrice)) + 
  geom_point(color='grey') + 
  ggtitle ('SalePrice vs GrLivArea')+
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels=comma)+
   geom_smooth(method = lm, se = FALSE, color='red')

Both the variables seem to be positively related to SalePrice.

Scatterplot matrix

Independent variable: SalePrice

Dependent variables:

OverallQual (Correlation 0.79), GrLivArea (Correlation 0.71), ExterQual (Correlation -0.64), GarageCars( Correlation 0.64)

train %>% 
  dplyr::select(c('SalePrice', 'OverallQual','GrLivArea', 'BsmtQual', 'GarageCars')) %>% 
  pairs.panels(method = "pearson", hist.col = "#5D6D7E")

Correlation matrix

Derive a correlation matrix for any three quantitative variables in the dataset.

corrmat = train %>% 
  dplyr::select(SalePrice, OverallQual, GrLivArea) %>% 
  cor() %>% 
  as.matrix()
corrmat

##             SalePrice OverallQual GrLivArea
## SalePrice   1.0000000   0.7909816 0.7086245
## OverallQual 0.7909816   1.0000000 0.5930074
## GrLivArea   0.7086245   0.5930074 1.0000000

Test Hypothesis

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

SalePrice vs OverallQual

pair1<-cor.test(train$SalePrice, train$OverallQual, conf.level = 0.8)
pair1

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

The correlation value (.79) indicates there is correlation between Overall Quality and Sale Price. The p-value is less than the significance level 0.05. It can be concluded that the variables are significantly correlated.

pair2<-cor.test(train$SalePrice, train$GrLivArea, conf.level = 0.8)
pair2

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

The correlation value (.71) indicates there is correlation between Overall Quality and Sale Price. The p-value is less than the significance level 0.05. It can be concluded that the variables are significantly correlated.

Family-wise error rate is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. I would not be worried in this case becausein both cases the alternative hypothesis was affirmed and the p-values were extremely small.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Precision matrix

precisionmatrix <-solve(corrmat)
precisionmatrix

##             SalePrice OverallQual  GrLivArea
## SalePrice    3.498623  -2.0007280 -1.2927630
## OverallQual -2.000728   2.6865350 -0.1753704
## GrLivArea   -1.292763  -0.1753704  2.0200794

Correlation times precision

multiple <- round(corrmat %*% precisionmatrix)
multiple

##             SalePrice OverallQual GrLivArea
## SalePrice           1           0         0
## OverallQual         0           1         0
## GrLivArea           0           0         1

Precision times correlation

multiple2<- round(precisionmatrix %*% corrmat)
multiple2

##             SalePrice OverallQual GrLivArea
## SalePrice           1           0         0
## OverallQual         0           1         0
## GrLivArea           0           0         1

LU Decomposition

#conduct LU decomposition on the matrix
lu <- lu.decomposition(corrmat)
lu

## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.7909816 1.0000000    0
## [3,] 0.7086245 0.0868136    1
## 
## $U
##      [,1]      [,2]       [,3]
## [1,]    1 0.7909816 0.70862448
## [2,]    0 0.3743481 0.03249851
## [3,]    0 0.0000000 0.49503004

# test to show the product of LU is the original matrix
lu$L%*%lu$U

##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.7909816 0.7086245
## [2,] 0.7909816 1.0000000 0.5930074
## [3,] 0.7086245 0.5930074 1.0000000

#The two matrices multiplied give us the correlation matrix as below
corrmat

##             SalePrice OverallQual GrLivArea
## SalePrice   1.0000000   0.7909816 0.7086245
## OverallQual 0.7909816   1.0000000 0.5930074
## GrLivArea   0.7086245   0.5930074 1.0000000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

I have selected the GrLivArea variable. The minimum value is 334 which is above zero as required.

min(train$GrLivArea)

## [1] 334

The distribution of the variable is shown below

ggplot(train, aes(GrLivArea)) +
  geom_histogram(aes(y = ..density..), colour="black", fill="#5D6D7E") +
  geom_density(alpha=.5, fill="#85929E") +
  geom_vline(xintercept = min(train$GrLivArea)) +
  labs(title = "GrLivArea") +
  theme_classic()

Then load the MASS package and run fitdistr to fit an exponential probability density function.). Find the optimal value of λ for this distribution

# fit exponential distribution
fit_dist <- fitdistr(train$GrLivArea, "exponential")
rate <- fit_dist$estimate
rate

##        rate 
## 0.000659864

The optimal value of λ for this distribution is 0.000659864

Then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable.

#Get values for exponential distribution
set.seed(450)
opval <-rexp(1000, rate)
opval<-data.frame(opval)

#Plot the histograms
ep<-ggplot(opval, aes(opval)) +  
  geom_histogram(bins=50, colour="black", fill="#5D6D7E") +
  theme_classic()+
  labs(x = "Exponential Distribution") 
od<-ggplot(train, aes(GrLivArea)) +  
  geom_histogram(bins=50, colour="black", fill="#5D6D7E") +
  theme_classic()+
  labs(x = "Original Distribution") 
ggarrange(ep, od, nrow = 1, ncol = 2)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

# the 5th and 95th percentiles of exponential pdf
qexp(0.05, rate = rate)

## [1] 77.73313

qexp(0.95, rate = rate)

## [1] 4539.924

#95% confidence interval from empirical, assuming normailit
CI<-CI(na.exclude(train$GrLivArea), ci = 0.95)
CI

##    upper     mean    lower 
## 1542.440 1515.464 1488.487

# the 5th and 95th percentiles of empirical data
quantile(train$GrLivArea, c(0.05, 0.95))

##     5%    95% 
##  848.0 2466.1

The 5th and 95th percentile values of the exponential distribution are 77.73 and 4539.92.

The 95% confidence interval for the empirical data have lower limit of 1488 and upper limit of 1542.

The empirical distribution have 5th and 95th percentile values of 848 and 2466, which differ highly from the values derived from the exponential distribution.

From the visualization we see that the exponential distribution shows a much higher positive skew than the empirical distribution. This implies that the exponential distribution is not a good fit for this particular variable in this dataset.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

We have seen in the Descriptive statistics that some of the variables have a few major outliers. We attempt to remove those outliers first

##removing outliers
train<-train%>%
 filter(GrLivArea<=4000)%>%
 filter(SalePrice<=500000)%>%
 filter(TotalBsmtSF<=2250)

Converting to data matrix so we may test the categorical variables

temp<-data.matrix(train)
temp<-data.frame(temp)

After several attemps using a mix of variables with an absolute correlation of over 0.60 with SalePrice, I came up with the following model which gives the highest Rsquare for the combinations run

fit = lm(log(temp$SalePrice) ~ temp$OverallQual + log(temp$GrLivArea) + temp$GarageCars +temp$TotalBsmtSF, data = temp)
summary(fit)

## 
## Call:
## lm(formula = log(temp$SalePrice) ~ temp$OverallQual + log(temp$GrLivArea) + 
##     temp$GarageCars + temp$TotalBsmtSF, data = temp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99029 -0.08041  0.01505  0.10711  0.53603 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         8.267e+00  1.113e-01   74.25   <2e-16 ***
## temp$OverallQual    1.171e-01  4.633e-03   25.28   <2e-16 ***
## log(temp$GrLivArea) 3.642e-01  1.710e-02   21.30   <2e-16 ***
## temp$GarageCars     9.974e-02  7.541e-03   13.23   <2e-16 ***
## temp$TotalBsmtSF    2.086e-04  1.324e-05   15.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1653 on 1439 degrees of freedom
## Multiple R-squared:  0.8166, Adjusted R-squared:  0.8161 
## F-statistic:  1602 on 4 and 1439 DF,  p-value: < 2.2e-16

The model has an Adjusted RSquare of 0.82. All of the Coefficient estimates have a very low p value and are all significant. The standard errors are all at least 5 to 10 times smaller than the coefficients.

Checking the residuals

Constant variability

Most of the residuals seem to be evenly distributed around 0, but a significant number seem to be below -0.5.

plot(fitted(fit), resid(fit))
abline(a=0, b=0, col='red')

Nearly Normal Residuals

The Q-Q plot displays signs of skewness towards the left. The assumption of nearly normal residuals do not seem to have been met.

qqnorm(resid(fit))
qqline(resid(fit))

Independence

The independent variables show some positive correlation with each other.

data <- dplyr::select( train, GrLivArea, OverallQual, TotalBsmtSF, GarageCars) 
m<-cor(data)
corrplot.mixed(m, lower.col = "black", number.cex = .7)

Testing the data with test.csv and writing the submission for Kaggle

#Loading test data and converting to data frame
test<-read.csv("test.csv")
temp1<-data.matrix(test)
temp1<-data.frame(temp1)

#Mutate logsale[rice and sale price
temp1$logSalePrice = (fit$coefficients[2] * temp1$OverallQual) + (fit$coefficients[3] * log(temp1$GrLivArea)) + (fit$coefficients[4] * temp1$GarageCars) +(fit$coefficients[5]*temp1$TotalBsmtSF) + fit$coefficients[1]
temp1$SalePrice<-exp(temp1$logSalePrice)

#make file ready for submission
submission = cbind(temp1$Id, temp1$SalePrice)
colnames(submission) = c("Id", "SalePrice")
submission<-na.locf(submission)
submission = as.data.frame(submission)
head(submission, 5)

##     Id SalePrice
## 1 1461  110457.9
## 2 1462  157367.0
## 3 1463  153188.2
## 4 1464  171186.5
## 5 1465  214581.7

#write file for submission
write.csv(submission, file = "experiment1.csv", quote = FALSE, row.names = FALSE)

Kaggle Score

User: https://www.kaggle.com/zahirf

Score: 0.17296

knitr::include_graphics('Score.png')

FZahir_Final

Farhana Zahir

5/15/2020

Problem 1

Random Variables

Probability

Marginal and Joint Probability

Independence

Problem 2

Choosing the variables

Descriptive and Inferential statistics

Descriptive statistics

Scatterplot matrix

Correlation matrix

Test Hypothesis

Linear Algebra and Correlation

Calculus-Based Probability & Statistics

Modeling