Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(μ=σ=\frac{N+1}{2}\)
#setting the seed for creating reproducible results
set.seed(1234)
X <- runif(n, min = 1, max = N)
hist(X)#setting the seed for creating reproducible results
set.seed(1234)
Y <- rnorm(n, mean=mu, sd=sig)
hist(Y)Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
## 50%
## 4.007595
## 25%
## 1.356831
\(P(A|B) = \frac{P(A)P(B)}{P(B)}\)
\(P(X>x | X>y) = \frac{P(X>x,X>y)}{P(X>y)}\)
P_Xx_Xy <- df %>% filter(X > x, X > y) %>% nrow()/n
P_Xy <- df %>% filter(X > y) %>% nrow()/n
P_a <- P_Xx_Xy/P_Xy
P_a## [1] 0.5315756
Probability that X is greater than all possible x and Y is greater than all possible y
## [1] 0.3762
\(P(X<x | X>y) = \frac{P(X<x,X>y)}{P(X>y)}\)
P_Xx_Xy <- df %>% filter(X<x, X>y) %>% nrow()/n
P_X_y <- df %>% filter(X>y) %>% nrow()/n
P_c <- P_Xx_Xy/P_X_y
P_c## [1] 0.4684244
marProb_A <- jointProb %>% ungroup() %>% group_by(A) %>% dplyr::summarize(count = sum(count), probability = sum(probability))
datatable(marProb_A)From the below table we see P(X>x and Y>y)=0.3762, P(X>x) =(0.1238+0.3762), P(Y>y) = (0.3738+0.3762)**
So we get P(X>x)P(Y>y) = 0.375 and we have P(X>x and Y>y)=0.3762, so with a difference of 0.001, we can say P(X>x and Y>y)=P(X>x)P(Y>y)
We define below hypothesis:
\(H_0 : X>x\ and\ Y>y\ are\ independent\ events\)
\(H_1 : X>x\ and\ Y>y\ are\ dependent\ events\)
##
## Fisher's Exact Test for Count Data
##
## data: dfTest
## p-value = 0.5953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9361461 1.1243231
## sample estimates:
## odds ratio
## 1.025936
From the Fisher’s test we see - The p-value is higher than 0.05 and we fail to reject the null hypothesis which means the events are independent.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: dfTest
## X-squared = 0.28213, df = 1, p-value = 0.5953
From the Chi squares test also we see - The p-value is higher than 0.05 and we fail to reject the null hypothesis which means the events are independent.
Fisher’s exact test is used when one or more expected cell counts in the cross-tabulation are less than 5 whereas Ch square is used when the cell count is large, which is our case here.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
I uploaded the training and testing data from kaggle into my git account and loaded them here as below
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
I chose LotArea and OverallQual as 2 independent variables and SalePrice as dependent variable.
Bigger the area and better the overall quality, higher is the sale price. So sale price depends on area and quality.
df <- data.frame(training_set$LotArea, training_set$OverallQual,training_set$SalePrice)
datatable(df)ggplot(training_set, aes(x = LotArea, y = SalePrice)) + geom_point() + theme(axis.text.x = element_text(angle = 60,
hjust = 1))ggplot(training_set, aes(x = OverallQual, y = SalePrice)) + geom_point() + theme(axis.text.x = element_text(angle = 60,
hjust = 1))cor_mat = training_set %>% select(LotArea, OverallQual, SalePrice) %>% cor() %>%
as.matrix()
cor_mat## LotArea OverallQual SalePrice
## LotArea 1.0000000 0.1058057 0.2638434
## OverallQual 0.1058057 1.0000000 0.7909816
## SalePrice 0.2638434 0.7909816 1.0000000
##
## Pearson's product-moment correlation
##
## data: training_set$LotArea and training_set$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
##
## Pearson's product-moment correlation
##
## data: training_set$OverallQual and training_set$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
We see from the above there is a strong coorelation between SalePrice and OverallQual and weak coorelation between SalePrice and LotArea.
Familywise error states the probability of coming to at least one false conclusion in a series of hypothesis tests. I increased the conf level to 90 and I didn’t see any significant difference. So I wouldn’t worry in my case here.
##
## Pearson's product-moment correlation
##
## data: training_set$LotArea and training_set$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
## 0.2233154 0.3034607
## sample estimates:
## cor
## 0.2638434
##
## Pearson's product-moment correlation
##
## data: training_set$OverallQual and training_set$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
## 0.7742916 0.8065720
## sample estimates:
## cor
## 0.7909816
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
## LotArea OverallQual SalePrice
## LotArea 1.1085153 0.3046752 -0.5334669
## OverallQual 0.3046752 2.7550503 -2.2595806
## SalePrice -0.5334669 -2.2595806 2.9280384
## $L
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## $L
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
I chose the variable X1stFlrSF, lets verify if its skewed right and greater than 0
## rate
## 0.0008601213
## (0.0000225104)
par(mfrow = c(1, 2))
hist(expo, breaks = 50, xlim = c(0, 6000), main = "Exponential"
)
hist(training_set$X1stFlrSF, breaks = 50, main = "Original")## [1] 59.63495
## [1] 3482.918
## upper mean lower
## 1182.473 1162.627 1142.780
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I selected 3 variables LotArea, OverallQual and SalePrice and tried running the linear regression on them. I got an R-squared of 0.6585.
##
## Call:
## lm(formula = SalePrice ~ OverallQual + LotArea, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -271225 -26819 -1459 20172 385190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.047e+05 5.547e+03 -18.88 <2e-16 ***
## OverallQual 4.433e+04 8.844e+02 50.12 <2e-16 ***
## LotArea 1.450e+00 1.225e-01 11.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46460 on 1457 degrees of freedom
## Multiple R-squared: 0.6585, Adjusted R-squared: 0.658
## F-statistic: 1405 on 2 and 1457 DF, p-value: < 2.2e-16
I further added few more variables and ran the model and I saw improvements in the R-squared from 0.6585 to 0.7802. Tried adding few more variables but the improvements weren’t significant and thats why I stopped it at R-squared: 0.7802, Adjusted R-squared: 0.7756
rlm <- lm(SalePrice ~ OverallQual + LotArea +KitchenQual+ factor(Neighborhood)+ GarageArea , data = training_set, na.action=na.omit)
summary(rlm)##
## Call:
## lm(formula = SalePrice ~ OverallQual + LotArea + KitchenQual +
## factor(Neighborhood) + GarageArea, data = training_set, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -273065 -19355 -1546 15308 304198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.870e+04 1.382e+04 2.801 0.00517 **
## OverallQual 2.392e+04 1.177e+03 20.325 < 2e-16 ***
## LotArea 1.009e+00 1.102e-01 9.158 < 2e-16 ***
## KitchenQualFa -6.446e+04 8.380e+03 -7.691 2.70e-14 ***
## KitchenQualGd -5.508e+04 4.649e+03 -11.850 < 2e-16 ***
## KitchenQualTA -6.674e+04 5.260e+03 -12.689 < 2e-16 ***
## factor(Neighborhood)Blueste -9.547e+03 2.826e+04 -0.338 0.73551
## factor(Neighborhood)BrDale -2.465e+04 1.337e+04 -1.844 0.06543 .
## factor(Neighborhood)BrkSide 1.803e+03 1.078e+04 0.167 0.86717
## factor(Neighborhood)ClearCr 2.806e+04 1.208e+04 2.322 0.02037 *
## factor(Neighborhood)CollgCr 9.232e+03 9.686e+03 0.953 0.34069
## factor(Neighborhood)Crawfor 4.180e+04 1.068e+04 3.913 9.53e-05 ***
## factor(Neighborhood)Edwards -2.170e+03 1.026e+04 -0.211 0.83260
## factor(Neighborhood)Gilbert 1.559e+04 1.017e+04 1.534 0.12531
## factor(Neighborhood)IDOTRR -1.644e+04 1.145e+04 -1.436 0.15117
## factor(Neighborhood)MeadowV -4.266e+03 1.333e+04 -0.320 0.74895
## factor(Neighborhood)Mitchel 4.880e+03 1.089e+04 0.448 0.65406
## factor(Neighborhood)NAmes 4.137e+03 9.811e+03 0.422 0.67332
## factor(Neighborhood)NoRidge 9.669e+04 1.098e+04 8.803 < 2e-16 ***
## factor(Neighborhood)NPkVill -6.191e+03 1.569e+04 -0.395 0.69320
## factor(Neighborhood)NridgHt 4.736e+04 1.037e+04 4.569 5.33e-06 ***
## factor(Neighborhood)NWAmes 1.317e+04 1.033e+04 1.275 0.20259
## factor(Neighborhood)OldTown -1.136e+04 1.009e+04 -1.126 0.26043
## factor(Neighborhood)Sawyer 6.377e+03 1.052e+04 0.606 0.54444
## factor(Neighborhood)SawyerW 1.258e+04 1.044e+04 1.205 0.22832
## factor(Neighborhood)Somerst 1.118e+04 1.003e+04 1.115 0.26489
## factor(Neighborhood)StoneBr 5.974e+04 1.195e+04 4.998 6.51e-07 ***
## factor(Neighborhood)SWISU 1.015e+04 1.211e+04 0.838 0.40227
## factor(Neighborhood)Timber 1.865e+04 1.123e+04 1.660 0.09711 .
## factor(Neighborhood)Veenker 3.690e+04 1.467e+04 2.515 0.01202 *
## GarageArea 6.550e+01 6.037e+00 10.850 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37630 on 1429 degrees of freedom
## Multiple R-squared: 0.7802, Adjusted R-squared: 0.7756
## F-statistic: 169.1 on 30 and 1429 DF, p-value: < 2.2e-16
From the plot below we see the residuals are not randomly distributed, dense towards 0 and gets scattered towards end and almost form a line shape.
Normal QQ plot shows that the residuals don’t fall on the theoretical normal line expect the center part of the distribution.
I tested my model with the test data and also submitted it on kaggle. My username on kaggle is petferns
kaggle entry