Final

Pick one of the quantitative independent variables \(Xi\) from the data set below, and define that variable as \(X\). Also, pick one of the dependent variables \(Yi\) below, and define that as \(Y\).

I selected \(X1\) and \(Y1\).

data <- data_frame(x=c(9.3, 4.1, 22.4, 9.1, 15.8, 7.1, 15.9, 6.9, 16.0, 6.7, 8.2, 16.0, 6.4, 11.8, 3.5, 21.7, 12.2, 9.3, 8.0, 6.2), 
                   y=c(20.3, 19.1, 19.3, 20.9, 22.0, 23.5, 13.8, 18.8, 20.9, 18.6, 22.3, 17.6, 20.8, 28.7, 15.2, 20.9, 18.4, 10.3, 26.3, 28.1))

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the 3d quartile of the \(X\) variable, and the small letter \(y\) is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.

summary(data)

##        x               y        
##  Min.   : 3.50   Min.   :10.30  
##  1st Qu.: 6.85   1st Qu.:18.55  
##  Median : 9.20   Median :20.55  
##  Mean   :10.83   Mean   :20.29  
##  3rd Qu.:15.82   3rd Qu.:22.07  
##  Max.   :22.40   Max.   :28.70

\(P(X > 15.82 | Y > 18.55) = \frac{17}{20}\) - The probability that a randomly row has X greater than 15.82 OR Y greater than 18.55.

data %>%
  filter(x > 15.82 | y > 18.55) %>%
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    17

\(P(X > 15.82, Y>18.55)=\frac{3}{20}\) - The probability that a randomly selected row has X greater than 15.82 AND Y greater than 18.55.

data %>%
  filter(x > 15.82, y > 18.55) %>%
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1     3

\(P(x < 15.82 | y > 18.55) = \frac{18}{20}=\frac{9}{10}\) - The probability that a randomly selected row has X less than 15.82 OR Y greater than 18.55

data %>%
  filter(x < 15.82 | y > 18.55) %>%
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    18

In addition, make a table of counts as shown below.

data %>%
  mutate(x = x <= 15.82,
         y = y <= 18.55) %>%
  table()

##        y
## x       FALSE TRUE
##   FALSE     3    2
##   TRUE     12    3

.	x <= 3rd	x > 3rd	Total
y <= 1st	3	2	5
y > 1st	12	3	15
Total	15	5	20

Does splitting the training data in this fashion make them independent? Let \(A\) be the new variable counting those observations above the 1st quartile for \(X\), and let \(B\) be the new variable counting those observations above the 1st quartile for \(Y\). Does \(P(AB)=P(A)P(B)\)? Check mathematically, and then evaluate by running a Chi Square test for independence.

\(P(A)=\frac{15}{20}\)

\(P(B)=\frac{15}{20}\)

\(P(AB)=\frac{15}{20}(\frac{15}{20})=\frac{9}{16}\)

\(P(A)\times P(B|A)=\frac{15}{20}\frac{12}{15}=\frac{3}{5}\)

\(P(B)\times P(A|B)=\frac{15}{20}\frac{11}{15}=\frac{11}{20}\)

As per the above probabilities, the events are not independent. The probability of B occurring depends on whether A has occurred and vice versa. The chi-sq test is inconclusive (possiblity due to the small number of observations.)

data %>%
  mutate(x = x <= 15.82,
         y = y <= 18.55) %>%
  table() %>%
  chisq.test()

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 0.088889, df = 1, p-value = 0.7656

Kaggle

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

raw.data <- read_csv('C:\\Users\\Brian\\Desktop\\GradClasses\\Fall18\\605\\final\\all\\train.csv') %>%
  select(-c(Id))

Descriptive and Inferential Statistics

Below are the univariate descriptive statistics for all of the numeric variables.

raw.data %>%
  select_if(is.numeric) %>%
  summary()

##    MSSubClass     LotFrontage        LotArea        OverallQual    
##  Min.   : 20.0   Min.   : 21.00   Min.   :  1300   Min.   : 1.000  
##  1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554   1st Qu.: 5.000  
##  Median : 50.0   Median : 69.00   Median :  9478   Median : 6.000  
##  Mean   : 56.9   Mean   : 70.05   Mean   : 10517   Mean   : 6.099  
##  3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602   3rd Qu.: 7.000  
##  Max.   :190.0   Max.   :313.00   Max.   :215245   Max.   :10.000  
##                  NA's   :259                                       
##   OverallCond      YearBuilt     YearRemodAdd    MasVnrArea    
##  Min.   :1.000   Min.   :1872   Min.   :1950   Min.   :   0.0  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   1st Qu.:   0.0  
##  Median :5.000   Median :1973   Median :1994   Median :   0.0  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Mean   : 103.7  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   3rd Qu.: 166.0  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Max.   :1600.0  
##                                                NA's   :8       
##    BsmtFinSF1       BsmtFinSF2        BsmtUnfSF       TotalBsmtSF    
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8  
##  Median : 383.5   Median :   0.00   Median : 477.5   Median : 991.5  
##  Mean   : 443.6   Mean   :  46.55   Mean   : 567.2   Mean   :1057.4  
##  3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2  
##  Max.   :5644.0   Max.   :1474.00   Max.   :2336.0   Max.   :6110.0  
##                                                                      
##     1stFlrSF       2ndFlrSF     LowQualFinSF       GrLivArea   
##  Min.   : 334   Min.   :   0   Min.   :  0.000   Min.   : 334  
##  1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130  
##  Median :1087   Median :   0   Median :  0.000   Median :1464  
##  Mean   :1163   Mean   : 347   Mean   :  5.845   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777  
##  Max.   :4692   Max.   :2065   Max.   :572.000   Max.   :5642  
##                                                                
##   BsmtFullBath     BsmtHalfBath        FullBath        HalfBath     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :2.000   Median :0.0000  
##  Mean   :0.4253   Mean   :0.05753   Mean   :1.565   Mean   :0.3829  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :2.00000   Max.   :3.000   Max.   :2.0000  
##                                                                     
##   BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd      Fireplaces   
##  Min.   :0.000   Min.   :0.000   Min.   : 2.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.000  
##  Median :3.000   Median :1.000   Median : 6.000   Median :1.000  
##  Mean   :2.866   Mean   :1.047   Mean   : 6.518   Mean   :0.613  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000   3rd Qu.:1.000  
##  Max.   :8.000   Max.   :3.000   Max.   :14.000   Max.   :3.000  
##                                                                  
##   GarageYrBlt     GarageCars      GarageArea       WoodDeckSF    
##  Min.   :1900   Min.   :0.000   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:  0.00  
##  Median :1980   Median :2.000   Median : 480.0   Median :  0.00  
##  Mean   :1979   Mean   :1.767   Mean   : 473.0   Mean   : 94.24  
##  3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:168.00  
##  Max.   :2010   Max.   :4.000   Max.   :1418.0   Max.   :857.00  
##  NA's   :81                                                      
##   OpenPorchSF     EnclosedPorch      3SsnPorch       ScreenPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 25.00   Median :  0.00   Median :  0.00   Median :  0.00  
##  Mean   : 46.66   Mean   : 21.95   Mean   :  3.41   Mean   : 15.06  
##  3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :547.00   Max.   :552.00   Max.   :508.00   Max.   :480.00  
##                                                                     
##     PoolArea          MiscVal             MoSold           YrSold    
##  Min.   :  0.000   Min.   :    0.00   Min.   : 1.000   Min.   :2006  
##  1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007  
##  Median :  0.000   Median :    0.00   Median : 6.000   Median :2008  
##  Mean   :  2.759   Mean   :   43.49   Mean   : 6.322   Mean   :2008  
##  3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009  
##  Max.   :738.000   Max.   :15500.00   Max.   :12.000   Max.   :2010  
##                                                                      
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

I plotted my three selected variables from the training set. The variables SalePrice and LotArea were heavily skewed so I log transformed them to make the data easier to visualize.

mod.data <- raw.data %>%
  dplyr::select(SalePrice, TotRmsAbvGrd, LotArea) %>%
  mutate(LotArea = log(LotArea),
         SalePrice = log(SalePrice))
mod.data %>%
  gather("key", "value") %>%
  ggplot() +
  geom_histogram(aes(value), bins=30) +
  facet_wrap(~key, scales='free_x')

Below is the correlation matrix for my three selected variables. The SalePrice and Lotarea are still log transformed.

mod.data %>%
  select(SalePrice, LotArea, TotRmsAbvGrd) %>%
  pairs()

The 80% confidence interval for the correlation between SalePrice and LotArea is \((0.37, 0.42)\)

cor.test(mod.data$SalePrice, mod.data$LotArea, conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  mod.data$SalePrice and mod.data$LotArea
## t = 16.661, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3713402 0.4277383
## sample estimates:
##       cor 
## 0.3999177

The 80% confidence interval for the correlation between SalePrice and TotRmsAbvGrd is \((0.51, 0.55)\)

cor.test(mod.data$SalePrice, mod.data$TotRmsAbvGrd, conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  mod.data$SalePrice and mod.data$TotRmsAbvGrd
## t = 24.143, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5100081 0.5579759
## sample estimates:
##       cor 
## 0.5344222

The 80% confidence interval for the correlation between TotRmsAbvGrd and LotArea is \((0.33, 0.38)\)

cor.test(mod.data$TotRmsAbvGrd, mod.data$LotArea, conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  mod.data$TotRmsAbvGrd and mod.data$LotArea
## t = 14.74, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3305629 0.3889893
## sample estimates:
##       cor 
## 0.3601291

The correlations between all three variables are appears to be strongly positive and significant at the 80% confidence interval. There appears to be correlation between these variables. Familywise error can occurr when performing multiple hypothesis tests. In short, it says that the likelyhood of making a Type 1 error increases as the number of comparisons made increases. Looking casually at the data, the small number of tests combined with the tiny p-values makes me confident that no such error has occurred. For completeness however, I used the bonferroni correct to adjust the p-values to account for the multiple tests. The p-values are still statistically significant providing further evidence of the strong positive relationship between all the selected variables.

p.adjust(c(2.2e-16, 2.2e-16, 2.2e-16), method="bonferroni")

## [1] 6.6e-16 6.6e-16 6.6e-16

Linear Algebra and Correlation

Invert your \(3\times3\) correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct \(LU\) decomposition on the matrix.

cor.matrix <- raw.data %>%
  select(SalePrice, TotRmsAbvGrd, LotArea) %>%
  cor()
cor.matrix

##              SalePrice TotRmsAbvGrd   LotArea
## SalePrice    1.0000000    0.5337232 0.2638434
## TotRmsAbvGrd 0.5337232    1.0000000 0.1900148
## LotArea      0.2638434    0.1900148 1.0000000

prec.matrix <- solve(cor.matrix)
prec.matrix

##               SalePrice TotRmsAbvGrd     LotArea
## SalePrice     1.4539777  -0.72946544 -0.24501314
## TotRmsAbvGrd -0.7294654   1.40343330 -0.07420846
## LotArea      -0.2450131  -0.07420846  1.07874579

cor.matrix %*% prec.matrix %>%
  round(2)

##              SalePrice TotRmsAbvGrd LotArea
## SalePrice            1            0       0
## TotRmsAbvGrd         0            1       0
## LotArea              0            0       1

prec.matrix %*% cor.matrix %>%
  round(2)

##              SalePrice TotRmsAbvGrd LotArea
## SalePrice            1            0       0
## TotRmsAbvGrd         0            1       0
## LotArea              0            0       1

Conduct \(LU\) decomposition on the matrix.

library(matrixcalc)
lu.decomposition(cor.matrix)

## $L
##           [,1]       [,2] [,3]
## [1,] 1.0000000 0.00000000    0
## [2,] 0.5337232 1.00000000    0
## [3,] 0.2638434 0.06879142    1
## 
## $U
##      [,1]      [,2]       [,3]
## [1,]    1 0.5337232 0.26384335
## [2,]    0 0.7151396 0.04919547
## [3,]    0 0.0000000 0.92700246

Calculus-Based Probability and Statistics

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

I selected LotArea. It is skewed as seen in the boxplot below.

raw.data %>%
  select(LotArea) %>%
  gather("key", "value") %>%
  ggplot() +
  geom_boxplot(aes(key, value))

Then load the MASS package and run fitdistr to fit an exponential probability density function. Find the optimal value of \(\lambda\)??? for this distribution, and then take 1000 samples from this exponential distribution using this value.

library(MASS)
fitdistr(raw.data$LotArea, densfun='exponential')

##        rate    
##   9.508570e-05 
##  (2.488507e-06)

Plot a histogram and compare it with a histogram of your original variable.

The theoretical exponential distribution has more diversity in values centered around the mean while the actual data has an incredibly large number of values in a very small range before a more steep drop off than seen in the theoretical data. This indicates that the expontential distribution may not be able to accurately provide insight into the LotArea variable.

compare.data <- data_frame(exp=rexp(1460, 9.508570e-5)) %>%
  mutate(lotarea = raw.data$LotArea)
compare.data %>%
  gather("key", "value") %>%
  mutate(value = as.numeric(value)) %>%
  ggplot() +
  geom_histogram(aes(value), bins=70) +
  facet_wrap(~key)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.

quantile(raw.data$LotArea, c(0.05, 0.95))

##       5%      95% 
##  3311.70 17401.15

library(gmodels)
ci(compare.data$exp, confidence=0.95)

##   Estimate   CI lower   CI upper Std. Error 
##  10748.754  10184.438  11313.071    287.683

quantile(rexp(1000, 9.508570e-5), c(0.05, 0.95))

##         5%        95% 
##   475.0215 30308.2985

As mentioned above the actual data has a much smaller grouping in the quantile range (~14000) compared to the theoretical data (~33000). The 95% confidence interval of (10520, 11647) just barely misses the true mean of 10517. Taking these findings as a whole, I would not model the LotaArea with an expontential distribution.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Please see other RMD file.

https://www.youtube.com/watch?v=njCM-Iaeq4A&feature=youtu.be