Overview

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs.

You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)).

set.seed(1234)
n = 10000

Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.

X <- runif(n, 5, 15)
hist(X)

Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

Y <- rnorm(n, 10, 2.89)
hist(Y)

Probability.

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)
y <- median(Y)

a) \(P(X>x|X>y)\)

With the assumption of independence, where we can validate with the Fisher and Chi-Square Test later on.

\[ P(X>x|X>y) = \frac{P(X>x \cap X>y)}{P(X>y)} \\ \text{If indepedent,} \ \\ =\frac{P(X>x)P( X>y)}{P(X>y)} \]

mean(X > max(x, y))

## [1] 0.4979

1 - punif(max(x, y), 5, 15)

## [1] 0.4968461

b) \(P(X>x \& Y>y)\)

mean(X > x & Y > y)

## [1] 0.2507

p_x <- punif(x, 5, 15)
p_y <- 1 - pnorm(y, 10, 2.89)

p_x * p_y

## [1] 0.2484506

c) \(P(X<x|X>y)\)

mean(X < x & X > y)

## [1] 0

Since x < y, there are no values that are less than x and the at the same time greater than y.

Independence Tests

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

\[ H_0 : \text{X and Y are Independent} \\ H_1 : \text{X and Y are not Independent} \\ \alpha = 0.05 \]

table <- table(X > x, Y > y)

fisher_test <- fisher.test(table)

chi_square_test <- chisq.test(table)

print(paste("Fisher's Exact Test p-value: ", fisher_test$p.value))

## [1] "Fisher's Exact Test p-value:  0.794865333155194"

print(paste("Chi-Square Test p-value: ", chi_square_test$p.value))

## [1] "Chi-Square Test p-value:  0.794863773596479"

The p-values of both test identical at 0.7946 indicates that it fails to reject the null hypothesis that X and Y are independent. Fisher’s Test is more suitable more smaller sample sizes and the Chi squares is more suitable larger samples sizes. For this where both yields identical values, we say that a sample size of 10 000 is small enough for Fisher and large enough for Chi square Tests. No, we are not surprised with the result since both X and Y are univariate distributions such that one does not depend on the other distribution. Hence, we assumed independence at the beginning of the problem.

Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

library(dplyr)
library(data.table)
library(ggplot2)
library(GGally)
library(corrplot)
library(corrr)
library(matlib)
library(factoextra)
library(ResourceSelection)
library(car)

train_df <- fread("Final-Project/train.csv")
train_df <- train_df[, -1]

test_df <- fread("Final-Project/test.csv")

a) Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)[-1]) {
    if (is.numeric(train_df[[col_name]])) {
        hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
    }
}

par(mfrow = c(1, 1))

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))

target_col <- "Age"

for (col_name in names(train_df)[-1]) {
    if (is.numeric(train_df[[col_name]]) && col_name != target_col) {
        plot(train_df[[col_name]], train_df[[target_col]], main = paste("Scatterplot:",
            col_name, "vs", target_col), xlab = col_name, ylab = target_col)
    }
}

par(mfrow = c(1, 1))

Derive a correlation matrix for any three quantitative variables in the dataset.

correlation_Matrix <- cor(train_df[, 2:9])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

cor_results <- list()

for (col_name in names(train_df)) {
    if (is.numeric(train_df[[col_name]]) && col_name != target_col) {
        cor_result <- cor.test(train_df[[target_col]], train_df[[col_name]])
        cor_results[[col_name]] <- cor_result
    }
}

cor_results

## $Length
## 
##  Pearson's product-moment correlation
## 
## data:  train_df[[target_col]] and train_df[[col_name]]
## t = 211.04, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6083257 0.6173207
## sample estimates:
##       cor 
## 0.6128431 
## 
## 
## $Diameter
## 
##  Pearson's product-moment correlation
## 
## data:  train_df[[target_col]] and train_df[[col_name]]
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6168134 0.6256588
## sample estimates:
##       cor 
## 0.6212559 
## 
## 
## $Height
## 
##  Pearson's product-moment correlation
## 
## data:  train_df[[target_col]] and train_df[[col_name]]
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6337771 0.6423175
## sample estimates:
##       cor 
## 0.6380669 
## 
## 
## $Weight
## 
##  Pearson's product-moment correlation
## 
## data:  train_df[[target_col]] and train_df[[col_name]]
## t = 204.73, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5965757 0.6057744
## sample estimates:
##      cor 
## 0.601195 
## 
## 
## $`Shucked Weight`
## 
##  Pearson's product-moment correlation
## 
## data:  train_df[[target_col]] and train_df[[col_name]]
## t = 158.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4979228 0.5086787
## sample estimates:
##       cor 
## 0.5033202 
## 
## 
## $`Viscera Weight`
## 
##  Pearson's product-moment correlation
## 
## data:  train_df[[target_col]] and train_df[[col_name]]
## t = 192.15, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5719815 0.5815941
## sample estimates:
##       cor 
## 0.5768078 
## 
## 
## $`Shell Weight`
## 
##  Pearson's product-moment correlation
## 
## data:  train_df[[target_col]] and train_df[[col_name]]
## t = 241.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6594219 0.6674861
## sample estimates:
##       cor 
## 0.6634733

Discuss the meaning of your analysis. Would you be worried about family wise error? Why or why not?

We found that multicollinearity exists in all of the explanatory variable which reduces the reliability of the model we will build later on. Family-wise error is also a concern since we conducted simultaneous comparisons of the response variable with all the explanatory variables which increases the risk of Type 1 error.

b) Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

precision_matrix <- inv(correlation_Matrix)

corr_preci <- correlation_Matrix * precision_matrix
preci_corr <- precision_matrix * correlation_Matrix

Conduct LDU decomposition on the matrix.

lu_decomp <- LU(preci_corr)
lu_decomp

## $P
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    0    0    0    0    0    0    0
## [2,]    0    1    0    0    0    0    0    0
## [3,]    0    0    1    0    0    0    0    0
## [4,]    0    0    0    1    0    0    0    0
## [5,]    0    0    0    0    1    0    0    0
## [6,]    0    0    0    0    0    1    0    0
## [7,]    0    0    0    0    0    0    1    0
## [8,]    0    0    0    0    0    0    0    1
## 
## $L
##             [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
## [1,]  1.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000
## [2,] -0.89201644  1.00000000  0.00000000  0.00000000  0.00000000  0.00000000
## [3,] -0.03572430 -0.31419950  1.00000000  0.00000000  0.00000000  0.00000000
## [4,]  0.00205901 -0.10300229 -0.08342595  1.00000000  0.00000000  0.00000000
## [5,] -0.05358117 -0.21990895 -0.11658437 -0.40595443  1.00000000  0.00000000
## [6,] -0.02719368 -0.08061884 -0.13726058 -0.23919253 -0.56292232  1.00000000
## [7,]  0.02847184 -0.10080164 -0.34445414 -0.32355276 -0.30158631 -0.71188559
## [8,] -0.00202978 -0.03060876 -0.06783414 -0.01423663  0.05360755  0.02055726
##            [,7] [,8]
## [1,]  0.0000000    0
## [2,]  0.0000000    0
## [3,]  0.0000000    0
## [4,]  0.0000000    0
## [5,]  0.0000000    0
## [6,]  0.0000000    0
## [7,]  1.0000000    0
## [8,] -0.1777528    1
## 
## $U
##                                                                                
## [1,] 50.03636 -44.63326 -1.787514  0.1030252  -2.6810071  -1.3606731   1.424627
## [2,]  0.00000  12.54154 -3.940544 -1.2918070  -2.7579960  -1.0110842  -1.264207
## [3,]  0.00000   0.00000  6.509302 -0.5430447  -0.7588829  -0.8934706  -2.242156
## [4,]  0.00000   0.00000  0.000000 77.8744608 -31.6134820 -18.6269893 -25.196496
## [5,]  0.00000   0.00000  0.000000  0.0000000  11.6296254  -6.5465757  -3.507336
## [6,]  0.00000   0.00000  0.000000  0.0000000   0.0000000   9.5871525  -6.824956
## [7,]  0.00000   0.00000  0.000000  0.0000000   0.0000000   0.0000000   5.988008
## [8,]  0.00000   0.00000  0.000000  0.0000000   0.0000000   0.0000000   0.000000
##                
## [1,] -0.1015629
## [2,] -0.3838808
## [3,] -0.4415529
## [4,] -1.1086695
## [5,]  0.6234357
## [6,]  0.1970856
## [7,] -1.0643852
## [8,]  1.8859049

c) Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data.

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

library(MASS)
shifted_diameter <- fitdistr(train_df$Diameter, "exponential")  # looks right skewed 

shifted_diameter

##       rate    
##   0.976089667 
##  (0.003586941)

Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).

exp_samples <- rexp(1000, shifted_diameter$estimate)

Plot a histogram and compare it with a histogram of your original variable.

par(mfrrow = c(1, 2))

hist(train_df$Diameter, main = "Original")

hist(exp_samples, main = "Exponential Sample")

par(mfrow = c(1, 1))

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

pi_5 <- qexp(0.05, shifted_diameter$estimate)
pi_5

## [1] 0.05254978

pi_95 <- qexp(0.95, shifted_diameter$estimate)
pi_95

## [1] 3.069116

Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

# assuming normality
mean <- mean(exp_samples)
std_error <- sd(exp_samples)/sqrt(length(data))

CI <- c(mean - 1.96 * std_error, mean + 1.96 * std_error)
CI

## [1] -0.9647212  2.9874881

# ci on exp_samples
exp_pi_5 <- quantile(exp_samples, 0.05)
exp_pi_95 <- quantile(exp_samples, 0.95)

exp_pi_5

##         5% 
## 0.05491338

exp_pi_95

##      95% 
## 2.958447

d) Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

train_pca <- train_df |>
    dplyr::select(-Sex)

Here, I normalized the data so that most of the values are center around the mean and within the standard deviation ussually 1 sd.

train_normalized <- scale(train_pca)
head(train_normalized)

##          Length    Diameter     Height       Weight Shucked Weight
## [1,]  0.7212332  0.63397775  0.2923977  0.441801426      0.4671847
## [2,] -0.7557067 -0.84035033 -0.7941577 -1.025191245     -0.9936809
## [3,]  0.2433997  0.37070488  0.2923977  0.110075046      0.2199225
## [4,]  1.3293850  1.63441467  1.6505920  2.156468183      1.8246040
## [5,] -0.2344338 -0.05053172 -0.1150606 -0.007598159      0.3334613
## [6,]  0.6343544  0.63397775  0.6998560  0.431715150      0.5882928
##      Viscera Weight Shell Weight         Age
## [1,]      0.5691823   0.45337303 -0.30480262
## [2,]     -0.9788731  -0.92678160 -0.61974448
## [3,]      0.1783618  -0.01722411 -0.30480262
## [4,]      2.1246076   2.30807939  0.32508111
## [5,]     -0.1972320  -0.21495400 -0.61974448
## [6,]      0.6199382   0.33868969  0.01013925

Then, perform pc analysis.

data_pca <- princomp(train_normalized)
summary(data_pca)

## Importance of components:
##                          Comp.1     Comp.2     Comp.3     Comp.4      Comp.5
## Standard deviation     2.642825 0.78010357 0.41082335 0.32134737 0.251281016
## Proportion of Variance 0.873077 0.07607122 0.02109726 0.01290819 0.007892875
## Cumulative Proportion  0.873077 0.94914825 0.97024551 0.98315370 0.991046576
##                             Comp.6      Comp.7      Comp.8
## Standard deviation     0.227590692 0.101849691 0.097239596
## Proportion of Variance 0.006474778 0.001296687 0.001181958
## Cumulative Proportion  0.997521354 0.998818042 1.000000000

It produces 8 principal components, which correspond to the number of variables in the data. After two components accounts for 94% of the total variance. In other words, the first two components can accurately represent the data.

data_pca$loadings[, 1]

##         Length       Diameter         Height         Weight Shucked Weight 
##      0.3674625      0.3682866      0.3576739      0.3717238      0.3593877 
## Viscera Weight   Shell Weight            Age 
##      0.3646823      0.3671664      0.2574165

These values represent the weight of the relationship between each of the original variable.

model_1 <- lm(Age ~ Length, data = train_df)
summary(model_1)

## 
## Call:
## lm(formula = Age ~ Length, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2309 -1.5546 -0.6248  0.7133 17.7978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.05878    0.04321    24.5   <2e-16 ***
## Length       6.76227    0.03204   211.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 74049 degrees of freedom
## Multiple R-squared:  0.3756, Adjusted R-squared:  0.3756 
## F-statistic: 4.454e+04 on 1 and 74049 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model_1)

df2 <- train_pca |>
    mutate(Age = Age^-0.41) |>
    mutate(Length = Length^0.3) |>
    mutate(Diameter = Diameter^0.3) |>
    mutate(Height = Height^0.26)
model_3 <- lm(Age ~ Length + Diameter + Height, data = df2)
summary(model_3)

## 
## Call:
## lm(formula = Age ~ Length + Diameter + Height, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27566 -0.01735  0.00497  0.02180  0.52644 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.922492   0.001840  501.36   <2e-16 ***
## Length       0.029164   0.011857    2.46   0.0139 *  
## Diameter    -0.316118   0.012125  -26.07   <2e-16 ***
## Height      -0.314387   0.005653  -55.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03388 on 74047 degrees of freedom
## Multiple R-squared:  0.5905, Adjusted R-squared:  0.5905 
## F-statistic: 3.56e+04 on 3 and 74047 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model_3)

test_id <- fread("Final-Project/train.csv")
predicted_age <- as.data.frame(test_id$id)
predictions <- predict(model_3, newdata = test_df)

Kaggle UserName : Nick AMC

Score : 1.61924

Data 605: Final Project

Nick Climaco

2023-12-15