Final Project

# Libraries needed
library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(psych)
library(corrplot)
library(pracma)
library(matrixcalc)
library(MASS)
library(fitdistrplus)
library(xgboost)
library(caret)

Problem 1:

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

set.seed(1234)
X = runif(n=10000,
          min = 5,
          max = 15)

Y = rnorm(10000,
          mean = 10,
          sd = 2.89)



ggplot(data = data.frame(val = X), aes(x = val)) + 
  geom_histogram(binwidth = 0.5) +
  ggtitle("X Normal Distribution") +
  xlab("Value")+
  theme_bw()

ggplot(data = data.frame(val = Y), aes(x = val)) + 
  geom_histogram(binwidth = 0.5) +
  ggtitle("Y Normal Distribution") +
  xlab("Value")+
  theme_bw()

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)
y <- median(Y)

print(paste("Median of X:", x, " | Median of Y:", y))
## [1] "Median of X: 10.0126582302619  | Median of Y: 10.0315387998106"
  1. \[P(X>x | X>y)\]

\[P(X>x|X>y)=\frac{P(X>x\cap X>y)}{P(X>y)} \]

p_xx_xy <- mean(X > x & X > y)
p_xy <- mean(X > y)
p_conditional <- p_xx_xy / p_xy
p_conditional
## [1] 1
  1. \[P(X>x & Y>y)\]

\[ P(X>x,Y>y)=P(X>x \cap Y>y)\]

p_joint <- mean(X > x & Y > y)
p_joint
## [1] 0.2507
  1. \[P(X<x | X>y)\]

\[P(X<x|X>y) = \frac{P(X<x \cap X>y)}{P(X>y)} \] This is just 0 since there the median of X is larger than the median of Y making it so that if x is smaller than the median of x there is no possibility that it can be larger than the median of y.

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

p_cond <- mean(X < x & X > y)
p_cond
## [1] 0

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

contingency_table <- table(X > x, Y > y)

# Fisher's Exact Test
ft_result <- fisher.test(contingency_table)
ft_result
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264
# Chi-square Test
chi_square_result <- chisq.test(contingency_table)
chi_square_result
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 0.0676, df = 1, p-value = 0.7949

Since the P-Values of Fisher’s Exact Test and Chi-Square Test are both at 0.7949. We fail to reject the null hypothesis that X and Y are independent. The difference between the Fisher and Chi-Square test is that Fisher’s is typically used for smaller sample sizes as Chi-Square are for larger samples. As well as methodologies are that Fisher’s calculates an exact probability compared to Chi-Square which is an approximation using chi-square statistics. We aren’t surprised at the result since both Fisher’s Exact Test and Chi-Square Test suggest no significant association between the variables in your dataset.

Problem 2:

Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16

test_df <- read.csv("https://raw.githubusercontent.com/Jlok17/Data-Science-Projects/main/test.csv")
train_df <- read.csv("https://raw.githubusercontent.com/Jlok17/Data-Science-Projects/main/train.csv")
train_df <- train_df[, !names(train_df) %in% "id"]


describe(train_df)
##                vars     n  mean    sd median trimmed   mad  min   max range
## Sex*              1 74051  2.06  0.82   2.00    2.07  1.48 1.00  3.00  2.00
## Length            2 74051  1.32  0.29   1.38    1.35  0.28 0.19  2.01  1.83
## Diameter          3 74051  1.02  0.24   1.07    1.05  0.22 0.14  1.61  1.48
## Height            4 74051  0.35  0.09   0.36    0.35  0.09 0.00  2.83  2.83
## Weight            5 74051 23.39 12.65  23.80   23.05 13.98 0.06 80.10 80.04
## Shucked.Weight    6 74051 10.10  5.62   9.91    9.92  6.18 0.03 42.18 42.16
## Viscera.Weight    7 74051  5.06  2.79   4.99    4.97  3.07 0.04 21.55 21.50
## Shell.Weight      8 74051  6.72  3.58   6.93    6.62  3.80 0.04 28.49 28.45
## Age               9 74051  9.97  3.18  10.00    9.69  2.97 1.00 29.00 28.00
##                 skew kurtosis   se
## Sex*           -0.10    -1.51 0.00
## Length         -0.84     0.29 0.00
## Diameter       -0.81     0.18 0.00
## Height          0.09    14.15 0.00
## Weight          0.23    -0.40 0.05
## Shucked.Weight  0.35    -0.12 0.02
## Viscera.Weight  0.29    -0.37 0.01
## Shell.Weight    0.28    -0.14 0.01
## Age             1.09     2.30 0.01

Descriptive and Inferential Statistics:

Since ID isn’t useful for training data as it just holds unique identifiers that doesn’t provide any value. It should be noted that column[“sex”] isn’t added to these histograms and scatterplots since I wanted to look at only numerical values.

train_df %>%
  keep(is.numeric) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(~name, scales = "free") +
  labs(title = "Counts of Variables", x = "Value") +
  theme_minimal()

train_df <- train_df[, !names(train_df) %in% "Sex"]
train_df %>%
  pivot_longer(cols = -Age, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value, y = Age)) +
  geom_point() +
  facet_wrap(~ Variable, scales = "free") +
  labs(title = "Scatterplots of Variables vs Age", x = "Value", y = "Age") +
  theme_minimal()

correlation_Matrix <- cor(train_df[, 1:8])
corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
         number.cex = 0.7, tl.col = "black", tl.cex = 0.8,
         diag = FALSE, order = "hclust", addrect = 2, rect.col = "grey")

cor_results <- list()

for (col_name in names(train_df)) {
    if (is.numeric(train_df[[col_name]]) && col_name != "Age") {
        cor_result <- cor.test(train_df$Age, train_df[[col_name]])
        cor_results[[col_name]] <- cor_result
    }
}

cor_results
## $Length
## 
##  Pearson's product-moment correlation
## 
## data:  train_df$Age and train_df[[col_name]]
## t = 211.04, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6083257 0.6173207
## sample estimates:
##       cor 
## 0.6128431 
## 
## 
## $Diameter
## 
##  Pearson's product-moment correlation
## 
## data:  train_df$Age and train_df[[col_name]]
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6168134 0.6256588
## sample estimates:
##       cor 
## 0.6212559 
## 
## 
## $Height
## 
##  Pearson's product-moment correlation
## 
## data:  train_df$Age and train_df[[col_name]]
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6337771 0.6423175
## sample estimates:
##       cor 
## 0.6380669 
## 
## 
## $Weight
## 
##  Pearson's product-moment correlation
## 
## data:  train_df$Age and train_df[[col_name]]
## t = 204.73, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5965757 0.6057744
## sample estimates:
##      cor 
## 0.601195 
## 
## 
## $Shucked.Weight
## 
##  Pearson's product-moment correlation
## 
## data:  train_df$Age and train_df[[col_name]]
## t = 158.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4979228 0.5086787
## sample estimates:
##       cor 
## 0.5033202 
## 
## 
## $Viscera.Weight
## 
##  Pearson's product-moment correlation
## 
## data:  train_df$Age and train_df[[col_name]]
## t = 192.15, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5719815 0.5815941
## sample estimates:
##       cor 
## 0.5768078 
## 
## 
## $Shell.Weight
## 
##  Pearson's product-moment correlation
## 
## data:  train_df$Age and train_df[[col_name]]
## t = 241.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6594219 0.6674861
## sample estimates:
##       cor 
## 0.6634733

The Pearson’s correlation test for all the variables except “Age” compared to “Age” showed very low P-Values. As a P-Value of 2.2e-16 which is significantly lower than the value of 0.05 shows that we can reject the null hypothesis. Which further Indicates that the results and correlation is statistically significant. Due to this we can reject the null hypothesis. It was also shown that the confidence interval for each variable was very narrow which suggests a precise estimate. Finally due to the low P-Values we can assume the familywise error to be negible as with a large sample size. Overall Type I error rate doesn’t seem to have concern at this time but if there was suspect to this error rate we can always deploy a Bonferroni correction.

Linear Algebra and Correlation:

precision_matrix <- inv(correlation_Matrix)
corr_preci <- correlation_Matrix * precision_matrix
preci_corr <- precision_matrix * correlation_Matrix
lu.decomposition(preci_corr )
## $L
##              [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
## [1,]  1.000000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000
## [2,] -0.892016443  1.00000000  0.00000000  0.00000000  0.00000000  0.00000000
## [3,] -0.035724298 -0.31419950  1.00000000  0.00000000  0.00000000  0.00000000
## [4,]  0.002059007 -0.10300229 -0.08342595  1.00000000  0.00000000  0.00000000
## [5,] -0.053581174 -0.21990895 -0.11658437 -0.40595443  1.00000000  0.00000000
## [6,] -0.027193684 -0.08061884 -0.13726058 -0.23919253 -0.56292232  1.00000000
## [7,]  0.028471839 -0.10080164 -0.34445414 -0.32355276 -0.30158631 -0.71188559
## [8,] -0.002029781 -0.03060876 -0.06783414 -0.01423663  0.05360755  0.02055726
##            [,7] [,8]
## [1,]  0.0000000    0
## [2,]  0.0000000    0
## [3,]  0.0000000    0
## [4,]  0.0000000    0
## [5,]  0.0000000    0
## [6,]  0.0000000    0
## [7,]  1.0000000    0
## [8,] -0.1777528    1
## 
## $U
##          [,1]      [,2]          [,3]       [,4]        [,5]        [,6]
## [1,] 50.03636 -44.63326 -1.787514e+00  0.1030252  -2.6810071  -1.3606731
## [2,]  0.00000  12.54154 -3.940544e+00 -1.2918070  -2.7579960  -1.0110842
## [3,]  0.00000   0.00000  6.509302e+00 -0.5430447  -0.7588829  -0.8934706
## [4,]  0.00000   0.00000  0.000000e+00 77.8744608 -31.6134820 -18.6269893
## [5,]  0.00000   0.00000  0.000000e+00  0.0000000  11.6296254  -6.5465757
## [6,]  0.00000   0.00000 -1.110223e-16  0.0000000   0.0000000   9.5871525
## [7,]  0.00000   0.00000 -7.903518e-17  0.0000000   0.0000000   0.0000000
## [8,]  0.00000   0.00000 -1.176641e-17  0.0000000   0.0000000   0.0000000
##            [,7]       [,8]
## [1,]   1.424627 -0.1015629
## [2,]  -1.264207 -0.3838808
## [3,]  -2.242156 -0.4415529
## [4,] -25.196496 -1.1086695
## [5,]  -3.507336  0.6234357
## [6,]  -6.824956  0.1970856
## [7,]   5.988008 -1.0643852
## [8,]   0.000000  1.8859049

Calculus-Based Probability & Statistics

# fitted exponential distribution

fit_exp <- fitdist(train_df$Weight, "exp")
summary(fit_exp)
## Fitting of the distribution ' exp ' by maximum likelihood 
## Parameters : 
##        estimate   Std. Error
## rate 0.04276206 0.0001570563
## Loglikelihood:  -307467.5   AIC:  614936.9   BIC:  614946.1
# Estimate lambda
shifted_Weight <- fitdistr(train_df$Weight, "exponential")
lambda_optimal <- shifted_Weight$estimate
exp_samples <- tibble(Weight = rexp(1000, lambda_optimal))

par(mfrow = c(1, 2))  
hist(train_df$Weight, main = "Original Weight", xlab = "Weight", col = "skyblue")
hist(exp_samples$Weight, main = "Exponential Distribution", xlab = "Weight", col = "salmon")

# Exponential distribution's CDF
exp_pi_5 <- qexp(0.05, rate = lambda_optimal)
exp_pi_95 <- qexp(0.95, rate = lambda_optimal)
print(paste("5th percentile (Exponential):", exp_pi_5))
## [1] "5th percentile (Exponential): 1.19950481850864"
print(paste("95th percentile (Exponential):", exp_pi_95))
## [1] "95th percentile (Exponential): 70.0558492098341"
# Mean and Standard deviation 
sample_mean <- mean(exp_samples$Weight)
sample_sd <- sd(exp_samples$Weight)

# 95% confidence interval
z_value <- qnorm(0.975)
n <- length(exp_samples$Weight)
moe <- z_value * (sample_sd / sqrt(n))
lower_bound <- sample_mean - moe
upper_bound <- sample_mean + moe
print(paste("95% Confidence Interval (Normal assumption): [", lower_bound, ",", upper_bound, "]"))
## [1] "95% Confidence Interval (Normal assumption): [ 21.659534525738 , 24.5122803206977 ]"
# Empirical 5th and 95th percentiles
emp_pi_5 <- quantile(exp_samples$Weight, 0.05)
emp_pi_95 <- quantile(exp_samples$Weight, 0.95)

# Print the empirical 5th and 95th percentiles
print(paste("Empirical 5th percentile:", emp_pi_5))
## [1] "Empirical 5th percentile: 1.25345648637394"
print(paste("Empirical 95th percentile:", emp_pi_95))
## [1] "Empirical 95th percentile: 67.5297114706328"

The Exponential 5th and 95th percentile are 1.1995 and 70.055 compared to the Empirical of 1.2534 and 67.529. As both percentiles values are close to one another it can be suggested that the data follows the pattern of assumed distribution.

Modeling:

# Multiple linear regression model
model <- lm(Age ~ Length + Diameter + Height + Weight + Shucked.Weight + Viscera.Weight + Shell.Weight, data = train_df)
summary(model)
## 
## Call:
## lm(formula = Age ~ Length + Diameter + Height + Weight + Shucked.Weight + 
##     Viscera.Weight + Shell.Weight, data = train_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.2001  -1.2411  -0.3769   0.7587  21.7936 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.275640   0.068356  33.291  < 2e-16 ***
## Length          0.842589   0.194685   4.328 1.51e-05 ***
## Diameter        2.909390   0.241185  12.063  < 2e-16 ***
## Height          7.905423   0.238777  33.108  < 2e-16 ***
## Weight          0.198657   0.005484  36.223  < 2e-16 ***
## Shucked.Weight -0.628563   0.006705 -93.743  < 2e-16 ***
## Viscera.Weight -0.186771   0.012003 -15.560  < 2e-16 ***
## Shell.Weight    0.520522   0.009944  52.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.155 on 74043 degrees of freedom
## Multiple R-squared:  0.5392, Adjusted R-squared:  0.5392 
## F-statistic: 1.238e+04 on 7 and 74043 DF,  p-value: < 2.2e-16

The model reveals a positive correlation between Age and variables such as Length, Diameter, Height, Weight, and Shell Weight, whereas Shucked Weight and Viscera Weight display a negative correlation with Age. The model shows a strong fit with accounting for 53.92% of the variance within the variable “Age” with a low residual standard error of 2.155. With the F-Statistic at 1.238e+04 and a P-Value of less than 2.2e-16, that will support that the model is statistically significant.

# Impute missing values in both train and test data
preProcValues <- preProcess(train_df, method = c("medianImpute"))
train_df_imputed <- predict(preProcValues, newdata = train_df)

preProcValues_test <- preProcess(test_df, method = c("medianImpute"))
test_df_imputed <- predict(preProcValues_test, newdata = test_df)

# Fit a K-Nearest Neighbors (KNN) classifier
knn_model <- train(
  Age ~ .,                             
  data = train_df_imputed,             
  method = "knn",                      
  trControl = trainControl(method = "cv", number = 5), 
  preProcess = c("center", "scale")    
)

# Make predictions
predictions <- predict(knn_model, newdata = test_df_imputed)
## Outputting Submission
#submission_df <- data.frame(Id = test_df$id, Age = predictions)
#write.csv(submission_df, "submission.csv", row.names = FALSE)

Kaggle:

Username: Josh L17 Score: 1.51935