# Libraries needed
library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(psych)
library(corrplot)
library(pracma)
library(matrixcalc)
library(MASS)
library(fitdistrplus)
library(xgboost)
library(caret)
Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
set.seed(1234)
X = runif(n=10000,
min = 5,
max = 15)
Y = rnorm(10000,
mean = 10,
sd = 2.89)
ggplot(data = data.frame(val = X), aes(x = val)) +
geom_histogram(binwidth = 0.5) +
ggtitle("X Normal Distribution") +
xlab("Value")+
theme_bw()
ggplot(data = data.frame(val = Y), aes(x = val)) +
geom_histogram(binwidth = 0.5) +
ggtitle("Y Normal Distribution") +
xlab("Value")+
theme_bw()
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.
x <- median(X)
y <- median(Y)
print(paste("Median of X:", x, " | Median of Y:", y))
## [1] "Median of X: 10.0126582302619 | Median of Y: 10.0315387998106"
\[P(X>x|X>y)=\frac{P(X>x\cap X>y)}{P(X>y)} \]
p_xx_xy <- mean(X > x & X > y)
p_xy <- mean(X > y)
p_conditional <- p_xx_xy / p_xy
p_conditional
## [1] 1
\[ P(X>x,Y>y)=P(X>x \cap Y>y)\]
p_joint <- mean(X > x & Y > y)
p_joint
## [1] 0.2507
\[P(X<x|X>y) = \frac{P(X<x \cap X>y)}{P(X>y)} \] This is just 0 since there the median of X is larger than the median of Y making it so that if x is smaller than the median of x there is no possibility that it can be larger than the median of y.
Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
p_cond <- mean(X < x & X > y)
p_cond
## [1] 0
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?
contingency_table <- table(X > x, Y > y)
# Fisher's Exact Test
ft_result <- fisher.test(contingency_table)
ft_result
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
# Chi-square Test
chi_square_result <- chisq.test(contingency_table)
chi_square_result
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 0.0676, df = 1, p-value = 0.7949
Since the P-Values of Fisher’s Exact Test and Chi-Square Test are both at 0.7949. We fail to reject the null hypothesis that X and Y are independent. The difference between the Fisher and Chi-Square test is that Fisher’s is typically used for smaller sample sizes as Chi-Square are for larger samples. As well as methodologies are that Fisher’s calculates an exact probability compared to Chi-Square which is an approximation using chi-square statistics. We aren’t surprised at the result since both Fisher’s Exact Test and Chi-Square Test suggest no significant association between the variables in your dataset.
Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16
test_df <- read.csv("https://raw.githubusercontent.com/Jlok17/Data-Science-Projects/main/test.csv")
train_df <- read.csv("https://raw.githubusercontent.com/Jlok17/Data-Science-Projects/main/train.csv")
train_df <- train_df[, !names(train_df) %in% "id"]
describe(train_df)
## vars n mean sd median trimmed mad min max range
## Sex* 1 74051 2.06 0.82 2.00 2.07 1.48 1.00 3.00 2.00
## Length 2 74051 1.32 0.29 1.38 1.35 0.28 0.19 2.01 1.83
## Diameter 3 74051 1.02 0.24 1.07 1.05 0.22 0.14 1.61 1.48
## Height 4 74051 0.35 0.09 0.36 0.35 0.09 0.00 2.83 2.83
## Weight 5 74051 23.39 12.65 23.80 23.05 13.98 0.06 80.10 80.04
## Shucked.Weight 6 74051 10.10 5.62 9.91 9.92 6.18 0.03 42.18 42.16
## Viscera.Weight 7 74051 5.06 2.79 4.99 4.97 3.07 0.04 21.55 21.50
## Shell.Weight 8 74051 6.72 3.58 6.93 6.62 3.80 0.04 28.49 28.45
## Age 9 74051 9.97 3.18 10.00 9.69 2.97 1.00 29.00 28.00
## skew kurtosis se
## Sex* -0.10 -1.51 0.00
## Length -0.84 0.29 0.00
## Diameter -0.81 0.18 0.00
## Height 0.09 14.15 0.00
## Weight 0.23 -0.40 0.05
## Shucked.Weight 0.35 -0.12 0.02
## Viscera.Weight 0.29 -0.37 0.01
## Shell.Weight 0.28 -0.14 0.01
## Age 1.09 2.30 0.01
Since ID isn’t useful for training data as it just holds unique identifiers that doesn’t provide any value. It should be noted that column[“sex”] isn’t added to these histograms and scatterplots since I wanted to look at only numerical values.
train_df %>%
keep(is.numeric) %>%
pivot_longer(everything()) %>%
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~name, scales = "free") +
labs(title = "Counts of Variables", x = "Value") +
theme_minimal()
train_df <- train_df[, !names(train_df) %in% "Sex"]
train_df %>%
pivot_longer(cols = -Age, names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value, y = Age)) +
geom_point() +
facet_wrap(~ Variable, scales = "free") +
labs(title = "Scatterplots of Variables vs Age", x = "Value", y = "Age") +
theme_minimal()
correlation_Matrix <- cor(train_df[, 1:8])
corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
number.cex = 0.7, tl.col = "black", tl.cex = 0.8,
diag = FALSE, order = "hclust", addrect = 2, rect.col = "grey")
cor_results <- list()
for (col_name in names(train_df)) {
if (is.numeric(train_df[[col_name]]) && col_name != "Age") {
cor_result <- cor.test(train_df$Age, train_df[[col_name]])
cor_results[[col_name]] <- cor_result
}
}
cor_results
## $Length
##
## Pearson's product-moment correlation
##
## data: train_df$Age and train_df[[col_name]]
## t = 211.04, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6083257 0.6173207
## sample estimates:
## cor
## 0.6128431
##
##
## $Diameter
##
## Pearson's product-moment correlation
##
## data: train_df$Age and train_df[[col_name]]
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6168134 0.6256588
## sample estimates:
## cor
## 0.6212559
##
##
## $Height
##
## Pearson's product-moment correlation
##
## data: train_df$Age and train_df[[col_name]]
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6337771 0.6423175
## sample estimates:
## cor
## 0.6380669
##
##
## $Weight
##
## Pearson's product-moment correlation
##
## data: train_df$Age and train_df[[col_name]]
## t = 204.73, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5965757 0.6057744
## sample estimates:
## cor
## 0.601195
##
##
## $Shucked.Weight
##
## Pearson's product-moment correlation
##
## data: train_df$Age and train_df[[col_name]]
## t = 158.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4979228 0.5086787
## sample estimates:
## cor
## 0.5033202
##
##
## $Viscera.Weight
##
## Pearson's product-moment correlation
##
## data: train_df$Age and train_df[[col_name]]
## t = 192.15, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5719815 0.5815941
## sample estimates:
## cor
## 0.5768078
##
##
## $Shell.Weight
##
## Pearson's product-moment correlation
##
## data: train_df$Age and train_df[[col_name]]
## t = 241.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6594219 0.6674861
## sample estimates:
## cor
## 0.6634733
The Pearson’s correlation test for all the variables except “Age” compared to “Age” showed very low P-Values. As a P-Value of 2.2e-16 which is significantly lower than the value of 0.05 shows that we can reject the null hypothesis. Which further Indicates that the results and correlation is statistically significant. Due to this we can reject the null hypothesis. It was also shown that the confidence interval for each variable was very narrow which suggests a precise estimate. Finally due to the low P-Values we can assume the familywise error to be negible as with a large sample size. Overall Type I error rate doesn’t seem to have concern at this time but if there was suspect to this error rate we can always deploy a Bonferroni correction.
precision_matrix <- inv(correlation_Matrix)
corr_preci <- correlation_Matrix * precision_matrix
preci_corr <- precision_matrix * correlation_Matrix
lu.decomposition(preci_corr )
## $L
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.000000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [2,] -0.892016443 1.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [3,] -0.035724298 -0.31419950 1.00000000 0.00000000 0.00000000 0.00000000
## [4,] 0.002059007 -0.10300229 -0.08342595 1.00000000 0.00000000 0.00000000
## [5,] -0.053581174 -0.21990895 -0.11658437 -0.40595443 1.00000000 0.00000000
## [6,] -0.027193684 -0.08061884 -0.13726058 -0.23919253 -0.56292232 1.00000000
## [7,] 0.028471839 -0.10080164 -0.34445414 -0.32355276 -0.30158631 -0.71188559
## [8,] -0.002029781 -0.03060876 -0.06783414 -0.01423663 0.05360755 0.02055726
## [,7] [,8]
## [1,] 0.0000000 0
## [2,] 0.0000000 0
## [3,] 0.0000000 0
## [4,] 0.0000000 0
## [5,] 0.0000000 0
## [6,] 0.0000000 0
## [7,] 1.0000000 0
## [8,] -0.1777528 1
##
## $U
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 50.03636 -44.63326 -1.787514e+00 0.1030252 -2.6810071 -1.3606731
## [2,] 0.00000 12.54154 -3.940544e+00 -1.2918070 -2.7579960 -1.0110842
## [3,] 0.00000 0.00000 6.509302e+00 -0.5430447 -0.7588829 -0.8934706
## [4,] 0.00000 0.00000 0.000000e+00 77.8744608 -31.6134820 -18.6269893
## [5,] 0.00000 0.00000 0.000000e+00 0.0000000 11.6296254 -6.5465757
## [6,] 0.00000 0.00000 -1.110223e-16 0.0000000 0.0000000 9.5871525
## [7,] 0.00000 0.00000 -7.903518e-17 0.0000000 0.0000000 0.0000000
## [8,] 0.00000 0.00000 -1.176641e-17 0.0000000 0.0000000 0.0000000
## [,7] [,8]
## [1,] 1.424627 -0.1015629
## [2,] -1.264207 -0.3838808
## [3,] -2.242156 -0.4415529
## [4,] -25.196496 -1.1086695
## [5,] -3.507336 0.6234357
## [6,] -6.824956 0.1970856
## [7,] 5.988008 -1.0643852
## [8,] 0.000000 1.8859049
# fitted exponential distribution
fit_exp <- fitdist(train_df$Weight, "exp")
summary(fit_exp)
## Fitting of the distribution ' exp ' by maximum likelihood
## Parameters :
## estimate Std. Error
## rate 0.04276206 0.0001570563
## Loglikelihood: -307467.5 AIC: 614936.9 BIC: 614946.1
# Estimate lambda
shifted_Weight <- fitdistr(train_df$Weight, "exponential")
lambda_optimal <- shifted_Weight$estimate
exp_samples <- tibble(Weight = rexp(1000, lambda_optimal))
par(mfrow = c(1, 2))
hist(train_df$Weight, main = "Original Weight", xlab = "Weight", col = "skyblue")
hist(exp_samples$Weight, main = "Exponential Distribution", xlab = "Weight", col = "salmon")
# Exponential distribution's CDF
exp_pi_5 <- qexp(0.05, rate = lambda_optimal)
exp_pi_95 <- qexp(0.95, rate = lambda_optimal)
print(paste("5th percentile (Exponential):", exp_pi_5))
## [1] "5th percentile (Exponential): 1.19950481850864"
print(paste("95th percentile (Exponential):", exp_pi_95))
## [1] "95th percentile (Exponential): 70.0558492098341"
# Mean and Standard deviation
sample_mean <- mean(exp_samples$Weight)
sample_sd <- sd(exp_samples$Weight)
# 95% confidence interval
z_value <- qnorm(0.975)
n <- length(exp_samples$Weight)
moe <- z_value * (sample_sd / sqrt(n))
lower_bound <- sample_mean - moe
upper_bound <- sample_mean + moe
print(paste("95% Confidence Interval (Normal assumption): [", lower_bound, ",", upper_bound, "]"))
## [1] "95% Confidence Interval (Normal assumption): [ 21.659534525738 , 24.5122803206977 ]"
# Empirical 5th and 95th percentiles
emp_pi_5 <- quantile(exp_samples$Weight, 0.05)
emp_pi_95 <- quantile(exp_samples$Weight, 0.95)
# Print the empirical 5th and 95th percentiles
print(paste("Empirical 5th percentile:", emp_pi_5))
## [1] "Empirical 5th percentile: 1.25345648637394"
print(paste("Empirical 95th percentile:", emp_pi_95))
## [1] "Empirical 95th percentile: 67.5297114706328"
The Exponential 5th and 95th percentile are 1.1995 and 70.055 compared to the Empirical of 1.2534 and 67.529. As both percentiles values are close to one another it can be suggested that the data follows the pattern of assumed distribution.
# Multiple linear regression model
model <- lm(Age ~ Length + Diameter + Height + Weight + Shucked.Weight + Viscera.Weight + Shell.Weight, data = train_df)
summary(model)
##
## Call:
## lm(formula = Age ~ Length + Diameter + Height + Weight + Shucked.Weight +
## Viscera.Weight + Shell.Weight, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.2001 -1.2411 -0.3769 0.7587 21.7936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.275640 0.068356 33.291 < 2e-16 ***
## Length 0.842589 0.194685 4.328 1.51e-05 ***
## Diameter 2.909390 0.241185 12.063 < 2e-16 ***
## Height 7.905423 0.238777 33.108 < 2e-16 ***
## Weight 0.198657 0.005484 36.223 < 2e-16 ***
## Shucked.Weight -0.628563 0.006705 -93.743 < 2e-16 ***
## Viscera.Weight -0.186771 0.012003 -15.560 < 2e-16 ***
## Shell.Weight 0.520522 0.009944 52.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.155 on 74043 degrees of freedom
## Multiple R-squared: 0.5392, Adjusted R-squared: 0.5392
## F-statistic: 1.238e+04 on 7 and 74043 DF, p-value: < 2.2e-16
The model reveals a positive correlation between Age and variables such as Length, Diameter, Height, Weight, and Shell Weight, whereas Shucked Weight and Viscera Weight display a negative correlation with Age. The model shows a strong fit with accounting for 53.92% of the variance within the variable “Age” with a low residual standard error of 2.155. With the F-Statistic at 1.238e+04 and a P-Value of less than 2.2e-16, that will support that the model is statistically significant.
# Impute missing values in both train and test data
preProcValues <- preProcess(train_df, method = c("medianImpute"))
train_df_imputed <- predict(preProcValues, newdata = train_df)
preProcValues_test <- preProcess(test_df, method = c("medianImpute"))
test_df_imputed <- predict(preProcValues_test, newdata = test_df)
# Fit a K-Nearest Neighbors (KNN) classifier
knn_model <- train(
Age ~ .,
data = train_df_imputed,
method = "knn",
trControl = trainControl(method = "cv", number = 5),
preProcess = c("center", "scale")
)
# Make predictions
predictions <- predict(knn_model, newdata = test_df_imputed)
## Outputting Submission
#submission_df <- data.frame(Id = test_df$id, Age = predictions)
#write.csv(submission_df, "submission.csv", row.names = FALSE)
Username: Josh L17 Score: 1.51935