Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
set.seed(1234)
X <- runif(10000, min = 5, max = 15)
Y <- rnorm(10000, mean = 10, sd = 2.89)
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.003 7.525 10.013 10.003 12.475 14.996
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.564 8.113 10.032 10.033 11.949 20.456
hist(X)
hist(Y)
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.
x <- median(X)
y <- median(Y)
x
## [1] 10.01266
y
## [1] 10.03154
The probability that X is greater than its median (x) given that X is already greater than the median of Y.
First, I created a subset where X > y.
X_larger_y <- X[X > y]
summary(X_larger_y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.03 11.26 12.48 12.49 13.72 15.00
Next, I used the mean function to calculate the probability that a value in the subset is greater than the median.
X_larger_x <- mean(X_larger_y > x)
X_larger_x
## [1] 1
The probability of X being larger than its median, given that X is greater than y, is 1. This makes sense as the medians are pretty much the same. If X is greater than the median of Y, it is almost certainly going to be greater than the median of X as well.
The probability that X and Y are greater than their median.
Like before, I will use the mean to calculate the proportion where before conditions are true
joint_prob <- mean(X > x & Y > y)
joint_prob
## [1] 0.2507
There is a 25% chance that both conditions are met simultaneously.
This problem is similar to the first one. It asking for the probability of X < x, given X is greater than y. Intuitively, it should be 0 of close to it since the probability of X > x is 1.
This is pretty much the same as first question.
X_less_x <- mean(X_larger_y < x)
X_less_x
## [1] 0
The probability is 0.
This is a contingency table evaluating the marginal and joint probabilities.
probability_matrix <- matrix(c((mean(X > x & Y > y)), (mean(X > x & Y <= y)),
(mean(X <= x & Y > y)), (mean(X <= x & Y <= y))),
nrow = 2, ncol = 2, byrow = TRUE,
dimnames = list(c("X > x", "X <= x"), c("Y > y", "Y <= y")))
probability_matrix <- rbind(probability_matrix, colSums(probability_matrix))
probability_matrix <- cbind(probability_matrix, rowSums(probability_matrix))
print(probability_matrix)
## Y > y Y <= y
## X > x 0.2507 0.2493 0.5
## X <= x 0.2493 0.2507 0.5
## 0.5000 0.5000 1.0
Based on this table, the joint probability of P(X > x & Y > y) is 0.2507. The individual probabilities of P(X > x) and P(Y > y) are both 0.5. The product of the marginal probabilities is 0.25. Therefore, the joint probability, P(X > x & Y > y), and the marginal probabilities, P(X > x)P(Y > y), are not equal. The joint probability is slightly higher than the product of the marginal probabilities. This difference suggests a slight dependence between X and Y. The joint probability and product of the marginal probabilities are equal when both variables are independent of each other.
Let’s make a matrix to perform the Fisher’s Exact Test and the Chi Square Test
matrix_prob <- matrix(c((sum(X > x & Y > y)), (sum(X > x & Y <= y)),
(sum(X <= x & Y > y)), (sum(X <= x & Y <= y))),
nrow = 2, ncol = 2, byrow = TRUE)
matrix_prob
## [,1] [,2]
## [1,] 2507 2493
## [2,] 2493 2507
#Fisher's Exact Test
fisher.test(matrix_prob)
##
## Fisher's Exact Test for Count Data
##
## data: matrix_prob
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
#Chi Square Test
chisq.test(matrix_prob)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix_prob
## X-squared = 0.0676, df = 1, p-value = 0.7949
Both the fisher’s exact test and the chi-square test are both used to determine if there is a significant association between two categorical variables. Both are used to test the independence of the variables and they both produce a p-value. A p value less than the typical alpha level of 0.05 suggests that we should reject the null hypothesis of independence, indicating a significant association between the two variables, X and Y. The p-value for the fisher’s exact test is 0.7949 and the p-value for the chi-square test is 0.7949 as well. Therefore, we fail to reject the null hypothesis in both tests, indicating that there is no statistically significant association between X and Y.
The fisher’s exact test is typically used when sample sizes are small while the chi-square test is more appropriate for large sample sizes. For this test, a chi-square test is most appropriate. The results are not surprising as the contingency table from the previous problem showed that the marginal and joint probabilities were not equal.
You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
df <- read_csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20605/playground-series-s3e16/train.csv")
## Rows: 74051 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (9): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
## Rows: 74,051
## Columns: 10
## $ id <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ Sex <chr> "I", "I", "M", "F", "I", "M", "M", "I", "F", "M", "I"…
## $ Length <dbl> 1.5250, 1.1000, 1.3875, 1.7000, 1.2500, 1.5000, 1.575…
## $ Diameter <dbl> 1.1750, 0.8250, 1.1125, 1.4125, 1.0125, 1.1750, 1.137…
## $ Height <dbl> 0.3750, 0.2750, 0.3750, 0.5000, 0.3375, 0.4125, 0.350…
## $ Weight <dbl> 28.973189, 10.418441, 24.777463, 50.660556, 23.289114…
## $ `Shucked Weight` <dbl> 12.7289255, 4.5217453, 11.3398000, 20.3549410, 11.977…
## $ `Viscera Weight` <dbl> 6.6479577, 2.3246590, 5.5565020, 10.9918385, 4.507570…
## $ `Shell Weight` <dbl> 8.348928, 3.401940, 6.662133, 14.996885, 5.953395, 7.…
## $ Age <dbl> 9, 8, 9, 11, 8, 10, 11, 11, 12, 11, 7, 10, 10, 7, 9, …
#Uni variate descriptive statistics
summary(df)
## id Sex Length Diameter
## Min. : 0 Length:74051 Min. :0.1875 Min. :0.1375
## 1st Qu.:18513 Class :character 1st Qu.:1.1500 1st Qu.:0.8875
## Median :37025 Mode :character Median :1.3750 Median :1.0750
## Mean :37025 Mean :1.3175 Mean :1.0245
## 3rd Qu.:55538 3rd Qu.:1.5375 3rd Qu.:1.2000
## Max. :74050 Max. :2.0128 Max. :1.6125
## Height Weight Shucked Weight Viscera Weight
## Min. :0.0000 Min. : 0.0567 Min. : 0.02835 Min. : 0.04252
## 1st Qu.:0.3000 1st Qu.:13.4377 1st Qu.: 5.71242 1st Qu.: 2.86330
## Median :0.3625 Median :23.7994 Median : 9.90815 Median : 4.98951
## Mean :0.3481 Mean :23.3852 Mean :10.10427 Mean : 5.05839
## 3rd Qu.:0.4125 3rd Qu.:32.1625 3rd Qu.:14.03300 3rd Qu.: 6.98815
## Max. :2.8250 Max. :80.1015 Max. :42.18406 Max. :21.54562
## Shell Weight Age
## Min. : 0.04252 Min. : 1.000
## 1st Qu.: 3.96893 1st Qu.: 8.000
## Median : 6.93145 Median :10.000
## Mean : 6.72387 Mean : 9.968
## 3rd Qu.: 9.07184 3rd Qu.:11.000
## Max. :28.49125 Max. :29.000
A couple of things that stand out in the summary statistics are the wide ranges of the weight and the shell weight.
#Histogram of all the numeric variables
numeric_cols <- sapply(df, is.numeric)
numeric_col_names <- names(df)[numeric_cols]
plot_list <- list()
for (col_name in numeric_col_names) {
#had to use sprintf to create a string from the col_name
p <- ggplot(df, aes_string(x = sprintf("`%s`", col_name))) +
geom_histogram(bins = 30) +
theme_minimal() +
ggtitle(paste("Histogram of", col_name))
plot_list[[col_name]] <- p
}
do.call(grid.arrange, c(plot_list, ncol = 3))
For the most part, the histogram of the variables are mostly right
skewed.
#Scatter plot of Diameter vs Age
df %>%
ggplot(aes(x=Diameter, y=Age)) +
geom_point() +
theme_minimal() +
ggtitle("Diameter vs Age")
Kind of looks like there’s a positive correlation here
#Scatter plot of Height vs Age
df %>%
ggplot(aes(x=Height, y=Age)) +
geom_point() +
theme_minimal() +
ggtitle("Height vs Age") +
coord_cartesian(xlim = c(0, 1))
Looks like there’s an outlier that is kind of skewing the plot. If I
remove the 2 outliers, there seems to be a positive correlation between
height and age.
#Scatter plot of Shell Weight vs Age
df %>%
ggplot(aes(x=`Shell Weight`, y=Age)) +
geom_point() +
theme_minimal() +
ggtitle("Shell Weight vs Age")
Looks like a positive correlation between shell weight and age.
#Correlation Matrix
cor_matrix <- cor(df[,c("Diameter","Height","Shell Weight")])
print(cor_matrix)
## Diameter Height Shell Weight
## Diameter 1.0000000 0.9213531 0.9226882
## Height 0.9213531 1.0000000 0.9033982
## Shell Weight 0.9226882 0.9033982 1.0000000
It looks like the 3 variables I picked have a strong correlation with one another. They also seem to be similarly correlated to Age.
#Correlation Test
test_dia_hei <- cor.test(df$Diameter, df$Height, conf.level = 0.80)
print("Diameter and Height")
## [1] "Diameter and Height"
print(test_dia_hei)
##
## Pearson's product-moment correlation
##
## data: df$Diameter and df$Height
## t = 644.97, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9206384 0.9220617
## sample estimates:
## cor
## 0.9213531
test_dia_she <- cor.test(df$Diameter, df$`Shell Weight`, conf.level = 0.80)
print("Diameter and Shell Weight")
## [1] "Diameter and Shell Weight"
print(test_dia_she)
##
## Pearson's product-moment correlation
##
## data: df$Diameter and df$`Shell Weight`
## t = 651.23, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9219851 0.9233852
## sample estimates:
## cor
## 0.9226882
test_hei_she <- cor.test(df$Height, df$`Shell Weight`, conf.level = 0.80)
print("Height and Shell Weight")
## [1] "Height and Shell Weight"
print(test_hei_she)
##
## Pearson's product-moment correlation
##
## data: df$Height and df$`Shell Weight`
## t = 573.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9025285 0.9042605
## sample estimates:
## cor
## 0.9033982
The correlation test for “Diameter and Height”, “Diameter and Shell Weight”, and “Height and Shell Weight” all demonstrate a strong positive correlation between the pairs of variables. They also have a p-value of less than 0.05, indicating the results are statistically significant meaning that the correlation is unlikely due to random chance. In addition, the confidence interval is narrow suggesting a precise estimate.
Given the extremely low p-value (2.2e-16), indicating that we should reject the null hypothesis, I would not be worried about familywise errors. Since we’re only performing three tests that all resulted in a very low p-value, the chances of all three results being false positive(type 1 errors) is extremely low. Though it is important to consider familywise errors, I would not be concerned about it in this instance.
#Inverting the correlation matrix
precision_matrix <- solve(cor_matrix)
print("Precision Matrix")
## [1] "Precision Matrix"
print(precision_matrix)
## Diameter Height Shell Weight
## Diameter 9.370051 -4.474177 -4.603672
## Height -4.474177 7.574983 -2.714955
## Shell Weight -4.603672 -2.714955 7.700440
#Multiply the correlation matrix by the precision matrix
product_1 <- cor_matrix %*% precision_matrix
print("Correlation Matrix * Precision Matrix")
## [1] "Correlation Matrix * Precision Matrix"
print(product_1)
## Diameter Height Shell Weight
## Diameter 1.000000e+00 -4.440892e-16 8.881784e-16
## Height 1.776357e-15 1.000000e+00 0.000000e+00
## Shell Weight 8.881784e-16 -8.881784e-16 1.000000e+00
#Multiply the precision matrix by the correlation matrix
product_2 <- precision_matrix %*% cor_matrix
print("Precision Matrix * Correlation Matrix")
## [1] "Precision Matrix * Correlation Matrix"
print(product_2)
## Diameter Height Shell Weight
## Diameter 1.000000e+00 8.881784e-16 0.000000e+00
## Height 8.881784e-16 1.000000e+00 -4.440892e-16
## Shell Weight 0.000000e+00 -8.881784e-16 1.000000e+00
library(Matrix)
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
#LDU decomposition on correlation matrix
ldu <- lu(cor_matrix)
expand(ldu)
## $L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.0000000 . .
## [2,] 0.9213531 1.0000000 .
## [3,] 0.9226882 0.3525715 1.0000000
##
## $U
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.9213531 0.9226882
## [2,] . 0.1511084 0.0532765
## [3,] . . 0.1298627
##
## $P
## 3 x 3 sparse Matrix of class "pMatrix"
##
## [1,] | . .
## [2,] . | .
## [3,] . . |
The histogram of ‘Shell Weight’ is skewed to the right
df %>%
ggplot(aes(x=`Shell Weight`)) +
geom_histogram() +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#minimum value
min(df$`Shell Weight`)
## [1] 0.04252425
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#Running fitdistr to fit an exponential probability density function
fit <- fitdistr(df$`Shell Weight`, "exponential")
#The optimal value of lambda
lambda <- fit$estimate
lambda
## rate
## 0.1487239
#Take 1000 samples from this exponential distribution
samples <- rexp(1000, rate = lambda)
#Histogram of samples data
ggplot(data.frame(samples), aes(x = samples)) +
geom_histogram() +
theme_minimal() +
ggtitle("Histogram of Sampled Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram of the sampled data shows a rapid decline in frequency from left to right when compared to the original histogram. This indicates there are significantly more smaller values than larger values. Since the shape of the exponential distribution is noticeably different from the original histogram, this suggests that exponential distribution is not the best model for the ‘Shell Weight’ variable.
#Finding the 5th and 95th percentile of the sample
p5_sample <- qexp(0.05, rate = lambda)
p95_sample <- qexp(0.95, rate = lambda)
p5_sample
## [1] 0.3448894
p95_sample
## [1] 20.14291
#Assuming normality, 95% confidence interval for the empirical data
ci_empirical <- quantile(df$`Shell Weight`, probs = c(0.025, 0.975))
ci_empirical
## 2.5% 97.5%
## 0.7087375 13.8912550
So this returns the 2.5 and 97.5 percentile of the ‘Shell Weight’ variable.
#Calculating CI, using z-score, if the data followed a normal distribution
test = 1.96 * (sd(df$`Shell Weight`) / sqrt(length(df$`Shell Weight`)))
ci_95 = c(mean(df$`Shell Weight`) - test, mean(df$`Shell Weight`) + test)
ci_95
## [1] 6.698053 6.749687
#Finding the 5th and 95th percentile of the empirical data
p5_empirical <- quantile(df$`Shell Weight`, probs = 0.05)
p95_empirical <- quantile(df$`Shell Weight`, probs = 0.95)
p5_empirical
## 5%
## 1.10563
p95_empirical
## 95%
## 12.7431
The 5th and 95th percentiles of the empirical data and the sample data are quite different. The sample data’s 5th and 95th percentile are 0.3448894 and 20.14291 while the empirical data’s 5th and 95th percentile are 1.10563 and 12.7431. The empirical data does not follow an exponential distribution so it is not surprising that the 5th and 95th percentile is very different from the sample data. Also, the sample data had more extreme values than the empirical data, which accounts for the big gap in the 95th percentile between the two.
To find the variables for the multiple regression model, let’s make a correlation matrix with all the continuous variables.
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
cor_matrix <- cor(select_if(df, is.numeric))
corrplot(cor_matrix, method = "number")
Based on this plot, ‘Shell Weight’, ‘Height’, ‘Diameter’, and ‘Length’ are the mostly positively correlated with Age.
#Was having trouble with the space in 'Shell Weight' so I had to change the column name
df <- df %>%
rename(Shell_Weight = `Shell Weight`)
multiple_regression <- lm(Age ~ Shell_Weight + Height + Diameter + Length, data = df)
summary(multiple_regression)
##
## Call:
## lm(formula = Age ~ Shell_Weight + Height + Diameter + Length,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.3488 -1.3795 -0.4852 0.7508 18.5093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.361115 0.065592 81.734 <2e-16 ***
## Shell_Weight 0.460719 0.006699 68.778 <2e-16 ***
## Height 9.084957 0.259946 34.949 <2e-16 ***
## Diameter 2.242626 0.263066 8.525 <2e-16 ***
## Length -2.999000 0.208923 -14.355 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.353 on 74046 degrees of freedom
## Multiple R-squared: 0.4507, Adjusted R-squared: 0.4507
## F-statistic: 1.519e+04 on 4 and 74046 DF, p-value: < 2.2e-16
Let’s begin with the coefficients. For each unit increase in ‘Shell_weight’, the dependent variable Age increases by 0.46. One thing to note is that ‘Length’ seems to have a negative coefficient. As ‘Length’ increases by one unit, ‘Age’ decreases by about 3 years. All of these predictor variables are statistically significant as they all have a small p-value.
The R-squared is .4507 which means that 45.07% of the variability in ‘Age’ can be explained by the model. More than half of the variation is not accounted for by these predictors, which suggests there are other factors that influence ‘Age’ that are not included. Lastly, the p-value is extremely small indicating that the multiple regression model is statistically significant in predicting ‘Age’.
A quick analysis of the residuals.
par(mfrow=c(2,2))
plot(multiple_regression)
For the Residuals vs Fitted plot, ideally, the residuals should be randomly scattered around the horizontal line at zero without any discernible patterns. However, this residual plot seems to grouped, suggesting potential non-linearity in the relationship between the predictor and target variable.
The Q-Q plot seems normally distributed for the most part but there are some deviations at the ends, suggesting possible outliers in the data.
Uploading to Kaggle
test_df <- read_csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20605/playground-series-s3e16/test.csv")
## Rows: 49368 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (8): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_df <- test_df %>%
rename(Shell_Weight = `Shell Weight`)
test_predictions <- predict(multiple_regression, newdata= test_df)
submission_df <- data.frame(Id = test_df$id, Age = test_predictions)
write.csv(submission_df, "submission.csv", row.names = FALSE)
Kaggle Username: jianjqc Score: 1.65103 / 1.64027