GITHUB LINK: https://github.com/rkasa01/DATA605_FINALPROJECT/tree/main

Problem 1

Task:

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15. Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

5 points. a. P(X>x | X>y) b. P(X>x & Y>y) c. P(X<x | X>y)

5 points. Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

Setting the Seed

set.seed(1234)

I first started by setting a random set equal to 1234.

Generating Variables

X <- runif(10000, min = 5, max = 15)
Y <- rnorm(10000, mean = 10, sd = 2.89)

We are given that random variable X has 10,000 continuous uniform values between 5 and 15, and that random variable Y has 10,000 normal values with a mean of 10 and a standard deviation of 2.89. Because of that, I generated x with 10,000 continuous uniform values with a minimum value of 5, and maximum value of 15. I also generated y with 10,000 normal values with a mean of 10 and standard deviation of 2.89

hist(X)

For X, we see based on the histogram that there is a uniform distribution. The frequencies are all about the same, making the range relatively small.

hist(Y)

For Y, we see based on the histogram that there is a normal distribution. For this type of dataset, the median and mean are about the same.

Calculating the Probabilities:

a. P(X>x | X>y)

x <- median(X)
y <- median(Y)
x
## [1] 10.01266
y 
## [1] 10.03154
A <- X[X>y]
B <- X[X>x]

probability_C <- length(A) / length(B)

cat('P(X>x | X>y):', probability_C)
## P(X>x | X>y): 0.9958

We are told that x and y both represent medians of their respective variables, so we can assign them to their median values. The median value of x is 10.01, and the median value of y is 10.03. This first probability represents the probability that X is greater than its median value, given that it is already greater than y. We can create a subset, A, where X is greater than y, and another, B, where X is greater than x. Here, A represents the subset of values in the variable X that are greater than the values in Y, and B represents the subset of values in X that are greater than the specific value x. X[X > y] filters the values in X that are greater than the value of y, creating a subset: A. X[X > x] filters values in X that are greater than the specific value x, forming a subset: B. Then, creating the proportion <- length(A) / length(B), calculates the proportion of values from subset A (values greater than y) within subset B (values greater than x). The probability is .9958, or about 1.

b. P(X>x & Y>y)

probability_B <- mean(X>x & Y>y)

cat('P(X>x & Y>y):', probability_B)
## P(X>x & Y>y): 0.2507

This second probability represents the probability that a random variable X is greater than its median and that a random variable Y is greater than its median. This helps to determine the probability that these two values are both greater than their respective medians. X > x represents the condition that checks for values in X that are greater than the median value x, whereas Y > y represents the condition that checks for values in Y that are greater than the median value y. Using the mean() function here calculates the proportion of cases where both conditions are satisfied simultaneously, compared to the total number of cases. The probability is .2507 or about .25.

c. P(X<x | X>y)

probability_C <- mean(X<x & X>y)/mean(X>y)

cat('P(X<x | X>y):', probability_C)
## P(X<x | X>y): 0

This third probability represents the probability that a random variable, X, is greater than y, given that X is less than its median value. It represents the likelihood that, in instances where X surpasses Y, X falls below its median.X < x represents the values in X that are less than the median value x and X > y represents the values in X that are greater than the median value y. Using the mean() function once again calculates the proportion of cases where both conditions are satisfied simultaneously, when compared to the total number of cases. The probability is 0.

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

table <- table(X>x, Y>y)
table
##        
##         FALSE TRUE
##   FALSE  2507 2493
##   TRUE   2493 2507

For the purpose of this task, we need to build a contingency table and evaluate the marginal and joint probabilities. When looking at the joint table, we see that there are differences in the marginal probabilities when the random variable is involved, indicating that these may be independent random variables.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

Null hypothesis: There is no relationship between the two variables, meaning that they are independent of one another. Alternative hypothesis: There is a relationship between the two variables, meaning that they are not independent of one another.

fisher_test <- fisher.test(table)
fisher_test
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264

The Fisher’s Exact Test is used to calculate the exact probability of observing the data given the marginal totals, assuming independence. This typically works best for smaller sample sizes. In this case, the null hypothesis states that there is no relationship between the two variables, making them independent of one another. The alternative hypothesis states that there is a relationship between the two variables, making them dependent to one another. With the Fisher’s Exact Test, we have a p-value of 0.8, which is insignificant. This means that we can accept the null hypothesis, and reject the alternative hypothesis. We also see that the 95% confidence interval is 0.9343 - 1.0946, providing further evidence that there is no association between the two variables. Furthermore, the odds ratio of 1 implies no association between the variables.

chisq_test <- chisq.test(table)
chisq_test
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 0.0676, df = 1, p-value = 0.7949

The Chi Square Test calculates the difference between the observed frequencies and the expected ones assuming independence.This works well for larger sample sizes, as it would not be as precise for smaller sample sizes. In this case, the null hypothesis states that there is no relationship between the two variables, making them independent of one another. The alternative hypothesis states that there is a relationship between the two variables, making them dependent to one another. With the Chi Square Test, we once again have a p-value of 0.8, which is insignificant. We can accept the null hypothesis, and reject the alternative hypothesis. The x-squared value measures the difference between the observed and expected frequencies, assuming independence, and a smaller value suggests that the observed counts are closer to what would be expected under independence. With the x-squared value being 0.068, this is further evidence in support of independence.

Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following:

5 points. Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

library(curl)
## Using libcurl 7.68.0 with OpenSSL/1.1.1f
test <- read.csv(curl("https://raw.githubusercontent.com/rkasa01/DATA605_FINALPROJECT/main/test.csv"))
train <- read.csv(curl("https://raw.githubusercontent.com/rkasa01/DATA605_FINALPROJECT/main/train.csv"))

Descriptive Statistics

str(test)
## 'data.frame':    49368 obs. of  9 variables:
##  $ id            : int  74051 74052 74053 74054 74055 74056 74057 74058 74059 74060 ...
##  $ Sex           : chr  "I" "I" "F" "F" ...
##  $ Length        : num  1.05 1.16 1.29 1.55 1.11 ...
##  $ Diameter      : num  0.762 0.887 0.988 0.988 0.85 ...
##  $ Height        : num  0.275 0.275 0.325 0.388 0.263 ...
##  $ Weight        : num  8.62 15.51 14.57 28.38 11.77 ...
##  $ Shucked.Weight: num  3.66 7.03 5.56 13.38 5.53 ...
##  $ Viscera.Weight: num  1.73 3.25 3.88 6.55 2.47 ...
##  $ Shell.Weight  : num  2.72 3.97 4.82 7.03 3.33 ...
str(train)
## 'data.frame':    74051 obs. of  10 variables:
##  $ id            : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Sex           : chr  "I" "I" "M" "F" ...
##  $ Length        : num  1.52 1.1 1.39 1.7 1.25 ...
##  $ Diameter      : num  1.175 0.825 1.113 1.413 1.012 ...
##  $ Height        : num  0.375 0.275 0.375 0.5 0.338 ...
##  $ Weight        : num  29 10.4 24.8 50.7 23.3 ...
##  $ Shucked.Weight: num  12.73 4.52 11.34 20.35 11.98 ...
##  $ Viscera.Weight: num  6.65 2.32 5.56 10.99 4.51 ...
##  $ Shell.Weight  : num  8.35 3.4 6.66 15 5.95 ...
##  $ Age           : int  9 8 9 11 8 10 11 11 12 11 ...
summary(train)
##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18512   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked.Weight     Viscera.Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell.Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000

Scatterplot Matrix

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)

train %>%
  gather(-id,-Age, key="var", value="value") %>%
  ggplot(aes(x=value,y=Age))+
  geom_point() +
  geom_smooth(method="lm",color="red") +
  facet_wrap(~ var, scales="free") +
  theme_bw()+
  ggtitle("Scatterplot Matrix")  
## `geom_smooth()` using formula = 'y ~ x'

Analysis

corr_mat <- cor(train[, c("Diameter", "Height", "Age")])
corr_mat
##           Diameter    Height       Age
## Diameter 1.0000000 0.9213531 0.6212559
## Height   0.9213531 1.0000000 0.6380669
## Age      0.6212559 0.6380669 1.0000000
corr_hyptest_a <- cor.test(train$Diameter, train$Height, conf.level = 0.80)
corr_hyptest_b <- cor.test(train$Diameter, train$Age, conf.level = 0.80)
corr_hyptest_c <- cor.test(train$Height, train$Age, conf.level = 0.80)
corr_hyptest_a
## 
##  Pearson's product-moment correlation
## 
## data:  train$Diameter and train$Height
## t = 644.97, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9206384 0.9220617
## sample estimates:
##       cor 
## 0.9213531

Discussion

The calculated sample Pearson’s correlation coefficient between ‘Diameter’ and ‘Height’ is 0.9214. The t-value is 645, and the degrees of freedom are 74049. The reported p-value is <2e-16, making it significant. This indicates strong evidence against the null hypothesis that there is no relationship between ‘Diameter’ and ‘Height’. The 80% confidence interval means that we can be 80% confident that the true correlation between ‘Diameter’ and ‘Height’ lies within the reported range: 0.9206 - 0.9221.

5 points. Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

Inverting the Correlation Matrix

inv_corr_mat <- solve(corr_mat)  
print(inv_corr_mat)
##            Diameter     Height        Age
## Diameter  6.7010643 -5.9333786 -0.3771832
## Height   -5.9333786  6.9403490 -0.7422606
## Age      -0.3771832 -0.7422606  1.7079392

Matrix Multiplication

prod_corr_prec <- corr_mat %*% inv_corr_mat

prod_prec_corr <- inv_corr_mat %*% corr_mat

print(prod_corr_prec)
##               Diameter        Height          Age
## Diameter  1.000000e+00 -3.330669e-16 0.000000e+00
## Height   -2.498002e-16  1.000000e+00 2.220446e-16
## Age      -1.665335e-16 -2.220446e-16 1.000000e+00
print(prod_prec_corr)
##              Diameter        Height           Age
## Diameter 1.000000e+00 -1.165734e-15 -6.661338e-16
## Height   5.551115e-16  1.000000e+00  6.661338e-16
## Age      0.000000e+00  2.220446e-16  1.000000e+00

LDU Decomposition

ldu_decomp <- La.svd(inv_corr_mat)
L <- ldu_decomp$u %*% diag(ldu_decomp$d)   
D <- diag(ldu_decomp$d)   
U <- t(ldu_decomp$u)   

print(L)
##            [,1]       [,2]      [,3]
## [1,] -8.9152141 -0.8430313 0.2444744
## [2,]  9.1260854 -0.7605555 0.2459622
## [3,] -0.3086059  1.8629266 0.2110483
print(D)
##          [,1]    [,2]      [,3]
## [1,] 12.76173 0.00000 0.0000000
## [2,]  0.00000 2.18166 0.0000000
## [3,]  0.00000 0.00000 0.4059637
print(U)
##            [,1]       [,2]        [,3]
## [1,] -0.6985899  0.7151136 -0.02418214
## [2,] -0.3864173 -0.3486131  0.85390313
## [3,]  0.6022075  0.6058725  0.51986986

5 points. Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lamda)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

library (fitdistrplus)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: survival
skewed <- train$Length
exponential_dist <- fitdistr(skewed, "exponential")
lambda <- exponential_dist$estimate["rate"]
samples_1000 <- rexp(1000, rate = lambda) #taking 1000 samples

hist(train$Length, probability = TRUE)

hist(samples_1000, main = "Exponential Distribution", probability = TRUE)

## 5th and 95th Percentiles from Cumulative Distribution Function

percentile_5_exp_pdf <- qexp(0.05, rate = lambda)
percentile_5_exp_pdf
## [1] 0.06757686
percentile_95_exp_pdf <- qexp(0.95, rate = lambda)
percentile_95_exp_pdf
## [1] 3.946757

Generate 95% Confidence Interval from Empirical Data

confidence_interval <- t.test(skewed)$conf.int

empirical_percentiles <- quantile(skewed, probs = c(0.05, 0.95))
empirical_percentiles
##     5%    95% 
## 0.7375 1.6750

Based on the 5th and 95th percentiles obtained from the exponential PDF and those obtained from the empirical percentiles, the interval for the empirical data were more precise due to the smaller range.

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

lm_model <- lm(Age ~ ., data = train)
lm_model<- step(lm_model, direction = c("backward"))
## Start:  AIC=111867.1
## Age ~ id + Sex + Length + Diameter + Height + Weight + Shucked.Weight + 
##     Viscera.Weight + Shell.Weight
## 
##                  Df Sum of Sq    RSS    AIC
## - id              1         0 335336 111865
## <none>                        335336 111867
## - Length          1       102 335438 111888
## - Diameter        1       363 335699 111945
## - Viscera.Weight  1      1504 336840 112197
## - Height          1      4232 339568 112794
## - Weight          1      5826 341163 113141
## - Sex             2      8658 343994 113751
## - Shell.Weight    1     12318 347654 114537
## - Shucked.Weight  1     38838 374174 119980
## 
## Step:  AIC=111865.1
## Age ~ Sex + Length + Diameter + Height + Weight + Shucked.Weight + 
##     Viscera.Weight + Shell.Weight
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                        335336 111865
## - Length          1       102 335438 111886
## - Diameter        1       363 335700 111943
## - Viscera.Weight  1      1504 336840 112195
## - Height          1      4232 339568 112792
## - Weight          1      5826 341163 113139
## - Sex             2      8658 343994 113749
## - Shell.Weight    1     12318 347654 114535
## - Shucked.Weight  1     38838 374174 119978
plot(residuals(lm_model))

par(mfrow = c(2, 2))

plot(lm_model)