GITHUB LINK: https://github.com/rkasa01/DATA605_FINALPROJECT/tree/main
Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15. Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.
set.seed(1234)
I first started by setting a random set equal to 1234.
X <- runif(10000, min = 5, max = 15)
Y <- rnorm(10000, mean = 10, sd = 2.89)
We are given that random variable X has 10,000 continuous uniform values between 5 and 15, and that random variable Y has 10,000 normal values with a mean of 10 and a standard deviation of 2.89. Because of that, I generated x with 10,000 continuous uniform values with a minimum value of 5, and maximum value of 15. I also generated y with 10,000 normal values with a mean of 10 and standard deviation of 2.89
hist(X)
For X, we see based on the histogram that there is a uniform
distribution. The frequencies are all about the same, making the range
relatively small.
hist(Y)
For Y, we see based on the histogram that there is a normal distribution. For this type of dataset, the median and mean are about the same.
x <- median(X)
y <- median(Y)
x
## [1] 10.01266
y
## [1] 10.03154
A <- X[X>y]
B <- X[X>x]
probability_C <- length(A) / length(B)
cat('P(X>x | X>y):', probability_C)
## P(X>x | X>y): 0.9958
We are told that x and y both represent medians of their respective variables, so we can assign them to their median values. The median value of x is 10.01, and the median value of y is 10.03. This first probability represents the probability that X is greater than its median value, given that it is already greater than y. We can create a subset, A, where X is greater than y, and another, B, where X is greater than x. Here, A represents the subset of values in the variable X that are greater than the values in Y, and B represents the subset of values in X that are greater than the specific value x. X[X > y] filters the values in X that are greater than the value of y, creating a subset: A. X[X > x] filters values in X that are greater than the specific value x, forming a subset: B. Then, creating the proportion <- length(A) / length(B), calculates the proportion of values from subset A (values greater than y) within subset B (values greater than x). The probability is .9958, or about 1.
probability_B <- mean(X>x & Y>y)
cat('P(X>x & Y>y):', probability_B)
## P(X>x & Y>y): 0.2507
This second probability represents the probability that a random variable X is greater than its median and that a random variable Y is greater than its median. This helps to determine the probability that these two values are both greater than their respective medians. X > x represents the condition that checks for values in X that are greater than the median value x, whereas Y > y represents the condition that checks for values in Y that are greater than the median value y. Using the mean() function here calculates the proportion of cases where both conditions are satisfied simultaneously, compared to the total number of cases. The probability is .2507 or about .25.
probability_C <- mean(X<x & X>y)/mean(X>y)
cat('P(X<x | X>y):', probability_C)
## P(X<x | X>y): 0
This third probability represents the probability that a random variable, X, is greater than y, given that X is less than its median value. It represents the likelihood that, in instances where X surpasses Y, X falls below its median.X < x represents the values in X that are less than the median value x and X > y represents the values in X that are greater than the median value y. Using the mean() function once again calculates the proportion of cases where both conditions are satisfied simultaneously, when compared to the total number of cases. The probability is 0.
table <- table(X>x, Y>y)
table
##
## FALSE TRUE
## FALSE 2507 2493
## TRUE 2493 2507
For the purpose of this task, we need to build a contingency table and evaluate the marginal and joint probabilities. When looking at the joint table, we see that there are differences in the marginal probabilities when the random variable is involved, indicating that these may be independent random variables.
Null hypothesis: There is no relationship between the two variables, meaning that they are independent of one another. Alternative hypothesis: There is a relationship between the two variables, meaning that they are not independent of one another.
fisher_test <- fisher.test(table)
fisher_test
##
## Fisher's Exact Test for Count Data
##
## data: table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
The Fisher’s Exact Test is used to calculate the exact probability of observing the data given the marginal totals, assuming independence. This typically works best for smaller sample sizes. In this case, the null hypothesis states that there is no relationship between the two variables, making them independent of one another. The alternative hypothesis states that there is a relationship between the two variables, making them dependent to one another. With the Fisher’s Exact Test, we have a p-value of 0.8, which is insignificant. This means that we can accept the null hypothesis, and reject the alternative hypothesis. We also see that the 95% confidence interval is 0.9343 - 1.0946, providing further evidence that there is no association between the two variables. Furthermore, the odds ratio of 1 implies no association between the variables.
chisq_test <- chisq.test(table)
chisq_test
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 0.0676, df = 1, p-value = 0.7949
The Chi Square Test calculates the difference between the observed frequencies and the expected ones assuming independence.This works well for larger sample sizes, as it would not be as precise for smaller sample sizes. In this case, the null hypothesis states that there is no relationship between the two variables, making them independent of one another. The alternative hypothesis states that there is a relationship between the two variables, making them dependent to one another. With the Chi Square Test, we once again have a p-value of 0.8, which is insignificant. We can accept the null hypothesis, and reject the alternative hypothesis. The x-squared value measures the difference between the observed and expected frequencies, assuming independence, and a smaller value suggests that the observed counts are closer to what would be expected under independence. With the x-squared value being 0.068, this is further evidence in support of independence.
You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following:
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
library(curl)
## Using libcurl 7.68.0 with OpenSSL/1.1.1f
test <- read.csv(curl("https://raw.githubusercontent.com/rkasa01/DATA605_FINALPROJECT/main/test.csv"))
train <- read.csv(curl("https://raw.githubusercontent.com/rkasa01/DATA605_FINALPROJECT/main/train.csv"))
str(test)
## 'data.frame': 49368 obs. of 9 variables:
## $ id : int 74051 74052 74053 74054 74055 74056 74057 74058 74059 74060 ...
## $ Sex : chr "I" "I" "F" "F" ...
## $ Length : num 1.05 1.16 1.29 1.55 1.11 ...
## $ Diameter : num 0.762 0.887 0.988 0.988 0.85 ...
## $ Height : num 0.275 0.275 0.325 0.388 0.263 ...
## $ Weight : num 8.62 15.51 14.57 28.38 11.77 ...
## $ Shucked.Weight: num 3.66 7.03 5.56 13.38 5.53 ...
## $ Viscera.Weight: num 1.73 3.25 3.88 6.55 2.47 ...
## $ Shell.Weight : num 2.72 3.97 4.82 7.03 3.33 ...
str(train)
## 'data.frame': 74051 obs. of 10 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Sex : chr "I" "I" "M" "F" ...
## $ Length : num 1.52 1.1 1.39 1.7 1.25 ...
## $ Diameter : num 1.175 0.825 1.113 1.413 1.012 ...
## $ Height : num 0.375 0.275 0.375 0.5 0.338 ...
## $ Weight : num 29 10.4 24.8 50.7 23.3 ...
## $ Shucked.Weight: num 12.73 4.52 11.34 20.35 11.98 ...
## $ Viscera.Weight: num 6.65 2.32 5.56 10.99 4.51 ...
## $ Shell.Weight : num 8.35 3.4 6.66 15 5.95 ...
## $ Age : int 9 8 9 11 8 10 11 11 12 11 ...
summary(train)
## id Sex Length Diameter
## Min. : 0 Length:74051 Min. :0.1875 Min. :0.1375
## 1st Qu.:18512 Class :character 1st Qu.:1.1500 1st Qu.:0.8875
## Median :37025 Mode :character Median :1.3750 Median :1.0750
## Mean :37025 Mean :1.3175 Mean :1.0245
## 3rd Qu.:55538 3rd Qu.:1.5375 3rd Qu.:1.2000
## Max. :74050 Max. :2.0128 Max. :1.6125
## Height Weight Shucked.Weight Viscera.Weight
## Min. :0.0000 Min. : 0.0567 Min. : 0.02835 Min. : 0.04252
## 1st Qu.:0.3000 1st Qu.:13.4377 1st Qu.: 5.71242 1st Qu.: 2.86330
## Median :0.3625 Median :23.7994 Median : 9.90815 Median : 4.98951
## Mean :0.3481 Mean :23.3852 Mean :10.10427 Mean : 5.05839
## 3rd Qu.:0.4125 3rd Qu.:32.1625 3rd Qu.:14.03300 3rd Qu.: 6.98815
## Max. :2.8250 Max. :80.1015 Max. :42.18406 Max. :21.54562
## Shell.Weight Age
## Min. : 0.04252 Min. : 1.000
## 1st Qu.: 3.96893 1st Qu.: 8.000
## Median : 6.93145 Median :10.000
## Mean : 6.72387 Mean : 9.968
## 3rd Qu.: 9.07184 3rd Qu.:11.000
## Max. :28.49125 Max. :29.000
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
train %>%
gather(-id,-Age, key="var", value="value") %>%
ggplot(aes(x=value,y=Age))+
geom_point() +
geom_smooth(method="lm",color="red") +
facet_wrap(~ var, scales="free") +
theme_bw()+
ggtitle("Scatterplot Matrix")
## `geom_smooth()` using formula = 'y ~ x'
corr_mat <- cor(train[, c("Diameter", "Height", "Age")])
corr_mat
## Diameter Height Age
## Diameter 1.0000000 0.9213531 0.6212559
## Height 0.9213531 1.0000000 0.6380669
## Age 0.6212559 0.6380669 1.0000000
corr_hyptest_a <- cor.test(train$Diameter, train$Height, conf.level = 0.80)
corr_hyptest_b <- cor.test(train$Diameter, train$Age, conf.level = 0.80)
corr_hyptest_c <- cor.test(train$Height, train$Age, conf.level = 0.80)
corr_hyptest_a
##
## Pearson's product-moment correlation
##
## data: train$Diameter and train$Height
## t = 644.97, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9206384 0.9220617
## sample estimates:
## cor
## 0.9213531
The calculated sample Pearson’s correlation coefficient between ‘Diameter’ and ‘Height’ is 0.9214. The t-value is 645, and the degrees of freedom are 74049. The reported p-value is <2e-16, making it significant. This indicates strong evidence against the null hypothesis that there is no relationship between ‘Diameter’ and ‘Height’. The 80% confidence interval means that we can be 80% confident that the true correlation between ‘Diameter’ and ‘Height’ lies within the reported range: 0.9206 - 0.9221.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.
inv_corr_mat <- solve(corr_mat)
print(inv_corr_mat)
## Diameter Height Age
## Diameter 6.7010643 -5.9333786 -0.3771832
## Height -5.9333786 6.9403490 -0.7422606
## Age -0.3771832 -0.7422606 1.7079392
prod_corr_prec <- corr_mat %*% inv_corr_mat
prod_prec_corr <- inv_corr_mat %*% corr_mat
print(prod_corr_prec)
## Diameter Height Age
## Diameter 1.000000e+00 -3.330669e-16 0.000000e+00
## Height -2.498002e-16 1.000000e+00 2.220446e-16
## Age -1.665335e-16 -2.220446e-16 1.000000e+00
print(prod_prec_corr)
## Diameter Height Age
## Diameter 1.000000e+00 -1.165734e-15 -6.661338e-16
## Height 5.551115e-16 1.000000e+00 6.661338e-16
## Age 0.000000e+00 2.220446e-16 1.000000e+00
ldu_decomp <- La.svd(inv_corr_mat)
L <- ldu_decomp$u %*% diag(ldu_decomp$d)
D <- diag(ldu_decomp$d)
U <- t(ldu_decomp$u)
print(L)
## [,1] [,2] [,3]
## [1,] -8.9152141 -0.8430313 0.2444744
## [2,] 9.1260854 -0.7605555 0.2459622
## [3,] -0.3086059 1.8629266 0.2110483
print(D)
## [,1] [,2] [,3]
## [1,] 12.76173 0.00000 0.0000000
## [2,] 0.00000 2.18166 0.0000000
## [3,] 0.00000 0.00000 0.4059637
print(U)
## [,1] [,2] [,3]
## [1,] -0.6985899 0.7151136 -0.02418214
## [2,] -0.3864173 -0.3486131 0.85390313
## [3,] 0.6022075 0.6058725 0.51986986
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lamda)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
library (fitdistrplus)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: survival
skewed <- train$Length
exponential_dist <- fitdistr(skewed, "exponential")
lambda <- exponential_dist$estimate["rate"]
samples_1000 <- rexp(1000, rate = lambda) #taking 1000 samples
hist(train$Length, probability = TRUE)
hist(samples_1000, main = "Exponential Distribution", probability = TRUE)
## 5th and 95th Percentiles from Cumulative Distribution Function
percentile_5_exp_pdf <- qexp(0.05, rate = lambda)
percentile_5_exp_pdf
## [1] 0.06757686
percentile_95_exp_pdf <- qexp(0.95, rate = lambda)
percentile_95_exp_pdf
## [1] 3.946757
confidence_interval <- t.test(skewed)$conf.int
empirical_percentiles <- quantile(skewed, probs = c(0.05, 0.95))
empirical_percentiles
## 5% 95%
## 0.7375 1.6750
Based on the 5th and 95th percentiles obtained from the exponential PDF and those obtained from the empirical percentiles, the interval for the empirical data were more precise due to the smaller range.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
lm_model <- lm(Age ~ ., data = train)
lm_model<- step(lm_model, direction = c("backward"))
## Start: AIC=111867.1
## Age ~ id + Sex + Length + Diameter + Height + Weight + Shucked.Weight +
## Viscera.Weight + Shell.Weight
##
## Df Sum of Sq RSS AIC
## - id 1 0 335336 111865
## <none> 335336 111867
## - Length 1 102 335438 111888
## - Diameter 1 363 335699 111945
## - Viscera.Weight 1 1504 336840 112197
## - Height 1 4232 339568 112794
## - Weight 1 5826 341163 113141
## - Sex 2 8658 343994 113751
## - Shell.Weight 1 12318 347654 114537
## - Shucked.Weight 1 38838 374174 119980
##
## Step: AIC=111865.1
## Age ~ Sex + Length + Diameter + Height + Weight + Shucked.Weight +
## Viscera.Weight + Shell.Weight
##
## Df Sum of Sq RSS AIC
## <none> 335336 111865
## - Length 1 102 335438 111886
## - Diameter 1 363 335700 111943
## - Viscera.Weight 1 1504 336840 112195
## - Height 1 4232 339568 112792
## - Weight 1 5826 341163 113139
## - Sex 2 8658 343994 113749
## - Shell.Weight 1 12318 347654 114535
## - Shucked.Weight 1 38838 374174 119978
plot(residuals(lm_model))
par(mfrow = c(2, 2))
plot(lm_model)