Data 605 Final Exam

library(reactable)
library(vtable)
library(caret)
library(GGally)
library(tidyverse)
library(plotly)
library(LaplacesDemon)
library(ggeasy)
library(corrplot)
library(matrixcalc)
library(Matrix)
library(MASS)
library(moments)
library(car)

Problem 1.

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

# Random seed
set.seed(1234)

# Random variable X with 10,000 continuous random uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)

# Random variable Y with 10,000 random normal values
# Mean = 10, Standard Deviation = 2.89
Y <- rnorm(10000, mean = 10, sd = 2.89)

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.003   7.525  10.013  10.003  12.475  14.996

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.564   8.113  10.032  10.033  11.949  20.456

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

\(P(X>x | X>y)\)

Formula for Conditional Probability

\[ \frac{P(X>x \cap X>y)}{P(X>y)} \]

# Estimate the medians
x_med <- median(X)
y_med <- median(Y)

prob_a <- sum(X>x_med & X>y_med)/sum(X>y_med)
prob_a

## [1] 1

\(P(X>x \& Y>y)\)

prob_b <- sum(X>x_med & Y>y_med)/length(X)
prob_b

## [1] 0.2507

\(P(X<x|X>y)\)

\[ \frac{P(X<x \cap X>y)}{P(X>y)} \]

prob_c <- sum(X<x_med & X>y_med)/sum(X>y_med)
prob_c

## [1] 0

Investigate whether \(P(X>x & Y>y)\) = \(P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

# Creating table
comparison_table <- table(X>x_med, Y>y_med)

# Calculate marginal probabilities
marg_prob_X_gt_x <- sum(X>x_med)/length(X)
marg_prob_Y_gt_y <- sum(Y>y_med)/length(Y)

# Calculate joint probability P(X>x & Y>y)
joint_prob <- sum(X>x_med & Y>y_med) / length(X)

# Check if P(X>x & Y>y) = P(X>x)*P(Y>y)
equality_check <- abs(joint_prob - (marg_prob_X_gt_x * marg_prob_Y_gt_y))

cat("Comparison Table:", comparison_table, "\n")

## Comparison Table: 2507 2493 2493 2507

cat("Marginal Probability P(X>x_med):", marg_prob_X_gt_x,"\n")

## Marginal Probability P(X>x_med): 0.5

cat("Marginal Probability P(Y>y_med):", marg_prob_Y_gt_y,"\n")

## Marginal Probability P(Y>y_med): 0.5

cat("Joint Probability P(X>x & Y>y):", joint_prob, "\n")

## Joint Probability P(X>x & Y>y): 0.2507

cat("Absolute difference between P(X>x & Y>y) and P(X>x) * P(Y>y):", equality_check, "\n")

## Absolute difference between P(X>x & Y>y) and P(X>x) * P(Y>y): 7e-04

# Matrix with values from above
comparison_matrix <- matrix(c(2507, 2493, 2493, 2507), nrow = 2, byrow = TRUE)

# Convert the matrix to a data frame for better display
comparison_df <- as.data.frame(comparison_matrix)
names(comparison_df) <- c("X>x_med","X<x_med")
row.names(comparison_df) <- c("Y<y_med","Y>y_med")
reactable(comparison_df)

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

Fisher’s Exact Test

fisher_test_result <- fisher.test(comparison_table)
fisher_test_result

## 
##  Fisher's Exact Test for Count Data
## 
## data:  comparison_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264

Chi Square Test

chi_square_test_result <- chisq.test(comparison_table)
chi_square_test_result

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  comparison_table
## X-squared = 0.0676, df = 1, p-value = 0.7949

Using this website as a reference, Fisher’s Exact Test and Chi-square test have similarities and differences. They both assess the relationship between categorical variables, and determine the independence between each variable. Fisher’s Exact Test calculates the number of all possible contingency tables with the same row and column totals as the observed table, with marginal distributions being one example. It calculates the probability for the p-value by finding the proportion of possible tables that are more extreme than the observed table. This test is appropriate for all sample sizes. However, this test is mainly used for small sample sizes. Some guidelines associated with using this test include having cell counts smaller than 20, having a cell with an expected value 5 or less, and when column or row marginal values are extremely uneven. In the case of the Chi-Square test, it is used for large sample sizes. It uses a test statistic and sampling distribution to calculate p-values. The chi-square sampling distribution only approximates the correct distribution, providing better p-values as the cell values in the table increase. As a result, chi-square p-values are invalid when you have small cell counts.

In this case, both tests have the same p-values: 0.7949. I am somewhat surprised each test delivered the same result, considering the sample size is the same for each. Based on the definition for each test, I would have thought that the Chi-square test would produced a lower p-value, since the sample size of 10000 is large, and the values are between 5 and 15, two conditions that are counter to the guidelines presented earlier about using Fisher’s Exact Test.

Problem 2

**You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

crab_age_df_train <- read.csv("https://raw.githubusercontent.com/moham6839/Data_605_Final_Project/main/train.csv", check.names = FALSE)
reactable(crab_age_df_train)

As stated in the description of the Crab Age dataset on the Kaggle website, “For a commercial crab farmer knowing the right age of the crab helps them decide if and when to harvest the crabs. Beyond a certain age, there is negligible growth in crab’s physical characteristics and hence, it is important to time the harvesting to reduce cost and increase profit.” Therefore, I will use Age as the dependent variable, and

Univariate Descriptive Statistics and Plots

st(crab_age_df_train)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
id	74051	37025	21377	0	18512	55538	74050
Sex	74051
… F	23010	31%
… I	23957	32%
… M	27084	37%
Length	74051	1.3	0.29	0.19	1.1	1.5	2
Diameter	74051	1	0.24	0.14	0.89	1.2	1.6
Height	74051	0.35	0.092	0	0.3	0.41	2.8
Weight	74051	23	13	0.057	13	32	80
Shucked Weight	74051	10	5.6	0.028	5.7	14	42
Viscera Weight	74051	5.1	2.8	0.043	2.9	7	22
Shell Weight	74051	6.7	3.6	0.043	4	9.1	28
Age	74051	10	3.2	1	8	11	29

Length

summary(crab_age_df_train$Length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1875  1.1500  1.3750  1.3175  1.5375  2.0128

ggplot(crab_age_df_train, aes(Length)) + 
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Length") +
  ggeasy::easy_center_title()

hist(crab_age_df_train$Length, xlab = "Length", main = "Length of Crabs", col = "blue")

Diameter

summary(crab_age_df_train$Diameter)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1375  0.8875  1.0750  1.0245  1.2000  1.6125

ggplot(crab_age_df_train, aes(Diameter)) + 
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Diameter") +
  ggeasy::easy_center_title()

hist(crab_age_df_train$Diameter, xlab = "Diameter", main = "Diameter of Crabs", col = "blue")

Height

summary(crab_age_df_train$Height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.3625  0.3481  0.4125  2.8250

ggplot(crab_age_df_train, aes(Height)) + 
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Height") +
  ggeasy::easy_center_title()

hist(crab_age_df_train$Height, xlab = "Height", main = "Height of Crabs", col = "blue")

Weight

summary(crab_age_df_train$Weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0567 13.4377 23.7994 23.3852 32.1625 80.1015

ggplot(crab_age_df_train, aes(Weight)) + 
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Weight") +
  ggeasy::easy_center_title()

hist(crab_age_df_train$Weight, xlab = "Weight", main = "Weight of Crabs", col = "blue")

Shucked Weight

summary(crab_age_df_train$`Shucked Weight`)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.02835  5.71242  9.90815 10.10427 14.03300 42.18406

ggplot(crab_age_df_train, aes(`Shucked Weight`)) + 
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Shucked Weight") +
  ggeasy::easy_center_title()

hist(crab_age_df_train$`Shucked Weight`, xlab = "Shucked Weight", main = "Shucked Weight of Crabs", col = "blue")

Viscera Weight

summary(crab_age_df_train$`Viscera Weight`)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.04252  2.86330  4.98951  5.05839  6.98815 21.54562

ggplot(crab_age_df_train, aes(`Viscera Weight`)) + 
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Viscera Weight") +
  ggeasy::easy_center_title()

hist(crab_age_df_train$`Viscera Weight`, xlab = "Viscera Weight", main = "Viscera Weight of Crabs", col = "blue")

Shell Weight

summary(crab_age_df_train$`Shell Weight`)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.04252  3.96893  6.93145  6.72387  9.07184 28.49125

ggplot(crab_age_df_train, aes(`Shell Weight`)) + 
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Shell Weight") +
  ggeasy::easy_center_title()

hist(crab_age_df_train$`Shell Weight`, xlab = "Shell Weight", main = "Shell Weight of Crabs", col = "blue")

Sex

summary(crab_age_df_train$Sex)

##    Length     Class      Mode 
##     74051 character character

ggplot(crab_age_df_train, aes(Sex)) + 
  geom_bar() +
  theme_minimal() +
  ggtitle("Gender of Crabs") +
  ggeasy::easy_center_title()

Age

summary(crab_age_df_train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   8.000  10.000   9.968  11.000  29.000

ggplot(crab_age_df_train, aes(Age)) + 
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Age") +
  ggeasy::easy_center_title()

hist(crab_age_df_train$Age, xlab = "Age", main = "Age of Crabs", col = "blue")

Scatterplot Matrix

Using Age as a dependent variable, I will use Height and Weight as my independent variables:

train_set2 <- crab_age_df_train %>%
  dplyr::select(Age, Height, Weight)

pairs(train_set2, main = "Scatter Plot Matrix for Age, Height, and Weight")

ggpairs(train_set2, title="Scatterplot Matrix Using Age, Height, and Weight")

Correlation Matrix

train_set_df_corr <- crab_age_df_train %>%
  dplyr::select(`Shucked Weight`, `Viscera Weight`, `Shell Weight`)

corr_mat <- cor(train_set_df_corr)
corr_mat

##                Shucked Weight Viscera Weight Shell Weight
## Shucked Weight      1.0000000       0.942626    0.9103977
## Viscera Weight      0.9426260       1.000000    0.9339190
## Shell Weight        0.9103977       0.933919    1.0000000

ggpairs(train_set_df_corr)

ggcorrplot::ggcorrplot(corr_mat, hc.order = TRUE, type = "lower",
          lab = TRUE)

Test Hypotheses with 80% Confidence Interval

Shucked Weight and Viscera Weight

shuck_and_vis <- cor.test(train_set_df_corr$`Shucked Weight`, train_set_df_corr$`Viscera Weight`, conf.level = 0.8)
shuck_and_vis

## 
##  Pearson's product-moment correlation
## 
## data:  train_set_df_corr$`Shucked Weight` and train_set_df_corr$`Viscera Weight`
## t = 768.33, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9420988 0.9431486
## sample estimates:
##      cor 
## 0.942626

Shucked Weight and Shell Weight

shuck_and_shell <- cor.test(train_set_df_corr$`Shucked Weight`, train_set_df_corr$`Shell Weight`, conf.level = 0.8)
shuck_and_shell

## 
##  Pearson's product-moment correlation
## 
## data:  train_set_df_corr$`Shucked Weight` and train_set_df_corr$`Shell Weight`
## t = 598.78, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9095880 0.9112004
## sample estimates:
##       cor 
## 0.9103977

Viscera Weight and Shell Weight

vis_and_shell <- cor.test(train_set_df_corr$`Viscera Weight`, train_set_df_corr$`Shell Weight`, conf.level = 0.8)
vis_and_shell

## 
##  Pearson's product-moment correlation
## 
## data:  train_set_df_corr$`Viscera Weight` and train_set_df_corr$`Shell Weight`
## t = 710.9, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9333145 0.9345182
## sample estimates:
##      cor 
## 0.933919

When analyzing the correlations of the three variables, each pair is statistically significant, having a p-value significantly less than .05. It appears we can reject the null hypothesis. However, there is a high amount of multicollinearity between each pair, which makes it difficult to reject the null hypothesis and infer conclusions from hypothesis testing.

Familywise error rate (FWER) is defined as the probability of making any Type I errors, which are false positives (https://statweb.stanford.edu/~joftius/slides/testing.pdf). The formula for calculating FWER is \(1-(1-\alpha)^m\) (https://www.statistics.com/glossary/family-wise-type-i-error/#:~:text=aFW%20%3D%201%20%E2%80%93%20(1%20%E2%80%93%20a)C.&text=For%20example%2C%20for%20k%3D4,comparison%2Dwise%20type%20I%20errors). We can see what our error is:

alpha <- 0.2 # 1 - 0.8 (confidence interval)
m <- 3
FWER <- (1 - ((1 - alpha)^m))*100 
FWER

## [1] 48.8

The FWER is 48.8%, which is pretty high and should be a cause for concern. However, there are methods that can reduce FWER. One is the Bonferroni Correction, which adjusts the significance level to control the probability of Type I errors (https://statisticsbyjim.com/hypothesis-testing/bonferroni-correction/). It is calculated by dividing the p-value by the number of tests. In this case, it would be:

BC <- alpha/m
BC

## [1] 0.06666667

The adjusted p-value is .06, which means that the tests now have a 6% chance of having a false positive, compared to the original p-value of 0.2, or 20%.

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

# Invert Correlation Matrix to Precision Matrix
precision_matrix <- Cov2Prec(corr_mat)
reactable(precision_matrix)

# Multiply Correlation Matrix by the Precision Matrix
corr_times_prec <- corr_mat %*% precision_matrix
reactable(corr_times_prec)

# Multiply Precision Matrix by Correlation Matrix
prec_times_corr <- precision_matrix %*% corr_mat
reactable(prec_times_corr)

LDU <- lu.decomposition(prec_times_corr)
LDU

## $L
##               [,1]          [,2] [,3]
## [1,]  1.000000e+00  0.000000e+00    0
## [2,] -2.979398e-16  1.000000e+00    0
## [3,] -5.718615e-16 -2.936959e-16    1
## 
## $U
##      [,1]          [,2] [,3]
## [1,]    1 -1.663803e-15    0
## [2,]    0  1.000000e+00    0
## [3,]    0  0.000000e+00    1

reactable(LDU$L)

reactable(LDU$U)

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of* \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

When examining the histograms of each quantitative variable, Length and Diameter were the only variables that appeared to have a skewness to the left. I decided to look at the skewness of each variable:

skewness(crab_age_df_train$Length)

## [1] -0.8443598

skewness(crab_age_df_train$Diameter)

## [1] -0.8128491

skewness(crab_age_df_train$Height)

## [1] 0.08657593

skewness(crab_age_df_train$Weight)

## [1] 0.2314599

skewness(crab_age_df_train$`Shucked Weight`)

## [1] 0.3494646

skewness(crab_age_df_train$`Viscera Weight`)

## [1] 0.286377

skewness(crab_age_df_train$`Shell Weight`)

## [1] 0.2774533

skewness(crab_age_df_train$Age)

## [1] 1.092897

My assumptions were accurate. Length and Diameter have a negative number below zero, indicating a skewness to the left. The rest of the variables were positive numbers greater than zero, and indicates a skewness to the right. I decided to look at Shucked Weight, since it has the highest skewness amount among the independent variables and appears to have a distinct skewness to the right.

shuck_weight_epdf <- fitdistr(crab_age_df_train$`Shucked Weight`, densfun = 'exponential')
shuck_weight_epdf

##        rate    
##   0.0989680626 
##  (0.0003636885)

lambda <- shuck_weight_epdf$estimate
shuck_weight_lambda_epdf <- rexp(1000, lambda)
shuck_weight_lambda_df <- as.data.frame(shuck_weight_lambda_epdf)
reactable(shuck_weight_lambda_df)

par(mfrow = c(1, 2))
hist(crab_age_df_train$`Shucked Weight`, xlab = "Shucked Weight", main = "Original Shucked Weight of Crabs", col = "blue")
hist(shuck_weight_lambda_df$shuck_weight_lambda_epdf, xlab = "Transformed Shucked Weight", main = "Shucked Weight of Crabs Transformed", col = "red")

# Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
percentiles <- quantile(shuck_weight_lambda_df$shuck_weight_lambda_epdf, c(0.05, .95))
percentiles

##         5%        95% 
##  0.5415927 29.1781950

# Also generate a 95% confidence interval from the empirical data, assuming normality.
confidence_interval <- t.test(crab_age_df_train$`Shucked Weight`)$conf.int
names(confidence_interval) <- c("LowerCI", "UpperCI")
confidence_interval

##  LowerCI  UpperCI 
## 10.06381 10.14473 
## attr(,"conf.level")
## [1] 0.95

#Finally, provide the empirical 5th percentile and 95th percentile of the data.
empirical_percentiles <- quantile(crab_age_df_train$`Shucked Weight`, c(0.05, .95))
empirical_percentiles

##        5%       95% 
##  1.502523 19.376883

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Initial Model

To begin, I will run an initial model using Age as the dependent variable and including the rest of the variables as independent, with the exception of id.

sum(is.na(crab_age_df_train))

## [1] 0

# Removing Sex and Length variables
train_set_model <- crab_age_df_train %>%
  dplyr::select(-id)

ggpairs(train_set_model)

train_set_model2 <- train_set_model %>%
  dplyr::select(-Sex)

cor_matrix_train <- cor(train_set_model2)

ggcorrplot::ggcorrplot(cor_matrix_train, hc.order = TRUE, type = "lower",
          lab = TRUE)

There appears to be high multicollinearity between each of the independent variables. The lowest correlation coefficient among the independent variables is 0.864, which is the correlation between Height and Shucked Weight. Considering it is the lowest correlation coefficient, this indicates that the other independent variables have a high amount of multicollinearity with one another, which can make it difficult to determine each variable’s effect on the dependent variable. Nevertheless, I’ll create an initial model with all the independent variables and evaluate the model’s performance.

initial_ml <- lm(Age ~ ., data=train_set_model)
summary(initial_ml)

## 
## Call:
## lm(formula = Age ~ ., data = train_set_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.755758   0.075897  49.485  < 2e-16 ***
## SexI             -1.040735   0.025209 -41.284  < 2e-16 ***
## SexM             -0.115212   0.019157  -6.014 1.82e-09 ***
## Length            0.910211   0.192234   4.735 2.20e-06 ***
## Diameter          2.138507   0.238786   8.956  < 2e-16 ***
## Height            7.222272   0.236275  30.567  < 2e-16 ***
## Weight            0.194265   0.005416  35.867  < 2e-16 ***
## `Shucked Weight` -0.614115   0.006632 -92.603  < 2e-16 ***
## `Viscera Weight` -0.216351   0.011872 -18.224  < 2e-16 ***
## `Shell Weight`    0.512127   0.009820  52.152  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

# Checking for multicollinearity using the Variance Inflation Factor
vif_values <- car::vif(initial_ml)
print(vif_values)

##                       GVIF Df GVIF^(1/(2*Df))
## Sex               1.815731  2        1.160815
## Length           50.029756  1        7.073172
## Diameter         52.539130  1        7.248388
## Height            7.731224  1        2.780508
## Weight           76.731546  1        8.759654
## `Shucked Weight` 22.695250  1        4.763953
## `Viscera Weight` 17.972270  1        4.239371
## `Shell Weight`   20.256450  1        4.500717

plot(initial_ml)

ggplot(initial_ml, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = initial_ml, aes(x = initial_ml$residuals)) +
    geom_histogram(bins = 10, fill = 'steelblue', color = 'black') +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')

ggplot(data = initial_ml, aes(x = .resid)) +
  geom_histogram(binwidth = 0.4) +
  xlab("Residuals")

qqnorm(resid(initial_ml))
qqline(resid(initial_ml))

When analyzing the results of the initial model, there are some observations that stand out. The overall p-value of the model is significantly below .05, which indicates that the variables together are statistically significant to determine the dependent variable Age. The Adjusted R-Squared value is a good initial measure at 55.08%, which means that 55.08% of the variation of Age can be explained by the variables in the model. The F-Statistic is the t-value squared. The value of the F-Statistic is 1.009e+04, which taken with the small p-value of 2.2^e-16 would indicate that the null hypothesis can be rejected and the model is statistically significant. The p-values of each variable are below .05, indicating that they are each statistically significant. The t-values of variables should be between 5 and 10, which indicates that the standard error is smaller than their respective coefficients and the variables are statistically significant. In this model, the only variable that has a t-value between 5 and 10 are Diameter.

The plots of the model shows that the model does not provide a good fit for the training data. The Residual vs. Fitted Values plot shows the residual points mostly concentrated around the zero threshold between 5 and 20, which may indicate that the model is overfitting. The histogram of the model shows a center near zero but a slight skewness to the right. The QQ plot shows the the right and left tail deviating from the reference line, which indicates that the model does not do a good job of capturing the linearity between the independent variables and Age.

Revised Model

Since the variables have high multicollinearity and are relatively skewed, I decided to log transform the dependent and independent variables. For purposes of this specific model, I will not include the Sex column.

# train_set_model2 has dataframe without Sex column
revised_train_model <- log(train_set_model2)

names(revised_train_model) <- paste0(names(train_set_model2), "_log")

reactable(revised_train_model)

In the Height_log column, there was an infinite value, so I removed the value:

revised_train_model <- revised_train_model %>%
  filter_at(vars(Height_log), all_vars(!is.infinite(.)))

cor_log_mat <- cor(revised_train_model)

ggcorrplot::ggcorrplot(cor_log_mat, hc.order = TRUE, type = "lower",
          lab = TRUE)

revised_ml <- lm(Age_log ~ Length_log + Diameter_log + Height_log + Weight_log + `Shucked Weight_log` +
                 `Viscera Weight_log` + `Shell Weight_log`, data=revised_train_model)
summary(revised_ml)

## 
## Call:
## lm(formula = Age_log ~ Length_log + Diameter_log + Height_log + 
##     Weight_log + `Shucked Weight_log` + `Viscera Weight_log` + 
##     `Shell Weight_log`, data = revised_train_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89113 -0.12285 -0.02497  0.09501  1.40069 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.644503   0.022730  72.350   <2e-16 ***
## Length_log           -0.230686   0.022500 -10.253   <2e-16 ***
## Diameter_log         -0.028977   0.021209  -1.366   0.1719    
## Height_log            0.168477   0.007570  22.257   <2e-16 ***
## Weight_log            0.380627   0.010356  36.754   <2e-16 ***
## `Shucked Weight_log` -0.479794   0.005761 -83.288   <2e-16 ***
## `Viscera Weight_log` -0.012165   0.005112  -2.380   0.0173 *  
## `Shell Weight_log`    0.442732   0.006207  71.324   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1885 on 74019 degrees of freedom
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6385 
## F-statistic: 1.868e+04 on 7 and 74019 DF,  p-value: < 2.2e-16

plot(revised_ml)

ggplot(revised_ml, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = revised_ml, aes(x = .resid)) +
  geom_histogram(binwidth = 0.4) +
  xlab("Residuals")

qqnorm(resid(revised_ml))
qqline(resid(revised_ml))

Log-transforming the variables in the dataset helped improve the overall performance of the model. The overall p-value is 2.2e-^16, which is the same as the initial model. The Adjusted R-Squared increased, from 55.08% to 63.85%. The F-Statistic increased to 1.868e+04. The median is -0.02497, which is an improvement from the initial model’s median of -0.3320. This is important to note, as having a median close to zero is ideal for a good model. When assessing each variable p-value, Diameter was the only variable that had p-value greater than .05. When examining the Residual vs Fitted Values plot, the residual points appear more scattered, although it doesn’t seem to be random. The histogram plot shows a normal distribution with its center at zero. The tail ends of the QQ plot still deviates from the reference line. However, the left tail appears closer to the reference line than the initial model QQ plot.

Because the Diameter variable has a p-value greater than .05, I will re-run the model without the variable.

revised_ml2 <- lm(Age_log ~ Length_log + Height_log + Weight_log + `Shucked Weight_log` +
                 `Viscera Weight_log` + `Shell Weight_log`, data=revised_train_model)
summary(revised_ml2)

## 
## Call:
## lm(formula = Age_log ~ Length_log + Height_log + Weight_log + 
##     `Shucked Weight_log` + `Viscera Weight_log` + `Shell Weight_log`, 
##     data = revised_train_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89141 -0.12285 -0.02501  0.09501  1.40359 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.657852   0.020523  80.782   <2e-16 ***
## Length_log           -0.249491   0.017799 -14.017   <2e-16 ***
## Height_log            0.168324   0.007569  22.239   <2e-16 ***
## Weight_log            0.379137   0.010299  36.814   <2e-16 ***
## `Shucked Weight_log` -0.480291   0.005749 -83.541   <2e-16 ***
## `Viscera Weight_log` -0.012501   0.005107  -2.448   0.0144 *  
## `Shell Weight_log`    0.441101   0.006092  72.412   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1885 on 74020 degrees of freedom
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6385 
## F-statistic: 2.179e+04 on 6 and 74020 DF,  p-value: < 2.2e-16

plot(revised_ml2)

ggplot(revised_ml2, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = revised_ml2, aes(x = .resid)) +
  geom_histogram(binwidth = 0.4) +
  xlab("Residuals")

qqnorm(resid(revised_ml2))
qqline(resid(revised_ml2))

Removing the Diameter variable did not improve the performance of the model.

Based on the results of each model, the revised model is the better performing model, and I chose it for the test dataset.

Using Final Model on Test Set

crab_age_df_test <- read_csv("https://raw.githubusercontent.com/moham6839/Data_605_Final_Project/main/test.csv")
reactable(crab_age_df_test)

crab_age_df_test2 <- crab_age_df_test %>%
  dplyr::select(-Sex)
revised_test_model <- log(crab_age_df_test2)

names(revised_test_model) <- paste0(names(crab_age_df_test2), "_log")

reactable(revised_test_model)

# replacing Inf values in Height_log column with median of the column
revised_test_model <- revised_test_model %>%
  mutate_if(is.numeric, function(x)ifelse(is.infinite(x), median(revised_test_model$Height_log), x))
  #filter_at(vars(Height_log), all_vars(!is.infinite(.)))

predict_test_df <- predict(revised_ml2, revised_test_model)
summary(predict_test_df)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.361   2.133   2.305   2.253   2.419   3.020

test_submission_df <- data.frame(Id = crab_age_df_test$id, Age = predict_test_df)

write.csv(test_submission_df, "test_submission.csv", row.names = FALSE)

knitr::include_graphics("/Users/mohamedhassan/Documents/Kaggle Screenshot.png")

No score was generated due to the submission being after the deadline.