Problem 1.

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

# Set the random seed
set.seed(1234)

# Random variable X with continuous uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)
hist(X,main="Histogram for Random Distribution ", 
     xlab="Random Distribution", 
     border="blue", 
     col="gray",
     xlim=c(-10,20),
     breaks=10)

#Random variable Y with normal distribution (mean = 10, sd = 2.89)
Y <- rnorm(10000, mean = 10, sd = 2.89)
hist(Y,main="Histogram for Normal Distribution ", 
     xlab="Random Distribution", 
     border="blue", 
     col="gray",
     xlim=c(-20,20),
     breaks=20)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

a. P(X>x | X>y) b. P(X>x & Y>y) c. P(X<x | X>y)

# Assuming x is the median of X and y is the median of Y
x <- median(X)
y <- median(Y)

# Calculate P(X > x ∩ X > y)
prob_Xx_Xy <- sum(X > x & X > y) / length(X)

# Calculate P(X > y)
prob_Xy <- sum(X > y) / length(X)

# Calculate P(X > x | X > y)
prob_Xx_given_Xy <- prob_Xx_Xy / prob_Xy
prob_Xx_given_Xy
## [1] 1

P(X>x∣X>y)=1 implies that the likelihood of X being above its median when it’s already above the median of Y is absolute or certain.

# Calculate P(X > x ∩ Y > y)
prob_Xx_Yy <- sum(X > x & Y > y) / length(X)
prob_Xx_Yy
## [1] 0.2507

P(X>x&Y>y) = 0.2507 implies that 25% of the observations from variable X are greater than its median (x) and, at the same time, 25% of the observations from variable Y are greater than its median (y).

# Calculate P(X < x ∩ X > y)
prob_XltX_YgtY <- sum(X < x & X > y) / length(X)

# Calculate P(X < x | X > y)
prob_XltX_given_YgtY <- prob_XltX_YgtY / prob_Xy
prob_XltX_given_YgtY
## [1] 0

P(X<x∣X>y) = 0 implies that the probability of a randomly chosen value from variable X being less than its median (x), given that it’s greater than the median of Y (y), is equal to zero or 0%.

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

# Calculate joint probabilities
prob_X_gt_x_and_Y_gt_y <- mean(X > x & Y > y)
prob_X_gt_x_and_Y_gt_y
## [1] 0.2507
# Calculate the product of marginal probabilities
prob_product <- mean(X > x) * mean(Y > y)
prob_product
## [1] 0.25
# Build a table with joint and marginal probabilities
table <- matrix(c(
  (mean(X > x & Y > y)), (mean(X > x & Y <= y)),
  (mean(X <= x & Y > y)), (mean(X <= x & Y <= y))
), nrow = 2, byrow = TRUE)

rownames(table) <- c("X > x", "X <= x")
colnames(table) <- c("Y > y", "Y <= y")

# Create a data frame for the table
contingency_table <- as.data.frame(table)
contingency_table$Total_X <- rowSums(table)
contingency_table <- cbind(contingency_table, Total_Y = colSums(table))
contingency_table <- rbind(
  contingency_table, 
  Total_Y = c(rowSums(table), sum(table))
)
## Warning in rbind(deparse.level, ...): number of columns of result, 4, is not a
## multiple of vector length 3 of arg 2
print(contingency_table[, -ncol(contingency_table)])
##          Y > y Y <= y Total_X
## X > x   0.2507 0.2493     0.5
## X <= x  0.2493 0.2507     0.5
## Total_Y 0.5000 0.5000     1.0

The joint probabilities P(X>x&Y>y) = 0.2507, individual probabilities P(X>x) and P(Y>y) are both 0.5, and the product of marginal probabilities P(X>x)× P(Y>y) = 0.25. They are slightly different, and the value of joint probabilities is a bit higher than product of marginal probabilities.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

Fisher’s Exact Test always gives an exact P value and works fine with small sample sizes.Because it calculates the exact probability of observing a particular distribution of counts in a contingency table, given the marginal totals.

The chi-square test is only an approximation. With large sample sizes, the chi-square test works very well but with small sample sizes, chi-square is not accurate.Because it assesses the independence of two categorical variables by comparing observed frequencies to expected frequencies under the assumption of independence.

test_table <- matrix(c((sum(X > x & Y > y)), (sum(X > x & Y <= y)),
                               (sum(X <= x & Y > y)), (sum(X <= x & Y <= y))),
                                nrow = 2, ncol = 2, byrow = TRUE)
# Fisher's Exact Test
fisher.test(test_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  test_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264
#Chi Square Test
chisq.test(test_table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  test_table
## X-squared = 0.0676, df = 1, p-value = 0.7949

Surprising Results

Both tests produced a same, high p-value (0.7949), that means there isn’t sufficient evidence to reject the null hypothesis of independence between the variables. The 95% Confidence Interval in Fisher’s Exact Test implies that the association between the variables falls within this range ( 0.9342763, 1.0946016).

Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Loading required libraries

library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.1
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(Matrix)
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(moments)
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

Descriptive and Inferential Statistics

# Read the CSV files
test_data <- read_csv("605FinalExam Data/test.csv")
## Rows: 49368 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (8): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_data <- read_csv("605FinalExam Data/train.csv")
## Rows: 74051 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (9): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(train_data)
##  [1] "id"             "Sex"            "Length"         "Diameter"      
##  [5] "Height"         "Weight"         "Shucked Weight" "Viscera Weight"
##  [9] "Shell Weight"   "Age"
str(train_data)
## spc_tbl_ [74,051 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id            : num [1:74051] 0 1 2 3 4 5 6 7 8 9 ...
##  $ Sex           : chr [1:74051] "I" "I" "M" "F" ...
##  $ Length        : num [1:74051] 1.52 1.1 1.39 1.7 1.25 ...
##  $ Diameter      : num [1:74051] 1.175 0.825 1.113 1.413 1.012 ...
##  $ Height        : num [1:74051] 0.375 0.275 0.375 0.5 0.338 ...
##  $ Weight        : num [1:74051] 29 10.4 24.8 50.7 23.3 ...
##  $ Shucked Weight: num [1:74051] 12.73 4.52 11.34 20.35 11.98 ...
##  $ Viscera Weight: num [1:74051] 6.65 2.32 5.56 10.99 4.51 ...
##  $ Shell Weight  : num [1:74051] 8.35 3.4 6.66 15 5.95 ...
##  $ Age           : num [1:74051] 9 8 9 11 8 10 11 11 12 11 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   Sex = col_character(),
##   ..   Length = col_double(),
##   ..   Diameter = col_double(),
##   ..   Height = col_double(),
##   ..   Weight = col_double(),
##   ..   `Shucked Weight` = col_double(),
##   ..   `Viscera Weight` = col_double(),
##   ..   `Shell Weight` = col_double(),
##   ..   Age = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Provide univariate descriptive statistics and appropriate plots for the training data set
# Summary statistics for numerical variables
summary(train_data)
##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18512   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked Weight     Viscera Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000

#Provide univariate descriptive statistics and appropriate plots for the training data set

# Calculate summary statistics of Age per Sex
age_summary <- train_data %>%
  group_by(Sex) %>%
  summarise(mean_age = mean(Age),   # Calculate mean age
            median_age = median(Age))  # Calculate median age

# Plotting bar plot for Age by Sex
ggplot(age_summary, aes(x = Sex, y = mean_age, fill = Sex)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.5) +
  geom_text(aes(label = round(mean_age, 1)), position = position_dodge(width = 0.5), vjust = -0.5) +
  labs(x = "Sex", y = "Mean Age") +
  ggtitle("Mean Age of Crabs by Sex") +
  theme(plot.title = element_text(hjust = 0.5))

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable

ggpairs(train_data,
        columns = c("Length", "Diameter", "Height", "Age"),
        mapping = aes(color = cut_number(Age, 5), alpha = 0.5),
        lower = list(continuous = "points", alpha = 0.3, size = 0.01),
        upper = list(continuous = "smooth", alpha = 0.005, size = 0.1),
        progress = FALSE) +
  theme(axis.text = element_blank(), axis.ticks = element_blank()) +
  labs(title = "Pair plots: lower - scatter; upper - linear fit, colored by target")

Derive a correlation matrix for any three quantitative variables in the dataset.

correlation_matrix = train_data %>% dplyr::select(Length, Diameter, Height, Height) %>% cor() %>% as.matrix()
correlation_matrix
##             Length  Diameter    Height
## Length   1.0000000 0.9894374 0.9183517
## Diameter 0.9894374 1.0000000 0.9213531
## Height   0.9183517 0.9213531 1.0000000

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

# Perform correlation tests between Lenght, Diameter, and Weight variables
cor_test_AL <- cor.test(train_data$Age, train_data$Length, conf.level = 0.80)
cor_test_AD <- cor.test(train_data$Age, train_data$Diameter, conf.level = 0.80)
cor_test_AH <- cor.test(train_data$Age, train_data$Height, conf.level = 0.80)


# Display correlation test results
print(cor_test_AL)
## 
##  Pearson's product-moment correlation
## 
## data:  train_data$Age and train_data$Length
## t = 211.04, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6098938 0.6157754
## sample estimates:
##       cor 
## 0.6128431
print(cor_test_AD)
## 
##  Pearson's product-moment correlation
## 
## data:  train_data$Age and train_data$Diameter
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6183556 0.6241393
## sample estimates:
##       cor 
## 0.6212559
print(cor_test_AH)
## 
##  Pearson's product-moment correlation
## 
## data:  train_data$Age and train_data$Height
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6352664 0.6408507
## sample estimates:
##       cor 
## 0.6380669

The three pair-wised p-values obtained from these correlation tests are very small (close to zero), reject the null hypothesis and there is strong evidence of a linear relationship between these variables.

In the context of conducting multiple correlation tests between ‘Diameter’, ‘Height’, and ‘Weight’, there is a concern about familywise error, especially when conducting several tests without adjusting the significance level.

Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

# Invert the correlation matrix to obtain the precision matrix
precision_matrix <- solve(correlation_matrix)
precision_matrix
##              Length   Diameter    Height
## Length    48.276661 -45.785573 -2.150272
## Diameter -45.785573  50.040791 -4.057983
## Height    -2.150272  -4.057983  6.713541
# Multiply correlation matrix by precision matrix
correlation_by_precision <- correlation_matrix %*% precision_matrix
correlation_by_precision
##                Length      Diameter       Height
## Length   1.000000e+00 -1.000842e-16 2.953585e-16
## Diameter 7.403530e-16  1.000000e+00 6.518226e-17
## Height   1.776357e-15  8.881784e-16 1.000000e+00
# Multiply precision matrix by correlation matrix
precision_by_correlation <- precision_matrix %*% correlation_matrix
precision_by_correlation
##                 Length     Diameter       Height
## Length    1.000000e+00 3.311900e-16 1.332268e-15
## Diameter  7.155759e-16 1.000000e+00 1.776357e-15
## Height   -5.928199e-16 6.518226e-17 1.000000e+00
# Perform LDU decomposition on the precision matrix
ldu_decomposition <- lu(correlation_matrix)
ldu_decomposition
## LU factorization of Formal class 'denseLU' [package "Matrix"] with 4 slots
##   ..@ x       : num [1:9] 1 0.989 0.918 0.989 0.021 ...
##   ..@ perm    : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 3
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:3] "Length" "Diameter" "Height"
##   .. ..$ : chr [1:3] "Length" "Diameter" "Height"

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Calculus-Based Probability & Statistics

#Rename the column
names(train_data)[names(train_data) == "Viscera Weight"] <- "Viscera.Weight"

Viscera.Weight_dist <- ggplot(train_data, aes(x = Viscera.Weight)) + 
  geom_histogram(binwidth = 1.5) + 
  coord_cartesian(xlim = c(0, 25)) +
  labs(title = "Distribution of Crab Viscera.Weight")
Viscera.Weight_dist

# Calculate skewness
skewness(train_data$Viscera.Weight)
## [1] 0.286377
# Shift the variable if necessary
if (min(train_data$Viscera.Weight) <= 0) {
  train_data$Viscera.Weight <- train_data$Viscera.Weight - min(train_data$Viscera.Weight) + 1
}

# Fit exponential distribution to the 'Viscera.Weight' variable
fit <- fitdistr(train_data$Viscera.Weight, "exponential")

#Extract the estimated lambda parameter
lambda <- fit$estimate["rate"]

# Generate 1000 samples from exponential distribution
samples <- rexp(1000, lambda)

#compare the histograms of the original variable and the generated samples, we can plot them using the hist function.
# Plot histograms
par(mfrow = c(1, 2))
hist(train_data$Viscera.Weight, main = "Original Variable: Viscera.Weight", col="darkgreen", xlab = "Viscera.Weight")
hist(samples, main = "Exponential Distribution", col="skyblue", xlab = "Samples")

#find the 5th and 95th percentiles using the cumulative distribution function (CDF)
qexp(c(0.05, 0.95), rate = lambda)
## [1]  0.2594613 15.1535702
#Generate a 95% confidence interval from the empirical data, assuming normality
qnorm(c(0.025, 0.975), mean=mean(train_data$Viscera.Weight), sd=sd(train_data$Viscera.Weight))
## [1] -0.4152617 10.5320337
# Empirical 5th percentile and 95th percentile of the data:
quantile(train_data$Viscera.Weight, c(0.05, 0.95))
##       5%      95% 
## 0.793786 9.738053

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Linear Regression Model

# 'Age' is the dependent variable and other variables are independent
M1 <- lm(Age ~ Length + Diameter + Height + Viscera.Weight, data = train_data)
# Summary of the model
summary(M1)
## 
## Call:
## lm(formula = Age ~ Length + Diameter + Height + Viscera.Weight, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.911  -1.469  -0.553   0.722  18.405 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.56227    0.06835  22.856  < 2e-16 ***
## Length         -2.31995    0.21734 -10.674  < 2e-16 ***
## Diameter        6.13654    0.26790  22.906  < 2e-16 ***
## Height         15.78694    0.25770  61.262  < 2e-16 ***
## Viscera.Weight -0.06329    0.00839  -7.543 4.63e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.426 on 74046 degrees of freedom
## Multiple R-squared:  0.416,  Adjusted R-squared:  0.416 
## F-statistic: 1.319e+04 on 4 and 74046 DF,  p-value: < 2.2e-16
#Create a residuals vs. fitted values plot and evaluate the linear regression model
# Obtain model predictions (fitted values) and residuals
fitted_values <- predict(M1)
residuals <- resid(M1)

# Create the residuals vs. fitted values plot
plot(fitted_values, residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs. Fitted Values Plot",
     col = "blue")
abline(h = 0, col = "red")  # Adding a horizontal line at y = 0 for reference

# Create Normal Q-Q plot 
qqnorm(M1$residuals, main = "Normal Q-Q Plot")
qqline(M1$residuals, col = 2)  # Add a reference line for normal distribution

Model Summary

The intercept and coefficients show the estimated effect of each predictor on the dependent variable (Age). Low p-values (< 2.2e-16) for all coefficients implies that they are strongly related to Age. The model explains about 41.6% of the variance in Age. The F-statistic is highly prominent, indicating that the overall model is fit with the train_data set.

Poisson Regression Model

Poisson regression is used when the dependent variable represents Age and and ‘Length’, ‘Diameter’, ‘Height’, and ‘Viscera.Weight’ are the predictor variables those follow a Poisson distribution.

# Assuming 'Age' is the dependent variable
M2 <- glm(Age ~ Length + Diameter + Height + Viscera.Weight, data = train_data, family = "poisson")
summary(M2)
## 
## Call:
## glm(formula = Age ~ Length + Diameter + Height + Viscera.Weight, 
##     family = "poisson", data = train_data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.245322   0.010128 122.960   <2e-16 ***
## Length         -0.033419   0.027935  -1.196    0.232    
## Diameter        0.777556   0.033838  22.979   <2e-16 ***
## Height          1.057764   0.020310  52.080   <2e-16 ***
## Viscera.Weight -0.017794   0.001072 -16.601   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 71090  on 74050  degrees of freedom
## Residual deviance: 38570  on 74046  degrees of freedom
## AIC: 342722
## 
## Number of Fisher Scoring iterations: 6
# Obtain fitted values and residuals
fitted_values <- fitted(M2)
residuals <- residuals(M2, type = "response")  # Use type = "response" for Poisson models

# Create residuals vs. fitted values plot
plot(fitted_values, residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs. Fitted Values Plot")

# Add a horizontal line at y = 0 for reference
abline(h = 0, col = "red", lty = 2)

# Create Normal Q-Q plot 
qqnorm(M2$residuals, main = "Normal Q-Q Plot")
qqline(M2$residuals, col = 2)  # Add a reference line for normal distribution

### Model Summary

The intercept indicates the expected Age when all predictor variables are zero. The coefficient of variable “Length” is not significant (p = 0.232). The model shows that ‘Diameter’, ‘Height’, and ‘Viscera.Weight’ have a highly impact on the expected count of ‘Age’. ‘Length’ doesn’t appear effect on Crab Age based on the p-value. AIC (Akaike Information Criterion) also measures the model’s goodness-of-fit. Lower values indicate a better fit. The AIC in the model is 342722. So, the Poisson regression model does not fit for the train_data set.

Performance Evaluation

#Rename the column
names(test_data)[names(test_data) == "Viscera Weight"] <- "Viscera.Weight"
names(test_data)
## [1] "id"             "Sex"            "Length"         "Diameter"      
## [5] "Height"         "Weight"         "Shucked Weight" "Viscera.Weight"
## [9] "Shell Weight"
# Make predictions using the trained model on the test data for Linear Regression Model
predict_age <- predict(M1, newdata = test_data, type = "response")
# Calculate the mean absolute error (MAE)
mae_lm <- mean(abs(predict_age - train_data$Age))
## Warning in predict_age - train_data$Age: longer object length is not a multiple
## of shorter object length
mae_lm
## [1] 2.917973
# Calculate RMSE
rmse_lm <- sqrt(mean((predict_age - train_data$Age)^2))
## Warning in predict_age - train_data$Age: longer object length is not a multiple
## of shorter object length
rmse_lm
## [1] 3.775332
# Make predictions using the trained model on the test data for Poisson Model
predicted_ages <- predict(M2, newdata = test_data, type = "response")
# Calculate the mean absolute error (MAE)
mae_Poisson <- mean(abs(predicted_ages - train_data$Age))
## Warning in predicted_ages - train_data$Age: longer object length is not a
## multiple of shorter object length
mae_Poisson
## [1] 2.912248
# Calculate RMSE
rmse_Poisson <- sqrt(mean((predicted_ages - train_data$Age)^2))
## Warning in predicted_ages - train_data$Age: longer object length is not a
## multiple of shorter object length
rmse_Poisson
## [1] 3.825034

Model Selection and its Results

In the context of evaluating MAE and RMSE of both models (Poisson Regression model and and multiple liner regression model), or predicting Crab Age, to assess the model’s predictive performance. Lower values for both metrics indicate better model performance, but they measure different aspects of the error.

Thus, Linear Regression model is selected to find the predicted age of Crab because the lm model shows that all predictors ‘Length’, ‘Diameter’, ‘Height’, and ‘Viscera.Weight’ have a deeply impact on the expected ‘Age’ of Crabs and rmse_lm is much lesser than rmse_Poisson.

#Append a new last column called "Result" in which there are predicted Crab Ages to test_data
test_data$Age=predict_age
names(test_data)
##  [1] "id"             "Sex"            "Length"         "Diameter"      
##  [5] "Height"         "Weight"         "Shucked Weight" "Viscera.Weight"
##  [9] "Shell Weight"   "Age"
test_data$id <- as.integer(test_data$id)
test_data <- test_data[, -c(2:9)]
head(test_data)
## # A tibble: 6 × 2
##      id   Age
##   <int> <dbl>
## 1 74051  8.04
## 2 74052  8.45
## 3 74053  9.52
## 4 74054  9.73
## 5 74055  8.19
## 6 74056 10.2

Report Kaggle.com user name and score.

# Write the predicted Age of Result to a csv file for submission to the Kaggle.com.
write.csv(test_data, file = "Lwin.Shwe.csv", row.names = FALSE)

My Kaggle user name is lwinnandar Shwe, and the resulting score from this model is Private score: 1.70788 and Public Score: 1.69888.

The link of submission on Kaggle.com is below: https://www.kaggle.com/competitions/playground-series-s3e16/submissions#