Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
# Set the random seed
set.seed(1234)
# Random variable X with continuous uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)
hist(X,main="Histogram for Random Distribution ",
xlab="Random Distribution",
border="blue",
col="gray",
xlim=c(-10,20),
breaks=10)
#Random variable Y with normal distribution (mean = 10, sd = 2.89)
Y <- rnorm(10000, mean = 10, sd = 2.89)
hist(Y,main="Histogram for Normal Distribution ",
xlab="Random Distribution",
border="blue",
col="gray",
xlim=c(-20,20),
breaks=20)
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.
# Assuming x is the median of X and y is the median of Y
x <- median(X)
y <- median(Y)
# Calculate P(X > x ∩ X > y)
prob_Xx_Xy <- sum(X > x & X > y) / length(X)
# Calculate P(X > y)
prob_Xy <- sum(X > y) / length(X)
# Calculate P(X > x | X > y)
prob_Xx_given_Xy <- prob_Xx_Xy / prob_Xy
prob_Xx_given_Xy
## [1] 1
P(X>x∣X>y)=1 implies that the likelihood of X being above its median when it’s already above the median of Y is absolute or certain.
# Calculate P(X > x ∩ Y > y)
prob_Xx_Yy <- sum(X > x & Y > y) / length(X)
prob_Xx_Yy
## [1] 0.2507
P(X>x&Y>y) = 0.2507 implies that 25% of the observations from variable X are greater than its median (x) and, at the same time, 25% of the observations from variable Y are greater than its median (y).
# Calculate P(X < x ∩ X > y)
prob_XltX_YgtY <- sum(X < x & X > y) / length(X)
# Calculate P(X < x | X > y)
prob_XltX_given_YgtY <- prob_XltX_YgtY / prob_Xy
prob_XltX_given_YgtY
## [1] 0
P(X<x∣X>y) = 0 implies that the probability of a randomly chosen value from variable X being less than its median (x), given that it’s greater than the median of Y (y), is equal to zero or 0%.
# Calculate joint probabilities
prob_X_gt_x_and_Y_gt_y <- mean(X > x & Y > y)
prob_X_gt_x_and_Y_gt_y
## [1] 0.2507
# Calculate the product of marginal probabilities
prob_product <- mean(X > x) * mean(Y > y)
prob_product
## [1] 0.25
# Build a table with joint and marginal probabilities
table <- matrix(c(
(mean(X > x & Y > y)), (mean(X > x & Y <= y)),
(mean(X <= x & Y > y)), (mean(X <= x & Y <= y))
), nrow = 2, byrow = TRUE)
rownames(table) <- c("X > x", "X <= x")
colnames(table) <- c("Y > y", "Y <= y")
# Create a data frame for the table
contingency_table <- as.data.frame(table)
contingency_table$Total_X <- rowSums(table)
contingency_table <- cbind(contingency_table, Total_Y = colSums(table))
contingency_table <- rbind(
contingency_table,
Total_Y = c(rowSums(table), sum(table))
)
## Warning in rbind(deparse.level, ...): number of columns of result, 4, is not a
## multiple of vector length 3 of arg 2
print(contingency_table[, -ncol(contingency_table)])
## Y > y Y <= y Total_X
## X > x 0.2507 0.2493 0.5
## X <= x 0.2493 0.2507 0.5
## Total_Y 0.5000 0.5000 1.0
The joint probabilities P(X>x&Y>y) = 0.2507, individual probabilities P(X>x) and P(Y>y) are both 0.5, and the product of marginal probabilities P(X>x)× P(Y>y) = 0.25. They are slightly different, and the value of joint probabilities is a bit higher than product of marginal probabilities.
Fisher’s Exact Test always gives an exact P value and works fine with small sample sizes.Because it calculates the exact probability of observing a particular distribution of counts in a contingency table, given the marginal totals.
The chi-square test is only an approximation. With large sample sizes, the chi-square test works very well but with small sample sizes, chi-square is not accurate.Because it assesses the independence of two categorical variables by comparing observed frequencies to expected frequencies under the assumption of independence.
test_table <- matrix(c((sum(X > x & Y > y)), (sum(X > x & Y <= y)),
(sum(X <= x & Y > y)), (sum(X <= x & Y <= y))),
nrow = 2, ncol = 2, byrow = TRUE)
# Fisher's Exact Test
fisher.test(test_table)
##
## Fisher's Exact Test for Count Data
##
## data: test_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
#Chi Square Test
chisq.test(test_table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: test_table
## X-squared = 0.0676, df = 1, p-value = 0.7949
Both tests produced a same, high p-value (0.7949), that means there isn’t sufficient evidence to reject the null hypothesis of independence between the variables. The 95% Confidence Interval in Fisher’s Exact Test implies that the association between the variables falls within this range ( 0.9342763, 1.0946016).
You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(Matrix)
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(moments)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# Read the CSV files
test_data <- read_csv("605FinalExam Data/test.csv")
## Rows: 49368 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (8): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_data <- read_csv("605FinalExam Data/train.csv")
## Rows: 74051 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (9): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(train_data)
## [1] "id" "Sex" "Length" "Diameter"
## [5] "Height" "Weight" "Shucked Weight" "Viscera Weight"
## [9] "Shell Weight" "Age"
str(train_data)
## spc_tbl_ [74,051 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:74051] 0 1 2 3 4 5 6 7 8 9 ...
## $ Sex : chr [1:74051] "I" "I" "M" "F" ...
## $ Length : num [1:74051] 1.52 1.1 1.39 1.7 1.25 ...
## $ Diameter : num [1:74051] 1.175 0.825 1.113 1.413 1.012 ...
## $ Height : num [1:74051] 0.375 0.275 0.375 0.5 0.338 ...
## $ Weight : num [1:74051] 29 10.4 24.8 50.7 23.3 ...
## $ Shucked Weight: num [1:74051] 12.73 4.52 11.34 20.35 11.98 ...
## $ Viscera Weight: num [1:74051] 6.65 2.32 5.56 10.99 4.51 ...
## $ Shell Weight : num [1:74051] 8.35 3.4 6.66 15 5.95 ...
## $ Age : num [1:74051] 9 8 9 11 8 10 11 11 12 11 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. Sex = col_character(),
## .. Length = col_double(),
## .. Diameter = col_double(),
## .. Height = col_double(),
## .. Weight = col_double(),
## .. `Shucked Weight` = col_double(),
## .. `Viscera Weight` = col_double(),
## .. `Shell Weight` = col_double(),
## .. Age = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Provide univariate descriptive statistics and appropriate plots for the training data set
# Summary statistics for numerical variables
summary(train_data)
## id Sex Length Diameter
## Min. : 0 Length:74051 Min. :0.1875 Min. :0.1375
## 1st Qu.:18512 Class :character 1st Qu.:1.1500 1st Qu.:0.8875
## Median :37025 Mode :character Median :1.3750 Median :1.0750
## Mean :37025 Mean :1.3175 Mean :1.0245
## 3rd Qu.:55538 3rd Qu.:1.5375 3rd Qu.:1.2000
## Max. :74050 Max. :2.0128 Max. :1.6125
## Height Weight Shucked Weight Viscera Weight
## Min. :0.0000 Min. : 0.0567 Min. : 0.02835 Min. : 0.04252
## 1st Qu.:0.3000 1st Qu.:13.4377 1st Qu.: 5.71242 1st Qu.: 2.86330
## Median :0.3625 Median :23.7994 Median : 9.90815 Median : 4.98951
## Mean :0.3481 Mean :23.3852 Mean :10.10427 Mean : 5.05839
## 3rd Qu.:0.4125 3rd Qu.:32.1625 3rd Qu.:14.03300 3rd Qu.: 6.98815
## Max. :2.8250 Max. :80.1015 Max. :42.18406 Max. :21.54562
## Shell Weight Age
## Min. : 0.04252 Min. : 1.000
## 1st Qu.: 3.96893 1st Qu.: 8.000
## Median : 6.93145 Median :10.000
## Mean : 6.72387 Mean : 9.968
## 3rd Qu.: 9.07184 3rd Qu.:11.000
## Max. :28.49125 Max. :29.000
#Provide univariate descriptive statistics and appropriate plots for the training data set
# Calculate summary statistics of Age per Sex
age_summary <- train_data %>%
group_by(Sex) %>%
summarise(mean_age = mean(Age), # Calculate mean age
median_age = median(Age)) # Calculate median age
# Plotting bar plot for Age by Sex
ggplot(age_summary, aes(x = Sex, y = mean_age, fill = Sex)) +
geom_bar(stat = "identity", position = "dodge", width = 0.5) +
geom_text(aes(label = round(mean_age, 1)), position = position_dodge(width = 0.5), vjust = -0.5) +
labs(x = "Sex", y = "Mean Age") +
ggtitle("Mean Age of Crabs by Sex") +
theme(plot.title = element_text(hjust = 0.5))
ggpairs(train_data,
columns = c("Length", "Diameter", "Height", "Age"),
mapping = aes(color = cut_number(Age, 5), alpha = 0.5),
lower = list(continuous = "points", alpha = 0.3, size = 0.01),
upper = list(continuous = "smooth", alpha = 0.005, size = 0.1),
progress = FALSE) +
theme(axis.text = element_blank(), axis.ticks = element_blank()) +
labs(title = "Pair plots: lower - scatter; upper - linear fit, colored by target")
correlation_matrix = train_data %>% dplyr::select(Length, Diameter, Height, Height) %>% cor() %>% as.matrix()
correlation_matrix
## Length Diameter Height
## Length 1.0000000 0.9894374 0.9183517
## Diameter 0.9894374 1.0000000 0.9213531
## Height 0.9183517 0.9213531 1.0000000
# Perform correlation tests between Lenght, Diameter, and Weight variables
cor_test_AL <- cor.test(train_data$Age, train_data$Length, conf.level = 0.80)
cor_test_AD <- cor.test(train_data$Age, train_data$Diameter, conf.level = 0.80)
cor_test_AH <- cor.test(train_data$Age, train_data$Height, conf.level = 0.80)
# Display correlation test results
print(cor_test_AL)
##
## Pearson's product-moment correlation
##
## data: train_data$Age and train_data$Length
## t = 211.04, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6098938 0.6157754
## sample estimates:
## cor
## 0.6128431
print(cor_test_AD)
##
## Pearson's product-moment correlation
##
## data: train_data$Age and train_data$Diameter
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6183556 0.6241393
## sample estimates:
## cor
## 0.6212559
print(cor_test_AH)
##
## Pearson's product-moment correlation
##
## data: train_data$Age and train_data$Height
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6352664 0.6408507
## sample estimates:
## cor
## 0.6380669
The three pair-wised p-values obtained from these correlation tests are very small (close to zero), reject the null hypothesis and there is strong evidence of a linear relationship between these variables.
In the context of conducting multiple correlation tests between ‘Diameter’, ‘Height’, and ‘Weight’, there is a concern about familywise error, especially when conducting several tests without adjusting the significance level.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.
# Invert the correlation matrix to obtain the precision matrix
precision_matrix <- solve(correlation_matrix)
precision_matrix
## Length Diameter Height
## Length 48.276661 -45.785573 -2.150272
## Diameter -45.785573 50.040791 -4.057983
## Height -2.150272 -4.057983 6.713541
# Multiply correlation matrix by precision matrix
correlation_by_precision <- correlation_matrix %*% precision_matrix
correlation_by_precision
## Length Diameter Height
## Length 1.000000e+00 -1.000842e-16 2.953585e-16
## Diameter 7.403530e-16 1.000000e+00 6.518226e-17
## Height 1.776357e-15 8.881784e-16 1.000000e+00
# Multiply precision matrix by correlation matrix
precision_by_correlation <- precision_matrix %*% correlation_matrix
precision_by_correlation
## Length Diameter Height
## Length 1.000000e+00 3.311900e-16 1.332268e-15
## Diameter 7.155759e-16 1.000000e+00 1.776357e-15
## Height -5.928199e-16 6.518226e-17 1.000000e+00
# Perform LDU decomposition on the precision matrix
ldu_decomposition <- lu(correlation_matrix)
ldu_decomposition
## LU factorization of Formal class 'denseLU' [package "Matrix"] with 4 slots
## ..@ x : num [1:9] 1 0.989 0.918 0.989 0.021 ...
## ..@ perm : int [1:3] 1 2 3
## ..@ Dim : int [1:2] 3 3
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:3] "Length" "Diameter" "Height"
## .. ..$ : chr [1:3] "Length" "Diameter" "Height"
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
#Rename the column
names(train_data)[names(train_data) == "Viscera Weight"] <- "Viscera.Weight"
Viscera.Weight_dist <- ggplot(train_data, aes(x = Viscera.Weight)) +
geom_histogram(binwidth = 1.5) +
coord_cartesian(xlim = c(0, 25)) +
labs(title = "Distribution of Crab Viscera.Weight")
Viscera.Weight_dist
# Calculate skewness
skewness(train_data$Viscera.Weight)
## [1] 0.286377
# Shift the variable if necessary
if (min(train_data$Viscera.Weight) <= 0) {
train_data$Viscera.Weight <- train_data$Viscera.Weight - min(train_data$Viscera.Weight) + 1
}
# Fit exponential distribution to the 'Viscera.Weight' variable
fit <- fitdistr(train_data$Viscera.Weight, "exponential")
#Extract the estimated lambda parameter
lambda <- fit$estimate["rate"]
# Generate 1000 samples from exponential distribution
samples <- rexp(1000, lambda)
#compare the histograms of the original variable and the generated samples, we can plot them using the hist function.
# Plot histograms
par(mfrow = c(1, 2))
hist(train_data$Viscera.Weight, main = "Original Variable: Viscera.Weight", col="darkgreen", xlab = "Viscera.Weight")
hist(samples, main = "Exponential Distribution", col="skyblue", xlab = "Samples")
#find the 5th and 95th percentiles using the cumulative distribution function (CDF)
qexp(c(0.05, 0.95), rate = lambda)
## [1] 0.2594613 15.1535702
#Generate a 95% confidence interval from the empirical data, assuming normality
qnorm(c(0.025, 0.975), mean=mean(train_data$Viscera.Weight), sd=sd(train_data$Viscera.Weight))
## [1] -0.4152617 10.5320337
# Empirical 5th percentile and 95th percentile of the data:
quantile(train_data$Viscera.Weight, c(0.05, 0.95))
## 5% 95%
## 0.793786 9.738053
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
# 'Age' is the dependent variable and other variables are independent
M1 <- lm(Age ~ Length + Diameter + Height + Viscera.Weight, data = train_data)
# Summary of the model
summary(M1)
##
## Call:
## lm(formula = Age ~ Length + Diameter + Height + Viscera.Weight,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.911 -1.469 -0.553 0.722 18.405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.56227 0.06835 22.856 < 2e-16 ***
## Length -2.31995 0.21734 -10.674 < 2e-16 ***
## Diameter 6.13654 0.26790 22.906 < 2e-16 ***
## Height 15.78694 0.25770 61.262 < 2e-16 ***
## Viscera.Weight -0.06329 0.00839 -7.543 4.63e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.426 on 74046 degrees of freedom
## Multiple R-squared: 0.416, Adjusted R-squared: 0.416
## F-statistic: 1.319e+04 on 4 and 74046 DF, p-value: < 2.2e-16
#Create a residuals vs. fitted values plot and evaluate the linear regression model
# Obtain model predictions (fitted values) and residuals
fitted_values <- predict(M1)
residuals <- resid(M1)
# Create the residuals vs. fitted values plot
plot(fitted_values, residuals,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs. Fitted Values Plot",
col = "blue")
abline(h = 0, col = "red") # Adding a horizontal line at y = 0 for reference
# Create Normal Q-Q plot
qqnorm(M1$residuals, main = "Normal Q-Q Plot")
qqline(M1$residuals, col = 2) # Add a reference line for normal distribution
The intercept and coefficients show the estimated effect of each predictor on the dependent variable (Age). Low p-values (< 2.2e-16) for all coefficients implies that they are strongly related to Age. The model explains about 41.6% of the variance in Age. The F-statistic is highly prominent, indicating that the overall model is fit with the train_data set.
Poisson regression is used when the dependent variable represents Age and and ‘Length’, ‘Diameter’, ‘Height’, and ‘Viscera.Weight’ are the predictor variables those follow a Poisson distribution.
# Assuming 'Age' is the dependent variable
M2 <- glm(Age ~ Length + Diameter + Height + Viscera.Weight, data = train_data, family = "poisson")
summary(M2)
##
## Call:
## glm(formula = Age ~ Length + Diameter + Height + Viscera.Weight,
## family = "poisson", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.245322 0.010128 122.960 <2e-16 ***
## Length -0.033419 0.027935 -1.196 0.232
## Diameter 0.777556 0.033838 22.979 <2e-16 ***
## Height 1.057764 0.020310 52.080 <2e-16 ***
## Viscera.Weight -0.017794 0.001072 -16.601 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 71090 on 74050 degrees of freedom
## Residual deviance: 38570 on 74046 degrees of freedom
## AIC: 342722
##
## Number of Fisher Scoring iterations: 6
# Obtain fitted values and residuals
fitted_values <- fitted(M2)
residuals <- residuals(M2, type = "response") # Use type = "response" for Poisson models
# Create residuals vs. fitted values plot
plot(fitted_values, residuals,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs. Fitted Values Plot")
# Add a horizontal line at y = 0 for reference
abline(h = 0, col = "red", lty = 2)
# Create Normal Q-Q plot
qqnorm(M2$residuals, main = "Normal Q-Q Plot")
qqline(M2$residuals, col = 2) # Add a reference line for normal distribution
### Model Summary
The intercept indicates the expected Age when all predictor variables are zero. The coefficient of variable “Length” is not significant (p = 0.232). The model shows that ‘Diameter’, ‘Height’, and ‘Viscera.Weight’ have a highly impact on the expected count of ‘Age’. ‘Length’ doesn’t appear effect on Crab Age based on the p-value. AIC (Akaike Information Criterion) also measures the model’s goodness-of-fit. Lower values indicate a better fit. The AIC in the model is 342722. So, the Poisson regression model does not fit for the train_data set.
#Rename the column
names(test_data)[names(test_data) == "Viscera Weight"] <- "Viscera.Weight"
names(test_data)
## [1] "id" "Sex" "Length" "Diameter"
## [5] "Height" "Weight" "Shucked Weight" "Viscera.Weight"
## [9] "Shell Weight"
# Make predictions using the trained model on the test data for Linear Regression Model
predict_age <- predict(M1, newdata = test_data, type = "response")
# Calculate the mean absolute error (MAE)
mae_lm <- mean(abs(predict_age - train_data$Age))
## Warning in predict_age - train_data$Age: longer object length is not a multiple
## of shorter object length
mae_lm
## [1] 2.917973
# Calculate RMSE
rmse_lm <- sqrt(mean((predict_age - train_data$Age)^2))
## Warning in predict_age - train_data$Age: longer object length is not a multiple
## of shorter object length
rmse_lm
## [1] 3.775332
# Make predictions using the trained model on the test data for Poisson Model
predicted_ages <- predict(M2, newdata = test_data, type = "response")
# Calculate the mean absolute error (MAE)
mae_Poisson <- mean(abs(predicted_ages - train_data$Age))
## Warning in predicted_ages - train_data$Age: longer object length is not a
## multiple of shorter object length
mae_Poisson
## [1] 2.912248
# Calculate RMSE
rmse_Poisson <- sqrt(mean((predicted_ages - train_data$Age)^2))
## Warning in predicted_ages - train_data$Age: longer object length is not a
## multiple of shorter object length
rmse_Poisson
## [1] 3.825034
In the context of evaluating MAE and RMSE of both models (Poisson Regression model and and multiple liner regression model), or predicting Crab Age, to assess the model’s predictive performance. Lower values for both metrics indicate better model performance, but they measure different aspects of the error.
Thus, Linear Regression model is selected to find the predicted age of Crab because the lm model shows that all predictors ‘Length’, ‘Diameter’, ‘Height’, and ‘Viscera.Weight’ have a deeply impact on the expected ‘Age’ of Crabs and rmse_lm is much lesser than rmse_Poisson.
#Append a new last column called "Result" in which there are predicted Crab Ages to test_data
test_data$Age=predict_age
names(test_data)
## [1] "id" "Sex" "Length" "Diameter"
## [5] "Height" "Weight" "Shucked Weight" "Viscera.Weight"
## [9] "Shell Weight" "Age"
test_data$id <- as.integer(test_data$id)
test_data <- test_data[, -c(2:9)]
head(test_data)
## # A tibble: 6 × 2
## id Age
## <int> <dbl>
## 1 74051 8.04
## 2 74052 8.45
## 3 74053 9.52
## 4 74054 9.73
## 5 74055 8.19
## 6 74056 10.2
# Write the predicted Age of Result to a csv file for submission to the Kaggle.com.
write.csv(test_data, file = "Lwin.Shwe.csv", row.names = FALSE)
My Kaggle user name is lwinnandar Shwe, and the resulting score from this model is Private score: 1.70788 and Public Score: 1.69888.
The link of submission on Kaggle.com is below: https://www.kaggle.com/competitions/playground-series-s3e16/submissions#