title: “Final Data605” author: “Shariq Mian” date: “2023-12-14” output: html_document —

Problem 1.

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

set.seed(1234)
X = runif(n = 10000, min = 5, max = 15)
Y = rnorm(10000, mean = 10, sd = 2.89)

hist(X)

hist(Y)

### Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

x = median(X)
y = median(Y)
print(x)
## [1] 10.01266
print(y)
## [1] 10.03154
a = sum(X > x & X > y) / sum(X > x)
#b. The ratio of the probabity Number of times X>x and Y>y/ Total Number of Observations
b = sum(X > x & Y > y) / length(X)
#c This is one is the opposite. The ratio of Number of times X<x and X>y/ Number of Times X>y
c = sum(X < x & X > y) / length(X)

cat("a. P(X > x | X > y):", a, "\n")
## a. P(X > x | X > y): 0.9958
cat("b. P(X > x & Y > y):", b, "\n")
## b. P(X > x & Y > y): 0.2507
cat("c. P(X < x | X > y):", c, "\n")
## c. P(X < x | X > y): 0
  1. Probability that, given X>y, X is also greater than x is .9958 or close to 1.
  2. Joint probability that both X>x and Y>y is 0.2507 c.This is identically the opposite of a, Probability that, given X>y, X is less than x is almost 0.

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

table_XY <- table(X > x, Y > y)
print(table_XY)
##        
##         FALSE TRUE
##   FALSE  2507 2493
##   TRUE   2493 2507
prob_X_gt_x <- sum(X > x) / length(X)
prob_Y_gt_y <- sum(Y > y) / length(Y)


prob_Xgt_x_and_Ygt_y <- table_XY[2, 2] / sum(table_XY)

equality_check <- prob_Xgt_x_and_Ygt_y == (prob_X_gt_x * prob_Y_gt_y)


print(paste("P(X > x) =", prob_X_gt_x))
## [1] "P(X > x) = 0.5"
print(paste("P(Y > y) =", prob_Y_gt_y))
## [1] "P(Y > y) = 0.5"
print(paste("P(X > x and Y > y) =", prob_Xgt_x_and_Ygt_y))
## [1] "P(X > x and Y > y) = 0.2507"
print(paste("P(X > x) * P(Y > y) =", prob_X_gt_x * prob_Y_gt_y))
## [1] "P(X > x) * P(Y > y) = 0.25"
print(paste("Are they equal? (TRUE/FALSE):", equality_check))
## [1] "Are they equal? (TRUE/FALSE): FALSE"

Based on the information provided in the table, the joint probability denoted as P(X > x & Y > y) is observed to be 0.25. Simultaneously, the independent probabilities, P(X > x) and P(Y > y), for each variable are both 0.5. The multiplication of these individual probabilities results in a product of the marginal probabilities, which is also 0.25.

Hence, it is evident that the joint probability (P(X > x & Y > y)) and the product of the marginal probabilities (P(X > x)P(Y > y)) are not identical. The joint probability is marginally higher than the product of the marginal probabilities.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

matrix_prob <- matrix(c((sum(X > x & Y > y)), (sum(X > x & Y <= y)),
                               (sum(X <= x & Y > y)), (sum(X <= x & Y <= y))),
                                nrow = 2, ncol = 2, byrow = TRUE)
matrix_prob
##      [,1] [,2]
## [1,] 2507 2493
## [2,] 2493 2507
fisher.test(matrix_prob)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix_prob
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264

Fisher’s Exact Test and the Chi Square Test are used to assess the relationship between two variables based on observed data. Fisher’s Exact Test determines the probability of observing the data or a more extreme outcome under the assumption of independence. It considers all possible tables with the same marginal totals and sums the probabilities of tables as or more extreme than the observed table. If this probability is small (typically < 0.05), the null hypothesis of independence is rejected.

In contrast, the Chi Square Test compares observed counts in a contingency table to expected counts under independence. It calculates a test statistic by comparing squared differences between observed and expected counts, divided by the expected counts. Under the null hypothesis, the test statistic follows a Chi Square distribution with degrees of freedom based on the number of variable categories. If the calculated statistic exceeds the critical value, the null hypothesis is rejected.

#Chi Square Test
chisq.test(matrix_prob)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix_prob
## X-squared = 0.0676, df = 1, p-value = 0.7949

For Fisher’s exact test, p value of .7949 is the probability of observing a table as extreme as the calculated table, assuming the null hypothesis of independence is true. A higher p-value indicates weaker evidence against the null hypothesis and there is no statistical evidence to reject the null hypothesis of independence between variables X and Y. nd the point estimate of the odds ratio is 1, further supporting the notion that there is no association between the variables. Based on Fisher’s Exact Test results, there is no significant evidence to suggest that variables X and Y are dependent on each other. The data supports the hypothesis of independence.

The fisher’s exact test is typically used when sample sizes are small while the chi-square test is more appropriate for large sample sizes. For this test, a chi-square test is most appropriate.

Similar to Fisher’s exact test, chi-square p-value represents the probability of observing the table assuming independence which in case was same value of .7949

Chi-square statistic of 0.0676 measures the difference between the observed and expected frequencies.

Problem 2

# Load required libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Matrix)
## Warning: package 'Matrix' was built under R version 4.3.2
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(caret)
## Warning: package 'caret' was built under R version 4.3.1
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
# Assuming train_df is your dataset
train_df <- read.csv(url("https://github.com/mianshariq/Datasets/raw/e72c11f47fcbf656ba206d69d7db3245f6079ada/train_crab.csv"))

Descriptive and Inferential Statistics.

Univariate descriptive statistics for Age the dependent Variable

summary_age <- summary(train_df$Age)


age_histogram <- ggplot(train_df, aes(x = Age)) + 
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) + 
  labs(title = "Distribution of Crab Ages",
       x = "Age",
       y = "Frequency")

print(summary_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   8.000  10.000   9.968  11.000  29.000
print(age_histogram)

scatter_diameter_age <- ggplot(train_df, aes(x = Diameter, y = Age)) +
  geom_point(color = "blue", alpha = 0.7) +
  labs(title = "Scatterplot of Crab Diameter vs. Age",
       x = "Diameter",
       y = "Age")

shell_weight_dist=ggplot(train_df, aes(x = Shell.Weight)) + 
  geom_histogram(binwidth = 1) + 
  coord_cartesian(xlim = c(0,25)) +
  labs(title = "Distribution of Crab Shell Weights")

scatter_shell_weight_age <- ggplot(train_df, aes(x = `Shell.Weight`, y = Age)) +
  geom_point(color = "green", alpha = 0.7) +
  labs(title = "Scatterplot of Crab Shell Weight vs. Age",
       x = "Shell Weight",
       y = "Age")

print(scatter_diameter_age)

print(shell_weight_dist)

print(scatter_shell_weight_age)

The distribution of crab shell weights appears mostly normal, with a slight rightward skew. A noticeable positive linear correlation is observed between crab shell weight and age, indicating that as the shell weight of a crab increases, there is an associated likelihood of it being older.

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

scatter_matrix <- GGally::ggpairs(train_df[, c("Age", "Length","Height", "Diameter","Shell.Weight")])
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
print(scatter_matrix)

Correlation Matrix and Hypothesis Testing: Derive a correlation matrix for any three quantitative variables in the dataset

The correlation matrix quantifies the strength and direction of linear relationships. The hypothesis test assesses if the correlations are significantly different from zero.

Null Hypothesis: There is no (significant) correlation between a Crab’s shell weight and a crab’s age.

Alt. Hypothesis: There is a (significant) correlation between a Crab’s Shell weight and a crab’s age.

correlation_vars <- train_df[, c("Diameter", "Height", "Weight", "Shell.Weight", "Age")]

cor_matrix=cor(correlation_vars)
print(cor_matrix)
##               Diameter    Height    Weight Shell.Weight       Age
## Diameter     1.0000000 0.9213531 0.9382486    0.9226882 0.6212559
## Height       0.9213531 1.0000000 0.9017748    0.9033982 0.6380669
## Weight       0.9382486 0.9017748 1.0000000    0.9655246 0.6011950
## Shell.Weight 0.9226882 0.9033982 0.9655246    1.0000000 0.6634733
## Age          0.6212559 0.6380669 0.6011950    0.6634733 1.0000000

Null Hypothesis: There is no (significant) correlation between a Crab Diameter and age. Alt. Hypothesis: There is a (significant) correlation between a Crab weight and age.

Null Hypothesis: There is no (significant) correlation between a Crab weight and age. Alt. Hypothesis: There is a (significant) correlation between a Crab weight and age.

Null Hypothesis: There is no (significant) correlation between a Crab weight and age. Alt. Hypothesis: There is a (significant) correlation between a Crab weight and age.

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

selected_columns <- train_df %>%
  select(Diameter, Height, Shell.Weight, Age)

correlation_results <- cor.test(selected_columns$Diameter, selected_columns$Age, method = "pearson",conf.level = 0.80)
print(correlation_results)
## 
##  Pearson's product-moment correlation
## 
## data:  selected_columns$Diameter and selected_columns$Age
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6183556 0.6241393
## sample estimates:
##       cor 
## 0.6212559
correlation_results <- cor.test(selected_columns$Height, selected_columns$Age, method = "pearson",conf.level = 0.80)
print(correlation_results)
## 
##  Pearson's product-moment correlation
## 
## data:  selected_columns$Height and selected_columns$Age
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6352664 0.6408507
## sample estimates:
##       cor 
## 0.6380669
correlation_results <- cor.test(selected_columns$Shell.Weight, selected_columns$Age, method = "pearson",conf.level = 0.80)
print(correlation_results)
## 
##  Pearson's product-moment correlation
## 
## data:  selected_columns$Shell.Weight and selected_columns$Age
## t = 241.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6608286 0.6661015
## sample estimates:
##       cor 
## 0.6634733

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Since the P-value is less than the alpha value of 0.05, we reject the null hypothesis for all the pairs. There is a significant correlation between a Crab’s Diameter and Age, Height and Age and Shell Weight and Age. which means as Crabs get older, they tend to have bigger diameter, larger heights and their shell weights more and we are confident about this because their p-value fall value fall under .05 which means they are not correlated to random chance.

Regarding family wise error, it refers to the increased probability of making at least one Type I error (false positive) when conducting multiple hypothesis tests. This risk increases with more tests. Since we are doing that we have a random chance of that happening so we should be worried about family wise error.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

precision_matrix <- solve(cor_matrix)

result_matrix_1 <- cor_matrix %*% precision_matrix
result_matrix_2 <- precision_matrix %*% cor_matrix

# first multiplication with different variable names
lu_decomp_1 <- lu(as(Matrix(result_matrix_1), "CsparseMatrix"))
L_matrix_1 <- lu_decomp_1@L
U_matrix_1 <- lu_decomp_1@U

D_matrix_1 <- diag(diag(U_matrix_1))

L_matrix_1 <- L_matrix_1 %*% diag(1 / diag(U_matrix_1))

U_matrix_1 <- diag(1 / diag(U_matrix_1)) %*% U_matrix_1

A_matrix_1 <- L_matrix_1 %*% D_matrix_1 %*% U_matrix_1

# 2nd multiplication with different variable names
lu_decomp_2 <- lu(as(Matrix(result_matrix_2), "CsparseMatrix"))
L_matrix_2 <- lu_decomp_2@L
U_matrix_2 <- lu_decomp_2@U

D_matrix_2 <- diag(diag(U_matrix_2))

L_matrix_2 <- L_matrix_2 %*% diag(1 / diag(U_matrix_2))

U_matrix_2 <- diag(1 / diag(U_matrix_2)) %*% U_matrix_2

A_matrix_2 <- L_matrix_2 %*% D_matrix_2 %*% U_matrix_2

cat("\nResults of LDU Decomp Matrix 1")
## 
## Results of LDU Decomp Matrix 1
print(A_matrix_1)
## 5 x 5 Matrix of class "dgeMatrix"
##               [,1]          [,2]          [,3]          [,4]          [,5]
## [1,]  1.000000e+00  1.110223e-16 -2.220446e-16  4.440892e-16  0.000000e+00
## [2,]  1.110223e-16  1.000000e+00  2.886580e-15  3.330669e-15 -2.220446e-16
## [3,] -2.220446e-16  2.886580e-15  1.000000e+00 -1.110223e-15 -2.220446e-16
## [4,]  4.440892e-16  3.330669e-15 -1.110223e-15  1.000000e+00 -4.440892e-16
## [5,]  0.000000e+00 -2.220446e-16 -2.220446e-16 -4.440892e-16  1.000000e+00
cat("\nResults of LDU Decomp Matrix 2")
## 
## Results of LDU Decomp Matrix 2
print(A_matrix_2)
## 5 x 5 Matrix of class "dgeMatrix"
##               [,1]          [,2]          [,3]          [,4]          [,5]
## [1,]  1.000000e+00  4.440892e-16 -5.551115e-17 -1.110223e-16  7.216450e-16
## [2,]  4.440892e-16  1.000000e+00 -1.387779e-15 -1.110223e-15 -1.221245e-15
## [3,] -5.551115e-17 -1.387779e-15  1.000000e+00  8.881784e-16  8.881784e-16
## [4,] -1.110223e-16 -1.110223e-15  8.881784e-16  1.000000e+00 -8.881784e-16
## [5,]  7.216450e-16 -1.221245e-15  8.881784e-16 -8.881784e-16  1.000000e+00

Calculus-Based Probability & Statistics.

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary

We are using GG pairs to see the distribution of the different features and it seems the distribution of Crab Shell Weights, is skewed to the right.

scatter_matrix <- GGally::ggpairs(train_df[, c("Age", "Sex", "Shucked.Weight", "Length",'Diameter','Height','Weight','Shell.Weight')])
print(scatter_matrix)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

selected_variable <- train_df$Shell.Weight
shifted_variable <- selected_variable - min(selected_variable) + 1

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
exp_fit <- fitdistr(shifted_variable, densfun = "exponential")
lambda <- exp_fit$estimate["rate"]
exp_samples <- rexp(1000, rate = lambda)

alpha_value <- 0.5
hist(shifted_variable, main = "Original Variable Vs Exponential Distribution Samples ", xlab = "Values", col = rgb(0, 0, 1, alpha =.5), freq = FALSE, breaks = 30)
hist(exp_samples, main = "Exponential Distribution Samples", xlab = "Values", col = rgb(0, 1, 0, alpha = .5), add = TRUE, freq = FALSE, breaks = 30)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

exp_5th_percentile <- qexp(0.05, rate = lambda)
exp_95th_percentile <- qexp(0.95, rate = lambda)

cat("The exponential 5th percentile is", exp_5th_percentile, "\n")
## The exponential 5th percentile is 0.3940015
cat("The exponential 95th percentile is", exp_95th_percentile, "\n")
## The exponential 95th percentile is 23.01126

The 5th percentile (0.3940015) of the exponential distribution suggests a lower threshold for shell.weight. This value indicates the weight below which approximately 5% of hypothetical crab shells would fall if they were following an exponential distribution. The 95th percentile (23.01126) provides an upper threshold, suggesting the weight above which approximately 95% of hypothetical crab shells would fall.

95% confidence interval from the empirical data, assuming normality.

mean_variable <- mean(shifted_variable)
sd_variable <- sd(shifted_variable)
ci_low <- mean_variable - 1.96 * sd_variable / sqrt(length(shifted_variable))
ci_up <- mean_variable + 1.96 * sd_variable / sqrt(length(shifted_variable))

cat("The empirical 95% CI is", ci_low, "-", ci_up, "\n")
## The empirical 95% CI is 7.655529 - 7.707163

The empirical 95% CI reflects the precision of estimating the population mean based on the sample mean and standard deviation. The empirical 95% CI for the mean shell.weight (7.655529 to 7.707163) indicates a range where we are reasonably confident the true mean shell.weight lies, assuming a normal distribution.

Empirical 5th percentile and 95th percentile of the data

emp_5th_percentile <- quantile(shifted_variable, 0.05)
emp_95th_percentile <- quantile(shifted_variable, 0.95)

cat("The empirical 5th percentile is", emp_5th_percentile, "\n")
## The empirical 5th percentile is 2.063106
cat("The empirical 95th percentile is", emp_95th_percentile, "\n")
## The empirical 95th percentile is 13.70058

The empirical 5th percentile (2.063106) represents a lower threshold for observed shell.weight values, below which 5% of actual crab shells fall. The empirical 95th percentile (13.70058) is an upper threshold where 95% of observed crab shells fall. Empirical percentiles provide insights into the distribution of the original variable, indicating its central tendency and spread.

Understanding the distribution of shell.weight is essential for predicting the age of a crab. Deviations from the theoretical exponential model may suggest factors influencing shell weight beyond a simple exponential relationship.

Modeling

Build some type of multiple regression model and submit your model to the competition board.

Initial Model

initial_model <- lm(Age ~ Sex + Length + Diameter + Height + Weight + `Shucked.Weight` + `Viscera.Weight` + `Shell.Weight`, data = train_df)

initial_model <- step(initial_model, direction = "backward")
## Start:  AIC=111865.1
## Age ~ Sex + Length + Diameter + Height + Weight + Shucked.Weight + 
##     Viscera.Weight + Shell.Weight
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                        335336 111865
## - Length          1       102 335438 111886
## - Diameter        1       363 335700 111943
## - Viscera.Weight  1      1504 336840 112195
## - Height          1      4232 339568 112792
## - Weight          1      5826 341163 113139
## - Sex             2      8658 343994 113749
## - Shell.Weight    1     12318 347654 114535
## - Shucked.Weight  1     38838 374174 119978
summary(initial_model)
## 
## Call:
## lm(formula = Age ~ Sex + Length + Diameter + Height + Weight + 
##     Shucked.Weight + Viscera.Weight + Shell.Weight, data = train_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.755758   0.075897  49.485  < 2e-16 ***
## SexI           -1.040735   0.025209 -41.284  < 2e-16 ***
## SexM           -0.115212   0.019157  -6.014 1.82e-09 ***
## Length          0.910211   0.192234   4.735 2.20e-06 ***
## Diameter        2.138507   0.238786   8.956  < 2e-16 ***
## Height          7.222272   0.236275  30.567  < 2e-16 ***
## Weight          0.194265   0.005416  35.867  < 2e-16 ***
## Shucked.Weight -0.614115   0.006632 -92.603  < 2e-16 ***
## Viscera.Weight -0.216351   0.011872 -18.224  < 2e-16 ***
## Shell.Weight    0.512127   0.009820  52.152  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

Overall, the model appears to have an ok fit based on the .5508 R-squared value with significant predictor variables. However, it is important to further examine the assumptions of the regression model, such as the normality of residuals and the absence of influential outliers, to ensure the reliability of the results.

library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
plot(initial_model, which = c(1, 3))

set.seed(123)
sample_indices <- sample(1:nrow(train_df), 5000)
shapiro.test(residuals(initial_model)[sample_indices])
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(initial_model)[sample_indices]
## W = 0.9033, p-value < 2.2e-16
qqnorm(residuals(initial_model))
qqline(residuals(initial_model))

vif(initial_model)
##                     GVIF Df GVIF^(1/(2*Df))
## Sex             1.815731  2        1.160815
## Length         50.029756  1        7.073172
## Diameter       52.539130  1        7.248388
## Height          7.731224  1        2.780508
## Weight         76.731546  1        8.759654
## Shucked.Weight 22.695250  1        4.763953
## Viscera.Weight 17.972270  1        4.239371
## Shell.Weight   20.256450  1        4.500717

With a p-value very close to zero, we would likely reject the null hypothesis and conclude that the residuals (the differences between the observed and predicted values) are not normally distributed. The Shapiro-Wilk test specifically assesses whether the data depart from normality, and a low p-value indicates a departure from normality.

When residuals are normally distributed, it implies that the errors in prediction follow a normal distribution, which is essential for making valid statistical inferences and constructing reliable prediction intervals.

Since the p-value is very close to zero,we can try some transformation to see if it improves the model.We add a quadratic term and see the Residuals vs Fitted

# List of predictors
predictors <- c('Sex', 'Length', 'Diameter', 'Height', 'Weight', 'Shucked.Weight', 'Viscera.Weight', 'Shell.Weight')

model <- lm(Age ~ . + I(Length^2), data = train_df[, c("Age", predictors)])

plot(model, which = 1)  # Residuals vs. Fitted

summary(model)
## 
## Call:
## lm(formula = Age ~ . + I(Length^2), data = train_df[, c("Age", 
##     predictors)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3301  -1.2255  -0.3330   0.7497  21.9674 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.055101   0.120119   8.784  < 2e-16 ***
## SexI           -1.034540   0.025069 -41.267  < 2e-16 ***
## SexM           -0.121361   0.019052  -6.370  1.9e-10 ***
## Length          8.593002   0.327432  26.244  < 2e-16 ***
## Diameter        0.136602   0.247349   0.552    0.581    
## Height          5.598562   0.241579  23.175  < 2e-16 ***
## Weight          0.231749   0.005540  41.832  < 2e-16 ***
## Shucked.Weight -0.587295   0.006660 -88.187  < 2e-16 ***
## Viscera.Weight -0.164884   0.011939 -13.811  < 2e-16 ***
## Shell.Weight    0.562182   0.009918  56.686  < 2e-16 ***
## I(Length^2)    -3.601417   0.124615 -28.900  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.116 on 74040 degrees of freedom
## Multiple R-squared:  0.5558, Adjusted R-squared:  0.5558 
## F-statistic:  9265 on 10 and 74040 DF,  p-value: < 2.2e-16

We get same results and model R2 remain same at .558

Log transformation of the dependent variable

we can try some other transformation to see if it improves the model.We add Log transformation to the dependent variable and see the Residuals vs Fitted and the

log_model <- lm(log(Age) ~ ., data = train_df[, c("Age", predictors)])
plot(log_model, which = 1)  # Residuals vs. Fitted

summary(log_model)
## 
## Call:
## lm(formula = log(Age) ~ ., data = train_df[, c("Age", predictors)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.03326 -0.11932 -0.01918  0.09387  1.37992 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.3046848  0.0067216 194.104  < 2e-16 ***
## SexI           -0.1100606  0.0022326 -49.297  < 2e-16 ***
## SexM           -0.0093908  0.0016966  -5.535 3.12e-08 ***
## Length          0.3085725  0.0170246  18.125  < 2e-16 ***
## Diameter        0.3899925  0.0211474  18.442  < 2e-16 ***
## Height          0.8036416  0.0209250  38.406  < 2e-16 ***
## Weight          0.0113024  0.0004797  23.563  < 2e-16 ***
## Shucked.Weight -0.0513660  0.0005873 -87.459  < 2e-16 ***
## Viscera.Weight -0.0186469  0.0010514 -17.736  < 2e-16 ***
## Shell.Weight    0.0369459  0.0008697  42.482  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1885 on 74041 degrees of freedom
## Multiple R-squared:  0.6393, Adjusted R-squared:  0.6393 
## F-statistic: 1.458e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

R2 increases to 0.6393. Overall this is the best Model and the model appears to have a good fit based on the high R-squared value and significant predictor variables.

Saving Model and predicting on Test Data.

test_df <- read.csv(url("https://github.com/mianshariq/Datasets/raw/e72c11f47fcbf656ba206d69d7db3245f6079ada/test_crab.csv"))

predicted_ages= predict(log_model, newdata = test_df)

predicted_ages <- exp(predicted_ages)

test_df$Age=predicted_ages
names(test_df)
##  [1] "id"             "Sex"            "Length"         "Diameter"      
##  [5] "Height"         "Weight"         "Shucked.Weight" "Viscera.Weight"
##  [9] "Shell.Weight"   "Age"
crab_results = test_df %>%
  dplyr::select( id, Age)

write.csv(crab_results , file = "submission.csv", row.names = {FALSE})

Provide your complete model summary and results with analysis.

Sumbitting File to Kaggle Compitetion

Overall, got a score of 1.47 in the kaggle competition,

Kaggle Results
Kaggle Results