Problem #2

Question

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition.  https://www.kaggle.com/competitions/playground-series-s3e16  I want you to do the following.

5 points.  Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set.  Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset.  Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.  Discuss the meaning of your analysis.  Would you be worried about familywise error? Why or why not?

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix. 

5 points.  Calculus-Based Probability & Statistics.  Many times, it makes sense to fit a closed form distribution to data.  Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.  Then load the MASS package and run fitdistr to fit an exponential probability density function.  (See  https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).  Find the optimal value of l for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, l)).  Plot a histogram and compare it with a histogram of your original variable.   Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).   Also generate a 95% confidence interval from the empirical data, assuming normality.  Finally, provide the empirical 5th percentile and 95th percentile of the data.  Discuss.

10 points.  Modeling.  Build some type of multiple regression  model and submit your model to the competition board.  Provide your complete model summary and results with analysis.  Report your Kaggle.com user name and score.

Work and Answer

Dependent Variable: Crab Age

Independent Variables Chosen: Height, Weight, Shell Weight

Descriptive and Inferential Statistics

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(Matrix)
## Warning: package 'Matrix' was built under R version 4.3.2
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
testcrab=read.csv(url("https://raw.githubusercontent.com/sleepysloth12/data605_final/main/crab_dataset/test.csv"))

traincrab=read.csv(url("https://raw.githubusercontent.com/sleepysloth12/data605_final/main/crab_dataset/train.csv"))


summary_stats_raw=traincrab %>%
  select(Height, Weight, Shell.Weight, Age)

#dependent var age dist

age_dat= summary_stats_raw %>%
  summarise(mean_age=mean(Age),
            med_age=median(Age),
            std_age=sd(Age),
            var_age=std_age^2,
            min_age=min(Age),
            max_age=max(Age),
            range_age= range(Age),
            iqr_age= IQR(Age),
            n_age=n())
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
age_dist=ggplot(summary_stats_raw, aes(x = Age)) + 
  geom_histogram(binwidth = 1) + 
  coord_cartesian(xlim = c(0,35)) +
  labs(title = "Distribution of Crab Ages")

age_dist

Age Data seems to follow a normal distribution pattern.

Height as an Independent Variable

height_dat= summary_stats_raw %>%
  summarise(mean_height=mean(Height),
            med_height=median(Height),
            std_height=sd(Height),
            var_height=std_height^2,
            min_height=min(Height),
            max_height=max(Height),
            range_height= range(Height),
            iqr_height= IQR(Height),
            n_height=n())
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
height_dist=ggplot(summary_stats_raw, aes(x = Height)) + 
  geom_histogram(binwidth = 0.01) + 
  coord_cartesian(xlim = c(0,.75)) +
  labs(title = "Distribution of Crab Heights")

height_scatter=ggplot(summary_stats_raw, aes(x = Height, y = Age)) +
  geom_point(position = position_jitter(width = 0.2, height = 0.2), alpha = 0.5) + 
  theme_minimal() +
  labs(title = "Scatterplot of Crab's Height vs. Age",
       x = "Height",
       y = "Age")

height_dist

height_scatter

Data seems to follow a normal distribution pattern.

There seems to be a positive linear correlation between crab weight and age. As the crab increases in weight, the chances of it being older increases.

Weight as an Independent Variable

weight_dat= summary_stats_raw %>%
  summarise(mean_weight=mean(Weight),
            med_weight=median(Weight),
            std_weight=sd(Weight),
            var_weight=std_weight^2,
            min_weight=min(Weight),
            max_weight=max(Weight),
            range_weight= range(Weight),
            iqr_weight= IQR(Weight),
            n_weight=n())
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
weight_dist=ggplot(summary_stats_raw, aes(x = Weight)) + 
  geom_histogram(binwidth = 2) + 
  coord_cartesian(xlim = c(0,90)) +
  labs(title = "Distribution of Crab Weights")
height_dist

weight_scatter=ggplot(summary_stats_raw, aes(x = Weight, y = Age)) +
  geom_point(position = position_jitter(width = 0.2, height = 0.2), alpha = 0.5) + 
  theme_minimal() +
  labs(title = "Scatterplot of Crab's Weight vs. Age",
       x = "Weight",
       y = "Age")
weight_dist

weight_scatter

The Distribution of crab weights seem normal. It is a little bit skewed to the right.

There seems to be a positive linear correlation between crab weight and age. As the crab increases in weight, the chances of it being older increases.

Shell Weight as an Independent Variable

shell_weight_dat= summary_stats_raw %>%
  summarise(mean_shell_weight=mean(Shell.Weight),
            med_shell_weight=median(Shell.Weight),
            std_shell_weight=sd(Shell.Weight),
            var_shell_weight=std_shell_weight^2,
            min_shell_weight=min(Shell.Weight),
            max_shell_weight=max(Shell.Weight),
            range_shell_weight= range(Shell.Weight),
            iqr_shell_weight= IQR(Shell.Weight),
            n_shell_weight=n())
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
shell_weight_dist=ggplot(summary_stats_raw, aes(x = Shell.Weight)) + 
  geom_histogram(binwidth = 1) + 
  coord_cartesian(xlim = c(0,25)) +
  labs(title = "Distribution of Crab Shell Weights")


shell_weight_scatter=ggplot(summary_stats_raw, aes(x = Shell.Weight, y = Age)) +
  geom_point(position = position_jitter(width = 0.2, height = 0.2), alpha = 0.5) + 
  theme_minimal() +
  labs(title = "Scatterplot of Crab's Shell Weight vs. Age",
       x = "Shell Weight",
       y = "Age")

shell_weight_dist

shell_weight_scatter

The Distribution of crab shell weights seem normal. It is a little bit skewed to the right.

There seems to be a positive linear correlation between crab shell weight and age. As the crab increases in shell weight, the chances of it being older increases.

Correlation Matrix

Null Hypothesis: There is no (significant) correlation between a Crab’s shell weight and a crab’s age.

Alt. Hypothesis: There is a (significant) correlation between a Crab’s Shell weight and a crab’s age.

cor_matrix = cor(summary_stats_raw[, c("Height","Weight","Shell.Weight", "Age")], use = "complete.obs")
print(cor_matrix)
##                 Height    Weight Shell.Weight       Age
## Height       1.0000000 0.9017748    0.9033982 0.6380669
## Weight       0.9017748 1.0000000    0.9655246 0.6011950
## Shell.Weight 0.9033982 0.9655246    1.0000000 0.6634733
## Age          0.6380669 0.6011950    0.6634733 1.0000000
cor_test_result = cor.test(summary_stats_raw$Shell.Weight, summary_stats_raw$Age, conf.level=0.80)


print(cor_test_result)
## 
##  Pearson's product-moment correlation
## 
## data:  summary_stats_raw$Shell.Weight and summary_stats_raw$Age
## t = 241.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6608286 0.6661015
## sample estimates:
##       cor 
## 0.6634733

There is a 80% that the values fall between 0.6608 and 0.6661 [80% Confidence Interval].

Since the P-value is less than the alpha value of 0.05, we reject the null hypothesis.

There is a significant correlation between a Crab’s Shell Weight and a Crab’s age.

Null Hypothesis: There is no (significant) correlation between a Crab’s weight and a crab’s age.

Alt. Hypothesis: There is a (significant) correlation between a Crab’s weight and a crab’s age.

cor_test_result2 = cor.test(summary_stats_raw$Weight, summary_stats_raw$Age, conf.level=0.80)


print(cor_test_result2)
## 
##  Pearson's product-moment correlation
## 
## data:  summary_stats_raw$Weight and summary_stats_raw$Age
## t = 204.73, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5981791 0.6041938
## sample estimates:
##      cor 
## 0.601195

There is a 80% that the values fall between 0.5981 and 0.6041 [80% Confidence Interval].

Since the P-value is less than the alpha value of 0.05, we reject the null hypothesis.

There is a significant correlation between a Crab’s Weight and a Crab’s age.

Null Hypothesis: There is no (significant) correlation between a Crab’s height and a crab’s age.

Alt. Hypothesis: There is a (significant) correlation between a Crab’s height and a crab’s age.

cor_test_result3 = cor.test(summary_stats_raw$Height, summary_stats_raw$Age, conf.level=0.80)


print(cor_test_result3)
## 
##  Pearson's product-moment correlation
## 
## data:  summary_stats_raw$Height and summary_stats_raw$Age
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6352664 0.6408507
## sample estimates:
##       cor 
## 0.6380669

There is a 80% that the values fall between 0.6352 and 0.6408 [80% Confidence Interval].

Since the P-value is less than the alpha value of 0.05, we reject the null hypothesis.

There is a significant correlation between a Crab’s height and a Crab’s age.

Interpretation

  1. The Age, Height, Weight, and Shell Weight of crabs seem to follow a normal distribution.

  2. There is a positive correlation between each of the physical characteristics (Height, Weight, Shell Weight) and the Age of crabs. This means that as crabs grow older, they tend to be larger in these physical dimensions.

  3. The hypothesis tests conducted for each pair of variables (Height vs. Age, Weight vs. Age, Shell Weight vs. Age) show significant correlations, as indicated by the p-values being less than the alpha value of 0.05. This means that these correlations are unlikely to be due to random chance.

Family wise Error

In my analysis, I tested multiple hypotheses. When conducting multiple hypothesis tests, the chance of encountering at least one Type I error (false positive) increases. Yes, I should be worried about family wise error. To control for family wise error, you can use Bonferroni correction. This makes the criteria for statistical significance more stringent by adjusting the alpha value.

Linear Algebra and Correlation

#precision matrix
precision_matrix = solve(cor_matrix)
print(precision_matrix)
##                  Height     Weight Shell.Weight        Age
## Height        6.0532835  -3.003406    -2.150850 -0.6297359
## Weight       -3.0034063  16.877510   -14.508305  1.3955723
## Shell.Weight -2.1508496 -14.508305    18.316492 -2.0577974
## Age          -0.6297359   1.395572    -2.057797  1.9280962
#correlation matrix x precision matrix

r1= cor_matrix %*% precision_matrix
r2= precision_matrix %*% cor_matrix

#corr matrix x precision matrix

lu_decomp1=lu(as(Matrix(r1),"CsparseMatrix"))
L1=lu_decomp1@L
U1= lu_decomp1@U

D1= diag(diag(U1))

L1= L1 %*% diag(1/diag(U1))

U1= diag(1/diag(U1)) %*% U1

A1= L1 %*% D1 %*% U1

print("Results of LDU Decomp precision corr matrix x precision matrix")
## [1] "Results of LDU Decomp precision corr matrix x precision matrix"
print(A1)
## 4 x 4 Matrix of class "dgeMatrix"
##               [,1]          [,2]          [,3]         [,4]
## [1,]  1.000000e+00 -2.220446e-16 -4.440892e-16 0.000000e+00
## [2,] -2.220446e-16  1.000000e+00 -2.886580e-15 0.000000e+00
## [3,] -4.440892e-16 -2.886580e-15  1.000000e+00 2.220446e-16
## [4,]  0.000000e+00  0.000000e+00  2.220446e-16 1.000000e+00
#precision matrix x corr matrix
lu_decomp2=lu(as(Matrix(r2),"CsparseMatrix"))
L2=lu_decomp2@L
U2= lu_decomp2@U

D2= diag(diag(U2))

L2= L2 %*% diag(1/diag(U2))

U2= diag(1/diag(U2))%*% U2

A2= L2 %*% D2 %*% U2

print("Results of LDU Decomp precision matrix x corr matrix")
## [1] "Results of LDU Decomp precision matrix x corr matrix"
print(A2)
## 4 x 4 Matrix of class "dgeMatrix"
##               [,1]          [,2]          [,3]          [,4]
## [1,]  1.000000e+00 -9.436896e-16 -2.775558e-16 -6.661338e-16
## [2,] -9.436896e-16  1.000000e+00  9.992007e-16  1.554312e-15
## [3,] -2.775558e-16  9.992007e-16  1.000000e+00 -8.881784e-16
## [4,] -6.661338e-16  1.554312e-15 -8.881784e-16  1.000000e+00

Calculus-Based Probability & Statistics

Finding Variable skewed to the Right

The distribution of Crab Shell Weights, although seeming like a bell curve and relatively normal; is skewed a little bit to the right.

shell_weight_dist=ggplot(summary_stats_raw, aes(x = Shell.Weight)) + 
  geom_histogram(binwidth = 1.5) + 
  coord_cartesian(xlim = c(0,25)) +
  labs(title = "Distribution of Crab Shell Weights")

shell_weight_dist

Making Sure Minimum Value is above 0

sw_min=min(summary_stats_raw$Shell.Weight)

cat("The minimum value of Shell weight is", sw_min, "\n which is greater than 0")
## The minimum value of Shell weight is 0.04252425 
##  which is greater than 0

MASS Package, fitdistr and sampling

library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
sw_exp=fitdistr(summary_stats_raw$Shell.Weight, densfun = "exponential")

lambda=sw_exp$estimate["rate"]

sw_samp=rexp(1000,lambda)

Comparing the Two Distributions

Original Histogram

Sampled Exponential Distribution Histogram
sw_swamp_df=data.frame(sw_samp=sw_samp)

ggplot(sw_swamp_df, aes(x = sw_samp)) + 
  geom_histogram(binwidth = 1) + 
  coord_cartesian(xlim = c(0,40)) +
  labs(title = "Distribution of Crab Shell Weights")

Both distributions are right-skewed, but the skewness in the sampled data is more pronounced with a long tail extending further to the right. The original data has a larger sample size. You can tell by the higher counts in each bin. The y-axis scale for the original data reaches up to 10,000, while the sampled data’s y-axis scale goes up to 100, suggesting a much smaller sample size. The mode of the original data is higher than the mode of the sampled data. This means that the most common value of shell weight is higher in the original data set.

Percentiles and Confidence Intervals

#exp dist percentiles

exp_5=qexp(0.05, rate=lambda)
exp_95=qexp(0.95, rate=lambda)

#emp 95th CI

mean_sw=mean(summary_stats_raw$Shell.Weight)
sd_sw=sd(summary_stats_raw$Shell.Weight)
ci_low=mean_sw - 1.96 * sd_sw / sqrt(length(summary_stats_raw$Shell.Weight))
ci_up=mean_sw + 1.96 * sd_sw / sqrt(length(summary_stats_raw$Shell.Weight))

#emp 95th percentile

emp_5=quantile(summary_stats_raw$Shell.Weight, 0.05)
emp_95=quantile(summary_stats_raw$Shell.Weight, 0.95)


cat("The exponential 5th percentile is", exp_5, "\n")
## The exponential 5th percentile is 0.3448894
cat("The exponential 95th percentile is", exp_95, "\n")
## The exponential 95th percentile is 20.14291
cat("The empirical 95% CI is", ci_low, "-", ci_up, "\n")
## The empirical 95% CI is 6.698053 - 6.749687
cat("The empirical 5th percentile is", emp_5, "\n")
## The empirical 5th percentile is 1.10563
cat("The empirical 95th percentile is", emp_95, "\n")
## The empirical 95th percentile is 12.7431

The differences between the empirical and exponential percentiles, as well as the narrow confidence interval (assuming normality), suggest that the exponential distribution may not be the best model for the distribution of crab Shell weight, especially if it is significantly skewed.

Multiple Regression Model

Building a multiple regression model to predict crab age.

Model

set.seed(2245)

crab_model=lm(Age ~ Height + Weight + Shucked.Weight +Viscera.Weight +  Shell.Weight ,data=traincrab)

summary(crab_model)
## 
## Call:
## lm(formula = Age ~ Height + Weight + Shucked.Weight + Viscera.Weight + 
##     Shell.Weight, data = traincrab)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.9363  -1.2506  -0.3699   0.7866  20.8045 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.971612   0.043561   91.17   <2e-16 ***
## Height         11.826985   0.210711   56.13   <2e-16 ***
## Weight          0.210333   0.005516   38.13   <2e-16 ***
## Shucked.Weight -0.593106   0.006663  -89.02   <2e-16 ***
## Viscera.Weight -0.155522   0.012052  -12.90   <2e-16 ***
## Shell.Weight    0.556266   0.009948   55.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.172 on 74045 degrees of freedom
## Multiple R-squared:  0.532,  Adjusted R-squared:  0.532 
## F-statistic: 1.683e+04 on 5 and 74045 DF,  p-value: < 2.2e-16

3.971612 (the intercept) is the expected age when all other predictors are zero.The coefficients for Height, Weight, and Shell Weight are positive, meaning that as these measurements increase , so does age. Height has the most substantial positive impact. Shucked Weight and Viscera Weight have negative coefficients, meaning that increases in these measurements are associated with lower predicted ages. All predictors are statistically significant (<2e-16). 53.2% (Multiple R-squared) of the variability in the age of crabs is explained by the model. The F-statistic and its associated p-value (< 2.2e-16) are statistically significant. The predictors as a group have a significant relationship with the age of crabs.

Plotting Residuals
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.3.2
autoplot(crab_model)

In the residuals vs. fitted plot, there seems to be a pattern where the residuals fan out as the fitted values increase, indicating potential heteroscedasticity. The Q-Q plot indicates that the residuals deviate from the line at both ends, suggesting that residuals may not be normally distributed. The scale-location plot shows a upward trend. This means that the variance of the residuals is increasing with the fitted values, which is another sign of heteroscedasticity.

Using Test Data to Submit to Kaggle

In order to evaluate the competition, they ask you to predict ages using your code on the test data set. Afterwards, upload a file to kaggle of the data of predicted age linked with ID. They will use that to evaluate how well my model worked.

predicted_ages= predict(crab_model, newdata = testcrab)

testcrab$yield=predicted_ages

names(testcrab)
##  [1] "id"             "Sex"            "Length"         "Diameter"      
##  [5] "Height"         "Weight"         "Shucked.Weight" "Viscera.Weight"
##  [9] "Shell.Weight"   "yield"
crab_sub_df = testcrab %>%
  dplyr::select( id, yield)

write.csv(crab_sub_df, "jjimenez_crab_subfile.csv")

My Kaggle Username is jeanj12

After submitting my data, I got the score 1.571782