Problem 1

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

set.seed(1234)

X <- runif(10000, min=5, max=15)
Y<- rnorm(10000, mean=10, sd=2.89)

Lets check our distributions

tibble(val = X) |>
  ggplot(aes(val)) + 
  geom_histogram() +
  ggtitle("X Normal Distribution") +
  theme_bw()

tibble(val = Y) |>
  ggplot(aes(val)) + 
  geom_histogram() +
  ggtitle("Y Normal Distribution") +
  theme_bw()

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

Calculate the medians

x <- median(X)
y <- median(Y) 

a. P(X>x | X>y)

\[ P(X>x | X>y) = \frac{P(X>x \cap X>y)}{P(X>y)} \]

prob_df <- tibble(X = X, Y = Y)

p_xx_xy <- 
  prob_df %>%
  filter(X>x, X>y) |>
  nrow()/10000

p_xy <- 
  prob_df %>%
  filter(X>y) |>
  nrow()/10000

p_xx_xy * p_xy / p_xy
## [1] 0.4979

b. P(X>x & Y>y)

\[ P(X>x, Y>y) = P(X>x \cap P(Y>y) \]

p_xx_xy <- 
  prob_df %>%
  filter(X>x, Y>y) |>
  nrow()/10000

p_xx_xy
## [1] 0.2507

c. P(X<x | X>y)

\[ P(X<x | X>y) = \frac{P(X<x \cap X>y)}{P(X>y)} \]

p_xx_xy <- 
  prob_df %>%
  filter(X<x, X>y) |>
  nrow()/10000

p_xx_xy
## [1] 0

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

joint_table <- 
  data.frame(
    c(prob_df %>% filter(X>x, Y>y) |> nrow()/10000,
      prob_df %>% filter(X>x, Y<=y) |> nrow()/10000),
    c(prob_df %>% filter(X<=x, Y>y) |> nrow()/10000,
      prob_df %>% filter(X<=x, Y<=y) |> nrow()/10000)) %>%
  mutate(Total = rowSums(.)) %>%
  rbind(., colSums(.))

colnames(joint_table) <- c("X > x", "$X \\le x$", "total")  
rownames(joint_table) <- c("Y > y", "$Y \\le y$", "total")  

knitr::kable(joint_table, caption='Joint Probability', format="simple")
Joint Probability
X > x \(X \le x\) total
Y > y 0.2507 0.2493 0.5
\(Y \le y\) 0.2493 0.2507 0.5
total 0.5000 0.5000 1.0
prob_xx <- 
  prob_df %>% filter(X>x) |> nrow()/10000
prob_yy <- 
  prob_df %>% filter(Y>y) |> nrow()/10000
prob_xx * prob_yy
## [1] 0.25

When we compare the joint probability of \(X>x\) and \(Y>y\) we see it is 0.2507 while the marginal probability is 0.25. They are quite close to each other but not exactly the same.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

Let’s set the following:

\(H_0 = P(X>x)\) and \(PY>y\) is independent \(H_0 = P(X>x)\) and \(PY>y\) is NOT independent

This is using information from this website

table(X>x, Y>y) |> fisher.test()
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(X > x, Y > y)
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264
table(X>x, Y>y) |> chisq.test()
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(X > x, Y > y)
## X-squared = 0.0676, df = 1, p-value = 0.7949

With the results of the Fisher and chi-square test we see both p-values to be 0.7949. This means that we fail to reject \(H_0\) and that both X and Y are independent.

It’s noted that “Fisher’s exact test is practically applied only in analysis of small samples but actually it is valid for all sample sizes. While the chi-squared test relies on an approximation, Fisher’s exact test is one of exact tests.” NCBI

Problem 2

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Import the test dataset

df <- read_csv('train.csv') |> clean_names()

Retrieve descriptive statistics of each variable. We can see Sex is a categorical variable.

knitr::kable(describe(df), caption = "Descriptive Statistics of the Training Dataset", digits=4) %>% 
  kable_styling(latex_options = "scale_down") %>%
  kableExtra::landscape()
Descriptive Statistics of the Training Dataset
vars n mean sd median trimmed mad min max range skew kurtosis se
id 1 74051 37025.0000 21376.8267 37025.0000 37025.0000 27447.3738 0.0000 74050.0000 74050.0000 0.0000 -1.2000 78.5557
sex* 2 74051 2.0550 0.8206 2.0000 2.0688 1.4826 1.0000 3.0000 2.0000 -0.1019 -1.5081 0.0030
length 3 74051 1.3175 0.2878 1.3750 1.3450 0.2780 0.1875 2.0128 1.8253 -0.8443 0.2916 0.0011
diameter 4 74051 1.0245 0.2374 1.0750 1.0464 0.2224 0.1375 1.6125 1.4750 -0.8128 0.1771 0.0009
height 5 74051 0.3481 0.0920 0.3625 0.3525 0.0927 0.0000 2.8250 2.8250 0.0866 14.1518 0.0003
weight 6 74051 23.3852 12.6482 23.7994 23.0514 13.9753 0.0567 80.1015 80.0448 0.2315 -0.4019 0.0465
shucked_weight 7 74051 10.1043 5.6180 9.9082 9.9184 6.1786 0.0283 42.1841 42.1557 0.3495 -0.1192 0.0206
viscera_weight 8 74051 5.0584 2.7927 4.9895 4.9707 3.0683 0.0425 21.5456 21.5031 0.2864 -0.3654 0.0103
shell_weight 9 74051 6.7239 3.5844 6.9315 6.6245 3.8038 0.0425 28.4912 28.4487 0.2774 -0.1426 0.0132
age 10 74051 9.9678 3.1752 10.0000 9.6888 2.9652 1.0000 29.0000 28.0000 1.0929 2.2963 0.0117

Within the histogram plots we can see the distributions for every variable. For example age and shell_weight are right-skewed, while diameter and length are left-skewed.

df |> 
  pivot_longer(
    -c(id, sex),
    names_to = "variable",
    values_to = "value"
  )|>
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(vars(variable), scales = "free") +
  theme_bw()

Scatterplot matrix for height, length, weight, diameter and age. In general we can see as age increases, there is a positive trend for our independent variables (height, length, weight, diameter)

pairs(df |> subset(select=c(age, height, length, weight, diameter)))

Lets create a correlation matrix to measure their relationships to each other. Not unsurprising, but still quite high, the relationships of variation between each variable of height, length and weight are highly correlated with each other, over 0.90!

knitr::kable(cor(df |> subset(select=c(height, length, weight))), caption = "Correlation Matrix", digits=4)
Correlation Matrix
height length weight
height 1.0000 0.9184 0.9018
length 0.9184 1.0000 0.9364
weight 0.9018 0.9364 1.0000

When testing our 80% confidence intervals between the three variables we see that there is a high correlation within each interval showing high levels of multicollinearity. This violates one of the rules of multiple linear regression which will pose a problem later on with our results. As for familywise errors, it is defined as “the probability of making a Type I error among a specified group, or”family,” of tests”. [Statology](https://www.statology.org/family-wise-error-rate/#:~:text=What%20is%20this%3F-,Family%2Dwise%20error%20rate%20%3D%201%20%E2%80%93%20(1%2D%CE%B1,tests%20is%20over%2022%25!). This is concerning to us as we performed several hypothesis tests and could potentially run into these Type I errors. Calculating it is as \(1-(1-\alpha)^c\) where we ended up with \(1-(1-0.2)^3 = 0.488\), we have a 48% chanhce of committing said errors.

cor.test(df$height, df$length , method=c("pearson"), conf.level=0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  df$height and df$length
## t = 631.44, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9176108 0.9190862
## sample estimates:
##       cor 
## 0.9183517
cor.test(df$height, df$weight , method=c("pearson"), conf.level=0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  df$height and df$weight
## t = 567.76, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9008913 0.9026508
## sample estimates:
##       cor 
## 0.9017748
cor.test(df$length, df$weight , method=c("pearson"), conf.level=0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  df$length and df$weight
## t = 725.93, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9357910 0.9369515
## sample estimates:
##       cor 
## 0.9363738

Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

correlation_matrix <- cor(df |> subset(select=c(height, length, weight)))
precision_matrix <- solve(correlation_matrix)

correlation_precision <- correlation_matrix %*% precision_matrix
print("correlation_precision")
## [1] "correlation_precision"
print(correlation_precision)
##              height       length weight
## height 1.000000e+00 0.000000e+00      0
## length 4.440892e-16 1.000000e+00      0
## weight 4.440892e-16 8.881784e-16      1
precision_correlation <- precision_matrix %*% correlation_matrix
print("precision_correlation")
## [1] "precision_correlation"
print(precision_correlation)
##        height       length       weight
## height      1 4.440892e-16 4.440892e-16
## length      0 1.000000e+00 8.881784e-16
## weight      0 0.000000e+00 1.000000e+00
lu.decomposition(precision_correlation)
## $L
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## 
## $U
##      [,1]         [,2]         [,3]
## [1,]    1 4.440892e-16 4.440892e-16
## [2,]    0 1.000000e+00 8.881784e-16
## [3,]    0 0.000000e+00 1.000000e+00

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. See link.

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

original_distribution <- 
  df |> 
  ggplot(aes(weight)) +
  geom_histogram() +
  geom_vline(aes(xintercept=mean(weight), color='blue')) +
  theme_bw() +
  theme(legend.position = "none") + 
  ggtitle("Weight Distribution") +
  geom_text(aes(x=mean(weight) + 3, label=paste0("Mean = ", round(mean(weight), 2)), y=1000), colour="red", angle=90)
exponential_pdf <- fitdistr(df$weight, "exponential")
exponential_pdf
##        rate    
##   0.0427620578 
##  (0.0001571423)
lambda <- exponential_pdf$estimate
exponential_sampling <- rexp(1000, rate=lambda)

exponential_distribution <- 
  ggplot(mapping = aes(exponential_sampling)) +
  geom_histogram() +
  geom_vline(aes(xintercept=mean(exponential_sampling), color='red')) +
  theme_bw() +
  theme(legend.position = "none") +  
  ggtitle("Exponential Weight Distribution") + 
  geom_text(aes(x=mean(exponential_sampling) + 3, label=paste0("Mean = ", round(mean(exponential_sampling), 2)), y=35), colour="red", angle=90)
grid.arrange(original_distribution, exponential_distribution, nrow = 1)

original_stats <- describe(df$weight)
rownames(original_stats) <- c("Original")

exponential_stats <- describe(exponential_sampling)
rownames(exponential_stats) <- c("Exp. Sample")

compare_stats <- rbind(original_stats, exponential_stats)

knitr::kable(compare_stats, digits=4)
vars n mean sd median trimmed mad min max range skew kurtosis se
Original 1 74051 23.3852 12.6482 23.7994 23.0514 13.9753 0.0567 80.1015 80.0448 0.2315 -0.4019 0.0465
Exp. Sample 1 1000 23.0859 23.0136 16.2572 19.2704 16.7087 0.0094 189.5325 189.5231 2.1294 7.3002 0.7278

When we compare our original dataset we can see that the y-scale counts drastically change from over 6,000 down to the low 200’s for each grouped weight bins. We do see that the comparative means stay the same, however in our table, the medians between both change drastically as well from 23.79 down to 16.25. This is important as our graph shows skewness and potential outliers where we would want to use the median and IQR instead of the mean and standard deviation when making a summary of the variable.

CDF Percentile

quantile(exponential_sampling, c(0.05, 0.95))
##        5%       95% 
##  1.253456 67.529711

Confidence Intervals of Empirical Data

t.test(df$weight)
## 
##  One Sample t-test
## 
## data:  df$weight
## t = 503.13, df = 74050, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  23.29412 23.47632
## sample estimates:
## mean of x 
##  23.38522
quantile(df$weight, c(0.05, 0.95))
##        5%       95% 
##  3.600386 44.211045

Here we can see how the ranges again drastically change because of the scaling where the exponential data is wider than the original empirical data. We can assume that the weight distribution does not really follow an exponential distribution given the large difference between the values found.

Modeling.

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

We are first going to assume that the values provided in the test set are a random sampling of observations!

lm_model <- lm(age ~ sex + length + diameter + weight + height  + shucked_weight + viscera_weight + shell_weight, data=df)
summary(lm_model)
## 
## Call:
## lm(formula = age ~ sex + length + diameter + weight + height + 
##     shucked_weight + viscera_weight + shell_weight, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.755758   0.075897  49.485  < 2e-16 ***
## sexI           -1.040735   0.025209 -41.284  < 2e-16 ***
## sexM           -0.115212   0.019157  -6.014 1.82e-09 ***
## length          0.910211   0.192234   4.735 2.20e-06 ***
## diameter        2.138507   0.238786   8.956  < 2e-16 ***
## weight          0.194265   0.005416  35.867  < 2e-16 ***
## height          7.222272   0.236275  30.567  < 2e-16 ***
## shucked_weight -0.614115   0.006632 -92.603  < 2e-16 ***
## viscera_weight -0.216351   0.011872 -18.224  < 2e-16 ***
## shell_weight    0.512127   0.009820  52.152  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16
res <- resid(lm_model)
plot(fitted(lm_model), res)
abline(0,0)

We see some heteroscedasticity as there is a cone shape outward to the right with the residuals. This can cause issues with the estimates we receive in the model. This can go back to how we have high correlations between the independent variables used in the model, confirmed through the correlation matrix earlier.

qqnorm(res)
qqline(res)

Lastly, we see a huge tail off on the right side which violates the normality distribution. So even with our model showing strong statistical significant variables and a mediocre explanation of the variation with the R^2, it is difficult to trust the results for predictive purposes without running into huge errors.

Improve Model

vif(lm_model, type='predictor')
lm_model2 <- lm(age ~ sex + height  + shucked_weight + viscera_weight + shell_weight, data=df)
summary(lm_model2)
## 
## Call:
## lm(formula = age ~ sex + height + shucked_weight + viscera_weight + 
##     shell_weight, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.9484  -1.2479  -0.3228   0.7710  21.2596 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.180358   0.052043  99.539  < 2e-16 ***
## sexI           -1.156181   0.025362 -45.588  < 2e-16 ***
## sexM           -0.125536   0.019438  -6.458 1.06e-10 ***
## height         10.768804   0.211527  50.910  < 2e-16 ***
## shucked_weight -0.399978   0.004397 -90.958  < 2e-16 ***
## viscera_weight  0.033403   0.010337   3.232  0.00123 ** 
## shell_weight    0.792911   0.007152 110.860  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.16 on 74044 degrees of freedom
## Multiple R-squared:  0.5374, Adjusted R-squared:  0.5374 
## F-statistic: 1.434e+04 on 6 and 74044 DF,  p-value: < 2.2e-16
vif(lm_model2, type='predictor')
## GVIFs computed for predictors
#define intercept-only model
intercept_only <- lm(age ~ 1, data=df)

#define model with all predictors
all <- lm(age ~ ., data=df)

#perform stepwise regression
both_lm <- step(intercept_only, direction='both', scope=formula(all), trace=0)

summary(both_lm)
## 
## Call:
## lm(formula = age ~ shell_weight + shucked_weight + diameter + 
##     sex + weight + height + viscera_weight + length, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.755758   0.075897  49.485  < 2e-16 ***
## shell_weight    0.512127   0.009820  52.152  < 2e-16 ***
## shucked_weight -0.614115   0.006632 -92.603  < 2e-16 ***
## diameter        2.138507   0.238786   8.956  < 2e-16 ***
## sexI           -1.040735   0.025209 -41.284  < 2e-16 ***
## sexM           -0.115212   0.019157  -6.014 1.82e-09 ***
## weight          0.194265   0.005416  35.867  < 2e-16 ***
## height          7.222272   0.236275  30.567  < 2e-16 ***
## viscera_weight -0.216351   0.011872 -18.224  < 2e-16 ***
## length          0.910211   0.192234   4.735 2.20e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

Trying to perform boxcox transformations, we can determine which skewed plots can be changed.

box1 <- boxcox(df$length ~ 1)

box2 <- boxcox(df$diameter ~ 1)

box3 <- boxcox(df$weight ~ 1)

box4 <- boxcox(df$shucked_weight ~ 1)

box5 <- boxcox(df$viscera_weight ~ 1)

box6 <- boxcox(df$shell_weight ~ 1)

Now let’s update the model again with the transformations.

lm_model3 <- lm(age ~ sex + length^2 + diameter^2 + sqrt(weight) + height + sqrt(shucked_weight) + sqrt(viscera_weight) + sqrt(shell_weight), data=df)
summary(lm_model3)
## 
## Call:
## lm(formula = age ~ sex + length^2 + diameter^2 + sqrt(weight) + 
##     height + sqrt(shucked_weight) + sqrt(viscera_weight) + sqrt(shell_weight), 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3284  -1.2384  -0.3219   0.7784  19.5942 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.10683    0.07227  56.828  < 2e-16 ***
## sexI                 -0.89601    0.02528 -35.448  < 2e-16 ***
## sexM                 -0.10989    0.01904  -5.772 7.88e-09 ***
## length               -0.43852    0.19954  -2.198    0.028 *  
## diameter              0.28903    0.24794   1.166    0.244    
## sqrt(weight)          2.29455    0.05795  39.593  < 2e-16 ***
## height                5.22885    0.24287  21.530  < 2e-16 ***
## sqrt(shucked_weight) -4.21579    0.04644 -90.777  < 2e-16 ***
## sqrt(viscera_weight) -0.93277    0.05870 -15.889  < 2e-16 ***
## sqrt(shell_weight)    3.56084    0.05844  60.932  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.116 on 74041 degrees of freedom
## Multiple R-squared:  0.5561, Adjusted R-squared:  0.5561 
## F-statistic: 1.031e+04 on 9 and 74041 DF,  p-value: < 2.2e-16
#define intercept-only model
intercept_only <- lm(age ~ 1, data=df)

#define model with all predictors
all <- lm(age ~ sex + length^2 + diameter^2 + sqrt(weight) + height + sqrt(shucked_weight) + sqrt(viscera_weight) + sqrt(shell_weight), data=df)

#perform stepwise regression
both_lm_2 <- step(intercept_only, direction='both', scope=formula(all), trace=0)

summary(both_lm_2)
## 
## Call:
## lm(formula = age ~ sqrt(shell_weight) + sqrt(shucked_weight) + 
##     sqrt(weight) + sex + height + sqrt(viscera_weight) + length, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3556  -1.2393  -0.3217   0.7786  19.6024 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.11762    0.07167  57.450  < 2e-16 ***
## sqrt(shell_weight)    3.57149    0.05772  61.874  < 2e-16 ***
## sqrt(shucked_weight) -4.21371    0.04641 -90.799  < 2e-16 ***
## sqrt(weight)          2.29985    0.05778  39.807  < 2e-16 ***
## sexI                 -0.89699    0.02526 -35.506  < 2e-16 ***
## sexM                 -0.11002    0.01904  -5.779 7.56e-09 ***
## height                5.23997    0.24268  21.592  < 2e-16 ***
## sqrt(viscera_weight) -0.93139    0.05869 -15.869  < 2e-16 ***
## length               -0.27020    0.13773  -1.962   0.0498 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.116 on 74042 degrees of freedom
## Multiple R-squared:  0.5561, Adjusted R-squared:  0.5561 
## F-statistic: 1.16e+04 on 8 and 74042 DF,  p-value: < 2.2e-16
vif(both_lm_2, type='predictor')
lm_model6 <- lm(age ~ sex + length^2 + height , data=df)
summary(lm_model6)
## 
## Call:
## lm(formula = age ~ sex + length^2 + height, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.685  -1.457  -0.488   0.706  18.317 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.83896    0.06144   62.48   <2e-16 ***
## sexI        -1.31761    0.02809  -46.91   <2e-16 ***
## sexM        -0.23402    0.02155  -10.86   <2e-16 ***
## length       1.06281    0.07932   13.40   <2e-16 ***
## height      15.05503    0.24396   61.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.398 on 74046 degrees of freedom
## Multiple R-squared:  0.4297, Adjusted R-squared:  0.4297 
## F-statistic: 1.395e+04 on 4 and 74046 DF,  p-value: < 2.2e-16
vif(lm_model6, type='predictor')

In the end the best model removes diameter and performs some transformations to the skewed data. Overall OLS, may not be the best regression model to use and requires another regression model to have better predictive power.

summary(both_lm_2)
## 
## Call:
## lm(formula = age ~ sqrt(shell_weight) + sqrt(shucked_weight) + 
##     sqrt(weight) + sex + height + sqrt(viscera_weight) + length, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3556  -1.2393  -0.3217   0.7786  19.6024 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.11762    0.07167  57.450  < 2e-16 ***
## sqrt(shell_weight)    3.57149    0.05772  61.874  < 2e-16 ***
## sqrt(shucked_weight) -4.21371    0.04641 -90.799  < 2e-16 ***
## sqrt(weight)          2.29985    0.05778  39.807  < 2e-16 ***
## sexI                 -0.89699    0.02526 -35.506  < 2e-16 ***
## sexM                 -0.11002    0.01904  -5.779 7.56e-09 ***
## height                5.23997    0.24268  21.592  < 2e-16 ***
## sqrt(viscera_weight) -0.93139    0.05869 -15.869  < 2e-16 ***
## length               -0.27020    0.13773  -1.962   0.0498 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.116 on 74042 degrees of freedom
## Multiple R-squared:  0.5561, Adjusted R-squared:  0.5561 
## F-statistic: 1.16e+04 on 8 and 74042 DF,  p-value: < 2.2e-16
test <- read_csv("test.csv") |> clean_names()
predictions <- predict(both_lm_2, test)
submission_df <- data.frame("id" = test$id, 
                            "Age" = round(predictions, 0))
submission_df |> write_csv("submission.csv")
knitr::include_graphics("kaggle.png")