Final Project

Problem 1

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

# Setting the random seed
set.seed(1234)

# uniform random variable X
X <- runif(10000, min = 5, max = 15)

# normal random variable Y with mean = 10, standard deviation = 2.89
Y <- rnorm(10000, mean = 10, sd = 2.89)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)
print(x)
## [1] 10.01266
y <- median(Y)
print(y)
## [1] 10.03154

Part A

\(P(X>x | X>y)\)

This is interpreted as the probability that X is greater than its median given that X is greater than the median of Y. \(P(X > x |X>y) = \frac{P(X>x \ \cap \ X>y)}{P(X>y)}\)

# Calculating P(X > x and X > y)
P_X_greater_x_and_y <-mean(X >x & X> y)
print(P_X_greater_x_and_y)
## [1] 0.4979
# Calculating P(X > y)
P_X_greater_y <-mean(X >y)
print(P_X_greater_y)
## [1] 0.4979
# Calculating the conditional probability P(X > x | X > y)
conditional_probability <- P_X_greater_x_and_y / P_X_greater_y

# Output the conditional probability
conditional_probability
## [1] 1

As shown above \(P(X>x \cap X>y) = .4979\)

\(P(X>y) = .4979\)

So, \(P(X>x | X>y) = \frac{.4979}{.4979}=1\)

Part B

\(P(X>x \ \& \ Y >y)\)

This can be interpreted as the probability that X is greater than X AND the probability that Y is greater than y.

p_X_greater_x_Y_greater_y <- mean(X>x & Y>y)
p_X_greater_x_Y_greater_y
## [1] 0.2507

Part C

\(P(X<x | X>y)\)

This can be interpreted as the probability that X is less than x given that X is greater than y.

\(P(X < x |X>y) = \frac{P(X<x \ \cap \ X>y)}{P(X>y)}\)

P_X_less_x_and_y <-mean(X <x & X> y)
print(P_X_less_x_and_y)
## [1] 0
# Calculating P(X > y)
P_X_greater_y <-mean(X >y)
print(P_X_greater_y)
## [1] 0.4979
# Calculating the conditional probability P(X > x | X > y)
part_c <- P_X_less_x_and_y / P_X_greater_y

# Output the conditional probability
part_c
## [1] 0

As shown above \(P(X>x \cap X>y) = 0\)

\(P(X>y) = .4979\)

So, \(P(X>x | X>y) = \frac{0}{.4979}=0\)

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

p_X_greater_x <- mean(X > x)

p_Y_greater_y <- mean(Y > y)

p_X_greater_x_and_Y_greater_y <- mean(X > x & Y>y)

p_X_times_p_Y <- p_X_greater_x * p_Y_greater_y

prob_table <- data.frame(
  `p_X_greater_x` = p_X_greater_x,
  `p_Y_greater_y` = p_Y_greater_y,
  `p_X_greater_x_and_Y_greater_y` = p_X_greater_x_and_Y_greater_y,
  `product_of_marginals` = p_X_times_p_Y
)

print(prob_table)
##   p_X_greater_x p_Y_greater_y p_X_greater_x_and_Y_greater_y
## 1           0.5           0.5                        0.2507
##   product_of_marginals
## 1                 0.25

From the table above we can see that :

\(P(X \gt x \ \& \ Y \gt y) = 0.2507\)

\(P(X \gt x)P(Y \gt y) = 0.25\)

So while they are not exactly equal they are very close.

# Creating contingency table
table_data <- table(X > x, Y > y)

# Performing Chi-Square Test
chi_square_result <- chisq.test(table_data)

# Printing the results
print(chi_square_result)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_data
## X-squared = 0.0676, df = 1, p-value = 0.7949
fisher_result <- fisher.test(table_data)
print(fisher_result)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table_data
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264

For both the Chi-Squared test and the Fisher’s Exact test we get a p-value of 0.7949. This p-value suggests that the variables are indeed independent of each other. These results are not surprising as we are looking at two different distributions a normal and a uniform so we would not expect to see a relationship between the two. The difference between the Chi-squared test and Fisher’s Exact test is that the Chi-squared test is generally used for larger sample sizes and its an approximation while Fisher’s Exact test is generally used for smaller sample sizes and utilizes probabilities from the hypergeometric distribution making it more precise. Since we get the same p-value for both tests I would say that either are appropriate.

Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

Part 1

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

# Load in the training data

training_data <- read.csv("train.csv")

head(training_data)
##   id Sex Length Diameter Height   Weight Shucked.Weight Viscera.Weight
## 1  0   I 1.5250   1.1750 0.3750 28.97319      12.728926       6.647958
## 2  1   I 1.1000   0.8250 0.2750 10.41844       4.521745       2.324659
## 3  2   M 1.3875   1.1125 0.3750 24.77746      11.339800       5.556502
## 4  3   F 1.7000   1.4125 0.5000 50.66056      20.354941      10.991839
## 5  4   I 1.2500   1.0125 0.3375 23.28911      11.977664       4.507570
## 6  5   M 1.5000   1.1750 0.4125 28.84562      13.409313       6.789705
##   Shell.Weight Age
## 1     8.348928   9
## 2     3.401940   8
## 3     6.662133   9
## 4    14.996885  11
## 5     5.953395   8
## 6     7.937860  10

From calling head on the training data we can see that we have a data set with 10 columns and Age is our dependent variable.

summary(training_data)
##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18512   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked.Weight     Viscera.Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell.Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000
library(ggplot2)
# Load necessary libraries
library(rlang)

# Loop over each column in the dataframe
for (col_name in names(training_data)) {
  # Check if the column is numeric and not the "id" column
  if (is.numeric(training_data[[col_name]]) && col_name != "id") {
    # Plot histogram for numeric variables
    p <- ggplot(training_data, aes(x = !!sym(col_name))) + 
      geom_histogram(fill = "blue", color = "black", bins = 50) +
      ggtitle(paste("Histogram of", col_name)) +
      xlab(col_name) +
      ylab("Frequency")
    print(p)
  }
}

The above histograms show the distributions for the features in the training data. We can see that the length and Diameter features are skewed to the left. The Weight, Shucked.Weight, Viscera.Weight and Shell.Weight features are all skewed to the right. None of the variables appear to have perfect normal distributions. The Age column(dependent variable) is also skewed to the right.

Now plot the scatter matrix with Length on the X-axis

plot(training_data$Length, training_data$Age, xlab="Length", ylab="Age")

The scatter plot with Length on the x-axis and Age on the Y axis shows that there may be a slight relationship with age and length where the longer the length the older the crab but it doesnt seem to be a very strong relationship.

plot(training_data$Weight, training_data$Age, xlab="Weight", ylab="Age")

The scatter plot with Weight on the x-axis and Age on the y-axis does not show any discernable relationship between the two variables.

Now generate the correlation matrix for weight, Shuck.Weight, and Length

cor_matrix <- cor(training_data[, c("Weight", "Shucked.Weight", "Length")])
cor_matrix
##                   Weight Shucked.Weight    Length
## Weight         1.0000000      0.9712672 0.9363738
## Shucked.Weight 0.9712672      1.0000000 0.9155158
## Length         0.9363738      0.9155158 1.0000000

We can see that we have very high correlation between all these variables all above 90% this hints that there may be multicollinearity this can cause the model to be very sensitive and small changes in the data can lead to significant changes in the predictions. Of these variables Weight and Shucked.Weight have the highest correlation at .97 which makes sense since they are both variables related to the weight of the crab.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units

Loop over each pair of columns to compute the correlation and test the hypothesis that the correlations between each pairwise set of variables is 0 and find the 80% confidence interval for each. This will use pearsons test.

# Get column names
cols <- names(training_data[, -c(1)])

# Loop through each pair of columns
for (i in 1:(length(cols)-1)) {
  for (j in (i+1):length(cols)) {
    if (is.numeric(training_data[[cols[i]]]) && is.numeric(training_data[[cols[j]]])) {
      cat("\nCorrelation between", cols[i], "and", cols[j], ":\n")
      # Perform the test
      test <- cor.test(training_data[[cols[i]]], training_data[[cols[j]]], conf.level = 0.8)
      
      # Print the results
      print(test)
    }
  }
}
## 
## Correlation between Length and Diameter :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 1857.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9893379 0.9895359
## sample estimates:
##       cor 
## 0.9894374 
## 
## 
## Correlation between Length and Height :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 631.44, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9176108 0.9190862
## sample estimates:
##       cor 
## 0.9183517 
## 
## 
## Correlation between Length and Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 725.93, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9357910 0.9369515
## sample estimates:
##       cor 
## 0.9363738 
## 
## 
## Correlation between Length and Shucked.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 619.29, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9147504 0.9162747
## sample estimates:
##       cor 
## 0.9155158 
## 
## 
## Correlation between Length and Viscera.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 629.27, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9171100 0.9185939
## sample estimates:
##       cor 
## 0.9178552 
## 
## 
## Correlation between Length and Shell.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 625.39, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9162044 0.9177038
## sample estimates:
##       cor 
## 0.9169573 
## 
## 
## Correlation between Length and Age :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 211.04, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6098938 0.6157754
## sample estimates:
##       cor 
## 0.6128431 
## 
## 
## Correlation between Diameter and Height :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 644.97, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9206384 0.9220617
## sample estimates:
##       cor 
## 0.9213531 
## 
## 
## Correlation between Diameter and Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 737.99, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9376824 0.9388098
## sample estimates:
##       cor 
## 0.9382486 
## 
## 
## Correlation between Diameter and Shucked.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 613.85, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9134223 0.9149693
## sample estimates:
##       cor 
## 0.9141991 
## 
## 
## Correlation between Diameter and Viscera.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 631.44, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9176101 0.9190854
## sample estimates:
##      cor 
## 0.918351 
## 
## 
## Correlation between Diameter and Shell.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 651.23, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9219851 0.9233852
## sample estimates:
##       cor 
## 0.9226882 
## 
## 
## Correlation between Diameter and Age :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6183556 0.6241393
## sample estimates:
##       cor 
## 0.6212559 
## 
## 
## Correlation between Height and Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 567.76, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9008913 0.9026508
## sample estimates:
##       cor 
## 0.9017748 
## 
## 
## Correlation between Height and Shucked.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 467.14, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.8628846 0.8652710
## sample estimates:
##       cor 
## 0.8640826 
## 
## 
## Correlation between Height and Viscera.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 512.25, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.8820858 0.8841588
## sample estimates:
##       cor 
## 0.8831266 
## 
## 
## Correlation between Height and Shell.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 573.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9025285 0.9042605
## sample estimates:
##       cor 
## 0.9033982 
## 
## 
## Correlation between Height and Age :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6352664 0.6408507
## sample estimates:
##       cor 
## 0.6380669 
## 
## 
## Correlation between Weight and Shucked.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 1110.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9709993 0.9715328
## sample estimates:
##       cor 
## 0.9712672 
## 
## 
## Correlation between Weight and Viscera.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 1106.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9707920 0.9713293
## sample estimates:
##       cor 
## 0.9710619 
## 
## 
## Correlation between Weight and Shell.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 1009.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9652040 0.9658423
## sample estimates:
##       cor 
## 0.9655246 
## 
## 
## Correlation between Weight and Age :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 204.73, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5981791 0.6041938
## sample estimates:
##      cor 
## 0.601195 
## 
## 
## Correlation between Shucked.Weight and Viscera.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 768.33, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9420988 0.9431486
## sample estimates:
##      cor 
## 0.942626 
## 
## 
## Correlation between Shucked.Weight and Shell.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 598.78, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9095880 0.9112004
## sample estimates:
##       cor 
## 0.9103977 
## 
## 
## Correlation between Shucked.Weight and Age :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 158.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4997954 0.5068284
## sample estimates:
##       cor 
## 0.5033202 
## 
## 
## Correlation between Viscera.Weight and Shell.Weight :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 710.9, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9333145 0.9345182
## sample estimates:
##      cor 
## 0.933919 
## 
## 
## Correlation between Viscera.Weight and Age :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 192.15, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5736566 0.5799419
## sample estimates:
##       cor 
## 0.5768078 
## 
## 
## Correlation between Shell.Weight and Age :
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 241.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6608286 0.6661015
## sample estimates:
##       cor 
## 0.6634733

Looking at the results above we can reject the hypothesis that the correlations between each pairwise set of variables is 0 as we can clearly see all of the correlations are above at least .5 and none are 0. This is not what we want as we would want our independent variables to not be correlated with each other.

Because we are testing a bunch of hypothesis, we should be worried about the familywise error, which is the probability of false positives in our tests. Since we are testing the correlations across all pairwise variables we are running a lot of test we can use the Bonferonni correction to re run our tests and see if there is any differences.

Run the tests again with the Bonferonni correction.

# get the number of tests
num_tests <- choose(length(cols), 2)

# set alpha level
familywise_alpha <- 0.05

# Calculate the Bonferroni-corrected alpha level
corrected_alpha <- familywise_alpha / num_tests

# Loop through each pair of columns to perform the correlation tests
for (i in 1:(length(cols) - 1)) {
  for (j in (i + 1):length(cols)) {
    if (is.numeric(training_data[[cols[i]]]) && is.numeric(training_data[[cols[j]]])) {
      cat("\nCorrelation between", cols[i], "and", cols[j], ":\n")
      
      # Perform the test with the Bonferroni-corrected alpha level
      test <- cor.test(training_data[[cols[i]]], training_data[[cols[j]]], 
                       conf.level = 0.8, method = "pearson", 
                       exact = FALSE, p.adjust.method = "none")
      
      # Check if p-value is less than the corrected alpha to determine significance
      if (test$p.value < corrected_alpha) {
        cat("No change after  Bonferroni correction\n")
      } else {
        cat("No correlation found, change after correction\n")
      }

      # Print the results
      print(test)
    }
  }
}
## 
## Correlation between Length and Diameter :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 1857.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9893379 0.9895359
## sample estimates:
##       cor 
## 0.9894374 
## 
## 
## Correlation between Length and Height :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 631.44, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9176108 0.9190862
## sample estimates:
##       cor 
## 0.9183517 
## 
## 
## Correlation between Length and Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 725.93, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9357910 0.9369515
## sample estimates:
##       cor 
## 0.9363738 
## 
## 
## Correlation between Length and Shucked.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 619.29, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9147504 0.9162747
## sample estimates:
##       cor 
## 0.9155158 
## 
## 
## Correlation between Length and Viscera.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 629.27, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9171100 0.9185939
## sample estimates:
##       cor 
## 0.9178552 
## 
## 
## Correlation between Length and Shell.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 625.39, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9162044 0.9177038
## sample estimates:
##       cor 
## 0.9169573 
## 
## 
## Correlation between Length and Age :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 211.04, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6098938 0.6157754
## sample estimates:
##       cor 
## 0.6128431 
## 
## 
## Correlation between Diameter and Height :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 644.97, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9206384 0.9220617
## sample estimates:
##       cor 
## 0.9213531 
## 
## 
## Correlation between Diameter and Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 737.99, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9376824 0.9388098
## sample estimates:
##       cor 
## 0.9382486 
## 
## 
## Correlation between Diameter and Shucked.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 613.85, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9134223 0.9149693
## sample estimates:
##       cor 
## 0.9141991 
## 
## 
## Correlation between Diameter and Viscera.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 631.44, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9176101 0.9190854
## sample estimates:
##      cor 
## 0.918351 
## 
## 
## Correlation between Diameter and Shell.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 651.23, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9219851 0.9233852
## sample estimates:
##       cor 
## 0.9226882 
## 
## 
## Correlation between Diameter and Age :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6183556 0.6241393
## sample estimates:
##       cor 
## 0.6212559 
## 
## 
## Correlation between Height and Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 567.76, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9008913 0.9026508
## sample estimates:
##       cor 
## 0.9017748 
## 
## 
## Correlation between Height and Shucked.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 467.14, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.8628846 0.8652710
## sample estimates:
##       cor 
## 0.8640826 
## 
## 
## Correlation between Height and Viscera.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 512.25, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.8820858 0.8841588
## sample estimates:
##       cor 
## 0.8831266 
## 
## 
## Correlation between Height and Shell.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 573.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9025285 0.9042605
## sample estimates:
##       cor 
## 0.9033982 
## 
## 
## Correlation between Height and Age :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6352664 0.6408507
## sample estimates:
##       cor 
## 0.6380669 
## 
## 
## Correlation between Weight and Shucked.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 1110.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9709993 0.9715328
## sample estimates:
##       cor 
## 0.9712672 
## 
## 
## Correlation between Weight and Viscera.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 1106.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9707920 0.9713293
## sample estimates:
##       cor 
## 0.9710619 
## 
## 
## Correlation between Weight and Shell.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 1009.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9652040 0.9658423
## sample estimates:
##       cor 
## 0.9655246 
## 
## 
## Correlation between Weight and Age :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 204.73, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5981791 0.6041938
## sample estimates:
##      cor 
## 0.601195 
## 
## 
## Correlation between Shucked.Weight and Viscera.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 768.33, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9420988 0.9431486
## sample estimates:
##      cor 
## 0.942626 
## 
## 
## Correlation between Shucked.Weight and Shell.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 598.78, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9095880 0.9112004
## sample estimates:
##       cor 
## 0.9103977 
## 
## 
## Correlation between Shucked.Weight and Age :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 158.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4997954 0.5068284
## sample estimates:
##       cor 
## 0.5033202 
## 
## 
## Correlation between Viscera.Weight and Shell.Weight :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 710.9, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9333145 0.9345182
## sample estimates:
##      cor 
## 0.933919 
## 
## 
## Correlation between Viscera.Weight and Age :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 192.15, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5736566 0.5799419
## sample estimates:
##       cor 
## 0.5768078 
## 
## 
## Correlation between Shell.Weight and Age :
## No change after  Bonferroni correction
## 
##  Pearson's product-moment correlation
## 
## data:  training_data[[cols[i]]] and training_data[[cols[j]]]
## t = 241.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6608286 0.6661015
## sample estimates:
##       cor 
## 0.6634733

As we can see after accounting for the familywise error and adding in the correct all of our variables are still correlated so there was no change with the correction added in.

Part 2

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

Invert the correlation matrix using the inv function

library("pracma")
## 
## Attaching package: 'pracma'
## The following object is masked from 'package:Hmisc':
## 
##     ceil
precision_matrix <- inv(cor_matrix)
precision_matrix
##                    Weight Shucked.Weight     Length
## Weight          23.312795    -16.4227734 -6.7941818
## Shucked.Weight -16.422773     17.7483695 -0.8710585
## Length          -6.794182     -0.8710585  8.1593616

Now, multiply the correlation matrix with the precision matrix

cor_times_prec <- cor_matrix %*% precision_matrix
cor_times_prec
##                      Weight Shucked.Weight       Length
## Weight         1.000000e+00  -5.335785e-16 3.493993e-16
## Shucked.Weight 2.576022e-15   1.000000e+00 2.282746e-16
## Length         0.000000e+00  -1.110223e-16 1.000000e+00

Now, multiply the precision matrix with the correlation matrix we can see we get the sam values on the diagonal as the previous multiplication.

prec_times_cor <- precision_matrix %*% cor_matrix
prec_times_cor
##                      Weight Shucked.Weight        Length
## Weight         1.000000e+00  -9.766913e-16 -3.552714e-15
## Shucked.Weight 3.123094e-15   1.000000e+00  3.552714e-15
## Length         3.493993e-16   2.282746e-16  1.000000e+00

Now complete LDU decomposition on the precision matrix using the lu function in R

# run decomposition
lu_decomp <- lu(precision_matrix)
# get L 
L <- lu_decomp$L
#get U
U <- lu_decomp$U

# get D (diagonal of U)
D<- diag(diag(U))

# adjust U matrix
U <- diag(1/diag(U)) %*% U

print(L)
##                    Weight Shucked.Weight Length
## Weight          1.0000000      0.0000000      0
## Shucked.Weight -0.7044532      1.0000000      0
## Length         -0.2914357     -0.9155158      1
print(U)
##      Weight Shucked.Weight     Length
## [1,]      1     -0.7044532 -0.2914357
## [2,]      0      1.0000000 -0.9155158
## [3,]      0      0.0000000  1.0000000
print(D)
##         [,1]     [,2] [,3]
## [1,] 23.3128 0.000000    0
## [2,]  0.0000 6.179294    0
## [3,]  0.0000 0.000000    1
print(L %*% D %*% U)
##                    Weight Shucked.Weight     Length
## Weight          23.312795    -16.4227734 -6.7941818
## Shucked.Weight -16.422773     17.7483695 -0.8710585
## Length          -6.794182     -0.8710585  8.1593616
print(precision_matrix)
##                    Weight Shucked.Weight     Length
## Weight          23.312795    -16.4227734 -6.7941818
## Shucked.Weight -16.422773     17.7483695 -0.8710585
## Length          -6.794182     -0.8710585  8.1593616

We know that the LDU decomposition is correct becuase we can get our original matrix the precision matrix by multiplying L * D * U the results are printed above the LDU is right above the precision matrix output and they are identical

Part 3

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

I will use the Weight variable for this part. It is skewed to the right and its minimum value is above zero.

Fit the Weight column to the exponential distribution, print ou the optimal rate for lambda.

weight_fitdistr <- MASS::fitdistr(training_data$Weight, "exponential")
weight_fitdistr$estimate
##       rate 
## 0.04276206

The computed optimal value for lambda is 0.04276206

Now to take 1000 samples from this distribution:

Plot the distribution of the variable before the fit and after the fit to compare.

weight_exp_dist <- rexp(1000, weight_fitdistr$estimate)
hist(training_data$Weight, xlab = "Weight", main = "Histogram of Weight")

hist(weight_exp_dist, xlab = "Weight", main = "Histogram of Weight Fit to Expoential Distribution")

Overall the exponential distribution seems to be a very rough fit to our original data. It has a much greater max value of almost 150 while the original was closer to 80 and it seems to have much more values on the left end of the distribution than the original. The one good thing is that it is also right skewed like the original but there are signifgant differences at the tails of the distributions.

Now find the 5th and 95th percentiles.

# Find the 5th and 95th percentiles using the exponential distribution
p5 <- qexp(0.05, rate =  weight_fitdistr$estimate)
p95 <- qexp(0.95, rate =  weight_fitdistr$estimate)
print(p5)
## [1] 1.199505
print(p95)
## [1] 70.05585

For the fit distribution we get a 5th percentile of 1.199505 and a 95th of 70.05585

ci_95 <- t.test(weight_exp_dist)$conf.int
ci_95
## [1] 21.65780 24.51401
## attr(,"conf.level")
## [1] 0.95

For the 95th percentile we get a confidence interval of 22.45741 to 25.35921

# Empirical 95% CI 
emp_ci_95 <- t.test(training_data$Weight)$conf.int
emp_ci_95
## [1] 23.29412 23.47632
## attr(,"conf.level")
## [1] 0.95

While on the original data we get a confidence interval of 23.29412 to 23.47632 this is much tighter than the exponential fit and shows that the expoentual fit might not be the best for this column.

# Empirical 5th and 95th percentiles
emp_p5 <- quantile(training_data$Weight, 0.05)
emp_p95 <- quantile(training_data$Weight, 0.95)
print(emp_p5)
##       5% 
## 3.600386
print(emp_p95)
##      95% 
## 44.21105

For the empirical data we get a 5th percentile of 3.600386 and a 95th percentile of 44.21105. Again these are off from the exponential fit of 1.199505 for the 5th and a 95th of 70.05585. We can see the 5th percentile isn’t off by too much but the 95th is off by almost 30.

Part 4

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Lets fit a model to all the available variables to start

# Remove the id column
training_data <- training_data[, -which(names(training_data) == "id")]
head(training_data)
##   Sex Length Diameter Height   Weight Shucked.Weight Viscera.Weight
## 1   I 1.5250   1.1750 0.3750 28.97319      12.728926       6.647958
## 2   I 1.1000   0.8250 0.2750 10.41844       4.521745       2.324659
## 3   M 1.3875   1.1125 0.3750 24.77746      11.339800       5.556502
## 4   F 1.7000   1.4125 0.5000 50.66056      20.354941      10.991839
## 5   I 1.2500   1.0125 0.3375 23.28911      11.977664       4.507570
## 6   M 1.5000   1.1750 0.4125 28.84562      13.409313       6.789705
##   Shell.Weight Age
## 1     8.348928   9
## 2     3.401940   8
## 3     6.662133   9
## 4    14.996885  11
## 5     5.953395   8
## 6     7.937860  10

Make sure the Sex column is treated as a factor

training_data$Sex <-as.factor(training_data$Sex)
head(training_data)
##   Sex Length Diameter Height   Weight Shucked.Weight Viscera.Weight
## 1   I 1.5250   1.1750 0.3750 28.97319      12.728926       6.647958
## 2   I 1.1000   0.8250 0.2750 10.41844       4.521745       2.324659
## 3   M 1.3875   1.1125 0.3750 24.77746      11.339800       5.556502
## 4   F 1.7000   1.4125 0.5000 50.66056      20.354941      10.991839
## 5   I 1.2500   1.0125 0.3375 23.28911      11.977664       4.507570
## 6   M 1.5000   1.1750 0.4125 28.84562      13.409313       6.789705
##   Shell.Weight Age
## 1     8.348928   9
## 2     3.401940   8
## 3     6.662133   9
## 4    14.996885  11
## 5     5.953395   8
## 6     7.937860  10

Build regression model on all the varibles as is.

lm_model <- lm(Age~., data = training_data)
summary(lm_model)
## 
## Call:
## lm(formula = Age ~ ., data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.755758   0.075897  49.485  < 2e-16 ***
## SexI           -1.040735   0.025209 -41.284  < 2e-16 ***
## SexM           -0.115212   0.019157  -6.014 1.82e-09 ***
## Length          0.910211   0.192234   4.735 2.20e-06 ***
## Diameter        2.138507   0.238786   8.956  < 2e-16 ***
## Height          7.222272   0.236275  30.567  < 2e-16 ***
## Weight          0.194265   0.005416  35.867  < 2e-16 ***
## Shucked.Weight -0.614115   0.006632 -92.603  < 2e-16 ***
## Viscera.Weight -0.216351   0.011872 -18.224  < 2e-16 ***
## Shell.Weight    0.512127   0.009820  52.152  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

Lets try using the step function

# Stepwise model selection
step_model <- step(lm_model, direction = "both", trace = 0)

# Show the final model
print(summary(step_model))
## 
## Call:
## lm(formula = Age ~ Sex + Length + Diameter + Height + Weight + 
##     Shucked.Weight + Viscera.Weight + Shell.Weight, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.755758   0.075897  49.485  < 2e-16 ***
## SexI           -1.040735   0.025209 -41.284  < 2e-16 ***
## SexM           -0.115212   0.019157  -6.014 1.82e-09 ***
## Length          0.910211   0.192234   4.735 2.20e-06 ***
## Diameter        2.138507   0.238786   8.956  < 2e-16 ***
## Height          7.222272   0.236275  30.567  < 2e-16 ***
## Weight          0.194265   0.005416  35.867  < 2e-16 ***
## Shucked.Weight -0.614115   0.006632 -92.603  < 2e-16 ***
## Viscera.Weight -0.216351   0.011872 -18.224  < 2e-16 ***
## Shell.Weight    0.512127   0.009820  52.152  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

We can see that we get the same model stepping thru as we did before we can see an \(R^2\) of .5508 with an F-statistic of 1.009e+04 on 9 and 74041 DF which is significant and a p-value of < 2.2e-16 which is also significant.

Now lets create some plots to further investigate the model

plot(step_model)

The Residuals Vs Fitted plot shows that there is a pattern among the residuals this pattern of heteroscedasticity shows that the variability of the residuals is not constant with the range of fitted values. We ideally would like to see the spread of the residuals be random and evenly distributed around the horizontal line at zero.

For the QQ residual plot we would like to see the values fit more closely to the line. Instead we see some pretty significant deviation at the tails of the plot showing that the data is not normally distributed.

Lets try creating a model scaling the quantitative variables to see if we can get any improvements in the model.

scaled_training_data <- training_data
scaled_training_data$Length <- scale(scaled_training_data$Length)
scaled_training_data$Diameter <- scale(scaled_training_data$Diameter)
scaled_training_data$Height <- scale(scaled_training_data$Height)
scaled_training_data$Weight <- scale(scaled_training_data$Weight)
scaled_training_data$Shucked.Weight <- scale(scaled_training_data$Shucked.Weight)
scaled_training_data$Viscera.Weight <- scale(scaled_training_data$Viscera.Weight)
scaled_training_data$Shell.Weight <- scale(scaled_training_data$Shell.Weight)
head(scaled_training_data)
##   Sex     Length    Diameter     Height       Weight Shucked.Weight
## 1   I  0.7212332  0.63397775  0.2923977  0.441801426      0.4671847
## 2   I -0.7557067 -0.84035033 -0.7941577 -1.025191245     -0.9936809
## 3   M  0.2433997  0.37070488  0.2923977  0.110075046      0.2199225
## 4   F  1.3293850  1.63441467  1.6505920  2.156468183      1.8246040
## 5   I -0.2344338 -0.05053172 -0.1150606 -0.007598159      0.3334613
## 6   M  0.6343544  0.63397775  0.6998560  0.431715150      0.5882928
##   Viscera.Weight Shell.Weight Age
## 1      0.5691823   0.45337303   9
## 2     -0.9788731  -0.92678160   8
## 3      0.1783618  -0.01722411   9
## 4      2.1246076   2.30807939  11
## 5     -0.1972320  -0.21495400   8
## 6      0.6199382   0.33868969  10
scaled_model <- lm(Age~., data = scaled_training_data)
summary(scaled_model)
## 
## Call:
## lm(formula = Age ~ ., data = scaled_training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.34664    0.01513 683.888  < 2e-16 ***
## SexI           -1.04074    0.02521 -41.284  < 2e-16 ***
## SexM           -0.11521    0.01916  -6.014 1.82e-09 ***
## Length          0.26192    0.05532   4.735 2.20e-06 ***
## Diameter        0.50767    0.05669   8.956  < 2e-16 ***
## Height          0.66469    0.02175  30.567  < 2e-16 ***
## Weight          2.45709    0.06851  35.867  < 2e-16 ***
## Shucked.Weight -3.45012    0.03726 -92.603  < 2e-16 ***
## Viscera.Weight -0.60421    0.03315 -18.224  < 2e-16 ***
## Shell.Weight    1.83566    0.03520  52.152  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

Scaling the data does not seem to have made any difference in the model looking at the \(R^2\), F-Statistic and p-value this could mean that the data was already standardized.

plot(scaled_model)

Scaling did not really change much as the plots look almost identical to the orignal non scaled data plots so lets try transforming some variables to see if we can get a better fit.

First lets try applying a log transformation to the variables that appear to be skewed to the right. From the histograms made previously we know that these variables are: Weight, Shucked.Weight, Viscera.Weight, Shell.Weight

log_model <- lm(Age~log(Weight+1) + log(Shucked.Weight+1) + log(Viscera.Weight +1) + log(Shell.Weight+1) + Sex+Length + Diameter+Height , data = training_data)
summary(log_model)
## 
## Call:
## lm(formula = Age ~ log(Weight + 1) + log(Shucked.Weight + 1) + 
##     log(Viscera.Weight + 1) + log(Shell.Weight + 1) + Sex + Length + 
##     Diameter + Height, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5504  -1.2983  -0.3346   0.7993  19.1946 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3.07287    0.09126  33.672  < 2e-16 ***
## log(Weight + 1)          3.65299    0.10331  35.361  < 2e-16 ***
## log(Shucked.Weight + 1) -6.40704    0.07678 -83.443  < 2e-16 ***
## log(Viscera.Weight + 1) -0.45166    0.07472  -6.044 1.51e-09 ***
## log(Shell.Weight + 1)    6.40043    0.08640  74.083  < 2e-16 ***
## SexI                    -0.87746    0.02561 -34.258  < 2e-16 ***
## SexM                    -0.13572    0.01921  -7.066 1.61e-12 ***
## Length                  -1.93945    0.20925  -9.268  < 2e-16 ***
## Diameter                -0.28698    0.25360  -1.132    0.258    
## Height                   5.76077    0.24029  23.974  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.134 on 74041 degrees of freedom
## Multiple R-squared:  0.5482, Adjusted R-squared:  0.5481 
## F-statistic:  9980 on 9 and 74041 DF,  p-value: < 2.2e-16

transforming the variables using a log transformation does not seem to have made the model any better. We see a slight decrease in the \(R^2\) hinting that the transformations did not improve the model really at all.

plot(log_model)

The residuals vs fitted plot has not changed too much we still see some heteroscedasticity in the fit and now the line about 0 is not completely horizontal and we see the “funnel” shape showing heteroscedasticity.

Lets try also transforming the Dependent variable with a log transformation to fix this issue.

log_model <- lm(log(Age+1)~log(Weight+1) + log(Shucked.Weight+1) + log(Viscera.Weight +1) + log(Shell.Weight+1) + Sex+Length + Diameter+Height , data = training_data)
summary(log_model)
## 
## Call:
## lm(formula = log(Age + 1) ~ log(Weight + 1) + log(Shucked.Weight + 
##     1) + log(Viscera.Weight + 1) + log(Shell.Weight + 1) + Sex + 
##     Length + Diameter + Height, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31192 -0.10779 -0.01947  0.08368  1.19956 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.570619   0.007129 220.327  < 2e-16 ***
## log(Weight + 1)          0.483682   0.008070  59.938  < 2e-16 ***
## log(Shucked.Weight + 1) -0.534591   0.005998 -89.131  < 2e-16 ***
## log(Viscera.Weight + 1) -0.084533   0.005837 -14.482  < 2e-16 ***
## log(Shell.Weight + 1)    0.461313   0.006749  68.356  < 2e-16 ***
## SexI                    -0.081270   0.002001 -40.620  < 2e-16 ***
## SexM                    -0.009418   0.001500  -6.277 3.47e-10 ***
## Length                  -0.155527   0.016345  -9.515  < 2e-16 ***
## Diameter                -0.092843   0.019810  -4.687 2.78e-06 ***
## Height                   0.347963   0.018770  18.538  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1667 on 74041 degrees of freedom
## Multiple R-squared:  0.6491, Adjusted R-squared:  0.6491 
## F-statistic: 1.522e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

Transforming the dependent variable induced a significant increase in the \(R^2\) value while the f-statistic and p-value stayed signifigant. this is a good start now lets inspect the plots for the model.

plot(log_model)

The spread of the residuals now seems to be more constant but we can see a clear pattern in the plot hinting that the model may not be able to capture the relationship between the variables.

The QQ plot is still showing a lot od deviation at the ends of the curve showing that the variables are not normally distributed.

Lets try adding in an interaction term to the model

log_interaction_model <- lm(log(Age+1)~log(Weight+1) + log(Shucked.Weight+1) + log(Viscera.Weight +1) + log(Shell.Weight+1) + Sex*log(Shucked.Weight+1)+Length , data = training_data)
summary(log_interaction_model)
## 
## Call:
## lm(formula = log(Age + 1) ~ log(Weight + 1) + log(Shucked.Weight + 
##     1) + log(Viscera.Weight + 1) + log(Shell.Weight + 1) + Sex * 
##     log(Shucked.Weight + 1) + Length, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29084 -0.10683 -0.01942  0.08319  1.08965 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.801660   0.010673  168.80  < 2e-16 ***
## log(Weight + 1)               0.370235   0.008836   41.90  < 2e-16 ***
## log(Shucked.Weight + 1)      -0.560744   0.006271  -89.42  < 2e-16 ***
## log(Viscera.Weight + 1)      -0.031938   0.005893   -5.42 5.98e-08 ***
## log(Shell.Weight + 1)         0.508586   0.006324   80.42  < 2e-16 ***
## SexI                         -0.312167   0.008652  -36.08  < 2e-16 ***
## SexM                         -0.114575   0.009394  -12.20  < 2e-16 ***
## Length                       -0.127574   0.012719  -10.03  < 2e-16 ***
## log(Shucked.Weight + 1):SexI  0.104205   0.003667   28.42  < 2e-16 ***
## log(Shucked.Weight + 1):SexM  0.040449   0.003657   11.06  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1661 on 74041 degrees of freedom
## Multiple R-squared:  0.6517, Adjusted R-squared:  0.6517 
## F-statistic: 1.539e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

Adding in the interaction term we were able to improve the \(R^2\) sightly.

plot(log_interaction_model)

We can see that adding the interaction term did not really help too much as there is still an obivous pattern in the residuals vs fitted plot.

Also the QQ plot still shows deviations at the tails.

So lets try one more transformation to see if we can get any improvements.

Well try a box cox transformation

library(MASS)

# Find the optimal lambda
bc_out <- boxcox(step_model, lambda = seq(-2, 2, by = 0.1))

# Plot to find the lambda that maximizes the log-likelihood
plot(bc_out)

library(car)  
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:pracma':
## 
##     logit
training_data$transformed_Age <- bcPower(training_data$Age, bc_out$x[which.max(bc_out$y)])

Create a model with the box cox transformation applied to the age column

box_cox_model <- lm(transformed_Age~log(Weight+1) + log(Shucked.Weight+1) + log(Viscera.Weight +1) + log(Shell.Weight+1) +Sex+Length +Diameter+Height, data = training_data)
summary(box_cox_model)
## 
## Call:
## lm(formula = transformed_Age ~ log(Weight + 1) + log(Shucked.Weight + 
##     1) + log(Viscera.Weight + 1) + log(Shell.Weight + 1) + Sex + 
##     Length + Diameter + Height, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62657 -0.08487 -0.01296  0.06968  0.97587 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.237632   0.005683 217.784  < 2e-16 ***
## log(Weight + 1)          0.444006   0.006433  69.019  < 2e-16 ***
## log(Shucked.Weight + 1) -0.427257   0.004781 -89.358  < 2e-16 ***
## log(Viscera.Weight + 1) -0.085979   0.004653 -18.477  < 2e-16 ***
## log(Shell.Weight + 1)    0.342225   0.005380  63.611  < 2e-16 ***
## SexI                    -0.064377   0.001595 -40.362  < 2e-16 ***
## SexM                    -0.006862   0.001196  -5.737 9.67e-09 ***
## Length                  -0.121781   0.013030  -9.346  < 2e-16 ***
## Diameter                -0.097614   0.015792  -6.181 6.39e-10 ***
## Height                   0.240775   0.014963  16.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1329 on 74041 degrees of freedom
## Multiple R-squared:  0.6659, Adjusted R-squared:  0.6659 
## F-statistic: 1.64e+04 on 9 and 74041 DF,  p-value: < 2.2e-16
plot(box_cox_model)

The plot above still shows a clear pattern and it also still shows non-constant variance this could be hinting that there is a nonlinear relationship wiht the variable let try adding a polynomial term to the model to see if we can make any improvements

box_cox_model_poly <- lm(transformed_Age~log(Weight+1) + log(Shucked.Weight+1) + log(Viscera.Weight +1) + log(Shell.Weight+1) +Sex+Length+Diameter**2+Height, data = training_data)
summary(box_cox_model_poly)
## 
## Call:
## lm(formula = transformed_Age ~ log(Weight + 1) + log(Shucked.Weight + 
##     1) + log(Viscera.Weight + 1) + log(Shell.Weight + 1) + Sex + 
##     Length + Diameter^2 + Height, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62657 -0.08487 -0.01296  0.06968  0.97587 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.237632   0.005683 217.784  < 2e-16 ***
## log(Weight + 1)          0.444006   0.006433  69.019  < 2e-16 ***
## log(Shucked.Weight + 1) -0.427257   0.004781 -89.358  < 2e-16 ***
## log(Viscera.Weight + 1) -0.085979   0.004653 -18.477  < 2e-16 ***
## log(Shell.Weight + 1)    0.342225   0.005380  63.611  < 2e-16 ***
## SexI                    -0.064377   0.001595 -40.362  < 2e-16 ***
## SexM                    -0.006862   0.001196  -5.737 9.67e-09 ***
## Length                  -0.121781   0.013030  -9.346  < 2e-16 ***
## Diameter                -0.097614   0.015792  -6.181 6.39e-10 ***
## Height                   0.240775   0.014963  16.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1329 on 74041 degrees of freedom
## Multiple R-squared:  0.6659, Adjusted R-squared:  0.6659 
## F-statistic: 1.64e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

Adding in the Diameter squared term did not increase the \(R^2\)

plot(box_cox_model_poly)

The plots still look the same as before so we were not able to capture any non linear relationship with the squared term.

Since this seems about as good as were gonna get lets add some interaction terms to this model and see if we cant make any final improvements.

final_model <- lm(transformed_Age~log(Weight+1) + log(Shucked.Weight+1) + log(Viscera.Weight +1) + log(Shell.Weight+1) +Sex+Length+Height+Sex*log(Shucked.Weight +1) , data = training_data)
summary(final_model)
## 
## Call:
## lm(formula = transformed_Age ~ log(Weight + 1) + log(Shucked.Weight + 
##     1) + log(Viscera.Weight + 1) + log(Shell.Weight + 1) + Sex + 
##     Length + Height + Sex * log(Shucked.Weight + 1), data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.60067 -0.08351 -0.01282  0.06914  0.84750 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.448999   0.008454  171.40   <2e-16 ***
## log(Weight + 1)               0.325190   0.006999   46.46   <2e-16 ***
## log(Shucked.Weight + 1)      -0.452366   0.004965  -91.11   <2e-16 ***
## log(Viscera.Weight + 1)      -0.048464   0.004692  -10.33   <2e-16 ***
## log(Shell.Weight + 1)         0.347430   0.005216   66.61   <2e-16 ***
## SexI                         -0.303813   0.006884  -44.13   <2e-16 ***
## SexM                         -0.102505   0.007439  -13.78   <2e-16 ***
## Length                       -0.103915   0.010071  -10.32   <2e-16 ***
## Height                        0.315549   0.014916   21.16   <2e-16 ***
## log(Shucked.Weight + 1):SexI  0.109421   0.002923   37.44   <2e-16 ***
## log(Shucked.Weight + 1):SexM  0.036780   0.002896   12.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1315 on 74040 degrees of freedom
## Multiple R-squared:  0.673,  Adjusted R-squared:  0.673 
## F-statistic: 1.524e+04 on 10 and 74040 DF,  p-value: < 2.2e-16

Adding in some interaction terms we see a slight increase in the \(R^2\) value we also were able to remove the Diameter variable from the model as it was not signifigant.

plot(final_model)

Our plots still exhibit a clear patter and non-constant variance but it seems like this is the best we are going to do so lets make some predictions and submit to kaggle.

library("forecast")
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# get the test data
test_data<- read.csv("test.csv")

preds<- predict(final_model, newdata=test_data)
# Inverse the boxcox transformation to get final predictions
predictions_final <- InvBoxCox(preds, bc_out$x[which.max(bc_out$y)])
predictions_final <- data.frame(id = test_data$id, Age = predictions_final)

write.csv(predictions_final, 'crab_kaggle_sub.csv', row.names = FALSE)

I submitted the file to Kaggle and was able to obtain a MAE score of 1.42417 under the username joebochnik