DATA605 Final

Problem 1a

set.seed(1234)
X <- runif(10000,5,15) 
#mean should be 10 per uniform distribution characteristics
#median should also be equal to 10
Y <- rnorm(10000, mean = 10, sd = 2.89)
#mean, median should also be equal to 10

y_med <- median(Y)
x_med <- median(X)

cat((paste("Y Median:",y_med,"\nX Median:",x_med)))

## Y Median: 10.0315387998106 
## X Median: 10.0126582302619

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

a. P(X>x | X>y) = .50 | Interpretation: The probability that the random variable X takes a value greater than its median, given that the value taken is also greater than the median of Y. This is a conditional probability with its formula being represented as: P(A and B) / P(B)

P(A) = 50% (X>x). The probability of a value taken from a uniform distribution with mean 10 is greater than the median of a normal distribution with mean 10

| P(B) = 50% (X>y). Since the distributions share mean/meadians, the probability is the same as asking how many values in X are greater than its mean/median, which would again be 50%
| P(A and B) = .5*.5 = .25 | P(A | B) = P(A and B) / P(B) = .25/.50 = .50

prob_a <- sum(X > x_med)/length(X)
prob_b <- sum(X > y_med)/length(X)

prob_anb <- prob_a * prob_b

prob_agiveb <- prob_anb/prob_b

cat((paste("Prob of A:", prob_a, "\nProb of B:", prob_b, "\nProb of A and B:", prob_anb, "\nProb of A given B:", prob_agiveb)))

## Prob of A: 0.5 
## Prob of B: 0.4979 
## Prob of A and B: 0.24895 
## Prob of A given B: 0.5

P(X>x & Y>y) = .25 | Interpretation: The probability that the random variable X takes a value greater than the median and that the random variable Y taken is greater than the median of y. | P(X>x) = .5 | P(Y>y) = .5 | P(A and B) = .5*.5 = .25
P(X<x | X>y) = .50 | Interpretation: The probability that the random variable X takes a value less than the median and that the value taken is greater than the median of y. | P(X<x) = .5 | P(X>y) = .5 | P(A and B) = .5*.5 = .25 | P(A | B) = P(A and B) / P(B) = .25/.50 = .50

Problem 1b

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

# Joiny Probability of X>x & Y>y
prob_joint <- sum(X > x_med & Y > y_med) / length(X)

# Marginal probabilities P(X > x) and P(Y > y)
prob_x <- sum(X > x_med) / length(X)
prob_y <- sum(Y > y_med) / length(Y)

# Product of marginal probabilities P(X > x) * P(Y > y)
marginals_prod <- prob_x * prob_y

# Create a matrix
prob_table <- matrix(c(prob_x, prob_joint, prob_y, marginals_prod), nrow = 2, byrow = TRUE)

# Labeling the cols and rows
colnames(prob_table) <- c("Marginal Probs", "Joint Probs")
rownames(prob_table) <- c("P(X > x)", "P(Y > y)")


print(prob_table)

##          Marginal Probs Joint Probs
## P(X > x)            0.5      0.2507
## P(Y > y)            0.5      0.2500

\[P(X>x & Y>y) = P(X > x)*P(Y>y)\]

Problem 1c

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

| Fisher’s Test checks to see if there is evidence to suggest that two variables are independent from each other based on the counts of the marginal and joint probability conditions. Fisher’s Test is typically only used when more than 20% of cells have counts less than 5 and is more accurate on small samples. We can check the expected counts with Chi-Square function from R. The Chi Square Test checks for independence by comparing the observed frequencies versus the expected frequencies under the assumption that the variables were independent. The Chi Square test is more accurate than Fisher’s Test on large samples. Both check their values in relation to their critical values with their degrees of freedom.

Fisher’s Exact Test Formula: \[p = \frac{(\text{row 1 total}!) \; (\text{row 2 total}!) \; (\text{column 1}!) \; (\text{column 2}!)} {(\text{total samples}!) \; (\text{cell a}!) \; (\text{cell b}!) \; (\text{cell c}!) (\; \text{cell d}!)}\] \[p = \frac{{c_1 \choose a}{c_2 \choose b}}{{N \choose r_1}}\]

Chi-Square Formula: \[\chi^2 = \sum \frac{(\text{Original Counts} - \text{Expected Counts})^2 }{\text{Expected Counts}}\] Expected Frequencies Formula for Chi Square: - (row total)(column total)/grand total - P(A)P(B)*total_count

values_in_X_greater_than_med <- sum(X > x_med)
values_in_Y_greater_than_med <- sum(Y > y_med)

values_in_X_greater_than_y_med <- sum(X > y_med)
values_in_Y_greater_than_x_med <- sum(Y > x_med)

cont_table <- data.frame("Values from X" = c(values_in_X_greater_than_med,values_in_X_greater_than_y_med),
                         "Values from Y" =c(values_in_Y_greater_than_x_med,values_in_Y_greater_than_med),
                     row.names = c("> X Median","> Y Median"))


cont_table

##            Values.from.X Values.from.Y
## > X Median          5000          5036
## > Y Median          4979          5000

fisher.test(cont_table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  cont_table
## p-value = 0.9212
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9429173 1.0542732
## sample estimates:
## odds ratio 
##   0.997039

The P-Value is certainly greater than .05 at a value of 1 and concludes that no evidence was found to suggest that X and Y are not independent. Fisher’s Exact Test assumes independence as the null hypothesis.

chisq.test(cont_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 0.0082342, df = 1, p-value = 0.9277

The P-Value of the Chi-Square Test also has a p-value of 1, meaning there is no evidence to suggest that X and Y are not independent.
In this situation, I believe the Chi-Square Test would be most appropriate due to the number of observations. I am not surprised by the results since these counts are the result of two different distributions, and unless the Uniform and Normal distributions are dependent on each other I don’t see how the values ever could be dependent.

Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

| Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

train <- read_csv("data//train.csv")
submission_data <- read_csv("data//test.csv")

describe(train)

##                vars     n     mean       sd   median  trimmed      mad  min
## id                1 74051 37025.00 21376.83 37025.00 37025.00 27447.37 0.00
## Sex*              2 74051     2.06     0.82     2.00     2.07     1.48 1.00
## Length            3 74051     1.32     0.29     1.38     1.35     0.28 0.19
## Diameter          4 74051     1.02     0.24     1.07     1.05     0.22 0.14
## Height            5 74051     0.35     0.09     0.36     0.35     0.09 0.00
## Weight            6 74051    23.39    12.65    23.80    23.05    13.98 0.06
## Shucked Weight    7 74051    10.10     5.62     9.91     9.92     6.18 0.03
## Viscera Weight    8 74051     5.06     2.79     4.99     4.97     3.07 0.04
## Shell Weight      9 74051     6.72     3.58     6.93     6.62     3.80 0.04
## Age              10 74051     9.97     3.18    10.00     9.69     2.97 1.00
##                     max    range  skew kurtosis    se
## id             74050.00 74050.00  0.00    -1.20 78.56
## Sex*               3.00     2.00 -0.10    -1.51  0.00
## Length             2.01     1.83 -0.84     0.29  0.00
## Diameter           1.61     1.48 -0.81     0.18  0.00
## Height             2.83     2.83  0.09    14.15  0.00
## Weight            80.10    80.04  0.23    -0.40  0.05
## Shucked Weight    42.18    42.16  0.35    -0.12  0.02
## Viscera Weight    21.55    21.50  0.29    -0.37  0.01
## Shell Weight      28.49    28.45  0.28    -0.14  0.01
## Age               29.00    28.00  1.09     2.30  0.01

#trimmed refers to if any parts of the data were removed, none were removed here
#mad: median absolute deviation
colnames(train) <- c('id','sex','length','diameter','height','weight','shucked_weight','visc_weight','shell_weight','age')
colnames(submission_data) <- c('id','sex','length','diameter','height','weight','shucked_weight','visc_weight','shell_weight','age')

# Convert Sex column to factor
train <- train %>%
  mutate(
    sex = as.factor(sex) 
  )

# Convert Sex column to factor
submission_data <- submission_data %>%
  mutate(
    sex = as.factor(sex) 
  )

# Next code blocks create a function to create a boxplot for each column to 
#   then pass a list of colnames to the function

# Create boxplot for each column
plot_columns <- function(column_name) {

  boxplot <- ggplot(train, aes_string(x = column_name)) +
    geom_boxplot() +
    labs(title = paste("Boxplot of", column_name))
  
  return(list(boxplot))
}

# Loop through each column ignoring the first since its ID
for (col in names(train)[-1]) {
  plots <- plot_columns(col)
  print(plots[[1]]) # Print boxplot
}

  plot_histogram <- function(column_name) {
    hists <- ggplot(train, aes_string(x = column_name)) +
    geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram of", column_name))
    
    return(list(hists))
  }

for (col in names(train)[-2]) {
  hists <- plot_histogram(col)
  print(hists[[1]]) # Print histogram
}

plot_histogram('weight')

## [[1]]

Scatterplot

corr_shell <- cor(train$shell_weight, train$age)
ggplot(train, aes(x = shell_weight, y = age)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatterplot of Shell Weight vs Age") +
  annotate("text", x = max(train$shell_weight), y = 25, 
           label = paste("Correlation =", round(corr_shell, 3)), 
           hjust = 1, vjust = 1, size = 4, color = "blue")

corr_diam <- cor(train$diameter, train$age)
ggplot(train, aes(x = diameter, y = age)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatterplot of Diameter vs Age") +
  annotate("text", x = 2, y = 15, 
         label = paste("Correlation =", round(corr_diam, 3)), 
         hjust = 1, vjust = 1, size = 4, color = "blue")

Correlation Matrix

I’ve chosen length, diameter, height, and weight and we can see all of their correlation coefficients are nearly 1.

corrs <- cor(train[,3:6])

corrplot(corrs)

Testing Correlation Hypothesis

Using cor.test, below shows that none of the correlations between the four chosen variables have a correlation of 0 and procide their estimated Pearson’s R value.

len_diam <- cor.test(train$length, train$diameter, method = "pearson", conf.level = .80)
len_height <- cor.test(train$length, train$height, method = "pearson", conf.level = .80)
len_weight <- cor.test(train$length, train$weight, method = "pearson", conf.level = .80)
diam_height <- cor.test(train$diameter, train$height, method = "pearson", conf.level = .80)
diam_weight <- cor.test(train$diameter, train$weight, method = "pearson", conf.level = .80)
height_weight <- cor.test(train$height, train$weight, method = "pearson", conf.level = .80)


print(paste(len_diam$data.name," Correlation :",len_diam$estimate,"len_diam P-Val:", len_diam$p.value))

## [1] "train$length and train$diameter  Correlation : 0.989437354560577 len_diam P-Val: 0"

print(paste(len_height$data.name," Correlation :",len_height$estimate,"P-Val:", len_height$p.value))

## [1] "train$length and train$height  Correlation : 0.918351679384104 P-Val: 0"

print(paste(len_weight$data.name," Correlation :",len_weight$estimate,"P-Val:", len_weight$p.value))

## [1] "train$length and train$weight  Correlation : 0.936373792522644 P-Val: 0"

print(paste(diam_height$data.name," Correlation :",diam_height$estimate,"P-Val:", diam_height$p.value))

## [1] "train$diameter and train$height  Correlation : 0.921353147724645 P-Val: 0"

print(paste(diam_weight$data.name," Correlation :",diam_weight$estimate,"P-Val:", diam_weight$p.value))

## [1] "train$diameter and train$weight  Correlation : 0.938248621945664 P-Val: 0"

print(paste(height_weight$data.name," Correlation :",height_weight$estimate,"P-Val:", height_weight$p.value))

## [1] "train$height and train$weight  Correlation : 0.901774819491695 P-Val: 0"

pvals <- c(len_diam$p.value,len_height$p.value,len_weight$p.value,diam_height$p.value,diam_weight$p.value)

I would be concerned about a family wise error due to the low significance value chosen at .20. With 6 tests, there is a 73% chance that one of those significance tests are false. Although, these correlation values may be so strong that they outweigh the margin of a family wise error. I also run an adjustment of the p-values using Bonferroni-Holm’s correction, which leaves the P Values at 0.

# Adjusting p-values using Bonferroni-Holm correction
adjusted_pvals <- p.adjust(pvals, method = "holm")
adjusted_pvals

## [1] 0 0 0 0 0

alpha_fwe <- 1-((1-.20)^6)
adj_fwe <- 1-((1-.05)^6)

print(paste("Family Wise Error Rate at alpha = .8 & 6 Tests:", alpha_fwe))

## [1] "Family Wise Error Rate at alpha = .8 & 6 Tests: 0.737856"

print(paste("Family Wise Error Rate at alpha = .95 & 6 Tests:", adj_fwe))

## [1] "Family Wise Error Rate at alpha = .95 & 6 Tests: 0.264908109375"

| When running the correlation tests at a .05 significance level, the P Values are still 0, suggesting that the correlations are just that strong. It seems there isn’t a need to worry about the family wise error rate in this situation.

len_diam <- cor.test(train$length, train$diameter, method = "pearson", conf.level = .95)
height_weight <- cor.test(train$length, train$height, method = "pearson", conf.level = .95)
len_weight <- cor.test(train$length, train$weight, method = "pearson", conf.level = .95)
diam_height <- cor.test(train$diameter, train$height, method = "pearson", conf.level = .95)
diam_weight <- cor.test(train$diameter, train$weight, method = "pearson", conf.level = .95)
height_weight <- cor.test(train$height, train$weight, method = "pearson", conf.level = .95)


print(paste(len_diam$data.name," Correlation :",len_diam$estimate,"len_diam P-Val:", len_diam$p.value))

## [1] "train$length and train$diameter  Correlation : 0.989437354560577 len_diam P-Val: 0"

print(paste(len_height$data.name," Correlation :",len_height$estimate,"P-Val:", len_height$p.value))

## [1] "train$length and train$height  Correlation : 0.918351679384104 P-Val: 0"

print(paste(len_weight$data.name," Correlation :",len_weight$estimate,"P-Val:", len_weight$p.value))

## [1] "train$length and train$weight  Correlation : 0.936373792522644 P-Val: 0"

print(paste(diam_height$data.name," Correlation :",diam_height$estimate,"P-Val:", diam_height$p.value))

## [1] "train$diameter and train$height  Correlation : 0.921353147724645 P-Val: 0"

print(paste(diam_weight$data.name," Correlation :",diam_weight$estimate,"P-Val:", diam_weight$p.value))

## [1] "train$diameter and train$weight  Correlation : 0.938248621945664 P-Val: 0"

print(paste(height_weight$data.name," Correlation :",height_weight$estimate,"P-Val:", height_weight$p.value))

## [1] "train$height and train$weight  Correlation : 0.901774819491695 P-Val: 0"

## Precision Matrix and LDU

prec <- solve(corrs)
prec

##              length   diameter    height    weight
## length    49.149116 -44.597762 -1.488898 -2.835504
## diameter -44.597762  51.657946 -3.157551 -3.860421
## height    -1.488898  -3.157551  7.214902 -2.149484
## weight    -2.835504  -3.860421 -2.149484  9.215477

Multiplying the correlation matrix by the precision matrix

round(corrs %*% prec)

##          length diameter height weight
## length        1        0      0      0
## diameter      0        1      0      0
## height        0        0      1      0
## weight        0        0      0      1

Multiplying the precision matrix by the correlation matrix

round(prec %*% corrs)

##          length diameter height weight
## length        1        0      0      0
## diameter      0        1      0      0
## height        0        0      1      0
## weight        0        0      0      1

We can see these create matrices with 1s along the diagonal to create identity matrices, which is a natural property of multiplying the precision matrix (aka inverse matrix) to the original matrix.

“Conduct LDU decomposition on the matrix.” Assuming the matrix to be the original correlation matrix, LDU Factorization results in an identity matrix.

(rr_corrs <- rref(corrs))

##          length diameter height weight
## length        1        0      0      0
## diameter      0        1      0      0
## height        0        0      1      0
## weight        0        0      0      1

(lu_form <- lu.decomposition(rr_corrs))

## $L
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1
## 
## $U
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

lu_form

## $L
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1
## 
## $U
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

(L <- lu_form$L)

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

(U <- lu_form$U)

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

(D <- diag(diag(U)))

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

#(U_prime <- D %*% U) #D is an identity matrix and when inversed, is itself of the same order, it is also itself when multiplied by itself

Calculus-Based Probability & Statistics.

| Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function.

| Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Shucked weight has the highest right skew. When creating an exponential distribution on top of it with the estimate of \(\lambda\) that fitdistr() gives, they do not match particularly well.

skews <- apply(train[-2], MARGIN = 2, FUN = skewness) #get skewness of each column
print(skews[6])

## shucked_weight 
##      0.3494575

fit <- fitdistr(train$shucked_weight, densfun = "exponential") #find the expon. distr that works the best

print(fit$estimate) #print the rate estimate

##       rate 
## 0.09896806

exp_dis <- rexp(1000, fit$estimate) #create example distribution

hist(exp_dis, breaks = "FD", freq = FALSE, main = "Histogram of Exp. Dist.")
hist(train$shucked_weight, breaks = "FD", freq = FALSE, col = rgb(1, 0, 0, alpha = 0.5), add = TRUE)
legend("topright", legend = c("Exp Distribution", "Shucked Weight"), fill = c("grey", "red"))

Using a QQ Plot with both distributions reveals that they certainly do not match well.

qqplot(exp_dis, train$shucked_weight, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", main = "Q-Q Plot")
abline(0, 1, col = "red")

“Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).”

Although using the quantiles function would be more efficient, by trial and error once can use pexp(), the cumulative distribution function, which calculates the percent chance a value from an exponential variable is taken at or below a value. Passing the previously estimated rate to pexp() would give rough estimates of the variables 5th and 95th percentiles if an exponential distribution were a good fit to the data.

#pexp() calculates the CDF at a requested value and a given lambda

exp_5_val <- .519
exp_5 <- pexp(.519, rate = fit$estimate) # 5%

exp_95_val <- 30.27
exp_95 <- pexp(30.27, rate = fit$estimate) # 95%

print(paste("CDF Percentile:", round(exp_5,3),"Variable Value:",exp_5_val))

## [1] "CDF Percentile: 0.05 Variable Value: 0.519"

print(paste("CDF Percentile:", round(exp_95,3),"Variable Value:",exp_95_val))

## [1] "CDF Percentile: 0.95 Variable Value: 30.27"

#Another way to get the quantiles from our specific distribtuion:

real_quants <- quantile(exp_dis, c(0.05, 0.95))
print("Values from the 1000 observation distribution created earlier")

## [1] "Values from the 1000 observation distribution created earlier"

print(real_quants)

##         5%        95% 
##  0.5415927 29.1781950

“Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data.”
| It is unclear what the confidence interval is meant to be of or if it was a typo and the intention is to find the values of a normal distribution that are at 5% and 95% percentiles. I will do both and assume it is the mean the 95% confidence interval is for.

CI <- t.test(train$shucked_weight)$conf.int #95 CI of Mean of shucked_weight

print(paste("95% CI of the Mean: [", round(CI[1],3),",",round(CI[2],3),"]"))

## [1] "95% CI of the Mean: [ 10.064 , 10.145 ]"

print(paste("Value at 5th Percentile:",round(qnorm(0.05, mean = mean(train$shucked_weight), sd = sd(train$shucked_weight)),3)))

## [1] "Value at 5th Percentile: 0.863"

print(paste("value at 95th Percentile:",round(qnorm(0.95, mean = mean(train$shucked_weight), sd = sd(train$shucked_weight)),3)))

## [1] "value at 95th Percentile: 19.345"

| The exponential distribution did not match the distribution of the most skewed data the training set has to offer (shucked_weight), but the value of the exercise is that when distributions match they can be very good ways of estimating important points in the data, such as the 5th and 95th percentiles. It becomes even more valuable when attempting to generalize to a population when the distribution of values is known. If the exponential distribution was a good fit, one could attempt to generalize those important points to the population in hopes of creating more accurate predictions. To attempt a more accurate skewed distribution, one could only take values of the mean and higher and scale the data such that the mean was equal to zero, but that seemed too drastic in comparison to moving the minimum to zero. In this case the minimum of ‘shucked_weight’ was .028 so it seemed like a negligible gain.

print(paste("Minimum of shucked_weight:",min(train$shucked_weight)))

## [1] "Minimum of shucked_weight: 0.0283495"

hist(exp_dis, breaks = "FD", freq = FALSE, main = "Histogram of Exp. Dist.")
hist((train$shucked_weight[train$shucked_weight > mean(train$shucked_weight)])-mean(train$shucked_weight), breaks = "FD", freq = FALSE, col = rgb(1, 0, 0, alpha = 0.5), add = TRUE)
legend("topright", legend = c("Exp Distribution", "Shucked Weight"), fill = c("grey", "red"))

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

# Separating Index Column off the dataframe
train_sub <- train[-1]

# Basic model
model <- lm(age ~ ., data = train_sub)

#Playing with step
updated_model <- step(model, direction = "backward")

## Start:  AIC=111865.1
## age ~ sex + length + diameter + height + weight + shucked_weight + 
##     visc_weight + shell_weight
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                        335336 111865
## - length          1       102 335438 111886
## - diameter        1       363 335700 111943
## - visc_weight     1      1504 336840 112195
## - height          1      4232 339568 112792
## - weight          1      5826 341163 113139
## - sex             2      8658 343994 113749
## - shell_weight    1     12318 347654 114535
## - shucked_weight  1     38838 374174 119978

#Testing out removing some variables
train_sub_2 <- train[-1, -6,-8,-9]

# Setting train and test datasets
set.seed(123)
index <- createDataPartition(train$age, p = 0.7, list = FALSE)
train_data <- train_sub[index, ]  # 70% for training
test_data <- train_sub[-index, ]


model <- train(age ~ ., data = train_data, method = "lm")

preds <- predict(model, newdata = test_data)

metrics <- caret::postResample(preds, test_data$age)
print(metrics)

##      RMSE  Rsquared       MAE 
## 2.1314662 0.5545252 1.4831273

summary(model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2290  -1.2248  -0.3331   0.7537  19.0409 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.790101   0.090896  41.697  < 2e-16 ***
## sexI           -1.043613   0.030219 -34.535  < 2e-16 ***
## sexM           -0.099047   0.022891  -4.327 1.52e-05 ***
## length          0.740504   0.231687   3.196  0.00139 ** 
## diameter        2.244536   0.287056   7.819 5.42e-15 ***
## height          7.466731   0.289088  25.829  < 2e-16 ***
## weight          0.183947   0.006502  28.292  < 2e-16 ***
## shucked_weight -0.604673   0.007955 -76.008  < 2e-16 ***
## visc_weight    -0.205292   0.014238 -14.419  < 2e-16 ***
## shell_weight    0.524065   0.011795  44.431  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.127 on 51829 degrees of freedom
## Multiple R-squared:  0.5492, Adjusted R-squared:  0.5491 
## F-statistic:  7015 on 9 and 51829 DF,  p-value: < 2.2e-16

train_sub_2 <- train[-1, -6,-8,-9]

set.seed(123)

index <- createDataPartition(train_sub_2$age, p = 0.7, list = FALSE)
train_data_2 <- train_sub_2[index, ]  # 70% for training
test_data_2 <- train_sub_2[-index, ]

# Assuming 'df' is your dataframe
model_2 <- train(age ~ ., data = train_data_2, method = "lm")

preds_2 <- predict(model_2, newdata = test_data_2)

metrics <- caret::postResample(preds_2, test_data_2$age)
print(metrics)

##      RMSE  Rsquared       MAE 
## 2.1538186 0.5453647 1.4992540

matrix_train <- as.matrix(train_data[-9])
matrix_train_age <- as.matrix(train_data[9])
# Perform cross-validation to select the best lambda
cv_model <- cv.glmnet(matrix_train, matrix_train_age , alpha = 0.5)  # Cross-validation for Elastic Net

# Plot the cross-validation results (optional)
plot(cv_model)

# Find the best lambda value
best_lambda <- cv_model$lambda.min  # You can also use lambda.1se for a more parsimonious model

# Refit the model with the best lambda
final_model <- glmnet(matrix_train, matrix_train_age, alpha = 0.5, lambda = best_lambda)

# Make predictions
matrix_test_data <- as.matrix(test_data[-9])
matrix_test_data_age <- as.matrix(test_data[9])
predictions <- predict(final_model, newx = matrix_test_data)


metrics <- caret::postResample(predictions, matrix_test_data_age)
print(metrics)

##     RMSE Rsquared      MAE 
## 2.158333 0.543224 1.507126

modelLookup("xgbLinear")

##       model parameter                 label forReg forClass probModel
## 1 xgbLinear   nrounds # Boosting Iterations   TRUE     TRUE      TRUE
## 2 xgbLinear    lambda     L2 Regularization   TRUE     TRUE      TRUE
## 3 xgbLinear     alpha     L1 Regularization   TRUE     TRUE      TRUE
## 4 xgbLinear       eta         Learning Rate   TRUE     TRUE      TRUE

getModelInfo("xgbLinear")

## $xgbLinear
## $xgbLinear$label
## [1] "eXtreme Gradient Boosting"
## 
## $xgbLinear$library
## [1] "xgboost"
## 
## $xgbLinear$type
## [1] "Regression"     "Classification"
## 
## $xgbLinear$parameters
##   parameter   class                 label
## 1   nrounds numeric # Boosting Iterations
## 2    lambda numeric     L2 Regularization
## 3     alpha numeric     L1 Regularization
## 4       eta numeric         Learning Rate
## 
## $xgbLinear$grid
## function (x, y, len = NULL, search = "grid") 
## {
##     if (search == "grid") {
##         out <- expand.grid(lambda = c(0, 10^seq(-1, -4, length = len - 
##             1)), alpha = c(0, 10^seq(-1, -4, length = len - 1)), 
##             nrounds = floor((1:len) * 50), eta = 0.3)
##     }
##     else {
##         out <- data.frame(lambda = 10^runif(len, min = -5, 0), 
##             alpha = 10^runif(len, min = -5, 0), nrounds = sample(1:100, 
##                 size = len, replace = TRUE), eta = runif(len, 
##                 max = 3))
##     }
##     out
## }
## 
## $xgbLinear$loop
## NULL
## 
## $xgbLinear$fit
## function (x, y, wts, param, lev, last, classProbs, ...) 
## {
##     if (!inherits(x, "xgb.DMatrix")) 
##         x <- as.matrix(x)
##     if (is.factor(y)) {
##         if (length(lev) == 2) {
##             y <- ifelse(y == lev[1], 1, 0)
##             if (!inherits(x, "xgb.DMatrix")) 
##                 x <- xgboost::xgb.DMatrix(x, label = y, missing = NA)
##             else xgboost::setinfo(x, "label", y)
##             if (!is.null(wts)) 
##                 xgboost::setinfo(x, "weight", wts)
##             out <- xgboost::xgb.train(list(lambda = param$lambda, 
##                 alpha = param$alpha), data = x, nrounds = param$nrounds, 
##                 objective = "binary:logistic", ...)
##         }
##         else {
##             y <- as.numeric(y) - 1
##             if (!inherits(x, "xgb.DMatrix")) 
##                 x <- xgboost::xgb.DMatrix(x, label = y, missing = NA)
##             else xgboost::setinfo(x, "label", y)
##             if (!is.null(wts)) 
##                 xgboost::setinfo(x, "weight", wts)
##             out <- xgboost::xgb.train(list(lambda = param$lambda, 
##                 alpha = param$alpha), data = x, num_class = length(lev), 
##                 nrounds = param$nrounds, objective = "multi:softprob", 
##                 ...)
##         }
##     }
##     else {
##         if (!inherits(x, "xgb.DMatrix")) 
##             x <- xgboost::xgb.DMatrix(x, label = y, missing = NA)
##         else xgboost::setinfo(x, "label", y)
##         if (!is.null(wts)) 
##             xgboost::setinfo(x, "weight", wts)
##         out <- xgboost::xgb.train(list(lambda = param$lambda, 
##             alpha = param$alpha), data = x, nrounds = param$nrounds, 
##             objective = "reg:squarederror", ...)
##     }
##     out
## }
## 
## $xgbLinear$predict
## function (modelFit, newdata, submodels = NULL) 
## {
##     if (!inherits(newdata, "xgb.DMatrix")) {
##         newdata <- as.matrix(newdata)
##         newdata <- xgboost::xgb.DMatrix(data = newdata, missing = NA)
##     }
##     out <- predict(modelFit, newdata)
##     if (modelFit$problemType == "Classification") {
##         if (length(modelFit$obsLevels) == 2) {
##             out <- ifelse(out >= 0.5, modelFit$obsLevels[1], 
##                 modelFit$obsLevels[2])
##         }
##         else {
##             out <- matrix(out, ncol = length(modelFit$obsLevels), 
##                 byrow = TRUE)
##             out <- modelFit$obsLevels[apply(out, 1, which.max)]
##         }
##     }
##     out
## }
## 
## $xgbLinear$prob
## function (modelFit, newdata, submodels = NULL) 
## {
##     if (!inherits(newdata, "xgb.DMatrix")) {
##         newdata <- as.matrix(newdata)
##         newdata <- xgboost::xgb.DMatrix(data = newdata, missing = NA)
##     }
##     out <- predict(modelFit, newdata)
##     if (length(modelFit$obsLevels) == 2) {
##         out <- cbind(out, 1 - out)
##         colnames(out) <- modelFit$obsLevels
##     }
##     else {
##         out <- matrix(out, ncol = length(modelFit$obsLevels), 
##             byrow = TRUE)
##         colnames(out) <- modelFit$obsLevels
##     }
##     as.data.frame(out, stringsAsFactors = TRUE)
## }
## 
## $xgbLinear$predictors
## function (x, ...) 
## {
##     imp <- xgboost::xgb.importance(x$xNames, model = x)
##     x$xNames[x$xNames %in% imp$Feature]
## }
## 
## $xgbLinear$varImp
## function (object, numTrees = NULL, ...) 
## {
##     imp <- xgboost::xgb.importance(object$xNames, model = object)
##     imp <- as.data.frame(imp, stringsAsFactors = TRUE)[, 1:2]
##     rownames(imp) <- as.character(imp[, 1])
##     imp <- imp[, 2, drop = FALSE]
##     colnames(imp) <- "Overall"
##     missing <- object$xNames[!(object$xNames %in% rownames(imp))]
##     missing_imp <- data.frame(Overall = rep(0, times = length(missing)))
##     rownames(missing_imp) <- missing
##     imp <- rbind(imp, missing_imp)
##     imp
## }
## 
## $xgbLinear$levels
## function (x) 
## x$obsLevels
## 
## $xgbLinear$tags
## [1] "Linear Classifier Models"   "Linear Regression Models"  
## [3] "L1 Regularization Models"   "L2 Regularization Models"  
## [5] "Boosting"                   "Ensemble Model"            
## [7] "Implicit Feature Selection"
## 
## $xgbLinear$sort
## function (x) 
## {
##     x[order(x$nrounds, x$alpha, x$lambda), ]
## }

ctrl <- trainControl(method = "cv", number =10)
xgbGrid <- expand.grid(nrounds = c(1, 10),
                       lambda = c(.01, .1, .3 ,.5, .7),
                       alpha = c(.01, .1, .3, .5, .7),
                       eta = c(.3, .5, .7))

xgbLinear <- train(age ~ ., data = train_sub, 
                 method = "xgbLinear", 
                 trControl = ctrl, 
                 verbose = FALSE, 
                 tuneGrid = xgbGrid,
                 metric = "MAE")

xgbLinear$bestTune

##     nrounds lambda alpha eta
## 145      10    0.7   0.5 0.3

xgbPreds <- predict(xgbLinear, test_data[-9])

metrics <- caret::postResample(xgbPreds, test_data$age)
print(metrics)

##      RMSE  Rsquared       MAE 
## 2.0312610 0.6050321 1.3578110

# all + weight:length notable .606
# all + weight:diameter .607 1.355
xgbLim1 <- train(age ~ sex+length+diameter+height+visc_weight+shell_weight + weight:diameter+ shucked_weight:shucked_weight, data = train_sub, 
                 method = "xgbLinear", 
                 trControl = ctrl, 
                 verbose = FALSE, 
                 ## Only a single model can be passed to the
                 ## function when no resampling is used:
                 tuneGrid = xgbGrid,
                 metric = "MAE")

xgbPreds_Lim1 <- predict(xgbLim1, test_data[-9])

metrics <- caret::postResample(xgbPreds_Lim1, test_data$age)
print(metrics)

# Creating data to submit with

names(submission_data)

## [1] "id"             "sex"            "length"         "diameter"      
## [5] "height"         "weight"         "shucked_weight" "visc_weight"   
## [9] "shell_weight"

submit <- predict(xgbLinear, submission_data)

submit <- data.frame(round(submit))

write_csv(submit,"age_pred_column.csv")