Problem 1.

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

set.seed(1234)

X <- runif(10000, min = 5, max = 15)
Y <- rnorm(10000, mean = 10, sd = 2.89)

summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.003   7.525  10.013  10.003  12.475  14.996
summary(Y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.564   8.113  10.032  10.033  11.949  20.456
hist(X)

hist(Y)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)
y <- median(Y)

x
## [1] 10.01266
y
## [1] 10.03154

a. P(X > x | X > y)

The probability that X is greater than its median (x) given that X is already greater than the median of Y.

First, I created a subset where X > y.

X_larger_y <- X[X > y]
summary(X_larger_y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.03   11.26   12.48   12.49   13.72   15.00

Next, I used the mean function to calculate the probability that a value in the subset is greater than the median.

X_larger_x <- mean(X_larger_y > x)
X_larger_x
## [1] 1

The probability of X being larger than its median, given that X is greater than y, is 1. This makes sense as the medians are pretty much the same. If X is greater than the median of Y, it is almost certainly going to be greater than the median of X as well.

b. P(X > x & Y > y)

The probability that X and Y are greater than their median.

Like before, I will use the mean to calculate the proportion where before conditions are true

joint_prob <- mean(X > x & Y > y)
joint_prob
## [1] 0.2507

There is a 25% chance that both conditions are met simultaneously.

c. P(X < x | X > y)

This problem is similar to the first one. It asking for the probability of X < x, given X is greater than y. Intuitively, it should be 0 of close to it since the probability of X > x is 1.

This is pretty much the same as first question.

X_less_x <- mean(X_larger_y < x)
X_less_x
## [1] 0

The probability is 0.

Investigate whether P(X > x & Y > y) = P(X > x)P(Y > y) by building a table and evaluating the marginal and joint probabilities.

This is a contingency table evaluating the marginal and joint probabilities.

probability_matrix <- matrix(c((mean(X > x & Y > y)), (mean(X > x & Y <= y)),
                               (mean(X <= x & Y > y)), (mean(X <= x & Y <= y))),
                             nrow = 2, ncol = 2, byrow = TRUE,
                             dimnames = list(c("X > x", "X <= x"), c("Y > y", "Y <= y")))
probability_matrix <- rbind(probability_matrix, colSums(probability_matrix))
probability_matrix <- cbind(probability_matrix, rowSums(probability_matrix))

print(probability_matrix)
##         Y > y Y <= y    
## X > x  0.2507 0.2493 0.5
## X <= x 0.2493 0.2507 0.5
##        0.5000 0.5000 1.0

Based on this table, the joint probability of P(X > x & Y > y) is 0.2507. The individual probabilities of P(X > x) and P(Y > y) are both 0.5. The product of the marginal probabilities is 0.25. Therefore, the joint probability, P(X > x & Y > y), and the marginal probabilities, P(X > x)P(Y > y), are not equal. The joint probability is slightly higher than the product of the marginal probabilities. This difference suggests a slight dependence between X and Y. The joint probability and product of the marginal probabilities are equal when both variables are independent of each other.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

Let’s make a matrix to perform the Fisher’s Exact Test and the Chi Square Test

matrix_prob <- matrix(c((sum(X > x & Y > y)), (sum(X > x & Y <= y)),
                               (sum(X <= x & Y > y)), (sum(X <= x & Y <= y))),
                                nrow = 2, ncol = 2, byrow = TRUE)
matrix_prob
##      [,1] [,2]
## [1,] 2507 2493
## [2,] 2493 2507
#Fisher's Exact Test
fisher.test(matrix_prob)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix_prob
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264
#Chi Square Test
chisq.test(matrix_prob)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix_prob
## X-squared = 0.0676, df = 1, p-value = 0.7949

Both the fisher’s exact test and the chi-square test are both used to determine if there is a significant association between two categorical variables. Both are used to test the independence of the variables and they both produce a p-value. A p value less than the typical alpha level of 0.05 suggests that we should reject the null hypothesis of independence, indicating a significant association between the two variables, X and Y. The p-value for the fisher’s exact test is 0.7949 and the p-value for the chi-square test is 0.7949 as well. Therefore, we fail to reject the null hypothesis in both tests, indicating that there is no statistically significant association between X and Y.

The fisher’s exact test is typically used when sample sizes are small while the chi-square test is more appropriate for large sample sizes. For this test, a chi-square test is most appropriate. The results are not surprising as the contingency table from the previous problem showed that the marginal and joint probabilities were not equal.

Problem 2.

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
df <- read_csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20605/playground-series-s3e16/train.csv")
## Rows: 74051 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (9): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
## Rows: 74,051
## Columns: 10
## $ id               <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ Sex              <chr> "I", "I", "M", "F", "I", "M", "M", "I", "F", "M", "I"…
## $ Length           <dbl> 1.5250, 1.1000, 1.3875, 1.7000, 1.2500, 1.5000, 1.575…
## $ Diameter         <dbl> 1.1750, 0.8250, 1.1125, 1.4125, 1.0125, 1.1750, 1.137…
## $ Height           <dbl> 0.3750, 0.2750, 0.3750, 0.5000, 0.3375, 0.4125, 0.350…
## $ Weight           <dbl> 28.973189, 10.418441, 24.777463, 50.660556, 23.289114…
## $ `Shucked Weight` <dbl> 12.7289255, 4.5217453, 11.3398000, 20.3549410, 11.977…
## $ `Viscera Weight` <dbl> 6.6479577, 2.3246590, 5.5565020, 10.9918385, 4.507570…
## $ `Shell Weight`   <dbl> 8.348928, 3.401940, 6.662133, 14.996885, 5.953395, 7.…
## $ Age              <dbl> 9, 8, 9, 11, 8, 10, 11, 11, 12, 11, 7, 10, 10, 7, 9, …
#Uni variate descriptive statistics
summary(df)
##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18513   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked Weight     Viscera Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000

A couple of things that stand out in the summary statistics are the wide ranges of the weight and the shell weight.

#Histogram of all the numeric variables
numeric_cols <- sapply(df, is.numeric)
numeric_col_names <- names(df)[numeric_cols]

plot_list <- list()

for (col_name in numeric_col_names) {
  #had to use sprintf to create a string from the col_name
  p <- ggplot(df, aes_string(x = sprintf("`%s`", col_name))) +
    geom_histogram(bins = 30) +
    theme_minimal() +
    ggtitle(paste("Histogram of", col_name))
  plot_list[[col_name]] <- p
}

do.call(grid.arrange, c(plot_list, ncol = 3))

For the most part, the histogram of the variables are mostly right skewed.

#Scatter plot of Diameter vs Age
df %>% 
  ggplot(aes(x=Diameter, y=Age)) +
  geom_point() +
  theme_minimal() + 
  ggtitle("Diameter vs Age")

Kind of looks like there’s a positive correlation here

#Scatter plot of Height vs Age
df %>% 
  ggplot(aes(x=Height, y=Age)) +
  geom_point() +
  theme_minimal() + 
  ggtitle("Height vs Age") +
  coord_cartesian(xlim = c(0, 1))

Looks like there’s an outlier that is kind of skewing the plot. If I remove the 2 outliers, there seems to be a positive correlation between height and age.

#Scatter plot of Shell Weight vs Age
df %>% 
  ggplot(aes(x=`Shell Weight`, y=Age)) +
  geom_point() +
  theme_minimal() + 
  ggtitle("Shell Weight vs Age")

Looks like a positive correlation between shell weight and age.

#Correlation Matrix
cor_matrix <- cor(df[,c("Diameter","Height","Shell Weight")])

print(cor_matrix)
##               Diameter    Height Shell Weight
## Diameter     1.0000000 0.9213531    0.9226882
## Height       0.9213531 1.0000000    0.9033982
## Shell Weight 0.9226882 0.9033982    1.0000000

It looks like the 3 variables I picked have a strong correlation with one another. They also seem to be similarly correlated to Age.

#Correlation Test
test_dia_hei <- cor.test(df$Diameter, df$Height, conf.level = 0.80)
print("Diameter and Height")
## [1] "Diameter and Height"
print(test_dia_hei)
## 
##  Pearson's product-moment correlation
## 
## data:  df$Diameter and df$Height
## t = 644.97, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9206384 0.9220617
## sample estimates:
##       cor 
## 0.9213531
test_dia_she <- cor.test(df$Diameter, df$`Shell Weight`, conf.level = 0.80)
print("Diameter and Shell Weight")
## [1] "Diameter and Shell Weight"
print(test_dia_she)
## 
##  Pearson's product-moment correlation
## 
## data:  df$Diameter and df$`Shell Weight`
## t = 651.23, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9219851 0.9233852
## sample estimates:
##       cor 
## 0.9226882
test_hei_she <- cor.test(df$Height, df$`Shell Weight`, conf.level = 0.80)
print("Height and Shell Weight")
## [1] "Height and Shell Weight"
print(test_hei_she)
## 
##  Pearson's product-moment correlation
## 
## data:  df$Height and df$`Shell Weight`
## t = 573.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9025285 0.9042605
## sample estimates:
##       cor 
## 0.9033982

The correlation test for “Diameter and Height”, “Diameter and Shell Weight”, and “Height and Shell Weight” all demonstrate a strong positive correlation between the pairs of variables. They also have a p-value of less than 0.05, indicating the results are statistically significant meaning that the correlation is unlikely due to random chance. In addition, the confidence interval is narrow suggesting a precise estimate.

Given the extremely low p-value (2.2e-16), indicating that we should reject the null hypothesis, I would not be worried about familywise errors. Since we’re only performing three tests that all resulted in a very low p-value, the chances of all three results being false positive(type 1 errors) is extremely low. Though it is important to consider familywise errors, I would not be concerned about it in this instance.

b. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

#Inverting the correlation matrix
precision_matrix <- solve(cor_matrix)
print("Precision Matrix")
## [1] "Precision Matrix"
print(precision_matrix)
##               Diameter    Height Shell Weight
## Diameter      9.370051 -4.474177    -4.603672
## Height       -4.474177  7.574983    -2.714955
## Shell Weight -4.603672 -2.714955     7.700440
#Multiply the correlation matrix by the precision matrix
product_1 <- cor_matrix %*% precision_matrix
print("Correlation Matrix * Precision Matrix")
## [1] "Correlation Matrix * Precision Matrix"
print(product_1)
##                  Diameter        Height Shell Weight
## Diameter     1.000000e+00 -4.440892e-16 8.881784e-16
## Height       1.776357e-15  1.000000e+00 0.000000e+00
## Shell Weight 8.881784e-16 -8.881784e-16 1.000000e+00
#Multiply the precision matrix by the correlation matrix
product_2 <- precision_matrix %*% cor_matrix
print("Precision Matrix * Correlation Matrix")
## [1] "Precision Matrix * Correlation Matrix"
print(product_2)
##                  Diameter        Height  Shell Weight
## Diameter     1.000000e+00  8.881784e-16  0.000000e+00
## Height       8.881784e-16  1.000000e+00 -4.440892e-16
## Shell Weight 0.000000e+00 -8.881784e-16  1.000000e+00
library(Matrix)
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
#LDU decomposition on correlation matrix
ldu <- lu(cor_matrix)
expand(ldu)
## $L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000         .         .
## [2,] 0.9213531 1.0000000         .
## [3,] 0.9226882 0.3525715 1.0000000
## 
## $U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000 0.9213531 0.9226882
## [2,]         . 0.1511084 0.0532765
## [3,]         .         . 0.1298627
## 
## $P
## 3 x 3 sparse Matrix of class "pMatrix"
##           
## [1,] | . .
## [2,] . | .
## [3,] . . |

C. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

The histogram of ‘Shell Weight’ is skewed to the right

df %>% 
  ggplot(aes(x=`Shell Weight`)) +
  geom_histogram() + 
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#minimum value
min(df$`Shell Weight`)
## [1] 0.04252425
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
#Running fitdistr to fit an exponential probability density function
fit <- fitdistr(df$`Shell Weight`, "exponential")

#The optimal value of lambda
lambda <- fit$estimate
lambda
##      rate 
## 0.1487239
#Take 1000 samples from this exponential distribution
samples <- rexp(1000, rate = lambda)

#Histogram of samples data
ggplot(data.frame(samples), aes(x = samples)) +
  geom_histogram() +
  theme_minimal() +
  ggtitle("Histogram of Sampled Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram of the sampled data shows a rapid decline in frequency from left to right when compared to the original histogram. This indicates there are significantly more smaller values than larger values. Since the shape of the exponential distribution is noticeably different from the original histogram, this suggests that exponential distribution is not the best model for the ‘Shell Weight’ variable.

#Finding the 5th and 95th percentile of the sample
p5_sample <- qexp(0.05, rate = lambda)
p95_sample <- qexp(0.95, rate = lambda)

p5_sample
## [1] 0.3448894
p95_sample
## [1] 20.14291
#Assuming normality, 95% confidence interval for the empirical data
ci_empirical <- quantile(df$`Shell Weight`, probs = c(0.025, 0.975))
ci_empirical
##       2.5%      97.5% 
##  0.7087375 13.8912550

So this returns the 2.5 and 97.5 percentile of the ‘Shell Weight’ variable.

#Calculating CI, using z-score, if the data followed a normal distribution
test = 1.96 * (sd(df$`Shell Weight`) / sqrt(length(df$`Shell Weight`)))
ci_95 = c(mean(df$`Shell Weight`) - test, mean(df$`Shell Weight`) + test)
ci_95
## [1] 6.698053 6.749687
#Finding the 5th and 95th percentile of the empirical data
p5_empirical <- quantile(df$`Shell Weight`, probs = 0.05)
p95_empirical <- quantile(df$`Shell Weight`, probs = 0.95)
p5_empirical
##      5% 
## 1.10563
p95_empirical
##     95% 
## 12.7431

The 5th and 95th percentiles of the empirical data and the sample data are quite different. The sample data’s 5th and 95th percentile are 0.3448894 and 20.14291 while the empirical data’s 5th and 95th percentile are 1.10563 and 12.7431. The empirical data does not follow an exponential distribution so it is not surprising that the 5th and 95th percentile is very different from the sample data. Also, the sample data had more extreme values than the empirical data, which accounts for the big gap in the 95th percentile between the two.

D. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

To find the variables for the multiple regression model, let’s make a correlation matrix with all the continuous variables.

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
cor_matrix <- cor(select_if(df, is.numeric))

corrplot(cor_matrix, method = "number")

Based on this plot, ‘Shell Weight’, ‘Height’, ‘Diameter’, and ‘Length’ are the mostly positively correlated with Age.

#Was having trouble with the space in 'Shell Weight' so I had to change the column name
df <- df %>%
  rename(Shell_Weight = `Shell Weight`)

multiple_regression <- lm(Age ~ Shell_Weight + Height + Diameter + Length, data = df)

summary(multiple_regression)
## 
## Call:
## lm(formula = Age ~ Shell_Weight + Height + Diameter + Length, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.3488  -1.3795  -0.4852   0.7508  18.5093 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.361115   0.065592  81.734   <2e-16 ***
## Shell_Weight  0.460719   0.006699  68.778   <2e-16 ***
## Height        9.084957   0.259946  34.949   <2e-16 ***
## Diameter      2.242626   0.263066   8.525   <2e-16 ***
## Length       -2.999000   0.208923 -14.355   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.353 on 74046 degrees of freedom
## Multiple R-squared:  0.4507, Adjusted R-squared:  0.4507 
## F-statistic: 1.519e+04 on 4 and 74046 DF,  p-value: < 2.2e-16

Let’s begin with the coefficients. For each unit increase in ‘Shell_weight’, the dependent variable Age increases by 0.46. One thing to note is that ‘Length’ seems to have a negative coefficient. As ‘Length’ increases by one unit, ‘Age’ decreases by about 3 years. All of these predictor variables are statistically significant as they all have a small p-value.

The R-squared is .4507 which means that 45.07% of the variability in ‘Age’ can be explained by the model. More than half of the variation is not accounted for by these predictors, which suggests there are other factors that influence ‘Age’ that are not included. Lastly, the p-value is extremely small indicating that the multiple regression model is statistically significant in predicting ‘Age’.

A quick analysis of the residuals.

par(mfrow=c(2,2))
plot(multiple_regression)

For the Residuals vs Fitted plot, ideally, the residuals should be randomly scattered around the horizontal line at zero without any discernible patterns. However, this residual plot seems to grouped, suggesting potential non-linearity in the relationship between the predictor and target variable.

The Q-Q plot seems normally distributed for the most part but there are some deviations at the ends, suggesting possible outliers in the data.

Uploading to Kaggle

test_df <- read_csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20605/playground-series-s3e16/test.csv")
## Rows: 49368 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (8): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_df <- test_df %>%
  rename(Shell_Weight = `Shell Weight`)
test_predictions <- predict(multiple_regression, newdata= test_df)
submission_df <- data.frame(Id = test_df$id, Age = test_predictions)

write.csv(submission_df, "submission.csv", row.names = FALSE)

Kaggle Username: jianjqc Score: 1.65103 / 1.64027