Final

Computational Mathematics

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs.You are also expected to make a short presentation via YouTube and post that recording to the board. 5 mins explaining answers This project will show off your ability to understand the elements of the class.

Problem 1.

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.

set.seed(1234)

X <- runif(10000, min = 5, max = 15)

Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

set.seed(1234)

Y <- rnorm(10000, mean = 10, sd = 2.89)

Probability
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable.

# define medians
x_median <- median(X)
y_median <- median(Y)

Interpret the meaning of all probabilities.

5pts a. P(X>x | X>y) b. P(X>x & Y>y) c. P(X<x | X>y)

# a. P(X > x | X > y)
prob_a <- sum(X > x_median & X > y_median) / sum(X > y_median)
prob_a

## [1] 1

represents conditional probability where both conditions are true; the random variable X is more than the median of X, given that X is more than the median of y. The output of 1 means there is a 100% probably that when X exceeds median of y, it will also exceed median of x.

# b. P(X > x & Y > y)
prob_b <- sum(X > x_median & Y > y_median) / length(X)
prob_b

## [1] 0.2497

is a joint probability which represents that both both X and Y are greater than their medians .2497 means there is about a 24.97% probability that both conditions are ture: X is more than median of x and Y is more than median of y.

# c. P(X < x | X > y)
prob_c <- sum(X < x_median & X > y_median) / sum(X > y_median)
prob_c

## [1] 0

similar to (a) represents a conditional probability, but the Probability in question here is that X is less than the median of x, and X is more than median of y; the output of 0 means there is no chance or no observation found where this statement holds to be true, implying that X is always more than median of x when given that X is more than the median of y.

5pts Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

table(X > x_median, Y > y_median)

##        
##         FALSE TRUE
##   FALSE  2497 2503
##   TRUE   2503 2497

# P(X>x)P(Y>y); considered marginal probabilities
x_more <- sum(X > x_median) / length(X)
y_more <- sum(Y > y_median) / length(Y)
product <- x_more * y_more
cat("The product of Marginal Probabilities P(X > x) * P(Y > y)", product, "is almost identical to Joint Probabilities P(X>x & Y>y)", prob_b, "\n")

## The product of Marginal Probabilities P(X > x) * P(Y > y) 0.25 is almost identical to Joint Probabilities P(X>x & Y>y) 0.2497

The results above suggests that the Marginal Probabilities and Joint Probabilities are independent (do not impact each other).

5pts Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test.

contingency_table <- table(X > x_median, Y > y_median)
fisher_result <- fisher.test(contingency_table)

print(fisher_result) #when I restarted R and ran it again, there were issues so I redid this code chunk

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.9203
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9194656 1.0772299
## sample estimates:
## odds ratio 
##  0.9952116

chisq.test(contingency_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 0.01, df = 1, p-value = 0.9203

What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not? The difference between the two is the type of data the output provides, additionally Fisher’s is used for a smaller sample size. Since the sample size was 10,000 so the Chi test would be more appropriate. The value of X-squared suggests the observed counts in the table are very close to the expected counts assuming independence.

Null hypothesis: X & Y are independent (no relationship); Alternative: they are not independent and X and Y have a relationship. I am surprised that both tests have the exact same p-value .9203 indicating statistical significance or to not reject to null hypothesis therefore, X & Y are independent. But not generally surprised at the results since we saw in earlier chunk that P(X>x & Y>y)=P(X>x)P(Y>y).

Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

5 points. Descriptive and Inferential Statistics Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

#load data & libraries
test=read.csv(url("https://raw.githubusercontent.com/MarjeteV/Data605/main/test.csv"))
train=read.csv(url("https://raw.githubusercontent.com/MarjeteV/Data605/main/train.csv"))
library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.92 loaded

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Descriptive statistics

summary (train)

##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18512   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked.Weight     Viscera.Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell.Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000

glimpse(train)

## Rows: 74,051
## Columns: 10
## $ id             <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Sex            <chr> "I", "I", "M", "F", "I", "M", "M", "I", "F", "M", "I", …
## $ Length         <dbl> 1.5250, 1.1000, 1.3875, 1.7000, 1.2500, 1.5000, 1.5750,…
## $ Diameter       <dbl> 1.1750, 0.8250, 1.1125, 1.4125, 1.0125, 1.1750, 1.1375,…
## $ Height         <dbl> 0.3750, 0.2750, 0.3750, 0.5000, 0.3375, 0.4125, 0.3500,…
## $ Weight         <dbl> 28.973189, 10.418441, 24.777463, 50.660556, 23.289114, …
## $ Shucked.Weight <dbl> 12.7289255, 4.5217453, 11.3398000, 20.3549410, 11.97766…
## $ Viscera.Weight <dbl> 6.6479577, 2.3246590, 5.5565020, 10.9918385, 4.5075705,…
## $ Shell.Weight   <dbl> 8.348928, 3.401940, 6.662133, 14.996885, 5.953395, 7.93…
## $ Age            <int> 9, 8, 9, 11, 8, 10, 11, 11, 12, 11, 7, 10, 10, 7, 9, 7,…

#note some crab species are indertemined
ggplot(train, aes(x = Sex, y = Length)) +
  geom_boxplot()

ggplot(train, aes(x = Age)) +
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of Crab Ages",
       x = "Age",
       y = "Frequency") +
  theme_minimal()

train[, 3:9] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram() #curious about how the different weights look; "fixed" scale looks less clear so I kept like this

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scatter Plot Matrix

Kaggle says “Age is the target;” Crab age is the dependent variable
Scatter Plot matrix shows positive correlation with all variables

#Scatter plot matrix is grid of multiple plots to see relationship across all
pairs(train |> subset(select=c(Age, Length, Diameter)))

pairs(train |> subset(select=c(Age, Shell.Weight, Viscera.Weight, Shucked.Weight))) #I initially did all these variables together in one code chunk but since it was so many variables it mamde it harder to visualize the patterns

Correlation matrix

correlation_matrix <- cor(train[c("Age", "Length", "Diameter")])

print(correlation_matrix)

##                Age    Length  Diameter
## Age      1.0000000 0.6128431 0.6212559
## Length   0.6128431 1.0000000 0.9894374
## Diameter 0.6212559 0.9894374 1.0000000

####Test the hypotheses - the correlations between each pairwise set (this is ) of variables is 0, meaning there is no relationship (null hypothesis) - provide an 80% confidence interval, included in code below - I will test the hypothesis for Age, Length, and Diameter to keep it simple

cor.test(train$Age, train$Length, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Age and train$Length
## t = 211.04, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6098938 0.6157754
## sample estimates:
##       cor 
## 0.6128431

cor.test(train$Age, train$Diameter, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Age and train$Diameter
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6183556 0.6241393
## sample estimates:
##       cor 
## 0.6212559

cor.test(train$Diameter, train$Length, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$Diameter and train$Length
## t = 1857.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9893379 0.9895359
## sample estimates:
##       cor 
## 0.9894374

Analysis

Would you be worried about familywise error? Why?

This is also known as type 1 error, meaning there is a relationship between the independent variables which will impact the ability to find the significant coefficients. Yes I am worries about the familywise error. From the correlation matrix above it is clear that correlation coefficients between the variables: length, diameter, height, weight, shucked weight, viscera weight and shell weight being 0.9 above above would make false positive very likely. For example in the 3 correlation tests above, they all indicate that “alternative hypothesis: true correlation is not equal to 0.” Familywise error increases the probability that I will incorrectly reject the null hypothesis, which seems might be occurring in the 3 correlation tests mentioned, which makes the model less reliable.

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the [correlation] matrix.

precision_matrix <- solve(correlation_matrix) #invert correlation matrix

c_p <- correlation_matrix * precision_matrix # multiply the correlation matrix by the precision matrix
p_c <- precision_matrix * correlation_matrix # multiply the precision matrix by the correlation matrix
ldu_c <- Matrix::lu(correlation_matrix)

ldu_c

## LU factorization of Formal class 'denseLU' [package "Matrix"] with 4 slots
##   ..@ x       : num [1:9] 1 0.613 0.621 0.613 0.624 ...
##   ..@ perm    : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 3
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:3] "Age" "Length" "Diameter"
##   .. ..$ : chr [1:3] "Age" "Length" "Diameter"

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. - From the earlier question to provide descriptive analysis, the facetted histogram showed several variables skewed to the right and select weight variable

#shift it so that the minimum value is absolutely above zero if necessary; This isnt necessary since there is no negative weight 
min(train$Weight)

## [1] 0.056699

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lamda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g.,rexp(1000, lambda)).

exp_distr <- fitdistr(train$Weight, "exponential")
lambda <- exp_distr$estimate["rate"]
print (lambda)

##       rate 
## 0.04276206

sample <- rexp(1000, rate = lambda) #generates 1k samples

Plot a histogram and compare it with a histogram of your original variable. - While both histograms are still skewed the the right, the exponential graph is far more skewed and has a smaller spread with most to the density falling on the first few bars

par(mfrow = c(1, 2))
hist(train$Weight, main = "Weight", probability = TRUE)
hist(sample, main = "exp_distr", probability = TRUE)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

p5 <- qexp(0.05, rate = lambda)
p95 <- qexp(0.95, rate = lambda)
cat("Exponential: the 5th percentile is", p5, "and the 95th percentile is", p95, "\n")

## Exponential: the 5th percentile is 1.199505 and the 95th percentile is 70.05585

Also generate a 95% confidence interval from the empirical data, assuming normality.

ci = 1.96 * (sd(train$Weight) / sqrt(length(train$Weight)))
confidence_int = c(mean(train$Weight) - ci, mean(train$Weight) + ci)
confidence_int #using z score

## [1] 23.29412 23.47632

ci_emp = t.test(train$Weight)
ci_emp #using t test to confirm similar values

## 
##  One Sample t-test
## 
## data:  train$Weight
## t = 503.13, df = 74050, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  23.29412 23.47632
## sample estimates:
## mean of x 
##  23.38522

Finally, provide the empirical 5th percentile and 95th percentile of the data.

p5_e <- quantile(train$Weight, probs = 0.05)
p95_e <- quantile(train$Weight, probs = 0.95)
cat("Empirical: the 5th percentile is", p5_e, "and the 95th percentile is", p95_e, "\n")

## Empirical: the 5th percentile is 3.600386 and the 95th percentile is 44.21105

Discuss.

The values of the of 5th percentile and 9th percentile of our exponential data (1 and 70, respectively) and our empirical data (3 and 44, respectively) differ substantially I would say. I believe this is due to the graphs and samples being so different in their spread. I tried a few different methods (qnorm with the assumption of of it being a normal spread) and noticed slightly different answers for the 5% and 95% of empirical data, so I kept the code with quantile method since in reality it is skewed. In these question I did not consider outliers and how that could result in ther difference. The other possible option is that the fitted exponential model might not align with the empirical data if the data doesnt fit exponential decay.

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

correlation_all <- cor(train[, 3:10])
corrplot(correlation_all, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

I know there is multicollinearity, so I should try to have the smallest number of variables in the lm function to try to account for this & remove some variables that have the highest correlation to one another (as long as its not related to age) - plan to remove some of the weights and keep the one which has the highest r value from the correlation matrix (shell) & remove either diameter or length since they are almost the same.
when should I use powertransform function / boxcox (when the variable looks like it is not linear) - might want to do that with age, dependent variable since it didnt have a high coefficient in comparison to the other variables

model1=lm(Age ~ Height + Diameter +  Shell.Weight ,data=train)

summary (model1)

## 
## Call:
## lm(formula = Age ~ Height + Diameter + Shell.Weight, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3556  -1.3767  -0.4910   0.7475  18.4115 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.065273   0.062356   81.23   <2e-16 ***
## Height        8.709115   0.258982   33.63   <2e-16 ***
## Diameter     -1.177516   0.111667  -10.54   <2e-16 ***
## Shell.Weight  0.457674   0.006705   68.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.357 on 74047 degrees of freedom
## Multiple R-squared:  0.4492, Adjusted R-squared:  0.4491 
## F-statistic: 2.013e+04 on 3 and 74047 DF,  p-value: < 2.2e-16

Based on the summary all variables seem to be significant. I initially tried doing backward selection but there were still too many variables in my opinion. The model above all has the same P value (yes it is statistically significant) but makes me think something is off. The F statistic also suggests a significant model. The r value indicates the model (or the accounts for 44.92% of the age variable.

plot (model1)

These visuals dont look the best, I’m going to try to transform age variable to see if that helps.

# Use boxcox function 
boxcox_result <- boxcox(train$Age ~ 1, data = train, plotit = TRUE)

lambda_age <- boxcox_result$x[which.max(boxcox_result$y)]

transformed_age <- if (lambda_age == 0) log(train$Age) else ((train$Age^lambda_age) - 1) / lambda_age

# Create a new dataset with boxcox applied to Age
train_transformed <- transform(train, Age = transformed_age)

# test new model with the transformed Age
model2 <- lm(Age ~ Height + Diameter + Shell.Weight, data = train_transformed)
summary(model2)

## 
## Call:
## lm(formula = Age ~ Height + Diameter + Shell.Weight, data = train_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6563 -0.2055 -0.0522  0.1449  1.8713 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.6839178  0.0085360  197.27   <2e-16 ***
## Height       1.4251326  0.0354521   40.20   <2e-16 ***
## Diameter     0.3414732  0.0152861   22.34   <2e-16 ***
## Shell.Weight 0.0393902  0.0009178   42.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3226 on 74047 degrees of freedom
## Multiple R-squared:  0.5311, Adjusted R-squared:  0.5311 
## F-statistic: 2.796e+04 on 3 and 74047 DF,  p-value: < 2.2e-16

The R value increased after completing the boxcox on age, almost 10% so I’m heading in the right track! 53% of variability in age is now explained by model. The std errors have decreased and the t values have increased (from model1 to 2). Additionally, the coefficient estimate of the diameter was negative in model 1 and not in model 2.

plot (model2)

The residuals of model 2, dont appear visually perfect, which makes me believe there are still improvements that could be made.The QQ residuals appear to follow the line for the most part so that good, but the residuals vs leverage, scale vs location and residual vs fitted (suggest heteroscedasticity) do not fit the pattern I would expect.

# I tried doing all the variables together but got some errors bc some height variable seem to be negative
boxcox_result_Diameter <- boxcox(train$Diameter ~ 1, data = train, plotit = FALSE)

lambda_Diameter <- boxcox_result_Diameter$x[which.max(boxcox_result_Diameter$y)]

transformed_Diameter <- if (lambda_Diameter == 0) log(train$Diameter) else ((train$Diameter^lambda_Diameter) - 1) / lambda_Diameter

# update dataset with the Box-Cox 
train_t <- transform(train, Diameter = transformed_Diameter)

# next variable 
boxcox_result_Shell.Weight <- boxcox(train$Shell.Weight ~ 1, data = train, plotit = FALSE)

lambda_Shell.Weight <- boxcox_result_Shell.Weight$x[which.max(boxcox_result_Shell.Weight$y)]

transformed_Shell.Weight <- if (lambda_Shell.Weight == 0) log(train$Shell.Weight) else ((train$Shell.Weight^lambda_Shell.Weight) - 1) / lambda_Shell.Weight

# update dataset with the Box-Cox 
train_final <- transform(train, Shell.Weight = transformed_Shell.Weight)

model3 <- lm(Age ~ Height + Diameter + Shell.Weight, data = train_final)
summary(model3) #i also ran the box cox variables above  separately but deleted it since the R values were still lower

## 
## Call:
## lm(formula = Age ~ Height + Diameter + Shell.Weight, data = train_final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2943  -1.3577  -0.4593   0.7575  18.3881 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.56911    0.07918   95.59   <2e-16 ***
## Height        6.33836    0.25964   24.41   <2e-16 ***
## Diameter     -4.37276    0.12696  -34.44   <2e-16 ***
## Shell.Weight  1.22834    0.01459   84.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.321 on 74047 degrees of freedom
## Multiple R-squared:  0.4656, Adjusted R-squared:  0.4656 
## F-statistic: 2.151e+04 on 3 and 74047 DF,  p-value: < 2.2e-16

plot (model3)

The R values went down in model 3, so Model 2 is the best fit. However, after doing another look and plotting model 3 the residuals vs fitted actually looks better for model 3 (the line is closer to being on zero across the graph) indicating more linearity than model 2. The QQplot for both follow the line, indicating normality. Scale location for both doesnt seem ideal which means there might not be constant variance in both models. Residuals vs leverage still look off for both models

Double check which model is ideal

anova(model3, model2)

## Analysis of Variance Table
## 
## Model 1: Age ~ Height + Diameter + Shell.Weight
## Model 2: Age ~ Height + Diameter + Shell.Weight
##   Res.Df    RSS Df Sum of Sq F Pr(>F)
## 1  74047 398933                      
## 2  74047   7706  0    391227

Based on the anova comparison model 2 has a smaller RSS which is desirable.

Make Predictions on Test Data: I tried to use RMSE to test it but the result I got was NaN which suggests something was wrong.

predicted_values <- predict(model2, newdata = test)
MVkaggle <- data.frame(Id = test$id, Age = predicted_values)

write.csv(MVkaggle, "MVkaggle.csv", row.names = FALSE)

Username: marjete Score: 7.19212