Problem 1

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

Probability: Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

Probabilities: a. P(X>x | X>y); b. P(X>x & Y>y); c. P(X<x | X>y)

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

Solution 1

# Set the seed
set.seed(1234)

# Generate random variable X with 10,000 continuous random uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)

# Generate random variable Y with 10,000 random normal values with mean 10 and standard deviation 2.89
Y <- rnorm(10000, mean = 10, sd = 2.89)

# Calculate median values for X and Y
x_median <- median(X)
y_median <- median(Y)

Part A
P(X > x | X > y)

# a. P(X > x | X > y)
prob_a <- mean(X > x_median & X > Y)

# Display the probabilities
prob_a
## [1] 0.3897

P(X > x & Y > y)

# b. P(X > x & Y > y)
prob_b <- mean(X > x_median & Y > y_median)

# Display the probabilities
prob_b
## [1] 0.2507

P(X < x | X > y)

# c. P(X < x | X > y)
prob_c <- mean(X < x_median & X > Y)

# Display the probabilities
prob_c
## [1] 0.1083

Part B
Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y)

# Calculate joint probabilities
joint_prob <- mean(X > x_median & Y > y_median)

# Calculate marginal probabilities
marginal_prob_X <- mean(X > x_median)
marginal_prob_Y <- mean(Y > y_median)

# Display the joint and product of marginal probabilities
joint_prob
## [1] 0.2507
marginal_prob_X * marginal_prob_Y
## [1] 0.25

We see that the joint probability is equal the product of marginal probabilities: P(X>x & Y>y)=P(X>x)P(Y>y)

Part C
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test.

# Create a contingency table
cont_table <- table(X > x_median, Y > y_median)

# Perform Fisher’s Exact Test
fisher_test <- fisher.test(cont_table)

# Perform Chi-Square Test
chi_square_test <- chisq.test(cont_table)

# Display the results of both tests
fisher_test
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cont_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264
chi_square_test
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 0.0676, df = 1, p-value = 0.7949

If the results from the tests indicate a low p-value (typically less than 0.05), it suggests dependence between X and Y, while a higher p-value suggests independence. in this case, the p-value for both Fisher’s Test and Chi-Square are about the same (0.7949) which is greater than 0.05 indicating independence. Therefore, Independence exists.

Fisher’s Exact Test is used for contingency tables with small sample sizes, providing exact probabilities while Chi-Square Test is more suitable for larger samples, relying on approximations. I am not surprised by the results considering the fact that the data is fairly large (10,000).


Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Solution 2

Load Libraries

library(Amelia)
library(caret)
library(caTools)
library(cowplot)
library(e1071)
library(fastDummies)
library(grid)
library(kableExtra)
library(MASS)
library(matrixcalc)
library(Metrics)
library(nnet)
library(OpenImageR)
library(reshape2)
library(tidyverse)

Read the train dataset

url_crab_age = "https://raw.githubusercontent.com/chinedu2301/data605-computer-maths/main/final_project/data/crab_age_dataset.csv"
crab_age_train_raw = read_csv(url_crab_age)
crab_age_train_raw = crab_age_train_raw %>% rename(
    Shucked_Weight = `Shucked Weight`,
    Viscera_Weight = `Viscera Weight`,
    Shell_Weight = `Shell Weight`
    )

Display the data

# display a few records of the raw data
crab_age_raw_data_few_records <- kable(head(crab_age_train_raw, 50), "html") %>%
                                 kable_paper("hover", full_width = F) %>%
                                 scroll_box(width = "850px", height = "350px")
crab_age_raw_data_few_records
id Sex Length Diameter Height Weight Shucked_Weight Viscera_Weight Shell_Weight Age
0 I 1.5250 1.1750 0.3750 28.973189 12.7289255 6.6479577 8.3489277 9
1 I 1.1000 0.8250 0.2750 10.418441 4.5217453 2.3246590 3.4019400 8
2 M 1.3875 1.1125 0.3750 24.777463 11.3398000 5.5565020 6.6621325 9
3 F 1.7000 1.4125 0.5000 50.660556 20.3549410 10.9918385 14.9968855 11
4 I 1.2500 1.0125 0.3375 23.289114 11.9776637 4.5075705 5.9533950 8
5 M 1.5000 1.1750 0.4125 28.845616 13.4093135 6.7897052 7.9378600 10
6 M 1.5750 1.1375 0.3500 30.022120 11.9351395 7.3425205 8.6465975 11
7 I 1.3125 1.0250 0.3500 18.299602 8.2497045 3.8980562 5.6699000 11
8 F 1.6000 1.2875 0.4375 38.824640 16.9671758 7.4133942 10.7728100 12
9 M 1.0250 0.7625 0.2625 10.305043 4.4933958 2.1262125 2.9766975 11
10 I 0.9250 0.7250 0.2125 6.109317 2.6932025 1.7293195 2.1262125 7
11 F 1.2750 0.9750 0.3750 19.575330 7.3425205 4.2807745 6.2368900 10
12 M 1.3500 1.0250 0.3500 20.964455 8.5048500 4.6351432 6.9456275 10
13 I 1.1000 0.8625 0.2875 11.439023 4.5075705 2.3388338 3.4019400 7
14 M 1.3125 0.9875 0.3125 17.449117 8.5473743 3.9831047 5.2446575 9
15 I 0.9375 0.7500 0.2000 7.243297 3.2318430 1.6300962 2.1262125 7
16 I 1.0125 0.7750 0.3000 9.454558 3.7846583 2.6081540 2.8349500 8
17 F 1.6375 1.2500 0.4250 39.179009 16.2584382 8.7174712 10.4893150 12
18 I 1.1375 0.9125 0.2875 13.366789 6.8605790 2.6648530 3.9689300 7
19 I 0.6750 0.4875 0.1000 2.083688 0.8788345 0.6378637 0.7370870 5
20 M 1.5500 1.2000 0.4250 32.772022 14.5432935 6.2368900 9.0860148 9
21 F 1.0875 0.8750 0.3125 15.379604 5.3438808 3.2460178 4.9611625 14
22 F 1.7375 1.3375 0.4125 47.513762 21.1770765 11.4957222 13.8912550 10
23 M 1.4375 1.1000 0.3125 24.805812 12.0343627 5.5848515 6.3786375 9
24 I 0.9250 0.7125 0.2375 6.988152 2.6365035 1.4883488 2.1262125 7
25 M 1.6125 1.2500 0.4375 39.774349 17.0097000 8.8308692 9.7805775 11
26 F 1.4250 1.0750 0.3875 23.473386 11.6232950 5.8966960 4.2665998 10
27 I 1.2000 0.9375 0.3125 15.762322 6.7613558 2.3671832 4.7768907 8
28 F 1.3250 1.0750 0.4125 20.326591 8.1930055 4.9469878 5.5281525 9
29 I 0.6250 0.5000 0.1500 2.608154 0.6945628 0.5102910 0.5102910 6
30 M 1.2750 0.9500 0.3375 16.910477 7.3425205 3.9405805 4.9611625 9
31 M 1.5500 1.2000 0.4125 34.699788 17.1089232 7.6401902 8.5048500 11
32 F 1.7625 1.3875 0.5125 46.677452 23.0906678 6.0809678 11.2972758 13
33 I 1.1875 0.9000 0.3125 12.813974 5.4147545 2.8066005 3.9689300 9
34 F 1.5875 1.2375 0.4250 34.614739 14.7842642 8.6891218 8.2213550 13
35 F 1.4250 1.1500 0.3875 26.308336 11.7933920 5.5281525 7.7961125 10
36 F 1.7250 1.3125 0.5000 47.513762 19.6462035 11.5665960 13.3242650 16
37 F 1.2500 0.9750 0.3250 18.016107 7.4700932 4.4366968 5.2446575 8
38 I 0.5125 0.3750 0.1375 1.190679 0.4819415 0.1559222 0.1842718 4
39 M 1.1250 0.8500 0.3250 12.941547 5.9533950 2.8774743 3.4019400 9
40 M 1.5375 1.3125 0.4125 37.988330 16.4427100 9.7805775 10.4893150 11
41 M 1.4375 1.1250 0.3750 24.933385 9.8372765 5.9392202 7.7961125 18
42 M 1.5250 1.2125 0.3875 34.841535 15.4504775 7.4559185 8.2213550 10
43 F 1.6125 1.2875 0.4125 35.621147 11.4531980 6.4353365 11.0563050 15
44 M 1.6750 1.2875 0.4750 39.745999 17.9027092 8.3347530 11.0563050 10
45 I 1.0375 0.7750 0.2250 10.957082 4.2807745 1.7718437 2.6932025 7
46 I 1.1250 0.8375 0.2500 14.004653 6.0667930 2.9625228 3.8271825 7
47 F 1.8875 1.4125 0.5125 57.606184 21.2621250 12.4879548 16.4427100 17
48 M 1.4375 1.1000 0.3625 24.720764 10.0640725 5.4005798 7.1157245 9
49 I 1.1750 0.9250 0.3125 15.223681 7.9236852 3.5011632 4.9611625 8

Descriptive and Inferential Statistics.

Univariate Descriptive Statistics

summary(crab_age_train_raw)
##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18513   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked_Weight     Viscera_Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell_Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000

Scatter Plots

# Age vs Sex
scatter_age_sex = crab_age_train_raw %>% ggplot(aes(x = Sex, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Sex", x = "Sex", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Length
scatter_age_length = crab_age_train_raw %>% ggplot(aes(x = Length, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Length", x = "Length", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))  

# Age vs Diameter
scatter_age_diameter = crab_age_train_raw %>% ggplot(aes(x = Diameter, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Diameter", x = "Diameter", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite")) 

# Age vs Height
scatter_age_height = crab_age_train_raw %>% ggplot(aes(x = Height, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Height", x = "Height", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Weight
scatter_age_weight = crab_age_train_raw %>% ggplot(aes(x = Weight, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Weight", x = "Weight", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Shucked_Weight
scatter_age_shucked_weight = crab_age_train_raw %>% ggplot(aes(x = Shucked_Weight, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Shucked_Weight", x = "Shucked_Weight", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Shucked_Weight
scatter_age_viscera_weight = crab_age_train_raw %>% ggplot(aes(x = Viscera_Weight, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Viscera_Weight", x = "Viscera_Weight", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Shucked_Weight
scatter_age_shell_weight = crab_age_train_raw %>% ggplot(aes(x = Shell_Weight, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Shell_Weight", x = "Shell_Weight", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))


# Plot grid of all plots
crab_age_scatter_plot1 <- plot_grid(scatter_age_sex, scatter_age_length, scatter_age_diameter, scatter_age_height,
                                   byrow = TRUE, nrow = 2) 

crab_age_scatter_plot2 <- plot_grid(scatter_age_weight, scatter_age_shucked_weight, scatter_age_viscera_weight, scatter_age_shell_weight,
                                   byrow = TRUE, nrow = 2) 
#display all plots
crab_age_scatter_plot1

crab_age_scatter_plot2 # For Age vs Weights

Correlation Matrix
Derive a correlation matrix for any three quantitative variables in the dataset.

#using the Diameter, Length, and Weight
correlation_data = crab_age_train_raw[, c("Diameter", "Length", "Weight")]
correlation_matrix = round(cor(correlation_data), 2)

# display the correlation matrix
correlation_matrix
##          Diameter Length Weight
## Diameter     1.00   0.99   0.94
## Length       0.99   1.00   0.94
## Weight       0.94   0.94   1.00

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

# Find the pariwise correlation between Diameter and Length using a confidence interval of 80%
pairwise_corr_dia_len = cor.test(correlation_data$Diameter, correlation_data$Length, conf.level = 0.8)
pairwise_corr_dia_len
## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$Diameter and correlation_data$Length
## t = 1857.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9893379 0.9895359
## sample estimates:
##       cor 
## 0.9894374
# Find the pariwise correlation between Diameter and Weight using a confidence interval of 80%
pairwise_corr_dia_weight = cor.test(correlation_data$Diameter, correlation_data$Weight, conf.level = 0.8)
pairwise_corr_dia_weight
## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$Diameter and correlation_data$Weight
## t = 737.99, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9376824 0.9388098
## sample estimates:
##       cor 
## 0.9382486
# Find the pariwise correlation between Length and Weight using a confidence interval of 80%
pairwise_corr_len_weight = cor.test(correlation_data$Length, correlation_data$Weight, conf.level = 0.8)
pairwise_corr_len_weight
## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$Length and correlation_data$Weight
## t = 725.93, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9357910 0.9369515
## sample estimates:
##       cor 
## 0.9363738

From the correlation test results for all three cases above, we see that the p-value is very small (almost zero). Hence, we reject the Null Hypothesis that the true correlation is 0. Hence, the true correlation is NOT zero(0) for any of the three pairs of variables - Diameter, Length, and Weight.

Familywise Error
The family-wise error rate (FWER) is the probability of at least 1 false positive when multiple comparisons are being tested. Basically, FWER is the probability of a coming to at least one false conclusion in a series of hypothesis tests. In other words, it’s the probability of making at least one Type I Error.
Using the Bonferroni correction which adjusts the level of significance for a family of tests as adjusted \(\alpha\) = \(\frac{\alpha}{n}\) where n is the number of tests.
In this case, with 3 correlation tests:

Alpha level for each individual test = 1 - 80% = 0.20 Number of tests = 3

Adjusted alpha level for multiple tests = 0.20 / 3 ≈ 0.067
Therefore, using the Bonferroni correction, the adjusted significance level for each individual correlation test to control the familywise error rate at an overall 80% confidence level is approximately 0.067 or 6.7%. Based on this result, I will not be worried about familywise error.

Linear Algebra and Correlation.

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Invert Correlation Matrix
Using the solve function to invert the correlation matrix.

precision_matrix <- solve(correlation_matrix)
precision_matrix
##            Diameter     Length    Weight
## Diameter  52.244165 -47.755835 -4.219031
## Length   -47.755835  52.244165 -4.219031
## Weight    -4.219031  -4.219031  8.931777

Multiply correlation_matrix by the precision_matrix

# multiply correlation matrix by precision matrix
mul_corr_matrix_precision = round(correlation_matrix %*% precision_matrix, 2)
mul_corr_matrix_precision
##          Diameter Length Weight
## Diameter        1      0      0
## Length          0      1      0
## Weight          0      0      1

This shows that a matrix multiplied by its inverse gives us the identity matrix.

Multiply the precision_matrix by the correlation_matrix

# multiply precision matrix by correlation matrix
mul_precision_by_corr_matrix = round(precision_matrix %*% correlation_matrix, 2)
mul_precision_by_corr_matrix
##          Diameter Length Weight
## Diameter        1      0      0
## Length          0      1      0
## Weight          0      0      1

This also shows that the inverse of a matrix multiplied by the matrix is still the identity matrix.

Essentially, this proves the theorem that if A is a square matrix and \(A^{-1}\) exists, then \(AA^{-1}=A^{-1}A = I\) where \(I\) is the Identity matrix. This holds only for square matrices and their inverses.

Conduct LU decomposition on the matrix

lu_decomposition = lu.decomposition(correlation_matrix)
lu_decomposition
## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.99 1.0000000    0
## [3,] 0.94 0.4723618    1
## 
## $U
##      [,1]   [,2]      [,3]
## [1,]    1 0.9900 0.9400000
## [2,]    0 0.0199 0.0094000
## [3,]    0 0.0000 0.1119598

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Find Skewness
Using the skewness function in e1071 package, we compute the skewness of the variables.
Note:
Positive skewness (>0): Indicates right-skewed distribution (tail is longer on the right).
Negative skewness (<0): Indicates left-skewed distribution (tail is longer on the left).
Skewness around 0: Indicates approximately symmetric distribution.

# Get skewness for each column
for (col in colnames(crab_age_train_raw)) {
  if (!(col %in% c("Sex", "id"))){
    skew_val <- skewness(crab_age_train_raw[[col]]) 
    cat(sprintf("The skewness for the %s variable is: %.2f\n", col, round(skew_val, 2)))
  }
}
## The skewness for the Length variable is: -0.84
## The skewness for the Diameter variable is: -0.81
## The skewness for the Height variable is: 0.09
## The skewness for the Weight variable is: 0.23
## The skewness for the Shucked_Weight variable is: 0.35
## The skewness for the Viscera_Weight variable is: 0.29
## The skewness for the Shell_Weight variable is: 0.28
## The skewness for the Age variable is: 1.09

Therefore, we select the Weight variable since it is one of the right-skewed variables.
Check the minimum value of Weight

# check the minimum value:
min(crab_age_train_raw$Weight)
## [1] 0.056699

Since the minimum value of Weight is not negative, no need for adjustment.

Histogram to check distribution of Weight

#hist(crab_age_train_raw$Weight, breaks = 20, xlab = 'Weight', main = "Histogram of Weight")
hist_weight = crab_age_train_raw %>% ggplot(aes(x = Weight)) + 
              geom_histogram(binwidth = 0.5, color = "black", fill = "skyblue", alpha = 0.7) +
                               labs(title = "Histogram for Weight", x = "Weight", y = "Frequency") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
hist_weight

Density Plot to visualize the distribution of Weight

density_weight = crab_age_train_raw %>% ggplot(aes(x = Weight)) + geom_density() +
                               labs(title = "Density Plot for Weight", x = "Weight", y = "Density") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
density_weight

From the Histogram and Density plot, we can see that the Weight variable is skewed to the right.

Load the MASS package and run fitdistr to fit an exponential probability density function

prob_density_function <- fitdistr(crab_age_train_raw$Weight, densfun = 'exponential')
lambda <- prob_density_function$estimate
exponential_pdf <- rexp(1000, lambda)

Optimal Value of λ:
The optimal value of lambda = 1/λ.

optimal_value <- round((1/lambda), 4)
optimal_value
##    rate 
## 23.3852

Plot a histogram and compare it with a histogram of your original variable
Using ggplot2

# histogram for exponential distribution
exp_pdf = exponential_pdf %>% tibble()
colnames(exp_pdf) = "Weight"
hist_exponential_pdf = exp_pdf %>% ggplot(aes(x = Weight)) + 
                       geom_histogram(binwidth = 1, color = "black", fill = "skyblue", alpha = 0.7) +
                               labs(title = "Histogram for Weight-Exponential PDF", x = "Weight", y = "Frequency") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
plot_grid(hist_weight, NULL, hist_exponential_pdf, nrow = 1, rel_widths = c(1, 0.008, 1))

Using Base R

par(mfrow=c(1,2))
hist(crab_age_train_raw$Weight, breaks = 20, col="violet", main = "Histogram - Original", xlab = 'Weight')
hist(exponential_pdf, breaks = 20, col="royalblue", main = "Histogram - Exponential PDF", xlab = 'Weight')

The 5th and 95th percentiles using the CDF are:

cdf_5th_95th_percentiles = round(quantile(exponential_pdf, c(0.05, 0.95)), 4)
cdf_5th_95th_percentiles
##      5%     95% 
##  1.2535 67.5297

95% confidence interval from empirical data

confidence_interval = t.test(crab_age_train_raw$Weight)
confidence_interval
## 
##  One Sample t-test
## 
## data:  crab_age_train_raw$Weight
## t = 503.13, df = 74050, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  23.29412 23.47632
## sample estimates:
## mean of x 
##  23.38522

The 5th and 95th percentiles using the empirical data are:

empirical_5th_95th_percentiles = round(quantile(crab_age_train_raw$Weight, c(0.05, 0.95)), 4)
empirical_5th_95th_percentiles
##      5%     95% 
##  3.6004 44.2110

Comparing the 5th and 95th percentiles of the exponential pdf and the empirical data as well as their histograms, we can clearly see the exponential pdf is much more skewed to the right than the original/empirical data. Also, the original empirical data appears to be better and the exponential data is not a good approximation of the original data in this case.

Modeling.

Conduct Exploratory Data Analysis Prior to Modeling

glimpse of the data

glimpse(crab_age_train_raw)
## Rows: 74,051
## Columns: 10
## $ id             <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Sex            <chr> "I", "I", "M", "F", "I", "M", "M", "I", "F", "M", "I", …
## $ Length         <dbl> 1.5250, 1.1000, 1.3875, 1.7000, 1.2500, 1.5000, 1.5750,…
## $ Diameter       <dbl> 1.1750, 0.8250, 1.1125, 1.4125, 1.0125, 1.1750, 1.1375,…
## $ Height         <dbl> 0.3750, 0.2750, 0.3750, 0.5000, 0.3375, 0.4125, 0.3500,…
## $ Weight         <dbl> 28.973189, 10.418441, 24.777463, 50.660556, 23.289114, …
## $ Shucked_Weight <dbl> 12.7289255, 4.5217453, 11.3398000, 20.3549410, 11.97766…
## $ Viscera_Weight <dbl> 6.6479577, 2.3246590, 5.5565020, 10.9918385, 4.5075705,…
## $ Shell_Weight   <dbl> 8.348928, 3.401940, 6.662133, 14.996885, 5.953395, 7.93…
## $ Age            <dbl> 9, 8, 9, 11, 8, 10, 11, 11, 12, 11, 7, 10, 10, 7, 9, 7,…

There are about 74,051 observations and 10 variables (9 predictor variables and 1 response variable). The response variable is the age and we aim to predict the age of the crab using the other features given. Only Sex is the categorical variable in the data.

Check for missing Values

# use missmap function from the Amelia package to check for NA values
missmap(crab_age_train_raw,
        plot.background = element_rect(fill = "antiquewhite"),
        main = "Crab Age Dataset - Missing Values", 
        x.cex = 0.45,
        y.cex = 0.6,
        margins = c(7.1, 7.1),
        col = c("yellow", "black"), legend = FALSE)

As can be seen from the missing values plot, there are no missing values.

Histogram for Age to check the Age distribution

hist_age = crab_age_train_raw %>% ggplot(aes(x = Age)) + 
                                  geom_histogram(binwidth = 1, color = "black", fill = "skyblue", alpha = 0.7) +
                                  labs(title = "Histogram for Age", x = "Age", y = "Frequency") +
                                  theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                           panel.background = element_rect(fill = "gray80"),
                                                           plot.background = element_rect(fill = "antiquewhite"))
hist_age

From the Age distribution, we can see that ages of crabs given are between 3 and 20.

Bar Plot for Sex

# group the data by sex and get count/percentage
crab_sex_grouped <- crab_age_train_raw %>% group_by(Sex) %>% 
                        summarise(count = n()) %>%
                        mutate(percentage = round((count / sum(count) * 100), 2))

# bar plot using ggplot2
bar_plot_crab_sex <- crab_sex_grouped %>% ggplot(aes(x=factor(Sex), y = percentage, fill = factor(Sex))) +
                               geom_bar(stat = "identity", position = "dodge") +
                               labs(title = "Bar Chart - Crab Sex", x = "Crab Sex", y = "Percentage", 
                                    fill = "Crab Sex") +
                               scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentages
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))                          
bar_plot_crab_sex

Each of the crab sex constitute a little above 30% of the data with the “M” category being a little more than 35% and representing the highest category by a small margin.

Data Pre-Processing
Develop a function to pre-process the data and make it ready for training. This function can be easily used to pre-process any new test data to avoid code duplications and errors.

# data pre-processing
data_preprocess_scaling <- function(df){
                      # This function standardizes the numeric variables of the df using the standard normal method
                      df <- as.data.frame(df)
                      if ("Age" %in% colnames(df)){
                          df_char <- df %>% select(Age, Sex) 
                          df_numeric <- df %>% select(-Sex, -Age) # Exclude Age from scaling as well.
                      } else {
                          df_char <- df %>% select(Sex) 
                          df_numeric <- df %>% select(-Sex) # Exclude Age from scaling as well.                        
                      }
                      # scale the other numeric values using standard scaler
                      df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
                      # combine the categorical and numeric features and drop the id column
                      df_scaled_combined <- cbind(df_char, df_numeric_scaled) %>% select(-id)
                      return(df_scaled_combined)
}

Pre-process the train data

# pre-process the data using the function above
crab_data_pre_processed = data_preprocess_scaling(crab_age_train_raw)

# display some records for the standardized data
kable(head(crab_data_pre_processed, 50), "html") %>%
                                 kable_paper("hover", full_width = F) %>%
                                 scroll_box(width = "850px", height = "350px")
Age Sex Length Diameter Height Weight Shucked_Weight Viscera_Weight Shell_Weight
9 I 0.7212332 0.6339778 0.2923977 0.4418014 0.4671847 0.5691823 0.4533730
8 I -0.7557067 -0.8403503 -0.7941577 -1.0251912 -0.9936809 -0.9788731 -0.9267816
9 M 0.2433997 0.3707049 0.2923977 0.1100750 0.2199225 0.1783618 -0.0172241
11 F 1.3293850 1.6344147 1.6505920 2.1564682 1.8246040 2.1246076 2.3080794
8 I -0.2344338 -0.0505317 -0.1150606 -0.0075982 0.3334613 -0.1972320 -0.2149540
10 M 0.6343544 0.6339778 0.6998560 0.4317152 0.5882928 0.6199382 0.3386897
11 M 0.8949909 0.4760140 0.0207589 0.5247330 0.3258920 0.8178863 0.5364196
11 I -0.0172367 0.0021229 0.0207589 -0.4020836 -0.3301098 -0.4154824 -0.2940460
12 F 0.9818697 1.1078689 0.9714949 1.2206860 1.2215869 0.8432642 1.1296093
11 M -1.0163432 -1.1036232 -0.9299772 -1.0341568 -0.9987271 -1.0499314 -1.0454195
7 I -1.3638585 -1.2615869 -1.4732549 -1.3658832 -1.3191587 -1.1920479 -1.2826954
10 F -0.1475550 -0.2084954 0.2923977 -0.3012208 -0.4915872 -0.2784415 -0.1358620
10 M 0.1130815 0.0021229 0.0207589 -0.1913925 -0.2846943 -0.1515517 0.0618678
7 I -0.7557067 -0.6823866 -0.6583383 -0.9445010 -0.9962040 -0.9737975 -0.9267816
9 M -0.0172367 -0.1558409 -0.3866994 -0.4693254 -0.2771250 -0.3850289 -0.4126839
7 I -1.3204191 -1.1562778 -1.6090743 -1.2762274 -1.2232815 -1.2275771 -1.2826954
8 I -1.0597826 -1.0509686 -0.5225189 -1.1013987 -1.1248813 -0.8773613 -1.0849655
12 F 1.1121879 0.9499052 0.8356754 1.2487034 1.0954327 1.3102187 1.0505173
7 I -0.6253885 -0.4717683 -0.6583383 -0.7920862 -0.5773720 -0.8570589 -0.7685977
5 I -2.2326467 -2.2620238 -2.6956297 -1.6841612 -1.6421135 -1.5828685 -1.6702460
9 M 0.8081121 0.7392869 0.8356754 0.7421483 0.7901395 0.4219901 0.6590121
14 F -0.7991461 -0.6297320 -0.3866994 -0.6329472 -0.8473420 -0.6489597 -0.4917758
10 F 1.4597032 1.3184872 0.6998560 1.9076734 1.9709428 2.3050346 1.9996208
9 M 0.4171574 0.3180503 -0.3866994 0.1123164 0.3435536 0.1885129 -0.0963161
7 I -1.3638585 -1.3142415 -1.2016160 -1.2964000 -1.3292511 -1.2783330 -1.2826954
11 M 1.0253091 0.9499052 0.9714949 1.2957727 1.2291561 1.3508234 0.8527874
10 F 0.3737180 0.2127412 0.4282171 0.0069709 0.2703842 0.3001760 -0.6855511
8 I -0.4081914 -0.3664592 -0.3866994 -0.6026884 -0.5950336 -0.9636463 -0.5431856
9 F 0.0262027 0.2127412 0.6998560 -0.2418239 -0.3402021 -0.0398887 -0.3335919
6 I -2.4064043 -2.2093693 -2.1523520 -1.6426954 -1.6749136 -1.6285488 -1.7335196
9 M -0.1475550 -0.3138046 -0.1150606 -0.5119119 -0.4915872 -0.4002557 -0.4917758
11 M 0.8081121 0.7392869 0.6998560 0.8945631 1.2468177 0.9244737 0.4968736
13 F 1.5465820 1.5291055 1.7864114 1.8415523 2.3115592 0.3661586 1.2759294
9 I -0.4516309 -0.5244229 -0.3866994 -0.8357934 -0.8347266 -0.8063030 -0.7685977
13 F 0.9384303 0.8972506 0.8356754 0.8878389 0.8330319 1.3000675 0.4177816
10 F 0.3737180 0.5286686 0.4282171 0.2311103 0.3006612 0.1682106 0.2991437
16 F 1.4162638 1.2131781 1.6505920 1.9076734 1.6984498 2.3304125 1.8414369
8 F -0.2344338 -0.2084954 -0.2508800 -0.4244975 -0.4688794 -0.2226100 -0.4126839
4 I -2.7973590 -2.7359150 -2.2881714 -1.7547651 -1.7127598 -1.7554386 -1.8244753
9 M -0.6688279 -0.7350412 -0.2508800 -0.8257071 -0.7388494 -0.7809250 -0.9267816
11 M 0.7646727 1.2131781 0.6998560 1.1545649 1.1282328 1.6908880 1.0505173
18 M 0.4171574 0.4233595 0.2923977 0.1224027 -0.0475244 0.3154027 0.2991437
10 M 0.7212332 0.7919415 0.4282171 0.9057701 0.9516169 0.8584910 0.4177816
15 F 1.0253091 1.1078689 0.6998560 0.9674084 0.2401072 0.4930484 1.2087012
10 M 1.2425062 1.1078689 1.3789531 1.2935313 1.3881104 1.1731777 1.2087012
7 I -0.9729038 -1.0509686 -1.3374354 -0.9826048 -1.0365733 -1.1768212 -1.1245115
7 I -0.6688279 -0.7876958 -1.0657966 -0.7416548 -0.7186647 -0.7504715 -0.8081437
17 F 1.9809762 1.6344147 1.7864114 2.7056098 1.9860813 2.6603260 2.7114484
9 M 0.4171574 0.3180503 0.1565783 0.1055923 -0.0071550 0.1225303 0.1093230
8 I -0.4950703 -0.4191137 -0.3866994 -0.6452749 -0.3881407 -0.5575990 -0.4917758

Train Test Split

Use the CaTools library to split the cleaned dataset into training and testing datasets in 70:30 ratio.

# Set a seed
set.seed(1994)
#Split the sample
sampling <- sample.split(crab_data_pre_processed$Age, SplitRatio = 0.7) 
# Training Data
df_train_subset <- subset(crab_data_pre_processed, sampling == TRUE)
# Testing Data
df_test_subset <- subset(crab_data_pre_processed, sampling == FALSE)

Train the model

multiple_regression_model = lm(Age ~ ., data = df_train_subset)
# display the model
summary(multiple_regression_model)
## 
## Call:
## lm(formula = Age ~ ., data = df_train_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4992  -1.2221  -0.3402   0.7562  18.2745 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.34098    0.01807 572.244  < 2e-16 ***
## SexI           -1.02614    0.03016 -34.019  < 2e-16 ***
## SexM           -0.11444    0.02287  -5.004 5.63e-07 ***
## Length          0.25076    0.06591   3.805 0.000142 ***
## Diameter        0.54040    0.06754   8.001 1.26e-15 ***
## Height          0.60991    0.02513  24.267  < 2e-16 ***
## Weight          2.36161    0.08195  28.817  < 2e-16 ***
## Shucked_Weight -3.44562    0.04452 -77.391  < 2e-16 ***
## Viscera_Weight -0.57802    0.03964 -14.584  < 2e-16 ***
## Shell_Weight    1.93384    0.04217  45.862  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.127 on 51826 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5513 
## F-statistic:  7076 on 9 and 51826 DF,  p-value: < 2.2e-16

Predictions

predictions_test = round(predict(multiple_regression_model, df_test_subset), 0) # round predictions to nearest age

Evaluate the Model

mean_absolute_error = round(Metrics::mae(df_test_subset$Age, predictions_test), 2)
mean_sq_error = round(Metrics::mse(df_test_subset$Age, predictions_test), 2)
root_mean_sq_error = round(Metrics::rmse(df_test_subset$Age, predictions_test), 2)

model_eval_metrics = c(mean_absolute_error, mean_sq_error, root_mean_sq_error) %>% t()
column_names = c("Mean Absolute Error", "Mean Square Error", "Root Mean Square Error")
evaluation_metrics = data.frame(values = model_eval_metrics)
colnames(evaluation_metrics) = column_names

# combine the results and display
kable(evaluation_metrics, "html") %>%
                        kable_paper("hover", full_width = F) %>%
                        scroll_box(width = "450px", height = "100px")
Mean Absolute Error Mean Square Error Root Mean Square Error
1.46 4.63 2.15

From the MAE, and RMSE, we can see that the model is able to predict the Age of Crabs within 2 units of the Crab Age.

Using the test data provided to make predictions
Read test data from remote location

url_crab_age_test = "https://raw.githubusercontent.com/chinedu2301/data605-computer-maths/main/final_project/data/crab_age_test_data.csv"
crab_age_test_raw = read_csv(url_crab_age_test)
crab_age_test_raw = crab_age_test_raw %>% rename(
    Shucked_Weight = `Shucked Weight`,
    Viscera_Weight = `Viscera Weight`,
    Shell_Weight = `Shell Weight`
    )

Display a few records of the test data:

kable(head(crab_age_test_raw, 100), "html") %>%
                                 kable_paper("hover", full_width = F) %>%
                                 scroll_box(width = "850px", height = "350px")
id Sex Length Diameter Height Weight Shucked_Weight Viscera_Weight Shell_Weight
74051 I 1.0500 0.7625 0.2750 8.6182480 3.6570855 1.7293195 2.7215520
74052 I 1.1625 0.8875 0.2750 15.5071765 7.0306760 3.2460178 3.9689300
74053 F 1.2875 0.9875 0.3250 14.5716430 5.5565020 3.8838815 4.8194150
74054 F 1.5500 0.9875 0.3875 28.3778495 13.3809640 6.5487345 7.0306760
74055 I 1.1125 0.8500 0.2625 11.7650425 5.5281525 2.4664065 3.3310663
74056 M 1.4250 1.1125 0.3500 24.8341620 8.7316460 5.7124242 8.0796075
74057 M 1.7125 1.3250 0.4500 46.6774518 21.2337755 11.9634890 11.3681495
74058 I 1.1750 0.8875 0.3000 13.9337793 5.6982495 2.9908722 3.9405805
74059 F 1.3125 1.0625 0.2875 23.5867840 8.1646560 5.6273757 8.4623258
74060 F 1.4250 1.0500 0.3125 17.7893112 7.1015498 4.6067937 5.5281525
74061 M 1.5125 1.2000 0.4750 32.2333815 12.9415468 6.2227153 9.8372765
74062 F 1.5625 1.2000 0.4125 36.5141560 17.2931950 8.6040733 8.9300925
74063 I 0.4375 0.3250 0.1250 0.8221355 0.3543688 0.1842718 0.2834950
74064 F 1.6250 1.2875 0.4125 39.1931838 16.6836808 7.4700932 11.9067900
74065 I 0.6750 0.5000 0.1500 2.7215520 1.1906790 0.8079608 0.8504850
74066 I 0.8375 0.6250 0.1750 4.8194150 1.7718437 0.7654365 1.2473780
74067 M 1.6750 1.3000 0.4000 41.6312408 18.8524175 9.6955290 11.0563050
74068 M 1.4625 1.1750 0.3250 28.2644515 11.4248485 6.0384435 9.4970825
74069 F 1.6750 1.3000 0.3625 40.8374547 17.2931950 9.1143642 9.0860148
74070 M 1.5875 1.2625 0.4125 35.5077487 17.0805738 8.2922287 9.4970825
74071 M 1.6000 1.2500 0.4750 31.1277510 11.1271788 6.6763072 12.8423235
74072 F 1.2625 0.9375 0.3000 16.8396030 7.5267922 3.0192217 4.8194150
74073 I 1.1875 0.9000 0.3125 13.8062065 5.3297060 3.6003865 4.3941725
74074 I 1.4375 1.1125 0.3625 25.6988218 12.2328093 5.9250455 6.7755305
74075 F 1.5875 1.2500 0.4750 33.8493030 13.9479540 5.8966960 9.4970825
74076 F 1.2250 0.9625 0.3000 17.7893112 8.3489277 3.8271825 5.6699000
74077 F 1.8500 1.4125 0.5125 57.6061840 27.6691120 9.6104805 18.7106700
74078 F 1.5875 1.2125 0.4500 36.5708550 15.5213513 8.6324228 9.9790240
74079 M 1.2500 1.0000 0.3000 16.0741665 7.1015498 4.4366968 4.9611625
74080 F 1.8375 1.4875 0.4500 53.2970600 25.7838702 11.7650425 13.8912550
74081 F 1.6250 1.3000 0.3875 38.7821160 17.5341658 8.1646560 10.3475675
74082 M 1.7625 1.4000 0.3875 47.4854125 22.5945515 11.6091202 10.9996060
74083 M 1.6250 1.2375 0.4000 39.4199798 19.6603782 8.3772773 9.9223250
74084 M 1.6875 1.3375 0.5125 39.7743485 17.0805738 8.2922287 11.9067900
74085 F 1.5375 1.1500 0.4125 29.9512468 13.6502843 5.4431040 9.6388300
74086 I 1.0375 0.7875 0.2375 8.1221317 4.2665998 2.2537852 2.5514550
74087 M 1.5750 1.3125 0.3625 27.5415392 14.1322258 5.5706767 7.0165012
74088 M 1.4250 1.1625 0.4000 26.7052290 11.9209647 5.1596090 8.7741702
74089 I 1.2500 0.9375 0.3250 15.4504775 6.0242688 3.8838815 4.7060170
74090 M 1.4250 1.1125 0.3750 23.5159102 10.9854312 5.5281525 7.1015498
74091 M 0.8500 0.6375 0.2125 5.1029100 1.9844650 1.1906790 1.7009700
74092 M 1.5375 1.1750 0.3750 31.1561005 15.2803805 6.9598022 7.9378600
74093 F 1.5125 1.1875 0.4375 29.2425093 13.0549448 6.5062102 8.0796075
74094 M 1.5000 1.1875 0.4125 29.2708587 13.7920317 6.1943658 7.7961125
74095 M 1.6000 1.2625 0.5000 39.7034747 12.4737800 8.0796075 14.5999925
74096 I 1.4375 1.1125 0.3375 26.1240642 10.0357230 5.9250455 8.5048500
74097 I 0.8625 0.6250 0.2000 3.8697068 1.8568922 0.9922325 1.7009700
74098 I 0.8250 0.6250 0.1625 5.1312595 2.0411640 0.9780577 1.2615527
74099 M 1.3000 0.9625 0.4125 23.7143568 6.1376668 5.9959192 7.0873750
74100 F 1.5000 1.1750 0.4125 32.2900805 15.4504775 6.8747538 8.7883450
74101 M 1.7375 1.3375 0.4625 45.9970638 24.6357155 9.8372765 13.4660125
74102 M 1.4375 1.1000 0.3750 24.7632882 11.7650425 4.6351432 6.5912587
74103 M 1.7375 1.3875 0.5375 61.9011333 31.3970712 13.5794105 14.1747500
74104 I 1.4125 1.1250 0.3750 26.1240642 10.0357230 5.9250455 6.7046567
74105 I 1.1875 0.9000 0.3250 13.5368862 5.9533950 2.7215520 4.5359200
74106 M 1.6625 1.3125 0.4375 41.0075517 18.1862042 9.7805775 11.9067900
74107 M 1.3750 1.0625 0.3250 21.7298918 9.7664027 4.7768907 6.8038800
74108 I 0.6375 0.4625 0.1750 2.2679600 0.7370870 0.5102910 0.7087375
74109 I 1.3125 1.0125 0.4000 20.1139703 8.5615490 4.7343665 5.2446575
74110 I 0.7125 0.5250 0.1875 3.3027168 1.2332033 0.7796112 0.9922325
74111 F 1.5000 1.1875 0.4375 31.7514400 14.0471773 7.4275690 9.0718400
74112 I 0.9875 0.7375 0.2250 8.7741702 3.8980562 1.9277660 2.1687367
74113 I 1.2000 0.9750 0.3000 15.4788270 6.3786375 3.0192217 4.6776675
74114 F 1.6125 1.2125 0.4375 38.6120190 18.3846507 7.4133942 10.5034898
74115 I 1.3500 1.1000 0.3750 26.9887240 13.4660125 5.8541718 7.2007730
74116 I 1.6500 1.3125 0.4375 48.2860657 21.7298918 9.6104805 9.3978592
74117 F 1.6125 1.3125 0.5000 37.3221167 15.4363027 9.2135875 12.0485375
74118 F 1.3125 1.1250 0.3750 25.3728025 10.2483443 5.2730070 6.6621325
74119 F 1.2000 0.9250 0.3000 15.1953320 7.1157245 3.2318430 4.2524250
74120 F 1.5625 1.2500 0.4750 35.7628943 13.7353327 6.4069870 13.8912550
74121 F 1.3375 1.0875 0.3250 21.5597947 9.1001895 5.1737838 6.6621325
74122 I 1.0000 0.7500 0.2125 9.4970825 4.4225220 1.3182517 2.4947560
74123 F 1.5000 1.2000 0.3750 29.1716355 11.3256252 6.3786375 8.3631025
74124 I 0.6125 0.4500 0.1500 2.5656297 0.9638830 0.4961162 0.6095143
74125 I 1.0375 0.8375 0.2500 9.0151410 4.3799978 1.8143680 2.9766975
74126 I 1.0375 0.7625 0.2500 9.2561118 4.4366968 1.8427175 2.6932025
74127 M 1.3250 1.0750 0.3375 22.1409595 7.4133942 4.5926190 6.3786375
74128 F 1.5000 1.1875 0.4250 32.1625078 13.9337793 6.9739770 8.9300925
74129 M 1.3125 0.9625 0.3000 15.0394098 6.7330062 2.9341733 4.8052402
74130 F 1.6125 1.2000 0.4250 31.3828965 13.8912550 6.9598022 8.2213550
74131 F 1.6625 1.2625 0.4375 36.1881367 16.6128070 7.7110640 9.4262088
74132 I 1.1625 0.8875 0.3000 12.7572750 6.1234920 2.9483480 3.2885420
74133 I 1.4000 1.1125 0.3750 24.7207640 10.8578585 3.7137845 7.4842680
74134 M 1.5000 1.2125 0.4250 34.2320213 14.0046530 8.0229085 9.2135875
74135 I 0.5000 0.3750 0.1000 1.0489315 0.4677668 0.3118445 0.3685435
74136 I 0.9000 0.6625 0.2250 5.0603858 2.0128145 1.1481548 1.5592225
74137 I 0.7000 0.5125 0.1625 2.8349500 1.0205820 0.5244657 0.8504850
74138 M 1.6625 1.3625 0.4750 46.0537627 16.7262050 7.3850447 15.0252350
74139 M 1.6250 1.3125 0.4375 38.9663878 18.9091165 8.8166945 10.7728100
74140 M 1.4500 1.1500 0.4000 31.1277510 11.4248485 6.9739770 10.9145575
74141 F 1.3250 1.0250 0.3500 21.3896978 9.9081502 4.8619392 5.9675698
74142 F 1.3500 1.1250 0.3375 22.8922212 9.1285390 5.1312595 7.0873750
74143 F 1.6125 1.3375 0.4750 35.1392053 13.2675660 6.7613558 12.0201880
74144 F 1.6375 1.3625 0.5000 53.6089045 19.4477570 10.9570818 14.6141672
74145 F 1.5500 1.2125 0.4500 36.2731852 16.2584382 8.0937822 10.0640725
74146 I 1.1250 0.8375 0.2875 12.3320325 5.1454342 2.3530085 3.5436875
74147 F 1.3750 1.1625 0.3750 26.4075592 11.6232950 6.4920355 7.5126175
74148 I 0.7000 0.5375 0.2000 3.1467945 1.5450477 0.7796112 0.9922325
74149 F 1.6000 1.2500 0.4125 38.0025048 16.6411565 8.4198015 9.0718400
74150 F 0.9250 0.6875 0.2500 6.3077638 2.9483480 1.3182517 1.8710670

Pre-Process the test data

# pre-process the data using the function above
crab_test_data_pre_processed = data_preprocess_scaling(crab_age_test_raw)

Predictions

predictions_test_data = round(predict(multiple_regression_model, crab_test_data_pre_processed), 0)

Package the predictions results
Convert the predictions results to dataframe and join back to the Ids from the test data and get the data ready for export.

# convert the predictions data to dataframe
crab_age_predictions = predictions_test_data %>% as.vector() %>% data.frame()
crab_age_predictions$id = as.integer(crab_age_test_raw$id)
# switch the id column to come first
crab_age_predictions[,c(1,2)] = crab_age_predictions[,c(2,1)]
colnames(crab_age_predictions) = c("id", "Age")
# set missing predictions to 0 if any
crab_age_predictions[is.na(crab_age_predictions)] = 0

# display the final predictions data
kable(head(crab_age_predictions, 100), "html") %>%
                                 kable_paper("hover", full_width = F) %>%
                                 scroll_box(width = "850px", height = "350px")
id Age
74051 8
74052 8
74053 10
74054 9
74055 7
74056 12
74057 11
74058 8
74059 12
74060 10
74061 13
74062 10
74063 5
74064 13
74065 6
74066 6
74067 11
74068 12
74069 11
74070 10
74071 16
74072 10
74073 9
74074 9
74075 13
74076 9
74077 14
74078 12
74079 9
74080 11
74081 11
74082 10
74083 9
74084 13
74085 12
74086 7
74087 10
74088 12
74089 9
74090 10
74091 8
74092 9
74093 11
74094 10
74095 17
74096 11
74097 7
74098 6
74099 13
74100 10
74101 10
74102 10
74103 9
74104 10
74105 9
74106 12
74107 10
74108 6
74109 9
74110 6
74111 11
74112 7
74113 9
74114 11
74115 8
74116 9
74117 14
74118 11
74119 9
74120 16
74121 11
74122 7
74123 12
74124 5
74125 7
74126 7
74127 12
74128 11
74129 10
74130 11
74131 11
74132 8
74133 10
74134 12
74135 5
74136 7
74137 6
74138 16
74139 11
74140 13
74141 10
74142 11
74143 15
74144 15
74145 11
74146 8
74147 11
74148 6
74149 11
74150 8

Write the Predictions data to local storage for submission

write.csv(crab_age_predictions, "../data/crab_age_predictions.csv", row.names=FALSE)

Kaggle Submission
kaggle username: Chinedu Onyeka
kaggle score: 1.46

See screenshot below: