CUNY SPS DATA 605 Final Project

Problem 1

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

Probability: Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

Probabilities: a. P(X>x | X>y); b. P(X>x & Y>y); c. P(X<x | X>y)

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

Solution 1

# Set the seed
set.seed(1234)

# Generate random variable X with 10,000 continuous random uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)

# Generate random variable Y with 10,000 random normal values with mean 10 and standard deviation 2.89
Y <- rnorm(10000, mean = 10, sd = 2.89)

# Calculate median values for X and Y
x_median <- median(X)
y_median <- median(Y)

Part A
P(X > x | X > y)

# a. P(X > x | X > y)
prob_a <- mean(X > x_median & X > Y)

# Display the probabilities
prob_a

## [1] 0.3897

P(X > x & Y > y)

# b. P(X > x & Y > y)
prob_b <- mean(X > x_median & Y > y_median)

# Display the probabilities
prob_b

## [1] 0.2507

P(X < x | X > y)

# c. P(X < x | X > y)
prob_c <- mean(X < x_median & X > Y)

# Display the probabilities
prob_c

## [1] 0.1083

Part B
Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y)

# Calculate joint probabilities
joint_prob <- mean(X > x_median & Y > y_median)

# Calculate marginal probabilities
marginal_prob_X <- mean(X > x_median)
marginal_prob_Y <- mean(Y > y_median)

# Display the joint and product of marginal probabilities
joint_prob

## [1] 0.2507

marginal_prob_X * marginal_prob_Y

## [1] 0.25

We see that the joint probability is equal the product of marginal probabilities: P(X>x & Y>y)=P(X>x)P(Y>y)

Part C
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test.

# Create a contingency table
cont_table <- table(X > x_median, Y > y_median)

# Perform Fisher’s Exact Test
fisher_test <- fisher.test(cont_table)

# Perform Chi-Square Test
chi_square_test <- chisq.test(cont_table)

# Display the results of both tests
fisher_test

## 
##  Fisher's Exact Test for Count Data
## 
## data:  cont_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264

chi_square_test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 0.0676, df = 1, p-value = 0.7949

If the results from the tests indicate a low p-value (typically less than 0.05), it suggests dependence between X and Y, while a higher p-value suggests independence. in this case, the p-value for both Fisher’s Test and Chi-Square are about the same (0.7949) which is greater than 0.05 indicating independence. Therefore, Independence exists.

Fisher’s Exact Test is used for contingency tables with small sample sizes, providing exact probabilities while Chi-Square Test is more suitable for larger samples, relying on approximations. I am not surprised by the results considering the fact that the data is fairly large (10,000).

Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Solution 2

Load Libraries

library(Amelia)
library(caret)
library(caTools)
library(cowplot)
library(e1071)
library(fastDummies)
library(grid)
library(kableExtra)
library(MASS)
library(matrixcalc)
library(Metrics)
library(nnet)
library(OpenImageR)
library(reshape2)
library(tidyverse)

Read the train dataset

url_crab_age = "https://raw.githubusercontent.com/chinedu2301/data605-computer-maths/main/final_project/data/crab_age_dataset.csv"
crab_age_train_raw = read_csv(url_crab_age)
crab_age_train_raw = crab_age_train_raw %>% rename(
    Shucked_Weight = `Shucked Weight`,
    Viscera_Weight = `Viscera Weight`,
    Shell_Weight = `Shell Weight`
    )

Display the data

# display a few records of the raw data
crab_age_raw_data_few_records <- kable(head(crab_age_train_raw, 50), "html") %>%
                                 kable_paper("hover", full_width = F) %>%
                                 scroll_box(width = "850px", height = "350px")
crab_age_raw_data_few_records

id	Sex	Length	Diameter	Height	Weight	Shucked_Weight	Viscera_Weight	Shell_Weight	Age
0	I	1.5250	1.1750	0.3750	28.973189	12.7289255	6.6479577	8.3489277	9
1	I	1.1000	0.8250	0.2750	10.418441	4.5217453	2.3246590	3.4019400	8
2	M	1.3875	1.1125	0.3750	24.777463	11.3398000	5.5565020	6.6621325	9
3	F	1.7000	1.4125	0.5000	50.660556	20.3549410	10.9918385	14.9968855	11
4	I	1.2500	1.0125	0.3375	23.289114	11.9776637	4.5075705	5.9533950	8
5	M	1.5000	1.1750	0.4125	28.845616	13.4093135	6.7897052	7.9378600	10
6	M	1.5750	1.1375	0.3500	30.022120	11.9351395	7.3425205	8.6465975	11
7	I	1.3125	1.0250	0.3500	18.299602	8.2497045	3.8980562	5.6699000	11
8	F	1.6000	1.2875	0.4375	38.824640	16.9671758	7.4133942	10.7728100	12
9	M	1.0250	0.7625	0.2625	10.305043	4.4933958	2.1262125	2.9766975	11
10	I	0.9250	0.7250	0.2125	6.109317	2.6932025	1.7293195	2.1262125	7
11	F	1.2750	0.9750	0.3750	19.575330	7.3425205	4.2807745	6.2368900	10
12	M	1.3500	1.0250	0.3500	20.964455	8.5048500	4.6351432	6.9456275	10
13	I	1.1000	0.8625	0.2875	11.439023	4.5075705	2.3388338	3.4019400	7
14	M	1.3125	0.9875	0.3125	17.449117	8.5473743	3.9831047	5.2446575	9
15	I	0.9375	0.7500	0.2000	7.243297	3.2318430	1.6300962	2.1262125	7
16	I	1.0125	0.7750	0.3000	9.454558	3.7846583	2.6081540	2.8349500	8
17	F	1.6375	1.2500	0.4250	39.179009	16.2584382	8.7174712	10.4893150	12
18	I	1.1375	0.9125	0.2875	13.366789	6.8605790	2.6648530	3.9689300	7
19	I	0.6750	0.4875	0.1000	2.083688	0.8788345	0.6378637	0.7370870	5
20	M	1.5500	1.2000	0.4250	32.772022	14.5432935	6.2368900	9.0860148	9
21	F	1.0875	0.8750	0.3125	15.379604	5.3438808	3.2460178	4.9611625	14
22	F	1.7375	1.3375	0.4125	47.513762	21.1770765	11.4957222	13.8912550	10
23	M	1.4375	1.1000	0.3125	24.805812	12.0343627	5.5848515	6.3786375	9
24	I	0.9250	0.7125	0.2375	6.988152	2.6365035	1.4883488	2.1262125	7
25	M	1.6125	1.2500	0.4375	39.774349	17.0097000	8.8308692	9.7805775	11
26	F	1.4250	1.0750	0.3875	23.473386	11.6232950	5.8966960	4.2665998	10
27	I	1.2000	0.9375	0.3125	15.762322	6.7613558	2.3671832	4.7768907	8
28	F	1.3250	1.0750	0.4125	20.326591	8.1930055	4.9469878	5.5281525	9
29	I	0.6250	0.5000	0.1500	2.608154	0.6945628	0.5102910	0.5102910	6
30	M	1.2750	0.9500	0.3375	16.910477	7.3425205	3.9405805	4.9611625	9
31	M	1.5500	1.2000	0.4125	34.699788	17.1089232	7.6401902	8.5048500	11
32	F	1.7625	1.3875	0.5125	46.677452	23.0906678	6.0809678	11.2972758	13
33	I	1.1875	0.9000	0.3125	12.813974	5.4147545	2.8066005	3.9689300	9
34	F	1.5875	1.2375	0.4250	34.614739	14.7842642	8.6891218	8.2213550	13
35	F	1.4250	1.1500	0.3875	26.308336	11.7933920	5.5281525	7.7961125	10
36	F	1.7250	1.3125	0.5000	47.513762	19.6462035	11.5665960	13.3242650	16
37	F	1.2500	0.9750	0.3250	18.016107	7.4700932	4.4366968	5.2446575	8
38	I	0.5125	0.3750	0.1375	1.190679	0.4819415	0.1559222	0.1842718	4
39	M	1.1250	0.8500	0.3250	12.941547	5.9533950	2.8774743	3.4019400	9
40	M	1.5375	1.3125	0.4125	37.988330	16.4427100	9.7805775	10.4893150	11
41	M	1.4375	1.1250	0.3750	24.933385	9.8372765	5.9392202	7.7961125	18
42	M	1.5250	1.2125	0.3875	34.841535	15.4504775	7.4559185	8.2213550	10
43	F	1.6125	1.2875	0.4125	35.621147	11.4531980	6.4353365	11.0563050	15
44	M	1.6750	1.2875	0.4750	39.745999	17.9027092	8.3347530	11.0563050	10
45	I	1.0375	0.7750	0.2250	10.957082	4.2807745	1.7718437	2.6932025	7
46	I	1.1250	0.8375	0.2500	14.004653	6.0667930	2.9625228	3.8271825	7
47	F	1.8875	1.4125	0.5125	57.606184	21.2621250	12.4879548	16.4427100	17
48	M	1.4375	1.1000	0.3625	24.720764	10.0640725	5.4005798	7.1157245	9
49	I	1.1750	0.9250	0.3125	15.223681	7.9236852	3.5011632	4.9611625	8

Descriptive and Inferential Statistics.

Univariate Descriptive Statistics

summary(crab_age_train_raw)

##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18513   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked_Weight     Viscera_Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell_Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000

Scatter Plots

# Age vs Sex
scatter_age_sex = crab_age_train_raw %>% ggplot(aes(x = Sex, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Sex", x = "Sex", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Length
scatter_age_length = crab_age_train_raw %>% ggplot(aes(x = Length, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Length", x = "Length", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))  

# Age vs Diameter
scatter_age_diameter = crab_age_train_raw %>% ggplot(aes(x = Diameter, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Diameter", x = "Diameter", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite")) 

# Age vs Height
scatter_age_height = crab_age_train_raw %>% ggplot(aes(x = Height, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Height", x = "Height", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Weight
scatter_age_weight = crab_age_train_raw %>% ggplot(aes(x = Weight, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Weight", x = "Weight", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Shucked_Weight
scatter_age_shucked_weight = crab_age_train_raw %>% ggplot(aes(x = Shucked_Weight, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Shucked_Weight", x = "Shucked_Weight", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Shucked_Weight
scatter_age_viscera_weight = crab_age_train_raw %>% ggplot(aes(x = Viscera_Weight, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Viscera_Weight", x = "Viscera_Weight", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))

# Age vs Shucked_Weight
scatter_age_shell_weight = crab_age_train_raw %>% ggplot(aes(x = Shell_Weight, y = Age)) + geom_point(aes(color = Sex)) +
                               labs(title = "Crab Age vs Shell_Weight", x = "Shell_Weight", y = "Crab Age", fill = "Sex") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))


# Plot grid of all plots
crab_age_scatter_plot1 <- plot_grid(scatter_age_sex, scatter_age_length, scatter_age_diameter, scatter_age_height,
                                   byrow = TRUE, nrow = 2) 

crab_age_scatter_plot2 <- plot_grid(scatter_age_weight, scatter_age_shucked_weight, scatter_age_viscera_weight, scatter_age_shell_weight,
                                   byrow = TRUE, nrow = 2) 
#display all plots
crab_age_scatter_plot1

crab_age_scatter_plot2 # For Age vs Weights

Correlation Matrix
Derive a correlation matrix for any three quantitative variables in the dataset.

#using the Diameter, Length, and Weight
correlation_data = crab_age_train_raw[, c("Diameter", "Length", "Weight")]
correlation_matrix = round(cor(correlation_data), 2)

# display the correlation matrix
correlation_matrix

##          Diameter Length Weight
## Diameter     1.00   0.99   0.94
## Length       0.99   1.00   0.94
## Weight       0.94   0.94   1.00

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

# Find the pariwise correlation between Diameter and Length using a confidence interval of 80%
pairwise_corr_dia_len = cor.test(correlation_data$Diameter, correlation_data$Length, conf.level = 0.8)
pairwise_corr_dia_len

## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$Diameter and correlation_data$Length
## t = 1857.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9893379 0.9895359
## sample estimates:
##       cor 
## 0.9894374

# Find the pariwise correlation between Diameter and Weight using a confidence interval of 80%
pairwise_corr_dia_weight = cor.test(correlation_data$Diameter, correlation_data$Weight, conf.level = 0.8)
pairwise_corr_dia_weight

## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$Diameter and correlation_data$Weight
## t = 737.99, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9376824 0.9388098
## sample estimates:
##       cor 
## 0.9382486

# Find the pariwise correlation between Length and Weight using a confidence interval of 80%
pairwise_corr_len_weight = cor.test(correlation_data$Length, correlation_data$Weight, conf.level = 0.8)
pairwise_corr_len_weight

## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$Length and correlation_data$Weight
## t = 725.93, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9357910 0.9369515
## sample estimates:
##       cor 
## 0.9363738

From the correlation test results for all three cases above, we see that the p-value is very small (almost zero). Hence, we reject the Null Hypothesis that the true correlation is 0. Hence, the true correlation is NOT zero(0) for any of the three pairs of variables - Diameter, Length, and Weight.

Familywise Error
The family-wise error rate (FWER) is the probability of at least 1 false positive when multiple comparisons are being tested. Basically, FWER is the probability of a coming to at least one false conclusion in a series of hypothesis tests. In other words, it’s the probability of making at least one Type I Error.
Using the Bonferroni correction which adjusts the level of significance for a family of tests as adjusted \(\alpha\) = \(\frac{\alpha}{n}\) where n is the number of tests.
In this case, with 3 correlation tests:

Alpha level for each individual test = 1 - 80% = 0.20 Number of tests = 3

Adjusted alpha level for multiple tests = 0.20 / 3 ≈ 0.067
Therefore, using the Bonferroni correction, the adjusted significance level for each individual correlation test to control the familywise error rate at an overall 80% confidence level is approximately 0.067 or 6.7%. Based on this result, I will not be worried about familywise error.

Linear Algebra and Correlation.

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Invert Correlation Matrix
Using the solve function to invert the correlation matrix.

precision_matrix <- solve(correlation_matrix)
precision_matrix

##            Diameter     Length    Weight
## Diameter  52.244165 -47.755835 -4.219031
## Length   -47.755835  52.244165 -4.219031
## Weight    -4.219031  -4.219031  8.931777

Multiply correlation_matrix by the precision_matrix

# multiply correlation matrix by precision matrix
mul_corr_matrix_precision = round(correlation_matrix %*% precision_matrix, 2)
mul_corr_matrix_precision

##          Diameter Length Weight
## Diameter        1      0      0
## Length          0      1      0
## Weight          0      0      1

This shows that a matrix multiplied by its inverse gives us the identity matrix.

Multiply the precision_matrix by the correlation_matrix

# multiply precision matrix by correlation matrix
mul_precision_by_corr_matrix = round(precision_matrix %*% correlation_matrix, 2)
mul_precision_by_corr_matrix

##          Diameter Length Weight
## Diameter        1      0      0
## Length          0      1      0
## Weight          0      0      1

This also shows that the inverse of a matrix multiplied by the matrix is still the identity matrix.

Essentially, this proves the theorem that if A is a square matrix and \(A^{-1}\) exists, then \(AA^{-1}=A^{-1}A = I\) where \(I\) is the Identity matrix. This holds only for square matrices and their inverses.

Conduct LU decomposition on the matrix

lu_decomposition = lu.decomposition(correlation_matrix)
lu_decomposition

## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.99 1.0000000    0
## [3,] 0.94 0.4723618    1
## 
## $U
##      [,1]   [,2]      [,3]
## [1,]    1 0.9900 0.9400000
## [2,]    0 0.0199 0.0094000
## [3,]    0 0.0000 0.1119598

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Find Skewness
Using the skewness function in e1071 package, we compute the skewness of the variables.
Note:
Positive skewness (>0): Indicates right-skewed distribution (tail is longer on the right).
Negative skewness (<0): Indicates left-skewed distribution (tail is longer on the left).
Skewness around 0: Indicates approximately symmetric distribution.

# Get skewness for each column
for (col in colnames(crab_age_train_raw)) {
  if (!(col %in% c("Sex", "id"))){
    skew_val <- skewness(crab_age_train_raw[[col]]) 
    cat(sprintf("The skewness for the %s variable is: %.2f\n", col, round(skew_val, 2)))
  }
}

## The skewness for the Length variable is: -0.84
## The skewness for the Diameter variable is: -0.81
## The skewness for the Height variable is: 0.09
## The skewness for the Weight variable is: 0.23
## The skewness for the Shucked_Weight variable is: 0.35
## The skewness for the Viscera_Weight variable is: 0.29
## The skewness for the Shell_Weight variable is: 0.28
## The skewness for the Age variable is: 1.09

Therefore, we select the Weight variable since it is one of the right-skewed variables.
Check the minimum value of Weight

# check the minimum value:
min(crab_age_train_raw$Weight)

## [1] 0.056699

Since the minimum value of Weight is not negative, no need for adjustment.

Histogram to check distribution of Weight

#hist(crab_age_train_raw$Weight, breaks = 20, xlab = 'Weight', main = "Histogram of Weight")
hist_weight = crab_age_train_raw %>% ggplot(aes(x = Weight)) + 
              geom_histogram(binwidth = 0.5, color = "black", fill = "skyblue", alpha = 0.7) +
                               labs(title = "Histogram for Weight", x = "Weight", y = "Frequency") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
hist_weight

Density Plot to visualize the distribution of Weight

density_weight = crab_age_train_raw %>% ggplot(aes(x = Weight)) + geom_density() +
                               labs(title = "Density Plot for Weight", x = "Weight", y = "Density") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
density_weight

From the Histogram and Density plot, we can see that the Weight variable is skewed to the right.

Load the MASS package and run fitdistr to fit an exponential probability density function

prob_density_function <- fitdistr(crab_age_train_raw$Weight, densfun = 'exponential')
lambda <- prob_density_function$estimate
exponential_pdf <- rexp(1000, lambda)

Optimal Value of λ:
The optimal value of lambda = 1/λ.

optimal_value <- round((1/lambda), 4)
optimal_value

##    rate 
## 23.3852

Plot a histogram and compare it with a histogram of your original variable
Using ggplot2

# histogram for exponential distribution
exp_pdf = exponential_pdf %>% tibble()
colnames(exp_pdf) = "Weight"
hist_exponential_pdf = exp_pdf %>% ggplot(aes(x = Weight)) + 
                       geom_histogram(binwidth = 1, color = "black", fill = "skyblue", alpha = 0.7) +
                               labs(title = "Histogram for Weight-Exponential PDF", x = "Weight", y = "Frequency") +
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
plot_grid(hist_weight, NULL, hist_exponential_pdf, nrow = 1, rel_widths = c(1, 0.008, 1))

Using Base R

par(mfrow=c(1,2))
hist(crab_age_train_raw$Weight, breaks = 20, col="violet", main = "Histogram - Original", xlab = 'Weight')
hist(exponential_pdf, breaks = 20, col="royalblue", main = "Histogram - Exponential PDF", xlab = 'Weight')

The 5th and 95th percentiles using the CDF are:

cdf_5th_95th_percentiles = round(quantile(exponential_pdf, c(0.05, 0.95)), 4)
cdf_5th_95th_percentiles

##      5%     95% 
##  1.2535 67.5297

95% confidence interval from empirical data

confidence_interval = t.test(crab_age_train_raw$Weight)
confidence_interval

## 
##  One Sample t-test
## 
## data:  crab_age_train_raw$Weight
## t = 503.13, df = 74050, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  23.29412 23.47632
## sample estimates:
## mean of x 
##  23.38522

The 5th and 95th percentiles using the empirical data are:

empirical_5th_95th_percentiles = round(quantile(crab_age_train_raw$Weight, c(0.05, 0.95)), 4)
empirical_5th_95th_percentiles

##      5%     95% 
##  3.6004 44.2110

Comparing the 5th and 95th percentiles of the exponential pdf and the empirical data as well as their histograms, we can clearly see the exponential pdf is much more skewed to the right than the original/empirical data. Also, the original empirical data appears to be better and the exponential data is not a good approximation of the original data in this case.

Modeling.

Conduct Exploratory Data Analysis Prior to Modeling

glimpse of the data

glimpse(crab_age_train_raw)

## Rows: 74,051
## Columns: 10
## $ id             <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Sex            <chr> "I", "I", "M", "F", "I", "M", "M", "I", "F", "M", "I", …
## $ Length         <dbl> 1.5250, 1.1000, 1.3875, 1.7000, 1.2500, 1.5000, 1.5750,…
## $ Diameter       <dbl> 1.1750, 0.8250, 1.1125, 1.4125, 1.0125, 1.1750, 1.1375,…
## $ Height         <dbl> 0.3750, 0.2750, 0.3750, 0.5000, 0.3375, 0.4125, 0.3500,…
## $ Weight         <dbl> 28.973189, 10.418441, 24.777463, 50.660556, 23.289114, …
## $ Shucked_Weight <dbl> 12.7289255, 4.5217453, 11.3398000, 20.3549410, 11.97766…
## $ Viscera_Weight <dbl> 6.6479577, 2.3246590, 5.5565020, 10.9918385, 4.5075705,…
## $ Shell_Weight   <dbl> 8.348928, 3.401940, 6.662133, 14.996885, 5.953395, 7.93…
## $ Age            <dbl> 9, 8, 9, 11, 8, 10, 11, 11, 12, 11, 7, 10, 10, 7, 9, 7,…

There are about 74,051 observations and 10 variables (9 predictor variables and 1 response variable). The response variable is the age and we aim to predict the age of the crab using the other features given. Only Sex is the categorical variable in the data.

Check for missing Values

# use missmap function from the Amelia package to check for NA values
missmap(crab_age_train_raw,
        plot.background = element_rect(fill = "antiquewhite"),
        main = "Crab Age Dataset - Missing Values", 
        x.cex = 0.45,
        y.cex = 0.6,
        margins = c(7.1, 7.1),
        col = c("yellow", "black"), legend = FALSE)

As can be seen from the missing values plot, there are no missing values.

Histogram for Age to check the Age distribution

hist_age = crab_age_train_raw %>% ggplot(aes(x = Age)) + 
                                  geom_histogram(binwidth = 1, color = "black", fill = "skyblue", alpha = 0.7) +
                                  labs(title = "Histogram for Age", x = "Age", y = "Frequency") +
                                  theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                           panel.background = element_rect(fill = "gray80"),
                                                           plot.background = element_rect(fill = "antiquewhite"))
hist_age

From the Age distribution, we can see that ages of crabs given are between 3 and 20.

Bar Plot for Sex

# group the data by sex and get count/percentage
crab_sex_grouped <- crab_age_train_raw %>% group_by(Sex) %>% 
                        summarise(count = n()) %>%
                        mutate(percentage = round((count / sum(count) * 100), 2))

# bar plot using ggplot2
bar_plot_crab_sex <- crab_sex_grouped %>% ggplot(aes(x=factor(Sex), y = percentage, fill = factor(Sex))) +
                               geom_bar(stat = "identity", position = "dodge") +
                               labs(title = "Bar Chart - Crab Sex", x = "Crab Sex", y = "Percentage", 
                                    fill = "Crab Sex") +
                               scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentages
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))                          
bar_plot_crab_sex

Each of the crab sex constitute a little above 30% of the data with the “M” category being a little more than 35% and representing the highest category by a small margin.

Data Pre-Processing
Develop a function to pre-process the data and make it ready for training. This function can be easily used to pre-process any new test data to avoid code duplications and errors.

# data pre-processing
data_preprocess_scaling <- function(df){
                      # This function standardizes the numeric variables of the df using the standard normal method
                      df <- as.data.frame(df)
                      if ("Age" %in% colnames(df)){
                          df_char <- df %>% select(Age, Sex) 
                          df_numeric <- df %>% select(-Sex, -Age) # Exclude Age from scaling as well.
                      } else {
                          df_char <- df %>% select(Sex) 
                          df_numeric <- df %>% select(-Sex) # Exclude Age from scaling as well.                        
                      }
                      # scale the other numeric values using standard scaler
                      df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
                      # combine the categorical and numeric features and drop the id column
                      df_scaled_combined <- cbind(df_char, df_numeric_scaled) %>% select(-id)
                      return(df_scaled_combined)
}

Pre-process the train data

# pre-process the data using the function above
crab_data_pre_processed = data_preprocess_scaling(crab_age_train_raw)

# display some records for the standardized data
kable(head(crab_data_pre_processed, 50), "html") %>%
                                 kable_paper("hover", full_width = F) %>%
                                 scroll_box(width = "850px", height = "350px")

Age	Sex	Length	Diameter	Height	Weight	Shucked_Weight	Viscera_Weight	Shell_Weight
9	I	0.7212332	0.6339778	0.2923977	0.4418014	0.4671847	0.5691823	0.4533730
8	I	-0.7557067	-0.8403503	-0.7941577	-1.0251912	-0.9936809	-0.9788731	-0.9267816
9	M	0.2433997	0.3707049	0.2923977	0.1100750	0.2199225	0.1783618	-0.0172241
11	F	1.3293850	1.6344147	1.6505920	2.1564682	1.8246040	2.1246076	2.3080794
8	I	-0.2344338	-0.0505317	-0.1150606	-0.0075982	0.3334613	-0.1972320	-0.2149540
10	M	0.6343544	0.6339778	0.6998560	0.4317152	0.5882928	0.6199382	0.3386897
11	M	0.8949909	0.4760140	0.0207589	0.5247330	0.3258920	0.8178863	0.5364196
11	I	-0.0172367	0.0021229	0.0207589	-0.4020836	-0.3301098	-0.4154824	-0.2940460
12	F	0.9818697	1.1078689	0.9714949	1.2206860	1.2215869	0.8432642	1.1296093
11	M	-1.0163432	-1.1036232	-0.9299772	-1.0341568	-0.9987271	-1.0499314	-1.0454195
7	I	-1.3638585	-1.2615869	-1.4732549	-1.3658832	-1.3191587	-1.1920479	-1.2826954
10	F	-0.1475550	-0.2084954	0.2923977	-0.3012208	-0.4915872	-0.2784415	-0.1358620
10	M	0.1130815	0.0021229	0.0207589	-0.1913925	-0.2846943	-0.1515517	0.0618678
7	I	-0.7557067	-0.6823866	-0.6583383	-0.9445010	-0.9962040	-0.9737975	-0.9267816
9	M	-0.0172367	-0.1558409	-0.3866994	-0.4693254	-0.2771250	-0.3850289	-0.4126839
7	I	-1.3204191	-1.1562778	-1.6090743	-1.2762274	-1.2232815	-1.2275771	-1.2826954
8	I	-1.0597826	-1.0509686	-0.5225189	-1.1013987	-1.1248813	-0.8773613	-1.0849655
12	F	1.1121879	0.9499052	0.8356754	1.2487034	1.0954327	1.3102187	1.0505173
7	I	-0.6253885	-0.4717683	-0.6583383	-0.7920862	-0.5773720	-0.8570589	-0.7685977
5	I	-2.2326467	-2.2620238	-2.6956297	-1.6841612	-1.6421135	-1.5828685	-1.6702460
9	M	0.8081121	0.7392869	0.8356754	0.7421483	0.7901395	0.4219901	0.6590121
14	F	-0.7991461	-0.6297320	-0.3866994	-0.6329472	-0.8473420	-0.6489597	-0.4917758
10	F	1.4597032	1.3184872	0.6998560	1.9076734	1.9709428	2.3050346	1.9996208
9	M	0.4171574	0.3180503	-0.3866994	0.1123164	0.3435536	0.1885129	-0.0963161
7	I	-1.3638585	-1.3142415	-1.2016160	-1.2964000	-1.3292511	-1.2783330	-1.2826954
11	M	1.0253091	0.9499052	0.9714949	1.2957727	1.2291561	1.3508234	0.8527874
10	F	0.3737180	0.2127412	0.4282171	0.0069709	0.2703842	0.3001760	-0.6855511
8	I	-0.4081914	-0.3664592	-0.3866994	-0.6026884	-0.5950336	-0.9636463	-0.5431856
9	F	0.0262027	0.2127412	0.6998560	-0.2418239	-0.3402021	-0.0398887	-0.3335919
6	I	-2.4064043	-2.2093693	-2.1523520	-1.6426954	-1.6749136	-1.6285488	-1.7335196
9	M	-0.1475550	-0.3138046	-0.1150606	-0.5119119	-0.4915872	-0.4002557	-0.4917758
11	M	0.8081121	0.7392869	0.6998560	0.8945631	1.2468177	0.9244737	0.4968736
13	F	1.5465820	1.5291055	1.7864114	1.8415523	2.3115592	0.3661586	1.2759294
9	I	-0.4516309	-0.5244229	-0.3866994	-0.8357934	-0.8347266	-0.8063030	-0.7685977
13	F	0.9384303	0.8972506	0.8356754	0.8878389	0.8330319	1.3000675	0.4177816
10	F	0.3737180	0.5286686	0.4282171	0.2311103	0.3006612	0.1682106	0.2991437
16	F	1.4162638	1.2131781	1.6505920	1.9076734	1.6984498	2.3304125	1.8414369
8	F	-0.2344338	-0.2084954	-0.2508800	-0.4244975	-0.4688794	-0.2226100	-0.4126839
4	I	-2.7973590	-2.7359150	-2.2881714	-1.7547651	-1.7127598	-1.7554386	-1.8244753
9	M	-0.6688279	-0.7350412	-0.2508800	-0.8257071	-0.7388494	-0.7809250	-0.9267816
11	M	0.7646727	1.2131781	0.6998560	1.1545649	1.1282328	1.6908880	1.0505173
18	M	0.4171574	0.4233595	0.2923977	0.1224027	-0.0475244	0.3154027	0.2991437
10	M	0.7212332	0.7919415	0.4282171	0.9057701	0.9516169	0.8584910	0.4177816
15	F	1.0253091	1.1078689	0.6998560	0.9674084	0.2401072	0.4930484	1.2087012
10	M	1.2425062	1.1078689	1.3789531	1.2935313	1.3881104	1.1731777	1.2087012
7	I	-0.9729038	-1.0509686	-1.3374354	-0.9826048	-1.0365733	-1.1768212	-1.1245115
7	I	-0.6688279	-0.7876958	-1.0657966	-0.7416548	-0.7186647	-0.7504715	-0.8081437
17	F	1.9809762	1.6344147	1.7864114	2.7056098	1.9860813	2.6603260	2.7114484
9	M	0.4171574	0.3180503	0.1565783	0.1055923	-0.0071550	0.1225303	0.1093230
8	I	-0.4950703	-0.4191137	-0.3866994	-0.6452749	-0.3881407	-0.5575990	-0.4917758

Train Test Split

Use the CaTools library to split the cleaned dataset into training and testing datasets in 70:30 ratio.

# Set a seed
set.seed(1994)
#Split the sample
sampling <- sample.split(crab_data_pre_processed$Age, SplitRatio = 0.7) 
# Training Data
df_train_subset <- subset(crab_data_pre_processed, sampling == TRUE)
# Testing Data
df_test_subset <- subset(crab_data_pre_processed, sampling == FALSE)

Train the model

multiple_regression_model = lm(Age ~ ., data = df_train_subset)
# display the model
summary(multiple_regression_model)

## 
## Call:
## lm(formula = Age ~ ., data = df_train_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4992  -1.2221  -0.3402   0.7562  18.2745 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.34098    0.01807 572.244  < 2e-16 ***
## SexI           -1.02614    0.03016 -34.019  < 2e-16 ***
## SexM           -0.11444    0.02287  -5.004 5.63e-07 ***
## Length          0.25076    0.06591   3.805 0.000142 ***
## Diameter        0.54040    0.06754   8.001 1.26e-15 ***
## Height          0.60991    0.02513  24.267  < 2e-16 ***
## Weight          2.36161    0.08195  28.817  < 2e-16 ***
## Shucked_Weight -3.44562    0.04452 -77.391  < 2e-16 ***
## Viscera_Weight -0.57802    0.03964 -14.584  < 2e-16 ***
## Shell_Weight    1.93384    0.04217  45.862  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.127 on 51826 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5513 
## F-statistic:  7076 on 9 and 51826 DF,  p-value: < 2.2e-16

Predictions

predictions_test = round(predict(multiple_regression_model, df_test_subset), 0) # round predictions to nearest age

Evaluate the Model

mean_absolute_error = round(Metrics::mae(df_test_subset$Age, predictions_test), 2)
mean_sq_error = round(Metrics::mse(df_test_subset$Age, predictions_test), 2)
root_mean_sq_error = round(Metrics::rmse(df_test_subset$Age, predictions_test), 2)

model_eval_metrics = c(mean_absolute_error, mean_sq_error, root_mean_sq_error) %>% t()
column_names = c("Mean Absolute Error", "Mean Square Error", "Root Mean Square Error")
evaluation_metrics = data.frame(values = model_eval_metrics)
colnames(evaluation_metrics) = column_names

# combine the results and display
kable(evaluation_metrics, "html") %>%
                        kable_paper("hover", full_width = F) %>%
                        scroll_box(width = "450px", height = "100px")

Mean Absolute Error	Mean Square Error	Root Mean Square Error
1.46	4.63	2.15

From the MAE, and RMSE, we can see that the model is able to predict the Age of Crabs within 2 units of the Crab Age.

Using the test data provided to make predictions
Read test data from remote location

url_crab_age_test = "https://raw.githubusercontent.com/chinedu2301/data605-computer-maths/main/final_project/data/crab_age_test_data.csv"
crab_age_test_raw = read_csv(url_crab_age_test)
crab_age_test_raw = crab_age_test_raw %>% rename(
    Shucked_Weight = `Shucked Weight`,
    Viscera_Weight = `Viscera Weight`,
    Shell_Weight = `Shell Weight`
    )

Display a few records of the test data:

kable(head(crab_age_test_raw, 100), "html") %>%
                                 kable_paper("hover", full_width = F) %>%
                                 scroll_box(width = "850px", height = "350px")

id	Sex	Length	Diameter	Height	Weight	Shucked_Weight	Viscera_Weight	Shell_Weight
74051	I	1.0500	0.7625	0.2750	8.6182480	3.6570855	1.7293195	2.7215520
74052	I	1.1625	0.8875	0.2750	15.5071765	7.0306760	3.2460178	3.9689300
74053	F	1.2875	0.9875	0.3250	14.5716430	5.5565020	3.8838815	4.8194150
74054	F	1.5500	0.9875	0.3875	28.3778495	13.3809640	6.5487345	7.0306760
74055	I	1.1125	0.8500	0.2625	11.7650425	5.5281525	2.4664065	3.3310663
74056	M	1.4250	1.1125	0.3500	24.8341620	8.7316460	5.7124242	8.0796075
74057	M	1.7125	1.3250	0.4500	46.6774518	21.2337755	11.9634890	11.3681495
74058	I	1.1750	0.8875	0.3000	13.9337793	5.6982495	2.9908722	3.9405805
74059	F	1.3125	1.0625	0.2875	23.5867840	8.1646560	5.6273757	8.4623258
74060	F	1.4250	1.0500	0.3125	17.7893112	7.1015498	4.6067937	5.5281525
74061	M	1.5125	1.2000	0.4750	32.2333815	12.9415468	6.2227153	9.8372765
74062	F	1.5625	1.2000	0.4125	36.5141560	17.2931950	8.6040733	8.9300925
74063	I	0.4375	0.3250	0.1250	0.8221355	0.3543688	0.1842718	0.2834950
74064	F	1.6250	1.2875	0.4125	39.1931838	16.6836808	7.4700932	11.9067900
74065	I	0.6750	0.5000	0.1500	2.7215520	1.1906790	0.8079608	0.8504850
74066	I	0.8375	0.6250	0.1750	4.8194150	1.7718437	0.7654365	1.2473780
74067	M	1.6750	1.3000	0.4000	41.6312408	18.8524175	9.6955290	11.0563050
74068	M	1.4625	1.1750	0.3250	28.2644515	11.4248485	6.0384435	9.4970825
74069	F	1.6750	1.3000	0.3625	40.8374547	17.2931950	9.1143642	9.0860148
74070	M	1.5875	1.2625	0.4125	35.5077487	17.0805738	8.2922287	9.4970825
74071	M	1.6000	1.2500	0.4750	31.1277510	11.1271788	6.6763072	12.8423235
74072	F	1.2625	0.9375	0.3000	16.8396030	7.5267922	3.0192217	4.8194150
74073	I	1.1875	0.9000	0.3125	13.8062065	5.3297060	3.6003865	4.3941725
74074	I	1.4375	1.1125	0.3625	25.6988218	12.2328093	5.9250455	6.7755305
74075	F	1.5875	1.2500	0.4750	33.8493030	13.9479540	5.8966960	9.4970825
74076	F	1.2250	0.9625	0.3000	17.7893112	8.3489277	3.8271825	5.6699000
74077	F	1.8500	1.4125	0.5125	57.6061840	27.6691120	9.6104805	18.7106700
74078	F	1.5875	1.2125	0.4500	36.5708550	15.5213513	8.6324228	9.9790240
74079	M	1.2500	1.0000	0.3000	16.0741665	7.1015498	4.4366968	4.9611625
74080	F	1.8375	1.4875	0.4500	53.2970600	25.7838702	11.7650425	13.8912550
74081	F	1.6250	1.3000	0.3875	38.7821160	17.5341658	8.1646560	10.3475675
74082	M	1.7625	1.4000	0.3875	47.4854125	22.5945515	11.6091202	10.9996060
74083	M	1.6250	1.2375	0.4000	39.4199798	19.6603782	8.3772773	9.9223250
74084	M	1.6875	1.3375	0.5125	39.7743485	17.0805738	8.2922287	11.9067900
74085	F	1.5375	1.1500	0.4125	29.9512468	13.6502843	5.4431040	9.6388300
74086	I	1.0375	0.7875	0.2375	8.1221317	4.2665998	2.2537852	2.5514550
74087	M	1.5750	1.3125	0.3625	27.5415392	14.1322258	5.5706767	7.0165012
74088	M	1.4250	1.1625	0.4000	26.7052290	11.9209647	5.1596090	8.7741702
74089	I	1.2500	0.9375	0.3250	15.4504775	6.0242688	3.8838815	4.7060170
74090	M	1.4250	1.1125	0.3750	23.5159102	10.9854312	5.5281525	7.1015498
74091	M	0.8500	0.6375	0.2125	5.1029100	1.9844650	1.1906790	1.7009700
74092	M	1.5375	1.1750	0.3750	31.1561005	15.2803805	6.9598022	7.9378600
74093	F	1.5125	1.1875	0.4375	29.2425093	13.0549448	6.5062102	8.0796075
74094	M	1.5000	1.1875	0.4125	29.2708587	13.7920317	6.1943658	7.7961125
74095	M	1.6000	1.2625	0.5000	39.7034747	12.4737800	8.0796075	14.5999925
74096	I	1.4375	1.1125	0.3375	26.1240642	10.0357230	5.9250455	8.5048500
74097	I	0.8625	0.6250	0.2000	3.8697068	1.8568922	0.9922325	1.7009700
74098	I	0.8250	0.6250	0.1625	5.1312595	2.0411640	0.9780577	1.2615527
74099	M	1.3000	0.9625	0.4125	23.7143568	6.1376668	5.9959192	7.0873750
74100	F	1.5000	1.1750	0.4125	32.2900805	15.4504775	6.8747538	8.7883450
74101	M	1.7375	1.3375	0.4625	45.9970638	24.6357155	9.8372765	13.4660125
74102	M	1.4375	1.1000	0.3750	24.7632882	11.7650425	4.6351432	6.5912587
74103	M	1.7375	1.3875	0.5375	61.9011333	31.3970712	13.5794105	14.1747500
74104	I	1.4125	1.1250	0.3750	26.1240642	10.0357230	5.9250455	6.7046567
74105	I	1.1875	0.9000	0.3250	13.5368862	5.9533950	2.7215520	4.5359200
74106	M	1.6625	1.3125	0.4375	41.0075517	18.1862042	9.7805775	11.9067900
74107	M	1.3750	1.0625	0.3250	21.7298918	9.7664027	4.7768907	6.8038800
74108	I	0.6375	0.4625	0.1750	2.2679600	0.7370870	0.5102910	0.7087375
74109	I	1.3125	1.0125	0.4000	20.1139703	8.5615490	4.7343665	5.2446575
74110	I	0.7125	0.5250	0.1875	3.3027168	1.2332033	0.7796112	0.9922325
74111	F	1.5000	1.1875	0.4375	31.7514400	14.0471773	7.4275690	9.0718400
74112	I	0.9875	0.7375	0.2250	8.7741702	3.8980562	1.9277660	2.1687367
74113	I	1.2000	0.9750	0.3000	15.4788270	6.3786375	3.0192217	4.6776675
74114	F	1.6125	1.2125	0.4375	38.6120190	18.3846507	7.4133942	10.5034898
74115	I	1.3500	1.1000	0.3750	26.9887240	13.4660125	5.8541718	7.2007730
74116	I	1.6500	1.3125	0.4375	48.2860657	21.7298918	9.6104805	9.3978592
74117	F	1.6125	1.3125	0.5000	37.3221167	15.4363027	9.2135875	12.0485375
74118	F	1.3125	1.1250	0.3750	25.3728025	10.2483443	5.2730070	6.6621325
74119	F	1.2000	0.9250	0.3000	15.1953320	7.1157245	3.2318430	4.2524250
74120	F	1.5625	1.2500	0.4750	35.7628943	13.7353327	6.4069870	13.8912550
74121	F	1.3375	1.0875	0.3250	21.5597947	9.1001895	5.1737838	6.6621325
74122	I	1.0000	0.7500	0.2125	9.4970825	4.4225220	1.3182517	2.4947560
74123	F	1.5000	1.2000	0.3750	29.1716355	11.3256252	6.3786375	8.3631025
74124	I	0.6125	0.4500	0.1500	2.5656297	0.9638830	0.4961162	0.6095143
74125	I	1.0375	0.8375	0.2500	9.0151410	4.3799978	1.8143680	2.9766975
74126	I	1.0375	0.7625	0.2500	9.2561118	4.4366968	1.8427175	2.6932025
74127	M	1.3250	1.0750	0.3375	22.1409595	7.4133942	4.5926190	6.3786375
74128	F	1.5000	1.1875	0.4250	32.1625078	13.9337793	6.9739770	8.9300925
74129	M	1.3125	0.9625	0.3000	15.0394098	6.7330062	2.9341733	4.8052402
74130	F	1.6125	1.2000	0.4250	31.3828965	13.8912550	6.9598022	8.2213550
74131	F	1.6625	1.2625	0.4375	36.1881367	16.6128070	7.7110640	9.4262088
74132	I	1.1625	0.8875	0.3000	12.7572750	6.1234920	2.9483480	3.2885420
74133	I	1.4000	1.1125	0.3750	24.7207640	10.8578585	3.7137845	7.4842680
74134	M	1.5000	1.2125	0.4250	34.2320213	14.0046530	8.0229085	9.2135875
74135	I	0.5000	0.3750	0.1000	1.0489315	0.4677668	0.3118445	0.3685435
74136	I	0.9000	0.6625	0.2250	5.0603858	2.0128145	1.1481548	1.5592225
74137	I	0.7000	0.5125	0.1625	2.8349500	1.0205820	0.5244657	0.8504850
74138	M	1.6625	1.3625	0.4750	46.0537627	16.7262050	7.3850447	15.0252350
74139	M	1.6250	1.3125	0.4375	38.9663878	18.9091165	8.8166945	10.7728100
74140	M	1.4500	1.1500	0.4000	31.1277510	11.4248485	6.9739770	10.9145575
74141	F	1.3250	1.0250	0.3500	21.3896978	9.9081502	4.8619392	5.9675698
74142	F	1.3500	1.1250	0.3375	22.8922212	9.1285390	5.1312595	7.0873750
74143	F	1.6125	1.3375	0.4750	35.1392053	13.2675660	6.7613558	12.0201880
74144	F	1.6375	1.3625	0.5000	53.6089045	19.4477570	10.9570818	14.6141672
74145	F	1.5500	1.2125	0.4500	36.2731852	16.2584382	8.0937822	10.0640725
74146	I	1.1250	0.8375	0.2875	12.3320325	5.1454342	2.3530085	3.5436875
74147	F	1.3750	1.1625	0.3750	26.4075592	11.6232950	6.4920355	7.5126175
74148	I	0.7000	0.5375	0.2000	3.1467945	1.5450477	0.7796112	0.9922325
74149	F	1.6000	1.2500	0.4125	38.0025048	16.6411565	8.4198015	9.0718400
74150	F	0.9250	0.6875	0.2500	6.3077638	2.9483480	1.3182517	1.8710670

Pre-Process the test data

# pre-process the data using the function above
crab_test_data_pre_processed = data_preprocess_scaling(crab_age_test_raw)

Predictions

predictions_test_data = round(predict(multiple_regression_model, crab_test_data_pre_processed), 0)

Package the predictions results
Convert the predictions results to dataframe and join back to the Ids from the test data and get the data ready for export.

# convert the predictions data to dataframe
crab_age_predictions = predictions_test_data %>% as.vector() %>% data.frame()
crab_age_predictions$id = as.integer(crab_age_test_raw$id)
# switch the id column to come first
crab_age_predictions[,c(1,2)] = crab_age_predictions[,c(2,1)]
colnames(crab_age_predictions) = c("id", "Age")
# set missing predictions to 0 if any
crab_age_predictions[is.na(crab_age_predictions)] = 0

# display the final predictions data
kable(head(crab_age_predictions, 100), "html") %>%
                                 kable_paper("hover", full_width = F) %>%
                                 scroll_box(width = "850px", height = "350px")

id	Age
74051	8
74052	8
74053	10
74054	9
74055	7
74056	12
74057	11
74058	8
74059	12
74060	10
74061	13
74062	10
74063	5
74064	13
74065	6
74066	6
74067	11
74068	12
74069	11
74070	10
74071	16
74072	10
74073	9
74074	9
74075	13
74076	9
74077	14
74078	12
74079	9
74080	11
74081	11
74082	10
74083	9
74084	13
74085	12
74086	7
74087	10
74088	12
74089	9
74090	10
74091	8
74092	9
74093	11
74094	10
74095	17
74096	11
74097	7
74098	6
74099	13
74100	10
74101	10
74102	10
74103	9
74104	10
74105	9
74106	12
74107	10
74108	6
74109	9
74110	6
74111	11
74112	7
74113	9
74114	11
74115	8
74116	9
74117	14
74118	11
74119	9
74120	16
74121	11
74122	7
74123	12
74124	5
74125	7
74126	7
74127	12
74128	11
74129	10
74130	11
74131	11
74132	8
74133	10
74134	12
74135	5
74136	7
74137	6
74138	16
74139	11
74140	13
74141	10
74142	11
74143	15
74144	15
74145	11
74146	8
74147	11
74148	6
74149	11
74150	8

Write the Predictions data to local storage for submission

write.csv(crab_age_predictions, "../data/crab_age_predictions.csv", row.names=FALSE)

Kaggle Submission
kaggle username: Chinedu Onyeka
kaggle score: 1.46

See screenshot below: