Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
Probability: Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.
Probabilities: a. P(X>x | X>y); b. P(X>x & Y>y); c. P(X<x | X>y)
Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?
# Set the seed
set.seed(1234)
# Generate random variable X with 10,000 continuous random uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)
# Generate random variable Y with 10,000 random normal values with mean 10 and standard deviation 2.89
Y <- rnorm(10000, mean = 10, sd = 2.89)
# Calculate median values for X and Y
x_median <- median(X)
y_median <- median(Y)Part A
P(X > x | X > y)
## [1] 0.3897
P(X > x & Y > y)
# b. P(X > x & Y > y)
prob_b <- mean(X > x_median & Y > y_median)
# Display the probabilities
prob_b## [1] 0.2507
P(X < x | X > y)
## [1] 0.1083
Part B
Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y)
# Calculate joint probabilities
joint_prob <- mean(X > x_median & Y > y_median)
# Calculate marginal probabilities
marginal_prob_X <- mean(X > x_median)
marginal_prob_Y <- mean(Y > y_median)
# Display the joint and product of marginal probabilities
joint_prob## [1] 0.2507
## [1] 0.25
We see that the joint probability is equal the product of marginal probabilities: P(X>x & Y>y)=P(X>x)P(Y>y)
Part C
Check to see if independence holds by using Fisher’s Exact Test and the
Chi Square Test.
# Create a contingency table
cont_table <- table(X > x_median, Y > y_median)
# Perform Fisher’s Exact Test
fisher_test <- fisher.test(cont_table)
# Perform Chi-Square Test
chi_square_test <- chisq.test(cont_table)
# Display the results of both tests
fisher_test##
## Fisher's Exact Test for Count Data
##
## data: cont_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cont_table
## X-squared = 0.0676, df = 1, p-value = 0.7949
If the results from the tests indicate a low p-value (typically less than 0.05), it suggests dependence between X and Y, while a higher p-value suggests independence. in this case, the p-value for both Fisher’s Test and Chi-Square are about the same (0.7949) which is greater than 0.05 indicating independence. Therefore, Independence exists.
Fisher’s Exact Test is used for contingency tables with small sample sizes, providing exact probabilities while Chi-Square Test is more suitable for larger samples, relying on approximations. I am not surprised by the results considering the fact that the data is fairly large (10,000).
You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Load Libraries
library(Amelia)
library(caret)
library(caTools)
library(cowplot)
library(e1071)
library(fastDummies)
library(grid)
library(kableExtra)
library(MASS)
library(matrixcalc)
library(Metrics)
library(nnet)
library(OpenImageR)
library(reshape2)
library(tidyverse)Read the train dataset
url_crab_age = "https://raw.githubusercontent.com/chinedu2301/data605-computer-maths/main/final_project/data/crab_age_dataset.csv"
crab_age_train_raw = read_csv(url_crab_age)
crab_age_train_raw = crab_age_train_raw %>% rename(
Shucked_Weight = `Shucked Weight`,
Viscera_Weight = `Viscera Weight`,
Shell_Weight = `Shell Weight`
)Display the data
# display a few records of the raw data
crab_age_raw_data_few_records <- kable(head(crab_age_train_raw, 50), "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "850px", height = "350px")
crab_age_raw_data_few_records| id | Sex | Length | Diameter | Height | Weight | Shucked_Weight | Viscera_Weight | Shell_Weight | Age |
|---|---|---|---|---|---|---|---|---|---|
| 0 | I | 1.5250 | 1.1750 | 0.3750 | 28.973189 | 12.7289255 | 6.6479577 | 8.3489277 | 9 |
| 1 | I | 1.1000 | 0.8250 | 0.2750 | 10.418441 | 4.5217453 | 2.3246590 | 3.4019400 | 8 |
| 2 | M | 1.3875 | 1.1125 | 0.3750 | 24.777463 | 11.3398000 | 5.5565020 | 6.6621325 | 9 |
| 3 | F | 1.7000 | 1.4125 | 0.5000 | 50.660556 | 20.3549410 | 10.9918385 | 14.9968855 | 11 |
| 4 | I | 1.2500 | 1.0125 | 0.3375 | 23.289114 | 11.9776637 | 4.5075705 | 5.9533950 | 8 |
| 5 | M | 1.5000 | 1.1750 | 0.4125 | 28.845616 | 13.4093135 | 6.7897052 | 7.9378600 | 10 |
| 6 | M | 1.5750 | 1.1375 | 0.3500 | 30.022120 | 11.9351395 | 7.3425205 | 8.6465975 | 11 |
| 7 | I | 1.3125 | 1.0250 | 0.3500 | 18.299602 | 8.2497045 | 3.8980562 | 5.6699000 | 11 |
| 8 | F | 1.6000 | 1.2875 | 0.4375 | 38.824640 | 16.9671758 | 7.4133942 | 10.7728100 | 12 |
| 9 | M | 1.0250 | 0.7625 | 0.2625 | 10.305043 | 4.4933958 | 2.1262125 | 2.9766975 | 11 |
| 10 | I | 0.9250 | 0.7250 | 0.2125 | 6.109317 | 2.6932025 | 1.7293195 | 2.1262125 | 7 |
| 11 | F | 1.2750 | 0.9750 | 0.3750 | 19.575330 | 7.3425205 | 4.2807745 | 6.2368900 | 10 |
| 12 | M | 1.3500 | 1.0250 | 0.3500 | 20.964455 | 8.5048500 | 4.6351432 | 6.9456275 | 10 |
| 13 | I | 1.1000 | 0.8625 | 0.2875 | 11.439023 | 4.5075705 | 2.3388338 | 3.4019400 | 7 |
| 14 | M | 1.3125 | 0.9875 | 0.3125 | 17.449117 | 8.5473743 | 3.9831047 | 5.2446575 | 9 |
| 15 | I | 0.9375 | 0.7500 | 0.2000 | 7.243297 | 3.2318430 | 1.6300962 | 2.1262125 | 7 |
| 16 | I | 1.0125 | 0.7750 | 0.3000 | 9.454558 | 3.7846583 | 2.6081540 | 2.8349500 | 8 |
| 17 | F | 1.6375 | 1.2500 | 0.4250 | 39.179009 | 16.2584382 | 8.7174712 | 10.4893150 | 12 |
| 18 | I | 1.1375 | 0.9125 | 0.2875 | 13.366789 | 6.8605790 | 2.6648530 | 3.9689300 | 7 |
| 19 | I | 0.6750 | 0.4875 | 0.1000 | 2.083688 | 0.8788345 | 0.6378637 | 0.7370870 | 5 |
| 20 | M | 1.5500 | 1.2000 | 0.4250 | 32.772022 | 14.5432935 | 6.2368900 | 9.0860148 | 9 |
| 21 | F | 1.0875 | 0.8750 | 0.3125 | 15.379604 | 5.3438808 | 3.2460178 | 4.9611625 | 14 |
| 22 | F | 1.7375 | 1.3375 | 0.4125 | 47.513762 | 21.1770765 | 11.4957222 | 13.8912550 | 10 |
| 23 | M | 1.4375 | 1.1000 | 0.3125 | 24.805812 | 12.0343627 | 5.5848515 | 6.3786375 | 9 |
| 24 | I | 0.9250 | 0.7125 | 0.2375 | 6.988152 | 2.6365035 | 1.4883488 | 2.1262125 | 7 |
| 25 | M | 1.6125 | 1.2500 | 0.4375 | 39.774349 | 17.0097000 | 8.8308692 | 9.7805775 | 11 |
| 26 | F | 1.4250 | 1.0750 | 0.3875 | 23.473386 | 11.6232950 | 5.8966960 | 4.2665998 | 10 |
| 27 | I | 1.2000 | 0.9375 | 0.3125 | 15.762322 | 6.7613558 | 2.3671832 | 4.7768907 | 8 |
| 28 | F | 1.3250 | 1.0750 | 0.4125 | 20.326591 | 8.1930055 | 4.9469878 | 5.5281525 | 9 |
| 29 | I | 0.6250 | 0.5000 | 0.1500 | 2.608154 | 0.6945628 | 0.5102910 | 0.5102910 | 6 |
| 30 | M | 1.2750 | 0.9500 | 0.3375 | 16.910477 | 7.3425205 | 3.9405805 | 4.9611625 | 9 |
| 31 | M | 1.5500 | 1.2000 | 0.4125 | 34.699788 | 17.1089232 | 7.6401902 | 8.5048500 | 11 |
| 32 | F | 1.7625 | 1.3875 | 0.5125 | 46.677452 | 23.0906678 | 6.0809678 | 11.2972758 | 13 |
| 33 | I | 1.1875 | 0.9000 | 0.3125 | 12.813974 | 5.4147545 | 2.8066005 | 3.9689300 | 9 |
| 34 | F | 1.5875 | 1.2375 | 0.4250 | 34.614739 | 14.7842642 | 8.6891218 | 8.2213550 | 13 |
| 35 | F | 1.4250 | 1.1500 | 0.3875 | 26.308336 | 11.7933920 | 5.5281525 | 7.7961125 | 10 |
| 36 | F | 1.7250 | 1.3125 | 0.5000 | 47.513762 | 19.6462035 | 11.5665960 | 13.3242650 | 16 |
| 37 | F | 1.2500 | 0.9750 | 0.3250 | 18.016107 | 7.4700932 | 4.4366968 | 5.2446575 | 8 |
| 38 | I | 0.5125 | 0.3750 | 0.1375 | 1.190679 | 0.4819415 | 0.1559222 | 0.1842718 | 4 |
| 39 | M | 1.1250 | 0.8500 | 0.3250 | 12.941547 | 5.9533950 | 2.8774743 | 3.4019400 | 9 |
| 40 | M | 1.5375 | 1.3125 | 0.4125 | 37.988330 | 16.4427100 | 9.7805775 | 10.4893150 | 11 |
| 41 | M | 1.4375 | 1.1250 | 0.3750 | 24.933385 | 9.8372765 | 5.9392202 | 7.7961125 | 18 |
| 42 | M | 1.5250 | 1.2125 | 0.3875 | 34.841535 | 15.4504775 | 7.4559185 | 8.2213550 | 10 |
| 43 | F | 1.6125 | 1.2875 | 0.4125 | 35.621147 | 11.4531980 | 6.4353365 | 11.0563050 | 15 |
| 44 | M | 1.6750 | 1.2875 | 0.4750 | 39.745999 | 17.9027092 | 8.3347530 | 11.0563050 | 10 |
| 45 | I | 1.0375 | 0.7750 | 0.2250 | 10.957082 | 4.2807745 | 1.7718437 | 2.6932025 | 7 |
| 46 | I | 1.1250 | 0.8375 | 0.2500 | 14.004653 | 6.0667930 | 2.9625228 | 3.8271825 | 7 |
| 47 | F | 1.8875 | 1.4125 | 0.5125 | 57.606184 | 21.2621250 | 12.4879548 | 16.4427100 | 17 |
| 48 | M | 1.4375 | 1.1000 | 0.3625 | 24.720764 | 10.0640725 | 5.4005798 | 7.1157245 | 9 |
| 49 | I | 1.1750 | 0.9250 | 0.3125 | 15.223681 | 7.9236852 | 3.5011632 | 4.9611625 | 8 |
Univariate Descriptive Statistics
## id Sex Length Diameter
## Min. : 0 Length:74051 Min. :0.1875 Min. :0.1375
## 1st Qu.:18513 Class :character 1st Qu.:1.1500 1st Qu.:0.8875
## Median :37025 Mode :character Median :1.3750 Median :1.0750
## Mean :37025 Mean :1.3175 Mean :1.0245
## 3rd Qu.:55538 3rd Qu.:1.5375 3rd Qu.:1.2000
## Max. :74050 Max. :2.0128 Max. :1.6125
## Height Weight Shucked_Weight Viscera_Weight
## Min. :0.0000 Min. : 0.0567 Min. : 0.02835 Min. : 0.04252
## 1st Qu.:0.3000 1st Qu.:13.4377 1st Qu.: 5.71242 1st Qu.: 2.86330
## Median :0.3625 Median :23.7994 Median : 9.90815 Median : 4.98951
## Mean :0.3481 Mean :23.3852 Mean :10.10427 Mean : 5.05839
## 3rd Qu.:0.4125 3rd Qu.:32.1625 3rd Qu.:14.03300 3rd Qu.: 6.98815
## Max. :2.8250 Max. :80.1015 Max. :42.18406 Max. :21.54562
## Shell_Weight Age
## Min. : 0.04252 Min. : 1.000
## 1st Qu.: 3.96893 1st Qu.: 8.000
## Median : 6.93145 Median :10.000
## Mean : 6.72387 Mean : 9.968
## 3rd Qu.: 9.07184 3rd Qu.:11.000
## Max. :28.49125 Max. :29.000
Scatter Plots
# Age vs Sex
scatter_age_sex = crab_age_train_raw %>% ggplot(aes(x = Sex, y = Age)) + geom_point(aes(color = Sex)) +
labs(title = "Crab Age vs Sex", x = "Sex", y = "Crab Age", fill = "Sex") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age vs Length
scatter_age_length = crab_age_train_raw %>% ggplot(aes(x = Length, y = Age)) + geom_point(aes(color = Sex)) +
labs(title = "Crab Age vs Length", x = "Length", y = "Crab Age", fill = "Sex") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age vs Diameter
scatter_age_diameter = crab_age_train_raw %>% ggplot(aes(x = Diameter, y = Age)) + geom_point(aes(color = Sex)) +
labs(title = "Crab Age vs Diameter", x = "Diameter", y = "Crab Age", fill = "Sex") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age vs Height
scatter_age_height = crab_age_train_raw %>% ggplot(aes(x = Height, y = Age)) + geom_point(aes(color = Sex)) +
labs(title = "Crab Age vs Height", x = "Height", y = "Crab Age", fill = "Sex") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age vs Weight
scatter_age_weight = crab_age_train_raw %>% ggplot(aes(x = Weight, y = Age)) + geom_point(aes(color = Sex)) +
labs(title = "Crab Age vs Weight", x = "Weight", y = "Crab Age", fill = "Sex") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age vs Shucked_Weight
scatter_age_shucked_weight = crab_age_train_raw %>% ggplot(aes(x = Shucked_Weight, y = Age)) + geom_point(aes(color = Sex)) +
labs(title = "Crab Age vs Shucked_Weight", x = "Shucked_Weight", y = "Crab Age", fill = "Sex") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age vs Shucked_Weight
scatter_age_viscera_weight = crab_age_train_raw %>% ggplot(aes(x = Viscera_Weight, y = Age)) + geom_point(aes(color = Sex)) +
labs(title = "Crab Age vs Viscera_Weight", x = "Viscera_Weight", y = "Crab Age", fill = "Sex") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age vs Shucked_Weight
scatter_age_shell_weight = crab_age_train_raw %>% ggplot(aes(x = Shell_Weight, y = Age)) + geom_point(aes(color = Sex)) +
labs(title = "Crab Age vs Shell_Weight", x = "Shell_Weight", y = "Crab Age", fill = "Sex") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Plot grid of all plots
crab_age_scatter_plot1 <- plot_grid(scatter_age_sex, scatter_age_length, scatter_age_diameter, scatter_age_height,
byrow = TRUE, nrow = 2)
crab_age_scatter_plot2 <- plot_grid(scatter_age_weight, scatter_age_shucked_weight, scatter_age_viscera_weight, scatter_age_shell_weight,
byrow = TRUE, nrow = 2)
#display all plots
crab_age_scatter_plot1Correlation Matrix
Derive a correlation matrix for any three quantitative variables in the
dataset.
#using the Diameter, Length, and Weight
correlation_data = crab_age_train_raw[, c("Diameter", "Length", "Weight")]
correlation_matrix = round(cor(correlation_data), 2)
# display the correlation matrix
correlation_matrix## Diameter Length Weight
## Diameter 1.00 0.99 0.94
## Length 0.99 1.00 0.94
## Weight 0.94 0.94 1.00
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
# Find the pariwise correlation between Diameter and Length using a confidence interval of 80%
pairwise_corr_dia_len = cor.test(correlation_data$Diameter, correlation_data$Length, conf.level = 0.8)
pairwise_corr_dia_len##
## Pearson's product-moment correlation
##
## data: correlation_data$Diameter and correlation_data$Length
## t = 1857.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9893379 0.9895359
## sample estimates:
## cor
## 0.9894374
# Find the pariwise correlation between Diameter and Weight using a confidence interval of 80%
pairwise_corr_dia_weight = cor.test(correlation_data$Diameter, correlation_data$Weight, conf.level = 0.8)
pairwise_corr_dia_weight##
## Pearson's product-moment correlation
##
## data: correlation_data$Diameter and correlation_data$Weight
## t = 737.99, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9376824 0.9388098
## sample estimates:
## cor
## 0.9382486
# Find the pariwise correlation between Length and Weight using a confidence interval of 80%
pairwise_corr_len_weight = cor.test(correlation_data$Length, correlation_data$Weight, conf.level = 0.8)
pairwise_corr_len_weight##
## Pearson's product-moment correlation
##
## data: correlation_data$Length and correlation_data$Weight
## t = 725.93, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9357910 0.9369515
## sample estimates:
## cor
## 0.9363738
From the correlation test results for all three cases above, we see that the p-value is very small (almost zero). Hence, we reject the Null Hypothesis that the true correlation is 0. Hence, the true correlation is NOT zero(0) for any of the three pairs of variables - Diameter, Length, and Weight.
Familywise Error
The family-wise error rate (FWER) is the probability of at least 1 false
positive when multiple comparisons are being tested. Basically, FWER is
the probability of a coming to at least one false conclusion in a series
of hypothesis tests. In other words, it’s the probability of making at
least one Type I Error.
Using the Bonferroni correction which adjusts the level of significance
for a family of tests as adjusted \(\alpha\) = \(\frac{\alpha}{n}\) where n is the number of
tests.
In this case, with 3 correlation tests:
Alpha level for each individual test = 1 - 80% = 0.20 Number of tests = 3
Adjusted alpha level for multiple tests = 0.20 / 3 ≈ 0.067
Therefore, using the Bonferroni correction, the adjusted significance
level for each individual correlation test to control the familywise
error rate at an overall 80% confidence level is approximately 0.067 or
6.7%. Based on this result, I will not be worried about familywise
error.
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Invert Correlation Matrix
Using the solve function to invert the correlation
matrix.
## Diameter Length Weight
## Diameter 52.244165 -47.755835 -4.219031
## Length -47.755835 52.244165 -4.219031
## Weight -4.219031 -4.219031 8.931777
Multiply correlation_matrix by the precision_matrix
# multiply correlation matrix by precision matrix
mul_corr_matrix_precision = round(correlation_matrix %*% precision_matrix, 2)
mul_corr_matrix_precision## Diameter Length Weight
## Diameter 1 0 0
## Length 0 1 0
## Weight 0 0 1
This shows that a matrix multiplied by its inverse gives us the identity matrix.
Multiply the precision_matrix by the correlation_matrix
# multiply precision matrix by correlation matrix
mul_precision_by_corr_matrix = round(precision_matrix %*% correlation_matrix, 2)
mul_precision_by_corr_matrix## Diameter Length Weight
## Diameter 1 0 0
## Length 0 1 0
## Weight 0 0 1
This also shows that the inverse of a matrix multiplied by the matrix is still the identity matrix.
Essentially, this proves the theorem that if A is a square matrix and \(A^{-1}\) exists, then \(AA^{-1}=A^{-1}A = I\) where \(I\) is the Identity matrix. This holds only for square matrices and their inverses.
Conduct LU decomposition on the matrix
## $L
## [,1] [,2] [,3]
## [1,] 1.00 0.0000000 0
## [2,] 0.99 1.0000000 0
## [3,] 0.94 0.4723618 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.9900 0.9400000
## [2,] 0 0.0199 0.0094000
## [3,] 0 0.0000 0.1119598
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
Find Skewness
Using the skewness function in e1071 package,
we compute the skewness of the variables.
Note:
Positive skewness (>0): Indicates right-skewed distribution (tail is
longer on the right).
Negative skewness (<0): Indicates left-skewed distribution (tail is
longer on the left).
Skewness around 0: Indicates approximately symmetric distribution.
# Get skewness for each column
for (col in colnames(crab_age_train_raw)) {
if (!(col %in% c("Sex", "id"))){
skew_val <- skewness(crab_age_train_raw[[col]])
cat(sprintf("The skewness for the %s variable is: %.2f\n", col, round(skew_val, 2)))
}
}## The skewness for the Length variable is: -0.84
## The skewness for the Diameter variable is: -0.81
## The skewness for the Height variable is: 0.09
## The skewness for the Weight variable is: 0.23
## The skewness for the Shucked_Weight variable is: 0.35
## The skewness for the Viscera_Weight variable is: 0.29
## The skewness for the Shell_Weight variable is: 0.28
## The skewness for the Age variable is: 1.09
Therefore, we select the Weight variable since it is one of the
right-skewed variables.
Check the minimum value of Weight
## [1] 0.056699
Since the minimum value of Weight is not negative, no need for adjustment.
Histogram to check distribution of Weight
#hist(crab_age_train_raw$Weight, breaks = 20, xlab = 'Weight', main = "Histogram of Weight")
hist_weight = crab_age_train_raw %>% ggplot(aes(x = Weight)) +
geom_histogram(binwidth = 0.5, color = "black", fill = "skyblue", alpha = 0.7) +
labs(title = "Histogram for Weight", x = "Weight", y = "Frequency") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
hist_weightDensity Plot to visualize the distribution of Weight
density_weight = crab_age_train_raw %>% ggplot(aes(x = Weight)) + geom_density() +
labs(title = "Density Plot for Weight", x = "Weight", y = "Density") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
density_weightFrom the Histogram and Density plot, we can see that the Weight variable is skewed to the right.
Load the MASS package and run fitdistr to fit an exponential probability density function
prob_density_function <- fitdistr(crab_age_train_raw$Weight, densfun = 'exponential')
lambda <- prob_density_function$estimate
exponential_pdf <- rexp(1000, lambda)Optimal Value of λ:
The optimal value of lambda = 1/λ.
## rate
## 23.3852
Plot a histogram and compare it with a histogram of your
original variable
Using ggplot2
# histogram for exponential distribution
exp_pdf = exponential_pdf %>% tibble()
colnames(exp_pdf) = "Weight"
hist_exponential_pdf = exp_pdf %>% ggplot(aes(x = Weight)) +
geom_histogram(binwidth = 1, color = "black", fill = "skyblue", alpha = 0.7) +
labs(title = "Histogram for Weight-Exponential PDF", x = "Weight", y = "Frequency") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
plot_grid(hist_weight, NULL, hist_exponential_pdf, nrow = 1, rel_widths = c(1, 0.008, 1))Using Base R
par(mfrow=c(1,2))
hist(crab_age_train_raw$Weight, breaks = 20, col="violet", main = "Histogram - Original", xlab = 'Weight')
hist(exponential_pdf, breaks = 20, col="royalblue", main = "Histogram - Exponential PDF", xlab = 'Weight')The 5th and 95th percentiles using the CDF are:
cdf_5th_95th_percentiles = round(quantile(exponential_pdf, c(0.05, 0.95)), 4)
cdf_5th_95th_percentiles## 5% 95%
## 1.2535 67.5297
95% confidence interval from empirical data
##
## One Sample t-test
##
## data: crab_age_train_raw$Weight
## t = 503.13, df = 74050, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 23.29412 23.47632
## sample estimates:
## mean of x
## 23.38522
The 5th and 95th percentiles using the empirical data are:
empirical_5th_95th_percentiles = round(quantile(crab_age_train_raw$Weight, c(0.05, 0.95)), 4)
empirical_5th_95th_percentiles## 5% 95%
## 3.6004 44.2110
Comparing the 5th and 95th percentiles of the exponential pdf and the empirical data as well as their histograms, we can clearly see the exponential pdf is much more skewed to the right than the original/empirical data. Also, the original empirical data appears to be better and the exponential data is not a good approximation of the original data in this case.
Conduct Exploratory Data Analysis Prior to Modeling
glimpse of the data
## Rows: 74,051
## Columns: 10
## $ id <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Sex <chr> "I", "I", "M", "F", "I", "M", "M", "I", "F", "M", "I", …
## $ Length <dbl> 1.5250, 1.1000, 1.3875, 1.7000, 1.2500, 1.5000, 1.5750,…
## $ Diameter <dbl> 1.1750, 0.8250, 1.1125, 1.4125, 1.0125, 1.1750, 1.1375,…
## $ Height <dbl> 0.3750, 0.2750, 0.3750, 0.5000, 0.3375, 0.4125, 0.3500,…
## $ Weight <dbl> 28.973189, 10.418441, 24.777463, 50.660556, 23.289114, …
## $ Shucked_Weight <dbl> 12.7289255, 4.5217453, 11.3398000, 20.3549410, 11.97766…
## $ Viscera_Weight <dbl> 6.6479577, 2.3246590, 5.5565020, 10.9918385, 4.5075705,…
## $ Shell_Weight <dbl> 8.348928, 3.401940, 6.662133, 14.996885, 5.953395, 7.93…
## $ Age <dbl> 9, 8, 9, 11, 8, 10, 11, 11, 12, 11, 7, 10, 10, 7, 9, 7,…
There are about 74,051 observations and 10 variables (9 predictor variables and 1 response variable). The response variable is the age and we aim to predict the age of the crab using the other features given. Only Sex is the categorical variable in the data.
Check for missing Values
# use missmap function from the Amelia package to check for NA values
missmap(crab_age_train_raw,
plot.background = element_rect(fill = "antiquewhite"),
main = "Crab Age Dataset - Missing Values",
x.cex = 0.45,
y.cex = 0.6,
margins = c(7.1, 7.1),
col = c("yellow", "black"), legend = FALSE)As can be seen from the missing values plot, there are no missing values.
Histogram for Age to check the Age distribution
hist_age = crab_age_train_raw %>% ggplot(aes(x = Age)) +
geom_histogram(binwidth = 1, color = "black", fill = "skyblue", alpha = 0.7) +
labs(title = "Histogram for Age", x = "Age", y = "Frequency") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
hist_ageFrom the Age distribution, we can see that ages of crabs given are between 3 and 20.
Bar Plot for Sex
# group the data by sex and get count/percentage
crab_sex_grouped <- crab_age_train_raw %>% group_by(Sex) %>%
summarise(count = n()) %>%
mutate(percentage = round((count / sum(count) * 100), 2))
# bar plot using ggplot2
bar_plot_crab_sex <- crab_sex_grouped %>% ggplot(aes(x=factor(Sex), y = percentage, fill = factor(Sex))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Bar Chart - Crab Sex", x = "Crab Sex", y = "Percentage",
fill = "Crab Sex") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentages
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
bar_plot_crab_sexEach of the crab sex constitute a little above 30% of the data with the “M” category being a little more than 35% and representing the highest category by a small margin.
Data Pre-Processing
Develop a function to pre-process the data and make it ready for
training. This function can be easily used to pre-process any new test
data to avoid code duplications and errors.
# data pre-processing
data_preprocess_scaling <- function(df){
# This function standardizes the numeric variables of the df using the standard normal method
df <- as.data.frame(df)
if ("Age" %in% colnames(df)){
df_char <- df %>% select(Age, Sex)
df_numeric <- df %>% select(-Sex, -Age) # Exclude Age from scaling as well.
} else {
df_char <- df %>% select(Sex)
df_numeric <- df %>% select(-Sex) # Exclude Age from scaling as well.
}
# scale the other numeric values using standard scaler
df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
# combine the categorical and numeric features and drop the id column
df_scaled_combined <- cbind(df_char, df_numeric_scaled) %>% select(-id)
return(df_scaled_combined)
}Pre-process the train data
# pre-process the data using the function above
crab_data_pre_processed = data_preprocess_scaling(crab_age_train_raw)
# display some records for the standardized data
kable(head(crab_data_pre_processed, 50), "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "850px", height = "350px")| Age | Sex | Length | Diameter | Height | Weight | Shucked_Weight | Viscera_Weight | Shell_Weight |
|---|---|---|---|---|---|---|---|---|
| 9 | I | 0.7212332 | 0.6339778 | 0.2923977 | 0.4418014 | 0.4671847 | 0.5691823 | 0.4533730 |
| 8 | I | -0.7557067 | -0.8403503 | -0.7941577 | -1.0251912 | -0.9936809 | -0.9788731 | -0.9267816 |
| 9 | M | 0.2433997 | 0.3707049 | 0.2923977 | 0.1100750 | 0.2199225 | 0.1783618 | -0.0172241 |
| 11 | F | 1.3293850 | 1.6344147 | 1.6505920 | 2.1564682 | 1.8246040 | 2.1246076 | 2.3080794 |
| 8 | I | -0.2344338 | -0.0505317 | -0.1150606 | -0.0075982 | 0.3334613 | -0.1972320 | -0.2149540 |
| 10 | M | 0.6343544 | 0.6339778 | 0.6998560 | 0.4317152 | 0.5882928 | 0.6199382 | 0.3386897 |
| 11 | M | 0.8949909 | 0.4760140 | 0.0207589 | 0.5247330 | 0.3258920 | 0.8178863 | 0.5364196 |
| 11 | I | -0.0172367 | 0.0021229 | 0.0207589 | -0.4020836 | -0.3301098 | -0.4154824 | -0.2940460 |
| 12 | F | 0.9818697 | 1.1078689 | 0.9714949 | 1.2206860 | 1.2215869 | 0.8432642 | 1.1296093 |
| 11 | M | -1.0163432 | -1.1036232 | -0.9299772 | -1.0341568 | -0.9987271 | -1.0499314 | -1.0454195 |
| 7 | I | -1.3638585 | -1.2615869 | -1.4732549 | -1.3658832 | -1.3191587 | -1.1920479 | -1.2826954 |
| 10 | F | -0.1475550 | -0.2084954 | 0.2923977 | -0.3012208 | -0.4915872 | -0.2784415 | -0.1358620 |
| 10 | M | 0.1130815 | 0.0021229 | 0.0207589 | -0.1913925 | -0.2846943 | -0.1515517 | 0.0618678 |
| 7 | I | -0.7557067 | -0.6823866 | -0.6583383 | -0.9445010 | -0.9962040 | -0.9737975 | -0.9267816 |
| 9 | M | -0.0172367 | -0.1558409 | -0.3866994 | -0.4693254 | -0.2771250 | -0.3850289 | -0.4126839 |
| 7 | I | -1.3204191 | -1.1562778 | -1.6090743 | -1.2762274 | -1.2232815 | -1.2275771 | -1.2826954 |
| 8 | I | -1.0597826 | -1.0509686 | -0.5225189 | -1.1013987 | -1.1248813 | -0.8773613 | -1.0849655 |
| 12 | F | 1.1121879 | 0.9499052 | 0.8356754 | 1.2487034 | 1.0954327 | 1.3102187 | 1.0505173 |
| 7 | I | -0.6253885 | -0.4717683 | -0.6583383 | -0.7920862 | -0.5773720 | -0.8570589 | -0.7685977 |
| 5 | I | -2.2326467 | -2.2620238 | -2.6956297 | -1.6841612 | -1.6421135 | -1.5828685 | -1.6702460 |
| 9 | M | 0.8081121 | 0.7392869 | 0.8356754 | 0.7421483 | 0.7901395 | 0.4219901 | 0.6590121 |
| 14 | F | -0.7991461 | -0.6297320 | -0.3866994 | -0.6329472 | -0.8473420 | -0.6489597 | -0.4917758 |
| 10 | F | 1.4597032 | 1.3184872 | 0.6998560 | 1.9076734 | 1.9709428 | 2.3050346 | 1.9996208 |
| 9 | M | 0.4171574 | 0.3180503 | -0.3866994 | 0.1123164 | 0.3435536 | 0.1885129 | -0.0963161 |
| 7 | I | -1.3638585 | -1.3142415 | -1.2016160 | -1.2964000 | -1.3292511 | -1.2783330 | -1.2826954 |
| 11 | M | 1.0253091 | 0.9499052 | 0.9714949 | 1.2957727 | 1.2291561 | 1.3508234 | 0.8527874 |
| 10 | F | 0.3737180 | 0.2127412 | 0.4282171 | 0.0069709 | 0.2703842 | 0.3001760 | -0.6855511 |
| 8 | I | -0.4081914 | -0.3664592 | -0.3866994 | -0.6026884 | -0.5950336 | -0.9636463 | -0.5431856 |
| 9 | F | 0.0262027 | 0.2127412 | 0.6998560 | -0.2418239 | -0.3402021 | -0.0398887 | -0.3335919 |
| 6 | I | -2.4064043 | -2.2093693 | -2.1523520 | -1.6426954 | -1.6749136 | -1.6285488 | -1.7335196 |
| 9 | M | -0.1475550 | -0.3138046 | -0.1150606 | -0.5119119 | -0.4915872 | -0.4002557 | -0.4917758 |
| 11 | M | 0.8081121 | 0.7392869 | 0.6998560 | 0.8945631 | 1.2468177 | 0.9244737 | 0.4968736 |
| 13 | F | 1.5465820 | 1.5291055 | 1.7864114 | 1.8415523 | 2.3115592 | 0.3661586 | 1.2759294 |
| 9 | I | -0.4516309 | -0.5244229 | -0.3866994 | -0.8357934 | -0.8347266 | -0.8063030 | -0.7685977 |
| 13 | F | 0.9384303 | 0.8972506 | 0.8356754 | 0.8878389 | 0.8330319 | 1.3000675 | 0.4177816 |
| 10 | F | 0.3737180 | 0.5286686 | 0.4282171 | 0.2311103 | 0.3006612 | 0.1682106 | 0.2991437 |
| 16 | F | 1.4162638 | 1.2131781 | 1.6505920 | 1.9076734 | 1.6984498 | 2.3304125 | 1.8414369 |
| 8 | F | -0.2344338 | -0.2084954 | -0.2508800 | -0.4244975 | -0.4688794 | -0.2226100 | -0.4126839 |
| 4 | I | -2.7973590 | -2.7359150 | -2.2881714 | -1.7547651 | -1.7127598 | -1.7554386 | -1.8244753 |
| 9 | M | -0.6688279 | -0.7350412 | -0.2508800 | -0.8257071 | -0.7388494 | -0.7809250 | -0.9267816 |
| 11 | M | 0.7646727 | 1.2131781 | 0.6998560 | 1.1545649 | 1.1282328 | 1.6908880 | 1.0505173 |
| 18 | M | 0.4171574 | 0.4233595 | 0.2923977 | 0.1224027 | -0.0475244 | 0.3154027 | 0.2991437 |
| 10 | M | 0.7212332 | 0.7919415 | 0.4282171 | 0.9057701 | 0.9516169 | 0.8584910 | 0.4177816 |
| 15 | F | 1.0253091 | 1.1078689 | 0.6998560 | 0.9674084 | 0.2401072 | 0.4930484 | 1.2087012 |
| 10 | M | 1.2425062 | 1.1078689 | 1.3789531 | 1.2935313 | 1.3881104 | 1.1731777 | 1.2087012 |
| 7 | I | -0.9729038 | -1.0509686 | -1.3374354 | -0.9826048 | -1.0365733 | -1.1768212 | -1.1245115 |
| 7 | I | -0.6688279 | -0.7876958 | -1.0657966 | -0.7416548 | -0.7186647 | -0.7504715 | -0.8081437 |
| 17 | F | 1.9809762 | 1.6344147 | 1.7864114 | 2.7056098 | 1.9860813 | 2.6603260 | 2.7114484 |
| 9 | M | 0.4171574 | 0.3180503 | 0.1565783 | 0.1055923 | -0.0071550 | 0.1225303 | 0.1093230 |
| 8 | I | -0.4950703 | -0.4191137 | -0.3866994 | -0.6452749 | -0.3881407 | -0.5575990 | -0.4917758 |
Train Test Split
Use the CaTools library to split the cleaned dataset into training and testing datasets in 70:30 ratio.
# Set a seed
set.seed(1994)
#Split the sample
sampling <- sample.split(crab_data_pre_processed$Age, SplitRatio = 0.7)
# Training Data
df_train_subset <- subset(crab_data_pre_processed, sampling == TRUE)
# Testing Data
df_test_subset <- subset(crab_data_pre_processed, sampling == FALSE)Train the model
multiple_regression_model = lm(Age ~ ., data = df_train_subset)
# display the model
summary(multiple_regression_model)##
## Call:
## lm(formula = Age ~ ., data = df_train_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.4992 -1.2221 -0.3402 0.7562 18.2745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.34098 0.01807 572.244 < 2e-16 ***
## SexI -1.02614 0.03016 -34.019 < 2e-16 ***
## SexM -0.11444 0.02287 -5.004 5.63e-07 ***
## Length 0.25076 0.06591 3.805 0.000142 ***
## Diameter 0.54040 0.06754 8.001 1.26e-15 ***
## Height 0.60991 0.02513 24.267 < 2e-16 ***
## Weight 2.36161 0.08195 28.817 < 2e-16 ***
## Shucked_Weight -3.44562 0.04452 -77.391 < 2e-16 ***
## Viscera_Weight -0.57802 0.03964 -14.584 < 2e-16 ***
## Shell_Weight 1.93384 0.04217 45.862 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.127 on 51826 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5513
## F-statistic: 7076 on 9 and 51826 DF, p-value: < 2.2e-16
Predictions
predictions_test = round(predict(multiple_regression_model, df_test_subset), 0) # round predictions to nearest ageEvaluate the Model
mean_absolute_error = round(Metrics::mae(df_test_subset$Age, predictions_test), 2)
mean_sq_error = round(Metrics::mse(df_test_subset$Age, predictions_test), 2)
root_mean_sq_error = round(Metrics::rmse(df_test_subset$Age, predictions_test), 2)
model_eval_metrics = c(mean_absolute_error, mean_sq_error, root_mean_sq_error) %>% t()
column_names = c("Mean Absolute Error", "Mean Square Error", "Root Mean Square Error")
evaluation_metrics = data.frame(values = model_eval_metrics)
colnames(evaluation_metrics) = column_names
# combine the results and display
kable(evaluation_metrics, "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "450px", height = "100px")| Mean Absolute Error | Mean Square Error | Root Mean Square Error |
|---|---|---|
| 1.46 | 4.63 | 2.15 |
From the MAE, and RMSE, we can see that the model is able to predict the Age of Crabs within 2 units of the Crab Age.
Using the test data provided to make
predictions
Read test data from remote location
url_crab_age_test = "https://raw.githubusercontent.com/chinedu2301/data605-computer-maths/main/final_project/data/crab_age_test_data.csv"
crab_age_test_raw = read_csv(url_crab_age_test)
crab_age_test_raw = crab_age_test_raw %>% rename(
Shucked_Weight = `Shucked Weight`,
Viscera_Weight = `Viscera Weight`,
Shell_Weight = `Shell Weight`
)Display a few records of the test data:
kable(head(crab_age_test_raw, 100), "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "850px", height = "350px")| id | Sex | Length | Diameter | Height | Weight | Shucked_Weight | Viscera_Weight | Shell_Weight |
|---|---|---|---|---|---|---|---|---|
| 74051 | I | 1.0500 | 0.7625 | 0.2750 | 8.6182480 | 3.6570855 | 1.7293195 | 2.7215520 |
| 74052 | I | 1.1625 | 0.8875 | 0.2750 | 15.5071765 | 7.0306760 | 3.2460178 | 3.9689300 |
| 74053 | F | 1.2875 | 0.9875 | 0.3250 | 14.5716430 | 5.5565020 | 3.8838815 | 4.8194150 |
| 74054 | F | 1.5500 | 0.9875 | 0.3875 | 28.3778495 | 13.3809640 | 6.5487345 | 7.0306760 |
| 74055 | I | 1.1125 | 0.8500 | 0.2625 | 11.7650425 | 5.5281525 | 2.4664065 | 3.3310663 |
| 74056 | M | 1.4250 | 1.1125 | 0.3500 | 24.8341620 | 8.7316460 | 5.7124242 | 8.0796075 |
| 74057 | M | 1.7125 | 1.3250 | 0.4500 | 46.6774518 | 21.2337755 | 11.9634890 | 11.3681495 |
| 74058 | I | 1.1750 | 0.8875 | 0.3000 | 13.9337793 | 5.6982495 | 2.9908722 | 3.9405805 |
| 74059 | F | 1.3125 | 1.0625 | 0.2875 | 23.5867840 | 8.1646560 | 5.6273757 | 8.4623258 |
| 74060 | F | 1.4250 | 1.0500 | 0.3125 | 17.7893112 | 7.1015498 | 4.6067937 | 5.5281525 |
| 74061 | M | 1.5125 | 1.2000 | 0.4750 | 32.2333815 | 12.9415468 | 6.2227153 | 9.8372765 |
| 74062 | F | 1.5625 | 1.2000 | 0.4125 | 36.5141560 | 17.2931950 | 8.6040733 | 8.9300925 |
| 74063 | I | 0.4375 | 0.3250 | 0.1250 | 0.8221355 | 0.3543688 | 0.1842718 | 0.2834950 |
| 74064 | F | 1.6250 | 1.2875 | 0.4125 | 39.1931838 | 16.6836808 | 7.4700932 | 11.9067900 |
| 74065 | I | 0.6750 | 0.5000 | 0.1500 | 2.7215520 | 1.1906790 | 0.8079608 | 0.8504850 |
| 74066 | I | 0.8375 | 0.6250 | 0.1750 | 4.8194150 | 1.7718437 | 0.7654365 | 1.2473780 |
| 74067 | M | 1.6750 | 1.3000 | 0.4000 | 41.6312408 | 18.8524175 | 9.6955290 | 11.0563050 |
| 74068 | M | 1.4625 | 1.1750 | 0.3250 | 28.2644515 | 11.4248485 | 6.0384435 | 9.4970825 |
| 74069 | F | 1.6750 | 1.3000 | 0.3625 | 40.8374547 | 17.2931950 | 9.1143642 | 9.0860148 |
| 74070 | M | 1.5875 | 1.2625 | 0.4125 | 35.5077487 | 17.0805738 | 8.2922287 | 9.4970825 |
| 74071 | M | 1.6000 | 1.2500 | 0.4750 | 31.1277510 | 11.1271788 | 6.6763072 | 12.8423235 |
| 74072 | F | 1.2625 | 0.9375 | 0.3000 | 16.8396030 | 7.5267922 | 3.0192217 | 4.8194150 |
| 74073 | I | 1.1875 | 0.9000 | 0.3125 | 13.8062065 | 5.3297060 | 3.6003865 | 4.3941725 |
| 74074 | I | 1.4375 | 1.1125 | 0.3625 | 25.6988218 | 12.2328093 | 5.9250455 | 6.7755305 |
| 74075 | F | 1.5875 | 1.2500 | 0.4750 | 33.8493030 | 13.9479540 | 5.8966960 | 9.4970825 |
| 74076 | F | 1.2250 | 0.9625 | 0.3000 | 17.7893112 | 8.3489277 | 3.8271825 | 5.6699000 |
| 74077 | F | 1.8500 | 1.4125 | 0.5125 | 57.6061840 | 27.6691120 | 9.6104805 | 18.7106700 |
| 74078 | F | 1.5875 | 1.2125 | 0.4500 | 36.5708550 | 15.5213513 | 8.6324228 | 9.9790240 |
| 74079 | M | 1.2500 | 1.0000 | 0.3000 | 16.0741665 | 7.1015498 | 4.4366968 | 4.9611625 |
| 74080 | F | 1.8375 | 1.4875 | 0.4500 | 53.2970600 | 25.7838702 | 11.7650425 | 13.8912550 |
| 74081 | F | 1.6250 | 1.3000 | 0.3875 | 38.7821160 | 17.5341658 | 8.1646560 | 10.3475675 |
| 74082 | M | 1.7625 | 1.4000 | 0.3875 | 47.4854125 | 22.5945515 | 11.6091202 | 10.9996060 |
| 74083 | M | 1.6250 | 1.2375 | 0.4000 | 39.4199798 | 19.6603782 | 8.3772773 | 9.9223250 |
| 74084 | M | 1.6875 | 1.3375 | 0.5125 | 39.7743485 | 17.0805738 | 8.2922287 | 11.9067900 |
| 74085 | F | 1.5375 | 1.1500 | 0.4125 | 29.9512468 | 13.6502843 | 5.4431040 | 9.6388300 |
| 74086 | I | 1.0375 | 0.7875 | 0.2375 | 8.1221317 | 4.2665998 | 2.2537852 | 2.5514550 |
| 74087 | M | 1.5750 | 1.3125 | 0.3625 | 27.5415392 | 14.1322258 | 5.5706767 | 7.0165012 |
| 74088 | M | 1.4250 | 1.1625 | 0.4000 | 26.7052290 | 11.9209647 | 5.1596090 | 8.7741702 |
| 74089 | I | 1.2500 | 0.9375 | 0.3250 | 15.4504775 | 6.0242688 | 3.8838815 | 4.7060170 |
| 74090 | M | 1.4250 | 1.1125 | 0.3750 | 23.5159102 | 10.9854312 | 5.5281525 | 7.1015498 |
| 74091 | M | 0.8500 | 0.6375 | 0.2125 | 5.1029100 | 1.9844650 | 1.1906790 | 1.7009700 |
| 74092 | M | 1.5375 | 1.1750 | 0.3750 | 31.1561005 | 15.2803805 | 6.9598022 | 7.9378600 |
| 74093 | F | 1.5125 | 1.1875 | 0.4375 | 29.2425093 | 13.0549448 | 6.5062102 | 8.0796075 |
| 74094 | M | 1.5000 | 1.1875 | 0.4125 | 29.2708587 | 13.7920317 | 6.1943658 | 7.7961125 |
| 74095 | M | 1.6000 | 1.2625 | 0.5000 | 39.7034747 | 12.4737800 | 8.0796075 | 14.5999925 |
| 74096 | I | 1.4375 | 1.1125 | 0.3375 | 26.1240642 | 10.0357230 | 5.9250455 | 8.5048500 |
| 74097 | I | 0.8625 | 0.6250 | 0.2000 | 3.8697068 | 1.8568922 | 0.9922325 | 1.7009700 |
| 74098 | I | 0.8250 | 0.6250 | 0.1625 | 5.1312595 | 2.0411640 | 0.9780577 | 1.2615527 |
| 74099 | M | 1.3000 | 0.9625 | 0.4125 | 23.7143568 | 6.1376668 | 5.9959192 | 7.0873750 |
| 74100 | F | 1.5000 | 1.1750 | 0.4125 | 32.2900805 | 15.4504775 | 6.8747538 | 8.7883450 |
| 74101 | M | 1.7375 | 1.3375 | 0.4625 | 45.9970638 | 24.6357155 | 9.8372765 | 13.4660125 |
| 74102 | M | 1.4375 | 1.1000 | 0.3750 | 24.7632882 | 11.7650425 | 4.6351432 | 6.5912587 |
| 74103 | M | 1.7375 | 1.3875 | 0.5375 | 61.9011333 | 31.3970712 | 13.5794105 | 14.1747500 |
| 74104 | I | 1.4125 | 1.1250 | 0.3750 | 26.1240642 | 10.0357230 | 5.9250455 | 6.7046567 |
| 74105 | I | 1.1875 | 0.9000 | 0.3250 | 13.5368862 | 5.9533950 | 2.7215520 | 4.5359200 |
| 74106 | M | 1.6625 | 1.3125 | 0.4375 | 41.0075517 | 18.1862042 | 9.7805775 | 11.9067900 |
| 74107 | M | 1.3750 | 1.0625 | 0.3250 | 21.7298918 | 9.7664027 | 4.7768907 | 6.8038800 |
| 74108 | I | 0.6375 | 0.4625 | 0.1750 | 2.2679600 | 0.7370870 | 0.5102910 | 0.7087375 |
| 74109 | I | 1.3125 | 1.0125 | 0.4000 | 20.1139703 | 8.5615490 | 4.7343665 | 5.2446575 |
| 74110 | I | 0.7125 | 0.5250 | 0.1875 | 3.3027168 | 1.2332033 | 0.7796112 | 0.9922325 |
| 74111 | F | 1.5000 | 1.1875 | 0.4375 | 31.7514400 | 14.0471773 | 7.4275690 | 9.0718400 |
| 74112 | I | 0.9875 | 0.7375 | 0.2250 | 8.7741702 | 3.8980562 | 1.9277660 | 2.1687367 |
| 74113 | I | 1.2000 | 0.9750 | 0.3000 | 15.4788270 | 6.3786375 | 3.0192217 | 4.6776675 |
| 74114 | F | 1.6125 | 1.2125 | 0.4375 | 38.6120190 | 18.3846507 | 7.4133942 | 10.5034898 |
| 74115 | I | 1.3500 | 1.1000 | 0.3750 | 26.9887240 | 13.4660125 | 5.8541718 | 7.2007730 |
| 74116 | I | 1.6500 | 1.3125 | 0.4375 | 48.2860657 | 21.7298918 | 9.6104805 | 9.3978592 |
| 74117 | F | 1.6125 | 1.3125 | 0.5000 | 37.3221167 | 15.4363027 | 9.2135875 | 12.0485375 |
| 74118 | F | 1.3125 | 1.1250 | 0.3750 | 25.3728025 | 10.2483443 | 5.2730070 | 6.6621325 |
| 74119 | F | 1.2000 | 0.9250 | 0.3000 | 15.1953320 | 7.1157245 | 3.2318430 | 4.2524250 |
| 74120 | F | 1.5625 | 1.2500 | 0.4750 | 35.7628943 | 13.7353327 | 6.4069870 | 13.8912550 |
| 74121 | F | 1.3375 | 1.0875 | 0.3250 | 21.5597947 | 9.1001895 | 5.1737838 | 6.6621325 |
| 74122 | I | 1.0000 | 0.7500 | 0.2125 | 9.4970825 | 4.4225220 | 1.3182517 | 2.4947560 |
| 74123 | F | 1.5000 | 1.2000 | 0.3750 | 29.1716355 | 11.3256252 | 6.3786375 | 8.3631025 |
| 74124 | I | 0.6125 | 0.4500 | 0.1500 | 2.5656297 | 0.9638830 | 0.4961162 | 0.6095143 |
| 74125 | I | 1.0375 | 0.8375 | 0.2500 | 9.0151410 | 4.3799978 | 1.8143680 | 2.9766975 |
| 74126 | I | 1.0375 | 0.7625 | 0.2500 | 9.2561118 | 4.4366968 | 1.8427175 | 2.6932025 |
| 74127 | M | 1.3250 | 1.0750 | 0.3375 | 22.1409595 | 7.4133942 | 4.5926190 | 6.3786375 |
| 74128 | F | 1.5000 | 1.1875 | 0.4250 | 32.1625078 | 13.9337793 | 6.9739770 | 8.9300925 |
| 74129 | M | 1.3125 | 0.9625 | 0.3000 | 15.0394098 | 6.7330062 | 2.9341733 | 4.8052402 |
| 74130 | F | 1.6125 | 1.2000 | 0.4250 | 31.3828965 | 13.8912550 | 6.9598022 | 8.2213550 |
| 74131 | F | 1.6625 | 1.2625 | 0.4375 | 36.1881367 | 16.6128070 | 7.7110640 | 9.4262088 |
| 74132 | I | 1.1625 | 0.8875 | 0.3000 | 12.7572750 | 6.1234920 | 2.9483480 | 3.2885420 |
| 74133 | I | 1.4000 | 1.1125 | 0.3750 | 24.7207640 | 10.8578585 | 3.7137845 | 7.4842680 |
| 74134 | M | 1.5000 | 1.2125 | 0.4250 | 34.2320213 | 14.0046530 | 8.0229085 | 9.2135875 |
| 74135 | I | 0.5000 | 0.3750 | 0.1000 | 1.0489315 | 0.4677668 | 0.3118445 | 0.3685435 |
| 74136 | I | 0.9000 | 0.6625 | 0.2250 | 5.0603858 | 2.0128145 | 1.1481548 | 1.5592225 |
| 74137 | I | 0.7000 | 0.5125 | 0.1625 | 2.8349500 | 1.0205820 | 0.5244657 | 0.8504850 |
| 74138 | M | 1.6625 | 1.3625 | 0.4750 | 46.0537627 | 16.7262050 | 7.3850447 | 15.0252350 |
| 74139 | M | 1.6250 | 1.3125 | 0.4375 | 38.9663878 | 18.9091165 | 8.8166945 | 10.7728100 |
| 74140 | M | 1.4500 | 1.1500 | 0.4000 | 31.1277510 | 11.4248485 | 6.9739770 | 10.9145575 |
| 74141 | F | 1.3250 | 1.0250 | 0.3500 | 21.3896978 | 9.9081502 | 4.8619392 | 5.9675698 |
| 74142 | F | 1.3500 | 1.1250 | 0.3375 | 22.8922212 | 9.1285390 | 5.1312595 | 7.0873750 |
| 74143 | F | 1.6125 | 1.3375 | 0.4750 | 35.1392053 | 13.2675660 | 6.7613558 | 12.0201880 |
| 74144 | F | 1.6375 | 1.3625 | 0.5000 | 53.6089045 | 19.4477570 | 10.9570818 | 14.6141672 |
| 74145 | F | 1.5500 | 1.2125 | 0.4500 | 36.2731852 | 16.2584382 | 8.0937822 | 10.0640725 |
| 74146 | I | 1.1250 | 0.8375 | 0.2875 | 12.3320325 | 5.1454342 | 2.3530085 | 3.5436875 |
| 74147 | F | 1.3750 | 1.1625 | 0.3750 | 26.4075592 | 11.6232950 | 6.4920355 | 7.5126175 |
| 74148 | I | 0.7000 | 0.5375 | 0.2000 | 3.1467945 | 1.5450477 | 0.7796112 | 0.9922325 |
| 74149 | F | 1.6000 | 1.2500 | 0.4125 | 38.0025048 | 16.6411565 | 8.4198015 | 9.0718400 |
| 74150 | F | 0.9250 | 0.6875 | 0.2500 | 6.3077638 | 2.9483480 | 1.3182517 | 1.8710670 |
Pre-Process the test data
# pre-process the data using the function above
crab_test_data_pre_processed = data_preprocess_scaling(crab_age_test_raw)Predictions
Package the predictions results
Convert the predictions results to dataframe and join back to the Ids
from the test data and get the data ready for export.
# convert the predictions data to dataframe
crab_age_predictions = predictions_test_data %>% as.vector() %>% data.frame()
crab_age_predictions$id = as.integer(crab_age_test_raw$id)
# switch the id column to come first
crab_age_predictions[,c(1,2)] = crab_age_predictions[,c(2,1)]
colnames(crab_age_predictions) = c("id", "Age")
# set missing predictions to 0 if any
crab_age_predictions[is.na(crab_age_predictions)] = 0
# display the final predictions data
kable(head(crab_age_predictions, 100), "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "850px", height = "350px")| id | Age |
|---|---|
| 74051 | 8 |
| 74052 | 8 |
| 74053 | 10 |
| 74054 | 9 |
| 74055 | 7 |
| 74056 | 12 |
| 74057 | 11 |
| 74058 | 8 |
| 74059 | 12 |
| 74060 | 10 |
| 74061 | 13 |
| 74062 | 10 |
| 74063 | 5 |
| 74064 | 13 |
| 74065 | 6 |
| 74066 | 6 |
| 74067 | 11 |
| 74068 | 12 |
| 74069 | 11 |
| 74070 | 10 |
| 74071 | 16 |
| 74072 | 10 |
| 74073 | 9 |
| 74074 | 9 |
| 74075 | 13 |
| 74076 | 9 |
| 74077 | 14 |
| 74078 | 12 |
| 74079 | 9 |
| 74080 | 11 |
| 74081 | 11 |
| 74082 | 10 |
| 74083 | 9 |
| 74084 | 13 |
| 74085 | 12 |
| 74086 | 7 |
| 74087 | 10 |
| 74088 | 12 |
| 74089 | 9 |
| 74090 | 10 |
| 74091 | 8 |
| 74092 | 9 |
| 74093 | 11 |
| 74094 | 10 |
| 74095 | 17 |
| 74096 | 11 |
| 74097 | 7 |
| 74098 | 6 |
| 74099 | 13 |
| 74100 | 10 |
| 74101 | 10 |
| 74102 | 10 |
| 74103 | 9 |
| 74104 | 10 |
| 74105 | 9 |
| 74106 | 12 |
| 74107 | 10 |
| 74108 | 6 |
| 74109 | 9 |
| 74110 | 6 |
| 74111 | 11 |
| 74112 | 7 |
| 74113 | 9 |
| 74114 | 11 |
| 74115 | 8 |
| 74116 | 9 |
| 74117 | 14 |
| 74118 | 11 |
| 74119 | 9 |
| 74120 | 16 |
| 74121 | 11 |
| 74122 | 7 |
| 74123 | 12 |
| 74124 | 5 |
| 74125 | 7 |
| 74126 | 7 |
| 74127 | 12 |
| 74128 | 11 |
| 74129 | 10 |
| 74130 | 11 |
| 74131 | 11 |
| 74132 | 8 |
| 74133 | 10 |
| 74134 | 12 |
| 74135 | 5 |
| 74136 | 7 |
| 74137 | 6 |
| 74138 | 16 |
| 74139 | 11 |
| 74140 | 13 |
| 74141 | 10 |
| 74142 | 11 |
| 74143 | 15 |
| 74144 | 15 |
| 74145 | 11 |
| 74146 | 8 |
| 74147 | 11 |
| 74148 | 6 |
| 74149 | 11 |
| 74150 | 8 |
Write the Predictions data to local storage for submission
Kaggle Submission
kaggle username: Chinedu Onyeka
kaggle score: 1.46