DATA 605 Final Exam

#To clear console
rm(list=ls())
#To avoid exponential notation
format(1810032000, scientific = FALSE)

## [1] "1810032000"

Problem 1

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89. Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities. 5 points. a. P(X>x | X>y) b. P(X>x & Y>y) c. P(X<x | X>y)
5 points. Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities. 5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

# Set the seed for reproducibility
set.seed(1234)

# X variable with 10000 random uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)
head(X)

## [1]  6.137034 11.222994 11.092747 11.233794 13.609154 11.403106

# x
x <- median(X)
x

## [1] 10.01266

# Y variable with 10000 random normal values with mean of 10 and a standard deviation of 2.89
Y <- rnorm(10000,mean=10,sd=2.89)
head(Y)

## [1]  7.615333 11.003316  7.340931  9.169598  8.407233 12.452586

# y
y<-median(Y)
y

## [1] 10.03154

# Calculate Probabilities

# a. P(B_X > x | B_X > y)
prob_a <- sum(X > x & X > y) / sum(X > y)

# b. P(B_X > x & B_Y > y)
prob_b <- sum(X > x & Y > y) / length(X)

# c. P(B_X < x | B_X > y)
prob_c <- sum(X < x & X > y) / sum(X > y)

# Display results
cat("Median of X:", x, "\n")

## Median of X: 10.01266

cat("Median of Y:", y, "\n\n")

## Median of Y: 10.03154

cat("a. P(X > x | X > y):", prob_a, "\n")

## a. P(X > x | X > y): 1

cat("b. P(X > x & Y > y):", prob_b, "\n")

## b. P(X > x & Y > y): 0.2507

cat("c. P(X < x | X > y):", prob_c, "\n")

## c. P(X < x | X > y): 0

Interpretations:

P(B_X > x | B_X > y):

This represents the probability that B_X is greater than its median (x) given that it is greater than Y’s median (y). In other words, given that X is already greater than Y’s median, what is the probability that it’s also greater than its own median.

P(B_X > x & B_Y > y):

This is the joint probability that X is greater than its median and Y is greater than its median. So, it means that the probability that both X is greater than its median and Y is greater than its median. It’s the probability that both conditions are satisfied.

P(B_X < x | B_X > y):

This is the conditional probability that X is less than its median given that X is greater than Y.

It represents the probability that X is less than its median (x) given that it is greater than Y’s median (y). In other words, given that X is already greater than Y’s median, what is the probability that it’s less than its own median.

Investigate whether P(B_X>x & B_Y>y) =P(B_X>x)P(B_Y>y) by building a table and evaluating the marginal and joint probabilities.

# Create a contingency table
contingency_table <- table(X > x, Y > y)

print(contingency_table)

##        
##         FALSE TRUE
##   FALSE  2507 2493
##   TRUE   2493 2507

# Calculate probabilities
prob_joint <- sum(X > x & Y > y) / length(X)
prob_marginal_x <- sum(X > x) / length(X)
prob_marginal_y <- sum(Y > y) / length(Y)
prob_expected <- prob_marginal_x * prob_marginal_y

# Display probabilities
cat("P(X > x and Y > y):", prob_joint, "\n")

## P(X > x and Y > y): 0.2507

cat("P(X > x) * P(Y > y):", prob_expected, "\n")

## P(X > x) * P(Y > y): 0.25

In conclusion, P(X>x & Y>y) = P(X>x) * P(Y>y) and it suggests that the two variables X and Y are independent of each other.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. And answer the following questions:

What is the difference between the two?
Which is most appropriate?
Are you surprised at the results?
Why or why not?

# Create categorical variables based on medians
X_category <- factor(ifelse(X > x, "Yes", "No"))
Y_category <- factor(ifelse(Y > y, "Yes", "No"))

# Create a contingency table
contingency_table <- table(X_category, Y_category)

# Perform Fisher's Exact Test
fisher_test_result <- fisher.test(contingency_table)
fisher_test_result

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264

# Perform Chi-Square Test
chi_square_test_result <- chisq.test(contingency_table)
chi_square_test_result

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 0.0676, df = 1, p-value = 0.7949

1. Difference between Fisher’s Exact Test and Chi-Square Test:

Fisher’s Exact Test is used when the sample size is small, and it calculates the exact probability of observing the given distribution of data under the assumption of independence.
Chi-Square Test is more commonly used for larger sample sizes, and it assesses the difference between the expected and observed frequencies.

2. Which is most appropriate?

Fisher’s Exact Test is more appropriate when dealing with small sample sizes, while the Chi-Square Test is often used for larger sample sizes.
Both tests can be applied, and the choice may depend on the specific characteristics of the data.

3. Are you surprised at the results?

The surprise or lack thereof would depend on the specific outcome of the tests. If independence is rejected, it may suggest some relationship between the two variables.

4. Why or why not?

The results of independence tests provide insights into whether two events are related. If there is a significant result, it suggests that the variables are not independent, and the relationship is statistically significant. If there is no significant result, it implies that the variables are independent.

The interpretation of the test results and their significance would depend on the p-values obtained from the tests. Lower p-values suggest evidence against independence. The p-value of 0.9203 from the Chi-square test and Fisher’s Exact test at a 95% confidence level indicates that that the two variables are independent.

Problem 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following. 5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not? 5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss. 10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

crb <- read_csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/Crab_age_train.csv")

#Replace space between variable names with "_"
crb <- crb %>%
        clean_names()

crb$sex <- as.factor(crb$sex)
tail(crb)

## # A tibble: 6 × 10
##      id sex   length diameter height weight shucked_weight viscera_weight
##   <dbl> <fct>  <dbl>    <dbl>  <dbl>  <dbl>          <dbl>          <dbl>
## 1 74045 F      1.62     1.41   0.488  49.9           23.0           10.2 
## 2 74046 F      1.66     1.26   0.438  50.7           20.7           10.4 
## 3 74047 I      1.08     0.862  0.275  10.4            4.32           2.30
## 4 74048 F      1.49     1.2    0.412  29.5           12.3            7.54
## 5 74049 I      1.21     0.962  0.312  16.8            8.97           2.92
## 6 74050 I      0.912    0.675  0.2     5.39           2.06           1.03
## # ℹ 2 more variables: shell_weight <dbl>, age <dbl>

#Checking for missing data

Descriptive Statistics

summary(crb)

##        id        sex           length          diameter          height      
##  Min.   :    0   F:23010   Min.   :0.1875   Min.   :0.1375   Min.   :0.0000  
##  1st Qu.:18513   I:23957   1st Qu.:1.1500   1st Qu.:0.8875   1st Qu.:0.3000  
##  Median :37025   M:27084   Median :1.3750   Median :1.0750   Median :0.3625  
##  Mean   :37025             Mean   :1.3175   Mean   :1.0245   Mean   :0.3481  
##  3rd Qu.:55538             3rd Qu.:1.5375   3rd Qu.:1.2000   3rd Qu.:0.4125  
##  Max.   :74050             Max.   :2.0128   Max.   :1.6125   Max.   :2.8250  
##      weight        shucked_weight     viscera_weight      shell_weight     
##  Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252   Min.   : 0.04252  
##  1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330   1st Qu.: 3.96893  
##  Median :23.7994   Median : 9.90815   Median : 4.98951   Median : 6.93145  
##  Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839   Mean   : 6.72387  
##  3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815   3rd Qu.: 9.07184  
##  Max.   :80.1015   Max.   :42.18406   Max.   :21.54562   Max.   :28.49125  
##       age        
##  Min.   : 1.000  
##  1st Qu.: 8.000  
##  Median :10.000  
##  Mean   : 9.968  
##  3rd Qu.:11.000  
##  Max.   :29.000

Scatter Plot between variables and Age

library(grid)
library(cowplot)
#Crab Age vs Diameter
scatter_age_diameter <- crb %>% ggplot(aes(diameter,age)) + geom_point(aes(color = sex)) +
  labs(title = "Crab Age vs Diameter", x = "Diameter", y = "Crab Age")+theme_bw()

#Crab Age vs Height
scatter_age_height <- crb %>% ggplot(aes(height,age)) + geom_point(aes(color = sex))+ 
  labs(title = "Crab Age vs Height", x = "Height", y = "Crab Age") +theme_bw()

#Crab Age vs Weight
scatter_age_weight <- crb %>% ggplot(aes(weight,age)) + geom_point(aes(color = sex))+ 
  labs(title = "Age vs Weight", x = "Weight", y = "Crab Age") +theme_bw()

#Crab Age vs Shucked weight
scatter_age_shucked_weight <- crb %>% ggplot(aes(shucked_weight,age)) + geom_point(aes(color = sex))+ 
  labs(title = "Age vs Shucked weight", x = "Shucked weight", y = "Crab Age") +theme_bw()

#Crab Age vs Viscera weight
scatter_age_viscera_weight <- crb %>% ggplot(aes(viscera_weight,age)) + geom_point(aes(color = sex))+ 
  labs(title = "Age vs Viscera weight", x = "Viscera weight", y = "Crab Age") +theme_bw()

#Crab Age vs Shell weight
scatter_age_shell_weight <- crb %>% ggplot(aes(shell_weight,age)) + geom_point(aes(color = sex))+ 
  labs(title = "Age vs Shell weight", x = "Shell weight", y = "Crab Age") +theme_bw()

#Crab Age vs Sex
scatter_age_sex <- crb %>% ggplot(aes(sex,age)) + geom_point()+ 
  labs(title = "Age vs Sex", x = "Sex", y = "Crab Age") +theme_bw()

#Crab Age vs Length
scatter_age_length <- crb %>% ggplot(aes(length,age)) + geom_point(aes(color = sex))+ 
  labs(title = "Age vs Length", x = "Length", y = "Crab Age") +theme_bw()

#Display Plots as matrix
scatter_plot <- plot_grid(scatter_age_diameter, scatter_age_height, scatter_age_weight, scatter_age_shucked_weight,scatter_age_viscera_weight,scatter_age_shell_weight,scatter_age_sex,scatter_age_length,
                                   byrow = TRUE, nrow = 3)


scatter_plot

Correlation Matrix For Three variables

crb %>% dplyr::select(-c(sex,age,id,length,height,diameter,shell_weight,age)) %>% 
  pairs.panels()

cor_matrix <- crb %>% dplyr::select(-c(sex,age,id,length,height,diameter,shell_weight,age)) %>% 
  cor() %>% round(.,2)

cor_matrix

##                weight shucked_weight viscera_weight
## weight           1.00           0.97           0.97
## shucked_weight   0.97           1.00           0.94
## viscera_weight   0.97           0.94           1.00

Test the hypothesis that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval

To test the hypothesis that the pairwise correlation between weight, shucked weight and viscera weight is zero, a pairwise correlation between each of the variables would be determined and the p-value is ascertained to accept or reject the Null \(H_0\) hypothesis.

# Pariwise correlation between Weight and shucked weight at 80% confidence interval
weight_sh_weight = cor.test(crb$weight, crb$shucked_weight, conf.level = 0.8)
weight_vs_weight = cor.test(crb$weight, crb$viscera_weight, conf.level = 0.8)
sh_weight_vs_weight = cor.test(crb$shucked_weight, crb$viscera_weight, conf.level = 0.8)

#Weight vs shucked weight:
weight_sh_weight

## 
##  Pearson's product-moment correlation
## 
## data:  crb$weight and crb$shucked_weight
## t = 1110.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9709993 0.9715328
## sample estimates:
##       cor 
## 0.9712672

#Weight vs Viscera weight:
weight_vs_weight

## 
##  Pearson's product-moment correlation
## 
## data:  crb$weight and crb$viscera_weight
## t = 1106.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9707920 0.9713293
## sample estimates:
##       cor 
## 0.9710619

#shucked vs Viscera weight
sh_weight_vs_weight

## 
##  Pearson's product-moment correlation
## 
## data:  crb$shucked_weight and crb$viscera_weight
## t = 768.33, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9420988 0.9431486
## sample estimates:
##      cor 
## 0.942626

Going by the p-value which is < 2.2e-16 (~ 0) for each pairwise set of variables above, the Null hypothesis \(H_0\) which postulates that the correlations between each pairwise set of variables is 0 should be rejected as it appears false.

Evaluate Familywise Error Using Bonferroni corrected P-value

Bonferroni corrected P-value (BCPv) = \(\frac{\alpha}{n}\)

α = Confidence level (1 - 0.8)
n = Number of test (3)

# Bonferroni corrected P-value
α <- 1-0.8
n <- 3

BCPv = (1-0.8)/3

cat("Bonferroni corrected P-value (BCPv): ",BCPv)

## Bonferroni corrected P-value (BCPv):  0.06666667

The Bonferroni test is a multiple-comparison correction used when several dependent or independent statistical tests are being performed simultaneously. However, the adjusted significance level for each pairwise correlation test above for familywise error rate at 80% confidence level is approximately 0.067 or 6.7%. In other words, the error rate seems statistically insignificant at 80% confidence level. So, there is nothing to worry about as it concerns the familywise error.

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

Invert your correlation matrix from above: The solve function is used to invert matrix.

precision_matrix <- solve(cor_matrix)
precision_matrix

##                   weight shucked_weight viscera_weight
## weight          33.33333     -16.666667     -16.666667
## shucked_weight -16.66667      16.924399       0.257732
## viscera_weight -16.66667       0.257732      16.924399

Multiply the correlation matrix by the precision matrix:

cor_matrix_by_precision_mat = cor_matrix %*% precision_matrix  %>% round(.,4)
cor_matrix_by_precision_mat

##                weight shucked_weight viscera_weight
## weight              1              0              0
## shucked_weight      0              1              0
## viscera_weight      0              0              1

Multiply the precision matrix by the correlation matrix

precision_mat_by_cor_matrix = precision_matrix %*% cor_matrix %>% round(.,4)
precision_mat_by_cor_matrix

##                weight shucked_weight viscera_weight
## weight              1              0              0
## shucked_weight      0              1              0
## viscera_weight      0              0              1

Conduct LU decomposition on the matrix:

lu_decomposition = lu.decomposition(cor_matrix)
lu_decomposition

## $L
##      [,1]        [,2] [,3]
## [1,] 1.00  0.00000000    0
## [2,] 0.97  1.00000000    0
## [3,] 0.97 -0.01522843    1
## 
## $U
##      [,1]   [,2]        [,3]
## [1,]    1 0.9700  0.97000000
## [2,]    0 0.0591 -0.00090000
## [3,]    0 0.0000  0.05908629

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Determine variable skewness

skew_length <- skewness(crb$length)
skew_diameter <- skewness(crb$diameter)
skew_height <- skewness(crb$height)
skew_weight <- skewness(crb$weight)
skew_shucked_weight <- skewness(crb$shucked_weight)
skew_viscera_weight <- skewness(crb$viscera_weight)
skew_shell_weight <- skewness(crb$shell_weight)
skew_age <- skewness(crb$age)

#Display Skewness
cat("Length Skewness: ",skew_length)

## Length Skewness:  -0.8443427

cat("Diameter Skewness: ",skew_diameter)

## Diameter Skewness:  -0.8128327

cat("Weight Skewness: ",skew_weight)

## Weight Skewness:  0.2314552

cat("Shucked weight: ",skew_shucked_weight)

## Shucked weight:  0.3494575

cat("Viscera weight: ",skew_viscera_weight)

## Viscera weight:  0.2863712

cat("Shell weight: ",skew_shell_weight)

## Shell weight:  0.2774477

cat("Age Skewness: ",skew_age)

## Age Skewness:  1.092875

Negatively skewed: If the coefficient of skewness is less than 0 then said to be negatively skewed or left skewed.
Positive Skewed: If the coefficient of skewness is greater than 0 then said to be positively skewed or right skewed.

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

# Minimum value for shucked weight:
min(crb$shucked_weight)

## [1] 0.0283495

Using shucked weight as the select right skewed variable makes more sense as the minimum value is above zero (0).

#Histogram for shucked weight
histogram_shucked_weight <- crb %>% ggplot(aes(x = shucked_weight)) + 
              geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "shucked weight", x = "shucked weight", y = "Frequency") +
                               theme_bw()

#Histogram for Height
histogram_height <- crb %>% ggplot(aes(height)) + geom_density()+
  #geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "Height", x = "Height", y = "Frequency") +
                               theme_bw()

#Histogram for Weight
histogram_weight <- crb %>% ggplot(aes(weight)) + 
  geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "weight", x = "weight", y = "Frequency") +
                               theme_bw()

#Histogram for viscera weight
histogram_viscera_weight <- crb %>% ggplot(aes(viscera_weight)) + 
  geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "viscera weight", x = "viscera weight", y = "Frequency") +
                               theme_bw()

#Histogram for shell weight
histogram_shell_weight <- crb %>% ggplot(aes(shell_weight)) + 
  geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "Shell weight", x = "Shell weight", y = "Frequency") +
                               theme_bw()

#Histogram for length
histogram_length <- crb %>% ggplot(aes(length)) + geom_density() +
  #geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "Length", x = "Length", y = "Frequency") +
                               theme_bw()

#Histogram for Age
histogram_age <- crb %>% ggplot(aes(age)) + 
  geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "Age", x = "Age", y = "Frequency") +
                               theme_bw()


#Display Plots as matrix
histogram_plot <- plot_grid(histogram_weight,histogram_shucked_weight,histogram_height,histogram_viscera_weight,histogram_shell_weight,histogram_length,histogram_age, byrow = TRUE, ncol= 3)


histogram_plot

From the histograms above, it is correct to establish that viscera weight, weight, shell weight, shucked weight and age are skewed to the right or positively skewed. While length and diameter are skewed to the left or negatively skewed.

Load the MASS package and run fitdistr to fit an exponential probability density function

shucked_weight_prob_density <- fitdistr(crb$shucked_weight, densfun = 'exponential')
λ <- shucked_weight_prob_density$estimate
shucked_weight_exp_pdf <- rexp(1000, λ)
#shucked_weight_exp_pdf

Optimal Value of λ: The optimal value of \(\lambda = \frac{1}{\lambda}\).

optimal_λ <- (1/λ) %>% round(., 4)
optimal_λ

##    rate 
## 10.1043

Plot a histogram and compare it with a histogram of your original variable.

#Convert the vector to dataframe

shucked_weight_exp_pdf_df <- data.frame(weight = shucked_weight_exp_pdf)

#Histogram for shucked_weight_exp_pdf_df 
histogram_shucked_weight_exp_pdf <- shucked_weight_exp_pdf_df %>% ggplot(aes(weight)) + 
  geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "Exponential PDF", x = "shucked weight", y = "Frequency") +
                               theme_bw()

#Histogram for Original shucked weight
histogram_shucked_weight <- crb %>% ggplot(aes(x = shucked_weight)) + 
              geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
                               labs(title = "Original shucked weight", x = "shucked weight", y = "Frequency") +
                               theme_bw()
#Display Plots as matrix
shucked_weight_compare <- plot_grid(histogram_shucked_weight_exp_pdf,histogram_shucked_weight, byrow = TRUE, ncol= 2)


shucked_weight_compare

The 5th and 95th percentiles using the cumulative distribution function (CDF):

cdf_5th_95th_perc = quantile(shucked_weight_exp_pdf, c(0.05, 0.95)) %>%round(.,4)
cdf_5th_95th_perc

##      5%     95% 
##  0.5416 29.1782

95% confidence interval from empirical data, assuming normality

conf_int = t.test(crb$shucked_weight)
conf_int

## 
##  One Sample t-test
## 
## data:  crb$shucked_weight
## t = 489.43, df = 74050, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  10.06381 10.14473
## sample estimates:
## mean of x 
##  10.10427

The empirical 5th percentile and 95th percentile of the data

emp_5_95_perc = quantile(crb$shucked_weight, c(0.05, 0.95)) %>% round(.,4)
emp_5_95_perc

##      5%     95% 
##  1.5025 19.3769

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Build a multiple regression

mod1 <- crb %>% dplyr::select(-id) %>% lm(age ~ ., data=.)
summary(mod1)

## 
## Call:
## lm(formula = age ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.755758   0.075897  49.485  < 2e-16 ***
## sexI           -1.040735   0.025209 -41.284  < 2e-16 ***
## sexM           -0.115212   0.019157  -6.014 1.82e-09 ***
## length          0.910211   0.192234   4.735 2.20e-06 ***
## diameter        2.138507   0.238786   8.956  < 2e-16 ***
## height          7.222272   0.236275  30.567  < 2e-16 ***
## weight          0.194265   0.005416  35.867  < 2e-16 ***
## shucked_weight -0.614115   0.006632 -92.603  < 2e-16 ***
## viscera_weight -0.216351   0.011872 -18.224  < 2e-16 ***
## shell_weight    0.512127   0.009820  52.152  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

plot(mod1)

Analysis of Residual

The Residual vs Fitted value plot shows that the error did maintained a fairly constant variability. However, there are few data points at the far right of the distribution. The data distribution is fairly normal. The R-squared \(R^2\) of of approximately 0.55 or 55% shows that the model is reasonable to explain about 55% variation in age as a result of the variables. All the variables are statistically significant due to their p-value that is nearly zero (0).

Read test dataset

test <- read_csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/Crab_age_test.csv")

#Replace space between variable names with "_"
test <- test %>%
        clean_names()

head(test)

## # A tibble: 6 × 9
##      id sex   length diameter height weight shucked_weight viscera_weight
##   <dbl> <chr>  <dbl>    <dbl>  <dbl>  <dbl>          <dbl>          <dbl>
## 1 74051 I       1.05    0.762  0.275   8.62           3.66           1.73
## 2 74052 I       1.16    0.888  0.275  15.5            7.03           3.25
## 3 74053 F       1.29    0.988  0.325  14.6            5.56           3.88
## 4 74054 F       1.55    0.988  0.388  28.4           13.4            6.55
## 5 74055 I       1.11    0.85   0.262  11.8            5.53           2.47
## 6 74056 M       1.42    1.11   0.35   24.8            8.73           5.71
## # ℹ 1 more variable: shell_weight <dbl>

Prediction of test dataset

#Prediction of test data set
y.pred <- predict(mod1,test) %>% round(.,0) %>% data.frame(Age=.)
prediction <- cbind(test$id,y.pred) %>% rename(id=`test$id`)
head(prediction)

##      id Age
## 1 74051   8
## 2 74052   8
## 3 74053  10
## 4 74054  10
## 5 74055   8
## 6 74056  12

length(y.pred)

## [1] 1

nrow(test)

## [1] 49368

Standardize the data

scaled_crb <- cbind(scale(crb[,3:9]),crb[,2],crb[,10])
head(scaled_crb)

##       length    diameter     height       weight shucked_weight viscera_weight
## 1  0.7212332  0.63397775  0.2923977  0.441801426      0.4671847      0.5691823
## 2 -0.7557067 -0.84035033 -0.7941577 -1.025191245     -0.9936809     -0.9788731
## 3  0.2433997  0.37070488  0.2923977  0.110075046      0.2199225      0.1783618
## 4  1.3293850  1.63441467  1.6505920  2.156468183      1.8246040      2.1246076
## 5 -0.2344338 -0.05053172 -0.1150606 -0.007598159      0.3334613     -0.1972320
## 6  0.6343544  0.63397775  0.6998560  0.431715150      0.5882928      0.6199382
##   shell_weight sex age
## 1   0.45337303   I   9
## 2  -0.92678160   I   8
## 3  -0.01722411   M   9
## 4   2.30807939   F  11
## 5  -0.21495400   I   8
## 6   0.33868969   M  10

Build a regression model with the standardized data

mod2 <- scaled_crb %>% lm(age ~ ., data=.)
summary(mod2)

## 
## Call:
## lm(formula = age ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0347  -1.2228  -0.3320   0.7537  22.0418 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.34664    0.01513 683.888  < 2e-16 ***
## length          0.26192    0.05532   4.735 2.20e-06 ***
## diameter        0.50767    0.05669   8.956  < 2e-16 ***
## height          0.66469    0.02175  30.567  < 2e-16 ***
## weight          2.45709    0.06851  35.867  < 2e-16 ***
## shucked_weight -3.45012    0.03726 -92.603  < 2e-16 ***
## viscera_weight -0.60421    0.03315 -18.224  < 2e-16 ***
## shell_weight    1.83566    0.03520  52.152  < 2e-16 ***
## sexI           -1.04074    0.02521 -41.284  < 2e-16 ***
## sexM           -0.11521    0.01916  -6.014 1.82e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.009e+04 on 9 and 74041 DF,  p-value: < 2.2e-16

plot(mod2)

Analysis of Residual

The status of the residuals remain the same after standardizing the data. There was no significant change with or without standardization. The Residual vs Fitted value plot shows that the error did maintained a fairly constant variability. However, there are few data points at the far right of the distribution. The data distribution is fairly normal. The R-squared \(R^2\) of of approximately 0.55 or 55% shows that the model is reasonable to explain about 55% variation in age as a result of the variables. All the variables are statistically significant due to their p-value that is nearly zero (0).

Standardize the test dataset

scaled_test <- cbind(scale(test[,3:9]),test[,1:2])
scaled_test$id <- as.integer(scaled_test$id)
head(scaled_test)

##       length   diameter      height     weight shucked_weight viscera_weight
## 1 -0.9402177 -1.1152375 -0.79875436 -1.1743970     -1.1534907     -1.1976428
## 2 -0.5481630 -0.5869038 -0.79875436 -0.6294603     -0.5523282     -0.6543278
## 3 -0.1125467 -0.1642368 -0.25680473 -0.7034641     -0.8150211     -0.4258309
## 4  0.8022475 -0.1642368  0.42063230  0.3886519      0.5792716      0.5287786
## 5 -0.7224096 -0.7454039 -0.93424176 -0.9254753     -0.8200729     -0.9336019
## 6  0.3666312  0.3640969  0.01417008  0.1083347     -0.2492211      0.2291938
##   shell_weight    id sex
## 1  -1.12548020 74051   I
## 2  -0.77662673 74052   I
## 3  -0.53877210 74053   F
## 4   0.07964996 74054   F
## 5  -0.95501771 74055   I
## 6   0.37300402 74056   M

Prediction of test dataset

#Prediction of test data set
y.pred_scale <- predict(mod2,scaled_test) %>% round(.,0) %>% data.frame(Age=.)
prediction_scale <- cbind(scaled_test$id,y.pred_scale) %>% rename(id=`scaled_test$id`)
head(prediction_scale)

##      id Age
## 1 74051   8
## 2 74052   8
## 3 74053  10
## 4 74054  10
## 5 74055   7
## 6 74056  12

Compare The Two models

anova(mod1,mod2)

## Analysis of Variance Table
## 
## Model 1: age ~ sex + length + diameter + height + weight + shucked_weight + 
##     viscera_weight + shell_weight
## Model 2: age ~ length + diameter + height + weight + shucked_weight + 
##     viscera_weight + shell_weight + sex
##   Res.Df    RSS Df   Sum of Sq F Pr(>F)
## 1  74041 335336                        
## 2  74041 335336  0 -5.8208e-10

In summary, the comparison of these two models suggests that there is no significant difference between them either in terms of the number of parameters they use or their fit to the data. The extremely small difference in the RSS and the zero difference in degrees of freedom point to these models being essentially identical in performance.

Write the Predictions data to local storage for submission

write.csv(prediction_scale, "C:/Users/newma/OneDrive/Desktop/MSDS Fall 2023/Data 605/Final Exam/crab_age_kaggle.csv", row.names=FALSE)

DATA 605 Final Exam

Nnaemeka Okereafor

2023-12-10