#To clear console
rm(list=ls())
#To avoid exponential notation
format(1810032000, scientific = FALSE)
## [1] "1810032000"
Problem 1
Using R, set a random seed equal to 1234 (i.e., set.seed(1234)).
Generate a random variable X that has 10,000 continuous random uniform
values between 5 and 15.Then generate a random variable Y that has
10,000 random normal values with a mean of 10 and a standard deviation
of 2.89. Probability. Calculate as a minimum the below probabilities a
through c. Assume the small letter “x” is estimated as the median of the
X variable, and the small letter “y” is estimated as the median of the Y
variable. Interpret the meaning of all probabilities. 5 points. a.
P(X>x | X>y) b. P(X>x & Y>y) c. P(X<x | X>y)
5 points. Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y)
by building a table and evaluating the marginal and joint probabilities.
5 points. Check to see if independence holds by using Fisher’s Exact
Test and the Chi Square Test. What is the difference between the two?
Which is most appropriate? Are you surprised at the results? Why or why
not?
# Set the seed for reproducibility
set.seed(1234)
# X variable with 10000 random uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)
head(X)
## [1] 6.137034 11.222994 11.092747 11.233794 13.609154 11.403106
# x
x <- median(X)
x
## [1] 10.01266
# Y variable with 10000 random normal values with mean of 10 and a standard deviation of 2.89
Y <- rnorm(10000,mean=10,sd=2.89)
head(Y)
## [1] 7.615333 11.003316 7.340931 9.169598 8.407233 12.452586
# y
y<-median(Y)
y
## [1] 10.03154
# Calculate Probabilities
# a. P(B_X > x | B_X > y)
prob_a <- sum(X > x & X > y) / sum(X > y)
# b. P(B_X > x & B_Y > y)
prob_b <- sum(X > x & Y > y) / length(X)
# c. P(B_X < x | B_X > y)
prob_c <- sum(X < x & X > y) / sum(X > y)
# Display results
cat("Median of X:", x, "\n")
## Median of X: 10.01266
cat("Median of Y:", y, "\n\n")
## Median of Y: 10.03154
cat("a. P(X > x | X > y):", prob_a, "\n")
## a. P(X > x | X > y): 1
cat("b. P(X > x & Y > y):", prob_b, "\n")
## b. P(X > x & Y > y): 0.2507
cat("c. P(X < x | X > y):", prob_c, "\n")
## c. P(X < x | X > y): 0
Interpretations:
This represents the probability that B_X is greater than its median (x) given that it is greater than Y’s median (y). In other words, given that X is already greater than Y’s median, what is the probability that it’s also greater than its own median.
This is the joint probability that X is greater than its median and Y is greater than its median. So, it means that the probability that both X is greater than its median and Y is greater than its median. It’s the probability that both conditions are satisfied.
This is the conditional probability that X is less than its median given that X is greater than Y.
It represents the probability that X is less than its median (x) given that it is greater than Y’s median (y). In other words, given that X is already greater than Y’s median, what is the probability that it’s less than its own median.
Investigate whether P(B_X>x & B_Y>y) =P(B_X>x)P(B_Y>y) by building a table and evaluating the marginal and joint probabilities.
# Create a contingency table
contingency_table <- table(X > x, Y > y)
print(contingency_table)
##
## FALSE TRUE
## FALSE 2507 2493
## TRUE 2493 2507
# Calculate probabilities
prob_joint <- sum(X > x & Y > y) / length(X)
prob_marginal_x <- sum(X > x) / length(X)
prob_marginal_y <- sum(Y > y) / length(Y)
prob_expected <- prob_marginal_x * prob_marginal_y
# Display probabilities
cat("P(X > x and Y > y):", prob_joint, "\n")
## P(X > x and Y > y): 0.2507
cat("P(X > x) * P(Y > y):", prob_expected, "\n")
## P(X > x) * P(Y > y): 0.25
In conclusion, P(X>x & Y>y) = P(X>x) * P(Y>y) and it suggests that the two variables X and Y are independent of each other.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. And answer the following questions:
# Create categorical variables based on medians
X_category <- factor(ifelse(X > x, "Yes", "No"))
Y_category <- factor(ifelse(Y > y, "Yes", "No"))
# Create a contingency table
contingency_table <- table(X_category, Y_category)
# Perform Fisher's Exact Test
fisher_test_result <- fisher.test(contingency_table)
fisher_test_result
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
# Perform Chi-Square Test
chi_square_test_result <- chisq.test(contingency_table)
chi_square_test_result
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 0.0676, df = 1, p-value = 0.7949
1. Difference between Fisher’s Exact Test and Chi-Square Test:
Fisher’s Exact Test is used when the sample size is small, and it calculates the exact probability of observing the given distribution of data under the assumption of independence.
Chi-Square Test is more commonly used for larger sample sizes, and it assesses the difference between the expected and observed frequencies.
2. Which is most appropriate?
Fisher’s Exact Test is more appropriate when dealing with small sample sizes, while the Chi-Square Test is often used for larger sample sizes.
Both tests can be applied, and the choice may depend on the specific characteristics of the data.
3. Are you surprised at the results?
4. Why or why not?
The interpretation of the test results and their significance would depend on the p-values obtained from the tests. Lower p-values suggest evidence against independence. The p-value of 0.9203 from the Chi-square test and Fisher’s Exact test at a 95% confidence level indicates that that the two variables are independent.
Problem 2
You are to register for Kaggle.com (free) and compete in the
Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16
I want you to do the following. 5 points. Descriptive and Inferential
Statistics. Provide univariate descriptive statistics and appropriate
plots for the training data set. Provide a scatterplot matrix for at
least two of the independent variables and the dependent variable.
Derive a correlation matrix for any three quantitative variables in the
dataset. Test the hypotheses that the correlations between each pairwise
set of variables is 0 and provide an 80% confidence interval. Discuss
the meaning of your analysis. Would you be worried about familywise
error? Why or why not? 5 points. Linear Algebra and Correlation. Invert
your correlation matrix from above. (This is known as the precision
matrix and contains variance inflation factors on the diagonal.)
Multiply the correlation matrix by the precision matrix, and then
multiply the precision matrix by the correlation matrix. Conduct LDU
decomposition on the matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it
makes sense to fit a closed form distribution to data. Select a variable
in the Kaggle.com training dataset that is skewed to the right, shift it
so that the minimum value is absolutely above zero if necessary. Then
load the MASS package and run fitdistr to fit an exponential probability
density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html
). Find the optimal value of for this distribution, and then take 1000
samples from this exponential distribution using this value (e.g.,
rexp(1000, )). Plot a histogram and compare it with a histogram of your
original variable. Using the exponential pdf, find the 5th and 95th
percentiles using the cumulative distribution function (CDF). Also
generate a 95% confidence interval from the empirical data, assuming
normality. Finally, provide the empirical 5th percentile and 95th
percentile of the data. Discuss. 10 points. Modeling. Build some type of
multiple regression model and submit your model to the competition
board. Provide your complete model summary and results with analysis.
Report your Kaggle.com user name and score.
crb <- read_csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/Crab_age_train.csv")
#Replace space between variable names with "_"
crb <- crb %>%
clean_names()
crb$sex <- as.factor(crb$sex)
tail(crb)
## # A tibble: 6 × 10
## id sex length diameter height weight shucked_weight viscera_weight
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 74045 F 1.62 1.41 0.488 49.9 23.0 10.2
## 2 74046 F 1.66 1.26 0.438 50.7 20.7 10.4
## 3 74047 I 1.08 0.862 0.275 10.4 4.32 2.30
## 4 74048 F 1.49 1.2 0.412 29.5 12.3 7.54
## 5 74049 I 1.21 0.962 0.312 16.8 8.97 2.92
## 6 74050 I 0.912 0.675 0.2 5.39 2.06 1.03
## # ℹ 2 more variables: shell_weight <dbl>, age <dbl>
#Checking for missing data
Descriptive Statistics
summary(crb)
## id sex length diameter height
## Min. : 0 F:23010 Min. :0.1875 Min. :0.1375 Min. :0.0000
## 1st Qu.:18513 I:23957 1st Qu.:1.1500 1st Qu.:0.8875 1st Qu.:0.3000
## Median :37025 M:27084 Median :1.3750 Median :1.0750 Median :0.3625
## Mean :37025 Mean :1.3175 Mean :1.0245 Mean :0.3481
## 3rd Qu.:55538 3rd Qu.:1.5375 3rd Qu.:1.2000 3rd Qu.:0.4125
## Max. :74050 Max. :2.0128 Max. :1.6125 Max. :2.8250
## weight shucked_weight viscera_weight shell_weight
## Min. : 0.0567 Min. : 0.02835 Min. : 0.04252 Min. : 0.04252
## 1st Qu.:13.4377 1st Qu.: 5.71242 1st Qu.: 2.86330 1st Qu.: 3.96893
## Median :23.7994 Median : 9.90815 Median : 4.98951 Median : 6.93145
## Mean :23.3852 Mean :10.10427 Mean : 5.05839 Mean : 6.72387
## 3rd Qu.:32.1625 3rd Qu.:14.03300 3rd Qu.: 6.98815 3rd Qu.: 9.07184
## Max. :80.1015 Max. :42.18406 Max. :21.54562 Max. :28.49125
## age
## Min. : 1.000
## 1st Qu.: 8.000
## Median :10.000
## Mean : 9.968
## 3rd Qu.:11.000
## Max. :29.000
Scatter Plot between variables and Age
library(grid)
library(cowplot)
#Crab Age vs Diameter
scatter_age_diameter <- crb %>% ggplot(aes(diameter,age)) + geom_point(aes(color = sex)) +
labs(title = "Crab Age vs Diameter", x = "Diameter", y = "Crab Age")+theme_bw()
#Crab Age vs Height
scatter_age_height <- crb %>% ggplot(aes(height,age)) + geom_point(aes(color = sex))+
labs(title = "Crab Age vs Height", x = "Height", y = "Crab Age") +theme_bw()
#Crab Age vs Weight
scatter_age_weight <- crb %>% ggplot(aes(weight,age)) + geom_point(aes(color = sex))+
labs(title = "Age vs Weight", x = "Weight", y = "Crab Age") +theme_bw()
#Crab Age vs Shucked weight
scatter_age_shucked_weight <- crb %>% ggplot(aes(shucked_weight,age)) + geom_point(aes(color = sex))+
labs(title = "Age vs Shucked weight", x = "Shucked weight", y = "Crab Age") +theme_bw()
#Crab Age vs Viscera weight
scatter_age_viscera_weight <- crb %>% ggplot(aes(viscera_weight,age)) + geom_point(aes(color = sex))+
labs(title = "Age vs Viscera weight", x = "Viscera weight", y = "Crab Age") +theme_bw()
#Crab Age vs Shell weight
scatter_age_shell_weight <- crb %>% ggplot(aes(shell_weight,age)) + geom_point(aes(color = sex))+
labs(title = "Age vs Shell weight", x = "Shell weight", y = "Crab Age") +theme_bw()
#Crab Age vs Sex
scatter_age_sex <- crb %>% ggplot(aes(sex,age)) + geom_point()+
labs(title = "Age vs Sex", x = "Sex", y = "Crab Age") +theme_bw()
#Crab Age vs Length
scatter_age_length <- crb %>% ggplot(aes(length,age)) + geom_point(aes(color = sex))+
labs(title = "Age vs Length", x = "Length", y = "Crab Age") +theme_bw()
#Display Plots as matrix
scatter_plot <- plot_grid(scatter_age_diameter, scatter_age_height, scatter_age_weight, scatter_age_shucked_weight,scatter_age_viscera_weight,scatter_age_shell_weight,scatter_age_sex,scatter_age_length,
byrow = TRUE, nrow = 3)
scatter_plot
Correlation Matrix For Three variables
crb %>% dplyr::select(-c(sex,age,id,length,height,diameter,shell_weight,age)) %>%
pairs.panels()
cor_matrix <- crb %>% dplyr::select(-c(sex,age,id,length,height,diameter,shell_weight,age)) %>%
cor() %>% round(.,2)
cor_matrix
## weight shucked_weight viscera_weight
## weight 1.00 0.97 0.97
## shucked_weight 0.97 1.00 0.94
## viscera_weight 0.97 0.94 1.00
Test the hypothesis that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval
To test the hypothesis that the pairwise correlation between weight, shucked weight and viscera weight is zero, a pairwise correlation between each of the variables would be determined and the p-value is ascertained to accept or reject the Null \(H_0\) hypothesis.
# Pariwise correlation between Weight and shucked weight at 80% confidence interval
weight_sh_weight = cor.test(crb$weight, crb$shucked_weight, conf.level = 0.8)
weight_vs_weight = cor.test(crb$weight, crb$viscera_weight, conf.level = 0.8)
sh_weight_vs_weight = cor.test(crb$shucked_weight, crb$viscera_weight, conf.level = 0.8)
#Weight vs shucked weight:
weight_sh_weight
##
## Pearson's product-moment correlation
##
## data: crb$weight and crb$shucked_weight
## t = 1110.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9709993 0.9715328
## sample estimates:
## cor
## 0.9712672
#Weight vs Viscera weight:
weight_vs_weight
##
## Pearson's product-moment correlation
##
## data: crb$weight and crb$viscera_weight
## t = 1106.4, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9707920 0.9713293
## sample estimates:
## cor
## 0.9710619
#shucked vs Viscera weight
sh_weight_vs_weight
##
## Pearson's product-moment correlation
##
## data: crb$shucked_weight and crb$viscera_weight
## t = 768.33, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9420988 0.9431486
## sample estimates:
## cor
## 0.942626
Going by the p-value which is < 2.2e-16 (~ 0) for each pairwise set of variables above, the Null hypothesis \(H_0\) which postulates that the correlations between each pairwise set of variables is 0 should be rejected as it appears false.
Evaluate Familywise Error Using Bonferroni corrected P-value
Bonferroni corrected P-value (BCPv) = \(\frac{\alpha}{n}\)
# Bonferroni corrected P-value
α <- 1-0.8
n <- 3
BCPv = (1-0.8)/3
cat("Bonferroni corrected P-value (BCPv): ",BCPv)
## Bonferroni corrected P-value (BCPv): 0.06666667
The Bonferroni test is a multiple-comparison correction used when several dependent or independent statistical tests are being performed simultaneously. However, the adjusted significance level for each pairwise correlation test above for familywise error rate at 80% confidence level is approximately 0.067 or 6.7%. In other words, the error rate seems statistically insignificant at 80% confidence level. So, there is nothing to worry about as it concerns the familywise error.
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.
Invert your correlation matrix from above: The solve function is used to invert matrix.
precision_matrix <- solve(cor_matrix)
precision_matrix
## weight shucked_weight viscera_weight
## weight 33.33333 -16.666667 -16.666667
## shucked_weight -16.66667 16.924399 0.257732
## viscera_weight -16.66667 0.257732 16.924399
Multiply the correlation matrix by the precision matrix:
cor_matrix_by_precision_mat = cor_matrix %*% precision_matrix %>% round(.,4)
cor_matrix_by_precision_mat
## weight shucked_weight viscera_weight
## weight 1 0 0
## shucked_weight 0 1 0
## viscera_weight 0 0 1
Multiply the precision matrix by the correlation matrix
precision_mat_by_cor_matrix = precision_matrix %*% cor_matrix %>% round(.,4)
precision_mat_by_cor_matrix
## weight shucked_weight viscera_weight
## weight 1 0 0
## shucked_weight 0 1 0
## viscera_weight 0 0 1
Conduct LU decomposition on the matrix:
lu_decomposition = lu.decomposition(cor_matrix)
lu_decomposition
## $L
## [,1] [,2] [,3]
## [1,] 1.00 0.00000000 0
## [2,] 0.97 1.00000000 0
## [3,] 0.97 -0.01522843 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.9700 0.97000000
## [2,] 0 0.0591 -0.00090000
## [3,] 0 0.0000 0.05908629
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Determine variable skewness
skew_length <- skewness(crb$length)
skew_diameter <- skewness(crb$diameter)
skew_height <- skewness(crb$height)
skew_weight <- skewness(crb$weight)
skew_shucked_weight <- skewness(crb$shucked_weight)
skew_viscera_weight <- skewness(crb$viscera_weight)
skew_shell_weight <- skewness(crb$shell_weight)
skew_age <- skewness(crb$age)
#Display Skewness
cat("Length Skewness: ",skew_length)
## Length Skewness: -0.8443427
cat("Diameter Skewness: ",skew_diameter)
## Diameter Skewness: -0.8128327
cat("Weight Skewness: ",skew_weight)
## Weight Skewness: 0.2314552
cat("Shucked weight: ",skew_shucked_weight)
## Shucked weight: 0.3494575
cat("Viscera weight: ",skew_viscera_weight)
## Viscera weight: 0.2863712
cat("Shell weight: ",skew_shell_weight)
## Shell weight: 0.2774477
cat("Age Skewness: ",skew_age)
## Age Skewness: 1.092875
Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
# Minimum value for shucked weight:
min(crb$shucked_weight)
## [1] 0.0283495
Using shucked weight as the select right skewed variable makes more sense as the minimum value is above zero (0).
#Histogram for shucked weight
histogram_shucked_weight <- crb %>% ggplot(aes(x = shucked_weight)) +
geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "shucked weight", x = "shucked weight", y = "Frequency") +
theme_bw()
#Histogram for Height
histogram_height <- crb %>% ggplot(aes(height)) + geom_density()+
#geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "Height", x = "Height", y = "Frequency") +
theme_bw()
#Histogram for Weight
histogram_weight <- crb %>% ggplot(aes(weight)) +
geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "weight", x = "weight", y = "Frequency") +
theme_bw()
#Histogram for viscera weight
histogram_viscera_weight <- crb %>% ggplot(aes(viscera_weight)) +
geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "viscera weight", x = "viscera weight", y = "Frequency") +
theme_bw()
#Histogram for shell weight
histogram_shell_weight <- crb %>% ggplot(aes(shell_weight)) +
geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "Shell weight", x = "Shell weight", y = "Frequency") +
theme_bw()
#Histogram for length
histogram_length <- crb %>% ggplot(aes(length)) + geom_density() +
#geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "Length", x = "Length", y = "Frequency") +
theme_bw()
#Histogram for Age
histogram_age <- crb %>% ggplot(aes(age)) +
geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "Age", x = "Age", y = "Frequency") +
theme_bw()
#Display Plots as matrix
histogram_plot <- plot_grid(histogram_weight,histogram_shucked_weight,histogram_height,histogram_viscera_weight,histogram_shell_weight,histogram_length,histogram_age, byrow = TRUE, ncol= 3)
histogram_plot
From the histograms above, it is correct to establish that viscera weight, weight, shell weight, shucked weight and age are skewed to the right or positively skewed. While length and diameter are skewed to the left or negatively skewed.
Load the MASS package and run fitdistr to fit an exponential probability density function
shucked_weight_prob_density <- fitdistr(crb$shucked_weight, densfun = 'exponential')
λ <- shucked_weight_prob_density$estimate
shucked_weight_exp_pdf <- rexp(1000, λ)
#shucked_weight_exp_pdf
Optimal Value of λ: The optimal value of \(\lambda = \frac{1}{\lambda}\).
optimal_λ <- (1/λ) %>% round(., 4)
optimal_λ
## rate
## 10.1043
Plot a histogram and compare it with a histogram of your original variable.
#Convert the vector to dataframe
shucked_weight_exp_pdf_df <- data.frame(weight = shucked_weight_exp_pdf)
#Histogram for shucked_weight_exp_pdf_df
histogram_shucked_weight_exp_pdf <- shucked_weight_exp_pdf_df %>% ggplot(aes(weight)) +
geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "Exponential PDF", x = "shucked weight", y = "Frequency") +
theme_bw()
#Histogram for Original shucked weight
histogram_shucked_weight <- crb %>% ggplot(aes(x = shucked_weight)) +
geom_histogram(binwidth = 0.5, color = "red", fill = "skyblue", alpha = 0.5) +
labs(title = "Original shucked weight", x = "shucked weight", y = "Frequency") +
theme_bw()
#Display Plots as matrix
shucked_weight_compare <- plot_grid(histogram_shucked_weight_exp_pdf,histogram_shucked_weight, byrow = TRUE, ncol= 2)
shucked_weight_compare
The 5th and 95th percentiles using the cumulative distribution function (CDF):
cdf_5th_95th_perc = quantile(shucked_weight_exp_pdf, c(0.05, 0.95)) %>%round(.,4)
cdf_5th_95th_perc
## 5% 95%
## 0.5416 29.1782
95% confidence interval from empirical data, assuming normality
conf_int = t.test(crb$shucked_weight)
conf_int
##
## One Sample t-test
##
## data: crb$shucked_weight
## t = 489.43, df = 74050, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 10.06381 10.14473
## sample estimates:
## mean of x
## 10.10427
The empirical 5th percentile and 95th percentile of the data
emp_5_95_perc = quantile(crb$shucked_weight, c(0.05, 0.95)) %>% round(.,4)
emp_5_95_perc
## 5% 95%
## 1.5025 19.3769
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Build a multiple regression
mod1 <- crb %>% dplyr::select(-id) %>% lm(age ~ ., data=.)
summary(mod1)
##
## Call:
## lm(formula = age ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0347 -1.2228 -0.3320 0.7537 22.0418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.755758 0.075897 49.485 < 2e-16 ***
## sexI -1.040735 0.025209 -41.284 < 2e-16 ***
## sexM -0.115212 0.019157 -6.014 1.82e-09 ***
## length 0.910211 0.192234 4.735 2.20e-06 ***
## diameter 2.138507 0.238786 8.956 < 2e-16 ***
## height 7.222272 0.236275 30.567 < 2e-16 ***
## weight 0.194265 0.005416 35.867 < 2e-16 ***
## shucked_weight -0.614115 0.006632 -92.603 < 2e-16 ***
## viscera_weight -0.216351 0.011872 -18.224 < 2e-16 ***
## shell_weight 0.512127 0.009820 52.152 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared: 0.5508, Adjusted R-squared: 0.5508
## F-statistic: 1.009e+04 on 9 and 74041 DF, p-value: < 2.2e-16
plot(mod1)
Analysis of Residual
The Residual vs Fitted value plot shows that the error did maintained a fairly constant variability. However, there are few data points at the far right of the distribution. The data distribution is fairly normal. The R-squared \(R^2\) of of approximately 0.55 or 55% shows that the model is reasonable to explain about 55% variation in age as a result of the variables. All the variables are statistically significant due to their p-value that is nearly zero (0).
Read test dataset
test <- read_csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/Crab_age_test.csv")
#Replace space between variable names with "_"
test <- test %>%
clean_names()
head(test)
## # A tibble: 6 × 9
## id sex length diameter height weight shucked_weight viscera_weight
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 74051 I 1.05 0.762 0.275 8.62 3.66 1.73
## 2 74052 I 1.16 0.888 0.275 15.5 7.03 3.25
## 3 74053 F 1.29 0.988 0.325 14.6 5.56 3.88
## 4 74054 F 1.55 0.988 0.388 28.4 13.4 6.55
## 5 74055 I 1.11 0.85 0.262 11.8 5.53 2.47
## 6 74056 M 1.42 1.11 0.35 24.8 8.73 5.71
## # ℹ 1 more variable: shell_weight <dbl>
Prediction of test dataset
#Prediction of test data set
y.pred <- predict(mod1,test) %>% round(.,0) %>% data.frame(Age=.)
prediction <- cbind(test$id,y.pred) %>% rename(id=`test$id`)
head(prediction)
## id Age
## 1 74051 8
## 2 74052 8
## 3 74053 10
## 4 74054 10
## 5 74055 8
## 6 74056 12
length(y.pred)
## [1] 1
nrow(test)
## [1] 49368
Standardize the data
scaled_crb <- cbind(scale(crb[,3:9]),crb[,2],crb[,10])
head(scaled_crb)
## length diameter height weight shucked_weight viscera_weight
## 1 0.7212332 0.63397775 0.2923977 0.441801426 0.4671847 0.5691823
## 2 -0.7557067 -0.84035033 -0.7941577 -1.025191245 -0.9936809 -0.9788731
## 3 0.2433997 0.37070488 0.2923977 0.110075046 0.2199225 0.1783618
## 4 1.3293850 1.63441467 1.6505920 2.156468183 1.8246040 2.1246076
## 5 -0.2344338 -0.05053172 -0.1150606 -0.007598159 0.3334613 -0.1972320
## 6 0.6343544 0.63397775 0.6998560 0.431715150 0.5882928 0.6199382
## shell_weight sex age
## 1 0.45337303 I 9
## 2 -0.92678160 I 8
## 3 -0.01722411 M 9
## 4 2.30807939 F 11
## 5 -0.21495400 I 8
## 6 0.33868969 M 10
Build a regression model with the standardized data
mod2 <- scaled_crb %>% lm(age ~ ., data=.)
summary(mod2)
##
## Call:
## lm(formula = age ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0347 -1.2228 -0.3320 0.7537 22.0418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.34664 0.01513 683.888 < 2e-16 ***
## length 0.26192 0.05532 4.735 2.20e-06 ***
## diameter 0.50767 0.05669 8.956 < 2e-16 ***
## height 0.66469 0.02175 30.567 < 2e-16 ***
## weight 2.45709 0.06851 35.867 < 2e-16 ***
## shucked_weight -3.45012 0.03726 -92.603 < 2e-16 ***
## viscera_weight -0.60421 0.03315 -18.224 < 2e-16 ***
## shell_weight 1.83566 0.03520 52.152 < 2e-16 ***
## sexI -1.04074 0.02521 -41.284 < 2e-16 ***
## sexM -0.11521 0.01916 -6.014 1.82e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared: 0.5508, Adjusted R-squared: 0.5508
## F-statistic: 1.009e+04 on 9 and 74041 DF, p-value: < 2.2e-16
plot(mod2)
Analysis of Residual
The status of the residuals remain the same after standardizing the data. There was no significant change with or without standardization. The Residual vs Fitted value plot shows that the error did maintained a fairly constant variability. However, there are few data points at the far right of the distribution. The data distribution is fairly normal. The R-squared \(R^2\) of of approximately 0.55 or 55% shows that the model is reasonable to explain about 55% variation in age as a result of the variables. All the variables are statistically significant due to their p-value that is nearly zero (0).
Standardize the test dataset
scaled_test <- cbind(scale(test[,3:9]),test[,1:2])
scaled_test$id <- as.integer(scaled_test$id)
head(scaled_test)
## length diameter height weight shucked_weight viscera_weight
## 1 -0.9402177 -1.1152375 -0.79875436 -1.1743970 -1.1534907 -1.1976428
## 2 -0.5481630 -0.5869038 -0.79875436 -0.6294603 -0.5523282 -0.6543278
## 3 -0.1125467 -0.1642368 -0.25680473 -0.7034641 -0.8150211 -0.4258309
## 4 0.8022475 -0.1642368 0.42063230 0.3886519 0.5792716 0.5287786
## 5 -0.7224096 -0.7454039 -0.93424176 -0.9254753 -0.8200729 -0.9336019
## 6 0.3666312 0.3640969 0.01417008 0.1083347 -0.2492211 0.2291938
## shell_weight id sex
## 1 -1.12548020 74051 I
## 2 -0.77662673 74052 I
## 3 -0.53877210 74053 F
## 4 0.07964996 74054 F
## 5 -0.95501771 74055 I
## 6 0.37300402 74056 M
Prediction of test dataset
#Prediction of test data set
y.pred_scale <- predict(mod2,scaled_test) %>% round(.,0) %>% data.frame(Age=.)
prediction_scale <- cbind(scaled_test$id,y.pred_scale) %>% rename(id=`scaled_test$id`)
head(prediction_scale)
## id Age
## 1 74051 8
## 2 74052 8
## 3 74053 10
## 4 74054 10
## 5 74055 7
## 6 74056 12
Compare The Two models
anova(mod1,mod2)
## Analysis of Variance Table
##
## Model 1: age ~ sex + length + diameter + height + weight + shucked_weight +
## viscera_weight + shell_weight
## Model 2: age ~ length + diameter + height + weight + shucked_weight +
## viscera_weight + shell_weight + sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 74041 335336
## 2 74041 335336 0 -5.8208e-10
In summary, the comparison of these two models suggests that there is no significant difference between them either in terms of the number of parameters they use or their fit to the data. The extremely small difference in the RSS and the zero difference in degrees of freedom point to these models being essentially identical in performance.
Write the Predictions data to local storage for submission
write.csv(prediction_scale, "C:/Users/newma/OneDrive/Desktop/MSDS Fall 2023/Data 605/Final Exam/crab_age_kaggle.csv", row.names=FALSE)