library(reactable)
library(vtable)
library(caret)
library(GGally)
library(tidyverse)
library(plotly)
library(LaplacesDemon)
library(ggeasy)
library(corrplot)
library(matrixcalc)
library(Matrix)
library(MASS)
library(moments)
library(car)
Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
# Random seed
set.seed(1234)
# Random variable X with 10,000 continuous random uniform values between 5 and 15
X <- runif(10000, min = 5, max = 15)
# Random variable Y with 10,000 random normal values
# Mean = 10, Standard Deviation = 2.89
Y <- rnorm(10000, mean = 10, sd = 2.89)
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.003 7.525 10.013 10.003 12.475 14.996
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.564 8.113 10.032 10.033 11.949 20.456
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.
Formula for Conditional Probability
\[ \frac{P(X>x \cap X>y)}{P(X>y)} \]
# Estimate the medians
x_med <- median(X)
y_med <- median(Y)
prob_a <- sum(X>x_med & X>y_med)/sum(X>y_med)
prob_a
## [1] 1
prob_b <- sum(X>x_med & Y>y_med)/length(X)
prob_b
## [1] 0.2507
\[ \frac{P(X<x \cap X>y)}{P(X>y)} \]
prob_c <- sum(X<x_med & X>y_med)/sum(X>y_med)
prob_c
## [1] 0
Investigate whether \(P(X>x & Y>y)\) = \(P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
# Creating table
comparison_table <- table(X>x_med, Y>y_med)
# Calculate marginal probabilities
marg_prob_X_gt_x <- sum(X>x_med)/length(X)
marg_prob_Y_gt_y <- sum(Y>y_med)/length(Y)
# Calculate joint probability P(X>x & Y>y)
joint_prob <- sum(X>x_med & Y>y_med) / length(X)
# Check if P(X>x & Y>y) = P(X>x)*P(Y>y)
equality_check <- abs(joint_prob - (marg_prob_X_gt_x * marg_prob_Y_gt_y))
cat("Comparison Table:", comparison_table, "\n")
## Comparison Table: 2507 2493 2493 2507
cat("Marginal Probability P(X>x_med):", marg_prob_X_gt_x,"\n")
## Marginal Probability P(X>x_med): 0.5
cat("Marginal Probability P(Y>y_med):", marg_prob_Y_gt_y,"\n")
## Marginal Probability P(Y>y_med): 0.5
cat("Joint Probability P(X>x & Y>y):", joint_prob, "\n")
## Joint Probability P(X>x & Y>y): 0.2507
cat("Absolute difference between P(X>x & Y>y) and P(X>x) * P(Y>y):", equality_check, "\n")
## Absolute difference between P(X>x & Y>y) and P(X>x) * P(Y>y): 7e-04
# Matrix with values from above
comparison_matrix <- matrix(c(2507, 2493, 2493, 2507), nrow = 2, byrow = TRUE)
# Convert the matrix to a data frame for better display
comparison_df <- as.data.frame(comparison_matrix)
names(comparison_df) <- c("X>x_med","X<x_med")
row.names(comparison_df) <- c("Y<y_med","Y>y_med")
reactable(comparison_df)
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?
fisher_test_result <- fisher.test(comparison_table)
fisher_test_result
##
## Fisher's Exact Test for Count Data
##
## data: comparison_table
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
chi_square_test_result <- chisq.test(comparison_table)
chi_square_test_result
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: comparison_table
## X-squared = 0.0676, df = 1, p-value = 0.7949
Using this website as a reference, Fisher’s Exact Test and Chi-square test have similarities and differences. They both assess the relationship between categorical variables, and determine the independence between each variable. Fisher’s Exact Test calculates the number of all possible contingency tables with the same row and column totals as the observed table, with marginal distributions being one example. It calculates the probability for the p-value by finding the proportion of possible tables that are more extreme than the observed table. This test is appropriate for all sample sizes. However, this test is mainly used for small sample sizes. Some guidelines associated with using this test include having cell counts smaller than 20, having a cell with an expected value 5 or less, and when column or row marginal values are extremely uneven. In the case of the Chi-Square test, it is used for large sample sizes. It uses a test statistic and sampling distribution to calculate p-values. The chi-square sampling distribution only approximates the correct distribution, providing better p-values as the cell values in the table increase. As a result, chi-square p-values are invalid when you have small cell counts.
In this case, both tests have the same p-values: 0.7949. I am somewhat surprised each test delivered the same result, considering the sample size is the same for each. Based on the definition for each test, I would have thought that the Chi-square test would produced a lower p-value, since the sample size of 10000 is large, and the values are between 5 and 15, two conditions that are counter to the guidelines presented earlier about using Fisher’s Exact Test.
**You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
crab_age_df_train <- read.csv("https://raw.githubusercontent.com/moham6839/Data_605_Final_Project/main/train.csv", check.names = FALSE)
reactable(crab_age_df_train)
As stated in the description of the Crab Age dataset on the Kaggle website, “For a commercial crab farmer knowing the right age of the crab helps them decide if and when to harvest the crabs. Beyond a certain age, there is negligible growth in crab’s physical characteristics and hence, it is important to time the harvesting to reduce cost and increase profit.” Therefore, I will use Age as the dependent variable, and
st(crab_age_df_train)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| id | 74051 | 37025 | 21377 | 0 | 18512 | 55538 | 74050 |
| Sex | 74051 | ||||||
| … F | 23010 | 31% | |||||
| … I | 23957 | 32% | |||||
| … M | 27084 | 37% | |||||
| Length | 74051 | 1.3 | 0.29 | 0.19 | 1.1 | 1.5 | 2 |
| Diameter | 74051 | 1 | 0.24 | 0.14 | 0.89 | 1.2 | 1.6 |
| Height | 74051 | 0.35 | 0.092 | 0 | 0.3 | 0.41 | 2.8 |
| Weight | 74051 | 23 | 13 | 0.057 | 13 | 32 | 80 |
| Shucked Weight | 74051 | 10 | 5.6 | 0.028 | 5.7 | 14 | 42 |
| Viscera Weight | 74051 | 5.1 | 2.8 | 0.043 | 2.9 | 7 | 22 |
| Shell Weight | 74051 | 6.7 | 3.6 | 0.043 | 4 | 9.1 | 28 |
| Age | 74051 | 10 | 3.2 | 1 | 8 | 11 | 29 |
summary(crab_age_df_train$Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1875 1.1500 1.3750 1.3175 1.5375 2.0128
ggplot(crab_age_df_train, aes(Length)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Length") +
ggeasy::easy_center_title()
hist(crab_age_df_train$Length, xlab = "Length", main = "Length of Crabs", col = "blue")
summary(crab_age_df_train$Diameter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1375 0.8875 1.0750 1.0245 1.2000 1.6125
ggplot(crab_age_df_train, aes(Diameter)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Diameter") +
ggeasy::easy_center_title()
hist(crab_age_df_train$Diameter, xlab = "Diameter", main = "Diameter of Crabs", col = "blue")
summary(crab_age_df_train$Height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.3625 0.3481 0.4125 2.8250
ggplot(crab_age_df_train, aes(Height)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Height") +
ggeasy::easy_center_title()
hist(crab_age_df_train$Height, xlab = "Height", main = "Height of Crabs", col = "blue")
summary(crab_age_df_train$Weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0567 13.4377 23.7994 23.3852 32.1625 80.1015
ggplot(crab_age_df_train, aes(Weight)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Weight") +
ggeasy::easy_center_title()
hist(crab_age_df_train$Weight, xlab = "Weight", main = "Weight of Crabs", col = "blue")
summary(crab_age_df_train$`Shucked Weight`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02835 5.71242 9.90815 10.10427 14.03300 42.18406
ggplot(crab_age_df_train, aes(`Shucked Weight`)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Shucked Weight") +
ggeasy::easy_center_title()
hist(crab_age_df_train$`Shucked Weight`, xlab = "Shucked Weight", main = "Shucked Weight of Crabs", col = "blue")
summary(crab_age_df_train$`Viscera Weight`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04252 2.86330 4.98951 5.05839 6.98815 21.54562
ggplot(crab_age_df_train, aes(`Viscera Weight`)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Viscera Weight") +
ggeasy::easy_center_title()
hist(crab_age_df_train$`Viscera Weight`, xlab = "Viscera Weight", main = "Viscera Weight of Crabs", col = "blue")
summary(crab_age_df_train$`Shell Weight`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04252 3.96893 6.93145 6.72387 9.07184 28.49125
ggplot(crab_age_df_train, aes(`Shell Weight`)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Shell Weight") +
ggeasy::easy_center_title()
hist(crab_age_df_train$`Shell Weight`, xlab = "Shell Weight", main = "Shell Weight of Crabs", col = "blue")
summary(crab_age_df_train$Sex)
## Length Class Mode
## 74051 character character
ggplot(crab_age_df_train, aes(Sex)) +
geom_bar() +
theme_minimal() +
ggtitle("Gender of Crabs") +
ggeasy::easy_center_title()
summary(crab_age_df_train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 8.000 10.000 9.968 11.000 29.000
ggplot(crab_age_df_train, aes(Age)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Age") +
ggeasy::easy_center_title()
hist(crab_age_df_train$Age, xlab = "Age", main = "Age of Crabs", col = "blue")
Using Age as a dependent variable, I will use Height and Weight as my independent variables:
train_set2 <- crab_age_df_train %>%
dplyr::select(Age, Height, Weight)
pairs(train_set2, main = "Scatter Plot Matrix for Age, Height, and Weight")
ggpairs(train_set2, title="Scatterplot Matrix Using Age, Height, and Weight")
train_set_df_corr <- crab_age_df_train %>%
dplyr::select(`Shucked Weight`, `Viscera Weight`, `Shell Weight`)
corr_mat <- cor(train_set_df_corr)
corr_mat
## Shucked Weight Viscera Weight Shell Weight
## Shucked Weight 1.0000000 0.942626 0.9103977
## Viscera Weight 0.9426260 1.000000 0.9339190
## Shell Weight 0.9103977 0.933919 1.0000000
ggpairs(train_set_df_corr)
ggcorrplot::ggcorrplot(corr_mat, hc.order = TRUE, type = "lower",
lab = TRUE)
shuck_and_vis <- cor.test(train_set_df_corr$`Shucked Weight`, train_set_df_corr$`Viscera Weight`, conf.level = 0.8)
shuck_and_vis
##
## Pearson's product-moment correlation
##
## data: train_set_df_corr$`Shucked Weight` and train_set_df_corr$`Viscera Weight`
## t = 768.33, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9420988 0.9431486
## sample estimates:
## cor
## 0.942626
shuck_and_shell <- cor.test(train_set_df_corr$`Shucked Weight`, train_set_df_corr$`Shell Weight`, conf.level = 0.8)
shuck_and_shell
##
## Pearson's product-moment correlation
##
## data: train_set_df_corr$`Shucked Weight` and train_set_df_corr$`Shell Weight`
## t = 598.78, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9095880 0.9112004
## sample estimates:
## cor
## 0.9103977
vis_and_shell <- cor.test(train_set_df_corr$`Viscera Weight`, train_set_df_corr$`Shell Weight`, conf.level = 0.8)
vis_and_shell
##
## Pearson's product-moment correlation
##
## data: train_set_df_corr$`Viscera Weight` and train_set_df_corr$`Shell Weight`
## t = 710.9, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9333145 0.9345182
## sample estimates:
## cor
## 0.933919
When analyzing the correlations of the three variables, each pair is statistically significant, having a p-value significantly less than .05. It appears we can reject the null hypothesis. However, there is a high amount of multicollinearity between each pair, which makes it difficult to reject the null hypothesis and infer conclusions from hypothesis testing.
Familywise error rate (FWER) is defined as the probability of making any Type I errors, which are false positives (https://statweb.stanford.edu/~joftius/slides/testing.pdf). The formula for calculating FWER is \(1-(1-\alpha)^m\) (https://www.statistics.com/glossary/family-wise-type-i-error/#:~:text=aFW%20%3D%201%20%E2%80%93%20(1%20%E2%80%93%20a)C.&text=For%20example%2C%20for%20k%3D4,comparison%2Dwise%20type%20I%20errors). We can see what our error is:
alpha <- 0.2 # 1 - 0.8 (confidence interval)
m <- 3
FWER <- (1 - ((1 - alpha)^m))*100
FWER
## [1] 48.8
The FWER is 48.8%, which is pretty high and should be a cause for concern. However, there are methods that can reduce FWER. One is the Bonferroni Correction, which adjusts the significance level to control the probability of Type I errors (https://statisticsbyjim.com/hypothesis-testing/bonferroni-correction/). It is calculated by dividing the p-value by the number of tests. In this case, it would be:
BC <- alpha/m
BC
## [1] 0.06666667
The adjusted p-value is .06, which means that the tests now have a 6% chance of having a false positive, compared to the original p-value of 0.2, or 20%.
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.
# Invert Correlation Matrix to Precision Matrix
precision_matrix <- Cov2Prec(corr_mat)
reactable(precision_matrix)
# Multiply Correlation Matrix by the Precision Matrix
corr_times_prec <- corr_mat %*% precision_matrix
reactable(corr_times_prec)
# Multiply Precision Matrix by Correlation Matrix
prec_times_corr <- precision_matrix %*% corr_mat
reactable(prec_times_corr)
LDU <- lu.decomposition(prec_times_corr)
LDU
## $L
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0.000000e+00 0
## [2,] -2.979398e-16 1.000000e+00 0
## [3,] -5.718615e-16 -2.936959e-16 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 -1.663803e-15 0
## [2,] 0 1.000000e+00 0
## [3,] 0 0.000000e+00 1
reactable(LDU$L)
reactable(LDU$U)
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of* \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
When examining the histograms of each quantitative variable, Length and Diameter were the only variables that appeared to have a skewness to the left. I decided to look at the skewness of each variable:
skewness(crab_age_df_train$Length)
## [1] -0.8443598
skewness(crab_age_df_train$Diameter)
## [1] -0.8128491
skewness(crab_age_df_train$Height)
## [1] 0.08657593
skewness(crab_age_df_train$Weight)
## [1] 0.2314599
skewness(crab_age_df_train$`Shucked Weight`)
## [1] 0.3494646
skewness(crab_age_df_train$`Viscera Weight`)
## [1] 0.286377
skewness(crab_age_df_train$`Shell Weight`)
## [1] 0.2774533
skewness(crab_age_df_train$Age)
## [1] 1.092897
My assumptions were accurate. Length and Diameter have a negative
number below zero, indicating a skewness to the left. The rest of the
variables were positive numbers greater than zero, and indicates a
skewness to the right. I decided to look at Shucked Weight,
since it has the highest skewness amount among the independent variables
and appears to have a distinct skewness to the right.
shuck_weight_epdf <- fitdistr(crab_age_df_train$`Shucked Weight`, densfun = 'exponential')
shuck_weight_epdf
## rate
## 0.0989680626
## (0.0003636885)
lambda <- shuck_weight_epdf$estimate
shuck_weight_lambda_epdf <- rexp(1000, lambda)
shuck_weight_lambda_df <- as.data.frame(shuck_weight_lambda_epdf)
reactable(shuck_weight_lambda_df)
par(mfrow = c(1, 2))
hist(crab_age_df_train$`Shucked Weight`, xlab = "Shucked Weight", main = "Original Shucked Weight of Crabs", col = "blue")
hist(shuck_weight_lambda_df$shuck_weight_lambda_epdf, xlab = "Transformed Shucked Weight", main = "Shucked Weight of Crabs Transformed", col = "red")
# Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
percentiles <- quantile(shuck_weight_lambda_df$shuck_weight_lambda_epdf, c(0.05, .95))
percentiles
## 5% 95%
## 0.5415927 29.1781950
# Also generate a 95% confidence interval from the empirical data, assuming normality.
confidence_interval <- t.test(crab_age_df_train$`Shucked Weight`)$conf.int
names(confidence_interval) <- c("LowerCI", "UpperCI")
confidence_interval
## LowerCI UpperCI
## 10.06381 10.14473
## attr(,"conf.level")
## [1] 0.95
#Finally, provide the empirical 5th percentile and 95th percentile of the data.
empirical_percentiles <- quantile(crab_age_df_train$`Shucked Weight`, c(0.05, .95))
empirical_percentiles
## 5% 95%
## 1.502523 19.376883
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
To begin, I will run an initial model using Age as the
dependent variable and including the rest of the variables as
independent, with the exception of id.
sum(is.na(crab_age_df_train))
## [1] 0
# Removing Sex and Length variables
train_set_model <- crab_age_df_train %>%
dplyr::select(-id)
ggpairs(train_set_model)
train_set_model2 <- train_set_model %>%
dplyr::select(-Sex)
cor_matrix_train <- cor(train_set_model2)
ggcorrplot::ggcorrplot(cor_matrix_train, hc.order = TRUE, type = "lower",
lab = TRUE)
There appears to be high multicollinearity between each of the
independent variables. The lowest correlation coefficient among the
independent variables is 0.864, which is the correlation between
Height and Shucked Weight. Considering it is
the lowest correlation coefficient, this indicates that the other
independent variables have a high amount of multicollinearity with one
another, which can make it difficult to determine each variable’s effect
on the dependent variable. Nevertheless, I’ll create an initial model
with all the independent variables and evaluate the model’s
performance.
initial_ml <- lm(Age ~ ., data=train_set_model)
summary(initial_ml)
##
## Call:
## lm(formula = Age ~ ., data = train_set_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0347 -1.2228 -0.3320 0.7537 22.0418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.755758 0.075897 49.485 < 2e-16 ***
## SexI -1.040735 0.025209 -41.284 < 2e-16 ***
## SexM -0.115212 0.019157 -6.014 1.82e-09 ***
## Length 0.910211 0.192234 4.735 2.20e-06 ***
## Diameter 2.138507 0.238786 8.956 < 2e-16 ***
## Height 7.222272 0.236275 30.567 < 2e-16 ***
## Weight 0.194265 0.005416 35.867 < 2e-16 ***
## `Shucked Weight` -0.614115 0.006632 -92.603 < 2e-16 ***
## `Viscera Weight` -0.216351 0.011872 -18.224 < 2e-16 ***
## `Shell Weight` 0.512127 0.009820 52.152 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared: 0.5508, Adjusted R-squared: 0.5508
## F-statistic: 1.009e+04 on 9 and 74041 DF, p-value: < 2.2e-16
# Checking for multicollinearity using the Variance Inflation Factor
vif_values <- car::vif(initial_ml)
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## Sex 1.815731 2 1.160815
## Length 50.029756 1 7.073172
## Diameter 52.539130 1 7.248388
## Height 7.731224 1 2.780508
## Weight 76.731546 1 8.759654
## `Shucked Weight` 22.695250 1 4.763953
## `Viscera Weight` 17.972270 1 4.239371
## `Shell Weight` 20.256450 1 4.500717
plot(initial_ml)
ggplot(initial_ml, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title="Residual vs. Fitted Values Plot") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = initial_ml, aes(x = initial_ml$residuals)) +
geom_histogram(bins = 10, fill = 'steelblue', color = 'black') +
labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')
ggplot(data = initial_ml, aes(x = .resid)) +
geom_histogram(binwidth = 0.4) +
xlab("Residuals")
qqnorm(resid(initial_ml))
qqline(resid(initial_ml))
When analyzing the results of the initial model, there are some
observations that stand out. The overall p-value of the model is
significantly below .05, which indicates that the variables together are
statistically significant to determine the dependent variable
Age. The Adjusted R-Squared value is a good initial measure
at 55.08%, which means that 55.08% of the variation of Age
can be explained by the variables in the model. The F-Statistic is the
t-value squared. The value of the F-Statistic is 1.009e+04, which taken
with the small p-value of 2.2^e-16 would indicate that the null
hypothesis can be rejected and the model is statistically significant.
The p-values of each variable are below .05, indicating that they are
each statistically significant. The t-values of variables should be
between 5 and 10, which indicates that the standard error is smaller
than their respective coefficients and the variables are statistically
significant. In this model, the only variable that has a t-value between
5 and 10 are Diameter.
The plots of the model shows that the model does not provide a good
fit for the training data. The Residual vs. Fitted Values plot shows the
residual points mostly concentrated around the zero threshold between 5
and 20, which may indicate that the model is overfitting. The histogram
of the model shows a center near zero but a slight skewness to the
right. The QQ plot shows the the right and left tail deviating from the
reference line, which indicates that the model does not do a good job of
capturing the linearity between the independent variables and
Age.
Since the variables have high multicollinearity and are relatively
skewed, I decided to log transform the dependent and independent
variables. For purposes of this specific model, I will not include the
Sex column.
# train_set_model2 has dataframe without Sex column
revised_train_model <- log(train_set_model2)
names(revised_train_model) <- paste0(names(train_set_model2), "_log")
reactable(revised_train_model)