Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
set.seed(1234)
X <- runif(10000, min=5, max=15)
Y<- rnorm(10000, mean=10, sd=2.89)
Lets check our distributions
tibble(val = X) |>
ggplot(aes(val)) +
geom_histogram() +
ggtitle("X Normal Distribution") +
theme_bw()
tibble(val = Y) |>
ggplot(aes(val)) +
geom_histogram() +
ggtitle("Y Normal Distribution") +
theme_bw()
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.
Calculate the medians
x <- median(X)
y <- median(Y)
a. P(X>x | X>y)
\[ P(X>x | X>y) = \frac{P(X>x \cap X>y)}{P(X>y)} \]
prob_df <- tibble(X = X, Y = Y)
p_xx_xy <-
prob_df %>%
filter(X>x, X>y) |>
nrow()/10000
p_xy <-
prob_df %>%
filter(X>y) |>
nrow()/10000
p_xx_xy * p_xy / p_xy
## [1] 0.4979
b. P(X>x & Y>y)
\[ P(X>x, Y>y) = P(X>x \cap P(Y>y) \]
p_xx_xy <-
prob_df %>%
filter(X>x, Y>y) |>
nrow()/10000
p_xx_xy
## [1] 0.2507
c. P(X<x | X>y)
\[ P(X<x | X>y) = \frac{P(X<x \cap X>y)}{P(X>y)} \]
p_xx_xy <-
prob_df %>%
filter(X<x, X>y) |>
nrow()/10000
p_xx_xy
## [1] 0
Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
joint_table <-
data.frame(
c(prob_df %>% filter(X>x, Y>y) |> nrow()/10000,
prob_df %>% filter(X>x, Y<=y) |> nrow()/10000),
c(prob_df %>% filter(X<=x, Y>y) |> nrow()/10000,
prob_df %>% filter(X<=x, Y<=y) |> nrow()/10000)) %>%
mutate(Total = rowSums(.)) %>%
rbind(., colSums(.))
colnames(joint_table) <- c("X > x", "$X \\le x$", "total")
rownames(joint_table) <- c("Y > y", "$Y \\le y$", "total")
knitr::kable(joint_table, caption='Joint Probability', format="simple")
X > x | \(X \le x\) | total | |
---|---|---|---|
Y > y | 0.2507 | 0.2493 | 0.5 |
\(Y \le y\) | 0.2493 | 0.2507 | 0.5 |
total | 0.5000 | 0.5000 | 1.0 |
prob_xx <-
prob_df %>% filter(X>x) |> nrow()/10000
prob_yy <-
prob_df %>% filter(Y>y) |> nrow()/10000
prob_xx * prob_yy
## [1] 0.25
When we compare the joint probability of \(X>x\) and \(Y>y\) we see it is 0.2507 while the marginal probability is 0.25. They are quite close to each other but not exactly the same.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?
Let’s set the following:
\(H_0 = P(X>x)\) and \(PY>y\) is independent \(H_0 = P(X>x)\) and \(PY>y\) is NOT independent
This is using information from this website
table(X>x, Y>y) |> fisher.test()
##
## Fisher's Exact Test for Count Data
##
## data: table(X > x, Y > y)
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
table(X>x, Y>y) |> chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(X > x, Y > y)
## X-squared = 0.0676, df = 1, p-value = 0.7949
With the results of the Fisher and chi-square test we see both p-values to be 0.7949. This means that we fail to reject \(H_0\) and that both X and Y are independent.
It’s noted that “Fisher’s exact test is practically applied only in analysis of small samples but actually it is valid for all sample sizes. While the chi-squared test relies on an approximation, Fisher’s exact test is one of exact tests.” NCBI
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Import the test dataset
df <- read_csv('train.csv') |> clean_names()
Retrieve descriptive statistics of each variable. We can see Sex is a categorical variable.
knitr::kable(describe(df), caption = "Descriptive Statistics of the Training Dataset", digits=4) %>%
kable_styling(latex_options = "scale_down") %>%
kableExtra::landscape()
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | 1 | 74051 | 37025.0000 | 21376.8267 | 37025.0000 | 37025.0000 | 27447.3738 | 0.0000 | 74050.0000 | 74050.0000 | 0.0000 | -1.2000 | 78.5557 |
sex* | 2 | 74051 | 2.0550 | 0.8206 | 2.0000 | 2.0688 | 1.4826 | 1.0000 | 3.0000 | 2.0000 | -0.1019 | -1.5081 | 0.0030 |
length | 3 | 74051 | 1.3175 | 0.2878 | 1.3750 | 1.3450 | 0.2780 | 0.1875 | 2.0128 | 1.8253 | -0.8443 | 0.2916 | 0.0011 |
diameter | 4 | 74051 | 1.0245 | 0.2374 | 1.0750 | 1.0464 | 0.2224 | 0.1375 | 1.6125 | 1.4750 | -0.8128 | 0.1771 | 0.0009 |
height | 5 | 74051 | 0.3481 | 0.0920 | 0.3625 | 0.3525 | 0.0927 | 0.0000 | 2.8250 | 2.8250 | 0.0866 | 14.1518 | 0.0003 |
weight | 6 | 74051 | 23.3852 | 12.6482 | 23.7994 | 23.0514 | 13.9753 | 0.0567 | 80.1015 | 80.0448 | 0.2315 | -0.4019 | 0.0465 |
shucked_weight | 7 | 74051 | 10.1043 | 5.6180 | 9.9082 | 9.9184 | 6.1786 | 0.0283 | 42.1841 | 42.1557 | 0.3495 | -0.1192 | 0.0206 |
viscera_weight | 8 | 74051 | 5.0584 | 2.7927 | 4.9895 | 4.9707 | 3.0683 | 0.0425 | 21.5456 | 21.5031 | 0.2864 | -0.3654 | 0.0103 |
shell_weight | 9 | 74051 | 6.7239 | 3.5844 | 6.9315 | 6.6245 | 3.8038 | 0.0425 | 28.4912 | 28.4487 | 0.2774 | -0.1426 | 0.0132 |
age | 10 | 74051 | 9.9678 | 3.1752 | 10.0000 | 9.6888 | 2.9652 | 1.0000 | 29.0000 | 28.0000 | 1.0929 | 2.2963 | 0.0117 |
Within the histogram plots we can see the distributions for every variable. For example age and shell_weight are right-skewed, while diameter and length are left-skewed.
df |>
pivot_longer(
-c(id, sex),
names_to = "variable",
values_to = "value"
)|>
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(vars(variable), scales = "free") +
theme_bw()
Scatterplot matrix for height, length, weight, diameter and age. In general we can see as age increases, there is a positive trend for our independent variables (height, length, weight, diameter)
pairs(df |> subset(select=c(age, height, length, weight, diameter)))
Lets create a correlation matrix to measure their relationships to each other. Not unsurprising, but still quite high, the relationships of variation between each variable of height, length and weight are highly correlated with each other, over 0.90!
knitr::kable(cor(df |> subset(select=c(height, length, weight))), caption = "Correlation Matrix", digits=4)
height | length | weight | |
---|---|---|---|
height | 1.0000 | 0.9184 | 0.9018 |
length | 0.9184 | 1.0000 | 0.9364 |
weight | 0.9018 | 0.9364 | 1.0000 |
When testing our 80% confidence intervals between the three variables we see that there is a high correlation within each interval showing high levels of multicollinearity. This violates one of the rules of multiple linear regression which will pose a problem later on with our results. As for familywise errors, it is defined as “the probability of making a Type I error among a specified group, or”family,” of tests”. [Statology](https://www.statology.org/family-wise-error-rate/#:~:text=What%20is%20this%3F-,Family%2Dwise%20error%20rate%20%3D%201%20%E2%80%93%20(1%2D%CE%B1,tests%20is%20over%2022%25!). This is concerning to us as we performed several hypothesis tests and could potentially run into these Type I errors. Calculating it is as \(1-(1-\alpha)^c\) where we ended up with \(1-(1-0.2)^3 = 0.488\), we have a 48% chanhce of committing said errors.
cor.test(df$height, df$length , method=c("pearson"), conf.level=0.80)
##
## Pearson's product-moment correlation
##
## data: df$height and df$length
## t = 631.44, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9176108 0.9190862
## sample estimates:
## cor
## 0.9183517
cor.test(df$height, df$weight , method=c("pearson"), conf.level=0.80)
##
## Pearson's product-moment correlation
##
## data: df$height and df$weight
## t = 567.76, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9008913 0.9026508
## sample estimates:
## cor
## 0.9017748
cor.test(df$length, df$weight , method=c("pearson"), conf.level=0.80)
##
## Pearson's product-moment correlation
##
## data: df$length and df$weight
## t = 725.93, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9357910 0.9369515
## sample estimates:
## cor
## 0.9363738
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.
correlation_matrix <- cor(df |> subset(select=c(height, length, weight)))
precision_matrix <- solve(correlation_matrix)
correlation_precision <- correlation_matrix %*% precision_matrix
print("correlation_precision")
## [1] "correlation_precision"
print(correlation_precision)
## height length weight
## height 1.000000e+00 0.000000e+00 0
## length 4.440892e-16 1.000000e+00 0
## weight 4.440892e-16 8.881784e-16 1
precision_correlation <- precision_matrix %*% correlation_matrix
print("precision_correlation")
## [1] "precision_correlation"
print(precision_correlation)
## height length weight
## height 1 4.440892e-16 4.440892e-16
## length 0 1.000000e+00 8.881784e-16
## weight 0 0.000000e+00 1.000000e+00
lu.decomposition(precision_correlation)
## $L
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 4.440892e-16 4.440892e-16
## [2,] 0 1.000000e+00 8.881784e-16
## [3,] 0 0.000000e+00 1.000000e+00
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. See link.
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
original_distribution <-
df |>
ggplot(aes(weight)) +
geom_histogram() +
geom_vline(aes(xintercept=mean(weight), color='blue')) +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Weight Distribution") +
geom_text(aes(x=mean(weight) + 3, label=paste0("Mean = ", round(mean(weight), 2)), y=1000), colour="red", angle=90)
exponential_pdf <- fitdistr(df$weight, "exponential")
exponential_pdf
## rate
## 0.0427620578
## (0.0001571423)
lambda <- exponential_pdf$estimate
exponential_sampling <- rexp(1000, rate=lambda)
exponential_distribution <-
ggplot(mapping = aes(exponential_sampling)) +
geom_histogram() +
geom_vline(aes(xintercept=mean(exponential_sampling), color='red')) +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Exponential Weight Distribution") +
geom_text(aes(x=mean(exponential_sampling) + 3, label=paste0("Mean = ", round(mean(exponential_sampling), 2)), y=35), colour="red", angle=90)
grid.arrange(original_distribution, exponential_distribution, nrow = 1)
original_stats <- describe(df$weight)
rownames(original_stats) <- c("Original")
exponential_stats <- describe(exponential_sampling)
rownames(exponential_stats) <- c("Exp. Sample")
compare_stats <- rbind(original_stats, exponential_stats)
knitr::kable(compare_stats, digits=4)
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Original | 1 | 74051 | 23.3852 | 12.6482 | 23.7994 | 23.0514 | 13.9753 | 0.0567 | 80.1015 | 80.0448 | 0.2315 | -0.4019 | 0.0465 |
Exp. Sample | 1 | 1000 | 23.0859 | 23.0136 | 16.2572 | 19.2704 | 16.7087 | 0.0094 | 189.5325 | 189.5231 | 2.1294 | 7.3002 | 0.7278 |
When we compare our original dataset we can see that the y-scale counts drastically change from over 6,000 down to the low 200’s for each grouped weight bins. We do see that the comparative means stay the same, however in our table, the medians between both change drastically as well from 23.79 down to 16.25. This is important as our graph shows skewness and potential outliers where we would want to use the median and IQR instead of the mean and standard deviation when making a summary of the variable.
CDF Percentile
quantile(exponential_sampling, c(0.05, 0.95))
## 5% 95%
## 1.253456 67.529711
Confidence Intervals of Empirical Data
t.test(df$weight)
##
## One Sample t-test
##
## data: df$weight
## t = 503.13, df = 74050, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 23.29412 23.47632
## sample estimates:
## mean of x
## 23.38522
quantile(df$weight, c(0.05, 0.95))
## 5% 95%
## 3.600386 44.211045
Here we can see how the ranges again drastically change because of the scaling where the exponential data is wider than the original empirical data. We can assume that the weight distribution does not really follow an exponential distribution given the large difference between the values found.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
We are first going to assume that the values provided in the test set are a random sampling of observations!
lm_model <- lm(age ~ sex + length + diameter + weight + height + shucked_weight + viscera_weight + shell_weight, data=df)
summary(lm_model)
##
## Call:
## lm(formula = age ~ sex + length + diameter + weight + height +
## shucked_weight + viscera_weight + shell_weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0347 -1.2228 -0.3320 0.7537 22.0418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.755758 0.075897 49.485 < 2e-16 ***
## sexI -1.040735 0.025209 -41.284 < 2e-16 ***
## sexM -0.115212 0.019157 -6.014 1.82e-09 ***
## length 0.910211 0.192234 4.735 2.20e-06 ***
## diameter 2.138507 0.238786 8.956 < 2e-16 ***
## weight 0.194265 0.005416 35.867 < 2e-16 ***
## height 7.222272 0.236275 30.567 < 2e-16 ***
## shucked_weight -0.614115 0.006632 -92.603 < 2e-16 ***
## viscera_weight -0.216351 0.011872 -18.224 < 2e-16 ***
## shell_weight 0.512127 0.009820 52.152 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared: 0.5508, Adjusted R-squared: 0.5508
## F-statistic: 1.009e+04 on 9 and 74041 DF, p-value: < 2.2e-16
res <- resid(lm_model)
plot(fitted(lm_model), res)
abline(0,0)
We see some heteroscedasticity as there is a cone shape outward to the right with the residuals. This can cause issues with the estimates we receive in the model. This can go back to how we have high correlations between the independent variables used in the model, confirmed through the correlation matrix earlier.
qqnorm(res)
qqline(res)
Lastly, we see a huge tail off on the right side which violates the normality distribution. So even with our model showing strong statistical significant variables and a mediocre explanation of the variation with the R^2, it is difficult to trust the results for predictive purposes without running into huge errors.
vif(lm_model, type='predictor')
lm_model2 <- lm(age ~ sex + height + shucked_weight + viscera_weight + shell_weight, data=df)
summary(lm_model2)
##
## Call:
## lm(formula = age ~ sex + height + shucked_weight + viscera_weight +
## shell_weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.9484 -1.2479 -0.3228 0.7710 21.2596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.180358 0.052043 99.539 < 2e-16 ***
## sexI -1.156181 0.025362 -45.588 < 2e-16 ***
## sexM -0.125536 0.019438 -6.458 1.06e-10 ***
## height 10.768804 0.211527 50.910 < 2e-16 ***
## shucked_weight -0.399978 0.004397 -90.958 < 2e-16 ***
## viscera_weight 0.033403 0.010337 3.232 0.00123 **
## shell_weight 0.792911 0.007152 110.860 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.16 on 74044 degrees of freedom
## Multiple R-squared: 0.5374, Adjusted R-squared: 0.5374
## F-statistic: 1.434e+04 on 6 and 74044 DF, p-value: < 2.2e-16
vif(lm_model2, type='predictor')
## GVIFs computed for predictors
#define intercept-only model
intercept_only <- lm(age ~ 1, data=df)
#define model with all predictors
all <- lm(age ~ ., data=df)
#perform stepwise regression
both_lm <- step(intercept_only, direction='both', scope=formula(all), trace=0)
summary(both_lm)
##
## Call:
## lm(formula = age ~ shell_weight + shucked_weight + diameter +
## sex + weight + height + viscera_weight + length, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0347 -1.2228 -0.3320 0.7537 22.0418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.755758 0.075897 49.485 < 2e-16 ***
## shell_weight 0.512127 0.009820 52.152 < 2e-16 ***
## shucked_weight -0.614115 0.006632 -92.603 < 2e-16 ***
## diameter 2.138507 0.238786 8.956 < 2e-16 ***
## sexI -1.040735 0.025209 -41.284 < 2e-16 ***
## sexM -0.115212 0.019157 -6.014 1.82e-09 ***
## weight 0.194265 0.005416 35.867 < 2e-16 ***
## height 7.222272 0.236275 30.567 < 2e-16 ***
## viscera_weight -0.216351 0.011872 -18.224 < 2e-16 ***
## length 0.910211 0.192234 4.735 2.20e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared: 0.5508, Adjusted R-squared: 0.5508
## F-statistic: 1.009e+04 on 9 and 74041 DF, p-value: < 2.2e-16
Trying to perform boxcox transformations, we can determine which skewed plots can be changed.
box1 <- boxcox(df$length ~ 1)
box2 <- boxcox(df$diameter ~ 1)
box3 <- boxcox(df$weight ~ 1)
box4 <- boxcox(df$shucked_weight ~ 1)
box5 <- boxcox(df$viscera_weight ~ 1)
box6 <- boxcox(df$shell_weight ~ 1)
Now let’s update the model again with the transformations.
lm_model3 <- lm(age ~ sex + length^2 + diameter^2 + sqrt(weight) + height + sqrt(shucked_weight) + sqrt(viscera_weight) + sqrt(shell_weight), data=df)
summary(lm_model3)
##
## Call:
## lm(formula = age ~ sex + length^2 + diameter^2 + sqrt(weight) +
## height + sqrt(shucked_weight) + sqrt(viscera_weight) + sqrt(shell_weight),
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3284 -1.2384 -0.3219 0.7784 19.5942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.10683 0.07227 56.828 < 2e-16 ***
## sexI -0.89601 0.02528 -35.448 < 2e-16 ***
## sexM -0.10989 0.01904 -5.772 7.88e-09 ***
## length -0.43852 0.19954 -2.198 0.028 *
## diameter 0.28903 0.24794 1.166 0.244
## sqrt(weight) 2.29455 0.05795 39.593 < 2e-16 ***
## height 5.22885 0.24287 21.530 < 2e-16 ***
## sqrt(shucked_weight) -4.21579 0.04644 -90.777 < 2e-16 ***
## sqrt(viscera_weight) -0.93277 0.05870 -15.889 < 2e-16 ***
## sqrt(shell_weight) 3.56084 0.05844 60.932 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.116 on 74041 degrees of freedom
## Multiple R-squared: 0.5561, Adjusted R-squared: 0.5561
## F-statistic: 1.031e+04 on 9 and 74041 DF, p-value: < 2.2e-16
#define intercept-only model
intercept_only <- lm(age ~ 1, data=df)
#define model with all predictors
all <- lm(age ~ sex + length^2 + diameter^2 + sqrt(weight) + height + sqrt(shucked_weight) + sqrt(viscera_weight) + sqrt(shell_weight), data=df)
#perform stepwise regression
both_lm_2 <- step(intercept_only, direction='both', scope=formula(all), trace=0)
summary(both_lm_2)
##
## Call:
## lm(formula = age ~ sqrt(shell_weight) + sqrt(shucked_weight) +
## sqrt(weight) + sex + height + sqrt(viscera_weight) + length,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3556 -1.2393 -0.3217 0.7786 19.6024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.11762 0.07167 57.450 < 2e-16 ***
## sqrt(shell_weight) 3.57149 0.05772 61.874 < 2e-16 ***
## sqrt(shucked_weight) -4.21371 0.04641 -90.799 < 2e-16 ***
## sqrt(weight) 2.29985 0.05778 39.807 < 2e-16 ***
## sexI -0.89699 0.02526 -35.506 < 2e-16 ***
## sexM -0.11002 0.01904 -5.779 7.56e-09 ***
## height 5.23997 0.24268 21.592 < 2e-16 ***
## sqrt(viscera_weight) -0.93139 0.05869 -15.869 < 2e-16 ***
## length -0.27020 0.13773 -1.962 0.0498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.116 on 74042 degrees of freedom
## Multiple R-squared: 0.5561, Adjusted R-squared: 0.5561
## F-statistic: 1.16e+04 on 8 and 74042 DF, p-value: < 2.2e-16
vif(both_lm_2, type='predictor')
lm_model6 <- lm(age ~ sex + length^2 + height , data=df)
summary(lm_model6)
##
## Call:
## lm(formula = age ~ sex + length^2 + height, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.685 -1.457 -0.488 0.706 18.317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.83896 0.06144 62.48 <2e-16 ***
## sexI -1.31761 0.02809 -46.91 <2e-16 ***
## sexM -0.23402 0.02155 -10.86 <2e-16 ***
## length 1.06281 0.07932 13.40 <2e-16 ***
## height 15.05503 0.24396 61.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.398 on 74046 degrees of freedom
## Multiple R-squared: 0.4297, Adjusted R-squared: 0.4297
## F-statistic: 1.395e+04 on 4 and 74046 DF, p-value: < 2.2e-16
vif(lm_model6, type='predictor')
In the end the best model removes diameter and performs some transformations to the skewed data. Overall OLS, may not be the best regression model to use and requires another regression model to have better predictive power.
summary(both_lm_2)
##
## Call:
## lm(formula = age ~ sqrt(shell_weight) + sqrt(shucked_weight) +
## sqrt(weight) + sex + height + sqrt(viscera_weight) + length,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3556 -1.2393 -0.3217 0.7786 19.6024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.11762 0.07167 57.450 < 2e-16 ***
## sqrt(shell_weight) 3.57149 0.05772 61.874 < 2e-16 ***
## sqrt(shucked_weight) -4.21371 0.04641 -90.799 < 2e-16 ***
## sqrt(weight) 2.29985 0.05778 39.807 < 2e-16 ***
## sexI -0.89699 0.02526 -35.506 < 2e-16 ***
## sexM -0.11002 0.01904 -5.779 7.56e-09 ***
## height 5.23997 0.24268 21.592 < 2e-16 ***
## sqrt(viscera_weight) -0.93139 0.05869 -15.869 < 2e-16 ***
## length -0.27020 0.13773 -1.962 0.0498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.116 on 74042 degrees of freedom
## Multiple R-squared: 0.5561, Adjusted R-squared: 0.5561
## F-statistic: 1.16e+04 on 8 and 74042 DF, p-value: < 2.2e-16
test <- read_csv("test.csv") |> clean_names()
predictions <- predict(both_lm_2, test)
submission_df <- data.frame("id" = test$id,
"Age" = round(predictions, 0))
submission_df |> write_csv("submission.csv")
knitr::include_graphics("kaggle.png")