title: “Final Data605” author: “Shariq Mian” date: “2023-12-14” output: html_document —
set.seed(1234)
X = runif(n = 10000, min = 5, max = 15)
Y = rnorm(10000, mean = 10, sd = 2.89)
hist(X)
hist(Y)
### Probability. Calculate as a minimum the below probabilities a
through c. Assume the small letter “x” is estimated as the median of the
X variable, and the small letter “y” is estimated as the median of the Y
variable. Interpret the meaning of all probabilities.
x = median(X)
y = median(Y)
print(x)
## [1] 10.01266
print(y)
## [1] 10.03154
a = sum(X > x & X > y) / sum(X > x)
#b. The ratio of the probabity Number of times X>x and Y>y/ Total Number of Observations
b = sum(X > x & Y > y) / length(X)
#c This is one is the opposite. The ratio of Number of times X<x and X>y/ Number of Times X>y
c = sum(X < x & X > y) / length(X)
cat("a. P(X > x | X > y):", a, "\n")
## a. P(X > x | X > y): 0.9958
cat("b. P(X > x & Y > y):", b, "\n")
## b. P(X > x & Y > y): 0.2507
cat("c. P(X < x | X > y):", c, "\n")
## c. P(X < x | X > y): 0
table_XY <- table(X > x, Y > y)
print(table_XY)
##
## FALSE TRUE
## FALSE 2507 2493
## TRUE 2493 2507
prob_X_gt_x <- sum(X > x) / length(X)
prob_Y_gt_y <- sum(Y > y) / length(Y)
prob_Xgt_x_and_Ygt_y <- table_XY[2, 2] / sum(table_XY)
equality_check <- prob_Xgt_x_and_Ygt_y == (prob_X_gt_x * prob_Y_gt_y)
print(paste("P(X > x) =", prob_X_gt_x))
## [1] "P(X > x) = 0.5"
print(paste("P(Y > y) =", prob_Y_gt_y))
## [1] "P(Y > y) = 0.5"
print(paste("P(X > x and Y > y) =", prob_Xgt_x_and_Ygt_y))
## [1] "P(X > x and Y > y) = 0.2507"
print(paste("P(X > x) * P(Y > y) =", prob_X_gt_x * prob_Y_gt_y))
## [1] "P(X > x) * P(Y > y) = 0.25"
print(paste("Are they equal? (TRUE/FALSE):", equality_check))
## [1] "Are they equal? (TRUE/FALSE): FALSE"
Based on the information provided in the table, the joint probability denoted as P(X > x & Y > y) is observed to be 0.25. Simultaneously, the independent probabilities, P(X > x) and P(Y > y), for each variable are both 0.5. The multiplication of these individual probabilities results in a product of the marginal probabilities, which is also 0.25.
Hence, it is evident that the joint probability (P(X > x & Y > y)) and the product of the marginal probabilities (P(X > x)P(Y > y)) are not identical. The joint probability is marginally higher than the product of the marginal probabilities.
matrix_prob <- matrix(c((sum(X > x & Y > y)), (sum(X > x & Y <= y)),
(sum(X <= x & Y > y)), (sum(X <= x & Y <= y))),
nrow = 2, ncol = 2, byrow = TRUE)
matrix_prob
## [,1] [,2]
## [1,] 2507 2493
## [2,] 2493 2507
fisher.test(matrix_prob)
##
## Fisher's Exact Test for Count Data
##
## data: matrix_prob
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
Fisher’s Exact Test and the Chi Square Test are used to assess the relationship between two variables based on observed data. Fisher’s Exact Test determines the probability of observing the data or a more extreme outcome under the assumption of independence. It considers all possible tables with the same marginal totals and sums the probabilities of tables as or more extreme than the observed table. If this probability is small (typically < 0.05), the null hypothesis of independence is rejected.
In contrast, the Chi Square Test compares observed counts in a contingency table to expected counts under independence. It calculates a test statistic by comparing squared differences between observed and expected counts, divided by the expected counts. Under the null hypothesis, the test statistic follows a Chi Square distribution with degrees of freedom based on the number of variable categories. If the calculated statistic exceeds the critical value, the null hypothesis is rejected.
#Chi Square Test
chisq.test(matrix_prob)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix_prob
## X-squared = 0.0676, df = 1, p-value = 0.7949
For Fisher’s exact test, p value of .7949 is the probability of observing a table as extreme as the calculated table, assuming the null hypothesis of independence is true. A higher p-value indicates weaker evidence against the null hypothesis and there is no statistical evidence to reject the null hypothesis of independence between variables X and Y. nd the point estimate of the odds ratio is 1, further supporting the notion that there is no association between the variables. Based on Fisher’s Exact Test results, there is no significant evidence to suggest that variables X and Y are dependent on each other. The data supports the hypothesis of independence.
The fisher’s exact test is typically used when sample sizes are small while the chi-square test is more appropriate for large sample sizes. For this test, a chi-square test is most appropriate.
Similar to Fisher’s exact test, chi-square p-value represents the probability of observing the table assuming independence which in case was same value of .7949
Chi-square statistic of 0.0676 measures the difference between the observed and expected frequencies.
# Load required libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Matrix)
## Warning: package 'Matrix' was built under R version 4.3.2
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(caret)
## Warning: package 'caret' was built under R version 4.3.1
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
# Assuming train_df is your dataset
train_df <- read.csv(url("https://github.com/mianshariq/Datasets/raw/e72c11f47fcbf656ba206d69d7db3245f6079ada/train_crab.csv"))
summary_age <- summary(train_df$Age)
age_histogram <- ggplot(train_df, aes(x = Age)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Crab Ages",
x = "Age",
y = "Frequency")
print(summary_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 8.000 10.000 9.968 11.000 29.000
print(age_histogram)
scatter_diameter_age <- ggplot(train_df, aes(x = Diameter, y = Age)) +
geom_point(color = "blue", alpha = 0.7) +
labs(title = "Scatterplot of Crab Diameter vs. Age",
x = "Diameter",
y = "Age")
shell_weight_dist=ggplot(train_df, aes(x = Shell.Weight)) +
geom_histogram(binwidth = 1) +
coord_cartesian(xlim = c(0,25)) +
labs(title = "Distribution of Crab Shell Weights")
scatter_shell_weight_age <- ggplot(train_df, aes(x = `Shell.Weight`, y = Age)) +
geom_point(color = "green", alpha = 0.7) +
labs(title = "Scatterplot of Crab Shell Weight vs. Age",
x = "Shell Weight",
y = "Age")
print(scatter_diameter_age)
print(shell_weight_dist)
print(scatter_shell_weight_age)
The distribution of crab shell weights appears mostly normal, with a
slight rightward skew. A noticeable positive linear correlation is
observed between crab shell weight and age, indicating that as the shell
weight of a crab increases, there is an associated likelihood of it
being older.
scatter_matrix <- GGally::ggpairs(train_df[, c("Age", "Length","Height", "Diameter","Shell.Weight")])
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
print(scatter_matrix)
The correlation matrix quantifies the strength and direction of linear relationships. The hypothesis test assesses if the correlations are significantly different from zero.
Null Hypothesis: There is no (significant) correlation between a Crab’s shell weight and a crab’s age.
Alt. Hypothesis: There is a (significant) correlation between a Crab’s Shell weight and a crab’s age.
correlation_vars <- train_df[, c("Diameter", "Height", "Weight", "Shell.Weight", "Age")]
cor_matrix=cor(correlation_vars)
print(cor_matrix)
## Diameter Height Weight Shell.Weight Age
## Diameter 1.0000000 0.9213531 0.9382486 0.9226882 0.6212559
## Height 0.9213531 1.0000000 0.9017748 0.9033982 0.6380669
## Weight 0.9382486 0.9017748 1.0000000 0.9655246 0.6011950
## Shell.Weight 0.9226882 0.9033982 0.9655246 1.0000000 0.6634733
## Age 0.6212559 0.6380669 0.6011950 0.6634733 1.0000000
Null Hypothesis: There is no (significant) correlation between a Crab Diameter and age. Alt. Hypothesis: There is a (significant) correlation between a Crab weight and age.
Null Hypothesis: There is no (significant) correlation between a Crab weight and age. Alt. Hypothesis: There is a (significant) correlation between a Crab weight and age.
Null Hypothesis: There is no (significant) correlation between a Crab weight and age. Alt. Hypothesis: There is a (significant) correlation between a Crab weight and age.
selected_columns <- train_df %>%
select(Diameter, Height, Shell.Weight, Age)
correlation_results <- cor.test(selected_columns$Diameter, selected_columns$Age, method = "pearson",conf.level = 0.80)
print(correlation_results)
##
## Pearson's product-moment correlation
##
## data: selected_columns$Diameter and selected_columns$Age
## t = 215.74, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6183556 0.6241393
## sample estimates:
## cor
## 0.6212559
correlation_results <- cor.test(selected_columns$Height, selected_columns$Age, method = "pearson",conf.level = 0.80)
print(correlation_results)
##
## Pearson's product-moment correlation
##
## data: selected_columns$Height and selected_columns$Age
## t = 225.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6352664 0.6408507
## sample estimates:
## cor
## 0.6380669
correlation_results <- cor.test(selected_columns$Shell.Weight, selected_columns$Age, method = "pearson",conf.level = 0.80)
print(correlation_results)
##
## Pearson's product-moment correlation
##
## data: selected_columns$Shell.Weight and selected_columns$Age
## t = 241.3, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6608286 0.6661015
## sample estimates:
## cor
## 0.6634733
Since the P-value is less than the alpha value of 0.05, we reject the null hypothesis for all the pairs. There is a significant correlation between a Crab’s Diameter and Age, Height and Age and Shell Weight and Age. which means as Crabs get older, they tend to have bigger diameter, larger heights and their shell weights more and we are confident about this because their p-value fall value fall under .05 which means they are not correlated to random chance.
Regarding family wise error, it refers to the increased probability of making at least one Type I error (false positive) when conducting multiple hypothesis tests. This risk increases with more tests. Since we are doing that we have a random chance of that happening so we should be worried about family wise error.
precision_matrix <- solve(cor_matrix)
result_matrix_1 <- cor_matrix %*% precision_matrix
result_matrix_2 <- precision_matrix %*% cor_matrix
# first multiplication with different variable names
lu_decomp_1 <- lu(as(Matrix(result_matrix_1), "CsparseMatrix"))
L_matrix_1 <- lu_decomp_1@L
U_matrix_1 <- lu_decomp_1@U
D_matrix_1 <- diag(diag(U_matrix_1))
L_matrix_1 <- L_matrix_1 %*% diag(1 / diag(U_matrix_1))
U_matrix_1 <- diag(1 / diag(U_matrix_1)) %*% U_matrix_1
A_matrix_1 <- L_matrix_1 %*% D_matrix_1 %*% U_matrix_1
# 2nd multiplication with different variable names
lu_decomp_2 <- lu(as(Matrix(result_matrix_2), "CsparseMatrix"))
L_matrix_2 <- lu_decomp_2@L
U_matrix_2 <- lu_decomp_2@U
D_matrix_2 <- diag(diag(U_matrix_2))
L_matrix_2 <- L_matrix_2 %*% diag(1 / diag(U_matrix_2))
U_matrix_2 <- diag(1 / diag(U_matrix_2)) %*% U_matrix_2
A_matrix_2 <- L_matrix_2 %*% D_matrix_2 %*% U_matrix_2
cat("\nResults of LDU Decomp Matrix 1")
##
## Results of LDU Decomp Matrix 1
print(A_matrix_1)
## 5 x 5 Matrix of class "dgeMatrix"
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000000e+00 1.110223e-16 -2.220446e-16 4.440892e-16 0.000000e+00
## [2,] 1.110223e-16 1.000000e+00 2.886580e-15 3.330669e-15 -2.220446e-16
## [3,] -2.220446e-16 2.886580e-15 1.000000e+00 -1.110223e-15 -2.220446e-16
## [4,] 4.440892e-16 3.330669e-15 -1.110223e-15 1.000000e+00 -4.440892e-16
## [5,] 0.000000e+00 -2.220446e-16 -2.220446e-16 -4.440892e-16 1.000000e+00
cat("\nResults of LDU Decomp Matrix 2")
##
## Results of LDU Decomp Matrix 2
print(A_matrix_2)
## 5 x 5 Matrix of class "dgeMatrix"
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000000e+00 4.440892e-16 -5.551115e-17 -1.110223e-16 7.216450e-16
## [2,] 4.440892e-16 1.000000e+00 -1.387779e-15 -1.110223e-15 -1.221245e-15
## [3,] -5.551115e-17 -1.387779e-15 1.000000e+00 8.881784e-16 8.881784e-16
## [4,] -1.110223e-16 -1.110223e-15 8.881784e-16 1.000000e+00 -8.881784e-16
## [5,] 7.216450e-16 -1.221245e-15 8.881784e-16 -8.881784e-16 1.000000e+00
We are using GG pairs to see the distribution of the different features and it seems the distribution of Crab Shell Weights, is skewed to the right.
scatter_matrix <- GGally::ggpairs(train_df[, c("Age", "Sex", "Shucked.Weight", "Length",'Diameter','Height','Weight','Shell.Weight')])
print(scatter_matrix)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
selected_variable <- train_df$Shell.Weight
shifted_variable <- selected_variable - min(selected_variable) + 1
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
exp_fit <- fitdistr(shifted_variable, densfun = "exponential")
lambda <- exp_fit$estimate["rate"]
exp_samples <- rexp(1000, rate = lambda)
alpha_value <- 0.5
hist(shifted_variable, main = "Original Variable Vs Exponential Distribution Samples ", xlab = "Values", col = rgb(0, 0, 1, alpha =.5), freq = FALSE, breaks = 30)
hist(exp_samples, main = "Exponential Distribution Samples", xlab = "Values", col = rgb(0, 1, 0, alpha = .5), add = TRUE, freq = FALSE, breaks = 30)
exp_5th_percentile <- qexp(0.05, rate = lambda)
exp_95th_percentile <- qexp(0.95, rate = lambda)
cat("The exponential 5th percentile is", exp_5th_percentile, "\n")
## The exponential 5th percentile is 0.3940015
cat("The exponential 95th percentile is", exp_95th_percentile, "\n")
## The exponential 95th percentile is 23.01126
The 5th percentile (0.3940015) of the exponential distribution suggests a lower threshold for shell.weight. This value indicates the weight below which approximately 5% of hypothetical crab shells would fall if they were following an exponential distribution. The 95th percentile (23.01126) provides an upper threshold, suggesting the weight above which approximately 95% of hypothetical crab shells would fall.
95% confidence interval from the empirical data, assuming normality.
mean_variable <- mean(shifted_variable)
sd_variable <- sd(shifted_variable)
ci_low <- mean_variable - 1.96 * sd_variable / sqrt(length(shifted_variable))
ci_up <- mean_variable + 1.96 * sd_variable / sqrt(length(shifted_variable))
cat("The empirical 95% CI is", ci_low, "-", ci_up, "\n")
## The empirical 95% CI is 7.655529 - 7.707163
The empirical 95% CI reflects the precision of estimating the population mean based on the sample mean and standard deviation. The empirical 95% CI for the mean shell.weight (7.655529 to 7.707163) indicates a range where we are reasonably confident the true mean shell.weight lies, assuming a normal distribution.
Empirical 5th percentile and 95th percentile of the data
emp_5th_percentile <- quantile(shifted_variable, 0.05)
emp_95th_percentile <- quantile(shifted_variable, 0.95)
cat("The empirical 5th percentile is", emp_5th_percentile, "\n")
## The empirical 5th percentile is 2.063106
cat("The empirical 95th percentile is", emp_95th_percentile, "\n")
## The empirical 95th percentile is 13.70058
The empirical 5th percentile (2.063106) represents a lower threshold for observed shell.weight values, below which 5% of actual crab shells fall. The empirical 95th percentile (13.70058) is an upper threshold where 95% of observed crab shells fall. Empirical percentiles provide insights into the distribution of the original variable, indicating its central tendency and spread.
Understanding the distribution of shell.weight is essential for predicting the age of a crab. Deviations from the theoretical exponential model may suggest factors influencing shell weight beyond a simple exponential relationship.
initial_model <- lm(Age ~ Sex + Length + Diameter + Height + Weight + `Shucked.Weight` + `Viscera.Weight` + `Shell.Weight`, data = train_df)
initial_model <- step(initial_model, direction = "backward")
## Start: AIC=111865.1
## Age ~ Sex + Length + Diameter + Height + Weight + Shucked.Weight +
## Viscera.Weight + Shell.Weight
##
## Df Sum of Sq RSS AIC
## <none> 335336 111865
## - Length 1 102 335438 111886
## - Diameter 1 363 335700 111943
## - Viscera.Weight 1 1504 336840 112195
## - Height 1 4232 339568 112792
## - Weight 1 5826 341163 113139
## - Sex 2 8658 343994 113749
## - Shell.Weight 1 12318 347654 114535
## - Shucked.Weight 1 38838 374174 119978
summary(initial_model)
##
## Call:
## lm(formula = Age ~ Sex + Length + Diameter + Height + Weight +
## Shucked.Weight + Viscera.Weight + Shell.Weight, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0347 -1.2228 -0.3320 0.7537 22.0418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.755758 0.075897 49.485 < 2e-16 ***
## SexI -1.040735 0.025209 -41.284 < 2e-16 ***
## SexM -0.115212 0.019157 -6.014 1.82e-09 ***
## Length 0.910211 0.192234 4.735 2.20e-06 ***
## Diameter 2.138507 0.238786 8.956 < 2e-16 ***
## Height 7.222272 0.236275 30.567 < 2e-16 ***
## Weight 0.194265 0.005416 35.867 < 2e-16 ***
## Shucked.Weight -0.614115 0.006632 -92.603 < 2e-16 ***
## Viscera.Weight -0.216351 0.011872 -18.224 < 2e-16 ***
## Shell.Weight 0.512127 0.009820 52.152 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.128 on 74041 degrees of freedom
## Multiple R-squared: 0.5508, Adjusted R-squared: 0.5508
## F-statistic: 1.009e+04 on 9 and 74041 DF, p-value: < 2.2e-16
Overall, the model appears to have an ok fit based on the .5508 R-squared value with significant predictor variables. However, it is important to further examine the assumptions of the regression model, such as the normality of residuals and the absence of influential outliers, to ensure the reliability of the results.
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
plot(initial_model, which = c(1, 3))
set.seed(123)
sample_indices <- sample(1:nrow(train_df), 5000)
shapiro.test(residuals(initial_model)[sample_indices])
##
## Shapiro-Wilk normality test
##
## data: residuals(initial_model)[sample_indices]
## W = 0.9033, p-value < 2.2e-16
qqnorm(residuals(initial_model))
qqline(residuals(initial_model))
vif(initial_model)
## GVIF Df GVIF^(1/(2*Df))
## Sex 1.815731 2 1.160815
## Length 50.029756 1 7.073172
## Diameter 52.539130 1 7.248388
## Height 7.731224 1 2.780508
## Weight 76.731546 1 8.759654
## Shucked.Weight 22.695250 1 4.763953
## Viscera.Weight 17.972270 1 4.239371
## Shell.Weight 20.256450 1 4.500717
With a p-value very close to zero, we would likely reject the null hypothesis and conclude that the residuals (the differences between the observed and predicted values) are not normally distributed. The Shapiro-Wilk test specifically assesses whether the data depart from normality, and a low p-value indicates a departure from normality.
When residuals are normally distributed, it implies that the errors in prediction follow a normal distribution, which is essential for making valid statistical inferences and constructing reliable prediction intervals.
Since the p-value is very close to zero,we can try some transformation to see if it improves the model.We add a quadratic term and see the Residuals vs Fitted
# List of predictors
predictors <- c('Sex', 'Length', 'Diameter', 'Height', 'Weight', 'Shucked.Weight', 'Viscera.Weight', 'Shell.Weight')
model <- lm(Age ~ . + I(Length^2), data = train_df[, c("Age", predictors)])
plot(model, which = 1) # Residuals vs. Fitted
summary(model)
##
## Call:
## lm(formula = Age ~ . + I(Length^2), data = train_df[, c("Age",
## predictors)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3301 -1.2255 -0.3330 0.7497 21.9674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.055101 0.120119 8.784 < 2e-16 ***
## SexI -1.034540 0.025069 -41.267 < 2e-16 ***
## SexM -0.121361 0.019052 -6.370 1.9e-10 ***
## Length 8.593002 0.327432 26.244 < 2e-16 ***
## Diameter 0.136602 0.247349 0.552 0.581
## Height 5.598562 0.241579 23.175 < 2e-16 ***
## Weight 0.231749 0.005540 41.832 < 2e-16 ***
## Shucked.Weight -0.587295 0.006660 -88.187 < 2e-16 ***
## Viscera.Weight -0.164884 0.011939 -13.811 < 2e-16 ***
## Shell.Weight 0.562182 0.009918 56.686 < 2e-16 ***
## I(Length^2) -3.601417 0.124615 -28.900 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.116 on 74040 degrees of freedom
## Multiple R-squared: 0.5558, Adjusted R-squared: 0.5558
## F-statistic: 9265 on 10 and 74040 DF, p-value: < 2.2e-16
We get same results and model R2 remain same at .558
we can try some other transformation to see if it improves the model.We add Log transformation to the dependent variable and see the Residuals vs Fitted and the
log_model <- lm(log(Age) ~ ., data = train_df[, c("Age", predictors)])
plot(log_model, which = 1) # Residuals vs. Fitted
summary(log_model)
##
## Call:
## lm(formula = log(Age) ~ ., data = train_df[, c("Age", predictors)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.03326 -0.11932 -0.01918 0.09387 1.37992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3046848 0.0067216 194.104 < 2e-16 ***
## SexI -0.1100606 0.0022326 -49.297 < 2e-16 ***
## SexM -0.0093908 0.0016966 -5.535 3.12e-08 ***
## Length 0.3085725 0.0170246 18.125 < 2e-16 ***
## Diameter 0.3899925 0.0211474 18.442 < 2e-16 ***
## Height 0.8036416 0.0209250 38.406 < 2e-16 ***
## Weight 0.0113024 0.0004797 23.563 < 2e-16 ***
## Shucked.Weight -0.0513660 0.0005873 -87.459 < 2e-16 ***
## Viscera.Weight -0.0186469 0.0010514 -17.736 < 2e-16 ***
## Shell.Weight 0.0369459 0.0008697 42.482 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1885 on 74041 degrees of freedom
## Multiple R-squared: 0.6393, Adjusted R-squared: 0.6393
## F-statistic: 1.458e+04 on 9 and 74041 DF, p-value: < 2.2e-16
R2 increases to 0.6393. Overall this is the best Model and the model appears to have a good fit based on the high R-squared value and significant predictor variables.
test_df <- read.csv(url("https://github.com/mianshariq/Datasets/raw/e72c11f47fcbf656ba206d69d7db3245f6079ada/test_crab.csv"))
predicted_ages= predict(log_model, newdata = test_df)
predicted_ages <- exp(predicted_ages)
test_df$Age=predicted_ages
names(test_df)
## [1] "id" "Sex" "Length" "Diameter"
## [5] "Height" "Weight" "Shucked.Weight" "Viscera.Weight"
## [9] "Shell.Weight" "Age"
crab_results = test_df %>%
dplyr::select( id, Age)
write.csv(crab_results , file = "submission.csv", row.names = {FALSE})
Overall, got a score of 1.47 in the kaggle competition,