\(\\\) \(\\\)
Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15. Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
\(\\\) \(\\\)
\(Probability\)
\(\\\)
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable.
set.seed(1234)
X <- runif(n = 10000, min = 5, max = 15)
Y <- rnorm(n = 10000, mean = 10, sd = 2.89)
x = median(X)
y = quantile(Y, 0.25)
\(\\\) \(\\\) \(P(X>x | X>y) = \frac{P(X>x \cap X>y)}{P(X>y)}\)
num <- length(X[X>x & X>y])/length(X)
denom <- length(X[X>y])/length(X)
num/denom
## [1] 0.7245327
\(\\\) \(\\\)
length(which(X>x & Y>y))/10000
## [1] 0.3738
\(\\\) \(\\\) \(P(X>x | X>y) = \frac{P(X>x \cap X>y)}{P(X>y)}\)
num <- length(X[X<x & X>y])/length(X)
denom <- length(X[X>y])/length(X)
num/denom
## [1] 0.2754673
Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
library(magrittr)
library(kableExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
A <- length(which(X>x & Y>y))/10000
B <- length(which(X<=x & Y>y))/10000
C <- length(which(X>x & Y<=y))/10000
D <- length(which(X<=x & Y<=y))/10000
df <- data.frame(c(A,C),c(B,D)) %>%
mutate(Total = rowSums(.)) %>%
rbind(., colSums(.)) %>%
set_colnames(c('$X>x$', '$X\\leq x$', 'Total')) %>%
set_rownames(c('$Y>y$', '$Y\\leq y$', 'Total')) %>%
kable() %>%
kable_styling(latex_options = "hold_position")
df
| \(X>x\) | \(X\leq x\) | Total | |
|---|---|---|---|
| \(Y>y\) | 0.3738 | 0.3762 | 0.75 |
| \(Y\leq y\) | 0.1262 | 0.1238 | 0.25 |
| Total | 0.5000 | 0.5000 | 1.00 |
From the table, the joint probability is 0.3738 and the marginal probability is 0.375 (the result of multiplying P(X>x) =0.5 and P(Y>y) = 0.75). While the two probabilities are close in value, they are not exactly the same. \(\\\) \(\\\)
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not? Null Hypothesis: \(P(X>x)\) and \(P(Y>y)\) are independent \(\\\) Alternative Hypothesis: \(P(X>x)\) and \(P(Y>y)\) are dependent
table(X > x, Y > y) %>%
fisher.test()
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.5953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8894241 1.0682094
## sample estimates:
## odds ratio
## 0.9747199
The fishers test yields a p-value of 0.5953, which is significantly higher than our goal of 0.05 and for this reason we cannot reject the null hypothesis based on the Fisher’s exact test.
table(X>x, Y>y) %>%
chisq.test(.)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 0.28213, df = 1, p-value = 0.5953
The Chi-squared test here yields the same p-value as the Fisher’s Exact test, meaning that based on the Chi-squared test (just like the Fisher’s Exact Test), we cannot reject the Null hypothesis (\(P(X>x)\) and \(P(Y>y)\) are independent).
\(\\\) \(\\\)
Fisher’s Exact test is best used on a smaller sample size whereas Chi-squared produces better results as the sample size. With that said, it would make sense to use Chi-squared in this case; however, this may not be the best illustration because the p-values ended up being the same here.
\(\\\) \(\\\)
You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.
\(\\\) \(\\\)
Here I load the training data:
library(readr)
crabs <- read_csv('train.csv')
## Rows: 74051 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (9): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
\(Descriptive \: and \: Inferential \: Statistics\)
\(\\\)
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
summary(crabs)
## id Sex Length Diameter
## Min. : 0 Length:74051 Min. :0.1875 Min. :0.1375
## 1st Qu.:18512 Class :character 1st Qu.:1.1500 1st Qu.:0.8875
## Median :37025 Mode :character Median :1.3750 Median :1.0750
## Mean :37025 Mean :1.3175 Mean :1.0245
## 3rd Qu.:55538 3rd Qu.:1.5375 3rd Qu.:1.2000
## Max. :74050 Max. :2.0128 Max. :1.6125
## Height Weight Shucked Weight Viscera Weight
## Min. :0.0000 Min. : 0.0567 Min. : 0.02835 Min. : 0.04252
## 1st Qu.:0.3000 1st Qu.:13.4377 1st Qu.: 5.71242 1st Qu.: 2.86330
## Median :0.3625 Median :23.7994 Median : 9.90815 Median : 4.98951
## Mean :0.3481 Mean :23.3852 Mean :10.10427 Mean : 5.05839
## 3rd Qu.:0.4125 3rd Qu.:32.1625 3rd Qu.:14.03300 3rd Qu.: 6.98815
## Max. :2.8250 Max. :80.1015 Max. :42.18406 Max. :21.54562
## Shell Weight Age
## Min. : 0.04252 Min. : 1.000
## 1st Qu.: 3.96893 1st Qu.: 8.000
## Median : 6.93145 Median :10.000
## Mean : 6.72387 Mean : 9.968
## 3rd Qu.: 9.07184 3rd Qu.:11.000
## Max. :28.49125 Max. :29.000
\(\\\) \(\\\)
pairs(~`Shucked Weight`+Weight+Height+`Shell Weight`, data = crabs, lwd = 0.5)
vars <- c("Weight", "Height", "Shucked Weight")
corr_matrix <- cor(crabs[vars])
corr_matrix
## Weight Height Shucked Weight
## Weight 1.0000000 0.9017748 0.9712672
## Height 0.9017748 1.0000000 0.8640826
## Shucked Weight 0.9712672 0.8640826 1.0000000
Now, I’ll go ahead and check the pairwise correlation of Weight, Height and Schucked Weight.
cor.test(~ Weight + Height, data = crabs, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: Weight and Height
## t = 567.76, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9008913 0.9026508
## sample estimates:
## cor
## 0.9017748
cor.test(~ `Shucked Weight` + Height, data = crabs, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: Shucked Weight and Height
## t = 467.14, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.8628846 0.8652710
## sample estimates:
## cor
## 0.8640826
cor.test(~ Weight + `Shucked Weight`, data = crabs, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: Weight and Shucked Weight
## t = 1110.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.9709993 0.9715328
## sample estimates:
## cor
## 0.9712672
All three variable combinations yield a p-value far less than 0.05 and none of the confidence intervals contain 0. These results tell us that there is likely a meaningful relationship to explore between these variable pairing relationships.
Familywise error involves a type one error or rejecting a null hypothesis that is actually true in the population. I would not be worried about that being true in this scenario because the p-value is almost 0.
\(Linear \: Algebra \: and \: Correlation\)
\(\\\)
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.
\(\\\) \(\\\)
# here I invert the correlation matrix
inverted_mat <- solve(corr_matrix)
inverted_mat
## Weight Height Shucked Weight
## Weight 24.264456 -5.987522 -18.393558
## Height -5.987522 5.424421 1.128336
## Shucked Weight -18.393558 1.128336 17.890085
# here I multiply the inverted matrix by the correlation matrix
pre_cor <- inverted_mat %*% corr_matrix
pre_cor
## Weight Height Shucked Weight
## Weight 1.000000e+00 3.493138e-15 3.552714e-15
## Height -2.940922e-16 1.000000e+00 0.000000e+00
## Shucked Weight -4.060847e-15 -1.833191e-15 1.000000e+00
# here I multiply the correlation matrix by the inverted matrix
cor_pre <- corr_matrix %*% inverted_mat
cor_pre
## Weight Height Shucked Weight
## Weight 1.000000e+00 -7.842758e-17 -5.081335e-16
## Height 4.232997e-16 1.000000e+00 1.719523e-15
## Shucked Weight 0.000000e+00 2.220446e-16 1.000000e+00
Now, I’ll conduct an LU Decomp for the initial correlation matrix:
library(matrixcalc)
lu.decomposition(corr_matrix)
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.00000000 0
## [2,] 0.9017748 1.00000000 0
## [3,] 0.9712672 -0.06307045 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.9017748 0.97126723
## [2,] 0 0.1868022 -0.01178170
## [3,] 0 0.0000000 0.05589689
\(Calculus-Based \: Probability \: \& \: Statistics\)
\(\\\)
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
\(\\\) \(\\\)
Here, I’ll inspect for right-skewness:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.1
library(tidyr) # so I can have access to gather ()
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
# subsetting to only numeric values
crabs_numeric <- subset(crabs, select = -c(id, Sex))
# preliminary looks for skewness
ggplot(gather(crabs_numeric), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.1
skews <- crabs_numeric %>%
summarise(across(where(is.numeric), ~ skewness(.x, na.rm = TRUE)))
min(crabs$Age) > 0
## [1] TRUE
Age has a skew of about 1.093, which means it is skewed heavily to the right and the minimum value of crabs$Age is greater than zero, which means that shifting the variable (in this case) would not be needed.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## crabs
## The following object is masked from 'package:dplyr':
##
## select
exp_age <- fitdistr(crabs$Age, "exponential")
lambda_age <- exp_age$estimate
exp_sample <- rexp(1000, lambda_age)
summary(exp_sample)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00401 3.03722 6.92952 9.84023 13.94293 80.78709
hist(exp_sample, main = "Histogram of Exponential Sample of Crabs Age")
lower <- qexp(0.05, lambda_age)
upper <- qexp(0.95, lambda_age)
print(paste('5th percentile: ', lower))
## [1] "5th percentile: 0.511281606097218"
print(paste('95th percentile: ', upper))
## [1] "95th percentile: 29.8608780455269"
quantile(crabs$Age, c(0.05, 0.95))
## 5% 95%
## 6 16
\(Modeling\) \(\\\) \(\\\)
# No missing values
colSums(is.na(crabs_numeric))
## Length Diameter Height Weight Shucked Weight
## 0 0 0 0 0
## Viscera Weight Shell Weight Age
## 0 0 0
crabs_numeric <- subset(crabs, select = -c(id, Sex))
crabs_model <- lm(Age ~., crabs_numeric)
summary(crabs_model)
##
## Call:
## lm(formula = Age ~ ., data = crabs_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.2001 -1.2411 -0.3769 0.7587 21.7936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.275640 0.068356 33.291 < 2e-16 ***
## Length 0.842589 0.194685 4.328 1.51e-05 ***
## Diameter 2.909390 0.241185 12.063 < 2e-16 ***
## Height 7.905423 0.238777 33.108 < 2e-16 ***
## Weight 0.198657 0.005484 36.223 < 2e-16 ***
## `Shucked Weight` -0.628563 0.006705 -93.743 < 2e-16 ***
## `Viscera Weight` -0.186771 0.012003 -15.560 < 2e-16 ***
## `Shell Weight` 0.520522 0.009944 52.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.155 on 74043 degrees of freedom
## Multiple R-squared: 0.5392, Adjusted R-squared: 0.5392
## F-statistic: 1.238e+04 on 7 and 74043 DF, p-value: < 2.2e-16
All variables have an associated p-value which is less than 0.05 and an associated t-value with is more than 1.96 away from 0 meaning they are likely statistically significant. The R-squared values are both a little below .54 meaning that the model can account for about 54% of variation.
plot(crabs_model)
unique(crabs$Sex)
## [1] "I" "M" "F"
crabs <- crabs %>%
group_by(Sex) %>%
mutate(mean_weight = mean(Weight))
crabs_numeric <- subset(crabs, select = -c(id, Sex))
crab_model2 <- lm(Age ~ ., data = crabs_numeric)
summary(crab_model2)
##
## Call:
## lm(formula = Age ~ ., data = crabs_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0221 -1.2220 -0.3315 0.7544 22.0546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.056882 0.067676 30.393 < 2e-16 ***
## Length 0.908607 0.192228 4.727 2.29e-06 ***
## Diameter 2.138428 0.238786 8.955 < 2e-16 ***
## Height 7.221825 0.236275 30.565 < 2e-16 ***
## Weight 0.194339 0.005416 35.883 < 2e-16 ***
## `Shucked Weight` -0.614345 0.006628 -92.685 < 2e-16 ***
## `Viscera Weight` -0.216153 0.011870 -18.209 < 2e-16 ***
## `Shell Weight` 0.512206 0.009820 52.161 < 2e-16 ***
## mean_weight 0.056508 0.001293 43.709 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.128 on 74042 degrees of freedom
## Multiple R-squared: 0.5508, Adjusted R-squared: 0.5508
## F-statistic: 1.135e+04 on 8 and 74042 DF, p-value: < 2.2e-16
Initially had made a really complicated model and that yielded an R-Squared of to about 0.57; however the p-values of some variables grew and the t-values shrunk.
The variables for my new model all have an associated p-value which is less than 0.05 and an associated t-value with is more than 1.96 away from 0 meaning they are likely statistically significant. Though, in this new model, I was able to improve the R-squared values to about .55, meaning this new model can account for more variation.
plot(crab_model2)
And if you look at the plots of my residuals, the residuals seem to be more evenly dispersed above and below zero, not to mention that the QQ-plot displays residuals that more closely adhere to the normal line than do the plots from the first model.
crab_test <- read.csv('test.csv')
crab_test <- crab_test %>%
group_by(Sex) %>%
mutate(mean_weight = mean(Weight))
crab_test <- crab_test %>%
rename(`Shucked Weight`=Shucked.Weight, `Viscera Weight`=Viscera.Weight, `Shell Weight`=Shell.Weight)
prediction <- predict(crab_model2, crab_test, type = "response")
head(prediction)
## 1 2 3 4 5 6
## 7.739995 7.686829 10.424255 9.546894 7.508559 12.223504
kaggle_predictions <- data.frame(Id = crab_test$id, Age = prediction)
head(kaggle_predictions)
## Id Age
## 1 74051 7.739995
## 2 74052 7.686829
## 3 74053 10.424255
## 4 74054 9.546894
## 5 74055 7.508559
## 6 74056 12.223504
# I just needed to run this once, but I do not want this to run everytime I knit
write.csv(kaggle_predictions, file = "crab_age2.csv", row.names=FALSE)
It say the deadline for submission has past and therefore I cannot submit my predictions to the leaderboard. However, my score was a 1.48.