Data_605_Final

Computational Mathematics

$\\$ $\\$

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1.

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15. Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

$\\$ $\\$

$Probability$

$\\$

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable.

set.seed(1234)

X <- runif(n = 10000, min = 5, max = 15)

Y <- rnorm(n = 10000, mean = 10, sd = 2.89) 

x = median(X)

y = quantile(Y, 0.25)

Interpret the meaning of all probabilities.

a. $P(X>x | X>y)$

$\\$ $\\$ $P(X>x | X>y) = \frac{P(X>x \cap X>y)}{P(X>y)}$

num <- length(X[X>x & X>y])/length(X)
denom <- length(X[X>y])/length(X)

num/denom

## [1] 0.7245327

b. $P(X>x \:\&\: Y>y)$

$\\$ $\\$

length(which(X>x & Y>y))/10000

## [1] 0.3738

c. $P(X<x | X>y)$

$\\$ $\\$ $P(X>x | X>y) = \frac{P(X>x \cap X>y)}{P(X>y)}$

num <- length(X[X<x & X>y])/length(X)
denom <- length(X[X>y])/length(X)

num/denom

## [1] 0.2754673

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

library(magrittr)
library(kableExtra)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

A <- length(which(X>x & Y>y))/10000
B <- length(which(X<=x & Y>y))/10000
C <- length(which(X>x & Y<=y))/10000
D <- length(which(X<=x & Y<=y))/10000

df <- data.frame(c(A,C),c(B,D)) %>%
  mutate(Total = rowSums(.)) %>%
  rbind(., colSums(.)) %>%
  set_colnames(c('$X>x$', '$X\\leq x$', 'Total')) %>%
  set_rownames(c('$Y>y$', '$Y\\leq y$', 'Total')) %>%
  kable() %>% 
  kable_styling(latex_options = "hold_position")

df

	$X>x$	$X\leq x$	Total
$Y>y$	0.3738	0.3762	0.75
$Y\leq y$	0.1262	0.1238	0.25
Total	0.5000	0.5000	1.00

From the table, the joint probability is 0.3738 and the marginal probability is 0.375 (the result of multiplying P(X>x) =0.5 and P(Y>y) = 0.75). While the two probabilities are close in value, they are not exactly the same. $\\$ $\\$

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not? Null Hypothesis: $P(X>x)$ and $P(Y>y)$ are independent $\\$ Alternative Hypothesis: $P(X>x)$ and $P(Y>y)$ are dependent

table(X > x, Y > y) %>%
  fisher.test()

## 
##  Fisher's Exact Test for Count Data
## 
## data:  .
## p-value = 0.5953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8894241 1.0682094
## sample estimates:
## odds ratio 
##  0.9747199

The fishers test yields a p-value of 0.5953, which is significantly higher than our goal of 0.05 and for this reason we cannot reject the null hypothesis based on the Fisher’s exact test.

table(X>x, Y>y) %>%
  chisq.test(.)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 0.28213, df = 1, p-value = 0.5953

The Chi-squared test here yields the same p-value as the Fisher’s Exact test, meaning that based on the Chi-squared test (just like the Fisher’s Exact Test), we cannot reject the Null hypothesis ($P(X>x)$ and $P(Y>y)$ are independent).

$\\$ $\\$

Fisher’s Exact test is best used on a smaller sample size whereas Chi-squared produces better results as the sample size. With that said, it would make sense to use Chi-squared in this case; however, this may not be the best illustration because the p-values ended up being the same here.

Problem 2

$\\$ $\\$

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

$\\$ $\\$

Here I load the training data:

library(readr)
crabs <- read_csv('train.csv')

## Rows: 74051 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (9): id, Length, Diameter, Height, Weight, Shucked Weight, Viscera Weigh...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

$Descriptive \: and \: Inferential \: Statistics$

$\\$

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Summary:

summary(crabs)

##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18512   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked Weight     Viscera Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000

$\\$ $\\$

pairs(~`Shucked Weight`+Weight+Height+`Shell Weight`, data = crabs, lwd = 0.5)

vars <- c("Weight", "Height", "Shucked Weight")

corr_matrix <- cor(crabs[vars])

corr_matrix

##                   Weight    Height Shucked Weight
## Weight         1.0000000 0.9017748      0.9712672
## Height         0.9017748 1.0000000      0.8640826
## Shucked Weight 0.9712672 0.8640826      1.0000000

Now, I’ll go ahead and check the pairwise correlation of Weight, Height and Schucked Weight.

cor.test(~ Weight + Height, data = crabs, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  Weight and Height
## t = 567.76, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9008913 0.9026508
## sample estimates:
##       cor 
## 0.9017748

cor.test(~ `Shucked Weight` + Height, data = crabs, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  Shucked Weight and Height
## t = 467.14, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.8628846 0.8652710
## sample estimates:
##       cor 
## 0.8640826

cor.test(~ Weight + `Shucked Weight`, data = crabs, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  Weight and Shucked Weight
## t = 1110.5, df = 74049, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.9709993 0.9715328
## sample estimates:
##       cor 
## 0.9712672

All three variable combinations yield a p-value far less than 0.05 and none of the confidence intervals contain 0. These results tell us that there is likely a meaningful relationship to explore between these variable pairing relationships.

Familywise error involves a type one error or rejecting a null hypothesis that is actually true in the population. I would not be worried about that being true in this scenario because the p-value is almost 0.

$Linear \: Algebra \: and \: Correlation$

$\\$

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LDU decomposition on the matrix.

$\\$ $\\$

# here I invert the correlation matrix
inverted_mat <- solve(corr_matrix)

inverted_mat

##                    Weight    Height Shucked Weight
## Weight          24.264456 -5.987522     -18.393558
## Height          -5.987522  5.424421       1.128336
## Shucked Weight -18.393558  1.128336      17.890085

# here I multiply the inverted matrix by the correlation matrix
pre_cor <- inverted_mat %*% corr_matrix

pre_cor

##                       Weight        Height Shucked Weight
## Weight          1.000000e+00  3.493138e-15   3.552714e-15
## Height         -2.940922e-16  1.000000e+00   0.000000e+00
## Shucked Weight -4.060847e-15 -1.833191e-15   1.000000e+00

# here I multiply the correlation matrix by the inverted matrix
cor_pre <- corr_matrix %*% inverted_mat

cor_pre

##                      Weight        Height Shucked Weight
## Weight         1.000000e+00 -7.842758e-17  -5.081335e-16
## Height         4.232997e-16  1.000000e+00   1.719523e-15
## Shucked Weight 0.000000e+00  2.220446e-16   1.000000e+00

Now, I’ll conduct an LU Decomp for the initial correlation matrix:

library(matrixcalc)

lu.decomposition(corr_matrix)

## $L
##           [,1]        [,2] [,3]
## [1,] 1.0000000  0.00000000    0
## [2,] 0.9017748  1.00000000    0
## [3,] 0.9712672 -0.06307045    1
## 
## $U
##      [,1]      [,2]        [,3]
## [1,]    1 0.9017748  0.97126723
## [2,]    0 0.1868022 -0.01178170
## [3,]    0 0.0000000  0.05589689

$Calculus-Based \: Probability \: \& \: Statistics$

$\\$

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

$\\$ $\\$

Here, I’ll inspect for right-skewness:

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.1

library(tidyr) # so I can have access to gather ()

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

# subsetting to only numeric values

crabs_numeric <- subset(crabs, select = -c(id, Sex))

# preliminary looks for skewness

ggplot(gather(crabs_numeric), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.1

skews <- crabs_numeric %>%
  summarise(across(where(is.numeric), ~ skewness(.x, na.rm = TRUE)))

min(crabs$Age) > 0

## [1] TRUE

Age has a skew of about 1.093, which means it is skewed heavily to the right and the minimum value of crabs$Age is greater than zero, which means that shifting the variable (in this case) would not be needed.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked _by_ '.GlobalEnv':
## 
##     crabs

## The following object is masked from 'package:dplyr':
## 
##     select

exp_age <- fitdistr(crabs$Age, "exponential")

lambda_age <- exp_age$estimate

exp_sample <- rexp(1000, lambda_age)

summary(exp_sample)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00401  3.03722  6.92952  9.84023 13.94293 80.78709

Histogram:

hist(exp_sample, main = "Histogram of Exponential Sample of Crabs Age")

lower <- qexp(0.05, lambda_age)
upper <- qexp(0.95, lambda_age)

print(paste('5th percentile: ', lower))

## [1] "5th percentile:  0.511281606097218"

print(paste('95th percentile: ', upper))

## [1] "95th percentile:  29.8608780455269"

quantile(crabs$Age, c(0.05, 0.95))

##  5% 95% 
##   6  16

$Modeling$ $\\$ $\\$

# No missing values
colSums(is.na(crabs_numeric))

##         Length       Diameter         Height         Weight Shucked Weight 
##              0              0              0              0              0 
## Viscera Weight   Shell Weight            Age 
##              0              0              0

crabs_numeric <- subset(crabs, select = -c(id, Sex))

crabs_model <- lm(Age ~., crabs_numeric)
summary(crabs_model)

## 
## Call:
## lm(formula = Age ~ ., data = crabs_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.2001  -1.2411  -0.3769   0.7587  21.7936 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.275640   0.068356  33.291  < 2e-16 ***
## Length            0.842589   0.194685   4.328 1.51e-05 ***
## Diameter          2.909390   0.241185  12.063  < 2e-16 ***
## Height            7.905423   0.238777  33.108  < 2e-16 ***
## Weight            0.198657   0.005484  36.223  < 2e-16 ***
## `Shucked Weight` -0.628563   0.006705 -93.743  < 2e-16 ***
## `Viscera Weight` -0.186771   0.012003 -15.560  < 2e-16 ***
## `Shell Weight`    0.520522   0.009944  52.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.155 on 74043 degrees of freedom
## Multiple R-squared:  0.5392, Adjusted R-squared:  0.5392 
## F-statistic: 1.238e+04 on 7 and 74043 DF,  p-value: < 2.2e-16

All variables have an associated p-value which is less than 0.05 and an associated t-value with is more than 1.96 away from 0 meaning they are likely statistically significant. The R-squared values are both a little below .54 meaning that the model can account for about 54% of variation.

plot(crabs_model)

unique(crabs$Sex)

## [1] "I" "M" "F"

crabs <- crabs %>%
  group_by(Sex) %>%
  mutate(mean_weight = mean(Weight))

crabs_numeric <- subset(crabs, select = -c(id, Sex))

crab_model2 <- lm(Age ~ ., data = crabs_numeric)
summary(crab_model2)

## 
## Call:
## lm(formula = Age ~ ., data = crabs_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0221  -1.2220  -0.3315   0.7544  22.0546 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.056882   0.067676  30.393  < 2e-16 ***
## Length            0.908607   0.192228   4.727 2.29e-06 ***
## Diameter          2.138428   0.238786   8.955  < 2e-16 ***
## Height            7.221825   0.236275  30.565  < 2e-16 ***
## Weight            0.194339   0.005416  35.883  < 2e-16 ***
## `Shucked Weight` -0.614345   0.006628 -92.685  < 2e-16 ***
## `Viscera Weight` -0.216153   0.011870 -18.209  < 2e-16 ***
## `Shell Weight`    0.512206   0.009820  52.161  < 2e-16 ***
## mean_weight       0.056508   0.001293  43.709  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 74042 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5508 
## F-statistic: 1.135e+04 on 8 and 74042 DF,  p-value: < 2.2e-16

Initially had made a really complicated model and that yielded an R-Squared of to about 0.57; however the p-values of some variables grew and the t-values shrunk.

The variables for my new model all have an associated p-value which is less than 0.05 and an associated t-value with is more than 1.96 away from 0 meaning they are likely statistically significant. Though, in this new model, I was able to improve the R-squared values to about .55, meaning this new model can account for more variation.

plot(crab_model2)

And if you look at the plots of my residuals, the residuals seem to be more evenly dispersed above and below zero, not to mention that the QQ-plot displays residuals that more closely adhere to the normal line than do the plots from the first model.

crab_test <- read.csv('test.csv')

crab_test <- crab_test %>%
  group_by(Sex) %>%
  mutate(mean_weight = mean(Weight))

crab_test <- crab_test %>%
  rename(`Shucked Weight`=Shucked.Weight, `Viscera Weight`=Viscera.Weight, `Shell Weight`=Shell.Weight)

prediction <- predict(crab_model2, crab_test, type = "response")
head(prediction)

##         1         2         3         4         5         6 
##  7.739995  7.686829 10.424255  9.546894  7.508559 12.223504

kaggle_predictions <- data.frame(Id = crab_test$id, Age = prediction)
head(kaggle_predictions)

##      Id       Age
## 1 74051  7.739995
## 2 74052  7.686829
## 3 74053 10.424255
## 4 74054  9.546894
## 5 74055  7.508559
## 6 74056 12.223504

# I just needed to run this once, but I do not want this to run everytime I knit
write.csv(kaggle_predictions, file = "crab_age2.csv", row.names=FALSE)

It say the deadline for submission has past and therefore I cannot submit my predictions to the leaderboard. However, my score was a 1.48.

Data_605_Final

Coco Donovan

2023-12-15

Computational Mathematics

Problem 1.

Interpret the meaning of all probabilities.

a. \(P(X>x | X>y)\)

b. \(P(X>x \:\&\: Y>y)\)

c. \(P(X<x | X>y)\)

Problem 2

Summary:

Histogram:

	\(X>x\)	\(X\leq x\)	Total
\(Y>y\)	0.3738	0.3762	0.75
\(Y\leq y\)	0.1262	0.1238	0.25
Total	0.5000	0.5000	1.00