Data 605 Final

Introduction

Instructions: Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Part 1

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

set.seed(1234)

# 10,000 continuous random uniform values between 5 and 15
X <- runif(n = 10000, min = 5, max = 15)

# 10,000 random normal values with a mean of 10 and a standard deviation of 2.89 
Y <- rnorm(10000, mean=10, sd = 2.89)

5 points: Probability. Calculate as a minimum the below probabilities a through c.

Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.

P(X>x | X>y)
P(X>x & Y>y)
P(X<x | X>y)

# x is median of X
# y is median of Y

x <- median(X)
y <- median(Y)
  
# a.  P(X>x | X>y)
# The P(B|A) = P(A&B)/P(A)
n <- 10000
p_ab <- sum(X>x & X>y)/n
p_a <- sum(X>y)/n
p <- p_ab/p_a
print(paste("P(X>x | X>y) = ", p))

## [1] "P(X>x | X>y) =  1"

# b.  P(X>x & Y>y)
p <- sum(X>x & Y>y)/n
print(paste("P(X>x & Y>y) = ", p))

## [1] "P(X>x & Y>y) =  0.2507"

# c.  P(X<x | X>y)
# The P(B|A) = P(A&B)/P(A)
n <- 10000
p_ab <- sum(X<x & X>y)/n
p_a <- sum(X>y)/n
p <- p_ab/p_a
print(paste("P(X<x | X>y) = ", p))

## [1] "P(X<x | X>y) =  0"

Now to interpret the meanings:

P(X>x | X>y) = 1

The probability that X>x given that X>y is 1. This means that given the values of X that are greater than the median value of the Y series, for those values of X they are greater than the median of the X series. This makes sense, because the median of the X and the Y series are about the same:

print(paste("Median X: ", median(X)))

## [1] "Median X:  10.0126582302619"

print(paste("Median Y: ", median(Y)))

## [1] "Median Y:  10.0315387998106"

We also know that X is a uniform distribution with values from 5 to 15 and Y is a normal distribution centered at 10 with a standard deviation of 2.89. Sense we have been given the values of X that are greater than the median of the Y series - we also know that these X values will be greater than the median of the X series as their medians are the same and the distributions have about equal amounts of values on either side.

P(X>x & Y>y) = 0.2507

The intersection of the probabilities of X being greater than it’s median and Y being greater than it’s median. We know that there are half the values on either side of the median, so multiplying 0.5*0.5 =0.25 which makes sense.

P(X<x | X>y) = 0

The probability that Xy is 0. This means that given the values of X that are greater than the median value of the Y series, for those values of X there is a probability of 0 that they are less than the median of the X series. This makes sense, because you have pre-selected the values that are greater than the median - so none of the remaining are less than that.

5 points. Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Help from: probabilities: https://www.youtube.com/watch?v=FWrEaSaW2mc&ab_channel=BryanNelson tables: https://www.geeksforgeeks.org/how-to-create-tables-in-r/

# Now let's build the table of counts of conditions
# Exp 1: P(X>x), P(X<=x)
# Exp 2: P(Y>y), P(Y<=y)
#_____________
#          (Y>y)   (Y<=y)
# (X>x)  |       | 
# (X<=x) |       |
# ____________

c1 <- sum(X>x & Y>y)
c2 <- sum(X>x & Y<=y)
c3 <- sum(X<=x & Y>y)
c4 <- sum(X<=x & Y<=y)

# create matrix with 2 columns and 2 rows
data= matrix(c(c1,c2,c3,c4), ncol=2, byrow=TRUE)
 
# specify the column names and row names of matrix
colnames(data) = c('(Y>y)','(Y<=y)')
rownames(data) <- c('(X>x)','(X<=x)')
 
# assign to table
final=as.table(data)
 
# display
final

##        (Y>y) (Y<=y)
## (X>x)   2507   2493
## (X<=x)  2493   2507

# P(X>x & Y>y)=P(X>x)P(Y>y) ?

print(paste("P(X>x & Y>y): ", c1/(sum(c1,c2,c3,c4))))

## [1] "P(X>x & Y>y):  0.2507"

print(paste("P(X>x)P(Y>y): ", ((c1+c2)/sum(c1,c2,c3,c4))*((c1+c3)/sum(c1,c2,c3,c4))))

## [1] "P(X>x)P(Y>y):  0.25"

Using the table of counts of each joint condition, I was able to determine that the P(X>x & Y>y) = P(X>x)P(Y>y).

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

print("Fisher Test")

## [1] "Fisher Test"

fisher.test(final)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  final
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264

print("Chi Square Test")

## [1] "Chi Square Test"

chisq.test(final)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  final
## X-squared = 0.0676, df = 1, p-value = 0.7949

In the fisher and chi square test, the null hypothesis is that the results are independent and the alternative hypothesis is that the results are not independent. As the p-value for both is the same, 0.7949, it is large so we do not reject the null hypothesis. This means that there is independence. The difference between the fisher test and the chi square test is their suitability for different situations. As the name suggests, the fisher exact tests is exact - however it is better for a smaller sample size. In contrast, the chi square test is approximate but is better for a larger sample size. I think because we have a larger sample, the chi square test makes more sense. I am not surprised by the results. The analysis up until this point also showcased the independence of the conditions. The two conditions were based on probabilities from two different distributions that only depended on themselves. It would not have made sense for them to be dependent on one another.

Reference material: https://www.datascienceblog.net/post/statistical_test/contingency_table_tests/

Part 2

You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.

5 points. Descriptive and Inferential Statistics.

Load necessary packages.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the test and train data.

train <- read_csv("https://raw.githubusercontent.com/klgriffen96/data605/main/train.csv", show_col_types = FALSE)
test <- read_csv("https://raw.githubusercontent.com/klgriffen96/data605/main/test.csv", show_col_types = FALSE)

Provide univariate descriptive statistics and appropriate plots for the training data set.

print("Columns")

## [1] "Columns"

colnames(train)

##  [1] "id"             "Sex"            "Length"         "Diameter"      
##  [5] "Height"         "Weight"         "Shucked Weight" "Viscera Weight"
##  [9] "Shell Weight"   "Age"

print("Descriptive stats")

## [1] "Descriptive stats"

summary(train)

##        id            Sex                Length          Diameter     
##  Min.   :    0   Length:74051       Min.   :0.1875   Min.   :0.1375  
##  1st Qu.:18513   Class :character   1st Qu.:1.1500   1st Qu.:0.8875  
##  Median :37025   Mode  :character   Median :1.3750   Median :1.0750  
##  Mean   :37025                      Mean   :1.3175   Mean   :1.0245  
##  3rd Qu.:55538                      3rd Qu.:1.5375   3rd Qu.:1.2000  
##  Max.   :74050                      Max.   :2.0128   Max.   :1.6125  
##      Height           Weight        Shucked Weight     Viscera Weight    
##  Min.   :0.0000   Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.04252  
##  1st Qu.:0.3000   1st Qu.:13.4377   1st Qu.: 5.71242   1st Qu.: 2.86330  
##  Median :0.3625   Median :23.7994   Median : 9.90815   Median : 4.98951  
##  Mean   :0.3481   Mean   :23.3852   Mean   :10.10427   Mean   : 5.05839  
##  3rd Qu.:0.4125   3rd Qu.:32.1625   3rd Qu.:14.03300   3rd Qu.: 6.98815  
##  Max.   :2.8250   Max.   :80.1015   Max.   :42.18406   Max.   :21.54562  
##   Shell Weight           Age        
##  Min.   : 0.04252   Min.   : 1.000  
##  1st Qu.: 3.96893   1st Qu.: 8.000  
##  Median : 6.93145   Median :10.000  
##  Mean   : 6.72387   Mean   : 9.968  
##  3rd Qu.: 9.07184   3rd Qu.:11.000  
##  Max.   :28.49125   Max.   :29.000

Let’s go through and look at the distributions of each variable in the training set.

train |>
  #keep(is.numeric) |>
  subset(select = -c(Sex, id)) |>
  gather() |>
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(bins = 30)

From these plots we can see that some of the variables are skewed.

Reference for quick plotting: https://drsimonj.svbtle.com/quick-plot-of-all-variables

Check the count of the sexes.

ggplot(train, aes(x=Sex)) +
  geom_bar()

There appears to be approximately equal distributions of the sexes.

I also want to look out for outliers.

train |>
  #keep(is.numeric) |>
  subset(select = -c(Sex, id)) |>
  gather() |>
  ggplot(aes(x=value)) +
    facet_wrap(~ key, scales = "free") +
    geom_boxplot()

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

The dependent variable is the Age, we will plot this against the independent variables.

train |>
  gather(key = "variable", value = "value",
         -id, -Age) |>
  ggplot(aes(x = value, y = Age)) +
  geom_point() + 
  facet_wrap(~variable, scales = "free")

From the looks of these scatterplots, it seems that diameter, height, and length are all good candidates for strong relationships to Age.

Derive a correlation matrix for any three quantitative variables in the dataset.

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.2.3

## corrplot 0.92 loaded

train |>
  subset(select = -c(Sex, id)) |>
  cor() |>
  corrplot(addCoef.col = "white", type = "lower", number.cex=0.75)

Based on the correlation plot, we can note that alot of the variables are highly correlated.

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

So the null hypothesis is that the pairwise correlation is 0 and the alternative hypothesis is that it is not 0. I predict that none of them will be zero.

cs <- colnames(train)[3:10]

for (n1 in cs){
  for (n2 in cs){
    if (n1 != n2){
      print(paste("Variables:", n1, "and", n2))
      result <- cor.test(train[[n1]], train[[n2]], method = "pearson", conf.level = 0.8)
      print(paste("P Value: ", round(result$p.value, 4), "Confidence Interval", round(result$conf.int[1], 4), "to" ,round(result$conf.int[2], 4)))
      print("____________________________________________________________")
    }
  }
}

## [1] "Variables: Length and Diameter"
## [1] "P Value:  0 Confidence Interval 0.9893 to 0.9895"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Height"
## [1] "P Value:  0 Confidence Interval 0.9176 to 0.9191"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Weight"
## [1] "P Value:  0 Confidence Interval 0.9358 to 0.937"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Shucked Weight"
## [1] "P Value:  0 Confidence Interval 0.9148 to 0.9163"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Viscera Weight"
## [1] "P Value:  0 Confidence Interval 0.9171 to 0.9186"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Shell Weight"
## [1] "P Value:  0 Confidence Interval 0.9162 to 0.9177"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Age"
## [1] "P Value:  0 Confidence Interval 0.6099 to 0.6158"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Length"
## [1] "P Value:  0 Confidence Interval 0.9893 to 0.9895"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Height"
## [1] "P Value:  0 Confidence Interval 0.9206 to 0.9221"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Weight"
## [1] "P Value:  0 Confidence Interval 0.9377 to 0.9388"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Shucked Weight"
## [1] "P Value:  0 Confidence Interval 0.9134 to 0.915"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Viscera Weight"
## [1] "P Value:  0 Confidence Interval 0.9176 to 0.9191"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Shell Weight"
## [1] "P Value:  0 Confidence Interval 0.922 to 0.9234"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Age"
## [1] "P Value:  0 Confidence Interval 0.6184 to 0.6241"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Length"
## [1] "P Value:  0 Confidence Interval 0.9176 to 0.9191"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Diameter"
## [1] "P Value:  0 Confidence Interval 0.9206 to 0.9221"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Weight"
## [1] "P Value:  0 Confidence Interval 0.9009 to 0.9027"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Shucked Weight"
## [1] "P Value:  0 Confidence Interval 0.8629 to 0.8653"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Viscera Weight"
## [1] "P Value:  0 Confidence Interval 0.8821 to 0.8842"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Shell Weight"
## [1] "P Value:  0 Confidence Interval 0.9025 to 0.9043"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Age"
## [1] "P Value:  0 Confidence Interval 0.6353 to 0.6409"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Length"
## [1] "P Value:  0 Confidence Interval 0.9358 to 0.937"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Diameter"
## [1] "P Value:  0 Confidence Interval 0.9377 to 0.9388"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Height"
## [1] "P Value:  0 Confidence Interval 0.9009 to 0.9027"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Shucked Weight"
## [1] "P Value:  0 Confidence Interval 0.971 to 0.9715"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Viscera Weight"
## [1] "P Value:  0 Confidence Interval 0.9708 to 0.9713"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Shell Weight"
## [1] "P Value:  0 Confidence Interval 0.9652 to 0.9658"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Age"
## [1] "P Value:  0 Confidence Interval 0.5982 to 0.6042"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Length"
## [1] "P Value:  0 Confidence Interval 0.9148 to 0.9163"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Diameter"
## [1] "P Value:  0 Confidence Interval 0.9134 to 0.915"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Height"
## [1] "P Value:  0 Confidence Interval 0.8629 to 0.8653"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Weight"
## [1] "P Value:  0 Confidence Interval 0.971 to 0.9715"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Viscera Weight"
## [1] "P Value:  0 Confidence Interval 0.9421 to 0.9431"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Shell Weight"
## [1] "P Value:  0 Confidence Interval 0.9096 to 0.9112"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Age"
## [1] "P Value:  0 Confidence Interval 0.4998 to 0.5068"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Length"
## [1] "P Value:  0 Confidence Interval 0.9171 to 0.9186"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Diameter"
## [1] "P Value:  0 Confidence Interval 0.9176 to 0.9191"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Height"
## [1] "P Value:  0 Confidence Interval 0.8821 to 0.8842"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Weight"
## [1] "P Value:  0 Confidence Interval 0.9708 to 0.9713"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Shucked Weight"
## [1] "P Value:  0 Confidence Interval 0.9421 to 0.9431"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Shell Weight"
## [1] "P Value:  0 Confidence Interval 0.9333 to 0.9345"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Age"
## [1] "P Value:  0 Confidence Interval 0.5737 to 0.5799"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Length"
## [1] "P Value:  0 Confidence Interval 0.9162 to 0.9177"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Diameter"
## [1] "P Value:  0 Confidence Interval 0.922 to 0.9234"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Height"
## [1] "P Value:  0 Confidence Interval 0.9025 to 0.9043"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Weight"
## [1] "P Value:  0 Confidence Interval 0.9652 to 0.9658"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Shucked Weight"
## [1] "P Value:  0 Confidence Interval 0.9096 to 0.9112"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Viscera Weight"
## [1] "P Value:  0 Confidence Interval 0.9333 to 0.9345"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Age"
## [1] "P Value:  0 Confidence Interval 0.6608 to 0.6661"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Length"
## [1] "P Value:  0 Confidence Interval 0.6099 to 0.6158"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Diameter"
## [1] "P Value:  0 Confidence Interval 0.6184 to 0.6241"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Height"
## [1] "P Value:  0 Confidence Interval 0.6353 to 0.6409"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Weight"
## [1] "P Value:  0 Confidence Interval 0.5982 to 0.6042"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Shucked Weight"
## [1] "P Value:  0 Confidence Interval 0.4998 to 0.5068"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Viscera Weight"
## [1] "P Value:  0 Confidence Interval 0.5737 to 0.5799"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Shell Weight"
## [1] "P Value:  0 Confidence Interval 0.6608 to 0.6661"
## [1] "____________________________________________________________"

Reference: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r

Discuss the meaning of your analysis.
Would you be worried about familywise error? Why or why not?

I checked between independence for all of the numerical variables. The familywise error rate is the risk that we commit at least one type one error, which would be rejecting the null hypothesis when it is actually true. So in our case the null hypothesis is that the pairwise correlation is 0, we rejected the null hypothesis in every case due to a low p-value. If the null hypothesis was true - and the variables were actually independent then we would have made a type-1 error. The familywise error rate increases with more hypothesis tests, however because our 80% confidence intervals are so far away from 0, I am not worried about a familywise error in this case. There are other tests that could be done to mitigate the problem of the familywise error.

Reference: https://www.youtube.com/watch?v=8CuFKY45UXQ&ab_channel=TileStats

5 points. Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

corr_matrix <- train |>
  subset(select = -c(Sex, id)) |>
  cor() 

prec_matrix <- corr_matrix |> solve()
prec_matrix

##                     Length    Diameter     Height      Weight Shucked Weight
## Length          50.0363646 -45.1097381 -1.9464374   0.1100257     -2.9284116
## Diameter       -45.1097381  52.3551380 -2.5463119  -1.4747766     -0.4008903
## Height          -1.9464374  -2.5463119  7.8112768  -0.1561810      0.2354580
## Weight           0.1100257  -1.4747766 -0.1561810  78.0530360    -32.1967124
## Shucked Weight  -2.9284116  -0.4008903  0.2354580 -32.1967124     25.3018917
## Viscera Weight  -1.4824486   0.2206767 -0.5969465 -19.0009603      1.5006489
## Shell Weight     1.5536462  -2.7474053 -2.0985613 -25.7645353      7.8914648
## Age             -0.1657241  -0.4720841 -0.4972981  -1.7174146      2.4136599
##                Viscera Weight Shell Weight        Age
## Length             -1.4824486     1.553646 -0.1657241
## Diameter            0.2206767    -2.747405 -0.4720841
## Height             -0.5969465    -2.098561 -0.4972981
## Weight            -19.0009603   -25.764535 -1.7174146
## Shucked Weight      1.5006489     7.891465  2.4136599
## Viscera Weight     17.9689552     1.656625  0.3565189
## Shell Weight        1.6566248    20.997071 -1.2752517
## Age                 0.3565189    -1.275252  2.1702671

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

round(corr_matrix%*% prec_matrix,2)

##                Length Diameter Height Weight Shucked Weight Viscera Weight
## Length              1        0      0      0              0              0
## Diameter            0        1      0      0              0              0
## Height              0        0      1      0              0              0
## Weight              0        0      0      1              0              0
## Shucked Weight      0        0      0      0              1              0
## Viscera Weight      0        0      0      0              0              1
## Shell Weight        0        0      0      0              0              0
## Age                 0        0      0      0              0              0
##                Shell Weight Age
## Length                    0   0
## Diameter                  0   0
## Height                    0   0
## Weight                    0   0
## Shucked Weight            0   0
## Viscera Weight            0   0
## Shell Weight              1   0
## Age                       0   1

round(prec_matrix%*% corr_matrix,2)

##                Length Diameter Height Weight Shucked Weight Viscera Weight
## Length              1        0      0      0              0              0
## Diameter            0        1      0      0              0              0
## Height              0        0      1      0              0              0
## Weight              0        0      0      1              0              0
## Shucked Weight      0        0      0      0              1              0
## Viscera Weight      0        0      0      0              0              1
## Shell Weight        0        0      0      0              0              0
## Age                 0        0      0      0              0              0
##                Shell Weight Age
## Length                    0   0
## Diameter                  0   0
## Height                    0   0
## Weight                    0   0
## Shucked Weight            0   0
## Viscera Weight            0   0
## Shell Weight              1   0
## Age                       0   1

Conduct LDU decomposition on the matrix.

library(matrixcalc)
round(lu.decomposition(corr_matrix)$L,4)

##        [,1]   [,2]   [,3]   [,4]    [,5]    [,6]   [,7] [,8]
## [1,] 1.0000 0.0000 0.0000 0.0000  0.0000  0.0000 0.0000    0
## [2,] 0.9894 1.0000 0.0000 0.0000  0.0000  0.0000 0.0000    0
## [3,] 0.9184 0.6044 1.0000 0.0000  0.0000  0.0000 0.0000    0
## [4,] 0.9364 0.5599 0.2332 1.0000  0.0000  0.0000 0.0000    0
## [5,] 0.9155 0.3975 0.1226 0.9682  1.0000  0.0000 0.0000    0
## [6,] 0.9179 0.4850 0.2286 0.9027 -0.0139  1.0000 0.0000    0
## [7,] 0.9170 0.7336 0.3490 0.7939 -0.4585 -0.0922 1.0000    0
## [8,] 0.6128 0.7084 0.4449 0.0328 -1.3793 -0.2184 0.5876    1

round(lu.decomposition(corr_matrix)$U,4)

##      [,1]   [,2]   [,3]   [,4]   [,5]    [,6]    [,7]    [,8]
## [1,]    1 0.9894 0.9184 0.9364 0.9155  0.9179  0.9170  0.6128
## [2,]    0 0.0210 0.0127 0.0118 0.0084  0.0102  0.0154  0.0149
## [3,]    0 0.0000 0.1490 0.0347 0.0183  0.0341  0.0520  0.0663
## [4,]    0 0.0000 0.0000 0.1085 0.1051  0.0980  0.0862  0.0036
## [5,]    0 0.0000 0.0000 0.0000 0.0545 -0.0008 -0.0250 -0.0752
## [6,]    0 0.0000 0.0000 0.0000 0.0000  0.0564 -0.0052 -0.0123
## [7,]    0 0.0000 0.0000 0.0000 0.0000  0.0000  0.0494  0.0290
## [8,]    0 0.0000 0.0000 0.0000 0.0000  0.0000  0.0000  0.4608

5 points. Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data.
Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Based on the exploratory data analysis section, the Weight appears right skewed.

library(e1071)
hist(train$Weight)

print(paste("Skewness:" , skewness(train$Weight)))

## [1] "Skewness: 0.231455178371882"

print(paste("Min: ", min(train$Weight)))

## [1] "Min:  0.056699"

The minimum is already above 0 so that is all set.

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

expon_weight <- fitdistr(train$Weight, "exponential")

Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda)).

lambda <- expon_weight$estimate
lambda_samples <- rexp(1000, lambda)

Plot a histogram and compare it with a histogram of your original variable.

train |> ggplot(aes(x = Weight)) + geom_histogram(bins = 30)

data.frame(exp_weight = lambda_samples) |> ggplot(aes(x = exp_weight)) + geom_histogram(bins = 30)

The fitted distribution does not actually seem like a great fit for the weight data, but we will continue with it nonetheless.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

Reference: https://r-coder.com/exponential-distribution-r/#Plot_exponential_cumulative_distribution_function_in_R

print(paste("5th percentile:", round(qexp(0.05, rate = lambda), 4), "and the 95th percentile:", round(qexp(0.95, rate = lambda), 4)))

## [1] "5th percentile: 1.1995 and the 95th percentile: 70.0558"

Also generate a 95% confidence interval from the empirical data, assuming normality.

I am assuming that this is looking for the confidence interval around the mean.

Reference: https://makemeanalyst.com/statistics-with-r/how-to-find-confidence-intervals-in-r/

c95 <- t.test(train$Weight, conf.level = 0.95)
c95$conf.int

## [1] 23.29412 23.47632
## attr(,"conf.level")
## [1] 0.95

Finally, provide the empirical 5th percentile and 95th percentile of the data.

Reference: https://www.digitalocean.com/community/tutorials/quantile-function-in-r

quantile(train$Weight,na.rm = T,probs = c(0.05, 0.95))

##        5%       95% 
##  3.600386 44.211045

Discuss.

The exponentially created fitted distribution does not seem like a great match for the Weight data. Exponential: 5%: 1.1995 and 95%: 70.0558 Empirical: 5%: 3.6003 and 95%: 44.2110 The percentiles are not well matched, which is not a surprise because the histograms are not similar either. I would not go on to use the exponential distribution for the Weight.

10 points. Modeling.

Build some type of multiple regression model and submit your model to the competition board.

There are many types of regression models, a stratgey for multiple linear regression is noted in the book: https://conservancy.umn.edu/handle/11299/189222 in chapter 4 and 5.

I also previously took Data 624 where I learned many methods for mutliple regression: https://docs.google.com/document/d/1duSd351OXVB1P28lX9t5ChCEV0yeswaHv2VIkUU4Sm0/edit?usp=sharing

Based on my knowledge, I will make a few models. The Kaggle competition is judged based on mean absolute error, so I will use this to compare my models and pick the best one from my options.

First, let’s split my own training data into a test and train subsets so I can compare a couple of models.

library(caret)

## Warning: package 'caret' was built under R version 4.2.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(train$Age, p=0.8, list=F)

# Separate training from test
dfm_full_train <- train[train_rows,2:(ncol(train))]
dfm_full_test <- train[-train_rows,2:(ncol(train))]

# Separate outcome from predictors
trainx_full <- dfm_full_train[,1:(ncol(dfm_full_train)-1)]
trainy_full <- dfm_full_train[,ncol(dfm_full_train)]
testx_full <- dfm_full_test[,1:(ncol(dfm_full_train)-1)]
testy_full <- dfm_full_test[,ncol(dfm_full_train)]

Previously, I had done some exploratory data analysis to see the shape of the distributions and the outliers. From that exploratory data analysis it can be noted that there was some skewness in the predictor variables as well as some outliers.

Let’s start with a multiple regression of everything, minus the ID (of course).

library(Metrics)

## Warning: package 'Metrics' was built under R version 4.2.3

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

lm.full <- lm(Age ~ ., data = dfm_full_train)
summary(lm.full)

## 
## Call:
## lm(formula = Age ~ ., data = dfm_full_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.3209  -1.2232  -0.3315   0.7524  22.0343 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.724161   0.085538  43.538  < 2e-16 ***
## SexI             -1.040505   0.028252 -36.829  < 2e-16 ***
## SexM             -0.114695   0.021472  -5.342 9.25e-08 ***
## Length            0.882034   0.215196   4.099 4.16e-05 ***
## Diameter          2.077091   0.267339   7.769 8.01e-15 ***
## Height            7.728391   0.272786  28.331  < 2e-16 ***
## Weight            0.189910   0.006074  31.268  < 2e-16 ***
## `Shucked Weight` -0.607680   0.007435 -81.727  < 2e-16 ***
## `Viscera Weight` -0.220906   0.013312 -16.594  < 2e-16 ***
## `Shell Weight`    0.514515   0.011018  46.696  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.133 on 59232 degrees of freedom
## Multiple R-squared:  0.5488, Adjusted R-squared:  0.5487 
## F-statistic:  8005 on 9 and 59232 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(lm.full)

actual <- dfm_full_test$Age
calculated <- predict(lm.full, testx_full)
print(paste("Full Model MAE:", round(mae(actual, calculated),4)))

## [1] "Full Model MAE: 1.4778"

Now, let’s try another model where the highly correlated predictors are removed.

high_corr <- findCorrelation(cor(train[3:9], use='complete'), names=T, cutoff=0.9)

# Take out highly correlated
dfm_red_train <- dfm_full_train[,-match(high_corr, colnames(dfm_full_train))]
trainx_red <- trainx_full[,-match(high_corr, colnames(trainx_full))]
testx_red <- testx_full[,-match(high_corr, colnames(testx_full))]

lm.red <- lm(Age ~ ., data = dfm_red_train)
summary(lm.red)

## 
## Call:
## lm(formula = Age ~ ., data = dfm_red_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.398  -1.438  -0.429   0.760  18.661 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.212826   0.062084   51.75   <2e-16 ***
## SexI             -1.527862   0.030692  -49.78   <2e-16 ***
## SexM             -0.233679   0.023772   -9.83   <2e-16 ***
## Height           25.333707   0.222235  114.00   <2e-16 ***
## `Shucked Weight` -0.146866   0.003516  -41.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.365 on 59237 degrees of freedom
## Multiple R-squared:  0.4449, Adjusted R-squared:  0.4449 
## F-statistic: 1.187e+04 on 4 and 59237 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(lm.red)

actual <- testy_full$Age
calculated <- predict(lm.red, testx_red)
print(paste("Reduced Model MAE:", round(mae(actual, calculated),4)))

## [1] "Reduced Model MAE: 1.644"

The full model still did slightly better than the reduced, even though I removed some of the co-linearity.

Last let’s try elastic net.

Reference: https://daviddalpiaz.github.io/r4sl/elastic-net.html http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/

library(fastDummies)

## Thank you for using fastDummies!

## To acknowledge our work, please cite the package:

## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-8

train_dummy <- dummy_cols(train, select_columns='Sex', remove_first_dummy=T, remove_selected_columns=T)

# Separate training from test
dfm_full_train <- train_dummy[train_rows,2:(ncol(train_dummy))]
dfm_full_test <- train_dummy[-train_rows,2:(ncol(train_dummy))]

# Separate outcome from predictors
trainx_full <- dfm_full_train[,c(1:7,9,10)]
trainy_full <- dfm_full_train[,8]
testx_full <- dfm_full_test[,c(1:7,9,10)]
testy_full <- dfm_full_test[,8]

elnet <- train(x=trainx_full, y=trainy_full$Age, method='glmnet', trControl=trainControl(method = "cv", number = 5))

actual <- testy_full$Age
calculated <- predict(elnet, testx_full)
coef(elnet$finalModel, elnet$bestTune$lambda)

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)     3.7297685
## Length          0.8646587
## Diameter        2.0691882
## Height          7.7897654
## Weight          0.1627276
## Shucked Weight -0.5753122
## Viscera Weight -0.1926445
## Shell Weight    0.5398616
## Sex_I          -1.0427540
## Sex_M          -0.1149381

print(paste("Reduced Model MAE:", round(mae(actual, calculated),4)))

## [1] "Reduced Model MAE: 1.478"

plot(elnet)

It looks like the elastic net did just as good as the original multiple regression.

Now let’s try some more pre-processing.

# Separate training from test
dfm_full_train <- train[train_rows,2:(ncol(train))]
dfm_full_test <- train[-train_rows,2:(ncol(train))]

# Separate outcome from predictors
trainx_full <- dfm_full_train[,1:(ncol(dfm_full_train)-1)]
trainy_full <- dfm_full_train[,ncol(dfm_full_train)]
testx_full <- dfm_full_test[,1:(ncol(dfm_full_train)-1)]
testy_full <- dfm_full_test[,ncol(dfm_full_train)]

pp_no_nzv <- preProcess(trainx_full, 
                        method = c("center", "scale", "YeoJohnson", "nzv"))
pp_no_nzv

## Created from 59242 samples and 8 variables
## 
## Pre-processing:
##   - centered (7)
##   - ignored (1)
##   - scaled (7)
##   - Yeo-Johnson transformation (5)
## 
## Lambda estimates for Yeo-Johnson transformation:
## 1.35, 0.7, 0.64, 0.64, 0.68

train_preproc <- predict(pp_no_nzv, trainx_full)
test_preproc <- predict(pp_no_nzv, testx_full)

dfm_prepoc <- cbind(train_preproc, trainy_full)

lm.preproc <- lm(Age ~ . , data = dfm_prepoc)
summary(lm.preproc)

## 
## Call:
## lm(formula = Age ~ ., data = dfm_prepoc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.6288  -1.2295  -0.3265   0.7721  19.6562 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.33190    0.01695 609.727  < 2e-16 ***
## SexI             -0.99499    0.02832 -35.138  < 2e-16 ***
## SexM             -0.10917    0.02143  -5.095 3.49e-07 ***
## Length            0.28910    0.06241   4.633 3.62e-06 ***
## Diameter          0.46554    0.06406   7.267 3.72e-13 ***
## Height            0.59046    0.02447  24.129  < 2e-16 ***
## Weight            2.70495    0.08592  31.481  < 2e-16 ***
## `Shucked Weight` -3.86342    0.04708 -82.060  < 2e-16 ***
## `Viscera Weight` -0.69074    0.04128 -16.731  < 2e-16 ***
## `Shell Weight`    2.20057    0.04442  49.539  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.128 on 59232 degrees of freedom
## Multiple R-squared:  0.5507, Adjusted R-squared:  0.5506 
## F-statistic:  8066 on 9 and 59232 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(lm.preproc)

actual <- dfm_full_test$Age
calculated <- predict(lm.preproc, test_preproc)
print(paste("Preprocessed Model MAE:", round(mae(actual, calculated),4)))

## [1] "Preprocessed Model MAE: 1.4826"

Based on this, I will go with the simplest model with the highest MAE.

Now I need to make the predictions.

One important step is to check out the distributions of the test set, and make sure it is comparable to the training.

test_d <- test |>
  #keep(is.numeric) |>
  subset(select = -c(Sex, id)) |>
  gather() 

train_d <- train |>
  #keep(is.numeric) |>
  subset(select = -c(Sex, id, Age)) |>
  gather() 

ggplot() +
    geom_density(data = test_d, aes(x = value),color="darkblue") + geom_density(data = train_d, aes(x = value),color="green") +facet_wrap(~ key, scales = "free")

ggplot(test, aes(x=Sex)) +
  geom_bar()

test |>
  #keep(is.numeric) |>
  subset(select = -c(Sex, id)) |>
  gather() |>
  ggplot(aes(x=value)) +
    facet_wrap(~ key, scales = "free") +
    geom_boxplot()

calculated <- predict(lm.full, test |> subset(select = -c(id)))
summary(calculated)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.088   8.317  10.126   9.987  11.514  27.246

length(calculated)

## [1] 49368

write_csv(data.frame(cbind(id = test$id, Age = calculated)), "kgriffen_submission.csv")

Report your Kaggle.com user name and score.

My usernames is: kayleahgriffen

My score was: 1.48544

Recording

My recording is posted here: https://drive.google.com/drive/folders/17v-nLHOgB_i65XaJWvmq9vRWFkFj1Xv4?usp=sharing