Instructions: Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15.Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.
set.seed(1234)
# 10,000 continuous random uniform values between 5 and 15
X <- runif(n = 10000, min = 5, max = 15)
# 10,000 random normal values with a mean of 10 and a standard deviation of 2.89
Y <- rnorm(10000, mean=10, sd = 2.89)
Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities.
# x is median of X
# y is median of Y
x <- median(X)
y <- median(Y)
# a. P(X>x | X>y)
# The P(B|A) = P(A&B)/P(A)
n <- 10000
p_ab <- sum(X>x & X>y)/n
p_a <- sum(X>y)/n
p <- p_ab/p_a
print(paste("P(X>x | X>y) = ", p))
## [1] "P(X>x | X>y) = 1"
# b. P(X>x & Y>y)
p <- sum(X>x & Y>y)/n
print(paste("P(X>x & Y>y) = ", p))
## [1] "P(X>x & Y>y) = 0.2507"
# c. P(X<x | X>y)
# The P(B|A) = P(A&B)/P(A)
n <- 10000
p_ab <- sum(X<x & X>y)/n
p_a <- sum(X>y)/n
p <- p_ab/p_a
print(paste("P(X<x | X>y) = ", p))
## [1] "P(X<x | X>y) = 0"
Now to interpret the meanings:
The probability that X>x given that X>y is 1. This means that given the values of X that are greater than the median value of the Y series, for those values of X they are greater than the median of the X series. This makes sense, because the median of the X and the Y series are about the same:
print(paste("Median X: ", median(X)))
## [1] "Median X: 10.0126582302619"
print(paste("Median Y: ", median(Y)))
## [1] "Median Y: 10.0315387998106"
We also know that X is a uniform distribution with values from 5 to 15 and Y is a normal distribution centered at 10 with a standard deviation of 2.89. Sense we have been given the values of X that are greater than the median of the Y series - we also know that these X values will be greater than the median of the X series as their medians are the same and the distributions have about equal amounts of values on either side.
The intersection of the probabilities of X being greater than it’s median and Y being greater than it’s median. We know that there are half the values on either side of the median, so multiplying 0.5*0.5 =0.25 which makes sense.
The probability that X
Help from: probabilities: https://www.youtube.com/watch?v=FWrEaSaW2mc&ab_channel=BryanNelson tables: https://www.geeksforgeeks.org/how-to-create-tables-in-r/
# Now let's build the table of counts of conditions
# Exp 1: P(X>x), P(X<=x)
# Exp 2: P(Y>y), P(Y<=y)
#_____________
# (Y>y) (Y<=y)
# (X>x) | |
# (X<=x) | |
# ____________
c1 <- sum(X>x & Y>y)
c2 <- sum(X>x & Y<=y)
c3 <- sum(X<=x & Y>y)
c4 <- sum(X<=x & Y<=y)
# create matrix with 2 columns and 2 rows
data= matrix(c(c1,c2,c3,c4), ncol=2, byrow=TRUE)
# specify the column names and row names of matrix
colnames(data) = c('(Y>y)','(Y<=y)')
rownames(data) <- c('(X>x)','(X<=x)')
# assign to table
final=as.table(data)
# display
final
## (Y>y) (Y<=y)
## (X>x) 2507 2493
## (X<=x) 2493 2507
# P(X>x & Y>y)=P(X>x)P(Y>y) ?
print(paste("P(X>x & Y>y): ", c1/(sum(c1,c2,c3,c4))))
## [1] "P(X>x & Y>y): 0.2507"
print(paste("P(X>x)P(Y>y): ", ((c1+c2)/sum(c1,c2,c3,c4))*((c1+c3)/sum(c1,c2,c3,c4))))
## [1] "P(X>x)P(Y>y): 0.25"
Using the table of counts of each joint condition, I was able to determine that the P(X>x & Y>y) = P(X>x)P(Y>y).
print("Fisher Test")
## [1] "Fisher Test"
fisher.test(final)
##
## Fisher's Exact Test for Count Data
##
## data: final
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342763 1.0946016
## sample estimates:
## odds ratio
## 1.011264
print("Chi Square Test")
## [1] "Chi Square Test"
chisq.test(final)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: final
## X-squared = 0.0676, df = 1, p-value = 0.7949
In the fisher and chi square test, the null hypothesis is that the results are independent and the alternative hypothesis is that the results are not independent. As the p-value for both is the same, 0.7949, it is large so we do not reject the null hypothesis. This means that there is independence. The difference between the fisher test and the chi square test is their suitability for different situations. As the name suggests, the fisher exact tests is exact - however it is better for a smaller sample size. In contrast, the chi square test is approximate but is better for a larger sample size. I think because we have a larger sample, the chi square test makes more sense. I am not surprised by the results. The analysis up until this point also showcased the independence of the conditions. The two conditions were based on probabilities from two different distributions that only depended on themselves. It would not have made sense for them to be dependent on one another.
Reference material: https://www.datascienceblog.net/post/statistical_test/contingency_table_tests/
You are to register for Kaggle.com (free) and compete in the Regression with a Crab Age Dataset competition. https://www.kaggle.com/competitions/playground-series-s3e16 I want you to do the following.
Load necessary packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load the test and train data.
train <- read_csv("https://raw.githubusercontent.com/klgriffen96/data605/main/train.csv", show_col_types = FALSE)
test <- read_csv("https://raw.githubusercontent.com/klgriffen96/data605/main/test.csv", show_col_types = FALSE)
Provide univariate descriptive statistics and appropriate plots for the training data set.
print("Columns")
## [1] "Columns"
colnames(train)
## [1] "id" "Sex" "Length" "Diameter"
## [5] "Height" "Weight" "Shucked Weight" "Viscera Weight"
## [9] "Shell Weight" "Age"
print("Descriptive stats")
## [1] "Descriptive stats"
summary(train)
## id Sex Length Diameter
## Min. : 0 Length:74051 Min. :0.1875 Min. :0.1375
## 1st Qu.:18513 Class :character 1st Qu.:1.1500 1st Qu.:0.8875
## Median :37025 Mode :character Median :1.3750 Median :1.0750
## Mean :37025 Mean :1.3175 Mean :1.0245
## 3rd Qu.:55538 3rd Qu.:1.5375 3rd Qu.:1.2000
## Max. :74050 Max. :2.0128 Max. :1.6125
## Height Weight Shucked Weight Viscera Weight
## Min. :0.0000 Min. : 0.0567 Min. : 0.02835 Min. : 0.04252
## 1st Qu.:0.3000 1st Qu.:13.4377 1st Qu.: 5.71242 1st Qu.: 2.86330
## Median :0.3625 Median :23.7994 Median : 9.90815 Median : 4.98951
## Mean :0.3481 Mean :23.3852 Mean :10.10427 Mean : 5.05839
## 3rd Qu.:0.4125 3rd Qu.:32.1625 3rd Qu.:14.03300 3rd Qu.: 6.98815
## Max. :2.8250 Max. :80.1015 Max. :42.18406 Max. :21.54562
## Shell Weight Age
## Min. : 0.04252 Min. : 1.000
## 1st Qu.: 3.96893 1st Qu.: 8.000
## Median : 6.93145 Median :10.000
## Mean : 6.72387 Mean : 9.968
## 3rd Qu.: 9.07184 3rd Qu.:11.000
## Max. :28.49125 Max. :29.000
Let’s go through and look at the distributions of each variable in the training set.
train |>
#keep(is.numeric) |>
subset(select = -c(Sex, id)) |>
gather() |>
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 30)
From these plots we can see that some of the variables are skewed.
Reference for quick plotting: https://drsimonj.svbtle.com/quick-plot-of-all-variables
Check the count of the sexes.
ggplot(train, aes(x=Sex)) +
geom_bar()
There appears to be approximately equal distributions of the sexes.
I also want to look out for outliers.
train |>
#keep(is.numeric) |>
subset(select = -c(Sex, id)) |>
gather() |>
ggplot(aes(x=value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot()
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
The dependent variable is the Age, we will plot this against the independent variables.
train |>
gather(key = "variable", value = "value",
-id, -Age) |>
ggplot(aes(x = value, y = Age)) +
geom_point() +
facet_wrap(~variable, scales = "free")
From the looks of these scatterplots, it seems that diameter, height, and length are all good candidates for strong relationships to Age.
Derive a correlation matrix for any three quantitative variables in the dataset.
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
train |>
subset(select = -c(Sex, id)) |>
cor() |>
corrplot(addCoef.col = "white", type = "lower", number.cex=0.75)
Based on the correlation plot, we can note that alot of the variables are highly correlated.
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
So the null hypothesis is that the pairwise correlation is 0 and the alternative hypothesis is that it is not 0. I predict that none of them will be zero.
cs <- colnames(train)[3:10]
for (n1 in cs){
for (n2 in cs){
if (n1 != n2){
print(paste("Variables:", n1, "and", n2))
result <- cor.test(train[[n1]], train[[n2]], method = "pearson", conf.level = 0.8)
print(paste("P Value: ", round(result$p.value, 4), "Confidence Interval", round(result$conf.int[1], 4), "to" ,round(result$conf.int[2], 4)))
print("____________________________________________________________")
}
}
}
## [1] "Variables: Length and Diameter"
## [1] "P Value: 0 Confidence Interval 0.9893 to 0.9895"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Height"
## [1] "P Value: 0 Confidence Interval 0.9176 to 0.9191"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Weight"
## [1] "P Value: 0 Confidence Interval 0.9358 to 0.937"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Shucked Weight"
## [1] "P Value: 0 Confidence Interval 0.9148 to 0.9163"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Viscera Weight"
## [1] "P Value: 0 Confidence Interval 0.9171 to 0.9186"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Shell Weight"
## [1] "P Value: 0 Confidence Interval 0.9162 to 0.9177"
## [1] "____________________________________________________________"
## [1] "Variables: Length and Age"
## [1] "P Value: 0 Confidence Interval 0.6099 to 0.6158"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Length"
## [1] "P Value: 0 Confidence Interval 0.9893 to 0.9895"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Height"
## [1] "P Value: 0 Confidence Interval 0.9206 to 0.9221"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Weight"
## [1] "P Value: 0 Confidence Interval 0.9377 to 0.9388"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Shucked Weight"
## [1] "P Value: 0 Confidence Interval 0.9134 to 0.915"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Viscera Weight"
## [1] "P Value: 0 Confidence Interval 0.9176 to 0.9191"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Shell Weight"
## [1] "P Value: 0 Confidence Interval 0.922 to 0.9234"
## [1] "____________________________________________________________"
## [1] "Variables: Diameter and Age"
## [1] "P Value: 0 Confidence Interval 0.6184 to 0.6241"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Length"
## [1] "P Value: 0 Confidence Interval 0.9176 to 0.9191"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Diameter"
## [1] "P Value: 0 Confidence Interval 0.9206 to 0.9221"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Weight"
## [1] "P Value: 0 Confidence Interval 0.9009 to 0.9027"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Shucked Weight"
## [1] "P Value: 0 Confidence Interval 0.8629 to 0.8653"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Viscera Weight"
## [1] "P Value: 0 Confidence Interval 0.8821 to 0.8842"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Shell Weight"
## [1] "P Value: 0 Confidence Interval 0.9025 to 0.9043"
## [1] "____________________________________________________________"
## [1] "Variables: Height and Age"
## [1] "P Value: 0 Confidence Interval 0.6353 to 0.6409"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Length"
## [1] "P Value: 0 Confidence Interval 0.9358 to 0.937"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Diameter"
## [1] "P Value: 0 Confidence Interval 0.9377 to 0.9388"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Height"
## [1] "P Value: 0 Confidence Interval 0.9009 to 0.9027"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Shucked Weight"
## [1] "P Value: 0 Confidence Interval 0.971 to 0.9715"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Viscera Weight"
## [1] "P Value: 0 Confidence Interval 0.9708 to 0.9713"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Shell Weight"
## [1] "P Value: 0 Confidence Interval 0.9652 to 0.9658"
## [1] "____________________________________________________________"
## [1] "Variables: Weight and Age"
## [1] "P Value: 0 Confidence Interval 0.5982 to 0.6042"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Length"
## [1] "P Value: 0 Confidence Interval 0.9148 to 0.9163"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Diameter"
## [1] "P Value: 0 Confidence Interval 0.9134 to 0.915"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Height"
## [1] "P Value: 0 Confidence Interval 0.8629 to 0.8653"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Weight"
## [1] "P Value: 0 Confidence Interval 0.971 to 0.9715"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Viscera Weight"
## [1] "P Value: 0 Confidence Interval 0.9421 to 0.9431"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Shell Weight"
## [1] "P Value: 0 Confidence Interval 0.9096 to 0.9112"
## [1] "____________________________________________________________"
## [1] "Variables: Shucked Weight and Age"
## [1] "P Value: 0 Confidence Interval 0.4998 to 0.5068"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Length"
## [1] "P Value: 0 Confidence Interval 0.9171 to 0.9186"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Diameter"
## [1] "P Value: 0 Confidence Interval 0.9176 to 0.9191"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Height"
## [1] "P Value: 0 Confidence Interval 0.8821 to 0.8842"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Weight"
## [1] "P Value: 0 Confidence Interval 0.9708 to 0.9713"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Shucked Weight"
## [1] "P Value: 0 Confidence Interval 0.9421 to 0.9431"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Shell Weight"
## [1] "P Value: 0 Confidence Interval 0.9333 to 0.9345"
## [1] "____________________________________________________________"
## [1] "Variables: Viscera Weight and Age"
## [1] "P Value: 0 Confidence Interval 0.5737 to 0.5799"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Length"
## [1] "P Value: 0 Confidence Interval 0.9162 to 0.9177"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Diameter"
## [1] "P Value: 0 Confidence Interval 0.922 to 0.9234"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Height"
## [1] "P Value: 0 Confidence Interval 0.9025 to 0.9043"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Weight"
## [1] "P Value: 0 Confidence Interval 0.9652 to 0.9658"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Shucked Weight"
## [1] "P Value: 0 Confidence Interval 0.9096 to 0.9112"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Viscera Weight"
## [1] "P Value: 0 Confidence Interval 0.9333 to 0.9345"
## [1] "____________________________________________________________"
## [1] "Variables: Shell Weight and Age"
## [1] "P Value: 0 Confidence Interval 0.6608 to 0.6661"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Length"
## [1] "P Value: 0 Confidence Interval 0.6099 to 0.6158"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Diameter"
## [1] "P Value: 0 Confidence Interval 0.6184 to 0.6241"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Height"
## [1] "P Value: 0 Confidence Interval 0.6353 to 0.6409"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Weight"
## [1] "P Value: 0 Confidence Interval 0.5982 to 0.6042"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Shucked Weight"
## [1] "P Value: 0 Confidence Interval 0.4998 to 0.5068"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Viscera Weight"
## [1] "P Value: 0 Confidence Interval 0.5737 to 0.5799"
## [1] "____________________________________________________________"
## [1] "Variables: Age and Shell Weight"
## [1] "P Value: 0 Confidence Interval 0.6608 to 0.6661"
## [1] "____________________________________________________________"
Reference: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r
Discuss the meaning of your analysis.
Would you be worried about familywise error? Why or why not?
I checked between independence for all of the numerical variables. The familywise error rate is the risk that we commit at least one type one error, which would be rejecting the null hypothesis when it is actually true. So in our case the null hypothesis is that the pairwise correlation is 0, we rejected the null hypothesis in every case due to a low p-value. If the null hypothesis was true - and the variables were actually independent then we would have made a type-1 error. The familywise error rate increases with more hypothesis tests, however because our 80% confidence intervals are so far away from 0, I am not worried about a familywise error in this case. There are other tests that could be done to mitigate the problem of the familywise error.
Reference: https://www.youtube.com/watch?v=8CuFKY45UXQ&ab_channel=TileStats
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
corr_matrix <- train |>
subset(select = -c(Sex, id)) |>
cor()
prec_matrix <- corr_matrix |> solve()
prec_matrix
## Length Diameter Height Weight Shucked Weight
## Length 50.0363646 -45.1097381 -1.9464374 0.1100257 -2.9284116
## Diameter -45.1097381 52.3551380 -2.5463119 -1.4747766 -0.4008903
## Height -1.9464374 -2.5463119 7.8112768 -0.1561810 0.2354580
## Weight 0.1100257 -1.4747766 -0.1561810 78.0530360 -32.1967124
## Shucked Weight -2.9284116 -0.4008903 0.2354580 -32.1967124 25.3018917
## Viscera Weight -1.4824486 0.2206767 -0.5969465 -19.0009603 1.5006489
## Shell Weight 1.5536462 -2.7474053 -2.0985613 -25.7645353 7.8914648
## Age -0.1657241 -0.4720841 -0.4972981 -1.7174146 2.4136599
## Viscera Weight Shell Weight Age
## Length -1.4824486 1.553646 -0.1657241
## Diameter 0.2206767 -2.747405 -0.4720841
## Height -0.5969465 -2.098561 -0.4972981
## Weight -19.0009603 -25.764535 -1.7174146
## Shucked Weight 1.5006489 7.891465 2.4136599
## Viscera Weight 17.9689552 1.656625 0.3565189
## Shell Weight 1.6566248 20.997071 -1.2752517
## Age 0.3565189 -1.275252 2.1702671
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
round(corr_matrix%*% prec_matrix,2)
## Length Diameter Height Weight Shucked Weight Viscera Weight
## Length 1 0 0 0 0 0
## Diameter 0 1 0 0 0 0
## Height 0 0 1 0 0 0
## Weight 0 0 0 1 0 0
## Shucked Weight 0 0 0 0 1 0
## Viscera Weight 0 0 0 0 0 1
## Shell Weight 0 0 0 0 0 0
## Age 0 0 0 0 0 0
## Shell Weight Age
## Length 0 0
## Diameter 0 0
## Height 0 0
## Weight 0 0
## Shucked Weight 0 0
## Viscera Weight 0 0
## Shell Weight 1 0
## Age 0 1
round(prec_matrix%*% corr_matrix,2)
## Length Diameter Height Weight Shucked Weight Viscera Weight
## Length 1 0 0 0 0 0
## Diameter 0 1 0 0 0 0
## Height 0 0 1 0 0 0
## Weight 0 0 0 1 0 0
## Shucked Weight 0 0 0 0 1 0
## Viscera Weight 0 0 0 0 0 1
## Shell Weight 0 0 0 0 0 0
## Age 0 0 0 0 0 0
## Shell Weight Age
## Length 0 0
## Diameter 0 0
## Height 0 0
## Weight 0 0
## Shucked Weight 0 0
## Viscera Weight 0 0
## Shell Weight 1 0
## Age 0 1
Conduct LDU decomposition on the matrix.
library(matrixcalc)
round(lu.decomposition(corr_matrix)$L,4)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0
## [2,] 0.9894 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0
## [3,] 0.9184 0.6044 1.0000 0.0000 0.0000 0.0000 0.0000 0
## [4,] 0.9364 0.5599 0.2332 1.0000 0.0000 0.0000 0.0000 0
## [5,] 0.9155 0.3975 0.1226 0.9682 1.0000 0.0000 0.0000 0
## [6,] 0.9179 0.4850 0.2286 0.9027 -0.0139 1.0000 0.0000 0
## [7,] 0.9170 0.7336 0.3490 0.7939 -0.4585 -0.0922 1.0000 0
## [8,] 0.6128 0.7084 0.4449 0.0328 -1.3793 -0.2184 0.5876 1
round(lu.decomposition(corr_matrix)$U,4)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 0.9894 0.9184 0.9364 0.9155 0.9179 0.9170 0.6128
## [2,] 0 0.0210 0.0127 0.0118 0.0084 0.0102 0.0154 0.0149
## [3,] 0 0.0000 0.1490 0.0347 0.0183 0.0341 0.0520 0.0663
## [4,] 0 0.0000 0.0000 0.1085 0.1051 0.0980 0.0862 0.0036
## [5,] 0 0.0000 0.0000 0.0000 0.0545 -0.0008 -0.0250 -0.0752
## [6,] 0 0.0000 0.0000 0.0000 0.0000 0.0564 -0.0052 -0.0123
## [7,] 0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0494 0.0290
## [8,] 0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.4608
Many times, it makes sense to fit a closed form distribution to
data.
Select a variable in the Kaggle.com training dataset that is skewed to
the right, shift it so that the minimum value is absolutely above zero
if necessary.
Based on the exploratory data analysis section, the Weight appears right skewed.
library(e1071)
hist(train$Weight)
print(paste("Skewness:" , skewness(train$Weight)))
## [1] "Skewness: 0.231455178371882"
print(paste("Min: ", min(train$Weight)))
## [1] "Min: 0.056699"
The minimum is already above 0 so that is all set.
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
expon_weight <- fitdistr(train$Weight, "exponential")
Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda)).
lambda <- expon_weight$estimate
lambda_samples <- rexp(1000, lambda)
Plot a histogram and compare it with a histogram of your original variable.
train |> ggplot(aes(x = Weight)) + geom_histogram(bins = 30)
data.frame(exp_weight = lambda_samples) |> ggplot(aes(x = exp_weight)) + geom_histogram(bins = 30)
The fitted distribution does not actually seem like a great fit for the
weight data, but we will continue with it nonetheless.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
print(paste("5th percentile:", round(qexp(0.05, rate = lambda), 4), "and the 95th percentile:", round(qexp(0.95, rate = lambda), 4)))
## [1] "5th percentile: 1.1995 and the 95th percentile: 70.0558"
Also generate a 95% confidence interval from the empirical data, assuming normality.
I am assuming that this is looking for the confidence interval around the mean.
Reference: https://makemeanalyst.com/statistics-with-r/how-to-find-confidence-intervals-in-r/
c95 <- t.test(train$Weight, conf.level = 0.95)
c95$conf.int
## [1] 23.29412 23.47632
## attr(,"conf.level")
## [1] 0.95
Finally, provide the empirical 5th percentile and 95th percentile of the data.
Reference: https://www.digitalocean.com/community/tutorials/quantile-function-in-r
quantile(train$Weight,na.rm = T,probs = c(0.05, 0.95))
## 5% 95%
## 3.600386 44.211045
Discuss.
The exponentially created fitted distribution does not seem like a great match for the Weight data. Exponential: 5%: 1.1995 and 95%: 70.0558 Empirical: 5%: 3.6003 and 95%: 44.2110 The percentiles are not well matched, which is not a surprise because the histograms are not similar either. I would not go on to use the exponential distribution for the Weight.
Build some type of multiple regression model and submit your model to the competition board.
There are many types of regression models, a stratgey for multiple linear regression is noted in the book: https://conservancy.umn.edu/handle/11299/189222 in chapter 4 and 5.
I also previously took Data 624 where I learned many methods for mutliple regression: https://docs.google.com/document/d/1duSd351OXVB1P28lX9t5ChCEV0yeswaHv2VIkUU4Sm0/edit?usp=sharing
Based on my knowledge, I will make a few models. The Kaggle competition is judged based on mean absolute error, so I will use this to compare my models and pick the best one from my options.
First, let’s split my own training data into a test and train subsets so I can compare a couple of models.
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(train$Age, p=0.8, list=F)
# Separate training from test
dfm_full_train <- train[train_rows,2:(ncol(train))]
dfm_full_test <- train[-train_rows,2:(ncol(train))]
# Separate outcome from predictors
trainx_full <- dfm_full_train[,1:(ncol(dfm_full_train)-1)]
trainy_full <- dfm_full_train[,ncol(dfm_full_train)]
testx_full <- dfm_full_test[,1:(ncol(dfm_full_train)-1)]
testy_full <- dfm_full_test[,ncol(dfm_full_train)]
Previously, I had done some exploratory data analysis to see the shape of the distributions and the outliers. From that exploratory data analysis it can be noted that there was some skewness in the predictor variables as well as some outliers.
Let’s start with a multiple regression of everything, minus the ID (of course).
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.2.3
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
lm.full <- lm(Age ~ ., data = dfm_full_train)
summary(lm.full)
##
## Call:
## lm(formula = Age ~ ., data = dfm_full_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.3209 -1.2232 -0.3315 0.7524 22.0343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.724161 0.085538 43.538 < 2e-16 ***
## SexI -1.040505 0.028252 -36.829 < 2e-16 ***
## SexM -0.114695 0.021472 -5.342 9.25e-08 ***
## Length 0.882034 0.215196 4.099 4.16e-05 ***
## Diameter 2.077091 0.267339 7.769 8.01e-15 ***
## Height 7.728391 0.272786 28.331 < 2e-16 ***
## Weight 0.189910 0.006074 31.268 < 2e-16 ***
## `Shucked Weight` -0.607680 0.007435 -81.727 < 2e-16 ***
## `Viscera Weight` -0.220906 0.013312 -16.594 < 2e-16 ***
## `Shell Weight` 0.514515 0.011018 46.696 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.133 on 59232 degrees of freedom
## Multiple R-squared: 0.5488, Adjusted R-squared: 0.5487
## F-statistic: 8005 on 9 and 59232 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.full)
actual <- dfm_full_test$Age
calculated <- predict(lm.full, testx_full)
print(paste("Full Model MAE:", round(mae(actual, calculated),4)))
## [1] "Full Model MAE: 1.4778"
Now, let’s try another model where the highly correlated predictors are removed.
high_corr <- findCorrelation(cor(train[3:9], use='complete'), names=T, cutoff=0.9)
# Take out highly correlated
dfm_red_train <- dfm_full_train[,-match(high_corr, colnames(dfm_full_train))]
trainx_red <- trainx_full[,-match(high_corr, colnames(trainx_full))]
testx_red <- testx_full[,-match(high_corr, colnames(testx_full))]
lm.red <- lm(Age ~ ., data = dfm_red_train)
summary(lm.red)
##
## Call:
## lm(formula = Age ~ ., data = dfm_red_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.398 -1.438 -0.429 0.760 18.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.212826 0.062084 51.75 <2e-16 ***
## SexI -1.527862 0.030692 -49.78 <2e-16 ***
## SexM -0.233679 0.023772 -9.83 <2e-16 ***
## Height 25.333707 0.222235 114.00 <2e-16 ***
## `Shucked Weight` -0.146866 0.003516 -41.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.365 on 59237 degrees of freedom
## Multiple R-squared: 0.4449, Adjusted R-squared: 0.4449
## F-statistic: 1.187e+04 on 4 and 59237 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.red)
actual <- testy_full$Age
calculated <- predict(lm.red, testx_red)
print(paste("Reduced Model MAE:", round(mae(actual, calculated),4)))
## [1] "Reduced Model MAE: 1.644"
The full model still did slightly better than the reduced, even though I removed some of the co-linearity.
Last let’s try elastic net.
Reference: https://daviddalpiaz.github.io/r4sl/elastic-net.html http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/
library(fastDummies)
## Thank you for using fastDummies!
## To acknowledge our work, please cite the package:
## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
train_dummy <- dummy_cols(train, select_columns='Sex', remove_first_dummy=T, remove_selected_columns=T)
# Separate training from test
dfm_full_train <- train_dummy[train_rows,2:(ncol(train_dummy))]
dfm_full_test <- train_dummy[-train_rows,2:(ncol(train_dummy))]
# Separate outcome from predictors
trainx_full <- dfm_full_train[,c(1:7,9,10)]
trainy_full <- dfm_full_train[,8]
testx_full <- dfm_full_test[,c(1:7,9,10)]
testy_full <- dfm_full_test[,8]
elnet <- train(x=trainx_full, y=trainy_full$Age, method='glmnet', trControl=trainControl(method = "cv", number = 5))
actual <- testy_full$Age
calculated <- predict(elnet, testx_full)
coef(elnet$finalModel, elnet$bestTune$lambda)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.7297685
## Length 0.8646587
## Diameter 2.0691882
## Height 7.7897654
## Weight 0.1627276
## Shucked Weight -0.5753122
## Viscera Weight -0.1926445
## Shell Weight 0.5398616
## Sex_I -1.0427540
## Sex_M -0.1149381
print(paste("Reduced Model MAE:", round(mae(actual, calculated),4)))
## [1] "Reduced Model MAE: 1.478"
plot(elnet)
It looks like the elastic net did just as good as the original multiple regression.
Now let’s try some more pre-processing.
# Separate training from test
dfm_full_train <- train[train_rows,2:(ncol(train))]
dfm_full_test <- train[-train_rows,2:(ncol(train))]
# Separate outcome from predictors
trainx_full <- dfm_full_train[,1:(ncol(dfm_full_train)-1)]
trainy_full <- dfm_full_train[,ncol(dfm_full_train)]
testx_full <- dfm_full_test[,1:(ncol(dfm_full_train)-1)]
testy_full <- dfm_full_test[,ncol(dfm_full_train)]
pp_no_nzv <- preProcess(trainx_full,
method = c("center", "scale", "YeoJohnson", "nzv"))
pp_no_nzv
## Created from 59242 samples and 8 variables
##
## Pre-processing:
## - centered (7)
## - ignored (1)
## - scaled (7)
## - Yeo-Johnson transformation (5)
##
## Lambda estimates for Yeo-Johnson transformation:
## 1.35, 0.7, 0.64, 0.64, 0.68
train_preproc <- predict(pp_no_nzv, trainx_full)
test_preproc <- predict(pp_no_nzv, testx_full)
dfm_prepoc <- cbind(train_preproc, trainy_full)
lm.preproc <- lm(Age ~ . , data = dfm_prepoc)
summary(lm.preproc)
##
## Call:
## lm(formula = Age ~ ., data = dfm_prepoc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.6288 -1.2295 -0.3265 0.7721 19.6562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.33190 0.01695 609.727 < 2e-16 ***
## SexI -0.99499 0.02832 -35.138 < 2e-16 ***
## SexM -0.10917 0.02143 -5.095 3.49e-07 ***
## Length 0.28910 0.06241 4.633 3.62e-06 ***
## Diameter 0.46554 0.06406 7.267 3.72e-13 ***
## Height 0.59046 0.02447 24.129 < 2e-16 ***
## Weight 2.70495 0.08592 31.481 < 2e-16 ***
## `Shucked Weight` -3.86342 0.04708 -82.060 < 2e-16 ***
## `Viscera Weight` -0.69074 0.04128 -16.731 < 2e-16 ***
## `Shell Weight` 2.20057 0.04442 49.539 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.128 on 59232 degrees of freedom
## Multiple R-squared: 0.5507, Adjusted R-squared: 0.5506
## F-statistic: 8066 on 9 and 59232 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.preproc)
actual <- dfm_full_test$Age
calculated <- predict(lm.preproc, test_preproc)
print(paste("Preprocessed Model MAE:", round(mae(actual, calculated),4)))
## [1] "Preprocessed Model MAE: 1.4826"
Based on this, I will go with the simplest model with the highest MAE.
Now I need to make the predictions.
One important step is to check out the distributions of the test set, and make sure it is comparable to the training.
test_d <- test |>
#keep(is.numeric) |>
subset(select = -c(Sex, id)) |>
gather()
train_d <- train |>
#keep(is.numeric) |>
subset(select = -c(Sex, id, Age)) |>
gather()
ggplot() +
geom_density(data = test_d, aes(x = value),color="darkblue") + geom_density(data = train_d, aes(x = value),color="green") +facet_wrap(~ key, scales = "free")
ggplot(test, aes(x=Sex)) +
geom_bar()
test |>
#keep(is.numeric) |>
subset(select = -c(Sex, id)) |>
gather() |>
ggplot(aes(x=value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot()
calculated <- predict(lm.full, test |> subset(select = -c(id)))
summary(calculated)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.088 8.317 10.126 9.987 11.514 27.246
length(calculated)
## [1] 49368
write_csv(data.frame(cbind(id = test$id, Age = calculated)), "kgriffen_submission.csv")
Report your Kaggle.com user name and score.
My usernames is: kayleahgriffen
My score was: 1.48544
My recording is posted here: https://drive.google.com/drive/folders/17v-nLHOgB_i65XaJWvmq9vRWFkFj1Xv4?usp=sharing