DATA605 Final
Problem 1a
set.seed(1234)
X <- runif(10000,5,15)
#mean should be 10 per uniform distribution characteristics
#median should also be equal to 10
Y <- rnorm(10000, mean = 10, sd = 2.89)
#mean, median should also be equal to 10
y_med <- median(Y)
x_med <- median(X)
cat((paste("Y Median:",y_med,"\nX Median:",x_med)))## Y Median: 10.0315387998106
## X Median: 10.0126582302619
a. P(X>x | X>y) = .50 |
Interpretation: The probability that the random
variable X takes a value greater than its median, given that the value
taken is also greater than the median of Y. This is a conditional
probability with its formula being represented as: P(A and B) / P(B)
| P(B) = 50% (X>y). Since the distributions share
mean/meadians, the probability is the same as asking how many values in
X are greater than its mean/median, which would again be 50%
| P(A
and B) = .5*.5 = .25 | P(A | B) = P(A and B) / P(B) = .25/.50 = .50
prob_a <- sum(X > x_med)/length(X)
prob_b <- sum(X > y_med)/length(X)
prob_anb <- prob_a * prob_b
prob_agiveb <- prob_anb/prob_b
cat((paste("Prob of A:", prob_a, "\nProb of B:", prob_b, "\nProb of A and B:", prob_anb, "\nProb of A given B:", prob_agiveb)))## Prob of A: 0.5
## Prob of B: 0.4979
## Prob of A and B: 0.24895
## Prob of A given B: 0.5
P(X>x & Y>y) = .25 | Interpretation: The probability that the random variable X takes a value greater than the median and that the random variable Y taken is greater than the median of y. | P(X>x) = .5 | P(Y>y) = .5 | P(A and B) = .5*.5 = .25
P(X<x | X>y) = .50 | Interpretation: The probability that the random variable X takes a value less than the median and that the value taken is greater than the median of y. | P(X<x) = .5 | P(X>y) = .5 | P(A and B) = .5*.5 = .25 | P(A | B) = P(A and B) / P(B) = .25/.50 = .50
Problem 1b
# Joiny Probability of X>x & Y>y
prob_joint <- sum(X > x_med & Y > y_med) / length(X)
# Marginal probabilities P(X > x) and P(Y > y)
prob_x <- sum(X > x_med) / length(X)
prob_y <- sum(Y > y_med) / length(Y)
# Product of marginal probabilities P(X > x) * P(Y > y)
marginals_prod <- prob_x * prob_y
# Create a matrix
prob_table <- matrix(c(prob_x, prob_joint, prob_y, marginals_prod), nrow = 2, byrow = TRUE)
# Labeling the cols and rows
colnames(prob_table) <- c("Marginal Probs", "Joint Probs")
rownames(prob_table) <- c("P(X > x)", "P(Y > y)")
print(prob_table)## Marginal Probs Joint Probs
## P(X > x) 0.5 0.2507
## P(Y > y) 0.5 0.2500
\[P(X>x & Y>y) = P(X > x)*P(Y>y)\]
Problem 1c
| Fisher’s Test checks to see if there is evidence to
suggest that two variables are independent from each other based on the
counts of the marginal and joint probability conditions. Fisher’s Test
is typically only used when more than 20% of cells have counts less than
5 and is more accurate on small samples. We can check the expected
counts with Chi-Square function from R. The Chi Square Test checks for
independence by comparing the observed frequencies versus the expected
frequencies under the assumption that the variables were independent.
The Chi Square test is more accurate than Fisher’s Test on large
samples. Both check their values in relation to their critical values
with their degrees of freedom.
Fisher’s Exact Test Formula:
\[p = \frac{(\text{row 1 total}!) \;
(\text{row 2 total}!) \; (\text{column 1}!) \; (\text{column 2}!)}
{(\text{total samples}!) \; (\text{cell a}!) \; (\text{cell b}!) \;
(\text{cell c}!) (\; \text{cell d}!)}\] \[p = \frac{{c_1 \choose a}{c_2 \choose b}}{{N
\choose r_1}}\]
Chi-Square Formula: \[\chi^2 = \sum \frac{(\text{Original Counts} - \text{Expected Counts})^2 }{\text{Expected Counts}}\] Expected Frequencies Formula for Chi Square: - (row total)(column total)/grand total - P(A)P(B)*total_count
values_in_X_greater_than_med <- sum(X > x_med)
values_in_Y_greater_than_med <- sum(Y > y_med)
values_in_X_greater_than_y_med <- sum(X > y_med)
values_in_Y_greater_than_x_med <- sum(Y > x_med)
cont_table <- data.frame("Values from X" = c(values_in_X_greater_than_med,values_in_X_greater_than_y_med),
"Values from Y" =c(values_in_Y_greater_than_x_med,values_in_Y_greater_than_med),
row.names = c("> X Median","> Y Median"))
cont_table## Values.from.X Values.from.Y
## > X Median 5000 5036
## > Y Median 4979 5000
##
## Fisher's Exact Test for Count Data
##
## data: cont_table
## p-value = 0.9212
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9429173 1.0542732
## sample estimates:
## odds ratio
## 0.997039
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cont_table
## X-squared = 0.0082342, df = 1, p-value = 0.9277
In this situation, I believe the Chi-Square Test would be most appropriate due to the number of observations. I am not surprised by the results since these counts are the result of two different distributions, and unless the Uniform and Normal distributions are dependent on each other I don’t see how the values ever could be dependent.
Problem 2
| Provide univariate descriptive statistics and appropriate
plots for the training data set. Provide a scatterplot matrix for at
least two of the independent variables and the dependent variable.
Derive a correlation matrix for any three quantitative variables in the
dataset. Test the hypotheses that the correlations between each pairwise
set of variables is 0 and provide an 80% confidence interval. Discuss
the meaning of your analysis. Would you be worried about familywise
error? Why or why not?
## vars n mean sd median trimmed mad min
## id 1 74051 37025.00 21376.83 37025.00 37025.00 27447.37 0.00
## Sex* 2 74051 2.06 0.82 2.00 2.07 1.48 1.00
## Length 3 74051 1.32 0.29 1.38 1.35 0.28 0.19
## Diameter 4 74051 1.02 0.24 1.07 1.05 0.22 0.14
## Height 5 74051 0.35 0.09 0.36 0.35 0.09 0.00
## Weight 6 74051 23.39 12.65 23.80 23.05 13.98 0.06
## Shucked Weight 7 74051 10.10 5.62 9.91 9.92 6.18 0.03
## Viscera Weight 8 74051 5.06 2.79 4.99 4.97 3.07 0.04
## Shell Weight 9 74051 6.72 3.58 6.93 6.62 3.80 0.04
## Age 10 74051 9.97 3.18 10.00 9.69 2.97 1.00
## max range skew kurtosis se
## id 74050.00 74050.00 0.00 -1.20 78.56
## Sex* 3.00 2.00 -0.10 -1.51 0.00
## Length 2.01 1.83 -0.84 0.29 0.00
## Diameter 1.61 1.48 -0.81 0.18 0.00
## Height 2.83 2.83 0.09 14.15 0.00
## Weight 80.10 80.04 0.23 -0.40 0.05
## Shucked Weight 42.18 42.16 0.35 -0.12 0.02
## Viscera Weight 21.55 21.50 0.29 -0.37 0.01
## Shell Weight 28.49 28.45 0.28 -0.14 0.01
## Age 29.00 28.00 1.09 2.30 0.01
#trimmed refers to if any parts of the data were removed, none were removed here
#mad: median absolute deviation
colnames(train) <- c('id','sex','length','diameter','height','weight','shucked_weight','visc_weight','shell_weight','age')
colnames(submission_data) <- c('id','sex','length','diameter','height','weight','shucked_weight','visc_weight','shell_weight','age')
# Convert Sex column to factor
train <- train %>%
mutate(
sex = as.factor(sex)
)
# Convert Sex column to factor
submission_data <- submission_data %>%
mutate(
sex = as.factor(sex)
)# Next code blocks create a function to create a boxplot for each column to
# then pass a list of colnames to the function
# Create boxplot for each column
plot_columns <- function(column_name) {
boxplot <- ggplot(train, aes_string(x = column_name)) +
geom_boxplot() +
labs(title = paste("Boxplot of", column_name))
return(list(boxplot))
}
# Loop through each column ignoring the first since its ID
for (col in names(train)[-1]) {
plots <- plot_columns(col)
print(plots[[1]]) # Print boxplot
} plot_histogram <- function(column_name) {
hists <- ggplot(train, aes_string(x = column_name)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = paste("Histogram of", column_name))
return(list(hists))
}
for (col in names(train)[-2]) {
hists <- plot_histogram(col)
print(hists[[1]]) # Print histogram
}## [[1]]
Scatterplot
corr_shell <- cor(train$shell_weight, train$age)
ggplot(train, aes(x = shell_weight, y = age)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Scatterplot of Shell Weight vs Age") +
annotate("text", x = max(train$shell_weight), y = 25,
label = paste("Correlation =", round(corr_shell, 3)),
hjust = 1, vjust = 1, size = 4, color = "blue")corr_diam <- cor(train$diameter, train$age)
ggplot(train, aes(x = diameter, y = age)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Scatterplot of Diameter vs Age") +
annotate("text", x = 2, y = 15,
label = paste("Correlation =", round(corr_diam, 3)),
hjust = 1, vjust = 1, size = 4, color = "blue")Correlation Matrix
Testing Correlation Hypothesis
len_diam <- cor.test(train$length, train$diameter, method = "pearson", conf.level = .80)
len_height <- cor.test(train$length, train$height, method = "pearson", conf.level = .80)
len_weight <- cor.test(train$length, train$weight, method = "pearson", conf.level = .80)
diam_height <- cor.test(train$diameter, train$height, method = "pearson", conf.level = .80)
diam_weight <- cor.test(train$diameter, train$weight, method = "pearson", conf.level = .80)
height_weight <- cor.test(train$height, train$weight, method = "pearson", conf.level = .80)
print(paste(len_diam$data.name," Correlation :",len_diam$estimate,"len_diam P-Val:", len_diam$p.value))## [1] "train$length and train$diameter Correlation : 0.989437354560577 len_diam P-Val: 0"
print(paste(len_height$data.name," Correlation :",len_height$estimate,"P-Val:", len_height$p.value))## [1] "train$length and train$height Correlation : 0.918351679384104 P-Val: 0"
print(paste(len_weight$data.name," Correlation :",len_weight$estimate,"P-Val:", len_weight$p.value))## [1] "train$length and train$weight Correlation : 0.936373792522644 P-Val: 0"
print(paste(diam_height$data.name," Correlation :",diam_height$estimate,"P-Val:", diam_height$p.value))## [1] "train$diameter and train$height Correlation : 0.921353147724645 P-Val: 0"
print(paste(diam_weight$data.name," Correlation :",diam_weight$estimate,"P-Val:", diam_weight$p.value))## [1] "train$diameter and train$weight Correlation : 0.938248621945664 P-Val: 0"
print(paste(height_weight$data.name," Correlation :",height_weight$estimate,"P-Val:", height_weight$p.value))## [1] "train$height and train$weight Correlation : 0.901774819491695 P-Val: 0"
pvals <- c(len_diam$p.value,len_height$p.value,len_weight$p.value,diam_height$p.value,diam_weight$p.value)# Adjusting p-values using Bonferroni-Holm correction
adjusted_pvals <- p.adjust(pvals, method = "holm")
adjusted_pvals## [1] 0 0 0 0 0
alpha_fwe <- 1-((1-.20)^6)
adj_fwe <- 1-((1-.05)^6)
print(paste("Family Wise Error Rate at alpha = .8 & 6 Tests:", alpha_fwe))## [1] "Family Wise Error Rate at alpha = .8 & 6 Tests: 0.737856"
## [1] "Family Wise Error Rate at alpha = .95 & 6 Tests: 0.264908109375"
| When running the correlation tests at a .05 significance
level, the P Values are still 0, suggesting that the correlations are
just that strong. It seems there isn’t a need to worry about the family
wise error rate in this situation.
len_diam <- cor.test(train$length, train$diameter, method = "pearson", conf.level = .95)
height_weight <- cor.test(train$length, train$height, method = "pearson", conf.level = .95)
len_weight <- cor.test(train$length, train$weight, method = "pearson", conf.level = .95)
diam_height <- cor.test(train$diameter, train$height, method = "pearson", conf.level = .95)
diam_weight <- cor.test(train$diameter, train$weight, method = "pearson", conf.level = .95)
height_weight <- cor.test(train$height, train$weight, method = "pearson", conf.level = .95)
print(paste(len_diam$data.name," Correlation :",len_diam$estimate,"len_diam P-Val:", len_diam$p.value))## [1] "train$length and train$diameter Correlation : 0.989437354560577 len_diam P-Val: 0"
print(paste(len_height$data.name," Correlation :",len_height$estimate,"P-Val:", len_height$p.value))## [1] "train$length and train$height Correlation : 0.918351679384104 P-Val: 0"
print(paste(len_weight$data.name," Correlation :",len_weight$estimate,"P-Val:", len_weight$p.value))## [1] "train$length and train$weight Correlation : 0.936373792522644 P-Val: 0"
print(paste(diam_height$data.name," Correlation :",diam_height$estimate,"P-Val:", diam_height$p.value))## [1] "train$diameter and train$height Correlation : 0.921353147724645 P-Val: 0"
print(paste(diam_weight$data.name," Correlation :",diam_weight$estimate,"P-Val:", diam_weight$p.value))## [1] "train$diameter and train$weight Correlation : 0.938248621945664 P-Val: 0"
print(paste(height_weight$data.name," Correlation :",height_weight$estimate,"P-Val:", height_weight$p.value))## [1] "train$height and train$weight Correlation : 0.901774819491695 P-Val: 0"
## Precision Matrix and LDU
## length diameter height weight
## length 49.149116 -44.597762 -1.488898 -2.835504
## diameter -44.597762 51.657946 -3.157551 -3.860421
## height -1.488898 -3.157551 7.214902 -2.149484
## weight -2.835504 -3.860421 -2.149484 9.215477
## length diameter height weight
## length 1 0 0 0
## diameter 0 1 0 0
## height 0 0 1 0
## weight 0 0 0 1
## length diameter height weight
## length 1 0 0 0
## diameter 0 1 0 0
## height 0 0 1 0
## weight 0 0 0 1
## length diameter height weight
## length 1 0 0 0
## diameter 0 1 0 0
## height 0 0 1 0
## weight 0 0 0 1
## $L
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
##
## $U
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
## $L
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
##
## $U
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
Calculus-Based Probability & Statistics.
| Many times, it makes sense to fit a closed form distribution
to data. Select a variable in the Kaggle.com training dataset that is
skewed to the right, shift it so that the minimum value is absolutely
above zero if necessary. Then load the MASS package and run fitdistr to
fit an exponential probability density function.
| Find the
optimal value of \(\lambda\) for this
distribution, and then take 1000 samples from this exponential
distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare
it with a histogram of your original variable. Using the exponential
pdf, find the 5th and 95th percentiles using the cumulative distribution
function (CDF). Also generate a 95% confidence interval from the
empirical data, assuming normality. Finally, provide the empirical 5th
percentile and 95th percentile of the data. Discuss.
## shucked_weight
## 0.3494575
fit <- fitdistr(train$shucked_weight, densfun = "exponential") #find the expon. distr that works the best
print(fit$estimate) #print the rate estimate## rate
## 0.09896806
exp_dis <- rexp(1000, fit$estimate) #create example distribution
hist(exp_dis, breaks = "FD", freq = FALSE, main = "Histogram of Exp. Dist.")
hist(train$shucked_weight, breaks = "FD", freq = FALSE, col = rgb(1, 0, 0, alpha = 0.5), add = TRUE)
legend("topright", legend = c("Exp Distribution", "Shucked Weight"), fill = c("grey", "red"))qqplot(exp_dis, train$shucked_weight, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", main = "Q-Q Plot")
abline(0, 1, col = "red")“Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).”
#pexp() calculates the CDF at a requested value and a given lambda
exp_5_val <- .519
exp_5 <- pexp(.519, rate = fit$estimate) # 5%
exp_95_val <- 30.27
exp_95 <- pexp(30.27, rate = fit$estimate) # 95%
print(paste("CDF Percentile:", round(exp_5,3),"Variable Value:",exp_5_val))## [1] "CDF Percentile: 0.05 Variable Value: 0.519"
## [1] "CDF Percentile: 0.95 Variable Value: 30.27"
#Another way to get the quantiles from our specific distribtuion:
real_quants <- quantile(exp_dis, c(0.05, 0.95))
print("Values from the 1000 observation distribution created earlier")## [1] "Values from the 1000 observation distribution created earlier"
## 5% 95%
## 0.5415927 29.1781950
“Also generate a 95% confidence interval from the empirical
data, assuming normality. Finally, provide the empirical 5th percentile
and 95th percentile of the data.”
| It is unclear what the
confidence interval is meant to be of or if it was a typo and the
intention is to find the values of a normal distribution that are at 5%
and 95% percentiles. I will do both and assume it is the mean the 95%
confidence interval is for.
CI <- t.test(train$shucked_weight)$conf.int #95 CI of Mean of shucked_weight
print(paste("95% CI of the Mean: [", round(CI[1],3),",",round(CI[2],3),"]"))## [1] "95% CI of the Mean: [ 10.064 , 10.145 ]"
print(paste("Value at 5th Percentile:",round(qnorm(0.05, mean = mean(train$shucked_weight), sd = sd(train$shucked_weight)),3)))## [1] "Value at 5th Percentile: 0.863"
print(paste("value at 95th Percentile:",round(qnorm(0.95, mean = mean(train$shucked_weight), sd = sd(train$shucked_weight)),3)))## [1] "value at 95th Percentile: 19.345"
| The exponential distribution did not match the
distribution of the most skewed data the training set has to offer
(shucked_weight), but the value of the exercise is that when
distributions match they can be very good ways of estimating important
points in the data, such as the 5th and 95th percentiles. It becomes
even more valuable when attempting to generalize to a population when
the distribution of values is known. If the exponential distribution was
a good fit, one could attempt to generalize those important points to
the population in hopes of creating more accurate predictions. To
attempt a more accurate skewed distribution, one could only take values
of the mean and higher and scale the data such that the mean was equal
to zero, but that seemed too drastic in comparison to moving the minimum
to zero. In this case the minimum of ‘shucked_weight’ was .028 so it
seemed like a negligible gain.
## [1] "Minimum of shucked_weight: 0.0283495"
hist(exp_dis, breaks = "FD", freq = FALSE, main = "Histogram of Exp. Dist.")
hist((train$shucked_weight[train$shucked_weight > mean(train$shucked_weight)])-mean(train$shucked_weight), breaks = "FD", freq = FALSE, col = rgb(1, 0, 0, alpha = 0.5), add = TRUE)
legend("topright", legend = c("Exp Distribution", "Shucked Weight"), fill = c("grey", "red"))Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.
# Separating Index Column off the dataframe
train_sub <- train[-1]
# Basic model
model <- lm(age ~ ., data = train_sub)
#Playing with step
updated_model <- step(model, direction = "backward")## Start: AIC=111865.1
## age ~ sex + length + diameter + height + weight + shucked_weight +
## visc_weight + shell_weight
##
## Df Sum of Sq RSS AIC
## <none> 335336 111865
## - length 1 102 335438 111886
## - diameter 1 363 335700 111943
## - visc_weight 1 1504 336840 112195
## - height 1 4232 339568 112792
## - weight 1 5826 341163 113139
## - sex 2 8658 343994 113749
## - shell_weight 1 12318 347654 114535
## - shucked_weight 1 38838 374174 119978
#Testing out removing some variables
train_sub_2 <- train[-1, -6,-8,-9]
# Setting train and test datasets
set.seed(123)
index <- createDataPartition(train$age, p = 0.7, list = FALSE)
train_data <- train_sub[index, ] # 70% for training
test_data <- train_sub[-index, ]
model <- train(age ~ ., data = train_data, method = "lm")
preds <- predict(model, newdata = test_data)
metrics <- caret::postResample(preds, test_data$age)
print(metrics)## RMSE Rsquared MAE
## 2.1314662 0.5545252 1.4831273
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2290 -1.2248 -0.3331 0.7537 19.0409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.790101 0.090896 41.697 < 2e-16 ***
## sexI -1.043613 0.030219 -34.535 < 2e-16 ***
## sexM -0.099047 0.022891 -4.327 1.52e-05 ***
## length 0.740504 0.231687 3.196 0.00139 **
## diameter 2.244536 0.287056 7.819 5.42e-15 ***
## height 7.466731 0.289088 25.829 < 2e-16 ***
## weight 0.183947 0.006502 28.292 < 2e-16 ***
## shucked_weight -0.604673 0.007955 -76.008 < 2e-16 ***
## visc_weight -0.205292 0.014238 -14.419 < 2e-16 ***
## shell_weight 0.524065 0.011795 44.431 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.127 on 51829 degrees of freedom
## Multiple R-squared: 0.5492, Adjusted R-squared: 0.5491
## F-statistic: 7015 on 9 and 51829 DF, p-value: < 2.2e-16
train_sub_2 <- train[-1, -6,-8,-9]
set.seed(123)
index <- createDataPartition(train_sub_2$age, p = 0.7, list = FALSE)
train_data_2 <- train_sub_2[index, ] # 70% for training
test_data_2 <- train_sub_2[-index, ]
# Assuming 'df' is your dataframe
model_2 <- train(age ~ ., data = train_data_2, method = "lm")
preds_2 <- predict(model_2, newdata = test_data_2)
metrics <- caret::postResample(preds_2, test_data_2$age)
print(metrics)## RMSE Rsquared MAE
## 2.1538186 0.5453647 1.4992540
matrix_train <- as.matrix(train_data[-9])
matrix_train_age <- as.matrix(train_data[9])
# Perform cross-validation to select the best lambda
cv_model <- cv.glmnet(matrix_train, matrix_train_age , alpha = 0.5) # Cross-validation for Elastic Net
# Plot the cross-validation results (optional)
plot(cv_model)# Find the best lambda value
best_lambda <- cv_model$lambda.min # You can also use lambda.1se for a more parsimonious model
# Refit the model with the best lambda
final_model <- glmnet(matrix_train, matrix_train_age, alpha = 0.5, lambda = best_lambda)
# Make predictions
matrix_test_data <- as.matrix(test_data[-9])
matrix_test_data_age <- as.matrix(test_data[9])
predictions <- predict(final_model, newx = matrix_test_data)
metrics <- caret::postResample(predictions, matrix_test_data_age)
print(metrics)## RMSE Rsquared MAE
## 2.158333 0.543224 1.507126
## model parameter label forReg forClass probModel
## 1 xgbLinear nrounds # Boosting Iterations TRUE TRUE TRUE
## 2 xgbLinear lambda L2 Regularization TRUE TRUE TRUE
## 3 xgbLinear alpha L1 Regularization TRUE TRUE TRUE
## 4 xgbLinear eta Learning Rate TRUE TRUE TRUE
## $xgbLinear
## $xgbLinear$label
## [1] "eXtreme Gradient Boosting"
##
## $xgbLinear$library
## [1] "xgboost"
##
## $xgbLinear$type
## [1] "Regression" "Classification"
##
## $xgbLinear$parameters
## parameter class label
## 1 nrounds numeric # Boosting Iterations
## 2 lambda numeric L2 Regularization
## 3 alpha numeric L1 Regularization
## 4 eta numeric Learning Rate
##
## $xgbLinear$grid
## function (x, y, len = NULL, search = "grid")
## {
## if (search == "grid") {
## out <- expand.grid(lambda = c(0, 10^seq(-1, -4, length = len -
## 1)), alpha = c(0, 10^seq(-1, -4, length = len - 1)),
## nrounds = floor((1:len) * 50), eta = 0.3)
## }
## else {
## out <- data.frame(lambda = 10^runif(len, min = -5, 0),
## alpha = 10^runif(len, min = -5, 0), nrounds = sample(1:100,
## size = len, replace = TRUE), eta = runif(len,
## max = 3))
## }
## out
## }
##
## $xgbLinear$loop
## NULL
##
## $xgbLinear$fit
## function (x, y, wts, param, lev, last, classProbs, ...)
## {
## if (!inherits(x, "xgb.DMatrix"))
## x <- as.matrix(x)
## if (is.factor(y)) {
## if (length(lev) == 2) {
## y <- ifelse(y == lev[1], 1, 0)
## if (!inherits(x, "xgb.DMatrix"))
## x <- xgboost::xgb.DMatrix(x, label = y, missing = NA)
## else xgboost::setinfo(x, "label", y)
## if (!is.null(wts))
## xgboost::setinfo(x, "weight", wts)
## out <- xgboost::xgb.train(list(lambda = param$lambda,
## alpha = param$alpha), data = x, nrounds = param$nrounds,
## objective = "binary:logistic", ...)
## }
## else {
## y <- as.numeric(y) - 1
## if (!inherits(x, "xgb.DMatrix"))
## x <- xgboost::xgb.DMatrix(x, label = y, missing = NA)
## else xgboost::setinfo(x, "label", y)
## if (!is.null(wts))
## xgboost::setinfo(x, "weight", wts)
## out <- xgboost::xgb.train(list(lambda = param$lambda,
## alpha = param$alpha), data = x, num_class = length(lev),
## nrounds = param$nrounds, objective = "multi:softprob",
## ...)
## }
## }
## else {
## if (!inherits(x, "xgb.DMatrix"))
## x <- xgboost::xgb.DMatrix(x, label = y, missing = NA)
## else xgboost::setinfo(x, "label", y)
## if (!is.null(wts))
## xgboost::setinfo(x, "weight", wts)
## out <- xgboost::xgb.train(list(lambda = param$lambda,
## alpha = param$alpha), data = x, nrounds = param$nrounds,
## objective = "reg:squarederror", ...)
## }
## out
## }
##
## $xgbLinear$predict
## function (modelFit, newdata, submodels = NULL)
## {
## if (!inherits(newdata, "xgb.DMatrix")) {
## newdata <- as.matrix(newdata)
## newdata <- xgboost::xgb.DMatrix(data = newdata, missing = NA)
## }
## out <- predict(modelFit, newdata)
## if (modelFit$problemType == "Classification") {
## if (length(modelFit$obsLevels) == 2) {
## out <- ifelse(out >= 0.5, modelFit$obsLevels[1],
## modelFit$obsLevels[2])
## }
## else {
## out <- matrix(out, ncol = length(modelFit$obsLevels),
## byrow = TRUE)
## out <- modelFit$obsLevels[apply(out, 1, which.max)]
## }
## }
## out
## }
##
## $xgbLinear$prob
## function (modelFit, newdata, submodels = NULL)
## {
## if (!inherits(newdata, "xgb.DMatrix")) {
## newdata <- as.matrix(newdata)
## newdata <- xgboost::xgb.DMatrix(data = newdata, missing = NA)
## }
## out <- predict(modelFit, newdata)
## if (length(modelFit$obsLevels) == 2) {
## out <- cbind(out, 1 - out)
## colnames(out) <- modelFit$obsLevels
## }
## else {
## out <- matrix(out, ncol = length(modelFit$obsLevels),
## byrow = TRUE)
## colnames(out) <- modelFit$obsLevels
## }
## as.data.frame(out, stringsAsFactors = TRUE)
## }
##
## $xgbLinear$predictors
## function (x, ...)
## {
## imp <- xgboost::xgb.importance(x$xNames, model = x)
## x$xNames[x$xNames %in% imp$Feature]
## }
##
## $xgbLinear$varImp
## function (object, numTrees = NULL, ...)
## {
## imp <- xgboost::xgb.importance(object$xNames, model = object)
## imp <- as.data.frame(imp, stringsAsFactors = TRUE)[, 1:2]
## rownames(imp) <- as.character(imp[, 1])
## imp <- imp[, 2, drop = FALSE]
## colnames(imp) <- "Overall"
## missing <- object$xNames[!(object$xNames %in% rownames(imp))]
## missing_imp <- data.frame(Overall = rep(0, times = length(missing)))
## rownames(missing_imp) <- missing
## imp <- rbind(imp, missing_imp)
## imp
## }
##
## $xgbLinear$levels
## function (x)
## x$obsLevels
##
## $xgbLinear$tags
## [1] "Linear Classifier Models" "Linear Regression Models"
## [3] "L1 Regularization Models" "L2 Regularization Models"
## [5] "Boosting" "Ensemble Model"
## [7] "Implicit Feature Selection"
##
## $xgbLinear$sort
## function (x)
## {
## x[order(x$nrounds, x$alpha, x$lambda), ]
## }
ctrl <- trainControl(method = "cv", number =10)
xgbGrid <- expand.grid(nrounds = c(1, 10),
lambda = c(.01, .1, .3 ,.5, .7),
alpha = c(.01, .1, .3, .5, .7),
eta = c(.3, .5, .7))
xgbLinear <- train(age ~ ., data = train_sub,
method = "xgbLinear",
trControl = ctrl,
verbose = FALSE,
tuneGrid = xgbGrid,
metric = "MAE")## nrounds lambda alpha eta
## 145 10 0.7 0.5 0.3
xgbPreds <- predict(xgbLinear, test_data[-9])
metrics <- caret::postResample(xgbPreds, test_data$age)
print(metrics)## RMSE Rsquared MAE
## 2.0312610 0.6050321 1.3578110
# all + weight:length notable .606
# all + weight:diameter .607 1.355
xgbLim1 <- train(age ~ sex+length+diameter+height+visc_weight+shell_weight + weight:diameter+ shucked_weight:shucked_weight, data = train_sub,
method = "xgbLinear",
trControl = ctrl,
verbose = FALSE,
## Only a single model can be passed to the
## function when no resampling is used:
tuneGrid = xgbGrid,
metric = "MAE")
xgbPreds_Lim1 <- predict(xgbLim1, test_data[-9])
metrics <- caret::postResample(xgbPreds_Lim1, test_data$age)
print(metrics)## [1] "id" "sex" "length" "diameter"
## [5] "height" "weight" "shucked_weight" "visc_weight"
## [9] "shell_weight"