In this project, we will simulate genomic data and milk yield from dairy cows and build a machine learning model to predict milk yield based on genomic markers. The analysis will include feature selection, regression modeling, and visualization of prediction accuracy.
We will simulate Single Nucleotide Polymorphism (SNP) data and create a target variable representing milk yield
# Load necessary libraries
library(tidyverse)
library(caret) # For model building
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘caret’
The following object is masked from ‘package:survival’:
cluster
The following object is masked from ‘package:purrr’:
lift
library(randomForest) # For Random Forest models
# Set random seed for reproducibility
set.seed(123)
# Simulate genomic SNP data (100 SNPs, 500 cows)
n_cows <- 500 # Number of cows
n_snps <- 100 # Number of SNP markers
# Generate random SNP data (0, 1, 2 representing the number of minor alleles)
snp_data <- as.data.frame(matrix(sample(0:2, n_cows * n_snps, replace = TRUE), nrow = n_cows, ncol = n_snps))
colnames(snp_data) <- paste0("SNP_", 1:n_snps)
# Simulate milk yield based on a few key SNPs (introducing some SNPs with high influence on milk yield)
milk_yield <- 30 + 1.5 * snp_data$SNP_1 - 2.0 * snp_data$SNP_5 + 1.2 * snp_data$SNP_10 + rnorm(n_cows, mean = 0, sd = 2)
# Combine SNP data with milk yield
data <- cbind(snp_data, milk_yield)
head(data)
SNP_1 <int> | SNP_2 <int> | SNP_3 <int> | SNP_4 <int> | SNP_5 <int> | SNP_6 <int> | SNP_7 <int> | SNP_8 <int> | SNP_9 <int> | ||
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | |
2 | 2 | 1 | 2 | 2 | 1 | 1 | 2 | 0 | 2 | |
3 | 2 | 2 | 1 | 0 | 0 | 2 | 0 | 0 | 2 | |
4 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 2 | 2 | |
5 | 2 | 2 | 2 | 2 | 0 | 0 | 1 | 2 | 0 | |
6 | 1 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 1 |
NA
Before building models, we will perform feature selection to identify the most important SNPs that influence milk yield.
# Perform feature selection using caret package
set.seed(123)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model <- train(milk_yield ~ ., data = data, method = "lm", trControl = control)
# Get the importance of each SNP
importance <- varImp(model)
print(importance)
lm variable importance
only 20 most important variables shown (out of 100)
Overall <dbl> | ||||
---|---|---|---|---|
SNP_5 | 100.00000 | |||
SNP_1 | 77.38699 | |||
SNP_10 | 56.98812 | |||
SNP_99 | 15.30143 | |||
SNP_61 | 14.04266 | |||
SNP_20 | 13.92271 | |||
SNP_69 | 13.78804 | |||
SNP_74 | 13.77585 | |||
SNP_24 | 12.93515 | |||
SNP_4 | 12.19906 |
# Visualize important SNPs
plot(importance, top = 10, main = "Top 10 Important SNPs for Milk Yield Prediction")
We will build Linear Regression and Random Forest models to predict milk yield based on the SNP data.
# Linear regression model
lm_model <- lm(milk_yield ~ ., data = data)
summary(lm_model)
Call:
lm(formula = milk_yield ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-4.7307 -1.0865 -0.0072 1.1021 4.7633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.904806 1.208254 23.923 < 2e-16 ***
SNP_1 1.580666 0.118277 13.364 < 2e-16 ***
SNP_2 -0.113100 0.117284 -0.964 0.33546
SNP_3 -0.177384 0.117204 -1.513 0.13095
SNP_4 -0.250731 0.118293 -2.120 0.03466 *
SNP_5 -1.963461 0.113727 -17.265 < 2e-16 ***
SNP_6 -0.085199 0.120635 -0.706 0.48044
SNP_7 -0.112746 0.125120 -0.901 0.36808
SNP_8 0.251053 0.122293 2.053 0.04073 *
SNP_9 0.112428 0.116837 0.962 0.33650
SNP_10 1.223993 0.124321 9.845 < 2e-16 ***
SNP_11 -0.075918 0.116228 -0.653 0.51401
SNP_12 0.127867 0.116537 1.097 0.27321
SNP_13 0.019332 0.115133 0.168 0.86674
SNP_14 -0.070012 0.116008 -0.604 0.54651
SNP_15 -0.088050 0.119216 -0.739 0.46060
SNP_16 0.132314 0.115949 1.141 0.25450
SNP_17 -0.001758 0.114733 -0.015 0.98779
SNP_18 0.008537 0.121931 0.070 0.94422
SNP_19 0.113803 0.116633 0.976 0.32979
SNP_20 0.283229 0.117187 2.417 0.01610 *
SNP_21 -0.087775 0.120415 -0.729 0.46647
SNP_22 -0.122367 0.122277 -1.001 0.31756
SNP_23 0.097498 0.113839 0.856 0.39226
SNP_24 0.269754 0.120075 2.247 0.02521 *
SNP_25 0.029439 0.119639 0.246 0.80576
SNP_26 0.013128 0.115212 0.114 0.90934
SNP_27 -0.005180 0.123382 -0.042 0.96653
SNP_28 0.243947 0.118256 2.063 0.03977 *
SNP_29 -0.125609 0.119134 -1.054 0.29236
SNP_30 -0.237695 0.118152 -2.012 0.04492 *
SNP_31 0.102362 0.115260 0.888 0.37502
SNP_32 0.073090 0.119611 0.611 0.54151
SNP_33 0.083392 0.121349 0.687 0.49235
SNP_34 0.087353 0.118912 0.735 0.46301
SNP_35 -0.166785 0.116241 -1.435 0.15212
SNP_36 0.054863 0.114546 0.479 0.63223
SNP_37 0.128753 0.120935 1.065 0.28768
SNP_38 -0.119380 0.117109 -1.019 0.30864
SNP_39 0.095005 0.117292 0.810 0.41843
SNP_40 0.093207 0.122308 0.762 0.44647
SNP_41 -0.187800 0.117325 -1.601 0.11024
SNP_42 0.082752 0.113905 0.727 0.46796
SNP_43 0.222577 0.121258 1.836 0.06717 .
SNP_44 -0.068667 0.117790 -0.583 0.56025
SNP_45 -0.050694 0.119392 -0.425 0.67135
SNP_46 -0.088444 0.117793 -0.751 0.45319
SNP_47 0.238045 0.122711 1.940 0.05310 .
SNP_48 -0.196666 0.119744 -1.642 0.10130
SNP_49 -0.034879 0.121694 -0.287 0.77456
SNP_50 -0.147701 0.116042 -1.273 0.20382
SNP_51 -0.054824 0.118300 -0.463 0.64331
SNP_52 -0.159518 0.118391 -1.347 0.17862
SNP_53 -0.037992 0.119906 -0.317 0.75153
SNP_54 0.207060 0.116304 1.780 0.07578 .
SNP_55 0.216290 0.117407 1.842 0.06619 .
SNP_56 0.013221 0.114416 0.116 0.90807
SNP_57 0.188615 0.119471 1.579 0.11518
SNP_58 -0.246363 0.117766 -2.092 0.03707 *
SNP_59 0.201013 0.120099 1.674 0.09497 .
SNP_60 0.143955 0.114815 1.254 0.21065
SNP_61 0.286249 0.117431 2.438 0.01522 *
SNP_62 0.085058 0.118213 0.720 0.47224
SNP_63 0.247981 0.118664 2.090 0.03727 *
SNP_64 0.136256 0.119588 1.139 0.25523
SNP_65 -0.148861 0.121627 -1.224 0.22171
SNP_66 0.146374 0.115708 1.265 0.20660
SNP_67 -0.167923 0.115903 -1.449 0.14817
SNP_68 -0.127682 0.114659 -1.114 0.26613
SNP_69 0.276993 0.115719 2.394 0.01714 *
SNP_70 0.108795 0.112436 0.968 0.33382
SNP_71 -0.031897 0.120808 -0.264 0.79189
SNP_72 0.172916 0.117874 1.467 0.14318
SNP_73 -0.080621 0.119910 -0.672 0.50176
SNP_74 0.301033 0.125873 2.392 0.01724 *
SNP_75 -0.033367 0.117666 -0.284 0.77689
SNP_76 -0.225923 0.122888 -1.838 0.06674 .
SNP_77 -0.140040 0.119204 -1.175 0.24078
SNP_78 0.035662 0.119605 0.298 0.76573
SNP_79 0.007060 0.117309 0.060 0.95204
SNP_80 0.027632 0.115725 0.239 0.81141
SNP_81 0.023246 0.122188 0.190 0.84921
SNP_82 -0.082775 0.123756 -0.669 0.50398
SNP_83 -0.164546 0.121783 -1.351 0.17742
SNP_84 0.051815 0.120121 0.431 0.66644
SNP_85 -0.107796 0.116589 -0.925 0.35574
SNP_86 -0.048208 0.114926 -0.419 0.67510
SNP_87 -0.038587 0.114953 -0.336 0.73729
SNP_88 -0.068601 0.117390 -0.584 0.55929
SNP_89 -0.107721 0.124072 -0.868 0.38580
SNP_90 0.082622 0.118527 0.697 0.48616
SNP_91 -0.064020 0.116695 -0.549 0.58358
SNP_92 -0.119136 0.117202 -1.017 0.31001
SNP_93 -0.007217 0.117809 -0.061 0.95118
SNP_94 0.028621 0.117727 0.243 0.80804
SNP_95 -0.053211 0.116132 -0.458 0.64706
SNP_96 -0.002861 0.114367 -0.025 0.98006
SNP_97 -0.038005 0.114042 -0.333 0.73912
SNP_98 0.237946 0.119362 1.993 0.04689 *
SNP_99 -0.311685 0.117408 -2.655 0.00826 **
SNP_100 0.104263 0.123258 0.846 0.39812
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.933 on 399 degrees of freedom
Multiple R-squared: 0.6677, Adjusted R-squared: 0.5845
F-statistic: 8.019 on 100 and 399 DF, p-value: < 2.2e-16
# Predict using linear regression
lm_predictions <- predict(lm_model, data)
# Evaluate model performance
lm_rmse <- sqrt(mean((lm_predictions - data$milk_yield)^2))
cat("Linear Regression RMSE:", lm_rmse)
Linear Regression RMSE: 1.727098
# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(milk_yield ~ ., data = data, ntree = 100)
# Predict using Random Forest
rf_predictions <- predict(rf_model, data)
# Evaluate Random Forest performance
rf_rmse <- sqrt(mean((rf_predictions - data$milk_yield)^2))
cat("Random Forest RMSE:", rf_rmse)
Random Forest RMSE: 0.9130019
# Visualize variable importance from Random Forest
varImpPlot(rf_model, main = "Variable Importance - Random Forest")
We will compare the performance of the Linear Regression and Random Forest models using Root Mean Squared Error (RMSE) and visualize the predictions.
# Compare actual vs predicted milk yield (Random Forest)
ggplot(data, aes(x = milk_yield, y = rf_predictions)) +
geom_point(alpha = 0.6, color = "blue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(title = "Actual vs Predicted Milk Yield (Random Forest)", x = "Actual Milk Yield", y = "Predicted Milk Yield") +
theme_minimal()
# Visualize distribution of prediction errors
ggplot(data, aes(x = rf_predictions - milk_yield)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
labs(title = "Prediction Error Distribution (Random Forest)", x = "Prediction Error", y = "Count") +
theme_minimal()
To incorporate BLUE (Best Linear Unbiased Estimator) and REML (Restricted Maximum Likelihood) into the above milk yield prediction project, we can apply linear mixed models that account for both fixed and random effects. These models are commonly used in genetics and animal breeding to model both genetic and environmental influences on traits like milk yield.
In this context, BLUE applies to the estimation of fixed effects (like treatment, nutrition, or specific SNP markers) and REML helps in estimating variance components for random effects (like cow-to-cow variability or genomic data).
In this project, we explored the use of genomic data (SNP markers) to predict milk yield in dairy cows using two machine learning approaches: Linear Regression and Random Forest.
The Random Forest model significantly outperformed the Linear Regression model in predicting milk yield. This suggests that Random Forest was able to capture complex, non-linear interactions between the genomic markers (SNPs) and milk yield, while the Linear Regression model struggled with these complexities.
Feature Importance: Several SNPs (such as SNP_1, SNP_5, and SNP_10) were identified as highly influential in predicting milk yield. This finding aligns with the notion that genetic variations play a critical role in milk production traits.
Predictive Power: The lower RMSE of 0.913 from the Random Forest model indicates a highly accurate prediction of milk yield, making it a reliable tool for selecting high-performing dairy cows based on genomic data.
Practical Application: The ability to predict milk yield using genomic data can have profound implications in precision breeding. By identifying cows with optimal genetic markers for milk production, farmers and breeders can make more informed decisions to enhance productivity and sustainability.
The project demonstrates that Random Forest is a more effective model than Linear Regression for predicting milk yield based on genomic data due to its ability to handle non-linear relationships and interactions between SNPs. This approach provides a valuable framework for precision breeding, enabling farmers to optimize milk production through data-driven genetic selection.
This methodology can be applied by companies involved in livestock health and productivity enhancement, showcasing the potential of machine learning in improving dairy production outcomes.