We set out to determine which of the provided variables were significantly correlated with–and thus could be used to accurately predict–our outcome variables of either plasma levels of Retinol (Retinol.Plasma) or plasma levels of Beta Carotene (Beta.Plasma). For the purposes of this project, we chose to focus only on Retinol.Plasma.
We had some initial difficulties fitting a model around Retinol.Plasma, and after running the Power Transformation function, it was revealed to us that taking the natural logarithm of the variable would be the best transformation. After doing so, the model fit much better!
Using our newly transformed Log.Retinol.Plasma, we utilized the Mass Step AIC function to find the best prediction model. The result: Log.Retinol.Plasma can be predicted using the variables Age, Smoking.Status, Calories, Fat, Fiber, and Alcohol. Then to transform the predicted result back into meaningful terms, we exponentiate.
Removing one random observation, we were able to test our model and prove that, in fact, the real Retinol.Plasma value did fall within our model’s prediction interval!
The resulting formula we found for predicting Retinol.Plasma:
\(\widehat{Retinol.Plasma}\) = \(e^{6.086 + 0.006X_{Age} + 0.080X_{Smoking.StatusFormer} - 0.010X_{Smoking.StatusCurrent Smoker} + 0.0002X_{Calories} - 0.003X_{Fat} - 0.009X_{Fiber} + 0.011X_{Alcohol} \pm 0.320}\)
Retinol is among the widely studied compounds in various populations, for both human plasma(or serum) concentrations and dietary intake.
Observational studies have suggested that low dietary intake or low plasma concentrations of retinol, beta-carotene, or other carotenoids might be associated with increased risk of developing certain types of cancer. However, relatively few studies have investigated how drinking and smoking habits, dietary intake, sex, and age influence plasma concentrations of retinol.
This cross-sectional study aims at studying relationships between personal characteristics and dietary factors, and plasma concentrations of retinol.
Reference: These data have not been published yet but a related reference is Nierenberg DW, Stukel TA, Baron JA, Dain BJ, Greenberg ER. Determinants of plasma levels of beta-carotene and retinol. American Journal of Epidemiology 1989;130:511-521.
Determinants of Plasma Retinol and Beta-Carotene Levels
Variable Names in order from left to right:
- Age: Age (years)
- Sex: Sex (1=Male, 2=Female)
- Smoking.Status: Smoking status (1=Never, 2=Former, 3=Current Smoker)
- Quetelet: Quetelet (weight/\(height^{2}\))
- Vitamin.Use: Vitamin Use (1=Yes, fairly often, 2=Yes, not often, 3=No)
- Calories: Number of calories consumed per day
- Fet: Grams of fat consumed per day
- Fiber: Grams of fiber consumed per day
- Alcohol: Number of alcoholic drinks consumed per week
- Cholesterol: Cholesterol consumed (mg per day)
- Beta.Diet: Dietary beta-carotene consumed (mcg per day)
- Retinol.Diet: Dietary retinol consumed (mcg per day)
- Beta.Plasma: Plasma beta-carotene (ng/ml)
- Retinol.Plasma: Plasma Retinol (ng/ml)
# turn sex as a factor
plasma$Sex <- factor(plasma$Sex, levels=c(1,2), labels=c("Male", "Female"))
# turn Smoking Status into a Factor
plasma$Smoking.Status <- factor(plasma$Smoking.Status, levels=c(1,2,3), labels=c("Never", "Former", "Current Smoker"))
# turn Vitamin Use into a Factor
plasma$Vitamin.Use <- factor(plasma$Vitamin.Use, levels=c(3,2,1), labels=c("No", "Yes, not often", "Yes, fairly often"))
| Age | Sex | Smoking.Status | Quetelet | Vitamin.Use | Calories | Fat | Fiber | Alcohol | Cholesterol | Beta.Diet | Retinol.Diet | Beta.Plasma | Retinol.Plasma |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 64 | Female | Former | 21.48380 | Yes, fairly often | 1298.8 | 57.0 | 6.3 | 0.0 | 170.3 | 1945 | 890 | 200 | 915 |
| 76 | Female | Never | 23.87631 | Yes, fairly often | 1032.5 | 50.1 | 15.8 | 0.0 | 75.8 | 2653 | 451 | 124 | 727 |
| 38 | Female | Former | 20.01080 | Yes, not often | 2372.3 | 83.6 | 19.1 | 14.1 | 257.9 | 6321 | 660 | 328 | 721 |
| 40 | Female | Former | 25.14062 | No | 2449.5 | 97.5 | 26.5 | 0.5 | 332.6 | 1061 | 864 | 153 | 615 |
| 72 | Female | Never | 20.98504 | Yes, fairly often | 1952.1 | 82.6 | 16.2 | 0.0 | 170.8 | 2863 | 1209 | 92 | 799 |
| 40 | Female | Former | 27.52136 | No | 1366.9 | 56.0 | 9.6 | 1.3 | 154.6 | 1729 | 1439 | 148 | 654 |
We randomly selected one entry from our dataset to apply to our final model in Part 6. This entry was selected using the set.seed(1). Once the observation was selected we stored the value as the selection object and removed it from the plasma data frame.
#set seed to ensure the same sample is repeatidly selected
set.seed(1)
#select 1 sample from the plasma data frame
selection <- plasma[sample(nrow(plasma), 1), ]
#Remove Selected Observation
plasma <- plasma[-c(84), ]
# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(selection, format = "markdown")
| Age | Sex | Smoking.Status | Quetelet | Vitamin.Use | Calories | Fat | Fiber | Alcohol | Cholesterol | Beta.Diet | Retinol.Diet | Beta.Plasma | Retinol.Plasma | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 84 | 19 | Female | Never | 29.24145 | Yes, fairly often | 2558.9 | 116.1 | 12.3 | 0 | 324.5 | 1498 | 1066 | 327 | 693 |
In this section, an exploratory data analysis was carried out for all the variables registered in the study in order to detect any visual characteristics of the variables or potential correlations.The histogram for Age shows an apparent bimodality in distribution. The sex scatterplot suggest a correlation between female gender and Retinol plasma level. One significant observation was that there was a much higher count of females than males. This could affect the high correlation value due to gender, because the variable is biased toward more female entries. The histogram for Calories is skewed to the left but still normally distributed. The histogram for Alcohol reveals that the distribution is positively skewed and there are still several outliers.
# Creates Variable of all Numeric Columns in Data Set
NumCol <- c(1,4,6:14)
# Creates Variable of all Factor Columns in Data Set
FacCol <- c(2,3,5)
# Summary of Numerical DataSet
summaryPlasNum <- summary(plasma[,NumCol])
# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(summaryPlasNum, format = "markdown")
| Age | Quetelet | Calories | Fat | Fiber | Alcohol | Cholesterol | Beta.Diet | Retinol.Diet | Beta.Plasma | Retinol.Plasma | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :22.00 | Min. :16.33 | Min. : 445.2 | Min. : 14.40 | Min. : 3.100 | Min. : 0.00 | Min. : 37.7 | Min. : 214 | Min. : 30.0 | Min. : 0.0 | Min. : 179.0 | |
| 1st Qu.:39.00 | 1st Qu.:21.79 | 1st Qu.:1335.9 | 1st Qu.: 53.92 | 1st Qu.: 9.125 | 1st Qu.: 0.00 | 1st Qu.:154.9 | 1st Qu.:1115 | 1st Qu.: 479.5 | 1st Qu.: 89.5 | 1st Qu.: 466.0 | |
| Median :48.00 | Median :24.71 | Median :1665.0 | Median : 72.90 | Median :12.100 | Median : 0.30 | Median :206.2 | Median :1814 | Median : 707.0 | Median : 139.0 | Median : 565.0 | |
| Mean :50.25 | Mean :26.15 | Mean :1794.2 | Mean : 76.91 | Mean :12.790 | Mean : 3.29 | Mean :242.2 | Mean :2188 | Mean : 832.0 | Mean : 189.5 | Mean : 602.5 | |
| 3rd Qu.:62.75 | 3rd Qu.:28.75 | 3rd Qu.:2092.5 | 3rd Qu.: 95.17 | 3rd Qu.:15.600 | 3rd Qu.: 3.20 | 3rd Qu.:308.2 | 3rd Qu.:2850 | 3rd Qu.:1026.8 | 3rd Qu.: 228.0 | 3rd Qu.: 717.5 | |
| Max. :83.00 | Max. :50.40 | Max. :6662.2 | Max. :235.90 | Max. :36.800 | Max. :203.00 | Max. :900.7 | Max. :9642 | Max. :6901.0 | Max. :1415.0 | Max. :1727.0 |
# Summary of Factor DataSet
summaryPlasFac <- summary(plasma[,FacCol])
# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(summaryPlasFac, format = "markdown")
| Sex | Smoking.Status | Vitamin.Use | |
|---|---|---|---|
| Male : 42 | Never :156 | No :111 | |
| Female:272 | Former :115 | Yes, not often : 82 | |
| NA | Current Smoker: 43 | Yes, fairly often:121 |
# Pairs
pairs(plasma[,c(1:14)], pch=19, lower.panel=panel.smooth)
It is important to note that there is one outlier that was removed from the dataset before conducting analysis. This record indicated that an individual consumed, on average, 203 alcoholic beverages per week. This data point was greatly skewing our data; the next highest record was 35. Given that an adult may sleep 6-8 hours each day, this would indicate that an alcoholic beverage would be consumed every 30-45 minutes each day. This seems impractical.
# explaining remove of outliers
#first Alcohol Outlier # 62
ggplot(data = plasma, aes(x = Alcohol), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 20) +
labs(title = "Alcohol") +
scale_fill_gradient("Count", low = "Dark Blue", high = "Red")
summary(plasma$Alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.30 3.29 3.20 203.00
# remove Outliers
plasma <- plasma[-c(62),]
# 62 is alcohol of 203 drinks per week
# 257 is Beta Plasma of 0.0 (rounding error)
# Sex Characteristic BoxPlots
### -- Retinol Plasma by Sex
SexRetinolBox <- ggplot(data = plasma, aes(x = Sex, y = Retinol.Plasma)) +
stat_boxplot(geom='errorbar', width=0.5) +
geom_boxplot(outlier.size = 1, aes(fill=Sex)) +
coord_flip() +
stat_summary(fun.y = mean, color="yellow", geom="point", size=2, shape=18) +
labs(title = "Sex and Retinol Plasma")
# Smoking Characteristic BoxPlots
### -- Retinol Plasma by Sex
SmokeRetinolBox <- ggplot(data = plasma, aes(x = Smoking.Status, y = Retinol.Plasma)) +
stat_boxplot(geom='errorbar', width=0.5) +
geom_boxplot(outlier.size = 1, aes(fill=Smoking.Status)) +
coord_flip() +
stat_summary(fun.y = mean, color="yellow", geom="point", size=2, shape=18) +
labs(title = "Smoking Status and Retinol Plasma")
# Vitamin Characteristic BoxPlots
### -- Retinol Plasma by Vitamin Use
VitaminRetinolBox <- ggplot(data = plasma, aes(x = Vitamin.Use, y = Retinol.Plasma)) +
stat_boxplot(geom='errorbar', width=0.5) +
geom_boxplot(outlier.size = 1, aes(fill=Vitamin.Use)) +
coord_flip() +
stat_summary(fun.y = mean, color="yellow", geom="point", size=2, shape=18) +
labs(title = "Vitamin Use and Retinol Plasma")
#BoxPlots Graph Display
grid.arrange(SexRetinolBox, SmokeRetinolBox, VitaminRetinolBox, ncol=1,
top=textGrob("Boxplots",gp=gpar(fontface="bold"))) #Adding a main title
#Age Histogram
AgeHist <- ggplot(data = plasma, aes(x = Age), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Age") +
scale_fill_gradient("Count", low = "Orange", high = "Blue")
#Quetelet Histogram
QueteletHist <- ggplot(data = plasma, aes(x = Quetelet), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Quetelet") +
scale_fill_gradient("Count", low = "Orange", high = "Blue")
#Calories Histogram
CaloriesHist <- ggplot(data = plasma, aes(x = Calories), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Calories") +
scale_fill_gradient("Count", low = "Orange", high = "Blue")
#Fat Histogram
FatHist <- ggplot(data = plasma, aes(x = Fat), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Fat") +
scale_fill_gradient("Count", low = "Orange", high = "Blue")
#Fiber Histogram
FiberHist <- ggplot(data = plasma, aes(x = Fiber), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Fiber") +
scale_fill_gradient("Count", low = "Orange", high = "Blue")
#Alcohol Histogram
AlcoholHist <- ggplot(data = plasma, aes(x = Alcohol), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Alcohol") +
scale_fill_gradient("Count", low = "Orange", high = "Blue")
#Cholesterol Histogram
CholesterolHist <- ggplot(data = plasma, aes(x = Cholesterol), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Cholesterol") +
scale_fill_gradient("Count", low = "Orange", high = "Blue")
#Beta Diet Histogram
BetaDietHist <- ggplot(data = plasma, aes(x = Beta.Diet), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Beta Diet") +
scale_fill_gradient("Count", low = "Red", high = "Blue")
#Retinol Diet Histogram
RetinolDietHist <- ggplot(data = plasma, aes(x = Retinol.Diet), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Retinol Diet") +
scale_fill_gradient("Count", low = "Green", high = "Blue")
#Retinol Plasma Histogram
RetinolHist <- ggplot(data = plasma, aes(x = Retinol.Plasma), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Retinol Plasma") +
scale_fill_gradient("Count", low = "Green", high = "Blue")
#Beta Plasma Histogram
BetaHist <- ggplot(data = plasma, aes(x = Beta.Plasma), colour = barlines, fill = barfill) +
geom_histogram(aes(fill = ..count..), bins = 40) +
labs(title = "Beta Plasma") +
scale_fill_gradient("Count", low = "Red", high = "Blue")
#All Histograms Graph Display
grid.arrange(AgeHist, QueteletHist, CaloriesHist, FatHist, FiberHist, AlcoholHist, CholesterolHist, BetaDietHist, RetinolDietHist, BetaHist, RetinolHist, ncol=3,
top=textGrob("All Variable Histograms",gp=gpar(fontface="bold"))) #Adding a main title
# Age
## Age by Sex StackedBar
AgeStackSex <- ggplot(plasma, aes(Age, fill = Sex)) +
geom_bar() +
labs(title = "Age by Sex", x = "Ages", y = "Count")
## Age by Sex StackedBar Percentage
AgeStackSexP <- ggplot(plasma, aes(Age, fill = Sex)) +
geom_bar(position="fill")+
labs(title = "Age by Sex Percent", x = "Ages", y = "Percentage")
## Age by Smoking Status StackedBar
AgeStackSmoke <- ggplot(plasma, aes(Age, fill = Smoking.Status)) +
geom_bar()+
labs(title = "Age by Smoking Status", x = "Ages", y = "Count")
## Age by Smoking Status StackedBar Percentage
AgeStackSmokeP <- ggplot(plasma, aes(Age, fill = Smoking.Status)) +
geom_bar(position="fill")+
labs(title = "Age by Smoking Status Percent", x = "Ages", y = "Percentage")
## Age by Vitamin Use StackedBar
AgeStackVitamin <- ggplot(plasma, aes(Age, fill = Vitamin.Use)) +
geom_bar()+
labs(title = "Age by Vitamin Use", x = "Ages", y = "Count")
## Age by Vitamin Use StackedBar Percentage
AgeStackVitaminP <- ggplot(plasma, aes(Age, fill = Vitamin.Use)) +
geom_bar(position="fill")+
labs(title = "Age by Vitamin Use Percent", x = "Ages", y = "Percentage")
#Age Graph Display
grid.arrange(AgeStackSex, AgeStackSexP, AgeStackSmoke, AgeStackSmokeP, AgeStackVitamin, AgeStackVitaminP, ncol=2,
top=textGrob("Age",gp=gpar(fontface="bold"))) #Adding a main title
CrossTable(plasma$Sex, plasma$Smoking.Status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 313
##
##
## | plasma$Smoking.Status
## plasma$Sex | Never | Former | Current Smoker | Row Total |
## -------------|----------------|----------------|----------------|----------------|
## Male | 13 | 22 | 6 | 41 |
## | 2.705 | 3.194 | 0.045 | |
## | 0.317 | 0.537 | 0.146 | 0.131 |
## | 0.083 | 0.191 | 0.143 | |
## | 0.042 | 0.070 | 0.019 | |
## -------------|----------------|----------------|----------------|----------------|
## Female | 143 | 93 | 36 | 272 |
## | 0.408 | 0.481 | 0.007 | |
## | 0.526 | 0.342 | 0.132 | 0.869 |
## | 0.917 | 0.809 | 0.857 | |
## | 0.457 | 0.297 | 0.115 | |
## -------------|----------------|----------------|----------------|----------------|
## Column Total | 156 | 115 | 42 | 313 |
## | 0.498 | 0.367 | 0.134 | |
## -------------|----------------|----------------|----------------|----------------|
##
##
CrossTable(plasma$Sex, plasma$Vitamin.Use)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 313
##
##
## | plasma$Vitamin.Use
## plasma$Sex | No | Yes, not often | Yes, fairly often | Row Total |
## -------------|-------------------|-------------------|-------------------|-------------------|
## Male | 23 | 5 | 13 | 41 |
## | 5.122 | 3.069 | 0.512 | |
## | 0.561 | 0.122 | 0.317 | 0.131 |
## | 0.209 | 0.061 | 0.107 | |
## | 0.073 | 0.016 | 0.042 | |
## -------------|-------------------|-------------------|-------------------|-------------------|
## Female | 87 | 77 | 108 | 272 |
## | 0.772 | 0.463 | 0.077 | |
## | 0.320 | 0.283 | 0.397 | 0.869 |
## | 0.791 | 0.939 | 0.893 | |
## | 0.278 | 0.246 | 0.345 | |
## -------------|-------------------|-------------------|-------------------|-------------------|
## Column Total | 110 | 82 | 121 | 313 |
## | 0.351 | 0.262 | 0.387 | |
## -------------|-------------------|-------------------|-------------------|-------------------|
##
##
# A histogram of Ages by Sex and Smoking Status
ggplot(plasma, aes(Age, fill = Vitamin.Use)) +
geom_histogram(bins = 40,colour="white") +
facet_grid(Sex ~ Smoking.Status) +
labs(title = "Ages by Sex and Smoking Status", y = "Count of Observations", x = "Ages")
# A histogram of Quetelet by Sex and Smoking Status
ggplot(plasma, aes(Quetelet, fill = Vitamin.Use)) +
geom_histogram(bins = 40,colour="white") +
facet_grid(Sex ~ Smoking.Status) +
labs(title = "Quetelet by Sex and Smoking Status", y = "Count of Observations", x = "Quetelet")
# A histogram of Retinol.Plasma by Sex and Smoking Status
ggplot(plasma, aes(Retinol.Plasma, fill = Vitamin.Use)) +
geom_histogram(bins = 40, color="white") +
facet_grid(Sex ~ Smoking.Status) +
labs(title = "Retinol.Plasma by Sex and Smoking Status", y = "Count of Observations", x = "Retinol.Plasma")
#Sex ScactterPlot
SexScatter <- ggplot(plasma, aes(x = Age, y = Retinol.Plasma))+
geom_point() +
geom_smooth(aes(color = Sex, fill = Sex), method = "lm") +
labs(title = "Retinol Plasma vs Age by Sex")
#Smoking ScactterPlot
SmokeScatter <- ggplot(plasma, aes(x = Age, y = Retinol.Plasma))+
geom_point() +
geom_smooth(aes(color = Smoking.Status, fill = Smoking.Status), method = "lm") +
labs(title = "Retinol Plasma vs Age by Sex")
#Vitamin ScactterPlot
VitaminScatter <- ggplot(plasma, aes(x = Age, y = Retinol.Plasma))+
geom_point() +
geom_smooth(aes(color = Vitamin.Use, fill = Vitamin.Use), method = "lm") +
labs(title = "Retinol Plasma vs Age by Sex")
ggplot(plasma, aes(x = Age, y = Retinol.Plasma)) +
geom_point() +
scale_x_continuous("Age")+
scale_y_continuous("Retinol Plasma") +
theme_bw() + labs(title="Sex Scatterplot") + facet_wrap( ~ Sex) +
geom_smooth(method = "lm", aes(x = Age, y = Retinol.Plasma))
ggplot(plasma, aes(x = Age, y = Retinol.Plasma)) +
geom_point() +
scale_x_continuous("Age")+
scale_y_continuous("Retinol Plasma") +
theme_bw() + labs(title="Smoking Status Scatterplot") + facet_wrap( ~ Smoking.Status) +
geom_smooth(method = "lm", aes(x = Age, y = Retinol.Plasma))
ggplot(plasma, aes(x = Age, y = Retinol.Plasma)) +
geom_point() +
scale_x_continuous("Age")+
scale_y_continuous("Retinol Plasma") +
theme_bw() + labs(title="Vitamin Use Scatterplot") +
facet_wrap( ~ Vitamin.Use) +
geom_smooth(method = "lm", aes(x = Age, y = Retinol.Plasma))
#Sex Characteristic Graph Display
grid.arrange(SexScatter, SmokeScatter, VitaminScatter, ncol=2,
top=textGrob("Scatter Plots",gp=gpar(fontface="bold"))) #Adding a main title
# MosaicPlot
mosaicplot(table(plasma$Sex,plasma$Smoking.Status, plasma$Vitamin.Use),
color = c("blue", "green", "orange"),
main = "MosaicPlot: Sex by Smoking Status by Vitamin Use",
ylab = "Smoking Status",
xlab = "Sex",
shade = F)
# creates scatterplots of all numeric variables
pairs(plasma[,NumCol], pch=16, lower.panel=panel.smooth)
# Creates Correlation Matrix of all Numeric Variables
CorTable <- cor(plasma[,NumCol])
# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(round(CorTable,4), format = "markdown")
| Age | Quetelet | Calories | Fat | Fiber | Alcohol | Cholesterol | Beta.Diet | Retinol.Diet | Beta.Plasma | Retinol.Plasma | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.0000 | -0.0126 | -0.2126 | -0.1737 | 0.0455 | -0.0067 | -0.1208 | 0.0678 | -0.0099 | 0.1090 | 0.2218 |
| Quetelet | -0.0126 | 1.0000 | 0.0134 | 0.0514 | -0.0880 | -0.1208 | 0.1148 | -0.0051 | 0.0328 | -0.2318 | 0.0105 |
| Calories | -0.2126 | 0.0134 | 1.0000 | 0.8979 | 0.5176 | 0.2241 | 0.6599 | 0.2568 | 0.4176 | -0.0145 | -0.0480 |
| Fat | -0.1737 | 0.0514 | 0.8979 | 1.0000 | 0.2828 | 0.1339 | 0.7028 | 0.1431 | 0.4092 | -0.0915 | -0.0826 |
| Fiber | 0.0455 | -0.0880 | 0.5176 | 0.2828 | 1.0000 | -0.0143 | 0.1586 | 0.4833 | 0.2159 | 0.2361 | -0.0457 |
| Alcohol | -0.0067 | -0.1208 | 0.2241 | 0.1339 | -0.0143 | 1.0000 | 0.1043 | 0.0356 | -0.0037 | 0.0122 | 0.2210 |
| Cholesterol | -0.1208 | 0.1148 | 0.6599 | 0.7028 | 0.1586 | 0.1043 | 1.0000 | 0.1139 | 0.4410 | -0.1291 | -0.0601 |
| Beta.Diet | 0.0678 | -0.0051 | 0.2568 | 0.1431 | 0.4833 | 0.0356 | 0.1139 | 1.0000 | 0.0522 | 0.2271 | -0.0108 |
| Retinol.Diet | -0.0099 | 0.0328 | 0.4176 | 0.4092 | 0.2159 | -0.0037 | 0.4410 | 0.0522 | 1.0000 | -0.0457 | -0.0597 |
| Beta.Plasma | 0.1090 | -0.2318 | -0.0145 | -0.0915 | 0.2361 | 0.0122 | -0.1291 | 0.2271 | -0.0457 | 1.0000 | 0.0686 |
| Retinol.Plasma | 0.2218 | 0.0105 | -0.0480 | -0.0826 | -0.0457 | 0.2210 | -0.0601 | -0.0108 | -0.0597 | 0.0686 | 1.0000 |
# Visually displays the Correlation Matrix
## The narrower the width of the ellipse and darker the color the stronger the degree of correlation
## Blue indicates Positive Correlation
## Red Negative Correlation
corrplot(cor(plasma[,NumCol]), method="ellipse",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, number.cex = 0.7) #Text label color and rotation
The process of creating a model began with looking at our response variable and determining if it needed to be transformed for a more accurate fit. When creating a histogram for the Retinol.Plasma variable, the results looked highly skewed and not normally distributed. Therefore, we ran a power transformation test to determine what degree of transformation yields a more accurate model. Since the result was close to the value of 0, we decided to take the natural logarithm of the Retinol.Plasma variable, creating the potential for a more accurate model.
#Retinol Plasma Histogram
grid.arrange(RetinolHist)
# compare Lamda to find what transformation needs to happen
summary(powerTransform(plasma$Retinol.Plasma))
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## plasma$Retinol.Plasma 0.1691 0 -0.06 0.3983
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 2.101759 1 0.14713
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 49.68566 1 1.8046e-12
#lamba is 0.1717 - close to Zero - recommended log(Y)
# Create Log of Retinol Plasma to
plasma <- plasma %>%
mutate(Log.Retinol.Plasma = log(Retinol.Plasma)) %>%
select(-c(Beta.Plasma))
# Log of Retinol Plasma Histogram
ggplot(data = plasma, aes(x = Log.Retinol.Plasma)) +
geom_histogram(bins = 40) +
labs(title = "Log Retinol Plasma")
# Visually displays the Correlation Matrix
## The narrower the width of the ellipse and darker the color the stronger the degree of correlation
## Blue indicates Positive Correlation
## Red Negative Correlation
corrplot(cor(plasma[,NumCol]), method="ellipse",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, number.cex = 0.7) #Text label color and rotation
To create our first model, we manually observed the chart of correlations of the predicting variables with the natural log of the Retinol.Plasma variable. Choosing the variables with a higher correlation, we decided to use these as dependencies in building a linear model.
fit1 <- lm(Log.Retinol.Plasma~Age+Alcohol+Calories+Cholesterol+Fat+Fiber+Retinol.Diet, data = plasma)
summary(fit1)
##
## Call:
## lm(formula = Log.Retinol.Plasma ~ Age + Alcohol + Calories +
## Cholesterol + Fat + Fiber + Retinol.Diet, data = plasma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.15571 -0.18270 -0.00495 0.19678 1.02533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.095e+00 9.558e-02 63.772 < 2e-16 ***
## Age 5.793e-03 1.343e-03 4.312 2.19e-05 ***
## Alcohol 1.231e-02 4.025e-03 3.058 0.00243 **
## Calories 1.738e-04 9.647e-05 1.802 0.07258 .
## Cholesterol -1.188e-04 2.055e-04 -0.578 0.56373
## Fat -3.065e-03 1.522e-03 -2.013 0.04497 *
## Fiber -8.635e-03 4.970e-03 -1.737 0.08334 .
## Retinol.Diet -9.210e-06 3.543e-05 -0.260 0.79505
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3225 on 305 degrees of freedom
## Multiple R-squared: 0.1178, Adjusted R-squared: 0.09758
## F-statistic: 5.819 on 7 and 305 DF, p-value: 2.387e-06
After observing the results of our first model, not all of the variables we selected were significant. Therefore, to create our second linear model, we eliminated the insignificant variables and reran the model.
fit2 <- lm(Log.Retinol.Plasma~Age+Alcohol+Fat, data = plasma)
summary(fit2)
##
## Call:
## lm(formula = Log.Retinol.Plasma ~ Age + Alcohol + Fat, data = plasma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14145 -0.18135 -0.00519 0.19781 1.00840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1361059 0.0848867 72.286 < 2e-16 ***
## Age 0.0050038 0.0012792 3.912 0.000113 ***
## Alcohol 0.0151855 0.0037176 4.085 5.63e-05 ***
## Fat -0.0010621 0.0005583 -1.902 0.058044 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3224 on 309 degrees of freedom
## Multiple R-squared: 0.1067, Adjusted R-squared: 0.098
## F-statistic: 12.3 on 3 and 309 DF, p-value: 1.271e-07
Finally, for the third model, we modified our dataset and ran it through the stepAIC algorithm. Since our goal was to predict the natural logarithm of the Retinol.Plasma, we removed the column of simply Retinol.Plasma from being a potential predicting variable. In other words, Retinol.Plasma should not be used as a predicting variable for the natural log of Retinol.Plasma. We also removed the Beta.Plasma result from being a potential predicting variable, because it was intended to be a secondary outcome, not a determining factor of the Retinol.Plasma. After running the algorithm, we used the result to create our final model that would allow us to predict the Retinol.Plasma values. In the end, we needed to take e (euler’s constant) to the power of the resulting equation to get the accurate value for the Retinol.Plasma that we were trying to predict.
plasma2 <- plasma %>%
select(-c(Retinol.Plasma))
fit3 <- lm(Log.Retinol.Plasma~., data=plasma2)
summary(fit3)
##
## Call:
## lm(formula = Log.Retinol.Plasma ~ ., data = plasma2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06406 -0.17796 0.00256 0.20275 0.95608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.119e+00 1.610e-01 38.012 < 2e-16 ***
## Age 5.318e-03 1.445e-03 3.679 0.000278 ***
## SexFemale -6.392e-02 6.153e-02 -1.039 0.299700
## Smoking.StatusFormer 7.914e-02 4.115e-02 1.923 0.055400 .
## Smoking.StatusCurrent Smoker -7.482e-03 5.942e-02 -0.126 0.899889
## Quetelet 1.375e-03 3.152e-03 0.436 0.663065
## Vitamin.UseYes, not often 4.302e-02 4.857e-02 0.886 0.376416
## Vitamin.UseYes, fairly often 3.276e-02 4.466e-02 0.734 0.463829
## Calories 1.632e-04 9.793e-05 1.666 0.096733 .
## Fat -3.053e-03 1.545e-03 -1.976 0.049098 *
## Fiber -7.167e-03 5.418e-03 -1.323 0.186885
## Alcohol 1.097e-02 4.271e-03 2.569 0.010682 *
## Cholesterol -1.482e-04 2.107e-04 -0.703 0.482324
## Beta.Diet -1.078e-05 1.437e-05 -0.750 0.453775
## Retinol.Diet -7.413e-06 3.570e-05 -0.208 0.835628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3225 on 298 degrees of freedom
## Multiple R-squared: 0.138, Adjusted R-squared: 0.09755
## F-statistic: 3.409 on 14 and 298 DF, p-value: 3.866e-05
# helpful in identifing Model
MASS::stepAIC(fit3, direction = "backward")
## Start: AIC=-693.85
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Quetelet +
## Vitamin.Use + Calories + Fat + Fiber + Alcohol + Cholesterol +
## Beta.Diet + Retinol.Diet
##
## Df Sum of Sq RSS AIC
## - Vitamin.Use 2 0.09391 31.081 -696.91
## - Retinol.Diet 1 0.00448 30.992 -695.81
## - Quetelet 1 0.01978 31.007 -695.65
## - Cholesterol 1 0.05146 31.039 -695.33
## - Beta.Diet 1 0.05851 31.046 -695.26
## - Sex 1 0.11223 31.099 -694.72
## - Fiber 1 0.18198 31.169 -694.02
## <none> 30.987 -693.85
## - Smoking.Status 2 0.45071 31.438 -693.33
## - Calories 1 0.28867 31.276 -692.95
## - Fat 1 0.40594 31.393 -691.78
## - Alcohol 1 0.68633 31.673 -689.00
## - Age 1 1.40748 32.395 -681.95
##
## Step: AIC=-696.91
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Quetelet +
## Calories + Fat + Fiber + Alcohol + Cholesterol + Beta.Diet +
## Retinol.Diet
##
## Df Sum of Sq RSS AIC
## - Retinol.Diet 1 0.00559 31.087 -698.85
## - Quetelet 1 0.01557 31.097 -698.75
## - Cholesterol 1 0.04888 31.130 -698.41
## - Beta.Diet 1 0.05303 31.134 -698.37
## - Sex 1 0.08891 31.170 -698.01
## <none> 31.081 -696.91
## - Fiber 1 0.20155 31.283 -696.88
## - Smoking.Status 2 0.45016 31.531 -696.41
## - Calories 1 0.35197 31.433 -695.38
## - Fat 1 0.47577 31.557 -694.15
## - Alcohol 1 0.64089 31.722 -692.52
## - Age 1 1.41208 32.493 -685.00
##
## Step: AIC=-698.85
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Quetelet +
## Calories + Fat + Fiber + Alcohol + Cholesterol + Beta.Diet
##
## Df Sum of Sq RSS AIC
## - Quetelet 1 0.01584 31.102 -700.69
## - Beta.Diet 1 0.05040 31.137 -700.34
## - Cholesterol 1 0.06080 31.147 -700.24
## - Sex 1 0.09176 31.178 -699.93
## <none> 31.087 -698.85
## - Fiber 1 0.20624 31.293 -698.78
## - Smoking.Status 2 0.45059 31.537 -698.35
## - Calories 1 0.34733 31.434 -697.37
## - Fat 1 0.47696 31.564 -696.08
## - Alcohol 1 0.65155 31.738 -694.36
## - Age 1 1.40741 32.494 -686.99
##
## Step: AIC=-700.69
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Calories +
## Fat + Fiber + Alcohol + Cholesterol + Beta.Diet
##
## Df Sum of Sq RSS AIC
## - Beta.Diet 1 0.04819 31.151 -702.21
## - Cholesterol 1 0.05550 31.158 -702.13
## - Sex 1 0.09402 31.196 -701.75
## <none> 31.102 -700.69
## - Fiber 1 0.22357 31.326 -700.45
## - Smoking.Status 2 0.45121 31.554 -700.18
## - Calories 1 0.35138 31.454 -699.17
## - Fat 1 0.47999 31.582 -697.90
## - Alcohol 1 0.63610 31.739 -696.35
## - Age 1 1.40410 32.507 -688.87
##
## Step: AIC=-702.21
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Calories +
## Fat + Fiber + Alcohol + Cholesterol
##
## Df Sum of Sq RSS AIC
## - Cholesterol 1 0.06284 31.213 -703.58
## - Sex 1 0.10493 31.256 -703.15
## <none> 31.151 -702.21
## - Smoking.Status 2 0.43127 31.582 -701.90
## - Fiber 1 0.34575 31.496 -700.75
## - Calories 1 0.35323 31.504 -700.68
## - Fat 1 0.47767 31.628 -699.44
## - Alcohol 1 0.62004 31.771 -698.04
## - Age 1 1.37720 32.528 -690.67
##
## Step: AIC=-703.58
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Calories +
## Fat + Fiber + Alcohol
##
## Df Sum of Sq RSS AIC
## - Sex 1 0.08169 31.295 -704.76
## <none> 31.213 -703.58
## - Smoking.Status 2 0.44179 31.655 -703.18
## - Fiber 1 0.30795 31.521 -702.50
## - Calories 1 0.31186 31.525 -702.46
## - Fat 1 0.56820 31.782 -699.93
## - Alcohol 1 0.65917 31.873 -699.03
## - Age 1 1.37530 32.589 -692.08
##
## Step: AIC=-704.76
## Log.Retinol.Plasma ~ Age + Smoking.Status + Calories + Fat +
## Fiber + Alcohol
##
## Df Sum of Sq RSS AIC
## <none> 31.295 -704.76
## - Smoking.Status 2 0.47054 31.766 -704.09
## - Fiber 1 0.31626 31.611 -703.61
## - Calories 1 0.32013 31.615 -703.57
## - Fat 1 0.54568 31.841 -701.35
## - Alcohol 1 0.77476 32.070 -699.10
## - Age 1 1.78199 33.077 -689.42
##
## Call:
## lm(formula = Log.Retinol.Plasma ~ Age + Smoking.Status + Calories +
## Fat + Fiber + Alcohol, data = plasma2)
##
## Coefficients:
## (Intercept) Age
## 6.0856104 0.0055881
## Smoking.StatusFormer Smoking.StatusCurrent Smoker
## 0.0797737 -0.0097973
## Calories Fat
## 0.0001661 -0.0034261
## Fiber Alcohol
## -0.0086179 0.0111156
#Subset Regression
subsets <- regsubsets(Log.Retinol.Plasma~., data=plasma, really.big=T)
plot(subsets, scale="adjr2")
\(\widehat{Retinol.Plasma}\) = \(e^{6.086 + 0.006X_{Age} + 0.080X_{Smoking.StatusFormer} - 0.010X_{Smoking.StatusCurrent Smoker} + 0.0002X_{Calories} - 0.003X_{Fat} - 0.009X_{Fiber} + 0.011X_{Alcohol} \pm 0.320}\)
#Create Linear Regression Model
plasma.fit <- lm(Log.Retinol.Plasma ~ Age + Smoking.Status + Calories + Fat + Fiber + Alcohol, data = plasma)
summary(plasma.fit)
##
## Call:
## lm(formula = Log.Retinol.Plasma ~ Age + Smoking.Status + Calories +
## Fat + Fiber + Alcohol, data = plasma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12759 -0.18186 0.00772 0.21015 0.98086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.086e+00 9.804e-02 62.076 < 2e-16 ***
## Age 5.588e-03 1.341e-03 4.167 4.02e-05 ***
## Smoking.StatusFormer 7.977e-02 4.044e-02 1.973 0.04943 *
## Smoking.StatusCurrent Smoker -9.797e-03 5.774e-02 -0.170 0.86537
## Calories 1.661e-04 9.402e-05 1.766 0.07834 .
## Fat -3.426e-03 1.486e-03 -2.306 0.02177 *
## Fiber -8.618e-03 4.909e-03 -1.756 0.08015 .
## Alcohol 1.112e-02 4.045e-03 2.748 0.00636 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3203 on 305 degrees of freedom
## Multiple R-squared: 0.1295, Adjusted R-squared: 0.1095
## F-statistic: 6.48 on 7 and 305 DF, p-value: 3.928e-07
# anova Test
anova(plasma.fit)
## Analysis of Variance Table
##
## Response: Log.Retinol.Plasma
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 1.9074 1.90741 18.5894 2.191e-05 ***
## Smoking.Status 2 0.6892 0.34460 3.3584 0.036081 *
## Calories 1 0.0561 0.05615 0.5472 0.460033
## Fat 1 0.5199 0.51988 5.0667 0.025101 *
## Fiber 1 0.7071 0.70711 6.8914 0.009097 **
## Alcohol 1 0.7748 0.77476 7.5507 0.006356 **
## Residuals 305 31.2952 0.10261
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveragePlots(plasma.fit, pch=16)
plasma.fit.df <- augment(plasma.fit)
ggplot(plasma.fit.df, aes(x=.fitted, y=.resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
labs(x = "fitted values", y = "Residuals")
ncvTest(plasma.fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.9956292, Df = 1, p = 0.31837
qqPlot(plasma.fit,pch=16)
## [1] 80 291
shapiro.test(plasma.fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: plasma.fit$residuals
## W = 0.99042, p-value = 0.03866
#looking for outliers using Cook's D Plot
cutoff <- 4/(nrow(plasma)- length(plasma.fit$coefficients)-2)
plot(plasma.fit, which=4, cook.levels = cutoff)
abline(h=cutoff, lty=2, col="red")
# looking for Multicollinearity
MultiPlasma <- cor(plasma[NumCol])
# Variance Inflation Factor looking for Multicollinearity
vif(plasma.fit)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.148042 1 1.071467
## Smoking.Status 1.122996 2 1.029425
## Calories 10.422533 1 3.228395
## Fat 7.532301 1 2.744504
## Fiber 2.094368 1 1.447193
## Alcohol 1.221526 1 1.105227
# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(MultiPlasma, format = "markdown")
| Age | Quetelet | Calories | Fat | Fiber | Alcohol | Cholesterol | Beta.Diet | Retinol.Diet | Retinol.Plasma | Log.Retinol.Plasma | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.0000000 | -0.0125944 | -0.2126350 | -0.1737258 | 0.0455450 | -0.0066697 | -0.1207996 | 0.0677929 | -0.0099325 | 0.2217522 | 0.2303427 |
| Quetelet | -0.0125944 | 1.0000000 | 0.0134108 | 0.0513615 | -0.0879637 | -0.1208441 | 0.1148000 | -0.0051417 | 0.0328123 | 0.0104610 | -0.0095042 |
| Calories | -0.2126350 | 0.0134108 | 1.0000000 | 0.8978788 | 0.5176226 | 0.2240924 | 0.6598935 | 0.2567768 | 0.4176231 | -0.0479794 | -0.0723291 |
| Fat | -0.1737258 | 0.0513615 | 0.8978788 | 1.0000000 | 0.2828084 | 0.1338801 | 0.7028062 | 0.1431437 | 0.4091983 | -0.0825821 | -0.1122599 |
| Fiber | 0.0455450 | -0.0879637 | 0.5176226 | 0.2828084 | 1.0000000 | -0.0142741 | 0.1586174 | 0.4833462 | 0.2158992 | -0.0456814 | -0.0585116 |
| Alcohol | -0.0066697 | -0.1208441 | 0.2240924 | 0.1338801 | -0.0142741 | 1.0000000 | 0.1042547 | 0.0355940 | -0.0036564 | 0.2209611 | 0.2061992 |
| Cholesterol | -0.1207996 | 0.1148000 | 0.6598935 | 0.7028062 | 0.1586174 | 0.1042547 | 1.0000000 | 0.1139018 | 0.4410165 | -0.0600768 | -0.0876945 |
| Beta.Diet | 0.0677929 | -0.0051417 | 0.2567768 | 0.1431437 | 0.4833462 | 0.0355940 | 0.1139018 | 1.0000000 | 0.0521951 | -0.0108499 | -0.0393540 |
| Retinol.Diet | -0.0099325 | 0.0328123 | 0.4176231 | 0.4091983 | 0.2158992 | -0.0036564 | 0.4410165 | 0.0521951 | 1.0000000 | -0.0596948 | -0.0592769 |
| Retinol.Plasma | 0.2217522 | 0.0104610 | -0.0479794 | -0.0825821 | -0.0456814 | 0.2209611 | -0.0600768 | -0.0108499 | -0.0596948 | 1.0000000 | 0.9597215 |
| Log.Retinol.Plasma | 0.2303427 | -0.0095042 | -0.0723291 | -0.1122599 | -0.0585116 | 0.2061992 | -0.0876945 | -0.0393540 | -0.0592769 | 0.9597215 | 1.0000000 |
A simple linear regression model was used to describe the trends of the data and make predictions. The general equation for linear regression is as follows:
y = \(b_{0}\) + \(b_{1}\)\(x_{1}\) + \(b_{2}\)\(x_{2}\) + … + \(b_{i}\)\(x_{i}\) + … + \(b_{n}\)\(x_{n}\)
where there are n predicting variables used to estimate the response, y. \(b_{0}\) serves as the y-intercept. Each predictor variables, \(x_{i}\), has a weight coefficient, \(b_{i}\), indicating how significant its influence is on the response. Out of the 12 potential predictor variables, our final model used (?). (The natural log of Retinol.Plasma) was the response that was modeled.
Selected observation removed in Part 3 stored as object: ‘selection’. We used our model to predict the Retinol.Plasma and compare our prediction against the actual removed observation.
The Model we used transformed Retinol.Plasma using a natural log, so to transform the result of the prediction back into meaningful terms we had to exponentiate.
After we transformed the predicted value, we were able to prove that, in fact, the real Retinol.Plasma value did fall within our model’s prediction interval!
The resulting formula we found for predicting Retinol.Plasma:
Retinol.Plasma = \(e^{6.086 + 0.006X_{Age} + 0.080X_{Smoking.StatusFormer} - 0.010X_{Smoking.StatusCurrent Smoker} + 0.0002X_{Calories} - 0.003X_{Fat} - 0.009X_{Fiber} + 0.011X_{Alcohol} \pm 0.320}\)
# Removed Observation for testing
# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(selection, format = "markdown")
| Age | Sex | Smoking.Status | Quetelet | Vitamin.Use | Calories | Fat | Fiber | Alcohol | Cholesterol | Beta.Diet | Retinol.Diet | Beta.Plasma | Retinol.Plasma | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 84 | 19 | Female | Never | 29.24145 | Yes, fairly often | 2558.9 | 116.1 | 12.3 | 0 | 324.5 | 1498 | 1066 | 327 | 693 |
predict(plasma.fit, newdata=selection, interval = "predict")
## fit lwr upr
## 84 6.11296 5.473549 6.752372
exp(predict(plasma.fit, newdata=selection, interval = "predict"))
## fit lwr upr
## 84 451.6737 238.3044 856.0866
There are several factors missing from the data that have a potential to negatively impact the accuracy of our model and analysis. The variation in plasma levels could be attributed to a variety of complex health conditions that is difficult to reduce to concrete measurements or classifications. There are a few terms that are not well-defined, such as what is considered “often vitamin use” or “one alcoholic drink”. Finally, it is unclear how some of the measurements were taken or recorded, leaving potential for inaccuracy or rounding error in some of the data points.
Insufficient variables to find any “smoking gun” that had a high level of influence on Retinol Plasma levels; Age the most influential predictor variable for our model only had a p-value of 4.02e-05.
As mentioned above one outlier was removed from our dataset prior to conducting analysis. Once this point was removed, our analysis was able to produce a more accurate fit.
Data appears to be self-reported by the participants, so we have to (optimistically) assume the patients were being truthful in their reporting. Calorie / Fat / Fiber content of their daily diets. how would patients be expected to accurately notate this?
A helpful variable could be hours of exercise each week, or some other quantification of activity level. Outside of vitamin use, Diet doesn’t have a significant impact on Retinol Plasma, but perhaps exercise would.