Part 1: Abstract

We set out to determine which of the provided variables were significantly correlated with–and thus could be used to accurately predict–our outcome variables of either plasma levels of Retinol (Retinol.Plasma) or plasma levels of Beta Carotene (Beta.Plasma). For the purposes of this project, we chose to focus only on Retinol.Plasma.

We had some initial difficulties fitting a model around Retinol.Plasma, and after running the Power Transformation function, it was revealed to us that taking the natural logarithm of the variable would be the best transformation. After doing so, the model fit much better!

Using our newly transformed Log.Retinol.Plasma, we utilized the Mass Step AIC function to find the best prediction model. The result: Log.Retinol.Plasma can be predicted using the variables Age, Smoking.Status, Calories, Fat, Fiber, and Alcohol. Then to transform the predicted result back into meaningful terms, we exponentiate.

Removing one random observation, we were able to test our model and prove that, in fact, the real Retinol.Plasma value did fall within our model’s prediction interval!

The resulting formula we found for predicting Retinol.Plasma:

\(\widehat{Retinol.Plasma}\) = \(e^{6.086 + 0.006X_{Age} + 0.080X_{Smoking.StatusFormer} - 0.010X_{Smoking.StatusCurrent Smoker} + 0.0002X_{Calories} - 0.003X_{Fat} - 0.009X_{Fiber} + 0.011X_{Alcohol} \pm 0.320}\)

 

Part 2: Introduction

Retinol is among the widely studied compounds in various populations, for both human plasma(or serum) concentrations and dietary intake.
Observational studies have suggested that low dietary intake or low plasma concentrations of retinol, beta-carotene, or other carotenoids might be associated with increased risk of developing certain types of cancer. However, relatively few studies have investigated how drinking and smoking habits, dietary intake, sex, and age influence plasma concentrations of retinol.

This cross-sectional study aims at studying relationships between personal characteristics and dietary factors, and plasma concentrations of retinol.

 

Part 3: Overview of Dataset

Reference: These data have not been published yet but a related reference is Nierenberg DW, Stukel TA, Baron JA, Dain BJ, Greenberg ER. Determinants of plasma levels of beta-carotene and retinol. American Journal of Epidemiology 1989;130:511-521.

Determinants of Plasma Retinol and Beta-Carotene Levels

Variable Names in order from left to right:
- Age: Age (years)
- Sex: Sex (1=Male, 2=Female)
- Smoking.Status: Smoking status (1=Never, 2=Former, 3=Current Smoker)
- Quetelet: Quetelet (weight/\(height^{2}\))
- Vitamin.Use: Vitamin Use (1=Yes, fairly often, 2=Yes, not often, 3=No)
- Calories: Number of calories consumed per day
- Fet: Grams of fat consumed per day
- Fiber: Grams of fiber consumed per day
- Alcohol: Number of alcoholic drinks consumed per week
- Cholesterol: Cholesterol consumed (mg per day)
- Beta.Diet: Dietary beta-carotene consumed (mcg per day)
- Retinol.Diet: Dietary retinol consumed (mcg per day)
- Beta.Plasma: Plasma beta-carotene (ng/ml)
- Retinol.Plasma: Plasma Retinol (ng/ml)

# turn sex as a factor
plasma$Sex <- factor(plasma$Sex, levels=c(1,2),  labels=c("Male", "Female"))

# turn Smoking Status into a Factor
plasma$Smoking.Status <- factor(plasma$Smoking.Status, levels=c(1,2,3),  labels=c("Never", "Former", "Current Smoker"))

# turn Vitamin Use into a Factor
plasma$Vitamin.Use <- factor(plasma$Vitamin.Use, levels=c(3,2,1),  labels=c("No", "Yes, not often", "Yes, fairly often"))

Plasma Table

Age Sex Smoking.Status Quetelet Vitamin.Use Calories Fat Fiber Alcohol Cholesterol Beta.Diet Retinol.Diet Beta.Plasma Retinol.Plasma
64 Female Former 21.48380 Yes, fairly often 1298.8 57.0 6.3 0.0 170.3 1945 890 200 915
76 Female Never 23.87631 Yes, fairly often 1032.5 50.1 15.8 0.0 75.8 2653 451 124 727
38 Female Former 20.01080 Yes, not often 2372.3 83.6 19.1 14.1 257.9 6321 660 328 721
40 Female Former 25.14062 No 2449.5 97.5 26.5 0.5 332.6 1061 864 153 615
72 Female Never 20.98504 Yes, fairly often 1952.1 82.6 16.2 0.0 170.8 2863 1209 92 799
40 Female Former 27.52136 No 1366.9 56.0 9.6 1.3 154.6 1729 1439 148 654

Remove Random Row Selection

We randomly selected one entry from our dataset to apply to our final model in Part 6. This entry was selected using the set.seed(1). Once the observation was selected we stored the value as the selection object and removed it from the plasma data frame.

#set seed to ensure the same sample is repeatidly selected
set.seed(1)

#select 1 sample from the plasma data frame
selection <- plasma[sample(nrow(plasma), 1), ]


#Remove Selected Observation
plasma <- plasma[-c(84), ]

# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(selection, format = "markdown")
Age Sex Smoking.Status Quetelet Vitamin.Use Calories Fat Fiber Alcohol Cholesterol Beta.Diet Retinol.Diet Beta.Plasma Retinol.Plasma
84 19 Female Never 29.24145 Yes, fairly often 2558.9 116.1 12.3 0 324.5 1498 1066 327 693

 

Part 4: Exploratory analysis

In this section, an exploratory data analysis was carried out for all the variables registered in the study in order to detect any visual characteristics of the variables or potential correlations.The histogram for Age shows an apparent bimodality in distribution. The sex scatterplot suggest a correlation between female gender and Retinol plasma level. One significant observation was that there was a much higher count of females than males. This could affect the high correlation value due to gender, because the variable is biased toward more female entries. The histogram for Calories is skewed to the left but still normally distributed. The histogram for Alcohol reveals that the distribution is positively skewed and there are still several outliers.

# Creates Variable of all Numeric Columns in Data Set
NumCol <- c(1,4,6:14)
# Creates Variable of all Factor Columns in Data Set
FacCol <- c(2,3,5)

# Summary of Numerical DataSet
summaryPlasNum <- summary(plasma[,NumCol])

# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(summaryPlasNum, format = "markdown")
Age Quetelet Calories Fat Fiber Alcohol Cholesterol Beta.Diet Retinol.Diet Beta.Plasma Retinol.Plasma
Min. :22.00 Min. :16.33 Min. : 445.2 Min. : 14.40 Min. : 3.100 Min. : 0.00 Min. : 37.7 Min. : 214 Min. : 30.0 Min. : 0.0 Min. : 179.0
1st Qu.:39.00 1st Qu.:21.79 1st Qu.:1335.9 1st Qu.: 53.92 1st Qu.: 9.125 1st Qu.: 0.00 1st Qu.:154.9 1st Qu.:1115 1st Qu.: 479.5 1st Qu.: 89.5 1st Qu.: 466.0
Median :48.00 Median :24.71 Median :1665.0 Median : 72.90 Median :12.100 Median : 0.30 Median :206.2 Median :1814 Median : 707.0 Median : 139.0 Median : 565.0
Mean :50.25 Mean :26.15 Mean :1794.2 Mean : 76.91 Mean :12.790 Mean : 3.29 Mean :242.2 Mean :2188 Mean : 832.0 Mean : 189.5 Mean : 602.5
3rd Qu.:62.75 3rd Qu.:28.75 3rd Qu.:2092.5 3rd Qu.: 95.17 3rd Qu.:15.600 3rd Qu.: 3.20 3rd Qu.:308.2 3rd Qu.:2850 3rd Qu.:1026.8 3rd Qu.: 228.0 3rd Qu.: 717.5
Max. :83.00 Max. :50.40 Max. :6662.2 Max. :235.90 Max. :36.800 Max. :203.00 Max. :900.7 Max. :9642 Max. :6901.0 Max. :1415.0 Max. :1727.0
# Summary of Factor DataSet
summaryPlasFac <- summary(plasma[,FacCol])

# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(summaryPlasFac, format = "markdown")
Sex Smoking.Status Vitamin.Use
Male : 42 Never :156 No :111
Female:272 Former :115 Yes, not often : 82
NA Current Smoker: 43 Yes, fairly often:121
# Pairs
pairs(plasma[,c(1:14)], pch=19, lower.panel=panel.smooth)

Remove Outliers

It is important to note that there is one outlier that was removed from the dataset before conducting analysis. This record indicated that an individual consumed, on average, 203 alcoholic beverages per week. This data point was greatly skewing our data; the next highest record was 35. Given that an adult may sleep 6-8 hours each day, this would indicate that an alcoholic beverage would be consumed every 30-45 minutes each day. This seems impractical.

# explaining remove of outliers

#first Alcohol Outlier # 62
ggplot(data = plasma, aes(x = Alcohol), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 20) +
  labs(title = "Alcohol") +
  scale_fill_gradient("Count", low = "Dark Blue", high = "Red")

summary(plasma$Alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.30    3.29    3.20  203.00
# remove Outliers

plasma <- plasma[-c(62),]
# 62 is alcohol of 203 drinks per week
# 257 is Beta Plasma of 0.0 (rounding error)

BoxPlot Graphs

# Sex Characteristic BoxPlots

### -- Retinol Plasma by Sex
SexRetinolBox <- ggplot(data = plasma, aes(x = Sex, y = Retinol.Plasma)) +
  stat_boxplot(geom='errorbar', width=0.5) +
  geom_boxplot(outlier.size = 1, aes(fill=Sex)) +
  coord_flip() +
  stat_summary(fun.y = mean, color="yellow", geom="point", size=2, shape=18) +
  labs(title = "Sex and Retinol Plasma")


# Smoking Characteristic BoxPlots

### -- Retinol Plasma by Sex
SmokeRetinolBox <- ggplot(data = plasma, aes(x = Smoking.Status, y = Retinol.Plasma)) +
  stat_boxplot(geom='errorbar', width=0.5) +
  geom_boxplot(outlier.size = 1, aes(fill=Smoking.Status)) +
  coord_flip() +
  stat_summary(fun.y = mean, color="yellow", geom="point", size=2, shape=18) +
  labs(title = "Smoking Status and Retinol Plasma")


# Vitamin Characteristic BoxPlots

### -- Retinol Plasma by Vitamin Use
VitaminRetinolBox <- ggplot(data = plasma, aes(x = Vitamin.Use, y = Retinol.Plasma)) +
  stat_boxplot(geom='errorbar', width=0.5) +
  geom_boxplot(outlier.size = 1, aes(fill=Vitamin.Use)) +
  coord_flip() +
  stat_summary(fun.y = mean, color="yellow", geom="point", size=2, shape=18) +
  labs(title = "Vitamin Use and Retinol Plasma")

#BoxPlots Graph Display
grid.arrange(SexRetinolBox, SmokeRetinolBox, VitaminRetinolBox, ncol=1, 
             top=textGrob("Boxplots",gp=gpar(fontface="bold"))) #Adding a main title

Histogram Graphs

#Age Histogram
AgeHist <- ggplot(data = plasma, aes(x = Age), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Age") +
  scale_fill_gradient("Count", low = "Orange", high = "Blue")

#Quetelet Histogram
QueteletHist <- ggplot(data = plasma, aes(x = Quetelet), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Quetelet") +
  scale_fill_gradient("Count", low = "Orange", high = "Blue")

#Calories Histogram
CaloriesHist <- ggplot(data = plasma, aes(x = Calories), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Calories") +
  scale_fill_gradient("Count", low = "Orange", high = "Blue")

#Fat Histogram
FatHist <- ggplot(data = plasma, aes(x = Fat), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Fat") +
  scale_fill_gradient("Count", low = "Orange", high = "Blue")

#Fiber Histogram
FiberHist <- ggplot(data = plasma, aes(x = Fiber), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Fiber") +
  scale_fill_gradient("Count", low = "Orange", high = "Blue")

#Alcohol Histogram
AlcoholHist <- ggplot(data = plasma, aes(x = Alcohol), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Alcohol") +
  scale_fill_gradient("Count", low = "Orange", high = "Blue")

#Cholesterol Histogram
CholesterolHist <- ggplot(data = plasma, aes(x = Cholesterol), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Cholesterol") +
  scale_fill_gradient("Count", low = "Orange", high = "Blue")

#Beta Diet Histogram
BetaDietHist <- ggplot(data = plasma, aes(x = Beta.Diet), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Beta Diet") +
  scale_fill_gradient("Count", low = "Red", high = "Blue")

#Retinol Diet Histogram
RetinolDietHist <- ggplot(data = plasma, aes(x = Retinol.Diet), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Retinol Diet") +
  scale_fill_gradient("Count", low = "Green", high = "Blue")

#Retinol Plasma Histogram
RetinolHist <- ggplot(data = plasma, aes(x = Retinol.Plasma), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Retinol Plasma") +
  scale_fill_gradient("Count", low = "Green", high = "Blue")

#Beta Plasma Histogram
BetaHist <- ggplot(data = plasma, aes(x = Beta.Plasma), colour = barlines, fill = barfill) +
  geom_histogram(aes(fill = ..count..), bins = 40) +
  labs(title = "Beta Plasma") +
  scale_fill_gradient("Count", low = "Red", high = "Blue")




#All Histograms Graph Display
grid.arrange(AgeHist, QueteletHist, CaloriesHist, FatHist, FiberHist, AlcoholHist, CholesterolHist, BetaDietHist, RetinolDietHist, BetaHist, RetinolHist, ncol=3, 
             top=textGrob("All Variable Histograms",gp=gpar(fontface="bold"))) #Adding a main title

StackedBar Graphs

# Age 

## Age by Sex StackedBar
AgeStackSex <- ggplot(plasma, aes(Age, fill = Sex)) + 
  geom_bar() + 
  labs(title = "Age by Sex", x = "Ages", y = "Count")

## Age by Sex StackedBar Percentage
AgeStackSexP <- ggplot(plasma, aes(Age, fill = Sex)) + 
  geom_bar(position="fill")+ 
  labs(title = "Age by Sex Percent", x = "Ages", y = "Percentage")

## Age by Smoking Status StackedBar
AgeStackSmoke <- ggplot(plasma, aes(Age, fill = Smoking.Status)) + 
  geom_bar()+ 
  labs(title = "Age by Smoking Status", x = "Ages", y = "Count")

## Age by Smoking Status StackedBar Percentage
AgeStackSmokeP <- ggplot(plasma, aes(Age, fill = Smoking.Status)) + 
  geom_bar(position="fill")+ 
  labs(title = "Age by Smoking Status Percent", x = "Ages", y = "Percentage")

## Age by Vitamin Use StackedBar
AgeStackVitamin <- ggplot(plasma, aes(Age, fill = Vitamin.Use)) + 
  geom_bar()+ 
  labs(title = "Age by Vitamin Use", x = "Ages", y = "Count")

## Age by Vitamin Use StackedBar Percentage
AgeStackVitaminP <- ggplot(plasma, aes(Age, fill = Vitamin.Use)) + 
  geom_bar(position="fill")+ 
  labs(title = "Age by Vitamin Use Percent", x = "Ages", y = "Percentage")

#Age Graph Display
grid.arrange(AgeStackSex, AgeStackSexP, AgeStackSmoke, AgeStackSmokeP, AgeStackVitamin, AgeStackVitaminP, ncol=2, 
             top=textGrob("Age",gp=gpar(fontface="bold"))) #Adding a main title

CrossTable(plasma$Sex, plasma$Smoking.Status)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  313 
## 
##  
##              | plasma$Smoking.Status 
##   plasma$Sex |          Never |         Former | Current Smoker |      Row Total | 
## -------------|----------------|----------------|----------------|----------------|
##         Male |             13 |             22 |              6 |             41 | 
##              |          2.705 |          3.194 |          0.045 |                | 
##              |          0.317 |          0.537 |          0.146 |          0.131 | 
##              |          0.083 |          0.191 |          0.143 |                | 
##              |          0.042 |          0.070 |          0.019 |                | 
## -------------|----------------|----------------|----------------|----------------|
##       Female |            143 |             93 |             36 |            272 | 
##              |          0.408 |          0.481 |          0.007 |                | 
##              |          0.526 |          0.342 |          0.132 |          0.869 | 
##              |          0.917 |          0.809 |          0.857 |                | 
##              |          0.457 |          0.297 |          0.115 |                | 
## -------------|----------------|----------------|----------------|----------------|
## Column Total |            156 |            115 |             42 |            313 | 
##              |          0.498 |          0.367 |          0.134 |                | 
## -------------|----------------|----------------|----------------|----------------|
## 
## 
CrossTable(plasma$Sex, plasma$Vitamin.Use)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  313 
## 
##  
##              | plasma$Vitamin.Use 
##   plasma$Sex |                No |    Yes, not often | Yes, fairly often |         Row Total | 
## -------------|-------------------|-------------------|-------------------|-------------------|
##         Male |                23 |                 5 |                13 |                41 | 
##              |             5.122 |             3.069 |             0.512 |                   | 
##              |             0.561 |             0.122 |             0.317 |             0.131 | 
##              |             0.209 |             0.061 |             0.107 |                   | 
##              |             0.073 |             0.016 |             0.042 |                   | 
## -------------|-------------------|-------------------|-------------------|-------------------|
##       Female |                87 |                77 |               108 |               272 | 
##              |             0.772 |             0.463 |             0.077 |                   | 
##              |             0.320 |             0.283 |             0.397 |             0.869 | 
##              |             0.791 |             0.939 |             0.893 |                   | 
##              |             0.278 |             0.246 |             0.345 |                   | 
## -------------|-------------------|-------------------|-------------------|-------------------|
## Column Total |               110 |                82 |               121 |               313 | 
##              |             0.351 |             0.262 |             0.387 |                   | 
## -------------|-------------------|-------------------|-------------------|-------------------|
## 
## 
# A histogram of Ages by Sex and Smoking Status
ggplot(plasma, aes(Age, fill = Vitamin.Use)) + 
  geom_histogram(bins = 40,colour="white") +
  facet_grid(Sex ~ Smoking.Status) +
  labs(title = "Ages by Sex and Smoking Status", y = "Count of Observations", x = "Ages")

# A histogram of Quetelet by Sex and Smoking Status
ggplot(plasma, aes(Quetelet, fill = Vitamin.Use)) + 
  geom_histogram(bins = 40,colour="white") +
  facet_grid(Sex ~ Smoking.Status) +
  labs(title = "Quetelet by Sex and Smoking Status", y = "Count of Observations", x = "Quetelet")

# A histogram of Retinol.Plasma by Sex and Smoking Status
ggplot(plasma, aes(Retinol.Plasma, fill = Vitamin.Use)) + 
  geom_histogram(bins = 40, color="white") +
  facet_grid(Sex ~ Smoking.Status) +
  labs(title = "Retinol.Plasma by Sex and Smoking Status", y = "Count of Observations", x = "Retinol.Plasma")

ScatterPlot Graphs

#Sex ScactterPlot
SexScatter <- ggplot(plasma, aes(x = Age, y = Retinol.Plasma))+
  geom_point() +
  geom_smooth(aes(color = Sex, fill = Sex), method = "lm") +
  labs(title = "Retinol Plasma vs Age by Sex")

#Smoking ScactterPlot
SmokeScatter <- ggplot(plasma, aes(x = Age, y = Retinol.Plasma))+
  geom_point() +
  geom_smooth(aes(color = Smoking.Status, fill = Smoking.Status), method = "lm") +
  labs(title = "Retinol Plasma vs Age by Sex")

#Vitamin ScactterPlot
VitaminScatter <- ggplot(plasma, aes(x = Age, y = Retinol.Plasma))+
  geom_point() +
  geom_smooth(aes(color = Vitamin.Use, fill = Vitamin.Use), method = "lm") +
  labs(title = "Retinol Plasma vs Age by Sex")

ggplot(plasma, aes(x = Age, y = Retinol.Plasma)) + 
  geom_point() + 
  scale_x_continuous("Age")+
  scale_y_continuous("Retinol Plasma") + 
  theme_bw() + labs(title="Sex Scatterplot") + facet_wrap( ~ Sex) +
  geom_smooth(method = "lm", aes(x = Age, y = Retinol.Plasma))

ggplot(plasma, aes(x = Age, y = Retinol.Plasma)) + 
  geom_point() + 
  scale_x_continuous("Age")+
  scale_y_continuous("Retinol Plasma") + 
  theme_bw() + labs(title="Smoking Status Scatterplot") + facet_wrap( ~ Smoking.Status) +
  geom_smooth(method = "lm", aes(x = Age, y = Retinol.Plasma))

ggplot(plasma, aes(x = Age, y = Retinol.Plasma)) + 
  geom_point() + 
  scale_x_continuous("Age")+
  scale_y_continuous("Retinol Plasma") + 
  theme_bw() + labs(title="Vitamin Use Scatterplot") + 
  facet_wrap( ~ Vitamin.Use) +
  geom_smooth(method = "lm", aes(x = Age, y = Retinol.Plasma))

#Sex Characteristic Graph Display
grid.arrange(SexScatter, SmokeScatter, VitaminScatter, ncol=2, 
             top=textGrob("Scatter Plots",gp=gpar(fontface="bold"))) #Adding a main title

Mosaic Plot

# MosaicPlot
mosaicplot(table(plasma$Sex,plasma$Smoking.Status, plasma$Vitamin.Use), 
           color = c("blue", "green", "orange"), 
           main = "MosaicPlot: Sex by Smoking Status by Vitamin Use",
           ylab = "Smoking Status",
           xlab = "Sex",
           shade = F)

Correlation Plots

# creates scatterplots of all numeric variables
pairs(plasma[,NumCol], pch=16, lower.panel=panel.smooth)

# Creates Correlation Matrix of all Numeric Variables
CorTable <- cor(plasma[,NumCol])

# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(round(CorTable,4), format = "markdown")
Age Quetelet Calories Fat Fiber Alcohol Cholesterol Beta.Diet Retinol.Diet Beta.Plasma Retinol.Plasma
Age 1.0000 -0.0126 -0.2126 -0.1737 0.0455 -0.0067 -0.1208 0.0678 -0.0099 0.1090 0.2218
Quetelet -0.0126 1.0000 0.0134 0.0514 -0.0880 -0.1208 0.1148 -0.0051 0.0328 -0.2318 0.0105
Calories -0.2126 0.0134 1.0000 0.8979 0.5176 0.2241 0.6599 0.2568 0.4176 -0.0145 -0.0480
Fat -0.1737 0.0514 0.8979 1.0000 0.2828 0.1339 0.7028 0.1431 0.4092 -0.0915 -0.0826
Fiber 0.0455 -0.0880 0.5176 0.2828 1.0000 -0.0143 0.1586 0.4833 0.2159 0.2361 -0.0457
Alcohol -0.0067 -0.1208 0.2241 0.1339 -0.0143 1.0000 0.1043 0.0356 -0.0037 0.0122 0.2210
Cholesterol -0.1208 0.1148 0.6599 0.7028 0.1586 0.1043 1.0000 0.1139 0.4410 -0.1291 -0.0601
Beta.Diet 0.0678 -0.0051 0.2568 0.1431 0.4833 0.0356 0.1139 1.0000 0.0522 0.2271 -0.0108
Retinol.Diet -0.0099 0.0328 0.4176 0.4092 0.2159 -0.0037 0.4410 0.0522 1.0000 -0.0457 -0.0597
Beta.Plasma 0.1090 -0.2318 -0.0145 -0.0915 0.2361 0.0122 -0.1291 0.2271 -0.0457 1.0000 0.0686
Retinol.Plasma 0.2218 0.0105 -0.0480 -0.0826 -0.0457 0.2210 -0.0601 -0.0108 -0.0597 0.0686 1.0000
# Visually displays the Correlation Matrix
## The narrower the width of the ellipse and darker the color the stronger the degree of correlation
## Blue indicates Positive Correlation
## Red Negative Correlation
corrplot(cor(plasma[,NumCol]), method="ellipse", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, number.cex = 0.7) #Text label color and rotation

 

Part 5: Identification and evaluation of a suitable model

Transform Data

The process of creating a model began with looking at our response variable and determining if it needed to be transformed for a more accurate fit. When creating a histogram for the Retinol.Plasma variable, the results looked highly skewed and not normally distributed. Therefore, we ran a power transformation test to determine what degree of transformation yields a more accurate model. Since the result was close to the value of 0, we decided to take the natural logarithm of the Retinol.Plasma variable, creating the potential for a more accurate model.

#Retinol Plasma Histogram
grid.arrange(RetinolHist)

# compare Lamda to find what transformation needs to happen
summary(powerTransform(plasma$Retinol.Plasma))
## bcPower Transformation to Normality 
##                       Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## plasma$Retinol.Plasma    0.1691           0        -0.06       0.3983
## 
## Likelihood ratio test that transformation parameter is equal to 0
##  (log transformation)
##                            LRT df    pval
## LR test, lambda = (0) 2.101759  1 0.14713
## 
## Likelihood ratio test that no transformation is needed
##                            LRT df       pval
## LR test, lambda = (1) 49.68566  1 1.8046e-12
#lamba is 0.1717 - close to Zero -  recommended log(Y)


# Create Log of Retinol Plasma to 
plasma <- plasma %>% 
  mutate(Log.Retinol.Plasma = log(Retinol.Plasma)) %>%
  select(-c(Beta.Plasma))


# Log of Retinol Plasma Histogram
ggplot(data = plasma, aes(x = Log.Retinol.Plasma)) +
  geom_histogram(bins = 40) +
  labs(title = "Log Retinol Plasma")

# Visually displays the Correlation Matrix
## The narrower the width of the ellipse and darker the color the stronger the degree of correlation
## Blue indicates Positive Correlation
## Red Negative Correlation
corrplot(cor(plasma[,NumCol]), method="ellipse", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, number.cex = 0.7) #Text label color and rotation

Model 1

To create our first model, we manually observed the chart of correlations of the predicting variables with the natural log of the Retinol.Plasma variable. Choosing the variables with a higher correlation, we decided to use these as dependencies in building a linear model.

fit1 <- lm(Log.Retinol.Plasma~Age+Alcohol+Calories+Cholesterol+Fat+Fiber+Retinol.Diet, data = plasma)
summary(fit1)
## 
## Call:
## lm(formula = Log.Retinol.Plasma ~ Age + Alcohol + Calories + 
##     Cholesterol + Fat + Fiber + Retinol.Diet, data = plasma)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15571 -0.18270 -0.00495  0.19678  1.02533 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.095e+00  9.558e-02  63.772  < 2e-16 ***
## Age           5.793e-03  1.343e-03   4.312 2.19e-05 ***
## Alcohol       1.231e-02  4.025e-03   3.058  0.00243 ** 
## Calories      1.738e-04  9.647e-05   1.802  0.07258 .  
## Cholesterol  -1.188e-04  2.055e-04  -0.578  0.56373    
## Fat          -3.065e-03  1.522e-03  -2.013  0.04497 *  
## Fiber        -8.635e-03  4.970e-03  -1.737  0.08334 .  
## Retinol.Diet -9.210e-06  3.543e-05  -0.260  0.79505    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3225 on 305 degrees of freedom
## Multiple R-squared:  0.1178, Adjusted R-squared:  0.09758 
## F-statistic: 5.819 on 7 and 305 DF,  p-value: 2.387e-06

Model 2

After observing the results of our first model, not all of the variables we selected were significant. Therefore, to create our second linear model, we eliminated the insignificant variables and reran the model.

fit2 <- lm(Log.Retinol.Plasma~Age+Alcohol+Fat, data = plasma)
summary(fit2)
## 
## Call:
## lm(formula = Log.Retinol.Plasma ~ Age + Alcohol + Fat, data = plasma)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14145 -0.18135 -0.00519  0.19781  1.00840 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.1361059  0.0848867  72.286  < 2e-16 ***
## Age          0.0050038  0.0012792   3.912 0.000113 ***
## Alcohol      0.0151855  0.0037176   4.085 5.63e-05 ***
## Fat         -0.0010621  0.0005583  -1.902 0.058044 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3224 on 309 degrees of freedom
## Multiple R-squared:  0.1067, Adjusted R-squared:  0.098 
## F-statistic:  12.3 on 3 and 309 DF,  p-value: 1.271e-07

Model 3

Finally, for the third model, we modified our dataset and ran it through the stepAIC algorithm. Since our goal was to predict the natural logarithm of the Retinol.Plasma, we removed the column of simply Retinol.Plasma from being a potential predicting variable. In other words, Retinol.Plasma should not be used as a predicting variable for the natural log of Retinol.Plasma. We also removed the Beta.Plasma result from being a potential predicting variable, because it was intended to be a secondary outcome, not a determining factor of the Retinol.Plasma. After running the algorithm, we used the result to create our final model that would allow us to predict the Retinol.Plasma values. In the end, we needed to take e (euler’s constant) to the power of the resulting equation to get the accurate value for the Retinol.Plasma that we were trying to predict.

plasma2 <- plasma %>% 
  select(-c(Retinol.Plasma))

fit3 <- lm(Log.Retinol.Plasma~., data=plasma2)
summary(fit3)
## 
## Call:
## lm(formula = Log.Retinol.Plasma ~ ., data = plasma2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06406 -0.17796  0.00256  0.20275  0.95608 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   6.119e+00  1.610e-01  38.012  < 2e-16 ***
## Age                           5.318e-03  1.445e-03   3.679 0.000278 ***
## SexFemale                    -6.392e-02  6.153e-02  -1.039 0.299700    
## Smoking.StatusFormer          7.914e-02  4.115e-02   1.923 0.055400 .  
## Smoking.StatusCurrent Smoker -7.482e-03  5.942e-02  -0.126 0.899889    
## Quetelet                      1.375e-03  3.152e-03   0.436 0.663065    
## Vitamin.UseYes, not often     4.302e-02  4.857e-02   0.886 0.376416    
## Vitamin.UseYes, fairly often  3.276e-02  4.466e-02   0.734 0.463829    
## Calories                      1.632e-04  9.793e-05   1.666 0.096733 .  
## Fat                          -3.053e-03  1.545e-03  -1.976 0.049098 *  
## Fiber                        -7.167e-03  5.418e-03  -1.323 0.186885    
## Alcohol                       1.097e-02  4.271e-03   2.569 0.010682 *  
## Cholesterol                  -1.482e-04  2.107e-04  -0.703 0.482324    
## Beta.Diet                    -1.078e-05  1.437e-05  -0.750 0.453775    
## Retinol.Diet                 -7.413e-06  3.570e-05  -0.208 0.835628    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3225 on 298 degrees of freedom
## Multiple R-squared:  0.138,  Adjusted R-squared:  0.09755 
## F-statistic: 3.409 on 14 and 298 DF,  p-value: 3.866e-05
# helpful in identifing Model
MASS::stepAIC(fit3, direction = "backward")
## Start:  AIC=-693.85
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Quetelet + 
##     Vitamin.Use + Calories + Fat + Fiber + Alcohol + Cholesterol + 
##     Beta.Diet + Retinol.Diet
## 
##                  Df Sum of Sq    RSS     AIC
## - Vitamin.Use     2   0.09391 31.081 -696.91
## - Retinol.Diet    1   0.00448 30.992 -695.81
## - Quetelet        1   0.01978 31.007 -695.65
## - Cholesterol     1   0.05146 31.039 -695.33
## - Beta.Diet       1   0.05851 31.046 -695.26
## - Sex             1   0.11223 31.099 -694.72
## - Fiber           1   0.18198 31.169 -694.02
## <none>                        30.987 -693.85
## - Smoking.Status  2   0.45071 31.438 -693.33
## - Calories        1   0.28867 31.276 -692.95
## - Fat             1   0.40594 31.393 -691.78
## - Alcohol         1   0.68633 31.673 -689.00
## - Age             1   1.40748 32.395 -681.95
## 
## Step:  AIC=-696.91
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Quetelet + 
##     Calories + Fat + Fiber + Alcohol + Cholesterol + Beta.Diet + 
##     Retinol.Diet
## 
##                  Df Sum of Sq    RSS     AIC
## - Retinol.Diet    1   0.00559 31.087 -698.85
## - Quetelet        1   0.01557 31.097 -698.75
## - Cholesterol     1   0.04888 31.130 -698.41
## - Beta.Diet       1   0.05303 31.134 -698.37
## - Sex             1   0.08891 31.170 -698.01
## <none>                        31.081 -696.91
## - Fiber           1   0.20155 31.283 -696.88
## - Smoking.Status  2   0.45016 31.531 -696.41
## - Calories        1   0.35197 31.433 -695.38
## - Fat             1   0.47577 31.557 -694.15
## - Alcohol         1   0.64089 31.722 -692.52
## - Age             1   1.41208 32.493 -685.00
## 
## Step:  AIC=-698.85
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Quetelet + 
##     Calories + Fat + Fiber + Alcohol + Cholesterol + Beta.Diet
## 
##                  Df Sum of Sq    RSS     AIC
## - Quetelet        1   0.01584 31.102 -700.69
## - Beta.Diet       1   0.05040 31.137 -700.34
## - Cholesterol     1   0.06080 31.147 -700.24
## - Sex             1   0.09176 31.178 -699.93
## <none>                        31.087 -698.85
## - Fiber           1   0.20624 31.293 -698.78
## - Smoking.Status  2   0.45059 31.537 -698.35
## - Calories        1   0.34733 31.434 -697.37
## - Fat             1   0.47696 31.564 -696.08
## - Alcohol         1   0.65155 31.738 -694.36
## - Age             1   1.40741 32.494 -686.99
## 
## Step:  AIC=-700.69
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Calories + 
##     Fat + Fiber + Alcohol + Cholesterol + Beta.Diet
## 
##                  Df Sum of Sq    RSS     AIC
## - Beta.Diet       1   0.04819 31.151 -702.21
## - Cholesterol     1   0.05550 31.158 -702.13
## - Sex             1   0.09402 31.196 -701.75
## <none>                        31.102 -700.69
## - Fiber           1   0.22357 31.326 -700.45
## - Smoking.Status  2   0.45121 31.554 -700.18
## - Calories        1   0.35138 31.454 -699.17
## - Fat             1   0.47999 31.582 -697.90
## - Alcohol         1   0.63610 31.739 -696.35
## - Age             1   1.40410 32.507 -688.87
## 
## Step:  AIC=-702.21
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Calories + 
##     Fat + Fiber + Alcohol + Cholesterol
## 
##                  Df Sum of Sq    RSS     AIC
## - Cholesterol     1   0.06284 31.213 -703.58
## - Sex             1   0.10493 31.256 -703.15
## <none>                        31.151 -702.21
## - Smoking.Status  2   0.43127 31.582 -701.90
## - Fiber           1   0.34575 31.496 -700.75
## - Calories        1   0.35323 31.504 -700.68
## - Fat             1   0.47767 31.628 -699.44
## - Alcohol         1   0.62004 31.771 -698.04
## - Age             1   1.37720 32.528 -690.67
## 
## Step:  AIC=-703.58
## Log.Retinol.Plasma ~ Age + Sex + Smoking.Status + Calories + 
##     Fat + Fiber + Alcohol
## 
##                  Df Sum of Sq    RSS     AIC
## - Sex             1   0.08169 31.295 -704.76
## <none>                        31.213 -703.58
## - Smoking.Status  2   0.44179 31.655 -703.18
## - Fiber           1   0.30795 31.521 -702.50
## - Calories        1   0.31186 31.525 -702.46
## - Fat             1   0.56820 31.782 -699.93
## - Alcohol         1   0.65917 31.873 -699.03
## - Age             1   1.37530 32.589 -692.08
## 
## Step:  AIC=-704.76
## Log.Retinol.Plasma ~ Age + Smoking.Status + Calories + Fat + 
##     Fiber + Alcohol
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        31.295 -704.76
## - Smoking.Status  2   0.47054 31.766 -704.09
## - Fiber           1   0.31626 31.611 -703.61
## - Calories        1   0.32013 31.615 -703.57
## - Fat             1   0.54568 31.841 -701.35
## - Alcohol         1   0.77476 32.070 -699.10
## - Age             1   1.78199 33.077 -689.42
## 
## Call:
## lm(formula = Log.Retinol.Plasma ~ Age + Smoking.Status + Calories + 
##     Fat + Fiber + Alcohol, data = plasma2)
## 
## Coefficients:
##                  (Intercept)                           Age  
##                    6.0856104                     0.0055881  
##         Smoking.StatusFormer  Smoking.StatusCurrent Smoker  
##                    0.0797737                    -0.0097973  
##                     Calories                           Fat  
##                    0.0001661                    -0.0034261  
##                        Fiber                       Alcohol  
##                   -0.0086179                     0.0111156
#Subset Regression
subsets <- regsubsets(Log.Retinol.Plasma~., data=plasma, really.big=T)
plot(subsets, scale="adjr2")

Create Model

\(\widehat{Retinol.Plasma}\) = \(e^{6.086 + 0.006X_{Age} + 0.080X_{Smoking.StatusFormer} - 0.010X_{Smoking.StatusCurrent Smoker} + 0.0002X_{Calories} - 0.003X_{Fat} - 0.009X_{Fiber} + 0.011X_{Alcohol} \pm 0.320}\)

#Create Linear Regression Model
plasma.fit <- lm(Log.Retinol.Plasma ~ Age + Smoking.Status + Calories + Fat + Fiber + Alcohol, data = plasma)
summary(plasma.fit)
## 
## Call:
## lm(formula = Log.Retinol.Plasma ~ Age + Smoking.Status + Calories + 
##     Fat + Fiber + Alcohol, data = plasma)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12759 -0.18186  0.00772  0.21015  0.98086 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   6.086e+00  9.804e-02  62.076  < 2e-16 ***
## Age                           5.588e-03  1.341e-03   4.167 4.02e-05 ***
## Smoking.StatusFormer          7.977e-02  4.044e-02   1.973  0.04943 *  
## Smoking.StatusCurrent Smoker -9.797e-03  5.774e-02  -0.170  0.86537    
## Calories                      1.661e-04  9.402e-05   1.766  0.07834 .  
## Fat                          -3.426e-03  1.486e-03  -2.306  0.02177 *  
## Fiber                        -8.618e-03  4.909e-03  -1.756  0.08015 .  
## Alcohol                       1.112e-02  4.045e-03   2.748  0.00636 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3203 on 305 degrees of freedom
## Multiple R-squared:  0.1295, Adjusted R-squared:  0.1095 
## F-statistic:  6.48 on 7 and 305 DF,  p-value: 3.928e-07

Evaluate Model

# anova Test
anova(plasma.fit)
## Analysis of Variance Table
## 
## Response: Log.Retinol.Plasma
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## Age              1  1.9074 1.90741 18.5894 2.191e-05 ***
## Smoking.Status   2  0.6892 0.34460  3.3584  0.036081 *  
## Calories         1  0.0561 0.05615  0.5472  0.460033    
## Fat              1  0.5199 0.51988  5.0667  0.025101 *  
## Fiber            1  0.7071 0.70711  6.8914  0.009097 ** 
## Alcohol          1  0.7748 0.77476  7.5507  0.006356 ** 
## Residuals      305 31.2952 0.10261                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveragePlots(plasma.fit, pch=16)

plasma.fit.df <- augment(plasma.fit)

ggplot(plasma.fit.df, aes(x=.fitted, y=.resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x = "fitted values", y = "Residuals")

ncvTest(plasma.fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.9956292, Df = 1, p = 0.31837
qqPlot(plasma.fit,pch=16)

## [1]  80 291
shapiro.test(plasma.fit$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  plasma.fit$residuals
## W = 0.99042, p-value = 0.03866
#looking for outliers using Cook's D Plot
cutoff <- 4/(nrow(plasma)- length(plasma.fit$coefficients)-2)
plot(plasma.fit, which=4, cook.levels = cutoff)
abline(h=cutoff, lty=2, col="red")

# looking for Multicollinearity
MultiPlasma <- cor(plasma[NumCol])

# Variance Inflation Factor looking for Multicollinearity
vif(plasma.fit)
##                     GVIF Df GVIF^(1/(2*Df))
## Age             1.148042  1        1.071467
## Smoking.Status  1.122996  2        1.029425
## Calories       10.422533  1        3.228395
## Fat             7.532301  1        2.744504
## Fiber           2.094368  1        1.447193
## Alcohol         1.221526  1        1.105227
# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(MultiPlasma, format = "markdown")
Age Quetelet Calories Fat Fiber Alcohol Cholesterol Beta.Diet Retinol.Diet Retinol.Plasma Log.Retinol.Plasma
Age 1.0000000 -0.0125944 -0.2126350 -0.1737258 0.0455450 -0.0066697 -0.1207996 0.0677929 -0.0099325 0.2217522 0.2303427
Quetelet -0.0125944 1.0000000 0.0134108 0.0513615 -0.0879637 -0.1208441 0.1148000 -0.0051417 0.0328123 0.0104610 -0.0095042
Calories -0.2126350 0.0134108 1.0000000 0.8978788 0.5176226 0.2240924 0.6598935 0.2567768 0.4176231 -0.0479794 -0.0723291
Fat -0.1737258 0.0513615 0.8978788 1.0000000 0.2828084 0.1338801 0.7028062 0.1431437 0.4091983 -0.0825821 -0.1122599
Fiber 0.0455450 -0.0879637 0.5176226 0.2828084 1.0000000 -0.0142741 0.1586174 0.4833462 0.2158992 -0.0456814 -0.0585116
Alcohol -0.0066697 -0.1208441 0.2240924 0.1338801 -0.0142741 1.0000000 0.1042547 0.0355940 -0.0036564 0.2209611 0.2061992
Cholesterol -0.1207996 0.1148000 0.6598935 0.7028062 0.1586174 0.1042547 1.0000000 0.1139018 0.4410165 -0.0600768 -0.0876945
Beta.Diet 0.0677929 -0.0051417 0.2567768 0.1431437 0.4833462 0.0355940 0.1139018 1.0000000 0.0521951 -0.0108499 -0.0393540
Retinol.Diet -0.0099325 0.0328123 0.4176231 0.4091983 0.2158992 -0.0036564 0.4410165 0.0521951 1.0000000 -0.0596948 -0.0592769
Retinol.Plasma 0.2217522 0.0104610 -0.0479794 -0.0825821 -0.0456814 0.2209611 -0.0600768 -0.0108499 -0.0596948 1.0000000 0.9597215
Log.Retinol.Plasma 0.2303427 -0.0095042 -0.0723291 -0.1122599 -0.0585116 0.2061992 -0.0876945 -0.0393540 -0.0592769 0.9597215 1.0000000

 

Part 6: Application of the model

A simple linear regression model was used to describe the trends of the data and make predictions. The general equation for linear regression is as follows:

y = \(b_{0}\) + \(b_{1}\)\(x_{1}\) + \(b_{2}\)\(x_{2}\) + … + \(b_{i}\)\(x_{i}\) + … + \(b_{n}\)\(x_{n}\)

where there are n predicting variables used to estimate the response, y. \(b_{0}\) serves as the y-intercept. Each predictor variables, \(x_{i}\), has a weight coefficient, \(b_{i}\), indicating how significant its influence is on the response. Out of the 12 potential predictor variables, our final model used (?). (The natural log of Retinol.Plasma) was the response that was modeled.

Comparing Fit Prediction to Removed Observation

Selected observation removed in Part 3 stored as object: ‘selection’. We used our model to predict the Retinol.Plasma and compare our prediction against the actual removed observation.

The Model we used transformed Retinol.Plasma using a natural log, so to transform the result of the prediction back into meaningful terms we had to exponentiate.

After we transformed the predicted value, we were able to prove that, in fact, the real Retinol.Plasma value did fall within our model’s prediction interval!

The resulting formula we found for predicting Retinol.Plasma:

Retinol.Plasma = \(e^{6.086 + 0.006X_{Age} + 0.080X_{Smoking.StatusFormer} - 0.010X_{Smoking.StatusCurrent Smoker} + 0.0002X_{Calories} - 0.003X_{Fat} - 0.009X_{Fiber} + 0.011X_{Alcohol} \pm 0.320}\)

# Removed Observation for testing

# Do not modify the following code:
# displays selection on Markdown File
knitr::kable(selection, format = "markdown")
Age Sex Smoking.Status Quetelet Vitamin.Use Calories Fat Fiber Alcohol Cholesterol Beta.Diet Retinol.Diet Beta.Plasma Retinol.Plasma
84 19 Female Never 29.24145 Yes, fairly often 2558.9 116.1 12.3 0 324.5 1498 1066 327 693
predict(plasma.fit, newdata=selection, interval = "predict")
##        fit      lwr      upr
## 84 6.11296 5.473549 6.752372
exp(predict(plasma.fit, newdata=selection, interval = "predict"))
##         fit      lwr      upr
## 84 451.6737 238.3044 856.0866

 

Part 7: Limitations and Assumptions

There are several factors missing from the data that have a potential to negatively impact the accuracy of our model and analysis. The variation in plasma levels could be attributed to a variety of complex health conditions that is difficult to reduce to concrete measurements or classifications. There are a few terms that are not well-defined, such as what is considered “often vitamin use” or “one alcoholic drink”. Finally, it is unclear how some of the measurements were taken or recorded, leaving potential for inaccuracy or rounding error in some of the data points.

 

Limitations

Insufficient variables to find any “smoking gun” that had a high level of influence on Retinol Plasma levels; Age the most influential predictor variable for our model only had a p-value of 4.02e-05.

 

Assumptions

As mentioned above one outlier was removed from our dataset prior to conducting analysis. Once this point was removed, our analysis was able to produce a more accurate fit.

Data appears to be self-reported by the participants, so we have to (optimistically) assume the patients were being truthful in their reporting. Calorie / Fat / Fiber content of their daily diets. how would patients be expected to accurately notate this?

 

Missing Variables

A helpful variable could be hours of exercise each week, or some other quantification of activity level. Outside of vitamin use, Diet doesn’t have a significant impact on Retinol Plasma, but perhaps exercise would.