Problem 1:
You will need to download from blackboard the dataset
“MultipleR.csv.” This dataset contains data on blood fat content, weight
in kilograms, and age in years.
a) Conduct a multiple linear regression analysis
(you can use the linear model “lm” function) using blood fat content as
the dependent variable and weight and age as independent variables.
Report the R2 value, the best fit parameters, the p-values for the
independent variables, and the p-value for the overall model. Explain
the results in the context of the data.
multi <- lm(BloodFat ~ Weight + Age, data = Multiple)
summary(multi)
##
## Call:
## lm(formula = BloodFat ~ Weight + Age, data = Multiple)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.570 -30.374 -5.449 28.626 89.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.9825 52.4296 1.487 0.151
## Weight 0.4174 0.7288 0.573 0.573
## Age 5.2166 0.7572 6.889 6.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.11 on 22 degrees of freedom
## Multiple R-squared: 0.7056, Adjusted R-squared: 0.6788
## F-statistic: 26.36 on 2 and 22 DF, p-value: 1.443e-06
R^2 = 0.6788, intercept = 77.9825, best fit value (weight) = 0.4174,
best fit value (age) = 5.2166, weight p-value = 0.573 , age p-value =
0.573 , overall p-value = 1.443e-06
The (adjusted) R^2 is 0.6788, meaning that 67.88% of the variability
in bloodfat can be explained by the model. The intercept is the
“baseline” bloodfat value without considering any of the factors (the
average person still has to have some bloodfat to survive). Both
best-fit values were positive, whihc makes sense because bloodfat level
generally trends upward as people get older and/or gain weight. However,
age has a much more dramatic effect on bloodfat, increasing over 10-fold
more per unit than weight does. This is supported by the p-value of age
being the only factor < 0.05, so we can reject the null hypothesis
and assume that there is a relationship between age and bloodfat
content. On the other hand, the p-value for weight was above 0.05 so we
cannot reject the null hypothesis, which states that there is no
relationship between the x variable (weight) and the y (bloodfat). The
overall p-value is also < 0.05, so we reject the null hypothesis that
all of the factor coefficients are 0 (and there is no factor effect on
the y-value).
b) Conduct a multiple linear regression analysis
again but additionally include an interaction effect between age and
weight. Is there a significant interaction between age and weight?
Explain what an interaction effect is and use as an example the
variables tested (age and weight).
multi2 = lm (BloodFat ~ Weight + Age + (Weight*Age), data = Multiple)
summary(multi2)
##
## Call:
## lm(formula = BloodFat ~ Weight + Age + (Weight * Age), data = Multiple)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.910 -26.925 -7.654 30.275 89.961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141.67498 161.54291 0.877 0.390
## Weight -0.53587 2.39943 -0.223 0.825
## Age 3.23690 4.80082 0.674 0.508
## Weight:Age 0.02910 0.06964 0.418 0.680
##
## Residual standard error: 44.96 on 21 degrees of freedom
## Multiple R-squared: 0.708, Adjusted R-squared: 0.6663
## F-statistic: 16.97 on 3 and 21 DF, p-value: 7.899e-06
Weight-Age Interaction p-value = 0.68. An interaction effect occurs
when the effects of one variable are dependent on another variable in
the model. In this case, it was thought that there may be a significant
interaction between age and weight factors on bloodfat content– which
intuitively follows, as people tend to gain weight with age as their
metabolism slows, but their diets may not shift to accomodate these
changes. However, this analysis found that there is no significant
interaction between age and weight. The p-value is too large for us to
reject the null hypothesis, which states that the coefficient for the
interaction of factors is 0 (there is no signficant interaction between
weight and age to produce an effect on bloodfat content.)
c) Use an F ratio test to compare the fit of the
two models (without interaction effect vs. with interaction effect).
Explain the results and indicate which model is better. Hint: you will
have to use the “anova” function in R.
anova(multi, multi2)
## Analysis of Variance Table
##
## Model 1: BloodFat ~ Weight + Age
## Model 2: BloodFat ~ Weight + Age + (Weight * Age)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 22 42806
## 2 21 42453 1 352.88 0.1746 0.6803
F = 0.1746, P = 0.6803 > 0.05
Because the p-value for the F-test < 0.05, we cannot reject the
null hypothesis that the differential coefficients between model 2 and
model 1 are 0– in other words, there is no difference between the two
models. This makes sense, because the only added factor was the
interaction effect in model 2 that was found to have an insignificant
effect on the y variable in 1b). Therefore, we should just stick to the
first model.
Problem 2:
You will need to download from blackboard the dataset
“LogisticR.csv.” This dataset was obtained from the following paper:
“Veltman, C.J., S. Nee, and M.J. Crawley. 1996. Correlates of
introduction success in exotic New Zealand birds. American Naturalist
147: 542-557” (pdf of article available on blackboard). In this article
the authors used logistic regression to discover significant correlates
of introduction success of exotic birds in New Zealand.
In the original dataset there is one response variable (introduction
success = status) and 13 different predictor variables (listed with the
letters A–M) for 79 species of birds. However, in the csv file I have
reduced the number of variables (I excluded C, E, H and J) because of
missing data and to avoid overfitting. I also deleted species with
missing data for the remaining variables.
Logistic = read_excel("/Users/The-Queen/Documents/Biometry/Homework_Stuff/HW_4/LogisticR.xlsx")
a) “R” will interpret by default all variables with
numbers as numerical, so you will have to convert the variables D
(migration) and F (diet) into categorical variables. To do so you can
use the “factor” function of R. See the following webpage for help: https://stats.idre.ucla.edu/r/modules/coding-for-categorical-variables-in-regression-models/
Logistic$D.f <- factor(Logistic$D)
Logistic$F.f <- factor(Logistic$F)
print(Logistic$D.f)
## [1] 1 1 1 3 3 2 1 1 2 3 3 3 3 3 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 3 2 1 1 1 2 1 2 2
## [39] 2 2 3 2 2 3 2 1 1 2 1 2 1 1 1 1 1 1 2 3 2 3 2 2 2 2 2 3 1 1 3 2 2
## Levels: 1 2 3
print(Logistic$F.f)
## [1] 2 1 1 2 1 1 1 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 1 1 1 1 3 3 1 2 1 3 3 3 2 2
## [39] 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2
## Levels: 1 2 3
b) Once you have converted the necessary variables
to categorical, go ahead and conduct a logistic regression analysis with
all variables in the csv file. Use the generalized linear model “glm”
function [will also do linear regression function if you want] of R with
the argument “family = binomial.” This argument needs to be applied to
run a logistic regression. Explain the results in the context of the
study—read the paper! You only need to discuss statistical significance
(p-values) of the predictor variables. Don’t worry about interpreting
the coefficients for each variable because they are log transformed odd
ratios (if you would want to make sense out of them you would need to do
antilog transformations). How do your results compare to the results in
the paper?
summary(glm(status ~ A + B + Logistic$D.f + Logistic$F.f + G + I + K + L + M, data = Logistic, family = binomial))
##
## Call:
## glm(formula = status ~ A + B + Logistic$D.f + Logistic$F.f +
## G + I + K + L + M, family = binomial, data = Logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.70327 -0.26161 -0.00005 0.14183 2.64632
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.600e+00 2.904e+00 -1.928 0.0538 .
## A -4.772e-03 4.351e-03 -1.097 0.2728
## B 2.880e-03 1.500e-03 1.919 0.0549 .
## Logistic$D.f2 -6.855e-01 1.128e+00 -0.608 0.5434
## Logistic$D.f3 -1.878e+01 2.447e+03 -0.008 0.9939
## Logistic$F.f2 3.262e+00 1.775e+00 1.838 0.0661 .
## Logistic$F.f3 4.355e+00 2.565e+00 1.698 0.0896 .
## G -2.399e-01 2.189e-01 -1.096 0.2730
## I 2.238e+00 1.717e+00 1.303 0.1926
## K 9.526e-01 2.123e+00 0.449 0.6537
## L 1.396e-01 1.357e-01 1.028 0.3037
## M 1.073e-02 5.102e-03 2.104 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 94.317 on 70 degrees of freedom
## Residual deviance: 30.304 on 59 degrees of freedom
## AIC: 54.304
##
## Number of Fisher Scoring iterations: 18
The only p-value <0.05 was for factor M, the minimum number of
individuals introduced to the habitat. This result is corroborated by
the paper, as it states that both introduction effort factors (number of
individuals released and releasing at multiple locations) were
positively correlated with surviviorship status. It also stated that the
only life history variable to make a significant difference in
survivorship is migration. We did not find the other two factors to be
significant in our analysis.
Problem 3)
In your own words briefly explain the difference between “general
linear models” and a “generalized linear models” Note: Please don’t just
copy and paste an answer from the internet because I will notice.
:0)
In a generAL linear model, the dependent variable (y) is continuous
and assumed to fit a normal distribution. In a generalIZED linear model,
y could fit any distribution (does not necessarily need to be
normal).