Biometry HW 4

Problem 1:

You will need to download from blackboard the dataset “MultipleR.csv.” This dataset contains data on blood fat content, weight in kilograms, and age in years.

a) Conduct a multiple linear regression analysis (you can use the linear model “lm” function) using blood fat content as the dependent variable and weight and age as independent variables. Report the R2 value, the best fit parameters, the p-values for the independent variables, and the p-value for the overall model. Explain the results in the context of the data.

multi <- lm(BloodFat ~ Weight + Age, data = Multiple)
summary(multi)

## 
## Call:
## lm(formula = BloodFat ~ Weight + Age, data = Multiple)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.570 -30.374  -5.449  28.626  89.170 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.9825    52.4296   1.487    0.151    
## Weight        0.4174     0.7288   0.573    0.573    
## Age           5.2166     0.7572   6.889 6.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.11 on 22 degrees of freedom
## Multiple R-squared:  0.7056, Adjusted R-squared:  0.6788 
## F-statistic: 26.36 on 2 and 22 DF,  p-value: 1.443e-06

R^2 = 0.6788, intercept = 77.9825, best fit value (weight) = 0.4174, best fit value (age) = 5.2166, weight p-value = 0.573 , age p-value = 0.573 , overall p-value = 1.443e-06

The (adjusted) R^2 is 0.6788, meaning that 67.88% of the variability in bloodfat can be explained by the model. The intercept is the “baseline” bloodfat value without considering any of the factors (the average person still has to have some bloodfat to survive). Both best-fit values were positive, whihc makes sense because bloodfat level generally trends upward as people get older and/or gain weight. However, age has a much more dramatic effect on bloodfat, increasing over 10-fold more per unit than weight does. This is supported by the p-value of age being the only factor < 0.05, so we can reject the null hypothesis and assume that there is a relationship between age and bloodfat content. On the other hand, the p-value for weight was above 0.05 so we cannot reject the null hypothesis, which states that there is no relationship between the x variable (weight) and the y (bloodfat). The overall p-value is also < 0.05, so we reject the null hypothesis that all of the factor coefficients are 0 (and there is no factor effect on the y-value).

b) Conduct a multiple linear regression analysis again but additionally include an interaction effect between age and weight. Is there a significant interaction between age and weight? Explain what an interaction effect is and use as an example the variables tested (age and weight).

multi2 = lm (BloodFat ~ Weight + Age + (Weight*Age), data = Multiple)
summary(multi2)

## 
## Call:
## lm(formula = BloodFat ~ Weight + Age + (Weight * Age), data = Multiple)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.910 -26.925  -7.654  30.275  89.961 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141.67498  161.54291   0.877    0.390
## Weight       -0.53587    2.39943  -0.223    0.825
## Age           3.23690    4.80082   0.674    0.508
## Weight:Age    0.02910    0.06964   0.418    0.680
## 
## Residual standard error: 44.96 on 21 degrees of freedom
## Multiple R-squared:  0.708,  Adjusted R-squared:  0.6663 
## F-statistic: 16.97 on 3 and 21 DF,  p-value: 7.899e-06

Weight-Age Interaction p-value = 0.68. An interaction effect occurs when the effects of one variable are dependent on another variable in the model. In this case, it was thought that there may be a significant interaction between age and weight factors on bloodfat content– which intuitively follows, as people tend to gain weight with age as their metabolism slows, but their diets may not shift to accomodate these changes. However, this analysis found that there is no significant interaction between age and weight. The p-value is too large for us to reject the null hypothesis, which states that the coefficient for the interaction of factors is 0 (there is no signficant interaction between weight and age to produce an effect on bloodfat content.)

c) Use an F ratio test to compare the fit of the two models (without interaction effect vs. with interaction effect). Explain the results and indicate which model is better. Hint: you will have to use the “anova” function in R.

anova(multi, multi2)

## Analysis of Variance Table
## 
## Model 1: BloodFat ~ Weight + Age
## Model 2: BloodFat ~ Weight + Age + (Weight * Age)
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     22 42806                           
## 2     21 42453  1    352.88 0.1746 0.6803

F = 0.1746, P = 0.6803 > 0.05

Because the p-value for the F-test < 0.05, we cannot reject the null hypothesis that the differential coefficients between model 2 and model 1 are 0– in other words, there is no difference between the two models. This makes sense, because the only added factor was the interaction effect in model 2 that was found to have an insignificant effect on the y variable in 1b). Therefore, we should just stick to the first model.

Problem 2:

You will need to download from blackboard the dataset “LogisticR.csv.” This dataset was obtained from the following paper: “Veltman, C.J., S. Nee, and M.J. Crawley. 1996. Correlates of introduction success in exotic New Zealand birds. American Naturalist 147: 542-557” (pdf of article available on blackboard). In this article the authors used logistic regression to discover significant correlates of introduction success of exotic birds in New Zealand.

In the original dataset there is one response variable (introduction success = status) and 13 different predictor variables (listed with the letters A–M) for 79 species of birds. However, in the csv file I have reduced the number of variables (I excluded C, E, H and J) because of missing data and to avoid overfitting. I also deleted species with missing data for the remaining variables.

Logistic = read_excel("/Users/The-Queen/Documents/Biometry/Homework_Stuff/HW_4/LogisticR.xlsx")

a) “R” will interpret by default all variables with numbers as numerical, so you will have to convert the variables D (migration) and F (diet) into categorical variables. To do so you can use the “factor” function of R. See the following webpage for help: https://stats.idre.ucla.edu/r/modules/coding-for-categorical-variables-in-regression-models/

Logistic$D.f <- factor(Logistic$D)
Logistic$F.f <- factor(Logistic$F)
print(Logistic$D.f)

##  [1] 1 1 1 3 3 2 1 1 2 3 3 3 3 3 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 3 2 1 1 1 2 1 2 2
## [39] 2 2 3 2 2 3 2 1 1 2 1 2 1 1 1 1 1 1 2 3 2 3 2 2 2 2 2 3 1 1 3 2 2
## Levels: 1 2 3

print(Logistic$F.f)

##  [1] 2 1 1 2 1 1 1 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 1 1 1 1 3 3 1 2 1 3 3 3 2 2
## [39] 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2
## Levels: 1 2 3

b) Once you have converted the necessary variables to categorical, go ahead and conduct a logistic regression analysis with all variables in the csv file. Use the generalized linear model “glm” function [will also do linear regression function if you want] of R with the argument “family = binomial.” This argument needs to be applied to run a logistic regression. Explain the results in the context of the study—read the paper! You only need to discuss statistical significance (p-values) of the predictor variables. Don’t worry about interpreting the coefficients for each variable because they are log transformed odd ratios (if you would want to make sense out of them you would need to do antilog transformations). How do your results compare to the results in the paper?

summary(glm(status ~ A + B + Logistic$D.f + Logistic$F.f + G + I + K + L + M, data = Logistic, family = binomial))

## 
## Call:
## glm(formula = status ~ A + B + Logistic$D.f + Logistic$F.f + 
##     G + I + K + L + M, family = binomial, data = Logistic)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.70327  -0.26161  -0.00005   0.14183   2.64632  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -5.600e+00  2.904e+00  -1.928   0.0538 .
## A             -4.772e-03  4.351e-03  -1.097   0.2728  
## B              2.880e-03  1.500e-03   1.919   0.0549 .
## Logistic$D.f2 -6.855e-01  1.128e+00  -0.608   0.5434  
## Logistic$D.f3 -1.878e+01  2.447e+03  -0.008   0.9939  
## Logistic$F.f2  3.262e+00  1.775e+00   1.838   0.0661 .
## Logistic$F.f3  4.355e+00  2.565e+00   1.698   0.0896 .
## G             -2.399e-01  2.189e-01  -1.096   0.2730  
## I              2.238e+00  1.717e+00   1.303   0.1926  
## K              9.526e-01  2.123e+00   0.449   0.6537  
## L              1.396e-01  1.357e-01   1.028   0.3037  
## M              1.073e-02  5.102e-03   2.104   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 94.317  on 70  degrees of freedom
## Residual deviance: 30.304  on 59  degrees of freedom
## AIC: 54.304
## 
## Number of Fisher Scoring iterations: 18

The only p-value <0.05 was for factor M, the minimum number of individuals introduced to the habitat. This result is corroborated by the paper, as it states that both introduction effort factors (number of individuals released and releasing at multiple locations) were positively correlated with surviviorship status. It also stated that the only life history variable to make a significant difference in survivorship is migration. We did not find the other two factors to be significant in our analysis.

Problem 3)

In your own words briefly explain the difference between “general linear models” and a “generalized linear models” Note: Please don’t just copy and paste an answer from the internet because I will notice. :0)

In a generAL linear model, the dependent variable (y) is continuous and assumed to fit a normal distribution. In a generalIZED linear model, y could fit any distribution (does not necessarily need to be normal).

Biometry HW 4

Yee-Ann Wong

2022-11-19

Problem 1:

You will need to download from blackboard the dataset “MultipleR.csv.” This dataset contains data on blood fat content, weight in kilograms, and age in years.

R^2 = 0.6788, intercept = 77.9825, best fit value (weight) = 0.4174, best fit value (age) = 5.2166, weight p-value = 0.573 , age p-value = 0.573 , overall p-value = 1.443e-06

b) Conduct a multiple linear regression analysis again but additionally include an interaction effect between age and weight. Is there a significant interaction between age and weight? Explain what an interaction effect is and use as an example the variables tested (age and weight).

c) Use an F ratio test to compare the fit of the two models (without interaction effect vs. with interaction effect). Explain the results and indicate which model is better. Hint: you will have to use the “anova” function in R.

F = 0.1746, P = 0.6803 > 0.05

Problem 2:

Problem 3)

In your own words briefly explain the difference between “general linear models” and a “generalized linear models” Note: Please don’t just copy and paste an answer from the internet because I will notice. :0)

In a generAL linear model, the dependent variable (y) is continuous and assumed to fit a normal distribution. In a generalIZED linear model, y could fit any distribution (does not necessarily need to be normal).