helpful R code

we often want to recode variables for specific analyses, and there are multiple ways we can do this.

first, let’s just mean center, or subtract some constant from a variable already in our dataset.

apgar$gestat_mean<- apgar$GESTAT-mean(apgar$GESTAT)
# this is saying from our apgar dataset, create a new variable called gestat_mean (apgar$gestat_mean<- )
# then, assign that variable to be the original GESTAT var minus the mean of GESTAT

# you could also do
mean(apgar$GESTAT)
## [1] 37.11667
apgar$gestat_mean<- apgar$GESTAT-37.11667

# or
gest_mean<- mean(apgar$GESTAT)
apgar$gestat_mean<- apgar$GESTAT-gest_mean

recoding categorical variables is a common one too. let’s take the SMOKES variable (where 1=smokes and 0=did not smoke) and recode it to be contrast codes of -.5 and .5.

# the easiest way IMO is with ifelse

apgar$smokes_contrast<- ifelse(apgar$SMOKES==0, -.5, .5)
# this says if apgar$SMOKES is equal to 0, give this new variable the value of -.5, for everything ELSE, give a value of .5.

# we can only do this this easily beacuse we know there are only 2 possible values of SMOKES (or or 1)

table(apgar$SMOKES)
## 
##  0  1 
## 44 16
table(apgar$smokes_contrast)
## 
## -0.5  0.5 
##   44   16
# if you have more than one level, it will recode 0's to 0.5 and then literally evereything else to .5. we can use nested ifelse statements, but that's not always the best way

table(apgar$PRENAT)
## 
##  0  1  2  3 
##  6 12 20 22
# lets make prenatal be -.5 for a value of 0 or 1 and .5 for a value of 2 or 3

apgar$prenat_contrast<- ifelse(apgar$PRENAT==0, -.5, ifelse(apgar$PRENAT==1, -.5, ifelse(apgar$PRENAT==2, .5, ifelse(apgar$PRENAT==3, .5, NA)))) # you have to end evrey ifelse statement with the "ELSE" category, so our last ifelse statement I just put NA, basically saying if there are any values other than 0-3, put NA

table(apgar$prenat_contrast) ### adds up to the previous table
## 
## -0.5  0.5 
##   18   42
# but writing that many ifelse statements can be a pain and prone to error. we could simplifiy in this case:

apgar$prenat_contrast<- ifelse(apgar$PRENAT<=1, -.5, ifelse(apgar$PRENAT>1, .5, NA))

table(apgar$prenat_contrast) ### adds up to the previous table
## 
## -0.5  0.5 
##   18   42
# or we could reecode by saying multiply (but not really multiply) a value by .5/-.5 depending on what the original variable was

apgar$prenat_contrast<- -.5*(apgar$PRENAT<=1) + .5*(apgar$PRENAT>1) 

table(apgar$prenat_contrast) ### adds up to the previous table
## 
## -0.5  0.5 
##   18   42
# finally we could use an R recode function itself
apgar$prenat_contrast<- recode(apgar$PRENAT, "0:1=0.5; 2:3=-0.5")

table(apgar$prenat_contrast) ### adds up to the previous table
## 
## -0.5  0.5 
##   42   18
# so do whatever makes the most sense to you!

apgar data set examples

We are going to use the APGAR dataset to answer some questions about the relationship between whether mothers smoke or not and the health of their baby, as measured by APGAR scores.

head(apgar)
##   CASE APGAR GENDER SMOKES WGTGAIN GESTAT PRENAT ANNINC gestat_mean
## 1   43     6      0      0      22     37      0     32  -0.1166667
## 2   51     5      0      0      50     35      0     17  -2.1166667
## 3   56     4      0      0      60     36      0     21  -1.1166667
## 4   53     4      1      0      60     37      0     32  -0.1166667
## 5   40     6      0      0      35     41      1     23   3.8833333
## 6   59     2      0      0      50     38      1     12   0.8833333
##   smokes_contrast prenat_contrast
## 1            -0.5             0.5
## 2            -0.5             0.5
## 3            -0.5             0.5
## 4            -0.5             0.5
## 5            -0.5             0.5
## 6            -0.5             0.5

Immediately after birth, babies are assigned an APGAR score based on such factors of breathing, reflexes, skin color, etc. Scores range from 0 to 10, with 10 being a normal, healthy baby. Low scores indicate the need for immediate critical care. The data in the dataset ‘apgar’ represent hypothetical data from a study of disadvantaged mothers who were provided with various opportunities to obtain pre-natal care. Presumably, the longer the gestation period, the healthier the baby.

  1. provide a test of whether longer gestation periods are associated with higher APGAR scores. Write model A, model C, SSE(A), SSE(C), PRE, F*, and a short, substantive conclusion.

model A: \(apgar= \beta_0+\beta_1*gestat+\epsilon_i\) model C: \(apgar= \beta_0+\epsilon_i\)

mod.a<- lm(APGAR~GESTAT, data=apgar)
mcSummary(mod.a)
## lm(formula = APGAR ~ GESTAT, data = apgar)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq      F p
## Model       72.044  1 72.044 0.276 22.116 0
## Error      188.939 58  3.258               
## Corr Total 260.983 59  4.423               
## 
##   RMSE AdjEtaSq
##  1.805    0.264
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) -0.930 1.636 -0.569  1.054 0.006  NA -4.204   2.344 0.572
## GESTAT       0.205 0.044  4.703 72.044 0.276  NA  0.118   0.292 0.000
mod.c<- lm(APGAR~1, data=apgar)
mcSummary(mod.c)
## lm(formula = APGAR ~ 1, data = apgar)
## 
## Omnibus ANOVA
##                 SS df    MS EtaSq F p
## Model        0.000  0   Inf     0    
## Error      260.983 59 4.423          
## Corr Total 260.983 59 4.423          
## 
##   RMSE AdjEtaSq
##  2.103        0
## 
## Coefficients
##               Est StErr      t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.272 24.614 2680.017 0.911  NA   6.14   7.227 0

SSE(A)= 188.939 SSE(C)= 260.983 PRE= .276 t= 4.703

Longer gestation periods are significantly associated with higher apgar scores.

  1. Re run the same analysis, but with GESTAT mean centered to provide a different (more meaningful) interpretation of the intercept.
apgar$gestat_mean<- apgar$GESTAT-mean(apgar$GESTAT)
mod.a<- lm(APGAR~gestat_mean, data=apgar)
mcSummary(mod.a)
## lm(formula = APGAR ~ gestat_mean, data = apgar)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq      F p
## Model       72.044  1 72.044 0.276 22.116 0
## Error      188.939 58  3.258               
## Corr Total 260.983 59  4.423               
## 
##   RMSE AdjEtaSq
##  1.805    0.264
## 
## Coefficients
##               Est StErr      t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.233 28.683 2680.017 0.934  NA  6.217   7.150 0
## gestat_mean 0.205 0.044  4.703   72.044 0.276  NA  0.118   0.292 0

Here, the intercept is 6.683, or the predicted APGAR score for an average baby controlling for gestation period (or a baby with an average gestation period). The slope is the same – the increase in APGAR score has gestation increases one week.

  1. Finally, redo the analysis one last time, but this time changing GESTAT not by subtracting a constant from it (i.e., its mean) but rather changing its units into days rather than weeks. In other words, compute a new predictor variable from GESTAT that is GESTAT multiplied by 7 (the number of days in a week). Regress APGAR on this new transformed version of GESTAT and again provide interpretations of both the intercept and slope.
apgar$gestat_days<- apgar$GESTAT*7
mod.a<- lm(APGAR~gestat_days, data=apgar)
mcSummary(mod.a)
## lm(formula = APGAR ~ gestat_days, data = apgar)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq      F p
## Model       72.044  1 72.044 0.276 22.116 0
## Error      188.939 58  3.258               
## Corr Total 260.983 59  4.423               
## 
##   RMSE AdjEtaSq
##  1.805    0.264
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) -0.930 1.636 -0.569  1.054 0.006  NA -4.204   2.344 0.572
## gestat_days  0.029 0.006  4.703 72.044 0.276  NA  0.017   0.042 0.000

Here, the intercept is the same. the predicted APGAR score for someone who had a gestation period of 0 days. The slope is now (still significantly associated with apgar) but is the amount an apgar score is expected to increase as gestation period increases one day (much smaller).

  1. predict the expected AGPAR score for a baby who had a gestation period of 35 weeks using all three of the above models!

mod 1 - normal

\(apgar_i= -.930 + .205*gestat\)

mod 2 - mean centered

\(apgar_i= 6.683 + .205*gestatmean\)

mod 3 - with gestat as days not weeks

\(apgar_i= -.930 + .029*gestatdays\)

# mod 1= 
-.930 + .205*(35)
## [1] 6.245
# mod 2= 
6.683 + .205*(35-mean(apgar$GESTAT))
## [1] 6.249083
# mod 3=
-.930 + .029*(35*7)
## [1] 6.175

by using all three of our different models (that had slightly different interpretations), we get the same predicted apgar score of someone who gestated 35 weeks. so you can use any model but make sure to plug in the value that corresponds to your units (i.e num. of weeks that differed from the mean or gestation period in days, not weeks)

  1. Using all the data, ask whether the means of each of the two separate groups of babies, those with mothers who smoke and those with mothers who don’t, are significantly different from 7. (Hint: recode SMOKES using alternative dummy codes so that the intercepts can be used to estimate the means in each group. Use the confidence intervals for the intercepts to answer the question.)
apgar$apgar_7<- apgar$APGAR-7
mod.smokes<- lm(apgar_7~SMOKES, data=apgar) # 0 = did not smoke, 1= smoked
mcSummary(mod.smokes)
## lm(formula = apgar_7 ~ SMOKES, data = apgar)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq     F     p
## Model       41.000  1 41.000 0.157 10.81 0.002
## Error      219.983 58  3.793                  
## Corr Total 260.983 59  4.423                  
## 
##   RMSE AdjEtaSq
##  1.948    0.143
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept)  0.182 0.294  0.619  1.455 0.007  NA -0.406   0.770 0.538
## SMOKES      -1.869 0.569 -3.288 41.000 0.157  NA -3.007  -0.731 0.002
apgar$no_smoke<- ifelse(apgar$SMOKES==1, 0, 1) # just flipping so if SMOKES=1, recode to 0, make everything else 1. we know there are no other values so this is ok here.

apgar$apgar_7<- apgar$APGAR-7
mod.nosmokes<- lm(apgar_7~no_smoke, data=apgar) # 0 = did not smoke, 1= smoked
mcSummary(mod.nosmokes)
## lm(formula = apgar_7 ~ no_smoke, data = apgar)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq     F     p
## Model       41.000  1 41.000 0.157 10.81 0.002
## Error      219.983 58  3.793                  
## Corr Total 260.983 59  4.423                  
## 
##   RMSE AdjEtaSq
##  1.948    0.143
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) -1.688 0.487 -3.466 45.563 0.172  NA -2.662  -0.713 0.001
## no_smoke     1.869 0.569  3.288 41.000 0.157  NA  0.731   3.007 0.002

The first model has an intercept of .182. That is the mean difference in apgar score from 7 for kids in our sample who’s mom did not smoke (when SMOKES=0). because 0 is in the CI, we fail to reject the null that kids without smoking mother’s have an APGAR score of 7. These kids do have an APGAR score similar to 7 (mean apgar=7.1818).

The second model has an intercept of -1.688. That is the mean difference in apgar score from 7 for kids in our sample who’s mom did smoke (when no_smoke=0). because 0 is not in the CI, we reject the null that kids without smoking mother’s have an APGAR score of 7. These kids have a significantly different APGAR score from 7 (mean apgar=5.3125).

plot!

let’s plot means and relationships based on the categorical predictor. first, make SMOKES be non-numeric so it’s easier for GGplot to understand what we want to do.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
apgar$SMOKES_cat<- ifelse(apgar$SMOKES==1, "Smoker", "Non-smoker")
ggplot(data=apgar, aes(y=APGAR, x=SMOKES_cat, fill=SMOKES_cat)) + 
    geom_bar(position="dodge", stat="identity")+
    ggtitle("Distribution of APGAR scores for smoking moms and non smoking moms")+
    xlab("Mother who smoked")

ggplot(data=apgar, aes(x=GESTAT, y=APGAR, group=SMOKES_cat, color=SMOKES_cat)) +
    geom_line()

this graph is weird because those in the no smoke category had a very different range of GESTAT scores.

project application

  1. Choose a continuous predictor variable from your dataset that you are interested in using to predict your outcome variable of interest. Write model A, model C, SSE(A), SSE(C), PRE, F*, and a short, substantive conclusion.

obviously I’m not going to do this but think about doing this with your own data because it will be helpful for the final project