we often want to recode variables for specific analyses, and there are multiple ways we can do this.
first, let’s just mean center, or subtract some constant from a variable already in our dataset.
apgar$gestat_mean<- apgar$GESTAT-mean(apgar$GESTAT)
# this is saying from our apgar dataset, create a new variable called gestat_mean (apgar$gestat_mean<- )
# then, assign that variable to be the original GESTAT var minus the mean of GESTAT
# you could also do
mean(apgar$GESTAT)
## [1] 37.11667
apgar$gestat_mean<- apgar$GESTAT-37.11667
# or
gest_mean<- mean(apgar$GESTAT)
apgar$gestat_mean<- apgar$GESTAT-gest_mean
recoding categorical variables is a common one too. let’s take the SMOKES variable (where 1=smokes and 0=did not smoke) and recode it to be contrast codes of -.5 and .5.
# the easiest way IMO is with ifelse
apgar$smokes_contrast<- ifelse(apgar$SMOKES==0, -.5, .5)
# this says if apgar$SMOKES is equal to 0, give this new variable the value of -.5, for everything ELSE, give a value of .5.
# we can only do this this easily beacuse we know there are only 2 possible values of SMOKES (or or 1)
table(apgar$SMOKES)
##
## 0 1
## 44 16
table(apgar$smokes_contrast)
##
## -0.5 0.5
## 44 16
# if you have more than one level, it will recode 0's to 0.5 and then literally evereything else to .5. we can use nested ifelse statements, but that's not always the best way
table(apgar$PRENAT)
##
## 0 1 2 3
## 6 12 20 22
# lets make prenatal be -.5 for a value of 0 or 1 and .5 for a value of 2 or 3
apgar$prenat_contrast<- ifelse(apgar$PRENAT==0, -.5, ifelse(apgar$PRENAT==1, -.5, ifelse(apgar$PRENAT==2, .5, ifelse(apgar$PRENAT==3, .5, NA)))) # you have to end evrey ifelse statement with the "ELSE" category, so our last ifelse statement I just put NA, basically saying if there are any values other than 0-3, put NA
table(apgar$prenat_contrast) ### adds up to the previous table
##
## -0.5 0.5
## 18 42
# but writing that many ifelse statements can be a pain and prone to error. we could simplifiy in this case:
apgar$prenat_contrast<- ifelse(apgar$PRENAT<=1, -.5, ifelse(apgar$PRENAT>1, .5, NA))
table(apgar$prenat_contrast) ### adds up to the previous table
##
## -0.5 0.5
## 18 42
# or we could reecode by saying multiply (but not really multiply) a value by .5/-.5 depending on what the original variable was
apgar$prenat_contrast<- -.5*(apgar$PRENAT<=1) + .5*(apgar$PRENAT>1)
table(apgar$prenat_contrast) ### adds up to the previous table
##
## -0.5 0.5
## 18 42
# finally we could use an R recode function itself
apgar$prenat_contrast<- recode(apgar$PRENAT, "0:1=0.5; 2:3=-0.5")
table(apgar$prenat_contrast) ### adds up to the previous table
##
## -0.5 0.5
## 42 18
# so do whatever makes the most sense to you!
We are going to use the APGAR dataset to answer some questions about the relationship between whether mothers smoke or not and the health of their baby, as measured by APGAR scores.
head(apgar)
## CASE APGAR GENDER SMOKES WGTGAIN GESTAT PRENAT ANNINC gestat_mean
## 1 43 6 0 0 22 37 0 32 -0.1166667
## 2 51 5 0 0 50 35 0 17 -2.1166667
## 3 56 4 0 0 60 36 0 21 -1.1166667
## 4 53 4 1 0 60 37 0 32 -0.1166667
## 5 40 6 0 0 35 41 1 23 3.8833333
## 6 59 2 0 0 50 38 1 12 0.8833333
## smokes_contrast prenat_contrast
## 1 -0.5 0.5
## 2 -0.5 0.5
## 3 -0.5 0.5
## 4 -0.5 0.5
## 5 -0.5 0.5
## 6 -0.5 0.5
Immediately after birth, babies are assigned an APGAR score based on such factors of breathing, reflexes, skin color, etc. Scores range from 0 to 10, with 10 being a normal, healthy baby. Low scores indicate the need for immediate critical care. The data in the dataset ‘apgar’ represent hypothetical data from a study of disadvantaged mothers who were provided with various opportunities to obtain pre-natal care. Presumably, the longer the gestation period, the healthier the baby.
model A: \(apgar= \beta_0+\beta_1*gestat+\epsilon_i\) model C: \(apgar= \beta_0+\epsilon_i\)
mod.a<- lm(APGAR~GESTAT, data=apgar)
mcSummary(mod.a)
## lm(formula = APGAR ~ GESTAT, data = apgar)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 72.044 1 72.044 0.276 22.116 0
## Error 188.939 58 3.258
## Corr Total 260.983 59 4.423
##
## RMSE AdjEtaSq
## 1.805 0.264
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) -0.930 1.636 -0.569 1.054 0.006 NA -4.204 2.344 0.572
## GESTAT 0.205 0.044 4.703 72.044 0.276 NA 0.118 0.292 0.000
mod.c<- lm(APGAR~1, data=apgar)
mcSummary(mod.c)
## lm(formula = APGAR ~ 1, data = apgar)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0.000 0 Inf 0
## Error 260.983 59 4.423
## Corr Total 260.983 59 4.423
##
## RMSE AdjEtaSq
## 2.103 0
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.272 24.614 2680.017 0.911 NA 6.14 7.227 0
SSE(A)= 188.939 SSE(C)= 260.983 PRE= .276 t= 4.703
Longer gestation periods are significantly associated with higher apgar scores.
apgar$gestat_mean<- apgar$GESTAT-mean(apgar$GESTAT)
mod.a<- lm(APGAR~gestat_mean, data=apgar)
mcSummary(mod.a)
## lm(formula = APGAR ~ gestat_mean, data = apgar)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 72.044 1 72.044 0.276 22.116 0
## Error 188.939 58 3.258
## Corr Total 260.983 59 4.423
##
## RMSE AdjEtaSq
## 1.805 0.264
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.233 28.683 2680.017 0.934 NA 6.217 7.150 0
## gestat_mean 0.205 0.044 4.703 72.044 0.276 NA 0.118 0.292 0
Here, the intercept is 6.683, or the predicted APGAR score for an average baby controlling for gestation period (or a baby with an average gestation period). The slope is the same – the increase in APGAR score has gestation increases one week.
apgar$gestat_days<- apgar$GESTAT*7
mod.a<- lm(APGAR~gestat_days, data=apgar)
mcSummary(mod.a)
## lm(formula = APGAR ~ gestat_days, data = apgar)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 72.044 1 72.044 0.276 22.116 0
## Error 188.939 58 3.258
## Corr Total 260.983 59 4.423
##
## RMSE AdjEtaSq
## 1.805 0.264
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) -0.930 1.636 -0.569 1.054 0.006 NA -4.204 2.344 0.572
## gestat_days 0.029 0.006 4.703 72.044 0.276 NA 0.017 0.042 0.000
Here, the intercept is the same. the predicted APGAR score for someone who had a gestation period of 0 days. The slope is now (still significantly associated with apgar) but is the amount an apgar score is expected to increase as gestation period increases one day (much smaller).
mod 1 - normal
\(apgar_i= -.930 + .205*gestat\)
mod 2 - mean centered
\(apgar_i= 6.683 + .205*gestatmean\)
mod 3 - with gestat as days not weeks
\(apgar_i= -.930 + .029*gestatdays\)
# mod 1=
-.930 + .205*(35)
## [1] 6.245
# mod 2=
6.683 + .205*(35-mean(apgar$GESTAT))
## [1] 6.249083
# mod 3=
-.930 + .029*(35*7)
## [1] 6.175
by using all three of our different models (that had slightly different interpretations), we get the same predicted apgar score of someone who gestated 35 weeks. so you can use any model but make sure to plug in the value that corresponds to your units (i.e num. of weeks that differed from the mean or gestation period in days, not weeks)
apgar$apgar_7<- apgar$APGAR-7
mod.smokes<- lm(apgar_7~SMOKES, data=apgar) # 0 = did not smoke, 1= smoked
mcSummary(mod.smokes)
## lm(formula = apgar_7 ~ SMOKES, data = apgar)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 41.000 1 41.000 0.157 10.81 0.002
## Error 219.983 58 3.793
## Corr Total 260.983 59 4.423
##
## RMSE AdjEtaSq
## 1.948 0.143
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 0.182 0.294 0.619 1.455 0.007 NA -0.406 0.770 0.538
## SMOKES -1.869 0.569 -3.288 41.000 0.157 NA -3.007 -0.731 0.002
apgar$no_smoke<- ifelse(apgar$SMOKES==1, 0, 1) # just flipping so if SMOKES=1, recode to 0, make everything else 1. we know there are no other values so this is ok here.
apgar$apgar_7<- apgar$APGAR-7
mod.nosmokes<- lm(apgar_7~no_smoke, data=apgar) # 0 = did not smoke, 1= smoked
mcSummary(mod.nosmokes)
## lm(formula = apgar_7 ~ no_smoke, data = apgar)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 41.000 1 41.000 0.157 10.81 0.002
## Error 219.983 58 3.793
## Corr Total 260.983 59 4.423
##
## RMSE AdjEtaSq
## 1.948 0.143
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) -1.688 0.487 -3.466 45.563 0.172 NA -2.662 -0.713 0.001
## no_smoke 1.869 0.569 3.288 41.000 0.157 NA 0.731 3.007 0.002
The first model has an intercept of .182. That is the mean difference in apgar score from 7 for kids in our sample who’s mom did not smoke (when SMOKES=0). because 0 is in the CI, we fail to reject the null that kids without smoking mother’s have an APGAR score of 7. These kids do have an APGAR score similar to 7 (mean apgar=7.1818).
The second model has an intercept of -1.688. That is the mean difference in apgar score from 7 for kids in our sample who’s mom did smoke (when no_smoke=0). because 0 is not in the CI, we reject the null that kids without smoking mother’s have an APGAR score of 7. These kids have a significantly different APGAR score from 7 (mean apgar=5.3125).
let’s plot means and relationships based on the categorical predictor. first, make SMOKES be non-numeric so it’s easier for GGplot to understand what we want to do.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
apgar$SMOKES_cat<- ifelse(apgar$SMOKES==1, "Smoker", "Non-smoker")
ggplot(data=apgar, aes(y=APGAR, x=SMOKES_cat, fill=SMOKES_cat)) +
geom_bar(position="dodge", stat="identity")+
ggtitle("Distribution of APGAR scores for smoking moms and non smoking moms")+
xlab("Mother who smoked")
ggplot(data=apgar, aes(x=GESTAT, y=APGAR, group=SMOKES_cat, color=SMOKES_cat)) +
geom_line()
this graph is weird because those in the no smoke category had a very different range of GESTAT scores.
obviously I’m not going to do this but think about doing this with your own data because it will be helpful for the final project