library(tidyverse)
library(dplyr)
library(dslabs)

Introduction

This document was composed from Dr. Snopkowski’s ANTH 504 Week 8 lecture and Rafael A. Irizarry’s 2019 Introduction to Data Science: Data Analysis and Prediction Algorithms with R Chapter 21-22.

Data Management

•The initial step in data analysis is typically getting the data and then converting it from its raw form to the tidy form that allows us to complete the rest of the analysis.

• This is also referred to as: data management, data cleaning, data preparation, pre-processing

• 90% of your time will be spent in data management; 10% will be in analysis

• Take good notes

• Important to be organized to know what you need and what your ultimate goal is

Example Data

Indonesia Family Life Survey

The Indonesian Family Life Survey (IFLS) is an on-going longitudinal survey in Indonesia. The sample is representative of about 83% of the Indonesian population and contains over 30,000 individuals living in 13 of the 27 provinces in the country. The map below identifies the 13 IFLS provinces in the IFLS.

The first wave of the IFLS (IFLS1) was conducted in 1993/94 by RAND in collaboration with Lembaga Demografi, University of Indonesia. IFLS2 and IFLS2+ were conducted in 1997 and 1998, respectively, by RAND in collaboration with UCLA and Lembaga Demografi, University of Indonesia. IFLS2+ covered a 25% sub-sample of the IFLS households. IFLS3, which was fielded in 2000 and covered the full sample, was conducted by RAND in collaboration with the Population Research center, University of Gadjah Mada. The fourth wave of the IFLS (IFLS4), fielded in 2007/2008 covering the full sample, was conducted by RAND, the center for Population and Policy Studies (CPPS) of the University of Gadjah Mada and Survey METRE. The fifth wave of the IFLS (IFLS-5) was fielded 2014-15.

IFLS Questionnaire

Motivating Problem

We are interested in the factors that affect family size in Indonesia.

We decide that we are only interested in women over 20.

Things that likely affect fertility levels include:

• Education of woman

• Her age (especially as countries have gone through the Demographic Transition)

• Religion (?)

Things you might want to explore

• Does her age at menarche affect fertility?

• Does the value of her dowry affect fertility?

Importing Data

• While you can import SPSS datafiles from the Data -> Import section of your Rstudio console, let’s instead import it using the foreign package, read.spss() so that we can get the variable labels

# install.packages('foreign')
library(foreign)

IFLS_birth <- read.spss("IFLS_birth_info.sav", to.data.frame=T)
## re-encoding from CP1252
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br03
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br04
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br06
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br07
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br09
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br10
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 2 added in variable: br16a
birth.labels <- as.data.frame(attr(IFLS_birth, "variable.labels"))

head(birth.labels)
##        attr(IFLS_birth, "variable.labels")
## hhid97            Household ID (1997 wave)
## pid97             Person identifier (1997)
## br00x      IVWR CK: Is Respondent Panel or
## br00a         IVWR CK: Pre-printer roster?
## br01            Have you ever given birth?
## br02       Any biological children live w/

Using read.spss() to import

If we import using the Import Dataset button, our data will load without the variable labels.

IFLS_HH <- read.spss("IFLS_HH_data.sav", to.data.frame=T)
## re-encoding from CP1252
## Warning in read.spss("IFLS_HH_data.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 added in variable: ar10
## Warning in read.spss("IFLS_HH_data.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 17 added in variable:
## ar11
## Warning in read.spss("IFLS_HH_data.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3 added in variable: ar12
## Warning in read.spss("IFLS_HH_data.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 17 added in variable:
## ar14
HH.labels <- as.data.frame(attr(IFLS_HH, "variable.labels"))

head(HH.labels)
##        attr(IFLS_HH, "variable.labels")
## hhid97         Household ID (1997 wave)
## pid97          Person identifier (1997)
## ar02    Relation to HH head (preprt IFL
## ar01a                     Member of HH?
## ar01b         Respondent to be tracked?
## ar02b               Relation to HH head

to.data.frame=T Puts the file into a data frame

The Warnings are normal

IFLS_menarche <- read.spss("IFLS_menarche.sav", to.data.frame=T)
## re-encoding from CP1252
## Warning in read.spss("IFLS_menarche.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3, 4, 5, 6, 7, 10 added in variable: kw26
## Warning in read.spss("IFLS_menarche.sav", to.data.frame = T): Undeclared
## level(s) 0, 1, 2, 3, 4, 5, 6 added in variable: kw27a
## Warning in read.spss("IFLS_menarche.sav", to.data.frame = T): Undeclared
## level(s) 0, 1, 2, 3, 4, 5, 6 added in variable: kw27b
menarche.labels <- as.data.frame(attr(IFLS_menarche, "variable.labels"))

head(menarche.labels)
##        attr(IFLS_menarche, "variable.labels")
## hhid97               Household ID (1997 wave)
## pid97                Person identifier (1997)
## kw02            What is your marital status ?
## kw02b          Taken pill Zat Besi last 4 wks
## kw02cx         Number of pills Zat Besi (able
## kw02c          Number of pills Zat Besi taken

Joining Tables (Ch. 22 in Irizarry)

• Frequently, you’ll have to join tables. We’ve already seen some of these functions before

• In general, you need to identify one or more columns that match the two tables (sometimes referred to as the “key” variable)

• To see how these different “join” functions work, we’ll subset our tables to see how they work.

data(murders)
data(polls_us_election_2016)
tab1 <- slice(murders, 1:6) %>% select(state, population) 
tab2 <- results_us_election_2016 %>% filter(state %in% 
c("Alabama", "Alaska", "Arizona", "California", "Connecticut", "Delaware")) %>% select(state, 
electoral_votes) 
head(tab1)
##        state population
## 1    Alabama    4779736
## 2     Alaska     710231
## 3    Arizona    6392017
## 4   Arkansas    2915918
## 5 California   37253956
## 6   Colorado    5029196
head(tab2)
##         state electoral_votes
## 1  California              55
## 2     Arizona              11
## 3     Alabama               9
## 4 Connecticut               7
## 5      Alaska               3
## 6    Delaware               3

Join

• Left join – adding a 2nd table for the elements that match in the first table.

left_join(tab1, tab2, by="state")
##        state population electoral_votes
## 1    Alabama    4779736               9
## 2     Alaska     710231               3
## 3    Arizona    6392017              11
## 4   Arkansas    2915918              NA
## 5 California   37253956              55
## 6   Colorado    5029196              NA

• Right join – adding a table where we keep the rows of the second table.

right_join(tab1, tab2, by="state")
##         state population electoral_votes
## 1     Alabama    4779736               9
## 2      Alaska     710231               3
## 3     Arizona    6392017              11
## 4  California   37253956              55
## 5 Connecticut         NA               7
## 6    Delaware         NA               3

• Inner join – keep only the rows that have information in both tables

inner_join(tab1, tab2, by="state")
##        state population electoral_votes
## 1    Alabama    4779736               9
## 2     Alaska     710231               3
## 3    Arizona    6392017              11
## 4 California   37253956              55

• Full join – keep all the rows (of both tables) filling in missing parts with NAs

full_join(tab1, tab2, by="state")
##         state population electoral_votes
## 1     Alabama    4779736               9
## 2      Alaska     710231               3
## 3     Arizona    6392017              11
## 4    Arkansas    2915918              NA
## 5  California   37253956              55
## 6    Colorado    5029196              NA
## 7 Connecticut         NA               7
## 8    Delaware         NA               3

• Semi join – keep the part of the first table for which we have information in the 2nd.

semi_join(tab1, tab2, by="state")
##        state population
## 1    Alabama    4779736
## 2     Alaska     710231
## 3    Arizona    6392017
## 4 California   37253956

• Anti join– keep elements of the first table for which there is no information in the 2nd

anti_join(tab1, tab2, by="state")
##      state population
## 1 Arkansas    2915918
## 2 Colorado    5029196

Set operators

There are a variety of set operators which are also useful for managing our data tables. Let’s create these two tables so we can play around with these operators

tab <- left_join(murders, 
results_us_election_2016, by = "state") %>% 
select(-others) %>% rename(ev = 
electoral_votes) 
table1 <- tab[1:5,] 
table2 <- tab[3:7,] 
head(table1)
##        state abb region population total ev clinton trump
## 1    Alabama  AL  South    4779736   135  9    34.4  62.1
## 2     Alaska  AK   West     710231    19  3    36.6  51.3
## 3    Arizona  AZ   West    6392017   232 11    45.1  48.7
## 4   Arkansas  AR  South    2915918    93  6    33.7  60.6
## 5 California  CA   West   37253956  1257 55    61.7  31.6
head(table2)
##         state abb    region population total ev clinton trump
## 3     Arizona  AZ      West    6392017   232 11    45.1  48.7
## 4    Arkansas  AR     South    2915918    93  6    33.7  60.6
## 5  California  CA      West   37253956  1257 55    61.7  31.6
## 6    Colorado  CO      West    5029196    65  9    48.2  43.3
## 7 Connecticut  CT Northeast    3574097    97  7    54.6  40.9

Set operators

Using our two tables, try to determine what each set operator does

Intersect – intersect(table1, table2) (note: you may need to use dplyr::intersect)

Union – union(table1, table2)

Setdiff – setdiff(table1, table2)

Intersect – returns the rows in common between two tables.

intersect(table1, table2) 
##        state abb region population total ev clinton trump
## 1    Arizona  AZ   West    6392017   232 11    45.1  48.7
## 2   Arkansas  AR  South    2915918    93  6    33.7  60.6
## 3 California  CA   West   37253956  1257 55    61.7  31.6

This also works with vectors

intersect(1:10, 6:15)
## [1]  6  7  8  9 10
intersect(c("a", "b", "c"), c("b", "c", "d")) 
## [1] "b" "c"

It gives you what is in common.

Union – returns all rows of the two tables with the same column names

union(table1, table2)
##         state abb    region population total ev clinton trump
## 1     Alabama  AL     South    4779736   135  9    34.4  62.1
## 2      Alaska  AK      West     710231    19  3    36.6  51.3
## 3     Arizona  AZ      West    6392017   232 11    45.1  48.7
## 4    Arkansas  AR     South    2915918    93  6    33.7  60.6
## 5  California  CA      West   37253956  1257 55    61.7  31.6
## 6    Colorado  CO      West    5029196    65  9    48.2  43.3
## 7 Connecticut  CT Northeast    3574097    97  7    54.6  40.9

Returns the rows with the same column names

Setdiff – The rows in the first table that don’t exist in the 2nd (set difference between first and second argument)

setdiff(table1, table2)
##     state abb region population total ev clinton trump
## 1 Alabama  AL  South    4779736   135  9    34.4  62.1
## 2  Alaska  AK   West     710231    19  3    36.6  51.3

Returns the columns in the first column that do not occur in the second table.

Merging Tables / Dataframes

• How can we merge our dataframes together?

• We need to have a “key” variable. Let’s make one using the hhid97 and pid97 variables.

sapply(IFLS_HH, class)
##      hhid97       pid97        ar02       ar01a       ar01b       ar02b 
##   "numeric"   "numeric"    "factor"    "factor"    "factor"    "factor" 
##        ar07    ar08mthx     ar08mth     ar08yrx      ar08yr        ar09 
##    "factor"    "factor"   "numeric"    "factor"   "numeric"   "numeric" 
##        ar10        ar11        ar12        ar13        ar14        ar15 
##    "factor"    "factor"    "factor"    "factor"    "factor"    "factor" 
##       ar15a      ar15bx       ar15b        ar16        ar17       ar18a 
##    "factor"    "factor"   "numeric"    "factor"    "factor"    "factor" 
##       ar18b       ar18c       ar18d    ar18emtx    ar18emth    ar18eyrx 
##    "factor"    "factor"    "factor"    "factor"    "factor"    "factor" 
##     ar18eyr       ar18f       ar18g       ar18h       ar18i       ar18j 
##   "numeric"    "factor"    "factor"    "factor"    "factor"    "factor" 
##      age_97    res97b3a    res97b3b    res97b3p     res97b4     res97b5 
##   "numeric"    "factor"    "factor"    "factor"    "factor"    "factor" 
##      panel3      panel4     version 
##    "factor"    "factor" "character"
sapply(IFLS_birth, class)
##      hhid97       pid97       br00x       br00a        br01        br02 
##   "numeric"   "numeric"    "factor"    "factor"    "factor"    "factor" 
##        br03        br04        br05        br06        br07        br08 
##    "factor"    "factor"    "factor"    "factor"    "factor"    "factor" 
##        br09        br10        br11        br12        br13        br14 
##    "factor"    "factor"    "factor"    "factor"    "factor"    "factor" 
##        br15       br15k        br16       br16k       br16a     version 
##   "numeric"    "factor"   "numeric"    "factor"    "factor" "character"
IFLS_birth <- IFLS_birth %>% mutate(pid = paste(hhid97, pid97))
head(IFLS_birth$pid)
## [1] "12200 2" "12900 2" "20100 2" "20100 3" "20200 2" "20300 5"
IFLS_menarche <- IFLS_menarche %>% mutate(pid = paste(hhid97, pid97))
head(IFLS_menarche$pid)
## [1] "12200 2" "12900 2" "20100 2" "20100 3" "20200 2" "20300 5"
IFLS_HH <- IFLS_HH %>% mutate(pid = paste(hhid97, pid97))
head(IFLS_HH$pid)
## [1] "10600 2"  "10600 3"  "10800 2"  "10800 4"  "10800 11" "10800 12"
IFLS <- full_join(IFLS_HH, IFLS_birth, by="pid")
IFLS <- full_join(IFLS, IFLS_menarche, by="pid")

Selecting individuals & variables you want in your analyses

Now we have all the data together. Based on our problem, the variables and people we want to include: ar16: highest level of education, ar09: age, ar15: religion, kb12b: dowry, and kw23b: age at menarche.

Recoding Data

Let’s look at each variable we are interested in and determine how to recode them.

  1. br15: total number of births
table(IFLS$br15)
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12 
## 450 872 432 206 131  70  36  21  10   7   3   1   1

These look like they are coded ok.

  1. ar16: education
table(IFLS$ar16)
## 
##       01. None       02. Elem   03. Gen JrHi   04. Voc JrHi   05. Gen SrHi 
##           2608           5868           1824            256           1627 
##   06. Voc SrHi  07. Dipl(1/2)    08. Dipl(3)       09. Univ      10. Other 
##           1263            131            214            470             16 
##  11. Adlt Ed A  12. Adlt Ed B  13. Open Univ  14. Islam Scl 17. Disabl Scl 
##              0              3              0              2              0 
##   70. Madrasah         90. KG         98. DK       99. Miss 
##              0              0             59              0

It makes sense that None should be lowest. Then it is order elementary, general junior high, vocational junior high, general high school, vocational high school, etc. Everything from 10 and above are probably not levels of higher education.It will make sense for us to recode this variable so that anyone with a 10 or higher is missing.

  1. ar09: birth year
table(IFLS$ar09)
## 
##   0   3  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29 
##   1   1   2   2  16 463 488 534 547 450 476 461 448 356 385 404 322 352 305 391 
##  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49 
## 387 238 326 226 377 333 236 255 208 309 289 168 206 175 265 216 125 155  95 186 
##  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69 
## 171  85 145  87 258 213 109 147  83 180 173  70  89  58 201 124  54  94  37 115 
##  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89 
## 104  23  39  21 117  49  18  28  13  37  34   7  12   7  50  17   8   3   1  11 
##  90  91  92  93  94  95  96  97  98 100 115 120 998 
##  15   1   1   1  11   3   1   2  13   6   1   1  11

Here are all the possible values that people entered for their age. The first row is the age and the number below is how many their are. 998 is a missing code. Is there someone who is 115 or 120? We could code it so that greater than 100 goes to missing.

  1. ar15: religion
table(IFLS$ar15)
## 
##  01. Islam  02. Prot.  03. Cath.  04. Hindu 05. Buddh.  06. Other     98. DK 
##      12491        702        294        655        157         30         12 
##   99. Miss 
##          0

We may want to recode 98 to missing (don’t know).

  1. kw12b: dowry
hist(IFLS$kw12b)

Since this looks really ugly.

max(IFLS$kw12b)
## [1] NA

The max value is NA which does us not good here.

max(IFLS$kw12b, na.rm=T)
## [1] 1e+07

This is the largest value.

table(IFLS$kw14x)
## 
##           01. Rupiah           02. Gulden            03. Perak 
##                 5350                    0                    0 
##             04. Tali           05. Kethip          06. Ringgit 
##                    0                    0                    0 
##         07. Cent/Sen             08. Suku    09. Kepeng Belong 
##                    0                    1                    0 
##             10. Gelo    11. Grams of gold       12. Kg of rice 
##                    0                    1                    0 
## 90. Foreign currency         95. Top code              96. N/A 
##                    0                    0                    0 
##           97. Refuse               98. DK          99. Missing 
##                    0                  801                    7

This measures the currency. For the most part every one is using Rupiah. When looking at money, we usually want to take the log transformation.

hist(log10(IFLS$kw12b + 1))

We add 1 because if anyone has a value between 0 and 1, it can’t take the log of that. This is more normal.

min(IFLS$kw12b, na.rm=T)
## [1] 0

There are some individuals with the value of 0.

  1. kw23b: age at first menstruation
table(IFLS$kw23b)
## 
##    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22 
##    5   11   33  103  124  800  963 1159 1697  549  398  148   34   46    6    4 
##   23   24   25   26   27 
##    1    2    6    2    2

The age at first menstruation runs from 7 to 27. That seems late but is perhaps not impossible. This is fine.

IFLS <- IFLS %>% 
  mutate(educ = ifelse(as.numeric(ar16) >= 10, NA, ar16), 
         religion = ifelse(ar15 == 98, NA, ar15),
         log_dowry = log10(kw12b + 1)) %>% 
  filter(as.numeric(ar09) >= 20 & as.numeric(ar09) <= 100) %>% 
  select(br15, educ, ar09, religion, log_dowry, kw23b)
head(IFLS)
##   br15 educ ar09 religion log_dowry kw23b
## 1   NA    2   42        2        NA    NA
## 2   NA    1   23        2        NA    NA
## 3   NA    2   32        2  6.164353    15
## 4   NA    3   44        2        NA    NA
## 5   NA    2   20        2        NA    NA
## 6   NA    3   26        2        NA    NA

We filter the rows that we want and select the columns that we want.

Running model

Let’s start by running our model predicting br15 (number of live births) using the predictors: educ, yr_birth, log_dowry, kw23b

Model selection

A common problem is trying to figure out which variables to include in a model. There are a variety of strategies to try to solve this:

• Forced Entry: All predictors are entered simultaneously.

• Hierarchical: Experimenter decides the order in which variables are entered into the model and checks the model fit using AIC / BIC

• Stepwise: Predictors are added or removed based on AIC values automatically

AIC stands for Akaike information criterion. The AIC value takes the residual sum of squares (which will go down as you add more predictors) and then adds 2*K – where k is the number of predictors.

Lower AIC scores are better. So, AIC takes into account that as you add predictors, you get a model with less error by penalizing models with more predictors. So, a predictor will only be added (if you use this method) if it improves the model by a “substantial” amount.

Forced Entry Regression

• All variables are entered into the model simultaneously.

• The results obtained depend on the variables entered into the model.

• It is important, therefore, to have good theoretical reasons for including a particular variable.

Hierarchical Regression

Known predictors (based on past research) are entered into the regression model first.

• New predictors are then entered in a separate step/block.

• Experimenter makes the decisions.

• It is the best method:

• Based on theory testing.

• You can see the unique predictive influence of a new variable on the outcome because known predictors are held constant in the model.

• Possible con:

• Relies on the experimenter knowing what they’re doing

IFLS_nomiss <- IFLS %>% filter(!is.na(log_dowry))
#Using education and age to predict the number of live births
model1 <- lm(br15 ~ educ + ar09, data=IFLS_nomiss)
summary(model1)
## 
## Call:
## lm(formula = br15 ~ educ + ar09, data = IFLS_nomiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4071 -0.8564 -0.1399  0.6686  8.8558 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.527598   0.157244  -3.355 0.000809 ***
## educ        -0.095766   0.018185  -5.266 1.56e-07 ***
## ar09         0.085852   0.004442  19.326  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.48 on 1782 degrees of freedom
##   (3394 observations deleted due to missingness)
## Multiple R-squared:  0.1951, Adjusted R-squared:  0.1942 
## F-statistic:   216 on 2 and 1782 DF,  p-value: < 2.2e-16

Education (educ) is a significant predictor because the p-value is less than 0.05 (p=1.56e-07) The number of births decreases with more education (t value = -5.266). Age (ar09) is a significant predictor because the p-value is less than 0.05 (p=2e-16) As individuals get older, the number of births increases (t value = 19.326)

#Using education, age, and dowry to predict the number of live births
model2 <- lm(br15 ~ educ + ar09 + log_dowry, data=IFLS_nomiss)
summary(model2)
## 
## Call:
## lm(formula = br15 ~ educ + ar09 + log_dowry, data = IFLS_nomiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1945 -0.8338 -0.1384  0.6818  8.8520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01424    0.24963  -0.057  0.95451    
## educ        -0.07936    0.01919  -4.136 3.69e-05 ***
## ar09         0.08294    0.00457  18.148  < 2e-16 ***
## log_dowry   -0.11114    0.04202  -2.645  0.00824 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.478 on 1781 degrees of freedom
##   (3394 observations deleted due to missingness)
## Multiple R-squared:  0.1983, Adjusted R-squared:  0.1969 
## F-statistic: 146.8 on 3 and 1781 DF,  p-value: < 2.2e-16

Education (educ) is a significant predictor because the p-value is less than 0.05 (p=3.69e-05) The number of births decreases with more education (t value = -4.136). Age (ar09) is a significant predictor because the p-value is less than 0.05 (p=2e-16) As individuals get older, the number of births increases (t value = 18.148) The dowry is a significant predictor because the p-value is less than 0.05 (p=0.00824) As the dowry increases, the number of births decreases (-2.645)

The way we determine if the second model is an improvement over the first is by using anova() (This is not aov())

anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: br15 ~ educ + ar09
## Model 2: br15 ~ educ + ar09 + log_dowry
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1   1782 3904.8                                
## 2   1781 3889.5  1    15.278 6.9958 0.008242 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This returns showing that that second model is improved over the first model (p=0.008242).

Stepwise Regression I

• Variables are entered into the model based on mathematical criteria.

• Computer selects variables in steps

fullmodel <- lm(br15 ~ educ + ar09 + log_dowry + kw23b, data=IFLS_nomiss)
IFLS_nomiss <- IFLS %>% filter(!is.na(log_dowry) & !is.na(kw23b))
model <- lm(br15 ~ educ + ar09 + log_dowry + kw23b, data=IFLS_nomiss)
step(fullmodel, direction="backward")
## Start:  AIC=1389.16
## br15 ~ educ + ar09 + log_dowry + kw23b
## 
##             Df Sum of Sq    RSS    AIC
## - kw23b      1      0.06 3859.1 1387.2
## <none>                   3859.0 1389.2
## - log_dowry  1     14.39 3873.4 1393.8
## - educ       1     38.59 3897.6 1404.8
## - ar09       1    726.53 4585.6 1692.8
## 
## Step:  AIC=1387.19
## br15 ~ educ + ar09 + log_dowry
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   3859.1 1387.2
## - log_dowry  1     14.40 3873.5 1391.8
## - educ       1     38.53 3897.6 1402.8
## - ar09       1    726.95 4586.0 1691.0
## 
## Call:
## lm(formula = br15 ~ educ + ar09 + log_dowry, data = IFLS_nomiss)
## 
## Coefficients:
## (Intercept)         educ         ar09    log_dowry  
##    -0.04002     -0.08077      0.08363     -0.10837

Number of births br15 is predicted by education educ + age ar09 + log size of dowry log_dowry + age of menarche kw23b(AIC =1389.2). In the first step, it looks at what happens if we take out one variable. Taking out age of menarche improves the AIC by a little bit (1387.2). If we take out log size of dowry, the AIC gets worse (1393.8). If we take out education, the AIC gets a lot worse (1404.8). If we take out age, the AIC is the worst of all (1692.8). By “Worse” we mean, we want the smallest possible value. It is ordered from the lowest to the highest AIC where is not as good as removing age of menarche. This is why - kw23b is listed first.

Both starts with all the variables, kicks one out then looks to add one back in.This catches suppressor effects when you need both things in to actually see it.

step(fullmodel, direction="both")
## Start:  AIC=1389.16
## br15 ~ educ + ar09 + log_dowry + kw23b
## 
##             Df Sum of Sq    RSS    AIC
## - kw23b      1      0.06 3859.1 1387.2
## <none>                   3859.0 1389.2
## - log_dowry  1     14.39 3873.4 1393.8
## - educ       1     38.59 3897.6 1404.8
## - ar09       1    726.53 4585.6 1692.8
## 
## Step:  AIC=1387.19
## br15 ~ educ + ar09 + log_dowry
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   3859.1 1387.2
## + kw23b      1      0.06 3859.0 1389.2
## - log_dowry  1     14.40 3873.5 1391.8
## - educ       1     38.53 3897.6 1402.8
## - ar09       1    726.95 4586.0 1691.0
## 
## Call:
## lm(formula = br15 ~ educ + ar09 + log_dowry, data = IFLS_nomiss)
## 
## Coefficients:
## (Intercept)         educ         ar09    log_dowry  
##    -0.04002     -0.08077      0.08363     -0.10837

We end up with the same result as backward. It just gave us the possibility of adding other back in.

#forward
full_model <- formula(lm(br15 ~ educ + ar09 + log_dowry + 
kw23b, data=na.omit(IFLS)))
intercept_only <- lm(br15 ~ 1, data=na.omit(IFLS))
step(intercept_only, direction="forward", scope=full_model)
## Start:  AIC=1778.41
## br15 ~ 1
## 
##             Df Sum of Sq    RSS    AIC
## + ar09       1    893.74 3935.1 1417.7
## + log_dowry  1    192.41 4636.4 1708.4
## + educ       1    129.70 4699.1 1732.2
## <none>                   4828.8 1778.4
## + kw23b      1      1.98 4826.8 1779.7
## 
## Step:  AIC=1417.73
## br15 ~ ar09
## 
##             Df Sum of Sq    RSS    AIC
## + educ       1    61.561 3873.5 1391.8
## + log_dowry  1    37.426 3897.6 1402.8
## <none>                   3935.1 1417.7
## + kw23b      1     0.026 3935.0 1419.7
## 
## Step:  AIC=1391.79
## br15 ~ ar09 + educ
## 
##             Df Sum of Sq    RSS    AIC
## + log_dowry  1   14.3991 3859.1 1387.2
## <none>                   3873.5 1391.8
## + kw23b      1    0.0686 3873.4 1393.8
## 
## Step:  AIC=1387.19
## br15 ~ ar09 + educ + log_dowry
## 
##         Df Sum of Sq    RSS    AIC
## <none>               3859.1 1387.2
## + kw23b  1    0.0606 3859.0 1389.2
## 
## Call:
## lm(formula = br15 ~ ar09 + educ + log_dowry, data = na.omit(IFLS))
## 
## Coefficients:
## (Intercept)         ar09         educ    log_dowry  
##    -0.04002      0.08363     -0.08077     -0.10837

The code is trickier for forward. na.omit() helps remove NA.

Stepwise methods

• Forward

• Add predictor that improves the model most

• Then add the predictor that improves the model 2nd most (after first predictor is already included)

• May miss out on effect that only occurs when another predictor is in the model.

• Backward

• Include all variables

• Exclude variable that improves AIC most

• Both

• Include all variables

• Exclude variable that improves AIC most

• Then check to either add or remove the variable that improves AIC the most

Which method of regression should I use?

Stepwise should be avoided except for exploratory model building. Don’t P-Hack!

Use backward rather than forward to minimize suppressor effects (when a predictor has a significant effect but only when another variable is held constant)

Categorical Variables

Religion is a categorical variable.

How can we include categorical variables in our multiple regression models?

You throw it in in the same way.

l1 <- lm(br15 ~ educ + ar09 + as.factor(religion) + 
           log_dowry + kw23b, data=IFLS)
model1 <- lm(br15 ~ ., data=IFLS)
summary(model1)
## 
## Call:
## lm(formula = br15 ~ ., data = IFLS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2663 -0.8360 -0.1443  0.6837  8.8344 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.022904   0.363904   0.063   0.9498    
## educ        -0.081167   0.019265  -4.213 2.64e-05 ***
## ar09         0.083830   0.004600  18.224  < 2e-16 ***
## religion    -0.040687   0.079239  -0.513   0.6077    
## log_dowry   -0.104553   0.042855  -2.440   0.0148 *  
## kw23b       -0.002845   0.018571  -0.153   0.8782    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.478 on 1766 degrees of freedom
##   (10049 observations deleted due to missingness)
## Multiple R-squared:  0.2009, Adjusted R-squared:  0.1987 
## F-statistic: 88.82 on 5 and 1766 DF,  p-value: < 2.2e-16

Interpreting Categorical Variables

audio 45:00