library(tidyverse)
library(dplyr)
library(dslabs)
This document was composed from Dr. Snopkowski’s ANTH 504 Week 8 lecture and Rafael A. Irizarry’s 2019 Introduction to Data Science: Data Analysis and Prediction Algorithms with R Chapter 21-22.
•The initial step in data analysis is typically getting the data and then converting it from its raw form to the tidy form that allows us to complete the rest of the analysis.
• This is also referred to as: data management, data cleaning, data preparation, pre-processing
• 90% of your time will be spent in data management; 10% will be in analysis
• Take good notes
• Important to be organized to know what you need and what your ultimate goal is
The Indonesian Family Life Survey (IFLS) is an on-going longitudinal survey in Indonesia. The sample is representative of about 83% of the Indonesian population and contains over 30,000 individuals living in 13 of the 27 provinces in the country. The map below identifies the 13 IFLS provinces in the IFLS.
The first wave of the IFLS (IFLS1) was conducted in 1993/94 by RAND in collaboration with Lembaga Demografi, University of Indonesia. IFLS2 and IFLS2+ were conducted in 1997 and 1998, respectively, by RAND in collaboration with UCLA and Lembaga Demografi, University of Indonesia. IFLS2+ covered a 25% sub-sample of the IFLS households. IFLS3, which was fielded in 2000 and covered the full sample, was conducted by RAND in collaboration with the Population Research center, University of Gadjah Mada. The fourth wave of the IFLS (IFLS4), fielded in 2007/2008 covering the full sample, was conducted by RAND, the center for Population and Policy Studies (CPPS) of the University of Gadjah Mada and Survey METRE. The fifth wave of the IFLS (IFLS-5) was fielded 2014-15.
We are interested in the factors that affect family size in Indonesia.
We decide that we are only interested in women over 20.
Things that likely affect fertility levels include:
• Education of woman
• Her age (especially as countries have gone through the Demographic Transition)
• Religion (?)
Things you might want to explore
• Does her age at menarche affect fertility?
• Does the value of her dowry affect fertility?
• While you can import SPSS datafiles from the
Data -> Import section of your Rstudio console, let’s
instead import it using the foreign package, read.spss() so
that we can get the variable labels
# install.packages('foreign')
library(foreign)
IFLS_birth <- read.spss("IFLS_birth_info.sav", to.data.frame=T)
## re-encoding from CP1252
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br03
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br04
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br06
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br07
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br09
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 0 added in variable: br10
## Warning in read.spss("IFLS_birth_info.sav", to.data.frame = T): Undeclared
## level(s) 2 added in variable: br16a
birth.labels <- as.data.frame(attr(IFLS_birth, "variable.labels"))
head(birth.labels)
## attr(IFLS_birth, "variable.labels")
## hhid97 Household ID (1997 wave)
## pid97 Person identifier (1997)
## br00x IVWR CK: Is Respondent Panel or
## br00a IVWR CK: Pre-printer roster?
## br01 Have you ever given birth?
## br02 Any biological children live w/
read.spss() to importIf we import using the Import Dataset button, our data will load without the variable labels.
IFLS_HH <- read.spss("IFLS_HH_data.sav", to.data.frame=T)
## re-encoding from CP1252
## Warning in read.spss("IFLS_HH_data.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 added in variable: ar10
## Warning in read.spss("IFLS_HH_data.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 17 added in variable:
## ar11
## Warning in read.spss("IFLS_HH_data.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3 added in variable: ar12
## Warning in read.spss("IFLS_HH_data.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 17 added in variable:
## ar14
HH.labels <- as.data.frame(attr(IFLS_HH, "variable.labels"))
head(HH.labels)
## attr(IFLS_HH, "variable.labels")
## hhid97 Household ID (1997 wave)
## pid97 Person identifier (1997)
## ar02 Relation to HH head (preprt IFL
## ar01a Member of HH?
## ar01b Respondent to be tracked?
## ar02b Relation to HH head
to.data.frame=T Puts the file into a data frame
The Warnings are normal
IFLS_menarche <- read.spss("IFLS_menarche.sav", to.data.frame=T)
## re-encoding from CP1252
## Warning in read.spss("IFLS_menarche.sav", to.data.frame = T): Undeclared
## level(s) 1, 2, 3, 4, 5, 6, 7, 10 added in variable: kw26
## Warning in read.spss("IFLS_menarche.sav", to.data.frame = T): Undeclared
## level(s) 0, 1, 2, 3, 4, 5, 6 added in variable: kw27a
## Warning in read.spss("IFLS_menarche.sav", to.data.frame = T): Undeclared
## level(s) 0, 1, 2, 3, 4, 5, 6 added in variable: kw27b
menarche.labels <- as.data.frame(attr(IFLS_menarche, "variable.labels"))
head(menarche.labels)
## attr(IFLS_menarche, "variable.labels")
## hhid97 Household ID (1997 wave)
## pid97 Person identifier (1997)
## kw02 What is your marital status ?
## kw02b Taken pill Zat Besi last 4 wks
## kw02cx Number of pills Zat Besi (able
## kw02c Number of pills Zat Besi taken
• Frequently, you’ll have to join tables. We’ve already seen some of these functions before
• In general, you need to identify one or more columns that match the two tables (sometimes referred to as the “key” variable)
• To see how these different “join” functions work, we’ll subset our tables to see how they work.
data(murders)
data(polls_us_election_2016)
tab1 <- slice(murders, 1:6) %>% select(state, population)
tab2 <- results_us_election_2016 %>% filter(state %in%
c("Alabama", "Alaska", "Arizona", "California", "Connecticut", "Delaware")) %>% select(state,
electoral_votes)
head(tab1)
## state population
## 1 Alabama 4779736
## 2 Alaska 710231
## 3 Arizona 6392017
## 4 Arkansas 2915918
## 5 California 37253956
## 6 Colorado 5029196
head(tab2)
## state electoral_votes
## 1 California 55
## 2 Arizona 11
## 3 Alabama 9
## 4 Connecticut 7
## 5 Alaska 3
## 6 Delaware 3
• Left join – adding a 2nd table for the elements that match in the first table.
left_join(tab1, tab2, by="state")
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 55
## 6 Colorado 5029196 NA
• Right join – adding a table where we keep the rows of the second table.
right_join(tab1, tab2, by="state")
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 California 37253956 55
## 5 Connecticut NA 7
## 6 Delaware NA 3
• Inner join – keep only the rows that have information in both tables
inner_join(tab1, tab2, by="state")
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 California 37253956 55
• Full join – keep all the rows (of both tables) filling in missing
parts with NAs
full_join(tab1, tab2, by="state")
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 55
## 6 Colorado 5029196 NA
## 7 Connecticut NA 7
## 8 Delaware NA 3
• Semi join – keep the part of the first table for which we have information in the 2nd.
semi_join(tab1, tab2, by="state")
## state population
## 1 Alabama 4779736
## 2 Alaska 710231
## 3 Arizona 6392017
## 4 California 37253956
• Anti join– keep elements of the first table for which there is no information in the 2nd
anti_join(tab1, tab2, by="state")
## state population
## 1 Arkansas 2915918
## 2 Colorado 5029196
There are a variety of set operators which are also useful for managing our data tables. Let’s create these two tables so we can play around with these operators
tab <- left_join(murders,
results_us_election_2016, by = "state") %>%
select(-others) %>% rename(ev =
electoral_votes)
table1 <- tab[1:5,]
table2 <- tab[3:7,]
head(table1)
## state abb region population total ev clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 California CA West 37253956 1257 55 61.7 31.6
head(table2)
## state abb region population total ev clinton trump
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 California CA West 37253956 1257 55 61.7 31.6
## 6 Colorado CO West 5029196 65 9 48.2 43.3
## 7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
Using our two tables, try to determine what each set operator does
Intersect – intersect(table1, table2) (note: you may need to use dplyr::intersect)
Union – union(table1, table2)
Setdiff – setdiff(table1, table2)
Intersect – returns the rows in common between two tables.
intersect(table1, table2)
## state abb region population total ev clinton trump
## 1 Arizona AZ West 6392017 232 11 45.1 48.7
## 2 Arkansas AR South 2915918 93 6 33.7 60.6
## 3 California CA West 37253956 1257 55 61.7 31.6
This also works with vectors
intersect(1:10, 6:15)
## [1] 6 7 8 9 10
intersect(c("a", "b", "c"), c("b", "c", "d"))
## [1] "b" "c"
It gives you what is in common.
Union – returns all rows of the two tables with the same column names
union(table1, table2)
## state abb region population total ev clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 California CA West 37253956 1257 55 61.7 31.6
## 6 Colorado CO West 5029196 65 9 48.2 43.3
## 7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
Returns the rows with the same column names
Setdiff – The rows in the first table that don’t exist in the 2nd (set difference between first and second argument)
setdiff(table1, table2)
## state abb region population total ev clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
Returns the columns in the first column that do not occur in the second table.
• How can we merge our dataframes together?
• We need to have a “key” variable. Let’s make one using the hhid97 and pid97 variables.
sapply(IFLS_HH, class)
## hhid97 pid97 ar02 ar01a ar01b ar02b
## "numeric" "numeric" "factor" "factor" "factor" "factor"
## ar07 ar08mthx ar08mth ar08yrx ar08yr ar09
## "factor" "factor" "numeric" "factor" "numeric" "numeric"
## ar10 ar11 ar12 ar13 ar14 ar15
## "factor" "factor" "factor" "factor" "factor" "factor"
## ar15a ar15bx ar15b ar16 ar17 ar18a
## "factor" "factor" "numeric" "factor" "factor" "factor"
## ar18b ar18c ar18d ar18emtx ar18emth ar18eyrx
## "factor" "factor" "factor" "factor" "factor" "factor"
## ar18eyr ar18f ar18g ar18h ar18i ar18j
## "numeric" "factor" "factor" "factor" "factor" "factor"
## age_97 res97b3a res97b3b res97b3p res97b4 res97b5
## "numeric" "factor" "factor" "factor" "factor" "factor"
## panel3 panel4 version
## "factor" "factor" "character"
sapply(IFLS_birth, class)
## hhid97 pid97 br00x br00a br01 br02
## "numeric" "numeric" "factor" "factor" "factor" "factor"
## br03 br04 br05 br06 br07 br08
## "factor" "factor" "factor" "factor" "factor" "factor"
## br09 br10 br11 br12 br13 br14
## "factor" "factor" "factor" "factor" "factor" "factor"
## br15 br15k br16 br16k br16a version
## "numeric" "factor" "numeric" "factor" "factor" "character"
IFLS_birth <- IFLS_birth %>% mutate(pid = paste(hhid97, pid97))
head(IFLS_birth$pid)
## [1] "12200 2" "12900 2" "20100 2" "20100 3" "20200 2" "20300 5"
IFLS_menarche <- IFLS_menarche %>% mutate(pid = paste(hhid97, pid97))
head(IFLS_menarche$pid)
## [1] "12200 2" "12900 2" "20100 2" "20100 3" "20200 2" "20300 5"
IFLS_HH <- IFLS_HH %>% mutate(pid = paste(hhid97, pid97))
head(IFLS_HH$pid)
## [1] "10600 2" "10600 3" "10800 2" "10800 4" "10800 11" "10800 12"
IFLS <- full_join(IFLS_HH, IFLS_birth, by="pid")
IFLS <- full_join(IFLS, IFLS_menarche, by="pid")
Now we have all the data together. Based on our problem, the variables and people we want to include: ar16: highest level of education, ar09: age, ar15: religion, kb12b: dowry, and kw23b: age at menarche.
Let’s look at each variable we are interested in and determine how to recode them.
table(IFLS$br15)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 450 872 432 206 131 70 36 21 10 7 3 1 1
These look like they are coded ok.
table(IFLS$ar16)
##
## 01. None 02. Elem 03. Gen JrHi 04. Voc JrHi 05. Gen SrHi
## 2608 5868 1824 256 1627
## 06. Voc SrHi 07. Dipl(1/2) 08. Dipl(3) 09. Univ 10. Other
## 1263 131 214 470 16
## 11. Adlt Ed A 12. Adlt Ed B 13. Open Univ 14. Islam Scl 17. Disabl Scl
## 0 3 0 2 0
## 70. Madrasah 90. KG 98. DK 99. Miss
## 0 0 59 0
It makes sense that None should be lowest. Then it is
order elementary, general junior high, vocational junior high, general
high school, vocational high school, etc. Everything from 10 and above
are probably not levels of higher education.It will make sense for us to
recode this variable so that anyone with a 10 or higher is missing.
table(IFLS$ar09)
##
## 0 3 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 1 1 2 2 16 463 488 534 547 450 476 461 448 356 385 404 322 352 305 391
## 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 387 238 326 226 377 333 236 255 208 309 289 168 206 175 265 216 125 155 95 186
## 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## 171 85 145 87 258 213 109 147 83 180 173 70 89 58 201 124 54 94 37 115
## 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
## 104 23 39 21 117 49 18 28 13 37 34 7 12 7 50 17 8 3 1 11
## 90 91 92 93 94 95 96 97 98 100 115 120 998
## 15 1 1 1 11 3 1 2 13 6 1 1 11
Here are all the possible values that people entered for their age. The first row is the age and the number below is how many their are. 998 is a missing code. Is there someone who is 115 or 120? We could code it so that greater than 100 goes to missing.
table(IFLS$ar15)
##
## 01. Islam 02. Prot. 03. Cath. 04. Hindu 05. Buddh. 06. Other 98. DK
## 12491 702 294 655 157 30 12
## 99. Miss
## 0
We may want to recode 98 to missing (don’t know).
hist(IFLS$kw12b)
Since this looks really ugly.
max(IFLS$kw12b)
## [1] NA
The max value is NA which does us not good here.
max(IFLS$kw12b, na.rm=T)
## [1] 1e+07
This is the largest value.
table(IFLS$kw14x)
##
## 01. Rupiah 02. Gulden 03. Perak
## 5350 0 0
## 04. Tali 05. Kethip 06. Ringgit
## 0 0 0
## 07. Cent/Sen 08. Suku 09. Kepeng Belong
## 0 1 0
## 10. Gelo 11. Grams of gold 12. Kg of rice
## 0 1 0
## 90. Foreign currency 95. Top code 96. N/A
## 0 0 0
## 97. Refuse 98. DK 99. Missing
## 0 801 7
This measures the currency. For the most part every one is using Rupiah. When looking at money, we usually want to take the log transformation.
hist(log10(IFLS$kw12b + 1))
We add 1 because if anyone has a value between 0 and 1, it can’t take the log of that. This is more normal.
min(IFLS$kw12b, na.rm=T)
## [1] 0
There are some individuals with the value of 0.
table(IFLS$kw23b)
##
## 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 5 11 33 103 124 800 963 1159 1697 549 398 148 34 46 6 4
## 23 24 25 26 27
## 1 2 6 2 2
The age at first menstruation runs from 7 to 27. That seems late but is perhaps not impossible. This is fine.
IFLS <- IFLS %>%
mutate(educ = ifelse(as.numeric(ar16) >= 10, NA, ar16),
religion = ifelse(ar15 == 98, NA, ar15),
log_dowry = log10(kw12b + 1)) %>%
filter(as.numeric(ar09) >= 20 & as.numeric(ar09) <= 100) %>%
select(br15, educ, ar09, religion, log_dowry, kw23b)
head(IFLS)
## br15 educ ar09 religion log_dowry kw23b
## 1 NA 2 42 2 NA NA
## 2 NA 1 23 2 NA NA
## 3 NA 2 32 2 6.164353 15
## 4 NA 3 44 2 NA NA
## 5 NA 2 20 2 NA NA
## 6 NA 3 26 2 NA NA
We filter the rows that we want and select the columns that we want.
Let’s start by running our model predicting br15 (number
of live births) using the predictors: educ,
yr_birth, log_dowry, kw23b
A common problem is trying to figure out which variables to include in a model. There are a variety of strategies to try to solve this:
• Forced Entry: All predictors are entered simultaneously.
• Hierarchical: Experimenter decides the order in which variables are entered into the model and checks the model fit using AIC / BIC
• Stepwise: Predictors are added or removed based on AIC values automatically
AIC stands for Akaike information criterion. The AIC value takes the
residual sum of squares (which will go down as you add more predictors)
and then adds 2*K – where k is the number of
predictors.
Lower AIC scores are better. So, AIC takes into account that as you add predictors, you get a model with less error by penalizing models with more predictors. So, a predictor will only be added (if you use this method) if it improves the model by a “substantial” amount.
• All variables are entered into the model simultaneously.
• The results obtained depend on the variables entered into the model.
• It is important, therefore, to have good theoretical reasons for including a particular variable.
• Known predictors (based on past research) are entered into the regression model first.
• New predictors are then entered in a separate step/block.
• Experimenter makes the decisions.
• It is the best method:
• Based on theory testing.
• You can see the unique predictive influence of a new variable on the outcome because known predictors are held constant in the model.
• Possible con:
• Relies on the experimenter knowing what they’re doing
IFLS_nomiss <- IFLS %>% filter(!is.na(log_dowry))
#Using education and age to predict the number of live births
model1 <- lm(br15 ~ educ + ar09, data=IFLS_nomiss)
summary(model1)
##
## Call:
## lm(formula = br15 ~ educ + ar09, data = IFLS_nomiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4071 -0.8564 -0.1399 0.6686 8.8558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.527598 0.157244 -3.355 0.000809 ***
## educ -0.095766 0.018185 -5.266 1.56e-07 ***
## ar09 0.085852 0.004442 19.326 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.48 on 1782 degrees of freedom
## (3394 observations deleted due to missingness)
## Multiple R-squared: 0.1951, Adjusted R-squared: 0.1942
## F-statistic: 216 on 2 and 1782 DF, p-value: < 2.2e-16
Education (educ) is a significant predictor because the
p-value is less than 0.05 (p=1.56e-07) The number of births decreases
with more education (t value = -5.266). Age (ar09) is a
significant predictor because the p-value is less than 0.05 (p=2e-16) As
individuals get older, the number of births increases (t value =
19.326)
#Using education, age, and dowry to predict the number of live births
model2 <- lm(br15 ~ educ + ar09 + log_dowry, data=IFLS_nomiss)
summary(model2)
##
## Call:
## lm(formula = br15 ~ educ + ar09 + log_dowry, data = IFLS_nomiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1945 -0.8338 -0.1384 0.6818 8.8520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01424 0.24963 -0.057 0.95451
## educ -0.07936 0.01919 -4.136 3.69e-05 ***
## ar09 0.08294 0.00457 18.148 < 2e-16 ***
## log_dowry -0.11114 0.04202 -2.645 0.00824 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.478 on 1781 degrees of freedom
## (3394 observations deleted due to missingness)
## Multiple R-squared: 0.1983, Adjusted R-squared: 0.1969
## F-statistic: 146.8 on 3 and 1781 DF, p-value: < 2.2e-16
Education (educ) is a significant predictor because the
p-value is less than 0.05 (p=3.69e-05) The number of births decreases
with more education (t value = -4.136). Age (ar09) is a
significant predictor because the p-value is less than 0.05 (p=2e-16) As
individuals get older, the number of births increases (t value = 18.148)
The dowry is a significant predictor because the p-value is less than
0.05 (p=0.00824) As the dowry increases, the number of births decreases
(-2.645)
The way we determine if the second model is an improvement over the
first is by using anova() (This is not
aov())
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: br15 ~ educ + ar09
## Model 2: br15 ~ educ + ar09 + log_dowry
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1782 3904.8
## 2 1781 3889.5 1 15.278 6.9958 0.008242 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This returns showing that that second model is improved over the first model (p=0.008242).
• Variables are entered into the model based on mathematical criteria.
• Computer selects variables in steps
fullmodel <- lm(br15 ~ educ + ar09 + log_dowry + kw23b, data=IFLS_nomiss)
IFLS_nomiss <- IFLS %>% filter(!is.na(log_dowry) & !is.na(kw23b))
model <- lm(br15 ~ educ + ar09 + log_dowry + kw23b, data=IFLS_nomiss)
step(fullmodel, direction="backward")
## Start: AIC=1389.16
## br15 ~ educ + ar09 + log_dowry + kw23b
##
## Df Sum of Sq RSS AIC
## - kw23b 1 0.06 3859.1 1387.2
## <none> 3859.0 1389.2
## - log_dowry 1 14.39 3873.4 1393.8
## - educ 1 38.59 3897.6 1404.8
## - ar09 1 726.53 4585.6 1692.8
##
## Step: AIC=1387.19
## br15 ~ educ + ar09 + log_dowry
##
## Df Sum of Sq RSS AIC
## <none> 3859.1 1387.2
## - log_dowry 1 14.40 3873.5 1391.8
## - educ 1 38.53 3897.6 1402.8
## - ar09 1 726.95 4586.0 1691.0
##
## Call:
## lm(formula = br15 ~ educ + ar09 + log_dowry, data = IFLS_nomiss)
##
## Coefficients:
## (Intercept) educ ar09 log_dowry
## -0.04002 -0.08077 0.08363 -0.10837
Number of births br15 is predicted by education
educ + age ar09 + log size of dowry
log_dowry + age of menarche kw23b(AIC
=1389.2). In the first step, it looks at what happens if we take out one
variable. Taking out age of menarche improves the AIC by a little bit
(1387.2). If we take out log size of dowry, the AIC gets worse (1393.8).
If we take out education, the AIC gets a lot worse (1404.8). If we take
out age, the AIC is the worst of all (1692.8). By “Worse” we mean, we
want the smallest possible value. It is ordered from the lowest to the
highest AIC where kw23b is listed first.
Both starts with all the variables, kicks one out then looks to add one back in.This catches suppressor effects when you need both things in to actually see it.
step(fullmodel, direction="both")
## Start: AIC=1389.16
## br15 ~ educ + ar09 + log_dowry + kw23b
##
## Df Sum of Sq RSS AIC
## - kw23b 1 0.06 3859.1 1387.2
## <none> 3859.0 1389.2
## - log_dowry 1 14.39 3873.4 1393.8
## - educ 1 38.59 3897.6 1404.8
## - ar09 1 726.53 4585.6 1692.8
##
## Step: AIC=1387.19
## br15 ~ educ + ar09 + log_dowry
##
## Df Sum of Sq RSS AIC
## <none> 3859.1 1387.2
## + kw23b 1 0.06 3859.0 1389.2
## - log_dowry 1 14.40 3873.5 1391.8
## - educ 1 38.53 3897.6 1402.8
## - ar09 1 726.95 4586.0 1691.0
##
## Call:
## lm(formula = br15 ~ educ + ar09 + log_dowry, data = IFLS_nomiss)
##
## Coefficients:
## (Intercept) educ ar09 log_dowry
## -0.04002 -0.08077 0.08363 -0.10837
We end up with the same result as backward. It just gave us the possibility of adding other back in.
#forward
full_model <- formula(lm(br15 ~ educ + ar09 + log_dowry +
kw23b, data=na.omit(IFLS)))
intercept_only <- lm(br15 ~ 1, data=na.omit(IFLS))
step(intercept_only, direction="forward", scope=full_model)
## Start: AIC=1778.41
## br15 ~ 1
##
## Df Sum of Sq RSS AIC
## + ar09 1 893.74 3935.1 1417.7
## + log_dowry 1 192.41 4636.4 1708.4
## + educ 1 129.70 4699.1 1732.2
## <none> 4828.8 1778.4
## + kw23b 1 1.98 4826.8 1779.7
##
## Step: AIC=1417.73
## br15 ~ ar09
##
## Df Sum of Sq RSS AIC
## + educ 1 61.561 3873.5 1391.8
## + log_dowry 1 37.426 3897.6 1402.8
## <none> 3935.1 1417.7
## + kw23b 1 0.026 3935.0 1419.7
##
## Step: AIC=1391.79
## br15 ~ ar09 + educ
##
## Df Sum of Sq RSS AIC
## + log_dowry 1 14.3991 3859.1 1387.2
## <none> 3873.5 1391.8
## + kw23b 1 0.0686 3873.4 1393.8
##
## Step: AIC=1387.19
## br15 ~ ar09 + educ + log_dowry
##
## Df Sum of Sq RSS AIC
## <none> 3859.1 1387.2
## + kw23b 1 0.0606 3859.0 1389.2
##
## Call:
## lm(formula = br15 ~ ar09 + educ + log_dowry, data = na.omit(IFLS))
##
## Coefficients:
## (Intercept) ar09 educ log_dowry
## -0.04002 0.08363 -0.08077 -0.10837
The code is trickier for forward. na.omit() helps remove
NA.
• Forward
• Add predictor that improves the model most
• Then add the predictor that improves the model 2nd most (after first predictor is already included)
• May miss out on effect that only occurs when another predictor is in the model.
• Backward
• Include all variables
• Exclude variable that improves AIC most
• Both
• Include all variables
• Exclude variable that improves AIC most
• Then check to either add or remove the variable that improves AIC the most
Stepwise should be avoided except for exploratory model building. Don’t P-Hack!
Use backward rather than forward to minimize suppressor effects (when a predictor has a significant effect but only when another variable is held constant)
Religion is a categorical variable.
How can we include categorical variables in our multiple regression models?
You throw it in in the same way.
l1 <- lm(br15 ~ educ + ar09 + as.factor(religion) +
log_dowry + kw23b, data=IFLS)
model1 <- lm(br15 ~ ., data=IFLS)
summary(model1)
##
## Call:
## lm(formula = br15 ~ ., data = IFLS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2663 -0.8360 -0.1443 0.6837 8.8344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.022904 0.363904 0.063 0.9498
## educ -0.081167 0.019265 -4.213 2.64e-05 ***
## ar09 0.083830 0.004600 18.224 < 2e-16 ***
## religion -0.040687 0.079239 -0.513 0.6077
## log_dowry -0.104553 0.042855 -2.440 0.0148 *
## kw23b -0.002845 0.018571 -0.153 0.8782
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.478 on 1766 degrees of freedom
## (10049 observations deleted due to missingness)
## Multiple R-squared: 0.2009, Adjusted R-squared: 0.1987
## F-statistic: 88.82 on 5 and 1766 DF, p-value: < 2.2e-16
audio 45:00