Introduction:

The research question for the analysis is “Is there a relationship (association) between obtained education with the number of children in a household?”

The goal of this study is to obtain an answer to the above research question. If noticing any correlation or trend the politicians could consider adjusting social care, help to different groups of citizens. I wonder if social status which can be defined by education impacts number of children in a family. There could be different scenarios. For instance, if we notice any association between nr of kids and education obtained the government could consider improving access to higher education by increasing level of scholars etc. so that young people can easier get higher education and be eager to have more kids therefore country has more taxpayers and social policy would be in a good shape. Morover politicians would need to drive different policy to keep higher birth rate in the country.

Data:

Data Collection

Data comes from General Social Survey (GSS), years 1972 - 2012, http://bit.ly/dasi_gss_data degree vs. child column. This is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The survey is conducted face-to-face with an in-person interview by the National Opinion Research Center at the University of Chicago, of adults (18+) in randomly selected households.

Cases

Units of observation - adults (18+) in randomly selected households in USA, (57061 cases)

Variables

  1. degree - states for degree as education level obtained - categorical variable, nominal (Lt High School,High School Junior College, Bachelor, Graduate)

  2. childs - states for number of kids - numeric, discrete variable (0-8)

Study

This is an observational study, because researchers simply go and collect data based on what is seen and heard and then any inference is being made. Variables are simply observed and data is collected through observation. Researchers dot interfere with the data. In this study there is no control and treatment group like in experiment.

Scope of inference - generalizability

The units are residents of United States, adults at households and data can be generalised to the entire population as it is an observational study, less than 10% population, random sampling. Few biases are as follows: 1. Convenience bias as not all groups/sectors of people were contacted 2. Non response - there are missing rows across years and also some citizens could have rejected answering the survey, we don’t have such data.

Data - Scope of inference - causality

No casual links can be stablished between variables as this is observational study and no confounding variables appear.

Exploratory data analysis:

As per having many levels in childs variable we will create categories for this variable with following levels: (0,2] (2,4] (4,6] (6,8]

Now we have two categorical variables.

Summary and Visualisation

summary(gss$degree)
## Lt High School    High School Junior College       Bachelor       Graduate 
##          11822          29287           3070           8002           3870 
##           NA's 
##           1010
barplot(table(gss$degree))

Comment: As we can notice majority of respondets finished education at high school, around 50% of all respondents. We have about 1 000 NA’s in this group.

Similar summary for childs:

summary(kids)
## (0,2] (2,4] (4,6] (6,8]  NA's 
## 23135 13509  3241  1502 15674
barplot(table(kids))

Comment:We notice that most respondents declare they have up to 2 kids. It is more than 50% (excluding NA’s). Only this value tells us the we are having low birth rate within our sample. Interesting how many respondents did not want to share information about childs.

In order to compare these two variables we use proportion table and mosaic plot, which gives us better visibility on the data.

propor=table(kids,gss$degree)
round(prop.table(propor),digits=2)
##        
## kids    Lt High School High School Junior College Bachelor Graduate
##   (0,2]           0.10        0.31           0.03     0.08     0.04
##   (2,4]           0.08        0.17           0.02     0.04     0.02
##   (4,6]           0.03        0.03           0.00     0.00     0.00
##   (6,8]           0.02        0.01           0.00     0.00     0.00
propor2=table(gss$degree,kids)
mosaicplot(propor2,main = "Degree vs. Kids Distribution", color = TRUE)

Comment: Above we notice a difference between number of kids in LT High Shool and High School comparing to other groups.This might be due to chance or sample size, but there also could be a significant difference.This gives us some hint to our research question.

Inference:

Method

Since we have two categorical variables (after creating groups within childs) with more than 1 level we will conduct Chi-Square hypothesis test as this method is used to calculate any association between categorical data with more than one level.

State hypothesis

H0 : Number of childs in a houshold and obtained education are independent. Number of childs do not vary by education obtained.

HA : Number of childs in a houshold and obtained education are dependent. Number of childs do vary by education obtained.

Conditions 1. Independence - random sample - actual data comes from random sampling - if sampling without replacement, n < 10% of population - our data have less than 10% of an entire population, it is only about 57 000 responses - each case only contributes to one cell in the table - this is also true in this case 2. Sample size: Each particular scenario must have at least 5 expected cases - we have more than 5 scenarios for each case

Inference Chi-square is performed to measure p-value for this test. If p value is lower than 0.05 we will reject Ho.

chisq.test(propor2)
## 
##  Pearson's Chi-squared test
## 
## data:  propor2
## X-squared = 2158.5, df = 12, p-value < 2.2e-16

Comment: p-value is much less than than 0.05 significance level, therefore we will reject the null hypothesis. We can say there is a relationship between degree and number of childs.

No other methods are applicable for testing the relationship between our variables so there is nothing to compare this result with.

Conclusion:

Based on hypothesis test performed and the whole data analysis we notice a significant difference between number of kids and degree obtained. Thanks to this test I have learned and proved the assumption that people decide to have more/less babies depending on the time when they finish education so this topic needs deeper analysis.As data comes from observational study we can not set causation. We still do not know the details on which group and how data are correlated, so I would advice performing regression test, also trying to calculate correlation between variables.

References:

General Social Survey (GSS): A sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States

Data set: https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html

http://bit.ly/dasi_gss_data

Appendix:

head(data.frame(gss$degree,gss$childs),n=40L)
##        gss.degree gss.childs
## 1        Bachelor          0
## 2  Lt High School          5
## 3     High School          4
## 4        Bachelor          0
## 5     High School          2
## 6     High School          0
## 7     High School          2
## 8        Bachelor          0
## 9     High School          2
## 10    High School          4
## 11    High School          1
## 12 Lt High School          5
## 13 Lt High School          1
## 14 Lt High School          2
## 15 Lt High School          5
## 16    High School          2
## 17    High School          2
## 18 Lt High School          3
## 19       Bachelor          3
## 20    High School          0
## 21    High School          2
## 22    High School          2
## 23    High School          0
## 24    High School          1
## 25       Bachelor          0
## 26    High School          2
## 27    High School          2
## 28    High School          2
## 29    High School          0
## 30 Lt High School          2
## 31 Lt High School          2
## 32    High School          1
## 33       Bachelor          0
## 34 Lt High School          5
## 35    High School          1
## 36    High School          2
## 37    High School          2
## 38 Lt High School          1
## 39 Lt High School          0
## 40    High School          2