Introduction:

As the world is on the brink of explosion in terms of population, it is necessary for us to delve into in depth reasons for the same and develop mechanisms to prevent such explosions as poverty and the rich- poor gap rises. The title of this project “Impact of advanced education on family planning” surely explains the purpose of data analysis and further work in this area. The project analyzes the research question “Does academic qualification have an impact on family planning?” It is a common notion among masses that more educated an individual is, less is the number of children he/she has indicating an inverse relationship between educational qualification and number of children. Therefore, the aim of this project is to analyze the dataset produced by GSS:1972- 2012 Survey and produce useful relationships between education and family planning in a broader perspective to question the established notion leading to results with the help of statistical techniques and validation tests.

Data:

A sociological survey has been used to collect the data on demographic characteristics and attitudes of residents of the United States. It contains a total of 57061 cases (observations), each having 114 variables in the dataset. This was collected through computer-aided, face-to-face and telephonic interviews. There are a total of 57061 cases in the dataset which is the number of individuals surveyed from 1972 to 2012 in the GSS. However, all the years would not be the subject of the study for the purpose of the project as this would be cumbersome. Therefore, two years data would be used for the purpose of testing. It is an observational study conducted to describe various characteristics and attitudes of residents of the United States. It uses data to study and draw conclusions about a diverse set of subject areas which fall under the following categories: - Psychological: sexual behaviour, preferences - Economic: employment, poverty, expenditures, incomes - Social: security, mobility, inequality - Educational: qualifications, degree, occupation - Health: alcohol, AIDS, birth control, drugs - Demographics: ethnicity, gender, immigration, religion - Others: technology, terrorism, democracy, military service Two variables, namely “degree” and “childs” will be studied in the project. An implementation of a few statistical techniques will aid in good insights into the data and a better understanding of the relationship between the variables. The variables selected for the analysis can be classified into two categories: . Categorical Variable: “degree” is the variable that refers to the highest qualification a respondent has attained. The degrees considered are LT High School, High School, Junior High, Bachelors and Graduate with the representative symbols 0,1,2,3 and 4 respectively. . Numerical Variable: “childs” as a variable is a numerical value ranging from 0 to 8, representing the number of children of a respondent. This variable takes in the values 0, 1, 2, 3, 4, 5, 6, 7 and 8 where 0 means no children and 8 means respondents with 8 or more children. These areas highlight the wide scope of GSS. The observational nature of the study is clear from the fact that the answers to the survey questions were not controlled or the respondents were not subject to any specific control. The GSS staff merely would have observed the data retrospectively & prospectively without influencing any individual’s responses ruling out the possibility of an experimental design. Therefore, one can easily understand that it was not an experimental design rather an observational study. The objective of the GSS was to facilitate trend studies and therefore casual relationships between the variables could be established by applying statistical techniques but causation cannot be inferred(which is the outcome of an experiment). Overall population for GSS was all non-institutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. But for the project, all the individuals will be studied for testing a relationship between education and number of children. However, the data for the same will be used only for two years as mentioned earlier. Generalizations still cannot be made as the sample window could suffer from voluntary response or no response bias as mentioned that not all individuals responded to the survey. This data cannot be used to establish the causal relationships between the variables of interest as this an observation. It is because we could test these variables with statistical tools and study the trends. Definitely, some relation could emerge out, if properly structured but causation is least likely to be an outcome due to the fact that this is not an experimental design. We could figure out correlations or casual relationships among these variables but there might be other factors influencing these variables such as family tradition or religious background etc.

gss[1:50, c("childs", "degree")]
##    childs         degree
## 1       0       Bachelor
## 2       5 Lt High School
## 3       4    High School
## 4       0       Bachelor
## 5       2    High School
## 6       0    High School
## 7       2    High School
## 8       0       Bachelor
## 9       2    High School
## 10      4    High School
## 11      1    High School
## 12      5 Lt High School
## 13      1 Lt High School
## 14      2 Lt High School
## 15      5 Lt High School
## 16      2    High School
## 17      2    High School
## 18      3 Lt High School
## 19      3       Bachelor
## 20      0    High School
## 21      2    High School
## 22      2    High School
## 23      0    High School
## 24      1    High School
## 25      0       Bachelor
## 26      2    High School
## 27      2    High School
## 28      2    High School
## 29      0    High School
## 30      2 Lt High School
## 31      2 Lt High School
## 32      1    High School
## 33      0       Bachelor
## 34      5 Lt High School
## 35      1    High School
## 36      2    High School
## 37      2    High School
## 38      1 Lt High School
## 39      0 Lt High School
## 40      2    High School
## 41      2 Lt High School
## 42      2    High School
## 43      2 Lt High School
## 44      0 Lt High School
## 45      4 Lt High School
## 46      0    High School
## 47      0    High School
## 48      4    High School
## 49      4 Lt High School
## 50      4 Lt High School

Exploratory data analysis:

The GSS survey contains data for almost 40 years which will be difficult to test because not all respondants answered the survey questions in all the years.Using the data for three years namely 1980, 2010, 2012. Saperate statistical inefernce will be conducted for drawing conclusions about the relationship between “degree” and “childs” a respindant chose to bear.

YEAR 1980

Summary includes minimum, maximum values, 1St and 3rd quartile, mean and median of the distribution of the degreeholders. Since it is a categorical variable subject to testing the summary includes number of respondants holding each type of degree and hence, the plot visualises the same.

year_1980 = gss[gss$year==1980, ]
degreeholders_1980 = year_1980$degree
summary(degreeholders_1980)
## Lt High School    High School Junior College       Bachelor       Graduate 
##            407            745             45            158             71 
##           NA's 
##             42
plot(summary(degreeholders_1980))

The following outlines the data for numerical variable “childs”. As already explained, the summary ouputs the standard statistics for this variable and the histogram visualizes the number of repondants corresponding to each category for “number of children” raging from 0 to 8 or above.

childs_1980 = year_1980$childs
summary(childs_1980)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   2.000   2.031   3.000   8.000       3
hist(childs_1980)

Finally, simultaneous plotting of “degree” and “childs” projecting number of children each category of degreeholder has.

plot(degreeholders_1980, childs_1980)

summary(year_1980[, c("childs", "degree")])
##      childs                 degree   
##  Min.   :0.000   Lt High School:407  
##  1st Qu.:0.000   High School   :745  
##  Median :2.000   Junior College: 45  
##  Mean   :2.031   Bachelor      :158  
##  3rd Qu.:3.000   Graduate      : 71  
##  Max.   :8.000   NA's          : 42  
##  NA's   :3
data_1980 = table(degreeholders_1980, childs_1980)

YEAR 2010

(Same procedure os followed for this year as in 1980). The data is explored below.

year_2010 = gss[gss$year==2010, ]
degreeholders_2010 = year_2010$degree
summary(degreeholders_2010)
## Lt High School    High School Junior College       Bachelor       Graduate 
##            293           1001            145            375            218 
##           NA's 
##             12
plot(summary(degreeholders_2010))

Data for number of children for the year.

childs_2010 = year_2010$childs
summary(childs_2010)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   2.000   1.885   3.000   8.000       3
hist(childs_2010)

Finally, simultaneous plotting of “degree” and “childs” projecting number od children each category of degreeholder has.

plot(degreeholders_2010, childs_2010)

summary(year_2010[, c("childs", "degree")])
##      childs                 degree    
##  Min.   :0.000   Lt High School: 293  
##  1st Qu.:0.000   High School   :1001  
##  Median :2.000   Junior College: 145  
##  Mean   :1.885   Bachelor      : 375  
##  3rd Qu.:3.000   Graduate      : 218  
##  Max.   :8.000   NA's          :  12  
##  NA's   :3

The final table presents the number of children corresponding to each category of highest degree a repondant posseses.

data_2010 = summary(table(degreeholders_2010, childs_2010))

Inference:

As this project studies One numerical and one categorical variable (with more than 2 levels) the appropriate statistical technique to test them would be solely hypothesis testing. Therefore, comparing mean number of children across different categories of degreeholders(Lt High School, High School, Junior College, bachelors, Graduate).But there would be no defined parameter of interest, ANOVA would be used as well.

YEAR 1980

summary(table(degreeholders_1980, childs_1980))
## Number of cases in table: 1423 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 124.62, df = 32, p-value = 7.213e-13
##  Chi-squared approximation may be incorrect

YEAR 2010

summary(table(degreeholders_2010, childs_2010))
## Number of cases in table: 2029 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 199.82, df = 32, p-value = 3.607e-26
##  Chi-squared approximation may be incorrect

Various categorical summaries would be created using the code:

summary(data_1980[1, 1:9])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   22.00   42.00   45.11   57.00  106.00

This indicates the basic statistics for Lt. High school holders varying across various number of children.

Hyphothesis: H(0) : Mean(1) = Mean(2) = Mean(3) = Mean(4) = Mean(5)

H(A) : Atleast one of the means is different to state that the mean number of children across various categories of degreeholders is different in at least one pair of categories. Confidence Level :95% Significance Level : 5% Test Statstic: F-Test

The F-test would be an appropriate test statistic for testing if atleast one pair of the mean stated in the above hypothesis are different or not. If this is the case then We would be able to reject the NUll(H(o)) with the 95% confidence and state that we are 95% confident that there is a difference between the average children born among different classes of degreeholders. The following ANOVA output would be used to calculate the test statistic.

The relevant parameters are as follows: Sum of Squares Group(SSG) Sum of Squared Errors(SSE) Degree of freedom for Group(Dfg) Degree of freedom for Errors(Dfe) Mean Squared Group(MSG): SSG/Dfg Mean Squared Errors(MSE): SSE/Dfe

F-test = MSG/MSE P-value in this test would be the probability of at least as large a ratio between the “between” and “within” group variablities that ANOVA builds upon, if in fact the mean children of all the categories of “degree” are equal. If this p-value is less than the significance level stated for the test,we would be in a position to reject H(0) and conclude that with 95% confidence we can state that there is a difference of mean children born in different categories of degreeholders.

1980 2010

Conclusion:

The results in the previous section surely show that there is a difference of mean children born in different categories of degreeholders which is a barometer of education and awareness that could guide an individual to decide to bear less or more children. The average children born to different categories of “degree” are different ans can support our research question that education might have fewer children.The data for the years 1980 and 2010 clearly indicate that there might be some degree of difference between the children born to high school graduates and graduates or any other pair among the categories of the catagorical variable “degree” for that matter. Therefore, education might make a difference to family planning.