Preliminary Considerations

Purpose of This Document

This project has been completed in partial fulfillment of requirements for IS 606, The City University of New York MS in Data Analytics.

The purpose of document is to demonstrate the statistical and data assumptions, procedures, and diagnostics used to winnow down a large dataset and less-well-developed research question in order to arrive at data used to answer two specific, focused questions related to women’s health.

Obtaining Data

Data was downloaded through the GUI interface of the World Bank Data website and then added to the author’s GitHub account. The data file can be found at:

https://raw.githubusercontent.com/pm0kjp/IS_606_Final_Project/master/Data/Data_Extract_From_World_Development_Indicators_Data.csv

The data dictionary can be found at:

https://raw.githubusercontent.com/pm0kjp/IS_606_Final_Project/master/Data/Data_Extract_From_World_Development_Indicators_Definition%20and%20Source.csv

A Disclaimer: Culturally Mediated Language

Domestic violence or spousal abuse are value-laden terms that describe a sociological phenomenon – for the purposes of this study, the act of a man beating a woman he is married to – understood differently depending on cultural context. In some cultures, there is widespread agreement that a man striking his wife is a crime and constitutes unjustifiable abuse, but in others, this is considered part of marriage and a natural response to perceived failures on the part of a wife to fulfill her obligations. In this study I use the term domestic violence to describe a man’s physical punishment of his wife for perceived infractions.

Importance of the Topic

Understanding the factors that contribute to women finding it legitimate for a man to strike his wife can help drive efforts to influence these contributing factors. Directly challenging cultural norms is prone to failure and the perception of cultural hegemony, but providing a scaffolding of social opportunities for girls and women may be able to indirectly influence attitudes about domestic violence.

I chose this topic because of my own experiences in the developing world, where women’s choices are often severely limited. I believe that exposure to limited possibilities for educational, economic, and political advancement contribute to women’s support of or toleration of violence in the home. This topic is important not only as it relates to women’s rights but also in relation to children’s exposure to violence and to the economic, educational, and political success of societies generally.

Research Question

In this study, I investigate factors that influence a woman’s likelihood to consider it legitimate for a man to beat his wife under particular circumstances.

The research question that prompted this investigation can be phrased as follows:

Do reduced opportunities for women in the areas of education, health, politics, and work contribute to women’s belief that men are justified in striking their wives?

The final research questions, arrived at after limiting the scope of the project, are:

Can we successfully model female acceptance for men beating wives who argue with them?

Do nations with low levels (first quartile) of contraceptive prevalence and/or short (first quartile) female life expectancy have a level of female acceptance for men beating wives who argue with them that is different than nations generally?

Data

Sources

The table below consists of the Indicator Name, Long Definition, and Source of the nine final explanatory and one response variable chosen for use for regression modeling, as supplied by the data dictionary provided by the World Bank. The method at arriving at this limited set of variables is described in detail further on.



+----------------------+--------------------------------+--------------------------------+
|         Code         |         Indicator.Name         |             Source             |
+======================+================================+================================+
|    SG.VAW.ARGU.ZS    | Women who believe a husband is | Demographic and Health Surveys |
|                      | justified in beating his wife  |  (DHS) and Multiple Indicator  |
|                      |  when she argues with him (%)  |     Cluster Surveys (MICS)     |
+----------------------+--------------------------------+--------------------------------+
|     SP.ADO.TFRT      |   Adolescent fertility rate    | United Nations Population Di-  |
|                      |  (births per 1,000 women ages  |    vision, World Population    |
|                      |             15-19)             |           Prospects.           |
+----------------------+--------------------------------+--------------------------------+
|    SP.DYN.CONU.ZS    | Contraceptive prevalence (% of | UNICEF's State of the World's  |
|                      |       women ages 15-49)        | Children and Childinfo, United |
|                      |                                | Nations Population Division's  |
|                      |                                |    World Contraceptive Use,    |
|                      |                                |  household surveys including   |
|                      |                                | Demographic and Health Surveys |
|                      |                                | and Multiple Indicator Cluster |
|                      |                                |            Surveys.            |
+----------------------+--------------------------------+--------------------------------+
|    SH.STA.ANVC.ZS    | Pregnant women receiving pre-  |  UNICEF, State of the World's  |
|                      |         natal care (%)         | Children, Childinfo, and Demo- |
|                      |                                |  graphic and Health Surveys.   |
+----------------------+--------------------------------+--------------------------------+
|  SP.DYN.LE00.FE.IN   | Life expectancy at birth, fe-  | (1) United Nations Population  |
|                      |          male (years)          |   Division. World Population   |
|                      |                                | Prospects, (2) United Nations  |
|                      |                                | Statistical Division. Popula-  |
|                      |                                | tion and Vital Statistics Re-  |
|                      |                                | port (various years), (3) Cen- |
|                      |                                | sus reports and other statis-  |
|                      |                                |  tical publications from na-   |
|                      |                                |  tional statistical offices,   |
|                      |                                | (4) Eurostat: Demographic Sta- |
|                      |                                |  tistics, (5) Secretariat of   |
|                      |                                | the Pacific Community: Statis- |
|                      |                                | tics and Demography Programme, |
|                      |                                |  and (6) U.S. Census Bureau:   |
|                      |                                |    International Database.     |
+----------------------+--------------------------------+--------------------------------+
| SL.EMP.1524.SP.FE.ZS |  Employment to population ra-  | International Labour Organiza- |
|                      |  tio, ages 15-24, female (%)   |  tion, Key Indicators of the   |
|                      |     (modeled ILO estimate)     |    Labour Market database.     |
+----------------------+--------------------------------+--------------------------------+
|    SP.DYN.AMRT.FE    | Mortality rate, adult, female  | (1) United Nations Population  |
|                      |   (per 1,000 female adults)    |   Division. World Population   |
|                      |                                |  Prospects. New York, United   |
|                      |                                | Nations, Department of Econom- |
|                      |                                |   ic and Social Affairs (a-    |
|                      |                                | dvanced Excel tables). Avail-  |
|                      |                                |  able at http://esa.un.org/w-  |
|                      |                                | pp/unpp/panel_population.htm,  |
|                      |                                | (2) University of California,  |
|                      |                                |  Berkeley, and Max Planck In-  |
|                      |                                |  stitute for Demographic Re-   |
|                      |                                | search. Human Mortality Data-  |
|                      |                                |  base. [ www.mortality.org or  |
|                      |                                |    www.humanmortality.de].     |
+----------------------+--------------------------------+--------------------------------+
|  SP.DYN.TO65.FE.ZS   | Survival to age 65, female (%  | United Nations Population Di-  |
|                      |           of cohort)           |    vision. World Population    |
|                      |                                |  Prospects. New York, United   |
|                      |                                | Nations, Department of Econom- |
|                      |                                |   ic and Social Affairs (a-    |
|                      |                                | dvanced Excel tables). Avail-  |
|                      |                                |  able at http://esa.un.org/w-  |
|                      |                                | pp/unpp/panel_population.htm.  |
+----------------------+--------------------------------+--------------------------------+
|  SL.UEM.TOTL.FE.ZS   | Unemployment, female (% of fe- | International Labour Organiza- |
|                      | male labor force) (modeled ILO |  tion, Key Indicators of the   |
|                      |           estimate)            |    Labour Market database.     |
+----------------------+--------------------------------+--------------------------------+
|  SL.UEM.1524.FE.ZS   | Unemployment, youth female (%  | International Labour Organiza- |
|                      |   of female labor force ages   |  tion, Key Indicators of the   |
|                      | 15-24) (modeled ILO estimate)  |    Labour Market database.     |
+----------------------+--------------------------------+--------------------------------+

Methodology

As shown in the table above, data were collected by a number of United Nations and World Bank organizations, using a number of methods, including data contributed by participating nations, household surveys, and economic and population estimates.

The final variables have the following sources and methods:

Adolescent fertility rate (births per 1,000 women ages 15-19)

Data was gathered using one source, the United Nations Population Division, World Population Prospect.

United Nations Population Division’s World Population Prospects (WPP)

The WPP consists of “official United Nations population estimates and projections that have been prepared by the Population Division of the Department of Economic and Social Affairs of the United Nations Secretariat.”1 Population estimates (including life expectancy, infant mortality, migration, fertility, etc.) are calculated using a variety of locally gathered data sources.2

Contraceptive prevalence (% of women ages 15-49)

Data was gathered using various sources:

  • UNICEF’s State of the World’s Children and Childinfo
  • United Nations Population Division’s World Contraceptive Use
  • Household surveys including Demographic and Health Surveys (DHS) and Multiple Indicator Cluster Surveys (MICS)

State of the World’s Children

The State of the World’s Children is an annual report published by UNICEF which compiles data about children from a number of sources and reports it in a story-driven, narrative form.3

Multiple Indicator Cluster Surveys (MICS) / Childinfo

The MICS data collection is a UNICEF initiative dating from 1995 that does face-to-face surveying in low and middle-income countries, with a special concentration on the issues that affect women and children.4 MICS surveys include four questionnaires: one for the household, one for adult women, one for adult men, and one for children.5 UNICEF describes MICS as the “centerpiece” of its data collection strategy and provides all MICS data at http://mics.unicef.org.6 Detailed descriptions of MICS survey design, implementation, and analysis can be found at the same website. ChildInfo is another name for the children’s data component of MICS.7

Demographic and Health Surveys (DHS)

The DHS data collection program is an initiative of USAID in which households are surveyed in person.8 “Surveys have large sample sizes (usually between 5,000 and 30,000 households) and typically are conducted about every 5 years, to allow comparisons over time.”9

World Contraceptive Use

The indicator provided by the UN Department of Economic and Social Affairs, Population Division, is defined as “the percentage of women who are currently using, or whose sexual partner is currently using, at least one method of contraception, regardless of the method used. It is usually reported for married or in union women aged 15 to 49.”10 Data used to calculate this indicator come from national surveys including the MICS and DHS (described elsewhere in this document).11}

Pregnant women receiving prenatal care (%)

Sources for this data include:

  • UNICEF’s State of the World’s Children and Childinfo (described above)
  • Demographic and Health Surveys (DHS) (described above)

Life expectancy at birth, female (years)

Data was gathered using six source types:

  • United Nations Population Division. World Population Prospects (described above)
  • United Nations Statistical Division. Population and Vital Statistics Report (various years)
  • Census reports and other statistical publications from national statistical offices
  • Eurostat: Demographic Statistics
  • Secretariat of the Pacific Community: Statistics and Demography Programme
  • U.S. Census Bureau: International Database.

United Nations Statistical Division’s Population and Vital Statistics Report

The United Nations describes the Population and Vital Statistics Report as follows:

“The Population and Vital Statistics Report presents most recent data on population size (total, male and female) from the latest available census of the population, national official population estimates and the number and rate (births, deaths and infant deaths) for the latest available year within the past 15 years. It also presents United Nations estimates of the mid-year population of the world, and its major areas and regions.”12

Eurostat: Demographic Statistics

The European Commission describes its collection methodology on its website:

“Data on population demography and migration is collected every year: countries report to Eurostat their population on the 1st of January, along with breakdowns of the population by various characteristics. Data on vital events (births, deaths) … are also reported, resulting in a wealth of information on European population. …

Based on population, vital events and migration trends Eurostat also produces population projections every three years, to estimate the likely future size and structure of population.“13

Secretariat of the Pacific Community: Statistics and Demography Programme

Demographic statistics are compiled using population censuses and household surveys.14 Publications include the Pacific Mortality Trend Report15 and Population and Demographic Indicators dataset16

U.S. Census Bureau: International Database

The United States Census Bureau conducts decennial censuses as well as conducting other surveys and censuses on an ongoing basis. This data drives population projections on a cohort-by-cohort basis.17

Employment to population ratio, ages 15-24, female (%)

Data come from the International Labour Organization, Key Indicators of the Labour Market database.

International Labour Organization, Key Indicators of the Labour Market

The ILO is a branch of the United Nations. It describes its Key Indicators of the Labour Market data product as follows: “Harvesting information from international data repositories as well as regional and national statistical sources, the KILM offers data for over 200 countries.”18

Mortality rate, adult, female (per 1,000 female adults)

Two sources are used for this data:

  • United Nations Population Division. World Population Prospects (described above)
  • University of California, Berkeley, and Max Planck Institute for Demographic Research. Human Mortality Database.

Human Mortality Database

The Human Mortality Database derives its raw data mainly birth and death counts from vital statistics, plus population counts from periodic censuses and/or official population estimates.19

Survival to age 65, female (% of cohort)

Data are taken from United Nations Population Division. World Population Prospects. The WPP is described above.

Unemployment, female (% of female labor force)

Data are taken from the International Labour Organization, Key Indicators of the Labour Market database, described above.

Women who believe a husband is justified in beating his wife when she argues with him

Data were gathered via Demographic and Health Surveys (DHS) and Multiple Indicator Cluster Surveys (MICS), both described above.

Variable Selection and Data Analysis

The data chosen for this analysis were provided by the World Bank’s World Development Indicators, which can be accessed at http://data.worldbank.org/data-catalog/world-development-indicators. This data is observational in nature and therefore can be characterized more rigorously in terms of correlation than causality. Additionally, the data are sourced from a number of national and international governmental agencies and NGOs and may be heterogeneous in its data quality. Any conclusions reached in this paper will therefore be provisional and conservative in nature. Variable data are numeric (with the exception of case identifiers). Variable selection was a three-stage process of iterative narrowing and focusing, which is described in detail below.

Initial Response Variables

The response variable(s) will be one or more of the spousal violence variables collected by the World Bank. These are numerical variables representing female agreement on a national level with the belief that men are justified in striking their wives in certain situations. Some countries have several measurements of these fields that come from different years (coverage goes from 1991-2015). These data are gathered in six fields:

  • Women who believe a husband is justified in beating his wife when she refuses sex with him (%)

  • Women who believe a husband is justified in beating his wife when she goes out without telling him (%)

  • Women who believe a husband is justified in beating his wife when she neglects the children (%)

  • Women who believe a husband is justified in beating his wife when she burns the food (%)

  • Women who believe a husband is justified in beating his wife when she argues with him (%)

  • Women who believe a husband is justified in beating his wife (any of five reasons) (%)

There are 74 countries for which we have data on at least one of the spousal violence opinion fields:

 [1] Albania                Armenia                Azerbaijan            
 [4] Bangladesh             Belize                 Benin                 
 [7] Bolivia                Bosnia and Herzegovina Burkina Faso          
[10] Burundi                Cambodia               Cameroon              
[13] Colombia               Comoros                Congo, Dem. Rep.      
[16] Congo, Rep.            Cote d'Ivoire          Dominican Republic    
[19] Egypt, Arab Rep.       Eritrea                Ethiopia              
[22] Gabon                  Gambia, The            Georgia               
[25] Ghana                  Guinea                 Guinea-Bissau         
[28] Guyana                 Haiti                  Honduras              
[31] India                  Indonesia              Jordan                
[34] Kazakhstan             Kenya                  Kyrgyz Republic       
[37] Lesotho                Liberia                Macedonia, FYR        
[40] Madagascar             Malawi                 Maldives              
[43] Mali                   Moldova                Mongolia              
[46] Montenegro             Morocco                Mozambique            
[49] Namibia                Nepal                  Nicaragua             
[52] Niger                  Nigeria                Pakistan              
[55] Peru                   Philippines            Rwanda                
[58] Sao Tome and Principe  Senegal                Serbia                
[61] Sierra Leone           Somalia                Swaziland             
[64] Tajikistan             Tanzania               Timor-Leste           
[67] Togo                   Trinidad and Tobago    Turkmenistan          
[70] Uganda                 Ukraine                Vietnam               
[73] Zambia                 Zimbabwe              
252 Levels:  Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe

Initial Pool of Explanatory Variables

There are 163 numerical variables relating to the economic, educational, political, and health situation of women that made up an initial selection of possible explanatory variables. Not all data were available for each country, and the most incomplete data were discarded. For interests of full disclosure, below is the listing of the original 163 possible explanatory variables, which were later narrowed to two dozen:

                                                                                             Indicator.Name
1                                                  Women's share of population ages 15+ living with HIV (%)
2                                                                  Wanted fertility rate (births per woman)
3                                              Unmet need for contraception (% of married women ages 15-49)
4                   Teenage mothers (% of women ages 15-19 who have had children or are currently pregnant)
5                                             Adolescent fertility rate (births per 1,000 women ages 15-19)
6                                                          Contraceptive prevalence (% of women ages 15-49)
7                                                                Pregnant women receiving prenatal care (%)
8                                     Prevalence of anemia among non-pregnant women (% of women ages 15-49)
9                                                             Prevalence of anemia among pregnant women (%)
10                                            Proportion of seats held by women in national parliaments (%)
11  Share of women in wage employment in the nonagricultural sector (% of total nonagricultural employment)
12                         Adjusted net enrollment rate, primary, female (% of primary school age children)
13                    Average working hours of children, study and work, female, ages 7-14 (hours per week)
14                      Average working hours of children, working only, female, ages 7-14 (hours per week)
15           Child employment in manufacturing, female (% of female economically active children ages 7-14)
16                Child employment in services, female (% of female economically active children ages 7-14)
17            Children in employment, self-employed, female (% of female children in employment, ages 7-14)
18           Children in employment, study and work, female (% of female children in employment, ages 7-14)
19                                                                  Children out of school, primary, female
20                Children in employment, work only, female (% of female children in employment, ages 7-14)
21             Children in employment, wage workers, female (% of female children in employment, ages 7-14)
22    Children in employment, unpaid family workers, female (% of female children in employment, ages 7-14)
23                                           Female legislators, senior officials and managers (% of total)
24                                                               Firms with female top manager (% of firms)
25                                            Female headed households (% of households with a female head)
26                                                Firms with female participation in ownership (% of firms)
27                 Gross intake ratio in first grade of primary education, female (% of relevant age group)
28                            Labor force participation rate for ages 15-24, female (%) (national estimate)
29                                                                 Life expectancy at birth, female (years)
30                                      Condom use, population ages 15-24, female (% of females ages 15-24)
31                                              Contributing family workers, female (% of females employed)
32                                          Children in employment, female (% of female children ages 7-14)
33             Child employment in agriculture, female (% of female economically active children ages 7-14)
34                                                  Employment in industry, female (% of female employment)
35                                                  Employment in services, female (% of female employment)
36                                   Employment to population ratio, 15+, female (%) (modeled ILO estimate)
37                                      Employment to population ratio, 15+, female (%) (national estimate)
38                            Employment to population ratio, ages 15-24, female (%) (modeled ILO estimate)
39                               Employment to population ratio, ages 15-24, female (%) (national estimate)
40                                               Employment in agriculture, female (% of female employment)
41                                                                      Employers, female (% of employment)
42             Labor force participation rate, female (% of female population ages 15+) (national estimate)
43        Labor force participation rate, female (% of female population ages 15-64) (modeled ILO estimate)
44                                     Labor force with primary education, female (% of female labor force)
45                                   Labor force with secondary education, female (% of female labor force)
46                                    Labor force with tertiary education, female (% of female labor force)
47                         Labor force participation rate for ages 15-24, female (%) (modeled ILO estimate)
48          Labor force participation rate, female (% of female population ages 15+) (modeled ILO estimate)
49                                             Literacy rate, adult female (% of females ages 15 and above)
50                                                    Literacy rate, youth female (% of females ages 15-24)
51                                                Long-term unemployment, female (% of female unemployment)
52                                        Lower secondary completion rate, female (% of relevant age group)
53                                                             Labor force, female (% of total labor force)
54                                                  Mortality rate, adult, female (per 1,000 female adults)
55                                                   Mortality rate, infant, female (per 1,000 live births)
56                                                  Mortality rate, under-5, female (per 1,000 live births)
57                                 Net intake rate in grade 1, female (% of official school-age population)
58                                 Prevalence of wasting, weight for height, female (% of children under 5)
59                                              Part time employment, female (% of total female employment)
60                                               Persistence to last grade of primary, female (% of cohort)
61                                                                          Population, female (% of total)
62                                                                 Prevalence of HIV, female (% ages 15-24)
63                              Prevalence of overweight, weight for height, female (% of children under 5)
64                          Prevalence of severe wasting, weight for height, female (% of children under 5)
65                                   Prevalence of stunting, height for age, female (% of children under 5)
66                                Prevalence of underweight, weight for age, female (% of children under 5)
67                                                             Persistence to grade 5, female (% of cohort)
68                                                Primary completion rate, female (% of relevant age group)
69                                                                     Primary education, pupils (% female)
70                                                                   Primary education, teachers (% female)
71                                                              Progression to secondary school, female (%)
72                                           Part time employment, female (% of total part time employment)
73                        Ratio of female to male labor force participation rate (%) (modeled ILO estimate)
74                           Ratio of female to male labor force participation rate (%) (national estimate)
75                                                      Repeaters, primary, female (% of female enrollment)
76                                                    Repeaters, secondary, female (% of female enrollment)
77                                                          School enrollment, preprimary, female (% gross)
78                                                             School enrollment, primary, female (% gross)
79                                                               School enrollment, primary, female (% net)
80                                                           School enrollment, secondary, female (% gross)
81                                                             School enrollment, secondary, female (% net)
82                                                            School enrollment, tertiary, female (% gross)
83                                                           Secondary education, general pupils (% female)
84                                                                   Secondary education, pupils (% female)
85                                                                 Secondary education, teachers (% female)
86                                                                    Secondary education, teachers, female
87                                                        Secondary education, vocational pupils (% female)
88                                                            Self-employed, female (% of females employed)
89           Share of youth not in education, employment or training, female (% of female youth population)
90                                                                Smoking prevalence, females (% of adults)
91                                                                 Survival to age 65, female (% of cohort)
92                                                            Tertiary education, academic staff (% female)
93                                     Trained teachers in primary education, female (% of female teachers)
94                                   Unemployment with primary education, female (% of female unemployment)
95                                 Unemployment with secondary education, female (% of female unemployment)
96                                  Unemployment with tertiary education, female (% of female unemployment)
97                                    Unemployment, female (% of female labor force) (modeled ILO estimate)
98                                       Unemployment, female (% of female labor force) (national estimate)
99                   Unemployment, youth female (% of female labor force ages 15-24) (modeled ILO estimate)
100                     Unemployment, youth female (% of female labor force ages 15-24) (national estimate)
101                                                  Vulnerable employment, female (% of female employment)
102                                               Wage and salaried workers, female (% of females employed)
103                                                                                                        
104                                                Literacy rate, adult male (% of males ages 15 and above)
105                                              Literacy rate, adult total (% of people ages 15 and above)
106                                            Literacy rate, youth (ages 15-24), gender parity index (GPI)
107                                                       Literacy rate, youth male (% of males ages 15-24)
108                                                     Literacy rate, youth total (% of people ages 15-24)
109                                                           CPIA gender equality rating (1=low to 6=high)
110                                           School enrollment, primary (gross), gender parity index (GPI)
111                             School enrollment, primary and secondary (gross), gender parity index (GPI)
112                                         School enrollment, secondary (gross), gender parity index (GPI)
113                                          School enrollment, tertiary (gross), gender parity index (GPI)
114                                                                    Poverty gap at $1.25 a day (PPP) (%)
115                                                                       Poverty gap at $2 a day (PPP) (%)
116                                                               Poverty gap at national poverty lines (%)
117                                          Poverty headcount ratio at $1.25 a day (PPP) (% of population)
118                                             Poverty headcount ratio at $2 a day (PPP) (% of population)
119                                     Poverty headcount ratio at national poverty lines (% of population)
120                                                         Rural poverty gap at national poverty lines (%)
121                         Rural poverty headcount ratio at national poverty lines (% of rural population)
122                                                         Urban poverty gap at national poverty lines (%)
123                         Urban poverty headcount ratio at national poverty lines (% of urban population)
124                                                      Adjusted savings: education expenditure (% of GNI)
125                                                   Adjusted savings: education expenditure (current US$)
126   All education staff compensation, secondary (% of total expenditure in secondary public institutions)
127     All education staff compensation, tertiary (% of total expenditure in tertiary public institutions)
128                 All education staff compensation, total (% of total expenditure in public institutions)
129       All education staff compensation, primary (% of total expenditure in primary public institutions)
130      Current education expenditure, secondary (% of total expenditure in secondary public institutions)
131                    Current education expenditure, total (% of total expenditure in public institutions)
132          Current education expenditure, primary (% of total expenditure in primary public institutions)
133        Current education expenditure, tertiary (% of total expenditure in tertiary public institutions)
134                             Expenditure on primary education (% of government expenditure on education)
135                           Expenditure on secondary education (% of government expenditure on education)
136                            Expenditure on tertiary education (% of government expenditure on education)
137                                Government expenditure on education, total (% of government expenditure)
138                  Gross intake ratio in first grade of primary education, male (% of relevant age group)
139                                                   Government expenditure on education, total (% of GDP)
140                 Gross intake ratio in first grade of primary education, total (% of relevant age group)
141                                                         Labor force with primary education (% of total)
142                                        Labor force with primary education, male (% of male labor force)
143                                                       Labor force with secondary education (% of total)
144                                      Labor force with secondary education, male (% of male labor force)
145                                                        Labor force with tertiary education (% of total)
146                                       Labor force with tertiary education, male (% of male labor force)
147                                                                     Primary education, duration (years)
148                                                                               Primary education, pupils
149                                                                             Primary education, teachers
150                                                                   Secondary education, duration (years)
151                                                                             Secondary education, pupils
152                                                                           Secondary education, teachers
153                                                                  Secondary education, vocational pupils
154              Share of youth not in education, employment or training, male (% of male youth population)
155                  Share of youth not in education, employment or training, total (% of youth population)
156                                                                     Secondary education, general pupils
157                                             Trained teachers in primary education (% of total teachers)
158                                        Trained teachers in primary education, male (% of male teachers)
159                                           Unemployment with primary education (% of total unemployment)
160                                      Unemployment with primary education, male (% of male unemployment)
161                                         Unemployment with secondary education (% of total unemployment)
162                                    Unemployment with secondary education, male (% of male unemployment)
163                                          Unemployment with tertiary education (% of total unemployment)
164                                     Unemployment with tertiary education, male (% of male unemployment)

Cleaning / Reduction of Explanatory and Response Variables

We tidy the data, reshaping it, altering variable names, and removing cases (a case here is data belonging to a single country in a single year) where data for the potential response variables are completely absent. This results in 135 cases. Code below demonstrates the methods by which this tidying was accomplished.

library(tidyr)
# First, eliminate all the data that does not belong to the countries which
# have response variable data.
countries_of_interest <- filter(data, Country.Name %in% countries_with_data)
# We'll clean up the variable names
colnames(countries_of_interest) <- gsub("\\.", "_", colnames(countries_of_interest))
colnames(countries_of_interest) <- gsub(".+YR(\\d+)_", "\\1", colnames(countries_of_interest))
# And now we'll reshape the data
countries_of_interest <- (gather(countries_of_interest, year, value, -c(1:4)))
tidy_countries_of_interest <- select(countries_of_interest, -Series_Name) %>% 
    spread(Series_Code, value)
# We'll eliminate years for which no response variable data is known
tidy_countries_of_interest <- filter(tidy_countries_of_interest, !is.na(SG.VAW.REFU.ZS) | 
    !is.na(SG.VAW.GOES.ZS) | !is.na(SG.VAW.NEGL.ZS) | !is.na(SG.VAW.BURN.ZS) | 
    !is.na(SG.VAW.ARGU.ZS) | !is.na(SG.VAW.REAS.ZS))

We have a few cases for the same country in various years. These represent non-independent cases, so we choose only the most recent year for each country, in order that our cases remain independent from one another.

# Year needs to be changed from factor to number:
tidy_countries_of_interest$year <- as.numeric(as.character(tidy_countries_of_interest$year))
tidy_countries_of_interest <- tidy_countries_of_interest %>% group_by(Country_Code) %>% 
    filter(year == max(year))
tidy_countries_of_interest <- ungroup(tidy_countries_of_interest)

This gives us 74 cases.

We now select only those variables for which the number of absent values (out of 74 cases) is below 5, which gives us a much smaller list of measures (21 possible explanatory variables and 5 possible response variables, to be precise). This is a more workable scenario.

colnames_to_include <- colnames(tidy_countries_of_interest[, which(as.numeric(colSums(is.na(tidy_countries_of_interest))) < 
    5)])
smaller_tidy <- tidy_countries_of_interest[, colnames_to_include]

The columns are, in total, 29: 3 which identify the case, 21 possible explanatory variables, and 5 possible response variables.

Final Variable Selection

Correlation Matrix

We create a correlation matrix to discover which variables of this more limited set will be the most promising to investigate further.

We’ll narrow down our correlation by setting a level of correlation that we consider sufficiently interesting. For use in single variable regression, we want the correlation level at an absolute value of at least 0.35. However, since we intend to use multiple regression, we can drop the limit to 0.25, since we will be combining various factors.

  SG.VAW.ARGU.ZS SG.VAW.BURN.ZS SG.VAW.GOES.ZS SG.VAW.NEGL.ZS
1     -0.3684022     -0.3967000     -0.3226340     -0.3086633
2      0.2319965      0.2622562      0.2620649      0.2841737
3     -0.3754656     -0.3665254     -0.3892853     -0.3967292
4     -0.2872609     -0.3000599     -0.3382916     -0.3402134
5      0.3945528      0.3118857      0.3190843      0.2875295
6      0.3236851      0.2585281      0.2598714      0.2709623
7     -0.5126210     -0.4927831     -0.4613978     -0.4244391
8     -0.4828889     -0.4022742     -0.4254645     -0.4167068
9     -0.4429030     -0.3714731     -0.3840760     -0.3846382
  SG.VAW.REFU.ZS            indicator
1     -0.3728898       SH.STA.ANVC.ZS
2      0.2882412 SL.EMP.1524.SP.FE.ZS
3     -0.3416662    SL.UEM.1524.FE.ZS
4     -0.2519426    SL.UEM.TOTL.FE.ZS
5      0.4029802          SP.ADO.TFRT
6      0.2899417       SP.DYN.AMRT.FE
7     -0.5161065       SP.DYN.CONU.ZS
8     -0.4711554    SP.DYN.LE00.FE.IN
9     -0.4202728    SP.DYN.TO65.FE.ZS

We can figure out what these indicators are in more detail by adding information from the data dictionary.



+----------------------+--------------------------------+
|      indicator       |         Indicator.Name         |
+======================+================================+
|    SH.STA.ANVC.ZS    | Pregnant women receiving pre-  |
|                      |         natal care (%)         |
+----------------------+--------------------------------+
| SL.EMP.1524.SP.FE.ZS |  Employment to population ra-  |
|                      |  tio, ages 15-24, female (%)   |
|                      |     (modeled ILO estimate)     |
+----------------------+--------------------------------+
|  SL.UEM.1524.FE.ZS   | Unemployment, youth female (%  |
|                      |   of female labor force ages   |
|                      | 15-24) (modeled ILO estimate)  |
+----------------------+--------------------------------+
|  SL.UEM.TOTL.FE.ZS   | Unemployment, female (% of fe- |
|                      | male labor force) (modeled ILO |
|                      |           estimate)            |
+----------------------+--------------------------------+
|     SP.ADO.TFRT      |   Adolescent fertility rate    |
|                      |  (births per 1,000 women ages  |
|                      |             15-19)             |
+----------------------+--------------------------------+
|    SP.DYN.AMRT.FE    | Mortality rate, adult, female  |
|                      |   (per 1,000 female adults)    |
+----------------------+--------------------------------+
|    SP.DYN.CONU.ZS    | Contraceptive prevalence (% of |
|                      |       women ages 15-49)        |
+----------------------+--------------------------------+
|  SP.DYN.LE00.FE.IN   | Life expectancy at birth, fe-  |
|                      |          male (years)          |
+----------------------+--------------------------------+
|  SP.DYN.TO65.FE.ZS   | Survival to age 65, female (%  |
|                      |           of cohort)           |
+----------------------+--------------------------------+

Three of the possible explanatory variables have something to do with women’s reproductive health

  • Pregnant women receiving prenatal care (%)

  • Adolescent fertility rate (births per 1,000 women ages 15-19)

  • Contraceptive prevalence (% of women ages 15-49)

Three of the indicators have to do with life span / life expectancy:

  • Mortality rate, adult, female (per 1,000 female adults)

  • Life expectancy at birth, female (years)

  • Survival to age 65, female (% of cohort)

And three of the indicators (which have lower correlation strength than the six above) have to do with employment:

  • Employment to population ratio, ages 15-24, female

  • Unemployment, youth female (% of female labor force ages 15-24)

  • Unemployment, female (% of female labor force)

Selecting for Simple Regression

For single-variable (simple) regression, we can simply choose a couple of variables to work with that have a strong correlation with our potential response variables.

The most striking correlation is contraceptive prevalence. Since the other reproductive health indicators are likely not to be independent from one another or from Contraceptive Prevalence (teenage pregnancy and reproductive health access are likely to be closely related to contraceptive prevalence), we can concentrate on Contraceptive Prevalence as the single reproductive health indicator we use.

The second most dramatic correlation is female life expectancy at birth. Since this has at least thematic independence from contraceptive prevalence (although one could understand how lack of access to contraceptives could lead to earlier deaths in childbirth), we will consider it as well.

This brings to two explanatory variables for which we would like to do single regression. Which of the response variables will we consider? In order to reduce confounding, let’s eliminate the response variables most closely related to reproduction:

  • Women who believe a husband is justified in beating his wife when she goes out without telling him (as this could be related to paternity questions / infidelity)
  • Women who believe a husband is justified in beating his wife when she neglects the children (related to reproductive roles)
  • Women who believe a husband is justified in beating his wife when she refuses sex with him

This leaves two options: - Women who believe a husband is justified in beating his wife when she burns the food - Women who believe a husband is justified in beating his wife when she argues with him

The correlation with the two selected explanatory variables is stronger with the arguing variable than with the burning food variable. Additionally, a boxplot demonstrates a greater variance for arguing than for burning food:

We are likely to get a clearer statistical relationship for the variable that measures the proportion of women who believe a husband is justified in beating his wife when she argues with him, given its larger spread, and will therefore opt for using it.

Data Visualization

We can do some initial data visualization and line fitting. We will first attempt to fit a line for the relationship between contraceptive prevalence and the acceptance of domestic violence and do some graphical diagnostics to see whether a linear model applies.

Contraceptive Model

Call:
lm(formula = contraceptive_tidy$SG.VAW.ARGU.ZS ~ contraceptive_tidy$SP.DYN.CONU.ZS)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.154 -11.560  -0.326  10.488  39.490 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       40.68348    4.21276   9.657 1.93e-14 ***
contraceptive_tidy$SP.DYN.CONU.ZS -0.45963    0.09268  -4.959 4.87e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.64 on 69 degrees of freedom
Multiple R-squared:  0.2628,    Adjusted R-squared:  0.2521 
F-statistic: 24.59 on 1 and 69 DF,  p-value: 4.869e-06

We can see that the \(R^2\) value is helpful, if not overwhelming, at 39.78%, and that the t value is very extreme, indicating a solid linear relationship, if conditions for reliability hold.

Conditions for Line Fitting

We check the conditions for model reliability by ensuring linearity, normal distribution of residuals, and constant variability:

Linearity and Constant Variability

By viewing the scatterplot and its overlaid linear model, we observe that variability does not appear constant throughout the data. At higher levels of contraceptive prevalence, residuals tend to fall slightly closer to the model than at lower levels of contraceptive prevalence. This is not quite problematic enough for us to discount the model.

We are already satisfied that the relationship is linear (and not lacking a trend at all or better represented by a curved model, for example) by having viewed the scatterplot with the overlaid linear model. To make sure, we can also plot the residuals:

The data appears approximately linear, although there is some interesting clustering of data toward the higher end of contraceptive prevalence which points to the lack of uniform variability in the data.

Residual Normality

We can display a histogram to determine whether the residuals have a normal distribution:

The distribution, while slightly right-skewed, is largely symmetrical and unimodal, with a nearly normal shape.

Conclusion

There is slight divergence of our contraceptive model from the conditions for linear models. We will use this linear model in a guarded way as a reliable predictive tool, aware that there is some lack of uniformity in data variability which could impact its reliability.

Life Expectancy Model

Call:
lm(formula = life_expectancy_tidy$SG.VAW.ARGU.ZS ~ life_expectancy_tidy$SP.DYN.LE00.FE.IN)

Residuals:
   Min     1Q Median     3Q    Max 
-27.95 -11.12  -5.13   8.63  47.08 

Coefficients:
                                       Estimate Std. Error t value
(Intercept)                             83.7596    13.4844   6.212
life_expectancy_tidy$SP.DYN.LE00.FE.IN  -0.9400     0.2037  -4.614
                                       Pr(>|t|)    
(Intercept)                            3.30e-08 ***
life_expectancy_tidy$SP.DYN.LE00.FE.IN 1.74e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.94 on 70 degrees of freedom
Multiple R-squared:  0.2332,    Adjusted R-squared:  0.2222 
F-statistic: 21.29 on 1 and 70 DF,  p-value: 1.739e-05

We can see that the \(R^2\) value is moderately helpful, at 34%, and that the t value is very extreme, indicating a solid linear relationship, if conditions hold.

Conditions for Line Fitting

We check the conditions for a reliable model by ensuring linearity, normal distribution of residuals, and constant variability:

Linearity and Constant Variability

By viewing the scatterplot and its overlaid linear model, we observe that, just as in the case of contraceptive model, data variability does not appear constant throughout the data. At higher life expectancies, residuals fall much closer to the model than at lower life expectancies. This makes us suspect that although the linear model may be helpful, we will be unable to use it as a reliable predictive model.

We are already satisfied that the relationship is linear (and not lacking a trend at all or better represented by a curved model, for example) by having viewed the scatterplot with the overlaid linear model. To make sure, we can also plot the residuals:

The data appears approximately linear, although there is some interesting clustering of data toward the higher end of contraceptive prevalence which points to the lack of uniform variability in the data. This mirrors quite closely what we’ve seen in the model based on contraceptive prevalence as well.

Residual Normality

We can display a histogram to determine whether the residuals have a normal distribution:

The distribution is right-skewed and not sufficiently normal enough to allow us to propose a linear model for prediction purposes.

Conclusion

Because of the divergence of our contraceptive model from the conditions for linear models, we cannot use our linear model as a reliable predictive tool. We can, however, keep in mind that it is useful for alerting us to a non-trivial negative correlation we may wish to investigate further.

Non-linear Smoothing

More plots that allow for curving may illustrate data trends and add to our understanding of the data. Below, we use LOESS smoothing:

## Warning: Removed 3 rows containing missing values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

Selection for Multiple Regression (by Hand)

What about multiple regression? We have nine variables, but how should we combine them? We want to avoid redundancy and colinearity in our choices. Looking at a multiple regression model using all 9 variables at the same time as we use a 9X9 correlation matrix will help us eliminate strongly colinear variables in our model.

We will continue to use the “arguing” variable as our response variable, for the same reasons as mentioned previously.

We begin by creating a model that accounts for all nine variables:

entire_model <- lm(smaller_tidy$SG.VAW.ARGU.ZS ~ smaller_tidy$SH.STA.ANVC.ZS + 
    smaller_tidy$SL.EMP.1524.SP.FE.ZS + smaller_tidy$SL.UEM.1524.FE.ZS + smaller_tidy$SL.UEM.TOTL.FE.ZS + 
    smaller_tidy$SP.ADO.TFRT + smaller_tidy$SP.DYN.AMRT.FE + smaller_tidy$SP.DYN.CONU.ZS + 
    smaller_tidy$SP.DYN.LE00.FE.IN + smaller_tidy$SP.DYN.TO65.FE.ZS)
summary(entire_model)

Call:
lm(formula = smaller_tidy$SG.VAW.ARGU.ZS ~ smaller_tidy$SH.STA.ANVC.ZS + 
    smaller_tidy$SL.EMP.1524.SP.FE.ZS + smaller_tidy$SL.UEM.1524.FE.ZS + 
    smaller_tidy$SL.UEM.TOTL.FE.ZS + smaller_tidy$SP.ADO.TFRT + 
    smaller_tidy$SP.DYN.AMRT.FE + smaller_tidy$SP.DYN.CONU.ZS + 
    smaller_tidy$SP.DYN.LE00.FE.IN + smaller_tidy$SP.DYN.TO65.FE.ZS)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.871  -8.690  -2.421   5.840  38.631 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)                       190.76165   76.46408   2.495   0.0155 *
smaller_tidy$SH.STA.ANVC.ZS        -0.20780    0.13341  -1.558   0.1248  
smaller_tidy$SL.EMP.1524.SP.FE.ZS  -0.13247    0.13671  -0.969   0.3366  
smaller_tidy$SL.UEM.1524.FE.ZS     -0.71428    0.47097  -1.517   0.1348  
smaller_tidy$SL.UEM.TOTL.FE.ZS      0.99157    0.87344   1.135   0.2609  
smaller_tidy$SP.ADO.TFRT           -0.03971    0.05970  -0.665   0.5087  
smaller_tidy$SP.DYN.AMRT.FE        -0.16553    0.12412  -1.334   0.1875  
smaller_tidy$SP.DYN.CONU.ZS        -0.19557    0.14643  -1.336   0.1869  
smaller_tidy$SP.DYN.LE00.FE.IN      0.69261    2.02035   0.343   0.7330  
smaller_tidy$SP.DYN.TO65.FE.ZS     -2.10894    2.01346  -1.047   0.2993  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.38 on 58 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.4362,    Adjusted R-squared:  0.3487 
F-statistic: 4.985 on 9 and 58 DF,  p-value: 5.691e-05

We’ll now do a backwards stepwise model reduction, in which I remove, one by one, the variable with the highest p value (or lowest absolute value T statistic), keeping in mind colinearity issues, which can be highlighted by correlation matrices like this one:

                     SH.STA.ANVC.ZS SL.EMP.1524.SP.FE.ZS SL.UEM.1524.FE.ZS
SH.STA.ANVC.ZS            1.0000000           -0.1790680         0.2749653
SL.EMP.1524.SP.FE.ZS     -0.1790680            1.0000000        -0.6926190
SL.UEM.1524.FE.ZS         0.2749653           -0.6926190         1.0000000
SL.UEM.TOTL.FE.ZS         0.2228170           -0.6184338         0.9562236
SP.ADO.TFRT              -0.2498688            0.4008194        -0.3958203
SP.DYN.AMRT.FE           -0.1237799            0.3484561        -0.2100498
SP.DYN.CONU.ZS            0.2320302           -0.2228853         0.1830244
SP.DYN.LE00.FE.IN         0.2366897           -0.3972221         0.3500810
SP.DYN.TO65.FE.ZS         0.2047651           -0.4025925         0.3334981
                     SL.UEM.TOTL.FE.ZS SP.ADO.TFRT SP.DYN.AMRT.FE
SH.STA.ANVC.ZS              0.22281697  -0.2498688     -0.1237799
SL.EMP.1524.SP.FE.ZS       -0.61843384   0.4008194      0.3484561
SL.UEM.1524.FE.ZS           0.95622356  -0.3958203     -0.2100498
SL.UEM.TOTL.FE.ZS           1.00000000  -0.3087964     -0.1056764
SP.ADO.TFRT                -0.30879641   1.0000000      0.6359953
SP.DYN.AMRT.FE             -0.10567643   0.6359953      1.0000000
SP.DYN.CONU.ZS              0.08984436  -0.4617582     -0.4009550
SP.DYN.LE00.FE.IN           0.24841976  -0.7197989     -0.9369888
SP.DYN.TO65.FE.ZS           0.22957105  -0.7010109     -0.9715663
                     SP.DYN.CONU.ZS SP.DYN.LE00.FE.IN SP.DYN.TO65.FE.ZS
SH.STA.ANVC.ZS           0.23203016         0.2366897         0.2047651
SL.EMP.1524.SP.FE.ZS    -0.22288527        -0.3972221        -0.4025925
SL.UEM.1524.FE.ZS        0.18302437         0.3500810         0.3334981
SL.UEM.TOTL.FE.ZS        0.08984436         0.2484198         0.2295710
SP.ADO.TFRT             -0.46175823        -0.7197989        -0.7010109
SP.DYN.AMRT.FE          -0.40095499        -0.9369888        -0.9715663
SP.DYN.CONU.ZS           1.00000000         0.5792463         0.5196357
SP.DYN.LE00.FE.IN        0.57924634         1.0000000         0.9896156
SP.DYN.TO65.FE.ZS        0.51963570         0.9896156         1.0000000

For example, we see extreme colinearity (>.90) between: - SL.UEM.1524.FE.ZS and SL.UEM.TOTL.FE.ZS

(so we should not use both of these employment variables)

  • SP.DYN.AMRT.FE and SP.DYN.LE00.FE.IN
  • SP.DYN.AMRT.FE and SP.DYN.TO65.FE.ZS
  • SP.DYN.LE00.FE.IN and SP.DYN.TO65.FE.ZS

(this indicates that we should keep at max ONE mortality variable)

Additionally, there are high levels of colinearity (> .60) between: - SL.EMP.1524.SP.FE.ZS and SL.UEM.1524.FE.ZS, - SL.EMP.1524.SP.FE.ZS and SL.UEM.TOTL.FE.ZS

(so we should probably only select one of the employment variables)

  • SP.ADO.TFRT and SP.DYN.LE00.FE.IN
  • SP.ADO.TFRT and SP.DYN.TO65.FE.ZS

(the first colinearity we see between reproduction and mortality)

  • SP.ADO.TFRT and SP.DYN.AMRT.FE

(both reproductive variables)

In the interest of space, I will not show all of the R code or results (which are, however, available in the R Markdown document used to generate this report. To execute code chunks, remove the “eval=FALSE” attribute for the desired chunks.)

The model arrived at by stepwise reduction is as follows:

smaller_model <- lm(smaller_tidy$SG.VAW.ARGU.ZS ~ smaller_tidy$SH.STA.ANVC.ZS + 
    smaller_tidy$SL.UEM.1524.FE.ZS + smaller_tidy$SP.DYN.CONU.ZS)
summary(smaller_model)

Call:
lm(formula = smaller_tidy$SG.VAW.ARGU.ZS ~ smaller_tidy$SH.STA.ANVC.ZS + 
    smaller_tidy$SL.UEM.1524.FE.ZS + smaller_tidy$SP.DYN.CONU.ZS)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.909 -10.646  -0.668   8.073  37.476 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    64.83227   10.78322   6.012 9.65e-08 ***
smaller_tidy$SH.STA.ANVC.ZS    -0.26349    0.12739  -2.068  0.04265 *  
smaller_tidy$SL.UEM.1524.FE.ZS -0.24394    0.12525  -1.948  0.05584 .  
smaller_tidy$SP.DYN.CONU.ZS    -0.37285    0.09155  -4.073  0.00013 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.37 on 64 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.3788,    Adjusted R-squared:  0.3497 
F-statistic: 13.01 on 3 and 64 DF,  p-value: 9.834e-07

As we can see, its R squared value (0.3788) is worse when compared to the entire model (0.4362), but its adjusted R squared value (0.3497) is slightly better than the 0.3487 value of the entire model. We’ve done an apparently successful stepwise reduction.

Selection for Multiple Regression (Using the “leaps” Package)

Thanks to http://www.statmethods.net/stats/regression.html for great tips relating to this section! Leaps provides a great graphical way to choose any number of variables in the construction of a multiple regression model. First I can check my own earlier work to see if SP.DYN.CONU.ZS and SP.DYN.LE00.FE.IN really are the best choices if I’m sticking to one variable, then I can also check to see if I’m pretty close with my attempt to come up with a set of four variables. The summary of leaps creates a table in which asterisks represent a variable being chosen. In our case, I’ve asked to see the top four options for any given number of variables, so I’ll be given first through fourth choices for one variable, first through fourth choices for two variables, and so on.

library(leaps)
leaps <- regsubsets(SG.VAW.ARGU.ZS ~ SH.STA.ANVC.ZS + SL.EMP.1524.SP.FE.ZS + 
    SL.UEM.1524.FE.ZS + SL.UEM.TOTL.FE.ZS + SP.ADO.TFRT + SP.DYN.AMRT.FE + SP.DYN.CONU.ZS + 
    SP.DYN.LE00.FE.IN + SP.DYN.TO65.FE.ZS, data = smaller_tidy, nbest = 4)
summary(leaps)$outmat
         SH.STA.ANVC.ZS SL.EMP.1524.SP.FE.ZS SL.UEM.1524.FE.ZS
1  ( 1 ) " "            " "                  " "              
1  ( 2 ) " "            " "                  " "              
1  ( 3 ) " "            " "                  " "              
1  ( 4 ) " "            " "                  "*"              
2  ( 1 ) " "            " "                  " "              
2  ( 2 ) "*"            " "                  " "              
2  ( 3 ) " "            " "                  "*"              
2  ( 4 ) " "            " "                  " "              
3  ( 1 ) "*"            " "                  " "              
3  ( 2 ) "*"            " "                  "*"              
3  ( 3 ) " "            " "                  " "              
3  ( 4 ) "*"            " "                  " "              
4  ( 1 ) "*"            " "                  " "              
4  ( 2 ) " "            " "                  "*"              
4  ( 3 ) "*"            " "                  " "              
4  ( 4 ) "*"            " "                  "*"              
5  ( 1 ) "*"            " "                  "*"              
5  ( 2 ) "*"            " "                  "*"              
5  ( 3 ) "*"            " "                  " "              
5  ( 4 ) "*"            " "                  " "              
6  ( 1 ) "*"            " "                  "*"              
6  ( 2 ) "*"            "*"                  "*"              
6  ( 3 ) "*"            "*"                  "*"              
6  ( 4 ) "*"            " "                  "*"              
7  ( 1 ) "*"            "*"                  "*"              
7  ( 2 ) "*"            " "                  "*"              
7  ( 3 ) "*"            " "                  "*"              
7  ( 4 ) "*"            "*"                  "*"              
8  ( 1 ) "*"            "*"                  "*"              
8  ( 2 ) "*"            "*"                  "*"              
8  ( 3 ) "*"            " "                  "*"              
8  ( 4 ) "*"            "*"                  "*"              
         SL.UEM.TOTL.FE.ZS SP.ADO.TFRT SP.DYN.AMRT.FE SP.DYN.CONU.ZS
1  ( 1 ) " "               " "         " "            "*"           
1  ( 2 ) " "               " "         " "            " "           
1  ( 3 ) " "               " "         " "            " "           
1  ( 4 ) " "               " "         " "            " "           
2  ( 1 ) " "               " "         "*"            " "           
2  ( 2 ) " "               " "         " "            "*"           
2  ( 3 ) " "               " "         " "            "*"           
2  ( 4 ) " "               " "         "*"            " "           
3  ( 1 ) " "               " "         "*"            " "           
3  ( 2 ) " "               " "         " "            "*"           
3  ( 3 ) " "               " "         "*"            "*"           
3  ( 4 ) "*"               " "         " "            "*"           
4  ( 1 ) " "               " "         "*"            "*"           
4  ( 2 ) "*"               " "         "*"            " "           
4  ( 3 ) " "               "*"         "*"            " "           
4  ( 4 ) " "               " "         " "            "*"           
5  ( 1 ) "*"               " "         "*"            " "           
5  ( 2 ) " "               " "         "*"            "*"           
5  ( 3 ) " "               " "         "*"            "*"           
5  ( 4 ) " "               "*"         "*"            "*"           
6  ( 1 ) "*"               " "         "*"            "*"           
6  ( 2 ) " "               " "         "*"            "*"           
6  ( 3 ) "*"               " "         "*"            " "           
6  ( 4 ) " "               "*"         "*"            "*"           
7  ( 1 ) "*"               " "         "*"            "*"           
7  ( 2 ) "*"               "*"         "*"            "*"           
7  ( 3 ) "*"               " "         "*"            "*"           
7  ( 4 ) " "               "*"         "*"            "*"           
8  ( 1 ) "*"               "*"         "*"            "*"           
8  ( 2 ) "*"               " "         "*"            "*"           
8  ( 3 ) "*"               "*"         "*"            "*"           
8  ( 4 ) "*"               "*"         "*"            "*"           
         SP.DYN.LE00.FE.IN SP.DYN.TO65.FE.ZS
1  ( 1 ) " "               " "              
1  ( 2 ) "*"               " "              
1  ( 3 ) " "               "*"              
1  ( 4 ) " "               " "              
2  ( 1 ) " "               "*"              
2  ( 2 ) " "               " "              
2  ( 3 ) " "               " "              
2  ( 4 ) "*"               " "              
3  ( 1 ) " "               "*"              
3  ( 2 ) " "               " "              
3  ( 3 ) " "               "*"              
3  ( 4 ) " "               " "              
4  ( 1 ) " "               "*"              
4  ( 2 ) " "               "*"              
4  ( 3 ) " "               "*"              
4  ( 4 ) "*"               " "              
5  ( 1 ) " "               "*"              
5  ( 2 ) " "               "*"              
5  ( 3 ) "*"               "*"              
5  ( 4 ) " "               "*"              
6  ( 1 ) " "               "*"              
6  ( 2 ) " "               "*"              
6  ( 3 ) " "               "*"              
6  ( 4 ) " "               "*"              
7  ( 1 ) " "               "*"              
7  ( 2 ) " "               "*"              
7  ( 3 ) "*"               "*"              
7  ( 4 ) " "               "*"              
8  ( 1 ) " "               "*"              
8  ( 2 ) "*"               "*"              
8  ( 3 ) "*"               "*"              
8  ( 4 ) "*"               " "              

First I notice that my selections for simple regression were correct and show up as the first and second best options for single-variable regression, according to leaps.

Next, I check out the four options offered to me by leaps for a regression using three variables:

(My by-hand attempt: SH.STA.ANVC.ZS, SL.UEM.1524.FE.ZS, SP.DYN.CONU.ZS)

First choice according to leaps: SH.STA.ANVC.ZS, SP.DYN.AMRT.FE, SP.DYN.TO65.FE.ZS (only one in common with mine)

Second choice according to leaps: SH.STA.ANVC.ZS, SL.UEM.1524.FE.ZS, SP.DYN.CONU.ZS (my choice)

Third choice according to leaps: SP.DYN.AMRT.FE, SP.DYN.CONU.ZS, SP.DYN.TO65.FE.ZS (one in common)

Fourth choice according to leaps: SH.STA.ANVC.ZS, SL.UEM.TOTL.FE.ZS, SP.DYN.CONU.ZS (two in common)

Let’s check the best model as proposed by leaps:

leaps_model <- lm(smaller_tidy$SG.VAW.ARGU.ZS ~ smaller_tidy$SH.STA.ANVC.ZS + 
    smaller_tidy$SP.DYN.AMRT.FE + smaller_tidy$SP.DYN.TO65.FE.ZS)
summary(leaps_model)

Call:
lm(formula = smaller_tidy$SG.VAW.ARGU.ZS ~ smaller_tidy$SH.STA.ANVC.ZS + 
    smaller_tidy$SP.DYN.AMRT.FE + smaller_tidy$SP.DYN.TO65.FE.ZS)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.490  -9.035  -2.893   9.271  40.925 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    257.53541   44.42735   5.797 1.99e-07 ***
smaller_tidy$SH.STA.ANVC.ZS     -0.21752    0.12613  -1.725 0.089223 .  
smaller_tidy$SP.DYN.AMRT.FE     -0.24311    0.06081  -3.998 0.000162 ***
smaller_tidy$SP.DYN.TO65.FE.ZS  -2.45046    0.51071  -4.798 9.32e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.15 on 67 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.4186,    Adjusted R-squared:  0.3926 
F-statistic: 16.08 on 3 and 67 DF,  p-value: 5.616e-08

The values given in the summary are: Multiple R-squared: 0.4186, Adjusted R-squared: 0.3926. The model proposed by leaps is statistically better than my model, but uses two variables that are very strongly co-linear. Leaps’s second choice does not have this limitation, and matches my attempt. We’ll go with it!

Conditions for Multiple Regression

We want to confirm that the following criteria are met:

  1. The residuals of the model are nearly normal,

  2. The variability of the residuals is nearly constant,

  3. The residuals are independent, and

  4. Each variable is linearly related to the outcome.

Normal Residuals

hist(smaller_model$residuals)

We have a nearly normal distribution, unimodal and roughly symmetrical, with a slight but not worrisome right skew.

Residual Variability and Independence

plot(leaps_model$residuals)

There seems to be a fairly consistent amount of variability in residual data and no clustering that would indicate unforeseen lack of independence.

Linear Variable Relationships

In order to check this condition, we want to make sure there are no missing values that will cause the plot to fail.

complete_data <- select(smaller_tidy, SG.VAW.ARGU.ZS, SH.STA.ANVC.ZS, SL.UEM.1524.FE.ZS, 
    SP.DYN.CONU.ZS)
complete_data <- complete_data[which(complete.cases(complete_data)), ]
complete_data_model <- lm(complete_data$SG.VAW.ARGU.ZS ~ complete_data$SH.STA.ANVC.ZS + 
    complete_data$SL.UEM.1524.FE.ZS + complete_data$SP.DYN.CONU.ZS)
plot(complete_data_model$residuals ~ complete_data$SH.STA.ANVC.ZS)
plot(complete_data_model$residuals ~ complete_data$SL.UEM.1524.FE.ZS)
plot(complete_data_model$residuals ~ complete_data$SP.DYN.CONU.ZS)

In this model, the first and second variables have some outliers (in the case of SH.STA.ANVC.VS, data with a value below 70 may be considered outliers; in the case of SL.UEM.1524.FE.ZS, data with a value above 35 are outliers), but do not demonstrate a noticable pattern that deviates from a linear model (there is no curving of data). The third variable seems clearly linear with no difficulty in acceptance.

Conclusion: We can give this model our conditional approval, noting that it may not hold for outlier values as described in the preceding paragraph. The formula for our multivariate model can be obtained from the summary of the model: 64.83227 - 0.26349(SH.STA.ANVC.ZS) - 0.24394(SL.UEM.1524.FE.ZS) - 0.37285(SP.DYN.CONU.ZS).

Conclusions and Future Directions

Caveats

Cases and Sources

The World Bank World Development Indicators gather data on a per-country, per-year basis, so that each case represents data representing a given nation for a given year. Note that many countries lacked response variable data and not every country with response variable data has it available for every year. In the final data set I analyzed, there were 72 cases, representing a range of years from 1999-2013 and encompassing data from 72 countries.

The scope of this data is limited to a small subset of nations, and generalizability, particularly to wealthy, highly developed nations, is therefore suspect.

Additionally, the data was drawn from a number of sources that are likely to be heterogeneous in their data quality. It was beyond the scope of this project to delve more deeply into the final source of all the data, but in order to make policy recommendations (for example, beginning a program that aims to increase access to contraception), greater clarity is required as to the quality of data collection.

Causality

This dataset consists of observational data only. Although statistical evidence for the relationship between low contraceptive prevalence and increased support for domestic violence was shown, it is unclear what the causal relationship, if any, might be. Additional study and, if ethical, an experimental protocol, would be required to draw any conclusions as to causality. The same is true of the relationship established between female life expectancy and support of domestic violence.

Conclusions

Women’s support of the right of men to beat wives when they argue with their husbands is negatively correlated with both contraceptive prevalence and female life expectancy. Nations with low (<1Q) contraceptive prevalence or female life expectancy have markedly higher and statistically significant levels of female acceptance of this kind of domestic violence.

Additionally, we can roughly model nations’ female acceptance of this kind of domestic violence by the use of three variables: contraceptive prevalence, female youth unemployment, and pregnant women receiving prenatal care. We can also construct a fairly reliable model using simple regression, with just the contraceptive prevalence variable.

Future Directions

Besides the additional experimental investigation that may be called for, there are certainly areas for improvement in the analysis carried out in this paper. Principally, the multiple regression methodology would be improved by principle component analysis and/or factor analysis, given the number of closely affiliated measures relating to poverty, healthcare, and economic participation. To carry out these efforts ended up being beyond the scope of this project, but would be a welcome and helpful addition to the efforts found here.


  1. http://esa.un.org/unpd/wpp/, accessed 3 November 2015.

  2. More detailed information about sources that influenced each nation’s population estimates can be found at http://esa.un.org/unpd/wpp/DVD/Files/3_Other%20Files/WPP2015_F02_METAINFO.XLS.

  3. See http://www.unicef.org/sowc/.

  4. http://www.unicef.org/statistics/index_24302.html and http://mics.unicef.org/about, accessed 3 November 2015

  5. Multiple Indicator Cluster Survey Plan, http://mics.unicef.org/files?job=W1siZiIsIjIwMTUvMDEvMTQvMDYvMDYvMDUvMzUzL0VuZ2xpc2hfTUlDU19TdXJ2ZXlfUGxhbl9UZW1wbGF0ZV8yMDEzMDMwNy5kb2N4Il1d&sha=419058ca9529848b, accessed 3 November 2015.

  6. http://www.unicef.org/statistics/index_24302.html, accessed 3 November 2015

  7. e.g. http://www.childinfo.org/mics_compiler.html

  8. http://www.dhsprogram.com/, accessed 5 November 2015.

  9. http://www.dhsprogram.com/What-We-Do/Survey-Types/DHS.cfm, accessed 5 November, 2015.

  10. http://www.un.org/en/development/desa/population/publications/dataset/contraception/prevalence.shtml, accessed 10 November, 2015.

  11. http://www.un.org/en/development/desa/population/publications/dataset/contraception/prevalence.shtml, accessed 10 November, 2015.

  12. http://unstats.un.org/unsd/demographic/products/vitstats/, accessed 5 November 2015.

  13. http://ec.europa.eu/eurostat/web/population-demography-migration-projections/overview, accessed 5 November 2015.

  14. http://www.spc.int/prism/regional-data-and-tools/demographic-statistics, accessed 5 November 2015.

  15. http://www.spc.int/sdd/index.php/en/new-sdp-releases/39-new-sdd-releases/91-thepacificmortalitytrendreport, accessed 3 November 2015.

  16. http://www.spc.int/sdd/index.php/en/component/content/article/1/38-welcome-to-the-sdp-website, accessed 3 November 2015.

  17. https://www.census.gov/population/projections/about/, accessed 3 November, 2015.

  18. http://www.ilo.org/global/statistics-and-databases/research-and-databases/kilm/WCMS_422090/lang--en/index.htm, accessed 25 November 2015.

  19. See http://www.mortality.org/Public/Docs/MethodsProtocol.pdf for more details. Quote is from page 1, accessed 26 November 2015.