Relationship between Education level and TV watching for US residents

Introduction:

In this report, we examine possible relationship between a person’s educational attainment and the amount of hours they spends per day watching TV for US residents. We examine the hypothesis that more education is correlated with lesser TV watching. For example, do college graduates watch less TV than a high school graduate? Is there a decrease in TV watching as the education level increases from some high school to high school to junior college to college? If there is indeed a correlation, further studies can be designed to study the underlying causes and device appropriate policies.

Data:

In order to answer the question, we use the data from the General Social Survey.

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.

More details can be obtained at the website http://www3.norc.org/GSS+Website

Variables of interest

The data has 57061 observations from 1972 to 2012 and collects 115 variables. For the purpose of this research, we have selected observations from 1988 to 2012 as the prior years contain missing values for the TV hours. We examine the two below variables

degree - A ordinal category variable with level of degree attained. This has five levels as below

LT HIGH SCHOOL - Some high school
HIGH SCHOOL - Completed High School
JUNIOR COLLEGE - Completed Junior college
BACHELOR - Completed Bachelor degree
GRADUATE - Completed Graduate degree

tvhours - A continuous numerical variable indicating hours of tv watched per day . Takes values between 0 to 24 in 1 hour increments

Applicability of the data

The survey collects data via random sampling (ie respondents are chosen at randomly within the United States residents) but does not do random assignment (ie does not split into treatment and control groups). Hence this is a observational study that can be generalized to the English speaking residents of the United states. In particular, we cannot draw causual inference from this data as there is no random assignment.

The data for the survey from 1972 to 2004 was collected from English speaking residents only. Starting in 2006, spanish speakers were added to the survey respondents. This survey methodology may cause convenience bias (it is convenient to sample English and Spanish) and hence the study results cannot be generalized to non-English/Spanish speaking population such as immigrant groups.

The detailed survey design methodology can be found at http://www3.norc.org/GSS+Website/Documentation/

Exploratory data analysis:

Pre processing data

There are a large number of missing values in the data coded as NA for both degree and tvhours. Especially for tvhours variable, data has not been collected for some of the initial years. Hence, to ensure continuous data, we have excluded data before 1988. Also any missing data for the years since 1988 are also excluded and only the complete records where both degree and tvhours are present are taken.

        ## Select only the observations and variables of interest
        gssfiltered <-gss[gss$year>1987, c("year", "tvhours", "degree") ]
        ## Remove all missing values
        gssfiltered <- gssfiltered[complete.cases(gssfiltered), ]
        dim(gssfiltered)

## [1] 21186     3

After this filtering, we have 21,186 observations. Each observation corresponds to a resident of United States who responded to the GSS survey. The data is collected over the years of 1988 to 2012.

Plot of TV hours Vs Education

Now let us look at the plot of tvhours vs the level of education

The boxplot visually indicates that the median does seem to be decrease for the TV hours as the level of education increases. Further,it shows that the variability decreases as the level of education increases. The interquantile range for Little high school is much higher than for a graduate. Also the observations seem to be right skewed with a long tail.

Outliers

There are also some interesting outliers. In general, the outliers seem to decrease as the level of education increases. Further, there are some outliers that show the hours of tv watched at 24 hours implying that someone is watching TV all the time without pausing for even sleep. This seems inutively wrong and indicative of perhaps data collection error or data entry error. So we do a quick check to see what percentage of the data is above 16 hours of tv/day

##                 
##                  FALSE  TRUE
##   Lt High School  3295    27
##   High School    11184    31
##   Junior College  1483     3
##   Bachelor        3450     5
##   Graduate        1707     1

        ## check the proportion of data that are outliers
        mean(gssfiltered$tvhours > 16)

## [1] 0.003162466

Since this is just about 0.3% of the data, we chose to continue with the analysis as is without removing these outliers.

Data transformation

To handle the outliers as well as the right skewness revealed by the boxplot, we plot a histogram of the tvhours to look at the distribution

The histogram confirms that the data is right skewed. Let us transform the data using log transformation and check if we get a more normal curve. The transformed histogramlooks much more normally distributed. Hence let us apply log transformation to the data. In order to handle 0 hours of watching TV, (as log(0) is -Infinity ), we add a 1 to the values before applying the log transformation

gssfiltered$tvhours <- log(gssfiltered$tvhours +1)

Summary statistics

Now let us look at summary data to check if the mean also reports the variation

        tapply(gssfiltered$tvhours, gssfiltered$degree, mean)

## Lt High School    High School Junior College       Bachelor       Graduate 
##      1.4456533      1.2741234      1.1588529      1.0484541      0.9513265

The mean clearly decreases from 1.4456 log hours of tv watching/day to 0.9513 log hours of tv wathcing/day as the education level increases from little high school to graudate and seems to support our hypothesis. Let us do a formal check on the hypothesis

Inference:

We want to investigate the hypothesis that more education is correlated with lesser TV watching. As these involve a categorical variable with more than two levels (degree) and a numerical variable (tvhours), we can compare the means accross these levels via an ANOVA to check if in fact there is at least one pair of means that are different from each other. Once the ANOVA test is successful for the alternative hypotheseis, we can do a pairwise t-tests to determine which of the means have a statistically significant difference.

ANOVA test

For the ANOVA test, let us formulate the null hypothesis that nothing is going on and there is no difference in the mean number of tv hours across all educational level. The alternative hypothesis would be that at least one pair of means is different.

H0: Mean(Lt High School) = Mean(High School) = mean(Junior College) = mean(Bachelor) = mean(Graduate)
HA: At least one pair of means are different

Conditions for ANOVA test

For the ANOVA test, the following conditions have to be checked

Independence

The observations are random sample of US residents and each group is less than 10% of US population. Hence we can assume that observations are independent both within the group as well as between groups

Approximately normal

Next we need to check if the variables are approximately normal. We can do this by doing a qqplot

The qqplots show that the data is approximately normal with the graduate hours being somewhat more away from the normal. However, since we have large number of observations for each of the group (at least a 1000), we can ignore this and proceed further.

Constant Variance
We check whether the variance is constant across groups

        tapply(gssfiltered$tvhours, gssfiltered$degree, sd)

## Lt High School    High School Junior College       Bachelor       Graduate 
##      0.5625021      0.5099453      0.4771016      0.4739021      0.4633416

And we do find that variance is approximately constant.

With all the conditions satisfied, we can proceed to do the ANOVA test

ANOVA results

        anova(lm(gssfiltered$tvhours ~gssfiltered$degree))

## Analysis of Variance Table
## 
## Response: gssfiltered$tvhours
##                       Df Sum Sq Mean Sq F value    Pr(>F)    
## gssfiltered$degree     4  430.4 107.591  418.36 < 2.2e-16 ***
## Residuals          21181 5447.1   0.257                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We note that the P value is practically zero. This p value indicates the probability of observing a difference as large or higher given the condition that the null hypothesis is true.

Assuming a Significance level of 0.05, as the probability is near zero and less than our significance threshold, we can confidently reject the null hypothesis and accept the alternative hypothesis that at least one pair of means are different.

Since this is an ANOVA test and we are only doing a hypothesis test, we dont have confidence interval to compare against.

Pairwise T Test with Bonferroni correction

Now we need to see which of the means are statistically different and if in fact our hypothesis of lesser tv hours as more education holds true across all levels. We formulate the null hypothesis that there is no difference in the means of each pair of education levels and an alternative hypothesis that means of higher level of education is less than means of lower level education.

We have already verified the conditions for t-distrubutions namely 1. Data is independent 2. Data is nearly normally distributed

Hence we proceed further.

Since there are 5 levels, there will be 5 * (5-1)/2 = 10 pairs of t test

Hypothesis

The null hypothesis H0 would be

Mean(Lt High School) = Mean(High School)
Mean(Lt High School) = Mean(Junior College)
Mean(Lt High School) = Mean(Bachelor)
Mean(Lt High School) = Mean(Graduate)
Mean(High School) = Mean(Junior College)
Mean(High School) = Mean(Bachelor)
Mean(High School) = Mean(Graduate)
Mean(Junior College) = Mean(Bachelor)
Mean(Junior College) = Mean(Graduate)
Mean(Bachelor) = Mean(Graduate)

The Alternate hypothesis, HA

Mean(Lt High School) < Mean(High School)
Mean(Lt High School) < Mean(Junior College)
Mean(Lt High School) < Mean(Bachelor)
Mean(Lt High School) < Mean(Graduate)
Mean(High School) < Mean(Junior College)
Mean(High School) < Mean(Bachelor)
Mean(High School) < Mean(Graduate)
Mean(Junior College) < Mean(Bachelor)
Mean(Junior College) < Mean(Graduate)
Mean(Bachelor) < Mean(Graduate)

T Test results with Bonferroni correction

As we are doing multiple testing of 10 pairs, we should adjust the p value. For this we use Bonferroni correction

        pairwise.t.test(gssfiltered$tvhours, gssfiltered$degree, 
                        p.adj = "bonferroni", alternative = "less")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  gssfiltered$tvhours and gssfiltered$degree 
## 
##                Lt High School High School Junior College Bachelor
## High School    < 2e-16        -           -              -       
## Junior College < 2e-16        9.6e-16     -              -       
## Bachelor       < 2e-16        < 2e-16     1.2e-11        -       
## Graduate       < 2e-16        < 2e-16     < 2e-16        4.8e-10 
## 
## P value adjustment method: bonferroni

As we can see above, the p values even after adjusting for bonferroni correction are close to 0 for all the pairs. This p value indicates the probability of observing as low or lower difference assuming the condition that the null hypothesis is true.

Assuming as signifcance of 0.005 (= 0.05/10), as the probability is near zero and below the significance level, we can confidently reject the null hypothesis and accept the alternative hypothesis that it is indeed the hours/day watching tv is lower among higher educated US residents.

Conclusion:

Using the GSS data between the years of 1988 and 2012 and performing an ANOVA and a paired t test, we have established that there is sufficient statistical evidence to conclude that among the residents of the United states, higher levels of education attainment is negatively correlated with lower amount of hours of TV watched per day. This correlation is present between every pair of educational attainment and there is a clear pattern as we move across the educational level.

However, this study does not establish any causal relationship. As a potential future research possibility, one can design an experiment and determine if in fact there is a causal effect between these two parameters. Further one also has to examine whether there are any other confounding variables such as income or age or gender which may affect this relationship.

References

The details regarding the GSS can be obtained at the website http://www3.norc.org/GSS+Website
The detailed survey design methodology can be found at http://www3.norc.org/GSS+Website/Documentation/
The data is obatained from: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

Appendix

A sample page of data

##       tvhours         degree
## 11000       2       Bachelor
## 11001       2       Bachelor
## 11002       2       Bachelor
## 11003       2       Bachelor
## 11004       3 Lt High School
## 11005       0 Lt High School
## 11006       5    High School
## 11007       4    High School
## 11008       3 Lt High School
## 11009       1    High School
## 11010       8    High School
## 11011       4           <NA>
## 11012       2    High School
## 11013       3 Lt High School
## 11014       3       Bachelor
## 11015       1    High School
## 11016       4 Lt High School
## 11017       2 Lt High School
## 11018       5 Lt High School
## 11019       6    High School
## 11020      16    High School
## 11021       3    High School
## 11022       3 Lt High School
## 11023       1    High School
## 11024       2 Lt High School
## 11025       3    High School
## 11026       2       Bachelor
## 11027       2    High School
## 11028       2    High School
## 11029       0    High School
## 11030       1    High School