In this report, we examine possible relationship between a person’s educational attainment and the amount of hours they spends per day watching TV for US residents. We examine the hypothesis that more education is correlated with lesser TV watching. For example, do college graduates watch less TV than a high school graduate? Is there a decrease in TV watching as the education level increases from some high school to high school to junior college to college? If there is indeed a correlation, further studies can be designed to study the underlying causes and device appropriate policies.
In order to answer the question, we use the data from the General Social Survey.
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
More details can be obtained at the website http://www3.norc.org/GSS+Website
The data has 57061 observations from 1972 to 2012 and collects 115 variables. For the purpose of this research, we have selected observations from 1988 to 2012 as the prior years contain missing values for the TV hours. We examine the two below variables
The survey collects data via random sampling (ie respondents are chosen at randomly within the United States residents) but does not do random assignment (ie does not split into treatment and control groups). Hence this is a observational study that can be generalized to the English speaking residents of the United states. In particular, we cannot draw causual inference from this data as there is no random assignment.
The data for the survey from 1972 to 2004 was collected from English speaking residents only. Starting in 2006, spanish speakers were added to the survey respondents. This survey methodology may cause convenience bias (it is convenient to sample English and Spanish) and hence the study results cannot be generalized to non-English/Spanish speaking population such as immigrant groups.
The detailed survey design methodology can be found at http://www3.norc.org/GSS+Website/Documentation/
There are a large number of missing values in the data coded as NA for both degree and tvhours. Especially for tvhours variable, data has not been collected for some of the initial years. Hence, to ensure continuous data, we have excluded data before 1988. Also any missing data for the years since 1988 are also excluded and only the complete records where both degree and tvhours are present are taken.
## Select only the observations and variables of interest
gssfiltered <-gss[gss$year>1987, c("year", "tvhours", "degree") ]
## Remove all missing values
gssfiltered <- gssfiltered[complete.cases(gssfiltered), ]
dim(gssfiltered)
## [1] 21186 3
After this filtering, we have 21,186 observations. Each observation corresponds to a resident of United States who responded to the GSS survey. The data is collected over the years of 1988 to 2012.
Now let us look at the plot of tvhours vs the level of education
The boxplot visually indicates that the median does seem to be decrease for the TV hours as the level of education increases. Further,it shows that the variability decreases as the level of education increases. The interquantile range for Little high school is much higher than for a graduate. Also the observations seem to be right skewed with a long tail.
There are also some interesting outliers. In general, the outliers seem to decrease as the level of education increases. Further, there are some outliers that show the hours of tv watched at 24 hours implying that someone is watching TV all the time without pausing for even sleep. This seems inutively wrong and indicative of perhaps data collection error or data entry error. So we do a quick check to see what percentage of the data is above 16 hours of tv/day
##
## FALSE TRUE
## Lt High School 3295 27
## High School 11184 31
## Junior College 1483 3
## Bachelor 3450 5
## Graduate 1707 1
## check the proportion of data that are outliers
mean(gssfiltered$tvhours > 16)
## [1] 0.003162466
Since this is just about 0.3% of the data, we chose to continue with the analysis as is without removing these outliers.
To handle the outliers as well as the right skewness revealed by the boxplot, we plot a histogram of the tvhours to look at the distribution
The histogram confirms that the data is right skewed. Let us transform the data using log transformation and check if we get a more normal curve. The transformed histogramlooks much more normally distributed. Hence let us apply log transformation to the data. In order to handle 0 hours of watching TV, (as log(0) is -Infinity ), we add a 1 to the values before applying the log transformation
gssfiltered$tvhours <- log(gssfiltered$tvhours +1)
Now let us look at summary data to check if the mean also reports the variation
tapply(gssfiltered$tvhours, gssfiltered$degree, mean)
## Lt High School High School Junior College Bachelor Graduate
## 1.4456533 1.2741234 1.1588529 1.0484541 0.9513265
The mean clearly decreases from 1.4456 log hours of tv watching/day to 0.9513 log hours of tv wathcing/day as the education level increases from little high school to graudate and seems to support our hypothesis. Let us do a formal check on the hypothesis
We want to investigate the hypothesis that more education is correlated with lesser TV watching. As these involve a categorical variable with more than two levels (degree) and a numerical variable (tvhours), we can compare the means accross these levels via an ANOVA to check if in fact there is at least one pair of means that are different from each other. Once the ANOVA test is successful for the alternative hypotheseis, we can do a pairwise t-tests to determine which of the means have a statistically significant difference.
For the ANOVA test, let us formulate the null hypothesis that nothing is going on and there is no difference in the mean number of tv hours across all educational level. The alternative hypothesis would be that at least one pair of means is different.
H0: Mean(Lt High School) = Mean(High School) = mean(Junior College) = mean(Bachelor) = mean(Graduate)
HA: At least one pair of means are different
For the ANOVA test, the following conditions have to be checked
The observations are random sample of US residents and each group is less than 10% of US population. Hence we can assume that observations are independent both within the group as well as between groups
Next we need to check if the variables are approximately normal. We can do this by doing a qqplot
The qqplots show that the data is approximately normal with the graduate hours being somewhat more away from the normal. However, since we have large number of observations for each of the group (at least a 1000), we can ignore this and proceed further.
tapply(gssfiltered$tvhours, gssfiltered$degree, sd)
## Lt High School High School Junior College Bachelor Graduate
## 0.5625021 0.5099453 0.4771016 0.4739021 0.4633416
And we do find that variance is approximately constant.
With all the conditions satisfied, we can proceed to do the ANOVA test
anova(lm(gssfiltered$tvhours ~gssfiltered$degree))
## Analysis of Variance Table
##
## Response: gssfiltered$tvhours
## Df Sum Sq Mean Sq F value Pr(>F)
## gssfiltered$degree 4 430.4 107.591 418.36 < 2.2e-16 ***
## Residuals 21181 5447.1 0.257
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We note that the P value is practically zero. This p value indicates the probability of observing a difference as large or higher given the condition that the null hypothesis is true.
Assuming a Significance level of 0.05, as the probability is near zero and less than our significance threshold, we can confidently reject the null hypothesis and accept the alternative hypothesis that at least one pair of means are different.
Since this is an ANOVA test and we are only doing a hypothesis test, we dont have confidence interval to compare against.
Now we need to see which of the means are statistically different and if in fact our hypothesis of lesser tv hours as more education holds true across all levels. We formulate the null hypothesis that there is no difference in the means of each pair of education levels and an alternative hypothesis that means of higher level of education is less than means of lower level education.
We have already verified the conditions for t-distrubutions namely 1. Data is independent 2. Data is nearly normally distributed
Hence we proceed further.
Since there are 5 levels, there will be 5 * (5-1)/2 = 10 pairs of t test
The null hypothesis H0 would be
The Alternate hypothesis, HA
As we are doing multiple testing of 10 pairs, we should adjust the p value. For this we use Bonferroni correction
pairwise.t.test(gssfiltered$tvhours, gssfiltered$degree,
p.adj = "bonferroni", alternative = "less")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: gssfiltered$tvhours and gssfiltered$degree
##
## Lt High School High School Junior College Bachelor
## High School < 2e-16 - - -
## Junior College < 2e-16 9.6e-16 - -
## Bachelor < 2e-16 < 2e-16 1.2e-11 -
## Graduate < 2e-16 < 2e-16 < 2e-16 4.8e-10
##
## P value adjustment method: bonferroni
As we can see above, the p values even after adjusting for bonferroni correction are close to 0 for all the pairs. This p value indicates the probability of observing as low or lower difference assuming the condition that the null hypothesis is true.
Assuming as signifcance of 0.005 (= 0.05/10), as the probability is near zero and below the significance level, we can confidently reject the null hypothesis and accept the alternative hypothesis that it is indeed the hours/day watching tv is lower among higher educated US residents.
Using the GSS data between the years of 1988 and 2012 and performing an ANOVA and a paired t test, we have established that there is sufficient statistical evidence to conclude that among the residents of the United states, higher levels of education attainment is negatively correlated with lower amount of hours of TV watched per day. This correlation is present between every pair of educational attainment and there is a clear pattern as we move across the educational level.
However, this study does not establish any causal relationship. As a potential future research possibility, one can design an experiment and determine if in fact there is a causal effect between these two parameters. Further one also has to examine whether there are any other confounding variables such as income or age or gender which may affect this relationship.
A sample page of data
## tvhours degree
## 11000 2 Bachelor
## 11001 2 Bachelor
## 11002 2 Bachelor
## 11003 2 Bachelor
## 11004 3 Lt High School
## 11005 0 Lt High School
## 11006 5 High School
## 11007 4 High School
## 11008 3 Lt High School
## 11009 1 High School
## 11010 8 High School
## 11011 4 <NA>
## 11012 2 High School
## 11013 3 Lt High School
## 11014 3 Bachelor
## 11015 1 High School
## 11016 4 Lt High School
## 11017 2 Lt High School
## 11018 5 Lt High School
## 11019 6 High School
## 11020 16 High School
## 11021 3 High School
## 11022 3 Lt High School
## 11023 1 High School
## 11024 2 Lt High School
## 11025 3 High School
## 11026 2 Bachelor
## 11027 2 High School
## 11028 2 High School
## 11029 0 High School
## 11030 1 High School