Recent work suggests that sedentary activity, such as TV watching, is associated with negative changes in many aspects of health including cardiovascular, bone health and cellular function [2]. Television use in particular has been linked with greater risk for obesity and Type2 diabetes, heart diseases, lower life satisfaction, less frequent engagement in social and physical interaction, and increased risk for dementia [2,3].
The researchers concluded that increasing public awareness of alternatives to TV watching and reducing barriers to alternative activities that are more socially and physically engaging could reduce TV use by older people and diminish the potential for associated negative health effects [3].
But is it really the case that older people have a higher TV consumption? Do the GSS data support those statistics? This study will try to answer the question whether retired people watch more TV than not-retired.
The data was collected by the General Society Survey (GSS) between 1972 and 2012 by means of questionnaire. Most surveys were administered in person. Computer assisted interviews have been used since 2002 and some interviews were also conducted by phone [1]. The target sample size for each year the survey was administered is 1500. GSS has been administered to two sample sizes since 1994.
Each data-case is the result of an interview with a resident of the Unites States (observational unit). The unis of observations were adults living in households within the United States.
This study is investigating the relationship between the following two variables of the GSS dataset:
tvhours - average hours per day watching TV (numerical variable)
wrkstat - labor force status (categorical variable)
The variable wrkstat comprises 8 levels. For the purpose of this study, these levels were summarized to two levels (retired, not_retired) of a new categorical variable: retirementStatus.
This is an observational study. It is based on the dataset of the General Society Survey (GSS). for this study 200 cases of retired and 200 cases of not-retired indivicuals were selected randomly from about 3600 cases of respondents surveyed between the year 2000 and 2005 with tvhours>0 (exluding NAs).
The population of interest for this study is the population of the United States in the years 2000 to 2005.
The findings from this study can be generalized to that population, because: 1. the GSS surveys 2000 to 2005 were taken on randomly selected households in the United States 2. for this study two stratified subsets (according to retirement status) of the GSS data were used 3. 200 cases were randomly selected from each of the subsets
There is a possible non-response bias - inidcated as value ‘NA’ for variables ‘wrkstat’ and ‘tvhours’. Variable wrkstat has only a few ’NA’s. This can be neglected. But variable tvhours has over 23000 ’NA’s in the original dataset. This could influence the results of this study. Further studies may be necessary.
As this is an observation study, the reletionship between the two variables may show only an association, but will not show a causal relationship, eg. that retirement causes more TV consumption.
An early inspection found many ‘NA’ values for the tvhours variable in the GSS dataset. What is the distribution of the ‘NA’ values acccording to the various levels of the wrkstat? Below the portion of ‘NA’ values for the different wrkstat levels:
tapply(gss$tvhours, gss$wrkstat, function(x) sum(is.na(x))/length(x))
## Working Fulltime Working Parttime Temp Not Working Unempl, Laid Off
## 0.4046513 0.3943855 0.4072547 0.4025627
## Retired School Keeping House Other
## 0.4068307 0.4283267 0.4172792 0.4045936
We find, that ‘NA’ vaues are evently distributed over the various wrkstat-levels. Each level has about the same proportion (40%) of ‘NA’ values. Non-response bias is the same for retired and not-retired respondents.
We are only interested in two variables of the GSS dataset and the years between 2000 and 2005. Therefor we select wrkstat (labor force status) and tvhours (hours per day watching TV) of cases of the years 2000 to 2005. To avoid NA values, we omit ‘NA’ values for these variables.
gss2 <- subset (gss, year>=2000 & year <= 2005 & tvhours>=0 & !is.na(wrkstat),
select=c(wrkstat,tvhours))
dim(gss2)
## [1] 3633 2
The filtered dataset comprises about 3600 cases.
Next we create a new categorical variable ‘retirementStatus’ with only two levels, specifying whether a respondent is retired or not.
gss2[,"retirementStatus"] <- "not_retired"
gss2[gss2$wrkstat=="Retired","retirementStatus"] <- "retired"
gss2$retirementStatus <- as.factor(gss2$retirementStatus)
Next a stratified sampling is performed. From the filtered dataset 2 subsets are created:
From each set 200 cases will then be selected. Both subsets will be merged to the resulting dataset for this study (400 cases)
set.seed(1010101)
gss2n <- gss2[gss2$retirementStatus=="not_retired",]
gss2y <- gss2[gss2$retirementStatus=="retired",]
gss3n <- gss2n[sample(nrow(gss2n),200 ), ]
gss3y <- gss2y[sample(nrow(gss2y),200 ), ]
gss3 <- rbind(gss3n,gss3y)
Next we calculate some basic statistics (median, mean and SD in hours) for both groups (retired and not-retired cases).
| Not-Retired | Retired | |
|---|---|---|
| Median | 2 | 3.5 |
| Mean | 2.675 | 4.08 |
| SD | 2.336 | 2.877 |
| n | 200 | 200 |
This suggests,that the average daily TV consumption of retired people is indeed higher than for not-retired. But the standard deviation is high for both groups (mean - 2*SD gets negative), so there is some uncertainty.
Next we are examining both groups: retired and not-retired persons.
We first create a summary statistics of variable tvhours
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.378 4.000 20.000
We observe an average of about 3.4 hours and a maximum of 20 hours for the daily TV consumption of all (400) cases.
The following figure shows the histograms of TV hours for both groups For both groups the distribution is strongly right skewed with outliers. But the sample size of 200 for each group is large enough to justify the normality-condition for the hypothesis test that follows.
The side-by-side boxplot illustrates the TV consumption versus retirement status It shows that both distributions are right skewed with some outliers. The median for retired is substantially higher than for not-retired. And the variability for retired persons is higher than for not-retired (wider IRQ).
The exploratory analysis suggests that there exists a relationship between retirement status and TV consumption: while the average daily TV consumption is about 2.7 hours for non-retired respondends, it is higher (3.8 hours) for retired respondends. But the uncertainty is relatively high. Therefor a hypothesis test will be performed.
For this study we measure and compare the difference in the average daily TV consumption between retired and not-retired respondents. It is an association between a numerical (tvhours) and a categorical (retirementStatus) Varaiable. And as we compare two means, the inference is done via a hypothesis test and a check of the confidence interval.
For the hypothesis test we state both hypothesis:
H0 = \(\mu\)rt - \(\mu\)nrt = 0
HA = \(\mu\)rt - \(\mu\)nrt > 0 (one-sided)
The observed difference in the average daily TV consumption between retired and not-retired persons is 1.405 hours.
Before performing a hyphotesis test we must be sure that the conditions for inference for comparing two independent means are met.
As all conditions are met we can continue performing the hypothesis test.
We have two independent groups (retired and not-retired residents of the US) and want to compare the average daily TV consumption (tvhours) of both groups.
We are interested in the difference of the average daily TV consumption for all US residents who are retired and those not retired. As the point estimate we use the average difference of the daily TV consumption between two sampled groups of US residents who are retired and not-retired.
We will perform a hypothesis test and a confidence interval check to estimate the difference and margin of error.
For the hypothesis test we want to compare the means of the variable tvhours for both levels of the variable retirementStatus. The Null value is 0 (no difference); the alternative is one-sided (tvhousr(retired) > tvhours(not-retired). The significance level is 5%. The test is done with the inference function of the ‘statistics_lab_resources_inference.R’ script.
How large is the probability - given the NULL hypothesis - of observing a difference at least as large as in the sample dataset (1.405 hours)?
inference(y = gss3$tvhours, x = gss3$retirementStatus,
est = "mean", type = "ht", null = 0, alternative = "greater", siglevel = 0.05,
method = "theoretical", order = c("retired","not_retired"),
eda_plot = FALSE, inf_plot = FALSE)
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_retired = 200, mean_retired = 4.08, sd_retired = 2.8766
## n_not_retired = 200, mean_not_retired = 2.675, sd_not_retired = 2.3359
## Observed difference between means (retired-not_retired) = 1.405
## H0: mu_retired - mu_not_retired = 0
## HA: mu_retired - mu_not_retired > 0
## Standard error = 0.262
## Test statistic: Z = 5.362
## p-value = 0
The p-value is nearly zero (0). If there is no difference in the average daily TV consumption between retired and not-retired persons, there is nearly no chance of obtaining random samples of 200 retired and 200 not-retired persons where the average difference in the daily TV consumption is at least 1.405 hours.
Therefor we reject the Null hypothesis (H0) and stay with the alternative hypothesis, that the average daily TV consumption of retired persons is higher than that of not-retired.
As the p-value is very close to 0, there is also no probability of making a type 1 or type 2 error.
In a second step we want to estimate the uncertainty of the test-result. We calculate the confidence interval for the difference between the average daily TV consumption (in hours) of retired to not-retired persons. The confidence level is 95%.
inference(y = gss3$tvhours, x = gss3$retirementStatus,
est = "mean", type = "ci", null = 0, alternative = "greater", conflevel = 0.95,
method = "theoretical", order = c("retired","not_retired"),
eda_plot = FALSE, inf_plot = FALSE)
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_retired = 200, mean_retired = 4.08, sd_retired = 2.8766
## n_not_retired = 200, mean_not_retired = 2.675, sd_not_retired = 2.3359
## Observed difference between means (retired-not_retired) = 1.405
## Standard error = 0.262
## 95 % Confidence interval = ( 0.8914 , 1.9186 )
Retired persons in the US watch on average 0.89 to 1.92 hours TV per day more than not-retired persons. The interval does not include ‘0’ and is in accordance with the hypothesis tests There is a significant difference in the daily TV consumption of both groups.
Some scientists suggests that TV watiching is associated with negative changes in many aspects of health. And it is said that older persons watch more TV [2,3] In this study we wanted to check whether the GSS dataset supports the statement and wanted to quantify the difference of the daily TV consumption of retired persons versus not-retired.
We learned that the daily TV consumption of retired persons is about 4.1 hours versus 2.7 hours for not-retired with some persons (in both groups) watching TV for more than 8 hours per day (strong skewed distribution);
The hypothesis test performed led to a strong evidence that retired persons watch about 0.9 to 1.9 hours more TV per day than non-retired. This is a significant difference and supports the efforts to reduce the TV consumption of older persons to improve their health.
However there is a concern: the non-response bias regarding the tvhours variable is relatively high (about 40% accross all levels of wrkstat). Getting responses of non-responders could influence the study results. But because of the low p-value (nearly 0) it is not very probable that this would change our results.
Future studies could concentrate on investigating alternative activities that are more socially and physically engaging and looking for which activities best reduces passive TV consumption.
The table below shows one page of the data used for this study with the original variables (wrkstat, tvhours) and the derived variable (retirementStatus).
## wrkstat tvhours retirementStatus
## 39912 Working Fulltime 0 not_retired
## 45242 Working Fulltime 2 not_retired
## 42306 Retired 2 retired
## 42931 Retired 2 retired
## 45992 Retired 6 retired
## 39177 Retired 13 retired
## 42011 Retired 5 retired
## 46427 Retired 1 retired
## 41450 Keeping House 4 not_retired
## 38512 Working Parttime 4 not_retired
## 41828 Retired 3 retired
## 41593 Retired 5 retired
## 42697 Keeping House 4 not_retired
## 39595 Keeping House 3 not_retired
## 40500 Retired 6 retired
## 41523 Retired 7 retired
## 41005 Retired 8 retired
## 40399 Working Fulltime 1 not_retired
## 42070 Other 3 not_retired
## 38620 Retired 3 retired
## 45899 Keeping House 1 not_retired
## 42314 Retired 4 retired
## 40883 Retired 5 retired
## 45419 Retired 3 retired
## 39577 Working Fulltime 3 not_retired
## 43281 Retired 5 retired
## 41763 Working Fulltime 2 not_retired
## 46099 Retired 20 retired
## 40925 Retired 3 retired
## 40872 Working Parttime 2 not_retired
## 41182 Working Parttime 1 not_retired
## 41267 Working Fulltime 2 not_retired
## 41236 Retired 3 retired
## 44974 Working Fulltime 2 not_retired
## 40217 Retired 0 retired
## 39930 Working Fulltime 2 not_retired
## 40263 Retired 4 retired
## 45947 Working Fulltime 1 not_retired
## 46190 Retired 3 retired
## 44837 Retired 5 retired
## 38612 Working Fulltime 1 not_retired
## 45894 Retired 2 retired
## 46146 Retired 1 retired
## 38801 Keeping House 2 not_retired
## 43075 Keeping House 1 not_retired