To describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability/causality).
The General Social Survey (GSS) is a nationally representative survey of adults in the United States. It is conducted using as personal-interview survey.
This is generalizable to noninstitutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States.(NORC. University of Chicago. The General Social Survey. available at:link).
As this is an observational, cross sectional study, only associations may be investigated.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## Warning: package 'readr' was built under R version 4.2.1
## Warning: package 'forcats' was built under R version 4.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.2.1
##
## Attaching package: 'rstatix'
##
## The following object is masked from 'package:stats':
##
## filter
load("~/R data/Social Survey (GSS)/_5db435f06000e694f6050a2d43fc7be3_gss (2).Rdata")
Come up with a research question that you want to answer using these data. You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. You are welcomed to create new variables based on existing ones. Along with your research question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience. Perform inference that addresses the research question you outlined above. Each R output and plot should be accompanied by a brief interpretation. INFERENCE: Statistical inference via hypothesis testing and/or confidence interval: State hypotheses, Check conditions, State the method(s) to be used and why and how, Perform inference, Interpret results. If applicable, state whether results from various methods agree.
Is there a difference between US males and US females in daily hours watching TV?
Null hypothessis (there is nothing going on, therefore there is no difference in population means of hours per day watching TV between US males and females): H0:µmale=µfemale. Alternative hypothesis (there is something going on, therefore there is a difference in population means of hours per day watching TV between US males and females): Ha:µmale≠µfemale, Ha:µmale<µfemale, or Ha:µmale>µfemale.
Two sided two sample T-test (independent samples t-test) was selected, as the objective is to test whether the unknown population means of watching TV (hours per day) of two groups (men and women) are equal or not. Two sided test is selected, as we are interested in difference in any direction. General Social Survey Cumulative File, 1972-2012 is used as data source. R software is used for the data description and analysis (R Core Team, (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. available at:link
Data values are independent (therefore one observation does not affect other observation for variable concerned). Data in each group are random sample from the population. Based on the number of males ( n= 14 754) and females (n = 19 101), this is definitely less than 10% of US males and females.
gss %>%
group_by(sex) %>%
get_summary_stats(tvhours, type = "common")
## # A tibble: 2 × 11
## sex variable n min max median iqr mean sd se ci
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Male tvhours 14754 0 24 2 3 2.82 2.24 0.018 0.036
## 2 Female tvhours 19101 0 24 3 2 3.08 2.43 0.018 0.035
gss %>%
ggplot(aes(x=tvhours, colour = sex)) + geom_boxplot()+coord_flip()+theme_bw()
## Warning: Removed 23206 rows containing non-finite values (stat_boxplot).
Based on the summary statistics and side by side box plot, we can see, that both distributions seem to be fairly symmetric, median for males is close to the 25th percentile, which indicates right skewness. Median is lower in males that median for females. At the same time, mean is higher than median for males. Data in males are more variable. However, standard deviations from both distributions are close (2.24 and 2. 43). Based on this, the two-sample t-test appears to be an appropriate method to test for a difference in means with significance level α = 0.05.
gss %>%
t.test(tvhours~sex, data =.,alternative ="two.sided")
##
## Welch Two Sample t-test
##
## data: tvhours by sex
## t = -10.34, df = 32857, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -0.3133439 -0.2134776
## sample estimates:
## mean in group Male mean in group Female
## 2.822082 3.085493
Males are watching TV significantly less than women. The 95% confidence interval for difference in means between males and females is (-0.3133439, -0.2134776) with p-value < 2.2e-16, which means that if null hypothesis was true, probability of obtaining difference in sample means of that magnitude simply by chance is < 2.2e-16. Despite, the difference of 0.2-0.3 hour a day is statistically significant, it is unknown whether it has any other impact (e.g. health impact - females could sport less). It should be mentioned, that the data are self reported, which is not an objective measurement, however it is assumed that both males and females estimate TV time with similar accuracy.