Setup

Load packages

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.1
library(dplyr)
library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")

Part 1: Data

SAMPLE : observations in the sample are collected through personel interview thru random sampling. Because there is random sampling involved we can generelize the conslusion to the population from where random sample was drawm from.

SCOPE OF INFERENCE : we can only make correlational study OR conclusions OR association based inference but we cannot make causal conclusion OR causality study because random sampling was involved but random assignement of treatment was not.

TARGET POPLULATION : The target population of the GSS is adults (18+) living in households in the United States. From 1972 to 2004 it was further restricted to those able to do the survey in English. From 2006 to present it has included those able to do the survey in English or Spanish. Those unable to do the survey in either English or Spanish are out-of-scope. Residents of institutions and group quarters are out-of-scope. Those with mental and/or physical conditions that prevent them from doing the survey, but who live in households are part of the target population and in-scope. In the reinterviews those who have died, moved out of the United States, or who no longer live in a household have left the target population and are out-of-scope.

YOU CAN MAKE CORRELATIONAL CONCLUSIONS FOR VARIABLES IN THIS STUDY DATA ONLY FOR THE POPULATION AS MENTIONED ABOVE IN “TARGET POPULATION” PARAGRAPH.


Part 2: Research question

there is a claims that based on gss survey data for year 1972 vs. 1973, the proportion of females was not equal between 1972 vs. 1973. further they claim that year 1973 had higher proportion of females than year 1972.

our job is to check if this claim is statistically valid or not. we are going to test this using two statistical methods.

  1. EXPLORATORY STATISCAL ANALYSIS - data summary - plot visualization

  2. INFERENTIAL STATSTICAL ANALYSIS: - METHOD-1 : hypothesis testing - METHOD-2 : confidence-interval method


Part 3: Exploratory data analysis

# copy of gss data with year filtered BY 1972, 1973.
gss2 <- gss %>% filter(year == c(1972,1973))
## Warning in c(1972L, 1972L, 1972L, 1972L, 1972L, 1972L, 1972L, 1972L,
## 1972L, : longer object length is not a multiple of shorter object length
# converting numerical variable into categorical.
gss2$year = as.factor(gss2$year) 

# summary of count in table format.
# INTERPRETATION : this table shows the raw count of females and males in the sample survey for the year 1972 and 1972. here you can see the count but hard to summarise the proportion.
table(gss2$year, gss2$sex) 
##       
##        Male Female
##   1972  432    375
##   1973  354    398
# data-summary of proportion in table format.
# INTERPRETATION : you can see the actual proportion of females and males by year. clearly female proportion for year-1973 is larger than year-1972. but it is hard to tell if this difference is statistically significant enough to conclude that female proportion is higher for 1973 vs. 1972. we need further statistical inference testing to conclude that.
prop.table(table(gss2$year, gss2$sex)) 
##       
##             Male    Female
##   1972 0.2771007 0.2405388
##   1973 0.2270686 0.2552919
# stacked bar-plot for the proportion of female-male for year 1972 and 1973.
# INTERPRETATION : again, you can see the actual proportion of females and males by year. female proportion for year-1973 is larger than year-1972. but it is hard to tell if this difference is statistically significant enough to conclude that female proportion is higher for 1973 vs. 1972. we need further statistical inference testing to conclude that. THIS CASE IS A GREAT CLASSIC EXAMPLE OF IMPORTANCE OF INFERENTIAL STATISTICAL ANALYSIS VS. EXPLORATORY ANALYSIS. YOU WILL SEE THAT WHEN WE MAKE A CONCLUSION USING INFERENTIAL ANALYSIS.
ggplot(data = gss2, aes(x = year, fill = sex)) + geom_bar()


Part 4: Inference

# INFERENCTIAL ANALYSIS METHOD-1 : HYPOTHESIS TESTING
# we are going to run hypothesis testing at 5% significance level. if p-value comes less than significance level of 5% then we reject null-hypothesis and accept the alternative-hypothesis otherwise we do not have enough evidence to reject null-hypothesis and we will accept it.
###  H0: p_1973 =  p_1972
###  HA: p_1973 != p_1972

inference(data = gss2, y = sex, x = year, type = "ht", order = c(1973,1972), statistic = "proportion", method = "theoretical", alternative = "twosided", null = 0, success = "Female")
## Response variable: categorical (2 levels, success: Female)
## Explanatory variable: categorical (2 levels) 
## n_1973 = 752, p_hat_1973 = 0.5293
## n_1972 = 807, p_hat_1972 = 0.4647
## H0: p_1973 =  p_1972
## HA: p_1973 != p_1972
## z = 2.548
## p_value = 0.0108

## INFERENCE : p-value is less than alpha (significance level) of 0.05, so we have enough evidence to "REJECT H0" and accept HA that proportion of female for year 1973 and 1972 is not equal. positive Z-score indicates that female proportion for 1973 is larger than 1972.

# # INFERENCTIAL ANALYSIS METHOD-2 : CONFIDENCE INTERVAL
# we are going to run confidence-interval testing at 95% confidence level. if confidence interval include "0" then that means we cannot conclusively say that there is indeed a difference in proportion of female for year 1973 vs. 1972. on other side, if confidence interval of 1973 minus 1972 does not include "0" then we can say that there is indeed a difference in proportion of females for both years. if confidence interval is positive (above 0) then we can also say that 1973 had higher proportion of females than year-1972.

inference(data = gss2, y = sex, x = year, type = "ci", order = c(1973,1972), statistic = "proportion", method = "theoretical", alternative = "twosided", null = 0, success = "Female")
## Response variable: categorical (2 levels, success: Female)
## Explanatory variable: categorical (2 levels) 
## n_1973 = 752, p_hat_1973 = 0.5293
## n_1972 = 807, p_hat_1972 = 0.4647
## 95% CI (1973 - 1972): (0.015 , 0.1141)

## INFERENCE : confidence interval of difference of proportion of 1973 and 1972 does not include 0 that means we have enough evidence to say that proportion of female for year 1973 and 1972 is not equal. also positive confidence interval indicates that female proportion for 1973 was indeed larger than 1972.