Load necessary packages

library(dplyr)
library(statsr)
library(ggplot2)
library(janitor)
# Remove Scientific Format; Reset (scipen=0)
options(scipen=999) 

Load data

load("gss.Rdata")

Part 1: Data


Part 2: Research question

My research question after going through the data is:

Is political view of an individual USA citizen related to opinion towards the government expenses on improving education?

If there is any difference in opinion then it will mean Govt is not doing so well to ensure high quality education for ‘everyone’ in the country. This is why various citizens with political views have given different opinions. Those who are with Govt will try to defeat by giving false opinion and others may say the truth regardless of which political party is in position.


Part 3: Exploratory data analysis

To do the inference I need variables. One is on the political views of the citizens of US and another is on their opinion towards the government expenses on education.

polviews has 7 levels: Extremely Conservative, Conservative, Slightly Conservative, Moderate, Slightly Liberal, Liberal and Extremely Liberal.

nateduc contained question about what an US citizen think on their Govt’s spending on Improving the nation’s education system.
It contains 3 levels: too much, too little and about the right amount.

dt <-  gss %>% 
  select(polviews, nateduc) %>% 
  na.omit

Before performing inference, I can perform some exploratory data analysis (EDA) using summary statistics/tables and visual plots.

summary(dt)
##                   polviews            nateduc     
##  Extremely Liberal    :  774   Too Little :17657  
##  Liberal              : 3184   About Right: 7864  
##  Slightly Liberal     : 3664   Too Much   : 1916  
##  Moderate             :10620                      
##  Slightly Conservative: 4455                      
##  Conservative         : 3909                      
##  Extrmly Conservative :  831
dt%>%
  tabyl(polviews, nateduc) %>% 
  adorn_totals('row')
##               polviews Too Little About Right Too Much
##      Extremely Liberal        599         147       28
##                Liberal       2309         749      126
##       Slightly Liberal       2542         962      160
##               Moderate       6852        3184      584
##  Slightly Conservative       2708        1392      355
##           Conservative       2212        1211      486
##   Extrmly Conservative        435         219      177
##                  Total      17657        7864     1916

The table shows some difference in opinions among the citizens with different political views.

It becomes easier to see this difference visually:

ggplot(dt, aes(x=polviews, fill=nateduc))+
  theme(panel.border=element_rect(colour='black', fill=NA)) +
  theme(text = element_text(size = 13)) +
  labs(x = 'Political Views', y='Proportion')+
  geom_bar(position='fill', color='black')+
  scale_fill_discrete(name="Opinion")+
  coord_flip()

The graph clearly shows that Liberal US citizens are more concerned of the Govt. expense on improving education than Conservative US citizens.

Although, The visual indicates that there may be differences, but we need to perform actual inference to confirm that.


Part 4: Inference

In this final stage, I will perform statistical inference. This is done through a series of well-defined steps:
    1. Define hypothesis
    2. Choose statistical method
    3. Check for conditions
    4. Perform the inference tests
    5. Interpret the results

Defining Hypothesis

The null hypothesis (H0) is that there is no association between the political views of US citizens and their opinions on Govt expenses on improving education.

The alternative hypothesis (HA) is that the political views of US citizens and their opinions on Govt expenses on improving education are associated.

Choosing Statistical Test Method

Since the dataset consists of two categorical variables (polviews and nateduc), the adequate test to be used is the chi-square test of independence.
This test is to be used when comparing 2 categorical variables where one of the variables has more than 2 levels. This is the case here, as can be seen below:

str(dt)
## 'data.frame':    27437 obs. of  2 variables:
##  $ polviews: Factor w/ 7 levels "Extremely Liberal",..: 4 5 6 6 6 5 5 5 6 2 ...
##  $ nateduc : Factor w/ 3 levels "Too Little","About Right",..: 1 1 2 2 1 1 1 1 1 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:29624] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:29624] "1" "2" "3" "4" ...

The chi-square test does not define confidence intervals, so it is not included in this analysis.

Checking conditions

The key conditions for the chi-square test of independence are:

  1. Independence between observations: This is assumed to be true based on the sampling methodology used in the GSS, as it uses random sampling. Furthermore, the size of the sample is less than 10% of the population, and each result is only counted in one cell.
  2. Sample size: As can be seen below, there are at least 5 counts for each cell.
table(dt)
##                        nateduc
## polviews                Too Little About Right Too Much
##   Extremely Liberal            599         147       28
##   Liberal                     2309         749      126
##   Slightly Liberal            2542         962      160
##   Moderate                    6852        3184      584
##   Slightly Conservative       2708        1392      355
##   Conservative                2212        1211      486
##   Extrmly Conservative         435         219      177

Since the conditions are met, we can proceed to the next step.

Inference

Finally, inference calculation using the chi-square test:

chisq.test(dt$polviews, dt$nateduc)
## 
##  Pearson's Chi-squared test
## 
## data:  dt$polviews and dt$nateduc
## X-squared = 759.64, df = 12, p-value < 0.00000000000000022

As seen above, the Chi-squared value is too high resulting a very p-value.

Get help on your problems from experienced statisticians at homeworkhelponline.net.

Thank you!