Does the attitude towards impact of Science and Technology on our lives different by Age groups and by Party likely to vote for in the National election?
Ho(Age): The average score on the Better Off or Worse Off item are equal among the 3 age groups
Ho(Party): The average score on the Better Off or Worse Off item are equal among the 3 Party groups
Ha(Age): The average score on the Better Off or Worse Off item are unequal among the 3 age groups
Ha(Party): The average score on the Better Off or Worse Off item are unequal among the 3 Party groups
• What are the cases, and how many are there? (The cases represent individual respondents and there are 2596 in this dataset. The original dataset has over 290 questions included. The current dataset is a subset of 12 questions from the original)
Describe the method of data collection (Survey data collection through a questionnaire). • What type of study is this (observational)? •
• Description of the dependent variable (what is being measured – Life is Worse Off or Better off due to Science and Technology)
• Description of the independent variable (Age being measured as a 3 group variable Age3g, “Up to 29”, “30-49”, and “50 and Over”; Party, recoded from “partyvotingfor” variable -original variable had 7 groups which were recoded into 3 groups for the “Party” variable)
• Response: What is the response variable (Continuous/Scale), and what type is it (numerical)? The item is measured on a scale of 1 (Worse off) to 10 (Better Off) Life is ______ due to Science and Technology
• Explanatory: What is/are the explanatory variable(s): Age and Party, and what type (both are categorical)? • Relevant summary statistics
#There are critical assumptions that should be verified
#When we have continuous DV and IV we examine correlational hypothesis
#Wehn the IV is a categorical variable, we use a Comparative hypothesis
#If your categorical IV has two groups you use t-test
#If your categorical IV has threeormore groups you use ANOVA
values <- read.csv("https://raw.githubusercontent.com/johnm1990/DATA606/master/us_values.csv", sep=",", header=TRUE)
dim(values)
## [1] 2596 10
head(values)
## interviewID socialclass age3g opportunities morescience_lowfaith right_wrong
## 1 840071001 5 2 -2 -2 -2
## 2 840071002 3 2 -2 -2 -2
## 3 840071003 5 2 1 1 10
## 4 840071004 3 2 1 10 1
## 5 840071005 3 1 4 5 5
## 6 840071006 5 1 1 1 1
## science_not_important better_of_.worse_of partyvotefor sex
## 1 -2 -2 840002 2
## 2 -2 -2 840005 2
## 3 10 -1 -1 2
## 4 5 1 840005 2
## 5 5 1 840001 2
## 6 1 1 840005 1
str(values)
## 'data.frame': 2596 obs. of 10 variables:
## $ interviewID : int 840071001 840071002 840071003 840071004 840071005 840071006 840071007 840071008 840071009 840071010 ...
## $ socialclass : int 5 3 5 3 3 5 3 3 4 5 ...
## $ age3g : int 2 2 2 2 1 1 2 3 1 3 ...
## $ opportunities : int -2 -2 1 1 4 1 1 6 1 10 ...
## $ morescience_lowfaith : int -2 -2 1 10 5 1 1 4 10 10 ...
## $ right_wrong : int -2 -2 10 1 5 1 1 6 10 10 ...
## $ science_not_important: int -2 -2 10 5 5 1 10 1 5 10 ...
## $ better_of_.worse_of : int -2 -2 -1 1 1 1 1 1 1 1 ...
## $ partyvotefor : int 840002 840005 -1 840005 840001 840005 840006 840002 840004 840002 ...
## $ sex : int 2 2 2 2 2 1 1 2 2 1 ...
colnames(values)
## [1] "interviewID" "socialclass" "age3g"
## [4] "opportunities" "morescience_lowfaith" "right_wrong"
## [7] "science_not_important" "better_of_.worse_of" "partyvotefor"
## [10] "sex"
hist(values$better_of_.worse_of, main = "Better Off Worse Off")
values$better_worse[values$better_of_.worse_of < 0] <- 1
values$better_worse[values$better_of_.worse_of == 1] <- 1
values$better_worse[values$better_of_.worse_of == 2] <- 2
values$better_worse[values$better_of_.worse_of == 3] <- 3
values$better_worse[values$better_of_.worse_of == 4] <- 4
values$better_worse[values$better_of_.worse_of == 5] <- 5
values$better_worse[values$better_of_.worse_of == 6] <- 6
values$better_worse[values$better_of_.worse_of == 7] <- 7
values$better_worse[values$better_of_.worse_of == 8] <- 8
values$better_worse[values$better_of_.worse_of == 9] <- 9
values$better_worse[values$better_of_.worse_of == 10] <- 10
hist(values$better_of_.worse_of, main = "Better Off Worse Off" )
# 840001 - Republicans 840002 - Democratic 840004 - Libertarian
# 840005 - Other 850006 - Green "-1" & "-2" are No Answer and Don't Know
# Recode "-1" and "-2" into 840005 Other,along with 840004 / 840006 to
# create the Independent third party
values$party[values$partyvotefor < 0] <- 3
values$party[values$partyvotefor == 840001] <- 1
values$party[values$partyvotefor == 840002] <- 2
values$party[values$partyvotefor == 840004] <- 3
values$party[values$partyvotefor == 840005] <- 3
values$party[values$partyvotefor == 840006] <- 3
values$party <- factor(values$party)
values$party <- set_labels(values$party, labels=c("Republican", "Democratic", "Independent"))
frq(values$party)
##
## x <categorical>
## # total N=2596 valid N=2596 mean=1.94 sd=0.75
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------------
## 1 | Republican | 819 | 31.55 | 31.55 | 31.55
## 2 | Democratic | 1126 | 43.37 | 43.37 | 74.92
## 3 | Independent | 651 | 25.08 | 25.08 | 100.00
## <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
values$age3g <- factor(values$age3g)
values$age3g <- set_labels(values$age3g, labels=c("Upto 29", "30-49", "50 & Over"))
## Frequency tables of Age3g and Party:
frq(values$age3g)
##
## x <categorical>
## # total N=2596 valid N=2596 mean=2.10 sd=0.76
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ---------------------------------------------------
## 1 | Upto 29 | 632 | 24.35 | 24.35 | 24.35
## 2 | 30-49 | 1072 | 41.29 | 41.29 | 65.64
## 3 | 50 & Over | 892 | 34.36 | 34.36 | 100.00
## <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
frq(values$party)
##
## x <categorical>
## # total N=2596 valid N=2596 mean=1.94 sd=0.75
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------------
## 1 | Republican | 819 | 31.55 | 31.55 | 31.55
## 2 | Democratic | 1126 | 43.37 | 43.37 | 74.92
## 3 | Independent | 651 | 25.08 | 25.08 | 100.00
## <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
summary(values$better_worse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 8.000 7.437 10.000 10.000
ANOVA model supports the Ha(Age) and rejects the Ho(Age)
anovamod1 <-aov(better_worse ~ age3g+party, data = values)
summary(anovamod1)
## Df Sum Sq Mean Sq F value Pr(>F)
## age3g 2 187 93.43 16.98 4.70e-08 ***
## party 2 396 198.05 36.00 3.79e-16 ***
## Residuals 2591 14254 5.50
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(anovamod1)
## # A tibble: 3 x 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age3g 2 187. 93.4 17.0 4.70e- 8
## 2 party 2 396. 198. 36.0 3.79e-16
## 3 Residuals 2591 14254. 5.50 NA NA
##TukeyHSD shows which pairwise group mean differences for Age and Party are significant vs. not significant
tidy(TukeyHSD(anovamod1)) # Tukey means comparison among groups for Age3g and Party
## # A tibble: 6 x 7
## term contrast null.value estimate conf.low conf.high adj.p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age3g 2-1 0 0.186 -0.0897 0.462 0.253
## 2 age3g 3-1 0 0.661 0.375 0.947 0.000000196
## 3 age3g 3-2 0 0.475 0.225 0.724 0.0000248
## 4 party 2-1 0 0.700 0.448 0.953 0
## 5 party 3-1 0 -0.173 -0.462 0.116 0.338
## 6 party 3-2 0 -0.873 -1.14 -0.603 0
# Anovamodel as a regression model with Dummies
## • For regression models, include the regression output and interpret the R-squared value.
lmfit <- summary.lm(anovamod1)
lmfit
##
## Call:
## aov(formula = better_worse ~ age3g + party, data = values)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2489 -1.6936 0.3703 1.9513 3.3064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8625 0.1217 56.396 < 2e-16 ***
## age3g2 0.1862 0.1179 1.580 0.114
## age3g3 0.6831 0.1236 5.525 3.62e-08 ***
## party2 0.7033 0.1086 6.475 1.13e-10 ***
## party3 -0.1689 0.1248 -1.353 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.345 on 2591 degrees of freedom
## Multiple R-squared: 0.03929, Adjusted R-squared: 0.03781
## F-statistic: 26.49 on 4 and 2591 DF, p-value: < 2.2e-16
The model R-square is 3.9% which means that the age3g and party variables are able to explain only 3.9% of the variation in the Better_Off_Worse_Off opinion item about Science and Technology. About 96% of the variation in this dependent variable (= 1 – 3,9%) still is unexplained. The ANOVA model, TukeyHSD, and the regression output point us to where the significant differences are arising from within each of Age3g and Party variables: 1. Age3g: The 50 and Over group is scoring significantly higher than the Upto-29 and 30-49 age groups 2. Party: Those who indicated that they would vote for Democratic party in the National election score significantly higher than both Republicans and Independents (these two groups are not different from each other.) Party Value 1 – Republican Party Value 2 – Democrat Party Value 3 - Independent
# Box plot with multiple groups
# +++++++++++++++++++++
# Plot tooth science ("better_worse") by groups ("age3g")
# Color box plot by a second group: "party"
ggboxplot(values, x = "age3g", y = "better_worse", color = "party",
palette = c("#00AFBB", "#E7B800", "#AA4371"))
• Conclusion • Why is this analysis important? • Limitations of the analysis?
The analysis was undertaken to illustrate the use of ANOVA technique when predictors are Categorical. Here the Two-way ANOVA is illustrated with two categorical predictors. The box-plots show how the dependent variable varies by Age3g and Party variables which is confirmed through the use of different modeling techniques.
In addition, the analysis also showed that the ANOVA model and the regression model are equivalent by extracting the linear model output from the ANOVA output. The ANOVA model shows through the F-test, how the overall model can be tested against the model Null Hypothesis. The TukeyHSD analysis shows where differences among the groups within each categorical variables are significant thus supporting the Alternate Hypothesis. The regression model demonstrates the use of Dummy variables to capture the group mean comparisons.
The primary limitations of the analysis arise from the fact that the data is observational in nature and not causal. Hence the relationships and associations observed among variables are descriptive in nature
Data Source: (https://cps.isr.umich.edu/project/world-values-survey-wvs/) See below for the citation/link.
Inglehart, R., C. Haerpfer, A. Moreno, C. Welzel, K. Kizilova, J. Diez-Medrano, M. Lagos, P. Norris, E. Ponarin & B. Puranen et al. (eds.). 2014. World Values Survey: Round One - Country-Pooled Datafile Version: www.worldvaluessurvey.org/WVSDocumentationWV1.jsp. Madrid: JD Systems Institute.