Research Question

Does the attitude towards impact of Science and Technology on our lives different by Age groups and by Party likely to vote for in the National election?

Hypothesis Declarations

Ho(Age): The average score on the Better Off or Worse Off item are equal among the 3 age groups

Ho(Party): The average score on the Better Off or Worse Off item are equal among the 3 Party groups

Ha(Age): The average score on the Better Off or Worse Off item are unequal among the 3 age groups

Ha(Party): The average score on the Better Off or Worse Off item are unequal among the 3 Party groups

Cases analysis

• What are the cases, and how many are there? (The cases represent individual respondents and there are 2596 in this dataset. The original dataset has over 290 questions included. The current dataset is a subset of 12 questions from the original)

Description of Data collection

Describe the method of data collection (Survey data collection through a questionnaire). • What type of study is this (observational)? •

Variables in use

• Description of the dependent variable (what is being measured – Life is Worse Off or Better off due to Science and Technology)

• Description of the independent variable (Age being measured as a 3 group variable Age3g, “Up to 29”, “30-49”, and “50 and Over”; Party, recoded from “partyvotingfor” variable -original variable had 7 groups which were recoded into 3 groups for the “Party” variable)

• Response: What is the response variable (Continuous/Scale), and what type is it (numerical)? The item is measured on a scale of 1 (Worse off) to 10 (Better Off) Life is ______ due to Science and Technology

• Explanatory: What is/are the explanatory variable(s): Age and Party, and what type (both are categorical)? • Relevant summary statistics

#There are critical assumptions that should be verified
#When we have continuous DV and IV we examine correlational hypothesis
#Wehn the IV is a categorical variable, we use a Comparative hypothesis
#If your categorical IV has two groups you use t-test
#If your categorical IV has threeormore groups you use ANOVA

values <- read.csv("https://raw.githubusercontent.com/johnm1990/DATA606/master/us_values.csv", sep=",", header=TRUE)

dim(values)

## [1] 2596   10

head(values)

##   interviewID socialclass age3g opportunities morescience_lowfaith right_wrong
## 1   840071001           5     2            -2                   -2          -2
## 2   840071002           3     2            -2                   -2          -2
## 3   840071003           5     2             1                    1          10
## 4   840071004           3     2             1                   10           1
## 5   840071005           3     1             4                    5           5
## 6   840071006           5     1             1                    1           1
##   science_not_important better_of_.worse_of partyvotefor sex
## 1                    -2                  -2       840002   2
## 2                    -2                  -2       840005   2
## 3                    10                  -1           -1   2
## 4                     5                   1       840005   2
## 5                     5                   1       840001   2
## 6                     1                   1       840005   1

str(values)

## 'data.frame':    2596 obs. of  10 variables:
##  $ interviewID          : int  840071001 840071002 840071003 840071004 840071005 840071006 840071007 840071008 840071009 840071010 ...
##  $ socialclass          : int  5 3 5 3 3 5 3 3 4 5 ...
##  $ age3g                : int  2 2 2 2 1 1 2 3 1 3 ...
##  $ opportunities        : int  -2 -2 1 1 4 1 1 6 1 10 ...
##  $ morescience_lowfaith : int  -2 -2 1 10 5 1 1 4 10 10 ...
##  $ right_wrong          : int  -2 -2 10 1 5 1 1 6 10 10 ...
##  $ science_not_important: int  -2 -2 10 5 5 1 10 1 5 10 ...
##  $ better_of_.worse_of  : int  -2 -2 -1 1 1 1 1 1 1 1 ...
##  $ partyvotefor         : int  840002 840005 -1 840005 840001 840005 840006 840002 840004 840002 ...
##  $ sex                  : int  2 2 2 2 2 1 1 2 2 1 ...

colnames(values)

##  [1] "interviewID"           "socialclass"           "age3g"                
##  [4] "opportunities"         "morescience_lowfaith"  "right_wrong"          
##  [7] "science_not_important" "better_of_.worse_of"   "partyvotefor"         
## [10] "sex"

hist(values$better_of_.worse_of, main = "Better Off Worse Off")

values$better_worse[values$better_of_.worse_of < 0] <- 1
values$better_worse[values$better_of_.worse_of == 1] <- 1
values$better_worse[values$better_of_.worse_of == 2] <- 2
values$better_worse[values$better_of_.worse_of == 3] <- 3
values$better_worse[values$better_of_.worse_of == 4] <- 4
values$better_worse[values$better_of_.worse_of == 5] <- 5
values$better_worse[values$better_of_.worse_of == 6] <- 6
values$better_worse[values$better_of_.worse_of == 7] <- 7
values$better_worse[values$better_of_.worse_of == 8] <- 8
values$better_worse[values$better_of_.worse_of == 9] <- 9
values$better_worse[values$better_of_.worse_of == 10] <- 10
hist(values$better_of_.worse_of, main = "Better Off Worse Off" )

# 840001 - Republicans   840002 - Democratic  840004 - Libertarian
# 840005 - Other 850006 - Green  "-1" & "-2" are No Answer and Don't Know
# Recode "-1" and "-2" into 840005 Other,along with 840004 / 840006 to
# create the Independent third party

values$party[values$partyvotefor < 0] <- 3
values$party[values$partyvotefor == 840001] <- 1
values$party[values$partyvotefor == 840002] <- 2
values$party[values$partyvotefor == 840004] <- 3
values$party[values$partyvotefor == 840005] <- 3
values$party[values$partyvotefor == 840006] <- 3

values$party <- factor(values$party)
values$party <- set_labels(values$party, labels=c("Republican", "Democratic", "Independent"))
frq(values$party)

## 
## x <categorical>
## # total N=2596  valid N=2596  mean=1.94  sd=0.75
## 
## Value |       Label |    N | Raw % | Valid % | Cum. %
## -----------------------------------------------------
##     1 |  Republican |  819 | 31.55 |   31.55 |  31.55
##     2 |  Democratic | 1126 | 43.37 |   43.37 |  74.92
##     3 | Independent |  651 | 25.08 |   25.08 | 100.00
##  <NA> |        <NA> |    0 |  0.00 |    <NA> |   <NA>

values$age3g <- factor(values$age3g)
values$age3g <- set_labels(values$age3g, labels=c("Upto 29", "30-49", "50 & Over"))

## Frequency tables of Age3g and Party:
frq(values$age3g)

## 
## x <categorical>
## # total N=2596  valid N=2596  mean=2.10  sd=0.76
## 
## Value |     Label |    N | Raw % | Valid % | Cum. %
## ---------------------------------------------------
##     1 |   Upto 29 |  632 | 24.35 |   24.35 |  24.35
##     2 |     30-49 | 1072 | 41.29 |   41.29 |  65.64
##     3 | 50 & Over |  892 | 34.36 |   34.36 | 100.00
##  <NA> |      <NA> |    0 |  0.00 |    <NA> |   <NA>

frq(values$party)

## 
## x <categorical>
## # total N=2596  valid N=2596  mean=1.94  sd=0.75
## 
## Value |       Label |    N | Raw % | Valid % | Cum. %
## -----------------------------------------------------
##     1 |  Republican |  819 | 31.55 |   31.55 |  31.55
##     2 |  Democratic | 1126 | 43.37 |   43.37 |  74.92
##     3 | Independent |  651 | 25.08 |   25.08 | 100.00
##  <NA> |        <NA> |    0 |  0.00 |    <NA> |   <NA>

summary(values$better_worse)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   8.000   7.437  10.000  10.000

ANOVA

ANOVA model supports the Ha(Age) and rejects the Ho(Age)

anovamod1 <-aov(better_worse ~ age3g+party, data = values)
summary(anovamod1)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## age3g          2    187   93.43   16.98 4.70e-08 ***
## party          2    396  198.05   36.00 3.79e-16 ***
## Residuals   2591  14254    5.50                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tidy(anovamod1)

## # A tibble: 3 x 6
##   term         df  sumsq meansq statistic   p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>     <dbl>
## 1 age3g         2   187.  93.4       17.0  4.70e- 8
## 2 party         2   396. 198.        36.0  3.79e-16
## 3 Residuals  2591 14254.   5.50      NA   NA

##TukeyHSD shows which pairwise group mean differences for Age and Party are significant vs. not significant

tidy(TukeyHSD(anovamod1)) # Tukey means comparison among groups for Age3g and Party

## # A tibble: 6 x 7
##   term  contrast null.value estimate conf.low conf.high adj.p.value
##   <chr> <chr>         <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
## 1 age3g 2-1               0    0.186  -0.0897     0.462 0.253      
## 2 age3g 3-1               0    0.661   0.375      0.947 0.000000196
## 3 age3g 3-2               0    0.475   0.225      0.724 0.0000248  
## 4 party 2-1               0    0.700   0.448      0.953 0          
## 5 party 3-1               0   -0.173  -0.462      0.116 0.338      
## 6 party 3-2               0   -0.873  -1.14      -0.603 0

# Anovamodel as a regression model with Dummies

## • For regression models, include the regression output and interpret the R-squared value. 


lmfit <- summary.lm(anovamod1)
lmfit

## 
## Call:
## aov(formula = better_worse ~ age3g + party, data = values)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2489 -1.6936  0.3703  1.9513  3.3064 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.8625     0.1217  56.396  < 2e-16 ***
## age3g2        0.1862     0.1179   1.580    0.114    
## age3g3        0.6831     0.1236   5.525 3.62e-08 ***
## party2        0.7033     0.1086   6.475 1.13e-10 ***
## party3       -0.1689     0.1248  -1.353    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.345 on 2591 degrees of freedom
## Multiple R-squared:  0.03929,    Adjusted R-squared:  0.03781 
## F-statistic: 26.49 on 4 and 2591 DF,  p-value: < 2.2e-16

The model R-square is 3.9% which means that the age3g and party variables are able to explain only 3.9% of the variation in the Better_Off_Worse_Off opinion item about Science and Technology. About 96% of the variation in this dependent variable (= 1 – 3,9%) still is unexplained. The ANOVA model, TukeyHSD, and the regression output point us to where the significant differences are arising from within each of Age3g and Party variables: 1. Age3g: The 50 and Over group is scoring significantly higher than the Upto-29 and 30-49 age groups 2. Party: Those who indicated that they would vote for Democratic party in the National election score significantly higher than both Republicans and Independents (these two groups are not different from each other.) Party Value 1 – Republican Party Value 2 – Democrat Party Value 3 - Independent

Data Visualizations

# Box plot with multiple groups
# +++++++++++++++++++++
# Plot tooth science ("better_worse") by groups ("age3g")
# Color box plot by a second group: "party"


ggboxplot(values, x = "age3g", y = "better_worse", color = "party",
          palette = c("#00AFBB", "#E7B800", "#AA4371"))

Conclusion

• Conclusion • Why is this analysis important? • Limitations of the analysis?

The analysis was undertaken to illustrate the use of ANOVA technique when predictors are Categorical. Here the Two-way ANOVA is illustrated with two categorical predictors. The box-plots show how the dependent variable varies by Age3g and Party variables which is confirmed through the use of different modeling techniques.

In addition, the analysis also showed that the ANOVA model and the regression model are equivalent by extracting the linear model output from the ANOVA output. The ANOVA model shows through the F-test, how the overall model can be tested against the model Null Hypothesis. The TukeyHSD analysis shows where differences among the groups within each categorical variables are significant thus supporting the Alternate Hypothesis. The regression model demonstrates the use of Dummy variables to capture the group mean comparisons.

The primary limitations of the analysis arise from the fact that the data is observational in nature and not causal. Hence the relationships and associations observed among variables are descriptive in nature

References

Data Source: (https://cps.isr.umich.edu/project/world-values-survey-wvs/) See below for the citation/link.

Inglehart, R., C. Haerpfer, A. Moreno, C. Welzel, K. Kizilova, J. Diez-Medrano, M. Lagos, P. Norris, E. Ponarin & B. Puranen et al. (eds.). 2014. World Values Survey: Round One - Country-Pooled Datafile Version: www.worldvaluessurvey.org/WVSDocumentationWV1.jsp. Madrid: JD Systems Institute.

Final Project - DATA606

John Mazon

12/10/2020

Introduction for Final Project -

Research Question

Hypothesis Declarations

Cases analysis

Description of Data collection

Variables in use

ANOVA

Data Visualizations

Conclusion

References

Final Project - DATA606

John Mazon

12/10/2020

Introduction for Final Project -

** Research Question **

Hypothesis Declarations

Cases analysis

Description of Data collection

Variables in use

ANOVA

Data Visualizations

Conclusion

References

Research Question