Titorchuk Oleksandr

Introduction:

Research question: Is there a relationship between person’s preferred source of getting information about USA political life (namely, internet) and his political views (namely, the party he is a member of)?

I believe that there are three main factors that have changed the global landscape of media in XXI century. These are:
1. the emergence of social networks (Facebook, Twitter, Instagram, Tumblr, Flickr etc.) and instant messaging services;
2. the enter to the market of smartphones with photo and video shooting functions and internet access;
3. increasing internet access coverage.

Starting as a simply the quickest source of news for “older” medias (recall Twitter) internet soon has become a media itself, which is expressed on the one hand in emergence of fully online news media companies (like Buzzfeed or The Huffington Post) and internet versions of traditional newspapers and TV channels (The New York Times, Washington Post, The Guardian; CNN, BBC, Fox, Aljazeera etc.) and on the other in the bankruptcy of traditional media [1]. This is why it is extremely interesting for me to evaluate the influence that internet has on the populations’ social life, in this case – it’s political views. It’s also has an obvious practical interest for political parties to conduct such research to adjust their campaigns in terms of the most effective means of agitation (in the context of different population cuts – divided by age, social status, income etc.).

Data:

My Data Analysis project examines the relationship between two categorical variables, each of which represents an answer to the question posed by American National Election Studies 2012 to the US eligible voters (representing the cases in my dataset):

  1. prmedia_atinews:
    Categorical (ordinal)
    How much attention do you pay to news about national politics on the Internet
    Levels: 1-A great deal 2-A lot 3-A moderate amount 4-A little 5-None at all

  2. prevote_regpty
    Categorical
    What political party are you registered with, if any?
    Levels: 1-Democratic party 2-Republican party 3-None or ‘independent’ 4-Other {SPECIFY}

Original dataset for my project was collected within the American National Election Studies 2012 which include information on electoral participation, voting behavior, public opinion, media exposure, cognitive style, and values and predispositions (205 variables in total) of U.S. eligible voters. Pre-election interviews were conducted with study respondents during the two months prior to the 2012 elections and were followed by post-election reinterviewing beginning November 7, 2012. For the first time in Time Series history, face-to-face interviewing (2,054 respondents) was supplemented with data collection on the Internet (3,860 respondents).
Data collection was conducted in the two modes independently, using separate samples. While face-to-face (FTF) respondents were administered the single pre-election interview and single post-election interview, for the internet sample the same questions were administered over a total of 4 shorter online interviews, 2 pre-election and 2 post-election. For more information refer to the User’s Guide and Codebook for the ANES 2012 Time Series Study [2].
According to the the User’s Guide and Codebook for the ANES 2012 Time Series Study [2], ANES 2012 is «a dual-mode survey (face-to-face and Internet) with two independent samples. … [Internet] Panelists are recruited using two probability sampling methods: address-based sampling (ABS) and random-digit dialing (RDD). … The in-person (face-to-face) interviews were conducted using an address-based, stratified, multi-stage cluster sample in 125 census tracts. The sample includes a nationally-representative “main sample” and two “oversamples,” one of blacks and one of Hispanics».
From this extract it becomes obvious, that ANES 2012 is clearly an observational study - the researchers do not undertake any experiments with random assignment, they simply collect information from randomly selected US citizens – my research that is based on its result is observational in nature too. This means that its results can be generalized to the whole population of interest – namely US citizens of age 18 or more. But also it worth mentioning that both the original study and my research are exposed to some bias:

1. Bias concerning original study
Participating in the study is clearly not compulsory, so it is more probable that persons who have proactive public stand will be more willing to answer questions of the study. And this proactive group may have some specific opinion about the things we are interested in, which differentiates them from the population as a whole. Also respondents may for their reasons refuse to answer some questions asked, resulting in NA appearing in the data, which can’t be used in my analysis, and so are excluded from the dataset. It might bias original data, if there is some inference between choosing certain answer to one question and give no answer to the other one – since the group of people followed this line of behavior would be excluded from my dataset.

2. Bias concerning my analysis
The problem I might face doing my research is that it may be confounder between the two variables I’m interested in, namely the age of respondents: it is more likely that internet users are comparably younger, and so it is the age that determines their political views. That is why before proceeding to the data analysis it might be useful to stratify my dataset according to the respondents’ age and randomly (and proportionately to the US citizens age structure) chose representative sample from each strata. But it seems that proposed dataset (to my surprise) doesn’t include such information – so it is impossible.

Although results of my research can be used to make some conclusions about the existence of inferential relationship between the variables in question, they can’t be claimed to be a point of reference for establishing causal links between them, since it’s not experimental study and we didn’t use random assignment to make such claims justifiable.

# load data into R (the name of the dataset is "anes")
load(url("http://bit.ly/dasi_anes_data"))
# choose only needed columns
# although our working dataset will include only the last two columns, I want to have the opportunity to identify the respondents by their Case ID
data<-cbind(anes$caseid,anes$prmedia_atinews,anes$prevote_regpty)
# delete all the entries, which contain at least one NA
data.clear<-na.omit(data)
# give proper names to the dataset columns
colnames(data.clear)<-c("Case ID","Internet","Party")
# coerce the "data.clear" dataset from matrix to data.frame object
data.clear<-as.data.frame(data.clear)
# provide the "Internet" and "Party" variables with descriptive values
data.clear$Internet<-gsub(1,'1 A great deal',data.clear$Internet)
data.clear$Internet<-gsub(2,'2 A lot',data.clear$Internet)
data.clear$Internet<-gsub(3,'3 A moderate amount',data.clear$Internet)
data.clear$Internet<-gsub(4,'4 A little',data.clear$Internet)
data.clear$Internet<-gsub(5,'5 None at all',data.clear$Internet)
data.clear$Party<-gsub(1,'1 Democratic party',data.clear$Party)
data.clear$Party<-gsub(2,'2 Republican party',data.clear$Party)
data.clear$Party<-gsub(3,'3 None or independent',data.clear$Party)
data.clear$Party<-gsub(4,'4 Other',data.clear$Party)
# create a final dataset suitable for further analysis
data.final<-data.clear[,c(2,3)]

Exploratory data analysis:

The first stage of our research is exploratory data analysis, which allows us to take a quick look at the data we are working with. These include some summary statistics (in absolute and percentage values) and visualizations.
There are 2 categorical variables of interest in my dataset. For comparing two categorical variables the most appropriate forms of exploratory data analysis are contingency tables and barplots/mosaic plots.

# create summary statictics for dataset
table1<-table(data.final$Party, data.final$Internet)
table2<-addmargins(table1)
p.table1<-prop.table(table1,2)
p.table2<-round(addmargins(p.table1),2)
table2
##                        
##                         1 A great deal 2 A lot 3 A moderate amount
##   1 Democratic party               136     183                 358
##   2 Republican party               108     137                 202
##   3 None or independent             20      28                  41
##   4 Other                           36      67                 124
##   Sum                              300     415                 725
##                        
##                         4 A little 5 None at all  Sum
##   1 Democratic party           324            94 1095
##   2 Republican party           150            42  639
##   3 None or independent         44             9  142
##   4 Other                      115            40  382
##   Sum                          633           185 2258
p.table2
##                        
##                         1 A great deal 2 A lot 3 A moderate amount
##   1 Democratic party              0.45    0.44                0.49
##   2 Republican party              0.36    0.33                0.28
##   3 None or independent           0.07    0.07                0.06
##   4 Other                         0.12    0.16                0.17
##   Sum                             1.00    1.00                1.00
##                        
##                         4 A little 5 None at all  Sum
##   1 Democratic party          0.51          0.51 2.41
##   2 Republican party          0.24          0.23 1.43
##   3 None or independent       0.07          0.05 0.31
##   4 Other                     0.18          0.22 0.85
##   Sum                         1.00          1.00 5.00
# create barplot and mosaic plot
# explanatory variable on the x-axis, response variable on the y-axis
# barplot: ylim = c(0, 1.6) is used to include legend, that isn't overlapses the plot
barplot(p.table1, xlab="Party", ylab="%", main="Connection between internet usage and political views", col = c("blue", "red", "limegreen", "yellow", "orchid3"), ylim  = c(0, 1.6),legend=T)

mosaicplot(t(p.table1), main="Connection between internet usage and political views", color = c("blue", "red", "limegreen", "yellow", "orchid3"))

The majority of respondents (1095/2258=48%) identified themselves as the Democratic Party members. The second place as expected went to the Republican Party (639/2258=28%), while answers “None or independent” and “Other” together got 24%. Analyzing the contingency table (in percentages) the one might notice that there are some inferential relationships between the variables:

Inference:

We should evaluate the presence of inference between two categorical variables, each of them having more than two levels. So, we don’t have the possibility of using both hypothesis test and confidence interval and will use the Chi-Square Independence test to confirm or reject the suggested hypothesis. This test is based on quantifying how different the observed proportions are from the expected ones, and states that large deviations between them provide strong evidence in favor of the alternative hypothesis. Since sample size condition is met (see below) we’ll be using theoretical based computation method.

Conditions for Chi-Square Independence Test:
I. Independence (sampled observations must be independent):

  1. Sample size (each particular scenario (i.e. cell) must have at least 5 expected cases) – this means that for each cell in a range the amount of calculated expected counts (raw total x column total/total) must at least be equal to 5; to check this condition we don’t need actually to make computations for each cell – it would be enough to do it for the intersection of the least raw-column totals; in our case it is (None or independent x None at all/Total)=(142x185/2258)=11,63>5. (satisfied)

H0 (nothing is going on): Internet and Party variables are independent – political party’s affiliation amongst respondents doesn’t not vary with the level of favoring the Internet as a source of political news by them (for example, respondents using Internet as their primary source of getting information about USA political life do not tend to prefer one political doctrine over others)

HA (something is going on): Internet and Party variables are dependent – political party’s affiliation amongst respondents vary with the level of favoring the Internet as a source of political news by them

There are various methods of calculating chi^2: 1) By hand (using general functionality of Excel and R); 2) Using built-in Excel formulas; 3) Using R. In our study we will use the third option (namely, the custon ‘inference’ function).

# load the 'inference' function, which is used for doing statistical analysis
source("http://bit.ly/dasi_inference")
# make a statisctical analysis using custom 'inference' function
inference(y=data.final$Party,x=data.final$Internet,est="proportion",type="ht",method = "theoretical", alternative = "greater", sum_stats=F, eda_plot=F, inf_plot=F)
## Installing package into 'C:/Users/Titoo/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'openintro' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Titoo\AppData\Local\Temp\RtmpgltRi8\downloaded_packages
## Warning: package 'openintro' was built under R version 3.2.3
## Installing package into 'C:/Users/Titoo/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'BHH2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Titoo\AppData\Local\Temp\RtmpgltRi8\downloaded_packages
## Warning: package 'BHH2' was built under R version 3.2.3
## Response variable: categorical, Explanatory variable: categorical
## Chi-square test of independence
## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
## Check conditions: expected counts
##                        x
## y                       1 A great deal 2 A lot 3 A moderate amount
##   1 Democratic party            145.48  201.25              351.58
##   2 Republican party             84.90  117.44              205.17
##   3 None or independent          18.87   26.10               45.59
##   4 Other                        50.75   70.21              122.65
##                        x
## y                       4 A little 5 None at all
##   1 Democratic party        306.97         89.71
##   2 Republican party        179.14         52.35
##   3 None or independent      39.81         11.63
##   4 Other                   107.09         31.30
## 
##  Pearson's Chi-squared test
## 
## data:  y_table
## X-squared = 29.08, df = 12, p-value = 0.003834

The computed p-value – 0.38% – is less than 1%, so we should reject the H0 hypothesis and acknowledge the fact that data provide convincing evidence about the existence of association between the variables of interest.

Conclusion:

The results of our analysis suggest the presence of some inferential relationship between the source of getting information about the US political life and political views of respondents, but they cannot be interpreted as if this relationship is causal, since our study wasn’t conducted in experimental framework. So, there may be at least three possible explanations of our findings:

  1. particular qualities of presenting information on the Internet (for example, an access to a large variety of different resources and hence the possibility of comprehensive study of the issue; or on the contrary the poor quality of Internet media articles, lack of in-depth journalistic investigations and high quality analytics) or the way in which USA Internet media cover some events cause the respondents to stick with one of the political parties;
  2. the members of particular political forces generally prefer some media sources over the others (for example, the republicans, who are commonly perceived to be conservators are more likely to prefer respected newspapers of magazines than online media);
  3. the is no direct causal relationship between the variables – the real reason for obtaining such a results is some confounder, that determines either the level of person’s activity on the Internet or his political views (for example, an age of the respondents).

It should be mentioned that statistical analysis we conducted doesn’t say anything about the nature of the discovered relationship between the variables, it just point out than the one exist. On the other hand, an exploratory data analysis allows us to make some cautious conclusions about the direction of the influence that Internet variable has on the Party variable: positive covariance between the level of interest to the Internet news and affiliation with Republican Party, and negative covariance between the level of interest to the Internet news and affiliation with Democratic Party. It is rather surprising result – I consider the Internet to be a mouthpiece of the Democratic Party, but in turns out not to be the case.

As regards possible future research, it would be interesting to conduct such an analysis for other types of media (TV, newspapers and magazines, radio etc.). Also including in the ANES questionnaires the question about respondents’ age will enable researchers to block one of the most obvious confounders for such a study we have done. Eventually, investigators with enough resources can even undertake an experimental study, to evaluate whether respondents’ political views change if they concentrate their attention on particular type of media.

Data Citation and References:

Original data
The American National Election Studies (ANES; www.electionstudies.org). The ANES 2012 Time Series Study [dataset]. Stanford University and the University of Michigan [producers]

Data used for the project
Extract of ANES 2012 modified for Data Analysis and Statistical Inference course (Duke University)
Data description: https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fanes1.html
Data: http://bit.ly/dasi_anes_data

References:
[1] Wikipedia article “Decline of newspapers”:
http://en.wikipedia.org/wiki/Decline_of_newspapers
[2] User’s Guide and Codebook for the ANES 2012 Time Series Study: http://electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012_userguidecodebook.pdf
[3] U.S. Census Bureau, U.S. and World Population Clock (date 06/15/2012):
http://www.census.gov/popclock/

Appendix

# create a range for printing as appendix 
data.print<-head(data.clear,50)
data.print
##    Case ID            Internet                 Party
## 1       11             2 A lot    2 Republican party
## 2       13          4 A little    1 Democratic party
## 3       15 3 A moderate amount 3 None or independent
## 4       16             2 A lot    2 Republican party
## 5       17      1 A great deal    2 Republican party
## 6       30             2 A lot    1 Democratic party
## 7       31      1 A great deal 3 None or independent
## 8       36 3 A moderate amount    1 Democratic party
## 9       37          4 A little 3 None or independent
## 10      39          4 A little    1 Democratic party
## 11      41      1 A great deal    2 Republican party
## 12      42 3 A moderate amount    1 Democratic party
## 13      43 3 A moderate amount    1 Democratic party
## 14      44 3 A moderate amount    2 Republican party
## 15      45 3 A moderate amount    2 Republican party
## 16      46 3 A moderate amount    2 Republican party
## 17      47          4 A little    2 Republican party
## 18      48       5 None at all    2 Republican party
## 19      50      1 A great deal    2 Republican party
## 20      51 3 A moderate amount    1 Democratic party
## 21      53 3 A moderate amount    2 Republican party
## 22      55          4 A little 3 None or independent
## 23      59          4 A little    2 Republican party
## 24      61      1 A great deal    2 Republican party
## 25      63      1 A great deal    1 Democratic party
## 26      67          4 A little    1 Democratic party
## 27      72 3 A moderate amount    1 Democratic party
## 28      73             2 A lot 3 None or independent
## 29      79       5 None at all    2 Republican party
## 30      80          4 A little 3 None or independent
## 31      81 3 A moderate amount    1 Democratic party
## 32      83      1 A great deal    1 Democratic party
## 33      86             2 A lot    1 Democratic party
## 34      87       5 None at all    1 Democratic party
## 35      89          4 A little 3 None or independent
## 36      91      1 A great deal    2 Republican party
## 37      92      1 A great deal    2 Republican party
## 38      97 3 A moderate amount    2 Republican party
## 39      99       5 None at all 3 None or independent
## 40     100          4 A little    2 Republican party
## 41     102 3 A moderate amount    2 Republican party
## 42     104          4 A little    2 Republican party
## 43     105 3 A moderate amount    2 Republican party
## 44     106          4 A little    1 Democratic party
## 45     107 3 A moderate amount    1 Democratic party
## 46     108             2 A lot    1 Democratic party
## 47     109             2 A lot    1 Democratic party
## 48     110             2 A lot    1 Democratic party
## 49     111 3 A moderate amount    1 Democratic party
## 50     113          4 A little    1 Democratic party