Introduction to Statistics for Social Sciences_Peer Assignment
- Austin City Limits dataset

Introduction to Statistics for Social Sciences_Peer Assignment

Austin City Limits dataset

Known as the “Live Music Capital of the World,” Austin, Texas is also home to the longest-running music series in American television history, Austin City Limits. This dataset includes data on a sample of musicians that performed live on the PBS television series Austin City Limits over the last 10 years. Data on each artist include measures of commercial popularity, such as the number of social media followers on Twitter or Facebook, and their success in winning a Grammy Music Award.

Primary Research Questions

Are there an equal number of male and female performers on Austin City Limits?
Are male performers just as likely to have had a Top 10 hit as female performers?

Reflect on the Method

We will use a Chi Square Goodness of Fit test to check whether there were an equal number of male and female performers. Why?

We want to see if the distribution of a categorical variable matches a proposed distribution model

Explanation:THE CHI-SQUARE GOODNESS OF FIT TEST IS USED TO COMPARE THE DISTRIBUTION OF A SINGLE CATEGORICAL VARIABLE TO AN EXPECTED DISTRIBUTION FOR THAT VARIABLE.

We will use a Chi Square Test of Independence to determine if male and female performers were equally likely to have had a Top 10 hit. Why?

We want to determine if there is an association between two categorical variables.

Explanation:THE CHI-SQUARE TEST OF INDEPENDENCE ALLOWS US TO DETERMINE WHETHER 2 CATEGORICAL VARIABLES ARE INDEPENDENT OF ONE ANOTHER.

Something about Chi-Square Test

The \(\chi^2\) distribution (chi-squared) allows for statistical tests of CATEGORICAL data.Among these tests are:

Goodness of Fit - It is called “Goodness of fit” because we test whether or not the proposed or expected distribution is a good fit for the observed data.
Independence -We want to determine if there is an association between two categorical variables

The chi-square distribution is sometimes used to characterize data sets and statistics that are always positive and typically right skewed. Recall the normal distribution had two parameters mean and standard deviation that could be used to describe its exact characteristics. The chi-square distribution has just one parameter called degrees of freedom (df), which in infuences the shape, center, and spread of the distribution.

Chi-square distribution with 6 degrees of freedom. The p-value is shaded.

1. Goodness-of-Fit Test

A goodness of fit test is a test that is concerned with the distribution of one categorical variable i.e.estimate how closely an observed distribution matches an expected distribution. The null and alternative hypotheses reflect this focus:

\(H_o\) : The population distribution of the variable is the same as the proposed distribution

\(H_a\) : The distributions are different

The Greek letter “chi”, written as \(\chi\), is the symbol used to identify a chi-square statistic, which we will use here to evaluate how well a set of observed categorical data fits a hypothesized distribution. The Chi-Square statistic is actually pretty straightforward to calculate:

\[\chi^2 =\sum_{i,j=1}^n\frac{(O_{i j} - E_{i j})^2}{E_{i j}}\]

The chisquare test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. Does the number of individuals or objects that fall in each category differ significantly from the number you would expect? Is this difference between the expected and observed due to sampling variation, or is it a real difference?

Features of the Goodness-of-Fit Test:-

As mentioned, the goodness-of-fit test is used to determine patterns of distinct categorical variables. The test requires that the data are obtained through a random sample. The number of degrees of freedom associated with a particular chi-square test is equal to the number of categories minus one. That is, df = no of categories(c) - 1.

2. Chi-square Test of Independence

The chi-square test of independence is used to assess if two factors are related. This test is often used in social science research to determine if factors are independent of each other. For example, we would use this test to determine relationships between voting patterns and race, income and gender, and behavior and education.

In general, when running the test of independence, we ask, “Is Variable X independent of Variable Y?” It is important to note that this test does not test how the variables are related, just simply whether or not they are independent of one another. For example, while the test of independence can help us determine if income and gender are independent,it cannot help us assess how one category might affect the other.

A chi-square \(\chi^2\) test can be used to determine if observed data indicates that two variables are dependent in much the same way that the test can be used to determine goodness of fit.

Just as with a goodness of fit test, we will calculate expected values, calculate a chi-square statistic, and compare it to the appropriate chi-square value from a reference to see if we should reject \(H_o\), which is that the variables are not related. Formally the hypothesis statements for the chi-square Test-of-Independence are:

\(H_o\) :There is no association between the two categorical variables

\(H_a\) : There is an association (the two variables are not independent)

In fact, the only major difference in process between a goodness of fit test and a test of independence is how we calculate the expected values

The degrees of freedom in a test of independence are calculated as: df =(rows-1)(cols-1)

Contingency tables can help us frame our hypotheses and solve problems. Often, we use contingency tables to list the variables and observational patterns that will help us to run a chi-square test.

Assumptions of the Chi-Square test

The assumptions of the chi-square test are the same whether we are using the goodness-of-fit or the test-of-independence.

The standard assumptions are:

Random sample.
Independent observations for the sample (one observation per subject).
All expected counts greater than one.
No expected counts less than five.

Notice that the last two assumptions are concerned with the expected counts, not the raw observed counts.

A good list of resources about using R for Chi-Square Test are given below:

Q1:Are there an equal number of male and female performers on Austin City Limits?

We will use a Chi Square Goodness of Fit test to check whether there were an equal number of male and female performer.We want to see if the distribution of a categorical variable matches a proposed distribution model

Hypothesis:

\(H_o\) : The number of male and female performers on Austin City Limits are EQUAL

\(H_a\) : number of male and female performers on Austin City Limits are DIFFERENT

Goodness of Fit Test:

Create a table to show the counts of each gender.
Create a vector of expected proportions.
Check the expected counts assumption.
Run a chi square test.
Interpret the chi square statistic and p-value.

library(SDSFoundations)
acl <- AustinCityLimits
str(acl)

## 'data.frame':    116 obs. of  14 variables:
##  $ Artist       : Factor w/ 116 levels "Aimee Mann","Alabama Shakes",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Year         : int  2008 2013 2009 2009 2007 2009 2010 2009 2003 2008 ...
##  $ Month        : Factor w/ 6 levels "December","February",..: 4 2 3 5 4 4 3 4 3 5 ...
##  $ Season       : Factor w/ 2 levels "fall","winter": 1 2 2 1 1 1 2 1 2 1 ...
##  $ Gender       : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 2 2 2 1 ...
##  $ Age          : int  52 24 75 39 33 62 37 35 43 67 ...
##  $ Age.Group    : Factor w/ 4 levels "Fifties or Older",..: 1 4 1 3 3 1 3 3 2 1 ...
##  $ Grammy       : Factor w/ 2 levels "N","Y": 2 1 1 1 2 2 1 1 2 1 ...
##  $ Genre        : Factor w/ 4 levels "Country","Jazz/Blues",..: 4 3 2 3 3 1 3 3 3 2 ...
##  $ BB.wk.top10  : int  0 1 NA 1 1 0 1 NA 1 0 ...
##  $ Twitter      : int  101870 73313 308634 56343 404439 3326 125758 8197 158647 690 ...
##  $ Twitter.100k : int  1 0 1 0 1 0 1 0 1 0 ...
##  $ Facebook     : int  113576 298278 10721 318313 1711685 27321 563505 18955 1381051 1715 ...
##  $ Facebook.100k: int  1 1 0 1 1 0 1 0 1 0 ...

#Create a table of counts for Gender, the Variables of Interest
gender_tab <-table(acl$Gender)
gender_tab

## 
##  F  M 
## 35 81

#converting the table into a dataframe so that we can access the values which will help us in plotting
genderdf<-as.data.frame(gender_tab)
genderdf

##   Var1 Freq
## 1    F   35
## 2    M   81

genderdf[,2]

## [1] 35 81

#Create vector of expected proportions corresponding to our null hypothesis that number of males and females are equal
ExpGender<- c(.50, .50)

#Check expected counts assumption
chisq.test(gender_tab, p=ExpGender)$expected

##  F  M 
## 58 58

# The expected counts were 58 for each gender

#observed counts 
chisqrobs<-chisq.test(gender_tab, p=ExpGender)$observed
chisqrobs

## 
##  F  M 
## 35 81

#Run chi square goodness of fit test
chisqgof<-chisq.test(gender_tab, p=ExpGender)
chisqgof

## 
##  Chi-squared test for given probabilities
## 
## data:  gender_tab
## X-squared = 18.241, df = 1, p-value = 1.946e-05

#### Ploting expected and observed values
plot(c(1:2),genderdf[,2], xlab="cell index", ylab="counts", xlim=c(0,3),ylim=c(0,100))
points(chisqgof$expected, pch=3, col="red")
legend("topright",c("observed", "expected"), col=c(1,"red"), pch=c(1,3),bty="0") #if you do not want a box around the label pass the arg:bty="n"


#lets visualise this chisquare distribution
library(NCStats) #to download package run the code:#source("http://www.rforge.net/NCStats/InstallNCStats.R")

## Loading required package: FSA
## 
## 
##  ############################################
##  ##      FSA package, version 0.7.0        ##
##  ##    Derek H. Ogle, Northland College    ##
##  ##                                        ##
##  ## Run ?FSA for documentation.            ##
##  ## Run citation('FSA') for citation ...   ##
##  ##   please cite if used in publication.  ##
##  ##                                        ##
##  ## See fishR.wordpress.com for more       ##
##  ##   thorough analytical vignettes.       ##
##  ############################################
## 
## 
## Loading required package: car
## Loading required package: FSAdata
## 
## 
##  ##############################################
##  ## FSAdata package, version 0.1.8           ##
##  ##   by Derek H. Ogle, Northland College    ##
##  ##                                          ##
##  ## Type ?FSAdata for documentation with     ##
##  ##   search hints to find data for specific ##
##  ##   types of fisheries analyses.           ##
##  ##############################################
## 
## 
## 
## 
##  ############################################
##  ## NCStats package, version 0.4.3         ##
##  ##   by Derek H. Ogle, Northland College  ##
##  ##                                        ##
##  ##    type ?NCStats for documentation.    ##
##  ############################################

plot(chisqgof)

#or 

library(visualize)
#Evaluates upper tail.
visualize.chisq(stat = 18.241, df = 1, section = "upper") #chisquare statistic=18.241,df=1-----info got from chisqr test output

#the section arg takes either lower,upper or bounded.Also you must supply the parameter as stat = c(lower_bound, upper_bound)..for ex:visualize.chisq(stat = c(1,2), df = 6, section = "bounded")


#Chi Square (rounded to 2 decimal places, with df=1)=
chisq.test(gender_tab, p=ExpGender)$statistic

## X-squared 
##  18.24138

#was the p-value less than 0.05?
chisq.test(gender_tab, p=ExpGender)$p.value

## [1] 1.946047e-05

Conclusion:In our sample, there were 81 males and 35 females . A chi square goodness of fit test showed that this difference was statistically significant (chi square= 18.24 df=1, p<.05). There are more males than females on the show.

Q2:Are male performers just as likely to have had a Top 10 hit as female performers?

Hypothesis:

\(H_o\) : There is no association between the gender and likely to have a Top 10 hit.

\(H_a\) : There is association between the gender and likely to have a Top 10 hit.

Test of Independence:

Create a two-way table for genre and Twitter following.
Check the expected counts assumption.
Run a chi square test.
Interpret the chi square statistic and p-value.

library(SDSFoundations)
acl <- AustinCityLimits
acl<-acl[!is.na(acl$BB.wk.top10),]
str(acl)

## 'data.frame':    103 obs. of  14 variables:
##  $ Artist       : Factor w/ 116 levels "Aimee Mann","Alabama Shakes",..: 1 2 4 5 6 7 9 10 11 12 ...
##  $ Year         : int  2008 2013 2009 2007 2009 2010 2003 2008 2007 2012 ...
##  $ Month        : Factor w/ 6 levels "December","February",..: 4 2 5 4 4 3 3 5 5 3 ...
##  $ Season       : Factor w/ 2 levels "fall","winter": 1 2 1 1 1 2 2 1 1 2 ...
##  $ Gender       : Factor w/ 2 levels "F","M": 1 1 2 1 2 2 2 1 1 2 ...
##  $ Age          : int  52 24 39 33 62 37 43 67 47 49 ...
##  $ Age.Group    : Factor w/ 4 levels "Fifties or Older",..: 1 4 3 3 1 3 2 1 2 2 ...
##  $ Grammy       : Factor w/ 2 levels "N","Y": 2 1 1 2 2 1 2 1 1 1 ...
##  $ Genre        : Factor w/ 4 levels "Country","Jazz/Blues",..: 4 3 3 3 1 3 3 2 4 2 ...
##  $ BB.wk.top10  : int  0 1 1 1 0 1 1 0 1 0 ...
##  $ Twitter      : int  101870 73313 56343 404439 3326 125758 158647 690 450096 88689 ...
##  $ Twitter.100k : int  1 0 0 1 0 1 1 0 1 0 ...
##  $ Facebook     : int  113576 298278 318313 1711685 27321 563505 1381051 1715 2754505 24866 ...
##  $ Facebook.100k: int  1 1 1 1 0 1 1 0 1 0 ...

acl$BB.wk.top10<-as.factor(acl$BB.wk.top10)

#Create two-way table for gender and Top 10 hits.
gender_top10 <-table(acl$Gender, acl$BB.wk.top10)
gender_top10

##    
##      0  1
##   F 15 18
##   M 38 32

#converting the table into a dataframe so that we can access the values which will help us in plotting
genderdf2<-as.data.frame(gender_top10)
genderdf2

##   Var1 Var2 Freq
## 1    F    0   15
## 2    M    0   38
## 3    F    1   18
## 4    M    1   32

genderdf2[,3]

## [1] 15 38 18 32

#Generate expected counts assumption.
chisqrexp<-chisq.test(gender_top10, correct=FALSE)$expected
chisqrexp

##    
##            0        1
##   F 16.98058 16.01942
##   M 36.01942 33.98058

#observed counts 
chisqrobs<-chisq.test(gender_top10, correct=FALSE)$observed
chisqrobs

##    
##      0  1
##   F 15 18
##   M 38 32

#### Ploting expected and observed values
plot(c(1:4),genderdf2[,3], xlab="cell index", ylab="counts", xlim=c(0,5),ylim=c(0,50))
points(c(16.98058,36.01942,16.01942,33.98058), pch=3, col="red")
legend("topright",c("observed", "expected"), col=c(1,"red"), pch=c(1,3),bty="o",cex=0.7)

#Run test of chi square independence
chisqrInd<-chisq.test(gender_top10, correct=FALSE)
chisqrInd

## Pearson's Chi-squared test with gender_top10 
## X-squared = 0.7002, df = 1, p-value = 0.4027

#lets visualise this chisquare distribution
library(NCStats) #to download package run the code:#source("http://www.rforge.net/NCStats/InstallNCStats.R")
plot(chisqrInd)

#or 

library(visualize)
#Evaluates upper tail.
visualize.chisq(stat = 0.7002284, df = 1, section = "upper") #chisquare statistic=0.70023,df=1-----info got from chisqr test output

#the section arg takes either lower,upper or bounded.Also you must supply the parameter as stat = c(lower_bound, upper_bound)..for ex:visualize.chisq(stat = c(1,2), df = 6, section = "bounded")


addmargins(gender_top10)

##      
##         0   1 Sum
##   F    15  18  33
##   M    38  32  70
##   Sum  53  50 103

round(addmargins(prop.table(gender_top10,1),2),2) #Approximately 55% of the   female  artists had a Top 10 hit, and 46% of the   male artists had a Top 10 hit.

##    
##        0    1  Sum
##   F 0.45 0.55 1.00
##   M 0.54 0.46 1.00

#Lets visualise the contingency table using stacked barplots/mosaic plots
barplot(t(gender_top10), beside=T,col = c("steelblue","lightgreen"), ylim=c(0,110),ylab="Observed frequencies in sample",main="Frequency of Bilboard week Top 10 by Gender",las = 1)
#Add the legend to the plot
legend("topright", legend =levels(acl$BB.wk.top10),pch=15,col=c("steelblue","lightgreen"), cex=0.7)

#mosaic plot
mosaicplot(gender_top10)

#stacked Barplot of relative frequencies

relfreq<-round(prop.table(gender_top10,1),2)

#Add extra space to right of plot area; change clipping to figure
par(mar=c(5,4,4,9),xpd=TRUE) 

#Plot the barplot
barplot(t(relfreq), beside=FALSE,col = c("steelblue","lightgreen"),ylab="Relative Frequency",main="Relative frequency of Bilboard week Top 10 by Gender",las = 1,bty="L")

# Add legend to top right, outside plot region using inset()
legend("topright", legend=levels(acl$BB.wk.top10),pch=19,col=c("steelblue","lightgreen"), cex=0.7,bty = "o",inset=c(-0.05,0))

# Restore default clipping rect
par(mar=c(5, 4, 4, 2) + 0.1)


#Chi Square statistic
chisq.test(gender_top10, correct=FALSE)$statistic

## X-squared 
## 0.7002284

# What was the p-value?
chisq.test(gender_top10, correct=FALSE)$p.value

## [1] 0.402707

#We should   fail to reject the hypothesis that gender is associated with having a Top 10 hit (As pvalue(=0.403)>0.05).

Conclusion: Approximately 55% of the female artists had a Top 10 hit, and 46% of the male artists had a Top 10 hit. This difference was not statistically significant. A chi square test of independence found top 10 hits to be independent of gender (chi square= 0.700, df=1, p= 0.403). The assumptions for each test were met.

##############################################################################

Peer_Assignment_Chisqrtest_july2015

Nishant Upadhyay

9 July 2015