Introduction:

My reasearch question is to find out whether is there any relationship between race and working status of a person. ####Why Do I care about my research question I care about this question because working status of people can be used to reflect many aspects of peoples life, education, social status and financial status and some other indirect characterstics. We need to know this relationship to find is any there anything we can do to fix this problem.

Data:

General Social Survey (GSS): A sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The codebook below lists all variables, the values they take, and the survey questions associated with them. There are a total of 57,061 cases and 114 variables in this dataset. Note that this is a cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.

Load the Data

load(url("http://bit.ly/dasi_gss_data"))

Load required libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lattice)

Once I load the data set, I tried to get only the required variables from the data set and also filtered my data to take only observarion after the year 2000. Since my analysis is between two categorical variables, I tried to convert in to numerical data by summarize the data by counts. Here we are focusing on two variables Race and WRKSTAT and we can find the corelation between these two variables because the study is observational data. Since it is observational, we can not make any caasal links between these variables.

Exploratory data analysis:

Here we I am plotting a mosiac plot to show the visual corelation bwtween these explanatory variables. Mosiac plot is self explanatory about the conditional probabilities.

gsssubset<-subset(gss,year>=2000)
dim(gsssubset)
## [1] 18945   114
gssAnalysis<-select(gsssubset,year,race,wrkstat)

gssAnalysisSummary <- gssAnalysis %>%
  group_by(race, wrkstat) %>%
  summarise(
    SummaryCounts = n()
  )


table(gssAnalysis$race,gssAnalysis$wrkstat)
##        
##         Working Fulltime Working Parttime Temp Not Working
##   White             7270             1579              318
##   Black             1358              262               61
##   Other              948              208               29
##        
##         Unempl, Laid Off Retired School Keeping House Other
##   White              489    2554    416          1493   378
##   Black              178     315    114           332   121
##   Other               84      79    100           202    46
plot(table(gssAnalysis$wrkstat,gssAnalysis$race))

By looking at the mosiac plot, we can see that the width of white is bigger than other races and probability of a full time job given the person is white compare to other race. Similary probability of laid of is bigger given the race is black than other races.

Inference:

NULL Hypothesis H0(There is nothing going on): There is no relationship between Race and working status of a person.

Alternative Hypothesis HA: There is a relationship between Race and working status of a person.

Check Conditions:

1.Independence. Sampled observations must be independent. GSS survey is random sample survey and this condition is met 2. Sample size: if sampling without replacement, n < 10% of population. 57061 observations certainly <10% American population and it is also met. 3. Number of observations in each cell is greater than 10. 4. Since we are using 2 categorical variables with more than 2 levels, we use chisquare test of indepence method here.

Method:

Here we can claculate the expected values for each cell the grid and then calculate by using the formular RoWCount* ColumnCount/ Total Observations for each cell. Summary and table function gives the required Chisquare Statistic and degreess of freedom. By using PChiSq function by passing ChiSquareStatistic and degreesofFreedom values, we get the required P value to make the conclusion.

summary(table(gssAnalysis$race,gssAnalysis$wrkstat))
## Number of cases in table: 18934 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 363.3, df = 14, p-value = 6.75e-69

Do the Chisquare test with summary information Chisquare and degrees of freedom.

pchisq(363.3, 14, lower.tail = FALSE)
## [1] 6.652595e-69

Conclusion:

Since the P-value is very small and it is almost equivalent to zero, We have convincing evidence to reject the null hypothesis in favor of alternative hypothesis that there is a corelation between race and working status of a person.

Since the study method is observational study, we can generalize caasal relation and we only concude that there is corelation between race and working status.

References:

This is part of Data Analysis and statistical inference course conducted by duke university and the data set used in this analysis provided by General Social Survey.

Appendix:

Included all the tables as part of the HTML in binary form and no attachments were referenced from other folders.