Tahir Hussain

Date 2nd August,2016

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

   setwd("C:/R Data Files") 
   load("C:/R Data Files/gss.RData")

Part 1: Data

1.Data Description

The study is based on the General Social Survey for Years 1972-2012.This data was
downloaded from Coursera course page in laptop from where it was loaded into the
R studio for analysis.

What is GSS ?

2.General Social Survey

This General Social Survey monitors changes in American sociatey since 1972.This
was conducted every year till 1994. Since 1994 it is done every even years. The vast
majority of data was collected by face to face interviews,computer assisted interviews
(2002) and by telephone. The respondents individuals were English and Spanish speaking
and were 18 years or older.The respondents were selected from metropolitan and rural
areas.In order to ensure the random sampling,age,race,income and sex was subjected to
multiple levels of stratification.

The target sample size consists of 1500 observations.The data comes from interviews 
not on experiments therefore this study can be classified as an observational study.
In such studies only correlation can be established. The causation can not be
obtained. The data comes from US residents by random sampling.
The study's findings could be Generalized to the entire american resident population.

3.Variables which are being used:

   degree: ordinal categorical variable.("Did you complete GED certificate or
                                          obtain a High school diploma?")
   
   conin : Continuous numerical variable.("Family Income in constant US Dollars.
                                          The figures are inflation adjusted.")

Part 2: Research question

        The study project focuses on the RELATIONSHIP BETWEEN HIGHEST DEGREE EARNED BY 
        AMERICAN RESIDENTS AND THEIR FAMILY INCOME IN CONSTANT US DOLLARS.
    
        Is the family income related to the level of education in  US according
        to this survey?

Part 3: Exploratory data analysis

3.1 Data Preparation

  Before carrying out the analysis,we will first prepare a subset of 'gss'using two
  chosen variables.The data frame 'gss' consists of 57061 row and 114 columns.By
  filtering the required variables the subset will have 1974 rows and 2 columns only.
  This sub set will occupy less memory in work space.Then the variables in this subset
  will be renamed and used.The observations will be counted.
  
       dim(gss)
## [1] 57061   114
       obs_data <- gss[gss$year== 2012,c(12,27)]
       colnames(obs_data)<- c("Highest.Degree","Family.Income")
       nrow(obs_data)
## [1] 1974

3.2Summary Statistics

Highest.Degree

Contigency Table and Frequency Table for Highest Degree are given below: We find the High school degrees are 50% of all the observations.for better visualization plot is also shown below.

     table(obs_data$Highest.Degree)
## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            280            976            151            354            205
     prop.table(table(obs_data$Highest.Degree))
## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##      0.1424212      0.4964395      0.0768057      0.1800610      0.1042726
     par(mar=c(7,4,3,2))
     barplot(table(obs_data$Highest.Degree),
                las=2,main="Highest.Degree",col=rainbow(5))

Family.Income.Constant.USD

A summary and Histogram of Family Income are shown below.

     summary(obs_data$Family.Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   16280   34470   48380   63200  178700     216
     par(mar= c(7,4,3,2)) 
     hist(obs_data$Family.Income,main = "Family.Income.in USD",
              xlab = "USD", ylab="Frequency",col=rainbow(5))

Boxplot of Family.Income by Highest.Degree is shown below

**Also Plot shown below:*

      par(mar= c(7,4,3,2))
      
      boxplot(obs_data$Family.Income ~ obs_data$Highest.Degree,
             las=2,main="Family.Income by Degree",
             col=rainbow(7))

3.4 Results from above

         Result: We can draw following conclusions from above data summaries
         and plots.

         1. Distribution is unimodal and right skewed.
         2. 50% of observation in 16300-63200 USD range.
         3. Maximum value is 179000 USD.
         4. There are outliers in the upper quantiles of distribution.
         5. There 216 observation with missing values.
         6. These will have to be filtered out to get remaining
            significant sample size of 1752.

3.5 Filtering out NAS from sub set (obs_data) Count observations and check the contigency table after the filtering

        obs_data <- obs_data[complete.cases(obs_data),]
        nrow(obs_data)
## [1] 1752
        table(obs_data$Highest.Degree)
## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##            230            881            132            324            185
Important note on the above: A positive correlation exists between selected variables
(as seen in above plots),but it is not very strong.It seems the family income could be 
associated with some other variables.

Statistical Inference

1.The null hypothesis will be true if the mean family income is equal for all
degree groups.

2.The alternative hypothesis will be true if at least a pair of means are different
from each other.

For this purpose we will run ANOVA where 3 conditions have to be met.

1.Independence of Data.

2.Data distribution should be approximately normal.

3.Constant variance:The variability should be about equal across group.

ANOVA

       anova(lm(Family.Income ~ Highest.Degree,data=obs_data))
## Analysis of Variance Table
## 
## Response: Family.Income
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Highest.Degree    4 8.2832e+11 2.0708e+11  120.52 < 2.2e-16 ***
## Residuals      1747 3.0017e+12 1.7182e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observations from above:

F-value of 121 is very low if null hypothesis was true. We reject null Hypothesis.

Probability value is very low.

We will perform a pairwise t.test to find pairwise comparison to locate groups
with different means.

Pairwise t test for mean income grouped by degree

   pairwise.t.test(obs_data$Family.Income,obs_data$Highest.Degree, p.adj="bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  obs_data$Family.Income and obs_data$Highest.Degree 
## 
##                Lt High School High School Junior College Bachelor
## High School    1.4e-06        -           -              -       
## Junior College 3.2e-07        0.2140      -              -       
## Bachelor       < 2e-16        < 2e-16     2.3e-10        -       
## Graduate       < 2e-16        < 2e-16     < 2e-16        0.0011  
## 
## P value adjustment method: bonferroni
We can see that for nine group pairs the p-value lower than the significant 0.05
so nul hypothesis are rejected*

The null hypothesis is not rejected for high school-Junior college.The difference
of means of this pair is not significant satistically which might be due to chance.

Conclusion

1.This analysis proves a positive correlation among high degree earned by USA
  residents and their family income.*

2.In many groups there are wide range of income and outliers.

3.However,these results may not be viewed as definitive.


4.The shortcomings could be investigated by repeating study for another
  year for GSS data and compare the results.