Assignment 4

Dataset

Source:https://catalog.data.gov/dataset/american-community-survey

This is an annual nationwide survey that collects information such as age, race, income, commute time to work, home value, veteran status, and other data. We take a subset of this dataset that has all payroll related information, commodity spending and value added services spending for various manufacturing industries within the US for the year 2013.It comprises of 5258 records in all.

Here is a summary of the dataset used:

dataf<- (read.csv(file.choose(), header=T))

summary(dataf)

##          GEO.id        GEO.id2      GEO.display.label    NAICS.id   
##  0100000US  : 108          : 108   California: 108    31-33  :  52  
##  0400000US06: 108   12     : 108   Florida   : 108    311    :  52  
##  0400000US12: 108   13     : 108   Georgia   : 108    3113   :  52  
##  0400000US13: 108   17     : 108   Illinois  : 108    3118   :  52  
##  0400000US17: 108   26     : 108   Michigan  : 108    3119   :  52  
##  0400000US26: 108   27     : 108   Minnesota : 108    312    :  52  
##  (Other)    :4610   (Other):4610   (Other)   :4610    (Other):4946  
##                                         NAICS.display.label YEAR.id    
##  Petroleum and coal products manufacturing        : 104     2013:5257  
##  Printing and related support activities          : 104     Year:   1  
##  Apparel manufacturing                            :  52                
##  Architectural and structural metals manufacturing:  52                
##  Bakeries and tortilla manufacturing              :  52                
##  Beverage and tobacco product manufacturing       :  52                
##  (Other)                                          :4842                
##       EMP           EMP_S          PAYANN        PAYANN_S   
##  a      : 291   0      :1421   D      : 866   0      :1368  
##  b      : 239   S      : 844   S      :   4   S      : 867  
##  c      : 122   0.5    : 162   10780  :   3   0.1    : 175  
##  e      :  79   0.1    : 160   31402  :   3   0.4    : 175  
##  f      :  53   0.3    : 151   54343  :   3   0.3    : 170  
##  g      :  41   0.4    : 145   9389   :   3   0.5    : 165  
##  (Other):4433   (Other):2375   (Other):4376   (Other):2338  
##     EMPAVPW       EMPAVPW_S        HOURS         HOURS_S    
##  D      : 957   0      :1338   D      : 873   0      :1411  
##  S      :  11   S      : 959   S      :   8   S      : 877  
##  134    :   8   0.6    : 136   25     :   7   0.8    : 128  
##  164    :   8   0.7    : 136   94     :   7   0.5    : 124  
##  23     :   8   0.3    : 132   20     :   6   0.6    : 116  
##  36     :   8   0.5    : 126   371    :   6   0.4    : 110  
##  (Other):4258   (Other):2431   (Other):4351   (Other):2492  
##     PAYANPW       PAYANPW_S       CSTMTOT       CSTMTOT_S   
##  D      : 973   0      :1269   D      :1284   S      :1285  
##  S      :   8   S      : 976   S      :   4   0      :1069  
##  1383   :   3   0.1    : 153   0      :   2   0.1    : 212  
##  19437  :   3   0.3    : 145   1058896:   2   0.5    : 137  
##  5117   :   3   0.2    : 138   108741 :   2   0.4    : 136  
##  100290 :   2   0.4    : 137   114838 :   2   0.3    : 130  
##  (Other):4266   (Other):2440   (Other):3962   (Other):2289  
##      RCPTOT        RCPTOT_S        VALADD        VALADD_S   
##  D      :1811   S      :1813   D      :1266   S      :1270  
##  S      :   3   0      : 703   S      :  11   0      :1080  
##  1016476:   2   0.1    : 200   100793 :   2   0.1    : 122  
##  102860 :   2   0.3    : 180   1035916:   2   0.9    : 116  
##  113954 :   2   0.2    : 148   10432  :   2   0.7    : 113  
##  118695 :   2   0.4    : 145   1058199:   2   0.8    : 111  
##  (Other):3436   (Other):2069   (Other):3973   (Other):2446

str(dataf)

## 'data.frame':    5258 obs. of  22 variables:
##  $ GEO.id             : Factor w/ 53 levels "0100000US","0400000US01",..: 53 1 1 1 1 1 1 1 1 1 ...
##  $ GEO.id2            : Factor w/ 53 levels "","1","10","11",..: 53 1 1 1 1 1 1 1 1 1 ...
##  $ GEO.display.label  : Factor w/ 53 levels "Alabama","Alaska",..: 11 46 46 46 46 46 46 46 46 46 ...
##  $ NAICS.id           : Factor w/ 109 levels "2012 NAICS code",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ NAICS.display.label: Factor w/ 107 levels "Aerospace product and parts manufacturing",..: 55 53 35 4 42 98 39 27 5 92 ...
##  $ YEAR.id            : Factor w/ 2 levels "2013","Year": 2 1 1 1 1 1 1 1 1 1 ...
##  $ EMP                : Factor w/ 3144 levels "10","100","1000",..: 3143 155 431 2001 2238 2614 667 346 2116 1548 ...
##  $ EMP_S              : Factor w/ 201 levels "0","0.1","0.2",..: 200 2 2 5 6 4 5 3 3 10 ...
##  $ PAYANN             : Factor w/ 4219 levels "1000763","1002217",..: 4217 3326 3163 1497 2008 2048 3487 3496 852 444 ...
##  $ PAYANN_S           : Factor w/ 183 levels "0","0.1","0.2",..: 182 2 2 4 5 4 3 3 3 9 ...
##  $ EMPAVPW            : Factor w/ 2883 levels "0","10","100",..: 2882 2546 99 1375 1685 2032 375 2817 1793 1207 ...
##  $ EMPAVPW_S          : Factor w/ 218 levels "0","0.1","0.2",..: 216 2 3 6 8 3 5 4 3 11 ...
##  $ HOURS              : Factor w/ 3382 levels "0","10","100",..: 3381 635 1131 2704 3090 11 1446 974 3154 2420 ...
##  $ HOURS_S            : Factor w/ 225 levels "0","0.1","0.2",..: 223 2 3 7 8 6 6 3 4 9 ...
##  $ PAYANPW            : Factor w/ 4090 levels "0","100007","10014",..: 4089 2180 2395 542 1256 1226 2759 2645 367 3823 ...
##  $ PAYANPW_S          : Factor w/ 211 levels "0","0.1","0.2",..: 209 2 2 6 4 5 4 3 3 11 ...
##  $ CSTMTOT            : Factor w/ 3852 levels "0","100026","100047",..: 3852 2109 2694 2530 3371 1014 2278 3509 704 3196 ...
##  $ CSTMTOT_S          : Factor w/ 202 levels "0","0.1","0.2",..: 200 2 3 4 3 3 4 2 6 10 ...
##  $ RCPTOT             : Factor w/ 3338 levels "0","100030","1000738",..: 3338 2543 2919 2598 64 1729 2827 185 997 175 ...
##  $ RCPTOT_S           : Factor w/ 177 levels "0","0.1","0.2",..: 175 2 2 3 3 2 4 2 4 10 ...
##  $ VALADD             : Factor w/ 3868 levels "0","100002861",..: 3868 1493 1666 841 1801 656 2023 1902 2894 2523 ...
##  $ VALADD_S           : Factor w/ 232 levels "0","0.1","0.2",..: 230 2 3 9 4 5 5 4 6 13 ...

Hypothesis

H0: Production worker’s average annual wages (High or low) is not affected by the ‘Production worker’s annual hours’ and the dollar amount of value added services spent by the industry.

H1: The variability/change in the value of Production worker’s average annual wages (High or low)-(DV) can be explained by the ‘Production worker’s annual hours’-(IV 1) and the dollar amount of value added services spent by the industry - (IV 2).

For this study we dichotomize a continuous variable and change it to a categorical variable based on the population average. So we calculate the average annual wage for the manufacturing sector to be $10854 (IN 1000’s from the given data only). Based on this value we dichotomize the Production worker’s average annual wages into two groups:

High: Meaning wages above the $10854 mark (This is assigned a value 1)

Low: Wages below the $10854 mark (This is assigned a value 0)

Subsetting the data: the datasets provided in the link above are huge so we focus only on food manufacaturing/processing and textile manufacturing industries. We will use G-Star Power to determine the required sample size before the regression analysis.

Power analysis

Using G-Star power we determine the sample size for logistic regression. The parameter values used for this calculation are:

Odds Ratio: 2.33, alpha=0.05, Power= 0.80, R-squared= 0.3, Pr(Y=1|X=1)H0=0.12

We get a sample size of 120. Therefore, using a sample of 120 for this analysis.

Note: The number of hours (in 1000’s)is divided by 100000 for this analyis.

### Creating the required data frame

newdata1 <- (read.csv(file.choose(), header=T))

attach(newdata1)

summary(newdata1)

##     CATEGORY          HOURS               VALADD         
##  Min.   :0.0000   Min.   :  0.00002   Min.   :    1.235  
##  1st Qu.:1.0000   1st Qu.:  0.45855   1st Qu.:   24.061  
##  Median :1.0000   Median :  1.32936   Median :  181.718  
##  Mean   :0.7583   Mean   :  3.61879   Mean   :  934.501  
##  3rd Qu.:1.0000   3rd Qu.:  2.46694   3rd Qu.:  556.427  
##  Max.   :1.0000   Max.   :157.36107   Max.   :23983.918  
##                                                          
##     CSTMTOT              RCPTOT         
##  Min.   :1.375e+04   Min.   :    32700  
##  1st Qu.:6.519e+06   1st Qu.:   513333  
##  Median :2.149e+07   Median :  1726283  
##  Mean   :8.724e+07   Mean   : 15447981  
##  3rd Qu.:5.225e+07   3rd Qu.:  6387562  
##  Max.   :3.457e+09   Max.   :538007282  
##                      NA's   :31

sapply(newdata1, sd)

##     CATEGORY        HOURS       VALADD      CSTMTOT       RCPTOT 
## 4.298883e-01 1.449644e+01 3.177924e+03 3.331163e+08           NA

cor(newdata1,use="complete")

##             CATEGORY       HOURS      VALADD    CSTMTOT      RCPTOT
## CATEGORY  1.00000000  0.12145445  0.15623530 0.05480743 -0.23900068
## HOURS     0.12145445  1.00000000  0.67272611 0.93622122 -0.04725586
## VALADD    0.15623530  0.67272611  1.00000000 0.64324691 -0.05144206
## CSTMTOT   0.05480743  0.93622122  0.64324691 1.00000000  0.11357777
## RCPTOT   -0.23900068 -0.04725586 -0.05144206 0.11357777  1.00000000

Summary (Formulating the model):

Since multicollinearity is a big issue in logistic regression therefore before proceeding with subsequent analysis, we perform a test of correlation between the IV’s. As is evident from the results above both the IVs selected are highly correlated. This could lead to multicollinearity,and one of the solutions to deal with this issue is to remove one of the variables from the model. Based on this observation we remove the variable ‘VALADD’ and instead select ‘RCPTOT’ which is the ‘Total value of shipments and receipts for services (in $1,000)’. This selection is based on the low level of correlation between ‘HOURS’ and ‘RCPTOT’. Therefore we re-define our Hypothesis here:

H0: The variability in production worker’s average annual wages (High or low)- DV is not affected by the variability in ‘Production worker’s annual hours’ (IV1) and/or the variability in ‘Total value(in $1000) of shipments and receipts for services’ spent within the industry (IV2).

H1: The variability/change in the value of Production worker’s average annual wages (High or low)-(DV) can be explained by the variability in’Production worker’s annual hours’-(IV 1) and/or the variability in ‘Total value of shipments and receipts for services’ spent within the industry - (IV 2).

Model development: Heirarchical Model: Explanatory model

We are interested in analyzing the factors that impact the annual wage of a production worker across the 2 (food and textile) manufacturing industries mentioned above (This annual wage includes part time workers as well). Since we are trying to explain the variability in the dependent variable as a result of the variability in the IVs,therefore this is an explanatory model.

Rationale behind the model: The first independent variable (HOURS) is the total number of hours spent by workers on average in each manufacturing industry type. So it is possible that the number of hours might impact the wage dichotomy in general i.e. the variability in the number of hours can explain the variability in the annual wages.

The second independent variable (RCPTOT) is the ‘Total value of shipments and receipts for services’ each year(annual values) by each industry. Although this does not seem to directly impact the wages, but more the total value of shipments and receipts an industry deals with on a daily basis, higher will be the number of skilled workers it will need. This in turn would lead to higher wages for workers within that industry. For illustration let us consider the case of a computer manufacturer. Since the value of its shipments would be pretty high owing the value of the final product that it deals with, it is highly likely that a large number of workers working in the company are highly skilled (engineers etc.) and their wages must be relatively high.

Taking these ideas into consideration we formulate a generalized linear model:

## Logistic Regression model
options(warn=-1)
library(aod)
library(ggplot2)

mylogit <- glm(CATEGORY ~ HOURS+RCPTOT, data = newdata1, family = binomial,control = list(maxit = 50))

summary(mylogit)

## 
## Call:
## glm(formula = CATEGORY ~ HOURS + RCPTOT, family = binomial, data = newdata1, 
##     control = list(maxit = 50))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3414   0.0000   0.1001   0.5786   1.4545  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.128e+00  5.608e-01  -2.011 0.044357 *  
## HOURS        2.612e+00  7.100e-01   3.679 0.000234 ***
## RCPTOT      -2.145e-09  4.844e-09  -0.443 0.657935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 97.254  on 88  degrees of freedom
## Residual deviance: 57.921  on 86  degrees of freedom
##   (31 observations deleted due to missingness)
## AIC: 63.921
## 
## Number of Fisher Scoring iterations: 9

Based on the above results we reject the null hypothesis because one of the IV-‘HOURS’ is found to be significant. Therefore, we can say that the variability in the Production worker’s annual wage can be explained by something other than randomization (variability could be due to the change in number of hours).

A Wald test is used to test the statistical significance of each coefficient (b) in the model.

# Test for the significance of the 1st IV - 'HOURS' 

wald.test(b=coef(mylogit),Sigma = vcov(mylogit), Terms = 2)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 13.5, df = 1, P(> X2) = 0.00023

# Test for the significance of the 2nd IV - 'RCPTOT'
wald.test(b=coef(mylogit),Sigma = vcov(mylogit), Terms = 3)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 0.2, df = 1, P(> X2) = 0.66

From the above results it is clear that only the IV- ‘HOURS’ is significant. This is in line with our initial result from the generalized linear model.

Confidence Interval

For logistic models, confidence intervals are based on the profiled log-likelihood function as shown below:

options(warn=-1)

confint(mylogit)

## Waiting for profiling to be done...

##                     2.5 %        97.5 %
## (Intercept) -2.310290e+00 -7.963204e-02
## HOURS        1.392008e+00  4.205585e+00
## RCPTOT      -1.669709e-08  5.591122e-09

# odds ratio and 95% conf interval
exp(cbind(OR = coef(mylogit), confint(mylogit)))

## Waiting for profiling to be done...

##                     OR      2.5 %     97.5 %
## (Intercept)  0.3237971 0.09923247  0.9234561
## HOURS       13.6292005 4.02292119 67.0598192
## RCPTOT       1.0000000 0.99999998  1.0000000

Model Interpretation

In the summary we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.

The logistic regression coefficients give the change in the log odds(ln(OR)) of the outcome for a one unit increase in the predictor variable.

From the model summary we can interpret the following:

For every one unit change in ‘HOURS’, the log odds of high salary (versus low salary) increases by 2.612e+00.
For a one unit increase in ‘RCPTOT’, the log odds of getting high salary decreases by 2.145e-09.
Intercept: If the values of both independent variables were zero then the value of the production worker’s annual wage would be -1.128e+00.
Deviance residuals: In a perfect fitted model we should get a deviance of zero as log(1) is zero.A poorly fitting point has a large residual deviance as -2 times the log of a very small value is a large number. In our model this value is 57.921 which is much higher than zero, so we can say that this model is not a very good fit.
The degrees of freedom suggest that there are certain missing values in the sample selected since it is less than 120.

Plots

Based on our above findings we plot the standardized residuals. The results from the plots below would help in analyzing the ‘LINKS’ assumptions. The explanation for these plots and the inferences are discussed in the next section (Assumptions).

plot(fitted(mylogit),rstandard(mylogit),pch=21, cex=1,xlab ="Fitted", ylab ="Standard Resid",main ="Fitted vs St. Resid (Figure 1)")
abline(0,0,lwd=3,col="red")

There is clearly a downward trend and non-linearity in the plot above over the entire dynamic range.

plot(rstandard(mylogit), main = "Plot of standard residuals (Figure 2)")
abline(0,0,col="red",lwd = 3)

There are no conspicuous trends in the above plot which points to minimal or no correlation between the error terms.

hist(rstandard(mylogit),breaks=20, xlab="Standardized Residual", main="Standardized Residuals distribution (Figure 3)")

The histogram shows 2 peaks, kurtosis as well as some form of skew.

boxplot(rstandard(mylogit), xlab="CATEGORY ~ HOURS+RCPTOT", ylab="Standardized Residuals", main="St. Residuals boxplot (Figure 4)")

In the boxplot the median lies towards one extreme end. This shows bias in the error terms.

qqnorm(rstandard(mylogit), xlab="Normal quantiles", ylab="St.Residual quantiles", main="QQ plot: st.residuals vs. normal (Figure 5)")
qqline(rstandard(mylogit),col='blue')

Deviation from normality is quite clear in the above Q-Q plot.

Assumptions (LINE Analysis)

We need to test the following:

The mean of the response at each set of values of the predictor,is a Linear function of the predictors.
The errors are Independent.
The errors at each set of values of the predictor are Normally distributed.
The errors at each set of values of the predictor have Equal variances.

From the plots we infer the following:

Assumption 1:From Figure 1 we can infer that the linearity assumption seems to be violated because there is a trend in the plot. Non-linearity with a downward trend is quite evident over the entire dynamic range.

Assumption 2: Expected correlation between residuals for any 2 cases is zero as can be seen from the residual plot in Figure 2. There are some extreme values, however no clear patterns can be seen so we can say that there is lack of correlation between the error terms and so independence can be assumed.

Assumption 3: The normality assumption seems to be violated for a large number of values as can be seen from the Q-Q plot. There is kurtosis present as is evident in the histogram. There are 2 peaks in the entire range and a lot of outliers as well. The boxplots confirm a similar trend with the median lying towards one extreme end. (Refer to Figure 3, 4, 5)

## Breusch-Pagan test

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bp1<-bptest(CATEGORY ~ HOURS)
bp1

## 
##  studentized Breusch-Pagan test
## 
## data:  CATEGORY ~ HOURS
## BP = 1.4331, df = 1, p-value = 0.2313

bp2<-bptest(CATEGORY ~ RCPTOT)
bp2

## 
##  studentized Breusch-Pagan test
## 
## data:  CATEGORY ~ RCPTOT
## BP = 0.0025, df = 1, p-value = 0.9603

bp <- bptest(CATEGORY ~ HOURS+RCPTOT)
bp

## 
##  studentized Breusch-Pagan test
## 
## data:  CATEGORY ~ HOURS + RCPTOT
## BP = 1.3049, df = 2, p-value = 0.5208

First we test homoskedasticity for both the independent variables and then test it for the model as a whole. From all the p-values (in the 3 tests) we can infer that there is no heteroscedasticity since the BP test is based on the null hypothesis that there is homoskedasticity. Given the fact that p-value is quite large for all the cases tested above we fail to reject the null hypothesis.

Assumption 4: Heteroscedasticity seems to be absent if we take the results of Breusch Pagan’s test only so the assumption holds. Therefore we can say that the assumption of equal variances holds.

Assumptions for Logistic Regression

Given the fact that most of the assumptions valid for linear models do not apply for logistic regression, we test the assumptions for logistic regression here.

Assumption 1: IVs are linearly related to log odds ratio: This assumption seems to be violated to a certain extent from the plots of the residuals (Figure 1). There is some form of trend as well as non-linearity as well.

Assumption 2: No extraneous variables are included: This assumption holds because of the way the model is formulated and the underlying theory behind it.

Assumption 3: No multicollinearity: This assumption holds because there is no correlation between the IVs.

Assumption 4: Observations are independent: From the plots we can say that there is lack of correlation between the error terms so we can say that this assumption holds (Figure 2).This is further established from the result in Figure 6.

par(mfrow=c(1,1))
plot(rstandard(mylogit),cex=0.5,ylab="Standard Residuals",main="St.Residuals (Figure 6)")
abline(0,0,col="red",lwd = 2)
lines(smooth.spline(rstandard(mylogit)),col='blue',lwd=2)

Issues:

Collinearity

plot(newdata1)

cor(newdata1,use="complete")

##             CATEGORY       HOURS      VALADD    CSTMTOT      RCPTOT
## CATEGORY  1.00000000  0.12145445  0.15623530 0.05480743 -0.23900068
## HOURS     0.12145445  1.00000000  0.67272611 0.93622122 -0.04725586
## VALADD    0.15623530  0.67272611  1.00000000 0.64324691 -0.05144206
## CSTMTOT   0.05480743  0.93622122  0.64324691 1.00000000  0.11357777
## RCPTOT   -0.23900068 -0.04725586 -0.05144206 0.11357777  1.00000000

Any form of high correlation is not evident between the two IVs from the plots here so we can assume minimal or no collinearity. This is further confirmed from the correlation test.

Causality: Both the Independent variables i.e. the production workers annual hours and the total value of shipments and receipts are a proximal and probabilistic cause of the dependent variable. This is because there are numerous other factors that decide whether the annual wage is high or low in the manufacturing industry. As discussed before, it is possible that the number of hours might impact the wage dichotomy in general. Similarly,more the total value of shipments and receipts an industry deals with on a daily basis, higher will be the number of skilled workers it will need.
Sample size: We determined the sample size from Power analysis using G-Star Power and used that sample size for the entire analysis.
Measurement error: This is survey data so we can assume no measurement error. The data is from the Census Bureau’s survey and it is considered as the leading source of quality data about nation’s people and economy so we can safely assume that there is minimum measurement error.

Interaction Effect

Here we test the interaction effect (if any) in the generalized linear model. There is high probability that this interaction would be negligible, given the low correlation between the two independent variables.

model_interaction <- glm(CATEGORY ~ HOURS*RCPTOT, data=newdata1, family = binomial(logit))
summary(model_interaction)

## 
## Call:
## glm(formula = CATEGORY ~ HOURS * RCPTOT, family = binomial(logit), 
##     data = newdata1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.85091   0.00000   0.05174   0.52438   1.24572  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -7.421e-01  5.922e-01  -1.253   0.2101  
## HOURS         1.577e+00  7.532e-01   2.093   0.0363 *
## RCPTOT       -1.662e-07  1.259e-07  -1.320   0.1869  
## HOURS:RCPTOT  4.860e-07  3.391e-07   1.433   0.1518  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 97.254  on 88  degrees of freedom
## Residual deviance: 50.828  on 85  degrees of freedom
##   (31 observations deleted due to missingness)
## AIC: 58.828
## 
## Number of Fisher Scoring iterations: 10

As can be seen from the above model, there is no interaction effect between the independent variables. Only the variable ‘HOURS’ is found to be significant which was also evident in our original model.

Conclusion

From the entire logistic regression analysis for the problem defined and stated as the null hypothesis we can conclude that the variability in the Production worker’s annual wage (in the food and textile manufacturing sector only)for the two levels (0 or 1) can be explained by something other than randomization (variability could be attributed to the variability in number of hours).

References

https://catalog.data.gov/dataset/american-community-survey

Wikipedia