Source:https://catalog.data.gov/dataset/american-community-survey
This is an annual nationwide survey that collects information such as age, race, income, commute time to work, home value, veteran status, and other data. We take a subset of this dataset that has all payroll related information, commodity spending and value added services spending for various manufacturing industries within the US for the year 2013.It comprises of 5258 records in all.
Here is a summary of the dataset used:
dataf<- (read.csv(file.choose(), header=T))
summary(dataf)
## GEO.id GEO.id2 GEO.display.label NAICS.id
## 0100000US : 108 : 108 California: 108 31-33 : 52
## 0400000US06: 108 12 : 108 Florida : 108 311 : 52
## 0400000US12: 108 13 : 108 Georgia : 108 3113 : 52
## 0400000US13: 108 17 : 108 Illinois : 108 3118 : 52
## 0400000US17: 108 26 : 108 Michigan : 108 3119 : 52
## 0400000US26: 108 27 : 108 Minnesota : 108 312 : 52
## (Other) :4610 (Other):4610 (Other) :4610 (Other):4946
## NAICS.display.label YEAR.id
## Petroleum and coal products manufacturing : 104 2013:5257
## Printing and related support activities : 104 Year: 1
## Apparel manufacturing : 52
## Architectural and structural metals manufacturing: 52
## Bakeries and tortilla manufacturing : 52
## Beverage and tobacco product manufacturing : 52
## (Other) :4842
## EMP EMP_S PAYANN PAYANN_S
## a : 291 0 :1421 D : 866 0 :1368
## b : 239 S : 844 S : 4 S : 867
## c : 122 0.5 : 162 10780 : 3 0.1 : 175
## e : 79 0.1 : 160 31402 : 3 0.4 : 175
## f : 53 0.3 : 151 54343 : 3 0.3 : 170
## g : 41 0.4 : 145 9389 : 3 0.5 : 165
## (Other):4433 (Other):2375 (Other):4376 (Other):2338
## EMPAVPW EMPAVPW_S HOURS HOURS_S
## D : 957 0 :1338 D : 873 0 :1411
## S : 11 S : 959 S : 8 S : 877
## 134 : 8 0.6 : 136 25 : 7 0.8 : 128
## 164 : 8 0.7 : 136 94 : 7 0.5 : 124
## 23 : 8 0.3 : 132 20 : 6 0.6 : 116
## 36 : 8 0.5 : 126 371 : 6 0.4 : 110
## (Other):4258 (Other):2431 (Other):4351 (Other):2492
## PAYANPW PAYANPW_S CSTMTOT CSTMTOT_S
## D : 973 0 :1269 D :1284 S :1285
## S : 8 S : 976 S : 4 0 :1069
## 1383 : 3 0.1 : 153 0 : 2 0.1 : 212
## 19437 : 3 0.3 : 145 1058896: 2 0.5 : 137
## 5117 : 3 0.2 : 138 108741 : 2 0.4 : 136
## 100290 : 2 0.4 : 137 114838 : 2 0.3 : 130
## (Other):4266 (Other):2440 (Other):3962 (Other):2289
## RCPTOT RCPTOT_S VALADD VALADD_S
## D :1811 S :1813 D :1266 S :1270
## S : 3 0 : 703 S : 11 0 :1080
## 1016476: 2 0.1 : 200 100793 : 2 0.1 : 122
## 102860 : 2 0.3 : 180 1035916: 2 0.9 : 116
## 113954 : 2 0.2 : 148 10432 : 2 0.7 : 113
## 118695 : 2 0.4 : 145 1058199: 2 0.8 : 111
## (Other):3436 (Other):2069 (Other):3973 (Other):2446
str(dataf)
## 'data.frame': 5258 obs. of 22 variables:
## $ GEO.id : Factor w/ 53 levels "0100000US","0400000US01",..: 53 1 1 1 1 1 1 1 1 1 ...
## $ GEO.id2 : Factor w/ 53 levels "","1","10","11",..: 53 1 1 1 1 1 1 1 1 1 ...
## $ GEO.display.label : Factor w/ 53 levels "Alabama","Alaska",..: 11 46 46 46 46 46 46 46 46 46 ...
## $ NAICS.id : Factor w/ 109 levels "2012 NAICS code",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ NAICS.display.label: Factor w/ 107 levels "Aerospace product and parts manufacturing",..: 55 53 35 4 42 98 39 27 5 92 ...
## $ YEAR.id : Factor w/ 2 levels "2013","Year": 2 1 1 1 1 1 1 1 1 1 ...
## $ EMP : Factor w/ 3144 levels "10","100","1000",..: 3143 155 431 2001 2238 2614 667 346 2116 1548 ...
## $ EMP_S : Factor w/ 201 levels "0","0.1","0.2",..: 200 2 2 5 6 4 5 3 3 10 ...
## $ PAYANN : Factor w/ 4219 levels "1000763","1002217",..: 4217 3326 3163 1497 2008 2048 3487 3496 852 444 ...
## $ PAYANN_S : Factor w/ 183 levels "0","0.1","0.2",..: 182 2 2 4 5 4 3 3 3 9 ...
## $ EMPAVPW : Factor w/ 2883 levels "0","10","100",..: 2882 2546 99 1375 1685 2032 375 2817 1793 1207 ...
## $ EMPAVPW_S : Factor w/ 218 levels "0","0.1","0.2",..: 216 2 3 6 8 3 5 4 3 11 ...
## $ HOURS : Factor w/ 3382 levels "0","10","100",..: 3381 635 1131 2704 3090 11 1446 974 3154 2420 ...
## $ HOURS_S : Factor w/ 225 levels "0","0.1","0.2",..: 223 2 3 7 8 6 6 3 4 9 ...
## $ PAYANPW : Factor w/ 4090 levels "0","100007","10014",..: 4089 2180 2395 542 1256 1226 2759 2645 367 3823 ...
## $ PAYANPW_S : Factor w/ 211 levels "0","0.1","0.2",..: 209 2 2 6 4 5 4 3 3 11 ...
## $ CSTMTOT : Factor w/ 3852 levels "0","100026","100047",..: 3852 2109 2694 2530 3371 1014 2278 3509 704 3196 ...
## $ CSTMTOT_S : Factor w/ 202 levels "0","0.1","0.2",..: 200 2 3 4 3 3 4 2 6 10 ...
## $ RCPTOT : Factor w/ 3338 levels "0","100030","1000738",..: 3338 2543 2919 2598 64 1729 2827 185 997 175 ...
## $ RCPTOT_S : Factor w/ 177 levels "0","0.1","0.2",..: 175 2 2 3 3 2 4 2 4 10 ...
## $ VALADD : Factor w/ 3868 levels "0","100002861",..: 3868 1493 1666 841 1801 656 2023 1902 2894 2523 ...
## $ VALADD_S : Factor w/ 232 levels "0","0.1","0.2",..: 230 2 3 9 4 5 5 4 6 13 ...
H0: Production worker’s average annual wages (High or low) is not affected by the ‘Production worker’s annual hours’ and the dollar amount of value added services spent by the industry.
H1: The variability/change in the value of Production worker’s average annual wages (High or low)-(DV) can be explained by the ‘Production worker’s annual hours’-(IV 1) and the dollar amount of value added services spent by the industry - (IV 2).
For this study we dichotomize a continuous variable and change it to a categorical variable based on the population average. So we calculate the average annual wage for the manufacturing sector to be $10854 (IN 1000’s from the given data only). Based on this value we dichotomize the Production worker’s average annual wages into two groups:
High: Meaning wages above the $10854 mark (This is assigned a value 1)
Low: Wages below the $10854 mark (This is assigned a value 0)
Subsetting the data: the datasets provided in the link above are huge so we focus only on food manufacaturing/processing and textile manufacturing industries. We will use G-Star Power to determine the required sample size before the regression analysis.
Using G-Star power we determine the sample size for logistic regression. The parameter values used for this calculation are:
Odds Ratio: 2.33, alpha=0.05, Power= 0.80, R-squared= 0.3, Pr(Y=1|X=1)H0=0.12
We get a sample size of 120. Therefore, using a sample of 120 for this analysis.
Note: The number of hours (in 1000’s)is divided by 100000 for this analyis.
### Creating the required data frame
newdata1 <- (read.csv(file.choose(), header=T))
attach(newdata1)
summary(newdata1)
## CATEGORY HOURS VALADD
## Min. :0.0000 Min. : 0.00002 Min. : 1.235
## 1st Qu.:1.0000 1st Qu.: 0.45855 1st Qu.: 24.061
## Median :1.0000 Median : 1.32936 Median : 181.718
## Mean :0.7583 Mean : 3.61879 Mean : 934.501
## 3rd Qu.:1.0000 3rd Qu.: 2.46694 3rd Qu.: 556.427
## Max. :1.0000 Max. :157.36107 Max. :23983.918
##
## CSTMTOT RCPTOT
## Min. :1.375e+04 Min. : 32700
## 1st Qu.:6.519e+06 1st Qu.: 513333
## Median :2.149e+07 Median : 1726283
## Mean :8.724e+07 Mean : 15447981
## 3rd Qu.:5.225e+07 3rd Qu.: 6387562
## Max. :3.457e+09 Max. :538007282
## NA's :31
sapply(newdata1, sd)
## CATEGORY HOURS VALADD CSTMTOT RCPTOT
## 4.298883e-01 1.449644e+01 3.177924e+03 3.331163e+08 NA
cor(newdata1,use="complete")
## CATEGORY HOURS VALADD CSTMTOT RCPTOT
## CATEGORY 1.00000000 0.12145445 0.15623530 0.05480743 -0.23900068
## HOURS 0.12145445 1.00000000 0.67272611 0.93622122 -0.04725586
## VALADD 0.15623530 0.67272611 1.00000000 0.64324691 -0.05144206
## CSTMTOT 0.05480743 0.93622122 0.64324691 1.00000000 0.11357777
## RCPTOT -0.23900068 -0.04725586 -0.05144206 0.11357777 1.00000000
Since multicollinearity is a big issue in logistic regression therefore before proceeding with subsequent analysis, we perform a test of correlation between the IV’s. As is evident from the results above both the IVs selected are highly correlated. This could lead to multicollinearity,and one of the solutions to deal with this issue is to remove one of the variables from the model. Based on this observation we remove the variable ‘VALADD’ and instead select ‘RCPTOT’ which is the ‘Total value of shipments and receipts for services (in $1,000)’. This selection is based on the low level of correlation between ‘HOURS’ and ‘RCPTOT’. Therefore we re-define our Hypothesis here:
H0: The variability in production worker’s average annual wages (High or low)- DV is not affected by the variability in ‘Production worker’s annual hours’ (IV1) and/or the variability in ‘Total value(in $1000) of shipments and receipts for services’ spent within the industry (IV2).
H1: The variability/change in the value of Production worker’s average annual wages (High or low)-(DV) can be explained by the variability in’Production worker’s annual hours’-(IV 1) and/or the variability in ‘Total value of shipments and receipts for services’ spent within the industry - (IV 2).
We are interested in analyzing the factors that impact the annual wage of a production worker across the 2 (food and textile) manufacturing industries mentioned above (This annual wage includes part time workers as well). Since we are trying to explain the variability in the dependent variable as a result of the variability in the IVs,therefore this is an explanatory model.
Rationale behind the model: The first independent variable (HOURS) is the total number of hours spent by workers on average in each manufacturing industry type. So it is possible that the number of hours might impact the wage dichotomy in general i.e. the variability in the number of hours can explain the variability in the annual wages.
The second independent variable (RCPTOT) is the ‘Total value of shipments and receipts for services’ each year(annual values) by each industry. Although this does not seem to directly impact the wages, but more the total value of shipments and receipts an industry deals with on a daily basis, higher will be the number of skilled workers it will need. This in turn would lead to higher wages for workers within that industry. For illustration let us consider the case of a computer manufacturer. Since the value of its shipments would be pretty high owing the value of the final product that it deals with, it is highly likely that a large number of workers working in the company are highly skilled (engineers etc.) and their wages must be relatively high.
Taking these ideas into consideration we formulate a generalized linear model:
## Logistic Regression model
options(warn=-1)
library(aod)
library(ggplot2)
mylogit <- glm(CATEGORY ~ HOURS+RCPTOT, data = newdata1, family = binomial,control = list(maxit = 50))
summary(mylogit)
##
## Call:
## glm(formula = CATEGORY ~ HOURS + RCPTOT, family = binomial, data = newdata1,
## control = list(maxit = 50))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3414 0.0000 0.1001 0.5786 1.4545
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.128e+00 5.608e-01 -2.011 0.044357 *
## HOURS 2.612e+00 7.100e-01 3.679 0.000234 ***
## RCPTOT -2.145e-09 4.844e-09 -0.443 0.657935
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 97.254 on 88 degrees of freedom
## Residual deviance: 57.921 on 86 degrees of freedom
## (31 observations deleted due to missingness)
## AIC: 63.921
##
## Number of Fisher Scoring iterations: 9
Based on the above results we reject the null hypothesis because one of the IV-‘HOURS’ is found to be significant. Therefore, we can say that the variability in the Production worker’s annual wage can be explained by something other than randomization (variability could be due to the change in number of hours).
A Wald test is used to test the statistical significance of each coefficient (b) in the model.
# Test for the significance of the 1st IV - 'HOURS'
wald.test(b=coef(mylogit),Sigma = vcov(mylogit), Terms = 2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 13.5, df = 1, P(> X2) = 0.00023
# Test for the significance of the 2nd IV - 'RCPTOT'
wald.test(b=coef(mylogit),Sigma = vcov(mylogit), Terms = 3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 0.2, df = 1, P(> X2) = 0.66
From the above results it is clear that only the IV- ‘HOURS’ is significant. This is in line with our initial result from the generalized linear model.
For logistic models, confidence intervals are based on the profiled log-likelihood function as shown below:
options(warn=-1)
confint(mylogit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -2.310290e+00 -7.963204e-02
## HOURS 1.392008e+00 4.205585e+00
## RCPTOT -1.669709e-08 5.591122e-09
# odds ratio and 95% conf interval
exp(cbind(OR = coef(mylogit), confint(mylogit)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.3237971 0.09923247 0.9234561
## HOURS 13.6292005 4.02292119 67.0598192
## RCPTOT 1.0000000 0.99999998 1.0000000
In the summary we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.
The logistic regression coefficients give the change in the log odds(ln(OR)) of the outcome for a one unit increase in the predictor variable.
From the model summary we can interpret the following:
For every one unit change in ‘HOURS’, the log odds of high salary (versus low salary) increases by 2.612e+00.
For a one unit increase in ‘RCPTOT’, the log odds of getting high salary decreases by 2.145e-09.
Intercept: If the values of both independent variables were zero then the value of the production worker’s annual wage would be -1.128e+00.
Deviance residuals: In a perfect fitted model we should get a deviance of zero as log(1) is zero.A poorly fitting point has a large residual deviance as -2 times the log of a very small value is a large number. In our model this value is 57.921 which is much higher than zero, so we can say that this model is not a very good fit.
The degrees of freedom suggest that there are certain missing values in the sample selected since it is less than 120.
Based on our above findings we plot the standardized residuals. The results from the plots below would help in analyzing the ‘LINKS’ assumptions. The explanation for these plots and the inferences are discussed in the next section (Assumptions).
plot(fitted(mylogit),rstandard(mylogit),pch=21, cex=1,xlab ="Fitted", ylab ="Standard Resid",main ="Fitted vs St. Resid (Figure 1)")
abline(0,0,lwd=3,col="red")
There is clearly a downward trend and non-linearity in the plot above over the entire dynamic range.
plot(rstandard(mylogit), main = "Plot of standard residuals (Figure 2)")
abline(0,0,col="red",lwd = 3)
There are no conspicuous trends in the above plot which points to minimal or no correlation between the error terms.
hist(rstandard(mylogit),breaks=20, xlab="Standardized Residual", main="Standardized Residuals distribution (Figure 3)")
The histogram shows 2 peaks, kurtosis as well as some form of skew.
boxplot(rstandard(mylogit), xlab="CATEGORY ~ HOURS+RCPTOT", ylab="Standardized Residuals", main="St. Residuals boxplot (Figure 4)")
In the boxplot the median lies towards one extreme end. This shows bias in the error terms.
qqnorm(rstandard(mylogit), xlab="Normal quantiles", ylab="St.Residual quantiles", main="QQ plot: st.residuals vs. normal (Figure 5)")
qqline(rstandard(mylogit),col='blue')
Deviation from normality is quite clear in the above Q-Q plot.
We need to test the following:
The mean of the response at each set of values of the predictor,is a Linear function of the predictors.
The errors are Independent.
The errors at each set of values of the predictor are Normally distributed.
The errors at each set of values of the predictor have Equal variances.
From the plots we infer the following:
Assumption 1:From Figure 1 we can infer that the linearity assumption seems to be violated because there is a trend in the plot. Non-linearity with a downward trend is quite evident over the entire dynamic range.
Assumption 2: Expected correlation between residuals for any 2 cases is zero as can be seen from the residual plot in Figure 2. There are some extreme values, however no clear patterns can be seen so we can say that there is lack of correlation between the error terms and so independence can be assumed.
Assumption 3: The normality assumption seems to be violated for a large number of values as can be seen from the Q-Q plot. There is kurtosis present as is evident in the histogram. There are 2 peaks in the entire range and a lot of outliers as well. The boxplots confirm a similar trend with the median lying towards one extreme end. (Refer to Figure 3, 4, 5)
## Breusch-Pagan test
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bp1<-bptest(CATEGORY ~ HOURS)
bp1
##
## studentized Breusch-Pagan test
##
## data: CATEGORY ~ HOURS
## BP = 1.4331, df = 1, p-value = 0.2313
bp2<-bptest(CATEGORY ~ RCPTOT)
bp2
##
## studentized Breusch-Pagan test
##
## data: CATEGORY ~ RCPTOT
## BP = 0.0025, df = 1, p-value = 0.9603
bp <- bptest(CATEGORY ~ HOURS+RCPTOT)
bp
##
## studentized Breusch-Pagan test
##
## data: CATEGORY ~ HOURS + RCPTOT
## BP = 1.3049, df = 2, p-value = 0.5208
First we test homoskedasticity for both the independent variables and then test it for the model as a whole. From all the p-values (in the 3 tests) we can infer that there is no heteroscedasticity since the BP test is based on the null hypothesis that there is homoskedasticity. Given the fact that p-value is quite large for all the cases tested above we fail to reject the null hypothesis.
Assumption 4: Heteroscedasticity seems to be absent if we take the results of Breusch Pagan’s test only so the assumption holds. Therefore we can say that the assumption of equal variances holds.
Given the fact that most of the assumptions valid for linear models do not apply for logistic regression, we test the assumptions for logistic regression here.
Assumption 1: IVs are linearly related to log odds ratio: This assumption seems to be violated to a certain extent from the plots of the residuals (Figure 1). There is some form of trend as well as non-linearity as well.
Assumption 2: No extraneous variables are included: This assumption holds because of the way the model is formulated and the underlying theory behind it.
Assumption 3: No multicollinearity: This assumption holds because there is no correlation between the IVs.
Assumption 4: Observations are independent: From the plots we can say that there is lack of correlation between the error terms so we can say that this assumption holds (Figure 2).This is further established from the result in Figure 6.
par(mfrow=c(1,1))
plot(rstandard(mylogit),cex=0.5,ylab="Standard Residuals",main="St.Residuals (Figure 6)")
abline(0,0,col="red",lwd = 2)
lines(smooth.spline(rstandard(mylogit)),col='blue',lwd=2)
plot(newdata1)
cor(newdata1,use="complete")
## CATEGORY HOURS VALADD CSTMTOT RCPTOT
## CATEGORY 1.00000000 0.12145445 0.15623530 0.05480743 -0.23900068
## HOURS 0.12145445 1.00000000 0.67272611 0.93622122 -0.04725586
## VALADD 0.15623530 0.67272611 1.00000000 0.64324691 -0.05144206
## CSTMTOT 0.05480743 0.93622122 0.64324691 1.00000000 0.11357777
## RCPTOT -0.23900068 -0.04725586 -0.05144206 0.11357777 1.00000000
Any form of high correlation is not evident between the two IVs from the plots here so we can assume minimal or no collinearity. This is further confirmed from the correlation test.
Causality: Both the Independent variables i.e. the production workers annual hours and the total value of shipments and receipts are a proximal and probabilistic cause of the dependent variable. This is because there are numerous other factors that decide whether the annual wage is high or low in the manufacturing industry. As discussed before, it is possible that the number of hours might impact the wage dichotomy in general. Similarly,more the total value of shipments and receipts an industry deals with on a daily basis, higher will be the number of skilled workers it will need.
Sample size: We determined the sample size from Power analysis using G-Star Power and used that sample size for the entire analysis.
Measurement error: This is survey data so we can assume no measurement error. The data is from the Census Bureau’s survey and it is considered as the leading source of quality data about nation’s people and economy so we can safely assume that there is minimum measurement error.
Here we test the interaction effect (if any) in the generalized linear model. There is high probability that this interaction would be negligible, given the low correlation between the two independent variables.
model_interaction <- glm(CATEGORY ~ HOURS*RCPTOT, data=newdata1, family = binomial(logit))
summary(model_interaction)
##
## Call:
## glm(formula = CATEGORY ~ HOURS * RCPTOT, family = binomial(logit),
## data = newdata1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.85091 0.00000 0.05174 0.52438 1.24572
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.421e-01 5.922e-01 -1.253 0.2101
## HOURS 1.577e+00 7.532e-01 2.093 0.0363 *
## RCPTOT -1.662e-07 1.259e-07 -1.320 0.1869
## HOURS:RCPTOT 4.860e-07 3.391e-07 1.433 0.1518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 97.254 on 88 degrees of freedom
## Residual deviance: 50.828 on 85 degrees of freedom
## (31 observations deleted due to missingness)
## AIC: 58.828
##
## Number of Fisher Scoring iterations: 10
As can be seen from the above model, there is no interaction effect between the independent variables. Only the variable ‘HOURS’ is found to be significant which was also evident in our original model.
From the entire logistic regression analysis for the problem defined and stated as the null hypothesis we can conclude that the variability in the Production worker’s annual wage (in the food and textile manufacturing sector only)for the two levels (0 or 1) can be explained by something other than randomization (variability could be attributed to the variability in number of hours).