The data used in this recipe was found using the “100+ interesting data sets” webpage. It is part of the R dataset package called “Ecdat,” which various datasets for the purpose of econometrics.
This dataset in particular has data for wages and education level of 4,300 young males in the united states from 1980 to 1987.
# Read in dataset from Ecdat package
#install.packages("Ecdat")
# library(Ecdat)
data(Males, package = "Ecdat")
x <- Males
The original data being examined has 4,360 observations of 12 variables.
head(x)
## nr year school exper union ethn maried health wage
## 1 13 1980 14 1 no other no no 1.197540
## 2 13 1981 14 2 yes other no no 1.853060
## 3 13 1982 14 3 no other no no 1.344462
## 4 13 1983 14 4 no other no no 1.433213
## 5 13 1984 14 5 no other no no 1.568125
## 6 13 1985 14 6 no other no no 1.699891
## industry occupation
## 1 Business_and_Repair_Service Service_Workers
## 2 Personal_Service Service_Workers
## 3 Business_and_Repair_Service Service_Workers
## 4 Business_and_Repair_Service Service_Workers
## 5 Personal_Service Craftsmen, Foremen_and_kindred
## 6 Business_and_Repair_Service Managers, Officials_and_Proprietors
## residence
## 1 north_east
## 2 north_east
## 3 north_east
## 4 north_east
## 5 north_east
## 6 north_east
summary(x)
## nr year school exper
## Min. : 13 Min. :1980 Min. : 3.00 Min. : 0.000
## 1st Qu.: 2329 1st Qu.:1982 1st Qu.:11.00 1st Qu.: 4.000
## Median : 4569 Median :1984 Median :12.00 Median : 6.000
## Mean : 5262 Mean :1984 Mean :11.77 Mean : 6.515
## 3rd Qu.: 8406 3rd Qu.:1985 3rd Qu.:12.00 3rd Qu.: 9.000
## Max. :12548 Max. :1987 Max. :16.00 Max. :18.000
##
## union ethn maried health wage
## no :3296 other:3176 no :2446 no :4286 Min. :-3.579
## yes:1064 black: 504 yes:1914 yes: 74 1st Qu.: 1.351
## hisp : 680 Median : 1.671
## Mean : 1.649
## 3rd Qu.: 1.991
## Max. : 4.052
##
## industry
## Manufacturing :1231
## Trade :1169
## Professional_and_Related Service: 333
## Business_and_Repair_Service : 331
## Construction : 327
## Transportation : 286
## (Other) : 683
## occupation residence
## Craftsmen, Foremen_and_kindred :934 rural_area : 85
## Operatives_and_kindred :881 north_east : 733
## Service_Workers :509 nothern_central: 964
## Clerical_and_kindred :486 south :1333
## Professional, Technical_and_kindred:453 NA's :1245
## Laborers_and_farmers :401
## (Other) :696
# As seen below, there are 452 observations of 12 variables, including: 5 factors and 7 numeric variables.
str(x)
## 'data.frame': 4360 obs. of 12 variables:
## $ nr : int 13 13 13 13 13 13 13 13 17 17 ...
## $ year : int 1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
## $ school : int 14 14 14 14 14 14 14 14 13 13 ...
## $ exper : int 1 2 3 4 5 6 7 8 4 5 ...
## $ union : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ ethn : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ health : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ wage : num 1.2 1.85 1.34 1.43 1.57 ...
## $ industry : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
## $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
## $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...
x$school <- as.numeric(x$school) # assign as a numeric variable
x$school <- cut(x$school, c(2, 7, 11, 17)) #create intervals out of continuous variable
x$school <- as.factor(x$school) #reassign as a factor
x <- x[, c(3,6,9,11,12)] # eliminate all unused variables
The continuous variables in this dataset are years of schooling, years of experience, and the log of the hourly wage.
The response variable being analyzed in this dataset is the log of the hourly wage earned by the individual.
As seen below, the log of the hourly wage ranged from -3.579 to 4.052, with a mean of 1.649.
summary(x$wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.579 1.351 1.671 1.649 1.991 4.052
The data is originally consisted of 12 variables, but was subseted down to 5 columns: 4 factors, and 1 continuous variable.
There was no randomization in this recipe because the data points are based on observations of wages among young American males.
Response surface methods will be used to analyze which, if any, of the four factors are responsible for variation in the response variable, the log of hourly wage.
The experiment was based on observations, and was therefore not randomized.
The experiment had no replicates or repeated measures.
There was no blocking in this recipe.
The boxplot below shows the log of hourly wages in response to various intervals of schooling. All three of the intervals show relatively similar medians, however, their ranges vary greatly. the 2-7 interval has the smallest range, whereas the 7-ll years interval has the largest range in wages.
par(mfrow=c(1,1))
attach(x)
plot(wage ~ school, xlab = "Interval of Years of School", ylab = "Hourly wages", main="Wages by Years of Schooling" )
The boxplot below shows how log of hourly wages is affected by ethnicity. Similarly to years of schooling, each of the three levels have similar medians, but vary in the ranges of wages. “Other” has the largest range, whereas “black” seems to have the smallest.
plot(wage ~ ethn, xlab = "Ethnicity", ylab = "Hourly wages", main="Wages by Ethnicity" )
The boxplot below shows how occupation may affect the log of hourly wages. As seen below, there seems to be some variation in both the medians and ranges of wages.
plot(wage ~ occupation, xlab = "Occupation", ylab = "Hourly wages", main="Wages by Occupation" )
The boxplot shows variaous residences/locations of the young males, and their associated wages. “Rural area” has the smallest range in wages. “South” seems to be relatively symmetric, with a large range. “Northern central” has a large range and seems to be negatively skewed. All of the medians of the groups, however, tend to be relatively similar.
plot(wage ~ residence, xlab = "Residence", ylab = "Hourly wages", main="Wages by Residence" )
The response surface method below analyzed the log of hourly wages as a response to 4 different factors, with >2 levels.
The first step was to convert the levels of the factors into integers, in order to use the “rsm” function.
# Assign all factors as binary
x$school <- as.character(x$school)
x$school[x$school == "(2,7]"] <- 1
x$school[x$school == "(7,11]"] <- 2
x$school[x$school == "(11,17]"] <- 3
x$ethn <- as.character(x$ethn)
x$ethn[x$ethn == "other"] <- 1
x$ethn[x$ethn == "black"] <- 2
x$ethn[x$ethn == "hisp"] <- 3
x$occupation <- as.character(x$occupation)
x$occupation[x$occupation == "Professional, Technical_and_kindred"] <- 1
x$occupation[x$occupation == "Managers, Officials_and_Proprietors"] <- 2
x$occupation[x$occupation == "Sales_Workers"] <- 3
x$occupation[x$occupation == "Clerical_and_kindred"] <- 4
x$occupation[x$occupation == "Craftsmen, Foremen_and_kindred"] <- 5
x$occupation[x$occupation == "Operatives_and_kindred"] <- 6
x$occupation[x$occupation == "Laborers_and_farmers"] <- 7
x$occupation[x$occupation == "Farm_Laborers_and_Foreman"] <- 8
x$occupation[x$occupation == "Service_Workers"] <- 9
x$residence <- as.character(x$residence)
x$residence[x$residence == "rural_area"] <- 1
x$residence[x$residence == "north_east"] <- 2
x$residence[x$residence == "nothern_central"] <- 3
x$residence[x$residence == "south"] <- 4
x <- na.omit(x)
x$school <- as.integer(x$school)
x$ethn <- as.integer(x$ethn)
x$occupation <- as.integer(x$occupation)
x$residence <- as.integer(x$residence)
Next, a second order response surface was completed using the 4 factors.
#install.packages("rsm")
library("rsm")
## Warning: package 'rsm' was built under R version 3.1.2
wage.rsm <- rsm(wage~SO(ethn, school, occupation, residence), data=x)
summary(wage.rsm)
##
## Call:
## rsm(formula = wage ~ SO(ethn, school, occupation, residence),
## data = x)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47914968 0.46575231 5.3229 1.094e-07 ***
## ethn -0.30036812 0.14965680 -2.0070 0.044831 *
## school -0.81263361 0.29217192 -2.7814 0.005446 **
## occupation -0.04089938 0.03495928 -1.1699 0.242125
## residence 0.15931298 0.11251746 1.4159 0.156907
## ethn:school 0.00268620 0.02836155 0.0947 0.924549
## ethn:occupation 0.00529164 0.00609105 0.8688 0.385048
## ethn:residence -0.02261332 0.01429618 -1.5818 0.113803
## school:occupation 0.00071236 0.00895847 0.0795 0.936626
## school:residence -0.04906397 0.02214647 -2.2154 0.026803 *
## occupation:residence 0.00283332 0.00461751 0.6136 0.539522
## ethn^2 0.08875190 0.02959825 2.9986 0.002734 **
## school^2 0.24393150 0.05646101 4.3204 1.607e-05 ***
## occupation^2 -0.00175769 0.00147352 -1.1929 0.233019
## residence^2 -0.00865855 0.01351586 -0.6406 0.521816
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.1076, Adjusted R-squared: 0.1036
## F-statistic: 26.7 on 14 and 3100 DF, p-value: < 2.2e-16
##
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value
## FO(ethn, school, occupation, residence) 4 85.02 21.2544 83.6452
## TWI(ethn, school, occupation, residence) 6 2.02 0.3366 1.3245
## PQ(ethn, school, occupation, residence) 4 7.93 1.9827 7.8026
## Residuals 3100 787.72 0.2541
## Lack of fit 143 94.70 0.6623 2.8258
## Pure error 2957 693.01 0.2344
## Pr(>F)
## FO(ethn, school, occupation, residence) < 2.2e-16
## TWI(ethn, school, occupation, residence) 0.2424
## PQ(ethn, school, occupation, residence) 2.97e-06
## Residuals
## Lack of fit < 2.2e-16
## Pure error
##
## Stationary point of response surface:
## ethn school occupation residence
## 1.9639364 1.7145060 -7.9458988 0.4774514
##
## Eigenanalysis:
## $values
## [1] 0.246329573 0.090006970 -0.001546365 -0.012523012
##
## $vectors
## [,1] [,2] [,3] [,4]
## ethn 0.01541906 0.99355424 -0.009046895 0.11193914
## school 0.99522013 -0.02579419 0.014650056 0.09304257
## occupation 0.00104265 0.02689393 0.986897301 -0.15908912
## residence -0.09642639 -0.10705765 0.160428324 0.97647497
The null hypothesis for the ANOVA is that variation in the log of hourly wages cannot be explained by anything other than randomization. The p-values above for main effect show that ethnicity and years of schooling are significant for an alpha of 0.05. This is because the p-value is less than alpha, so we must reject the null, and suggest that something other than randomization explains the variation in wage. The interaction between years of school and residence is also significant, with a p-value of 0.027.
The stationary points are 1.96 (ethnicity), 1.715 (school), -7.946 (occupation), and 0.477 (residence).
We will now use contour plots to characterize the stationary points seen in the above summary of the response surfaces.
par(mfrow=c(2,3))
contour(wage.rsm, ~ethn + school + occupation + residence, image=TRUE, at=summary(wage.rsm$canoncial$xs))
The plots above show all of the second order response surfaces of the data. We will now look at each separately.
par(mfrow=c(1,1))
contour(wage.rsm, ~ethn + school, image=TRUE, at=summary(wage.rsm$canoncial$xs))
This second order response surface is shown for ethnicity and school. This appears to be a minima, because the values are increasing from this centralized point.
par(mfrow=c(1,1))
contour(wage.rsm, ~ethn + occupation, image=TRUE, at=summary(wage.rsm$canoncial$xs))
This stationary point seems to be a falling ridge.
par(mfrow=c(1,1))
contour(wage.rsm, ~ethn + residence, image=TRUE, at=summary(wage.rsm$canoncial$xs))
This stationary point is slightly more ambiguous. It appears as though it migh be a ridge, or part of a saddle point.
par(mfrow=c(1,1))
contour(wage.rsm, ~school + occupation, image=TRUE, at=summary(wage.rsm$canoncial$xs))
This stationary point seems to be also a falling ridge.
par(mfrow=c(1,1))
contour(wage.rsm, ~school + residence, image=TRUE, at=summary(wage.rsm$canoncial$xs))
This stationary point is a ridge.
par(mfrow=c(1,1))
contour(wage.rsm, ~occupation + residence, image=TRUE, at=summary(wage.rsm$canoncial$xs))
This stationary point seems to be a maxima on the bottom left corner of the plot.
Assumptions for ANOVA tests include that the data is normally distributed.
shapiro.test(x$wage)
##
## Shapiro-Wilk normality test
##
## data: x$wage
## W = 0.9525, p-value < 2.2e-16
The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the log of hourly wages is not normally distributed.
par(mfrow = c(1,1))
qqnorm(residuals(wage.rsm))
qqline(residuals(wage.rsm))
The Q-Q Normality Plots of the residuals for the model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.
plot(fitted(wage.rsm),residuals(wage.rsm))
The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the model do not appear to be symmetric over the length of the dynamic range, but seem to be mostly symmetric over 0.
Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.
A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.
With the kruskal-wallis test, we could only analyze the main effect of one factor at a time. I chose to analyze ethnicity and years of schooling.
kruskal.test(wage ~ ethn, data=x)
##
## Kruskal-Wallis rank sum test
##
## data: wage by ethn
## Kruskal-Wallis chi-squared = 41.1272, df = 2, p-value = 1.173e-09
The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in wage cannot be explained by anything other than randomization. This suggests that ethnicity may explain variation in the log of hourly wages.
kruskal.test(wage ~ school, data=x)
##
## Kruskal-Wallis rank sum test
##
## data: wage by school
## Kruskal-Wallis chi-squared = 241.1581, df = 2, p-value < 2.2e-16
The p-value for the factor, years of schooling, of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in wages cannot be explained by something other than randomization, and suggest an alternative - that the randomization can be explained by something other than randomization.