Consider some data AT&T collected on different programming languages. The company wanted to know about variables that affect work hours and in particular wanted to know if it was worth prioritizing certain programming languages based on the work hours necessary to completed typical programs. Formally the Hypothesis being tested is:
\[\begin{align*} H_0 & : \mu_1 = \mu_2 = \mu_3 = \mu_4 \\ H_a & : at \quad least \quad one \quad \mu_i \ne \mu_j \end{align*}\]#http://ww2.amstat.org/publications/jse/datasets/aptness.txt
prog<-read.table(header=TRUE, text="Point_Count Work_Hours OS DMS Language
1059 15000 1 5 1
234 1850 1 5 1
1533 13033 1 5 1
339 11742 1 2 1
205 283 1 5 3
420 17992 1 1 1
2618 36420 1 1 1
749 24700 1 2 4
126 1640 1 1 3
185 1491 1 5 4
713 6761 1 1 4
376 2495 1 5 1
724 3633 1 5 3
306 5000 1 2 3
315 2550 1 4 1
1734 4489 1 5 4
144 9657 1 1 1
881 19283 1 1 1
911 8206 1 1 1
600 14380 1 1 4
655 28993 1 1 1
2924 38608 1 1 1
1238 20815 1 1 4
3472 72219 1 1 1
119 3499 1 1 1
442 12096 1 1 4
452 11702 1 1 4
321 2735 1 5 1
805 14399 1 1 1
313 15819 1 1 4
202 5189 1 1 4
957 22420 1 1 4
426 7591 1 1 1
1105 15550 1 1 1
868 27800 1 2 3
1022 3684 1 1 4
390 10850 1 5 1
105 3415 1 2 1
746 18853 1 2 1
1491 38878 1 1 4
193 996 1 1 4
1815 19059 1 2 1
171 3800 1 2 1
719 26822 1 2 1
596 11402 1 2 1
695 8848 1 2 3
367 5091 1 1 1
301 2032 1 1 4
220 7958 1 2 4
369 3962 1 1 1
137 2407 1 1 1
146 2281 1 1 4
221 1628 1 1 4
672 4887 1 1 4
422 8260 1 1 1
121 638 1 5 3
318 5528 1 1 3
892 35555 1 2 4
181 2600 1 5 1
163 15508 1 1 1
653 1940 0 5 2
900 6163 0 5 2
172 2248 0 3 2
139 3153 0 3 2
549 7731 0 3 4
1339 10288 0 3 2
632 9857 0 1 2
999 13849 0 3 4
1137 18000 0 1 2
654 21819 0 5 1
342 2100 0 3 2
109 1264 0 5 2
360 3550 0 5 2
3290 50335 0 5 2
496 4884 0 4 1
389 2760 0 4 2
534 2000 0 3 4
1230 1393 0 3 4
268 4500 0 3 2
1190 25360 0 5 1
105 1400 0 3 4
328 3127 0 5 2
177 1558 0 3 2
273 6215 0 3 2
124 470 0 5 2
111 1086 0 5 2
355 980 0 5 2
321 1330 0 5 2
206 597 0 5 2
102 543 0 5 2
130 566 0 5 2
164 1840 0 3 2
278 1360 0 3 2
1391 31581 0 1 2
499 3998 0 3 2
195 2193 0 4 2
243 3940 0 3 4
145 2301 0 4 2
280 3288 0 4 2
362 6271 0 3 2
694 1474 0 3 4
212 2333 0 5 2
1325 14323 0 3 4
227 6578 0 3 2")
prog<-as.data.frame(prog)
define Language as a factor and plot boxplot
prog$Language<-factor(prog$Language, labels=c("COBOL","PLI","C","Other"))
head(prog)
library(ggplot2)
p <- ggplot(prog, aes(Language, Work_Hours))
p + geom_boxplot()
Transform data to normalize data:
#Creating new variable to stabilize non-normality
prog$log_Work_Hours<-log(prog$Work_Hours)
p <- ggplot(prog, aes(Language, log_Work_Hours))
p + geom_boxplot()
ANOVA
anova<-aov(log_Work_Hours~Language,data=prog)
summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
Language 3 27.26 9.087 7.673 0.000115 ***
Residuals 100 118.42 1.184
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Bartlett’s Test of non constant variance
bartlett.test(log_Work_Hours~Language,data=prog)
Bartlett test of homogeneity of variances
data: log_Work_Hours by Language
Bartlett's K-squared = 2.6947, df = 3, p-value = 0.4411