Abstract: How do university training and subsequent practical experience affect expertise in data science? To answer this question we developed methods to assess data science knowledge and the competence to formulate answers, construct code to problem solve, and create reports of outcomes. In the cross-sectional study, undergraduate students, trainees in a certified postgraduate data science curriculum, and data scientists with more than 10 years of experience were tested (100 in total: 20 each of novice, intermediate, and advanced university students, postgraduate trainees, and experienced data scientists). We discuss the results against the background of expertise research and the training of data scientist. Important factors for the continuing professional development of data scientists are proposed.

Dataset:

-   Participant type: novice students, intermediate students, advanced university students, postgraduate trainees, and experienced data scientists
-   Competence: an average score of data science knowledge and competence based on a knowledge test and short case studies.

APA write ups should include means, standard deviation/error (or a figure), t-values, p-values, effect size, and a brief description of what happened in plain English.

df <- read.csv("C:/Harrisburg/ANLY 500/Lab 10/10_data.csv")

Data screening:

Assume the data is accurate and that there is no missing data.

Outliers

a.  Examine the dataset for outliers using z-scores with a criterion of 3.00 as p < .001.
b.  Why do we have to use z-scores? 
Because the competence measured based on five Participant type leves, we need to standardize and scale them first.

c.  How many outliers did you have?
We don't have any outliers here.

d.  Exclude all outliers.

zscore <- scale(df$competence)
summary(abs(zscore) < 3)

##     V1         
##  Mode:logical  
##  TRUE:100

noout <- subset(df, abs(zscore) < 3)

Assumptions

Normality:

a.  Include a picture that you would use to assess multivariate normality. 
b.  Do you think you've met the assumption for normality? 
The assumption for normality is not met.

random = rchisq(nrow(noout), 2)
fake = lm(random ~ ., data = noout)
fitted = scale(fake$fitted.values)
standardized = rstudent(fake)
hist(standardized)

Linearity:

a.  Include a picture that you would use to assess linearity.
b.  Do you think you've met the assumption for linearity?
The assumption for linearity is met.

{qqnorm(standardized)
abline(0,1)}

Homogeneity/Homoscedasticity:

a.  Include a picture that you would use to assess homogeneity and homoscedasticity.
b.  Include the output from Levene's test.
Levene's test please refer to the next ANOVA test output.(Levene's Test for Homogeneity of Variance). The Levene's test is significant, F(4,95)=7.140029, p-value =0.00004529845.

c.  Do you think you've met the assumption for homogeneity? (Talk about both components here). 
The assumption for homogeneity is not met. The variance is not equal/even spread across the line above and below.

d.  Do you think you've met the assumption for homoscedasticity?
The assumption for homoscedasticity is not met. The variance is not equal/even spread across the line above and below.

{plot(fitted, standardized)
abline(0,0)
abline(v = 0)}

Hypothesis Testing:

Run the ANOVA test.

a.  Include the output from the ANOVA test.
b.  Was the omnibus ANOVA test significant?
The omnibus ANOVA test is significant. F(4,95)=168.3022, p-value <. 05

library(ez)

## Warning: package 'ez' was built under R version 3.5.3

noout$partno <- 1:nrow(noout) # add participant number
options(scipen = 999)
ezANOVA(data = noout,
        dv = competence,
        between = participant_type,
        wid = partno,
        type = 3,
        detailed = T)

## Warning: Converting "partno" to factor for ANOVA.

## Coefficient covariances computed by hccm()

## $ANOVA
##             Effect DFn DFd      SSn      SSd         F
## 1      (Intercept)   1  95 203761.6 3397.123 5698.1596
## 2 participant_type   4  95  24073.4 3397.123  168.3022
##                                                                                               p
## 1 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000131473
## 2 0.0000000000000000000000000000000000000000032454881097109948592631389896467908329213969409466
##   p<.05       ges
## 1     * 0.9836013
## 2     * 0.8763357
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd      SSn      SSd        F             p p<.05
## 1   4  95 365.6359 1216.221 7.140029 0.00004529845     *

# Running a one way anova - if levene's test is significant
oneway.test(competence ~ participant_type, data = noout)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  competence and participant_type
## F = 507.48, num df = 4.000, denom df = 44.375, p-value <
## 0.00000000000000022

Calculate the following effect sizes:

a.  $\eta^2$
b.  $\omega^2$

eta = 0.8763357 ##fill in the number here use for power below
eta

## [1] 0.8763357

library(MOTE)

## Warning: package 'MOTE' was built under R version 3.5.3

omega.F(dfm = 4, #this is dfn in the anova
        dfe = 95, #this is dfd in the anova
        Fvalue = 168.3022, #this is F
        n = 100, #look at the number of rows in your dataset
        a = .05) #leave this as .05

## $omega
## [1] 0.8699963
## 
## $omegalow
## [1] 0.8046411
## 
## $omegahigh
## [1] 0.9071609
## 
## $dfm
## [1] 4
## 
## $dfe
## [1] 95
## 
## $F
## [1] 168.3022
## 
## $p
## [1] 0.000000000000000000000000000000000000000003245485
## 
## $estimate
## [1] "$\\omega^2$ = 0.87, 95\\% CI [0.80, 0.91]"
## 
## $statistic
## [1] "$F$(4, 95) = 168.30, $p$ < .001"

Given the \(\eta^2\) effect size, how many participants would you have needed to find a significant effect?

If you get an error: “Error in uniroot(function(n) eval(p.body) - power, c(2 + 0.0000000001, : f() values at end points not of opposite sign”:

- This message implies that the sample size is so large that the estimation of sample size has bottomed out. You should assume sample size required n = 2 *per group*. Mathematically, ANOVA has to have two people per group - although, that's a bad idea for sample size planning due to assumptions of parametric tests.
- Leave in your code, but comment it out so the document will knit.

# feta = sqrt(eta / (1-eta))
# library(pwr)
# pwr.anova.test(k = 5, n = NULL, f = feta,
#               sig.level = .05, power = .80)
k <- 5
k * 2

## [1] 10

Run a post hoc independent t-test with no correction and a Bonferroni correction. Remember, for a real analysis, you would only run one type of post hoc. This question should show you how each post hoc corrects for type 1 error by changing the p-values.

# post hot tests-p.value adjustment "none" with Option unequal variance
post1 = pairwise.t.test(noout$competence,
                        noout$participant_type, 
                        p.adjust.method = "none", 
                        paired = F, 
                        var.equal = F)

# post hot tests-p.value adjustment "bonferroni" with Option unequal variance
post2 = pairwise.t.test(noout$competence,
                        noout$participant_type, 
                        p.adjust.method = "bonferroni", 
                        paired = F, 
                        var.equal = F)

Include the effect sizes for only Advanced Students vs Post Graduate Trainees and Intermediate students versus Experienced Data Scientists. You are only doing a couple of these to save time.

M <- with(noout, tapply(competence, participant_type, mean))
stdev <- with(noout, tapply(competence, participant_type, sd))
N <- with(noout, tapply(competence, participant_type, length))

# Advanced Students vs Post Graduate Trainees is 1 and 5
effect1 = d.ind.t(m1 = M[1], m2 = M[5],
                 sd1 = stdev[1], sd2 = stdev[5],
                 n1 = N[1], n2 = N[5], a = .05)
effect1$d

##   advanced 
## -0.5753897

# Intermediate students versus Experienced Data Scientists is 3 and 2
effect2 = d.ind.t(m1 = M[3], m2 = M[2],
                 sd1 = stdev[3], sd2 = stdev[2],
                 n1 = N[3], n2 = N[2], a = .05)
effect2$d

## intermediate 
##   -0.6960641

Create a table of the post hoc and effect size values:

tableprint = matrix(NA, nrow = 3, ncol = 3)

##row 1
##fill in where it says NA with the values for the right comparison
##column 2 = Advanced Students vs Post Graduate Trainees
##column 3 = Intermediate students versus Experienced Data Scientists. 
tableprint[1, ] = c("No correction p", post1$p.value[11], post1$p.value[8])

##row 2
tableprint[2, ] = c("Bonferroni p", post2$p.value[11], post2$p.value[8])

##row 3
tableprint[3, ] = c("d value", effect1$d, effect2$d)

#don't change this
kable(tableprint, 
      digits = 3,
      col.names = c("Type of Post Hoc", 
                    "Advanced Students vs Post Graduate Trainees", 
                    "Intermediate students versus Experienced Data Scientists"))

Type of Post Hoc	Advanced Students vs Post Graduate Trainees	Intermediate students versus Experienced Data Scientists
No correction p	0.00000000000000000311689701274726	0.0000000000000215917839037419
Bonferroni p	0.0000000000000000311689701274726	0.000000000000215917839037419
d value	-0.575389669763759	-0.696064088502951

Run a trend analysis.

a.  Is there a significant trend?  
Yes.

b.  Which type?
Generally it is linear treand.As participants more advanced/experienced, the competence score is higher (experienced data scientists are a little bit exceptional of this trend based their experience).

k <- 5
noout$part = noout$participant_type
contrasts(noout$part) = contr.poly(k)
output2 = aov(competence ~ part, data = noout)
summary.lm(output2)

## 
## Call:
## aov(formula = competence ~ part, data = noout)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.486  -3.769  -0.201   3.018  20.628 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   45.140      0.598  75.486 < 0.0000000000000002 ***
## part.L        -6.583      1.337  -4.923   0.0000035804688304 ***
## part.Q        26.725      1.337  19.987 < 0.0000000000000002 ***
## part.C        17.227      1.337  12.883 < 0.0000000000000002 ***
## part^4        12.220      1.337   9.139   0.0000000000000114 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.98 on 95 degrees of freedom
## Multiple R-squared:  0.8763, Adjusted R-squared:  0.8711 
## F-statistic: 168.3 on 4 and 95 DF,  p-value: < 0.00000000000000022

Make a bar chart of the results from this study:

a.  X axis labels and group labels
b.  Y axis label
c.  Y axis length â the scale runs 0-100. You can add coord_cartesian(ylim = c(0,100)) to control y axis length to your graph. 
d.  Error bars
e.  Ordering of groups: Use the factor command to put groups into the appropriate order.

You use the factor command to reorder the levels by only using the levels command and putting them in the order you want. Remember, the levels have to be spelled correctly or it will delete them.

library(ggplot2)
bargraph = ggplot(noout, aes(participant_type, competence))
bargraph +
  stat_summary(fun.y = mean,
               geom = "bar",
               fill = "white",
               color = "black") +
  stat_summary(fun.data = mean_cl_normal,
               geom = "errorbar",
               position = "dodge",
               width = .2) +
  xlab("Participant Type") +
  ylab("Average Competence") +
  coord_cartesian(ylim = c(0,100)) +
  scale_x_discrete(labels = c("advanced", "experienced", "intermediate", "novice", "postgraduate"))

Write up a results section outlining the results from this study. Use two decimal places for statistics (except when relevant for p-values). Be sure to include the following:

a.  A reference to the figure you created (the bar chart) â this reference allows you to not have to list every single mean and standard deviation.
As we can tell from the bar chart, novice students has the lowest average Competence score of data science knowledge and competence based on a knowledge test and short case studies. Generally, the more university training will lead to higher competence score, as we can see the average competence score is increasing from intermediate, advanced and postgraduate students. 

b.  Very brief description of study and variables.
The study is researching How do university training and subsequent practical experience affect expertise in data science. Dataset has two variables. Participant type: novice students, intermediate students, advanced university students, postgraduate trainees, and experienced data scientists; Competence: an average score of data science knowledge and competence based on a knowledge test and short case studies.

c.  The omnibus test value and if it was significant.
The omnibus ANOVA test is significant. F(4,95)=168.3022, p-value <. 05

d.  The two post hoc comparisons listed above describing what happened in the study and their relevant statistics. You would only list the post hoc correction values. 
First post hoc comparison without correction, second post hoc comparision with bonferroni p value adjustment method. We can see the p value for the mean comparision bewteen postgraduate and advanced students increasing from 0.1776 to be 1. So with the seccond post hot correction p value 1, we can safely reject the null hypothesis and conclude that there is no difference for the mean of competence score between postgraduate and advanced students.

e.  Effect sizes for all statistics.
eta = 0.8763357
omega^2 = 0.87
effect sizes for Advanced Students vs Post Graduate Trainees=-0.5753897
effect sizes for Intermediate students versus Experienced Data Scientists=-0.6960641

ANOVA

Jindong Zhao HU ID 267759

2019-08-31