Remember our in-class survey?
We've already looked at how to read in the data from Google and reformat it to make it easier to apply the operations we have learned in R.
You typically won't want to include this part in your report. I've consolidated all the commands here. If you look at the R fences in the source Rmd document, you'll see a statement echo=FALSE which signals to the system not to print out the commands. They get executed in the background.
An important component of modern statistics is computational. Accordingly, we invest a considerable amount of time and effort — both for students and faculty — in using computation intensively in our courses. This survey attempts to examine student attitudes toward computation. (Of course, mainly, it's a simple survey for collecting data to illustrate data processing and writing up a survey report.)
*You should also include one or more hypotheses here … * We had several hypotheses in designing the survey: * Students who think that skills in data analysis are important, are more likely to be inclined to study more about computing, unless they think it's beyond their ability. * Students with a current higher skill level are more likely to be inclined to study more about computing.
Don't be afraid to state hypotheses that you think are obvious. Even if it's obvious, you'll still want to try to demonstrate them from your data.
Describe how you distributed your survey
This survey was displayed in class in Math 155 and students were given time to complete it during the class.
Give a short description of the important variables individually You don't need to be exhaustive, just orient the reader to the important bits.
The students answering the survey were primarily in the natural and social sciences:
barchart(tally(~Division, data = d, margins = FALSE, format = "count"), auto.key = TRUE)
Many students preceive themselves as lacking ability in computing (“Unable”)
barchart(tally(~StudyMore, data = d, margins = FALSE, format = "count"), auto.key = TRUE)
Make a simple graphic to display how students regard the importance of computing in data analysis.
# Your graphics statements go here
You may sometimes want to drop some ill-populated levels
Natural and social science students seem to be about the same regarding their inclination to study more.
dd = droplevels(subset(d, !Division %in% c("Art", "Hum")))
mosaicplot(Division ~ StudyMore, data = dd, las = 2, col = rainbow(5))
Make a simple graphic that's relevant to the two hypotheses given above
# Your graphics statements go here
Does the inclination to study more depend on the student's rating of the importance of data analysis? Here's a logistic regression model of whether a student will study more based on division and whether they think computation is important in data analysis:
mod = glm( StudyMore=="WillDo" ~ as.numeric(Data) + Division,
data=d, family="binomial")
The regression table:
| Estimate | Std. Error | z value | Pr(> |z|) | |
|---|---|---|---|---|
| (Intercept) | -16.9672 | 2399.5448 | -0.01 | 0.9944 |
| as.numeric(Data) | 0.4011 | 0.4408 | 0.91 | 0.3629 |
| DivisionSci | 14.8617 | 2399.5450 | 0.01 | 0.9951 |
| DivisionSS | 14.3239 | 2399.5451 | 0.01 | 0.9952 |
The coefficient on “Data”“ (the importance of computation in data analysis) is positive. This suggests that students with a higher estimation of importance are more likely to be inclined to study more computation. But the p-value is so low that we cannot reject the null hypothesis.
Note that as.numeric(Data) was used so that the ordinal properties of the variable were considered. This treats the ordinal variable as quantitative.
You may want to modify your models. For instance, Division doesn't show up as significant in the above. You do not need to include every model you try in your write-up. But give a short summary, e.g. Division doesn't seem to be related to inclination to study more computing.
Build and interpret a model of whether students with a higher skill level are more likely to intend to study more computing.
If your p-values are too large to reject the null, it's helpful to give some guidance to future researchers. Select a sample size that will give you a p-value of 0.01 and report that. To do this, you'll need to vary the sample size until you find one that works reliably. You don't have to show the calculations you do, just give the result. (Your instructor can check it out by using that sample size!)
Here's an example of the calculation:
largerSample = resample(d,size=1000)
mod = glm( StudyMore=="WillDo" ~ as.numeric(Data) + Division,
data=largerSample, family="binomial")
##
## Call:
## glm(formula = StudyMore == "WillDo" ~ as.numeric(Data) + Division,
## family = "binomial", data = largerSample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.025 -0.876 -0.778 1.338 1.862
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.9558 352.9858 -0.05 0.96
## as.numeric(Data) 0.3897 0.0882 4.42 1e-05 ***
## DivisionSci 14.0277 352.9858 0.04 0.97
## DivisionSS 13.3572 352.9859 0.04 0.97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1150.4 on 957 degrees of freedom
## Residual deviance: 1106.5 on 954 degrees of freedom
## (42 observations deleted due to missingness)
## AIC: 1115
##
## Number of Fisher Scoring iterations: 14
Remember, you don't have to include the table in your report, just the conclusion.
Repeat the above lines to find a small size that gives a p-value of about 0.01.
Summarize your conclusions briefly here. You don't need to present more statistical analysis; you've already done that.
Only students with computers in class were able to do the survey. Perhaps they find computing less important, and so the results may be biased toward students who have a stronger interest in computing.
Very few arts and humanities students are enrolled in the class.
State weaknesses in your methodology. This won't detract from your grade, indeed the opposite.