You should submit the R file (R script or Rmd-file) by the end of a midterm via this link.
Note 1: you can use either basic plotting functions in R or ggplot2
for creating graphs. If your graphs look pretty (accurate titles, labels, colors, etc), you can earn bonus 1-2 points for each problem: 1 for pretty basic graphs, 2 for ggplot2
ones.
Note 2: you can use either basic R functions for data handling or dplyr
ones. If you use dplyr
, you can get 1 extra point.
Data stored in survey01.csv
contains the results of a survey organized at the bachelour programme on psychology. First year students were asked to provide some personal information (except name) and to participate in a small experiment. Participants had to evaluate the length of the interval and the size of the angle shown at the whiteboard. Then the absolute values of deviations from the correct answers (for length and angle) were recorded.
Variables:
height
: student’s height (in centimeters);math
: score for the Math exam (Russian State Exam);bio
: score for the Biology exam (Russian State Exam);sex
: student’s sex (1 for females, 2 for males, 3 and 22 - some strange values);subject
: favourite subject at school (1 for Math, 2 for Biology, 3 for Russian language, 4 for Foreign language, 5 None of suggested);residence
: student’s residence (1 - Moscow, 2 - Not Moscow, 3 - Other);soft
: software chosen for data analysis during labs (R or SPSS);len_dev
: the deviation from the correct answer for lengt
h (in centimeters);ang_dev
: the deviation from the correct answer for angle
(in degrees).surv
.surv
.Keep only those rows that correspond to students from Moscow and not from Moscow. Save changes to surv
.
Keep only those rows that correspond to students who specified their sex correctly (only 1 and 2, not 3 or 22). Save changes to surv
.
Exclude rows that correspond to students who claimed that the length of an interval and the angle provided equal 0. Save changes to surv
.
Create a histogram of height
. Describe this distribution in words: say whether it is symmetric or not and if it is not symmetric, state whether it is right-skewed or left-skewed.
Judging by this histogram, can we say that there are outliers in this sample of students? Explain your answer.
Create a boxplot of length
. Are there outliers in data? If yes, are they ‘natural’ (really extreme values) or might have occured as a result of mistake?
Imagine that you are asked to check whether the deviation from the correct answer to the question about an interval length is different for students who chose R and for those who chose SPSS. You are going to perform formal hypothesis testing.
Formulate the null hypothesis you are going to test. Write it as a comment.
Choose a suitable test and perform it in R. Report the R code and the output.
Make a statistical conclusion: decide whether your null hypothesis is rejected at the 5% statistical level. Write it as a comment.
Make a substantial conclusion: decide whether deviation from the correct answer differs for R-users and SPSS-users. Write it as a comment.
Calculate a 90% confidence interval for the mean value of length
.
Provide an interpretaion of a confidence interval. Calculate its length and report it. Write it as a comment.
Create a scatterplot that will show the association between length
and angle
. Comment on the direction of association between variables and its strength. Write it as a comment.
Which correlation coefficient seems to be suitable to measure the assosiation between length
and angle
? Choose an appropriate coefficient, calculate it and test its significance. Report the R code you used.
Can you conclude that these two variables are associated? Explain you answer. Write it as a comment.
Imagine you have to check whether the soft prefered (R or SPSS) depends on the favourite subject at school.
Choose an appropriate test you should use to check this, perform it and make a statistical (reject/not reject) and a substantial conclusion (depends/not depends).