A Second R Session

Author

Dr Andrew Dalby

Student Data

I have collected simple survey data using a Google Form from my students for the last 10 years. I now have a dataset that contains more than 1500 submissions.

The data is available from: https://drive.google.com/file/d/1fFgZRNwpmDwci3zIS4h_UG0qxAmboRm7/view?usp=sharing

This data has been edited in Excel to remove questionable results such as heights that are either too large or too small, non-integer numbers of siblings (half is allowed), and excessive numbers of siblings. These have been replaced with the missing value NA which is recognised by R but not by SPSS which requires a numerical missing value for numerical data.

The file was then saved as a comma separated file (csv), which is a plain text file where each column of data is separated by a comma. This is the most flexible and general format for data.

Importing the Data

The first thing that you want to do is import the data into R by reading the csv file and create a data frame for the imported data called data.

student <- read.csv("Edited Simple Class Survey.csv", header=TRUE)

# The data is imported with the orginal column names from the survey.
# These need to be changed to somethign shorter and easier to use.
# First I create a vector of the new column names
# Vectors do not have to be numeric in R

names <- c("gender","height","colour","month","siblings", "job", "degree", "positivity")

# Then I apply this vector to the data frame that I imported.
colnames(student) <- names

# The columns of the data frame student can now be accessed individually in order to calculate summary statistics using the new column names.

mean(student$height)

[1] NA

The calculation of the mean of the height returns NA as there are missing values for height as some students entered incorrect data which had to be replaced with NA. You need to calculate the summary statistics ignoring the missing values. For the core graphical outputs R will automatically ignore the missing values.

mean(student$height, na.rm=TRUE)

[1] 1.659132

sd(student$height, na.rm=TRUE)

[1] 0.1000675

hist(student$height, col="cornflowerblue", main="Histogram of Student Heights", xlab="Student Height (m)", breaks=30)

In this dataset gender, favourite colour and month of birth are all categorical variables and they can be treated as factors. As favourite colours were chosen from a pull down menu on the survey there are a defined number of possibilities. We refer to these as levels of the factor. I could recode the words as numbers corresponding to these levels and that is something that I would recommend doing in SPSS in order to make analysis easier. In R recoding is not as essential as it can work with text or numbers. You might want to recode for efficiency but that is up to you.

What I can do is look to see if there is a difference in the mean heights between the two levels of the factor gender that I have set (all other genders are coded as missing for this example).

male.h <- student$height[student$gender=="Male"]
female.h <- student$height[student$gender=="Female"]
mean(female.h, na.rm=TRUE)

[1] 1.628264

mean(male.h, na.rm=TRUE)

[1] 1.763839

Looking at the two means it does seem that there is a difference between the heights of the two levels of the gender factor. This should be tested using a statistical test. It is a test to see if there is a difference in means between two sets of data, which is a t-test. This can be carried out in R as follows.

t.test(male.h,female.h)


    Welch Two Sample t-test

data:  male.h and female.h
t = 25.189, df = 511.28, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1250016 0.1461500
sample estimates:
mean of x mean of y 
 1.763839  1.628264

This t-test result shows that there is indeed a difference and that the probability that the observed differences are just a result of chance in sampling the heights of the different gender is 0.00000000000000022.

We could have performed the t-test on the original data frame without having to create the two different datasets for male and female heights. This uses the ~ symbol to represent a functional relationship between two variables. In this case one is a factor and the other a continuous variable and we can run a t-test. If both were continuous variables R then we use the same symbol to carry out linear regression using the lm function.

t.test(student$height~student$gender, na.rm=TRUE)


    Welch Two Sample t-test

data:  student$height by student$gender
t = -25.189, df = 511.28, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.1461500 -0.1250016
sample estimates:
mean in group Female   mean in group Male 
            1.628264             1.763839

Notice that in this case the sign of the difference has changed as female is the reference group and male is the comparison group and in the first test it was the other way around. This also changes the sign of the value of the test statistic t. It is the size/magnitude that matters most and not the sign/direction.

As well as summarising the continuous variable height I can summarise the favourite colour data by tabulating it. I can also tabulate the favourite colours depending on the two levels of gender.

table(student$colour)


 Black   Blue  Brown   Gold  Green Orange   Pink Purple    Red Silver  White 
   292    416     26     34    180     35    156    195    176     13     57 
Yellow 
    46

table(student$gender,student$colour)

        
         Black Blue Brown Gold Green Orange Pink Purple Red Silver White Yellow
  Female   216  298    20   27   128     21  143    171 119     11    49     44
  Male      73  118     5    6    52     12   10     23  56      2     8      1

contingency <- table(student$gender,student$colour)