1 Setup

1.1 Global options


### setting up working environment
### for details see: https://yihui.name/knitr/options/
knitr::opts_chunk$set(
        echo = T,
        message = T,
        error = T,
        warning = T,
        comment = '##',
        highlight = T,
        prompt = T,
        strip.white = T,
        tidy = T
        )

1.2 Installing and loading R packages


### accompanying R package: https://github.com/gitrman/itns
if (!require("itns"))
        {remotes::install_github("gitrman/itns", 
                         build = TRUE, build_opts = c("--no-resave-data", "--no-manual"))
        library("itns")}


### https://www.tidyverse.org/
if (!require("tidyverse"))
        {install.packages("tidyverse", repos = 'http://cran.wu.ac.at/')
        library(tidyverse)}

### above command installed and loaded the core tidyverse packages:
# ggplot2: data visualisation
# tibble:  a modern take on data frames
# tidyr:   data tidying
# readr:   data import (csv, tsv, fwf)
# purrr:   functional R programming
# dplyr:   data (frame) manipulation
# stringr: string manipulation
# forcats: working with categorial varialbes


### to calculate mode:
if (!require("modeest"))
        {install.packages("modeest", repos = 'http://cran.wu.ac.at/')
        library(modeest)}

# I am going to use the `janitor` package for calculating table totals
if (!require("janitor"))
        {install.packages("janitor", repos = 'http://cran.wu.ac.at/')
        library(janitor)}

1.3 Theme adaption for the graphic display with ggplot2


my_theme <- theme_light() +
        theme(plot.title = element_text(size = 10, face = "bold", hjust = 0.5))
        theme(plot.background = element_rect(color = NA, fill = NA)) +
        theme(plot.margin = margin(1, 0, 0, 0, unit = 'cm'))

2 End-of-Chapter Exercises

2.1 Calculating decriptive statistics

2.1.1 Description

The following table contains a set of ACT scores from a sample of college students. The ACT is a standardized college-readiness exam taken by many U.S. students; scores can range fro 1 to 36.

Student ACT \((X~i~ - M)\) \((X~i~ - M)^2\)
1 26
2 24
3 28
4 31
5 20
6 27
7 18
8 17
9 21
10 29
11 24
Total
  1. Location: Calculate the mean (M), median, and mode for this sample.
  2. Spread: Calculate the standard deviation (s), range and interquartile range for this sample. For s, fill in the two columns on the right in the table above, then use the formula to calculate s yourself.

2.1.2 Solution

ACT <- c(26, 24, 28, 31, 20, 27, 18, 17, 21, 29, 24)

With summary there is a covenient command in R to print out some of the most important descriptive statistics.

summary(ACT)

The summary command does not display mode, variance, standard deviation and IQR. With the exception of mode all these command are included in the base R installation. To calculate the mode we need modeest, a package for mode estimation.

In R there are for the same task often several solution possible. I will add some of them.

Parameter Command in R Value
Mean M round(mean(ACT),2) 24.09
Median (V1) median(ACT) 24
Median (V2) fivenum(ACT)[3] 24
quantile (V3) quantile(ACT)[[3]] 24
Mode mfv(ACT) 24
Variance round(var(ACT),2) 21.29
s round(sd(ACT),2) 4.61
Range (V1) range(ACT) 17, 31
Range (V2) fivenum(ACT)[c(1,5)] 17, 31
IQR IQR(ACT) 7
Q1,Q3 fivenum(ACT)[c(2,4)] 20.5, 27.5

To populate the table programmatically I will convert ACT to a data frame ACT_df = as.data.frame(ACT) resp. in a tibble as tibbles has some advantages over data frames. (The tibble package is part of the tidyverse.)

### Calling `as_tibble()` on a vector is discouraged, 
### because the behavior is likely to change in the future.
### I am using instead the `enframe` command
ACT_df <- enframe(ACT, name = "Student", value = "ACT")

## using the pipe operator (https://magrittr.tidyverse.org/)
## using mutate from the dplyr package which is also included in tidyverse
ACT_df <- ACT_df %>% mutate( 
    `Xi-M` = round(ACT_df$ACT - mean(ACT_df$ACT),2),
    `(Xi-M)^2` = round(`Xi-M`^2,2)
    )


ACT_df <- ACT_df %>%
    adorn_totals("row") # janitor package to add totals
ACT_df[12, 3] <- "   " # summing up Xi-M does not make sense


knitr::kable(ACT_df, 
             booktabs = TRUE, 
             caption = "ACT Scores for Exercise 1 of Chapter 3",
             align = 'rrrr')

This table has not the same information as in the book p.505. At first I tried to build it similar but this complicates everything as it is not suitable for using R.

2.2 Loading our first data set

2.2.1 Inspecting the itns-package

In the task description of the itns-book you will find the following passage (p.68):

From the book website, load the College_Survey_1 data set. This is a large data set containing data from a wide-ranging survey of college students. You can read the code book for the data file to get a better sense of the different variables measured.

But you do not need to load the file into R as there are all these data sets already included in the itns package, which we already have loaded into R. In this package the data sets are already cleaned and made compatibel with the R language. This is a great advantage and helps us to concentrate just on the exercises. As in real life you will not have your data already cleaned, I will in the last section of this chapter explain some of the necessary procedures you whould normally apply to every new data set you want to work with.

In the documentation the name of the College_Survey_1 data set in the itns pacakge is college_survey1. When you have already the package attached you can find this out with the command ls("package:itns") our better in RStudio with the console command help(package = itns). (Our you can also go to the “Packages” tab and search for “itns”) The former displays the objects of the package in your RMarkdown file, the latter opens up the documentation page in the RStudio help tab.

ls("package:itns") # list all objects of the itns package

Further on I will only describe the RStudio version. RStudio is the most productive R IDE (Integrated Development Environment) I know about. It is free and works with many platforms. You can download the lastest version at https://www.rstudio.com/products/rstudio/download/#download

To detect the name of the variables (= columns) and the data structure we use the RStudio View command:

View(college_survey1) # opens the data set in the RStudio data browser

2.2.2 Reading the codebook for college-survey1

Even the data set is already included in the itns package, you still will need to go to the companion website and download the data set as you will need the code book to understand what all the variables and codes mean. To save you time I have it included here.

But keep in mind that the variable names in the college-survey1 are slighty different than in College_Survey_1. The main difference is that the R variables are not capitalized. As R is case sensitive, we need always to look up the correct names for the R variables (columns).

Survey of College Attitudes

This is data from survey of college students. The survey was put together as a classroom project, and measures a number of different constructs. Note that not all participants elected to answer all items–blanks represent scales that were not completed by that participant.

  • ID -
  • Gender : Self-reported gender, Male or Female
  • Gender_Code: 1 for Female, 2 for Male
  • Age: Self-reported age in years
  • School_Year - Self reported, First-year, sophomore, junior, senior, or post-bac
  • School_Year_Code: 1 = first-year, 2 = sophomore,3 = junior, 4 = senior, 5 = post-bac
  • Transfer: No if student did not transfer to current school, Yes if student transferred to current school
  • Transfer_Code: 0 = not a transfer student, 1 = transfer student
  • Student_Athlete: self-reported, non-athlete, or athlete in or out of season
  • Student_Athlete_Code: 0 = not a student athlete, 1 = student athlete
  • Wealth_SR: Single item: “Considering all the others students you’ve met here at SCHOOL NAME how would you rank yourself in terms of wealth?” rated on scale from 1 (well below average) to 5 (well above average)
  • GPA: Self-reported GPA, 0-4point scale
  • ACT: Self-reported ACT score
  • Subjective_Well_Being: Average of 5-item satisfaction with life scale by Deiner, rated on scale from 1 (strongly disagree) to 7 (strongly agree). Sample item: “I am satisfied with my life”. See http://internal.psychology.illinois.edu/~ediener/SWLS.html
  • Positive_Affect: Average of 10-item scale of frequency of experiencing positive emotions over the past week, rated on a scale from 1 (verly slightly to not at all) to 5 (extremely). Sample items: Enthusiastic, Proud, Inspired…Part of the PANAS scale.
  • Negative_Affect : Average of 10-item scale of frequency of experiencing negative emotions over the past week. Rated on scale from 1( very slightly to not at all) to 5 (extrmely). Sample items: Nervous, irritable, hostile, etc>0. Part of the PANAS scale. See http://booksite.elsevier.com/9780123745170/Chapter%203/Chapter_3_Worksheet_3.1.pdf
  • Relationship_Confidence- Average of a 7-item scale of confidence in having romantic relationships, rated on a scale from 1 (not at all characteristic of me) to 5 (very characteristic of me). Sample item: “I am a good partner for a romantic relationship)
  • Exercise - Exercise score on the Godin Leisure-Time Exercise Quetionnaire. This asks participants to rate how often during a regular week they engage in strenuous, moderate, or light exercise. A total score is calculated as 9strenuous + 5moderate + 3*light. See http://dapa-toolkit.mrc.ac.uk/documents/en/God/Godin_Leisure-Time_Exercise_Q.pdf
  • Academic_Motivation_Intrinsic - Average score on 6-item measure of intrinsic acadmic motivation. Participants respond to the prompt “Why do you go to college” and respond to each item with a rating from 1 (does not correpond at all) to 7 (corresponds exactly). Sample item: “For the intense feelings I experience when I am communicating my own ideas to others”. See http://www.er.uqam.ca/nobel/r26710/LRCS/scales/emec_en.doc
  • Academic_Motivation_Extrinsic - Average score on 7-item measure of extinsic acadmic motivation. Participants respond to the prompt “Why do you go to college” and respond to each item with a rating from 1 (does not correpond at all) to 7 (corresponds exactly). Sample item: “Because with only a high-school degree I would not find a high-paying job later one”. See http://www.er.uqam.ca/nobel/r26710/LRCS/scales/emec_en.doc
  • Academic_Motivation_Amotivation - Average score on 2-item measure of extinsic acadmic motivation. Participants respond to the prompt “Why do you go to college” and respond to each item with a rating from 1 (does not correpond at all) to 7 (corresponds exactly). Sample item: “Honestly, I don’t know; I really feel that I am wasting my time in school.”. See http://www.er.uqam.ca/nobel/r26710/LRCS/scales/emec_en.doc
  • Inteligence_Value - Measure of degree to which participant values intelligence, rated on a scale from 1 (strongly agree) to 5 (strongly agree). Sample item: “Intelligence greatly contributes to success in life.”. See
  • Raven_Score - % correct out of 8 items on the Raven Progressive Matrix Scale, a scale of logical thinking/IQ. Adapted from https://www.raventest.net/.

2.3 Distribution of positive affect scores

Now we can start to visualize the different distribution as described in 2a-e (pp.68f.)

2.3.1 Visualize the distribution

Visualize the distribution of positive affect scores (Positive_Affect). Students rate how often they experience each of 10 positive emotions, using a scale from 1 (not at all) to 5 (extremely), then the score is the average of the 10 ratings.

The reference in the codebook to the PANAS scale is not valid anymore. But I found an explanation and a template with the questionaire on the toolsHero website.

Following the recommendation by the paper Getting Started with the New Statistics in R (p.7) I am going to use the ggplot2 package. See details to the command at the ggplot2 web page.

2.3.2 Histogram

ggplot(college_survey1, aes(positive_affect)) +
  geom_histogram()

After executing the above command ggplot2 gives us a notification and a warning message:

2.3.2.1 Addressing the notification

I chose a binwidth of .1, meaning that I will get 40 bins as the scale goes from 1-5.

2.3.2.2 Addressing the warning

ggplot2 warns us that there are data missing in 8 records. Look at the RStudio data browser and scroll down to check it. You will see that at the end of the file, starting with record number 231 some data are missing (“not availabe” = NA). In our next visualization we tell R to remove these missing values silently and to use the binwidth of .1.

ggplot(college_survey1, aes(positive_affect)) +
  geom_histogram(binwidth = .1, na.rm = TRUE) # always experiment with the binwidth

2.3.3 Dot plot

For more information about this command see: https://ggplot2.tidyverse.org/reference/geom_dotplot.html

ggplot(college_survey1, aes(positive_affect)) +
    geom_dotplot(binwidth = .1, na.rm = TRUE, method = "histodot", stackratio = 1.1) +
    coord_fixed(ratio = 2)

It is a notorious problem with dot plots produced by ggplot2 that the y-axis is normalized and displays meaningless values between 0 and 1. I found recently a workaround posted at StackOverflow.

# library(ggplot2) ## already loaded with tidyverse
library(dplyr)     ## already loaded with tidyverse
library(ggExtra)

yheight = 20 # taken from the histogramm
# basic dotplot (binwidth = the accuracy of the data)
dotchart = ggplot(college_survey1, aes(x = positive_affect), dpi = 600)
dotchart = dotchart + geom_dotplot(binwidth = 0.1, 
                                   method = "histodot", 
                                   dotsize = 1, 
                                   fill = "blue",
                                   na.rm = TRUE)

# use coor_fixed(ratio=binwidth*dotsize*max frequency) to setup the right y axis height.
dotchart = dotchart + 
    coord_fixed(ratio = 0.1 * yheight)

# add tick mark on y axis to reflect frequencies. Note yheight is max frequency.
dotchart = dotchart + 
    scale_y_continuous(limits=c(0, 1), expand = c(0, 0), breaks = seq(0, 1, 1/yheight), labels=seq(0, yheight))

# remove x y lables and remove vertical grid lines
dotchart = dotchart + 
    labs(x=NULL, y=NULL) +
    theme_bw() +
    removeGridX()
dotchart

2.3.4 Interpretation

Describe the distribution you see, noting the location, spread, and shape, and the number of major peaks.

summary(college_survey1$positive_affect)
s <- round(sd(college_survey1$positive_affect, na.rm = TRUE),2)
s

Positive affect is relatively normally distributed—there is a single, strong peak and the distribution is fairly symmetrical. The mean score of 3.43 is a little above the midpoint of the 1–5 scale, indicating that students on average feel moderate levels of positive affect. There is wide spread, with s = 0.71 and scores ranging across almost the full range.

2.4 Distribution of student age

The next exercises are similar. I am going to produce just the histograms.

Visualize and describe the distribution of student age. Note the skew and the high outliers. Note the mean and median and explain why they differ. Would it make sense to delete outliers? Explain.

ggplot(college_survey1, aes(age)) +
  geom_histogram(binwidth = 1, na.rm = TRUE) # note the chosen bandwith
summary(college_survey1$age)
s <- round(sd(college_survey1$age, na.rm = TRUE),2)
s

There is strong positive skew, with most students aged 18–22, but a long upper tail of ages ranging up to nearly 60. The mean (21.8 years) is most affected by the outliers, and the median (20 years) is probably a more useful indicator of location. The skew causes the mean to be larger than the median. The college has mostly younger students, but also some older returning and non-traditional students. Removing outliers would misrepresent the sample, by omitting those older students. They should be removed only if there is reason to examine the subgroup of younger students, perhaps aged up to about 32.

2.5 Distribution of exercise scores

Again the URL in the code book does not exist anymore but you can download the article dircetly from Gaston Godin’s web page.

Visualize the distribution of exercise scores, which are calculated as 9×strenuous + 5×moderate + 3×light, where students report the number of times per week they engage in strenuous, moderate, and light exercise. There is an extreme outlier. What would that extreme value mean? Would it make sense to delete it? Explain. If you do delete it, what happens to the standard deviation? Test your prediction.

ggplot(college_survey1, aes(exercise)) +
  geom_histogram( binwidth = 15, na.rm = TRUE )
summary(college_survey1$exercise)
s <- round(sd(college_survey1$exercise, na.rm = TRUE),2)
s

There is considerable positive skew. Most students report relatively little exercise, but the right tail pulls the mean (54.7) way above the median (22). The extreme outlier of 1,810 corresponds to engaging in strenuous exercise 201 times per week, which is hardly credible. Most likely, the student didn’t understand the question, made an error, or gave a non-serious answer, so there is good reason to delete this outlier, which would decrease the SD. If it is deleted, s decreases from 135 to 64.7. As always, if you delete an outlier report how, when, and why the descision was made to remove it.

We can easy find the number of the record (= row) of this outlier when we in the RStudio data browser the column exercise are sorting from high to low.

outlier <- college_survey1$exercise[138]
college_survey1$exercise[138] <- NA
ggplot(college_survey1, aes(exercise)) +
  geom_histogram(binwidth = 10, na.rm = TRUE ) # different bindwidth as before
summary(college_survey1$exercise)
s <- round(sd(college_survey1$exercise, na.rm = TRUE),2)
college_survey1$exercise[138] <- outlier # restore outlier to get original data frame
s

2.6 Distribution of Raven scores and GPA

Visualize the distribution of Raven scores (Raven_Score), which are the proportion correct on a short 8-item IQ test. Next, visualize the distribution of GPA. Compare the two distributions. Why might they differ in shape?

By now this should already a routine operation.

ggplot(college_survey1, aes(raven_score / 100)) +
  geom_histogram( binwidth = .05, na.rm = TRUE )
summary(college_survey1$raven_score / 100)
s <- round(sd(college_survey1$raven_score / 100, na.rm = TRUE),2)
s
ggplot(college_survey1, aes(gpa)) +
  geom_histogram( binwidth = .1, na.rm = TRUE )
summary(college_survey1$gpa)
s <- round(sd(college_survey1$gpa, na.rm = TRUE),2)
s

Raven IQ Scores are relatively normally distributed. The mean is .377, median is .375, and standard deviation is .199. In contrast, GPA is very strongly negatively skewed, with scores stacked up near 4.0, the top of the GPA scale. A long tail extends down to very low values of GPA. The mean is 3.34, median is 3.40, and SD is 0.51. Different variables can have different distributions, even for the same sample. However, we expect IQ and GPA to be related, so it’s perhaps surprising that their two distributions are so different. The GPA distribution, with most of the scores concentrated in the 3.5–4.0 range, suggests that the testing is not sufficiently challenging to distinguish among students in the upper parts of the distribution. You may have heard of grade inflation occurring in the last one or two decades. Search online for “college grade inflation”, or similar, and you should easily find statistics describing a dramatic increase in the proportion of A grades in recent years, and interesting discussion about likely causes and consequences.

2.7 Presentation of the gender distribution

In this data set (of the original Excel file, pb), Gender is coded as female = 1 and male = 2. ESCI calculates that the mean is M = 1.28 for Gender. Does this make sense? How might you better summarize gender in this sample?

No! It doesn’t make sense to calculate a mean for data on a nominal scale. It’s often convenient to represent nominal data with numerical codes, but it’s important to remember that these codes have no numerical meaning, and thus calculation of most descriptive statistics doesn’t make sense. Nominal data can be summarized with frequencies (175 females and 68 males) and relative frequencies (72.0% females and 28.0% males).

This important difference between numbers and nominal data is in the itns data set already considered and therefore changed. We have two variables Females and Males. These so-called factor variables require a different proceeding:

n <- table(college_survey1$gender)
res <- cbind(n,round(prop.table(n) * 100, 1))
colnames(res) <- c('Count','%')
res

3 Put some indidual scores in context

3.1 Female student on the Ravens measure

One female participant achieved a score of .875 on the Raven measure. What is her z score?

We can calculate the z score manually with the formula (x - mean(x)) / sd(x) or use the command scale()

M <- mean(college_survey1$raven_score, na.rm = TRUE) / 100
s <- sd(na.omit(college_survey1$raven_score)) / 100
(.875 - M) / s

With X = .875, M = .377, s = .199 we calculate z = 2.50. This student is tied for top score in the sample.

All these commands are using vectors, e.g. the take all values of the column raven_score. In inspecting the data frame we notice that .875 is the first value. So we can also use the command scale(). To prevent to print out the long list of values we write scale()[1]

scale(college_survey1$raven_score)[1]

3.2 The same student in GPA

The same female participant has a GPA of 3.9. What is her z score for GPA? On which measure (GPA or Raven) is she more unusual?

college_survey1$gpa[1]
mean(college_survey1$gpa, na.rm = TRUE)
sd(na.omit(college_survey1$gpa))[1]
scale(college_survey1$gpa)[1]

With X = 3.9, M = 3.34, s = 0.508, we calculate z = 1.10. This participant stands out more on the IQ measure (z = 2.50) than on GPA (z = 1.10).

3.3 Comparing the positive and negative affect of a partipant

One participant scored a 2 for Positive Affect and a 2 for Negative Affect. Even though these raw scores are the same, they are very different within the context of the whole study. Express both as z scores and percentiles. Which score is more unusual?

Sorting either the column positive_affect or negative_affect we see that the particpant of row 154 is the one who scored a 2 for Positive Affect and a 2 for Negative Affect. But we should not use the direct

3.3.1 Positve Affect

college_survey1$positive_affect[154]
mean(college_survey1$positive_affect, na.rm = TRUE)
sd(na.omit(college_survey1$positive_affect))
scale(college_survey1$positive_affect)[154]

3.3.2 Negative Affect

college_survey1$negative_affect[154]
mean(college_survey1$negative_affect, na.rm = TRUE)
sd(na.omit(college_survey1$negative_affect))
scale(college_survey1$negative_affect)[154]

3.3.3 Comparison: Positive with Negative Affect

For positive affect: X = 2.0, M = 3.43, s = 0.71 and we calculate z = −2.01. For negative affect: X = 2.0, M = 2.32, s = 0.75 and we calculate z = −0.43. The positive affect score is more unusual than the negative affect score, within this sample.

3.4 Determine outliers

A z score of more than 3 or less than −3 is sometimes considered an outlier. By this standard, would the participant who is 59 qualify as an outlier for age? Would the participant who reported an exercise score of 1,810 qualify as an outlier for exercise?

college_survey1$age[69]
mean(college_survey1$age, na.rm = TRUE)
sd(na.omit(college_survey1$age))
scale(college_survey1$age)[69]
college_survey1$exercise[138]
mean(college_survey1$exercise, na.rm = TRUE)
sd(na.omit(college_survey1$exercise))
scale(college_survey1$exercise)[138]

For age, X = 59, M = 21.8, s = 5.56 and we calculate z = 6.69, which is an extreme outlier! For exercise, X = 1,810, M = 54.8, s = 135.3 and we calculate z = 12.97, which is a very extreme outlier!

4 Religious Beliefs

Let’s look at some data about religious beliefs. The religious_belief data set on the book website has data from a large online survey in which participants were asked to report, on a scale from 0 to 100, their belief in the existence of God.

4.1 Estimate the distribution

First, sketch the distribution you think this variable will have. Consider the relative frequencies of people you expect to strongly believe in God (high end of the scale), to strongly not believe in God (low end of the scale), or to be unsure (middle of the scale).

summary(religious_belief$belief_in_gd)

The lowest 25% percent do not believe in god and the highest 25% do believe completely (100%) in god.

I predicted an approximately flat distribution, with similar frequencies of strong, moderate, and non-believers.

4.2 Graphic of the religious data

Now make a picture of these data and describe it in words.

ggplot(religious_belief, aes(belief_in_gd)) +
  geom_histogram( binwidth = 10, na.rm = TRUE )

The distribution is clearly bimodal: Most respondents answered 100 (strongly believe in God) or 0 (strongly disbelieve in God), and relatively few participants gave scores between the extremes. My prediction was quite wrong. This is a good illustration of multiple peaks (two in this case) indicating distinct groups of participants.

4.3 Evaluate the different location parameters

Does the mean do a good job representing this data picture? Does the median? The mode? Why do all these measures of location seem to fail?

With such a strongly bimodal distribution, none of the three measures of location does a great job of representing the distribution. Both the mean (48.2) [49.39] and median (50) represent an intermediate level of belief that is actually rare; reporting only one mode would be misleading. The best non-graphical strategy would be to describe the distribution in words and report the location of each mode: one at 0 level of belief, the other at 100.

5 Working with the oiginal Excel data set

5.1 Load data set from Excel file

In RStudio there are several options to import data files via the menu available. I generally prefer .csv files using the option From Text (readr). readr is a package included in tidyverse.