Topic 1: Introduction to Statistics and presenting data


These are the solutions for Computer Lab 2.


Preparation

No answer required.

1 Inspecting our Data

The survey data set (Venables 1999) is within the MASS R package.

1.1

library(MASS)
names(survey)
##  [1] "Sex"    "Wr.Hnd" "NW.Hnd" "W.Hnd"  "Fold"   "Pulse"  "Clap"   "Exer"  
##  [9] "Smoke"  "Height" "M.I"    "Age"

1.2

View(survey)

This should immediately open a tab titled `survey’.

1.3

help(survey)

This should load the R Documentation on the survey data set.

1.4

The variable types for the variables in the survey data set example are:

  • Sex: Categorical, Nominal
  • Wr.Hnd: Numerical, Continuous
  • NW.Hnd: Numerical, Continuous
  • W.Hnd: Categorical, Nominal
  • Fold: Categorical, Nominal
  • Pulse: Numerical, Discrete
  • Clap: Categorical, Nominal
  • Exer: Categorical, Ordinal
  • Smoke: Categorical, Ordinal
  • Height: Numerical, Continuous
  • M.I: Categorical, Nominal
  • Age: Numerical, Continuous

2 Frequency Tables

2.1

freq.w.hnd <- table(survey$W.Hnd)  # Store frequency table as freq.w.hnd
freq.w.hnd  # Display the frequency table in the console
## 
##  Left Right 
##    18   218

There are 18 left-handed students and 218 right-handed students.

2.2

rel.freq.w.hnd <- prop.table(freq.w.hnd)
rel.freq.w.hnd
## 
##       Left      Right 
## 0.07627119 0.92372881

Approximately 7.6% of students are left-handed, with the majority (approximately 92.4%) right-handed.

2.3

rel.freq.w.hnd <- round(rel.freq.w.hnd, 2)
rel.freq.w.hnd
## 
##  Left Right 
##  0.08  0.92

We observe that the results are now rounded to two decimal places of accuracy.

2.4

rel.freq.w.hnd <- rel.freq.w.hnd * 100

Our results are now expressed as whole number percentages.

2.5

freq.sex <- table(survey$Sex)
freq.sex
## 
## Female   Male 
##    118    118

There are 118 females and 118 males students. Note one Sex recording is NA, hence the disparity between 118+118 and 237.

2.6

rel.freq.sex <- round(prop.table(freq.sex) * 100, 2)
rel.freq.sex
## 
## Female   Male 
##     50     50

50% of the class is female and 50% is male.

3 Types of Variables in R

3.1

is.ordered(survey$Smoke)
## [1] FALSE

The output FALSE tells us that the categories of survey$Smoke have not been ordered. We can verify this using the levels function.

levels(survey$Smoke)
## [1] "Heavy" "Never" "Occas" "Regul"

3.2

survey$Smoke <- ordered(survey$Smoke, levels = c("Never", "Occas", "Regul", "Heavy"))

3.3

is.ordered(survey$Smoke)
## [1] TRUE
levels(survey$Smoke)
## [1] "Never" "Occas" "Regul" "Heavy"

Note that in contrast to 3.1, our data is now ordered, as denoted by the output TRUE. We can also see that the levels, or categories, are now ordered from lowest to highest.

4 Relative and Cumulative Relative Frequency Tables

4.1

Our frequency table is created using the following code:

freq.smoke <- table(survey$Smoke)
freq.smoke
## 
## Never Occas Regul Heavy 
##   189    19    17    11

Our relative frequency table is created using the following code:

rel.freq.smoke <- round(prop.table(freq.smoke) * 100, 2)
rel.freq.smoke
## 
## Never Occas Regul Heavy 
## 80.08  8.05  7.20  4.66

4.2

Our cumulative frequency table is created using the following code:

cum.freq.smoke <- cumsum(freq.smoke)
cum.freq.smoke
## Never Occas Regul Heavy 
##   189   208   225   236

Our cumulative relative frequency table is created using the following code:

cum.rel.freq.smoke <- round(cumsum(prop.table(freq.smoke)) * 100, 2)
cum.rel.freq.smoke
##  Never  Occas  Regul  Heavy 
##  80.08  88.14  95.34 100.00

4.3 Tabulating Results

cbind("Freq" = freq.smoke, 
      "Cum Freq" = cum.freq.smoke, 
      "Rel Freq" = rel.freq.smoke, 
      "Cum Rel Freq" = cum.rel.freq.smoke)
##       Freq Cum Freq Rel Freq Cum Rel Freq
## Never  189      189    80.08        80.08
## Occas   19      208     8.05        88.14
## Regul   17      225     7.20        95.34
## Heavy   11      236     4.66       100.00

5 Visualizing our Data

5.1 Bar Charts

We can create a frequency chart of the smoking levels of the students as follows:

smoke.names <- c("Never", "Occasional", "Regular", "Heavy")
barplot(height = freq.smoke, 
        ylim = c(0, 200), 
        col = c("chartreuse4", "yellow", "orange", "red"),
        names = smoke.names,
        main = "Frequency Distribution Chart of Smoking Levels",
        axis.lty = 1, 
        xlab = "Smoking levels",
        ylab = "Frequency",
        legend.text = smoke.names)

5.1.1

No answer required.

5.1.2

We can create a relative frequency distribution chart of the smoking levels of the students as follows:

barplot(height = rel.freq.smoke, 
        ylim = c(0,100), 
        col = c("chartreuse4", "yellow", "orange", "red"),
        names = smoke.names,
        main = "Relative Frequency Distribution Chart of Smoking Levels",
        axis.lty = 1, xlab = "Smoking levels", ylab = "Percentage",
        legend.text = smoke.names)

5.1.3

No answer required.

5.2 Pie Charts

We can create a pie chart of the frequency of smoking levels of students as follows:

pie(x = freq.smoke, 
    labels = smoke.names,
    col = c("chartreuse4", "yellow", "orange", "red"),
    main = "Smoking Levels of Students")

5.2.1

No answer required.

6 Assessing Numerical Data

6.1

freq.height <- table(survey$Height)
freq.height
## 
##    150    152  152.4  153.5 154.94    155    156  156.2    157 157.48    158 
##      1      1      1      1      2      2      1      1      3      3      1 
##    159    160 160.02  162.5 162.56    163    164    165  165.1  166.4  166.5 
##      2      5      3      1      4      3      4     14      4      1      1 
##    167 167.64    168  168.5  168.9    169  169.2    170 170.18    171  171.5 
##      7      5      8      1      1      2      1     14      4      5      1 
##    172 172.72    173    174    175 175.26    176  176.5    177  177.8    178 
##      7      6      4      1      5      5      2      2      3      2      2 
##  178.5    179  179.1    180 180.34    182  182.5 182.88    183    184    185 
##      1      3      2      8      9      1      1      4      3      2      6 
## 185.42    187 187.96    188    189    190  190.5  191.8 193.04    195    196 
##      2      3      3      1      2      3      3      1      1      1      1 
##    200 
##      1

6.2

# Note we use the na.rm = TRUE argument to ignore missing values
range(survey$Height, na.rm = TRUE) 
## [1] 150 200

6.3

intervals <- seq(from = 150, to = 205, by = 5) 
intervals
##  [1] 150 155 160 165 170 175 180 185 190 195 200 205

6.4

height.intervals <- cut(x = survey$Height,
                        breaks = intervals, right = FALSE)
height.intervals

Note that for conciseness we omit the height.intervals output here.

6.5

freq.height <- table(height.intervals)
cbind(freq = freq.height)
##           freq
## [150,155)    6
## [155,160)   13
## [160,165)   20
## [165,170)   45
## [170,175)   42
## [175,180)   27
## [180,185)   28
## [185,190)   17
## [190,195)    8
## [195,200)    2
## [200,205)    1

6.6

# Relative Frequency Table
rel.freq.height <- round(prop.table(freq.height) * 100, 2)

# Cumulative frequency
cum.freq.height <- cumsum(freq.height)

# Cumulative relative frequency
cum.rel.freq.height <- round(cumsum(prop.table(freq.height)) * 100, 2)

# Use the cbind function to display all frequencies vertically
cbind("Freq" = freq.height, "Cum Freq" = cum.freq.height, 
      "Rel Freq" = rel.freq.height, "Cum Rel Freq" = cum.rel.freq.height)
##           Freq Cum Freq Rel Freq Cum Rel Freq
## [150,155)    6        6     2.87         2.87
## [155,160)   13       19     6.22         9.09
## [160,165)   20       39     9.57        18.66
## [165,170)   45       84    21.53        40.19
## [170,175)   42      126    20.10        60.29
## [175,180)   27      153    12.92        73.21
## [180,185)   28      181    13.40        86.60
## [185,190)   17      198     8.13        94.74
## [190,195)    8      206     3.83        98.56
## [195,200)    2      208     0.96        99.52
## [200,205)    1      209     0.48       100.00

7 Analysing a Variable

7.1

# 1. Find the range of the ages
range(survey$Age, na.rm = TRUE) 
## [1] 16.75 73.00
# 2. Define the intervals
intervals <- seq(from = 15, to = 75, by = 5) 
intervals
##  [1] 15 20 25 30 35 40 45 50 55 60 65 70 75
# 3. Break the ages down using these intervals
age.intervals <- cut(x = survey$Age, breaks = intervals, right = FALSE)
age.intervals

Note that for conciseness we omit the age.intervals output here.

7.2

# Frequency table
freq.age <- table(age.intervals)     
freq.age
## age.intervals
## [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65) 
##     171      47       6       4       4       3       0       0       0       0 
## [65,70) [70,75) 
##       0       2

7.3

# Relative Frequency Table
rel.freq.age <- round(prop.table(freq.age) * 100, 2)
rel.freq.age
## age.intervals
## [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65) 
##   72.15   19.83    2.53    1.69    1.69    1.27    0.00    0.00    0.00    0.00 
## [65,70) [70,75) 
##    0.00    0.84

7.4

# Cumulative frequency
cum.freq.age <- cumsum(freq.age)
cum.freq.age 
## [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65) 
##     171     218     224     228     232     235     235     235     235     235 
## [65,70) [70,75) 
##     235     237

7.5

# Cumulative relative frequency
cum.rel.freq.age <- round(cumsum(prop.table(freq.age)) * 100, 2)
cum.rel.freq.age
## [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65) 
##   72.15   91.98   94.51   96.20   97.89   99.16   99.16   99.16   99.16   99.16 
## [65,70) [70,75) 
##   99.16  100.00

7.6

# Use the cbind function to display all frequencies vertically
cbind("Freq" = freq.age, "Cum Freq" = cum.freq.age, 
      "Rel Freq" = rel.freq.age, "Cum Rel Freq" = cum.rel.freq.age)
##         Freq Cum Freq Rel Freq Cum Rel Freq
## [15,20)  171      171    72.15        72.15
## [20,25)   47      218    19.83        91.98
## [25,30)    6      224     2.53        94.51
## [30,35)    4      228     1.69        96.20
## [35,40)    4      232     1.69        97.89
## [40,45)    3      235     1.27        99.16
## [45,50)    0      235     0.00        99.16
## [50,55)    0      235     0.00        99.16
## [55,60)    0      235     0.00        99.16
## [60,65)    0      235     0.00        99.16
## [65,70)    0      235     0.00        99.16
## [70,75)    2      237     0.84       100.00

8 Creating Histograms

8.1

We can create a histogram of the ages of the students as follows:

hist(survey$Age)

The data appear to be highly skewed to the right, and we can also see an outlier.

8.2

# Reset intervals
range(survey$Age, na.rm = TRUE) 
## [1] 16.75 73.00
intervals <- seq(from = 15, to = 75, by = 5)
hist(survey$Age, breaks = intervals, right = FALSE)

8.3

hist(survey$Age, breaks = intervals, right = FALSE, labels = TRUE)

8.4

We could try, for example:

hist(survey$Age, breaks = 5, right = FALSE)

or

hist(survey$Age, breaks = 50, right = FALSE)

Notice that when we have too large a number of breaks, the histogram can become less informative.

8.5

We could for example produce the following histogram:

hist(survey$Age, breaks = intervals, right = FALSE,
     xlab = "Age (years)",
     main = "Age of Students",
     col = "lightblue")

8.6

hist(survey$Age, plot = FALSE)
## $breaks
##  [1] 15 20 25 30 35 40 45 50 55 60 65 70 75
## 
## $counts
##  [1] 174  44   6   4   4   3   0   0   0   0   0   2
## 
## $density
##  [1] 0.146835443 0.037130802 0.005063291 0.003375527 0.003375527 0.002531646
##  [7] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.001687764
## 
## $mids
##  [1] 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5
## 
## $xname
## [1] "survey$Age"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"


That’s everything for now! If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 1 material.


References

Venables, & Ripley, W. N. 1999. Modern Applied Statistics with s-PLUS. 3rd ed. New York: Springer.


These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.