No answer required.
The survey
data set (Venables 1999) is within the MASS
R package.
library(MASS)
names(survey)
## [1] "Sex" "Wr.Hnd" "NW.Hnd" "W.Hnd" "Fold" "Pulse" "Clap" "Exer"
## [9] "Smoke" "Height" "M.I" "Age"
View(survey)
This should immediately open a tab titled `survey’.
help(survey)
This should load the R Documentation on the survey
data set.
The variable types for the variables in the survey
data set example are:
Sex
: Categorical, NominalWr.Hnd
: Numerical, ContinuousNW.Hnd
: Numerical, ContinuousW.Hnd
: Categorical, NominalFold
: Categorical, NominalPulse
: Numerical, DiscreteClap
: Categorical, NominalExer
: Categorical, OrdinalSmoke
: Categorical, OrdinalHeight
: Numerical, ContinuousM.I
: Categorical, NominalAge
: Numerical, Continuousfreq.w.hnd <- table(survey$W.Hnd) # Store frequency table as freq.w.hnd
freq.w.hnd # Display the frequency table in the console
##
## Left Right
## 18 218
There are 18 left-handed students and 218 right-handed students.
rel.freq.w.hnd <- prop.table(freq.w.hnd)
rel.freq.w.hnd
##
## Left Right
## 0.07627119 0.92372881
Approximately 7.6% of students are left-handed, with the majority (approximately 92.4%) right-handed.
rel.freq.w.hnd <- round(rel.freq.w.hnd, 2)
rel.freq.w.hnd
##
## Left Right
## 0.08 0.92
We observe that the results are now rounded to two decimal places of accuracy.
rel.freq.w.hnd <- rel.freq.w.hnd * 100
Our results are now expressed as whole number percentages.
freq.sex <- table(survey$Sex)
freq.sex
##
## Female Male
## 118 118
There are 118 females and 118 males students. Note one Sex recording is NA
, hence the disparity between 118+118 and 237.
rel.freq.sex <- round(prop.table(freq.sex) * 100, 2)
rel.freq.sex
##
## Female Male
## 50 50
50% of the class is female and 50% is male.
is.ordered(survey$Smoke)
## [1] FALSE
The output FALSE
tells us that the categories of survey$Smoke
have not been ordered. We can verify this using the levels
function.
levels(survey$Smoke)
## [1] "Heavy" "Never" "Occas" "Regul"
survey$Smoke <- ordered(survey$Smoke, levels = c("Never", "Occas", "Regul", "Heavy"))
is.ordered(survey$Smoke)
## [1] TRUE
levels(survey$Smoke)
## [1] "Never" "Occas" "Regul" "Heavy"
Note that in contrast to 3.1, our data is now ordered, as denoted by the output TRUE
. We can also see that the levels, or categories, are now ordered from lowest to highest.
Our frequency table is created using the following code:
freq.smoke <- table(survey$Smoke)
freq.smoke
##
## Never Occas Regul Heavy
## 189 19 17 11
Our relative frequency table is created using the following code:
rel.freq.smoke <- round(prop.table(freq.smoke) * 100, 2)
rel.freq.smoke
##
## Never Occas Regul Heavy
## 80.08 8.05 7.20 4.66
Our cumulative frequency table is created using the following code:
cum.freq.smoke <- cumsum(freq.smoke)
cum.freq.smoke
## Never Occas Regul Heavy
## 189 208 225 236
Our cumulative relative frequency table is created using the following code:
cum.rel.freq.smoke <- round(cumsum(prop.table(freq.smoke)) * 100, 2)
cum.rel.freq.smoke
## Never Occas Regul Heavy
## 80.08 88.14 95.34 100.00
cbind("Freq" = freq.smoke,
"Cum Freq" = cum.freq.smoke,
"Rel Freq" = rel.freq.smoke,
"Cum Rel Freq" = cum.rel.freq.smoke)
## Freq Cum Freq Rel Freq Cum Rel Freq
## Never 189 189 80.08 80.08
## Occas 19 208 8.05 88.14
## Regul 17 225 7.20 95.34
## Heavy 11 236 4.66 100.00
We can create a frequency chart of the smoking levels of the students as follows:
smoke.names <- c("Never", "Occasional", "Regular", "Heavy")
barplot(height = freq.smoke,
ylim = c(0, 200),
col = c("chartreuse4", "yellow", "orange", "red"),
names = smoke.names,
main = "Frequency Distribution Chart of Smoking Levels",
axis.lty = 1,
xlab = "Smoking levels",
ylab = "Frequency",
legend.text = smoke.names)
No answer required.
We can create a relative frequency distribution chart of the smoking levels of the students as follows:
barplot(height = rel.freq.smoke,
ylim = c(0,100),
col = c("chartreuse4", "yellow", "orange", "red"),
names = smoke.names,
main = "Relative Frequency Distribution Chart of Smoking Levels",
axis.lty = 1, xlab = "Smoking levels", ylab = "Percentage",
legend.text = smoke.names)
No answer required.
We can create a pie chart of the frequency of smoking levels of students as follows:
pie(x = freq.smoke,
labels = smoke.names,
col = c("chartreuse4", "yellow", "orange", "red"),
main = "Smoking Levels of Students")
No answer required.
freq.height <- table(survey$Height)
freq.height
##
## 150 152 152.4 153.5 154.94 155 156 156.2 157 157.48 158
## 1 1 1 1 2 2 1 1 3 3 1
## 159 160 160.02 162.5 162.56 163 164 165 165.1 166.4 166.5
## 2 5 3 1 4 3 4 14 4 1 1
## 167 167.64 168 168.5 168.9 169 169.2 170 170.18 171 171.5
## 7 5 8 1 1 2 1 14 4 5 1
## 172 172.72 173 174 175 175.26 176 176.5 177 177.8 178
## 7 6 4 1 5 5 2 2 3 2 2
## 178.5 179 179.1 180 180.34 182 182.5 182.88 183 184 185
## 1 3 2 8 9 1 1 4 3 2 6
## 185.42 187 187.96 188 189 190 190.5 191.8 193.04 195 196
## 2 3 3 1 2 3 3 1 1 1 1
## 200
## 1
# Note we use the na.rm = TRUE argument to ignore missing values
range(survey$Height, na.rm = TRUE)
## [1] 150 200
intervals <- seq(from = 150, to = 205, by = 5)
intervals
## [1] 150 155 160 165 170 175 180 185 190 195 200 205
height.intervals <- cut(x = survey$Height,
breaks = intervals, right = FALSE)
height.intervals
Note that for conciseness we omit the height.intervals
output here.
freq.height <- table(height.intervals)
cbind(freq = freq.height)
## freq
## [150,155) 6
## [155,160) 13
## [160,165) 20
## [165,170) 45
## [170,175) 42
## [175,180) 27
## [180,185) 28
## [185,190) 17
## [190,195) 8
## [195,200) 2
## [200,205) 1
# Relative Frequency Table
rel.freq.height <- round(prop.table(freq.height) * 100, 2)
# Cumulative frequency
cum.freq.height <- cumsum(freq.height)
# Cumulative relative frequency
cum.rel.freq.height <- round(cumsum(prop.table(freq.height)) * 100, 2)
# Use the cbind function to display all frequencies vertically
cbind("Freq" = freq.height, "Cum Freq" = cum.freq.height,
"Rel Freq" = rel.freq.height, "Cum Rel Freq" = cum.rel.freq.height)
## Freq Cum Freq Rel Freq Cum Rel Freq
## [150,155) 6 6 2.87 2.87
## [155,160) 13 19 6.22 9.09
## [160,165) 20 39 9.57 18.66
## [165,170) 45 84 21.53 40.19
## [170,175) 42 126 20.10 60.29
## [175,180) 27 153 12.92 73.21
## [180,185) 28 181 13.40 86.60
## [185,190) 17 198 8.13 94.74
## [190,195) 8 206 3.83 98.56
## [195,200) 2 208 0.96 99.52
## [200,205) 1 209 0.48 100.00
# 1. Find the range of the ages
range(survey$Age, na.rm = TRUE)
## [1] 16.75 73.00
# 2. Define the intervals
intervals <- seq(from = 15, to = 75, by = 5)
intervals
## [1] 15 20 25 30 35 40 45 50 55 60 65 70 75
# 3. Break the ages down using these intervals
age.intervals <- cut(x = survey$Age, breaks = intervals, right = FALSE)
age.intervals
Note that for conciseness we omit the age.intervals
output here.
# Frequency table
freq.age <- table(age.intervals)
freq.age
## age.intervals
## [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65)
## 171 47 6 4 4 3 0 0 0 0
## [65,70) [70,75)
## 0 2
# Relative Frequency Table
rel.freq.age <- round(prop.table(freq.age) * 100, 2)
rel.freq.age
## age.intervals
## [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65)
## 72.15 19.83 2.53 1.69 1.69 1.27 0.00 0.00 0.00 0.00
## [65,70) [70,75)
## 0.00 0.84
# Cumulative frequency
cum.freq.age <- cumsum(freq.age)
cum.freq.age
## [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65)
## 171 218 224 228 232 235 235 235 235 235
## [65,70) [70,75)
## 235 237
# Cumulative relative frequency
cum.rel.freq.age <- round(cumsum(prop.table(freq.age)) * 100, 2)
cum.rel.freq.age
## [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65)
## 72.15 91.98 94.51 96.20 97.89 99.16 99.16 99.16 99.16 99.16
## [65,70) [70,75)
## 99.16 100.00
# Use the cbind function to display all frequencies vertically
cbind("Freq" = freq.age, "Cum Freq" = cum.freq.age,
"Rel Freq" = rel.freq.age, "Cum Rel Freq" = cum.rel.freq.age)
## Freq Cum Freq Rel Freq Cum Rel Freq
## [15,20) 171 171 72.15 72.15
## [20,25) 47 218 19.83 91.98
## [25,30) 6 224 2.53 94.51
## [30,35) 4 228 1.69 96.20
## [35,40) 4 232 1.69 97.89
## [40,45) 3 235 1.27 99.16
## [45,50) 0 235 0.00 99.16
## [50,55) 0 235 0.00 99.16
## [55,60) 0 235 0.00 99.16
## [60,65) 0 235 0.00 99.16
## [65,70) 0 235 0.00 99.16
## [70,75) 2 237 0.84 100.00
We can create a histogram of the ages of the students as follows:
hist(survey$Age)
The data appear to be highly skewed to the right, and we can also see an outlier.
# Reset intervals
range(survey$Age, na.rm = TRUE)
## [1] 16.75 73.00
intervals <- seq(from = 15, to = 75, by = 5)
hist(survey$Age, breaks = intervals, right = FALSE)
hist(survey$Age, breaks = intervals, right = FALSE, labels = TRUE)
We could try, for example:
hist(survey$Age, breaks = 5, right = FALSE)
or
hist(survey$Age, breaks = 50, right = FALSE)
Notice that when we have too large a number of breaks, the histogram can become less informative.
We could for example produce the following histogram:
hist(survey$Age, breaks = intervals, right = FALSE,
xlab = "Age (years)",
main = "Age of Students",
col = "lightblue")
hist(survey$Age, plot = FALSE)
## $breaks
## [1] 15 20 25 30 35 40 45 50 55 60 65 70 75
##
## $counts
## [1] 174 44 6 4 4 3 0 0 0 0 0 2
##
## $density
## [1] 0.146835443 0.037130802 0.005063291 0.003375527 0.003375527 0.002531646
## [7] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.001687764
##
## $mids
## [1] 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5
##
## $xname
## [1] "survey$Age"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.