This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Execute a code chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the tool-bar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
A list of colours available in R can be found at
A Histogram is a type of bar-chart where the categories are number intervals as opposed to nominal categories.
In this section we are going to use R to plot a histogram for a given data set.
During lectures we will find that there is a certain amount of ambiguity in how we graph a histogram, in that the number of bars and consequently the bar-widths are not fixed, but can be chosen.
However we do have some guidelines to help us:
The number of bars should be between 5 and 20.
If the lowest and highest data points are not convenient numbers then we can do the following:
Round down the smallest data point to the next lowest convenient number.
Round up the largest data point toe the next highest convenient number.
Find the length (or range) of this new interval by subtracting the lowest from the highest data values.
If possible, choose the number of bars such that this number divides evenly into the range, to give us our interval.
Starting with the smallest value, start marking off the horizontal axis in units of the interval, until the highest data point is reached.
Categorise the data into theses intervals and construct a frequency (or relative frequency) table for this categorization
The trees in an orchard were measured and found to have the following heights
Construct a histogram to graph this data set.
Heights <- c(3.4, 3.5, 3.7, 4.2, 4.4, 4.7, 4.9, 5.1, 5.2, 5.3, 5.9, 6.0, 6.4, 7.3, 7.9, 8.1, 8.7)
hist(Heights, breaks=6, axes=T,col="skyblue4", main="Histogram of Tree Heights in an Orchard", ylab="Frequency", xlab="Tree heights (m)", freq=T)
pdf("Histogram1.pdf")
hist(Heights,breaks=6, axes=T,col="skyblue4", main="Histogram of Tree Heights in an Orchard", ylab="Frequency", xlab="Tree heights (m)", freq=T)
dev.off()
null device
1
A collection of data is skewed left if its histogram displays a tail to the left. This means the at possess outliers at the lower end of the data set.
Left<-c(1,8,8,8,9,9,9,10,10,11,11,11)
hist(Left,col="cornflowerblue",main="Left-Skewed Data", xlab="Left Data Values")
Conversely, the data is skewed right if the histogram displays a tail to the right, and likewise this indicates the presence of outlines at the upper end of the data set.
Right<-c(1,1,2,3,3,3,4,4,5,5,5,12)
hist(Right,breaks=5, col="cornflowerblue",main="Right-Skewed Data", xlab="Right Data Values")
If the data is skewed right then \[ mean > median \]
If the data is skewed left then \[ mean < median \]
If the data is central then \[ mean = median \] and the histogram is symmetric.
We will test the data sets Left and Right for skewness.
mean(Left)
[1] 8.75
median(Left)
[1] 9
mean(Right)
[1] 4
median(Right)
[1] 3.5
The histogram in Example 1 appears to display a tail to the right and so we expect the data to be skewed right.
To test this we can use the mean() and the median() functions and compare.
mean(Heights)
[1] 5.570588
median(Heights)
[1] 5.2
Since the mean is bigger than the median, this means the data is skewed right as we expect from the histogram above.
The marks scored by a class in a test are as follows (not in order):
Using this data, answer the following:
A. Use the sort() function to sort this data and the length() function to count the data points
B. Draw a histogram to represent this sorted data
C. From the histogram decide if the data is skewed left, skewed right or symmetric
D. Find the mean of this data
E. Find the median of this data set
F. Comparing the mean and the median, decide if the data is skewed left, skewed right or symmetric. Does this agree with the histogram?
Scores <- c(50, 63, 29, 54, 56, 0, 53, 34, 45, 85, 65, 62, 60, 3, 71, 66, 63, 68, 69, 67, 79, 75, 72, 81, 87, 91, 52)
sort(Scores)
[1] 0 3 29 34 45 50 52 53 54 56 60 62 63 63 65 66 67 68 69 71 72 75 79 81 85 87 91
length(Scores)
[1] 27
hist(Scores, breaks=8, xlim=c(0,100),ylim=c(0,0.02),freq=F, xlab="Exam Scores", ylab="Relative Frequency", col="darkslategray4", main="Histogram of Exam Scores")
From the histogram, the data appears to be skewed-left.
mean(Scores)
[1] 59.25926
median(Scores)
[1] 63
Since the mean is less than the median the data is skewed left. This agrees with the shape of the histogram for the data set.
The journey times (in hours) for a train service between two cities were resumed on a given day, with the following data gathered
1.2, 1.22, 1.23, 1.25, 1.3, 1.34, 1.36, 1.39, 1.40, 1.42, 1.45, 1.48, 1.51, 1.56
Given this data, answer the following:
Use the median() function to find the median of this data set
Use the mean() function
Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central
Construct a histogram of this data set
Using this histogram, decide if the data is skewed-left, skewed-right or central
Does this agree with your answer from part 3?
An electrical components manufacturer measures the resistance of a sample of 50 resistors produced during one production run. The resistances were measured as follows
51.20, 50.30, 52.40, 50.24, 51.25, 49.89, 51.63, 52.84, 53.01, 49.99
50.34, 50.61, 51.93, 52.25, 49.87, 50.13, 51.32, 50.43, 51.34, 52.13
52.11, 56.84, 57.81, 51.33, 52.36, 50.67, 51.89, 52.13, 52.18, 50.31
50.84, 51,12, 50.95, 49.98, 51.03, 52.01, 50.96, 51.02, 52.44, 51.95
Given this data, answer the following:
Sort the data according to increasing measurements of resistance
Use the median() function to find the median of this data set
Use the mean() function
Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central
Construct a histogram of this data set
Using this histogram, decide if the data is skewed-left, skewed-right or central
Does this agree with your answer from part 4?
The water quality at a water treatment facility is measured hourly by measuring the impurities present in the water in (measured in parterre-million). The following data was collected during in one 36 hour period:
18, 8, 25, 13, 7, 19
14, 5, 21, 23, 18, 5
13, 6, 15, 21, 17, 6
15, 7, 13, 5, 18, 11
12, 14, 15, 16, 12, 13
9, 17, 16, 19, 8, 10
Given this data, answer the following:
Sort the data in order of increasing values of impurities measured
Use the median() function to find the median of this data set
Use the mean() function
Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central
Construct a histogram of this data set
Using this histogram, decide if the data is skewed-left, skewed-right or central
Does this agree with your answer from part 4?
A Pareto chart is a combination of a bar chart and a cumulative frequency chart
They are a useful diagnostic tool when one wishes to limit various defects, errors, delays, breakdowns in different circumstances.
Remark: To generate Pareto charts we need to import the qcc library as follows
Once installed we call the library qcc as follows
library(qcc)
__ _ ___ ___
/ _ |/ __/ __| Quality Control Charts and
| (_| | (_| (__ Statistical Process Control
\__ |\___\___|
|_| version 2.7
Type 'citation("qcc")' for citing this R package in publications.
The employees of a company were asked the reasons for late arrivals at work. The data collected were as follows:
Child Care 22
Emergency 8
Overslept 12
Public transport 15
Traffic 36
Weather 27
Using this data answer the following:
Create an appropriate vector to represent this data.
Use the function pareto.chart() to generate a Pareto chart for this data.
From the Pareto chart, determine which causes could reasonable be addressed to reduce late arrivals by at least 35%.
We first create an appropriate vector of numbers to represent the data, in the usual way:
Employees <- c(22,8,12,15,36,27)
Employees
[1] 22 8 12 15 36 27
Now we create an appropriate vector of names, corresponding the these numbers as follows:
Reasons <- c("Child Care", "Emergency", "Overslept", "Public Transport", "Traffic", "Weather")
Reasons
[1] "Child Care" "Emergency" "Overslept" "Public Transport" "Traffic"
[6] "Weather"
names(Employees)<-Reasons
i.e. each entry has now been given a corresponding name.
Employees
Child Care Emergency Overslept Public Transport Traffic Weather
22 8 12 15 36 27
and so we see each number has been assigned a corresponding reason.
pareto.chart(Employees,col="skyblue2", cumperc=seq(0,100,by=10), main="Pareto chart giving reasons for late arrivals at work")
Pareto chart analysis for Employees
Frequency Cum.Freq. Percentage Cum.Percent.
Traffic 36.000000 36.000000 30.000000 30.000000
Weather 27.000000 63.000000 22.500000 52.500000
Child Care 22.000000 85.000000 18.333333 70.833333
Public Transport 15.000000 100.000000 12.500000 83.333333
Overslept 12.000000 112.000000 10.000000 93.333333
Emergency 8.000000 120.000000 6.666667 100.000000
Total=sum(Employees)
Total
[1] 120
Employees <- Employees/Total
Employees
Child Care Emergency Overslept Public Transport Traffic Weather
0.18333333 0.06666667 0.10000000 0.12500000 0.30000000 0.22500000
pareto.chart(Employees,col="skyblue2", cumperc=seq(0,100,by=10), ylab="Relative Frequency", main="Pareto chart giving reasons for late arrivals at work")
Pareto chart analysis for Employees
Frequency Cum.Freq. Percentage Cum.Percent.
Traffic 0.30000000 0.30000000 30.00000000 30.00000000
Weather 0.22500000 0.52500000 22.50000000 52.50000000
Child Care 0.18333333 0.70833333 18.33333333 70.83333333
Public Transport 0.12500000 0.83333333 12.50000000 83.33333333
Overslept 0.10000000 0.93333333 10.00000000 93.33333333
Emergency 0.06666667 1.00000000 6.66666667 100.00000000
A mobile phone repair shop categorized the type of repair work it carried out over a year and found the following data:
Given this data, answer the following:
Find the total number of repairs in the year.
Construct a vector to represent the relative frequencies of each of these classes.
Use this vector of relative frequencies to construct a Pareto chart of this data.
From the Pareto chart determine the causes of phone damage owners should address to reduce repairs by at lease 25%.
A newly built apartment complex was inspected to prepare a snag list for each apartment. After inspection of apartment the complex, the following issues were found to occur with the frequency indicated:
Using this data answer the following:
Identify the data type given.
Construct a vector to represent the frequencies.
Using this vector, evaluate the number of damages to be repaired in the apartment complex.
Using the vector of frequency values, construct a Pareto chart for this data.
From this data determine how which issues should be addressed to ensure at least 75% of damages are repaired.