This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Execute a code chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the tool-bar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

Useful References for R

A list of colours available in R can be found at

Section 1 - Histograms

However we do have some guidelines to help us:

Guidelines for Histograms

  1. The number of bars should be between 5 and 20.

  2. If the lowest and highest data points are not convenient numbers then we can do the following:

    • Round down the smallest data point to the next lowest convenient number.

    • Round up the largest data point toe the next highest convenient number.

  3. Find the length (or range) of this new interval by subtracting the lowest from the highest data values.

  4. If possible, choose the number of bars such that this number divides evenly into the range, to give us our interval.

  5. Starting with the smallest value, start marking off the horizontal axis in units of the interval, until the highest data point is reached.

  6. Categorise the data into theses intervals and construct a frequency (or relative frequency) table for this categorization

  • NOTE: If a data value falls within two consecutive intervals, place it in the upper interval.
  1. The height of each bar on the histogram corresponds to the frequency or relative frequency of the interval.

Example 1

The trees in an orchard were measured and found to have the following heights

  • 3.4m, 3.5m, 3.7m, 4.2m, 4.4m, 4.7m, 4.9m, 5.1m, 5.2m, 5.3m, 5.9m, 6.0m, 6.4m, 7.3m, 7.9m, 8.1m, 8.7m.

Construct a histogram to graph this data set.

Solution

  1. We may round off the highest and lowest values to more convenient numbers such as
  • 3.4 \(\to\) 3.
  • 8.7 \(\to\) 9.
  1. The new data range is given by range=9-3=6
  2. We choose a number between 5 and 20 which divides evenly into this range, and so an obvious choice is 6 bars for this histogram.
  3. We create a vector valued object to store the measured heights:
Heights <- c(3.4, 3.5, 3.7, 4.2, 4.4, 4.7, 4.9, 5.1, 5.2, 5.3, 5.9, 6.0, 6.4, 7.3, 7.9, 8.1, 8.7)
  1. We use the hist() function to plot a histogram of the vector Heights
hist(Heights, breaks=6, axes=T,col="skyblue4", main="Histogram of Tree Heights in an Orchard", ylab="Frequency", xlab="Tree heights (m)", freq=T)

  1. To print a PDF of this graph we use the pdf() and dev.off() commands
pdf("Histogram1.pdf")
hist(Heights,breaks=6, axes=T,col="skyblue4", main="Histogram of Tree Heights in an Orchard", ylab="Frequency", xlab="Tree heights (m)", freq=T)
dev.off()
null device 
          1 
  1. Unlike the barplot() command, the histogram plot automates a large amount of the work for us. Nevertheless, it is still possible to override most of these automatic choices by selecting various arguments for the hist() function.
  • Most of these arguments are explained in the R documentation available in the Help panel of the R Studio environment. Just type hist into the the search bar and press enter.

Skewed Data and Central Data

A collection of data is skewed left if its histogram displays a tail to the left. This means the at possess outliers at the lower end of the data set.

Example: Skewed-left data

Left<-c(1,8,8,8,9,9,9,10,10,11,11,11)
hist(Left,col="cornflowerblue",main="Left-Skewed Data", xlab="Left Data Values")

  • Notice, most of the data is concentrated to the right between 6 and 12. However, a small portion of the data forms a left-tail between 0 and 2. This tail is what skews the data to the left.

Conversely, the data is skewed right if the histogram displays a tail to the right, and likewise this indicates the presence of outlines at the upper end of the data set.

Example: Right-skewed data

Right<-c(1,1,2,3,3,3,4,4,5,5,5,12)
hist(Right,breaks=5, col="cornflowerblue",main="Right-Skewed Data", xlab="Right Data Values")

  • Most of the data is concentrated to the left between 0 and 6. However, a small portion of the data forms a right-tail between 10 and 12. This tail is what skews the data to the right.

Test for skewness

  • If the data is skewed right then \[ mean > median \]

  • If the data is skewed left then \[ mean < median \]

  • If the data is central then \[ mean = median \] and the histogram is symmetric.

Example: Test for skewness

We will test the data sets Left and Right for skewness.

Left Data:

mean(Left)
[1] 8.75
median(Left)
[1] 9
  • We see that the mean of the Left data is less than the median of the Left data, confirming it is skewed-left.

Right Data:

mean(Right)
[1] 4
median(Right)
[1] 3.5
  • We see that the mean of the Right data is greater than the median of the Right data, confirming it is skewed-right.

Skewness of Heights:

The histogram in Example 1 appears to display a tail to the right and so we expect the data to be skewed right.

To test this we can use the mean() and the median() functions and compare.

mean(Heights)
[1] 5.570588
median(Heights)
[1] 5.2

Since the mean is bigger than the median, this means the data is skewed right as we expect from the histogram above.

Example 2

The marks scored by a class in a test are as follows (not in order):

  • 50, 63, 29, 54, 56, 0, 53, 34, 45, 85, 65, 62, 60, 3, 71, 66, 63, 68, 69, 67, 79, 75, 72, 81, 87, 91, 52

Using this data, answer the following:

A. Use the sort() function to sort this data and the length() function to count the data points

B. Draw a histogram to represent this sorted data

C. From the histogram decide if the data is skewed left, skewed right or symmetric

D. Find the mean of this data

E. Find the median of this data set

F. Comparing the mean and the median, decide if the data is skewed left, skewed right or symmetric. Does this agree with the histogram?

Solution

Scores <- c(50, 63, 29, 54, 56, 0, 53, 34, 45, 85, 65, 62, 60, 3, 71, 66, 63, 68, 69, 67, 79, 75, 72, 81, 87, 91, 52)
sort(Scores)
 [1]  0  3 29 34 45 50 52 53 54 56 60 62 63 63 65 66 67 68 69 71 72 75 79 81 85 87 91
length(Scores)
[1] 27
hist(Scores, breaks=8, xlim=c(0,100),ylim=c(0,0.02),freq=F, xlab="Exam Scores", ylab="Relative Frequency", col="darkslategray4", main="Histogram of Exam Scores")

  1. From the histogram, the data appears to be skewed-left.

mean(Scores)
[1] 59.25926
median(Scores)
[1] 63

Since the mean is less than the median the data is skewed left. This agrees with the shape of the histogram for the data set.

Exercise 1

The journey times (in hours) for a train service between two cities were resumed on a given day, with the following data gathered

1.2, 1.22, 1.23, 1.25, 1.3, 1.34, 1.36, 1.39, 1.40, 1.42, 1.45, 1.48, 1.51, 1.56

Given this data, answer the following:

  1. Use the median() function to find the median of this data set

  2. Use the mean() function

  3. Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central

  4. Construct a histogram of this data set

  5. Using this histogram, decide if the data is skewed-left, skewed-right or central

  6. Does this agree with your answer from part 3?

Exercise 2

An electrical components manufacturer measures the resistance of a sample of 50 resistors produced during one production run. The resistances were measured as follows

51.20, 50.30, 52.40, 50.24, 51.25, 49.89, 51.63, 52.84, 53.01, 49.99

50.34, 50.61, 51.93, 52.25, 49.87, 50.13, 51.32, 50.43, 51.34, 52.13

52.11, 56.84, 57.81, 51.33, 52.36, 50.67, 51.89, 52.13, 52.18, 50.31

50.84, 51,12, 50.95, 49.98, 51.03, 52.01, 50.96, 51.02, 52.44, 51.95

Given this data, answer the following:

  1. Sort the data according to increasing measurements of resistance

  2. Use the median() function to find the median of this data set

  3. Use the mean() function

  4. Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central

  5. Construct a histogram of this data set

  6. Using this histogram, decide if the data is skewed-left, skewed-right or central

  7. Does this agree with your answer from part 4?

Exercise 3

The water quality at a water treatment facility is measured hourly by measuring the impurities present in the water in (measured in parterre-million). The following data was collected during in one 36 hour period:

18, 8, 25, 13, 7, 19

14, 5, 21, 23, 18, 5

13, 6, 15, 21, 17, 6

15, 7, 13, 5, 18, 11

12, 14, 15, 16, 12, 13

9, 17, 16, 19, 8, 10

Given this data, answer the following:

  1. Sort the data in order of increasing values of impurities measured

  2. Use the median() function to find the median of this data set

  3. Use the mean() function

  4. Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central

  5. Construct a histogram of this data set

  6. Using this histogram, decide if the data is skewed-left, skewed-right or central

  7. Does this agree with your answer from part 4?

Pareto Charts

A Pareto chart is a combination of a bar chart and a cumulative frequency chart

They are a useful diagnostic tool when one wishes to limit various defects, errors, delays, breakdowns in different circumstances.

Remark: To generate Pareto charts we need to import the qcc library as follows

  1. In the Tools tab above, select the option Install Packages…
  2. In the Install from option, select Repository(CRAN, CRANextra)
  3. In the Packages input line type qcc
  4. IN the Install to Library option select your own home directory (this should be the daffodils anyhow)
  5. Make sure Install all dependencies is checked

Once installed we call the library qcc as follows

library(qcc)
  __ _  ___ ___ 
 / _  |/ __/ __|  Quality Control Charts and 
| (_| | (_| (__   Statistical Process Control
 \__  |\___\___|
    |_|           version 2.7
Type 'citation("qcc")' for citing this R package in publications.

Example 1

The employees of a company were asked the reasons for late arrivals at work. The data collected were as follows:

  • Child Care 22

  • Emergency 8

  • Overslept 12

  • Public transport 15

  • Traffic 36

  • Weather 27

Using this data answer the following:

  1. Create an appropriate vector to represent this data.

  2. Use the function pareto.chart() to generate a Pareto chart for this data.

  3. From the Pareto chart, determine which causes could reasonable be addressed to reduce late arrivals by at least 35%.

  4. We first create an appropriate vector of numbers to represent the data, in the usual way:

Employees <- c(22,8,12,15,36,27)
Employees
[1] 22  8 12 15 36 27

Now we create an appropriate vector of names, corresponding the these numbers as follows:

Reasons <- c("Child Care", "Emergency", "Overslept", "Public Transport", "Traffic", "Weather")
Reasons
[1] "Child Care"       "Emergency"        "Overslept"        "Public Transport" "Traffic"         
[6] "Weather"         
  • Now we re-create the Employees vector by assigning it the names in the vector Reasons as follows:
names(Employees)<-Reasons

i.e. each entry has now been given a corresponding name.

  • The new Employees vector is
Employees
      Child Care        Emergency        Overslept Public Transport          Traffic          Weather 
              22                8               12               15               36               27 

and so we see each number has been assigned a corresponding reason.

  1. The Pareto Chart for this data is given by
pareto.chart(Employees,col="skyblue2", cumperc=seq(0,100,by=10), main="Pareto chart giving reasons for late arrivals at work")
                  
Pareto chart analysis for Employees
                    Frequency  Cum.Freq. Percentage Cum.Percent.
  Traffic           36.000000  36.000000  30.000000    30.000000
  Weather           27.000000  63.000000  22.500000    52.500000
  Child Care        22.000000  85.000000  18.333333    70.833333
  Public Transport  15.000000 100.000000  12.500000    83.333333
  Overslept         12.000000 112.000000  10.000000    93.333333
  Emergency          8.000000 120.000000   6.666667   100.000000

  • We notice that the pareto.chart() function plots the frequency of each class, as opposed to the relative frequency of each class. This is easily modified by summing all frequencies in the Employees vector and dividing the the vector by this number to give the frequency density of each class.
Total=sum(Employees)
Total
[1] 120
Employees <- Employees/Total
Employees
      Child Care        Emergency        Overslept Public Transport          Traffic          Weather 
      0.18333333       0.06666667       0.10000000       0.12500000       0.30000000       0.22500000 
pareto.chart(Employees,col="skyblue2", cumperc=seq(0,100,by=10), ylab="Relative Frequency", main="Pareto chart giving reasons for late arrivals at work")
                  
Pareto chart analysis for Employees
                      Frequency    Cum.Freq.   Percentage Cum.Percent.
  Traffic            0.30000000   0.30000000  30.00000000  30.00000000
  Weather            0.22500000   0.52500000  22.50000000  52.50000000
  Child Care         0.18333333   0.70833333  18.33333333  70.83333333
  Public Transport   0.12500000   0.83333333  12.50000000  83.33333333
  Overslept          0.10000000   0.93333333  10.00000000  93.33333333
  Emergency          0.06666667   1.00000000   6.66666667 100.00000000

Exercise 4

A mobile phone repair shop categorized the type of repair work it carried out over a year and found the following data:

  • Cracked Screen 125
  • Scratched Screen 228
  • Forgotten Pass code 91
  • Water Damage 188
  • Battery Replacement 111
  • Software Virus 82
  • Phone Bricking 188
  • Damaged Connector 41

Given this data, answer the following:

  1. Find the total number of repairs in the year.

  2. Construct a vector to represent the relative frequencies of each of these classes.

  3. Use this vector of relative frequencies to construct a Pareto chart of this data.

  4. From the Pareto chart determine the causes of phone damage owners should address to reduce repairs by at lease 25%.

Exercise 5

A newly built apartment complex was inspected to prepare a snag list for each apartment. After inspection of apartment the complex, the following issues were found to occur with the frequency indicated:

List of Damages

  • Cooker Disconnected 4
  • Windows Not Closing 18
  • Bathroom Ventilator Broken 9
  • Damaged Plug Sockets 51
  • Broken Light 29
  • Damaged Paintwork 47
  • Doors Not Locking 14
  • Dampness 2
  • Scratched Flooring 21

Using this data answer the following:

  1. Identify the data type given.

  2. Construct a vector to represent the frequencies.

  3. Using this vector, evaluate the number of damages to be repaired in the apartment complex.

  4. Using the vector of frequency values, construct a Pareto chart for this data.

  5. From this data determine how which issues should be addressed to ensure at least 75% of damages are repaired.

---
title: ' Data Visualisation 2019 - Assignment 1'
output:
  html_notebook: default
  pdf_document: default
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Execute a code chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*. 

Add a new chunk by clicking the *Insert Chunk* button on the tool-bar or by pressing *Ctrl+Alt+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file).

## Useful References for __R__
A list of colours available in **R** can be found at

* http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

# Section 1 - Histograms

 * A __Histogram__ is a type of bar-chart where the categories are number intervals as opposed to nominal categories.


 * In this section we are going to use __R__ to plot a histogram for a given data set. 
 
 * During lectures we will find that there is a certain amount of ambiguity in how we graph a histogram, in that the number of bars    and consequently the bar-widths are not fixed, but can be chosen. 
 
However we do have some guidelines to help us:

### Guidelines for Histograms

1. The number of bars should be between __5__ and __20__.

2. If the lowest and highest data points are __not convenient numbers__ then we can do the following:
    
    * __Round down__ the smallest data point to the next lowest convenient number.
    
    * __Round up__ the largest data point toe the next highest convenient number.
    
3. Find the __length__ (or __range__) of this new interval by subtracting the lowest from the highest data values.

4. If possible, choose the number of bars such that this number divides evenly into the __range__, to give us our __interval__.


5. Starting with the smallest value, start marking off the horizontal axis in units of the interval, until the highest data point is reached.

6. Categorise the data into theses intervals and construct a  frequency (or relative frequency) table for this categorization
  
 * __NOTE:__ If a data value falls within two consecutive intervals, place it in the __upper interval__.  
  
7. The height of each bar on the histogram corresponds to the frequency or relative frequency of the interval.

### Example 1

The trees in an orchard were measured and found to have the following heights

* 3.4m, 3.5m, 3.7m, 4.2m, 4.4m, 4.7m, 4.9m, 5.1m, 5.2m, 5.3m, 5.9m, 6.0m, 6.4m, 7.3m, 7.9m, 8.1m, 8.7m.

Construct a histogram to graph this data set.

### Solution 
1. We may __round off__ the highest and lowest values to more convenient numbers such as
  * __3.4 $\to$ 3__.
  * __8.7 $\to$ 9__.
2. The new data __range__ is given by __range__=9-3=6
3. We choose a number between __5__ and __20__ which __divides__ evenly into this __range__, and so an obvious choice is __6 bars__ for this histogram.
4. We create a __vector__ valued object to store the measured heights: 

```{r}
Heights <- c(3.4, 3.5, 3.7, 4.2, 4.4, 4.7, 4.9, 5.1, 5.2, 5.3, 5.9, 6.0, 6.4, 7.3, 7.9, 8.1, 8.7)
```

5. We use the __hist()__ function to plot a histogram of the vector __Heights__

```{r}
hist(Heights, breaks=6, axes=T,col="skyblue4", main="Histogram of Tree Heights in an Orchard", ylab="Frequency", xlab="Tree heights (m)", freq=T)
```

6. To print a PDF of this graph we use the __pdf()__ and __dev.off()__ commands
```{r}
pdf("Histogram1.pdf")
hist(Heights,breaks=6, axes=T,col="skyblue4", main="Histogram of Tree Heights in an Orchard", ylab="Frequency", xlab="Tree heights (m)", freq=T)
dev.off()
```
7. Unlike the __barplot()__ command, the histogram plot automates a large amount of the work for us. Nevertheless, it is still possible to override most of these automatic choices by selecting various arguments for the __hist()__ function. 

* Most of these arguments are explained in the __R__ documentation available in the __Help__ panel of the __R Studio__ environment. Just type hist into the the search bar and press enter.  

## Skewed Data and Central Data

A collection of data is __skewed left__ if its histogram displays a tail to the left. This means the at possess __outliers__ at the lower end  of the data set.

### Example: Skewed-left data

```{r}
Left<-c(1,8,8,8,9,9,9,10,10,11,11,11)
hist(Left,col="cornflowerblue",main="Left-Skewed Data", xlab="Left Data Values")
```
* Notice, most of the data is concentrated to the right between 6 and 12. However, a small portion of the data forms a __left-tail__ between 0 and 2. This tail is what skews the data to the left.



Conversely, the data is __skewed right__ if the histogram displays a tail to the right, and likewise this indicates the presence of outlines at the upper end of the data set.

### Example: Right-skewed data
```{r}
Right<-c(1,1,2,3,3,3,4,4,5,5,5,12)
hist(Right,breaks=5, col="cornflowerblue",main="Right-Skewed Data", xlab="Right Data Values")
```
* Most of the data is concentrated to the left between 0 and 6. However, a small portion of the data forms a __right-tail__ between 10 and 12. This tail is what skews the data to the right.

### Test for skewness

* If the data is __skewed right__ then
\\[
   mean > median
\\]


* If the data is __skewed left__ then
\\[
  mean < median
\\]

* If the data is __central__ then
\\[
  mean = median
\\]
  and the histogram is symmetric.
  
### Example: Test for skewness
We will test the data sets __Left__ and __Right__ for skewness.

#### Left Data:
```{r}
mean(Left)
```
```{r}
median(Left)
```

* We see that the mean of the __Left__ data is less than the median of the __Left__ data, confirming it is skewed-left.


#### Right Data:
```{r}
mean(Right)
```

```{r}
median(Right)
```

* We see that the mean of the __Right__ data is greater than the median of the __Right__ data, confirming it is skewed-right.


#### Skewness of Heights:

The histogram in __Example 1__ appears to display a tail to the right and so we expect the data to be __skewed right__. 

To test this we can use the __mean()__ and the __median()__ functions and compare.

```{r}
mean(Heights)
median(Heights)
```
Since the mean is bigger than the median, this means the data is __skewed right__ as we expect from the histogram above.

## Example 2
The marks scored by a class in a test are as follows (not in order):

 * 50, 63, 29, 54, 56, 0, 53, 34, 45, 85, 65, 62, 60, 3, 71, 66, 63, 68, 69, 67, 79, 75, 72, 81, 87, 91, 52 

Using this data, answer the following:

A. Use the __sort()__ function to sort this data and the __length()__ function to count the data points

B. Draw a histogram to represent this sorted data

C. From the histogram decide if the data is __skewed left__, __skewed right__ or __symmetric__

D. Find the mean of this data

E. Find the median of this data set

F. Comparing the mean and the median, decide if the data is __skewed left__, __skewed right__ or __symmetric__. Does this agree with the histogram?

### Solution
A.
```{r}
Scores <- c(50, 63, 29, 54, 56, 0, 53, 34, 45, 85, 65, 62, 60, 3, 71, 66, 63, 68, 69, 67, 79, 75, 72, 81, 87, 91, 52)
sort(Scores)
```
```{r}
length(Scores)
```
B.
```{r}
hist(Scores, breaks=8, xlim=c(0,100),ylim=c(0,0.02),freq=F, xlab="Exam Scores", ylab="Relative Frequency", col="darkslategray4", main="Histogram of Exam Scores")
```

C. 
From the histogram, the data appears to be skewed-left.


D.
```{r}
mean(Scores)
```

```{r}
median(Scores)
```
Since the mean is less than the median the data is skewed left. This agrees with the shape of the histogram for the data set.

### Exercise 1

The journey times (in hours) for a train service between two cities were resumed on a given day, with the following data gathered

1.2, 1.22, 1.23, 1.25, 1.3, 1.34, 1.36, 1.39, 1.40, 1.42, 1.45, 1.48, 1.51, 1.56

Given this data, answer the following:

1. Use the __median()__ function to find the median of this data set

2. Use the __mean()__ function

3. Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central

4. Construct a histogram of this data set

5. Using this histogram, decide if the data is skewed-left, skewed-right or central

6. Does this agree with your answer from part 3?


### Exercise 2

An electrical components manufacturer measures the resistance of a sample of 50 resistors produced during one production run. The resistances were measured as follows

51.20, 50.30, 52.40, 50.24, 51.25, 49.89, 51.63, 52.84, 53.01, 49.99

50.34, 50.61, 51.93, 52.25, 49.87, 50.13, 51.32, 50.43, 51.34, 52.13

52.11, 56.84, 57.81, 51.33, 52.36, 50.67, 51.89, 52.13, 52.18, 50.31

50.84, 51,12, 50.95, 49.98, 51.03, 52.01, 50.96, 51.02, 52.44, 51.95

Given this data, answer the following:

1. Sort the data according to increasing measurements of resistance

2. Use the __median()__ function to find the median of this data set

3. Use the __mean()__ function

4. Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central

5. Construct a histogram of this data set

6. Using this histogram, decide if the data is skewed-left, skewed-right or central

7. Does this agree with your answer from part 4?

### Exercise 3 
The water quality at a water treatment facility is measured hourly by measuring the impurities present in the water in (measured in parterre-million). The following data was collected during in one 36 hour period:



18, 8, 25, 13, 7, 19

14, 5, 21, 23, 18, 5

13, 6, 15, 21, 17, 6

15, 7, 13, 5, 18, 11

12, 14, 15, 16, 12, 13

9, 17, 16, 19, 8, 10

Given this data, answer the following:

1. Sort the data in order of increasing values of impurities measured

2. Use the __median()__ function to find the median of this data set

3. Use the __mean()__ function

4. Comparing the mean and the median of the data set, is the data skewed-left, skewed-right or central

5. Construct a histogram of this data set

6. Using this histogram, decide if the data is skewed-left, skewed-right or central

7. Does this agree with your answer from part 4?


# Pareto Charts

A Pareto chart is a combination of a __bar chart__ and a __cumulative frequency chart__

They are a useful diagnostic tool when one wishes to limit various defects, errors, delays, breakdowns in different circumstances.

__Remark:__ To generate Pareto charts we need to import the __qcc__ library as follows

  1. In the __Tools__ tab above, select the option __Install Packages...__
  2. In the __Install from__ option, select __Repository(CRAN, CRANextra)__
  3. In the __Packages__ input line type __qcc__
  4. IN the __Install to Library__ option select your own home directory (this should be the daffodils anyhow)
  5. Make sure __Install all dependencies__ is checked
  
Once installed we call the library __qcc__ as follows
```{r}
library(qcc)
```


### Example 1
 The employees of a company were asked the reasons for late arrivals at work. The data collected were as follows:
 
 * __Child Care__ 22
 
 * __Emergency__ 8
 
 * __Overslept__ 12
 
 * __Public transport__ 15
 
 * __Traffic__ 36
 
 * __Weather__ 27
 
 Using this data answer the following:
 
 1. Create an appropriate vector to represent this data.
 
 2. Use the function __pareto.chart()__ to generate a Pareto chart for this data.
 
 3. From the Pareto chart, determine which causes could reasonable be addressed to reduce late arrivals by at least 35%.
 
1. We first create an appropriate vector of numbers to represent the data, in the usual way:
```{r}
Employees <- c(22,8,12,15,36,27)
Employees
```
Now we create an appropriate vector of names, corresponding the these numbers as follows:
```{r}
Reasons <- c("Child Care", "Emergency", "Overslept", "Public Transport", "Traffic", "Weather")
Reasons
```

* Now we re-create the __Employees__ vector by assigning it the names in the vector __Reasons__ as follows:
```{r}
names(Employees)<-Reasons
```
i.e. each entry has now been given a corresponding name.

* The new __Employees__ vector is 

```{r}
Employees
```
and so we see each number has been assigned a corresponding reason.

2. The __Pareto Chart__ for this data is given by

```{r}
pareto.chart(Employees,col="skyblue2", cumperc=seq(0,100,by=10), main="Pareto chart giving reasons for late arrivals at work")
```

* We notice that the __pareto.chart()__ function plots the frequency of each class, as opposed to the relative frequency of each class. This is easily modified by summing all frequencies in the __Employees__ vector and dividing the the vector by this number to give the __frequency density__ of each class.

```{r}
Total=sum(Employees)
Total
```

```{r}
Employees <- Employees/Total
Employees
```
```{r}
pareto.chart(Employees,col="skyblue2", cumperc=seq(0,100,by=10), ylab="Relative Frequency", main="Pareto chart giving reasons for late arrivals at work")
```


## Exercise 4

A mobile phone repair shop categorized the type of repair work it carried out over a year and found the following data:
  
* __Cracked Screen__ 125
* __Scratched Screen__ 228
* __Forgotten Pass code__ 91
* __Water Damage__ 188
* __Battery Replacement__ 111
* __Software Virus__ 82
* __Phone Bricking__ 188
* __Damaged Connector__ 41
  
Given this data, answer the following:

1. Find the total number of repairs in the year.

2. Construct a vector to represent the relative frequencies of each of these classes.

3. Use this vector of relative frequencies to construct a Pareto chart of this data.

4. From the Pareto chart determine the causes of phone damage owners should address to reduce repairs by at lease 25%.



## Exercise 5

A newly built apartment complex was inspected to prepare a snag list for each apartment. After inspection of apartment the complex, the following issues were found to occur with the frequency indicated:

### List of Damages
* __Cooker Disconnected__ 4
* __Windows Not Closing__ 18
* __Bathroom Ventilator Broken__ 9
* __Damaged Plug Sockets__ 51
* __Broken Light__ 29
* __Damaged Paintwork__ 47
* __Doors Not Locking__ 14
* __Dampness__ 2
* __Scratched Flooring__ 21

Using this data answer the following:

1. Identify the data type given.

2. Construct a vector to represent the frequencies.

3. Using this vector, evaluate the number of damages to be repaired in the apartment complex.

4. Using the vector of frequency values, construct a Pareto chart for this data.

5. From this data determine how which issues should be addressed to ensure at least 75% of damages are repaired.
