Rumsey, Chapter 2: Organizing categorial data - charts and graphs. [Complete

Exercise 1 - 5/2/’16

Suppose 375 individuals are asked what type of vehicle they own: SUV, truck, ora car. See the following frequency table.

e1 <- data.frame(category = integer(375)) 
e1$category <- rep(c("SUV","Truck","Car"), c(150,125,100))

# Frequency tables
xtabs(e1)

## category
##   Car   SUV Truck 
##   100   150   125

A. Make a relative frequency table of these results

round(prop.table(table(e1$category)), digits = 2)

## 
##   Car   SUV Truck 
##  0.27  0.40  0.33

B. Make a pie chart of these results

pie(table(e1), labels = paste(c("SUV ","Truck ","Car "), round(prop.table(table(e1$category))*100, digits = 1), "%", sep = "") , main = "Vehicles - n: 375", col = rainbow(3))
legend("topleft", c("SUV","Truck","Car"), cex = 0.6, fill = rainbow(3))

C. Interpret the results

The sample of cars (375 vehicles) is composed by trucks, cars and SUVs in the proportions of 40, 33.3 and 26.7 percent respectively.

Excercise 2 - 5/2/’16

For a month, a restaurant owner checks off whem a customer patronize his restaurant. He records the data for 1000 customers for one month.

customers <- data.frame(category = integer(1000))
customers$category <- rep(c("Breakfast", "Lunch", "Dinner", "Other"), c(10*6.7, 10*22.2, 10*44.4, 10*26.7))
xtabs(customers)

## category
## Breakfast    Dinner     Lunch     Other 
##        67       444       222       267

pie(table(customers), labels = paste(c("Breakfast", "Lunch", "Dinner", "Other"), round(prop.table(table(customers$category))*100, digits = 2), "%" ), main = "Restaurant customers - n: 1000", col = rainbow(4))
legend("topright", c("Breakfast", "Lunch", "Dinner", "Other"), cex = 0.5, fill = rainbow(4))

A. What does this information tell the restaurant owner?

Its restaurant is particularly appreciated in the midday. The 44.4% of the 1000 sampled customers patronize the restaurant for lunch. That is the breakfast restaurant of choice only for the 6.7% of the customers. Better result for dinner, with 22.2%. A relevant 26.7% of customers is uncategorized.

B. What is the problem with the “other” category? How can this study improved in the future?

The category “other” covers more customers than “dinner” and “breakfast”. Such an amount of information should be properly investigated. It could group categories of customers that normally do not fit the usual schedules, but that could be enough relevant represent a potential economic gain, if treated adequately.

Excercise 3 - 7/2/16

A manager decides to categorize her e-mails into five groups: 1) highest, 2) medium, 3) low priority, 4) personal and 5) SPAM.

e3 <- data.frame(400)
e3 <- rep( c("high", "medium", "low", "personal", "SPAM"), c(60, 120, 20, 50, 150))

# Frequency table
table(e3)

## e3
##     high      low   medium personal     SPAM 
##       60       20      120       50      150

A. Make a relative frequency table.

paste(prop.table(table(e3))*100, "%", sep = "")

## [1] "15%"   "5%"    "30%"   "12.5%" "37.5%"

B Make a pie chart of this data.

# Does not work with Knitr.
# par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)

pie(table(e3), labels = paste(round(prop.table(table(e3))*100), "%", sep = ""), col = heat.colors(5), main = "Mail received - n: 400")

# With Knitr the legend is not visible, while there is no problem with the normal plot output.

#usr <- par("usr")*1.2; rect(usr[1], usr[3], usr[2], usr[4], lwd=2, col=NA)
legend("topright", legend = c("High", "Medium", "Low", "Personal", "SPAM"), fill = heat.colors(5), title = "Categories", cex = 0.6)

C. Interpret the result

Being able of immediately point SPAM can reprensent a relevant contribute to productivity, since they are are the most conspicuous category: 38% of 400.

Excercise 4 - 10/2/16

Suppose a survey is conducted to see what tyoes od pets people own. The survey of 100 adults finds that 40 of the own a dog, 60 a cat, 20 a fish, 10 a rodent. Can this data be organized in a pie chart?

Solution: It cannot. Since each adult is allowed to possess more than one animal, they can be counted more than one time(40 + 60 + 20 + 10 = 130). Hence, the observed cases would become more than the sample (100). In this case, an histogram would be preferable.

e8 <- data.frame(100)
e8 <- rep(c("dog", "cat", "fish", "rodent"), c(40, 60, 20, 10))
table(e8)

## e8
##    cat    dog   fish rodent 
##     60     40     20     10

barplot(table(e8), main = "Animals possessed by 100 adults", xlab = "Animal", ylab = "Count", col = gray.colors(4))

Excercise 5 - 10/2/16

A road intersection is observed for four hours, observing if the drivers stop completely, roll through, ran the stop sign.

e5 <- data.frame(rep(c("Complete stop", "Rolled through", "Ran the stop sign"), c(64.2, 35.2, 0.6)))
par("mar")

## [1] 5.1 4.1 4.1 2.1

par(mar=c(1,1,1,1))
pie(table(e5), main = "Drivers behaviour", labels = paste(c("Stop ", "Roll through ", "Ran the stop "), round(prop.table(table(e5))*100, 1), "%", sep = ""))
par(oma=c(2,2,2,2), mar=c(4.1, 4.1, 4.1, 4.1), xpd=TRUE) ; box("inner", col="black", lwd=3)

A. Interpret the result as they appear on the chart

What an unusual place. The 64.6% of the drivers completely stop at the sign. And even if the 35.4% roll throug the sing, only the 0.6% run the sign stop! OMG! That is not Italy. At all.

B. What information is missing from the pie chart?

The dimension of the sample is missing.

C. Does the missing information affect the interpretation of the results?

Heavily. We cannot derive any information, since this plot could refer to almost any number of drivers.

D. Should you make a generalization of all drivers based on this data?

Not at all. Because of the reason stated in point B. we cannot even think any kind of inference.

Excercise 6 - 27/2/16

A survey is conducted to determine whether 20 office employees of a certain company would prefer to work at home, if given the chance. Of the 10 women surveyed, 7 said yes, and three no. Of the 10 men, 8 said no, 2 said yes. Compare the result using two pie charts. Does gender seems to affect one’s preferences to work at home?

e6 <- data.frame(gender = rep( c("man", "woman"), each=10))
e6$answer <- ifelse(e6$gender == "man", rep( c("yes", "no"), c(8, 2)), rep( c("yes", "no"), c(3, 7)))

par(mfrow = c(1, 2))

suppressWarnings(pie(table(e6$answer[e6$gender == "man"]), main = "Men", line = -4, labels = paste(c("Yes ", "No "), round(prop.table(table(e6$answer[e6$gender == "man"]))*100, 1), "%", sep = "")))

suppressWarnings(pie(table(e6$answer[e6$gender == "woman"]), main = "Women", line = -4, labels = paste(c("Yes ", "No "), round(prop.table(table(e6$answer[e6$gender == "woman"]))*100, 1), "%", sep = "")))

suppressWarnings(mtext("Willingness to work at home", side = 3, line = -2, outer = TRUE, cex=2,font=2))

Gender appears to be a relevant variable relatively to the willingness to work at home.

Excercise 7 - 27/2/16

Give an example of categorical data that you cannot summarize correctly by using a pie chart.

Solution: Where the percentages does not sum up to 100. An observation or an individual could be part of more than one group.

Excercise 8 - 27/2/16

Surfing non-work-related websites compromises employee productivity.

e8 <- data.frame(employer = rep( c("no", "yes"), c(334, 666)))
e8$employee <- rep( c("no", "yes"), c(498, 502))

par(mfrow = c(1, 2))

suppressWarnings(pie(table(e8$employer), main = "Employers", line = -2, labels = paste(c("Yes ", "No "), round(prop.table(table(e8$employer))*100, 1), "%", sep = "")))

suppressWarnings(pie(table(e8$employee), main = "Employees", line = -2, labels = paste(c("Yes ", "No "), round(prop.table(table(e8$employee))*100, 1), "%", sep = "")))

suppressWarnings(mtext("Productivity and non-work web", side = 3, line = -2, outer = TRUE, cex = 1.5, font = 2))

A. Interpret these results

Employers are more convinced that non-work-related web sites do compromise the productivity.

B. What important information is missing from pie chart?

The dimension of the sample. (I created the data artificially with 1000 observation.)

Excercise 9 - 27/2/16

Here is the frequency bar graph of 500 people between three categories (1: favor, 2: oppose, 3: neutral) about the possibility of smoking.

e8 <- data.frame(opinion = rep(c("Favor", "Oppose", "Neutral"), c(250, 125, 125)))
barplot(table(e8), main = "Opinion on smoking ban")

A. Make a relative frequency table of this data.

paste(prop.table(table(e8))*100, "%", sep = "")

## [1] "50%" "25%" "25%"

B. Use the relative frequency table to make a bar graph of this data.

barplot(prop.table(table(e8)), main = "Opinion on smoking ban (percentage")

C. Interpret the result.

The half of the sample favors the smoking ban. The same amount of of people oppose the ban and does not care about it.

Excercise 10 - 27/2/16

A health club asks their customers to rate their service (very good, 1; good, 2; fair, 3; poor, 4). What percentages od the of the customers rated the services as good?

e10 <- data.frame(rate = rep(c("very good", "good", "fair", "poor"), c(10, 10, 8, 2)))
barplot(table(e10), main = "Club customer ratings")

paste(round(prop.table(table(e10)), 2)*100, "%", sep = "")

## [1] "27%" "33%" "7%"  "33%"

Excercise 11 - 28/2/16

Do voters agree about the “Opinion X”? yes: 500; no: 550; dk: 250.

e11 <- data.frame(opinion = rep(c("yes", "no", "dk"), c(500, 550, 250)))
a1 <- barplot(table(e11), main = "Opinion on Isue X")
text(y= table(e11) + 1, x= a1, labels=as.character(table(e11)), xpd=TRUE)

A. Make the frequency table of these results (including the total number).

paste(round(prop.table(table(e11)), 2)*100, "%", sep = "")

## [1] "19%" "42%" "38%"

table(e11)

## e11
##  dk  no yes 
## 250 550 500

B. Evaluate the graph as to whether or not it fairly represents the results.

No, the original graph from the book clearly does tot represent the correct proportion between the categories.

Excercise 12 - 28/2/16

What are the priority of 270 seniors, including whether or not buying a home.

e12 <- data.frame(answer = rep(c("yes", "no", "dk"), c(100, 90, 80)))
p12.1 <- barplot(table(e12), axes = F, ylim = c(0, 300), main = "Is buying house a priority?")
axis(2, c(0, 50, 100, 150, 200, 250, 300))

A. The original bargraph is misleading: explain why.

The bar graph does not have a scale capable of make the differences between categories frequencies.

B. Make the new bar graph that more fairly presents the results.

p12.2 <- barplot(table(e12), ylim = c(0, 100), main = "Is buying house a priority?")

Excercise 13 - 28/2/16

Are the customers males or females?

e13 <- data.frame(gender = rep(c("male", "female"), c(330, 670)))
barplot(prop.table(table(e13))*100, ylim = c(0, 100), main = "Gender of customers")

A. Interpret this result.

The customers are roughly 67% females.

B. Is the bar plot a fari and accurate representation of the data?

Yes, it is ok since it represent the proportions between the two categories, even if a pie chart could be more adequate.

Excercise 14 - 28/2/16

Sample of 20 employees. Do they prefer work from home?

e14 <- data.frame(gender = rep(c("man", "woman"), each = 10))
e14$answer <- ifelse(e14$gender == "man", rep(c("yes", "no"), c(2, 8)), rep(c("yes", "no"), c(7, 3)))

par(mfrow = c(1, 2))

barplot(table(e14$answer), main = "Would you work at home?", ylim = c(0, 12))

par(mar = c(5.1, 4.1, 4.1, 2))
par(xpd = TRUE)

barplot(table(e14$answer, e14$gender), main = "by sex", col = gray.colors(length(unique(e14$answer))), ylim = c(0, 12))

legend(1, 12, c("yes", "no"), cex = 0.4, fill = gray.colors(length(unique(e14$answer))))

mtext("Productivity and non-work-related web", side = 3, line = -1.5, outer = TRUE, cex = 1.5, font = 2)

A. Interpret the result of each graph.

Graph 1: 11 of 20 workers do not think that it would be a probem in terms of productivity. Graph 2: 8 men out of 10 do not think that it would be a probem in terms of productivity, against 3 women out of 10.

B. Discuss the added value in including gender in the second bra graph.

It permits us a better understanding of the resut in graph 1. That is, a common filter policy could be differently satisfactory between men and women.

Rumsey, Chapter 2: Organizing categorial data - charts and graphs. [Complete - Base graphics]

Maurizio Murino

5 February 2016