knitr::opts_chunk$set(cache=TRUE, warning=FALSE, message=FALSE, fig.width=14)
library(lattice)
library(knitr)
library(xtable)
I often perform barcharts since the data are generated from figures, hence with discrete instead of continues data.
You lose information from the data when yuo create a histogram. What information is lost?
Solution:
You do not know the actual value of the data anymore, but only the group where they fall into.
Make a histogram from this dataset of test scores: 72, 79, 81, 80, 63, 62, 89, 90, 50, 78, 87, 97, 55, 69, 97, 87, 88, 99, 76, 78, 5, 77, 88, 90, 81. Would a pie chart be appropriated for this data?
e2 <- c(72, 79, 81, 80, 63, 62, 89, 90, 50, 78, 87, 97, 55, 69, 97, 87, 88, 99, 76, 78, 65, 77, 88, 90, 81)
histogram(~e2,
type = "count",
xlab = "Test scores",
main = "Test scores",
breaks = seq(from = 50, to = 100, by = 5))
max(e2)
## [1] 99
A pie chart is not appropriate. Pie charts are more useful with categorical data.
Survey of 45 home-owners. Of them, 2 do not have a TV, 17 have 1, 22 have 2, 3 have 3, 1 has 4. Make a relative frequency histogram of this data and interpret the result.
e3 <- data.frame(tv = rep(c(0, 1, 2, 3, 4), c(2, 17, 22, 3, 1)))
sum(e3$tv)
## [1] 74
histogram(~e3,
xlab = "TVs",
main = "Distribution of 73 Tvs between 45 owners",
breaks = seq(from = 0, to = 4, by = 1))
Think a loaded die. Make it roll several times.
e4 <- data.frame(n = rep(c(1, 2, 3, 4, 5, 6), c(10, 15, 5, 15, 5, 20)))
histogram(~n, data = e4,
type = "count",
main = "260 die rolls",
xlab = "Die result",
breaks = seq(from = 0, to = 6, by = 1))
A. Make the relative frequency histogram of these results.
histogram(~n, data = e4,
main = "Relative requencies - 260 die rolls",
xlab = "Die result",
breaks = seq(from = 0, to = 6, by = 1))
B. You can make a relative frequency histogram from a frequency histogram; can you go to the other direction?
Well, it is possible, but only by arbitrarily determining a sample size, that is unknow. Hence, it is no really useful.
An ATM machine asks to choose an amount in 50$ increments from 100 from $500. Results from a recent sample of customer withdrawals are shown in the figures. Discuss shape, center and spread of the data.
e5 <- data.frame(withdraw = rep(c(100, 150, 200, 250, 300, 350, 400, 450, 500), c(14, 23, 23, 4, 2, 0, 2, 0, 2)))
barchart(e5,
main = "Withdrawals",
type = "count",
horizontal = FALSE,
xlab = "Withdrawals",
breaks = seq(from = 100, to = 500, by = 50))
Histogram of rods produced by a company. a Rod should be 100 inches long. Discuss the company accuracy by interpreting the shape, center and spread of the data.
e6 <- data.frame(lngth = rep(seq(from = 99.2, to = 100.6, by = 0.1), c(1, 0, 2, 1, 2, 2, 8, 7, 10, 10, 6, 7, 2, 1, 4)))
histogram(~lngth, data = e6,
breaks = seq(from = 99.2, to = 100.6, by = 0.1),
type = "count",
main = "Rods length",
xlab = "Rods")
kable(summary(e6), format = "markdown", align = "c", row.names = TRUE)
| lngth | |
|---|---|
| Min. : 99.20 | |
| 1st Qu.: 99.85 | |
| Median :100.00 | |
| Mean :100.03 | |
| 3rd Qu.:100.20 | |
| Max. :100.60 |
The distribution is slightly skewed to the left, even if it this quite close to the center. Even so, the 50% of the production is between 99.85 and 100.2, being really accurate.
In the histogram the amount of money spent on fruits by 317 households in a year. Describe shape, center and spread. What do they tell us about how much families spend on fruits?
e7 <- data.frame(expense = rep(seq(100, 800, 50), c(2, 0, 5, 9, 12, 27, 60, 65, 70, 50, 32, 18, 7, 2, 3)))
histogram(~expense, data = e7,
type = "count",
breaks = seq(50, 800, 50),
main = "Histogram of fruits",
xlab = "Fruits",
ylab = "Frequency")
kable(summary(e7), format = "markdown", align = "c", row.names = TRUE)
| expense | |
|---|---|
| Min. :100 | |
| 1st Qu.:400 | |
| Median :500 | |
| Mean :475 | |
| 3rd Qu.:550 | |
| Max. :800 |
The histogram is symmetric and bell shaped, with peak arounf 500. Range is from 100 to 800, quite large, with due probably to vary dietary habits, availability and prices of vegatables. Most of the house hold spend between 200 and 700 per year.
Consider the percentage return for a group of stocks in Portfolio A over a year period, as shown in figure.
histogram(~percen, data = e8,
breaks = seq(-50, 70, 10),
xlab = "Percent returnon stocks in group A",
ylab = "Percentage of stocks")
A. Some of the value on the X-axis are negative numbers. What does this mean?
The stock had losses in few moments of the year.
B. On any histogram in general, when (if ever) can the X-and/or Y-axis contain negative values?
While the X-axis can report if a negative value occurred, the Y-axis report how many times (absolute or percentage) a value appears in a distribution. Hence, it cannot be negative.
The incomes of new graduates from a numerous program are shown in the figure.
barchart(e9,
horizontal = FALSE,
main = "Income")
A. Discuss the implications for graduates of this program.
library(knitr)
kable(summary(e9))
| income | |
|---|---|
| Min. : 50000 | |
| 1st Qu.: 50000 | |
| Median : 60000 | |
| Mean : 65385 | |
| 3rd Qu.: 80000 | |
| Max. :100000 | |
| Even | if the graduates are going well, the distribution is quite wide and right skewed, meaning that the biggest part of the graduates remain in the left site of the distribution, with few graduates making a big amount of money. |
Caution: When talking about money, people is used to lie.
B. Estimate where the median salary is in this data set.
median(e9$income)
## [1] 60000
C. Do you see any issues that anyone interpreting this data should consider?
The intervals between the class of income are quite wide. It does not allow a fine observation of the income distribution across the graduates. Furthermore, we are taking in account only 13 graduates from a “large program”.
Consider two histograms with n = 200. Which one have the smallest spread?
plot_1 <- barchart(e10a,
horizontal = FALSE,
ylim = 0:40,
main = "First histogram",
scales = list(y = list(at = seq(0, 40, 10))))
plot_2 <- barchart(e10b,
ylim = 0:40,
horizontal = FALSE,
main = "Second histogram",
scales = list(y = list(at = seq(0, 40, 10))))
print(plot_1, position=c(0, .5, 1, 1), more=TRUE)
print(plot_2, position=c(0, 0, 1, .5))
Solution:
The first histogram has a smaller spread, since a greater amount of observation are closer to the mean in respect to histogram 2.
Suppose your friend believes this gambling partners palys with a loaded die. He shows you a graph of the outcomes of the games playes with this die (see plot 1). Based on that graph, do you agree with wthis person? Why or why not?
## [1] 297
plot_e11c <-
densityplot(e11$freq,
type = "frequency",
main = "Plot 2. Suspect vs fair die (red) probability density",
panel = function(x, ...){
tab <- table(x)
panel.barchart(names(tab), tab/length(x), horizontal = FALSE)
panel.densityplot(x, plot.points = FALSE)
panel.mathdensity(dmath = dnorm, col = "red",
args = list(mean = 3.5), sd =
1.71)},
ylim = c(0, 0.5))
print(plot_e11a, position=c(0, .5, 1, 1), more=TRUE)
print(plot_e11c, position=c(0, 0, 1, .5))
Solution:
We are considering 297 rolls with a mean of 3.47 and a standard deviation of 1.69. In the long course, the mean for dice rolls is 3.5, and the standard deviation 1.71. Hence, the results do not suggest a loaded die, since the frequency of the results shows a similar probabily for every possible result.
The first month’s telephone bills is shown in figure. Explain why the histohram is misleading and suggest a solution.
set.seed(1234)
x <- abs(rep(rnorm(13, seq(5, 125, 10), 5), c(24, 43, 25, 18, 14, 12, 10, 12, 14, 18, 16, 14, 10)))
hist_e12a <- histogram(x,
type = "count",
main = "Plot 1",
breaks = 13,
xlab = "Telephone bill (€)",
ylim = 0:140)
Solution:
The scale y is too large. It does not allow a clear comparison between the different surprise each customer found in its bill. Reducing the scale to from 150 to 50, as shown in plot 2, make the data easier to read.
hist_e12b <- histogram(x,
type = "count",
main = "Plot 2",
breaks = 13,
xlab = "Telephone bill (€)",
ylim = 0:50)
print(hist_e12a, position=c(0, .5, 1, 1), more=TRUE)
print(hist_e12b, position=c(0, 0, 1, .5))
Check out the sales of a particular car across the US over a 60 days period, as shown in the figure. NB. the problem is solved with random generated data.
set.seed(1234)
e13 <- data.frame(sales = rnorm(60, 650, 50), day = 1:60)
xyplot(sales ~ day, data = e13,
type = c("l", "p", "r"),
abline = list(mean(e13$sales), col = "red", lty = 2))
min(e13$sales); max(e13$sales); which.min(e13$sales); which.max(e13$sales)
## [1] 532.7151
## [1] 770.7918
## [1] 4
## [1] 20
A. Can you see a pattern to the sales of this car across this time period?
In order to answer this question, I am going to add a trendline. The line indicate a negative trend along the 60 days of sales.
B. What are the lowest and highest numbers af sales and when did they occur?
The minimum number of sales was 532,71, and the maximum 770,8. They occurred on the 4th and 20th days.
C. Can you estimate the average off all sales over this time period?
The average is 628.7989183, as indicated by the red line that I added to the grap.
Bob decides is a good time to get in shape, so he starts exercising every day and plans to increase his excercise as he goes along. Look at the two graphs.
set.seed(1234)
e14 <- data.frame(mins = rnorm(100, mean = seq(20, 30, length = 100), sd = 4), day = c(1:100))
plot_e14a <- xyplot(mins ~ day, data = e14,
type = c("l", "p"),
main = "Line graph 1 of exercise log",
ylab = "Exercise time",
xlab = "Day")
plot_e14b <- xyplot(mins ~ day, data = e14,
type = c("l", "p"),
main = "Line graph 2 of exercise log",
ylab = "Exercise time",
xlab = "Day",
ylim = 10:80)
print(plot_e14a, position=c(0, .5, 1, 1), more=TRUE)
print(plot_e14b, position=c(0, 0, 1, .5))
A. Compare the two graphs. Do they represent the same dataset?
Yes, they do. The scales of the two graphs are different.
B. If the graphs are from the same data, which one is the most adequate?
Graph 1 is most adequate. It permits to better graps the variance of the exercises. The second graph has a lot of space that does not communicate any useful information.
Here is the line graph with the revenues of a company over the years. Exaplain why the graph is misleading and how you could fix the problem.
e15 <- data.frame(rev = seq(10, 70, by = 10), year = as.character(c(1950, 1970, 1975, 1980, 1985, 1990, 2000)))
class(e15$year)
## [1] "factor"
xyplot(rev ~ year, data = e15,
type = c("p", "l"),
main = "Line graph of revenue")
Solution:
First, the graph probably does not consider economic inflation. Second, the time scale consider each interval as equally spaced. The X scale should be spaced properly, with data represented by the adequate time intervals.
Line graphs typically connect the dots that represent the data values over time. If the time increments between the dots are large, explain why the line graph can be somewhat misleading.
Solution:
If the time increments between the dots are large, and you connect the dots, you assume that the change between happened with a steady rate.