We begin by loading four packages: tidyverse, haven, forcats, and foreign.
library(tidyverse)
library(haven)
library(foreign)
library(forcats)
Let’s load some data.
affairs <- read_dta("http://fmwww.bc.edu/ec-p/data/wooldridge/affairs.dta")
affairs
## # A tibble: 601 x 19
## id male age yrsmarr kids relig educ occup ratemarr naffairs affair
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 1 37 10 0 3 18 7 4 0 0
## 2 5 0 27 4 0 4 14 6 4 0 0
## 3 6 1 27 1.5 0 3 18 4 4 3 1
## 4 11 0 32 15 1 1 12 1 4 0 0
## 5 12 0 27 4 1 3 17 1 5 3 1
## 6 16 1 57 15 1 5 18 6 5 0 0
## 7 23 1 22 0.75 0 2 17 6 3 0 0
## 8 29 0 32 1.5 0 2 17 5 5 0 0
## 9 43 1 37 15 1 5 18 6 2 7 1
## 10 44 0 22 0.75 0 2 12 1 3 0 0
## # ... with 591 more rows, and 8 more variables: vryhap <dbl>, hapavg <dbl>,
## # avgmarr <dbl>, unhap <dbl>, vryrel <dbl>, smerel <dbl>, slghtrel <dbl>,
## # notrel <dbl>
The binary/dummy variable affairs$kids is easier to understand when it is converted to a factor variable. Same for the categorical variable affairs$ratemarr:
haskids <- factor(x = affairs$kids,labels = c("no","yes"))
mlab <- c("very unhappy","somewhat unhappy","average", "somewhat happy", "very happy")
marriage <- factor(x = affairs$ratemarr, labels = mlab)
is.factor(marriage) # TRUE; checks whether a vector is a factor.
## [1] TRUE
Here are the frequencies for those who have kids and those who don’t, and the frequencies for the various ratings for marital happiness:
table(haskids) # Tallies the numbers of observations for each factor level
## haskids
## no yes
## 171 430
table(affairs$kids) # Compare with previous command to see why it may make sense to convert a categorical variable into a factor.
##
## 0 1
## 171 430
table(marriage)
## marriage
## very unhappy somewhat unhappy average somewhat happy
## 16 66 93 194
## very happy
## 232
class(table(marriage)) # "table"
## [1] "table"
prop.table(table(haskids))
## haskids
## no yes
## 0.2845258 0.7154742
prop.table(table(marriage))
## marriage
## very unhappy somewhat unhappy average somewhat happy
## 0.0266223 0.1098170 0.1547421 0.3227953
## very happy
## 0.3860233
Two-way tables are also called contingency tables. The following contingency table shows the frequencies for all haskids-marriage combinations:
(countstab <- table(marriage,haskids)) # frequencies
## haskids
## marriage no yes
## very unhappy 3 13
## somewhat unhappy 8 58
## average 24 69
## somewhat happy 40 154
## very happy 96 136
is.table(countstab) # TRUE; checks whether an object is a table
## [1] TRUE
prop.table(countstab) # relative frequencies
## haskids
## marriage no yes
## very unhappy 0.004991681 0.021630616
## somewhat unhappy 0.013311148 0.096505824
## average 0.039933444 0.114808652
## somewhat happy 0.066555740 0.256239601
## very happy 0.159733777 0.226289517
By the way, the contingency table countstab can be converted into a data frame using the command countstabdf <- as.data.frame(countstab).
Yet another possibility is to convert a contingency table into a tibble: countstabtib <- as_tibble(countstab).
We may also be interested in the conditional relative frequencies. For example, what proportion of those who are “very happy” in their marriages have children?
prop.table(x = countstab, margin = 1) # The proportions add to 1 for every value of marriage
## haskids
## marriage no yes
## very unhappy 0.1875000 0.8125000
## somewhat unhappy 0.1212121 0.8787879
## average 0.2580645 0.7419355
## somewhat happy 0.2061856 0.7938144
## very happy 0.4137931 0.5862069
prop.table(x = countstab, margin = 2) # The proportions add to 1 for every value of haskids
## haskids
## marriage no yes
## very unhappy 0.01754386 0.03023256
## somewhat unhappy 0.04678363 0.13488372
## average 0.14035088 0.16046512
## somewhat happy 0.23391813 0.35813953
## very happy 0.56140351 0.31627907
pie(x = table(marriage), col = gray(seq(.2, 1, .2)), main = "Pie 1")
pie(x = table(affairs$ratemarr), col = gray(seq(.2, 1, .2)), main = "Pie 2")
pie(x = table(affairs$ratemarr), labels = mlab, col = gray(seq(.2, 1, .2)), main = "Pie 3")
pie(x = c(16, 66, 93, 194, 232), labels = mlab, col = gray(seq(.2, 1, .2)), main = "Pie 4")
Compare the first and second pie charts above. Good luck if you are trying to make sense of the second pie chart, which charts an unfactored categorical variable. Note that the second pie chart pie(affairs$ratemarr, col = gray(seq(.2, 1, .2))) was a huge failure. You need to convert a variable to a table form before you put it in pie().
The third pie shows that a pie chart of unfactored data can be made meaningful with the addition of appropriate labels.
The fourth pie shows that a pie chart can be made out of a numerical vector representing frequencies.
barplot(table(marriage), horiz = TRUE, las = 1, main = "Happiness", sub = "Totals", xlim = c(0, 250))
barplot(table(marriage), horiz = FALSE, las = 1, main = "Happiness", sub = "Totals", ylim = c(0, 250))
Replace marriage with affairs$ratemarr and try again, for comparison. This would provide yet another demonstration of the need to use the factor versions of any categorical variable in graphing. Compare the last two commands.
By the way, some labels may disappear if R does not find space. So, when exporting, adjust the size till all labels become visible.
barplot(table(haskids, marriage), horiz = TRUE, las = 1, main = "Happiness by Kids", legend = TRUE, args.legend = c(x="bottomright"))
barplot(table(marriage, haskids), horiz = TRUE, las = 1, main = "Kids by Happiness", legend = TRUE, args.legend = c(x="bottomright"))
barplot(table(haskids, marriage), beside = TRUE, las = 2, legend = TRUE, args.legend = c(x="top"))
Compare the last two commands.
ceosal1<-read_dta("http://fmwww.bc.edu/ec-p/data/wooldridge/ceosal1.dta")
ROE <- ceosal1$roe # Extract ROE to vector
hist(ROE)
A few more:
hist(ceosal1$roe, main = "Histogram of ROE", xlab = "ROE") # With labels
hist(ROE, breaks=c(0,5,10,20,30,60) ) # Plots density on the vertical, probably because that's what's appropriate when the breaks are uneven.
hist(ROE, breaks=5)
# First, a stand-alone density plot
plot(density(ROE), main = "Density Plot")
# Second, a density plot superimposed on a histogram
hist(ROE, freq=FALSE, ylim=c(0,.07), main ="Histogram and Density Plot")
lines( density(ROE), lwd=3 )
curve(0.0001*x^2, add = TRUE) # Add a function of x, for no good reason!
plot(ecdf(ROE))
mean(ceosal1$salary) # Sample mean/average
## [1] 1281.12
median(ceosal1$salary) # Sample median
## [1] 1039
sd(ceosal1$salary) # Sample standard deviation
## [1] 1372.345
summary(ceosal1$salary) # Sample summary information
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 223 736 1039 1281 1407 14822
cor(ceosal1$salary, ceosal1$roe) # Sample correlation
## [1] 0.1148417
with(ceosal1, cor(salary, roe)) # Same as previous
## [1] 0.1148417