Questions

Complete the following exercises:

  1. Download the data, and read them (in Excel, or R)
  2. How many observations do we have?

Compute for the variables alienation and income:

  1. The mean
  2. The median
  3. The variance
  4. The standard deviation
  5. The range

For both variables:

  1. Plot an histogram
  2. Make a frequency table (use meaningful brackets; compute frequency and cumulative frequency)

For the relationship between the variables:
j. Compute the correlation coefficient. What is your interpretation of the result?
k. Plot a scattergram

Read Data from GitHub

# a. Download the data, and read them (in Excel, or R)
alienation <- read.csv("https://raw.githubusercontent.com/statmind/exercises/main/alienation_csv.csv")

Descriptives

# b. How many observations do we have?

nrow(alienation)    # Number of Rows
## [1] 100
summary(alienation) # Summary (min. max, median, mean, quartiles)
##    alienation        income      
##  Min.   : 1.00   Min.   :     0  
##  1st Qu.: 3.00   1st Qu.: 29500  
##  Median : 5.00   Median : 59500  
##  Mean   : 5.56   Mean   : 58070  
##  3rd Qu.: 8.00   3rd Qu.: 88000  
##  Max.   :10.00   Max.   :132000
colSums(!is.na(alienation)) # Number of valid records for all variables
## alienation     income 
##        100        100
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alienation %>% summarise(count = n()) 
##   count
## 1   100

Descriptives

The mean and the median can be obtained via summary()

# c. The mean
mean(alienation$alienation)
## [1] 5.56
mean(alienation$income)
## [1] 58070
# d. The median
median(alienation$alienation)
## [1] 5
median(alienation$income)
## [1] 59500
# e. The variance
var(alienation$alienation)
## [1] 8.572121
var(alienation$income)
## [1] 1258247576
# f. The standard deviation
sd(alienation$alienation)   
## [1] 2.927819
sd(alienation$income)        # from sd()
## [1] 35471.79
sqrt(var(alienation$income)) # as square root of var()
## [1] 35471.79
# g. The range
range(alienation$alienation)
## [1]  1 10
range(alienation$income)
## [1]      0 132000

The range() function outputs the minimum and maximum values of the variable.

The range (maximum - minimum) can be easily computed.

x <- range(alienation$alienation)
cat("The range of alienation is",x[2] - x[1])
## The range of alienation is 9
cat("The range of alienation is",max(alienation$alienation) - min(alienation$alienation))
## The range of alienation is 9

Histogram

Basic histograms can be produced via hist().

# h. Plot an histogram
hist(alienation$alienation)

hist(alienation$income)

For publication quality graphs (here: histograms), we can use ggplot2.

Calibration of the graphs is not easy because of the endless number of options and settings.

Best strategy is:
1. Google, and compose your preferred format for graphs.
2. Then apply that format to all similar graphs!

Below, we found the scales package for getting percentages on the y-scale.

We used a pink color for filling the bars.

We set the x-scale in steps of 20,000 monetary units.

# install.packages(scale) if needed
library(scales)
library(ggplot2)
ggplot(alienation, aes(x=income)) +
  geom_histogram(aes(y = (..count..)/sum(..count..) ), 
                 binwidth=20000,
                 boundary = 0,
                 colour="black", fill="pink") +
  scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
  ggtitle("Histogram of Income") +
  scale_y_continuous(labels=percent) +
  labs(y = 'Percent of Total')
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We produced a cumulative distribution for income.

ggplot(alienation, aes(x=income)) +
  stat_ecdf(geom = "line")

# install.packages("epiDisplay")
alienation$class <- cut(alienation$income, breaks = c(-Inf, 25000, 50000, 75000, 100000, 125000, Inf))
alienation$class2 <- factor(alienation$class, labels = c("0-25", "25-50", "50-75", "75-100", "100-125", "125+"))
library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: nnet
## 
## Attaching package: 'epiDisplay'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## The following object is masked from 'package:scales':
## 
##     alpha
tab1(alienation$class2, cum.percent = TRUE)

## alienation$class2 : 
##         Frequency Percent Cum. percent
## 0-25           21      21           21
## 25-50          19      19           40
## 50-75          22      22           62
## 75-100         27      27           89
## 100-125         9       9           98
## 125+            2       2          100
##   Total       100     100          100

Two Variables

We can compute the correaltion coefficient, in several ways.

  1. cor() from base-R computes the coefficient.
  2. rcorr() from Hmics gives the sample size and the significance, in addition.
# For the relationship between the variables:
# j. Compute the correlation coefficient. What is your interpretation of the result?
# k. Plot a scattergram.

cor(alienation[,1:2], use="all.obs", method="pearson")
##            alienation     income
## alienation  1.0000000 -0.9146323
## income     -0.9146323  1.0000000
# Correlations with significance levels
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(alienation[,1:2]), type="pearson") 
##            alienation income
## alienation       1.00  -0.91
## income          -0.91   1.00
## 
## n= 100 
## 
## 
## P
##            alienation income
## alienation             0    
## income      0

A basic scattergram gives insight into the correlation between two variables. Note that plot() is a smart function that decides on a scattergram based on the input (two numeric columns of a data frame).

We use ggplot2 to get publication-quality graphs.

plot(alienation[,1:2])

# shape=1 gives hollow circles
ggplot(alienation, aes(x=income, y=alienation)) +
  geom_point(shape=1) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# shape=20, for black filled enlarged circles
# scales are adjusted
# a linear regression line is added
ggplot(alienation, aes(x=income, y=alienation)) +
  geom_point(shape=20) +
  geom_smooth(method="lm") +
  scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
  scale_y_continuous(breaks = seq(1, 10, by = 1))
## `geom_smooth()` using formula = 'y ~ x'

A format that I like omits the greyish background with white gridlines, and uses red colors.

Since alienation is measured on a discrete ordinal scale (the scores are integer values from 1 to 10), we use geom_jitter() instead of geom_point(). Jitter avoids that dots with (close to) equal values on both variables overlap, by adding small random values.

ggplot(alienation, aes(x=income, y=alienation)) +
  geom_jitter(shape=20) +
  geom_smooth(method="lm",color="black") +
  scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
  scale_y_continuous(breaks = seq(0, 10, by = 1)) +
  theme(panel.background = element_rect(fill = "white")) +
  theme(panel.grid.minor = element_line(color = "red",
                                        size = 0.10,
                                        linetype = 1))
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'