Complete the following exercises:
Compute for the variables alienation and income:
For both variables:
For the relationship between the variables:
j. Compute the correlation coefficient. What is your interpretation of
the result?
k. Plot a scattergram
# a. Download the data, and read them (in Excel, or R)
alienation <- read.csv("https://raw.githubusercontent.com/statmind/exercises/main/alienation_csv.csv")
# b. How many observations do we have?
nrow(alienation) # Number of Rows
## [1] 100
summary(alienation) # Summary (min. max, median, mean, quartiles)
## alienation income
## Min. : 1.00 Min. : 0
## 1st Qu.: 3.00 1st Qu.: 29500
## Median : 5.00 Median : 59500
## Mean : 5.56 Mean : 58070
## 3rd Qu.: 8.00 3rd Qu.: 88000
## Max. :10.00 Max. :132000
colSums(!is.na(alienation)) # Number of valid records for all variables
## alienation income
## 100 100
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alienation %>% summarise(count = n())
## count
## 1 100
The mean and the median can be obtained via summary()
# c. The mean
mean(alienation$alienation)
## [1] 5.56
mean(alienation$income)
## [1] 58070
# d. The median
median(alienation$alienation)
## [1] 5
median(alienation$income)
## [1] 59500
# e. The variance
var(alienation$alienation)
## [1] 8.572121
var(alienation$income)
## [1] 1258247576
# f. The standard deviation
sd(alienation$alienation)
## [1] 2.927819
sd(alienation$income) # from sd()
## [1] 35471.79
sqrt(var(alienation$income)) # as square root of var()
## [1] 35471.79
# g. The range
range(alienation$alienation)
## [1] 1 10
range(alienation$income)
## [1] 0 132000
The range() function outputs the minimum and maximum values of the variable.
The range (maximum - minimum) can be easily computed.
x <- range(alienation$alienation)
cat("The range of alienation is",x[2] - x[1])
## The range of alienation is 9
cat("The range of alienation is",max(alienation$alienation) - min(alienation$alienation))
## The range of alienation is 9
Basic histograms can be produced via hist().
# h. Plot an histogram
hist(alienation$alienation)
hist(alienation$income)
For publication quality graphs (here: histograms), we can use ggplot2.
Calibration of the graphs is not easy because of the endless number of options and settings.
Best strategy is:
1. Google, and compose your preferred format for graphs.
2. Then apply that format to all similar graphs!
Below, we found the scales package for getting percentages on the y-scale.
We used a pink color for filling the bars.
We set the x-scale in steps of 20,000 monetary units.
# install.packages(scale) if needed
library(scales)
library(ggplot2)
ggplot(alienation, aes(x=income)) +
geom_histogram(aes(y = (..count..)/sum(..count..) ),
binwidth=20000,
boundary = 0,
colour="black", fill="pink") +
scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
ggtitle("Histogram of Income") +
scale_y_continuous(labels=percent) +
labs(y = 'Percent of Total')
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We produced a cumulative distribution for income.
ggplot(alienation, aes(x=income)) +
stat_ecdf(geom = "line")
# install.packages("epiDisplay")
alienation$class <- cut(alienation$income, breaks = c(-Inf, 25000, 50000, 75000, 100000, 125000, Inf))
alienation$class2 <- factor(alienation$class, labels = c("0-25", "25-50", "50-75", "75-100", "100-125", "125+"))
library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: nnet
##
## Attaching package: 'epiDisplay'
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:scales':
##
## alpha
tab1(alienation$class2, cum.percent = TRUE)
## alienation$class2 :
## Frequency Percent Cum. percent
## 0-25 21 21 21
## 25-50 19 19 40
## 50-75 22 22 62
## 75-100 27 27 89
## 100-125 9 9 98
## 125+ 2 2 100
## Total 100 100 100
We can compute the correaltion coefficient, in several ways.
# For the relationship between the variables:
# j. Compute the correlation coefficient. What is your interpretation of the result?
# k. Plot a scattergram.
cor(alienation[,1:2], use="all.obs", method="pearson")
## alienation income
## alienation 1.0000000 -0.9146323
## income -0.9146323 1.0000000
# Correlations with significance levels
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(alienation[,1:2]), type="pearson")
## alienation income
## alienation 1.00 -0.91
## income -0.91 1.00
##
## n= 100
##
##
## P
## alienation income
## alienation 0
## income 0
A basic scattergram gives insight into the correlation between two variables. Note that plot() is a smart function that decides on a scattergram based on the input (two numeric columns of a data frame).
We use ggplot2 to get publication-quality graphs.
plot(alienation[,1:2])
# shape=1 gives hollow circles
ggplot(alienation, aes(x=income, y=alienation)) +
geom_point(shape=1) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# shape=20, for black filled enlarged circles
# scales are adjusted
# a linear regression line is added
ggplot(alienation, aes(x=income, y=alienation)) +
geom_point(shape=20) +
geom_smooth(method="lm") +
scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
scale_y_continuous(breaks = seq(1, 10, by = 1))
## `geom_smooth()` using formula = 'y ~ x'
A format that I like omits the greyish background with white gridlines, and uses red colors.
Since alienation is measured on a discrete ordinal scale (the scores are integer values from 1 to 10), we use geom_jitter() instead of geom_point(). Jitter avoids that dots with (close to) equal values on both variables overlap, by adding small random values.
ggplot(alienation, aes(x=income, y=alienation)) +
geom_jitter(shape=20) +
geom_smooth(method="lm",color="black") +
scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
scale_y_continuous(breaks = seq(0, 10, by = 1)) +
theme(panel.background = element_rect(fill = "white")) +
theme(panel.grid.minor = element_line(color = "red",
size = 0.10,
linetype = 1))
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'