DS1

Questions

Complete the following exercises:

Download the data, and read them (in Excel, or R)
How many observations do we have?

Compute for the variables alienation and income:

The mean
The median
The variance
The standard deviation
The range

For both variables:

Plot an histogram
Make a frequency table (use meaningful brackets; compute frequency and cumulative frequency)

For the relationship between the variables:
j. Compute the correlation coefficient. What is your interpretation of the result?
k. Plot a scattergram

Read Data from GitHub

# a. Download the data, and read them (in Excel, or R)
alienation <- read.csv("https://raw.githubusercontent.com/statmind/exercises/main/alienation_csv.csv")

Descriptives

# b. How many observations do we have?

nrow(alienation)    # Number of Rows

## [1] 100

summary(alienation) # Summary (min. max, median, mean, quartiles)

##    alienation        income      
##  Min.   : 1.00   Min.   :     0  
##  1st Qu.: 3.00   1st Qu.: 29500  
##  Median : 5.00   Median : 59500  
##  Mean   : 5.56   Mean   : 58070  
##  3rd Qu.: 8.00   3rd Qu.: 88000  
##  Max.   :10.00   Max.   :132000

colSums(!is.na(alienation)) # Number of valid records for all variables

## alienation     income 
##        100        100

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

alienation %>% summarise(count = n())

##   count
## 1   100

Descriptives

The mean and the median can be obtained via summary()

# c. The mean
mean(alienation$alienation)

## [1] 5.56

mean(alienation$income)

## [1] 58070

# d. The median
median(alienation$alienation)

## [1] 5

median(alienation$income)

## [1] 59500

# e. The variance
var(alienation$alienation)

## [1] 8.572121

var(alienation$income)

## [1] 1258247576

# f. The standard deviation
sd(alienation$alienation)

## [1] 2.927819

sd(alienation$income)        # from sd()

## [1] 35471.79

sqrt(var(alienation$income)) # as square root of var()

## [1] 35471.79

# g. The range
range(alienation$alienation)

## [1]  1 10

range(alienation$income)

## [1]      0 132000

The range() function outputs the minimum and maximum values of the variable.

The range (maximum - minimum) can be easily computed.

x <- range(alienation$alienation)
cat("The range of alienation is",x[2] - x[1])

## The range of alienation is 9

cat("The range of alienation is",max(alienation$alienation) - min(alienation$alienation))

## The range of alienation is 9

Histogram

Basic histograms can be produced via hist().

# h. Plot an histogram
hist(alienation$alienation)

hist(alienation$income)

For publication quality graphs (here: histograms), we can use ggplot2.

Calibration of the graphs is not easy because of the endless number of options and settings.

Best strategy is:
1. Google, and compose your preferred format for graphs.
2. Then apply that format to all similar graphs!

Below, we found the scales package for getting percentages on the y-scale.

We used a pink color for filling the bars.

We set the x-scale in steps of 20,000 monetary units.

# install.packages(scale) if needed
library(scales)
library(ggplot2)
ggplot(alienation, aes(x=income)) +
  geom_histogram(aes(y = (..count..)/sum(..count..) ), 
                 binwidth=20000,
                 boundary = 0,
                 colour="black", fill="pink") +
  scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
  ggtitle("Histogram of Income") +
  scale_y_continuous(labels=percent) +
  labs(y = 'Percent of Total')

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We produced a cumulative distribution for income.

ggplot(alienation, aes(x=income)) +
  stat_ecdf(geom = "line")

# install.packages("epiDisplay")
alienation$class <- cut(alienation$income, breaks = c(-Inf, 25000, 50000, 75000, 100000, 125000, Inf))
alienation$class2 <- factor(alienation$class, labels = c("0-25", "25-50", "50-75", "75-100", "100-125", "125+"))
library(epiDisplay)

## Loading required package: foreign

## Loading required package: survival

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: nnet

## 
## Attaching package: 'epiDisplay'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

## The following object is masked from 'package:scales':
## 
##     alpha

tab1(alienation$class2, cum.percent = TRUE)

## alienation$class2 : 
##         Frequency Percent Cum. percent
## 0-25           21      21           21
## 25-50          19      19           40
## 50-75          22      22           62
## 75-100         27      27           89
## 100-125         9       9           98
## 125+            2       2          100
##   Total       100     100          100

Two Variables

We can compute the correaltion coefficient, in several ways.

cor() from base-R computes the coefficient.
rcorr() from Hmics gives the sample size and the significance, in addition.

# For the relationship between the variables:
# j. Compute the correlation coefficient. What is your interpretation of the result?
# k. Plot a scattergram.

cor(alienation[,1:2], use="all.obs", method="pearson")

##            alienation     income
## alienation  1.0000000 -0.9146323
## income     -0.9146323  1.0000000

# Correlations with significance levels
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(alienation[,1:2]), type="pearson")

##            alienation income
## alienation       1.00  -0.91
## income          -0.91   1.00
## 
## n= 100 
## 
## 
## P
##            alienation income
## alienation             0    
## income      0

A basic scattergram gives insight into the correlation between two variables. Note that plot() is a smart function that decides on a scattergram based on the input (two numeric columns of a data frame).

We use ggplot2 to get publication-quality graphs.

plot(alienation[,1:2])

# shape=1 gives hollow circles
ggplot(alienation, aes(x=income, y=alienation)) +
  geom_point(shape=1) +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# shape=20, for black filled enlarged circles
# scales are adjusted
# a linear regression line is added
ggplot(alienation, aes(x=income, y=alienation)) +
  geom_point(shape=20) +
  geom_smooth(method="lm") +
  scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
  scale_y_continuous(breaks = seq(1, 10, by = 1))

## `geom_smooth()` using formula = 'y ~ x'

A format that I like omits the greyish background with white gridlines, and uses red colors.

Since alienation is measured on a discrete ordinal scale (the scores are integer values from 1 to 10), we use geom_jitter() instead of geom_point(). Jitter avoids that dots with (close to) equal values on both variables overlap, by adding small random values.

ggplot(alienation, aes(x=income, y=alienation)) +
  geom_jitter(shape=20) +
  geom_smooth(method="lm",color="black") +
  scale_x_continuous(breaks = seq(0, 160000, by = 20000)) +
  scale_y_continuous(breaks = seq(0, 10, by = 1)) +
  theme(panel.background = element_rect(fill = "white")) +
  theme(panel.grid.minor = element_line(color = "red",
                                        size = 0.10,
                                        linetype = 1))

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

DS1

Robert

2024-04-05

Questions

Read Data from GitHub

Descriptives

Descriptives

Histogram

Two Variables