1 Part 1

  • Use the occupational experience variable (“oexp”) of the income_example dataset and plot

    • a histogram,
    • a kernel density estimate,
    • a boxplot of “oexp”,
    • a set of boxplots showing the distribution of “oexp” by sex crossed with occupational status (“occ”).

The distribution of occupational experience, overall, is slightly skewed to the right. We see the modal years of occupational experience is 0. The distribution is fairly uniform from 1 to around 30 years of experience. After 30 years, we start to see less and less data points.

library(tidyverse)
data <- read.table('/Users/johnhope/Desktop/DS3003/Data/income_exmpl.dat')

par(mfrow = c(2, 2))

# histogram
hist(data$oexp, xlab = 'Occ. Experience (in years)', main = 'Histogram of Occ. Experience', breaks = 20)

# density plot
plot(density(data$oexp, adjust =0.5), main = 'Density Estimate of Occ. Experience', xlab = 'Occ. Experience (in years)')

# boxplot
boxplot(data$oexp, horizontal = TRUE, main = 'Boxplot of Occ. Experience', xlab = 'Occ. Experience (in years)')

# set of boxplots
boxplot(data$oexp ~ data$sex + data$occ, horizontal = TRUE, col = rep(c('blue', 'red'), 3), xlab = 'Occ Exp. (in years)', ylab = '',
        main = 'Occ. Experience by Sex and Occ. Status', yaxt = "n")
text(y = 6:1, x = par("usr")[3] - 4, labels = c('m.med','f.med','m.low','f.low','m.high','f.high'), xpd = NA, adj = 1)

From the histogram, we can see the distribution described above, with the overall right skew. Looking at the density plot, we see something very simiilar. We see thr highest density at 0, failry unifrom until 30, where after that the density continues to sharply fall. For the box plot, we dont see as much skew, but still some. The longer ‘whisker’, or tail, on the right side of the box indicates that there are 25% of values over a longer range, indicating less points at each year, telling us there are fewer points after 30 years. The set of boxplots for sex and and status show us the differences between males and females, as well as occupational status. We see that overall, high status tend to have more years of experience, indciated by the higher medians. For both medium and high status, males have more occupational experience. For low status, females have more occupatonal experience.

2 Part 2

  • Download the SCS Data set from Collab (there you also find a separate file containing a brief description of variables). Then investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”) by plotting the data and the path of means.
  1. Produce a scatterplot between “mathpre” and “mars”. You might consider using jitter() or alpha() for avoiding overlying points.
library(foreign) # importing to read .sav file
data <- read.spss('/Users/johnhope/Desktop/DS3003/Data/SCS_QE.sav', to.data.frame = TRUE) # second argument used to convert to data frame

# scatterplot
plot(jitter(data$mathpre, factor = 3), jitter(data$mars, factor = 3), 
     xlab = 'Math Achievement Score', ylab = 'Math Anxiety Score', cex = .6, pch = 16, 
     col = rgb(red = 0, blue = 0, green = 0, alpha = 0.6))

  1. Draw a conditioning plot for female and male students (variable “male”). Include “| male” in your first argument to create a conditioning plot.
#conditional plot
coplot(mars ~ mathpre | male, data = data, xlab = 'Math Achievement Score', ylab = 'Math Anxiety Score')

  1. Describe in words the relation between math scores and math anxiety. Do you find evidence of Simpson’s Paradox?

The plots indicate an overall negative relationship between math scores and math anxiety. As math scores increases, math anxiety decreases, and vice versa. This can be seen by the downward trending relation between the two variables.
From the conditioning plot, there is no real evidence of Simpson’s paradox. Both males and females share the same overall negative relationship between math scores and math anxiety

3 Part 3

The following dataset is heart disease classification dataset. The dataset was accessed for free online at kaggle.com, and can be found here. The data contains 14 variables, consisting of indicators of heart disease, as well as age and sex. The goal of use of the data is to be able to use these indicators to predict whether or not a patient has a presence of heart disease or not. The population here is a sample of 1,922 patients who’s data was used and stored.

  • Re-do Part 2, i.e.,
    • produce a scatterplot between “A” and “B”. You might consider using jitter() or alpha() for avoiding overlying points.
    • draw a scatterplot plot conditioning on variable “C”. Include “| C” in your first argument to create a conditioning plot.
    • describe in words the relation between “A” and “B.” Do you find evidence of Simpson’s Paradox?
data <- read.csv('/Users/johnhope/Desktop/DS3003/Data/heart.csv') # reading in the data

data$sex <- ifelse(data$sex==1,"Male", "Female") # re-coding to make 1 = Male, and 0 = Female

# scatterplot
plot(jitter(data$age, factor = 3), jitter(data$chol, factor = 3),
     xlab = 'Age (in years)', ylab = 'Cholesterol (in mg/dl)', cex = .6, pch = 16,
     col = rgb(red = 0, blue = 0, green = 0, alpha = 0.6))

# conditional plot
coplot(chol ~ age | sex, data = data, xlab = 'Age', ylab = 'Cholesterol')

Overall, we can see a slight positive relationship between age and cholesterol levels. As age increases, we can see slight increases in the overall cholesterol levels of patients, though it is not a very strong relationship.
From the conditional plot, we can see slight evidence of Simpson’s Paradox. When looking at the male group, there appears to be little to no relation between age and choleserol. The band of points appears horizontal, indicating no association. However, when we look at the female group, there is a much more apparent positve association between age and cholesterol. As a result, sex can be seen as a confounder in this situation, as the partial associations are different from the marginal association.