1 Instructions

  • This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.

  • Upload your html file on RPubs and include the link when you submit your submission files on Collab.

  • Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.

2 Part 1

  • Use the occupational experience variable (“oexp”) of the income_example dataset and plot

    • a histogram,
    • a kernel density estimate,
    • a boxplot of “oexp”,
    • a set of boxplots showing the distribution of “oexp” by sex crossed with occupational status (“occ”).
  • You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.

  • Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).

[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]

[That’s how the plots could look like – but you have to do it with your group;-)]

#file.choose()
data <- read.table("/Users/michaelvaden/Downloads/income_exmpl.dat", header = TRUE, sep = "\t")

par(mfrow = c(2, 2))

hist(data$oexp, xlab = "Occupancy Experience (years)", main = "Histogram of Occupancy Experience", col = "darkblue", border = "orange")

plot(density(data$oexp), main = 'Density Estimate of Occupancy Experience', col = "orange", xlab = "Occupancy Experience (years)")

boxplot(data$oexp, horizontal = T, main="Boxplot of Occupancy Experience", col = "orange", border = "darkblue", xlab = "Occupancy Experience (years)")

boxplot(data$oexp ~ data$sex + data$occ, horizontal = T, main="Occ. Experience by Sex and Occ. Status", col = rep(c('orange', 'darkblue'), each = 3), border = rep(c('darkblue', 'orange'), each = 3), xlab = "Occupancy Experience (years)", ylab = "", las = 1)

  • describe your plots.

The histogram of occupancy experience is right-skewed, with a mode of 0-5 years. The sample contains similar frequencies of occupancy experience for the year range 5-35, after which is there is a decrease in frequency.

The Density estimate of occupancy experience appears to be slightly right-skewed, with two humps. The highest density of occupancy experience is around 0.023 at roughly 8 years of occupancy experience. The density of occupancy experience is similar in the range of 10 to 30 years.

The boxplot of occupancy experience shows that the data is slightly right-skewed, with a range of approximately 50. The 25th percentile is approximately 9 years, while the 75th percentile is at 30 years. The median is at approximately 19 years.

The boxplots of occupancy experience by sex and occupancy status show that the smallest range is for males with low occupancy status. This group also has the lowest median and 75th percentile. All of the groups are slightly right-skewed, with the exception of males with high occupancy status, which has a slightly left-skewed box plot with the highest five-number summary.

3 Part 2

  • Download the SCS Data set from Collab (there you also find a separate file containing a brief description of variables). Then investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”) by plotting the data and the path of means.
  1. Produce a scatterplot between “mathpre” and “mars”. You might consider using jitter() or alpha() for avoiding overlying points.
#file.choose()
library(foreign)

data <- read.spss("/Users/michaelvaden/Downloads/SCS_QE.sav", to.data.frame=TRUE)
## re-encoding from CP1252
## Warning in read.spss("/Users/michaelvaden/Downloads/SCS_QE.sav", to.data.frame =
## TRUE): Undeclared level(s) 0 added in variable: married
plot(jitter(data$mars, factor = 3), jitter(data$mathpre, factor = 3), xlab = "Math Anxiety Score", ylab = "Math Achievement Score")

  1. Draw a conditioning plot for female and male students (variable “male”). Include “| male” in your first argument to create a conditioning plot.
coplot(mathpre ~ mars | male, data = data, xlab = "Math Anxiety Score", ylab = "Math Achievement Score")

  1. Describe in words the relation between math scores and math anxiety. Do you find evidence of Simpson’s Paradox?

describe your plots.

There appears to be a slight negative linear relationship on the plot between math anxiety and math scores. Simpson’s Paradox is essentially the phenomenon that, when a trend appears in groups of data, it disappears when the data is combined or split up. If we examine the plots split up by sex, we do not find much evidence of Simpson’s Paradox. Although the sample size for females is larger and there is greater range in the plot, the general trend of a slight negative linear relationship remains apparent in both plots separated by sex. When we re-examine the combined plot, there is no significant difference in trend.

4 Part 3

  • Use a dataset that is available in data repositories (e.g., kaggle)

  • Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)

    • describe your data.

    This dataset catalogues the average life-expectancy at birth for each year in the range 1900 - 2013 overall and for each sex. All races are included, and the dataset also includes mortality rates. Populations in the study are based on standard population and the census, with non-census years using post-census estimates.

    dataset kaggle link

  • Re-do Part 2, i.e.,

    • produce a scatterplot between “A” and “B”. You might consider using jitter() or alpha() for avoiding overlying points.
# file.choose()
lifedata <- read.csv("/Users/michaelvaden/Downloads/NCHS_-_Age-adjusted_death_rates_and_life-expectancy_at_birth___All_Races__Both_Sexes___United_States__1900-2013.csv")
# View(lifedata)

plot(jitter(lifedata$Year, factor = 2), jitter(lifedata$Average.Life.Expectancy, factor = 2), xlab = "Year", ylab = "Average Life Expectancy")

  • draw a scatterplot plot conditioning on variable “C”. Include “| C” in your first argument to create a conditioning plot.
coplot(Average.Life.Expectancy ~ Year | Sex, data = lifedata, xlab = "Year", ylab = "Average Life Expectancy")

  • describe in words the relation between “A” and “B.” Do you find evidence of Simpson’s Paradox?

There appears to be a significant positive relationship between Year and Average Life Expectancy variables. There seems to be more variability before the year 1950, with a very strong positive relationship after the year 1950. If we examine the plots split up by sex, we do not find much evidence of Simpson’s Paradox. The plots are of similar sample sizes, and both the male and female plots are almost identical in trend to the plot of both sexes When we examine the combined plot along with the two separated by sex, there is no significant difference in trend.

  • You will present results of Part 3 to your neighbor(s) in class of Jan. 7 (Mon).