1 Instructions

  • This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.

  • Upload your html file on RPubs and include the link when you submit your submission files on Collab.

  • Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.

2 Part 1

  • Use the occupational experience variable (“oexp”) of the income_example dataset and plot

    • a histogram,
    • a kernel density estimate,
    • a boxplot of “oexp”,
    • a set of boxplots showing the distribution of “oexp” by sex crossed with occupational status (“occ”).
  • You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.

  • Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).

[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]

[That’s how the plots could look like – but you have to do it with your group;-)]

# add your codes
library(foreign)
par(mfrow = c(2,2))
incex <- read.table("income_exmpl.dat", header = T, sep = "\t")

hist(incex$oexp, 
     main = "Histogram of Occ. Experience",
     xlab = "Occ. Experience (In Years)",
     col = "white",
     breaks = 25)

boxplot(incex$oexp,
        main = "Boxplot of Occ. Experience",
        xlab = "Occ. Experience (in years)",
        horizontal = TRUE,
        col = "white")

plot(density(incex$oexp, adjust = 0.5), xlab = "Occ. Experience (in years)", main = "Density Estimate of Occ. Experience")


income <- incex$income
edu <- incex$edu
sex <- incex$sex
occ <- incex$occ
oexp <- incex$oexp

boxplot(oexp ~ occ + sex, 
        col = rep(c('blue', 'red')), 
        horizontal = TRUE,
        main = "Occ. Experience by Sex and Occ. Status",
        names = c("m.med", "f.med", "m.low", "f.low", "m.high", "f.high"),
        xlab = "Occ. Experience (in years)",
        ylab = "",
        las = 1)

  • The histogram depicts a general trend that many individuals do not have any Occupational Experience. Beyond, zero experience, frequency of experience appears similar accross bins, tapering after 40.

  • The Boxplot shows the Median Occupational Experience is less than 20 years. No outliers in the dataset exist. We can see the IQR lays in the range 9 years to 29 years (estimated from viewing graph). It follows that the minimum of the dataset is 0 years, meaning no Occ. Experience, while the max is 48.

  • The Density Plot shows that the largest proportion of individiduals have zero percent work experience. We can observe 2 years - 30 years maintain steady density, with a sharp decline at 40 years. This is to be expected when taking the data into real world consideration.

  • The Sex and Occ Status Boxplots offer insight into work experience by gender. We see the median Occupation Experience of females is greater than males for Occupational Status high and low.

3 Part 2

  • Download the SCS Data set from Collab (there you also find a separate file containing a brief description of variables). Then investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”) by plotting the data and the path of means.
  1. Produce a scatterplot between “mathpre” and “mars”. You might consider using jitter() or alpha() for avoiding overlying points.
library(foreign)
data <- read.spss("SCS_QE.sav", to.data.frame = TRUE)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav", to.data.frame = TRUE): Undeclared level(s) 0
## added in variable: married
mathpre <- data$mathpre
mars <- data$mars

plot(jitter(mathpre, factor = 5), 
     jitter(mars, factor = 5), 
     xlab = "Pre-test Score",
     ylab = "Anxiety",
     col=rgb(red = .5, green = 0, blue = .5, alpha = .8), 
     main= "Scatter of Pre-test Score Measured Against Anxiety")

  1. Draw a conditioning plot for female and male students (variable “male”). Include “| male” in your first argument to create a conditioning plot.
library(lattice)
xyplot(data$mars ~ data$mathpre | data$male, data = data, cex = .5,
       xlab = "Pre-test Score",
       ylab = "Anxiety")

  1. Describe in words the relation between math scores and math anxiety. Do you find evidence of Simpson’s Paradox?

In general, as the pre-test score increased, math anxiety seems to decrease. Females tend to have higher levels of anxiety while testing in comparison to males. We can observe that males tend to score a few points above females. This could provide an explanation to the differing levels of anxiety. Since we can break the data down into male versus female data, we can observe high levels of anxiety to be predominately found in females. Therefore, we have found evidence of Simpson’s Paradox since each categorical variable differs from the original sample.

4 Part 3

  • Use a dataset that is available in data repositories (e.g., kaggle)

  • Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)

    • The following data are representative of IMDB ratings of films hosted on the Neflix streaming service. Variables measured are Genre, Premiere [date], Runtime, IMDB.score, and Language. These data include premiers released 2017-2021. I will be exploring relationships between Runtime and IMDB.Score
    • Source: https://www.kaggle.com/luiscorter/netflix-original-films-imdb-scores
  • Re-do Part 2, i.e.,
    • produce a scatterplot between “A” and “B”. You might consider using jitter() or alpha() for avoiding overlying points.
    • draw a scatterplot plot conditioning on variable “C”. Include “| C” in your first argument to create a conditioning plot.
    • describe in words the relation between “A” and “B.” Do you find evidence of Simpson’s Paradox?
netflix <- read.csv("NetflixOriginals.csv")

runtime <- netflix$Runtime
score <- netflix$IMDB.Score

plot(jitter(runtime, factor = 5), 
     jitter(score, factor = 5), 
     xlab = "Runtime (mins)",
     ylab = "IMDB score",
     col=rgb(red = .5, green = 0, blue = .5, alpha = .8), 
     main= "Scatter of Pre-test Score Measured Against Anxiety")

xyplot(netflix$Runtime ~ netflix$IMDB.Score | netflix$Genre=="Documentary", data = netflix, cex = .5,
       xlab = "IMDB Score",
       ylab = "Runtime")

  • The above plots demonstrate a relationship that most movies have a runtime of 100mins. At 100mins of runtime we see higher IMDB scores. This could indicate that a runtime of about 100 minutes is optimal for a user’s preference.

  • Documentaries on average are ranked much lower than other genres regardless of Runtime.

  • You will present results of Part 3 to your neighbor(s) in class of Jan. 7 (Mon).