1 Instructions

  • This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.

  • Upload your html file on RPubs and include the link when you submit your submission files on Collab.

  • Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.

2 Part 1

  • Use the occupational experience variable (“oexp”) of the income_example dataset and plot

    • a histogram,
    • a kernel density estimate,
    • a boxplot of “oexp”,
    • a set of boxplots showing the distribution of “oexp” by sex crossed with occupational status (“occ”).
  • You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.

  • Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).

[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]

[That’s how the plots could look like – but you have to do it with your group;-)]

# add your codes
income_data <- read.table('income_exmpl.dat', header = T, sep = "\t")

par(mfrow = c(2,2))
#Histogram
hist(income_data$oexp,breaks = 25,col = "white", main = "Histogram of Occ. Experience",xlab = "Occ. Experience (in years)")
#Density 
plot(density(income_data$oexp,adjust = 0.5),main = "Density Estimate of Occ. Experience",xlab = "Occ. Experience (in years)")
#Boxplot
boxplot(income_data$oexp,horizontal = TRUE,main = "Boxplot of Occ. Experience",col = "white",xlab = "Occ. Experience (in years)")
#Group of boxplots
boxplot(income_data$oexp ~ income_data$occ + income_data$sex, col = rep(c('blue', 'red')),horizontal = TRUE,main="Occ. Experience by Sex and Occ. Status", xlab = "Occ. Experience (in years)", ylab = "",las=1)

  • describe your plots. The plots of occupational experience are right skewed a little but. This makes sense because it is more likely for someone to be newer to a profession than to have worked for a long time. People could retire or switch jobs. Looking at the histogram, 0-2 years of experience have double the number of observations as any other bin. The density plot also shows this same story. The box plots do a good job of showing the quartiles. It looks like the mean is about ~ 19 years of experience.

3 Part 2

  • Download the SCS Data set from Collab (there you also find a separate file containing a brief description of variables). Then investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”) by plotting the data and the path of means.
  1. Produce a scatterplot between “mathpre” and “mars”. You might consider using jitter() or alpha() for avoiding overlying points.
library(foreign)
scs<-read.spss("SCS_QE.sav",to.data.frame = TRUE)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav", to.data.frame = TRUE): Undeclared level(s) 0
## added in variable: married
plot(jitter(scs$mathpre), jitter(scs$mars), xlab = 'Math Achievement Score', ylab = 'Math Anxiety Score', cex = .4, pch = 16, main = "Scatterplot of Math Achievement Score vs. Math Anxiety Score")

  1. Draw a conditioning plot for female and male students (variable “male”). Include “| male” in your first argument to create a conditioning plot.
coplot(data = scs, scs$mathpre ~ scs$mars | male ,xlab = 'Math Achievement Score', ylab = 'Math Anxiety Score')

  1. Describe in words the relation between math scores and math anxiety. Do you find evidence of Simpson’s Paradox?

describe your plots. Looking at the base scatter plot, it seems that there is a correlation between a high math anxiety score and a low math achievement score. However, breaking this down by gender, it seems that this trend only applies for males. Females scores seem to be pretty evenly distributed. This would sort of be an example of Simpson’s Paradox as the grouped data tells a different story than the ungrouped data.

4 Part 3

  • Use a dataset that is available in data repositories (e.g., kaggle)

  • Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)

    • describe your data.

The data set I got it is a car buying data set downloaded from kaggle. It contains varibles such as brand, model, price, number of buyers, and gas type. I chose to look at a comparison of price to number of buys and then condition on gasoline type.

  • Re-do Part 2, i.e.,
    • produce a scatterplot between “A” and “B”. You might consider using jitter() or alpha() for avoiding overlying points.
    • draw a scatterplot plot conditioning on variable “C”. Include “| C” in your first argument to create a conditioning plot.
    • describe in words the relation between “A” and “B.” Do you find evidence of Simpson’s Paradox?
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
cars<-read.csv("CarBuyers.csv")
cars<-cars%>%filter(Fuel != "automatic")
cars<-cars%>%mutate(Total=as.numeric(Total))%>%na.omit()
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
plot(jitter(cars$Price), jitter(cars$Total), xlab = 'Average Sale Price (Thousands)', ylab = 'Total Number of Sales', cex = .4, pch = 16, main = "Scatterplot of Sale Price vs. Total Sales")

coplot(data = cars, cars$Price ~ cars$Total | cars$Fuel ,xlab = 'Average Sale Price', ylab = 'Total Number of Sales (Thousands)')

There seems to be a pretty clear relationship between sale price and total number of sales. The more the price is, the less cars of that type are sold. Conditioning on fuel type doesn’t seem to change this so there is no evidence of Simpson’s Paradox present.

  • You will present results of Part 3 to your neighbor(s) in class of Jan. 7 (Mon).