1 Instructions

  • This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.

  • Upload your html file on RPubs and include the link when you submit your submission files on Collab.

  • Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.

2 Part 1

  • Use the occupational experience variable (“oexp”) of the income_example dataset and plot

    • a histogram,
    • a kernel density estimate,
    • a boxplot of “oexp”,
    • a set of boxplots showing the distribution of “oexp” by sex crossed with occupational status (“occ”).
  • You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.

  • Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).

[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]

[That’s how the plots could look like – but you have to do it with your group;-)]

income <- read.table(
    'income_exmpl.dat',
    header=TRUE,
    sep='\t'
)
cat('Dimensions:', dim(income), '\n\n')
## Dimensions: 1922 6
cat('Columns:', names(income), '\n\n')
## Columns: sex age edu occ oexp income
summary(income)
##  sex           age          edu        occ           oexp          income    
##  f: 853   Min.   :18.00   high:419   high:627   Min.   : 0.0   Min.   : 704  
##  m:1069   1st Qu.:31.00   low :811   low :595   1st Qu.: 8.0   1st Qu.:1117  
##           Median :42.00   med.:692   med.:700   Median :19.0   Median :1304  
##           Mean   :41.69                         Mean   :19.5   Mean   :1313  
##           3rd Qu.:52.00                         3rd Qu.:30.0   3rd Qu.:1506  
##           Max.   :65.00                         Max.   :48.0   Max.   :2115
par(mfrow = c(2, 2))

# Histogram
hist(
  income$oexp,
  xlab='Experience',
  ylab='Frequency',
  col='blue',
  main='Experience Histogram'
)
# Kernel Density Estimate
plot(
  density(income$oexp), 
  xlab='Experience',
  ylab='Density',
  col='blue',
  main='Experience Density Plot'
)
# Boxplot
boxplot(
  income$oexp,
  col='blue',
  main='Experience Boxplot'
)
# Boxplot -- crossed by sex and occ
boxplot(
  income$oexp ~ income$sex + income$occ,
  main='Experience Boxplot by Sex and Occupation',
  col = rep(c('blue', 'red')),
  xlab='~ Sex + Occupution',
  ylab='Experience'
)

describe your plots.
The histogram displays that occupational experience is slightly right skewed, with the majority of subjects have less years of experience. This is expectedly reflected in the density plot, yet it appears slightly more Normal than the histogram (skew is still somewhat noticeable). The first boxplot indicates that the minimum, median, maximum experience is roughly 1, 20, and 50 years (respectively); furthermore, the first quartile is roughly 8 years and the third quartile is roughly 30 years. The second boxplot, by sex and occupation level, highlights that for medium and high occupational status, males generally have more experience; the opposite is true for a low occupational status.

3 Part 2

  • Download the SCS Data set from Collab (there you also find a separate file containing a brief description of variables). Then investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”) by plotting the data and the path of means.
library(foreign)
scs <- read.spss(
  "SCS_QE.sav",
  to.data.frame=TRUE
)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav", to.data.frame = TRUE): Undeclared level(s) 0
## added in variable: married

(i) Produce a scatterplot between “mathpre” and “mars”. You might consider using jitter() or alpha() for avoiding overlying points.

plot(
  jitter(scs$mathpre, factor=3), 
  jitter(scs$mars, factor=3),
  cex=.4, 
  pch=16,
  xlab='Math Pre Achievement', 
  ylab='Math Anxiety', 
  main='Math Pre Achievement x Math Anxiety',
  col='blue'
)

(ii) Draw a conditioning plot for female and male students (variable “male”). Include “| male” in your first argument to create a conditioning plot.

coplot(
  mathpre ~ mars | male, 
  data=scs,
  cex=.6,
  xlab='Math Pre Achievement', 
  ylab='Math Anxiety', 
  col='blue'
)

(iii) Describe in words the relation between math scores and math anxiety. Do you find evidence of Simpson’s Paradox?

It is evident that there is an negative, inverse relationship between math scores and math anxiety. In other words, students scored higher when they were less anxious (and vice versa). There does appear to be evidence of Simpson’s Paradox, such that this relationship appears in the combined group of data but dissapears once separated (no apparent relationship when broken down by sex).

4 Part 3

  • Use a dataset that is available in data repositories (e.g., kaggle)
gas <- read.csv(
    'gas.csv',
    header=TRUE
)
cat('Dimensions:', dim(gas), '\n\n')
## Dimensions: 117927 11
cat('Columns:', names(gas), '\n\n')
## Columns: X mark model generation_name year mileage vol_engine fuel city province price
summary(gas)
##        X                     mark           model       
##  Min.   :     0   audi         :12031   astra  :  3331  
##  1st Qu.: 29482   opel         :11914   seria-3:  2944  
##  Median : 58963   bmw          :11070   a4     :  2912  
##  Mean   : 58963   volkswagen   :10848   golf   :  2592  
##  3rd Qu.: 88444   ford         : 9664   a6     :  2496  
##  Max.   :117926   mercedes-benz: 7136   seria-5:  2464  
##                   (Other)      :55264   (Other):101188  
##           generation_name       year         mileage          vol_engine  
##                   :30085   Min.   :1945   Min.   :      0   Min.   :   0  
##  gen-8p-2003-2012 : 1567   1st Qu.:2009   1st Qu.:  67000   1st Qu.:1461  
##  gen-j-2009-2015  : 1376   Median :2013   Median : 146269   Median :1796  
##  gen-a-2008-2017  : 1216   Mean   :2013   Mean   : 140977   Mean   :1812  
##  gen-iii-2013     : 1184   3rd Qu.:2018   3rd Qu.: 203000   3rd Qu.:1995  
##  gen-e90-2005-2012: 1152   Max.   :2022   Max.   :2800000   Max.   :7600  
##  (Other)          :81347                                                  
##        fuel             city                province         price        
##  CNG     :   47   Warszawa: 7972   Mazowieckie  :22219   Min.   :    500  
##  Diesel  :48476   Łódź    : 3341   Śląskie      :16706   1st Qu.:  21000  
##  Electric:  885   Kraków  : 2936   Wielkopolskie:14016   Median :  41900  
##  Gasoline:61597   Wrocław : 2764   Małopolskie  : 9756   Mean   :  70300  
##  Hybrid  : 2621   Poznań  : 2382   Dolnośląskie : 8838   3rd Qu.:  83600  
##  LPG     : 4301   Gdańsk  : 2271   Łódzkie      : 7884   Max.   :2399900  
##                   (Other) :96261   (Other)      :38508
sum(is.na(gas))
## [1] 0
  • Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)

    • describe your data.

The dataset I chose is from Kaggle: Car Prices in Poland. Per the description, the context is as follows: the dataset was assembled in January 2022, and is from a well-known car sale site in Poland (which is public). Selenium and request were used for parsing. It includes the following variables:

Column Description Type
x index int
mark car brand string
model car model string
generation_name car generation name string
year manufacturing year int
mileage total mileage on the car int
vol_engine engine volume int
fuel car fuel type string
city manufacturing city string
province manufacturing province string
price car price int

Here are the first five rows of the dataset:

head(gas)
##   X mark model generation_name year mileage vol_engine   fuel            city
## 1 0 opel combo      gen-d-2011 2015  139568       1248 Diesel           Janki
## 2 1 opel combo      gen-d-2011 2018   31991       1499 Diesel        Katowice
## 3 2 opel combo      gen-d-2011 2015  278437       1598 Diesel           Brzeg
## 4 3 opel combo      gen-d-2011 2016   47600       1248 Diesel       Korfantów
## 5 4 opel combo      gen-d-2011 2014  103000       1400    CNG Tarnowskie Góry
## 6 5 opel combo      gen-d-2011 2017  121203       1598 Diesel        Warszawa
##      province price
## 1 Mazowieckie 35900
## 2     Śląskie 78501
## 3    Opolskie 27000
## 4    Opolskie 30800
## 5     Śląskie 35900
## 6 Mazowieckie 51900
  • Re-do Part 2, i.e.,
    • produce a scatterplot between “A” and “B”. You might consider using jitter() or alpha() for avoiding overlying points.
    • draw a scatterplot plot conditioning on variable “C”. Include “| C” in your first argument to create a conditioning plot.
    • describe in words the relation between “A” and “B.” Do you find evidence of Simpson’s Paradox?
plot(
  jitter(gas$year, factor=3), 
  jitter(gas$price, factor=3),
  cex=.4, 
  pch=16,
  xlab='Year', 
  ylab='Price', 
  main='Price x Year',
  col='blue'
)

coplot(
  price ~ year | fuel, 
  data=gas,
  cex=.6,
  xlab='Year', 
  ylab='Price', 
  col='blue'
)

It is evident that there is a positive relationship between Year and Price; in other words, the newer the car, the more expensive the price (and vice versa). There does appear to be somewhat of a Simpson’s Paradox, such that the trend in the combined group fades when broken down by fuel type. Specifically, there appears to be no trend for LPG and CNG fuel types; it is important to note that this could be due to the smaller amount of data points for these fuel types.

  • You will present results of Part 3 to your neighbor(s) in class of Jan. 7 (Mon).