This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.
Upload your html file on RPubs and include the link when you submit your submission files on Collab.
Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.
Use the occupational experience variable (“oexp”) of the income_example dataset and plot
You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.
Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).
[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]
[That’s how the plots could look like – but you have to do it with your group;-)]
income <- read.table(
'income_exmpl.dat',
header=TRUE,
sep='\t'
)
cat('Dimensions:', dim(income), '\n\n')
## Dimensions: 1922 6
cat('Columns:', names(income), '\n\n')
## Columns: sex age edu occ oexp income
summary(income)
## sex age edu occ oexp income
## f: 853 Min. :18.00 high:419 high:627 Min. : 0.0 Min. : 704
## m:1069 1st Qu.:31.00 low :811 low :595 1st Qu.: 8.0 1st Qu.:1117
## Median :42.00 med.:692 med.:700 Median :19.0 Median :1304
## Mean :41.69 Mean :19.5 Mean :1313
## 3rd Qu.:52.00 3rd Qu.:30.0 3rd Qu.:1506
## Max. :65.00 Max. :48.0 Max. :2115
par(mfrow = c(2, 2))
# Histogram
hist(
income$oexp,
xlab='Experience',
ylab='Frequency',
col='blue',
main='Experience Histogram'
)
# Kernel Density Estimate
plot(
density(income$oexp),
xlab='Experience',
ylab='Density',
col='blue',
main='Experience Density Plot'
)
# Boxplot
boxplot(
income$oexp,
col='blue',
main='Experience Boxplot'
)
# Boxplot -- crossed by sex and occ
boxplot(
income$oexp ~ income$sex + income$occ,
main='Experience Boxplot by Sex and Occupation',
col = rep(c('blue', 'red')),
xlab='~ Sex + Occupution',
ylab='Experience'
)
describe your plots.
The histogram displays that occupational experience is slightly right skewed, with the majority of subjects have less years of experience. This is expectedly reflected in the density plot, yet it appears slightly more Normal than the histogram (skew is still somewhat noticeable). The first boxplot indicates that the minimum, median, maximum experience is roughly 1, 20, and 50 years (respectively); furthermore, the first quartile is roughly 8 years and the third quartile is roughly 30 years. The second boxplot, by sex and occupation level, highlights that for medium and high occupational status, males generally have more experience; the opposite is true for a low occupational status.
library(foreign)
scs <- read.spss(
"SCS_QE.sav",
to.data.frame=TRUE
)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav", to.data.frame = TRUE): Undeclared level(s) 0
## added in variable: married
(i) Produce a scatterplot between “mathpre” and “mars”. You might consider using jitter() or alpha() for avoiding overlying points.
plot(
jitter(scs$mathpre, factor=3),
jitter(scs$mars, factor=3),
cex=.4,
pch=16,
xlab='Math Pre Achievement',
ylab='Math Anxiety',
main='Math Pre Achievement x Math Anxiety',
col='blue'
)
(ii) Draw a conditioning plot for female and male students (variable “male”). Include “| male” in your first argument to create a conditioning plot.
coplot(
mathpre ~ mars | male,
data=scs,
cex=.6,
xlab='Math Pre Achievement',
ylab='Math Anxiety',
col='blue'
)
(iii) Describe in words the relation between math scores and math anxiety. Do you find evidence of Simpson’s Paradox?
It is evident that there is an negative, inverse relationship between math scores and math anxiety. In other words, students scored higher when they were less anxious (and vice versa). There does appear to be evidence of Simpson’s Paradox, such that this relationship appears in the combined group of data but dissapears once separated (no apparent relationship when broken down by sex).
gas <- read.csv(
'gas.csv',
header=TRUE
)
cat('Dimensions:', dim(gas), '\n\n')
## Dimensions: 117927 11
cat('Columns:', names(gas), '\n\n')
## Columns: X mark model generation_name year mileage vol_engine fuel city province price
summary(gas)
## X mark model
## Min. : 0 audi :12031 astra : 3331
## 1st Qu.: 29482 opel :11914 seria-3: 2944
## Median : 58963 bmw :11070 a4 : 2912
## Mean : 58963 volkswagen :10848 golf : 2592
## 3rd Qu.: 88444 ford : 9664 a6 : 2496
## Max. :117926 mercedes-benz: 7136 seria-5: 2464
## (Other) :55264 (Other):101188
## generation_name year mileage vol_engine
## :30085 Min. :1945 Min. : 0 Min. : 0
## gen-8p-2003-2012 : 1567 1st Qu.:2009 1st Qu.: 67000 1st Qu.:1461
## gen-j-2009-2015 : 1376 Median :2013 Median : 146269 Median :1796
## gen-a-2008-2017 : 1216 Mean :2013 Mean : 140977 Mean :1812
## gen-iii-2013 : 1184 3rd Qu.:2018 3rd Qu.: 203000 3rd Qu.:1995
## gen-e90-2005-2012: 1152 Max. :2022 Max. :2800000 Max. :7600
## (Other) :81347
## fuel city province price
## CNG : 47 Warszawa: 7972 Mazowieckie :22219 Min. : 500
## Diesel :48476 Łódź : 3341 Śląskie :16706 1st Qu.: 21000
## Electric: 885 Kraków : 2936 Wielkopolskie:14016 Median : 41900
## Gasoline:61597 Wrocław : 2764 Małopolskie : 9756 Mean : 70300
## Hybrid : 2621 Poznań : 2382 Dolnośląskie : 8838 3rd Qu.: 83600
## LPG : 4301 Gdańsk : 2271 Łódzkie : 7884 Max. :2399900
## (Other) :96261 (Other) :38508
sum(is.na(gas))
## [1] 0
Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)
The dataset I chose is from Kaggle: Car Prices in Poland. Per the description, the context is as follows: the dataset was assembled in January 2022, and is from a well-known car sale site in Poland (which is public). Selenium and request were used for parsing. It includes the following variables:
| Column | Description | Type |
|---|---|---|
| x | index | int |
| mark | car brand | string |
| model | car model | string |
| generation_name | car generation name | string |
| year | manufacturing year | int |
| mileage | total mileage on the car | int |
| vol_engine | engine volume | int |
| fuel | car fuel type | string |
| city | manufacturing city | string |
| province | manufacturing province | string |
| price | car price | int |
Here are the first five rows of the dataset:
head(gas)
## X mark model generation_name year mileage vol_engine fuel city
## 1 0 opel combo gen-d-2011 2015 139568 1248 Diesel Janki
## 2 1 opel combo gen-d-2011 2018 31991 1499 Diesel Katowice
## 3 2 opel combo gen-d-2011 2015 278437 1598 Diesel Brzeg
## 4 3 opel combo gen-d-2011 2016 47600 1248 Diesel Korfantów
## 5 4 opel combo gen-d-2011 2014 103000 1400 CNG Tarnowskie Góry
## 6 5 opel combo gen-d-2011 2017 121203 1598 Diesel Warszawa
## province price
## 1 Mazowieckie 35900
## 2 Śląskie 78501
## 3 Opolskie 27000
## 4 Opolskie 30800
## 5 Śląskie 35900
## 6 Mazowieckie 51900
jitter() or alpha() for avoiding overlying points.| C” in your first argument to create a conditioning plot.plot(
jitter(gas$year, factor=3),
jitter(gas$price, factor=3),
cex=.4,
pch=16,
xlab='Year',
ylab='Price',
main='Price x Year',
col='blue'
)
coplot(
price ~ year | fuel,
data=gas,
cex=.6,
xlab='Year',
ylab='Price',
col='blue'
)
It is evident that there is a positive relationship between Year and Price; in other words, the newer the car, the more expensive the price (and vice versa). There does appear to be somewhat of a Simpson’s Paradox, such that the trend in the combined group fades when broken down by fuel type. Specifically, there appears to be no trend for LPG and CNG fuel types; it is important to note that this could be due to the smaller amount of data points for these fuel types.