This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.
Upload your html file on RPubs and include the link when you submit your submission files on Collab.
Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.
Use the occupational experience variable (“oexp”) of the income_example dataset and plot
You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.
Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).
[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]
[That’s how the plots could look like – but you have to do it with your group;-)]
# add your codes
library(foreign)
par(mfrow = c(2,2))
incex <- read.table("income_exmpl.dat", header = T, sep = "\t")
hist(incex$oexp,
main = "Histogram of Occ. Experience",
xlab = "Occ. Experience (In Years)",
col = "white",
breaks = 25)
boxplot(incex$oexp,
main = "Boxplot of Occ. Experience",
xlab = "Occ. Experience (in years)",
horizontal = TRUE,
col = "white")
plot(density(incex$oexp, adjust = 0.5), xlab = "Occ. Experience (in years)", main = "Density Estimate of Occ. Experience")
income <- incex$income
edu <- incex$edu
sex <- incex$sex
occ <- incex$occ
oexp <- incex$oexp
boxplot(oexp ~ occ + sex,
col = rep(c('blue', 'red')),
horizontal = TRUE,
main = "Occ. Experience by Sex and Occ. Status",
names = c("m.med", "f.med", "m.low", "f.low", "m.high", "f.high"),
xlab = "Occ. Experience (in years)",
ylab = "",
las = 1)
The histogram depicts a general trend that many individuals do not have any Occupational Experience. Beyond, zero experience, frequency of experience appears similar accross bins, tapering after 40.
The Boxplot shows the Median Occupational Experience is less than 20 years. No outliers in the dataset exist. We can see the IQR lays in the range 9 years to 29 years (estimated from viewing graph). It follows that the minimum of the dataset is 0 years, meaning no Occ. Experience, while the max is 48.
The Density Plot shows that the largest proportion of individiduals have zero percent work experience. We can observe 2 years - 30 years maintain steady density, with a sharp decline at 40 years. This is to be expected when taking the data into real world consideration.
The Sex and Occ Status Boxplots offer insight into work experience by gender. We see the median Occupation Experience of females is greater than males for Occupational Status high and low.
jitter() or alpha() for avoiding overlying points.library(foreign)
data <- read.spss("SCS_QE.sav", to.data.frame = TRUE)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav", to.data.frame = TRUE): Undeclared level(s) 0
## added in variable: married
mathpre <- data$mathpre
mars <- data$mars
plot(jitter(mathpre, factor = 5),
jitter(mars, factor = 5),
xlab = "Pre-test Score",
ylab = "Anxiety",
col=rgb(red = .5, green = 0, blue = .5, alpha = .8),
main= "Scatter of Pre-test Score Measured Against Anxiety")
| male” in your first argument to create a conditioning plot.library(lattice)
xyplot(data$mars ~ data$mathpre | data$male, data = data, cex = .5,
xlab = "Pre-test Score",
ylab = "Anxiety")
In general, as the pre-test score increased, math anxiety seems to decrease. Females tend to have higher levels of anxiety while testing in comparison to males. We can observe that males tend to score a few points above females. This could provide an explanation to the differing levels of anxiety. Since we can break the data down into male versus female data, we can observe high levels of anxiety to be predominately found in females. Therefore, we have found evidence of Simpson’s Paradox since each categorical variable differs from the original sample.
Use a dataset that is available in data repositories (e.g., kaggle)
Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)
jitter() or alpha() for avoiding overlying points.| C” in your first argument to create a conditioning plot.netflix <- read.csv("NetflixOriginals.csv")
runtime <- netflix$Runtime
score <- netflix$IMDB.Score
plot(jitter(runtime, factor = 5),
jitter(score, factor = 5),
xlab = "Runtime (mins)",
ylab = "IMDB score",
col=rgb(red = .5, green = 0, blue = .5, alpha = .8),
main= "Scatter of Pre-test Score Measured Against Anxiety")
xyplot(netflix$Runtime ~ netflix$IMDB.Score | netflix$Genre=="Documentary", data = netflix, cex = .5,
xlab = "IMDB Score",
ylab = "Runtime")
The above plots demonstrate a relationship that most movies have a runtime of 100mins. At 100mins of runtime we see higher IMDB scores. This could indicate that a runtime of about 100 minutes is optimal for a user’s preference.
Documentaries on average are ranked much lower than other genres regardless of Runtime.
You will present results of Part 3 to your neighbor(s) in class of Jan. 7 (Mon).