This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.
Upload your html file on RPubs and include the link when you submit your submission files on Collab.
Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.
Use the occupational experience variable (“oexp”) of the income_example dataset and plot
You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.
Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).
[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]
[That’s how the plots could look like – but you have to do it with your group;-)]
# add your codes
income_data <- read.table('income_exmpl.dat', header = T, sep = "\t")
par(mfrow = c(2,2))
#Histogram
hist(income_data$oexp,breaks = 25,col = "white", main = "Histogram of Occ. Experience",xlab = "Occ. Experience (in years)")
#Density
plot(density(income_data$oexp,adjust = 0.5),main = "Density Estimate of Occ. Experience",xlab = "Occ. Experience (in years)")
#Boxplot
boxplot(income_data$oexp,horizontal = TRUE,main = "Boxplot of Occ. Experience",col = "white",xlab = "Occ. Experience (in years)")
#Group of boxplots
boxplot(income_data$oexp ~ income_data$occ + income_data$sex, col = rep(c('blue', 'red')),horizontal = TRUE,main="Occ. Experience by Sex and Occ. Status", xlab = "Occ. Experience (in years)", ylab = "",las=1)
jitter() or alpha() for avoiding overlying points.library(foreign)
scs<-read.spss("SCS_QE.sav",to.data.frame = TRUE)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav", to.data.frame = TRUE): Undeclared level(s) 0
## added in variable: married
plot(jitter(scs$mathpre), jitter(scs$mars), xlab = 'Math Achievement Score', ylab = 'Math Anxiety Score', cex = .4, pch = 16, main = "Scatterplot of Math Achievement Score vs. Math Anxiety Score")
| male” in your first argument to create a conditioning plot.coplot(data = scs, scs$mathpre ~ scs$mars | male ,xlab = 'Math Achievement Score', ylab = 'Math Anxiety Score')
describe your plots. Looking at the base scatter plot, it seems that there is a correlation between a high math anxiety score and a low math achievement score. However, breaking this down by gender, it seems that this trend only applies for males. Females scores seem to be pretty evenly distributed. This would sort of be an example of Simpson’s Paradox as the grouped data tells a different story than the ungrouped data.
Use a dataset that is available in data repositories (e.g., kaggle)
Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)
The data set I got it is a car buying data set downloaded from kaggle. It contains varibles such as brand, model, price, number of buyers, and gas type. I chose to look at a comparison of price to number of buys and then condition on gasoline type.
jitter() or alpha() for avoiding overlying points.| C” in your first argument to create a conditioning plot.library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
cars<-read.csv("CarBuyers.csv")
cars<-cars%>%filter(Fuel != "automatic")
cars<-cars%>%mutate(Total=as.numeric(Total))%>%na.omit()
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
plot(jitter(cars$Price), jitter(cars$Total), xlab = 'Average Sale Price (Thousands)', ylab = 'Total Number of Sales', cex = .4, pch = 16, main = "Scatterplot of Sale Price vs. Total Sales")
coplot(data = cars, cars$Price ~ cars$Total | cars$Fuel ,xlab = 'Average Sale Price', ylab = 'Total Number of Sales (Thousands)')
There seems to be a pretty clear relationship between sale price and total number of sales. The more the price is, the less cars of that type are sold. Conditioning on fuel type doesn’t seem to change this so there is no evidence of Simpson’s Paradox present.