library(foreign)
## sex age edu occ oexp income
## 1 f 62 low low 35 953
## 2 m 32 high high 6 1224
## 3 m 56 med. high 36 1466
## 4 f 63 med. med. 38 1339
## 5 m 20 low low 3 1184
## 6 f 38 med. med. 12 1196
## 7 f 39 med. low 13 951
## 8 f 53 low low 30 1039
## 9 m 49 low med. 31 1438
## 10 f 54 low low 30 1000
## [1] 1922 6
## [1] 1922
## [1] 6
## [1] "sex" "age" "edu" "occ" "oexp" "income"
## 'data.frame': 1922 obs. of 6 variables:
## $ sex : chr "f" "m" "m" "f" ...
## $ age : int 62 32 56 63 20 38 39 53 49 54 ...
## $ edu : chr "low" "high" "med." "med." ...
## $ occ : chr "low" "high" "high" "med." ...
## $ oexp : int 35 6 36 38 3 12 13 30 31 30 ...
## $ income: int 953 1224 1466 1339 1184 1196 951 1039 1438 1000 ...
library(foreign)
edu_df <- read.spss('SCS_QE.sav', to.data.frame = TRUE)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav", to.data.frame = TRUE): Undeclared level(s) 0
## added in variable: married
head(edu_df, 10)
## vocabpre mathpre numbmath likemath likelit preflit pextra pagree pconsc
## 1 24 7 2 6 7 2 16 39 35
## 2 26 3 2 2 10 3 22 41 35
## 3 17 5 1 3 8 3 31 39 39
## 4 23 4 2 8 10 2 22 46 34
## 5 23 5 2 2 7 3 29 48 48
## 6 28 7 2 2 9 3 28 43 34
## 7 12 5 4 6 6 1 31 32 36
## 8 25 10 3 4 3 2 23 36 32
## 9 17 9 1 7 2 1 20 50 40
## 10 23 8 2 4 7 3 34 42 33
## pemot pintell mars beck rq vm cauc afram
## 1 29 42 51 6 quasiexperiment Mathematics Caucasian Other
## 2 29 30 76 5 quasiexperiment Mathematics Caucasian Other
## 3 29 37 71 4 quasiexperiment Mathematics Other Afro-American
## 4 33 32 33 3 quasiexperiment Vocabulary Caucasian Other
## 5 40 31 77 0 quasiexperiment Vocabulary Caucasian Other
## 6 36 41 44 1 quasiexperiment Vocabulary Caucasian Other
## 7 33 35 63 0 quasiexperiment Mathematics Other Afro-American
## 8 15 35 70 13 quasiexperiment Vocabulary Other Other
## 9 30 37 39 0 quasiexperiment Mathematics Caucasian Other
## 10 31 40 54 8 quasiexperiment Vocabulary Caucasian Other
## other age male married parents momdegr daddegr credit majormi
## 1 0 19 male 0 15000.00 14.00000 14.00000 30.00000 non-technical
## 2 0 28 female 0 113471.45 12.00000 6.00000 45.00000 non-technical
## 3 0 19 female 0 89485.82 14.09398 13.59029 30.00000 technical
## 4 0 21 female 0 36000.00 14.00000 12.00000 0.00000 non-technical
## 5 0 34 female 0 176952.91 13.58694 16.00000 0.00000 non-technical
## 6 0 20 male 0 60000.00 14.00000 18.00000 0.00000 non-technical
## 7 0 21 male 0 145000.00 16.00000 16.00000 116.00000 non-technical
## 8 1 19 female 0 40000.00 16.00000 12.00000 13.00000 non-technical
## 9 0 18 male 0 120000.00 12.00000 12.00000 36.84452 non-technical
## 10 0 18 female 0 24000.00 18.00000 16.00000 15.00000 non-technical
## actcomp hsgpaar collgpaa vocaball mathall
## 1 27.00000 4.00000 2.730 7 17
## 2 21.28776 3.50000 3.600 13 10
## 3 20.00000 2.95000 2.760 7 13
## 4 19.94736 2.27000 1.640 18 3
## 5 19.58231 2.60512 3.657 17 4
## 6 26.00000 2.57000 3.423 20 2
## 7 17.20831 2.80000 2.200 8 4
## 8 24.00000 1.90000 2.400 23 11
## 9 19.00000 3.45000 3.000 8 10
## 10 17.70300 3.45000 3.650 22 6
titanic_df <- read.csv('train.csv')
This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.
Upload your html file on RPubs and include the link when you submit your submission files on Collab.
Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.
Use the occupational experience variable (“oexp”) of the income_example dataset and plot
You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.
Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).
[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]
# add your codes
oexp <- income_df$oexp
occ <- income_df$occ
sex <- income_df$sex
#par(mfrow=c(2,2), mar=c(3,5,4,4))
par(mgp=c(4,1,0),mar=c(4,5,4,1)) # space margins by side
hist(oexp, seq(0,55, by = 5), xaxt='n', xlab = 'Occupational experience (years)', main = 'Histrogram of occupational experience', col = 'grey90')
axis(1, at = seq(0,55, 5))
plot(density(oexp, bw = 1, kernel = "gaussian"), main = 'Occupational experience KDE',xlab = 'Occupational experience (years)')
#income ~ edu + sex, ,las = 2
boxplot(oexp,horizontal = T, main="Occupational experience boxplot", xaxt='n', xlab="Occupational expierence (years)" )
axis(1, at = seq(0,40, 10))
boxplot(oexp ~ sex+occ,horizontal = T, main="Occupational experience boxplot \n by occupational status and sex",ylab="Sex and occupational status", xlab="Occupational expierence (years)", col = c("red",'blue'), las=2 )
jitter() or alpha() for avoiding overlying points.mathpre <- edu_df$mathpre
mars <- edu_df$mars
plot(jitter(mathpre, factor = 1), jitter(mars, factor = 1),
xlab = 'Math achievment', ylab='Math anxiety', main = 'Math achievement vs Math anxiety', pch = 16)
| male” in your first argument to create a conditioning plot.# add your codes
coplot (jitter(mars, factor=2) ~ jitter(mathpre, factor=2) | male, data = edu_df,
xlab = c('Math achievement','Coplot of Math achievement vs Math anxiety by sex'), ylab = 'Math anxiety',
col = c("red", "blue")[as.numeric(edu_df$male)], pch = 16)
describe your plots. - The plots shows so relationship between math achievement and math anxiety. The students with low achievement scores had no noticable pattern but the high achievement scores all had low anxiety. Whether this is any indicator of causation is not known as low achievement could cause anxiety or visa versa. The female students seemed to have much higher anxiety in general than the male studnets. The female students also look to have a more spread out achievement scores while the male achievement scores were clustered higher. However, this could be due to a higher female sample size. There is no clear evidence of Simpson’s paradox because the in the original plot, Math Anxiety is negatively corrleated with Math Achievement and the correlation is roughly followed when the data is split by Sex.
Use a dataset that is available in data repositories (e.g., kaggle)
Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)
This dataset contains data from the titanic survival dataset from kaggle. Its target variable is surival of passengers of the titanic and it includes the features of Pclass (ticket class) , Sex, Age,sibsp ( # of siblings/ spouses), parch (# of parents/ children), fare (Passenger fare). These variables are a used to predict the survial of
Re-do Part 2, i.e.,
jitter() or alpha() for avoiding overlying points.| C” in your first argument to create a conditioning plot.t_df<-na.omit(titanic_df)
plot (t_df$Fare, jitter(t_df$Pclass, factor=4) ,
xlab = c('Fare','Sex'), ylab = 'Pclass', pch = 16)
t_df<-na.omit(titanic_df)
fare <- titanic_df$Fare
pclass <- titanic_df$Pclass
sex <- titanic_df$Sex
coplot ( jitter(Pclass, factor=4) ~ Fare
| Sex, data = t_df,
xlab = c('Fare','Sex'), ylab = 'Pclass', pch = 16)
- The majority of the data in Plcass vs Fare is that as Fare increase, the class goes goes down. However a majority of the data is located at low fare levels where fare is not found to be influence on class. For Simpson’s paradox, there is no evidence of it as seperating by male and female makes no difference when compared to the original graph,