library(foreign)
##    sex age  edu  occ oexp income
## 1    f  62  low  low   35    953
## 2    m  32 high high    6   1224
## 3    m  56 med. high   36   1466
## 4    f  63 med. med.   38   1339
## 5    m  20  low  low    3   1184
## 6    f  38 med. med.   12   1196
## 7    f  39 med.  low   13    951
## 8    f  53  low  low   30   1039
## 9    m  49  low med.   31   1438
## 10   f  54  low  low   30   1000
## [1] 1922    6
## [1] 1922
## [1] 6
## [1] "sex"    "age"    "edu"    "occ"    "oexp"   "income"
## 'data.frame':    1922 obs. of  6 variables:
##  $ sex   : chr  "f" "m" "m" "f" ...
##  $ age   : int  62 32 56 63 20 38 39 53 49 54 ...
##  $ edu   : chr  "low" "high" "med." "med." ...
##  $ occ   : chr  "low" "high" "high" "med." ...
##  $ oexp  : int  35 6 36 38 3 12 13 30 31 30 ...
##  $ income: int  953 1224 1466 1339 1184 1196 951 1039 1438 1000 ...
library(foreign)
edu_df <- read.spss('SCS_QE.sav', to.data.frame = TRUE)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav", to.data.frame = TRUE): Undeclared level(s) 0
## added in variable: married
head(edu_df, 10)
##    vocabpre mathpre numbmath likemath likelit preflit pextra pagree pconsc
## 1        24       7        2        6       7       2     16     39     35
## 2        26       3        2        2      10       3     22     41     35
## 3        17       5        1        3       8       3     31     39     39
## 4        23       4        2        8      10       2     22     46     34
## 5        23       5        2        2       7       3     29     48     48
## 6        28       7        2        2       9       3     28     43     34
## 7        12       5        4        6       6       1     31     32     36
## 8        25      10        3        4       3       2     23     36     32
## 9        17       9        1        7       2       1     20     50     40
## 10       23       8        2        4       7       3     34     42     33
##    pemot pintell mars beck              rq          vm      cauc         afram
## 1     29      42   51    6 quasiexperiment Mathematics Caucasian         Other
## 2     29      30   76    5 quasiexperiment Mathematics Caucasian         Other
## 3     29      37   71    4 quasiexperiment Mathematics     Other Afro-American
## 4     33      32   33    3 quasiexperiment  Vocabulary Caucasian         Other
## 5     40      31   77    0 quasiexperiment  Vocabulary Caucasian         Other
## 6     36      41   44    1 quasiexperiment  Vocabulary Caucasian         Other
## 7     33      35   63    0 quasiexperiment Mathematics     Other Afro-American
## 8     15      35   70   13 quasiexperiment  Vocabulary     Other         Other
## 9     30      37   39    0 quasiexperiment Mathematics Caucasian         Other
## 10    31      40   54    8 quasiexperiment  Vocabulary Caucasian         Other
##    other age   male married   parents  momdegr  daddegr    credit       majormi
## 1      0  19   male       0  15000.00 14.00000 14.00000  30.00000 non-technical
## 2      0  28 female       0 113471.45 12.00000  6.00000  45.00000 non-technical
## 3      0  19 female       0  89485.82 14.09398 13.59029  30.00000     technical
## 4      0  21 female       0  36000.00 14.00000 12.00000   0.00000 non-technical
## 5      0  34 female       0 176952.91 13.58694 16.00000   0.00000 non-technical
## 6      0  20   male       0  60000.00 14.00000 18.00000   0.00000 non-technical
## 7      0  21   male       0 145000.00 16.00000 16.00000 116.00000 non-technical
## 8      1  19 female       0  40000.00 16.00000 12.00000  13.00000 non-technical
## 9      0  18   male       0 120000.00 12.00000 12.00000  36.84452 non-technical
## 10     0  18 female       0  24000.00 18.00000 16.00000  15.00000 non-technical
##     actcomp hsgpaar collgpaa vocaball mathall
## 1  27.00000 4.00000    2.730        7      17
## 2  21.28776 3.50000    3.600       13      10
## 3  20.00000 2.95000    2.760        7      13
## 4  19.94736 2.27000    1.640       18       3
## 5  19.58231 2.60512    3.657       17       4
## 6  26.00000 2.57000    3.423       20       2
## 7  17.20831 2.80000    2.200        8       4
## 8  24.00000 1.90000    2.400       23      11
## 9  19.00000 3.45000    3.000        8      10
## 10 17.70300 3.45000    3.650       22       6
titanic_df <- read.csv('train.csv')

1 Instructions

  • This is an individual assignment. Submit your .Rmd and “knitted”.html files through Collab.

  • Upload your html file on RPubs and include the link when you submit your submission files on Collab.

  • Please don’t use ggplot2 for this assignment. We’ll use ggplot2 almost all the times after this assignment.

2 Part 1

  • Use the occupational experience variable (“oexp”) of the income_example dataset and plot

    • a histogram,
    • a kernel density estimate,
    • a boxplot of “oexp”,
    • a set of boxplots showing the distribution of “oexp” by sex crossed with occupational status (“occ”).
  • You can either produce four separate but small plots, or you can use par(mfrow = c(2, 2)) to create a plotting region consisting of four subplots.

  • Briefly describe the distributions of occupational experience in words (also include your plots and the R syntax).

[“Play” with the hist() and density() functions; for instance, by choosing a different number of bins or different break points for the hist() function, or different bandwidths using the adjust argument in density(). See also the corresponding help files and the examples given there. Only include the histogram and density estimate you find most informative. Also, add useful axis-labels and a title using the following arguments inside the plotting functions: xlab, ylab, main. Use the help ?par() for the description of many more plotting parameter.]

# add your codes
oexp <- income_df$oexp
occ <- income_df$occ
sex <- income_df$sex


#par(mfrow=c(2,2), mar=c(3,5,4,4))
par(mgp=c(4,1,0),mar=c(4,5,4,1)) # space margins by side

hist(oexp, seq(0,55, by = 5), xaxt='n', xlab = 'Occupational experience (years)', main = 'Histrogram of occupational experience', col = 'grey90')
axis(1, at = seq(0,55, 5))

plot(density(oexp, bw = 1, kernel = "gaussian"), main = 'Occupational experience KDE',xlab = 'Occupational experience (years)')

#income ~ edu + sex, ,las = 2

boxplot(oexp,horizontal = T, main="Occupational experience boxplot", xaxt='n', xlab="Occupational expierence (years)" )
axis(1, at = seq(0,40, 10))

boxplot(oexp ~  sex+occ,horizontal = T, main="Occupational experience boxplot \n by occupational status and sex",ylab="Sex and occupational status", xlab="Occupational expierence (years)", col = c("red",'blue'), las=2 )

  • describe your plots.
  • The plots show an increasing amount of data in the lower end of the occupational experience, with the mode being 0-5 years. Similarly, the median occupational expierence is about 19 years, but the maximum is between 45-50 indicating a tail. The multi-dimensional boxplot indicates relationship between gender, occupational status and occupational experience. Men tend to work longer than women in high and medium occupation status jobs but women work longer in low occupational status jobs. This is suggestive that men have high occupational status jobs than women overall.

3 Part 2

  • Download the SCS Data set from Collab (there you also find a separate file containing a brief description of variables). Then investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”) by plotting the data and the path of means.
  1. Produce a scatterplot between “mathpre” and “mars”. You might consider using jitter() or alpha() for avoiding overlying points.
mathpre <- edu_df$mathpre
mars <- edu_df$mars
plot(jitter(mathpre, factor = 1), jitter(mars, factor = 1),
     xlab = 'Math achievment', ylab='Math anxiety', main = 'Math achievement vs Math anxiety', pch = 16)

  1. Draw a conditioning plot for female and male students (variable “male”). Include “| male” in your first argument to create a conditioning plot.
# add your codes

coplot (jitter(mars, factor=2) ~ jitter(mathpre, factor=2) | male, data = edu_df, 
xlab = c('Math achievement','Coplot of Math achievement vs Math anxiety by sex'), ylab = 'Math anxiety',
col = c("red", "blue")[as.numeric(edu_df$male)], pch = 16)

  1. Describe in words the relation between math scores and math anxiety. Do you find evidence of Simpson’s Paradox?

describe your plots. - The plots shows so relationship between math achievement and math anxiety. The students with low achievement scores had no noticable pattern but the high achievement scores all had low anxiety. Whether this is any indicator of causation is not known as low achievement could cause anxiety or visa versa. The female students seemed to have much higher anxiety in general than the male studnets. The female students also look to have a more spread out achievement scores while the male achievement scores were clustered higher. However, this could be due to a higher female sample size. There is no clear evidence of Simpson’s paradox because the in the original plot, Math Anxiety is negatively corrleated with Math Achievement and the correlation is roughly followed when the data is split by Sex.

4 Part 3

  • Use a dataset that is available in data repositories (e.g., kaggle)

  • Briefly describe the dataset you’re using (e.g., means to access data, context, sample, variables, etc…)

  • This dataset contains data from the titanic survival dataset from kaggle. Its target variable is surival of passengers of the titanic and it includes the features of Pclass (ticket class) , Sex, Age,sibsp ( # of siblings/ spouses), parch (# of parents/ children), fare (Passenger fare). These variables are a used to predict the survial of

  • Re-do Part 2, i.e.,

    • produce a scatterplot between “A” and “B”. You might consider using jitter() or alpha() for avoiding overlying points.
    • draw a scatterplot plot conditioning on variable “C”. Include “| C” in your first argument to create a conditioning plot.
    • describe in words the relation between “A” and “B.” Do you find evidence of Simpson’s Paradox?
t_df<-na.omit(titanic_df)
plot (t_df$Fare, jitter(t_df$Pclass, factor=4) ,
xlab = c('Fare','Sex'), ylab = 'Pclass', pch = 16)

t_df<-na.omit(titanic_df)
fare <- titanic_df$Fare
pclass <- titanic_df$Pclass
sex <- titanic_df$Sex

coplot ( jitter(Pclass, factor=4) ~ Fare
         | Sex, data = t_df, 
xlab = c('Fare','Sex'), ylab = 'Pclass', pch = 16)

- The majority of the data in Plcass vs Fare is that as Fare increase, the class goes goes down. However a majority of the data is located at low fare levels where fare is not found to be influence on class. For Simpson’s paradox, there is no evidence of it as seperating by male and female makes no difference when compared to the original graph,

  • You will present results of Part 3 to your neighbor(s) in class of Jan. 7 (Mon).