In the Final Assignment of the semester you will work with a RIASEC dataset from http://personality-testing.info/_rawdata/. Set up an RMarkdown document and create a report in either word, pdf or html format. Make sure to hand in both, the RMarkdown file (.Rmd) and the report. Also make sure to comment on your results.

The following aspects will be graded: - Dealing with the R Ecosystem - Exploring and understanding the dataset - Purging the dataset - Writing a custom function - Reasoning of your answers

Suggestion: Start a fresh R Studio project for the assignment. The project should contain two subfolders, namely ‘R’ and ‘data’. The R folder is meant to contain your code drafts and the data folder should contain the data. Put your RMarkdown (.Rmd) in the project’s root folder for simple referencing to your data.

Task 1 - Get and Read the Data

Download the RIASEC (Holland Code Dataset) from http://personality-testing.info/_rawdata/. Store the .csv and codebook file in your project’s data folder.

  1. Read the dataset into R and create an R object called riasec that contains the data. Hint: the seperator is a tab stop (\t) again.

  2. Somehow the provider has interchanged the labels of age and gender. Check whether this is also the case in your copy! If so (argue we this is the case or is not the case), swap the column labels back again.

Task 2 - Cleaning up

Again the data provider has chosen an unfortunate code for missing values. Fortunately though, -1 is used for all columns and thus can be replaced easily.

  1. Copy the riasec object into a second c_riasec object. Use this new created object from this point. Use R’s vector/matrix orientation to replace all -1 with NA.

  2. The underlying study was implemented in two ways: Implementation 1 covers age and gender of the observations, Implementation 2 covers time elapsed and accuracy. The latter information, however, is not available in the former part and vice versa. This can be very cumbersome when trying to set up a missing value free dataset.

Use the R function split to split the dataset by the variable implementation. Store the return of the function in an object called l_riasec. Of which class is the object?

  1. Name the elements of l_riasec according to their content “imp_1” and “imp_2”.

  2. Out of both elements remove the columns that are not available for the respective implementation.

Hint: Remember names(l_riasec\(imp\_1), names(l\_riasec\)imp_2) and the %in% syntax!

  1. Create two entirely NA free subparts using the function na.omit on both parts of l_riasec. Replace l_riasec\(imp\_1 and l\_riasec\)imp_2 with the new NA free data.frames. How many observations do you have for both datasets?

Task 3 Write and apply a custom function

  1. To compute the sum scores of the respective factors (all variables that start with capital letters), write a custom function with the name compute_sum_score that accepts two arguments (namely a pattern and data.frame) and returns a vector of row sums.

Hint: The pattern should be used with grep to identify the relevant columns. Use apply to sum up row-wise. Make sure to define your own function even though you might be able to circumvent writing a function!

  1. Use the following vector of suitable patterns to combine your custom written function with lapply to produce a list of sum scores! Do this for both parts of the l_riasec object.
pttrn <- c("^R","^I","^A","^S","^E","^C")

Hint: Use two calls to end up with sum_list_1 and sum_list_2.

  1. Use the following expression to give the two lists from b) proper names.
names(sum_list_1) <- gsub("\\^","sum_",pttrn)
names(sum_list_2) <- gsub("\\^","sum_",pttrn)
  1. Turn the lists created in b) and c) into data.frames using as.data.frame. Use cbind to combine the data.frame of sum scores with their corresponding riasec subset.

Task 4 Analysis / Inference

An important question, now that we have two clean subsets of the RIASEC dataset is whether the two implementations had an undesired effect on the participants that influenced our data. In other words: are both datasets comparable?

  1. Create a boxplot with notches for the sum scores sum_R and sum_S for both subsets. Also run a t.test. Comment on your results

  2. Draw a histogram of age. Comment on what you see. To earn a bonus point: Do the same using ggplot2 with geom_bar() and fill = gender. Comment on your result again.

  3. Run a factor analysis using factanal using as many factors as already suggested by the dataset by using the following code. Explain what the code does. Comment on the result of the factor analysis. Is it expected?

fa_cols <- grep("^[A-Z]",names(c_riasec),value=T)
factanal(l_riasec$imp_1[,fa_cols],factors = length(pttrn))
factanal(l_riasec$imp_2[,fa_cols],factors = length(pttrn))