Home Assignment

General Remarks

The following document presents a handful of applied tasks. Solve the tasks and create an R Markdown document to present the solutions. Comment on what you do! Also make sure to comment on the results. Hint: If you are not sure how to do something, create a standard .R file first to try out and develop your code and create the report once the problem is solved.

don’t forget about:

ls() to see what’s in memory
rm() to remove an object
rm(list = ls()) to remove all objects
the meaning of indexes [row,col]
class(), str() to find out the data type you’re dealing with
?function_name to get immediate context help.

TASK 1 Getting started

Create a new R Studio project called homework that contains at two subfolders, namely: - data - R .

Download the Big 5 Personality Traits dataset from http://personality-testing.info/_rawdata/

Download the data and copy the .csv file and the codebook text file into the homework project’s data folder.
read the data into R and into an R object called big5. Hint: Use the correct seperator when importing .csv format. “\t” is a ‘Tab stop’.
save the big5 R object into an R binary called big5.RData

TASK 2 Exploring the data set

Get an idea of the variables contained in the dataset. Use the codebook textfile to make sense of the data. The dataset contains some demographics and information on the big 5 personality traits.

Display the first few lines
find out how many observations are in the dataset
How many columns does the data.frame have?
use grep() and regular expressions to extract the column names of all items and store it in a vector called ‘items’
use the vector you created in c) and create 5 new vectors from it which only contain the names of one respective trait.

Extraversion
Neuroticism
Agreeableness
Openess
Conscientiousness

HINT: use reasonable abbreviations such as ‘extra’.

TASK 3 A little custom function

Aggregate all 50 items to their respective big 5 trait. Write a function that accepts two arguments and computes a sum score given those two arguments. The first argument should be the data.frame that contains the data. The second argument is a vector of column names of this data.frame. The function should return a sum score per trait and observation.

Hint: this function is potentially a one-liner. If you need more than 5 or 6 lines, you are proabably thinking too complicated.

TASK 4 Basic Visualization, Getting an Idea of the dataset

Race, age, gender and country, as well as the 50 trait items plus the newly introduced sum scores will be the focus of the remaining tasks.

reduce the dataset to only those variables mentioned above (use indexing to do so and store the result in a second data.frame)
introduce a new variable that is TRUE when an observation is from the US and FALSE when it’s not. How many Americans and non-Americans are there in the dataset?
use boxplots to distinguish sum scores by 1) gender and by 2) the new US / non-US variable. What can you tell?
pick 4 out of the 5 traits and create a histogram for all each of the 4 sum scores. Create a 2x2 canvas of 4 histograms! HINT: use par(mfrow = c(2,2)) BEFORE plotting. Make sure to use enough breaks.

What about the distribution? Is it similar to a well known distribution? If so, plot a random draw of a similarily shaped distribution as well.

TASK 5 Purging the Dataset From Unreasonable Data

Datasets may contain measurement errors or bad results that cause problems to many statistical methods. Thus inspecting the variables of interest is important to avoid e.g. unreasonable outliers. Set unreasonable values to NA.

Make a copy of your big5 object called big5_clean. Just use this copy for the tasks ahead. We do so to compare the full dataset with the cleaned dataset later on.

gender does not seem to be binary, turn it into a binary variable, switch unreasonable values or NA. How many do you have? Hint: Use is.na().
age seems to contain several unreasonable values, too. E.g.: Some participants indicating their year of birth instead of their age. Also, some software development rookie introduced a common but bad encoding for missing data. Set these values to NA !
use the summary() and quantile() functions to get a better idea of your variables and to identify NAs or values that should rather be set to NA.
Create a correlation matrix of the sum scores you computed before? Are the trait scores correlated? Comment on your results. Why are your results reasonable?

TASK 6 Inference

Test for homegeneity of variances across gender for all of the sum scores. Make sure your gender variables has been cleaned up and turned into a factor. Use R’s standard bartlett.test as well as leveneTest from the car package.
Pick a sum score that turned out to have a homogeneous variance across groups and run an analysis of variance grouped by gender.
run an analysis of variance for one of the sum scores grouped by country.
run a t.test to compare extraversion sums across gender. Use the linear model function lm.
Pick one of the sum scores and regress it on country using the US as a reference level. Hint: Make sure country is a factor! Comment on your results / regression output.