I downloaded the datasets (RIASEC) and (Big5) and created a new R-project called “Midterm_RH”. This Markdown Document will focus on the RIASEC-dataset.

1. Import the data into R with the read.table function

riasec <- read.table("~/Documents/Psychologie/7. Semester/R/Blockseminar R/Midterm/RIASEC/data.csv", sep="\t", header = TRUE) # I set the header = TRUE because this excludes the headers from the other variables and data

head(riasec) #yes! it worked!
##   implementation R1 R2 R3 R4 R5 R6 R7 R8 I1 I2 I3 I4 I5 I6 I7 I8 A1 A2 A3
## 1              2  3  1  4  2  1  2  1  1  5  4  3  4  2  5  2  4  2  5  5
## 2              2  1  1  1  1  1  1  1  1  4  4  3  1  2  4  2  2  5  3  4
## 3              2  3  2  1  1  1  1  2  1  5  2  3  3  4  1  4  2  1  2  1
## 4              2  3  2  1  2  2  3  1  2  5  4  4  5  4  4  4  3  4  5  3
## 5              2 -1  2  3  2  3  2  1  3  5  2  4  4  4  3  4  3  1  1  2
## 6              2  3  1  3  4  3  4  3  3  3  4  3  3  2  3  3  4  2  3  4
##   A4 A5 A6 A7 A8 S1 S2 S3 S4 S5 S6 S7 S8 E1 E2 E3 E4 E5 E6 E7 E8 C1 C2 C3
## 1  5  5  4  2  5  4  4  3  4  4  4  3  3  2  2  3  1  4  1  1  4  1  1  1
## 2  3  3  5  1  3  1  3  3  1  3  2  2  3  1  1  1  1  1  1  1  1  1  1  1
## 3  1  2  1  3  1  4  5  4  4  4  2  2  4  3  2  2  3  4  2  4  2  4  3  2
## 4  4  3  5  4  3  3  4  3  4  4  5  3  2  1  1  4  2  3  4  2  2  2  4  3
## 5  1  2  2  4  3  3  2  3  4  2  3  3  2  3  4  2  1  4  3  4  2  3  4  4
## 6  2  3  3  4  4  2  2  3  2  2  2  3  2  3  2  3  2  4  2  4  4  3  4  3
##   C4 C5 C6 C7 C8 accuracy elapse country fromsearch age gender
## 1  1  2  1  1  2       90    222      PT          0  -1     -1
## 2  1  1  1  1  1      100    102      US          0  -1     -1
## 3  3  3  4  4  4       95    264      US          1  -1     -1
## 4  2  1  3  2  1       60    189      SG          0  -1     -1
## 5  2  4  3  3  3       90    197      US          0  -1     -1
## 6  3  3  3  3  3       80    247      US          1  -1     -1

2. Basic Inspections

What’s the size (rows, cols) of the riasec-dataset

nrow(riasec) #This dataset contains 8855 rows of data
## [1] 8855
ncol(riasec) #This dataset is made up of columns, i.e. 55 variables
## [1] 55
names(riasec) #What are the names of the variables?
##  [1] "implementation" "R1"             "R2"             "R3"            
##  [5] "R4"             "R5"             "R6"             "R7"            
##  [9] "R8"             "I1"             "I2"             "I3"            
## [13] "I4"             "I5"             "I6"             "I7"            
## [17] "I8"             "A1"             "A2"             "A3"            
## [21] "A4"             "A5"             "A6"             "A7"            
## [25] "A8"             "S1"             "S2"             "S3"            
## [29] "S4"             "S5"             "S6"             "S7"            
## [33] "S8"             "E1"             "E2"             "E3"            
## [37] "E4"             "E5"             "E6"             "E7"            
## [41] "E8"             "C1"             "C2"             "C3"            
## [45] "C4"             "C5"             "C6"             "C7"            
## [49] "C8"             "accuracy"       "elapse"         "country"       
## [53] "fromsearch"     "age"            "gender"

We now know, that this dataset contains 8855 rows of data and 55 columns, i.e. 55 different variables. The names() function shows us the names of all 55 variables.

What do we know about the background of this dataset

On the homegape and in the codebook, we find some additional information about the background and variables of this dataset:

  • data was collected from an interactive version of the public domain RIASEC markers (a.k.a. Holland Codes or Holland Occupational Themes) from the Interest Item Pool, maintained by James Rounds of the University of Illinois at Urbana-Champaign.
  • The Holland Occupational Themes is a theory of personality which focuses on career and vocational choices. It groups people on the basis of their suitability for six different categories of occupation. The six categories are represented by the RIASEC acronym.
  • The RIASEC Markers Scales from the Interest Item Pool were developed by Liao, Armstrong and Rounds (2008) to be used in psychological research.

About the test

  • The test contains 48 tasks
  • Each item has to be rated on a scale of (1) dislike (2) slightly dislike (3) neither like not dislike (4) slightly enjoy (5) enjoy.
  • the test takes about 5-10 minutes to complete.

Measures

Variables in this dataset

  1. 48 likert rated statements
  2. gender
  3. age
  4. country
  5. time elapsed
  6. self-rated accuracy

Summary Statistics

Summary statistics of all variables

summary(riasec)
##  implementation        R1              R2               R3        
##  Min.   :1.000   Min.   :-1.00   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.:2.000   1st Qu.: 1.00   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median :2.000   Median : 2.00   Median : 2.000   Median : 1.000  
##  Mean   :1.759   Mean   : 2.42   Mean   : 2.159   Mean   : 1.769  
##  3rd Qu.:2.000   3rd Qu.: 3.00   3rd Qu.: 3.000   3rd Qu.: 2.000  
##  Max.   :2.000   Max.   : 5.00   Max.   : 5.000   Max.   : 5.000  
##                                                                   
##        R4               R5               R6               R7        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 2.000   Median : 1.000   Median : 2.000   Median : 2.000  
##  Mean   : 2.256   Mean   : 1.664   Mean   : 2.321   Mean   : 1.898  
##  3rd Qu.: 3.000   3rd Qu.: 2.000   3rd Qu.: 3.000   3rd Qu.: 3.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        R8               I1               I2              I3        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.00   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.: 3.000   1st Qu.: 2.00   1st Qu.: 2.000  
##  Median : 2.000   Median : 4.000   Median : 4.00   Median : 3.000  
##  Mean   : 1.958   Mean   : 3.432   Mean   : 3.34   Mean   : 3.083  
##  3rd Qu.: 3.000   3rd Qu.: 4.000   3rd Qu.: 4.00   3rd Qu.: 4.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.00   Max.   : 5.000  
##                                                                    
##        I4               I5               I6               I7        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 1.000  
##  Median : 3.000   Median : 3.000   Median : 3.000   Median : 3.000  
##  Mean   : 2.927   Mean   : 2.874   Mean   : 2.995   Mean   : 2.715  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        I8               A1               A2               A3        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median : 3.000   Median : 2.000   Median : 3.000   Median : 3.000  
##  Mean   : 2.612   Mean   : 2.505   Mean   : 2.811   Mean   : 3.062  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        A4               A5               A6               A7        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 1.000  
##  Median : 3.000   Median : 4.000   Median : 4.000   Median : 3.000  
##  Mean   : 3.026   Mean   : 3.227   Mean   : 3.298   Mean   : 2.688  
##  3rd Qu.: 4.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        A8               S1               S2               S3        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 3.000   1st Qu.: 2.000  
##  Median : 3.000   Median : 4.000   Median : 4.000   Median : 3.000  
##  Mean   : 2.883   Mean   : 3.342   Mean   : 3.453   Mean   : 3.012  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        S4               S5               S6               S7        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000  
##  Median : 3.000   Median : 4.000   Median : 3.000   Median : 3.000  
##  Mean   : 2.952   Mean   : 3.302   Mean   : 2.839   Mean   : 3.121  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        S8               E1               E2               E3        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median : 3.000   Median : 2.000   Median : 2.000   Median : 3.000  
##  Mean   : 2.561   Mean   : 1.994   Mean   : 2.194   Mean   : 2.695  
##  3rd Qu.: 4.000   3rd Qu.: 3.000   3rd Qu.: 3.000   3rd Qu.: 4.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        E4               E5              E6               E7        
##  Min.   :-1.000   Min.   :-1.00   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.: 2.00   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 2.000   Median : 3.00   Median : 2.000   Median : 2.000  
##  Mean   : 2.203   Mean   : 2.84   Mean   : 2.454   Mean   : 2.285  
##  3rd Qu.: 3.000   3rd Qu.: 4.00   3rd Qu.: 3.000   3rd Qu.: 3.000  
##  Max.   : 5.000   Max.   : 5.00   Max.   : 5.000   Max.   : 5.000  
##                                                                    
##        E8               C1               C2               C3        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 3.000   Median : 2.000   Median : 2.000   Median : 2.000  
##  Mean   : 2.636   Mean   : 2.103   Mean   : 2.296   Mean   : 2.268  
##  3rd Qu.: 4.000   3rd Qu.: 3.000   3rd Qu.: 3.000   3rd Qu.: 3.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        C4               C5               C6               C7        
##  Min.   :-1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 2.000   Median : 2.000   Median : 3.000   Median : 2.000  
##  Mean   : 2.293   Mean   : 2.478   Mean   : 2.636   Mean   : 2.064  
##  3rd Qu.: 3.000   3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 3.000  
##  Max.   : 5.000   Max.   : 5.000   Max.   : 5.000   Max.   : 5.000  
##                                                                     
##        C8            accuracy              elapse            country    
##  Min.   :-1.000   Min.   :        -1   Min.   :    -1.0   US     :5389  
##  1st Qu.: 1.000   1st Qu.:        23   1st Qu.:    84.0   CA     : 610  
##  Median : 2.000   Median :        85   Median :   160.0   GB     : 480  
##  Mean   : 2.132   Mean   :    242582   Mean   :   371.1   AU     : 338  
##  3rd Qu.: 3.000   3rd Qu.:        95   3rd Qu.:   230.0   MY     : 274  
##  Max.   : 5.000   Max.   :2147483647   Max.   :509296.0   (Other):1763  
##                                                           NA's   :   1  
##    fromsearch          age              gender       
##  Min.   :0.0000   Min.   :-1.0000   Min.   : -1.000  
##  1st Qu.:0.0000   1st Qu.:-1.0000   1st Qu.: -1.000  
##  Median :0.0000   Median :-1.0000   Median : -1.000  
##  Mean   :0.4247   Mean   :-0.3964   Mean   :  6.892  
##  3rd Qu.:1.0000   3rd Qu.:-1.0000   3rd Qu.: -1.000  
##  Max.   :1.0000   Max.   : 3.0000   Max.   :100.000  
## 

When we look at the summary statistics, we notice that there are a few issues with the data.

  • all missing data have been reported as (-1)
  • accuracy: missing data reported as (-1), the maximum is a really big number. The range of reported data should be between 0-100.
  • elapse: missing data reported as (-1), largest number really big.
  • country: 1 NA
  • age: the range should be between 0 and probably a maximum of 100. but the median is 2.0. So most participants did not report age. Maybe this variable should be excluded from analysis.
  • gender: a big portion of the data is missing because even the third quantile is (-1) but the maximum is 100. It may be better to exclude gender from further analysis.

Let’s get rid of messy data

So let’s get rolling and clean this dataset up.

  1. reassign all values with -1 (missing data) to NA
  2. gender is supposed to have the following values: 1 if “Male”, 2 if “Female, 3 if”Other" (-1 if missed). (implementation 1) –> we may want to remove this variable because when we look at the summary statistics we see that there are 6743 NA’s and there’s a maximum of 100.00. There is a lot of messy data in this column. So, we’ll remove this variable.
  3. age: delete this variable (too many missing values)
  4. accuracy: The range of reported data should be between 0-100. There are a values > 100. Let’s set all values > 100 to NA.
  5. elapse: The description of the test told us that the test should take between 5-10 mins to complete. This means that the mean should be between 300-600 s. The mean 489.2 at the moment. That’s pretty good. Let’s set all values above 1200 to NA
#missing data
riasec[riasec == -1] <- NA #set all missing values to NA
head(riasec) #test to see if it worked
##   implementation R1 R2 R3 R4 R5 R6 R7 R8 I1 I2 I3 I4 I5 I6 I7 I8 A1 A2 A3
## 1              2  3  1  4  2  1  2  1  1  5  4  3  4  2  5  2  4  2  5  5
## 2              2  1  1  1  1  1  1  1  1  4  4  3  1  2  4  2  2  5  3  4
## 3              2  3  2  1  1  1  1  2  1  5  2  3  3  4  1  4  2  1  2  1
## 4              2  3  2  1  2  2  3  1  2  5  4  4  5  4  4  4  3  4  5  3
## 5              2 NA  2  3  2  3  2  1  3  5  2  4  4  4  3  4  3  1  1  2
## 6              2  3  1  3  4  3  4  3  3  3  4  3  3  2  3  3  4  2  3  4
##   A4 A5 A6 A7 A8 S1 S2 S3 S4 S5 S6 S7 S8 E1 E2 E3 E4 E5 E6 E7 E8 C1 C2 C3
## 1  5  5  4  2  5  4  4  3  4  4  4  3  3  2  2  3  1  4  1  1  4  1  1  1
## 2  3  3  5  1  3  1  3  3  1  3  2  2  3  1  1  1  1  1  1  1  1  1  1  1
## 3  1  2  1  3  1  4  5  4  4  4  2  2  4  3  2  2  3  4  2  4  2  4  3  2
## 4  4  3  5  4  3  3  4  3  4  4  5  3  2  1  1  4  2  3  4  2  2  2  4  3
## 5  1  2  2  4  3  3  2  3  4  2  3  3  2  3  4  2  1  4  3  4  2  3  4  4
## 6  2  3  3  4  4  2  2  3  2  2  2  3  2  3  2  3  2  4  2  4  4  3  4  3
##   C4 C5 C6 C7 C8 accuracy elapse country fromsearch age gender
## 1  1  2  1  1  2       90    222      PT          0  NA     NA
## 2  1  1  1  1  1      100    102      US          0  NA     NA
## 3  3  3  4  4  4       95    264      US          1  NA     NA
## 4  2  1  3  2  1       60    189      SG          0  NA     NA
## 5  2  4  3  3  3       90    197      US          0  NA     NA
## 6  3  3  3  3  3       80    247      US          1  NA     NA
#messy data

#gender
summary(riasec$gender) #it's so messy that we are going to remove this variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   22.00   29.00   32.09   40.00  100.00    6743
riasec$gender <- NULL

#age
summary(riasec$age) #again, this looks nothing like what we would expect. So we are going to delete this column as well. 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.515   2.000   3.000    6730
riasec$age <- NULL

#accuracy
summary(riasec$accuracy)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 1.000e+00 8.000e+01 9.000e+01 3.195e+05 9.800e+01 2.147e+09      2132
riasec$accuracy[riasec$accuracy > 100] <- NA #we set all values over 100 to NA. We can see that now instead of 2132 NA's we have 2137 NA's. And the minimum is 1.0 which is good, because everyone who answered with 0, didn't want their data to be used (see codebook)
summary(riasec$accuracy) #the maximum is 100.0 which means that there are no values > 100.0. They have all been reassigned to NA. 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    80.0    90.0    86.2    98.0   100.0    2137
#elapse
summary(riasec$elapse)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      3.0    145.0    187.0    489.2    262.0 509300.0     2132
plot(riasec$elapse) #we can see that there is one person who is really far off. 

riasec$elapse[riasec$elapse > 1200] <- NA #set all values > 100 to NA
summary(riasec$elapse)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     3.0   144.0   185.0   225.8   254.0  1193.0    2288
plot(riasec$elapse) #this looks way better. 

head(riasec) #let's take a look at the head of our clean dataset. 
##   implementation R1 R2 R3 R4 R5 R6 R7 R8 I1 I2 I3 I4 I5 I6 I7 I8 A1 A2 A3
## 1              2  3  1  4  2  1  2  1  1  5  4  3  4  2  5  2  4  2  5  5
## 2              2  1  1  1  1  1  1  1  1  4  4  3  1  2  4  2  2  5  3  4
## 3              2  3  2  1  1  1  1  2  1  5  2  3  3  4  1  4  2  1  2  1
## 4              2  3  2  1  2  2  3  1  2  5  4  4  5  4  4  4  3  4  5  3
## 5              2 NA  2  3  2  3  2  1  3  5  2  4  4  4  3  4  3  1  1  2
## 6              2  3  1  3  4  3  4  3  3  3  4  3  3  2  3  3  4  2  3  4
##   A4 A5 A6 A7 A8 S1 S2 S3 S4 S5 S6 S7 S8 E1 E2 E3 E4 E5 E6 E7 E8 C1 C2 C3
## 1  5  5  4  2  5  4  4  3  4  4  4  3  3  2  2  3  1  4  1  1  4  1  1  1
## 2  3  3  5  1  3  1  3  3  1  3  2  2  3  1  1  1  1  1  1  1  1  1  1  1
## 3  1  2  1  3  1  4  5  4  4  4  2  2  4  3  2  2  3  4  2  4  2  4  3  2
## 4  4  3  5  4  3  3  4  3  4  4  5  3  2  1  1  4  2  3  4  2  2  2  4  3
## 5  1  2  2  4  3  3  2  3  4  2  3  3  2  3  4  2  1  4  3  4  2  3  4  4
## 6  2  3  3  4  4  2  2  3  2  2  2  3  2  3  2  3  2  4  2  4  4  3  4  3
##   C4 C5 C6 C7 C8 accuracy elapse country fromsearch
## 1  1  2  1  1  2       90    222      PT          0
## 2  1  1  1  1  1      100    102      US          0
## 3  3  3  4  4  4       95    264      US          1
## 4  2  1  3  2  1       60    189      SG          0
## 5  2  4  3  3  3       90    197      US          0
## 6  3  3  3  3  3       80    247      US          1

Demographic variables

Let’s take a look at the count of some demographic variables.

How many people participated from each country?

str(riasec$country) #this tells us that this variable has 127 levels, i.e. 127 countries from which people participated. 
##  Factor w/ 127 levels "A1","A2","AE",..: 102 122 122 109 122 122 122 44 122 122 ...
table(riasec$country)#this gives us an overview of how many people from each country participated
## 
##   A1   A2   AE   AF   AI   AL   AP   AR   AS   AT   AU   AZ   BA   BD   BE 
##    2    2   14    1    1    6    6   10    1   14  338    2    3    1   29 
##   BG   BH   BM   BN   BO   BR   BS   BY   CA   CH   CL   CN   CO   CR   CV 
##   10    2    1    3    1   28    1    3  610   35    7   26    8    6    1 
##   CW   CY   CZ   DE   DK   DO   EC   EE   EG   ES   EU   FI   FR   GB   GF 
##    1    5   20   79   14    1    1    5   18   39    9   25   45  480    1 
##   GH   GR   GT   GU   GY   HK   HN   HR   HU   ID   IE   IL   IM   IN   IQ 
##    4   21    2    2    1   66    1   13    7   39   65   15    1  115    2 
##   IR   IS   IT   JE   JM   JO   JP   KE   KG   KR   KW   LB   LK   LT   LU 
##   12    4   39    1   16    2   14    9    1   17    2   14    1    6    3 
##   LV   LY   MA   MD   ME   MK   MM   MN   MO   MT   MU   MW   MX   MY   MZ 
##    2    1    5    1    1    1    1    1    2    6    1    1   16  274    1 
##   NG   NL   NO   NZ   OM   PF   PH   PK   PL   PR   PS   PT   QA   RO   RS 
##    2   73   18   95   40    1   51   12   22    4    1   17    1   32   10 
##   RU   SA   SE   SG   SI   SK   SL   SV   SY   TH   TJ   TR   TT   TW   TZ 
##   10   11   37  124   36    7    1    1    1   22    1   22   11    3    2 
##   UA   US   UY   VE   VI   VN   ZA 
##    3 5389    3    7    2   11   47
summary(riasec$country) #this function allows us to see the different countries in order (most participants to least). The country with the most participants was the US, the second largest participation rate was found in Canada and GB is next. 
##      US      CA      GB      AU      MY      SG      IN      NZ      DE 
##    5389     610     480     338     274     124     115      95      79 
##      NL      HK      IE      PH      ZA      FR      OM      ES      ID 
##      73      66      65      51      47      45      40      39      39 
##      IT      SE      SI      CH      RO      BE      BR      CN      FI 
##      39      37      36      35      32      29      28      26      25 
##      PL      TH      TR      GR      CZ      EG      NO      KR      PT 
##      22      22      22      21      20      18      18      17      17 
##      JM      MX      IL      AE      AT      DK      JP      LB      HR 
##      16      16      15      14      14      14      14      14      13 
##      IR      PK      SA      TT      VN      AR      BG      RS      RU 
##      12      12      11      11      11      10      10      10      10 
##      EU      KE      CO      CL      HU      SK      VE      AL      AP 
##       9       9       8       7       7       7       7       6       6 
##      CR      LT      MT      CY      EE      MA      GH      IS      PR 
##       6       6       6       5       5       5       4       4       4 
##      BA      BN      BY      LU      TW      UA      UY      A1      A2 
##       3       3       3       3       3       3       3       2       2 
##      AZ      BH      GT      GU      IQ      JO      KW      LV      MO 
##       2       2       2       2       2       2       2       2       2 
##      NG      TZ      VI      AF      AI      AS      BD      BM (Other) 
##       2       2       2       1       1       1       1       1      29 
##    NA's 
##       1

How many participants were referred to the test through a search engine (“fromsearch” = 1)?

table(riasec$fromsearch) #3761 participants were directed to the test through a search engine. 5094 were referred to the test through another website or sourse. 
## 
##    0    1 
## 5094 3761

Graphs

Create a histogram

Let’s create a histogram to visualize how accurate people thought they were:

hist(riasec$accuracy, breaks = 10, col = "Pink") #seems like almost everyone was pretty confident about their accuracy. 

#let's make it look a bit cooler
acc <- hist(riasec$accuracy, breaks = 10, plot = FALSE)
plot(acc, labels = TRUE, border = "Red", col = "Pink", main = "Histogram of participant's self-rated accuracy", xlab = "Accuracy")

Create a boxplot

Is there a difference on how people responded to item E8 depending on whether they were referred from a search engine or another source?

boxplot(E8 ~ fromsearch, data = riasec)

#let's make it look a bit cooler
boxplot(E8 ~ fromsearch, 
        data = riasec, 
        col = "turquoise", 
        notch = TRUE, 
        main = "Boxplot",
        xlab = "Referral source",
        ylab = "Item E8",
        add = TRUE 
        )

Correlations

Select columns that apparently belong to together (e.g. E1-E8) and compute the correlation, also show the correlation visually.

E <- data.frame(riasec$E1,riasec$E2, riasec$E3, riasec$E4, riasec$E5, riasec$E6, riasec$E7, riasec$E8)
head(E)
##   riasec.E1 riasec.E2 riasec.E3 riasec.E4 riasec.E5 riasec.E6 riasec.E7
## 1         2         2         3         1         4         1         1
## 2         1         1         1         1         1         1         1
## 3         3         2         2         3         4         2         4
## 4         1         1         4         2         3         4         2
## 5         3         4         2         1         4         3         4
## 6         3         2         3         2         4         2         4
##   riasec.E8
## 1         4
## 2         1
## 3         2
## 4         2
## 5         2
## 6         4
cor(na.omit(E)) #calculate the correlations between the E-variables
##           riasec.E1 riasec.E2 riasec.E3 riasec.E4 riasec.E5 riasec.E6
## riasec.E1 1.0000000 0.3859963 0.4372540 0.2742256 0.4007532 0.3813569
## riasec.E2 0.3859963 1.0000000 0.3462392 0.3414766 0.2763812 0.5549665
## riasec.E3 0.4372540 0.3462392 1.0000000 0.3293392 0.5657515 0.4873247
## riasec.E4 0.2742256 0.3414766 0.3293392 1.0000000 0.2073681 0.5097738
## riasec.E5 0.4007532 0.2763812 0.5657515 0.2073681 1.0000000 0.4203779
## riasec.E6 0.3813569 0.5549665 0.4873247 0.5097738 0.4203779 1.0000000
## riasec.E7 0.4667292 0.3683945 0.4205886 0.2857571 0.3795733 0.3984515
## riasec.E8 0.2804928 0.3722055 0.3692438 0.3406575 0.2808970 0.4743587
##           riasec.E7 riasec.E8
## riasec.E1 0.4667292 0.2804928
## riasec.E2 0.3683945 0.3722055
## riasec.E3 0.4205886 0.3692438
## riasec.E4 0.2857571 0.3406575
## riasec.E5 0.3795733 0.2808970
## riasec.E6 0.3984515 0.4743587
## riasec.E7 1.0000000 0.2822687
## riasec.E8 0.2822687 1.0000000
pairs(na.omit(E)) #visualize the correlations

#this does not seem to make much sense, but I don't know where the problem is. :( 

This was an overview of some (very) basic dataset inspections. I’m sorry, I couldn’t really figure out how to sort the correlations stuff and I’m sure there are a easier ways to combine several variables (columns) into one data frame to look at the correlations.