I downloaded the datasets (RIASEC) and (Big5) and created a new R-project called “Midterm_RH”. This Markdown Document will focus on the Big5-dataset.

1. Import the data into R with the read.table function

big5 <- read.table("~/Documents/Psychologie/7. Semester/R/Blockseminar R/Midterm/BIG5/data.csv", sep="\t", header = TRUE) # I set the header = TRUE because this excludes the headers from the other variables and data

head(big5) #yes! it worked!
##   race age engnat gender hand source country E1 E2 E3 E4 E5 E6 E7 E8 E9
## 1    3  53      1      1    1      1      US  4  2  5  2  5  1  4  3  5
## 2   13  46      1      2    1      1      US  2  2  3  3  3  3  1  5  1
## 3    1  14      2      2    1      1      PK  5  1  1  4  5  1  1  5  5
## 4    3  19      2      2    1      1      RO  2  5  2  4  3  4  3  4  4
## 5   11  25      2      2    1      2      US  3  1  3  3  3  1  3  1  3
## 6   13  31      1      2    1      2      US  1  5  2  4  1  3  2  4  1
##   E10 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 C1 C2
## 1   1  1  5  2  5  1  1  1  1  1   1  1  5  1  5  2  3  1  5  4   5  4  1
## 2   5  2  3  4  2  3  4  3  2  2   4  1  3  3  4  4  4  2  3  4   3  4  1
## 3   1  5  1  5  5  5  5  5  5  5   5  5  1  5  5  1  5  1  5  5   5  4  1
## 4   5  5  4  4  2  4  5  5  5  4   5  2  5  4  4  3  5  3  4  4   3  3  3
## 5   5  3  3  3  4  3  3  3  3  3   4  5  5  3  5  1  5  1  5  5   5  3  1
## 6   5  1  5  4  5  1  4  4  1  5   2  2  2  3  4  3  4  3  5  5   3  2  5
##   C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
## 1  5  1  5  1  4  1  4   5  4  1  3  1  5  1  4  2  5   5
## 2  3  2  3  1  5  1  4   4  3  3  3  3  2  3  3  1  3   2
## 3  5  1  5  1  5  1  5   5  4  5  5  1  5  1  5  5  5   5
## 4  4  5  1  4  5  4  2   3  4  3  5  2  4  2  5  2  5   5
## 5  5  3  3  1  1  3  3   3  3  1  1  1  3  1  3  1  5   3
## 6  4  3  3  4  5  3  5   3  4  2  1  3  3  5  5  4  5   3

2. Basic Inspections

What’s the size (rows, cols) of the riasec-dataset

nrow(big5) #This dataset contains 19719 rows of data
## [1] 19719
ncol(big5) #This dataset is made up of columns, i.e. 57 variables
## [1] 57
names(big5) #What are the names of the variables?
##  [1] "race"    "age"     "engnat"  "gender"  "hand"    "source"  "country"
##  [8] "E1"      "E2"      "E3"      "E4"      "E5"      "E6"      "E7"     
## [15] "E8"      "E9"      "E10"     "N1"      "N2"      "N3"      "N4"     
## [22] "N5"      "N6"      "N7"      "N8"      "N9"      "N10"     "A1"     
## [29] "A2"      "A3"      "A4"      "A5"      "A6"      "A7"      "A8"     
## [36] "A9"      "A10"     "C1"      "C2"      "C3"      "C4"      "C5"     
## [43] "C6"      "C7"      "C8"      "C9"      "C10"     "O1"      "O2"     
## [50] "O3"      "O4"      "O5"      "O6"      "O7"      "O8"      "O9"     
## [57] "O10"

We now know, that this dataset contains 19719 rows of data and 57 columns, i.e. 8 different variables. The names() function shows us the names of all 57 variables.

What do we know about the background of this dataset

On the homegape and in the codebook, we find some additional information about the background and variables of this dataset:

  • The big five personality traits are the best accepted and most commonly used model of personality in academic psychology.
  • This dataset was generated by an interactive version of the IPIP Big-Five Factor Markers.
  • This test uses the Big-Five Factor Markers from the International Personality Item Pool, developed by Goldberg (1992).

About the test

  • This test consists of fifty statements. Each statement must be rated on how much you agree that that statement on a five point scale: (1) disagree, (2) slightly disagree, (3) neutral, (4) slightly agree, and (5) agree. It should take most people three to eight minutes to complete.
  • There are also several demographical variables: race, age, engnat, gender, hand, country and source.

Measures

Variables in this dataset

  1. race: Chosen from a drop down menu. 1=Mixed Race, 2=Arctic (Siberian, Eskimo), 3=Caucasian (European), 4=Caucasian (Indian), 5=Caucasian (Middle East), 6=Caucasian (North African, Other), 7=Indigenous Australian, 8=Native American, 9=North East Asian (Mongol, Tibetan, Korean Japanese, etc), 10=Pacific (Polynesian, Micronesian, etc), 11=South East Asian (Chinese, Thai, Malay, Filipino, etc), 12=West African, Bushmen, Ethiopian, 13=Other (0=missed)

  2. age: entered as text (individuals reporting age < 13 were not recorded)

  3. engnat: Response to “is English your native language?”. 1=yes, 2=no (0=missed)

  4. gender: Chosen from a drop down menu. 1=Male, 2=Female, 3=Other (0=missed)

  5. hand: “What hand do you use to write with?”. 1=Right, 2=Left, 3=Both (0=missed)

Calculated from technical information:

  1. source: How the participant came to the test. Based on HTTP Referer. 1=from another page on the test website, 2=from google, 3=from facebook, 4=from any url with “.edu” in its domain name (e.g. xxx.edu, xxx.edu.au), 6=other source, or HTTP Referer not provided.

  2. country: ISO country code.

  3. 50 likert (5-point) rated items

Summary Statistics

Summary statistics of all variables

summary(big5)
##       race             age                engnat          gender     
##  Min.   : 0.000   Min.   :       13   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 3.000   1st Qu.:       18   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 3.000   Median :       22   Median :1.000   Median :2.000  
##  Mean   : 5.324   Mean   :    50767   Mean   :1.365   Mean   :1.617  
##  3rd Qu.: 8.000   3rd Qu.:       31   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :13.000   Max.   :999999999   Max.   :2.000   Max.   :3.000  
##                                                                      
##       hand          source         country           E1       
##  Min.   :0.00   Min.   :1.000   US     :8753   Min.   :0.000  
##  1st Qu.:1.00   1st Qu.:1.000   GB     :1531   1st Qu.:2.000  
##  Median :1.00   Median :1.000   IN     :1464   Median :3.000  
##  Mean   :1.13   Mean   :1.952   AU     : 974   Mean   :2.629  
##  3rd Qu.:1.00   3rd Qu.:2.000   CA     : 924   3rd Qu.:4.000  
##  Max.   :3.00   Max.   :5.000   (Other):6065   Max.   :5.000  
##                                 NA's   :   8                  
##        E2             E3              E4              E5       
##  Min.   :0.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.00   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.00   Median :4.000   Median :3.000   Median :4.000  
##  Mean   :2.76   Mean   :3.417   Mean   :3.152   Mean   :3.432  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                
##        E6              E7              E8              E9       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :2.453   Mean   :2.867   Mean   :3.376   Mean   :3.094  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##       E10              N1              N2              N3       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000  
##  Median :4.000   Median :3.000   Median :3.000   Median :4.000  
##  Mean   :3.585   Mean   :3.262   Mean   :3.235   Mean   :3.843  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##        N4              N5              N6             N7       
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.00   Median :3.000  
##  Mean   :2.756   Mean   :2.952   Mean   :2.98   Mean   :3.152  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##                                                                
##        N8              N9             N10              A1       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :2.000  
##  Mean   :2.803   Mean   :3.135   Mean   :2.834   Mean   :2.312  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##        A2              A3              A4             A5       
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:1.000   1st Qu.:4.00   1st Qu.:1.000  
##  Median :4.000   Median :2.000   Median :4.00   Median :2.000  
##  Mean   :3.927   Mean   :2.163   Mean   :4.03   Mean   :2.166  
##  3rd Qu.:5.000   3rd Qu.:3.000   3rd Qu.:5.00   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##                                                                
##        A6              A7              A8              A9       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :4.000   Median :2.000   Median :4.000   Median :4.000  
##  Mean   :3.896   Mean   :2.161   Mean   :3.766   Mean   :3.945  
##  3rd Qu.:5.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##       A10              C1              C2              C3       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:3.000  
##  Median :4.000   Median :3.000   Median :3.000   Median :4.000  
##  Mean   :3.682   Mean   :3.318   Mean   :2.979   Mean   :3.983  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##        C4              C5            C6              C7       
##  Min.   :0.000   Min.   :0.0   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.0   1st Qu.:2.000   1st Qu.:3.000  
##  Median :3.000   Median :3.0   Median :3.000   Median :4.000  
##  Mean   :2.654   Mean   :2.7   Mean   :2.923   Mean   :3.647  
##  3rd Qu.:4.000   3rd Qu.:4.0   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.0   Max.   :5.000   Max.   :5.000  
##                                                               
##        C8              C9             C10              O1       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :2.000   Median :3.000   Median :4.000   Median :4.000  
##  Mean   :2.481   Mean   :3.224   Mean   :3.637   Mean   :3.692  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##        O2             O3              O4              O5       
##  Min.   :0.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.00   1st Qu.:4.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :2.00   Median :4.000   Median :2.000   Median :4.000  
##  Mean   :2.15   Mean   :4.126   Mean   :2.079   Mean   :3.873  
##  3rd Qu.:3.00   3rd Qu.:5.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                
##        O6              O7              O8              O9       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:4.000   1st Qu.:2.000   1st Qu.:4.000  
##  Median :1.000   Median :4.000   Median :3.000   Median :4.000  
##  Mean   :1.795   Mean   :4.073   Mean   :3.208   Mean   :4.134  
##  3rd Qu.:2.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##       O10       
##  Min.   :0.000  
##  1st Qu.:3.000  
##  Median :4.000  
##  Mean   :4.005  
##  3rd Qu.:5.000  
##  Max.   :5.000  
## 

When we look at the summary statistics, we notice that there are a few issues with the data.

  • all missing data have been reported as (0)
  • age: there are some weird values in this column. Values > 100 should be deleted.

Let’s get rid of messy data

So let’s get rolling and clean this dataset up.

  1. reassign all values with 0 (missing data) to NA
#missing data
big5[big5 == 0] <- NA #set all missing values to NA
head(big5) #test to see if it worked
##   race age engnat gender hand source country E1 E2 E3 E4 E5 E6 E7 E8 E9
## 1    3  53      1      1    1      1      US  4  2  5  2  5  1  4  3  5
## 2   13  46      1      2    1      1      US  2  2  3  3  3  3  1  5  1
## 3    1  14      2      2    1      1      PK  5  1  1  4  5  1  1  5  5
## 4    3  19      2      2    1      1      RO  2  5  2  4  3  4  3  4  4
## 5   11  25      2      2    1      2      US  3  1  3  3  3  1  3  1  3
## 6   13  31      1      2    1      2      US  1  5  2  4  1  3  2  4  1
##   E10 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 C1 C2
## 1   1  1  5  2  5  1  1  1  1  1   1  1  5  1  5  2  3  1  5  4   5  4  1
## 2   5  2  3  4  2  3  4  3  2  2   4  1  3  3  4  4  4  2  3  4   3  4  1
## 3   1  5  1  5  5  5  5  5  5  5   5  5  1  5  5  1  5  1  5  5   5  4  1
## 4   5  5  4  4  2  4  5  5  5  4   5  2  5  4  4  3  5  3  4  4   3  3  3
## 5   5  3  3  3  4  3  3  3  3  3   4  5  5  3  5  1  5  1  5  5   5  3  1
## 6   5  1  5  4  5  1  4  4  1  5   2  2  2  3  4  3  4  3  5  5   3  2  5
##   C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
## 1  5  1  5  1  4  1  4   5  4  1  3  1  5  1  4  2  5   5
## 2  3  2  3  1  5  1  4   4  3  3  3  3  2  3  3  1  3   2
## 3  5  1  5  1  5  1  5   5  4  5  5  1  5  1  5  5  5   5
## 4  4  5  1  4  5  4  2   3  4  3  5  2  4  2  5  2  5   5
## 5  5  3  3  1  1  3  3   3  3  1  1  1  3  1  3  1  5   3
## 6  4  3  3  4  5  3  5   3  4  2  1  3  3  5  5  4  5   3
#messy data

#age
summary(big5$age) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.300e+01 1.800e+01 2.200e+01 5.077e+04 3.100e+01 1.000e+09
table(big5$age) #it seems like some people didn't just put their age in years, but the year of their birth. 
## 
##        13        14        15        16        17        18        19 
##       238       475       743      1148      1370      1523      1259 
##        20        21        22        23        24        25        26 
##      1231      1216       970       895       703       691       520 
##        27        28        29        30        31        32        33 
##       492       449       361       372       315       329       262 
##        34        35        36        37        38        39        40 
##       243       245       223       198       206       167       180 
##        41        42        43        44        45        46        47 
##       174       169       180       136       182       143       118 
##        48        49        50        51        52        53        54 
##       141       131       131       106       116        93        94 
##        55        56        57        58        59        60        61 
##        96        72        80        60        43        71        28 
##        62        63        64        65        66        67        68 
##        36        25        29        27        24        19        20 
##        69        70        71        72        73        74        75 
##        15        12        11         8         2         2         5 
##        76        77        78        79        80        92        97 
##         2         3         1         2         1         1         1 
##        99       100       118       188       191       208       211 
##         1         1         1         2         1         1         1 
##       223       266      1961      1964      1968      1974      1976 
##         1         1         1         1         1         1         2 
##      1977      1982      1984      1985      1986      1988      1989 
##         1         4         2         2         2         1         5 
##      1990      1991      1992      1993      1994      1995      1996 
##         3         3         9         5         8         5         7 
##      1997      1998      1999      2000    412434 999999999 
##         4         4         1         1         1         1
#The mean age is 26.26 yrs. 

big5$age[big5$age > 100] <- NA
#when we run the table and summary function again, we see that there are now no more values above 100. 
#there are 83 NA's. 

head(big5) #let's take a look at the head of our clean dataset. 
##   race age engnat gender hand source country E1 E2 E3 E4 E5 E6 E7 E8 E9
## 1    3  53      1      1    1      1      US  4  2  5  2  5  1  4  3  5
## 2   13  46      1      2    1      1      US  2  2  3  3  3  3  1  5  1
## 3    1  14      2      2    1      1      PK  5  1  1  4  5  1  1  5  5
## 4    3  19      2      2    1      1      RO  2  5  2  4  3  4  3  4  4
## 5   11  25      2      2    1      2      US  3  1  3  3  3  1  3  1  3
## 6   13  31      1      2    1      2      US  1  5  2  4  1  3  2  4  1
##   E10 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 C1 C2
## 1   1  1  5  2  5  1  1  1  1  1   1  1  5  1  5  2  3  1  5  4   5  4  1
## 2   5  2  3  4  2  3  4  3  2  2   4  1  3  3  4  4  4  2  3  4   3  4  1
## 3   1  5  1  5  5  5  5  5  5  5   5  5  1  5  5  1  5  1  5  5   5  4  1
## 4   5  5  4  4  2  4  5  5  5  4   5  2  5  4  4  3  5  3  4  4   3  3  3
## 5   5  3  3  3  4  3  3  3  3  3   4  5  5  3  5  1  5  1  5  5   5  3  1
## 6   5  1  5  4  5  1  4  4  1  5   2  2  2  3  4  3  4  3  5  5   3  2  5
##   C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
## 1  5  1  5  1  4  1  4   5  4  1  3  1  5  1  4  2  5   5
## 2  3  2  3  1  5  1  4   4  3  3  3  3  2  3  3  1  3   2
## 3  5  1  5  1  5  1  5   5  4  5  5  1  5  1  5  5  5   5
## 4  4  5  1  4  5  4  2   3  4  3  5  2  4  2  5  2  5   5
## 5  5  3  3  1  1  3  3   3  3  1  1  1  3  1  3  1  5   3
## 6  4  3  3  4  5  3  5   3  4  2  1  3  3  5  5  4  5   3

Demographic variables

Let’s take a look at the count of some demographic variables.

How many people participated from each country?

str(big5$country) #this tells us that this variable has 127 levels, i.e. 159 countries from which people participated. 
##  Factor w/ 159 levels "","(nu","A1",..: 150 150 121 128 150 150 150 73 150 150 ...
table(big5$country)#this gives us an overview of how many people from each country participated
## 
##       (nu   A1   A2   AE   AG   AL   AO   AP   AR   AS   AT   AU   AZ   BA 
##    1  369    8    9  100    1   12    1   19   41    1   20  974    4   10 
##   BB   BD   BE   BF   BG   BH   BM   BN   BO   BR   BS   BT   BW   BZ   CA 
##    2   44   86    1   41    8    8    5    3  175    2    1    4   17  924 
##   CH   CL   CM   CN   CO   CR   CV   CY   CZ   DE   DK   DO   DZ   EC   EE 
##   40   18    2   40   18    9    1    8   28  191  122    5    4    6   13 
##   EG   ES   ET   EU   FI   FJ   FO   FR   GB   GD   GE   GG   GH   GP   GR 
##   49   82    1   24   90    2    1  129 1531    1    4    2   20    1   85 
##   GT   GU   GY   HK   HN   HR   HT   HU   ID   IE   IL   IM   IN   IQ   IR 
##    3    1    1   41    4   40    2   34  172  107   27    1 1464    2   17 
##   IS   IT   JE   JM   JO   JP   KE   KG   KH   KR   KW   KY   KZ   LA   LB 
##   13  277    3   28   14   37   43    1    3   30    6    1    1    2   41 
##   LK   LS   LT   LV   LY   MA   ME   MK   MM   MN   MP   MR   MT   MU   MV 
##   31    2   29   21    2    9    3    7    3    2    2    1   11    8    3 
##   MW   MX   MY   MZ   NG   NI   NL   NO   NP   NZ   OM   PA   PE   PG   PH 
##    2   82  247    2   35    2  133  147   10  157    6    4    8    2  649 
##   PK   PL   PR   PT   PW   PY   QA   RO   RS   RU   RW   SA   SD   SE   SG 
##  222   79   16   88    1    2   10  135   85   19    2   45    1  169  133 
##   SI   SK   SR   SV   SY   TC   TH   TN   TR   TT   TW   TZ   UA   UG   US 
##   34   22    1    6    2    1   42    7   70   23   26    2   12   11 8753 
##   UY   UZ   VC   VE   VI   VN   ZA   ZM   ZW 
##    2    1    2   17    2   30  179    2    3
summary(big5$country) #this function allows us to see the different countries in order (most participants to least). The country with the most participants was the US, the second largest participation rate was found in GB.  
##      US      GB      IN      AU      CA      PH     (nu      IT      MY 
##    8753    1531    1464     974     924     649     369     277     247 
##      PK      DE      ZA      BR      ID      SE      NZ      NO      RO 
##     222     191     179     175     172     169     157     147     135 
##      NL      SG      FR      DK      IE      AE      FI      PT      BE 
##     133     133     129     122     107     100      90      88      86 
##      GR      RS      ES      MX      PL      TR      EG      SA      BD 
##      85      85      82      82      79      70      49      45      44 
##      KE      TH      AR      BG      HK      LB      CH      CN      HR 
##      43      42      41      41      41      41      40      40      40 
##      JP      NG      HU      SI      LK      KR      VN      LT      CZ 
##      37      35      34      34      31      30      30      29      28 
##      JM      IL      TW      EU      TT      SK      LV      AT      GH 
##      28      27      26      24      23      22      21      20      20 
##      AP      RU      CL      CO      BZ      IR      VE      PR      JO 
##      19      19      18      18      17      17      17      16      14 
##      EE      IS      AL      UA      MT      UG      BA      NP      QA 
##      13      13      12      12      11      11      10      10      10 
##      A2      CR      MA      A1      BH      BM      CY      MU      PE 
##       9       9       9       8       8       8       8       8       8 
##      MK      TN      EC      KW      OM      SV      BN      DO (Other) 
##       7       7       6       6       6       6       5       5     119 
##    NA's 
##       8

How many participants were referred to the test through a search engine (“source” = 1)?

table(big5$source)
## 
##     1     2     3     4     5 
## 12099  3653   303   137  3527
#1=from another page on the test website --> 12099 participants
#2=from google --> 3653 participants
#3=from facebook --> 303
#4=from any url with ".edu" in its domain name (e.g. xxx.edu, xxx.edu.au) --> 137
#6=other source, or HTTP Referer not provided. In the codebook, 5 is left out. I'm assuming they just meant 5 instead of 6. 

How many participants were Caucasian European?

table(big5$race) #10537 participants were Caucasian. This is also the largest group of represented races. 
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
##  1434    14 10537  1518   515   397    24   201   188    65  1861   259 
##    13 
##  2553

How many subjects had English as their native language?

table(big5$engnat)
## 
##     1     2 
## 12379  7270
#12379 people had English as their native language. 

How many males/females participated?

table(big5$gender)
## 
##     1     2     3 
##  7608 11985   102
#7608 = male
#11985 = women
#102 = other

Was the majority of subjects left- or righthanded?

table(big5$hand)
## 
##     1     2     3 
## 17424  1724   471
#the majority was righthanded. 

Graphs

Create a histogram

Let’s create a histogram to visualize how many subjects we had from each race:

hist(big5$race, breaks = 13, col = "Green") #seems like almost everyone was pretty confident about their accuracy. 

#let's make it look a bit cooler
ra <- hist(big5$race, breaks = 13, plot = FALSE)
plot(ra, labels = TRUE, border = "blue", col = "Green", main = "Histogram of participants' race", xlab = "Race")

Create a boxplot

Let’s create a boxplot to visualize how old people of each race were when they participated in this study.

boxplot(age ~ race, data = big5)

#let's make it look a bit cooler
boxplot(age ~ race, 
        data = big5, 
        col = "dark red", 
        notch = TRUE, 
        main = "Boxplot",
        xlab = "Race",
        ylab = "Age",
        add = TRUE 
        )
## Warning in bxp(structure(list(stats = structure(c(13, 17, 20, 26, 39, 14,
## : some notches went outside hinges ('box'): maybe set notch=FALSE

Correlations

Select columns that apparently belong to together (e.g. E1-E10) and compute the correlation, also show the correlation visually.

E <- data.frame(big5$E1, big5$E2, big5$E3, big5$E4, big5$E5, big5$E6, big5$E7, big5$E8, big5$E9, big5$E10)
head(E)
##   big5.E1 big5.E2 big5.E3 big5.E4 big5.E5 big5.E6 big5.E7 big5.E8 big5.E9
## 1       4       2       5       2       5       1       4       3       5
## 2       2       2       3       3       3       3       1       5       1
## 3       5       1       1       4       5       1       1       5       5
## 4       2       5       2       4       3       4       3       4       4
## 5       3       1       3       3       3       1       3       1       3
## 6       1       5       2       4       1       3       2       4       1
##   big5.E10
## 1        1
## 2        5
## 3        1
## 4        5
## 5        5
## 6        5
cor(na.omit(E)) #calculate the correlations between the E-variables
##             big5.E1    big5.E2    big5.E3    big5.E4    big5.E5    big5.E6
## big5.E1   1.0000000 -0.4213321  0.4741228 -0.4841974  0.4789539 -0.3469641
## big5.E2  -0.4213321  1.0000000 -0.4459679  0.5275965 -0.5399614  0.5707289
## big5.E3   0.4741228 -0.4459679  1.0000000 -0.4815417  0.5905066 -0.3938041
## big5.E4  -0.4841974  0.5275965 -0.4815417  1.0000000 -0.5105973  0.4747965
## big5.E5   0.4789539 -0.5399614  0.5905066 -0.5105973  1.0000000 -0.4810786
## big5.E6  -0.3469641  0.5707289 -0.3938041  0.4747965 -0.4810786  1.0000000
## big5.E7   0.5880106 -0.4802477  0.5797732 -0.5036264  0.6307028 -0.4057580
## big5.E8  -0.3669305  0.3733056 -0.3205900  0.4460278 -0.3451264  0.3201933
## big5.E9   0.4553453 -0.3650355  0.4232972 -0.4511764  0.4159735 -0.3304710
## big5.E10 -0.4147061  0.4634899 -0.4744953  0.5103011 -0.5429647  0.4118512
##             big5.E7    big5.E8    big5.E9   big5.E10
## big5.E1   0.5880106 -0.3669305  0.4553453 -0.4147061
## big5.E2  -0.4802477  0.3733056 -0.3650355  0.4634899
## big5.E3   0.5797732 -0.3205900  0.4232972 -0.4744953
## big5.E4  -0.5036264  0.4460278 -0.4511764  0.5103011
## big5.E5   0.6307028 -0.3451264  0.4159735 -0.5429647
## big5.E6  -0.4057580  0.3201933 -0.3304710  0.4118512
## big5.E7   1.0000000 -0.3451934  0.4331032 -0.5335745
## big5.E8  -0.3451934  1.0000000 -0.5147706  0.3807093
## big5.E9   0.4331032 -0.5147706  1.0000000 -0.3718664
## big5.E10 -0.5335745  0.3807093 -0.3718664  1.0000000
pairs(na.omit(E)) #visualize the correlations

#this does not seem to make much sense, but I don't know where the problem is. :( 

This was an overview of some (very) basic dataset inspections. I’m sorry, I couldn’t really figure out how to sort the correlations stuff and I’m sure there are a easier ways to combine several variables (columns) into one data frame to look at the correlations.

  1. Create a separate RMarkdown document. Create a brief documentation of 20 important basic R functions and group these functions reasonably (make use of unevaluated chunks for code highlighting) , e.g.:

checking and modifying the global environment / workspace: ls() # list all variables rm(x) # remove a variable x rm(list=ls()) # remove all variables that can be found by ls()