I downloaded the datasets (RIASEC) and (Big5) and created a new R-project called “Midterm_RH”. This Markdown Document will focus on the Big5-dataset.
big5 <- read.table("~/Documents/Psychologie/7. Semester/R/Blockseminar R/Midterm/BIG5/data.csv", sep="\t", header = TRUE) # I set the header = TRUE because this excludes the headers from the other variables and data
head(big5) #yes! it worked!
## race age engnat gender hand source country E1 E2 E3 E4 E5 E6 E7 E8 E9
## 1 3 53 1 1 1 1 US 4 2 5 2 5 1 4 3 5
## 2 13 46 1 2 1 1 US 2 2 3 3 3 3 1 5 1
## 3 1 14 2 2 1 1 PK 5 1 1 4 5 1 1 5 5
## 4 3 19 2 2 1 1 RO 2 5 2 4 3 4 3 4 4
## 5 11 25 2 2 1 2 US 3 1 3 3 3 1 3 1 3
## 6 13 31 1 2 1 2 US 1 5 2 4 1 3 2 4 1
## E10 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 C1 C2
## 1 1 1 5 2 5 1 1 1 1 1 1 1 5 1 5 2 3 1 5 4 5 4 1
## 2 5 2 3 4 2 3 4 3 2 2 4 1 3 3 4 4 4 2 3 4 3 4 1
## 3 1 5 1 5 5 5 5 5 5 5 5 5 1 5 5 1 5 1 5 5 5 4 1
## 4 5 5 4 4 2 4 5 5 5 4 5 2 5 4 4 3 5 3 4 4 3 3 3
## 5 5 3 3 3 4 3 3 3 3 3 4 5 5 3 5 1 5 1 5 5 5 3 1
## 6 5 1 5 4 5 1 4 4 1 5 2 2 2 3 4 3 4 3 5 5 3 2 5
## C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
## 1 5 1 5 1 4 1 4 5 4 1 3 1 5 1 4 2 5 5
## 2 3 2 3 1 5 1 4 4 3 3 3 3 2 3 3 1 3 2
## 3 5 1 5 1 5 1 5 5 4 5 5 1 5 1 5 5 5 5
## 4 4 5 1 4 5 4 2 3 4 3 5 2 4 2 5 2 5 5
## 5 5 3 3 1 1 3 3 3 3 1 1 1 3 1 3 1 5 3
## 6 4 3 3 4 5 3 5 3 4 2 1 3 3 5 5 4 5 3
nrow(big5) #This dataset contains 19719 rows of data
## [1] 19719
ncol(big5) #This dataset is made up of columns, i.e. 57 variables
## [1] 57
names(big5) #What are the names of the variables?
## [1] "race" "age" "engnat" "gender" "hand" "source" "country"
## [8] "E1" "E2" "E3" "E4" "E5" "E6" "E7"
## [15] "E8" "E9" "E10" "N1" "N2" "N3" "N4"
## [22] "N5" "N6" "N7" "N8" "N9" "N10" "A1"
## [29] "A2" "A3" "A4" "A5" "A6" "A7" "A8"
## [36] "A9" "A10" "C1" "C2" "C3" "C4" "C5"
## [43] "C6" "C7" "C8" "C9" "C10" "O1" "O2"
## [50] "O3" "O4" "O5" "O6" "O7" "O8" "O9"
## [57] "O10"
We now know, that this dataset contains 19719 rows of data and 57 columns, i.e. 8 different variables. The names() function shows us the names of all 57 variables.
On the homegape and in the codebook, we find some additional information about the background and variables of this dataset:
race: Chosen from a drop down menu. 1=Mixed Race, 2=Arctic (Siberian, Eskimo), 3=Caucasian (European), 4=Caucasian (Indian), 5=Caucasian (Middle East), 6=Caucasian (North African, Other), 7=Indigenous Australian, 8=Native American, 9=North East Asian (Mongol, Tibetan, Korean Japanese, etc), 10=Pacific (Polynesian, Micronesian, etc), 11=South East Asian (Chinese, Thai, Malay, Filipino, etc), 12=West African, Bushmen, Ethiopian, 13=Other (0=missed)
age: entered as text (individuals reporting age < 13 were not recorded)
engnat: Response to “is English your native language?”. 1=yes, 2=no (0=missed)
gender: Chosen from a drop down menu. 1=Male, 2=Female, 3=Other (0=missed)
hand: “What hand do you use to write with?”. 1=Right, 2=Left, 3=Both (0=missed)
Calculated from technical information:
source: How the participant came to the test. Based on HTTP Referer. 1=from another page on the test website, 2=from google, 3=from facebook, 4=from any url with “.edu” in its domain name (e.g. xxx.edu, xxx.edu.au), 6=other source, or HTTP Referer not provided.
country: ISO country code.
50 likert (5-point) rated items
Summary statistics of all variables
summary(big5)
## race age engnat gender
## Min. : 0.000 Min. : 13 Min. :0.000 Min. :0.000
## 1st Qu.: 3.000 1st Qu.: 18 1st Qu.:1.000 1st Qu.:1.000
## Median : 3.000 Median : 22 Median :1.000 Median :2.000
## Mean : 5.324 Mean : 50767 Mean :1.365 Mean :1.617
## 3rd Qu.: 8.000 3rd Qu.: 31 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :13.000 Max. :999999999 Max. :2.000 Max. :3.000
##
## hand source country E1
## Min. :0.00 Min. :1.000 US :8753 Min. :0.000
## 1st Qu.:1.00 1st Qu.:1.000 GB :1531 1st Qu.:2.000
## Median :1.00 Median :1.000 IN :1464 Median :3.000
## Mean :1.13 Mean :1.952 AU : 974 Mean :2.629
## 3rd Qu.:1.00 3rd Qu.:2.000 CA : 924 3rd Qu.:4.000
## Max. :3.00 Max. :5.000 (Other):6065 Max. :5.000
## NA's : 8
## E2 E3 E4 E5
## Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.00 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.00 Median :4.000 Median :3.000 Median :4.000
## Mean :2.76 Mean :3.417 Mean :3.152 Mean :3.432
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000
##
## E6 E7 E8 E9
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.453 Mean :2.867 Mean :3.376 Mean :3.094
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## E10 N1 N2 N3
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :3.000 Median :3.000 Median :4.000
## Mean :3.585 Mean :3.262 Mean :3.235 Mean :3.843
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## N4 N5 N6 N7
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.00 Median :3.000
## Mean :2.756 Mean :2.952 Mean :2.98 Mean :3.152
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
##
## N8 N9 N10 A1
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :3.000 Median :2.000
## Mean :2.803 Mean :3.135 Mean :2.834 Mean :2.312
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## A2 A3 A4 A5
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:4.00 1st Qu.:1.000
## Median :4.000 Median :2.000 Median :4.00 Median :2.000
## Mean :3.927 Mean :2.163 Mean :4.03 Mean :2.166
## 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.00 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
##
## A6 A7 A8 A9
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :2.000 Median :4.000 Median :4.000
## Mean :3.896 Mean :2.161 Mean :3.766 Mean :3.945
## 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## A10 C1 C2 C3
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :3.000 Median :3.000 Median :4.000
## Mean :3.682 Mean :3.318 Mean :2.979 Mean :3.983
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## C4 C5 C6 C7
## Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.0 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :3.0 Median :3.000 Median :4.000
## Mean :2.654 Mean :2.7 Mean :2.923 Mean :3.647
## 3rd Qu.:4.000 3rd Qu.:4.0 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.0 Max. :5.000 Max. :5.000
##
## C8 C9 C10 O1
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000
## Median :2.000 Median :3.000 Median :4.000 Median :4.000
## Mean :2.481 Mean :3.224 Mean :3.637 Mean :3.692
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## O2 O3 O4 O5
## Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:1.000 1st Qu.:3.000
## Median :2.00 Median :4.000 Median :2.000 Median :4.000
## Mean :2.15 Mean :4.126 Mean :2.079 Mean :3.873
## 3rd Qu.:3.00 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000
##
## O6 O7 O8 O9
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:4.000
## Median :1.000 Median :4.000 Median :3.000 Median :4.000
## Mean :1.795 Mean :4.073 Mean :3.208 Mean :4.134
## 3rd Qu.:2.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## O10
## Min. :0.000
## 1st Qu.:3.000
## Median :4.000
## Mean :4.005
## 3rd Qu.:5.000
## Max. :5.000
##
When we look at the summary statistics, we notice that there are a few issues with the data.
So let’s get rolling and clean this dataset up.
#missing data
big5[big5 == 0] <- NA #set all missing values to NA
head(big5) #test to see if it worked
## race age engnat gender hand source country E1 E2 E3 E4 E5 E6 E7 E8 E9
## 1 3 53 1 1 1 1 US 4 2 5 2 5 1 4 3 5
## 2 13 46 1 2 1 1 US 2 2 3 3 3 3 1 5 1
## 3 1 14 2 2 1 1 PK 5 1 1 4 5 1 1 5 5
## 4 3 19 2 2 1 1 RO 2 5 2 4 3 4 3 4 4
## 5 11 25 2 2 1 2 US 3 1 3 3 3 1 3 1 3
## 6 13 31 1 2 1 2 US 1 5 2 4 1 3 2 4 1
## E10 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 C1 C2
## 1 1 1 5 2 5 1 1 1 1 1 1 1 5 1 5 2 3 1 5 4 5 4 1
## 2 5 2 3 4 2 3 4 3 2 2 4 1 3 3 4 4 4 2 3 4 3 4 1
## 3 1 5 1 5 5 5 5 5 5 5 5 5 1 5 5 1 5 1 5 5 5 4 1
## 4 5 5 4 4 2 4 5 5 5 4 5 2 5 4 4 3 5 3 4 4 3 3 3
## 5 5 3 3 3 4 3 3 3 3 3 4 5 5 3 5 1 5 1 5 5 5 3 1
## 6 5 1 5 4 5 1 4 4 1 5 2 2 2 3 4 3 4 3 5 5 3 2 5
## C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
## 1 5 1 5 1 4 1 4 5 4 1 3 1 5 1 4 2 5 5
## 2 3 2 3 1 5 1 4 4 3 3 3 3 2 3 3 1 3 2
## 3 5 1 5 1 5 1 5 5 4 5 5 1 5 1 5 5 5 5
## 4 4 5 1 4 5 4 2 3 4 3 5 2 4 2 5 2 5 5
## 5 5 3 3 1 1 3 3 3 3 1 1 1 3 1 3 1 5 3
## 6 4 3 3 4 5 3 5 3 4 2 1 3 3 5 5 4 5 3
#messy data
#age
summary(big5$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300e+01 1.800e+01 2.200e+01 5.077e+04 3.100e+01 1.000e+09
table(big5$age) #it seems like some people didn't just put their age in years, but the year of their birth.
##
## 13 14 15 16 17 18 19
## 238 475 743 1148 1370 1523 1259
## 20 21 22 23 24 25 26
## 1231 1216 970 895 703 691 520
## 27 28 29 30 31 32 33
## 492 449 361 372 315 329 262
## 34 35 36 37 38 39 40
## 243 245 223 198 206 167 180
## 41 42 43 44 45 46 47
## 174 169 180 136 182 143 118
## 48 49 50 51 52 53 54
## 141 131 131 106 116 93 94
## 55 56 57 58 59 60 61
## 96 72 80 60 43 71 28
## 62 63 64 65 66 67 68
## 36 25 29 27 24 19 20
## 69 70 71 72 73 74 75
## 15 12 11 8 2 2 5
## 76 77 78 79 80 92 97
## 2 3 1 2 1 1 1
## 99 100 118 188 191 208 211
## 1 1 1 2 1 1 1
## 223 266 1961 1964 1968 1974 1976
## 1 1 1 1 1 1 2
## 1977 1982 1984 1985 1986 1988 1989
## 1 4 2 2 2 1 5
## 1990 1991 1992 1993 1994 1995 1996
## 3 3 9 5 8 5 7
## 1997 1998 1999 2000 412434 999999999
## 4 4 1 1 1 1
#The mean age is 26.26 yrs.
big5$age[big5$age > 100] <- NA
#when we run the table and summary function again, we see that there are now no more values above 100.
#there are 83 NA's.
head(big5) #let's take a look at the head of our clean dataset.
## race age engnat gender hand source country E1 E2 E3 E4 E5 E6 E7 E8 E9
## 1 3 53 1 1 1 1 US 4 2 5 2 5 1 4 3 5
## 2 13 46 1 2 1 1 US 2 2 3 3 3 3 1 5 1
## 3 1 14 2 2 1 1 PK 5 1 1 4 5 1 1 5 5
## 4 3 19 2 2 1 1 RO 2 5 2 4 3 4 3 4 4
## 5 11 25 2 2 1 2 US 3 1 3 3 3 1 3 1 3
## 6 13 31 1 2 1 2 US 1 5 2 4 1 3 2 4 1
## E10 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 C1 C2
## 1 1 1 5 2 5 1 1 1 1 1 1 1 5 1 5 2 3 1 5 4 5 4 1
## 2 5 2 3 4 2 3 4 3 2 2 4 1 3 3 4 4 4 2 3 4 3 4 1
## 3 1 5 1 5 5 5 5 5 5 5 5 5 1 5 5 1 5 1 5 5 5 4 1
## 4 5 5 4 4 2 4 5 5 5 4 5 2 5 4 4 3 5 3 4 4 3 3 3
## 5 5 3 3 3 4 3 3 3 3 3 4 5 5 3 5 1 5 1 5 5 5 3 1
## 6 5 1 5 4 5 1 4 4 1 5 2 2 2 3 4 3 4 3 5 5 3 2 5
## C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
## 1 5 1 5 1 4 1 4 5 4 1 3 1 5 1 4 2 5 5
## 2 3 2 3 1 5 1 4 4 3 3 3 3 2 3 3 1 3 2
## 3 5 1 5 1 5 1 5 5 4 5 5 1 5 1 5 5 5 5
## 4 4 5 1 4 5 4 2 3 4 3 5 2 4 2 5 2 5 5
## 5 5 3 3 1 1 3 3 3 3 1 1 1 3 1 3 1 5 3
## 6 4 3 3 4 5 3 5 3 4 2 1 3 3 5 5 4 5 3
Let’s take a look at the count of some demographic variables.
How many people participated from each country?
str(big5$country) #this tells us that this variable has 127 levels, i.e. 159 countries from which people participated.
## Factor w/ 159 levels "","(nu","A1",..: 150 150 121 128 150 150 150 73 150 150 ...
table(big5$country)#this gives us an overview of how many people from each country participated
##
## (nu A1 A2 AE AG AL AO AP AR AS AT AU AZ BA
## 1 369 8 9 100 1 12 1 19 41 1 20 974 4 10
## BB BD BE BF BG BH BM BN BO BR BS BT BW BZ CA
## 2 44 86 1 41 8 8 5 3 175 2 1 4 17 924
## CH CL CM CN CO CR CV CY CZ DE DK DO DZ EC EE
## 40 18 2 40 18 9 1 8 28 191 122 5 4 6 13
## EG ES ET EU FI FJ FO FR GB GD GE GG GH GP GR
## 49 82 1 24 90 2 1 129 1531 1 4 2 20 1 85
## GT GU GY HK HN HR HT HU ID IE IL IM IN IQ IR
## 3 1 1 41 4 40 2 34 172 107 27 1 1464 2 17
## IS IT JE JM JO JP KE KG KH KR KW KY KZ LA LB
## 13 277 3 28 14 37 43 1 3 30 6 1 1 2 41
## LK LS LT LV LY MA ME MK MM MN MP MR MT MU MV
## 31 2 29 21 2 9 3 7 3 2 2 1 11 8 3
## MW MX MY MZ NG NI NL NO NP NZ OM PA PE PG PH
## 2 82 247 2 35 2 133 147 10 157 6 4 8 2 649
## PK PL PR PT PW PY QA RO RS RU RW SA SD SE SG
## 222 79 16 88 1 2 10 135 85 19 2 45 1 169 133
## SI SK SR SV SY TC TH TN TR TT TW TZ UA UG US
## 34 22 1 6 2 1 42 7 70 23 26 2 12 11 8753
## UY UZ VC VE VI VN ZA ZM ZW
## 2 1 2 17 2 30 179 2 3
summary(big5$country) #this function allows us to see the different countries in order (most participants to least). The country with the most participants was the US, the second largest participation rate was found in GB.
## US GB IN AU CA PH (nu IT MY
## 8753 1531 1464 974 924 649 369 277 247
## PK DE ZA BR ID SE NZ NO RO
## 222 191 179 175 172 169 157 147 135
## NL SG FR DK IE AE FI PT BE
## 133 133 129 122 107 100 90 88 86
## GR RS ES MX PL TR EG SA BD
## 85 85 82 82 79 70 49 45 44
## KE TH AR BG HK LB CH CN HR
## 43 42 41 41 41 41 40 40 40
## JP NG HU SI LK KR VN LT CZ
## 37 35 34 34 31 30 30 29 28
## JM IL TW EU TT SK LV AT GH
## 28 27 26 24 23 22 21 20 20
## AP RU CL CO BZ IR VE PR JO
## 19 19 18 18 17 17 17 16 14
## EE IS AL UA MT UG BA NP QA
## 13 13 12 12 11 11 10 10 10
## A2 CR MA A1 BH BM CY MU PE
## 9 9 9 8 8 8 8 8 8
## MK TN EC KW OM SV BN DO (Other)
## 7 7 6 6 6 6 5 5 119
## NA's
## 8
How many participants were referred to the test through a search engine (“source” = 1)?
table(big5$source)
##
## 1 2 3 4 5
## 12099 3653 303 137 3527
#1=from another page on the test website --> 12099 participants
#2=from google --> 3653 participants
#3=from facebook --> 303
#4=from any url with ".edu" in its domain name (e.g. xxx.edu, xxx.edu.au) --> 137
#6=other source, or HTTP Referer not provided. In the codebook, 5 is left out. I'm assuming they just meant 5 instead of 6.
How many participants were Caucasian European?
table(big5$race) #10537 participants were Caucasian. This is also the largest group of represented races.
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1434 14 10537 1518 515 397 24 201 188 65 1861 259
## 13
## 2553
How many subjects had English as their native language?
table(big5$engnat)
##
## 1 2
## 12379 7270
#12379 people had English as their native language.
How many males/females participated?
table(big5$gender)
##
## 1 2 3
## 7608 11985 102
#7608 = male
#11985 = women
#102 = other
Was the majority of subjects left- or righthanded?
table(big5$hand)
##
## 1 2 3
## 17424 1724 471
#the majority was righthanded.
Let’s create a histogram to visualize how many subjects we had from each race:
hist(big5$race, breaks = 13, col = "Green") #seems like almost everyone was pretty confident about their accuracy.
#let's make it look a bit cooler
ra <- hist(big5$race, breaks = 13, plot = FALSE)
plot(ra, labels = TRUE, border = "blue", col = "Green", main = "Histogram of participants' race", xlab = "Race")
Let’s create a boxplot to visualize how old people of each race were when they participated in this study.
boxplot(age ~ race, data = big5)
#let's make it look a bit cooler
boxplot(age ~ race,
data = big5,
col = "dark red",
notch = TRUE,
main = "Boxplot",
xlab = "Race",
ylab = "Age",
add = TRUE
)
## Warning in bxp(structure(list(stats = structure(c(13, 17, 20, 26, 39, 14,
## : some notches went outside hinges ('box'): maybe set notch=FALSE
Select columns that apparently belong to together (e.g. E1-E10) and compute the correlation, also show the correlation visually.
E <- data.frame(big5$E1, big5$E2, big5$E3, big5$E4, big5$E5, big5$E6, big5$E7, big5$E8, big5$E9, big5$E10)
head(E)
## big5.E1 big5.E2 big5.E3 big5.E4 big5.E5 big5.E6 big5.E7 big5.E8 big5.E9
## 1 4 2 5 2 5 1 4 3 5
## 2 2 2 3 3 3 3 1 5 1
## 3 5 1 1 4 5 1 1 5 5
## 4 2 5 2 4 3 4 3 4 4
## 5 3 1 3 3 3 1 3 1 3
## 6 1 5 2 4 1 3 2 4 1
## big5.E10
## 1 1
## 2 5
## 3 1
## 4 5
## 5 5
## 6 5
cor(na.omit(E)) #calculate the correlations between the E-variables
## big5.E1 big5.E2 big5.E3 big5.E4 big5.E5 big5.E6
## big5.E1 1.0000000 -0.4213321 0.4741228 -0.4841974 0.4789539 -0.3469641
## big5.E2 -0.4213321 1.0000000 -0.4459679 0.5275965 -0.5399614 0.5707289
## big5.E3 0.4741228 -0.4459679 1.0000000 -0.4815417 0.5905066 -0.3938041
## big5.E4 -0.4841974 0.5275965 -0.4815417 1.0000000 -0.5105973 0.4747965
## big5.E5 0.4789539 -0.5399614 0.5905066 -0.5105973 1.0000000 -0.4810786
## big5.E6 -0.3469641 0.5707289 -0.3938041 0.4747965 -0.4810786 1.0000000
## big5.E7 0.5880106 -0.4802477 0.5797732 -0.5036264 0.6307028 -0.4057580
## big5.E8 -0.3669305 0.3733056 -0.3205900 0.4460278 -0.3451264 0.3201933
## big5.E9 0.4553453 -0.3650355 0.4232972 -0.4511764 0.4159735 -0.3304710
## big5.E10 -0.4147061 0.4634899 -0.4744953 0.5103011 -0.5429647 0.4118512
## big5.E7 big5.E8 big5.E9 big5.E10
## big5.E1 0.5880106 -0.3669305 0.4553453 -0.4147061
## big5.E2 -0.4802477 0.3733056 -0.3650355 0.4634899
## big5.E3 0.5797732 -0.3205900 0.4232972 -0.4744953
## big5.E4 -0.5036264 0.4460278 -0.4511764 0.5103011
## big5.E5 0.6307028 -0.3451264 0.4159735 -0.5429647
## big5.E6 -0.4057580 0.3201933 -0.3304710 0.4118512
## big5.E7 1.0000000 -0.3451934 0.4331032 -0.5335745
## big5.E8 -0.3451934 1.0000000 -0.5147706 0.3807093
## big5.E9 0.4331032 -0.5147706 1.0000000 -0.3718664
## big5.E10 -0.5335745 0.3807093 -0.3718664 1.0000000
pairs(na.omit(E)) #visualize the correlations
#this does not seem to make much sense, but I don't know where the problem is. :(
This was an overview of some (very) basic dataset inspections. I’m sorry, I couldn’t really figure out how to sort the correlations stuff and I’m sure there are a easier ways to combine several variables (columns) into one data frame to look at the correlations.
checking and modifying the global environment / workspace: ls() # list all variables rm(x) # remove a variable x rm(list=ls()) # remove all variables that can be found by ls()