Inlezen data

Voor meer achtergrond over de data lees hier.

url <- "https://raw.githubusercontent.com/hanbedrijfskunde/onderzoek/master/data/soc-trust.csv"
myDF <- read.csv(url)
str(myDF)
'data.frame':   33286 obs. of  10 variables:
 $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ cntry   : Factor w/ 17 levels "AT","BE","CH",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ gndr    : int  2 1 2 1 2 2 2 2 2 2 ...
 $ agea    : int  34 52 68 54 20 65 52 44 22 41 ...
 $ nwspol  : int  120 120 30 30 30 60 15 45 10 60 ...
 $ netusoft: int  4 5 2 5 5 5 2 4 5 4 ...
 $ netustm : int  180 120 6666 120 180 120 6666 30 120 120 ...
 $ ppltrst : int  8 6 5 6 5 3 7 7 9 5 ...
 $ pplfair : int  8 6 6 5 5 5 7 7 10 3 ...
 $ pplhlp  : int  3 5 4 6 7 4 6 7 10 4 ...
summary(myDF)
       X             cntry            gndr            agea            nwspol          netusoft    
 Min.   :    1   DE     : 2852   Min.   :1.000   Min.   : 15.00   Min.   :   0.0   Min.   :1.000  
 1st Qu.: 8322   IE     : 2766   1st Qu.:1.000   1st Qu.: 33.00   1st Qu.:  30.0   1st Qu.:3.000  
 Median :16644   IL     : 2557   Median :2.000   Median : 49.00   Median :  60.0   Median :5.000  
 Mean   :16644   RU     : 2430   Mean   :1.523   Mean   : 52.25   Mean   : 139.3   Mean   :3.941  
 3rd Qu.:24965   CZ     : 2300   3rd Qu.:2.000   3rd Qu.: 64.00   3rd Qu.:  90.0   3rd Qu.:5.000  
 Max.   :33286   FR     : 2070   Max.   :9.000   Max.   :999.00   Max.   :9999.0   Max.   :9.000  
                 (Other):18311   NA's   :40      NA's   :40       NA's   :40       NA's   :40     
    netustm        ppltrst          pplfair           pplhlp      
 Min.   :   0   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 120   1st Qu.: 4.000   1st Qu.: 5.000   1st Qu.: 4.000  
 Median : 240   Median : 6.000   Median : 6.000   Median : 5.000  
 Mean   :2153   Mean   : 5.576   Mean   : 6.407   Mean   : 5.705  
 3rd Qu.:6666   3rd Qu.: 7.000   3rd Qu.: 8.000   3rd Qu.: 7.000  
 Max.   :9999   Max.   :99.000   Max.   :99.000   Max.   :99.000  
 NA's   :40     NA's   :40       NA's   :40       NA's   :40      

Opschonen data

Een aantal codes moeten naar NA (=not available) worden veranderd. Dit is de standaard code in R voor ontbrekende waarden.

myDF$gndr[myDF$gndr == 9] <- NA
myDF$agea[myDF$agea == 999] <- NA
myDF$nwspol[myDF$nwspol %in% c(6666,7777,8888,9999)] <- NA
myDF$netustm[myDF$netustm %in% c(6666,7777,8888,9999)] <- NA
myDF$netusoft[myDF$netusoft %in% c(7,8,9)] <- NA
myDF$ppltrst[myDF$ppltrst %in% c(77,88,99)] <- NA
myDF$pplfair[myDF$pplfair %in% c(77,88,99)] <- NA
myDF$pplhlp[myDF$pplhlp %in% c(77,88,99)] <- NA
 
summary(myDF)
       X             cntry            gndr            agea            nwspol           netusoft    
 Min.   :    1   DE     : 2852   Min.   :1.000   Min.   : 15.00   Min.   :   0.00   Min.   :1.000  
 1st Qu.: 8322   IE     : 2766   1st Qu.:1.000   1st Qu.: 33.00   1st Qu.:  30.00   1st Qu.:3.000  
 Median :16644   IL     : 2557   Median :2.000   Median : 49.00   Median :  60.00   Median :5.000  
 Mean   :16644   RU     : 2430   Mean   :1.521   Mean   : 48.79   Mean   :  75.96   Mean   :3.938  
 3rd Qu.:24965   CZ     : 2300   3rd Qu.:2.000   3rd Qu.: 64.00   3rd Qu.:  90.00   3rd Qu.:5.000  
 Max.   :33286   FR     : 2070   Max.   :2.000   Max.   :100.00   Max.   :1262.00   Max.   :5.000  
                 (Other):18311   NA's   :48      NA's   :161      NA's   :275       NA's   :68     
    netustm          ppltrst          pplfair           pplhlp      
 Min.   :   0.0   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:  60.0   1st Qu.: 4.000   1st Qu.: 5.000   1st Qu.: 4.000  
 Median : 120.0   Median : 6.000   Median : 6.000   Median : 5.000  
 Mean   : 195.9   Mean   : 5.384   Mean   : 5.944   Mean   : 5.426  
 3rd Qu.: 240.0   3rd Qu.: 7.000   3rd Qu.: 8.000   3rd Qu.: 7.000  
 Max.   :1440.0   Max.   :10.000   Max.   :10.000   Max.   :10.000  
 NA's   :10008    NA's   :117      NA's   :228      NA's   :151     

Visualiseren data

We kunnen kijken naar de verdelingen en kruisgrafieken bouwen.

hist(myDF$agea, col = "tomato", xlab = "Leeftijd")

plot(myDF$agea, myDF$netustm, col = "#0099e6")

boxplot(myDF$ppltrst~myDF$cntry, col = "violet")

grpByGndr <- myDF %>% filter(!is.na(gndr)) %>% group_by(gndr)
Error in myDF %>% filter(!is.na(gndr)) %>% group_by(gndr) : 
  could not find function "%>%"

Om wat meer controle te krijgen over de opmaak van de grafieken kun je de library ggplot toepassen.

p <- ggplot(myDF) +
  geom_boxplot(aes(x = cntry, y = ppltrst), fill = "violet")
p

Met plotly kun je eenvoudig interactie toevoegen.

ggplotly(p , width = 800)
Removed 117 rows containing non-finite values (stat_boxplot).
LS0tCnRpdGxlOiAiQmlnIERhdGEgQW5hbHl0aWNzIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpgYGB7ciBtZXNzYWdlPUZBTFNFLCB3YXJuaW5nPUZBTFNFLCBpbmNsdWRlPUZBTFNFLCBwYWdlZC5wcmludD1GQUxTRX0KIyBjaGVjay5wYWNrYWdlcyBmdW5jdGlvbjogaW5zdGFsbCBhbmQgbG9hZCBtdWx0aXBsZSBSIHBhY2thZ2VzLgojIENoZWNrIHRvIHNlZSBpZiBwYWNrYWdlcyBhcmUgaW5zdGFsbGVkLiBJbnN0YWxsIHRoZW0gaWYgdGhleSBhcmUgbm90LCB0aGVuIGxvYWQgdGhlbSBpbnRvIHRoZSBSIHNlc3Npb24uCmNoZWNrLnBhY2thZ2VzIDwtIGZ1bmN0aW9uKHBrZyl7CiAgICBuZXcucGtnIDwtIHBrZ1shKHBrZyAlaW4lIGluc3RhbGxlZC5wYWNrYWdlcygpWywgIlBhY2thZ2UiXSldCiAgICBpZiAobGVuZ3RoKG5ldy5wa2cpKSAKICAgICAgICBpbnN0YWxsLnBhY2thZ2VzKG5ldy5wa2csIGRlcGVuZGVuY2llcyA9IFRSVUUpCiAgICBzYXBwbHkocGtnLCByZXF1aXJlLCBjaGFyYWN0ZXIub25seSA9IFRSVUUpCn0KCiMgVXNhZ2UgZXhhbXBsZQpwYWNrYWdlczwtYygiZ3NoZWV0IiwgInRpZHl2ZXJzZSIsICJwbG90bHkiLCAibGVhZmxldCIpCmNoZWNrLnBhY2thZ2VzKHBhY2thZ2VzKQpsaWJyYXJ5KGdzaGVldCkKbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkocGxvdGx5KQpsaWJyYXJ5KGxlYWZsZXQpCmxpYnJhcnkocGxvdGx5KQpgYGAKCgojIElubGV6ZW4gZGF0YQpWb29yIG1lZXIgYWNodGVyZ3JvbmQgb3ZlciBkZSBkYXRhIGxlZXMgW2hpZXJdKGh0dHA6Ly93d3cuZXVyb3BlYW5zb2NpYWxzdXJ2ZXkub3JnL2RhdGEvbW9kdWxlLWluZGV4Lmh0bWwpLgoKYGBge3J9CnVybCA8LSAiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2hhbmJlZHJpamZza3VuZGUvb25kZXJ6b2VrL21hc3Rlci9kYXRhL3NvYy10cnVzdC5jc3YiCm15REYgPC0gcmVhZC5jc3YodXJsKQoKc3RyKG15REYpCnN1bW1hcnkobXlERikKYGBgCgojIE9wc2Nob25lbiBkYXRhCkVlbiBhYW50YWwgY29kZXMgbW9ldGVuIG5hYXIgTkEgKD1ub3QgYXZhaWxhYmxlKSB3b3JkZW4gdmVyYW5kZXJkLiBEaXQgaXMgZGUgc3RhbmRhYXJkIGNvZGUgaW4gUiB2b29yIG9udGJyZWtlbmRlIHdhYXJkZW4uCgpgYGB7cn0KbXlERiRnbmRyW215REYkZ25kciA9PSA5XSA8LSBOQQpteURGJGFnZWFbbXlERiRhZ2VhID09IDk5OV0gPC0gTkEKbXlERiRud3Nwb2xbbXlERiRud3Nwb2wgJWluJSBjKDY2NjYsNzc3Nyw4ODg4LDk5OTkpXSA8LSBOQQpteURGJG5ldHVzdG1bbXlERiRuZXR1c3RtICVpbiUgYyg2NjY2LDc3NzcsODg4OCw5OTk5KV0gPC0gTkEKbXlERiRuZXR1c29mdFtteURGJG5ldHVzb2Z0ICVpbiUgYyg3LDgsOSldIDwtIE5BCm15REYkcHBsdHJzdFtteURGJHBwbHRyc3QgJWluJSBjKDc3LDg4LDk5KV0gPC0gTkEKbXlERiRwcGxmYWlyW215REYkcHBsZmFpciAlaW4lIGMoNzcsODgsOTkpXSA8LSBOQQpteURGJHBwbGhscFtteURGJHBwbGhscCAlaW4lIGMoNzcsODgsOTkpXSA8LSBOQQogCnN1bW1hcnkobXlERikKYGBgCgojIFZpc3VhbGlzZXJlbiBkYXRhCldlIGt1bm5lbiBraWprZW4gbmFhciBkZSAgdmVyZGVsaW5nZW4gZW4ga3J1aXNncmFmaWVrZW4gYm91d2VuLgoKYGBge3J9Cmhpc3QobXlERiRhZ2VhLCBjb2wgPSAidG9tYXRvIiwgeGxhYiA9ICJMZWVmdGlqZCIpCgpwbG90KG15REYkYWdlYSwgbXlERiRuZXR1c3RtLCBjb2wgPSAiIzAwOTllNiIpCgpib3hwbG90KG15REYkcHBsdHJzdH5teURGJGNudHJ5LCBjb2wgPSAidmlvbGV0IikKCmdycEJ5R25kciA8LSBteURGICU+JSBmaWx0ZXIoIWlzLm5hKGduZHIpKSAlPiUgZ3JvdXBfYnkoZ25kcikKCmduZHJTdW1teSA8LSBzdW1tYXJpemUoZ3JwQnlHbmRyLCBjb3VudCA9IG4oKSwgbmV0dXNhZ2UgPSBtZWFuKG5ldHVzdG0sIG5hLnJtID0gVFJVRSkpCmduZHJTdW1teSRnbmRyIDwtIHJlY29kZV9mYWN0b3IoZ25kclN1bW15JGduZHIsIGAxYCA9ICJtIiwgYDJgID0gImYiKQpiYXJwbG90KGduZHJTdW1teSRuZXR1c2FnZSwgbmFtZXMuYXJnID0gZ25kclN1bW15JGduZHIsIGNvbCA9ICIjMDBjYzg4IikKYGBgCgpPbSB3YXQgbWVlciBjb250cm9sZSB0ZSBrcmlqZ2VuIG92ZXIgZGUgb3BtYWFrIHZhbiBkZSBncmFmaWVrZW4ga3VuIGplIGRlIGxpYnJhcnkgYGdncGxvdGAgdG9lcGFzc2VuLgoKYGBge3J9CnAgPC0gZ2dwbG90KG15REYpICsKICBnZW9tX2JveHBsb3QoYWVzKHggPSBjbnRyeSwgeSA9IHBwbHRyc3QpLCBmaWxsID0gInZpb2xldCIpCnAKYGBgCgpNZXQgYHBsb3RseWAga3VuIGplIGVlbnZvdWRpZyBpbnRlcmFjdGllIHRvZXZvZWdlbi4KCmBgYHtyfQpnZ3Bsb3RseShwICwgd2lkdGggPSA4MDApCmBgYAo=