Voor meer achtergrond over de data lees hier.
url <- "https://raw.githubusercontent.com/hanbedrijfskunde/onderzoek/master/data/soc-trust.csv"
myDF <- read.csv(url)
str(myDF)
'data.frame': 33286 obs. of 10 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ cntry : Factor w/ 17 levels "AT","BE","CH",..: 1 1 1 1 1 1 1 1 1 1 ...
$ gndr : int 2 1 2 1 2 2 2 2 2 2 ...
$ agea : int 34 52 68 54 20 65 52 44 22 41 ...
$ nwspol : int 120 120 30 30 30 60 15 45 10 60 ...
$ netusoft: int 4 5 2 5 5 5 2 4 5 4 ...
$ netustm : int 180 120 6666 120 180 120 6666 30 120 120 ...
$ ppltrst : int 8 6 5 6 5 3 7 7 9 5 ...
$ pplfair : int 8 6 6 5 5 5 7 7 10 3 ...
$ pplhlp : int 3 5 4 6 7 4 6 7 10 4 ...
summary(myDF)
X cntry gndr agea nwspol netusoft
Min. : 1 DE : 2852 Min. :1.000 Min. : 15.00 Min. : 0.0 Min. :1.000
1st Qu.: 8322 IE : 2766 1st Qu.:1.000 1st Qu.: 33.00 1st Qu.: 30.0 1st Qu.:3.000
Median :16644 IL : 2557 Median :2.000 Median : 49.00 Median : 60.0 Median :5.000
Mean :16644 RU : 2430 Mean :1.523 Mean : 52.25 Mean : 139.3 Mean :3.941
3rd Qu.:24965 CZ : 2300 3rd Qu.:2.000 3rd Qu.: 64.00 3rd Qu.: 90.0 3rd Qu.:5.000
Max. :33286 FR : 2070 Max. :9.000 Max. :999.00 Max. :9999.0 Max. :9.000
(Other):18311 NA's :40 NA's :40 NA's :40 NA's :40
netustm ppltrst pplfair pplhlp
Min. : 0 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 120 1st Qu.: 4.000 1st Qu.: 5.000 1st Qu.: 4.000
Median : 240 Median : 6.000 Median : 6.000 Median : 5.000
Mean :2153 Mean : 5.576 Mean : 6.407 Mean : 5.705
3rd Qu.:6666 3rd Qu.: 7.000 3rd Qu.: 8.000 3rd Qu.: 7.000
Max. :9999 Max. :99.000 Max. :99.000 Max. :99.000
NA's :40 NA's :40 NA's :40 NA's :40
Een aantal codes moeten naar NA (=not available) worden veranderd. Dit is de standaard code in R voor ontbrekende waarden.
myDF$gndr[myDF$gndr == 9] <- NA
myDF$agea[myDF$agea == 999] <- NA
myDF$nwspol[myDF$nwspol %in% c(6666,7777,8888,9999)] <- NA
myDF$netustm[myDF$netustm %in% c(6666,7777,8888,9999)] <- NA
myDF$netusoft[myDF$netusoft %in% c(7,8,9)] <- NA
myDF$ppltrst[myDF$ppltrst %in% c(77,88,99)] <- NA
myDF$pplfair[myDF$pplfair %in% c(77,88,99)] <- NA
myDF$pplhlp[myDF$pplhlp %in% c(77,88,99)] <- NA
summary(myDF)
X cntry gndr agea nwspol netusoft
Min. : 1 DE : 2852 Min. :1.000 Min. : 15.00 Min. : 0.00 Min. :1.000
1st Qu.: 8322 IE : 2766 1st Qu.:1.000 1st Qu.: 33.00 1st Qu.: 30.00 1st Qu.:3.000
Median :16644 IL : 2557 Median :2.000 Median : 49.00 Median : 60.00 Median :5.000
Mean :16644 RU : 2430 Mean :1.521 Mean : 48.79 Mean : 75.96 Mean :3.938
3rd Qu.:24965 CZ : 2300 3rd Qu.:2.000 3rd Qu.: 64.00 3rd Qu.: 90.00 3rd Qu.:5.000
Max. :33286 FR : 2070 Max. :2.000 Max. :100.00 Max. :1262.00 Max. :5.000
(Other):18311 NA's :48 NA's :161 NA's :275 NA's :68
netustm ppltrst pplfair pplhlp
Min. : 0.0 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 60.0 1st Qu.: 4.000 1st Qu.: 5.000 1st Qu.: 4.000
Median : 120.0 Median : 6.000 Median : 6.000 Median : 5.000
Mean : 195.9 Mean : 5.384 Mean : 5.944 Mean : 5.426
3rd Qu.: 240.0 3rd Qu.: 7.000 3rd Qu.: 8.000 3rd Qu.: 7.000
Max. :1440.0 Max. :10.000 Max. :10.000 Max. :10.000
NA's :10008 NA's :117 NA's :228 NA's :151
We kunnen kijken naar de verdelingen en kruisgrafieken bouwen.
hist(myDF$agea, col = "tomato", xlab = "Leeftijd")
plot(myDF$agea, myDF$netustm, col = "#0099e6")
boxplot(myDF$ppltrst~myDF$cntry, col = "violet")
grpByGndr <- myDF %>% filter(!is.na(gndr)) %>% group_by(gndr)
Error in myDF %>% filter(!is.na(gndr)) %>% group_by(gndr) :
could not find function "%>%"
Om wat meer controle te krijgen over de opmaak van de grafieken kun je de library ggplot toepassen.
p <- ggplot(myDF) +
geom_boxplot(aes(x = cntry, y = ppltrst), fill = "violet")
p
Met plotly kun je eenvoudig interactie toevoegen.
ggplotly(p , width = 800)
Removed 117 rows containing non-finite values (stat_boxplot).