Hi! I have chosen as my topic Climate changes and we will explore data about Switzerland. What is behind this decision? Well, I’m fond of a fancy field, namely STS. And Climate changes is one of the disputable topic in the world and it’s one of the reason why STS researchers are into it. I don’t know nothing about Climate changes and decide that with this project I may find the way into it. And I have chosen Switzerland because the UN Environment Regional office for Europe is located in Geneva, but I’m far not sure that I will stick to Switzerland in future. So, let’s start with looking at our data.
swit_data <- read_sas("ess8ch.sas7bdat")
swit_data <- as_tibble(swit_data)
head(swit_data)
## # A tibble: 6 x 534
## name essround edition proddate idno cntry nwspol netusoft netustm
## <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 ESS8~ 8 2.1 01.12.2~ 1 CH 30 5 60
## 2 ESS8~ 8 2.1 01.12.2~ 3 CH 60 2 6666
## 3 ESS8~ 8 2.1 01.12.2~ 5 CH 45 5 300
## 4 ESS8~ 8 2.1 01.12.2~ 6 CH 60 4 210
## 5 ESS8~ 8 2.1 01.12.2~ 9 CH 90 5 60
## 6 ESS8~ 8 2.1 01.12.2~ 14 CH 60 5 270
## # ... with 525 more variables
I have chosen to use tibble from dplyr package because it has some benefits over data frame and one of them we can use head without worries that data could fill our console.
which(colnames(swit_data) == c("eneffap", "banhhap"))
## [1] 179 210
clim_swit <- swit_data[,c(179:210)]
head(clim_swit)
## # A tibble: 6 x 32
## eneffap rdcenr cflsenr elgcoal elgngas elghydr elgnuc elgsun elgwind
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 5 9 5 2 1 3 1 1
## 2 10 6 10 4 2 2 5 2 2
## 3 9 3 10 5 5 2 4 2 3
## 4 5 3 1 5 4 3 2 4 5
## 5 8 3 3 2 4 1 4 2 2
## 6 10 4 5 5 2 2 5 3 2
## # ... with 23 more variables
From inquiry I have taken the first and the last question on climate changes and made a new subset with only these questions.
We can see from the inquiry that all of the questions are about worries and attitudes on scale, so our variables are ordinal and we need to change them into factor. And there are two options:
clim_swit <- as.data.frame(sapply(clim_swit, function(x) as.factor(unlist(x))))
str(clim_swit)
## 'data.frame': 1525 obs. of 32 variables:
## $ eneffap : Factor w/ 12 levels "0","1","10","2",..: 12 3 12 7 10 3 3 10 3 6 ...
## $ rdcenr : Factor w/ 8 levels "1","2","3","4",..: 5 7 3 3 3 4 2 4 4 5 ...
## $ cflsenr : Factor w/ 12 levels "0","1","10","2",..: 12 3 3 2 5 7 4 10 9 10 ...
## $ elgcoal : Factor w/ 8 levels "1","2","3","4",..: 5 4 5 5 2 5 5 3 4 5 ...
## $ elgngas : Factor w/ 8 levels "1","2","3","4",..: 2 2 5 4 4 2 4 2 3 2 ...
## $ elghydr : Factor w/ 8 levels "1","2","3","4",..: 1 2 2 3 1 2 3 2 2 2 ...
## $ elgnuc : Factor w/ 8 levels "1","2","3","4",..: 3 5 4 2 4 5 2 4 5 3 ...
## $ elgsun : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 4 2 3 3 1 2 1 ...
## $ elgwind : Factor w/ 8 levels "1","2","3","4",..: 1 2 3 5 2 2 4 1 2 2 ...
## $ elgbio : Factor w/ 8 levels "1","2","3","4",..: 1 2 2 6 3 2 4 2 3 2 ...
## $ wrpwrct : Factor w/ 7 levels "1","2","3","4",..: 2 1 1 1 2 2 2 2 2 3 ...
## $ wrenexp : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 3 2 6 2 3 2 ...
## $ wrdpimp : Factor w/ 7 levels "1","2","3","4",..: 3 3 3 7 4 2 3 2 2 4 ...
## $ wrdpfos : Factor w/ 6 levels "1","2","3","4",..: 4 3 2 4 4 2 3 2 2 4 ...
## $ wrntdis : Factor w/ 6 levels "1","2","3","4",..: 3 1 1 4 1 2 2 2 2 3 ...
## $ wrinspw : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 4 2 2 2 2 3 ...
## $ wrtcfl : Factor w/ 7 levels "1","2","3","4",..: 3 1 1 3 5 2 3 2 3 3 ...
## $ wrtratc : Factor w/ 6 levels "1","2","3","4",..: 3 1 1 2 5 2 2 2 2 2 ...
## $ clmchng : Factor w/ 5 levels "1","2","3","4",..: 1 2 1 1 1 1 1 1 1 1 ...
## $ clmthgt1: Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ clmthgt2: Factor w/ 7 levels "1","2","3","4",..: 4 2 4 3 4 4 4 4 2 4 ...
## $ ccnthum : Factor w/ 8 levels "1","2","3","4",..: 4 4 4 3 3 3 4 3 3 3 ...
## $ ccrdprs : Factor w/ 13 levels "0","1","10","2",..: 13 3 11 7 6 3 5 11 11 7 ...
## $ wrclmch : Factor w/ 7 levels "1","2","3","4",..: 3 2 3 2 4 3 3 3 3 4 ...
## $ ccgdbd : Factor w/ 13 levels "0","1","10","2",..: 2 7 4 12 2 6 4 4 5 2 ...
## $ lkredcc : Factor w/ 14 levels "0","1","10","2",..: 8 3 10 13 12 10 2 12 3 6 ...
## $ lklmten : Factor w/ 14 levels "0","1","10","2",..: 7 4 5 6 5 7 4 8 7 4 ...
## $ gvsrdcc : Factor w/ 13 levels "0","1","10","2",..: 6 5 7 8 4 7 4 7 8 11 ...
## $ ownrdcc : Factor w/ 14 levels "0","1","10","2",..: 5 5 2 13 4 7 5 10 10 5 ...
## $ inctxff : Factor w/ 7 levels "1","2","3","4",..: 3 4 2 3 2 2 3 4 2 4 ...
## $ sbsrnen : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 2 4 2 2 2 2 1 ...
## $ banhhap : Factor w/ 6 levels "1","2","3","4",..: 2 2 2 2 2 2 1 2 4 2 ...
And with help of sjmisc package
clim_swit <- to_factor(clim_swit)
str(clim_swit)
## 'data.frame': 1525 obs. of 32 variables:
## $ eneffap : Factor w/ 12 levels "0","1","10","2",..: 12 3 12 7 10 3 3 10 3 6 ...
## $ rdcenr : Factor w/ 8 levels "1","2","3","4",..: 5 7 3 3 3 4 2 4 4 5 ...
## $ cflsenr : Factor w/ 12 levels "0","1","10","2",..: 12 3 3 2 5 7 4 10 9 10 ...
## $ elgcoal : Factor w/ 8 levels "1","2","3","4",..: 5 4 5 5 2 5 5 3 4 5 ...
## $ elgngas : Factor w/ 8 levels "1","2","3","4",..: 2 2 5 4 4 2 4 2 3 2 ...
## $ elghydr : Factor w/ 8 levels "1","2","3","4",..: 1 2 2 3 1 2 3 2 2 2 ...
## $ elgnuc : Factor w/ 8 levels "1","2","3","4",..: 3 5 4 2 4 5 2 4 5 3 ...
## $ elgsun : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 4 2 3 3 1 2 1 ...
## $ elgwind : Factor w/ 8 levels "1","2","3","4",..: 1 2 3 5 2 2 4 1 2 2 ...
## $ elgbio : Factor w/ 8 levels "1","2","3","4",..: 1 2 2 6 3 2 4 2 3 2 ...
## $ wrpwrct : Factor w/ 7 levels "1","2","3","4",..: 2 1 1 1 2 2 2 2 2 3 ...
## $ wrenexp : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 3 2 6 2 3 2 ...
## $ wrdpimp : Factor w/ 7 levels "1","2","3","4",..: 3 3 3 7 4 2 3 2 2 4 ...
## $ wrdpfos : Factor w/ 6 levels "1","2","3","4",..: 4 3 2 4 4 2 3 2 2 4 ...
## $ wrntdis : Factor w/ 6 levels "1","2","3","4",..: 3 1 1 4 1 2 2 2 2 3 ...
## $ wrinspw : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 4 2 2 2 2 3 ...
## $ wrtcfl : Factor w/ 7 levels "1","2","3","4",..: 3 1 1 3 5 2 3 2 3 3 ...
## $ wrtratc : Factor w/ 6 levels "1","2","3","4",..: 3 1 1 2 5 2 2 2 2 2 ...
## $ clmchng : Factor w/ 5 levels "1","2","3","4",..: 1 2 1 1 1 1 1 1 1 1 ...
## $ clmthgt1: Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ clmthgt2: Factor w/ 7 levels "1","2","3","4",..: 4 2 4 3 4 4 4 4 2 4 ...
## $ ccnthum : Factor w/ 8 levels "1","2","3","4",..: 4 4 4 3 3 3 4 3 3 3 ...
## $ ccrdprs : Factor w/ 13 levels "0","1","10","2",..: 13 3 11 7 6 3 5 11 11 7 ...
## $ wrclmch : Factor w/ 7 levels "1","2","3","4",..: 3 2 3 2 4 3 3 3 3 4 ...
## $ ccgdbd : Factor w/ 13 levels "0","1","10","2",..: 2 7 4 12 2 6 4 4 5 2 ...
## $ lkredcc : Factor w/ 14 levels "0","1","10","2",..: 8 3 10 13 12 10 2 12 3 6 ...
## $ lklmten : Factor w/ 14 levels "0","1","10","2",..: 7 4 5 6 5 7 4 8 7 4 ...
## $ gvsrdcc : Factor w/ 13 levels "0","1","10","2",..: 6 5 7 8 4 7 4 7 8 11 ...
## $ ownrdcc : Factor w/ 14 levels "0","1","10","2",..: 5 5 2 13 4 7 5 10 10 5 ...
## $ inctxff : Factor w/ 7 levels "1","2","3","4",..: 3 4 2 3 2 2 3 4 2 4 ...
## $ sbsrnen : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 2 4 2 2 2 2 1 ...
## $ banhhap : Factor w/ 6 levels "1","2","3","4",..: 2 2 2 2 2 2 1 2 4 2 ...
The difference is the first way is more complicated and we lose our attr(vector, 'label'). I will stick with the second way, but it’s better to save these labels.
swit_attr_lab <- lapply(swit_data, attr, 'label')
There is another problem that I’m not sure is about our discussion but nevertheless I will show it up.
sapply(clim_swit, function(x) length(unique(x)))
## eneffap rdcenr cflsenr elgcoal elgngas elghydr elgnuc elgsun
## 12 8 12 8 8 8 8 7
## elgwind elgbio wrpwrct wrenexp wrdpimp wrdpfos wrntdis wrinspw
## 8 8 7 6 7 6 6 6
## wrtcfl wrtratc clmchng clmthgt1 clmthgt2 ccnthum ccrdprs wrclmch
## 7 6 5 6 7 8 13 7
## ccgdbd lkredcc lklmten gvsrdcc ownrdcc inctxff sbsrnen banhhap
## 13 14 14 13 14 7 7 6
sapply(clim_swit, nlevels)
## eneffap rdcenr cflsenr elgcoal elgngas elghydr elgnuc elgsun
## 12 8 12 8 8 8 8 7
## elgwind elgbio wrpwrct wrenexp wrdpimp wrdpfos wrntdis wrinspw
## 8 8 7 6 7 6 6 6
## wrtcfl wrtratc clmchng clmthgt1 clmthgt2 ccnthum ccrdprs wrclmch
## 7 6 5 6 7 8 13 7
## ccgdbd lkredcc lklmten gvsrdcc ownrdcc inctxff sbsrnen banhhap
## 13 14 14 13 14 7 7 6
To make our variables appropriate we need change values that is given in the inquiry, I have done it with the help of plyr package.
clim_swit[, 4:10] <- sapply(clim_swit[, 4:10], function(x) {mapvalues(x, from = c("1", "2", "3", "4", "5", "55", "77", "88", "99"),
to = c("A very large amount", "A large amount", "A medium amount", "A small amount",
"None at all", "I have not heard of this energy source before", "Refusal",
"Don't know", "No answer"))})
## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 5, 99
## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 99
clim_swit <- to_factor(clim_swit)
head(clim_swit)
## eneffap rdcenr cflsenr elgcoal elgngas elghydr
## 1 9 5 9 None at all A large amount A very large amount
## 2 10 6 10 A small amount A large amount A large amount
## 3 9 3 10 None at all None at all A large amount
## 4 5 3 1 None at all A small amount A medium amount
## 5 8 3 3 A large amount A small amount A very large amount
## 6 10 4 5 None at all A large amount A large amount
## elgnuc elgsun elgwind
## 1 A medium amount A very large amount A very large amount
## 2 None at all A large amount A large amount
## 3 A small amount A large amount A medium amount
## 4 A large amount A small amount None at all
## 5 A small amount A large amount A large amount
## 6 None at all A medium amount A large amount
## elgbio wrpwrct wrenexp wrdpimp
## 1 A very large amount 2 2 3
## 2 A large amount 1 1 3
## 3 A large amount 1 1 3
## 4 I have not heard of this energy source before 1 2 8
## 5 A medium amount 2 3 4
## 6 A large amount 2 2 2
## wrdpfos wrntdis wrinspw wrtcfl wrtratc clmchng clmthgt1 clmthgt2 ccnthum
## 1 4 3 2 3 3 1 6 4 4
## 2 3 1 1 1 1 2 6 2 4
## 3 2 1 1 1 1 1 6 4 4
## 4 4 4 2 3 2 1 6 3 3
## 5 4 1 4 5 5 1 6 4 3
## 6 2 2 2 2 2 1 6 4 3
## ccrdprs wrclmch ccgdbd lkredcc lklmten gvsrdcc ownrdcc inctxff sbsrnen
## 1 9 3 1 6 5 4 3 3 1
## 2 10 2 5 10 2 3 3 4 2
## 3 8 3 2 7 3 5 1 2 2
## 4 5 2 88 88 4 6 88 3 2
## 5 4 4 1 8 3 2 2 2 4
## 6 10 3 4 7 5 5 5 2 2
## banhhap
## 1 2
## 2 2
## 3 2
## 4 2
## 5 2
## 6 2
Now we partly ready for doing plots.
dslabs packageggplot2clim_swit$elgsun <- factor(clim_swit$elgsun, levels = c("A very large amount", "A large amount",
"A medium amount", "A small amount",
"None at all",
"I have not heard of this energy source before",
"Refusal",
"Don't know", "No answer"))
ds_theme_set()
ggplot(clim_swit, aes(elgsun, fill = elgsun)) +
geom_bar() +
labs(title = "How much electricity in Switzerland should be generated from solar power?",x = "", y = "Number of answers", fill = "Answers:") +
scale_y_continuous(limits=c(0, 800)) +
theme(axis.text.x = element_blank(), axis.ticks = element_blank())
As we see here most of the respondents think that a very large amount and a large amount of electricity should be generated in their country(Switzerland).
Then we do almost the same preparation with another ordinal variables
impr_things <- swit_data[,c(which(colnames(swit_data) == "contplt"):which(colnames(swit_data) == "pstplonl"))]
impr_things_attr_lab <- lapply(impr_things, attr, 'label')
impr_things <- to_factor(impr_things)
impr_things <- sapply(impr_things, function(x) {mapvalues(x, from = c("1", "2", "7", "8", "9"),
to = c("Yes", "No", "Refusal", "Don't know", "No answer"))})
impr_things <- to_factor(as_tibble(impr_things))
Check that we have the same number of rows and combine data:
nrow(impr_things) == nrow(clim_swit)
## [1] TRUE
clim_impr_bind <- bind_cols(impr_things, clim_swit)
clim_impr_bind$bctprd <- factor(clim_impr_bind$bctprd, levels = c("Yes", "No", "Refusal", "Don't know", "No answer"))
ggplot(clim_impr_bind, aes(elgsun, fill = bctprd)) +
geom_bar(position = "stack") +
labs(title = "How much electricity in Switzerland should be generated from solar power?", y = "Number of answers", fill = "Boycotting certain products in last 12 months:", x = "") +
scale_y_continuous(limits=c(0, 800)) +
theme(axis.text.x = element_text(angle=90, hjust = 1))
From this plot we see that the proportion of people who was boycotting certain products in last 12 months are almost the same for all groups of people who have different attitudes about amount of generating solar energy in their country. So, there is less likely the connection between these variables.
I have found out 3 continuous variables lets try to work with some of them.
length(which(swit_data$netustm == 6666))
## [1] 336
wrk_hours <- swit_data %>%
select(wkhtot, lknemny, netustm) %>%
filter(wkhtot <= 168, netustm < 6000)
wrk_hours$lknemny <- to_factor(wrk_hours$lknemny)
wrk_hours$lknemny <- mapvalues(wrk_hours$lknemny, from = c("1", "2", "3", "4", "7", "8", "9"), to = c("Not at all likely", "Not very likely",
"Likely", "Very likely", "Refusal", "Don't know",
"No answer"))
kable(summary(wrk_hours))
| wkhtot | lknemny | netustm | |
|---|---|---|---|
| Min. : 0.00 | Not at all likely:503 | Min. : 4.0 | |
| 1st Qu.:30.00 | Not very likely :455 | 1st Qu.: 60.0 | |
| Median :42.00 | Likely :107 | Median : 120.0 | |
| Mean :37.44 | Very likely : 24 | Mean : 164.1 | |
| 3rd Qu.:46.00 | Don’t know : 16 | 3rd Qu.: 240.0 | |
| Max. :99.00 | NA | Max. :1200.0 |
sd(wrk_hours$wkhtot)
## [1] 16.22065
sd(wrk_hours$netustm)
## [1] 164.5726
We can see some elements of descriptive statistic, we have a “big” standard deviations especially for netustm, for wkhtot the mean is less than median so we might see left skewed distribution and for netustm It’s definitely right skewed.
ggplot(wrk_hours, aes(wkhtot)) +
geom_histogram(aes(y = ..density..), fill = "white", col = "black", bins = 30) +
geom_density(alpha = 0.3, fill = "#FF6666") +
labs(title = "Total hours normally worked per week in main job overtime included", x = "Hours") +
scale_x_continuous(breaks = c(seq(0, 100, 10)), limits = c(0, 100))
So, it actually seems skewed in both ways and it’s also leptokurtic.
ggplot(wrk_hours, aes(netustm)) +
geom_histogram(aes(y = ..density..), fill = "white", col = "black", bins = 30) +
geom_density(alpha = 0.3, fill = "#FF6666") +
labs(title = "Internet use, how much time on typical day, in minutes for regular users", x = "Minutes")
As i said before it’s highly right skewed.
Let’s see how our skewed variables interact in scatter plot.
ggplot(wrk_hours, aes(wkhtot, netustm)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
geom_hline(yintercept = 164.1, color = "red", linetype = "dotted") +
scale_y_continuous(breaks = c(seq(0, 900, 30)), limits = c(0, 1200)) +
labs(title = "Scatter Plot of the total hours worked per week with Time on the internet per day", y = "Minutes in the internet", x = "Working hours")
It’s hardly find any relation between these variables, I put lm and mean on this graph to show that if we were needed to predict net use time from working hours, our best prediction is not likely to be much different from mean time of using the internet per day.
On the last plot we would see connection between attitudes on the question “How likely not enough money for household necessities next 12 months” and time of using the internet per day for regular users.
ggplot(wrk_hours, aes(lknemny, netustm)) +
geom_boxplot() +
labs(x = "How likely not enough money for household necessities next 12 months", y = "Minutes in the internet per day")
As in previous results we hardly can say any difference between groups because all boxes intersects.
We have not tested any hypothesis, but our “playing” with data showed some information about variables and I think if we had any hypothesis these kind results usually called negative. So, that’s it.