Intro

Hi! I have chosen as my topic Climate changes and we will explore data about Switzerland. What is behind this decision? Well, I’m fond of a fancy field, namely STS. And Climate changes is one of the disputable topic in the world and it’s one of the reason why STS researchers are into it. I don’t know nothing about Climate changes and decide that with this project I may find the way into it. And I have chosen Switzerland because the UN Environment Regional office for Europe is located in Geneva, but I’m far not sure that I will stick to Switzerland in future. So, let’s start with looking at our data.

swit_data <- read_sas("ess8ch.sas7bdat")
swit_data <- as_tibble(swit_data)
head(swit_data)

## # A tibble: 6 x 534
##   name  essround edition proddate  idno cntry nwspol netusoft netustm
##   <chr>    <dbl> <chr>   <chr>    <dbl> <chr>  <dbl>    <dbl>   <dbl>
## 1 ESS8~        8 2.1     01.12.2~     1 CH        30        5      60
## 2 ESS8~        8 2.1     01.12.2~     3 CH        60        2    6666
## 3 ESS8~        8 2.1     01.12.2~     5 CH        45        5     300
## 4 ESS8~        8 2.1     01.12.2~     6 CH        60        4     210
## 5 ESS8~        8 2.1     01.12.2~     9 CH        90        5      60
## 6 ESS8~        8 2.1     01.12.2~    14 CH        60        5     270
## # ... with 525 more variables

I have chosen to use tibble from dplyr package because it has some benefits over data frame and one of them we can use head without worries that data could fill our console.

which(colnames(swit_data) == c("eneffap", "banhhap"))

## [1] 179 210

clim_swit <- swit_data[,c(179:210)]
head(clim_swit)

## # A tibble: 6 x 32
##   eneffap rdcenr cflsenr elgcoal elgngas elghydr elgnuc elgsun elgwind
##     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1       9      5       9       5       2       1      3      1       1
## 2      10      6      10       4       2       2      5      2       2
## 3       9      3      10       5       5       2      4      2       3
## 4       5      3       1       5       4       3      2      4       5
## 5       8      3       3       2       4       1      4      2       2
## 6      10      4       5       5       2       2      5      3       2
## # ... with 23 more variables

From inquiry I have taken the first and the last question on climate changes and made a new subset with only these questions.

We can see from the inquiry that all of the questions are about worries and attitudes on scale, so our variables are ordinal and we need to change them into factor. And there are two options:

clim_swit <- as.data.frame(sapply(clim_swit, function(x) as.factor(unlist(x))))
str(clim_swit)

## 'data.frame':    1525 obs. of  32 variables:
##  $ eneffap : Factor w/ 12 levels "0","1","10","2",..: 12 3 12 7 10 3 3 10 3 6 ...
##  $ rdcenr  : Factor w/ 8 levels "1","2","3","4",..: 5 7 3 3 3 4 2 4 4 5 ...
##  $ cflsenr : Factor w/ 12 levels "0","1","10","2",..: 12 3 3 2 5 7 4 10 9 10 ...
##  $ elgcoal : Factor w/ 8 levels "1","2","3","4",..: 5 4 5 5 2 5 5 3 4 5 ...
##  $ elgngas : Factor w/ 8 levels "1","2","3","4",..: 2 2 5 4 4 2 4 2 3 2 ...
##  $ elghydr : Factor w/ 8 levels "1","2","3","4",..: 1 2 2 3 1 2 3 2 2 2 ...
##  $ elgnuc  : Factor w/ 8 levels "1","2","3","4",..: 3 5 4 2 4 5 2 4 5 3 ...
##  $ elgsun  : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 4 2 3 3 1 2 1 ...
##  $ elgwind : Factor w/ 8 levels "1","2","3","4",..: 1 2 3 5 2 2 4 1 2 2 ...
##  $ elgbio  : Factor w/ 8 levels "1","2","3","4",..: 1 2 2 6 3 2 4 2 3 2 ...
##  $ wrpwrct : Factor w/ 7 levels "1","2","3","4",..: 2 1 1 1 2 2 2 2 2 3 ...
##  $ wrenexp : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 3 2 6 2 3 2 ...
##  $ wrdpimp : Factor w/ 7 levels "1","2","3","4",..: 3 3 3 7 4 2 3 2 2 4 ...
##  $ wrdpfos : Factor w/ 6 levels "1","2","3","4",..: 4 3 2 4 4 2 3 2 2 4 ...
##  $ wrntdis : Factor w/ 6 levels "1","2","3","4",..: 3 1 1 4 1 2 2 2 2 3 ...
##  $ wrinspw : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 4 2 2 2 2 3 ...
##  $ wrtcfl  : Factor w/ 7 levels "1","2","3","4",..: 3 1 1 3 5 2 3 2 3 3 ...
##  $ wrtratc : Factor w/ 6 levels "1","2","3","4",..: 3 1 1 2 5 2 2 2 2 2 ...
##  $ clmchng : Factor w/ 5 levels "1","2","3","4",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ clmthgt1: Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ clmthgt2: Factor w/ 7 levels "1","2","3","4",..: 4 2 4 3 4 4 4 4 2 4 ...
##  $ ccnthum : Factor w/ 8 levels "1","2","3","4",..: 4 4 4 3 3 3 4 3 3 3 ...
##  $ ccrdprs : Factor w/ 13 levels "0","1","10","2",..: 13 3 11 7 6 3 5 11 11 7 ...
##  $ wrclmch : Factor w/ 7 levels "1","2","3","4",..: 3 2 3 2 4 3 3 3 3 4 ...
##  $ ccgdbd  : Factor w/ 13 levels "0","1","10","2",..: 2 7 4 12 2 6 4 4 5 2 ...
##  $ lkredcc : Factor w/ 14 levels "0","1","10","2",..: 8 3 10 13 12 10 2 12 3 6 ...
##  $ lklmten : Factor w/ 14 levels "0","1","10","2",..: 7 4 5 6 5 7 4 8 7 4 ...
##  $ gvsrdcc : Factor w/ 13 levels "0","1","10","2",..: 6 5 7 8 4 7 4 7 8 11 ...
##  $ ownrdcc : Factor w/ 14 levels "0","1","10","2",..: 5 5 2 13 4 7 5 10 10 5 ...
##  $ inctxff : Factor w/ 7 levels "1","2","3","4",..: 3 4 2 3 2 2 3 4 2 4 ...
##  $ sbsrnen : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 2 4 2 2 2 2 1 ...
##  $ banhhap : Factor w/ 6 levels "1","2","3","4",..: 2 2 2 2 2 2 1 2 4 2 ...

And with help of sjmisc package

clim_swit <- to_factor(clim_swit)
str(clim_swit)

## 'data.frame':    1525 obs. of  32 variables:
##  $ eneffap : Factor w/ 12 levels "0","1","10","2",..: 12 3 12 7 10 3 3 10 3 6 ...
##  $ rdcenr  : Factor w/ 8 levels "1","2","3","4",..: 5 7 3 3 3 4 2 4 4 5 ...
##  $ cflsenr : Factor w/ 12 levels "0","1","10","2",..: 12 3 3 2 5 7 4 10 9 10 ...
##  $ elgcoal : Factor w/ 8 levels "1","2","3","4",..: 5 4 5 5 2 5 5 3 4 5 ...
##  $ elgngas : Factor w/ 8 levels "1","2","3","4",..: 2 2 5 4 4 2 4 2 3 2 ...
##  $ elghydr : Factor w/ 8 levels "1","2","3","4",..: 1 2 2 3 1 2 3 2 2 2 ...
##  $ elgnuc  : Factor w/ 8 levels "1","2","3","4",..: 3 5 4 2 4 5 2 4 5 3 ...
##  $ elgsun  : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 4 2 3 3 1 2 1 ...
##  $ elgwind : Factor w/ 8 levels "1","2","3","4",..: 1 2 3 5 2 2 4 1 2 2 ...
##  $ elgbio  : Factor w/ 8 levels "1","2","3","4",..: 1 2 2 6 3 2 4 2 3 2 ...
##  $ wrpwrct : Factor w/ 7 levels "1","2","3","4",..: 2 1 1 1 2 2 2 2 2 3 ...
##  $ wrenexp : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 3 2 6 2 3 2 ...
##  $ wrdpimp : Factor w/ 7 levels "1","2","3","4",..: 3 3 3 7 4 2 3 2 2 4 ...
##  $ wrdpfos : Factor w/ 6 levels "1","2","3","4",..: 4 3 2 4 4 2 3 2 2 4 ...
##  $ wrntdis : Factor w/ 6 levels "1","2","3","4",..: 3 1 1 4 1 2 2 2 2 3 ...
##  $ wrinspw : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 4 2 2 2 2 3 ...
##  $ wrtcfl  : Factor w/ 7 levels "1","2","3","4",..: 3 1 1 3 5 2 3 2 3 3 ...
##  $ wrtratc : Factor w/ 6 levels "1","2","3","4",..: 3 1 1 2 5 2 2 2 2 2 ...
##  $ clmchng : Factor w/ 5 levels "1","2","3","4",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ clmthgt1: Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ clmthgt2: Factor w/ 7 levels "1","2","3","4",..: 4 2 4 3 4 4 4 4 2 4 ...
##  $ ccnthum : Factor w/ 8 levels "1","2","3","4",..: 4 4 4 3 3 3 4 3 3 3 ...
##  $ ccrdprs : Factor w/ 13 levels "0","1","10","2",..: 13 3 11 7 6 3 5 11 11 7 ...
##  $ wrclmch : Factor w/ 7 levels "1","2","3","4",..: 3 2 3 2 4 3 3 3 3 4 ...
##  $ ccgdbd  : Factor w/ 13 levels "0","1","10","2",..: 2 7 4 12 2 6 4 4 5 2 ...
##  $ lkredcc : Factor w/ 14 levels "0","1","10","2",..: 8 3 10 13 12 10 2 12 3 6 ...
##  $ lklmten : Factor w/ 14 levels "0","1","10","2",..: 7 4 5 6 5 7 4 8 7 4 ...
##  $ gvsrdcc : Factor w/ 13 levels "0","1","10","2",..: 6 5 7 8 4 7 4 7 8 11 ...
##  $ ownrdcc : Factor w/ 14 levels "0","1","10","2",..: 5 5 2 13 4 7 5 10 10 5 ...
##  $ inctxff : Factor w/ 7 levels "1","2","3","4",..: 3 4 2 3 2 2 3 4 2 4 ...
##  $ sbsrnen : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 2 4 2 2 2 2 1 ...
##  $ banhhap : Factor w/ 6 levels "1","2","3","4",..: 2 2 2 2 2 2 1 2 4 2 ...

The difference is the first way is more complicated and we lose our attr(vector, 'label'). I will stick with the second way, but it’s better to save these labels.

swit_attr_lab <- lapply(swit_data, attr, 'label')

There is another problem that I’m not sure is about our discussion but nevertheless I will show it up.

sapply(clim_swit, function(x) length(unique(x)))

##  eneffap   rdcenr  cflsenr  elgcoal  elgngas  elghydr   elgnuc   elgsun 
##       12        8       12        8        8        8        8        7 
##  elgwind   elgbio  wrpwrct  wrenexp  wrdpimp  wrdpfos  wrntdis  wrinspw 
##        8        8        7        6        7        6        6        6 
##   wrtcfl  wrtratc  clmchng clmthgt1 clmthgt2  ccnthum  ccrdprs  wrclmch 
##        7        6        5        6        7        8       13        7 
##   ccgdbd  lkredcc  lklmten  gvsrdcc  ownrdcc  inctxff  sbsrnen  banhhap 
##       13       14       14       13       14        7        7        6

sapply(clim_swit, nlevels)

##  eneffap   rdcenr  cflsenr  elgcoal  elgngas  elghydr   elgnuc   elgsun 
##       12        8       12        8        8        8        8        7 
##  elgwind   elgbio  wrpwrct  wrenexp  wrdpimp  wrdpfos  wrntdis  wrinspw 
##        8        8        7        6        7        6        6        6 
##   wrtcfl  wrtratc  clmchng clmthgt1 clmthgt2  ccnthum  ccrdprs  wrclmch 
##        7        6        5        6        7        8       13        7 
##   ccgdbd  lkredcc  lklmten  gvsrdcc  ownrdcc  inctxff  sbsrnen  banhhap 
##       13       14       14       13       14        7        7        6

To make our variables appropriate we need change values that is given in the inquiry, I have done it with the help of plyr package.

clim_swit[, 4:10] <- sapply(clim_swit[, 4:10], function(x) {mapvalues(x, from = c("1", "2", "3", "4", "5", "55", "77", "88", "99"),
          to = c("A very large amount", "A large amount", "A medium amount", "A small amount",
                 "None at all", "I have not heard of this energy source before", "Refusal",
                 "Don't know", "No answer"))})

## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 99

## The following `from` values were not present in `x`: 5, 99

## The following `from` values were not present in `x`: 99
## The following `from` values were not present in `x`: 99

clim_swit <- to_factor(clim_swit)
head(clim_swit)

##   eneffap rdcenr cflsenr        elgcoal        elgngas             elghydr
## 1       9      5       9    None at all A large amount A very large amount
## 2      10      6      10 A small amount A large amount      A large amount
## 3       9      3      10    None at all    None at all      A large amount
## 4       5      3       1    None at all A small amount     A medium amount
## 5       8      3       3 A large amount A small amount A very large amount
## 6      10      4       5    None at all A large amount      A large amount
##            elgnuc              elgsun             elgwind
## 1 A medium amount A very large amount A very large amount
## 2     None at all      A large amount      A large amount
## 3  A small amount      A large amount     A medium amount
## 4  A large amount      A small amount         None at all
## 5  A small amount      A large amount      A large amount
## 6     None at all     A medium amount      A large amount
##                                          elgbio wrpwrct wrenexp wrdpimp
## 1                           A very large amount       2       2       3
## 2                                A large amount       1       1       3
## 3                                A large amount       1       1       3
## 4 I have not heard of this energy source before       1       2       8
## 5                               A medium amount       2       3       4
## 6                                A large amount       2       2       2
##   wrdpfos wrntdis wrinspw wrtcfl wrtratc clmchng clmthgt1 clmthgt2 ccnthum
## 1       4       3       2      3       3       1        6        4       4
## 2       3       1       1      1       1       2        6        2       4
## 3       2       1       1      1       1       1        6        4       4
## 4       4       4       2      3       2       1        6        3       3
## 5       4       1       4      5       5       1        6        4       3
## 6       2       2       2      2       2       1        6        4       3
##   ccrdprs wrclmch ccgdbd lkredcc lklmten gvsrdcc ownrdcc inctxff sbsrnen
## 1       9       3      1       6       5       4       3       3       1
## 2      10       2      5      10       2       3       3       4       2
## 3       8       3      2       7       3       5       1       2       2
## 4       5       2     88      88       4       6      88       3       2
## 5       4       4      1       8       3       2       2       2       4
## 6      10       3      4       7       5       5       5       2       2
##   banhhap
## 1       2
## 2       2
## 3       2
## 4       2
## 5       2
## 6       2

Now we partly ready for doing plots.

Plots

Barplot

order values
set theme with help of dslabs package
plotting with help of ggplot2

clim_swit$elgsun <- factor(clim_swit$elgsun, levels = c("A very large amount", "A large amount", 
                                                        "A medium amount", "A small amount",
                                                        "None at all",
                                                        "I have not heard of this energy source before", 
                                                        "Refusal",
                                                        "Don't know", "No answer"))
ds_theme_set()
ggplot(clim_swit, aes(elgsun, fill = elgsun)) +
  geom_bar() +
  labs(title = "How much electricity in Switzerland should be generated from solar power?",x = "", y = "Number of answers", fill = "Answers:") +
  scale_y_continuous(limits=c(0, 800)) +
  theme(axis.text.x = element_blank(), axis.ticks = element_blank())

As we see here most of the respondents think that a very large amount and a large amount of electricity should be generated in their country(Switzerland).

Then we do almost the same preparation with another ordinal variables

impr_things <- swit_data[,c(which(colnames(swit_data) == "contplt"):which(colnames(swit_data) == "pstplonl"))]
impr_things_attr_lab <- lapply(impr_things, attr, 'label')

impr_things <- to_factor(impr_things)

impr_things <- sapply(impr_things, function(x) {mapvalues(x, from = c("1", "2", "7", "8", "9"),
                                                                to = c("Yes", "No", "Refusal", "Don't know", "No answer"))})
impr_things <- to_factor(as_tibble(impr_things))

Check that we have the same number of rows and combine data:

nrow(impr_things) == nrow(clim_swit)

## [1] TRUE

clim_impr_bind <- bind_cols(impr_things, clim_swit)

Stacked barplot

order values for legend
plotting

clim_impr_bind$bctprd <- factor(clim_impr_bind$bctprd, levels = c("Yes", "No", "Refusal", "Don't know", "No answer"))
ggplot(clim_impr_bind, aes(elgsun, fill = bctprd)) +
  geom_bar(position = "stack") +
  labs(title = "How much electricity in Switzerland should be generated from solar power?", y = "Number of answers", fill = "Boycotting certain products in last 12 months:", x = "") +
  scale_y_continuous(limits=c(0, 800)) +
  theme(axis.text.x = element_text(angle=90, hjust = 1))

From this plot we see that the proportion of people who was boycotting certain products in last 12 months are almost the same for all groups of people who have different attitudes about amount of generating solar energy in their country. So, there is less likely the connection between these variables.

I have found out 3 continuous variables lets try to work with some of them.

length(which(swit_data$netustm == 6666))

## [1] 336

wrk_hours <- swit_data %>%
  select(wkhtot, lknemny, netustm) %>%
  filter(wkhtot <= 168, netustm < 6000)
wrk_hours$lknemny <- to_factor(wrk_hours$lknemny)
wrk_hours$lknemny <- mapvalues(wrk_hours$lknemny, from = c("1", "2", "3", "4", "7", "8", "9"), to = c("Not at all likely", "Not very likely",
                                                                                              "Likely", "Very likely", "Refusal", "Don't know",
                                                                                              "No answer"))
kable(summary(wrk_hours))

wkhtot	lknemny	netustm
Min. : 0.00	Not at all likely:503	Min. : 4.0
1st Qu.:30.00	Not very likely :455	1st Qu.: 60.0
Median :42.00	Likely :107	Median : 120.0
Mean :37.44	Very likely : 24	Mean : 164.1
3rd Qu.:46.00	Don’t know : 16	3rd Qu.: 240.0
Max. :99.00	NA	Max. :1200.0

sd(wrk_hours$wkhtot)

## [1] 16.22065

sd(wrk_hours$netustm)

## [1] 164.5726

We can see some elements of descriptive statistic, we have a “big” standard deviations especially for netustm, for wkhtot the mean is less than median so we might see left skewed distribution and for netustm It’s definitely right skewed.

Histogram1

ggplot(wrk_hours, aes(wkhtot)) +
  geom_histogram(aes(y = ..density..), fill = "white", col = "black", bins = 30) +
  geom_density(alpha = 0.3, fill = "#FF6666") +
  labs(title = "Total hours normally worked per week in main job overtime included", x = "Hours") +
  scale_x_continuous(breaks = c(seq(0, 100, 10)), limits = c(0, 100))

So, it actually seems skewed in both ways and it’s also leptokurtic.

Histogram2

ggplot(wrk_hours, aes(netustm)) +
  geom_histogram(aes(y = ..density..), fill = "white", col = "black", bins = 30) +
  geom_density(alpha = 0.3, fill = "#FF6666") +
  labs(title = "Internet use, how much time on typical day, in minutes for regular users", x = "Minutes")

As i said before it’s highly right skewed.

Let’s see how our skewed variables interact in scatter plot.

Scatter plot

ggplot(wrk_hours, aes(wkhtot, netustm)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  geom_hline(yintercept = 164.1, color = "red", linetype = "dotted") +
  scale_y_continuous(breaks = c(seq(0, 900, 30)), limits = c(0, 1200)) +
  labs(title = "Scatter Plot of the total hours worked per week with Time on the internet per day", y = "Minutes in the internet", x = "Working hours")

It’s hardly find any relation between these variables, I put lm and mean on this graph to show that if we were needed to predict net use time from working hours, our best prediction is not likely to be much different from mean time of using the internet per day.

On the last plot we would see connection between attitudes on the question “How likely not enough money for household necessities next 12 months” and time of using the internet per day for regular users.

Box plot

ggplot(wrk_hours, aes(lknemny, netustm)) +
  geom_boxplot() +
  labs(x = "How likely not enough money for household necessities next 12 months", y = "Minutes in the internet per day")

As in previous results we hardly can say any difference between groups because all boxes intersects.

Conclusion

We have not tested any hypothesis, but our “playing” with data showed some information about variables and I think if we had any hypothesis these kind results usually called negative. So, that’s it.

Project_1!

Edu Siberia