STAT 545A Homework 5

Matt Gingerich

The Canadian Internet Use Survey

For this report, I'm working with Canadian Internet Use Survey (CIUS) data from 2009 and 2005. Information about this dataset can be found here, on the Statistics Canada website and for people with access to the ABACUS shared collection of B.C. research libraries can use these direct links for the 2005 data and 2009 data.

This data is distributed with raw text (ASCII) files that consist entirely of integers representing factor levels that need to be parsed using SAS or SPSS syntax files that provided with the data. I did have some success importing this data with the R package SAScii; however, for the purpose of importing the data with full factor labels it proved easiest to first import the data with SPSS and export a tab-separated file (.dat) that can be easily read into R.

# Reading the 2005 and 2009 data into two separate data frames.
ciusDat2005 <- read.delim("CIUS_2005.dat", sep = "\t")
ciusDat2009 <- read.delim("CIUS_2009.dat", sep = "\t")

# Printing column names gives a quick sense of the scale of the dataset.
colnames(ciusDat2009)

##   [1] "�..PUMFID" "PROVINCE"  "REGION"    "G_URBRUR"  "GCAGEGR6" 
##   [6] "CSEX"      "G_CEDUC"   "G_CSTUD"   "G_CLFSST"  "GFAMTYPE" 
##  [11] "G_HHSIZE"  "G_HEDUC"   "G_HSTUD"   "EV_Q01"    "EV_Q02"   
##  [16] "PU_Q01"    "PU_Q02"    "PU_Q03"    "PU_Q06A"   "PU_Q06E"  
##  [21] "PU_Q06J"   "PU_Q06K"   "PU_G06"    "LU_Q01"    "LU_Q02"   
##  [26] "LU_G03"    "LU_Q04"    "LU_Q05"    "LU_Q06A"   "LU_Q06B"  
##  [31] "LU_G06"    "IU_Q01A"   "IU_Q01B"   "IU_Q01E"   "IU_G01"   
##  [36] "IU_Q01G"   "IU_Q02A"   "IU_Q02B"   "IU_Q02E"   "IU_G02"   
##  [41] "IU_Q03"    "IU_Q04"    "IU_Q05"    "IU_Q06"    "SU_Q01"   
##  [46] "SU_Q02"    "SU_Q03"    "SU_Q04"    "SU_Q05"    "SU_Q06"   
##  [51] "SU_Q07"    "SU_Q08"    "SU_Q09"    "SU_Q10"    "SU_Q11"   
##  [56] "SU_Q12"    "SU_Q13"    "SU_Q14"    "SU_Q15"    "SU_Q16"   
##  [61] "SU_Q17"    "SU_Q18"    "SU_Q19"    "SU_Q20"    "SU_Q21"   
##  [66] "SU_Q22"    "SU_Q23"    "SU_Q24"    "SU_Q25"    "GL_Q01A"  
##  [71] "GL_Q01B"   "GL_Q01C"   "GL_Q01D"   "GL_Q01E"   "GL_Q01F"  
##  [76] "GL_Q01G"   "GL_Q01H"   "GL_G01"    "EC_Q01"    "EC_Q02A"  
##  [81] "EC_Q02B"   "EC_Q02C"   "EC_Q02D"   "EC_Q02E"   "EC_Q02F"  
##  [86] "EC_Q02I"   "EC_Q02J"   "EC_Q02K"   "EC_Q02L"   "EC_Q02M"  
##  [91] "EC_Q02N"   "EC_Q02O"   "EC_Q02P"   "EC_Q02Q"   "EC_G02"   
##  [96] "EC_Q03"    "EC_Q04"    "EC_Q05"    "EC_Q06"    "EC_Q07A"  
## [101] "EC_Q07B"   "EC_G07"    "EC_Q08"    "EC_Q09A"   "EC_Q09B"  
## [106] "EC_Q09C"   "EC_Q09D"   "EC_Q09E"   "EC_Q09F"   "EC_Q09J"  
## [111] "EC_Q09K"   "EC_Q09L"   "EC_Q09M"   "EC_Q09N"   "EC_Q09O"  
## [116] "EC_Q09P"   "EC_Q09Q"   "EC_Q09R"   "EC_G09"    "EC_Q10"   
## [121] "NU_Q01"    "NU_Q02A"   "NU_Q02B"   "NU_Q02C"   "NU_Q02D"  
## [126] "NU_Q02E"   "NU_Q02F"   "NU_Q02I"   "NU_G02K"   "NU_G02"   
## [131] "PS_Q01"    "PS_Q02"    "PS_Q03"    "PS_Q04"    "PS_Q05"   
## [136] "G_HQUINT"  "WTPP"

Note: The name of the first column is noticeably strange. The column should be titled “PUMFID” but it's prefixed with some odd stuff. It's possible that SPSS added some sort of header to its tab-delimited output that R was not expecting; fortunately, this doesn't impact our use of this data, although it's something that I'd like to clean up eventually.

As the previous output shows, one of the significant challenges in working with this dataset is that there is a huge number of columns in each data frame. These columns correspond to questions asked in the survey and many of the values in the data frame are “Valid Skip” which indicates that a respondent didn't answer a question because of their response to a previous question (for instance, participants who say they have internet access are not asked why they don't have internet access).

The column names are also not entirely descriptive. A few columns, such as “PROVINCE” and “CSEX” are simple enough to understand (they describe the province and sex of a survey respondant), but interpreting the other columns requires consulting a codebook document which describes how each column was generated from the survey responses.

The following code snippet pulls the columns “PROVINCE”, “CSEX”, “GFAMTYPE”,“G_HEDUC”, “PS_Q01”, and “G_HQUINT” from the final rows of the dataset as a sanity check that the data imported correctly. GFAMTYPE describes the household type, G_CEDUC describes the highest level of education attained by the respondent, and PS_Q01 is the respondent's answer to the question:

In general, how concerned (are you/would you be) about privacy on the Internet? For example, people finding out what websites you have visited, others reading your e-mail?

library(xtable)
print(xtable(tail(ciusDat2009[c("PROVINCE", "CSEX", "GFAMTYPE", "G_CEDUC", "PS_Q01")])), 
    type = "html")

	PROVINCE	CSEX	GFAMTYPE	G_CEDUC	PS_Q01
23173	Quebec	Male	Single family household without unmarried children under 16	High school or less	Concerned
23174	Manitoba	Male	One person households	High school or less	Not at all concerned
23175	Manitoba	Female	Single family household without unmarried children under 16	High school or less	Very concerned
23176	Newfoundland and Labrador	Female	Multi family households	College or some post-secondary	Concerned
23177	Ontario	Female	One person households	High school or less	Very concerned
23178	British Columbia	Female	Single family household without unmarried children under 16	University certificate or degree	Concerned

Visualizing Attitudes on Privacy

To begin, I plotted the total counts of responses to the general privacy question in 2005. The figure, shown below, reveals that the majority of respondents are concerned about privacy on the internet.

# We've lost factor ordering due to the tab-delimited input format, so we
# need to reorder here to get a sensible layout.
ciusConcernLabels = c("Not at all concerned", "Concerned", "Very concerned", 
    "Don't know", "Refusal", "Not stated")
ciusDat2005 <- within(ciusDat2005, PS_Q01 <- factor(PS_Q01, levels = ciusConcernLabels))

library(ggplot2)
p <- ggplot(ciusDat2005, aes(PS_Q01))
p <- p + geom_bar()
p + theme(text = element_text(size = 16))  # make the text less tiny

plot of chunk unnamed-chunk-3

Next, I wanted to compare the responses from 2005 and 2009, but this is tricky with the data in separate data frames. It's possible to use layering to draw two bar graphs on top of each other, but I wasn't satisfied with the outcome of that, so I decided to combine the data frames together. This is challenging, because the two datasets don't have the same number of columns and I wanted the output data frame to have a new factor representing the year. In the end, I decided to manually glue parts of the data frames together with cbind/rbind and then fix my types by making a new data frame after the concatenation.

With the combined data, I make a “dodged” bar graph (as opposed to the default stacked bar graph) to highlight the differences between the two years.

ciusDat2009 <- within(ciusDat2009, PS_Q01 <- factor(PS_Q01, levels = ciusConcernLabels))

# Doing some R gymnastics to clobber together these two datasets
ciusDatJoint <- rbind(cbind(ciusDat2005$PS_Q01, 2005), cbind(ciusDat2009$PS_Q01, 
    2009))
ciusDatJoint <- data.frame(response = factor(ciusDatJoint[, 1], labels = ciusConcernLabels), 
    year = factor(ciusDatJoint[, 2]))

p <- ggplot(ciusDatJoint, aes(response, fill = year))
p + geom_bar(position = "dodge") + theme(text = element_text(size = 16))

plot of chunk unnamed-chunk-4

The following plot shows the breakdown of responses by gender and province for the 2009 data.

p <- ggplot(subset(ciusDat2009, PS_Q01 != "Don't know" & PS_Q01 != "Refusal" & 
    PS_Q01 != "Not stated"), aes(PS_Q01, fill = CSEX))
p <- p + geom_bar() + facet_wrap(~PROVINCE)
p + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), text = element_text(size = 16))

plot of chunk unnamed-chunk-5

There's nothing too exciting in the preceding data, but it might be interesting to see whether students tend to have different opinions on privacy than non-students, so let's split on the G_CSTUD factor (labelled “Yes” if the respondent is a student and “No” otherwise) and colour the bars based on the furthest level of education that the respondent has reached.

p <- ggplot(subset(ciusDat2009, PS_Q01 != "Don't know" & PS_Q01 != "Refusal" & 
    PS_Q01 != "Not stated"), aes(PS_Q01, fill = G_CEDUC))
p <- p + geom_bar() + facet_wrap(~G_CSTUD)
p + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

plot of chunk unnamed-chunk-6

There are many more non-students than students in the study, but it is interesting to note that students tend to be less “very concerned” about privacy on the internet when compared to the general population (there are more students who claim to be “not at all concerned” than there are students who are “very concerned”).

Plotting Quantitative Variables

All the variables in the CIUS dataset are categorical factors, so for the purposes of showing off ggplot's quantitative plotting functions, I'm going to use the gapminder dataset for a couple plots.

gDat <- read.delim("gapminderDataFiveYear.txt")
summary(gDat)  # check data import

##         country          year           pop              continent  
##  Afghanistan:  12   Min.   :1952   Min.   :6.00e+04   Africa  :624  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.79e+06   Americas:300  
##  Algeria    :  12   Median :1980   Median :7.02e+06   Asia    :396  
##  Angola     :  12   Mean   :1980   Mean   :2.96e+07   Europe  :360  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.96e+07   Oceania : 24  
##  Australia  :  12   Max.   :2007   Max.   :1.32e+09                 
##  (Other)    :1632                                                   
##     lifeExp       gdpPercap     
##  Min.   :23.6   Min.   :   241  
##  1st Qu.:48.2   1st Qu.:  1202  
##  Median :60.7   Median :  3532  
##  Mean   :59.5   Mean   :  7215  
##  3rd Qu.:70.8   3rd Qu.:  9325  
##  Max.   :82.6   Max.   :113523  
##

The next plot throws many different pieces of information together: on the x-axis we have life expectance, the y-axis is a log-10 scale for GDP per capita with logarithmically spaced axis labels, the size of points corresponds to the population of the country, and points are coloured by their continent.

library(scales)
gDatWithoutOceania <- droplevels(subset(gDat, country != "Oceania"))
p <- ggplot(gDatWithoutOceania, aes(x = lifeExp, y = gdpPercap, size = pop, 
    colour = continent))
p <- p + coord_trans(y = "log10")
(p <- p + geom_point())

plot of chunk unnamed-chunk-8

The previous figure is rather messy and has a bit of an overplotting problem, so I'll end with a simpler figure, plotting GDP per capita directly against life expectancy, that uses hexagonal binning to overcome overplotting with plots split by continent.

p <- ggplot(gDatWithoutOceania, aes(x = lifeExp, y = gdpPercap))
(p <- p + stat_binhex() + facet_wrap(~continent))

plot of chunk unnamed-chunk-9

Note: all the figures in this document have been automatically uploaded to imgur. See the markdown source for this document located on gist for details.