Toronto CMA Data manipulation

Description of the Dataset

This document contains analysis of a dataset from Toronto CMA containing 4263 records on 241 different variables. These variables range from demographic variables such as those describing age, gender and ethnicity to variables about life style such as household income and type of dwelling as well as variables about preferences of the subjects such as internet last used and number of mall visits per week. Some of these are categorical variables, whereas others are quantitative variables. The following analysis is an attempt gather some insights from the data provided.

Loading the required R packages

library(ggplot2)
library(reshape2)
library(plyr)

Reading the data into R

setwd("C:\\Users\\Alina\\OneDrive\\R")
mydata<-read.csv("torontoCMA.csv")

Graphs and Analysis

Age, Gender and Education

The data about age groups, gender and education levels of the population was extracted from the data set and used to plot the population distribution. ‘par(mfrow)’ command was used to arrange the graphs side by side. Fill colours were used to distinguish between different genders. The factor names were replaced by appropriate labels for the different age groups and education levels. The graphs were given descriptive titles.

par(mfrow=c(1,2))

# Age and Gender Bar Graph
ggplot(mydata,aes(x=factor(Age), fill=factor(Gender)))+
  geom_bar(position="dodge")+
  scale_x_discrete("Age Groups", 
                   labels=c("1"="12-17","2"="18-24","3"="25-34",
                            "4"="35-49","5"="50-64","6"="65+"))+
  scale_fill_discrete(name="Gender",
                      breaks=c("1", "2"),
                      labels=c("Male", "Female"))+
  ggtitle("Age Groups by Gender")+
  ylab('No. of People')


# Education bar graph
qplot(as.factor(mydata$Education), geom='bar', fill=as.factor(mydata$Gender))+
  coord_flip()+
  scale_x_discrete("Education Levels", 
                   labels=c("1"="No Cert","2"="Secondary School Grad",
                            "3"="Certificate/Diploma","4"="University Cert",
                            "5"="Bachelors Degree","6"="Post Grad Degree"))+
  scale_fill_discrete(name="Gender",
                      breaks=c("1", "2"),
                      labels=c("Male", "Female"))+
  ggtitle("Education Levels by Gender")+ylab('No. of People')

Employment

The data about employment and personal incomes was used to plot the pie chart and bar graph below. First we converted the categorial data about employment status (full-time, part-time and unemployed) to factors and ordered them in ascending order according the the frequency. The centres of the 3 different portions of the bars we calculated and saved in the ‘at’ variable for labeling purposes.The cumulative frequency percentages of the different employment types were calculated, rounded to 2 significant figures and saved in the ‘label’ variable. Finally the ggplot2 package was used to create a bar graph for employment statuses and then converted into a pie-chart through the use of polar coordinates. The respective labels and key were added to complete the pie chart. All the excess information like axes and axis ticks were removed to give a clean looking figure.

#Employment Pie Chart
mydata$EmploySum <- reorder(mydata$EmploySum, X = mydata$EmploySum, 
                            FUN = function(x) -length(x))

at <- nrow(mydata) - as.numeric(cumsum(sort(table(mydata$EmploySum)))-
                                  0.5*sort(table(mydata$EmploySum)))

label=paste0(round(sort(table(mydata$EmploySum))/
                     sum(table(mydata$EmploySum)),2) * 100,"%")

p <- ggplot(mydata,aes(x="", fill = EmploySum)) +
  geom_bar(width = 1) +
  scale_fill_discrete(name="Employment",
                      breaks=c("1","2","3"),
                      labels=c("Full-time", "Part-time","Unemployed"))+
  coord_polar(theta="y") +
  annotate(geom = "text", y = at, x = 1, label = label,size=4)+
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid  = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())+
  ggtitle("Employment Type")
print(p)

Personal Income

To present the income distribution of the survey sample, a subset of the original dataset was formatted as a dataframe. It contained the different income levels as factors. The frequency of people for each income level was summed up from the original data. The dataframe was used to plot the bar chart for personal income groups and appropriate labels and title were given to the plot.

#personal income bar chart
df <- data.frame(matrix(ncol = 2, nrow = 12)) 
colnames(df)<-c('Income','Frequency')
df$Income<-c('0','under 25000', '25000-34999','35000-39999','40000-49999',
           '50000-59999','60000-74999','75000-99999','100000-124999',
           '125000-149999','150000-199999','200000+')
for (i in 1:12){df[i,2]<-sum(mydata[,77+i])}

df$Income<-factor(df$Income, levels=df$Income) #to prevent reordering of bars


ggplot(df,aes(x=Income, y=Frequency), fill=df$Income)+
  geom_bar(stat="identity")+
  scale_x_discrete("Personal Income Groups")+
  coord_flip()+
  ggtitle('Personal Income Distribution')+
  ylab('No. of People')

Days since any magazine was last picked up

The two graphs below show the number of days past since the people in the sample read or looked at a magazine. The first graph shows the distribution of the cumulative number of people against the number of days. The data for the graph was subsetted from the main data set.The ‘sum’ function was used with a ‘for’ loop to calculate the cumulative frequency totals.The ‘stat=identity’ property of ‘geom_bar’ ensures that the frequency values for the bar chart are taken as they are instead of being computed from a column of data.

#magazine last read cumulative
df <- data.frame(matrix(ncol = 2, nrow =8 )) 
colnames(df)<-c('Time','Cumulative_Frequency')
df$Time<-c('1 day', '7 days','30 days','60 days','90 days','365 days','365+ days','Not Stated')
for (i in 1:8){df[i,2]<-sum(mydata[,117+i]*mydata[,2])}

df$Time<-factor(df$Time, levels=df$Time) #to prevent reordering of bars

ggplot(df,aes(x=Time, y=Cumulative_Frequency), fill=df$Time)+
  geom_bar(stat="identity")+
  xlab("Time since last read a magazine")+ylab('Cumulative No. of People')+
  ggtitle('Cumulative Population Distribution \n by Magazine Last Read')+
  coord_flip()

The second graph depicts the frequency of people against the number of days. The frequency and percentage frequency of the number of people was calculated as the data consisted of cumulative frequency numbers. The position for percentage labels on the bar was also determined using the ‘ddply’ function in R. The graph was labelled appropriately with the frequency percentages and given a suitable title.

#magazine last read
df <- data.frame(matrix(ncol = 4, nrow =6 )) 
colnames(df)<-c('Time','Cumulative_Frequency','Frequency','percentage')
df$Time<-c('1 day', '7 days','30 days','60 days','90 days','365 days')
for (i in 1:6){df[i,2]<-sum(mydata[,117+i]*mydata[,2])}
df[1,3]<-df[1,2]
for (i in 2:6){df[i,3]<-df[i,2]-df[i-1,2]}
for (i in 1:6){df[i,4]<-((df[i,3])/sum(df[,3]))*100}
df$percentage<-round(df$percentage, digits = 0)

df$Time<-factor(df$Time, levels=df$Time) #to prevent reordering of bars


df <- ddply(df, 'Time', 
            transform, pos = cumsum(Frequency) - (0.5 * Frequency))

p<-ggplot(df,aes(x=Time, y=Frequency), fill=df$Time)+
  geom_bar(stat="identity")+
  scale_x_discrete("Time since last read a magazine")+
  geom_text(data=df, aes(x=Time,y = pos, 
                         label = paste0(percentage,"%")),size=4)+
  ylab('No. of People')+
  ggtitle('Population Distribution by \n Magazine Last Read')
print(p)

Magazine Readership by Age Group and Education

The following graphs look at the time spent since a magazine was last picked up with respect to age and education level. For the first bar graph showing magazine readership against age, the total number of people from each age group falling in each category of readership was calculated using the ‘aggregate’ command in R. Then the ‘melt’ command was used to transform data into a format that was suitable for plotting a bar chart. The graph was ‘flipped’ from a vertical to a horizontal position to improve appearance and labelled appropriately. Similar trnsformations using the ‘aggregate’ and ‘melt’ commands were applied to the education level data to prepare it for plotting in the form of a bar chart. Fill colours were used to distinguish different ages and education levels in the bar graphs.

par(mfrow=c(1,2))

# Last read by Age Group bar graph
mag<-mydata[,c(100,118:125)]
magsum<-aggregate(mag,by=list(mag$Age),sum)
magsum<-magsum[,c(1,3:10)]
names(magsum)<-c("Age",'1 day', '7 days','30 days',
                 '60 days','90 days','365 days','365+ days','Not Stated')

magsum.long<-melt(magsum, id.vars="Age")


ggplot(magsum.long,aes(x=variable,y=value,fill=factor(Age)))+
  geom_bar(stat="identity")+
  scale_fill_discrete(name="Age Groups",
                      breaks=c(1,2,3,4,5,6),
                      labels=c('12-17','18-24','25-34','35-49','50-64','65+'))+
  coord_flip()+
  xlab("Days since last read a magazine")+ylab("No. of People")+
  ggtitle("Magazine Readership by Age")

# last read with education bar graph
mag<-mydata[,c(101,123)]
magsum<-aggregate(mag,by=list(mag$Education),sum)
magsum<-magsum[,c(1,3)]
names(magsum)<-c("Education",'365 days')

magsum.long<-melt(magsum, id.vars="Education")


ggplot(magsum.long,aes(x=variable,y=value,fill=factor(Education)))+
  geom_bar(stat="identity", position="dodge")+
  scale_fill_discrete(name="Education Level",
                      breaks=c(1,2,3,4,5,6),
                      labels=c("No Cert","Secondary School Grad",
                               "Certificate/Diploma","University Cert",
                               "Bachelors Degree","Post Grad Degree"))+
  xlab("Education Level")+ylab("No. of People")+
  ggtitle("Number of people who read a magazine
          \n in the last year by education level")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  ggtitle("Magazine Readership \n by Education")

Newspaper Sections by Popularity

The graph below shows how frequently each of the newspaper sections mentioned are read by the people surveyed. In order to make a comparison between the popularity of different newspaper sections, a faceted bar graph was plotted. The sum of the number of people in each category in the graph was calculated from the raw data. The ‘facet_wrap’ function from the ggplot2 plotting system was employed to produce a graph with 4 sections, each displaying the data for a different level of popularity for the newspaper sections. The graph was given appropriate title and labels.

# facet graphs for newspaper sections
df<-data.frame(matrix(ncol=3,nrow=60))
colnames(df)<-c('Freq','News_Type','Total_People')
df$Freq<-rep(c('usually','sometimes','seldom','never'), each=15)
df$News_Type<-rep(c('local','national','world','sports','finance',
                      'entertainment','editorials','food','fashion','travel','automotive',
                      'comics','real estate','health','puzzles'),4)

for (i in 1:15){df[i,3]<-sum(mydata[,125+i])}
for (i in 1:15){df[i+15,3]<-sum(mydata[,140+i])}
for (i in 1:15){df[i+30,3]<-sum(mydata[,155+i])}
for (i in 1:15){df[i+45,3]<-sum(mydata[,170+i])}


p<-ggplot(df, aes(x=News_Type, y=Total_People,fill=factor(News_Type)))+
  geom_bar(stat = 'identity')+facet_wrap(~ Freq, ncol=2)+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  scale_fill_discrete(name="News Type")+
  xlab("News Section")+ylab("No. of People")
  #ggtitle("Newspaper Popularity by Sections")
print(p)

All of the graphs and other analysis of the dataset has been completed using R and knitr.