library(tidyverse)
library(ggplot2)
library(knitr)
library(tinytex)
Question 1.
Generate a dataset within R, which includes creating categorical
variables that are initially character strings.
Create the sample data in R.
data<-data.frame( ID=1:30,
Gender=sample(c("Male","Female"), 30, replace=TRUE),
AgeGroup=sample(c("18-25","26-35","36-45"),30, replace=TRUE),
EducationLevel=sample(c("High School","Bachelor's","Master's","PhD"),30, replace=TRUE),
stringsAsFactors = FALSE)
Question 2.
Display the structure of the dataset.
str(data)
'data.frame': 30 obs. of 4 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : chr "Female" "Female" "Male" "Female" ...
$ AgeGroup : chr "26-35" "26-35" "26-35" "36-45" ...
$ EducationLevel: chr "PhD" "High School" "PhD" "PhD" ...
The structure of the dataset is above.
Question 3.
Convert the Gender, AgeGroup, and EducationLevel columns from character
strings to factors to better manage categorical data.
data$Gender<-factor(data$Gender)
data$AgeGroup<-factor(data$AgeGroup)
data$EducationLevel<-factor(data$EducationLevel)
Question 4.
Display the structure of the dataset to confirm the changes have been
made post conversion.
str(data)
'data.frame': 30 obs. of 4 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 2 1 1 ...
$ AgeGroup : Factor w/ 3 levels "18-25","26-35",..: 2 2 2 3 3 1 2 3 3 2 ...
$ EducationLevel: Factor w/ 4 levels "Bachelor's","High School",..: 4 2 4 4 2 2 3 3 3 3 ...
The dataset conversion was successful.
Question 5.
Look at a summary of the factor columns.
summary(data)
ID Gender AgeGroup EducationLevel
Min. : 1.00 Female:16 18-25:10 Bachelor's : 5
1st Qu.: 8.25 Male :14 26-35:10 High School: 6
Median :15.50 36-45:10 Master's :11
Mean :15.50 PhD : 8
3rd Qu.:22.75
Max. :30.00
Question 6.
To visualize the data, we can create barplots, or histograms to better
understand the data we have collected. Tables will be created for each
of the factors.
table(data$Gender)
Female Male
16 14
table(data$AgeGroup)
18-25 26-35 36-45
10 10 10
table(data$EducationLevel)
Bachelor's High School Master's PhD
5 6 11 8
nlevels(data$Gender)
[1] 2
nlevels(data$AgeGroup)
[1] 3
nlevels(data$EducationLevel)
[1] 4
nlevels(data$EducationLevel)
[1] 4
A barplot of the Gender data will follow:
barplot(table(data$Gender))
A barplot of the Age Groups will follow:
barplot(table(data$AgeGroup))
A barplot of Education Levels will follow:
barplot(table(data$EducationLevel))
These separate barplots give us a view of the data specifically in relation to that factor within the full dataset, but it can be nice to look at the data all in a single graph/barplot.
barplot(table(data$Gender), main="Gender Distribution", xlab="Gender", ylab="Count", col="chartreuse")
To compare Education Level by Gender, a ggplot can be created.
ggplot(data, aes(x=EducationLevel))+
geom_bar(fill="purple")+
labs(title="Education Level Distribution by Gender",
x="Education Level",
y="Count")+
facet_wrap(~ Gender)+
theme_minimal()
To visualize all of the data in a single barplot, another, more complex barplot can be made.
ggplot(data, aes(x=EducationLevel))+
geom_bar(fill="gold")+
labs(title="Education Level Distribution by Gender and Age Group",
x="Education Level",
y="Count",)+
facet_grid(Gender ~ AgeGroup)+
theme_minimal()
Question 7.
Upon examining the visualizations, there is not a prevalent age group in
the dataset, all categories are evenly represented. This suggests that
our data may have a better representation for all surveyed age groups.
One surprising finding is that there are far more Males with Master’s
degrees than females for all age groups.