We want to generate a dataset within R, which includes creating categorical variables that are initially character strings.
# Set a seed for reproducibility
set.seed(123)
# Create sample data in R
data <- data.frame(
ID = 1:30,
Gender = sample(c("Male", "Female"), 30, replace = TRUE),
AgeGroup = sample(c("18-25", "26-35", "36-45"), 30, replace = TRUE),
EducationLevel = sample(c("High School", "Bachelor's", "Master's", "PhD"), 30, replace = TRUE),
stringsAsFactors = FALSE # Ensure that character data is not automatically converted to factors
)
# Display the first few rows of the dataset
head(data)
NA
We want to display a structure of this dataset.
# Display the structure of the dataset
str(data)
'data.frame': 30 obs. of 4 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : chr "Male" "Male" "Male" "Female" ...
$ AgeGroup : chr "18-25" "26-35" "36-45" "26-35" ...
$ EducationLevel: chr "High School" "Master's" "High School" "Bachelor's" ...
We want convert the Gender, AgeGroup, and EducationLevel columns from character strings to factors to better manage categorical data.
# Convert character columns to factors
data$Gender <- as.factor(data$Gender)
data$AgeGroup <- as.factor(data$AgeGroup)
data$EducationLevel <- as.factor(data$EducationLevel)
We want to see the structure post conversion to verify the change.
str(data)
'data.frame': 30 obs. of 4 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 1 1 2 2 ...
$ AgeGroup : Factor w/ 3 levels "18-25","26-35",..: 1 2 3 2 1 3 3 1 3 2 ...
$ EducationLevel: Factor w/ 4 levels "Bachelor's","High School",..: 2 3 2 1 2 3 2 3 1 4 ...
We want to see a summary of these factor columns.
# Get a summary of the factor columns
summary(data$Gender)
Female Male
13 17
summary(data$AgeGroup)
18-25 26-35 36-45
13 7 10
summary(data$EducationLevel)
Bachelor's High School Master's PhD
9 8 7 6
We want to visualize the data through bar plots
library(ggplot2)
Want to understand how all the pieces fit together? Read R for Data
Science: https://r4ds.hadley.nz/
# Bar plot for Gender distribution
ggplot(data, aes(x = Gender)) +
geom_bar(fill = "skyblue") +
labs(title = "Gender Distribution", x = "Gender", y = "Count") +
theme_minimal()
# Bar plot for Age Group distribution
ggplot(data, aes(x = AgeGroup)) +
geom_bar(fill = "lightgreen") +
labs(title = "Age Group Distribution", x = "Age Group", y = "Count") +
theme_minimal()
# Bar plot for Education Level distribution
ggplot(data, aes(x = EducationLevel)) +
geom_bar(fill = "lightcoral") +
labs(title = "Education Level Distribution", x = "Education Level", y = "Count") +
theme_minimal()
The bar plots show that the age group predominantly skews to the younger 18-25 demographic. It could impact the analysis if your looking for data for older age groups. A younger demographic could affect things such as health, purchasing trends, and also educational goals. I dont really see any surprising trends from this data.