The data provided for the assignent are the prevalence rate of smoking and vaping. The vaping data shows the prevlance of daily vapers in various demographics, as well as gender, age range and whether they are disabled or not across different years from 2011 to 2022. These variables are the same in the smoking dataset with the only difference being there are less NAs in the smoking datset, this could be because vaping is relatively new compared to smoking and there are some years such as 2011 to 2014 where all the values in the vaping dataset are blank.
# choosing the interested columns
ratecols <- 2:14
# choosing the years
years <- 2011:2022
# reading in the csv file, skipping 9 lines of the data with only columns that we are interpreted in
smoking <- read.csv("current-smokers.csv", skip = 9) [ratecols]
colnames(smoking) <- c("group", as.character(years))
clean_numeric <- function(x) {
# Remove characters that non numeric
x <- gsub("[^0-9.]", "", x)
# Convert to numeric
as.numeric(x)
}
for (i in 2:13) {
smoking[, i] <- clean_numeric(smoking[, i])
}
for (i in 2:ncol(smoking)) {
smoking[, i] <- as.numeric(smoking[, i])
}
# Display of first few rows
head(smoking)
## group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
## 1 Total 18.4 17.9 17.6 16.8 16.6 16.0 15.1 14.7 13.7 10.9 9.8 8.3
## 2 Gender NA NA NA NA NA NA NA NA NA NA NA NA
## 3 Men 19.2 19.0 19.5 18.4 18.3 17.7 16.7 16.6 14.5 11.6 10.5 9.2
## 4 Women 17.6 16.8 15.9 15.2 14.8 14.4 13.7 12.8 12.9 10.2 9.0 7.5
## 5 Age group (years) NA NA NA NA NA NA NA NA NA NA NA NA
## 6 15-17 8.3 8.3 8.7 6.3 6.2 4.0 3.8 4.1 3.5 1.4 1.1 1.0
# Display of last few rows
tail(smoking)
## group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 32 Total Disabled NA NA NA NA NA NA NA 19.0 19.5 14.9 17.0
## 33 Total Non-disabled NA NA NA NA NA NA NA 14.3 13.2 10.5 9.1
## 34 Disabled women NA NA NA NA NA NA NA 19.0 21.3 13.4 15.3
## 35 Disabled men NA NA NA NA NA NA NA 19.1 17.1 16.9 19.0
## 36 Non-disabled women NA NA NA NA NA NA NA 12.2 12.1 9.8 8.4
## 37 Non-disabled men NA NA NA NA NA NA NA 16.4 14.3 11.2 9.8
## 2022
## 32 11.6
## 33 8.0
## 34 11.2
## 35 11.6
## 36 7.1
## 37 9.0
# Display of rows and columns numbers
dim(smoking)
## [1] 37 13
# Display the variable type of each column
sapply(smoking, mode)
## group 2011 2012 2013 2014 2015
## "character" "numeric" "numeric" "numeric" "numeric" "numeric"
## 2016 2017 2018 2019 2020 2021
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 2022
## "numeric"
The columns and years that we are interested in are selected (columns 2 to 14 and years 2011 and 2022), the CSV file is read in with its 9 lines skipped, the gsub function removes characters that are non numeric and the varibles are converted to numeric variables, the columns are renamed and there is a type conversion that converts all the columns except the first column to numeric variable are the lastly results are printed as requested
# columns in the data we are interested in (2nd to 14th columns)
ratecols <- 2:14
# column names for the years 2011 to 2022
yearNames <- 2011:2022
# Read the CSV file, skipping the first 9 lines
vaping <- read.csv("vaping-daily.csv", skip = 9)
# The columns of interest (2nd to 14th)
vaping <- vaping[, ratecols]
colnames(vaping) <- c("group", as.character(yearNames))
clean_numeric <- function(x) {
# Remove all non-numeric characters
x <- gsub("[^0-9.]", "", x)
# Convert to numeric
as.numeric(x)
}
for (i in 2:13) {
vaping[, i] <- clean_numeric(vaping[, i])
}
for (i in 2:ncol(vaping)) {
vaping[, i] <- as.numeric(vaping[, i])
}
# Display of first few rows
head(vaping)
## group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
## 1 Total NA NA NA NA 0.9 NA 2.6 3.3 3.5 6.2 8.4 9.7
## 2 Gender NA NA NA NA NA NA NA NA NA NA NA NA
## 3 Men NA NA NA NA 1.0 NA 3.2 4.2 4.3 7.5 8.5 10.3
## 4 Women NA NA NA NA 0.8 NA 2.0 2.3 2.7 5.0 8.3 9.1
## 5 Age group (years) NA NA NA NA NA NA NA NA NA NA NA NA
## 6 15-17 NA NA NA NA NA NA 0.6 1.7 2.3 5.8 8.3 15.4
# Display of last few rows
tail(vaping)
## group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 32 Total Disabled NA NA NA NA NA NA NA 4.3 4.9 6.7 11.9
## 33 Total Non-disabled NA NA NA NA NA NA NA 3.2 3.4 6.2 8.1
## 34 Disabled women NA NA NA NA NA NA NA 3.3 5.4 4.3 11.3
## 35 Disabled men NA NA NA NA NA NA NA 5.6 4.3 9.8 12.5
## 36 Non-disabled women NA NA NA NA NA NA NA 2.2 2.5 5.0 8.0
## 37 Non-disabled men NA NA NA NA NA NA NA 4.2 4.3 7.4 8.2
## 2022
## 32 14.2
## 33 9.3
## 34 15.0
## 35 12.4
## 36 8.5
## 37 10.1
# Display of rows and columns numbers
dim(vaping)
## [1] 37 13
# Display the variable type of each column
sapply(vaping, mode)
## group 2011 2012 2013 2014 2015
## "character" "numeric" "numeric" "numeric" "numeric" "numeric"
## 2016 2017 2018 2019 2020 2021
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 2022
## "numeric"
The code in this question follows very similar steps to the first question with the primary difference being the vaping file/data being selected and manipulated as well as some differences between the variables names to get the desired outcomes asked.
readFile <- function(file_name) {
# columns that contain the data we are interested in (2nd to 14th columns)
ratecols <- 2:14
# column names for the years 2011 to 2022
yearNames <- 2011:2022
# skipping the first 9 lines of metadata
df <- read.csv(file_name, skip = 9)
# only the columns of interest (2nd to 14th) and rename the columns
df <- df[, ratecols]
colnames(df) <- c("group", as.character(yearNames))
# helper function to clean numeric values
clean_numeric <- function(x) {
# Remove all non-numeric characters
x <- gsub("[^0-9.]", "", x)
# Convert to numeric
as.numeric(x)
}
# Applying the cleaning function to columns that are relevant
for (i in 2:13) {
df[, i] <- clean_numeric(df[, i])
}
# Convert all columns to numric except the first
for (i in 2:ncol(df)) {
df[, i] <- as.numeric(df[, i])
}
return(df)
}
smoking <- readFile("current-smokers.csv")
vaping <- readFile("vaping-daily.csv")
tried <- readFile("vaping-tried.csv")
# Verify results
all.equal(smoking, readFile("current-smokers.csv"))
## [1] TRUE
all.equal(vaping, readFile("vaping-daily.csv"))
## [1] TRUE
# Display results
head(tried)
## group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
## 1 Total NA NA NA NA 16.4 NA 18.8 21.8 24.3 25.2 27.6 24.7
## 2 Gender NA NA NA NA NA NA NA NA NA NA NA NA
## 3 Men NA NA NA NA 18.1 NA 21.0 24.8 27.5 27.5 29.1 26.3
## 4 Women NA NA NA NA 14.7 NA 16.6 18.8 21.2 23.0 26.1 23.0
## 5 Age group (years) NA NA NA NA NA NA NA NA NA NA NA NA
## 6 15-17 NA NA NA NA 19.4 NA 19.6 26.0 40.1 32.3 30.4 38.4
tail(tried)
## group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 32 Total Disabled NA NA NA NA NA NA NA 21.0 24.6 23.7 31.7
## 33 Total Non-disabled NA NA NA NA NA NA NA 21.8 24.3 25.4 27.2
## 34 Disabled women NA NA NA NA NA NA NA 21.3 25.7 22.0 30.6
## 35 Disabled men NA NA NA NA NA NA NA 20.6 23.3 26.0 33.0
## 36 Non-disabled women NA NA NA NA NA NA NA 18.5 20.8 23.1 25.6
## 37 Non-disabled men NA NA NA NA NA NA NA 25.1 27.9 27.6 28.9
## 2022
## 32 28.6
## 33 24.4
## 34 27.7
## 35 29.0
## 36 22.4
## 37 26.2
dim(tried)
## [1] 37 13
sapply(tried, mode)
## group 2011 2012 2013 2014 2015
## "character" "numeric" "numeric" "numeric" "numeric" "numeric"
## 2016 2017 2018 2019 2020 2021
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 2022
## "numeric"
readfile function takes the single argument file_name which the name of the CSV to be read, the similar to the last questions the columns in intrest including the years are selected and renamed and non numeric characters are converted to a numeric value and all variables except the first are converted to numeric. return(df) returns the cleaned data frame and the data files including smoking, vaping and tried all are read then verifies. The all.equal function checks if smoking and vaping is equal the result of calling ‘readFile(“current-smokers.csv”)’ and ‘readFile(“vaping-daily.csv”)’ respectively and lastly the results are displayed.
# readFile reads the file, processes it and cleans it
readFile <- function(file_name) {
# chooses years and columns
ratecols <- 2:14
yearNames <- 2011:2022
# Check if the file exists
if (!file.exists(file_name)) {
stop("The file does not exist.")
}
# skipping the first 9 lines of metadata
df <- read.csv(file_name, skip = 9, stringsAsFactors = FALSE)
# Check if the selected columns exist in the data frame
if (!all(ratecols %in% seq_along(df))) {
stop("The specified columns do not exist in the data frame.")
}
# the columns of interest and rename the columns
df <- df[, ratecols]
colnames(df) <- c("group", as.character(yearNames))
# helper function to clean numeric values
clean_numeric <- function(x) {
x <- gsub("[^0-9.]", "", x) # Remove all non-numeric characters
as.numeric(x) # Convert to numeric
}
# Apply the cleaning function to the relevant columns
for (i in 2:ncol(df)) {
df[, i] <- clean_numeric(df[, i])
}
return(df) # Return the cleaned data frame
}
check <- readFile("current-smokers.csv")
The readfile function assumes the CSV file has a specific structure with metadata in the first 9 lines and relevant data in columns 2 to 14 with the specified years. the checks for if file exists and if selected columns exist in the data frame are done. Then the first 9 lines of the meta data are skipped and the csv file is read in, the columns of interest are picked and renamed, the helper function is used to clean and convert string to numeric values, cleaning function is applied to reverent columns and again all columns are ensured to be numeric except the first one and lastly the return(df) function returns the cleaned data and process the data frame.
# the isHeader function
isHeader <- function(vec) {
# Checking if the first value is not missing and all other values are missing
!is.na(vec[1]) && all(is.na(vec[-1]))
}
# isHeader function is applied to each row of the smoking data frame
tableHead <- apply(smoking, 1, isHeader)
# The rows that are headers
print(tableHead)
## [1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## [25] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE
# Extract the names of the tables from the smoking data frame
tableNames <- smoking[tableHead, 1]
# Display the table names
print(tableNames)
## [1] "Gender" "Age group (years)" "Maori"
## [4] "Pacific" "Asian" "European/Other"
## [7] "Disability status"
The isHeader function was applies to create a function that determines if a given vector represents a header row, then the isHeader function is applied to each row of the smoking data frame and it identifies which rows in the smoking data frame are header rows, the rows that are headers are printed, then the names of the tables from the smoking data frame is extracted to get the real names of the tables from the identified header rows and lastly the table names are printed and diplayed.
# Test 1: Valid header (first value is not NA, rest are NA)
test1 <- isHeader(c("Header", NA, NA, NA))
print(paste("Test 1: Valid header",test1))
## [1] "Test 1: Valid header TRUE"
# Test 2: Non-header (not all other values are NA)
test2 <- isHeader(c("Header", 1, NA, NA))
print(paste("Test 2: Non-header",test2))
## [1] "Test 2: Non-header FALSE"
# Test 3: Non-header (first value is NA)
test3 <- isHeader(c(NA, NA, NA, NA))
print(paste("Test 3: Non-header", test3))
## [1] "Test 3: Non-header FALSE"
# Test 4: Edge case with a single-element vector
test4 <- tryCatch({
isHeader(c("Header"))
}, error = function(e) {
e$message
})
print(paste("Test 4: Edge case with a single-element vector", test4))
## [1] "Test 4: Edge case with a single-element vector TRUE"
# Test 5: Edge case with an empty vector
test5 <- tryCatch({
isHeader(c())
}, error = function(e) {
e$message
})
print(paste("Test 5: Edge case with an empty vector", test5))
## [1] "Test 5: Edge case with an empty vector NA"
# Test 6: Edge case with non-vector input
test6 <- tryCatch({
isHeader(matrix(1:4, nrow = 2))
}, error = function(e) {
e$message
})
print(paste("Test 6: Edge case with non-vector input", test6))
## [1] "Test 6: Edge case with non-vector input FALSE"
# Test 7: Valid header with different data types
test7 <- isHeader(c("Header", NA, NA, NA))
print(paste("Test 7: Valid header with different data types", test7))
## [1] "Test 7: Valid header with different data types TRUE"
# Test 8: Non-header with mixed types
test8 <- isHeader(c("Header", NA, "Text", NA))
print(paste("Test 8: Non-header with mixed types", test8))
## [1] "Test 8: Non-header with mixed types FALSE"
# Test 9: Valid header with numeric data
test9 <- isHeader(c(1, NA, NA, NA))
print(paste("Test 9: Valid header with numeric data",test9))
## [1] "Test 9: Valid header with numeric data TRUE"
# Test 10: Non-header with numeric data
test10 <- isHeader(c(1, 2, NA, NA))
print(paste("Test 10: Non-header with numeric data", test10))
## [1] "Test 10: Non-header with numeric data FALSE"
A few series of tests are done to check that your isHeader() function (i) produces the correct answer for different types of valid input and (ii) produces an error for different types of invalid input. first test is to check if the header is valid, second test tests if the function identifies a non header when not all other values are ‘NA’, third test is if the function identifies a non header properly, test four is checking if the function behavior with a single element vector, test five tests the functions behavior with an empty vector, test six checks if the test function behavior with a non vector input, test 7 tests if the function correctly identifying a valid header mixed with data types, test eight tests if the function correctly identifies a non header with mixed data types, test nine is to check if the function correctly identifies a valid header with numeric data and test ten is to test of the function correctly identifies a non header numeric data .
getTables <- function(df) {
tableHead <- apply(df, 1, isHeader)
tableNames <- as.character(df[tableHead, 1])
tableID <- rep(NA, nrow(df))
currentTableID <- 1
for (i in 1:nrow(df)) {
if (tableHead[i]) {
currentTableID <- currentTableID + 1
}
tableID[i] <- currentTableID
}
tableList <- split(df, tableID)
tableList <- lapply(tableList, function(x) {
if (isHeader(x[1, ]))
x <- x[-1, ]
x
})
names(tableList) <- tableNames
return(tableList)
}
# Loaded as asked by the question
smokingTables <- readRDS("smokingTables.rds")
vapingTables <- readRDS("vapingTables.rds")
# first table is extracted
overall_smoking <- smokingTables[[1]]
# Plot the data
plot(as.numeric(names(overall_smoking)[-1]), overall_smoking[-1],
type = "l", ylab = "Percent",
xaxt = "n", xlab = "")
# Add custom x-axis labels
every_second_year <- years[seq(2, length(years), by = 2)]
axis(1, at = every_second_year, labels = every_second_year)
Previously saved data is loaded in the first two lines of code as asked by the question, The first table is extracted from the ‘smokingtables’, the data is plotted and comments are added to y and x axis.
# Gender data is selected
gender_data <- smokingTables$Gender
# first column is the categories and the rest is years
categories <- gender_data[, 1]
years <- as.numeric(names(gender_data)[-1])
# Extract data for Men and Women
men_data <- as.numeric(gender_data[categories == "Men", -1])
women_data <- as.numeric(gender_data[categories == "Women", -1])
# Plot the data for Men
plot(years, men_data, type = "l", col = "black", ylab = "Percent", xlab = "Year",
xaxt = "n", ylim = range(c(men_data, women_data)))
# Add the data for Women
lines(years, women_data, col = "red")
# Add custom x-axis labels
every_second_year <- years[seq(2, length(years), by = 2)]
axis(1, at = every_second_year, labels = every_second_year)
# Add a legend in the top-right corner
legend("topright", legend = c("Men", "Women"), col = c("black", "red"), lty = 1)
The gender variables is selected out of the smokingtables data frame, the catergories and the years are extracted as well as the data for men and woman, then the data is plotted using the plot function and the y and x axisis are customized similar to the model answer and lastly a legend is added to show men and women on the top right corner.
plotTable <- function(data) {
# Extract categories and years
categories <- as.character(data[, 1])
years <- as.numeric(names(data)[-1])
# data is converted to numeirc
numeric_data <- as.data.frame(lapply(data[, -1], function(x) as.numeric(as.character(x))))
# Initialize the plot with the first category's data
plot(years, numeric_data[1, ], type = "l", col = 1, ylab = "Percent", xlab = "Year",
xaxt = "n", ylim = range(numeric_data, na.rm = TRUE))
# Add lines for each category
for (i in 2:nrow(numeric_data)) {
lines(years, numeric_data[i, ], col = i)
}
axis(1, at = years, labels = years)
# legends background
legend("topright", legend = categories, col = 1:nrow(numeric_data), lty = 1, bty = "o", bg = "transparent", cex = 0.8)
}
# Set up the plotting area, as asked by the question
opar <- par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
invisible(lapply(smokingTables, plotTable))
# Reset the plotting area to its original state
par(opar)
The fuctiondata defines the plot table and takes an argument data, the appropriate years and categories are selected, the data is converted to numeric, the plot is made and each categories has its lines and levels for x axis labels are done and lastly there is a legend on top right indicating the labels on top right and the plotting area is set up.
plotTable <- function(data, ylim = NULL, axes = TRUE, add = FALSE, lty = 1) {
# categories and years
categories <- as.character(data[, 1])
years <- as.numeric(names(data)[-1])
# Convert data to numeric
numeric_data <- as.data.frame(lapply(data[, -1], function(x) as.numeric(as.character(x))))
if (!add) {
# Initialize the plot with the first category's data
plot(years, numeric_data[1, ], type = "l", col = 1, ylab = "Percent", xlab = "Year",
xaxt = ifelse(axes, "s", "n"), ylim = if (is.null(ylim)) range(numeric_data, na.rm = TRUE) else ylim, lty = lty)
} else {
lines(years, numeric_data[1, ], col = 1, lty = lty)
}
# Add lines for each category
for (i in 2:nrow(numeric_data)) {
lines(years, numeric_data[i, ], col = i, lty = lty)
}
# Add custom x-axis labels if axes are true
if (axes) {
axis(1, at = years, labels = years)
}
# Add a legend with a transparent background
if (!add) {
legend("topright", legend = categories, col = 1:nrow(numeric_data), lty = 1, bty = "o", bg = "transparent", cex = 0.8)
}
}
# Initial plot settings
opar <- par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
invisible(lapply(smokingTables, plotTable))
percentRange <- range(smoking[-1], na.rm = TRUE)
opar <- par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
invisible(lapply(smokingTables, plotTable, ylim = percentRange))
opar <- par(mfrow = c(3, 3), mar = rep(0, 4), oma = c(3, 3, 1, 1))
for (i in 1:3) {
for (j in 1:3) {
index <- (i - 1) * 3 + j
if (index <= length(smokingTables)) {
plotTable(smokingTables[[index]], ylim = percentRange, axes = FALSE)
}
if (j == 1) {
axis(2)
}
if (i == 3 || (j == 3 && i == 2)) {
axis(1)
}
}
}
globalRange <- range(smoking[-1], vaping[-1], na.rm = TRUE)
opar <- par(mfrow = c(3, 3), mar = rep(0, 4), oma = c(3, 3, 1, 1))
for (i in 1:3) {
for (j in 1:3) {
index <- (i - 1) * 3 + j
if (index <= length(smokingTables)) {
plotTable(smokingTables[[index]], ylim = globalRange, axes = FALSE)
plotTable(vapingTables[[index]], ylim = globalRange, lty = 2, add = TRUE) # lty = 2 for dashed line
}
if (j == 1) {
axis(2)
}
if (i == 3 || (j == 3 && i == 2)) {
axis(1)
}
}
}
par(opar)
The plottable is set up and the initial plot settings are set,then the set up for consistent y axis limit is done,the custom plot layout with no margins and selective axes is set and the smoking and vaping data are plotted and lastly the par(opar) plot is rested to its original state.
In this assignment we were given two main data frames of vaping and smoking prevalence based on diffrent years, ages , genders and demographics and had to do a bit of data processing including cleaning CSV files with specific conditions and instructures and handling non numeric and characters and type coercion of some variables. There were ideas about reusable functions such as ‘readfile’ function that was revisited a few times to process data frames, data validation which was done to ensure all the functions are working effectively, various types of data visualizations done with plot function in base r and some other important functions including the ‘isHeader’, ‘readfile’ and ‘gettables’ to identify and processes tables within the data frame.