General Commentary (data discription):

The data provided for the assignent are the prevalence rate of smoking and vaping. The vaping data shows the prevlance of daily vapers in various demographics, as well as gender, age range and whether they are disabled or not across different years from 2011 to 2022. These variables are the same in the smoking dataset with the only difference being there are less NAs in the smoking datset, this could be because vaping is relatively new compared to smoking and there are some years such as 2011 to 2014 where all the values in the vaping dataset are blank.

Q1

# choosing the interested columns
ratecols <- 2:14
# choosing the years 
years <- 2011:2022
# reading in the csv file, skipping 9 lines of the data with only columns that we are interpreted in 
smoking <- read.csv("current-smokers.csv", skip = 9) [ratecols]
  colnames(smoking) <- c("group", as.character(years))

  clean_numeric <- function(x) {
# Remove characters that non numeric 
  x <- gsub("[^0-9.]", "", x)
# Convert to numeric
  as.numeric(x)
  }

    for (i in 2:13) {
      smoking[, i] <- clean_numeric(smoking[, i])
    }
    for (i in 2:ncol(smoking)) {
  smoking[, i] <- as.numeric(smoking[, i])
}
# Display of first few rows 
head(smoking)

##               group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
## 1             Total 18.4 17.9 17.6 16.8 16.6 16.0 15.1 14.7 13.7 10.9  9.8  8.3
## 2            Gender   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 3               Men 19.2 19.0 19.5 18.4 18.3 17.7 16.7 16.6 14.5 11.6 10.5  9.2
## 4             Women 17.6 16.8 15.9 15.2 14.8 14.4 13.7 12.8 12.9 10.2  9.0  7.5
## 5 Age group (years)   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 6             15-17  8.3  8.3  8.7  6.3  6.2  4.0  3.8  4.1  3.5  1.4  1.1  1.0

# Display of last few rows 
tail(smoking)

##                 group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 32     Total Disabled   NA   NA   NA   NA   NA   NA   NA 19.0 19.5 14.9 17.0
## 33 Total Non-disabled   NA   NA   NA   NA   NA   NA   NA 14.3 13.2 10.5  9.1
## 34     Disabled women   NA   NA   NA   NA   NA   NA   NA 19.0 21.3 13.4 15.3
## 35       Disabled men   NA   NA   NA   NA   NA   NA   NA 19.1 17.1 16.9 19.0
## 36 Non-disabled women   NA   NA   NA   NA   NA   NA   NA 12.2 12.1  9.8  8.4
## 37   Non-disabled men   NA   NA   NA   NA   NA   NA   NA 16.4 14.3 11.2  9.8
##    2022
## 32 11.6
## 33  8.0
## 34 11.2
## 35 11.6
## 36  7.1
## 37  9.0

# Display of rows and columns numbers  
dim(smoking)

## [1] 37 13

# Display the variable type of each column
sapply(smoking, mode)

##       group        2011        2012        2013        2014        2015 
## "character"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##        2016        2017        2018        2019        2020        2021 
##   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##        2022 
##   "numeric"

The columns and years that we are interested in are selected (columns 2 to 14 and years 2011 and 2022), the CSV file is read in with its 9 lines skipped, the gsub function removes characters that are non numeric and the varibles are converted to numeric variables, the columns are renamed and there is a type conversion that converts all the columns except the first column to numeric variable are the lastly results are printed as requested

Q2)

# columns in the data we are interested in (2nd to 14th columns)
ratecols <- 2:14

# column names for the years 2011 to 2022
yearNames <- 2011:2022

# Read the CSV file, skipping the first 9 lines
vaping <- read.csv("vaping-daily.csv", skip = 9) 

# The columns of interest (2nd to 14th)
vaping <- vaping[, ratecols]
colnames(vaping) <- c("group", as.character(yearNames))

  clean_numeric <- function(x) {
  # Remove all non-numeric characters
  x <- gsub("[^0-9.]", "", x)
  # Convert to numeric
  as.numeric(x)
  }

    for (i in 2:13) {
      vaping[, i] <- clean_numeric(vaping[, i])
    }
  
  for (i in 2:ncol(vaping)) {
    vaping[, i] <- as.numeric(vaping[, i])
}

# Display of first few rows 
head(vaping)

##               group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
## 1             Total   NA   NA   NA   NA  0.9   NA  2.6  3.3  3.5  6.2  8.4  9.7
## 2            Gender   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 3               Men   NA   NA   NA   NA  1.0   NA  3.2  4.2  4.3  7.5  8.5 10.3
## 4             Women   NA   NA   NA   NA  0.8   NA  2.0  2.3  2.7  5.0  8.3  9.1
## 5 Age group (years)   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 6             15-17   NA   NA   NA   NA   NA   NA  0.6  1.7  2.3  5.8  8.3 15.4

# Display of last few rows 
tail(vaping)

##                 group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 32     Total Disabled   NA   NA   NA   NA   NA   NA   NA  4.3  4.9  6.7 11.9
## 33 Total Non-disabled   NA   NA   NA   NA   NA   NA   NA  3.2  3.4  6.2  8.1
## 34     Disabled women   NA   NA   NA   NA   NA   NA   NA  3.3  5.4  4.3 11.3
## 35       Disabled men   NA   NA   NA   NA   NA   NA   NA  5.6  4.3  9.8 12.5
## 36 Non-disabled women   NA   NA   NA   NA   NA   NA   NA  2.2  2.5  5.0  8.0
## 37   Non-disabled men   NA   NA   NA   NA   NA   NA   NA  4.2  4.3  7.4  8.2
##    2022
## 32 14.2
## 33  9.3
## 34 15.0
## 35 12.4
## 36  8.5
## 37 10.1

# Display of rows and columns numbers  
dim(vaping)

## [1] 37 13

# Display the variable type of each column
sapply(vaping, mode)

##       group        2011        2012        2013        2014        2015 
## "character"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##        2016        2017        2018        2019        2020        2021 
##   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##        2022 
##   "numeric"

The code in this question follows very similar steps to the first question with the primary difference being the vaping file/data being selected and manipulated as well as some differences between the variables names to get the desired outcomes asked.

Q3)

readFile <- function(file_name) {
  # columns that contain the data we are interested in (2nd to 14th columns)
  ratecols <- 2:14

  # column names for the years 2011 to 2022
  yearNames <- 2011:2022

  # skipping the first 9 lines of metadata
  df <- read.csv(file_name, skip = 9) 

  # only the columns of interest (2nd to 14th) and rename the columns
  df <- df[, ratecols]
  colnames(df) <- c("group", as.character(yearNames))

  # helper function to clean numeric values
  clean_numeric <- function(x) {
    # Remove all non-numeric characters
    x <- gsub("[^0-9.]", "", x)
    # Convert to numeric
    as.numeric(x)
  }

  # Applying the cleaning function to columns that are relevant 
  for (i in 2:13) {
    df[, i] <- clean_numeric(df[, i])
  }

  # Convert all columns to numric except the first 
  for (i in 2:ncol(df)) {
    df[, i] <- as.numeric(df[, i])
  }

  return(df)
}

smoking <- readFile("current-smokers.csv")
vaping <- readFile("vaping-daily.csv")
tried <- readFile("vaping-tried.csv")

# Verify results
all.equal(smoking, readFile("current-smokers.csv"))

## [1] TRUE

all.equal(vaping, readFile("vaping-daily.csv"))

## [1] TRUE

# Display results
head(tried)

##               group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
## 1             Total   NA   NA   NA   NA 16.4   NA 18.8 21.8 24.3 25.2 27.6 24.7
## 2            Gender   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 3               Men   NA   NA   NA   NA 18.1   NA 21.0 24.8 27.5 27.5 29.1 26.3
## 4             Women   NA   NA   NA   NA 14.7   NA 16.6 18.8 21.2 23.0 26.1 23.0
## 5 Age group (years)   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 6             15-17   NA   NA   NA   NA 19.4   NA 19.6 26.0 40.1 32.3 30.4 38.4

tail(tried)

##                 group 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 32     Total Disabled   NA   NA   NA   NA   NA   NA   NA 21.0 24.6 23.7 31.7
## 33 Total Non-disabled   NA   NA   NA   NA   NA   NA   NA 21.8 24.3 25.4 27.2
## 34     Disabled women   NA   NA   NA   NA   NA   NA   NA 21.3 25.7 22.0 30.6
## 35       Disabled men   NA   NA   NA   NA   NA   NA   NA 20.6 23.3 26.0 33.0
## 36 Non-disabled women   NA   NA   NA   NA   NA   NA   NA 18.5 20.8 23.1 25.6
## 37   Non-disabled men   NA   NA   NA   NA   NA   NA   NA 25.1 27.9 27.6 28.9
##    2022
## 32 28.6
## 33 24.4
## 34 27.7
## 35 29.0
## 36 22.4
## 37 26.2

dim(tried)

## [1] 37 13

sapply(tried, mode)

##       group        2011        2012        2013        2014        2015 
## "character"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##        2016        2017        2018        2019        2020        2021 
##   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##        2022 
##   "numeric"

readfile function takes the single argument file_name which the name of the CSV to be read, the similar to the last questions the columns in intrest including the years are selected and renamed and non numeric characters are converted to a numeric value and all variables except the first are converted to numeric. return(df) returns the cleaned data frame and the data files including smoking, vaping and tried all are read then verifies. The all.equal function checks if smoking and vaping is equal the result of calling ‘readFile(“current-smokers.csv”)’ and ‘readFile(“vaping-daily.csv”)’ respectively and lastly the results are displayed.

Q4)

# readFile reads the file, processes it and cleans it 
readFile <- function(file_name) {
  # chooses years and columns 
  ratecols <- 2:14 
  yearNames <- 2011:2022 

  # Check if the file exists
  if (!file.exists(file_name)) {
    stop("The file does not exist.")
  }
  
  # skipping the first 9 lines of metadata
  df <- read.csv(file_name, skip = 9, stringsAsFactors = FALSE)

  # Check if the selected columns exist in the data frame
  if (!all(ratecols %in% seq_along(df))) {
    stop("The specified columns do not exist in the data frame.")
  }

  # the columns of interest and rename the columns
  df <- df[, ratecols]
  colnames(df) <- c("group", as.character(yearNames))

  # helper function to clean numeric values
  clean_numeric <- function(x) {
    x <- gsub("[^0-9.]", "", x) # Remove all non-numeric characters
    as.numeric(x) # Convert to numeric
  }

  # Apply the cleaning function to the relevant columns
  for (i in 2:ncol(df)) {
    df[, i] <- clean_numeric(df[, i])
  }

  return(df) # Return the cleaned data frame
}

check <- readFile("current-smokers.csv")

The readfile function assumes the CSV file has a specific structure with metadata in the first 9 lines and relevant data in columns 2 to 14 with the specified years. the checks for if file exists and if selected columns exist in the data frame are done. Then the first 9 lines of the meta data are skipped and the csv file is read in, the columns of interest are picked and renamed, the helper function is used to clean and convert string to numeric values, cleaning function is applied to reverent columns and again all columns are ensured to be numeric except the first one and lastly the return(df) function returns the cleaned data and process the data frame.

Q5)

# the isHeader function
isHeader <- function(vec) {
# Checking if the first value is not missing and all other values are missing
  !is.na(vec[1]) && all(is.na(vec[-1]))
}

# isHeader function is applied to each row of the smoking data frame
tableHead <- apply(smoking, 1, isHeader)

# The rows that are headers
print(tableHead)

##  [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
## [25] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE

# Extract the names of the tables from the smoking data frame
tableNames <- smoking[tableHead, 1]

# Display the table names
print(tableNames)

## [1] "Gender"            "Age group (years)" "Maori"            
## [4] "Pacific"           "Asian"             "European/Other"   
## [7] "Disability status"

The isHeader function was applies to create a function that determines if a given vector represents a header row, then the isHeader function is applied to each row of the smoking data frame and it identifies which rows in the smoking data frame are header rows, the rows that are headers are printed, then the names of the tables from the smoking data frame is extracted to get the real names of the tables from the identified header rows and lastly the table names are printed and diplayed.

Q6)

# Test 1: Valid header (first value is not NA, rest are NA)
test1 <- isHeader(c("Header", NA, NA, NA))
print(paste("Test 1: Valid header",test1))

## [1] "Test 1: Valid header TRUE"

# Test 2: Non-header (not all other values are NA)
test2 <- isHeader(c("Header", 1, NA, NA))
print(paste("Test 2: Non-header",test2))

## [1] "Test 2: Non-header FALSE"

# Test 3: Non-header (first value is NA)
test3 <- isHeader(c(NA, NA, NA, NA))
print(paste("Test 3: Non-header", test3))

## [1] "Test 3: Non-header FALSE"

# Test 4: Edge case with a single-element vector
test4 <- tryCatch({
  isHeader(c("Header"))
}, error = function(e) {
  e$message
})
print(paste("Test 4: Edge case with a single-element vector", test4))

## [1] "Test 4: Edge case with a single-element vector TRUE"

# Test 5: Edge case with an empty vector
test5 <- tryCatch({
  isHeader(c())
}, error = function(e) {
  e$message
})
print(paste("Test 5: Edge case with an empty vector", test5))

## [1] "Test 5: Edge case with an empty vector NA"

# Test 6: Edge case with non-vector input
test6 <- tryCatch({
  isHeader(matrix(1:4, nrow = 2))
}, error = function(e) {
  e$message
})
print(paste("Test 6: Edge case with non-vector input", test6))

## [1] "Test 6: Edge case with non-vector input FALSE"

# Test 7: Valid header with different data types
test7 <- isHeader(c("Header", NA, NA, NA))
print(paste("Test 7: Valid header with different data types",  test7))

## [1] "Test 7: Valid header with different data types TRUE"

# Test 8: Non-header with mixed types
test8 <- isHeader(c("Header", NA, "Text", NA))
print(paste("Test 8: Non-header with mixed types", test8))

## [1] "Test 8: Non-header with mixed types FALSE"

# Test 9: Valid header with numeric data
test9 <- isHeader(c(1, NA, NA, NA))
print(paste("Test 9: Valid header with numeric data",test9))

## [1] "Test 9: Valid header with numeric data TRUE"

# Test 10: Non-header with numeric data
test10 <- isHeader(c(1, 2, NA, NA))
print(paste("Test 10: Non-header with numeric data", test10))

## [1] "Test 10: Non-header with numeric data FALSE"

A few series of tests are done to check that your isHeader() function (i) produces the correct answer for different types of valid input and (ii) produces an error for different types of invalid input. first test is to check if the header is valid, second test tests if the function identifies a non header when not all other values are ‘NA’, third test is if the function identifies a non header properly, test four is checking if the function behavior with a single element vector, test five tests the functions behavior with an empty vector, test six checks if the test function behavior with a non vector input, test 7 tests if the function correctly identifying a valid header mixed with data types, test eight tests if the function correctly identifies a non header with mixed data types, test nine is to check if the function correctly identifies a valid header with numeric data and test ten is to test of the function correctly identifies a non header numeric data .

7)

getTables <- function(df) {
  tableHead <- apply(df, 1, isHeader)
  
  tableNames <- as.character(df[tableHead, 1])
  
  tableID <- rep(NA, nrow(df))
  
  currentTableID <- 1
  for (i in 1:nrow(df)) {
    if (tableHead[i]) {
      currentTableID <- currentTableID + 1
    }
    tableID[i] <- currentTableID
  }
  
  tableList <- split(df, tableID)
  
  tableList <- lapply(tableList, function(x) {
    if (isHeader(x[1, ]))
      x <- x[-1, ]
    x
  })
  
  names(tableList) <- tableNames
  
  return(tableList)
}

Q8)

# Loaded as asked by the question 
smokingTables <- readRDS("smokingTables.rds")
vapingTables <- readRDS("vapingTables.rds")

# first table is extracted 
overall_smoking <- smokingTables[[1]]

# Plot the data
plot(as.numeric(names(overall_smoking)[-1]), overall_smoking[-1],
     type = "l", ylab = "Percent",
     xaxt = "n", xlab = "")

# Add custom x-axis labels
every_second_year <- years[seq(2, length(years), by = 2)]
axis(1, at = every_second_year, labels = every_second_year)

Previously saved data is loaded in the first two lines of code as asked by the question, The first table is extracted from the ‘smokingtables’, the data is plotted and comments are added to y and x axis.

Q9)

# Gender data is selected 
gender_data <- smokingTables$Gender

# first column is the categories and the rest is years 
categories <- gender_data[, 1]
years <- as.numeric(names(gender_data)[-1])

# Extract data for Men and Women
men_data <- as.numeric(gender_data[categories == "Men", -1])
women_data <- as.numeric(gender_data[categories == "Women", -1])

# Plot the data for Men
plot(years, men_data, type = "l", col = "black", ylab = "Percent", xlab = "Year",
     xaxt = "n", ylim = range(c(men_data, women_data)))

# Add the data for Women
lines(years, women_data, col = "red")

# Add custom x-axis labels
every_second_year <- years[seq(2, length(years), by = 2)]
axis(1, at = every_second_year, labels = every_second_year)

# Add a legend in the top-right corner
legend("topright", legend = c("Men", "Women"), col = c("black", "red"), lty = 1)

The gender variables is selected out of the smokingtables data frame, the catergories and the years are extracted as well as the data for men and woman, then the data is plotted using the plot function and the y and x axisis are customized similar to the model answer and lastly a legend is added to show men and women on the top right corner.

Q10)

plotTable <- function(data) {
# Extract categories and years 
categories <- as.character(data[, 1])
years <- as.numeric(names(data)[-1])
  
# data is converted to numeirc 
numeric_data <- as.data.frame(lapply(data[, -1], function(x) as.numeric(as.character(x))))
  
# Initialize the plot with the first category's data
plot(years, numeric_data[1, ], type = "l", col = 1, ylab = "Percent", xlab = "Year",
xaxt = "n", ylim = range(numeric_data, na.rm = TRUE))
  
# Add lines for each category
for (i in 2:nrow(numeric_data)) {
lines(years, numeric_data[i, ], col = i)
}
axis(1, at = years, labels = years)
  
# legends background
legend("topright", legend = categories, col = 1:nrow(numeric_data), lty = 1, bty = "o", bg = "transparent", cex = 0.8)
}

# Set up the plotting area, as asked by the question 
opar <- par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
invisible(lapply(smokingTables, plotTable))

# Reset the plotting area to its original state
par(opar)

The fuctiondata defines the plot table and takes an argument data, the appropriate years and categories are selected, the data is converted to numeric, the plot is made and each categories has its lines and levels for x axis labels are done and lastly there is a legend on top right indicating the labels on top right and the plotting area is set up.

Q11)

plotTable <- function(data, ylim = NULL, axes = TRUE, add = FALSE, lty = 1) {
  # categories and years 
  categories <- as.character(data[, 1])
  years <- as.numeric(names(data)[-1])
  
  # Convert data to numeric 
  numeric_data <- as.data.frame(lapply(data[, -1], function(x) as.numeric(as.character(x))))
  
  if (!add) {
    # Initialize the plot with the first category's data
    plot(years, numeric_data[1, ], type = "l", col = 1, ylab = "Percent", xlab = "Year",
         xaxt = ifelse(axes, "s", "n"), ylim = if (is.null(ylim)) range(numeric_data, na.rm = TRUE) else ylim, lty = lty)
  } else {
    lines(years, numeric_data[1, ], col = 1, lty = lty)
  }
  
  # Add lines for each category
  for (i in 2:nrow(numeric_data)) {
    lines(years, numeric_data[i, ], col = i, lty = lty)
  }
  
  # Add custom x-axis labels if axes are true
  if (axes) {
    axis(1, at = years, labels = years)
  }
  
  # Add a legend with a transparent background
  if (!add) {
    legend("topright", legend = categories, col = 1:nrow(numeric_data), lty = 1, bty = "o", bg = "transparent", cex = 0.8)
  }
}

# Initial plot settings
opar <- par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
invisible(lapply(smokingTables, plotTable))

percentRange <- range(smoking[-1], na.rm = TRUE)
opar <- par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))

invisible(lapply(smokingTables, plotTable, ylim = percentRange))

opar <- par(mfrow = c(3, 3), mar = rep(0, 4), oma = c(3, 3, 1, 1))

for (i in 1:3) {
  for (j in 1:3) {
    index <- (i - 1) * 3 + j
    if (index <= length(smokingTables)) {
      plotTable(smokingTables[[index]], ylim = percentRange, axes = FALSE)
    }
    if (j == 1) {
      axis(2)
    }
    if (i == 3 || (j == 3 && i == 2)) {
      axis(1)
    }
  }
}

globalRange <- range(smoking[-1], vaping[-1], na.rm = TRUE)
opar <- par(mfrow = c(3, 3), mar = rep(0, 4), oma = c(3, 3, 1, 1))

for (i in 1:3) {
  for (j in 1:3) {
    index <- (i - 1) * 3 + j
    if (index <= length(smokingTables)) {
      plotTable(smokingTables[[index]], ylim = globalRange, axes = FALSE)
      plotTable(vapingTables[[index]], ylim = globalRange, lty = 2, add = TRUE) # lty = 2 for dashed line
    }
    if (j == 1) {
      axis(2)
    }
    if (i == 3 || (j == 3 && i == 2)) {
      axis(1)
    }
  }
}

par(opar)

The plottable is set up and the initial plot settings are set,then the set up for consistent y axis limit is done,the custom plot layout with no margins and selective axes is set and the smoking and vaping data are plotted and lastly the par(opar) plot is rested to its original state.

Summary:

In this assignment we were given two main data frames of vaping and smoking prevalence based on diffrent years, ages , genders and demographics and had to do a bit of data processing including cleaning CSV files with specific conditions and instructures and handling non numeric and characters and type coercion of some variables. There were ideas about reusable functions such as ‘readfile’ function that was revisited a few times to process data frames, data validation which was done to ensure all the functions are working effectively, various types of data visualizations done with plot function in base r and some other important functions including the ‘isHeader’, ‘readfile’ and ‘gettables’ to identify and processes tables within the data frame.

Assignmnet 2

Amir Ghobadi

2024-05-20

General Commentary (data discription):

Q1

Q2)

Q3)

Q4)

Q5)

Q6)

7)

Q8)

Q9)

Q10)

Q11)

Summary: