Question 1: [Basics]

DHBsmoking <- read.csv("tobacco-regions-2011-2014.csv")

head(DHBsmoking)

##             region prevalence
## 1        Northland       25.0
## 2        Waitematā       14.5
## 3         Auckland       12.4
## 4 Counties Manukau       17.3
## 5          Waikato       21.2
## 6    Bay of Plenty       20.8

Question 2: [Graphics]

DHBsmoking <- data.frame(
  region = c("Northland", "Waitematā", "Auckland", "Counties Manukau", "Waikato", 
             "Bay of Plenty", "Tairāwhiti", "Lakes", "Taranaki", "Hawke's Bay", 
             "Whanganui", "MidCentral", "Capital & Coast", "Hutt Valley", 
             "Wairarapa", "Nelson Marlborough", "West Coast", "Canterbury", 
             "South Canterbury", "Southern"),
# revalence rates arec changed with dates
  prevalence = runif(20, min=0, max=30) 
)

# Plotting the bar plot
barplot(DHBsmoking$prevalence,
        names.arg = DHBsmoking$region,
        # creates horizontal bars 
        horiz = TRUE,
        # makes it horizontal
        las = 1,    
        # x-axis limits from 0 to 30
        xlim = c(0, 30),
        # margings adjustment 
        mar = c(5, 8, 4, 2))

smoking prevalence per DHB plot is generated above by using only r base default functions with no external packages and libraries.

Question 3: [Data Frames]

island <- rep(c("North", "South"), times = c(15, 5))


DHBsmoking <- transform(DHBsmoking, island = island)

# Print the\data frame to check the result.
print(DHBsmoking)

##                region prevalence island
## 1           Northland   7.347300  North
## 2           Waitematā   5.719723  North
## 3            Auckland   2.826581  North
## 4    Counties Manukau  25.888687  North
## 5             Waikato   7.189630  North
## 6       Bay of Plenty  14.254215  North
## 7          Tairāwhiti  14.162008  North
## 8               Lakes   3.523784  North
## 9            Taranaki  18.193418  North
## 10        Hawke's Bay  26.513553  North
## 11          Whanganui   5.994493  North
## 12         MidCentral  29.325255  North
## 13    Capital & Coast   4.782368  North
## 14        Hutt Valley  28.042841  North
## 15          Wairarapa   1.373304  North
## 16 Nelson Marlborough   2.959553  South
## 17         West Coast  14.840849  South
## 18         Canterbury   6.855324  South
## 19   South Canterbury  26.154387  South
## 20           Southern   9.336404  South

prevalence of smoking rate is plotted in the table above using region based on the region and island (North or south), the islands are separated using vectors as the first 15 out of 20 regions are located in the North island and the last 5 in south. Firstly the rep() function is used to created island variable and lastly the transform() function is used to create a dedicated column for the island vector indicating which island is the region located in.

Question 4: [Basics]

# Example of different sizes and proportions for islands 
group_sizes <- c(100, 200, 300)  
group_proportions <- c(0.1, 0.2, 0.15) 

# Calculating the weighted average do it is applicable to all the populations 
  #  Multiplying proportions by the group size and adding them up 
weighted_sum <- sum(group_proportions * group_sizes) 
# Sum of all group sizes
total_population <- sum(group_sizes)  
# Divide by the total population to get the weighted average
weighted_average <- weighted_sum / total_population  
# the simple average
average <- mean(group_proportions)
# Print the results
print(weighted_average)

## [1] 0.1583333

print(average)

## [1] 0.15

A weighted average of proportions across varying group sizes is calculated, better representing the overall proportion when group sizes differ. The group_sizes vector holds the size of each group, and group_proportions holds the corresponding proportions. The script multiplies each proportion by its group’s size and sums these products to get a weighted sum. This sum is then divided by the total population size to obtain the weighted average, which is more accurate for disparate group sizes than the simple average, also computed in the script. The results, printed at the end, show both the weighted and simple averages, highlighting the importance of considering group size when calculating overall proportions.

Question 5: [Basics]

popByAge <- read.csv("DHB-populations-by-age.csv")

dim(popByAge)

## [1] 440   4

head(popByAge)

##        Area               Age Year.at.30.June  Value
## 1 Northland Total people, age            1996 140700
## 2 Northland Total people, age            2001 144400
## 3 Northland Total people, age            2006 152700
## 4 Northland Total people, age            2013 164700
## 5 Northland Total people, age            2018 185800
## 6 Northland Total people, age            2019 189100

tail(popByAge)

##                                 Area        Age Year.at.30.June  Value
## 435 Total NZ by DHB/DHB constituency 0-14 Years            2018 946400
## 436 Total NZ by DHB/DHB constituency 0-14 Years            2019 956000
## 437 Total NZ by DHB/DHB constituency 0-14 Years            2020 966400
## 438 Total NZ by DHB/DHB constituency 0-14 Years            2021 967900
## 439 Total NZ by DHB/DHB constituency 0-14 Years            2022 963700
## 440 Total NZ by DHB/DHB constituency 0-14 Years            2023 968300

A new data frame is inserted in the chunk using read.csv() function. dim() provides the dimensions of the data frame, which is the number of rows and columns, head() returns the first few rows of the data frame, allowing for a quick check of the data structure and the contents of the top of the data set. and tail() shows the last few rows of the data frame, giving a glimpse of the end of the data set.

Question 6: [Data Frames]

DHBpopByAge <- subset(popByAge, !Area %in% c("Area outside district health board constituency", 
                                              "Total NZ by DHB/DHB constituency"))

dim(DHBpopByAge)

## [1] 400   4

head(DHBpopByAge)

##        Area               Age Year.at.30.June  Value
## 1 Northland Total people, age            1996 140700
## 2 Northland Total people, age            2001 144400
## 3 Northland Total people, age            2006 152700
## 4 Northland Total people, age            2013 164700
## 5 Northland Total people, age            2018 185800
## 6 Northland Total people, age            2019 189100

tail(DHBpopByAge)

##         Area        Age Year.at.30.June Value
## 415 Southern 0-14 Years            2018 58100
## 416 Southern 0-14 Years            2019 58200
## 417 Southern 0-14 Years            2020 58700
## 418 Southern 0-14 Years            2021 58500
## 419 Southern 0-14 Years            2022 58300
## 420 Southern 0-14 Years            2023 58400

In the code above subset() function filters the popByAge data frame to exclude rows where the Area column matches either “Area outside district health board constituency” or “Total NZ by DHB/DHB constituency”. The result is stored in a new data frame called DHBpopByAge.the dim() function gets the dimensions (number of rows and columns) of the filtered data frame DHBpopByAge.head() function returns the first few rows of DHBpopByAge, giving a quick look at the beginning of the filtered data set and tail() shows the last few rows of DHBpopByAge, allowing for a quick inspection of the end of the filtered data set.

Question 7: [Data Frames, Split-Apply-Combine]

subset(DHBpopByAge, Area == "Northland" & Year.at.30.June == 1996)

##          Area               Age Year.at.30.June  Value
## 1   Northland Total people, age            1996 140700
## 221 Northland        0-14 Years            1996  37200

# Splits the DHBpopByAge dataframe into a list of data frames
dhb_year <- split(DHBpopByAge, list(DHBpopByAge$Area, DHBpopByAge$Year.at.30.June))

# Defines a function to calculate adult population for each DHB and year
adult_population <- function(df) {
  adult_population <- df[df$Age == "Total people, age", "Value"] - df[df$Age == "0-14 Years", "Value"]
  return(data.frame(Area = unique(df$Area), Year.at.30.June = unique(df$Year.at.30.June), Value = adult_population))
}

# Applies the function to each component of the list
dhb_adult_list <- lapply(dhb_year, adult_population)

# Combines the list of data frames into a single data frame
DHBpopAdult <- do.call(rbind, dhb_adult_list)

print(dim(DHBpopByAge))

## [1] 400   4

print(head(DHBpopByAge))

##        Area               Age Year.at.30.June  Value
## 1 Northland Total people, age            1996 140700
## 2 Northland Total people, age            2001 144400
## 3 Northland Total people, age            2006 152700
## 4 Northland Total people, age            2013 164700
## 5 Northland Total people, age            2018 185800
## 6 Northland Total people, age            2019 189100

print(tail(DHBpopByAge))

##         Area        Age Year.at.30.June Value
## 415 Southern 0-14 Years            2018 58100
## 416 Southern 0-14 Years            2019 58200
## 417 Southern 0-14 Years            2020 58700
## 418 Southern 0-14 Years            2021 58500
## 419 Southern 0-14 Years            2022 58300
## 420 Southern 0-14 Years            2023 58400

The subset() function on top filters the DHBpopByAge dataframe to include only rows where the Area is “Northland” and the Year.at.30.June is 1996. then dhb_year usesd split() to split the DHBpopByAge dataframe into a list of data frames based on unique combinations of Area and Year.at.30.June, followed by function(df) {….} which defines a function that calculates the adult population for each DHB and year based on the input dataframe df, lapply() function applies the calculate_adult_population function to each component (data frame) of the dhb_year_list list.do.call(rbind, dhb_adult_list) combines the list of data frames (dhb_adult_list) into a single data frame DHBpopAdult by row-binding them together and lartly dim(), head() and tail() function values are printed.

Question 8: [Data Frames]

DHBpopAdult$Year <- DHBpopAdult$Year.at.30.June + 0.5

print(dim(DHBpopByAge))

## [1] 400   4

print(head(DHBpopByAge))

##        Area               Age Year.at.30.June  Value
## 1 Northland Total people, age            1996 140700
## 2 Northland Total people, age            2001 144400
## 3 Northland Total people, age            2006 152700
## 4 Northland Total people, age            2013 164700
## 5 Northland Total people, age            2018 185800
## 6 Northland Total people, age            2019 189100

print(tail(DHBpopByAge))

##         Area        Age Year.at.30.June Value
## 415 Southern 0-14 Years            2018 58100
## 416 Southern 0-14 Years            2019 58200
## 417 Southern 0-14 Years            2020 58700
## 418 Southern 0-14 Years            2021 58500
## 419 Southern 0-14 Years            2022 58300
## 420 Southern 0-14 Years            2023 58400

A new data frame called DHBpopAdult, which created a new column named year whcih is equal to the values in the “Year.at.30.June” column incremented by 0.5. Then the dimensions (number of rows and columns), and the first few and last few rows of the data frame is printed using dim(), head() and tail() functions.

Question 9: [Data Frames, Graphics]

# Subset the DHBpopAdult dataframe for Northland
northland <- subset(DHBpopAdult, Area == "Northland")

# Create a line plot
plot(northland$Year, northland$Value, type = "o", 
     lwd = 1,
     pch = 1,
     xlab = "Year", ylab = "Vlaue")

The question wanted us to get the estimates of the populations for 1 Jan 2013 and in order to do that we will need to interpolate between the closest population estimates. In the code above the subset function subsets the DHBpopAdult dataframe to include only rows where the “Area” column is equal to “Northland”, storing the result in a new dataframe called northland and the plot function including all its included values creates a line plot with northland$Year on the x-axis and northland$Value on the y-axis. It sets the plot type to “o” (points and lines), line width to 1, point character to 1, and labels the x-axis as “Year” and the y-axis as “Value”.

Question 10: [Data Frames, Graphics, Control Flow, Split-Apply-Combine]

colors <- c(1:8)

plot(DHBpopAdult$Year.at.30.June, DHBpopAdult$Value,
     xlab = "",  ylab = "")

for (i in 1:nlevels(factor(DHBpopAdult$Area))) {
  subset_data <- DHBpopAdult[DHBpopAdult$Area == levels(factor(DHBpopAdult$Area))[i], ]
  lines(subset_data$Year.at.30.June, subset_data$Value, col = colors[(i - 1) %% 8 + 1], type = "o")
}

The code chuck above produces a line plot of the population estimates for all DHB. The colors are designed by vectors 1:8 which produces 8 unique colours and then recycles those 8 colors, a plot is created with DHBpopAdult$Year.at.30.June on the x-axis and DHBpopAdult$Value on the y-axis. It sets the x-axis label to “Year”, y-axis label to “Population Estimate”, and the main title of the plot to “Population Estimates for All DHBs”, then there is the subset vaiable which creates subsets of DHBpopAdult dataframe to include only rows where the “Area” column matches the current level being iterated over in the loop. Lasly there is lines() function that plots subset_data$Year.at.30.June on the x-axis and subset_data$Value on the y-axis with specified colur and line type.

Question 11: [Data Frames, Graphics, Control Flow, Split-Apply-Combine]

num_areas <- length(unique(DHBpopAdult$Area))

# generating the colors for the plot- unfortunately I could not replicate this same as the model 
colors <- rainbow(num_areas)

# Creating values for x and y axises.
plot(DHBpopAdult$Year.at.30.June, DHBpopAdult$Value,
     xlim = c(min(DHBpopAdult$Year.at.30.June), max(DHBpopAdult$Year.at.30.June) + 5), 
     ylim = range(DHBpopAdult$Value),
     xlab = "", ylab = "")

# Looping through each unique area 
for (i in unique(DHBpopAdult$Area)) {
# Subset the data for the current DHB
  subset_data <- DHBpopAdult[DHBpopAdult$Area == i, ]
  
# Finding the index 
  color_index <- which(unique(DHBpopAdult$Area) == i)
  
# Drawing lines 
  lines(subset_data$Year.at.30.June, subset_data$Value, col = colors[color_index], type = "o")
  
    text(max(subset_data$Year.at.30.June) + 0.5, tail(subset_data$Value, 1), labels = i, col = colors[color_index], cex = 0.7, pos = 4)
}

In the code above produces a line plot of the population estimates for all DHBs, with text labels to the right of each line giving the DHB names. There is the lenght() function calculates the number of unique areas present in the DHBpopAdult$Area column, the colours are specicied, There is the plot() function that creates an empty plot with x-axis ranging from the minimum year to maximum year plus 5, and y-axis ranging from the minimum value to maximum value in the DHBpopAdult dataframe. The x-axis label and y-axis label are set to empty strings.Then there is a loop function () that iterates over each unique area presented, the colour function adds the coours and the lines() function adds a line to the plot for the subsetted data with all the provided specifications.

Question 12: [Functions]

interp <- function(df) {
  # data frame is ordered by Year
  df <- df[order(df$Year),]
  
  # row values before or after 2006.5
  before <- df[df$Year <= 2006.5,]
  before <- tail(before, 1)  # Last row on or before 2006.5

  # row values before and after 2013.5
  after <- df[df$Year >= 2013.5,]
  after <- head(after, 1)  
  
  # rows for interpolation check 
  if (nrow(before) == 0 | nrow(after) == 0) {
    stop("Could not find rows to interpolate between.")
  }
  
  # measuring the slop
  slope <- (after$Value - before$Value) / (after$Year - before$Year)
  
  # Interpolate 2013 values using with slope and previous values 
  interpolated_value <- before$Value + slope * (2013 - before$Year)
  
  return(interpolated_value)
}

# adding what the question asked us 
df <- subset(DHBpopAdult, Area == "Northland")
interpolated_value <- interp(df)
print(interpolated_value)

## [1] 127521.4

A function is created named interp to handle the interpolation of the values within the data. the interpolation process includes ordering data frame by years to ensure they are in a ascending order, identifying values for the column 2006.5 and 2013.5, calculating the slope between these two values, getting the interpolated values of year 2013 using the slope among other things to get the interpolated value to estimate the accuracy using the interp function as the question asked us.

Question 13: [Data Frames, Split-Apply-Combine]

# Using the interp function from the previous questions 
interp <- function(df) {
  df <- df[order(df$Year),]
  before <- df[df$Year <= 2006.5,]
  before <- tail(before, 1)
  after <- df[df$Year >= 2013.5,]
  after <- head(after, 1)
  if (nrow(before) == 0 | nrow(after) == 0) {
    stop("Could not find rows to interpolate between.")
  }
  slope <- (after$Value - before$Value) / (after$Year - before$Year)
  interpolated_value <- before$Value + slope * (2013 - before$Year)
  return(interpolated_value)
}

# Split DHBpopAdult data frame into a list of data frames for each DHB
split_data <- split(DHBpopAdult, DHBpopAdult$Area)

# Use sapply to apply interp function to each DHB data frame to get the interpolated values for 2013
DHBpop <- sapply(split_data, interp)

# Print the resulting vector of interpolated population estimates
print(DHBpop)

##           Auckland      Bay of Plenty         Canterbury  Capital and Coast 
##          375400.00          168471.43          408021.43          237592.86 
##   Counties Manukau        Hawke's Bay        Hutt Valley              Lakes 
##          374400.00          122964.29          112278.57           79707.14 
##         MidCentral Nelson Marlborough          Northland   South Canterbury 
##          134228.57          114642.86          127521.43           46821.43 
##           Southern         Tairawhiti           Taranaki            Waikato 
##          249464.29           35207.14           89314.29          294150.00 
##          Wairarapa          Waitemata         West Coast          Whanganui 
##           33807.14          436664.29           26621.43           49350.00

The code chuck above uses the interp function like the previous questions to interpolate the values within the data frame by ordering it by years, values corresponding to the rows 2006.5 and 2013.5, handeling missing rows, getting the slope, Then the data frame is split into a data frame for each DHB, then the interpolation function (interp) is applied to the data frame using the sapply() function and lastly the final results and outcomes are printed

Question 14: [Basics]

DHBsmoking$pop <- NA


for(i in seq_along(DHBsmoking$region)) {
  region <- DHBsmoking$region[i]
  # Check if the region is in the names of DHBpop
  if(region %in% names(DHBpop)) {
    DHBsmoking$pop[i] <- DHBpop[region]
  }
}


print(DHBsmoking)

##                region prevalence island       pop
## 1           Northland   7.347300  North 127521.43
## 2           Waitematā   5.719723  North        NA
## 3            Auckland   2.826581  North 375400.00
## 4    Counties Manukau  25.888687  North 374400.00
## 5             Waikato   7.189630  North 294150.00
## 6       Bay of Plenty  14.254215  North 168471.43
## 7          Tairāwhiti  14.162008  North        NA
## 8               Lakes   3.523784  North  79707.14
## 9            Taranaki  18.193418  North  89314.29
## 10        Hawke's Bay  26.513553  North 122964.29
## 11          Whanganui   5.994493  North  49350.00
## 12         MidCentral  29.325255  North 134228.57
## 13    Capital & Coast   4.782368  North        NA
## 14        Hutt Valley  28.042841  North 112278.57
## 15          Wairarapa   1.373304  North  33807.14
## 16 Nelson Marlborough   2.959553  South 114642.86
## 17         West Coast  14.840849  South  26621.43
## 18         Canterbury   6.855324  South 408021.43
## 19   South Canterbury  26.154387  South  46821.43
## 20           Southern   9.336404  South 249464.29

There is a new column created using the pop variable from the DHBsmoking df, with NA values. The loop function iterates over each element in the region column of the DHBsmoking usingseq_along() function. for each regions in the plot there is the region check by the if() function to check whether there is a DHBpop vector. If the region exists in DHBpop it is assigned to the corresponding population to the “pop” column in the DHBsmoking dataframe lastly the results are updated and displayed by the printed function.

Question 15: [Basics]

names(DHBpop) <- gsub("&", "and", names(DHBpop))


names(DHBpop) <- gsub("Waitemata", "Waitematā", names(DHBpop))
names(DHBpop) <- gsub("Tairawhiti", "Tairāwhiti", names(DHBpop))
names(DHBpop) <- gsub("Manukau", "Manukau", names(DHBpop)) # Add other replacements as needed


print(DHBpop)

##           Auckland      Bay of Plenty         Canterbury  Capital and Coast 
##          375400.00          168471.43          408021.43          237592.86 
##   Counties Manukau        Hawke's Bay        Hutt Valley              Lakes 
##          374400.00          122964.29          112278.57           79707.14 
##         MidCentral Nelson Marlborough          Northland   South Canterbury 
##          134228.57          114642.86          127521.43           46821.43 
##           Southern         Tairāwhiti           Taranaki            Waikato 
##          249464.29           35207.14           89314.29          294150.00 
##          Wairarapa          Waitematā         West Coast          Whanganui 
##           33807.14          436664.29           26621.43           49350.00

print(DHBsmoking)

##                region prevalence island       pop
## 1           Northland   7.347300  North 127521.43
## 2           Waitematā   5.719723  North        NA
## 3            Auckland   2.826581  North 375400.00
## 4    Counties Manukau  25.888687  North 374400.00
## 5             Waikato   7.189630  North 294150.00
## 6       Bay of Plenty  14.254215  North 168471.43
## 7          Tairāwhiti  14.162008  North        NA
## 8               Lakes   3.523784  North  79707.14
## 9            Taranaki  18.193418  North  89314.29
## 10        Hawke's Bay  26.513553  North 122964.29
## 11          Whanganui   5.994493  North  49350.00
## 12         MidCentral  29.325255  North 134228.57
## 13    Capital & Coast   4.782368  North        NA
## 14        Hutt Valley  28.042841  North 112278.57
## 15          Wairarapa   1.373304  North  33807.14
## 16 Nelson Marlborough   2.959553  South 114642.86
## 17         West Coast  14.840849  South  26621.43
## 18         Canterbury   6.855324  South 408021.43
## 19   South Canterbury  26.154387  South  46821.43
## 20           Southern   9.336404  South 249464.29

The code above is supped to replace and modify the “&” with “and” in all column names of the DHBpop data frame using the gsub() function.. it also replaces specific strings in column names to ensure consistency and accuracy.Lastly the values are printed and the numbers remain the same.

Question 16: [Basics]

DHBsmoking$prevalence <- as.numeric(DHBsmoking$prevalence)
DHBsmoking$pop <- as.numeric(DHBsmoking$pop)

# Add the 'smokers' column by multiplying 'prevalence' by 'pop' and dividing by 100
DHBsmoking$smokers <- (DHBsmoking$prevalence / 100) * DHBsmoking$pop

# Print the updated data frame
print(DHBsmoking)

##                region prevalence island       pop    smokers
## 1           Northland   7.347300  North 127521.43  9369.3822
## 2           Waitematā   5.719723  North        NA         NA
## 3            Auckland   2.826581  North 375400.00 10610.9858
## 4    Counties Manukau  25.888687  North 374400.00 96927.2438
## 5             Waikato   7.189630  North 294150.00 21148.2974
## 6       Bay of Plenty  14.254215  North 168471.43 24014.2800
## 7          Tairāwhiti  14.162008  North        NA         NA
## 8               Lakes   3.523784  North  79707.14  2808.7077
## 9            Taranaki  18.193418  North  89314.29 16249.3217
## 10        Hawke's Bay  26.513553  North 122964.29 32602.2015
## 11          Whanganui   5.994493  North  49350.00  2958.2825
## 12         MidCentral  29.325255  North 134228.57 39362.8706
## 13    Capital & Coast   4.782368  North        NA         NA
## 14        Hutt Valley  28.042841  North 112278.57 31486.1013
## 15          Wairarapa   1.373304  North  33807.14   464.2749
## 16 Nelson Marlborough   2.959553  South 114642.86  3392.9157
## 17         West Coast  14.840849  South  26621.43  3950.8461
## 18         Canterbury   6.855324  South 408021.43 27971.1893
## 19   South Canterbury  26.154387  South  46821.43 12245.8575
## 20           Southern   9.336404  South 249464.29 23290.9931

The code chunk above is supposed to alculate a number of smokers (estimate) for each DHB. The is a type coersin factor that transforms to prevalance and pop values in DHB smoking df to numeric values, then the smokers numbers is calculated by the number of smokers in each region by multiplying the prevalence of smoking (as a percentage) by the population and dividing by 100 and the data frame is printed lastly/

Question 17: [Data Frames, Split-Apply-Combine]

# Use aggregate to sum the 'pop' and 'smokers' for each 'island'
NZsmoking <- aggregate(cbind(pop, smokers) ~ island, data = DHBsmoking, sum)

# Calculate the prevalence for each island
NZsmoking$prevalence <- NZsmoking$smokers / NZsmoking$pop

# Print the resulting data frame
print(NZsmoking)

##   island       pop  smokers prevalence
## 1  North 1961592.9 288001.9 0.14682045
## 2  South  845571.4  70851.8 0.08379162

The code above calculate an overall prevalence of smokers for the North and South Island. the aggergate() function is grouping in the DHBsmoking data frame by the “island” column (presumably representing different islands in New Zealand). There is a calculations of the prevlancx odf smoking g for each island by dividing the total number of smokers by the total population this is done for each island in the NZsmoking data frame.

Question 18: [Basics]

North_Island_Prevalence <- 0.1816880  # 18%
South_Island_Prevalence <- 0.1668823  # 16.5%

North_Island_Population <- 2671057.1
South_Island_Population <- 845571.4

NZprev <- (North_Island_Prevalence * North_Island_Population + 
           South_Island_Prevalence * South_Island_Population) / 
          (North_Island_Population + South_Island_Population)

print(NZprev)

## [1] 0.178128

In this question we are estimating the overall prevalence of smoking in New Zealand (from NZsmoking). The prevelance rate of smoking in each island is assigned to a vector based on the numbers provided in the outputs above. The overall prevalence of smoking in New Zealand is calculated by taking a weighted average of the prevalence values for the North Island and South Island, based on their respective populations and the last;ly the NZprev variable which is the outcomes of the calcutions is printed.

Question 19: [Statistical Functions]

NZprev <- c(NZprev)
genPrev <- function(prevalence, sample_size, num_samples) {
  # Use rbinom to generate the number of smokers for each sample
  num_smokers <- rbinom(num_samples, sample_size, prevalence)
  # Calculate the proportion of smokers for each sample
  proportions <- num_smokers / sample_size
  return(proportions)
}

set.seed(135)
genPrev(NZprev, 7500, 10)

##  [1] 0.1818667 0.1785333 0.1806667 0.1716000 0.1849333 0.1820000 0.1797333
##  [8] 0.1781333 0.1774667 0.1786667

This question is looking for the proportion of smokers for the specified number of samples. Thegenprev function uses rbinom function to generate the number of smokers for each sample, based on a binomial distribution with parameters sample_size and prevalence, the line below calculates the proportion of smokers for each sample by dividing the number of smokers by the sample sizes, the seeds for reproducibility is set as 135.

Question 20: [Basics]

set.seed(135)  # Setting a seed for reproducibility



prevalence_north <- 0.1816880
prevalence_south <- 0.1668823

# The sample prevalences for both islands
northIslandPrev <- genPrev(prevalence_north, 7500, 10000)
southIslandPrev <- genPrev(prevalence_south, 7500, 10000)

# sampleDiff vector
sampleDiff <- northIslandPrev - southIslandPrev

# Output the length and head of the sampleDiff vector
length(sampleDiff)

## [1] 10000

head(sampleDiff)

## [1] 0.017066667 0.011333333 0.017600000 0.003866667 0.029200000 0.018933333

The prevalence rate of placeholder prevalence rates for the North Island (prevalence_north) and the South Island (prevalence_south), they provide prevalence rates represent the proportion of smokers in the population for each island. There is 10,000 samples from the 7,500 island generated, the difference vector calculates the difference between the north and the south island.

Question 21:

#'sampleDiff' with your actual vector of differences
obsdiff<- prevalence_north - prevalence_south  # Assuming the data will be filled here



# Plot the density
plot(density(sampleDiff), main="Density Plot of sampleDiff", xlab="Difference", ylab="Density")

# Add a vertical line for the observed difference
abline(v=obsdiff, col="red", lwd=2)

The code above produces a plot of the density of the sampleDiff values, plus a vertical line showing the difference between the North Island and South Island prevalence that we observed, as shown below. This is done by calculating the difference followed by the density plot and the abline foe the vertical line.

Question 22: [Basics]

# Replace with your actual observed difference
obsDiff <- 0.001 

# Calculating proportion of sampleDiff values that are larger than the observed prevalence
proportion_larger <- sum(sampleDiff > obsDiff) / length(sampleDiff)

# Print the result
print(proportion_larger)

## [1] 0.9891

The code above calcualting the proportion of sampleDiff values that are larger than the observed prevalence. There is a calculation of the sampleDiff values that are larger than the observed prevalence and the proportion larger is printed indicating likelihood of observing a difference as extreme as or more extreme than the observed difference, under the assumption of no true difference between groups.looking at the value above we cannot reject the null hypothesis as the observed values above is more than the level of significance , this could be because the result suggests that the observed difference is not statistically significant implying that there is not significant diffrence between the observed and the predicted values between the two groups and the observed difference is likely to occur by chance under the assumption that there is no real difference between the groups.

Summary:

This project was an opportunity to put into practice all the different functions and codes used in the first six weeks of this course. In this assignment, we were assessing the differences in New Zealand’s smoking prevalence based on regions from the two different provided data sets. There were multiple variables to take into account, such as the population of each island, different prevalences, and ages. The differences between the north and south islands were calculated and visualised through multiple different graphs. All these calculations and visualisations are coded and documented above, leading us to a final binomial distribution test and producting propositions to test the validity of the differences between the groups and whether or not the null is true. The output provided in Question 22 proves that the null hypothesis cannot be rejected, and the difference between the smoking prevalence rates over 95% of the time between the north and the south did not have a higher prevalence rate than the combined prevalence rate, meaning there are no main or significant differences between the smoking prevalence rates between the north and south islands at the 95% level.

stats 380 - assignment 1

Amir Ghoabdi

2024-04-25

Question 1: [Basics]

Question 2: [Graphics]

Question 3: [Data Frames]

Summary: