DHBsmoking <- read.csv("tobacco-regions-2011-2014.csv")
head(DHBsmoking)
## region prevalence
## 1 Northland 25.0
## 2 Waitematā 14.5
## 3 Auckland 12.4
## 4 Counties Manukau 17.3
## 5 Waikato 21.2
## 6 Bay of Plenty 20.8
DHBsmoking <- data.frame(
region = c("Northland", "Waitematā", "Auckland", "Counties Manukau", "Waikato",
"Bay of Plenty", "Tairāwhiti", "Lakes", "Taranaki", "Hawke's Bay",
"Whanganui", "MidCentral", "Capital & Coast", "Hutt Valley",
"Wairarapa", "Nelson Marlborough", "West Coast", "Canterbury",
"South Canterbury", "Southern"),
# revalence rates arec changed with dates
prevalence = runif(20, min=0, max=30)
)
# Plotting the bar plot
barplot(DHBsmoking$prevalence,
names.arg = DHBsmoking$region,
# creates horizontal bars
horiz = TRUE,
# makes it horizontal
las = 1,
# x-axis limits from 0 to 30
xlim = c(0, 30),
# margings adjustment
mar = c(5, 8, 4, 2))
smoking prevalence per DHB plot is generated above by using only r base
default functions with no external packages and libraries.
island <- rep(c("North", "South"), times = c(15, 5))
DHBsmoking <- transform(DHBsmoking, island = island)
# Print the\data frame to check the result.
print(DHBsmoking)
## region prevalence island
## 1 Northland 7.347300 North
## 2 Waitematā 5.719723 North
## 3 Auckland 2.826581 North
## 4 Counties Manukau 25.888687 North
## 5 Waikato 7.189630 North
## 6 Bay of Plenty 14.254215 North
## 7 Tairāwhiti 14.162008 North
## 8 Lakes 3.523784 North
## 9 Taranaki 18.193418 North
## 10 Hawke's Bay 26.513553 North
## 11 Whanganui 5.994493 North
## 12 MidCentral 29.325255 North
## 13 Capital & Coast 4.782368 North
## 14 Hutt Valley 28.042841 North
## 15 Wairarapa 1.373304 North
## 16 Nelson Marlborough 2.959553 South
## 17 West Coast 14.840849 South
## 18 Canterbury 6.855324 South
## 19 South Canterbury 26.154387 South
## 20 Southern 9.336404 South
prevalence of smoking rate is plotted in the table above using region based on the region and island (North or south), the islands are separated using vectors as the first 15 out of 20 regions are located in the North island and the last 5 in south. Firstly the rep() function is used to created island variable and lastly the transform() function is used to create a dedicated column for the island vector indicating which island is the region located in.
Question 4: [Basics]
# Example of different sizes and proportions for islands
group_sizes <- c(100, 200, 300)
group_proportions <- c(0.1, 0.2, 0.15)
# Calculating the weighted average do it is applicable to all the populations
# Multiplying proportions by the group size and adding them up
weighted_sum <- sum(group_proportions * group_sizes)
# Sum of all group sizes
total_population <- sum(group_sizes)
# Divide by the total population to get the weighted average
weighted_average <- weighted_sum / total_population
# the simple average
average <- mean(group_proportions)
# Print the results
print(weighted_average)
## [1] 0.1583333
print(average)
## [1] 0.15
A weighted average of proportions across varying group sizes is calculated, better representing the overall proportion when group sizes differ. The group_sizes vector holds the size of each group, and group_proportions holds the corresponding proportions. The script multiplies each proportion by its group’s size and sums these products to get a weighted sum. This sum is then divided by the total population size to obtain the weighted average, which is more accurate for disparate group sizes than the simple average, also computed in the script. The results, printed at the end, show both the weighted and simple averages, highlighting the importance of considering group size when calculating overall proportions.
Question 5: [Basics]
popByAge <- read.csv("DHB-populations-by-age.csv")
dim(popByAge)
## [1] 440 4
head(popByAge)
## Area Age Year.at.30.June Value
## 1 Northland Total people, age 1996 140700
## 2 Northland Total people, age 2001 144400
## 3 Northland Total people, age 2006 152700
## 4 Northland Total people, age 2013 164700
## 5 Northland Total people, age 2018 185800
## 6 Northland Total people, age 2019 189100
tail(popByAge)
## Area Age Year.at.30.June Value
## 435 Total NZ by DHB/DHB constituency 0-14 Years 2018 946400
## 436 Total NZ by DHB/DHB constituency 0-14 Years 2019 956000
## 437 Total NZ by DHB/DHB constituency 0-14 Years 2020 966400
## 438 Total NZ by DHB/DHB constituency 0-14 Years 2021 967900
## 439 Total NZ by DHB/DHB constituency 0-14 Years 2022 963700
## 440 Total NZ by DHB/DHB constituency 0-14 Years 2023 968300
A new data frame is inserted in the chunk using read.csv() function. dim() provides the dimensions of the data frame, which is the number of rows and columns, head() returns the first few rows of the data frame, allowing for a quick check of the data structure and the contents of the top of the data set. and tail() shows the last few rows of the data frame, giving a glimpse of the end of the data set.
Question 6: [Data Frames]
DHBpopByAge <- subset(popByAge, !Area %in% c("Area outside district health board constituency",
"Total NZ by DHB/DHB constituency"))
dim(DHBpopByAge)
## [1] 400 4
head(DHBpopByAge)
## Area Age Year.at.30.June Value
## 1 Northland Total people, age 1996 140700
## 2 Northland Total people, age 2001 144400
## 3 Northland Total people, age 2006 152700
## 4 Northland Total people, age 2013 164700
## 5 Northland Total people, age 2018 185800
## 6 Northland Total people, age 2019 189100
tail(DHBpopByAge)
## Area Age Year.at.30.June Value
## 415 Southern 0-14 Years 2018 58100
## 416 Southern 0-14 Years 2019 58200
## 417 Southern 0-14 Years 2020 58700
## 418 Southern 0-14 Years 2021 58500
## 419 Southern 0-14 Years 2022 58300
## 420 Southern 0-14 Years 2023 58400
In the code above subset() function filters the popByAge data frame to exclude rows where the Area column matches either “Area outside district health board constituency” or “Total NZ by DHB/DHB constituency”. The result is stored in a new data frame called DHBpopByAge.the dim() function gets the dimensions (number of rows and columns) of the filtered data frame DHBpopByAge.head() function returns the first few rows of DHBpopByAge, giving a quick look at the beginning of the filtered data set and tail() shows the last few rows of DHBpopByAge, allowing for a quick inspection of the end of the filtered data set.
Question 7: [Data Frames, Split-Apply-Combine]
subset(DHBpopByAge, Area == "Northland" & Year.at.30.June == 1996)
## Area Age Year.at.30.June Value
## 1 Northland Total people, age 1996 140700
## 221 Northland 0-14 Years 1996 37200
# Splits the DHBpopByAge dataframe into a list of data frames
dhb_year <- split(DHBpopByAge, list(DHBpopByAge$Area, DHBpopByAge$Year.at.30.June))
# Defines a function to calculate adult population for each DHB and year
adult_population <- function(df) {
adult_population <- df[df$Age == "Total people, age", "Value"] - df[df$Age == "0-14 Years", "Value"]
return(data.frame(Area = unique(df$Area), Year.at.30.June = unique(df$Year.at.30.June), Value = adult_population))
}
# Applies the function to each component of the list
dhb_adult_list <- lapply(dhb_year, adult_population)
# Combines the list of data frames into a single data frame
DHBpopAdult <- do.call(rbind, dhb_adult_list)
print(dim(DHBpopByAge))
## [1] 400 4
print(head(DHBpopByAge))
## Area Age Year.at.30.June Value
## 1 Northland Total people, age 1996 140700
## 2 Northland Total people, age 2001 144400
## 3 Northland Total people, age 2006 152700
## 4 Northland Total people, age 2013 164700
## 5 Northland Total people, age 2018 185800
## 6 Northland Total people, age 2019 189100
print(tail(DHBpopByAge))
## Area Age Year.at.30.June Value
## 415 Southern 0-14 Years 2018 58100
## 416 Southern 0-14 Years 2019 58200
## 417 Southern 0-14 Years 2020 58700
## 418 Southern 0-14 Years 2021 58500
## 419 Southern 0-14 Years 2022 58300
## 420 Southern 0-14 Years 2023 58400
The subset() function on top filters the DHBpopByAge dataframe to include only rows where the Area is “Northland” and the Year.at.30.June is 1996. then dhb_year usesd split() to split the DHBpopByAge dataframe into a list of data frames based on unique combinations of Area and Year.at.30.June, followed by function(df) {….} which defines a function that calculates the adult population for each DHB and year based on the input dataframe df, lapply() function applies the calculate_adult_population function to each component (data frame) of the dhb_year_list list.do.call(rbind, dhb_adult_list) combines the list of data frames (dhb_adult_list) into a single data frame DHBpopAdult by row-binding them together and lartly dim(), head() and tail() function values are printed.
Question 8: [Data Frames]
DHBpopAdult$Year <- DHBpopAdult$Year.at.30.June + 0.5
print(dim(DHBpopByAge))
## [1] 400 4
print(head(DHBpopByAge))
## Area Age Year.at.30.June Value
## 1 Northland Total people, age 1996 140700
## 2 Northland Total people, age 2001 144400
## 3 Northland Total people, age 2006 152700
## 4 Northland Total people, age 2013 164700
## 5 Northland Total people, age 2018 185800
## 6 Northland Total people, age 2019 189100
print(tail(DHBpopByAge))
## Area Age Year.at.30.June Value
## 415 Southern 0-14 Years 2018 58100
## 416 Southern 0-14 Years 2019 58200
## 417 Southern 0-14 Years 2020 58700
## 418 Southern 0-14 Years 2021 58500
## 419 Southern 0-14 Years 2022 58300
## 420 Southern 0-14 Years 2023 58400
A new data frame called DHBpopAdult, which created a new column named year whcih is equal to the values in the “Year.at.30.June” column incremented by 0.5. Then the dimensions (number of rows and columns), and the first few and last few rows of the data frame is printed using dim(), head() and tail() functions.
Question 9: [Data Frames, Graphics]
# Subset the DHBpopAdult dataframe for Northland
northland <- subset(DHBpopAdult, Area == "Northland")
# Create a line plot
plot(northland$Year, northland$Value, type = "o",
lwd = 1,
pch = 1,
xlab = "Year", ylab = "Vlaue")
The question wanted us to get the estimates of the populations for 1 Jan 2013 and in order to do that we will need to interpolate between the closest population estimates. In the code above the subset function subsets the DHBpopAdult dataframe to include only rows where the “Area” column is equal to “Northland”, storing the result in a new dataframe called northland and the plot function including all its included values creates a line plot with northland\(Year on the x-axis and northland\)Value on the y-axis. It sets the plot type to “o” (points and lines), line width to 1, point character to 1, and labels the x-axis as “Year” and the y-axis as “Value”.
Question 10: [Data Frames, Graphics, Control Flow, Split-Apply-Combine]
colors <- c(1:8)
plot(DHBpopAdult$Year.at.30.June, DHBpopAdult$Value,
xlab = "", ylab = "")
for (i in 1:nlevels(factor(DHBpopAdult$Area))) {
subset_data <- DHBpopAdult[DHBpopAdult$Area == levels(factor(DHBpopAdult$Area))[i], ]
lines(subset_data$Year.at.30.June, subset_data$Value, col = colors[(i - 1) %% 8 + 1], type = "o")
}
The code chuck above produces a line plot of the population estimates for all DHB. The colors are designed by vectors 1:8 which produces 8 unique colours and then recycles those 8 colors, a plot is created with DHBpopAdult\(Year.at.30.June on the x-axis and DHBpopAdult\)Value on the y-axis. It sets the x-axis label to “Year”, y-axis label to “Population Estimate”, and the main title of the plot to “Population Estimates for All DHBs”, then there is the subset vaiable which creates subsets of DHBpopAdult dataframe to include only rows where the “Area” column matches the current level being iterated over in the loop. Lasly there is lines() function that plots subset_data\(Year.at.30.June on the x-axis and subset_data\)Value on the y-axis with specified colur and line type.
Question 11: [Data Frames, Graphics, Control Flow, Split-Apply-Combine]
num_areas <- length(unique(DHBpopAdult$Area))
# generating the colors for the plot- unfortunately I could not replicate this same as the model
colors <- rainbow(num_areas)
# Creating values for x and y axises.
plot(DHBpopAdult$Year.at.30.June, DHBpopAdult$Value,
xlim = c(min(DHBpopAdult$Year.at.30.June), max(DHBpopAdult$Year.at.30.June) + 5),
ylim = range(DHBpopAdult$Value),
xlab = "", ylab = "")
# Looping through each unique area
for (i in unique(DHBpopAdult$Area)) {
# Subset the data for the current DHB
subset_data <- DHBpopAdult[DHBpopAdult$Area == i, ]
# Finding the index
color_index <- which(unique(DHBpopAdult$Area) == i)
# Drawing lines
lines(subset_data$Year.at.30.June, subset_data$Value, col = colors[color_index], type = "o")
text(max(subset_data$Year.at.30.June) + 0.5, tail(subset_data$Value, 1), labels = i, col = colors[color_index], cex = 0.7, pos = 4)
}
In the code above produces a line plot of the population estimates for all DHBs, with text labels to the right of each line giving the DHB names. There is the lenght() function calculates the number of unique areas present in the DHBpopAdult$Area column, the colours are specicied, There is the plot() function that creates an empty plot with x-axis ranging from the minimum year to maximum year plus 5, and y-axis ranging from the minimum value to maximum value in the DHBpopAdult dataframe. The x-axis label and y-axis label are set to empty strings.Then there is a loop function () that iterates over each unique area presented, the colour function adds the coours and the lines() function adds a line to the plot for the subsetted data with all the provided specifications.
Question 12: [Functions]
interp <- function(df) {
# data frame is ordered by Year
df <- df[order(df$Year),]
# row values before or after 2006.5
before <- df[df$Year <= 2006.5,]
before <- tail(before, 1) # Last row on or before 2006.5
# row values before and after 2013.5
after <- df[df$Year >= 2013.5,]
after <- head(after, 1)
# rows for interpolation check
if (nrow(before) == 0 | nrow(after) == 0) {
stop("Could not find rows to interpolate between.")
}
# measuring the slop
slope <- (after$Value - before$Value) / (after$Year - before$Year)
# Interpolate 2013 values using with slope and previous values
interpolated_value <- before$Value + slope * (2013 - before$Year)
return(interpolated_value)
}
# adding what the question asked us
df <- subset(DHBpopAdult, Area == "Northland")
interpolated_value <- interp(df)
print(interpolated_value)
## [1] 127521.4
A function is created named interp to handle the interpolation of the values within the data. the interpolation process includes ordering data frame by years to ensure they are in a ascending order, identifying values for the column 2006.5 and 2013.5, calculating the slope between these two values, getting the interpolated values of year 2013 using the slope among other things to get the interpolated value to estimate the accuracy using the interp function as the question asked us.
Question 13: [Data Frames, Split-Apply-Combine]
# Using the interp function from the previous questions
interp <- function(df) {
df <- df[order(df$Year),]
before <- df[df$Year <= 2006.5,]
before <- tail(before, 1)
after <- df[df$Year >= 2013.5,]
after <- head(after, 1)
if (nrow(before) == 0 | nrow(after) == 0) {
stop("Could not find rows to interpolate between.")
}
slope <- (after$Value - before$Value) / (after$Year - before$Year)
interpolated_value <- before$Value + slope * (2013 - before$Year)
return(interpolated_value)
}
# Split DHBpopAdult data frame into a list of data frames for each DHB
split_data <- split(DHBpopAdult, DHBpopAdult$Area)
# Use sapply to apply interp function to each DHB data frame to get the interpolated values for 2013
DHBpop <- sapply(split_data, interp)
# Print the resulting vector of interpolated population estimates
print(DHBpop)
## Auckland Bay of Plenty Canterbury Capital and Coast
## 375400.00 168471.43 408021.43 237592.86
## Counties Manukau Hawke's Bay Hutt Valley Lakes
## 374400.00 122964.29 112278.57 79707.14
## MidCentral Nelson Marlborough Northland South Canterbury
## 134228.57 114642.86 127521.43 46821.43
## Southern Tairawhiti Taranaki Waikato
## 249464.29 35207.14 89314.29 294150.00
## Wairarapa Waitemata West Coast Whanganui
## 33807.14 436664.29 26621.43 49350.00
The code chuck above uses the interp function like the previous questions to interpolate the values within the data frame by ordering it by years, values corresponding to the rows 2006.5 and 2013.5, handeling missing rows, getting the slope, Then the data frame is split into a data frame for each DHB, then the interpolation function (interp) is applied to the data frame using the sapply() function and lastly the final results and outcomes are printed
Question 14: [Basics]
DHBsmoking$pop <- NA
for(i in seq_along(DHBsmoking$region)) {
region <- DHBsmoking$region[i]
# Check if the region is in the names of DHBpop
if(region %in% names(DHBpop)) {
DHBsmoking$pop[i] <- DHBpop[region]
}
}
print(DHBsmoking)
## region prevalence island pop
## 1 Northland 7.347300 North 127521.43
## 2 Waitematā 5.719723 North NA
## 3 Auckland 2.826581 North 375400.00
## 4 Counties Manukau 25.888687 North 374400.00
## 5 Waikato 7.189630 North 294150.00
## 6 Bay of Plenty 14.254215 North 168471.43
## 7 Tairāwhiti 14.162008 North NA
## 8 Lakes 3.523784 North 79707.14
## 9 Taranaki 18.193418 North 89314.29
## 10 Hawke's Bay 26.513553 North 122964.29
## 11 Whanganui 5.994493 North 49350.00
## 12 MidCentral 29.325255 North 134228.57
## 13 Capital & Coast 4.782368 North NA
## 14 Hutt Valley 28.042841 North 112278.57
## 15 Wairarapa 1.373304 North 33807.14
## 16 Nelson Marlborough 2.959553 South 114642.86
## 17 West Coast 14.840849 South 26621.43
## 18 Canterbury 6.855324 South 408021.43
## 19 South Canterbury 26.154387 South 46821.43
## 20 Southern 9.336404 South 249464.29
There is a new column created using the pop variable from the DHBsmoking df, with NA values. The loop function iterates over each element in the region column of the DHBsmoking usingseq_along() function. for each regions in the plot there is the region check by the if() function to check whether there is a DHBpop vector. If the region exists in DHBpop it is assigned to the corresponding population to the “pop” column in the DHBsmoking dataframe lastly the results are updated and displayed by the printed function.
Question 15: [Basics]
names(DHBpop) <- gsub("&", "and", names(DHBpop))
names(DHBpop) <- gsub("Waitemata", "Waitematā", names(DHBpop))
names(DHBpop) <- gsub("Tairawhiti", "Tairāwhiti", names(DHBpop))
names(DHBpop) <- gsub("Manukau", "Manukau", names(DHBpop)) # Add other replacements as needed
print(DHBpop)
## Auckland Bay of Plenty Canterbury Capital and Coast
## 375400.00 168471.43 408021.43 237592.86
## Counties Manukau Hawke's Bay Hutt Valley Lakes
## 374400.00 122964.29 112278.57 79707.14
## MidCentral Nelson Marlborough Northland South Canterbury
## 134228.57 114642.86 127521.43 46821.43
## Southern Tairāwhiti Taranaki Waikato
## 249464.29 35207.14 89314.29 294150.00
## Wairarapa Waitematā West Coast Whanganui
## 33807.14 436664.29 26621.43 49350.00
print(DHBsmoking)
## region prevalence island pop
## 1 Northland 7.347300 North 127521.43
## 2 Waitematā 5.719723 North NA
## 3 Auckland 2.826581 North 375400.00
## 4 Counties Manukau 25.888687 North 374400.00
## 5 Waikato 7.189630 North 294150.00
## 6 Bay of Plenty 14.254215 North 168471.43
## 7 Tairāwhiti 14.162008 North NA
## 8 Lakes 3.523784 North 79707.14
## 9 Taranaki 18.193418 North 89314.29
## 10 Hawke's Bay 26.513553 North 122964.29
## 11 Whanganui 5.994493 North 49350.00
## 12 MidCentral 29.325255 North 134228.57
## 13 Capital & Coast 4.782368 North NA
## 14 Hutt Valley 28.042841 North 112278.57
## 15 Wairarapa 1.373304 North 33807.14
## 16 Nelson Marlborough 2.959553 South 114642.86
## 17 West Coast 14.840849 South 26621.43
## 18 Canterbury 6.855324 South 408021.43
## 19 South Canterbury 26.154387 South 46821.43
## 20 Southern 9.336404 South 249464.29
The code above is supped to replace and modify the “&” with “and” in all column names of the DHBpop data frame using the gsub() function.. it also replaces specific strings in column names to ensure consistency and accuracy.Lastly the values are printed and the numbers remain the same.
Question 16: [Basics]
DHBsmoking$prevalence <- as.numeric(DHBsmoking$prevalence)
DHBsmoking$pop <- as.numeric(DHBsmoking$pop)
# Add the 'smokers' column by multiplying 'prevalence' by 'pop' and dividing by 100
DHBsmoking$smokers <- (DHBsmoking$prevalence / 100) * DHBsmoking$pop
# Print the updated data frame
print(DHBsmoking)
## region prevalence island pop smokers
## 1 Northland 7.347300 North 127521.43 9369.3822
## 2 Waitematā 5.719723 North NA NA
## 3 Auckland 2.826581 North 375400.00 10610.9858
## 4 Counties Manukau 25.888687 North 374400.00 96927.2438
## 5 Waikato 7.189630 North 294150.00 21148.2974
## 6 Bay of Plenty 14.254215 North 168471.43 24014.2800
## 7 Tairāwhiti 14.162008 North NA NA
## 8 Lakes 3.523784 North 79707.14 2808.7077
## 9 Taranaki 18.193418 North 89314.29 16249.3217
## 10 Hawke's Bay 26.513553 North 122964.29 32602.2015
## 11 Whanganui 5.994493 North 49350.00 2958.2825
## 12 MidCentral 29.325255 North 134228.57 39362.8706
## 13 Capital & Coast 4.782368 North NA NA
## 14 Hutt Valley 28.042841 North 112278.57 31486.1013
## 15 Wairarapa 1.373304 North 33807.14 464.2749
## 16 Nelson Marlborough 2.959553 South 114642.86 3392.9157
## 17 West Coast 14.840849 South 26621.43 3950.8461
## 18 Canterbury 6.855324 South 408021.43 27971.1893
## 19 South Canterbury 26.154387 South 46821.43 12245.8575
## 20 Southern 9.336404 South 249464.29 23290.9931
The code chunk above is supposed to alculate a number of smokers (estimate) for each DHB. The is a type coersin factor that transforms to prevalance and pop values in DHB smoking df to numeric values, then the smokers numbers is calculated by the number of smokers in each region by multiplying the prevalence of smoking (as a percentage) by the population and dividing by 100 and the data frame is printed lastly/
Question 17: [Data Frames, Split-Apply-Combine]
# Use aggregate to sum the 'pop' and 'smokers' for each 'island'
NZsmoking <- aggregate(cbind(pop, smokers) ~ island, data = DHBsmoking, sum)
# Calculate the prevalence for each island
NZsmoking$prevalence <- NZsmoking$smokers / NZsmoking$pop
# Print the resulting data frame
print(NZsmoking)
## island pop smokers prevalence
## 1 North 1961592.9 288001.9 0.14682045
## 2 South 845571.4 70851.8 0.08379162
The code above calculate an overall prevalence of smokers for the North and South Island. the aggergate() function is grouping in the DHBsmoking data frame by the “island” column (presumably representing different islands in New Zealand). There is a calculations of the prevlancx odf smoking g for each island by dividing the total number of smokers by the total population this is done for each island in the NZsmoking data frame.
Question 18: [Basics]
North_Island_Prevalence <- 0.1816880 # 18%
South_Island_Prevalence <- 0.1668823 # 16.5%
North_Island_Population <- 2671057.1
South_Island_Population <- 845571.4
NZprev <- (North_Island_Prevalence * North_Island_Population +
South_Island_Prevalence * South_Island_Population) /
(North_Island_Population + South_Island_Population)
print(NZprev)
## [1] 0.178128
In this question we are estimating the overall prevalence of smoking in New Zealand (from NZsmoking). The prevelance rate of smoking in each island is assigned to a vector based on the numbers provided in the outputs above. The overall prevalence of smoking in New Zealand is calculated by taking a weighted average of the prevalence values for the North Island and South Island, based on their respective populations and the last;ly the NZprev variable which is the outcomes of the calcutions is printed.
Question 19: [Statistical Functions]
NZprev <- c(NZprev)
genPrev <- function(prevalence, sample_size, num_samples) {
# Use rbinom to generate the number of smokers for each sample
num_smokers <- rbinom(num_samples, sample_size, prevalence)
# Calculate the proportion of smokers for each sample
proportions <- num_smokers / sample_size
return(proportions)
}
set.seed(135)
genPrev(NZprev, 7500, 10)
## [1] 0.1818667 0.1785333 0.1806667 0.1716000 0.1849333 0.1820000 0.1797333
## [8] 0.1781333 0.1774667 0.1786667
This question is looking for the proportion of smokers for the specified number of samples. Thegenprev function uses rbinom function to generate the number of smokers for each sample, based on a binomial distribution with parameters sample_size and prevalence, the line below calculates the proportion of smokers for each sample by dividing the number of smokers by the sample sizes, the seeds for reproducibility is set as 135.
Question 20: [Basics]
set.seed(135) # Setting a seed for reproducibility
prevalence_north <- 0.1816880
prevalence_south <- 0.1668823
# The sample prevalences for both islands
northIslandPrev <- genPrev(prevalence_north, 7500, 10000)
southIslandPrev <- genPrev(prevalence_south, 7500, 10000)
# sampleDiff vector
sampleDiff <- northIslandPrev - southIslandPrev
# Output the length and head of the sampleDiff vector
length(sampleDiff)
## [1] 10000
head(sampleDiff)
## [1] 0.017066667 0.011333333 0.017600000 0.003866667 0.029200000 0.018933333
The prevalence rate of placeholder prevalence rates for the North Island (prevalence_north) and the South Island (prevalence_south), they provide prevalence rates represent the proportion of smokers in the population for each island. There is 10,000 samples from the 7,500 island generated, the difference vector calculates the difference between the north and the south island.
Question 21:
#'sampleDiff' with your actual vector of differences
obsdiff<- prevalence_north - prevalence_south # Assuming the data will be filled here
# Plot the density
plot(density(sampleDiff), main="Density Plot of sampleDiff", xlab="Difference", ylab="Density")
# Add a vertical line for the observed difference
abline(v=obsdiff, col="red", lwd=2)
The code above produces a plot of the density of the sampleDiff values, plus a vertical line showing the difference between the North Island and South Island prevalence that we observed, as shown below. This is done by calculating the difference followed by the density plot and the abline foe the vertical line.
Question 22: [Basics]
# Replace with your actual observed difference
obsDiff <- 0.001
# Calculating proportion of sampleDiff values that are larger than the observed prevalence
proportion_larger <- sum(sampleDiff > obsDiff) / length(sampleDiff)
# Print the result
print(proportion_larger)
## [1] 0.9891
The code above calcualting the proportion of sampleDiff values that are larger than the observed prevalence. There is a calculation of the sampleDiff values that are larger than the observed prevalence and the proportion larger is printed indicating likelihood of observing a difference as extreme as or more extreme than the observed difference, under the assumption of no true difference between groups.looking at the value above we cannot reject the null hypothesis as the observed values above is more than the level of significance , this could be because the result suggests that the observed difference is not statistically significant implying that there is not significant diffrence between the observed and the predicted values between the two groups and the observed difference is likely to occur by chance under the assumption that there is no real difference between the groups.
This project was an opportunity to put into practice all the different functions and codes used in the first six weeks of this course. In this assignment, we were assessing the differences in New Zealand’s smoking prevalence based on regions from the two different provided data sets. There were multiple variables to take into account, such as the population of each island, different prevalences, and ages. The differences between the north and south islands were calculated and visualised through multiple different graphs. All these calculations and visualisations are coded and documented above, leading us to a final binomial distribution test and producting propositions to test the validity of the differences between the groups and whether or not the null is true. The output provided in Question 22 proves that the null hypothesis cannot be rejected, and the difference between the smoking prevalence rates over 95% of the time between the north and the south did not have a higher prevalence rate than the combined prevalence rate, meaning there are no main or significant differences between the smoking prevalence rates between the north and south islands at the 95% level.