We will begin by getting the tidyverse library ready and also loading in the csv file for the dataset.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df_main <- read.csv("climate_change_dataset.csv")
We will set a random seed. If we do not do this every time we pull subsamples they will be different leading to different results. We will need to this so that our analysis isnt rendered useless on a proceeding run. We will gather 7 subsamples with a population percentage of 30% so that we can analyze what happens when we take a smaller amount of the data but more number of times.
#set a random seed. Can be used as key to reproduce results
set.seed(945)
#set subsample percentage here. I will set it to low 30% but max out the number of samples to see how that will effect the
sample_percentage <- 0.30
#we will collect 7 samples
sample_cnt <- 7
#empty data frame constructor
df_samples <- tibble()
#create a for loop that collects 7 samples from the main dataframe and stores it in a temporary dataframe then transfers it to the actual empty dataframe we created earlier.
for (int_i in 1:sample_cnt) {
df_temp <- df_main |>
sample_n(size = sample_percentage * nrow(df_main), replace = TRUE) |>
mutate(sample_cnt = int_i)
df_samples <- bind_rows(df_samples, df_temp)
}
#debug code
names(df_main)
## [1] "Year" "Country"
## [3] "Avg.Temperature...C." "CO2.Emissions..Tons.Capita."
## [5] "Sea.Level.Rise..mm." "Rainfall..mm."
## [7] "Population" "Renewable.Energy...."
## [9] "Extreme.Weather.Events" "Forest.Area...."
In this section we group the subsamples by subsample number and compute the mean for each variable for each subsample. We then output this data to analyze.
Insight: We can see as shown by the table, that the sample means differ on every single instance of data collection. This is a critical observation as this is a very good reason on why you don’t make conclusions based only off of the summary statistics on the raw dataset. It is important to note that the mean values per instance are not too different.
#group combined subsample data by subsample number
df_samples |>
group_by(sample_cnt) |>
#compute mean values for selected variables within each subsample
summarise(
mean_temp = mean(`Avg.Temperature...C.`),
mean_co2 = mean(`CO2.Emissions..Tons.Capita.`),
mean_sea = mean(`Sea.Level.Rise..mm.`),
mean_rain = mean(`Rainfall..mm.`),
#free mem by dropping groups at the end
.groups = "drop"
)
Anomalies:
Similar to the previous section we will use group by function to find the maximum and minimum values for temperature and compare it to the max values of co2.
Insights: We can see that the max values returned by the co2 column are all vastly different then the ones returned by the max function when applied to temperature. A number like 0.6 from the min co2 column may be anomalous in temperature but is completely fine in the co2 column.
#group data by subsample
df_samples |>
group_by(sample_cnt) |>
#compute extreme selected variables
summarise(
max_temp = max(`Avg.Temperature...C.`),
min_temp = min(`Avg.Temperature...C.`),
max_co2 = max(`CO2.Emissions..Tons.Capita.`),
min_co2 = min(`CO2.Emissions..Tons.Capita.`),
#drop gropues
.groups = "drop"
)
How increasing relative subsample size effects results?
We started out with 30% of the sample data per collection. It is not illogical to extrapolate that with a larger sample size, we would get values more consistent for the mean. This is because on each collection instance, our pool that we draw samples from increases. This shows that our means get tighter and tighter around the true value aka the true mean of the main dataset. From this we can conclude that larger sample sizes provide more accurate statistics and from that our conclusions we draw from these will be more accurate and representative of the overall dataset./p>
How this affects how I will draw conclusions in the future:
I usually never draw conclusions from limited data but we can clearly see the more samples that you collect, the more representative the data becomes of the true value that is in the actual dataset. Obviously we can never collect all possible data related to a topic in the universe. This is why it is very important to sample correctly and make our assumptions off of that carefully