library(kableExtra)
library(magrittr)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(ggplot2)
library(forcats)
Essential R Packages:
| Student name | Student number | Percentage of contribution |
|---|---|---|
| HEMANTH RANGASWAMY | s4069811 | 50 |
| RAHUL RAMESH | s4029852 | 50 |
In this assignment, we carried out several preprocessing steps to clean, integrate, and transform the data for analysis. Here’s what we did:
Data Cleaning: We began by reading two CSV files, “city_temperature.csv” and “underwater_temperature.csv”, into separate data frames named RCB1 and RCB2. We then renamed the temperature columns in both datasets to “Temperature” for uniformity.
Data Integration: We combined the two data frames (RCB1 and RCB2) based on the “Temperature” column, creating a new dataset called “combined_dataset”. This step brought together temperature data from city and underwater sources into one dataset.
Data Transformation: Using the pivot_wider function from the tidyr package, we reshaped the “combined_dataset” by spreading the “Depth” column into separate columns based on unique values in the “Region” column. This made it easier to analyze depth-related data. We added a new column called “Equator” by subtracting the “Latitude” from the “Longitude”. This new feature might help in studying temperature patterns based on distance from the equator.
Handling Missing Values: We replaced certain values in the “Temperature” column with NA (missing values) using the mutate function from the dplyr package. The specific conditions or indices for these replacements were not detailed.
Outlier Detection and Capping: We created two custom functions,find_outliers and cap_outliers , to identify and limit outliers in numerical columns based on the interquartile range (IQR) method.
Data Transformation: We introduced a new column called “log_transformed_var” by applying a logarithmic transformation (log1p) to the “Equator” column. This transformation helps balance data distributions for certain analyses.
Throughout these preprocessing steps, we used various functions and packages from the tidyverse ecosystem, such as dplyr,tidyr, and ggplot2, to effectively manipulate, reshape, and visualize the data.
RCB1<- read.csv("C:/Users/Hemanth Gowda/Downloads/city_temperature.csv")
RCB2<- read.csv("C:/Users/Hemanth Gowda/Downloads/underwater_temperature.csv")
About Kaggle
Kaggle is a popular site for data scientists and machine learning enthusiasts. It offers useful resources like datasets, contests, and learning materials.
Key Features of Kaggle
Datasets: Kaggle hosts a large collection of datasets on various topics. Users can upload their datasets or download datasets shared by others.
Competitions: Kaggle runs machine learning competitions with cash prizes, where participants can solve challenging problems.
Notebooks: Kaggle provides a Jupyter-like environment for coding, known as Notebooks, which are useful for data analysis and sharing code.
Discussion Forums: Users can ask questions, share knowledge, and collaborate on projects in Kaggle’s discussion forums.
Courses: Kaggle offers free courses on data science and machine learning to help users learn and improve their skills.
Why We Preferred Kaggle?
For our assignment, Kaggle was extremely helpful. We obtained the city_temperature.csv and underwater_temperature.csv datasets from Kaggle. Here’s how it helped us:
Easy Access: The datasets were easy to find and download, saving us time.
High-Quality Data: Kaggle provides a wide variety of well-documented datasets, making it easier to find what we needed.
city_temperature.csv
This dataset contains temperature data for various cities across different regions, countries, and states. It has a total of 2,906,327 entries. The columns in this dataset are as follows:
Region: The general geographic area (e.g., Africa).
Country: The country where the city is located (e.g., Algeria).
State: The state or province within the country (mixed types, might contain NaN values).
City: The name of the city (e.g., Algiers).
Month: The month of the temperature reading.
Day: The day of the month.
Year: The year of the temperature reading.
AvgTemperature: The average temperature recorded for that day in the specified city.
This dataset is useful for analyzing temperature trends, studying climate patterns, or training machine learning models for weather prediction.
underwater_temperature.csv
This dataset contains temperature observations made at various underwater sites or locations. It has a total of 408,638 entries. The columns in this dataset are as follows:
This dataset is useful for studying underwater temperature patterns, analyzing the impact of depth on temperature, or modeling temperature changes in aquatic environments.
RCB1<- read.csv("C:/Users/Hemanth Gowda/Downloads/city_temperature.csv")
RCB2<- read.csv("C:/Users/Hemanth Gowda/Downloads/underwater_temperature.csv")
RCB2 <- RCB2 %>%
rename(
Temperature = Temp
)
RCB1 <- RCB1 %>%
rename(
Temperature = AvgTemperature
)
head(RCB1)
head(RCB2)
str(RCB1)
## 'data.frame': 2906327 obs. of 8 variables:
## $ Region : chr "Africa" "Africa" "Africa" "Africa" ...
## $ Country : chr "Algeria" "Algeria" "Algeria" "Algeria" ...
## $ State : chr "" "" "" "" ...
## $ City : chr "Algiers" "Algiers" "Algiers" "Algiers" ...
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ Temperature: num 64.2 49.4 48.8 46.4 47.9 48.7 48.9 49.1 49 51.9 ...
str(RCB2)
## 'data.frame': 408638 obs. of 8 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Site : chr "Ilha Deserta" "Ilha Deserta" "Ilha Deserta" "Ilha Deserta" ...
## $ Latitude : num 27.3 27.3 27.3 27.3 27.3 ...
## $ Longitude : num 48.3 48.3 48.3 48.3 48.3 ...
## $ Date : chr "20-02-2013" "20-02-2013" "20-02-2013" "20-02-2013" ...
## $ Time : chr "11:40:02" "12:00:03" "12:20:04" "12:40:05" ...
## $ Temperature: num 24.4 24.4 24.5 24.4 24.4 ...
## $ Depth : num 12 12 12 12 12 12 12 12 12 12 ...
# Merging the datasets RCB1 and RCB2 by Temperature
combined_dataset <- merge(RCB1, RCB2, by = "Temperature")
# Conversion of a variable along with the labeling of depth with the display of merged data
combined_dataset$Depth <- as.factor(combined_dataset$Depth)
combined_dataset$Depth <- as.character(combined_dataset$Depth)
combined_dataset$Depth <- as.factor(combined_dataset$Depth)
combined_dataset$Depth <- factor(combined_dataset$Depth, levels = c("5", "12", "22", "22.1"))
head(combined_dataset)
Data Understanding and Preparation of Data
We read two CSV files into R data frames. The first file is “city_temperature.csv” which we store in RCB1. The second file is “underwater_temperature.csv” which we store in RCB2.
We rename the “Temp” column in RCB2 to “Temperature” and the “AvgTemperature” column in RCB1 to “Temperature” so they match for merging.
We look at the first few rows of each data frame to get an idea of the data using the head function.
We check the structure of each data frame with the str function to see the data types and number of observations.
Understanding the Structure
When we look at the structure of a data frame, we see each column, holds different types of data.
Region: This column is of type ‘chr’ (character), meaning it contains character strings. These strings represent different geographical regions, like “Africa”.
Country: Similar to the Region column, this one is also ‘chr’ and contains character strings representing country names, such as “Algeria”.
State: This column, also of type ‘chr’, holds character strings representing states or provinces within a country.
City: The City column contains character strings representing city names, like “Algiers”.
Month: This column is of type ‘int’ (integer). It holds integer values that represent the month of the year, where “1” could indicate January.
Day: Another integer column, Day contains integer values representing the day of the month.
Year: The Year column is of type ‘int’ as well, holding integer values that represent the year, such as “1995”.
Merging and Manipulation of Data
Temperature: This column is ‘num’ (numeric), meaning it contains numeric values representing temperature measurements, like “64.2”.
We merge the two data frames on the “Temperature” column using the merge function. The merged data frame is called combine_dataset.
We then handle the “Depth” variable in the merged data. We convert “Depth” to a factor, then to a character, and back to a factor again to make sure it’s set up correctly.
We set the order of levels for the “Depth” factor to “5”, “12”, “22”, and “22.1”. This ensures that when we analyze this factor, it follows this specific order.
Finally, we display the first few rows of the combined_dataset, which now includes data from both original datasets, with the “Depth” variable ordered as we specified.
# This is the R chunk for the Tidy & Manipulate Data set
organized_data <- pivot_wider(
combined_dataset,
names_from = Region,
values_from = Depth
)
## Warning: Values from `Depth` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
## {data} |>
## dplyr::summarise(n = dplyr::n(), .by = c(Temperature, Country, State, City,
## Month, Day, Year, ID, Site, Latitude, Longitude, Date, Time, Region)) |>
## dplyr::filter(n > 1L)
organized_data <- organized_data %>%
mutate(Equator = Longitude-Latitude)
organized_data$Temperature[c( )] <- NA
head(organized_data)
We first use the pivot_wider function from the tidyr package to change the shape of the combined_dataset data frame. We’re transforming it so that unique values in the “Region” column become new column names, and values in the “Depth” column spread out into these new columns.
After reshaping the data, we calculate a new column called “Equator” by subtracting the “Latitude” from the “Longitude”. This new column helps us understand the distance from the equator, which can be important for studying temperature patterns.
Next, we modify the “Temperature” column using the mutate function from the dplyr package. We replace certain values in the “Temperature” column with NA (missing values), but the specific conditions or indices for replacement are not provided in this code snippet.
Finally, we use the head function to show the first few rows of the organized_data data frame. This data frame now has a new structure with separate columns for each depth and the newly added “Equator” column.
# Scan dataset
null_entries <- sapply(organized_data, function(x) sum(is.na(x)))
View(null_entries)
organized_data <- organized_data %>%
mutate(Longitude = ifelse(Longitude < -180 | Longitude > 180, NA, Longitude),
Latitude = ifelse(Latitude < -90 | Latitude > 90,NA,Latitude))
head(organized_data)
boxplot(organized_data$Equator, main = "Boxplot of Values", xlab="Equator", ylab = "Value", col="lightblue")
boxplot(organized_data$Longitude, main = "Boxplot of Values", xlab="Longitude", ylab = "Value", col="lightblue")
boxplot(organized_data$Latitude, main = "Boxplot of Values", xlab="Latitude", ylab = "Value", col="lightblue")
find_outliers <- function(x) {
R1 <- quantile(x, 0.25, na.rm = TRUE)
R3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- R3 - R1
lower_bound <- R1 - 1.5 * IQR
upper_bound <- R3 + 1.5 * IQR
return(x < lower_bound | x >upper_bound)
}
cap_outliers <- function(x) {
R1 <- quantile(x, 0.25, na.rm = TRUE)
R3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- R3 - R1
lower_bound <- R1 - 1.5 * IQR
upper_bound <- R3 + 1.5 * IQR
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
By using the sapply function to apply a custom function to each column of the organized_data data frame. This function counts the number of missing values (NA) in each column, and the results are stored in the null_entries object. We then use the View function to display the null_entries object in a separate window.
Next, we use the mutate function from the dplyr package to create new columns in the organized_data data frame. We check if the values in the “Longitude” column are outside the valid range of -180 to 180 degrees, and if so, we replace them with NA using the ifelse function. Similarly, we check if the values in the “Latitude” column are outside the valid range of -90 to 90 degrees, and replace any invalid values with NA.
After modifying the data, we use the head function to display the first few rows of the updated organized_data data frame.
Next, we create three boxplots using the boxplot function.
The first boxplot shows the distribution of values in the “Equator” column:
The second boxplot shows the distribution of values in the “Longitude” column:
The third boxplot shows the distribution of values in the “Latitude” column:
Each boxplot has a custom title, x-axis label, y-axis label, and color.
We then define two custom functions: find_outliers and cap_outliers. The find_outliers function identifies outliers in a numeric vector based on the interquartile range (IQR) method. The cap_outliers function replaces outlier values with the upper or lower bound values, effectively capping the outliers.
# Perform log transformation
organized_data$log_transformed_var <- log1p(organized_data$Equator)
# Create histogram with normal distribution curve
ggplot(organized_data, aes(x = log_transformed_var)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(organized_data$log_transformed_var, na.rm = TRUE), sd = sd(organized_data$log_transformed_var, na.rm = TRUE)), color = "green", linewidth = 1) +
labs(title = "Histogram with Normal Distribution Curve",
x = "Log-transformed Equator",
y = "Density") +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", hjust = 0.5, vjust = 1.5),
axis.title.x = element_text(face = "bold", margin = margin(t = 10)),
axis.title.y = element_text(face = "bold", margin = margin(r = 10))
)
First, we performed a log transformation on the Equator column of our organized_data data frame using the log1p function. This function computes the natural logarithm of one plus the input value and assigns the result to a new column named log_transformed_var.
Next, we used ggplot to create a plot, setting the data source to organized_data organized_data and mapping the log_transformed_var column to the x-axis.
We added a histogram layer with geom_histogram, setting the y-axis to display density (normalized count), using 30 bins for the histogram, and setting the bar fill color to blue with 50% opacity.
Then, we overlaid a normal distribution curve on the histogram using stat_function. We set the function to dnorm and provided the mean and standard deviation of the log_transformed_var column as arguments. The curve is colored green and has a line width of 1.
To add titles and labels, we used labs. We set the plot title to “Histogram with Normal Distribution Curve”, the x-axis label to “Log-transformed Equator”, and the y-axis label to “Density”.
We applied a minimal theme to the plot with theme_minimal, which removes most background elements and gridlines for a cleaner look.
Finally, we customized the appearance of the plot using theme. We set the plot title to bold font, centered it horizontally with hjust = 0.5, and added vertical spacing with vjust = 1.5. We also made the axis titles bold and added margins to create space between the labels and the graph.
RMIT Module 6: “Scan: Outliers.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_06.html. Accessed on May 22, 2024.
RMIT Module 5: “Scan: Missing Values.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_05.html. Accessed on May 22, 2024.
RMIT Module 7: “Transform: Data Transformation, Standardisation, and Reduction.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_07.html. Accessed on May 23, 2024.
Paraphrasing Tool: Used for documentation. Available at https://www.editpad.org/tool/paraphrasing-tool. Accessed on May 26, 2024.
Grammarly Paraphrasing Tool: Used for documentation. Available at https://app.grammarly.com/. Accessed on May 27, 2024.
W3Schools, ‘R programming tutorial,’ [Online]. Available: https://www.w3schools.com/r/. Accessed on May 20, 2024.