Setup

library(kableExtra)
library(magrittr)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(ggplot2)
library(forcats)

Essential R Packages:

Group information
Student name Student number Percentage of contribution
HEMANTH RANGASWAMY s4069811 50
RAHUL RAMESH s4029852 50

Executive Summary

In this assignment, we carried out several preprocessing steps to clean, integrate, and transform the data for analysis. Here’s what we did:

Throughout these preprocessing steps, we used various functions and packages from the tidyverse ecosystem, such as dplyr,tidyr, and ggplot2, to effectively manipulate, reshape, and visualize the data.

Data

RCB1<- read.csv("C:/Users/Hemanth Gowda/Downloads/city_temperature.csv")
RCB2<- read.csv("C:/Users/Hemanth Gowda/Downloads/underwater_temperature.csv")

About Kaggle

Kaggle is a popular site for data scientists and machine learning enthusiasts. It offers useful resources like datasets, contests, and learning materials.

Key Features of Kaggle

Why We Preferred Kaggle?

For our assignment, Kaggle was extremely helpful. We obtained the city_temperature.csv and underwater_temperature.csv datasets from Kaggle. Here’s how it helped us:

city_temperature.csv

This dataset contains temperature data for various cities across different regions, countries, and states. It has a total of 2,906,327 entries. The columns in this dataset are as follows:

This dataset is useful for analyzing temperature trends, studying climate patterns, or training machine learning models for weather prediction.

underwater_temperature.csv

This dataset contains temperature observations made at various underwater sites or locations. It has a total of 408,638 entries. The columns in this dataset are as follows:

This dataset is useful for studying underwater temperature patterns, analyzing the impact of depth on temperature, or modeling temperature changes in aquatic environments.

Understand

RCB1<- read.csv("C:/Users/Hemanth Gowda/Downloads/city_temperature.csv")
RCB2<- read.csv("C:/Users/Hemanth Gowda/Downloads/underwater_temperature.csv")
RCB2 <- RCB2 %>%
  rename(
    Temperature = Temp
  )
RCB1 <- RCB1 %>%
  rename(
    Temperature = AvgTemperature
  )
head(RCB1)
head(RCB2)
str(RCB1)
## 'data.frame':    2906327 obs. of  8 variables:
##  $ Region     : chr  "Africa" "Africa" "Africa" "Africa" ...
##  $ Country    : chr  "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ State      : chr  "" "" "" "" ...
##  $ City       : chr  "Algiers" "Algiers" "Algiers" "Algiers" ...
##  $ Month      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year       : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ Temperature: num  64.2 49.4 48.8 46.4 47.9 48.7 48.9 49.1 49 51.9 ...
str(RCB2)
## 'data.frame':    408638 obs. of  8 variables:
##  $ ID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Site       : chr  "Ilha Deserta" "Ilha Deserta" "Ilha Deserta" "Ilha Deserta" ...
##  $ Latitude   : num  27.3 27.3 27.3 27.3 27.3 ...
##  $ Longitude  : num  48.3 48.3 48.3 48.3 48.3 ...
##  $ Date       : chr  "20-02-2013" "20-02-2013" "20-02-2013" "20-02-2013" ...
##  $ Time       : chr  "11:40:02" "12:00:03" "12:20:04" "12:40:05" ...
##  $ Temperature: num  24.4 24.4 24.5 24.4 24.4 ...
##  $ Depth      : num  12 12 12 12 12 12 12 12 12 12 ...
# Merging the datasets RCB1 and RCB2 by Temperature
combined_dataset <- merge(RCB1, RCB2, by = "Temperature")

# Conversion of a variable along with the labeling of depth with the display of merged data
combined_dataset$Depth <- as.factor(combined_dataset$Depth)
combined_dataset$Depth <- as.character(combined_dataset$Depth)
combined_dataset$Depth <- as.factor(combined_dataset$Depth)
combined_dataset$Depth <- factor(combined_dataset$Depth, levels = c("5", "12", "22", "22.1"))
head(combined_dataset)

Data Understanding and Preparation of Data

We check the structure of each data frame with the str function to see the data types and number of observations.

Understanding the Structure

When we look at the structure of a data frame, we see each column, holds different types of data.

Merging and Manipulation of Data

Tidy & Manipulate Data set

# This is the R chunk for the Tidy & Manipulate Data set

organized_data <- pivot_wider(
  combined_dataset,
  names_from = Region,
  values_from = Depth
)
## Warning: Values from `Depth` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
##   {data} |>
##   dplyr::summarise(n = dplyr::n(), .by = c(Temperature, Country, State, City,
##   Month, Day, Year, ID, Site, Latitude, Longitude, Date, Time, Region)) |>
##   dplyr::filter(n > 1L)
organized_data <- organized_data %>%
  mutate(Equator = Longitude-Latitude)
organized_data$Temperature[c( )] <- NA
head(organized_data)

Scan Dataset

# Scan dataset

null_entries <- sapply(organized_data, function(x) sum(is.na(x)))
View(null_entries)

organized_data <- organized_data %>%
  mutate(Longitude = ifelse(Longitude < -180 | Longitude > 180, NA, Longitude),
         Latitude = ifelse(Latitude < -90 | Latitude > 90,NA,Latitude))
head(organized_data)
boxplot(organized_data$Equator, main = "Boxplot of Values", xlab="Equator", ylab = "Value", col="lightblue")

boxplot(organized_data$Longitude, main = "Boxplot of Values", xlab="Longitude", ylab = "Value", col="lightblue")

boxplot(organized_data$Latitude, main = "Boxplot of Values", xlab="Latitude", ylab = "Value", col="lightblue")

find_outliers <- function(x) {
  R1 <- quantile(x, 0.25, na.rm = TRUE)
  R3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- R3 - R1
  lower_bound <- R1 - 1.5 * IQR
  upper_bound <- R3 + 1.5 * IQR
  return(x < lower_bound | x >upper_bound)
}

cap_outliers <- function(x) {
  R1 <- quantile(x, 0.25, na.rm = TRUE)
  R3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- R3 - R1
  lower_bound <- R1 - 1.5 * IQR
  upper_bound <- R3 + 1.5 * IQR
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

By using the sapply function to apply a custom function to each column of the organized_data data frame. This function counts the number of missing values (NA) in each column, and the results are stored in the null_entries object. We then use the View function to display the null_entries object in a separate window.

Next, we use the mutate function from the dplyr package to create new columns in the organized_data data frame. We check if the values in the “Longitude” column are outside the valid range of -180 to 180 degrees, and if so, we replace them with NA using the ifelse function. Similarly, we check if the values in the “Latitude” column are outside the valid range of -90 to 90 degrees, and replace any invalid values with NA.

After modifying the data, we use the head function to display the first few rows of the updated organized_data data frame.

Next, we create three boxplots using the boxplot function.

The first boxplot shows the distribution of values in the “Equator” column:

The second boxplot shows the distribution of values in the “Longitude” column:

The third boxplot shows the distribution of values in the “Latitude” column:

Each boxplot has a custom title, x-axis label, y-axis label, and color.

We then define two custom functions: find_outliers and cap_outliers. The find_outliers function identifies outliers in a numeric vector based on the interquartile range (IQR) method. The cap_outliers function replaces outlier values with the upper or lower bound values, effectively capping the outliers.

Transform

# Perform log transformation
organized_data$log_transformed_var <- log1p(organized_data$Equator)

# Create histogram with normal distribution curve
ggplot(organized_data, aes(x = log_transformed_var)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "blue", alpha = 0.5) +
  stat_function(fun = dnorm, args = list(mean = mean(organized_data$log_transformed_var, na.rm = TRUE), sd = sd(organized_data$log_transformed_var, na.rm = TRUE)), color = "green", linewidth = 1) +
  labs(title = "Histogram with Normal Distribution Curve",
       x = "Log-transformed Equator",
       y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, vjust = 1.5),
    axis.title.x = element_text(face = "bold", margin = margin(t = 10)),
    axis.title.y = element_text(face = "bold", margin = margin(r = 10))
  )

First, we performed a log transformation on the Equator column of our organized_data data frame using the log1p function. This function computes the natural logarithm of one plus the input value and assigns the result to a new column named log_transformed_var.

Next, we used ggplot to create a plot, setting the data source to organized_data organized_data and mapping the log_transformed_var column to the x-axis.

We added a histogram layer with geom_histogram, setting the y-axis to display density (normalized count), using 30 bins for the histogram, and setting the bar fill color to blue with 50% opacity.

Then, we overlaid a normal distribution curve on the histogram using stat_function. We set the function to dnorm and provided the mean and standard deviation of the log_transformed_var column as arguments. The curve is colored green and has a line width of 1.

To add titles and labels, we used labs. We set the plot title to “Histogram with Normal Distribution Curve”, the x-axis label to “Log-transformed Equator”, and the y-axis label to “Density”.

We applied a minimal theme to the plot with theme_minimal, which removes most background elements and gridlines for a cleaner look.

Finally, we customized the appearance of the plot using theme. We set the plot title to bold font, centered it horizontally with hjust = 0.5, and added vertical spacing with vjust = 1.5. We also made the axis titles bold and added margins to create space between the labels and the graph.

REFERENCES :

  1. RMIT Module 6: “Scan: Outliers.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_06.html. Accessed on May 22, 2024.

  2. RMIT Module 5: “Scan: Missing Values.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_05.html. Accessed on May 22, 2024.

  3. RMIT Module 7: “Transform: Data Transformation, Standardisation, and Reduction.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_07.html. Accessed on May 23, 2024.

  4. Paraphrasing Tool: Used for documentation. Available at https://www.editpad.org/tool/paraphrasing-tool. Accessed on May 26, 2024.

  5. Grammarly Paraphrasing Tool: Used for documentation. Available at https://app.grammarly.com/. Accessed on May 27, 2024.

  6. W3Schools, ‘R programming tutorial,’ [Online]. Available: https://www.w3schools.com/r/. Accessed on May 20, 2024.