Data Wrangling (Data Preprocessing)

Setup

library(kableExtra)
library(magrittr)
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(ggplot2)
library(forcats)

Essential R Packages:

kableExtra: This package helps me create customizable tables that look good and are easy to understand. I can format, style, and add annotations to the tables.
magrittr: This package gives me the pipe operator (%>%), which allows me to link multiple functions together in a sequence.
readr: This package helps me import data from different file formats like CSV and TSV. It provides easy functions to quickly read and load data into R.
dplyr: This package offers tools for data manipulation. It lets me select specific columns, filter rows based on conditions, create new variables, summarize data, and group data by different factors.
tidyr: This package focuses on reshaping data. It allows me to convert data between wide and long formats, making it easier to work with tidy data.
ggplot2: This is a powerful package for creating various plots and visualizations. It offers a flexible system to build and customize plots like scatter plots, line charts, bar charts, histograms, and more.
forcats: This package is for working with categorical variables (factors). It provides tools to reorder, modify, and clean up the levels or categories within my factor variables.

Group information
Student name	Student number	Percentage of contribution
HEMANTH RANGASWAMY	s4069811	50
RAHUL RAMESH	s4029852	50

Executive Summary

In this assignment, we carried out several preprocessing steps to clean, integrate, and transform the data for analysis. Here’s what we did:

Data Cleaning: We began by reading two CSV files, “city_temperature.csv” and “underwater_temperature.csv”, into separate data frames named RCB1 and RCB2. We then renamed the temperature columns in both datasets to “Temperature” for uniformity.
Data Integration: We combined the two data frames (RCB1 and RCB2) based on the “Temperature” column, creating a new dataset called “combined_dataset”. This step brought together temperature data from city and underwater sources into one dataset.
Data Transformation: Using the pivot_wider function from the tidyr package, we reshaped the “combined_dataset” by spreading the “Depth” column into separate columns based on unique values in the “Region” column. This made it easier to analyze depth-related data. We added a new column called “Equator” by subtracting the “Latitude” from the “Longitude”. This new feature might help in studying temperature patterns based on distance from the equator.
Handling Missing Values: We replaced certain values in the “Temperature” column with NA (missing values) using the mutate function from the dplyr package. The specific conditions or indices for these replacements were not detailed.
Outlier Detection and Capping: We created two custom functions,find_outliers and cap_outliers , to identify and limit outliers in numerical columns based on the interquartile range (IQR) method.
Data Transformation: We introduced a new column called “log_transformed_var” by applying a logarithmic transformation (log1p) to the “Equator” column. This transformation helps balance data distributions for certain analyses.

Throughout these preprocessing steps, we used various functions and packages from the tidyverse ecosystem, such as dplyr,tidyr, and ggplot2, to effectively manipulate, reshape, and visualize the data.

Data

RCB1<- read.csv("C:/Users/Hemanth Gowda/Downloads/city_temperature.csv")
RCB2<- read.csv("C:/Users/Hemanth Gowda/Downloads/underwater_temperature.csv")

About Kaggle

Kaggle is a popular site for data scientists and machine learning enthusiasts. It offers useful resources like datasets, contests, and learning materials.

Key Features of Kaggle

Datasets: Kaggle hosts a large collection of datasets on various topics. Users can upload their datasets or download datasets shared by others.
Competitions: Kaggle runs machine learning competitions with cash prizes, where participants can solve challenging problems.
Notebooks: Kaggle provides a Jupyter-like environment for coding, known as Notebooks, which are useful for data analysis and sharing code.
Discussion Forums: Users can ask questions, share knowledge, and collaborate on projects in Kaggle’s discussion forums.
Courses: Kaggle offers free courses on data science and machine learning to help users learn and improve their skills.

Why We Preferred Kaggle?

For our assignment, Kaggle was extremely helpful. We obtained the city_temperature.csv and underwater_temperature.csv datasets from Kaggle. Here’s how it helped us:

Easy Access: The datasets were easy to find and download, saving us time.
High-Quality Data: Kaggle provides a wide variety of well-documented datasets, making it easier to find what we needed.

city_temperature.csv

This dataset contains temperature data for various cities across different regions, countries, and states. It has a total of 2,906,327 entries. The columns in this dataset are as follows:

Region: The general geographic area (e.g., Africa).
Country: The country where the city is located (e.g., Algeria).
State: The state or province within the country (mixed types, might contain NaN values).
City: The name of the city (e.g., Algiers).
Month: The month of the temperature reading.
Day: The day of the month.
Year: The year of the temperature reading.
AvgTemperature: The average temperature recorded for that day in the specified city.

This dataset is useful for analyzing temperature trends, studying climate patterns, or training machine learning models for weather prediction.

underwater_temperature.csv

This dataset contains temperature observations made at various underwater sites or locations. It has a total of 408,638 entries. The columns in this dataset are as follows:

ID: A unique identifier for each observation.
Site: The name of the location or site where the observation was made.
Latitude: The latitude coordinate of the site.
Longitude: The longitude coordinate of the site.
Date: The date when the observation was recorded.
Time: The time of day when the observation was made.
Temp: The temperature at the time of observation, measured in degrees Celsius.
Depth: The depth at which the observation was made, presumably in meters.

This dataset is useful for studying underwater temperature patterns, analyzing the impact of depth on temperature, or modeling temperature changes in aquatic environments.

Understand

RCB1<- read.csv("C:/Users/Hemanth Gowda/Downloads/city_temperature.csv")
RCB2<- read.csv("C:/Users/Hemanth Gowda/Downloads/underwater_temperature.csv")
RCB2 <- RCB2 %>%
  rename(
    Temperature = Temp
  )
RCB1 <- RCB1 %>%
  rename(
    Temperature = AvgTemperature
  )
head(RCB1)

head(RCB2)

str(RCB1)

## 'data.frame':    2906327 obs. of  8 variables:
##  $ Region     : chr  "Africa" "Africa" "Africa" "Africa" ...
##  $ Country    : chr  "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ State      : chr  "" "" "" "" ...
##  $ City       : chr  "Algiers" "Algiers" "Algiers" "Algiers" ...
##  $ Month      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year       : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ Temperature: num  64.2 49.4 48.8 46.4 47.9 48.7 48.9 49.1 49 51.9 ...

str(RCB2)

## 'data.frame':    408638 obs. of  8 variables:
##  $ ID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Site       : chr  "Ilha Deserta" "Ilha Deserta" "Ilha Deserta" "Ilha Deserta" ...
##  $ Latitude   : num  27.3 27.3 27.3 27.3 27.3 ...
##  $ Longitude  : num  48.3 48.3 48.3 48.3 48.3 ...
##  $ Date       : chr  "20-02-2013" "20-02-2013" "20-02-2013" "20-02-2013" ...
##  $ Time       : chr  "11:40:02" "12:00:03" "12:20:04" "12:40:05" ...
##  $ Temperature: num  24.4 24.4 24.5 24.4 24.4 ...
##  $ Depth      : num  12 12 12 12 12 12 12 12 12 12 ...

# Merging the datasets RCB1 and RCB2 by Temperature
combined_dataset <- merge(RCB1, RCB2, by = "Temperature")

# Conversion of a variable along with the labeling of depth with the display of merged data
combined_dataset$Depth <- as.factor(combined_dataset$Depth)
combined_dataset$Depth <- as.character(combined_dataset$Depth)
combined_dataset$Depth <- as.factor(combined_dataset$Depth)
combined_dataset$Depth <- factor(combined_dataset$Depth, levels = c("5", "12", "22", "22.1"))
head(combined_dataset)

Data Understanding and Preparation of Data

We read two CSV files into R data frames. The first file is “city_temperature.csv” which we store in RCB1. The second file is “underwater_temperature.csv” which we store in RCB2.
We rename the “Temp” column in RCB2 to “Temperature” and the “AvgTemperature” column in RCB1 to “Temperature” so they match for merging.
We look at the first few rows of each data frame to get an idea of the data using the head function.

We check the structure of each data frame with the str function to see the data types and number of observations.

Understanding the Structure

When we look at the structure of a data frame, we see each column, holds different types of data.

Region: This column is of type ‘chr’ (character), meaning it contains character strings. These strings represent different geographical regions, like “Africa”.
Country: Similar to the Region column, this one is also ‘chr’ and contains character strings representing country names, such as “Algeria”.
State: This column, also of type ‘chr’, holds character strings representing states or provinces within a country.
City: The City column contains character strings representing city names, like “Algiers”.
Month: This column is of type ‘int’ (integer). It holds integer values that represent the month of the year, where “1” could indicate January.
Day: Another integer column, Day contains integer values representing the day of the month.
Year: The Year column is of type ‘int’ as well, holding integer values that represent the year, such as “1995”.

Merging and Manipulation of Data

Temperature: This column is ‘num’ (numeric), meaning it contains numeric values representing temperature measurements, like “64.2”.
We merge the two data frames on the “Temperature” column using the merge function. The merged data frame is called combine_dataset.
We then handle the “Depth” variable in the merged data. We convert “Depth” to a factor, then to a character, and back to a factor again to make sure it’s set up correctly.
We set the order of levels for the “Depth” factor to “5”, “12”, “22”, and “22.1”. This ensures that when we analyze this factor, it follows this specific order.
Finally, we display the first few rows of the combined_dataset, which now includes data from both original datasets, with the “Depth” variable ordered as we specified.

Tidy & Manipulate Data set

# This is the R chunk for the Tidy & Manipulate Data set

organized_data <- pivot_wider(
  combined_dataset,
  names_from = Region,
  values_from = Depth
)

## Warning: Values from `Depth` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
##   {data} |>
##   dplyr::summarise(n = dplyr::n(), .by = c(Temperature, Country, State, City,
##   Month, Day, Year, ID, Site, Latitude, Longitude, Date, Time, Region)) |>
##   dplyr::filter(n > 1L)

organized_data <- organized_data %>%
  mutate(Equator = Longitude-Latitude)
organized_data$Temperature[c( )] <- NA
head(organized_data)

We first use the pivot_wider function from the tidyr package to change the shape of the combined_dataset data frame. We’re transforming it so that unique values in the “Region” column become new column names, and values in the “Depth” column spread out into these new columns.
After reshaping the data, we calculate a new column called “Equator” by subtracting the “Latitude” from the “Longitude”. This new column helps us understand the distance from the equator, which can be important for studying temperature patterns.
Next, we modify the “Temperature” column using the mutate function from the dplyr package. We replace certain values in the “Temperature” column with NA (missing values), but the specific conditions or indices for replacement are not provided in this code snippet.
Finally, we use the head function to show the first few rows of the organized_data data frame. This data frame now has a new structure with separate columns for each depth and the newly added “Equator” column.

Scan Dataset

# Scan dataset

null_entries <- sapply(organized_data, function(x) sum(is.na(x)))
View(null_entries)

organized_data <- organized_data %>%
  mutate(Longitude = ifelse(Longitude < -180 | Longitude > 180, NA, Longitude),
         Latitude = ifelse(Latitude < -90 | Latitude > 90,NA,Latitude))
head(organized_data)

boxplot(organized_data$Equator, main = "Boxplot of Values", xlab="Equator", ylab = "Value", col="lightblue")

boxplot(organized_data$Longitude, main = "Boxplot of Values", xlab="Longitude", ylab = "Value", col="lightblue")

boxplot(organized_data$Latitude, main = "Boxplot of Values", xlab="Latitude", ylab = "Value", col="lightblue")

find_outliers <- function(x) {
  R1 <- quantile(x, 0.25, na.rm = TRUE)
  R3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- R3 - R1
  lower_bound <- R1 - 1.5 * IQR
  upper_bound <- R3 + 1.5 * IQR
  return(x < lower_bound | x >upper_bound)
}

cap_outliers <- function(x) {
  R1 <- quantile(x, 0.25, na.rm = TRUE)
  R3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- R3 - R1
  lower_bound <- R1 - 1.5 * IQR
  upper_bound <- R3 + 1.5 * IQR
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

By using the sapply function to apply a custom function to each column of the organized_data data frame. This function counts the number of missing values (NA) in each column, and the results are stored in the null_entries object. We then use the View function to display the null_entries object in a separate window.

Next, we use the mutate function from the dplyr package to create new columns in the organized_data data frame. We check if the values in the “Longitude” column are outside the valid range of -180 to 180 degrees, and if so, we replace them with NA using the ifelse function. Similarly, we check if the values in the “Latitude” column are outside the valid range of -90 to 90 degrees, and replace any invalid values with NA.

After modifying the data, we use the head function to display the first few rows of the updated organized_data data frame.

Next, we create three boxplots using the boxplot function.

The first boxplot shows the distribution of values in the “Equator” column:

Value Axis: The vertical line with numbers, labeled “Value,” shows the range of numbers in the data, from 20.5 to 22.0.
Equator Axis: The horizontal line, labeled “Equator,” is like a placeholder name for what the data represents.
Box: The box in the middle of the graph shows where half of the data lies. It’s called the interquartile range (IQR).
Median: A line inside the box shows the middle value of the data.
Outliers: In this graph, there are no points outside the whisker, which would indicate unusual or extreme values.

The second boxplot shows the distribution of values in the “Longitude” column:

Value Axis: The vertical line with numbers shows the range of values in the data.
Box: The blue box in the middle represents where half of the data lies, with the bottom showing the 25th percentile and the top showing the 75th percentile.
Median: The line inside the box shows the middle value of the data.
Outliers: There are no unusual values outside the whisker.

The third boxplot shows the distribution of values in the “Latitude” column:

Value Axis: The vertical line with numbers, labeled ‘Value’, shows the range of values in the data, from 26.5 to 28.5.
Box: The blue box in the middle represents where half of the data lies, with the bottom showing the 25th percentile and the top showing the 75th percentile.
Median: The line inside the box shows the middle value of the data.
Outliers: There are no unusual values outside the whisker.

Each boxplot has a custom title, x-axis label, y-axis label, and color.

We then define two custom functions: find_outliers and cap_outliers. The find_outliers function identifies outliers in a numeric vector based on the interquartile range (IQR) method. The cap_outliers function replaces outlier values with the upper or lower bound values, effectively capping the outliers.

Transform

# Perform log transformation
organized_data$log_transformed_var <- log1p(organized_data$Equator)

# Create histogram with normal distribution curve
ggplot(organized_data, aes(x = log_transformed_var)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "blue", alpha = 0.5) +
  stat_function(fun = dnorm, args = list(mean = mean(organized_data$log_transformed_var, na.rm = TRUE), sd = sd(organized_data$log_transformed_var, na.rm = TRUE)), color = "green", linewidth = 1) +
  labs(title = "Histogram with Normal Distribution Curve",
       x = "Log-transformed Equator",
       y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, vjust = 1.5),
    axis.title.x = element_text(face = "bold", margin = margin(t = 10)),
    axis.title.y = element_text(face = "bold", margin = margin(r = 10))
  )

First, we performed a log transformation on the Equator column of our organized_data data frame using the log1p function. This function computes the natural logarithm of one plus the input value and assigns the result to a new column named log_transformed_var.

Next, we used ggplot to create a plot, setting the data source to organized_data organized_data and mapping the log_transformed_var column to the x-axis.

We added a histogram layer with geom_histogram, setting the y-axis to display density (normalized count), using 30 bins for the histogram, and setting the bar fill color to blue with 50% opacity.

Then, we overlaid a normal distribution curve on the histogram using stat_function. We set the function to dnorm and provided the mean and standard deviation of the log_transformed_var column as arguments. The curve is colored green and has a line width of 1.

To add titles and labels, we used labs. We set the plot title to “Histogram with Normal Distribution Curve”, the x-axis label to “Log-transformed Equator”, and the y-axis label to “Density”.

We applied a minimal theme to the plot with theme_minimal, which removes most background elements and gridlines for a cleaner look.

Finally, we customized the appearance of the plot using theme. We set the plot title to bold font, centered it horizontally with hjust = 0.5, and added vertical spacing with vjust = 1.5. We also made the axis titles bold and added margins to create space between the labels and the graph.

REFERENCES :

RMIT Module 6: “Scan: Outliers.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_06.html. Accessed on May 22, 2024.
RMIT Module 5: “Scan: Missing Values.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_05.html. Accessed on May 22, 2024.
RMIT Module 7: “Transform: Data Transformation, Standardisation, and Reduction.” Available at https://rare-phoenix-161610.appspot.com/secured/Module_07.html. Accessed on May 23, 2024.
Paraphrasing Tool: Used for documentation. Available at https://www.editpad.org/tool/paraphrasing-tool. Accessed on May 26, 2024.
Grammarly Paraphrasing Tool: Used for documentation. Available at https://app.grammarly.com/. Accessed on May 27, 2024.
W3Schools, ‘R programming tutorial,’ [Online]. Available: https://www.w3schools.com/r/. Accessed on May 20, 2024.