DAB502 Lab 3

QUESTION 1
EXPLANATION OF QUESTION 1
QUESTION 2
EXPLANATION OF QUESTION 2
QUESTION 3
EXPLANATION OF QUESTION 3

QUESTION 1

Given a dataset with two numerical columns xand y, write a function to create a scatterplot ofxversusyusing ggplot2. Your function should take an excel/csv file as input and display a scatterplot.

library(ggplot2)
Breast_Cancer_Data <- read.csv("C:/Users/hp/Downloads/Breast_Cancer_Data.csv")


create_scatterplot <- function(data) {
  # Create scatterplot using ggplot2 with some visual enhancements
  p <- ggplot(data, aes(x = radius_mean, y = texture_mean)) +
    geom_point(color = "#0072B2", size = 3, shape = 16) +  
    # Blue points, size 3, shape 16
    geom_smooth(method = "lm", se = FALSE, color = "#D55E00") +  
    # Add linear regression line
    theme_minimal() +  # Minimal theme
    labs(x = "Radius Mean", y = "Texture Mean", 
         title = "Scatterplot of Radius Mean vs Texture Mean") +
    theme(plot.title = element_text(size = 14, face = "bold"), 
          # Title appearance
          axis.text = element_text(size = 12),  # Axis text size
          axis.title = element_text(size = 12, face = "bold")) 
  # Axis title appearance
  
  # Display the plot
  print(p)
}

# Usage example:
create_scatterplot(Breast_Cancer_Data)

## `geom_smooth()` using formula = 'y ~ x'

EXPLANATION OF QUESTION 1

ggplot2 Library loading: The library ggplot2 has been loaded. R users can create sophisticated visualizations by using this library.

In my pc, a CSV file called “Breast_Cancer_Data.csv” is read from a given location. This file’s contents are kept in a variable named Breast Cancer Data. This data is organized in a dataframe, a popular R data structure for tabular data storage.

Describe the Function: There is a defined function called create_scatterplot. The single argument required by this function is supposed to be a dataframe.

Intializing the Plot: The function uses the supplied dataframe to construct an instance of a ggplot object. The texture_mean variable from the dataframe is shown on the y-axis of the plot, while the radius_mean variable is represented on the x-axis.

Adding Points: The plot is enhanced by a scatterplot, or overlay of points. These points are shaped like solid circles, have a blue hue, and are marginally bigger in size.

Adding a Smoothing Line: A linear model is used to add a smoothing line to the graphic. This line aids in illustrating the data’s trend. The line is stylized to be orange, and the confidence interval surrounding it is not visible.

Using a Theme: To make the plot seem tidy and uncomplicated, a simple theme is used.

The Breast_Cancer_Data dataframe is passed as an argument to the create_scatterplot method. Using the data from the CSV file, this creates and shows the scatterplot with the desired adjustments.

QUESTION 2

Write a function to compute the sample variance for a given list of numbers. The formula for the samplevariance is:S2=∑i(xi−μ)2n−1,where•xirepresents each value from the list.•μis the mean of the list.•nis the number of observations (length fo the list).

compute_sample_variance <- function(numbers) {
  n <- length(numbers)  # Number of observations
  if (n <= 1) {
    stop("Sample variance cannot be computed for less than 2 observations.")
  }
  
  # Compute the mean
  mean_value <- mean(numbers)
  
  # Compute the sum of squared differences from the mean
  sum_squared_diff <- sum((numbers - mean_value)^2)
  
  # Compute sample variance
  sample_variance <- sum_squared_diff / (n - 1)
  
  return(sample_variance)
}

# Example usage:
numbers <- c(4, 7, 9, 11, 13)
variance <- compute_sample_variance(numbers)
print(variance)

## [1] 12.2

EXPLANATION OF QUESTION 2

Definition of Function: It is intended that the single input taken by the function compute_sample_variance, numbers, will be a vector of numerical values.

Determine the Number of Observations: The number of observations in the input vector numbers is determined by n - length(numbers) and is stored in the variable n.

Verify the Valid Number of Observations: The function determines whether the vector contains fewer than two observations. “Sample variance cannot be computed for less than 2 observations,” is the error message that appears and the program quits if this is the case.

Determine the Mean:

The formula mean_value <- mean(numbers) yields the mean (average) of the input numbers, which is then stored in the mean_value variable.

Determine the Total Squared Deviations from the Mean: The sum of the squared discrepancies between each number and the mean is determined using the formula sum_squared_diff -sum((numbers - mean_value)^2). This action entails: deducting the average from every figure. the outcome of each subtraction squared. adding up each of these squared discrepancies.

Do the Sample Variance Calculation:

The sample variance is computed as sample_variance <- sum_squared_diff / (n - 1) , where n is the number of observations, and the sum of squared differences is divided by n. Bessel’s correction, which is applied to rectify the bias in the population variance estimation, is this modification (n - 1 instead of n).

QUESTION 3

Given a DataFrame containing both numerical and categorical columns, write a code to identify and handle missing values. Specifically, replace missing numerical values with the mean of their respectivecolumns and missing categorical values with the mode of their respective columns.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Function to replace missing values
replace_missing_values <- function(data) {
  # Replace missing numerical values with mean
  numerical_columns <- c("rating", "user_age")  # Add all numerical columns here
  for (col in numerical_columns) {
    data[[col]][is.na(data[[col]])] <- mean(data[[col]], na.rm = TRUE)
  }
  
  # Replace missing categorical values with mode
  categorical_columns <- c("Category", "user_sex", "user_country")  # Add all categorical columns here
  for (col in categorical_columns) {
    mode_val <- names(sort(table(data[[col]]), decreasing = TRUE)[1])
    data[[col]][is.na(data[[col]])] <- mode_val
  }
  
  return(data)
}

# Load the dataset
book_data <- read.csv("C:/Users/hp/Downloads/Book_Data.csv")

# Replace missing values
book_data_cleaned <- replace_missing_values(book_data)

# Check if missing values are replaced
summary(book_data_cleaned)  # Summary statistics of cleaned data

##     user_id        user_sex            user_age    user_country      
##  Min.   : 0.00   Length:28          Min.   :16.0   Length:28         
##  1st Qu.: 6.75   Class :character   1st Qu.:24.0   Class :character  
##  Median :13.50   Mode  :character   Median :32.0   Mode  :character  
##  Mean   :13.50                      Mean   :32.5                     
##  3rd Qu.:20.25                      3rd Qu.:40.0                     
##  Max.   :27.00                      Max.   :52.0                     
##      rating        comment             Author              date          
##  Min.   :2.500   Length:28          Length:28          Length:28         
##  1st Qu.:3.679   Class :character   Class :character   Class :character  
##  Median :3.952   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.905                                                           
##  3rd Qu.:4.500                                                           
##  Max.   :5.000                                                           
##    Category        
##  Length:28         
##  Class :character  
##  Mode  :character  
##                    
##                    
##

EXPLANATION OF QUESTION 3

To clean up a dataset, the supplied R script implements the function replace_missing_values, which replaces missing categorical values with the mode and missing numerical categories with the mean. It reads a CSV file called “Book_Data.csv” into a dataframe, imports the dplyr library for data manipulation, and then applies the function to this dataframe. Specifically, the function iterates over supplied numerical columns (rating, user_age) and over specified categorical columns (Category, user_sex, user_country), replacing NA values with the most frequent value (mode) and the column’s mean, respectively. The summary function is then used to confirm that all missing values have been correctly updated after the cleaned data has been saved in book_data_cleaned.

# Check for NA values in the dataset
any_na <- any(is.na(book_data_cleaned))

if (any_na) {
  print("There are missing values in the dataset.")
} else {
  print("There are no missing values in the dataset.")
}

## [1] "There are no missing values in the dataset."