Install and load necessary libraries

#install.packages(“ggplot2”) # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command #install.packages(“scales”) # Install scales for formatting #install.packages(“moments”) # Install moments for skewness and kurtosis library(ggplot2) # Load ggplot2 library library(scales) # Load scales library



## Setting up your directory in your computer.

This needs to be addressed here.




``` r
# Check the current working directory
getwd()

## [1] "C:/Users/cdaniels/Downloads/Assignment 4 Validate a Provided Program on Different Data to Demonstrate Understanding of Data Preprocessing attached files Sep 4, 2025 1232 PM"

# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")

# Set the working directory to where the data file is located
# This ensures the program can access the file correctly

setwd("C:/Users/cdaniels/Downloads/Assignment 4 Validate a Provided Program on Different Data to Demonstrate Understanding of Data Preprocessing attached files Sep 4, 2025 1232 PM")

### Choose an already existing directory in your computer.

Setting up your personalized data

# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 672
B <- 682
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility


# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset

write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory

Knit your file

As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.

It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML

In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html

df <- read.csv("my_data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

Cleaning up your data

Step 0. Now that you read the file, you want to learn few information about your data

The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.

# Basic exploratory commands
nrow(df)       # Number of rows in the dataset

## [1] 500

length(df)     # Number of columns (or variables) in the dataset

## [1] 15

str(df)        # Structure of the dataset (data types and a preview)

## 'data.frame':    500 obs. of  15 variables:
##  $ ID           : Factor w/ 4 levels "","Female","Male",..: 3 4 2 3 2 3 4 3 3 2 ...
##  $ Gender       : int  29 NA 29 30 26 29 28 29 29 26 ...
##  $ Age          : int  175 165 NA 182 165 175 185 175 175 160 ...
##  $ Height       : int  70 56 65 80 62 70 85 70 70 55 ...
##  $ Weight       : Factor w/ 5 levels "","Bachelor's",..: 5 5 2 4 1 5 3 5 5 2 ...
##  $ Education    : Factor w/ 15 levels "","32000","35000",..: 12 11 10 13 8 12 1 12 12 8 ...
##  $ Income       : Factor w/ 3 levels "","Married","Single": 2 3 3 2 3 2 3 2 2 3 ...
##  $ MaritalStatus: Factor w/ 4 levels "","Employed",..: 2 1 2 2 2 2 4 2 2 2 ...
##  $ Employment   : Factor w/ 15 levels "","5.5","5.7",..: 12 12 9 1 4 12 3 12 12 5 ...
##  $ Score        : Factor w/ 5 levels "","5.7","A","B",..: 5 4 4 5 3 5 3 5 5 4 ...
##  $ Category     : Factor w/ 5 levels "","A","Art","Music",..: 5 4 4 4 3 5 5 5 5 3 ...
##  $ Color        : Factor w/ 5 levels "","Blue","Green",..: 4 2 2 4 3 4 3 4 4 2 ...
##  $ Hobby        : Factor w/ 6 levels "","Green","Photography",..: 6 6 5 6 4 6 3 6 6 5 ...
##  $ Happiness    : Factor w/ 10 levels "","6","6.5","7",..: 9 8 2 8 4 9 2 9 9 4 ...
##  $ Location     : Factor w/ 5 levels "","6","City",..: 5 5 3 3 3 5 4 5 5 4 ...

summary(df)    # Summary statistics for each column

##       ID          Gender           Age            Height              Weight   
##        :  2   Min.   :25.00   Min.   :155.0   Min.   :54.00              : 24  
##  Female:186   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:62.00   Bachelor's :155  
##  Male  :218   Median :29.00   Median :175.0   Median :70.00   High School:102  
##  Other : 94   Mean   :28.68   Mean   :173.3   Mean   :70.87   Master's   :125  
##               3rd Qu.:30.00   3rd Qu.:182.0   3rd Qu.:80.00   PhD        : 91  
##               Max.   :34.00   Max.   :190.0   Max.   :90.00   NA's       :  3  
##               NA's   :34      NA's   :15      NA's   :50                       
##    Education       Income       MaritalStatus   Employment  Score    
##  60000  : 88          : 16             : 15          : 76      :  2  
##         : 79   Married:224   Employed  :370   7.8    : 74   5.7:  1  
##  48000  : 71   Single :260   Single    :  1   6.2    : 70   A  :162  
##  45000  : 69                 Unemployed:114   6.1    : 55   B  :186  
##  65000  : 64                                  5.7    : 50   C  :149  
##  55000  : 26                                  (Other):170            
##  (Other):103                                  NA's   :  5            
##    Category      Color             Hobby       Happiness     Location  
##        :  8         :  3              :  5   7      :142         :  3  
##  A     :  1   Blue  :200   Green      :  1   8.5    : 78   6     :  1  
##  Art   :164   Green :131   Photography: 84   9      : 77   City  :206  
##  Music :145   Red   :165   Reading    :134   6      : 64   Rural :177  
##  Sports:182   Sports:  1   Swimming   : 93   8      : 61   Suburb:113  
##                            Traveling  :183   (Other): 70               
##                                              NA's   :  8

Your Turn

Please answer the following questions, by typing information after the question.

Question 1

What type of variables does your file include?

Answer 1:

A and B based on my student ID A <- 672 B <- 682

Question 2

Specific data types?

Answer 2:

random sample of 500 sample_size <- 500

Question 3

Are they read properly?

Answer 3:

yes

Question 4

Are there any issues ?

Answer 4:

Question 5

Does your file includes both NAs and blanks?

Answer 5:

yes

Question 6

How many NAs do you have and

Answer 6:

356

Question 7

How many blanks?

Answer 7:

286

Cleanup Continued

Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs

#
# Step 1:  # Handling both blanks and NAs is not simple so first we want to eliminate
# some of those, let's eliminate the blanks and change them to NAs
#


# Replace blanks with NAs across the dataset
# This ensures that blank values are consistently treated as missing data
df[df == ""] <- NA

# Convert specific columns to factors
# This step ensures categorical variables are treated correctly after replacing blanks
factor_columns <- c("Gender", "Education", "Score", "MaritalStatus", "Category", 
                    "Employment", "Color", "Hobby", "Location")
df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))

Cleanup Continued

Step 2: Count NAs in the entire dataset

#
# Step 2: Count NAs in the entire dataset


# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values

## [1] 369

Your Turn

Please answer the following questions, by typing information after the question.

Question 8

Explain what the printed number is, what is the information that relays and how can you use it in your analysis?

Answer 8:

The printed number displays the total number of missing values

Clean Up Continued

Step 3: Count rows with NAs.

#
# Step 3: Count rows with NAs
#

# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
label_percent_row_NA <- scales::label_percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas

## [1] 271

label_percent_row_NA

## function (x) 
## {
##     number(x, accuracy = accuracy, scale = scale, prefix = prefix, 
##         suffix = suffix, big.mark = big.mark, decimal.mark = decimal.mark, 
##         style_positive = style_positive, style_negative = style_negative, 
##         scale_cut = scale_cut, trim = trim, ...)
## }
## <bytecode: 0x00000211d066d828>
## <environment: 0x00000211d066a900>

Your Turn

Question 9

How large is the proportion of the rows with NAs, we can drop up to 5%?

Answer 9:

57%

Question 10

Do you think that would be wise to drop the above percent?

Answer 10:

Question 11

How this will affect your dataset?

Answer 11:

It will alter the sample size

CleanUp Continued

Step 4: Count columns with NAs

#  
# Step 4: Count columns with NAs

# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
label_percent_col_NA <- scales::label_percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas

## [1] 15

label_percent_col_NA

## function (x) 
## {
##     number(x, accuracy = accuracy, scale = scale, prefix = prefix, 
##         suffix = suffix, big.mark = big.mark, decimal.mark = decimal.mark, 
##         style_positive = style_positive, style_negative = style_negative, 
##         scale_cut = scale_cut, trim = trim, ...)
## }
## <bytecode: 0x00000211d066d828>
## <environment: 0x00000211cb603a38>

Your Turn

Question 12

How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?

Answer 12:

81%. No

Question 13

How this will affect your dataset?

Answer 13:

It will mean that we will loose variables and associations - essentially altering entire columns and overall size

Imputation

Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)

In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.

# 
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.


# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
  if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
    if (sum(!is.na(col)) > 10) {
      col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
    } else {
      col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
    }
  } else if (is.factor(col)) { # Factor columns
    mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
    col[is.na(col)] <- mode_val
  } else if (is.character(col)) { # Character columns
    col[is.na(col)] <- "NA" # Replace with "NA"
  }
  return(col) # Return the modified column
})

df <- as.data.frame(df) # Convert the list back to a dataframe


#
# following the above method to impute, has now changed some of the statistics


# Check the updated dataset and ensure no remaining NAs
summary(df)

##       ID          Gender         Age            Height              Weight   
##        :  0   30     :132   Min.   :155.0   Min.   :54.00              :  0  
##  Female:186   28     : 76   1st Qu.:165.0   1st Qu.:65.00   Bachelor's :182  
##  Male  :220   29     : 73   Median :175.0   Median :70.87   High School:102  
##  Other : 94   26     : 71   Mean   :173.3   Mean   :70.87   Master's   :125  
##               27     : 70   3rd Qu.:182.0   3rd Qu.:80.00   PhD        : 91  
##               32     : 33   Max.   :190.0   Max.   :90.00                    
##               (Other): 45                                                    
##    Education       Income       MaritalStatus   Employment  Score    
##  60000  :167          :  0   Employed  :385   7.8    :155   5.7:  1  
##  48000  : 71   Married:224   Single    :  1   6.2    : 70   A  :162  
##  45000  : 69   Single :276   Unemployed:114   6.1    : 55   B  :188  
##  65000  : 64                                  5.7    : 50   C  :149  
##  55000  : 26                                  5.5    : 34            
##  42000  : 21                                  7.5    : 28            
##  (Other): 82                                  (Other):108            
##    Category      Color             Hobby       Happiness     Location  
##  A     :  1   Blue  :203   Green      :  1   7      :171   6     :  1  
##  Art   :164   Green :131   Photography: 84   8.5    : 78   City  :209  
##  Music :145   Red   :165   Reading    :134   9      : 77   Rural :177  
##  Sports:190   Sports:  1   Swimming   : 93   6      : 64   Suburb:113  
##                            Traveling  :188   8      : 61               
##                                              7.5    : 23               
##                                              (Other): 26

Your Turn

Essay Question

Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you obsevrve.

Answer

Cleaning your Data

Already complete Code

Install and load necessary libraries

Setting up your personalized data

Knit your file

Cleaning up your data

Your Turn

Cleanup Continued

Cleanup Continued

Your Turn

Clean Up Continued

Your Turn

CleanUp Continued

Your Turn

Imputation

Your Turn