Cleaning your Data

Already complete Code

As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.

# Install and load necessary libraries
#install.packages("ggplot2") # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command
#install.packages("scales")  # Install scales for formatting
#install.packages("moments") # Install moments for skewness and kurtosis
library(ggplot2)            # Load ggplot2 library

## Warning: package 'ggplot2' was built under R version 4.4.1

library(scales)             # Load scales library

Setting up your directory in your computer.

This needs to be addressed here.

# Check the current working directory
getwd()

## [1] "C:/Users/sheyl/OneDrive/Desktop/DDS-8500/Week 4"

# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")

# Set the working directory to where the data file is located
# This ensures the program can access the file correctly

setwd("C:/Users/sheyl/OneDrive/Desktop/DDS-8500/Week 4")

### Choose an already existing directory in your computer.

Setting up your personalized data

# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 212
B <- 549
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility


# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset

write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory

Knit your file

As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.

It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML

In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html

df <- read.csv("C:/Users/sheyl/OneDrive/Desktop/DDS-8500/Week 4/my_data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

Cleaning up your data

Step 0. Now that you read the file, you want to learn few information about your data

The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.

# Basic exploratory commands
nrow(df)       # Number of rows in the dataset

## [1] 500

length(df)     # Number of columns (or variables) in the dataset

## [1] 15

str(df)        # Structure of the dataset (data types and a preview)

## 'data.frame':    500 obs. of  15 variables:
##  $ ID           : Factor w/ 4 levels "","Female","Male",..: 3 2 4 2 4 3 2 3 3 3 ...
##  $ Gender       : int  30 27 30 26 30 28 27 30 30 NA ...
##  $ Age          : int  182 158 165 165 165 178 168 182 182 182 ...
##  $ Height       : int  80 NA NA 62 NA 78 65 80 80 80 ...
##  $ Weight       : Factor w/ 5 levels "","Bachelor's",..: 4 2 3 1 3 3 2 4 4 5 ...
##  $ Education    : Factor w/ 15 levels "","32000","35000",..: 13 6 1 8 2 4 7 13 13 14 ...
##  $ Income       : Factor w/ 3 levels "","Married","Single": 2 3 3 3 3 2 3 2 2 2 ...
##  $ MaritalStatus: Factor w/ 4 levels "","Employed",..: 2 4 4 2 4 2 2 2 2 2 ...
##  $ Employment   : Factor w/ 15 levels "","5.5","5.7",..: 1 NA 2 4 2 8 6 1 1 14 ...
##  $ Score        : Factor w/ 5 levels "","5.7","A","B",..: 5 3 3 3 3 3 4 5 5 5 ...
##  $ Category     : Factor w/ 5 levels "","A","Art","Music",..: 4 3 5 3 5 5 3 4 4 3 ...
##  $ Color        : Factor w/ 5 levels "","Blue","Green",..: 4 3 3 3 3 4 2 4 4 4 ...
##  $ Hobby        : Factor w/ 6 levels "","Green","Photography",..: 6 5 4 4 4 3 4 6 6 3 ...
##  $ Happiness    : Factor w/ 10 levels "","6","6.5","7",..: 8 3 1 4 NA 6 4 8 8 9 ...
##  $ Location     : Factor w/ 5 levels "","6","City",..: 3 3 4 3 4 4 3 3 3 5 ...

summary(df)    # Summary statistics for each column

##       ID          Gender           Age            Height              Weight   
##        :  1   Min.   :25.00   Min.   :155.0   Min.   :54.00              : 18  
##  Female:209   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:62.00   Bachelor's :191  
##  Male  :196   Median :28.00   Median :175.0   Median :70.00   High School: 96  
##  Other : 94   Mean   :28.43   Mean   :172.9   Mean   :70.24   Master's   :115  
##               3rd Qu.:30.00   3rd Qu.:182.0   3rd Qu.:80.00   PhD        : 77  
##               Max.   :34.00   Max.   :190.0   Max.   :90.00   NA's       :  3  
##               NA's   :27      NA's   :19      NA's   :42                       
##    Education       Income       MaritalStatus   Employment  Score    
##  45000  : 99          : 20             : 18   6.2    :100      :  7  
##  60000  : 82   Married:200   Employed  :366          : 74   5.7:  4  
##  48000  : 65   Single :280   Single    :  4   7.8    : 69   A  :152  
##         : 62                 Unemployed:112   6.1    : 52   B  :204  
##  65000  : 58                                  5.7    : 47   C  :133  
##  42000  : 30                                  (Other):152            
##  (Other):104                                  NA's   :  6            
##    Category      Color             Hobby       Happiness     Location  
##        :  4         :  2              :  5   7      :166         :  4  
##  A     :  4   Blue  :214   Green      :  4   8.5    : 74   6     :  4  
##  Art   :181   Green :130   Photography: 75   8      : 69   City  :226  
##  Music :148   Red   :150   Reading    :146   6      : 65   Rural :172  
##  Sports:163   Sports:  4   Swimming   : 91   9      : 60   Suburb: 94  
##                            Traveling  :179   (Other): 64               
##                                              NA's   :  2

Your Turn

Please answer the following questions, by typing information after the question.

Question 1

What type of variables does your file include?

Answer 1:

Question 2

Specific data types?

Answer 2:

Question 3

Are they read properly?

Answer 3:

Question 4

Are there any issues ?

Answer 4:

Question 5

Does your file includes both NAs and blanks?

Answer 5:

Question 6

How many NAs do you have and

Answer 6:

Question 7

How many blanks?

Answer 7:

Cleanup Continued

Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs

# Check the structure of your data first
str(df)

## 'data.frame':    500 obs. of  15 variables:
##  $ ID           : Factor w/ 4 levels "","Female","Male",..: 3 2 4 2 4 3 2 3 3 3 ...
##  $ Gender       : int  30 27 30 26 30 28 27 30 30 NA ...
##  $ Age          : int  182 158 165 165 165 178 168 182 182 182 ...
##  $ Height       : int  80 NA NA 62 NA 78 65 80 80 80 ...
##  $ Weight       : Factor w/ 5 levels "","Bachelor's",..: 4 2 3 1 3 3 2 4 4 5 ...
##  $ Education    : Factor w/ 15 levels "","32000","35000",..: 13 6 1 8 2 4 7 13 13 14 ...
##  $ Income       : Factor w/ 3 levels "","Married","Single": 2 3 3 3 3 2 3 2 2 2 ...
##  $ MaritalStatus: Factor w/ 4 levels "","Employed",..: 2 4 4 2 4 2 2 2 2 2 ...
##  $ Employment   : Factor w/ 15 levels "","5.5","5.7",..: 1 NA 2 4 2 8 6 1 1 14 ...
##  $ Score        : Factor w/ 5 levels "","5.7","A","B",..: 5 3 3 3 3 3 4 5 5 5 ...
##  $ Category     : Factor w/ 5 levels "","A","Art","Music",..: 4 3 5 3 5 5 3 4 4 3 ...
##  $ Color        : Factor w/ 5 levels "","Blue","Green",..: 4 3 3 3 3 4 2 4 4 4 ...
##  $ Hobby        : Factor w/ 6 levels "","Green","Photography",..: 6 5 4 4 4 3 4 6 6 3 ...
##  $ Happiness    : Factor w/ 10 levels "","6","6.5","7",..: 8 3 1 4 NA 6 4 8 8 9 ...
##  $ Location     : Factor w/ 5 levels "","6","City",..: 3 3 4 3 4 4 3 3 3 5 ...

# Convert character columns to factors
char_cols <- names(df)[sapply(df, is.character)]  # Detect character columns

if (length(char_cols) > 0) {
  df[char_cols] <- lapply(df[char_cols], function(col) as.factor(col))
}

# Check column names of df
colnames(df)

##  [1] "ID"            "Gender"        "Age"           "Height"       
##  [5] "Weight"        "Education"     "Income"        "MaritalStatus"
##  [9] "Employment"    "Score"         "Category"      "Color"        
## [13] "Hobby"         "Happiness"     "Location"

#
# Step 1:  # Handling both blanks and NAs is not simple so first we want to eliminate
# some of those, let's eliminate the blanks and change them to NAs
#


# Replace blanks with NAs across the dataset
# This ensures that blank values are consistently treated as missing data
df[df == ""] <- NA

# Convert specific columns to factors
# This step ensures categorical variables are treated correctly after replacing blanks
factor_columns <- c("ID", "Weight", "Education", "Income", 
                    "MaritalStatus", "Employment", "Score", 
                    "Category", "Color", "Hobby", "Happiness", "Location")
# Ensure columns exist in df before applying
existing_factors <- intersect(factor_columns, names(df))

df[existing_factors] <- lapply(df[existing_factors], function(col) as.factor(as.character(col)))

Cleanup Continued

Step 2: Count NAs in the entire dataset

#
# Step 2: Count NAs in the entire dataset


# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values

## [1] 331

Your Turn

Please answer the following questions, by typing information after the question.

Question 8

Explain what the printed number is, what is the information that relays and how can you use it in your analysis?

Answer 8:

Clean Up Continued

Step 3: Count rows with NAs.

#
# Step 3: Count rows with NAs
#

# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas

## [1] 255

Percent_row_NA

## [1] "51%"

Your Turn

Question 9

How large is the proportion of the rows with NAs, we can drop up to 5%?

Answer 9:

Question 10

Do you think that would be wise to drop the above percent?

Answer 10:

Question 11

How this will affect your dataset?

Answer 11:

CleanUp Continued

Step 4: Count columns with NAs

#  
# Step 4: Count columns with NAs

# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas

## [1] 15

Percent_col_NA

## [1] "100%"

Your Turn

Question 12

How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?

Answer 12:

Question 13

How this will affect your dataset?

Answer 13:

Imputation

Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)

In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.

# 
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.


# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
  if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
    if (sum(!is.na(col)) > 10) {
      col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
    } else {
      col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
    }
  } else if (is.factor(col)) { # Factor columns
    mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
    col[is.na(col)] <- mode_val
  } else if (is.character(col)) { # Character columns
    col[is.na(col)] <- "NA" # Replace with "NA"
  }
  return(col) # Return the modified column
})

df <- as.data.frame(df) # Convert the list back to a dataframe


#
# following the above method to impute, has now changed some of the statistics


# Check the updated dataset and ensure no remaining NAs
summary(df)

##       ID          Gender           Age            Height              Weight   
##  Female:210   Min.   :25.00   Min.   :155.0   Min.   :54.00   Bachelor's :212  
##  Male  :196   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:65.00   High School: 96  
##  Other : 94   Median :28.00   Median :172.9   Median :70.00   Master's   :115  
##               Mean   :28.43   Mean   :172.9   Mean   :70.24   PhD        : 77  
##               3rd Qu.:30.00   3rd Qu.:182.0   3rd Qu.:80.00                    
##               Max.   :34.00   Max.   :190.0   Max.   :90.00                    
##                                                                                
##    Education       Income       MaritalStatus   Employment  Score    
##  45000  :161   Married:200   Employed  :384   6.2    :180   5.7:  4  
##  60000  : 82   Single :300   Single    :  4   7.8    : 69   A  :152  
##  48000  : 65                 Unemployed:112   6.1    : 52   B  :211  
##  65000  : 58                                  5.7    : 47   C  :133  
##  42000  : 30                                  7.5    : 29            
##  55000  : 23                                  5.5    : 28            
##  (Other): 81                                  (Other): 95            
##    Category      Color             Hobby       Happiness     Location  
##  A     :  4   Blue  :216   Green      :  4   7      :185   6     :  4  
##  Art   :185   Green :130   Photography: 75   8.5    : 74   City  :230  
##  Music :148   Red   :150   Reading    :146   8      : 69   Rural :172  
##  Sports:163   Sports:  4   Swimming   : 91   6      : 65   Suburb: 94  
##                            Traveling  :184   9      : 60               
##                                              7.5    : 20               
##                                              (Other): 27

Your Turn

Essay Question

Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you observe.

Answer