Heather_Northam_Cleaning your Data

Already complete Code

As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.

# Install and load necessary libraries
#install.packages("ggplot2") # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command
#install.packages("scales")  # Install scales for formatting
#install.packages("moments") # Install moments for skewness and kurtosis
library(ggplot2)            # Load ggplot2 library
library(scales)             # Load scales library

Setting up your directory in your computer.

This needs to be addressed here.

# Check the current working directory
getwd()

# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")

# Set the working directory to where the data file is located
# This ensures the program can access the file correctly

setwd("C:/Users/ITsapara/Downloads")

### Choose an already existing directory in your computer.

Setting up your personalized data

# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 34
B <- 99
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility


# Generate a random sample of 500 rows from the data set
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the data set

write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory

Knit your file

As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.

It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML

In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html

df <- read.csv("my_data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
head(df)

##     ID Gender Age Height Weight   Education Income MaritalStatus Employment
## 1   31 Female  29    155     54    Master's  55000       Married   Employed
## 2  957  Other  30    165     NA High School     NA        Single Unemployed
## 3  964   Male  31    180     NA  Bachelor's  50000       Married   Employed
## 4   34 Female  27    168     65  Bachelor's  45000        Single   Employed
## 5  968   Male  30    182     80    Master's  65000       Married   Employed
## 6 1040   Male  29    175     70         PhD  60000       Married   Employed
##   Score Rating Category Color     Hobby Happiness Location
## 1   7.3      B    Music  Blue  Swimming       8.2     City
## 2   5.5      A   Sports Green   Reading        NA    Rural
## 3   6.5      A      Art Green   Reading       8.0    Rural
## 4   6.2      B      Art  Blue   Reading       7.0     City
## 5    NA      C    Music   Red Traveling       8.5     City
## 6   7.8      C   Sports   Red Traveling       9.0   Suburb

Cleaning up your data

Step 0. Now that you read the file, you want to learn few information about your data

The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.

# Basic exploratory commands
nrow(df)       # Number of rows in the data set

## [1] 500

length(df)     # Number of columns (or variables) in the data set

## [1] 16

str(df)        # Structure of the data set (data types and a preview)

## 'data.frame':    500 obs. of  16 variables:
##  $ ID           : int  31 957 964 34 968 1040 31 1051 32 1082 ...
##  $ Gender       : Factor w/ 3 levels "Female","Male",..: 1 3 2 1 2 2 1 1 2 3 ...
##  $ Age          : int  29 30 31 27 30 29 29 26 31 28 ...
##  $ Height       : int  155 165 180 168 182 175 155 160 180 185 ...
##  $ Weight       : int  54 NA NA 65 80 70 54 55 NA 85 ...
##  $ Education    : Factor w/ 5 levels "","Bachelor's",..: 4 3 2 2 4 5 4 2 2 3 ...
##  $ Income       : int  55000 NA 50000 45000 65000 60000 55000 48000 50000 NA ...
##  $ MaritalStatus: Factor w/ 3 levels "","Married","Single": 2 3 2 3 2 2 2 3 2 3 ...
##  $ Employment   : Factor w/ 3 levels "","Employed",..: 2 3 2 2 2 2 2 2 2 3 ...
##  $ Score        : num  7.3 5.5 6.5 6.2 NA 7.8 7.3 6.1 6.5 5.7 ...
##  $ Rating       : Factor w/ 4 levels "","A","B","C": 3 2 2 3 4 4 3 3 2 2 ...
##  $ Category     : Factor w/ 4 levels "","Art","Music",..: 3 4 2 2 3 4 3 2 2 4 ...
##  $ Color        : Factor w/ 4 levels "","Blue","Green",..: 2 3 3 2 4 4 2 2 3 3 ...
##  $ Hobby        : Factor w/ 5 levels "","Photography",..: 4 3 3 3 5 5 4 4 3 2 ...
##  $ Happiness    : num  8.2 NA 8 7 8.5 9 8.2 7 8 6 ...
##  $ Location     : Factor w/ 4 levels "","City","Rural",..: 2 3 3 2 2 4 2 3 3 3 ...

summary(df)    # Summary statistics for each column

##        ID            Gender         Age            Height          Weight     
##  Min.   :   1.0   Female:221   Min.   :25.00   Min.   :155.0   Min.   :54.00  
##  1st Qu.:  59.0   Male  :184   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:60.00  
##  Median :1001.0   Other : 95   Median :28.00   Median :175.0   Median :70.00  
##  Mean   : 766.7                Mean   :28.15   Mean   :172.3   Mean   :69.38  
##  3rd Qu.:1059.2                3rd Qu.:29.00   3rd Qu.:182.0   3rd Qu.:80.00  
##  Max.   :1117.0                Max.   :34.00   Max.   :190.0   Max.   :90.00  
##                                NA's   :27      NA's   :18      NA's   :28     
##        Education       Income      MaritalStatus      Employment 
##             : 20   Min.   :32000          : 16             : 21  
##  Bachelor's :189   1st Qu.:45000   Married:192   Employed  :364  
##  High School:104   Median :48000   Single :292   Unemployed:115  
##  Master's   : 96   Mean   :51551                                 
##  PhD        : 88   3rd Qu.:60000                                 
##  NA's       :  3   Max.   :70000                                 
##                    NA's   :77                                    
##      Score      Rating    Category     Color             Hobby    
##  Min.   :5.50    :  4         :  6        :  1              :  4  
##  1st Qu.:6.10   A:147   Art   :173   Blue :207   Photography: 90  
##  Median :6.20   B:215   Music :145   Green:135   Reading    :125  
##  Mean   :6.63   C:134   Sports:176   Red  :157   Swimming   :100  
##  3rd Qu.:7.50                                    Traveling  :181  
##  Max.   :8.90                                                     
##  NA's   :67                                                       
##    Happiness       Location  
##  Min.   :6.000         :  5  
##  1st Qu.:7.000   City  :209  
##  Median :7.000   Rural :191  
##  Mean   :7.502   Suburb: 95  
##  3rd Qu.:8.500               
##  Max.   :9.000               
##  NA's   :13

Your Turn

Please answer the following questions, by typing information after the question.

Question 1

What type of variables does your file include?

Answer 1:

There are 16 variables in this data frame. This data frame is comprised of categorical variables and quantitative variables. Categorical variables include Gender, Education, Marital Status, Employment, Rating, Category, Color, Hobby and Location. Quantitative, continuous variables include ID, Age, Height, Weight, Income, Score and Happiness.

Question 2

Specific data types?

Answer 2:

The data types included in this data frame include numeric, integer and factor (text) The numeric data types have been further organized into double (double-precision floating-point data type) for higher accuracy, as seen in the initial inspection of the data using head(df). The text data types have been converted to factors.Gender has 3 factors, “Male”, “Female”, and “Other”. Education has converted to 5 levels, “Bachelor’s”, “High School”, “Master’s”, “PhD”, and “NA’s”. Marital Status has been converted to 3 levels, “Married”, “Single” and a blank category. Employment has been converted to 3 levels, “Employed”, “Unemployed”, and a blank category. Rating has been converted to 4 categories, “A”, “B”, “C” and a blank category. Category has been converted to 4 categories, “Art”, “Music”, “Sports” and a blank category. Color has been converted to 4 levels as well, “Blue”, “Green”, “Red” and a blank category. Hobby has been converted to 5 levels, including “Photography”, “Reading”, “Swimming”, “Traveling” and a blank category. Finally, Location has been converted to 4 levels, “City”, “Rural”, “Suburb”, and a blank category.

Question 3

Are they read properly?

Answer 3:

The data from the CSV file accurately includes the correct delimiter and header row recognition. The columns appear upon initial inspection to have the appropriate data types, with numeric data for variables which would be expected to have numeric data such as “Height” and “Weight”. Factor variables also appear to have related levels, such as “Employed” and “Unemployed” for “Employment”.

Question 4

Are there any issues ?

Answer 4:

Upon inspection, there are issues with this data that will need to be managed. The first issue is related to missing values. There are NAs in 7 of the variables, including Age, Height, Weight, Education, Income, Score, and Happiness. Additionally, there are blanks in 8 of the variables, including Education, Marital Status, Employment, Rating, Category, Color, Hobby, and Location. Variable names could be more specific. For example, the variable “Color” is unclear, with “Blue”, “Green” and “Red” as categories, it could be hair color, eye color, or favorite color. Weight could be specified as Weight_kilo to indicate this variable is the data for kilograms (not pounds), and based on the values it appears to be in kilograms. Score, Rating, and Happiness variables appear to relate to the results of a test, questionnaire, or survey but the variable names do not indicate this. The benefit of using specific variable names is related to future modification of data frames, ensuring overlap in variable names and definitions does not occur in the future leading to errors in analysis.

Question 5

Does your file includes both NAs and blanks?

Answer 5:

Yes, this file includes both NAs and blanks.

Question 6

How many NAs do you have and

Answer 6:

total_na_count <- sum(is.na(df))
total_na_count

## [1] 233

“Age” has 27 NAs, “Height” has 18 NAs, “Weight” has 28 NAs, “Education” has 3 NAs, “Income” has 77 NAs, “Score” has 67 NAs, and “Happiness” has 13 NAs. In total, this file has 233 NAs.

Question 7

How many blanks?

Answer 7:

“Education” has 20 blanks, “MaritalStatus” has 16 blanks, “Employment” has 21 blanks, “Rating” has 4 blanks, “Category” has 6 blanks, “Color” has 1 blank, “Hobby” has 4 blanks and “Location” has 5 blanks. In total, this file has 77 blanks.

Cleanup Continued

Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs

#
# Step 1:  # Handling both blanks and NAs is not simple so first we want to eliminate
# some of those, let's eliminate the blanks and change them to NAs
#


# Replace blanks with NAs across the data set
# This ensures that blank values are consistently treated as missing data
df[df == ""] <- NA

# Convert specific columns to factors
# This step ensures categorical variables are treated correctly after replacing blanks
factor_columns <- c("Gender", "Education", "Rating", "MaritalStatus", "Category", 
                    "Employment", "Color", "Hobby", "Location")
df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))

Cleanup Continued

Step 2: Count NAs in the entire data set

#
# Step 2: Count NAs in the entire data set


# Count the total number of NAs in the data set
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values

## [1] 310

Your Turn

Please answer the following questions, by typing information after the question.

Question 8

Explain what the printed number is, what is the information that relays and how can you use it in your analysis?

Answer 8:

The printed number of 310 is the updated total number of NAs in this data frame. This number now includes the previous number of NAs (233) with the additional NAs added after the 77 blanks were converted to NAs. The benefit of converting the blanks to NAs is in the consistency of missing values. The missing values are now all NAs and can be imputed to ensure all variables have a complete set of values for analysis.This also provides information about the overall quality of the data in the file. When a file has a high percentage of missing values, the accuracy and reliability of the data may be affected. The overall percentage of missing values provides information about the completeness of the data in the file. Additionally, the percentage of missing values within variables also provides information about the quality of the data. Variables that have a higher volume of missing values may be less reliable than variables with lower volumes of missing values, depending on the variance. Variables that include values with a high variance may be more affected than variables with less variance because some methods of imputation (such as using means) may introduce bias.

Clean Up Continued

Step 3: Count rows with NAs.

#
# Step 3: Count rows with NAs
#

# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas

## [1] 255

Percent_row_NA

## [1] "51%"

Your Turn

Question 9

How large is the proportion of the rows with NAs, we can drop up to 5%?

Answer 9:

In this data frame, there is a very high percentage of rows with NAs. 51% of the rows include NAs. No, we can not drop to 5% NAs because we would be eliminating over half of the data and introducing bias to the analysis.

Question 10

Do you think that would be wise to drop the above percent?

Answer 10:

I do not think it would be wise to drop to the above suggested 5%. Currently 255 row, 51% of all rows, include NAs. Eliminating this high percentage of the data would reduce the validity of this data frame. The data analysis results would no longer be representative of the data included in the file.

Question 11

How this will affect your data set?

Answer 11:

Eliminating 51% of the rows would eliminate 51% of the observations and increase the likelihood of bias in the analysis due to the reduced size of the data set. The full data set would be more likely to provide accuracy by including sufficient representative values, reduce the impact of outlier values, and reduce the margin of error for stronger statistical analysis.

CleanUp Continued

Step 4: Count columns with NAs

#  
# Step 4: Count columns with NAs

# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas

## [1] 14

Percent_col_NA

## [1] "88%"

Your Turn

Question 12

How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?

Answer 12:

The proportion of columns in this data frame with NAs is currently 88%, meaning that almost all of the columns in this data frame include NAs. Dropping columns with NAs would eliminate valuable variables that may later be required for analysis. One option could reduce the number of columns eliminated by using a filter to drop only columns with a high percentage of NAs. The select_if() in dplyr could be used to filter columns that have more than a certain percentage of values missing. None of the columns in this data set had more than 20% of the values missing so this particular options would not show a benefit for this data set. The “Income” variable included 15.4% NAs and the “Score” included 13.4% of values being NAs therefore eliminating these columns is unlikely to affect the overall proportion of columns with NAs enough to be useful for analysis.

Question 13

How this will affect your data set?

Answer 13:

The data set would be much more limited in volume. Only the variables “ID” and “Gender” would remain as variables, due to the presence of NAs in all other variables.

Imputation

Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)

In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.

# 
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.


# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
  if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
    if (sum(!is.na(col)) > 10) {
      col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
    } else {
      col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
    }
  } else if (is.factor(col)) { # Factor columns
    mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
    col[is.na(col)] <- mode_val
  } else if (is.character(col)) { # Character columns
    col[is.na(col)] <- "NA" # Replace with "NA"
  }
  return(col) # Return the modified column
})

df <- as.data.frame(df) # Convert the list back to a dataframe


#
# following the above method to impute, has now changed some of the statistics


# Check the updated data set and ensure no remaining NAs
summary(df)

##        ID            Gender         Age            Height          Weight     
##  Min.   :   1.0   Female:221   Min.   :25.00   Min.   :155.0   Min.   :54.00  
##  1st Qu.:  59.0   Male  :184   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:62.00  
##  Median :1001.0   Other : 95   Median :28.00   Median :172.3   Median :69.38  
##  Mean   : 766.7                Mean   :28.15   Mean   :172.3   Mean   :69.38  
##  3rd Qu.:1059.2                3rd Qu.:29.00   3rd Qu.:182.0   3rd Qu.:80.00  
##  Max.   :1117.0                Max.   :34.00   Max.   :190.0   Max.   :90.00  
##        Education       Income      MaritalStatus      Employment 
##  Bachelor's :212   Min.   :32000   Married:192   Employed  :385  
##  High School:104   1st Qu.:45000   Single :308   Unemployed:115  
##  Master's   : 96   Median :51551                                 
##  PhD        : 88   Mean   :51551                                 
##                    3rd Qu.:60000                                 
##                    Max.   :70000                                 
##      Score      Rating    Category     Color             Hobby    
##  Min.   :5.50   A:147   Art   :173   Blue :208   Photography: 90  
##  1st Qu.:6.10   B:219   Music :145   Green:135   Reading    :125  
##  Median :6.20   C:134   Sports:182   Red  :157   Swimming   :100  
##  Mean   :6.63                                    Traveling  :185  
##  3rd Qu.:7.30                                                     
##  Max.   :8.90                                                     
##    Happiness       Location  
##  Min.   :6.000   City  :214  
##  1st Qu.:7.000   Rural :191  
##  Median :7.000   Suburb: 95  
##  Mean   :7.502               
##  3rd Qu.:8.500               
##  Max.   :9.000

Your Turn

Essay Question

Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you obsevrve.

Answer

The summary(df) provides a summary of the statistics of a data frame. For categorical data, the summary provides a number of values in each category. For example, the “Gender” variable includes three factors, “Female”, “Male”, and “Other. The summary provides the total number of values in each category, 221 in”Female”, 184 in “Male” and 95 in “Other”. This data set also includes numerical data. Additional statistical information is included for numeric variables. The summary includes the minimum value, maximum value, the first and third quartiles for information on variance, the median and the mean. From the summary, we can rapidly identify the distribution of values across variables and general trends.

A comparison of data frame summaries before and after imputation provides information about the changes from imputing mean values for NAs. All NAs were imputed and no additional NAs remain in the post imputation data set. The “ID” and “Gender” variables remained the same because these variable did not include NAs or blanks initially. The “Happiness” variable remained the same although 13 NAs were imputed. The “Age” variable revealed no differences as well following imputation, with the Min, Max, 1st and 3rd Quartiles, Mean and Median all remaining equal after imputation. The “Height” variable revealed a change in the Median value, with a Median of 175 prior to imputation and a Median of 172.3 (equal to the Mean) after imputation. For the variable “Weight”, both the 1st Quartile and the Median changed, with a 1st Quartile of 60.00 and a Median of 70.00 prior to imputation and a 1st Quartile of 62.00 and a Median of 69.38 (equal to the Mean) after imputation. The “Education” variable changed after imputation, raising the number of observations with a Bachelor’s from 189 to 212 after imputation. The “Income” variable had differences in the Median, with a Median of 48000 prior to imputation and a Median of 51551 (equal to the Mean) after imputation. The “MaritalStatus” demonstrated a change in the “Single” level, with 292 observations in the “Single” level prior to imputation and 308 observations after imputation. The “Employment” variable increased the number of observations in the “Employed” level from 364 to 385, with the “Unemployed” level remaining the same at 115. For the “Score” variable, Min, 1st Quartile, Median, Mean, and Max remained the same with a 3rd Quartile of 7.50 prior to imputation and a 3rd Quartile of 7.30 after imputation. For the “Rating” variable, the 4 missing values were added to the “B” level changing “B” from 215 to 219 after imputation. In the “Sports” variable, the “Sports” level increased from 176 to 182, with “Art” and “Music” remaining the same. For the “Color” variable, the “Blue” level changed from 207 to 208, with “Green” and “Red” remaining the same. For the “Hobby” variable, the levels “Photography”, “Reading”, and “Swimming” remained the same with the number of observations in “Traveling” increased from 181 to 185. Finally, the “Location” variable showed a change in the “City” level, with 209 observations prior to imputation and 214 following imputation.

In general, imputation led to a complete data set, with removal of blanks and NAs to prepare the data set for further analysis. Imputation provided a method for analysis of data with missing values without omitting variables, deleting observations, or reducing the size of the data set. This increases the likelihood that the data set is both reliable and valid.

A frequent occurrence with numeric data following imputation was an adjustment to the Median, with the post imputation value equal to the Mean. This is not a surprising result considering the imputation method used the Mean to impute missing values, increasing the number of values equal to the Mean and therefore adjusting the Median closer to the Mean.

The imputation method used also replaced missing categorical variables with the mode, which may have introduced unanticipated bias. For example, the “Education” variable raised the number of observations in the “Bachelor’s” level from 189 to 212. A data scientist may need to ask questions about the variable, such as “Were high school dropouts excluded from the data?” to ensure using the mode for this variable does not bias the data set. Additionally, categorical variables of “MaritalStatus” also increased the observations in the “Single” factor level for over 10 observations (292 before imputation to 308 after imputation) and in “Employment” in the “Employed” factor level for over 20 observations (364 before imputation to 385 after imputation).

The changes in the categorical variables appear small in volume compared to the overall number of 500 observations per variable. Overall, the data set following imputation appears clearer and complete when compared to the data set prior to imputation.