As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.
#```{r install}
#install.packages(“ggplot2”) # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command #install.packages(“scales”) # Install scales for formatting #install.packages(“moments”) # Install moments for skewness and kurtosis library(ggplot2) # Load ggplot2 library library(scales) # Load scales library
## Setting up your directory in your computer.
This needs to be addressed here.
``` r
# Check the current working directory
getwd()
## [1] "C:/Users/cdaniels/Downloads/Assignment 4 Validate a Provided Program on Different Data to Demonstrate Understanding of Data Preprocessing attached files Sep 4, 2025 1232 PM"
# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")
# Set the working directory to where the data file is located
# This ensures the program can access the file correctly
setwd("C:/Users/cdaniels/Downloads/Assignment 4 Validate a Provided Program on Different Data to Demonstrate Understanding of Data Preprocessing attached files Sep 4, 2025 1232 PM")
### Choose an already existing directory in your computer.
# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 672
B <- 682
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility
# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset
write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory
As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.
It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML
In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html
df <- read.csv("my_data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
Step 0. Now that you read the file, you want to learn few information about your data
The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.
# Basic exploratory commands
nrow(df) # Number of rows in the dataset
## [1] 500
length(df) # Number of columns (or variables) in the dataset
## [1] 15
str(df) # Structure of the dataset (data types and a preview)
## 'data.frame': 500 obs. of 15 variables:
## $ ID : Factor w/ 4 levels "","Female","Male",..: 3 4 2 3 2 3 4 3 3 2 ...
## $ Gender : int 29 NA 29 30 26 29 28 29 29 26 ...
## $ Age : int 175 165 NA 182 165 175 185 175 175 160 ...
## $ Height : int 70 56 65 80 62 70 85 70 70 55 ...
## $ Weight : Factor w/ 5 levels "","Bachelor's",..: 5 5 2 4 1 5 3 5 5 2 ...
## $ Education : Factor w/ 15 levels "","32000","35000",..: 12 11 10 13 8 12 1 12 12 8 ...
## $ Income : Factor w/ 3 levels "","Married","Single": 2 3 3 2 3 2 3 2 2 3 ...
## $ MaritalStatus: Factor w/ 4 levels "","Employed",..: 2 1 2 2 2 2 4 2 2 2 ...
## $ Employment : Factor w/ 15 levels "","5.5","5.7",..: 12 12 9 1 4 12 3 12 12 5 ...
## $ Score : Factor w/ 5 levels "","5.7","A","B",..: 5 4 4 5 3 5 3 5 5 4 ...
## $ Category : Factor w/ 5 levels "","A","Art","Music",..: 5 4 4 4 3 5 5 5 5 3 ...
## $ Color : Factor w/ 5 levels "","Blue","Green",..: 4 2 2 4 3 4 3 4 4 2 ...
## $ Hobby : Factor w/ 6 levels "","Green","Photography",..: 6 6 5 6 4 6 3 6 6 5 ...
## $ Happiness : Factor w/ 10 levels "","6","6.5","7",..: 9 8 2 8 4 9 2 9 9 4 ...
## $ Location : Factor w/ 5 levels "","6","City",..: 5 5 3 3 3 5 4 5 5 4 ...
summary(df) # Summary statistics for each column
## ID Gender Age Height Weight
## : 2 Min. :25.00 Min. :155.0 Min. :54.00 : 24
## Female:186 1st Qu.:27.00 1st Qu.:165.0 1st Qu.:62.00 Bachelor's :155
## Male :218 Median :29.00 Median :175.0 Median :70.00 High School:102
## Other : 94 Mean :28.68 Mean :173.3 Mean :70.87 Master's :125
## 3rd Qu.:30.00 3rd Qu.:182.0 3rd Qu.:80.00 PhD : 91
## Max. :34.00 Max. :190.0 Max. :90.00 NA's : 3
## NA's :34 NA's :15 NA's :50
## Education Income MaritalStatus Employment Score
## 60000 : 88 : 16 : 15 : 76 : 2
## : 79 Married:224 Employed :370 7.8 : 74 5.7: 1
## 48000 : 71 Single :260 Single : 1 6.2 : 70 A :162
## 45000 : 69 Unemployed:114 6.1 : 55 B :186
## 65000 : 64 5.7 : 50 C :149
## 55000 : 26 (Other):170
## (Other):103 NA's : 5
## Category Color Hobby Happiness Location
## : 8 : 3 : 5 7 :142 : 3
## A : 1 Blue :200 Green : 1 8.5 : 78 6 : 1
## Art :164 Green :131 Photography: 84 9 : 77 City :206
## Music :145 Red :165 Reading :134 6 : 64 Rural :177
## Sports:182 Sports: 1 Swimming : 93 8 : 61 Suburb:113
## Traveling :183 (Other): 70
## NA's : 8
Please answer the following questions, by typing information after the question.
Question 1
What type of variables does your file include?
Answer 1:
A and B based on my student ID A <- 672 B <- 682
Question 2
Specific data types?
Answer 2:
random sample of 500 sample_size <- 500
Question 3
Are they read properly?
Answer 3:
yes
Question 4
Are there any issues ?
Answer 4:
no
Question 5
Does your file includes both NAs and blanks?
Answer 5:
yes
Question 6
How many NAs do you have and
Answer 6:
455
Question 7
How many blanks?
Answer 7:
243
Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs
#
# Step 1: # Handling both blanks and NAs is not simple so first we want to eliminate
# some of those, let's eliminate the blanks and change them to NAs
#
# Replace blanks with NAs across the dataset
# This ensures that blank values are consistently treated as missing data
df[df == ""] <- NA
# Convert specific columns to factors
# This step ensures categorical variables are treated correctly after replacing blanks
factor_columns <- c("Gender", "Education", "Score", "MaritalStatus", "Category",
"Employment", "Color", "Hobby", "Location")
df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))
Step 2: Count NAs in the entire dataset
#
# Step 2: Count NAs in the entire dataset
# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values
## [1] 369
Please answer the following questions, by typing information after the question.
Question 8
Explain what the printed number is, what is the information that relays and how can you use it in your analysis?
Answer 8:
The printed number (369) displays the total number of missing values
Step 3: Count rows with NAs.
#
# Step 3: Count rows with NAs
#
# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
label_percent_row_NA <- scales::label_percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas
## [1] 271
label_percent_row_NA
## function (x)
## {
## number(x, accuracy = accuracy, scale = scale, prefix = prefix,
## suffix = suffix, big.mark = big.mark, decimal.mark = decimal.mark,
## style_positive = style_positive, style_negative = style_negative,
## scale_cut = scale_cut, trim = trim, ...)
## }
## <bytecode: 0x000001ec9466b848>
## <environment: 0x000001ec94666920>
Question 9
How large is the proportion of the rows with NAs, we can drop up to 5%?
Answer 9:
57%
Question 10
Do you think that would be wise to drop the above percent?
Answer 10:
No
Question 11
How this will affect your dataset?
Answer 11:
It will alter the sample size
Step 4: Count columns with NAs
#
# Step 4: Count columns with NAs
# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
label_percent_col_NA <- scales::label_percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas
## [1] 15
label_percent_col_NA
## function (x)
## {
## number(x, accuracy = accuracy, scale = scale, prefix = prefix,
## suffix = suffix, big.mark = big.mark, decimal.mark = decimal.mark,
## style_positive = style_positive, style_negative = style_negative,
## scale_cut = scale_cut, trim = trim, ...)
## }
## <bytecode: 0x000001ec9466b848>
## <environment: 0x000001ec8f605a38>
Question 12
How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?
Answer 12:
81%. No
Question 13
How this will affect your dataset?
Answer 13:
It will mean that we will loose variables and associations - essentially altering entire columns and overall size
Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)
In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.
#
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.
# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
if (sum(!is.na(col)) > 10) {
col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
} else {
col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
}
} else if (is.factor(col)) { # Factor columns
mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
col[is.na(col)] <- mode_val
} else if (is.character(col)) { # Character columns
col[is.na(col)] <- "NA" # Replace with "NA"
}
return(col) # Return the modified column
})
df <- as.data.frame(df) # Convert the list back to a dataframe
#
# following the above method to impute, has now changed some of the statistics
# Check the updated dataset and ensure no remaining NAs
summary(df)
## ID Gender Age Height Weight
## : 0 30 :132 Min. :155.0 Min. :54.00 : 0
## Female:186 28 : 76 1st Qu.:165.0 1st Qu.:65.00 Bachelor's :182
## Male :220 29 : 73 Median :175.0 Median :70.87 High School:102
## Other : 94 26 : 71 Mean :173.3 Mean :70.87 Master's :125
## 27 : 70 3rd Qu.:182.0 3rd Qu.:80.00 PhD : 91
## 32 : 33 Max. :190.0 Max. :90.00
## (Other): 45
## Education Income MaritalStatus Employment Score
## 60000 :167 : 0 Employed :385 7.8 :155 5.7: 1
## 48000 : 71 Married:224 Single : 1 6.2 : 70 A :162
## 45000 : 69 Single :276 Unemployed:114 6.1 : 55 B :188
## 65000 : 64 5.7 : 50 C :149
## 55000 : 26 5.5 : 34
## 42000 : 21 7.5 : 28
## (Other): 82 (Other):108
## Category Color Hobby Happiness Location
## A : 1 Blue :203 Green : 1 7 :171 6 : 1
## Art :164 Green :131 Photography: 84 8.5 : 78 City :209
## Music :145 Red :165 Reading :134 9 : 77 Rural :177
## Sports:190 Sports: 1 Swimming : 93 6 : 64 Suburb:113
## Traveling :188 8 : 61
## 7.5 : 23
## (Other): 26
Essay Question
Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you obsevrve.
Answer