library(kableExtra)
library(magrittr)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(forcats)
library(ggplot2)
library(naniar)| Student name | Student number | Percentage of contribution | 
|---|---|---|
| Chandangowda Maruvanahalli Shivaramu | s4063920 | 100 | 
 
In this report, I present the data pre-processing steps undertaken to clean, prepare and transform two datasets for analysis. The datasets used pertain to university admissions and salary information, which were merged and manipulated to achieve a tidy, comprehensive dataset suitable for further analysis.
The preprocessing process began with data importation and an initial inspection to understand the structure and content of the datasets. To guarantee consistency across the two datasets, I standardized the university names, making it possible to accurately integrate the data based on the university names.
I then addressed the different data types of the columns, making sure the format of the numeric columns was proper and converting pertinent columns to factors when necessary. This involved turning the salary columns into numeric values and deleting the dollar signs and commas from them.
To further tidy the data, I reshaped the dataset by pivoting selected salary percentile columns into a long format, which facilitated easier analysis of salary distributions across different percentiles. On the basis of the admission rate, I also developed a new variable called Selectivity, which classified universities as “Highly Selective,” “Selective,” and “Less Selective.”
I then performed a comprehensive scan for missing values, special values, and obvious errors. Missing numeric values were handled by filling them with the mean of their respective columns, while missing factor values were filled with the mode. Special values, such as NaN and Inf, were identified and appropriately managed.
In numerical data, outliers were found using the Interquartile Range (IQR) technique. In order to ensure that the data remained robust for analysis without being distorted by extreme values, these outliers were addressed by capping them to the nearest non-outlier values.
Finally, I applied a log transformation to the Starting_Salary variable to reduce skewness and normalize its distribution, enhancing the interpretability and usability of the variable for subsequent analyses.
 
Provide explanations here.
Source: I sourced our data from ‘Kaggle’, which is a Google subsidiary and world’s largest data science community. This platform is a home of varied datasets, learning resources for datascience and a hub of challenging competitions etc.,
Dataset_1 details: This dataset contains the details of US colleges/universities admission rates and it’s corresponding SAT and ACT scores required to get an admission. All its attribute details as below:
ID - This is the Integrated Postsecondary Education Data System (IPEDS) unique id for each college/university.
Name - Name of the college/university.
State - State of the college/university where it is located.
Applications - Number of applicants applied for the university for admission.
Admissions - Number of applicants got the seat/admission to the university.
Admission rate - Rate of university acceptance.
SATVR75 - SAT Verbal 75th percentile score.
SATMT75 - SAT Math 75th percentile score.
ACTEN75 - ACT English 75th percentile score.
ACTMT75 - ACT Math 75th percentile score.
Dataset_2 details: This dataset contains the details of US colleges/universities avarege salary pay-scale over the period of journey of a individual. All its attribute details as below:
School Name - Name of the college/university.
Region - Region of the university on which side it mostly categorized in the US. Ex: California region, Southern region, Northeastern region etc.,
Starting Median Salary - Starting average salary of an individual who studied in that university.
Mid-Career Median Salary - Career mid average salary of an individual who studied in that university.
Mid-Career 10th Percentile Salary - Career mid 10th percentile individuals salaries who studied in that university.
Mid-Career 25th Percentile Salary - Career mid 25th percentile individuals salaries who studied in that university.
Mid-Career 75th Percentile Salary - Career mid 75th percentile individuals salaries who studied in that university.
Mid-Career 90th Percentile Salary - Career mid 90th percentile individuals salaries who studied in that university.
dataframe_1 <- read.csv("/Users/chandangowda/Documents/RMIT Studies/Data Wrangling/Assignment_3/Dataset_1.csv")
dataframe_2 <- read.csv("/Users/chandangowda/Documents/RMIT Studies/Data Wrangling/Assignment_3/Dataset_2.csv")
# Displaying structure and head of both datasets
head(dataframe_1)str(dataframe_1)'data.frame':   1085 obs. of  10 variables:
 $ ID            : int  188429 168528 138600 100654 100724 188641 131159 164465 222831 164492 ...
 $ Name          : chr  "Adelphi University" "Adrian College" "Agnes Scott College" "Alabama A & M University" ...
 $ State         : chr  "New York" "Michigan" "Georgia" "Alabama" ...
 $ Applications  : int  16084 2736 1625 6560 5974 6243 19650 13999 3515 2519 ...
 $ Admissions    : int  12459 2079 1135 4697 5854 4193 12594 1224 2463 1835 ...
 $ Admission.Rate: num  0.775 0.76 0.698 0.716 0.98 ...
 $ SATVR75       : int  650 580 NA 520 534 630 730 760 590 NA ...
 $ SATMT75       : int  660 570 NA 510 516 670 710 790 570 NA ...
 $ ACTEN75       : int  31 24 NA 20 22 29 35 35 24 NA ...
 $ ACTMT75       : int  28 26 NA 18 19 27 31 35 24 NA ...head(dataframe_2)str(dataframe_2)'data.frame':   302 obs. of  8 variables:
 $ School.Name                      : chr  "Adelphi University" "American University, Washington D.C." "Amherst College" "Appalachian State University" ...
 $ Region                           : chr  "Northeastern" "Southern" "Northeastern" "Southern" ...
 $ Starting.Median.Salary           : chr  "$40,600.00 " "$45,300.00 " "$54,500.00 " "$40,400.00 " ...
 $ Mid.Career.Median.Salary         : chr  "$79,200.00 " "$90,800.00 " "$107,000.00 " "$69,100.00 " ...
 $ Mid.Career.10th.Percentile.Salary: chr  "$44,200.00 " "$45,200.00 " "N/A" "$37,200.00 " ...
 $ Mid.Career.25th.Percentile.Salary: chr  "$54,800.00 " "$62,400.00 " "$84,900.00 " "$50,400.00 " ...
 $ Mid.Career.75th.Percentile.Salary: chr  "$114,000.00 " "$134,000.00 " "$162,000.00 " "$90,800.00 " ...
 $ Mid.Career.90th.Percentile.Salary: chr  "$160,000.00 " "$169,000.00 " "N/A" "$115,000.00 " ...# Standardizing university names in both dataframes
dataframe_1 <- dataframe_1 %>%
  mutate(Name = str_to_upper(Name))
dataframe_2 <- dataframe_2 %>%
  mutate(School.Name = str_to_upper(School.Name))
# Merging datasets on standardized School Name
merged_dataframe <- dataframe_1 %>%
  inner_join(dataframe_2, by = c("Name" = "School.Name"))
# Renaming the merged column name as 'University Name'
merged_dataframe <- merged_dataframe %>%
  rename(
    University.Name = Name
  )
# Displaying structure and head of merged data
str(merged_dataframe)'data.frame':   283 obs. of  17 variables:
 $ ID                               : int  188429 131159 164465 197869 104151 106458 100858 219602 150136 160977 ...
 $ University.Name                  : chr  "ADELPHI UNIVERSITY" "AMERICAN UNIVERSITY, WASHINGTON D.C." "AMHERST COLLEGE" "APPALACHIAN STATE UNIVERSITY" ...
 $ State                            : chr  "New York" "District of Columbia" "Massachusetts" "North Carolina" ...
 $ Applications                     : int  16084 19650 13999 21120 61603 5835 27619 5053 22947 7319 ...
 $ Admissions                       : int  12459 12594 1224 17886 54329 3693 19660 4721 15642 1267 ...
 $ Admission.Rate                   : num  0.7746 0.6409 0.0874 0.8469 0.8819 ...
 $ SATVR75                          : int  650 730 760 650 NA 590 NA NA 610 NA ...
 $ SATMT75                          : int  660 710 790 650 NA 610 NA NA 600 NA ...
 $ ACTEN75                          : int  31 35 35 28 NA 28 NA 24 26 NA ...
 $ ACTMT75                          : int  28 31 35 27 NA 25 NA 23 26 NA ...
 $ Region                           : chr  "Northeastern" "Southern" "Northeastern" "Southern" ...
 $ Starting.Median.Salary           : chr  "$40,600.00 " "$45,300.00 " "$54,500.00 " "$40,400.00 " ...
 $ Mid.Career.Median.Salary         : chr  "$79,200.00 " "$90,800.00 " "$107,000.00 " "$69,100.00 " ...
 $ Mid.Career.10th.Percentile.Salary: chr  "$44,200.00 " "$45,200.00 " "N/A" "$37,200.00 " ...
 $ Mid.Career.25th.Percentile.Salary: chr  "$54,800.00 " "$62,400.00 " "$84,900.00 " "$50,400.00 " ...
 $ Mid.Career.75th.Percentile.Salary: chr  "$114,000.00 " "$134,000.00 " "$162,000.00 " "$90,800.00 " ...
 $ Mid.Career.90th.Percentile.Salary: chr  "$160,000.00 " "$169,000.00 " "N/A" "$115,000.00 " ...head(merged_dataframe)
 
# Summarizing the types of variables
variable_summary <- sapply(merged_dataframe, class)
print("Variable Types:")[1] "Variable Types:"print(variable_summary)                               ID                   University.Name                             State 
                        "integer"                       "character"                       "character" 
                     Applications                        Admissions                    Admission.Rate 
                        "integer"                         "integer"                         "numeric" 
                          SATVR75                           SATMT75                           ACTEN75 
                        "integer"                         "integer"                         "integer" 
                          ACTMT75                            Region            Starting.Median.Salary 
                        "integer"                       "character"                       "character" 
         Mid.Career.Median.Salary Mid.Career.10th.Percentile.Salary Mid.Career.25th.Percentile.Salary 
                      "character"                       "character"                       "character" 
Mid.Career.75th.Percentile.Salary Mid.Career.90th.Percentile.Salary 
                      "character"                       "character" # Converting character variables to factors
merged_dataframe <- merged_dataframe %>%
  mutate(University.Name = as.factor(University.Name),
         State = as.factor(State))
# Assuming 'State' is a factor variable that we need to label and order
merged_dataframe <- merged_dataframe %>%
  mutate(State = fct_relevel(State, "California", "Texas", "New York", "Illinois", "Colorado"))
str(merged_dataframe)'data.frame':   283 obs. of  17 variables:
 $ ID                               : int  188429 131159 164465 197869 104151 106458 100858 219602 150136 160977 ...
 $ University.Name                  : Factor w/ 280 levels "ADELPHI UNIVERSITY",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ State                            : Factor w/ 50 levels "California","Texas",..: 3 12 23 34 8 9 6 43 16 21 ...
 $ Applications                     : int  16084 19650 13999 21120 61603 5835 27619 5053 22947 7319 ...
 $ Admissions                       : int  12459 12594 1224 17886 54329 3693 19660 4721 15642 1267 ...
 $ Admission.Rate                   : num  0.7746 0.6409 0.0874 0.8469 0.8819 ...
 $ SATVR75                          : int  650 730 760 650 NA 590 NA NA 610 NA ...
 $ SATMT75                          : int  660 710 790 650 NA 610 NA NA 600 NA ...
 $ ACTEN75                          : int  31 35 35 28 NA 28 NA 24 26 NA ...
 $ ACTMT75                          : int  28 31 35 27 NA 25 NA 23 26 NA ...
 $ Region                           : chr  "Northeastern" "Southern" "Northeastern" "Southern" ...
 $ Starting.Median.Salary           : chr  "$40,600.00 " "$45,300.00 " "$54,500.00 " "$40,400.00 " ...
 $ Mid.Career.Median.Salary         : chr  "$79,200.00 " "$90,800.00 " "$107,000.00 " "$69,100.00 " ...
 $ Mid.Career.10th.Percentile.Salary: chr  "$44,200.00 " "$45,200.00 " "N/A" "$37,200.00 " ...
 $ Mid.Career.25th.Percentile.Salary: chr  "$54,800.00 " "$62,400.00 " "$84,900.00 " "$50,400.00 " ...
 $ Mid.Career.75th.Percentile.Salary: chr  "$114,000.00 " "$134,000.00 " "$162,000.00 " "$90,800.00 " ...
 $ Mid.Career.90th.Percentile.Salary: chr  "$160,000.00 " "$169,000.00 " "N/A" "$115,000.00 " ...summary(merged_dataframe)       ID                                              University.Name           State      Applications   
 Min.   :100751   UNION COLLEGE                                :  3    New York     : 32   Min.   :   261  
 1st Qu.:148494   CALIFORNIA STATE UNIVERSITY, EAST BAY (CSUEB):  2    California   : 25   1st Qu.:  7486  
 Median :187985   ADELPHI UNIVERSITY                           :  1    Massachusetts: 18   Median : 14965  
 Mean   :180612   AMERICAN UNIVERSITY, WASHINGTON D.C.         :  1    Pennsylvania : 15   Mean   : 22572  
 3rd Qu.:213376   AMHERST COLLEGE                              :  1    Texas        : 12   3rd Qu.: 31000  
 Max.   :243744   APPALACHIAN STATE UNIVERSITY                 :  1    Wisconsin    : 12   Max.   :139489  
                  (Other)                                      :274    (Other)      :169                   
   Admissions    Admission.Rate       SATVR75         SATMT75         ACTEN75         ACTMT75     
 Min.   :  217   Min.   :0.03915   Min.   :558.0   Min.   :540.0   Min.   :22.00   Min.   :21.00  
 1st Qu.: 3504   1st Qu.:0.41808   1st Qu.:640.0   1st Qu.:630.0   1st Qu.:27.00   1st Qu.:26.25  
 Median : 7987   Median :0.73850   Median :680.0   Median :680.0   Median :31.50   Median :29.00  
 Mean   :11391   Mean   :0.62864   Mean   :682.2   Mean   :689.6   Mean   :30.71   Mean   :29.16  
 3rd Qu.:16302   3rd Qu.:0.87292   3rd Qu.:730.0   3rd Qu.:755.0   3rd Qu.:35.00   3rd Qu.:32.00  
 Max.   :54329   Max.   :0.99990   Max.   :780.0   Max.   :800.0   Max.   :36.00   Max.   :36.00  
                                   NA's   :76      NA's   :76      NA's   :77      NA's   :77     
    Region          Starting.Median.Salary Mid.Career.Median.Salary Mid.Career.10th.Percentile.Salary
 Length:283         Length:283             Length:283               Length:283                       
 Class :character   Class :character       Class :character         Class :character                 
 Mode  :character   Mode  :character       Mode  :character         Mode  :character                 
                                                                                                     
                                                                                                     
                                                                                                     
                                                                                                     
 Mid.Career.25th.Percentile.Salary Mid.Career.75th.Percentile.Salary Mid.Career.90th.Percentile.Salary
 Length:283                        Length:283                        Length:283                       
 Class :character                  Class :character                  Class :character                 
 Mode  :character                  Mode  :character                  Mode  :character                 
                                                                                                      
                                                                                                      
                                                                                                      
                                                                                                      Provide explanations here.
In this step, I focused on understanding the structure and attributes of our merged dataset. This involved summarizing the types of variables, checking the summary statistics, and applying appropriate data type conversions to ensure consistency and readiness for analysis.
Summarize the Types of Variables - I used the sapply function to determine the data types of each variable in the merged_dataframe. This step was crucial to identify the types of variables (numeric, character, factor, etc.) and ensure that each variable is appropriately classified.
Convert Character Variables to Factors - Character variables, particularly University.Name and State, were converted to factors. This is an important step for categorical data, as it allows for more efficient storage and provides meaningful levels for analysis.
Label and Order Factor Levels - I assumed the State variable needed to be ordered for better analysis. The fct_relevel function from the forcats package was used to reorder the levels of State to a specified order. This is useful for ordered categorical variables where the sequence of levels has a meaningful interpretation.
 
# Removing dollar signs and commas from salary columns and converting to numeric
merged_dataframe <- merged_dataframe %>%
  mutate(across(contains("Salary"), ~ as.numeric(str_remove_all(., "[$,]"))))Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(contains("Salary"), ~as.numeric(str_remove_all(., "[$,]")))`.
Caused by warning:
! NAs introduced by coercion
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 1 remaining warning.# Reshaping only the selected salary columns into a long format
tidy_dataframe <- merged_dataframe %>%
  pivot_longer(
    cols = c("Mid.Career.75th.Percentile.Salary", "Mid.Career.90th.Percentile.Salary"),
    names_to = "Higher.Salary.Stage",
    names_prefix = "Mid-Career ",
    values_to = "Higher.Career.Median.Salary"
  )
head(tidy_dataframe)Provide explanations here.
Datasets did not fully adhere to the principles of tidy dataset due to the presence of salary columns (Mid-Career 75th Percentile Salary and Mid-Career 90th Percentile Salary) which represented similar types of information of an individual who is having higher bar of salary with much experience but were spread across multiple columns. To address this, we needed to reshape these columns into a longer format, ensuring each salary stage and its value are represented in a consistent manner.
Removing Dollar Signs and Commas - The salary columns initially contained dollar signs and commas, making them unsuitable for numerical operations. We used mutate with across to remove these characters and convert the columns to numeric. This step was necessary to ensure that the salary data could be accurately analyzed and manipulated.
Reshaping the Data - The pivot_longer function was used to transform the wide format salary data into a long format. This restructuring allowed us to condense the Mid-Career 75th Percentile Salary and Mid-Career 90th Percentile Salary columns into a single Higher.Career.Median.Salary column. An additional Higher.Salary.Stage column was created to indicate the stage of the salary (75th or 90th percentile). This reshaping adheres to the tidy data principles by ensuring each variable has its own column, each observation has its own row, and each value has its own cell. This makes the dataset more compact and easier to analyze.
 
# Creating the Selectivity variable based on Admission Rate
tidy_dataframe <- tidy_dataframe %>%
  mutate(Selectivity = case_when(
    `Admission.Rate` <= 0.30 ~ "Highly.Selective",
    `Admission.Rate` > 0.30 & `Admission.Rate` <= 0.60 ~ "Selective",
    `Admission.Rate` > 0.60 ~ "Less.Selective",
    TRUE ~ "Unknown"
  ))
# Displaying head of the tidy data with the new variable
head(tidy_dataframe)Provide explanations here.
In this step, I created a new variable, Selectivity, based on the Admission.Rate of each university. This variable categorizes universities into different levels of selectivity, providing additional insight into the dataset. The creation of the Selectivity variable allows for a more nuanced analysis of how the selectivity of a university might relate to other factors such as salaries.
 
# Scaning for missing values
missing_values <- colSums(is.na(tidy_dataframe))
print("Missing Values:")[1] "Missing Values:"print(missing_values)                               ID                   University.Name                             State 
                                0                                 0                                 0 
                     Applications                        Admissions                    Admission.Rate 
                                0                                 0                                 0 
                          SATVR75                           SATMT75                           ACTEN75 
                              152                               152                               154 
                          ACTMT75                            Region            Starting.Median.Salary 
                              154                                 0                                 0 
         Mid.Career.Median.Salary Mid.Career.10th.Percentile.Salary Mid.Career.25th.Percentile.Salary 
                                0                                86                                 0 
              Higher.Salary.Stage       Higher.Career.Median.Salary                       Selectivity 
                                0                                43                                 0 # Ploting the missing values
gg_miss_var(tidy_dataframe) + theme_minimal() + ggtitle("Missing Values by Variable")
# Handle missing values
# Here I choose to fill numeric columns with the mean and factor columns with the mode
tidy_dataframe <- tidy_dataframe %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)),
         across(where(is.factor), ~ ifelse(is.na(.), fct_mode(.), .)))
# Checking if missing values are handled
missing_values_after <- colSums(is.na(tidy_dataframe))
print("Missing Values After Handling:")[1] "Missing Values After Handling:"print(missing_values_after)                               ID                   University.Name                             State 
                                0                                 0                                 0 
                     Applications                        Admissions                    Admission.Rate 
                                0                                 0                                 0 
                          SATVR75                           SATMT75                           ACTEN75 
                                0                                 0                                 0 
                          ACTMT75                            Region            Starting.Median.Salary 
                                0                                 0                                 0 
         Mid.Career.Median.Salary Mid.Career.10th.Percentile.Salary Mid.Career.25th.Percentile.Salary 
                                0                                 0                                 0 
              Higher.Salary.Stage       Higher.Career.Median.Salary                       Selectivity 
                                0                                 0                                 0 # Handling special values (example: replace Inf with max and -Inf with min)
tidy_dataframe <- tidy_dataframe %>%
  mutate(across(where(is.numeric), ~ ifelse(is.infinite(.), NA, .)))
# Recheck special values after handling
special_values_after <- sapply(tidy_dataframe, function(x) sum(is.nan(x) | is.infinite(x)))
print("Special Values After Handling:")[1] "Special Values After Handling:"print(special_values_after)                               ID                   University.Name                             State 
                                0                                 0                                 0 
                     Applications                        Admissions                    Admission.Rate 
                                0                                 0                                 0 
                          SATVR75                           SATMT75                           ACTEN75 
                                0                                 0                                 0 
                          ACTMT75                            Region            Starting.Median.Salary 
                                0                                 0                                 0 
         Mid.Career.Median.Salary Mid.Career.10th.Percentile.Salary Mid.Career.25th.Percentile.Salary 
                                0                                 0                                 0 
              Higher.Salary.Stage       Higher.Career.Median.Salary                       Selectivity 
                                0                                 0                                 0 # Check for obvious errors in factor variables (example: unexpected levels)
unexpected_levels <- lapply(tidy_dataframe, function(x) if(is.factor(x)) levels(x))
print("Unexpected Levels:")[1] "Unexpected Levels:"print(unexpected_levels)$ID
NULL
$University.Name
NULL
$State
NULL
$Applications
NULL
$Admissions
NULL
$Admission.Rate
NULL
$SATVR75
NULL
$SATMT75
NULL
$ACTEN75
NULL
$ACTMT75
NULL
$Region
NULL
$Starting.Median.Salary
NULL
$Mid.Career.Median.Salary
NULL
$Mid.Career.10th.Percentile.Salary
NULL
$Mid.Career.25th.Percentile.Salary
NULL
$Higher.Salary.Stage
NULL
$Higher.Career.Median.Salary
NULL
$Selectivity
NULLProvide explanations here.
In this step, I scanned the dataset for missing values, special values, and obvious errors to ensure data quality and integrity.
Scan for Missing Values - I used colSums(is.na(tidy_dataframe)) to identify the number of missing values in each column. This step helps in understanding the extent of missing data within the dataset.
Plot Missing Values - To visualize the distribution of missing values, I used the gg_miss_var function from the naniar package. This plot helps in quickly identifying which variables have missing data.
Handle Missing Values - I chose to fill missing values in numeric columns with the mean of the respective columns, and missing values in factor columns with the mode. This approach maintains the data distribution while addressing the missing values.
Handle Special Values - I identified and handled special values (e.g., Inf, -Inf) by replacing them with NA. This step ensures that these special values do not interfere with subsequent analyses.
Check for Obvious Errors in Factor Variables - I checked for unexpected levels in factor variables to identify any obvious errors. This step helps ensure that factor variables have consistent and expected levels.
 
# Identifying outliers using the IQR method
numeric_cols <- select_if(tidy_dataframe, is.numeric)
# Calculating Q1 (25th percentile) and Q3 (75th percentile) for each numeric column
Q1 <- apply(numeric_cols, 2, quantile, 0.25, na.rm = TRUE)
Q3 <- apply(numeric_cols, 2, quantile, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
# Defining outliers as values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR
outliers <- numeric_cols %>%
  mutate(across(everything(), ~ (. < (Q1 - 1.5 * IQR)) | (. > (Q3 + 1.5 * IQR))))Warning: There were 30 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(...)`.
Caused by warning in `ID < (Q1 - 1.5 * IQR)`:
! longer object length is not a multiple of shorter object length
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 29 remaining warnings.# Counting of outliers in each column
outliers_count <- colSums(outliers, na.rm = TRUE)
print("Outliers Count:")[1] "Outliers Count:"print(outliers_count)                               ID                   University.Name                             State 
                              489                               434                               390 
                     Applications                        Admissions                    Admission.Rate 
                              457                               485                               376 
                          SATVR75                           SATMT75                           ACTEN75 
                              414                               414                               338 
                          ACTMT75            Starting.Median.Salary          Mid.Career.Median.Salary 
                              338                               380                               426 
Mid.Career.10th.Percentile.Salary Mid.Career.25th.Percentile.Salary       Higher.Career.Median.Salary 
                              383                               361                               475 # Handling outliers by capping them to the nearest non-outlier value
tidy_dataframe <- tidy_dataframe %>%
  mutate(across(where(is.numeric), ~ ifelse(. < (Q1 - 1.5 * IQR), Q1 - 1.5 * IQR, .))) %>%
  mutate(across(where(is.numeric), ~ ifelse(. > (Q3 + 1.5 * IQR), Q3 + 1.5 * IQR, .)))Warning: There were 15 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(...)`.
Caused by warning in `ID < (Q1 - 1.5 * IQR)`:
! longer object length is not a multiple of shorter object length
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 14 remaining warnings.Warning: There were 15 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(...)`.
Caused by warning in `ID > (Q3 + 1.5 * IQR)`:
! longer object length is not a multiple of shorter object length
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 14 remaining warnings.# Verifying the structure after handling outliers
str(tidy_dataframe)tibble [566 × 18] (S3: tbl_df/tbl/data.frame)
 $ ID                               : num [1:566] 188429 419 88.4 66980.2 35511.5 ...
 $ University.Name                  : num [1:566] 50256 1 2 2 3 ...
 $ State                            : num [1:566] 50256 3 12 12 23 ...
 $ Applications                     : num [1:566] 50256.1 419 88.4 19650 13999 ...
 $ Admissions                       : num [1:566] 50256.1 419 88.4 12594 1224 ...
 $ Admission.Rate                   : num [1:566] 5.03e+04 7.75e-01 6.41e-01 6.41e-01 8.74e-02 ...
 $ SATVR75                          : num [1:566] 50256.1 419 88.4 730 760 ...
 $ SATMT75                          : num [1:566] 50256.1 419 88.4 710 790 ...
 $ ACTEN75                          : num [1:566] 50256 31 35 35 35 ...
 $ ACTMT75                          : num [1:566] 50256 28 31 31 35 ...
 $ Region                           : chr [1:566] "Northeastern" "Northeastern" "Southern" "Southern" ...
 $ Starting.Median.Salary           : num [1:566] 50256.1 419 88.4 45300 35511.5 ...
 $ Mid.Career.Median.Salary         : num [1:566] 79200 419 88.4 66980.2 35511.5 ...
 $ Mid.Career.10th.Percentile.Salary: num [1:566] 50256.1 419 88.4 45200 35511.5 ...
 $ Mid.Career.25th.Percentile.Salary: num [1:566] 54800 419 88.4 62400 35511.5 ...
 $ Higher.Salary.Stage              : chr [1:566] "Mid.Career.75th.Percentile.Salary" "Mid.Career.90th.Percentile.Salary" "Mid.Career.75th.Percentile.Salary" "Mid.Career.90th.Percentile.Salary" ...
 $ Higher.Career.Median.Salary      : num [1:566] 114000 419 88.4 66980.2 35511.5 ...
 $ Selectivity                      : chr [1:566] "Less.Selective" "Less.Selective" "Less.Selective" "Less.Selective" ...Provide explanations here.
In this step, I scanned the numeric data for outliers to ensure data consistency and integrity.
Identify Outliers Using the IQR Method - I identified outliers in the numeric columns using the Interquartile Range (IQR) method. This method calculates the range between the first quartile (Q1) and the third quartile (Q3) and defines outliers as values below Q1 - 1.5IQR or above Q3 + 1.5IQR. This robust method is effective for identifying extreme values without being overly influenced by them.
Handle Outliers by Capping - To handle the identified outliers, I capped them to the nearest non-outlier values. This approach helps mitigate the impact of extreme values on our analysis while preserving the overall data distribution. By capping the outliers, I ensure that they do not disproportionately influence the results.
Verify the Structure After Handling Outliers - After handling the outliers, I verified the structure of the dataset to ensure that the changes were applied correctly and the dataset remained consistent.
 
# Selecting a "Starting_Salary" variable to transform
variable_to_transform <- tidy_dataframe$Starting.Median.Salary
# Applying log transformation
tidy_dataframe <- tidy_dataframe %>%
  mutate(Starting_Salary_Log = log(Starting.Median.Salary + 1))
# Verifying the structure after transformation
str(tidy_dataframe)tibble [566 × 19] (S3: tbl_df/tbl/data.frame)
 $ ID                               : num [1:566] 188429 419 88.4 66980.2 35511.5 ...
 $ University.Name                  : num [1:566] 50256 1 2 2 3 ...
 $ State                            : num [1:566] 50256 3 12 12 23 ...
 $ Applications                     : num [1:566] 50256.1 419 88.4 19650 13999 ...
 $ Admissions                       : num [1:566] 50256.1 419 88.4 12594 1224 ...
 $ Admission.Rate                   : num [1:566] 5.03e+04 7.75e-01 6.41e-01 6.41e-01 8.74e-02 ...
 $ SATVR75                          : num [1:566] 50256.1 419 88.4 730 760 ...
 $ SATMT75                          : num [1:566] 50256.1 419 88.4 710 790 ...
 $ ACTEN75                          : num [1:566] 50256 31 35 35 35 ...
 $ ACTMT75                          : num [1:566] 50256 28 31 31 35 ...
 $ Region                           : chr [1:566] "Northeastern" "Northeastern" "Southern" "Southern" ...
 $ Starting.Median.Salary           : num [1:566] 50256.1 419 88.4 45300 35511.5 ...
 $ Mid.Career.Median.Salary         : num [1:566] 79200 419 88.4 66980.2 35511.5 ...
 $ Mid.Career.10th.Percentile.Salary: num [1:566] 50256.1 419 88.4 45200 35511.5 ...
 $ Mid.Career.25th.Percentile.Salary: num [1:566] 54800 419 88.4 62400 35511.5 ...
 $ Higher.Salary.Stage              : chr [1:566] "Mid.Career.75th.Percentile.Salary" "Mid.Career.90th.Percentile.Salary" "Mid.Career.75th.Percentile.Salary" "Mid.Career.90th.Percentile.Salary" ...
 $ Higher.Career.Median.Salary      : num [1:566] 114000 419 88.4 66980.2 35511.5 ...
 $ Selectivity                      : chr [1:566] "Less.Selective" "Less.Selective" "Less.Selective" "Less.Selective" ...
 $ Starting_Salary_Log              : num [1:566] 10.82 6.04 4.49 10.72 10.48 ...# Originaling distribution
p1 <- ggplot(tidy_dataframe, aes(x = Starting.Median.Salary)) +
  geom_histogram(bins = 30, fill = "red", alpha = 0.7) +
  theme_minimal() +
  ggtitle("Original Starting Salary Distribution")
# Transformeding distribution
p2 <- ggplot(tidy_dataframe, aes(x = Starting_Salary_Log)) +
  geom_histogram(bins = 30, fill = "green", alpha = 0.7) +
  theme_minimal() +
  ggtitle("Log-Transformed Starting Salary Distribution")
print(p1)print(p2)Provide explanations here.
In this step, I applied a log transformation to the Starting.Median.Salary variable to address skewness and better understand the distribution of starting salaries.
Select a Variable to Transform - I selected Starting.Median.Salary for transformation due to its likely right-skewed distribution. Transforming this variable helps to reduce skewness and stabilize variance, making it more suitable for analysis.
Apply Log Transformation - The log transformation was applied to the Starting.Median.Salary variable by adding 1 to each value and taking the natural logarithm. This approach handles zero values effectively and reduces skewness in the distribution.
Verify the Structure After Transformation - I verified the structure of the dataset to ensure the new Starting_Salary_Log variable was added correctly and the dataset remained consistent.
Visualize the Original and Transformed Distributions - Using ggplot2, I visualized the original and log-transformed distributions of the Starting.Median.Salary. The histograms show how the log transformation reduced skewness and made the distribution more normal.
 
[1] RMIT (2024) Module 4: Tidy and Manipulate: Tidy Data Principles and Manipulating Data, Data Wrangling (Preprocessing). Accessed on: (25 May - 06 Jun)/2024. Link: http://rare-phoenix-161610.appspot.com/secured/Module_04.html
[2] RMIT (2024) Module 5: Scan: Missing Values Data Wrangling (Preprocessing). Accessed on: (25 May - 06 Jun)/2024. Link: http://rare-phoenix-161610.appspot.com/secured/Module_05.html
[3] RMIT (2024) Module 6: Scan: Outliers. Accessed on: (25 May - 06 Jun)/2024. Link: https://rare-phoenix- 161610.appspot.com/secured/Module_06.html
[4] RMIT (2024) Module 7: Scan: Missing Values. Accessed on: (30 May - 06 Jun)/2024. Link: https://rare-phoenix-161610.appspot.com/secured/Module_07.html
[5] QuillBot Website, used for paraphrasing my explanations. Accessed on: (02 Jun - 06 Jun)/2024. URL: https://quillbot.com