Project 3

Loading Data and Tidyverse

library(tidyverse)
setwd("/Users/mikea/Desktop/Datasets")
df <- read_csv("Salary_Data.csv")

Problem 1

str(df)

## spc_tbl_ [6,704 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                : num [1:6704] 32 28 45 36 52 29 42 31 26 38 ...
##  $ Gender             : chr [1:6704] "Male" "Female" "Male" "Female" ...
##  $ Education Level    : chr [1:6704] "Bachelor's" "Master's" "PhD" "Bachelor's" ...
##  $ Job Title          : chr [1:6704] "Software Engineer" "Data Analyst" "Senior Manager" "Sales Associate" ...
##  $ Years of Experience: num [1:6704] 5 3 15 7 20 2 12 4 1 10 ...
##  $ Salary             : num [1:6704] 90000 65000 150000 60000 200000 55000 120000 80000 45000 110000 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   `Education Level` = col_character(),
##   ..   `Job Title` = col_character(),
##   ..   `Years of Experience` = col_double(),
##   ..   Salary = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

This dataset includes 6 different types of variables

The first variable called age is a numeric variable that records the different ages in this dataset.

The Gender variable is a character variable.

Education Levelis also a character variable that records the level of education a employee has.

Job Title is a character variable that shows the job position someone holds.

Years of experience is a numeric variable shows the amount of years someone has worked in that area.

Salary is a numeric variable that shows the amount of money earned.

Problem 2

df <- df %>% 
  na.omit(df)
summary(df)

##       Age           Gender          Education Level     Job Title        
##  Min.   :21.00   Length:6698        Length:6698        Length:6698       
##  1st Qu.:28.00   Class :character   Class :character   Class :character  
##  Median :32.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :33.62                                                           
##  3rd Qu.:38.00                                                           
##  Max.   :62.00                                                           
##  Years of Experience     Salary      
##  Min.   : 0.000      Min.   :   350  
##  1st Qu.: 3.000      1st Qu.: 70000  
##  Median : 7.000      Median :115000  
##  Mean   : 8.095      Mean   :115329  
##  3rd Qu.:12.000      3rd Qu.:160000  
##  Max.   :34.000      Max.   :250000

sd(df$Salary)

## [1] 52789.79

sd(df$`Years of Experience`)

## [1] 6.060291

sd(df$Age)

## [1] 7.615784

Cleaning

df <- df %>% 
  mutate(Gender = as.factor(Gender)) %>% 
  mutate(`Education Level`= factor(`Education Level`)) %>% 
  mutate(`Job Title` = factor(`Job Title`))

df <- df %>%
  mutate(`Education Level` = ifelse(
    grepl("bachelor", tolower(`Education Level`)), "Bachelors's Degree",
    ifelse(grepl("master", tolower(`Education Level`)), "Master's Degree",
    ifelse(grepl("phd", tolower(`Education Level`)), "PhD",
    ifelse(grepl("high school", tolower(`Education Level`)), "High School",
    `Education Level`)))))

Problem 3

addmargins(table(df$Gender))

## 
## Female   Male  Other    Sum 
##   3013   3671     14   6698

prop.table(table(df$Gender))

## 
##      Female        Male       Other 
## 0.449835772 0.548074052 0.002090176

Problem 4

#table(df$Gender, df$`Job Title`)
addmargins(table(df$Gender, df$`Education Level`))

##         
##          Bachelors's Degree High School Master's Degree  PhD  Sum
##   Female               1198         251            1068  496 3013
##   Male                 1823         185             790  873 3671
##   Other                   0          12               2    0   14
##   Sum                  3021         448            1860 1369 6698

Problem 5

education_counts <- table(df$`Education Level`)
barplot(education_counts, main = "Education Level Distribution", xlab = "Education Level", ylab = "Count")

gender_counts <- table(df$Gender)
pie(gender_counts, labels = names(gender_counts), main = "Gender Distribution")

Problem 6

hist(df$Salary, main = "Salary", xlab = "Salary Range", ylab = "Counts")

boxplot(df$Salary, main = "Salary", ylab = "Salary Range")

hist(df$`Years of Experience`, main = "Years of Experience", xlab = "Years")

boxplot(df$`Years of Experience`, main = "Years of Experience", ylab = "Years")

Problem 7

This data is comprised of of 6698 observations after removing NA’s and has 6 variables. The six variables include, age, gender, education level, job title, years of experience, and salary. It would seem that there is 3671 males, 3013 females, and 14 who identify as other. After I compared the education levels between genders to see if there are any noticeable differences.

I discovered that more females have have a master degree compared to men. However, men seem to have more PhD’s than females. In terms of high school and bachelors there was not that much difference the two genders. However, it would seem a majority would fall into the bachelors level of education. I also made different charts to examine years of experience and salary ranges within this dataset. For example years of experience go from 0- 35. Salary ranges go from 350-250000 and that most people typically fall between 50,000 - 200,000 range.