About

This notebook is your first lab assignment. Please be sure to submit your work before the due date.You are expected to submit a knitted version of your worksheet which can be an html, pdf or word file.

If you worked as a team one of the team members submission will be enough as long as it has all the names at the header section.

BalanceGig Company

BalanceGig is a company that matches independent workers for short-term engagements with businesses in the construction, automotive, and high-tech industries. The ‘gig’ employees work only for a short period of time, often on a particular project. A manager at BalanceGig extracts the employee data from their most recent work engagement, including the hourly wage (HourlyWage), the client’s industry (Industry), and the employee’s job satisfaction (Job). The manager suspects that data about the gig employees are sometimes incomplete, perhaps due to the short engagement and transient nature of the employees. She would like to find the number of missing observations. In addition, she would like information on the number of employees who worked in the automotive industry and earned more than $30 per hour.

Read the Gig data file into a data frame and label it myDataGig What is the number of records and what are the variables in the dataset? (0.5 points)

# Read the Gig data file and display the number of records and the variables.
library(readxl)
Gig <- read_excel("C:/Users/jhyne/OneDrive/INFS 343/Gig.xlsx")
View(Gig)

Calculate the mean, median, maximum and minimum hourly wage. (0.5 points)

# Calculate the mean, median maximum and minimum hourly wages. 
mean(na.rm=TRUE, Gig$HourlyWage)
## [1] 40.12287
median(na.rm=TRUE, Gig$HourlyWage)
## [1] 41.91
max(na.rm=TRUE, Gig$HourlyWage)
## [1] 51
min(na.rm=TRUE, Gig$HourlyWage)
## [1] 24.28

Find missing values in Gig file (0.5 points)

R stores missing values as NA, and we use the is.na function to identify the observations with missing values. R labels observations with missing values as “True” and observations without missing values as “False”

#List only observations that have missing values:
which(is.na(Gig))
##  [1]  614  638  663  685  708  781  828 1201 1232 1347 1569 1586 1649 1654 1687
## [16] 1708 1739 1773 1833 1870 1878 1901 1920 1987 2024 2065 2103 2159 2167 2199
## [31] 2200 2367 2389 2405

For a large data set, having to look through all observations is inconvenient. Alternately, we can use the which function together with the is.na function to identify “which” observations contain missing values.

#Check which records have missing values for hourly wages, industry and job.
which(is.na(Gig$HourlyWage))
## [1]  10  34  59  81 104 177 224 597
which(is.na(Gig$Industry))
##  [1]  24 139 361 378 441 446 479 500 531 565
which(is.na(Gig$Job))
##  [1]  21  58  66  89 108 175 212 253 291 347 355 387 388 555 577 593

To identify and count the number of employees with multiple selection criteria, we use the which and length functions.

Task 2: Implement omission strategy to remove observations with missing values (0.5 points)

We use the ‘na.omit’ function to remove observations with missing values and store the resulting data set into omissionData data frame. How can you make sure that all observations with missing values were omitted? (Hint: check number of observations)

#use this chunk to remove observations with missing values and to check and make sure that observations with missing values were omitted.
dim(Gig)
## [1] 604   4
omissionData <- na.omit(Gig)
dim(Gig)
## [1] 604   4
Gig[!complete.cases(Gig), ]
## # A tibble: 34 × 4
##    EmployeeID HourlyWage Industry     Job       
##         <dbl>      <dbl> <chr>        <chr>     
##  1         10       NA   Construction Accountant
##  2         21       49.6 Construction <NA>      
##  3         24       42.6 <NA>         Sales Rep 
##  4         34       NA   Construction Sales Rep 
##  5         58       44.9 Construction <NA>      
##  6         59       NA   Automotive   Accountant
##  7         66       26.1 Construction <NA>      
##  8         81       NA   Construction Sales Rep 
##  9         89       41.9 Construction <NA>      
## 10        104       NA   Construction Consultant
## # … with 24 more rows
na.omit(Gig)
## # A tibble: 570 × 4
##    EmployeeID HourlyWage Industry     Job       
##         <dbl>      <dbl> <chr>        <chr>     
##  1          1       32.8 Construction Analyst   
##  2          2       46   Automotive   Engineer  
##  3          3       43.1 Construction Sales Rep 
##  4          4       48.1 Automotive   Other     
##  5          5       43.6 Automotive   Accountant
##  6          6       47.0 Construction Engineer  
##  7          7       43.0 Construction Sales Rep 
##  8          8       41.0 Construction Programmer
##  9          9       38.9 Construction Consultant
## 10         11       28   Automotive   Engineer  
## # … with 560 more rows

Task 3: Implement mean imputation strategy (0.5 points)

We will calculate the average value using the ‘mean’ function. The option na.rm=TRUE ignores the missing values when calculating the average values.

Hourlywagemean = mean(Gig$HourlyWage, na.rm = TRUE)
Hourlywagemean
## [1] 40.12287

To impute the missing values in the HourlyWage variable, we again use the ‘is.na’ function to identify the missing values and replace them with the means calculated in previous step.Check and make sure that all missing values for the HourlyWage were replaced by the mean value.(Hint: you can use length and whic functions to identify observations with HourlyWage equal to its mean value)

#Use this chunk to impute missing values and to make sure that all missing values were replaced by the mean value
round(40.12287,0)
## [1] 40
Gig$HourlyWage[is.na(Gig$HourlyWage)]= 40.12287
dim(Gig)
## [1] 604   4
#view(Gig)
length(which(Gig$HourlyWage == 40.12287))
## [1] 8

To identify and count the number of employees with multiple selection criteria, we use the which and length functions.

Task 4:How many employees worked for automotive industry and earned more than $30 per hour? (0.5 points)

#Find the number of employees working in the automotive industry and earning more than $30:(Hint: you can use the length and which functions)
length(which(Gig$HourlyWage > 30))
## [1] 537

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

plot(cars)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

```

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.