Data Engineering and Mining

Summer 2022 Due Date: June 19, 2022

Assignment 1

Directions: Read and study carefully pages 52 to 62. Run all codes or algorithms from pages 52 to 62. Make a copy of this cover sheet, then turn in a copy of this study sheet with your completed assignment.

1. List at least two packages and their associated functions in R

The two packages used in R are tidyr which cleans up data and standardizes it and readr which imports files.

2. What is data pre-processing?

Data pre-processing transforms data by cleaning it, changing variables and/or creates new variables in order to apply further analysis tools.

3. What is data cleaning?Explain.

Data cleaning is a method to transform data to usable state.

4. What are the properties of tidy data?

Based on the book, the three properties of tidy data are (i) Each value is a variable or an observation (ii) Each variable contains all values of a certain property measured across all observations. (iii) Each observation contains all values of the variables measured for the respective case.

5. When data is not in such a tidy format, which package is used to clean up the data and make it more standard?

The package tidyr is used to clean up data and make it more standard.

6.(a) What are some of the functions of the package, tidyr?

Some functions of the package, tidyr are gather, spread, separate and unite. ### 6.(b) What are some of the functions of the package, dplyr? Some functions of the package, dplyr, are filer(),slice(), arrange(), rename(), mutate(), relocate(), and summarize().

7. In the section, 3.3.1.3, “String Processing,” there is an algorithm that goes from page 58 to page 60. Carefully read and then explain what the algorithm/code is doing. Also, what are some of the packages used and what are some of the functions associated with those packages? (Run the algorithm to get a feel for what the developer is doing.)

Questions 8 -23 explain what the code does and the packages used.

8. (a) What are the functions of the package, readr?

The functions of the package, readr, are read_csv, tidyverse, and read_delim.

8.(b) What are some of the functions of the package, stringr?

Some of the functions of the package, stringr, are str_split_fixed, str_trim, str_replace_all, and str_c.

9. Look at the following statement:

dat$Y <- parse_integer (dat$Y, na="?")

List the function part, parse_integer ; and, list the argument(s) dat$y, na=“?”.

10. When is string processing required?

String processing is required when cleaning up data or extracting information from raw data.

11. Answer the following questions: What is the package, impute R, used for?

The package impute R provides tools to handle unknown values.

What is the package, lubridate, available for?

The package lubridate makes it easier to work with dates and times.

What are the functions of lubridate used for?

It extracts dates our of strings,

What are the objects, POSIXct, used for?

Some objects are ymd, and myd.

12. Can the functions of the lubridate package be applied to vectors? In what way?

Lubridate converts vectors into a nice format for arithmetic and computation.

13. TRUE or FALSE: Some functions of the lubridate package will allow us to add extra columns to data frames.

True

14. Time zones are a crucial aspect of dates and times. Not only can we store time and date information, we can store time zone information using functions with time zone arguments from which package?

We can use lubridate.

15. Run the code/algorithm on pages 58 to 59. If we have a dataset which contains 400 columns.

  • Explain why (or why not) we cannot manually assign column names to the data frame.

Technically you can but it would not be efficient. - Explain what can be done using the “.names” file. Using the “.names” file will extract the names of the columns and assign them to their corresponding column.

16. What package is used for string processing?

We use stringr.

17. Again, look at the algorithm on pages 58 to 59. Try and explain what is going on in each step of the algorithm.

Using the “.names” file will extract the names of the columns and assign them to their corresponding column.

18. Explain each of the following lines (code) [Pp. 58-59]: What is going on in each line:

namesF < - str_c(uci.repo, dataset, " .names") dim(data) text < - read_lines(url(namesF)

The first function str_c joins mutiple strings into a single string. The second functuon dim(data) outputs the dimensions of the matrix or table with the first value being the number of rows and the second value being the number of columns in the data.

19. Explain what the function, str_split_fixed() can be used for.

This function extracts the column names.

20. During preprocessing, we normally have some clean-up to do. During the preprocessing process

  • what is the function, str_trim() used for?

The function str_trim() is used to trim white space or extra spaces around strings.

  • what is the function, str_replace_all() used for?

The function str_replace_all() deletes invalid characters and replaces it with whatever string you want or empty space.

21. What is the special value R has to denote “unknown value?”

The special value is NA.

22. Study the algorithms on pages 60 and 61, under the section “3.3.1.4 Dealing with Unknown Values.” Interpret the following:

dat

Variable stored

dat\$Y

Check data in column Y.

na="?"

This tells the computer that if it sees a ? to interpret as na another words the computer is told to reassign ? to NA.

class(dat\$Y)

This tells the computer to check the data tyoe of column y.

(Remember: A factor is a nominal variable.)

23. Which package carries the function, parse_integer(), which can be used to parse a vector of values into integers; and, also to specify values that should be interpreted as unknown values through the parameter, “na,” during preprocessing.

The package readr carries the function parse_interger().

24. What does the following statement mean?

“We must read unknown values properly into R”

25. List several ways of trying to overcome missing values.

Based on page 61 of the Luis Torgo book the ways to overcome missing values are - Remove any row containing an unknown value - Fill-in (impute) the unknowns using some common value (typically using statistics of centrality) - Fill-in the unknowns using the most similar rows - Using more sophisticated forms of filling-in the unknowns.