Summer 2022 Due Date: June 19, 2022
Assignment 1
Directions: Read and study carefully pages 52 to 62. Run all codes or algorithms from pages 52 to 62. Make a copy of this cover sheet, then turn in a copy of this study sheet with your completed assignment.
The two packages used in R are tidyr which cleans up data and standardizes it and readr which imports files.
Data pre-processing transforms data by cleaning it, changing variables and/or creates new variables in order to apply further analysis tools.
Data cleaning is a method to transform data to usable state.
Based on the book, the three properties of tidy data are (i) Each value is a variable or an observation (ii) Each variable contains all values of a certain property measured across all observations. (iii) Each observation contains all values of the variables measured for the respective case.
The package tidyr is used to clean up data and make it more standard.
Some functions of the package, tidyr are gather, spread, separate and unite. ### 6.(b) What are some of the functions of the package, dplyr? Some functions of the package, dplyr, are filer(),slice(), arrange(), rename(), mutate(), relocate(), and summarize().
Questions 8 -23 explain what the code does and the packages used.
The functions of the package, readr, are read_csv, tidyverse, and read_delim.
Some of the functions of the package, stringr, are str_split_fixed, str_trim, str_replace_all, and str_c.
dat$Y <- parse_integer (dat$Y, na="?")
List the function part, parse_integer ; and, list the argument(s) dat$y, na=“?”.
String processing is required when cleaning up data or extracting information from raw data.
The package impute R provides tools to handle unknown values.
The package lubridate makes it easier to work with dates and times.
It extracts dates our of strings,
Some objects are ymd, and myd.
Lubridate converts vectors into a nice format for arithmetic and computation.
True
We can use lubridate.
Technically you can but it would not be efficient. - Explain what can be done using the “.names” file. Using the “.names” file will extract the names of the columns and assign them to their corresponding column.
We use stringr.
Using the “.names” file will extract the names of the columns and assign them to their corresponding column.
namesF < - str_c(uci.repo, dataset, " .names") dim(data) text < - read_lines(url(namesF)
The first function str_c joins mutiple strings into a single string. The second functuon dim(data) outputs the dimensions of the matrix or table with the first value being the number of rows and the second value being the number of columns in the data.
This function extracts the column names.
The function str_trim() is used to trim white space or extra spaces around strings.
The function str_replace_all() deletes invalid characters and replaces it with whatever string you want or empty space.
The special value is NA.
dat
Variable stored
dat\$Y
Check data in column Y.
na="?"
This tells the computer that if it sees a ? to interpret as na another words the computer is told to reassign ? to NA.
class(dat\$Y)
This tells the computer to check the data tyoe of column y.
(Remember: A factor is a nominal variable.)
The package readr carries the function parse_interger().
“We must read unknown values properly into R”
Based on page 61 of the Luis Torgo book the ways to overcome missing values are - Remove any row containing an unknown value - Fill-in (impute) the unknowns using some common value (typically using statistics of centrality) - Fill-in the unknowns using the most similar rows - Using more sophisticated forms of filling-in the unknowns.