Cleaning & Manipulating Data

Tidyr and dplyr

80% of the processes of manipulating data is spent on cleaning it and preparing it to be analyzed. When the process of cleaning data is all said and done it will be easy for a software program to run static tests and it will reveal information and insights. Two programs within the ‘R’ domain that are good to use when cleaning data are tidyr and dplyr. To access these packages within R you use the sequence library…and insert wether you want to use tidyr or dplyr. You can use the functions of gathering, spreading, or seperate to format the data in such a way that makes it easy for use. For example the spread function takes a data set that is convoluted and compressed and spreads it or organizes it in such a way that makes it easy to read and easy to transmit into a visual representation, it takes mult columns and colapses them into key value pairs. The functions listed above all follow a similar purpose with their functions. Functions that denote the same task can be simplified by using the operator function. This function allows for using multiple clensing functions at the same time, such as if you want to filter it, summarize it, and order it. Next, the publication taled about the various functions contained in the dplyr command. this command focuses on reducing the size of the dataframe. Functions contained within this command are, select, filter, groupby, summarize, arrange, join, and mutate. For example the select function allows you to select and rename variables within the data set. You can select all desired variables or only select specific values within the data set. The next function that proved to be insightful for me was the groupby function. This function allows us to create catagorical groupings within the data set of like-variables. There is no observable changes on R but you will notice changes happen when you perform summary statistics. Another variable that gives instant utility to summarising the data is the summarise function which performs basic analysis of the data set in which you can choose to summarise the mean, median, range or etc. Following, the arrange function allows us to order data in ways in which we would like to view it, for instance if we wish to order it in ascending order of means it will present all the means of the data set in that order from lowest to highest. Also, if we have two seperate datasets that have like variables we can join them using the join function in which we bring them together for further analysis. The last dyplr function is the mutate function which allows you to change a current data set, this function confuses me a bit but I understand that it is used to add a new variable to the new data frame.

Data Tidying

By Garrett Grolemund

This second article deal with a rolling example of how formatting data in order to input it into your computer allows one to come to form a clear approach which generates the most understanding for our viewers. In the example used in this article all the datasets depict the same values but there is one out of the three that makes the data set the easiest to work with. He states three conditions that must be met in a data set that will yeild the greatest workability. These three are, that one, each variable in the dataset should be placed in its own column, two, each observation should be placed in its own cell and three, each value should be placed in its own cell. When you satisfy these rules you have tity data according to the author. The reason this tidy data lends well to the R operating system is because the system deals with arranging data in the form of vectors. The relationships in the data set must parallel the structure of the data frame. When we start to compare the datasets, we see how problems can arise due to the formatting of the original set. For instance in dataset 4 the variables are separate across the tables which makes comparing the two sets that much more difficult. Making a key helps managing datasets or different variables more manageable. Again the various functions listed in the previous article help show how tidy data lends to yielding the most concise outcomes.

Introduction to dplyr for Faster Data Manipulation in R

The article opens up with the question of why people use diplr, and the author answers this question with two assertions, it is great for exploratory reasearch and transformations. Also, using this function means that your data is easy to write as well as to read. Again this article touches on the various functions that are available with diplyr which were illuminated in the first reading. One of the main points of this publication that helped me was the articles clarity on chaining data. This function allows you to perform multiple calculations in one line by linking the sets together, this makes readability easier when there are multiple commands taking place at one time. Mutating data allows one to take existing variables and combine the information in such a way that creats a new variable which encompasses the existing data the speed example helped depict this function.

Data Manipulation using dyplr

The last slide show sort of took all the information from the first three resources and put it together in an all encompassing presientation. It talked about how to load dyplr in the rconsole and how to subsequently load data sets within this function. Using tbl.df allows one to hone in on one dataset. This article clearly demonstrates useful ways to drill into the data set and to illuminate certain aspects of which you wish your audience to see.