Introduction
This code through explores how to briefly clean a data set, by
creating a analytical data set from a larger data set, how to make your
new data set into a data frame, and how to export your new data frame.
Content Overview
Specifically, we’ll explain and demonstrate how to clean a large data
set to focus on just one year of the data set (creating a smaller data
set from a large data set), turn your data set into a data frame, and
export your new data frame to your laptop.
Why You Should Care
This topic is valuable because if you continue to use R in the future and will be utilizing large data sets you will have to clean those data sets to best fit your research or project. Once you have created your new data set you will want to turn it into a data frame and download it so that is easy to share with your peers and colleagues.
Learning Objectives
Specifically, you’ll learn how to create a data frame and export it.
How to turn DataSets into DataFrames and DataFrames into CSV Files
Here, we’ll show how to clean a data set and create a smaller data set with no missing values or errors.
Further Exposition
The data used for this code-through is the National Longitudinal Surveys(NLS) that is designed to collect information from several points in time on the labor market activities. The NLS surveys have been an important tool for researches like economist and sociologist. I chose to use this data set for this code-through because it is a large analytical data set and is a good representation of the data you may encounter in work force or research setting.
Step One: Creating an analytic data
Step one shows how clean a large data set by narrowing down the observations with missing values. To create an analytical data set me must remove all missing values in all the variables. To do so I created a refined data set from the original NLSY Data set. I did this by naming my new dataset, selecting all the variables I wanted from the original data set, and removed all missing values with the !is.na() function. This narrowed the NLSY data set from 177,604 observations to 25,538 observations.
#Use the filter function to only select the variables you want to add to your data set. Use the !is.na() to remove missing observations from that variable.
NLSYrefined <- filter(NLSY, !is.na(id) & !is.na(year) & !is.na(afqt) & !is.na(age) & !is.na(female) & !is.na(educ ) & !is.na(married) & !is.na(income) & !is.na(height))
#Use the select function to chose each variable in your new data set
NLSYrefined <- select(NLSYrefined, id, year, afqt, age, educ, female, married, income, height)Step Two
Create your smaller data set from your refined data set by naming your new data set and filtering the refined data set for just the year 2006. Then I checked the observations with the nrow() function. The filtered function decreased the refined data set of 25,538 observations to just 6,894 observations.
#Create new data set from the refined data set by filtering for a particular phrase or word, in this case I used "year 2006"
NLSY2006 <- filter(NLSYrefined , year == 2006)
#Display the number of observations with the function nrow()
nrow(NLSY2006)## [1] 6894
#This picture shows the difference in the number of observations and variables in the three data sets:
img <- readPNG("Datasets.png")
grid.raster(img)Step Three: Create Data Frame
This step is easy, just use whatever data set you want to turn into a data frame and use data.frame() function.
#Create your data frame, you can use this as a template:
img <- readPNG("DataFrame.png")
grid.raster(img)#My code
NLSY2006 <- data.frame(NLSY2006)Step Four: Creating Your CSV file and downloading it.
For the last step use the write.cvs() function to turn the data frame into a CVS file. If you want the row names labeled write TRUE if not put FALSE.
#You can use this as a template:
img <- readPNG("WriteCSV.png")
grid.raster(img)#If you would like to show name names choose TRUE, if you do not want to show row names choose FALSE like the template.
#My Code:
write.csv(NLSY2006,"NLSY2006.csv", row.names = FALSE)#Your CSV file should save to your file as shown is this photo:
img <- readPNG("CSV.png")
grid.raster(img)
#Check the box:
img <- readPNG("CheckBox.png")
grid.raster(img)
#Press the settings button.
img <- readPNG("Settings.png")
grid.raster(img)
#Click download.
img <- readPNG("Download.png")
grid.raster(img)#Your new CSV should save to your working directory, good luck, I hope this helped.
## Further Resources {.tabset}
Learn more about [how to import CSV Files into R and so much more] with the following:
Resource I Hyperlink Text
Resource II Hyperlink Text
Works Cited
This code through references and cites the following sources:
Author Unknown (from datatofish.com) (2021). Source I. Hyperlink Text
Zach (2020). Source II. Hyperlink Text