The initial step in data analysis involves loading data into R.
We will learn the following:
1.4.1 Import CSV files
1.4.2 Import Excel files
1.4.3 Import Stata files
1.4.4 Import SPSS files
Before importing a dataset into an R DataFrame, it is advisable to gain insight into its contents. Skimming through the file beforehand provides valuable information, helping determine essential columns and identify those that can be omitted during the loading process.
In this section, we proceed under the assumption that the dataset is accurately entered, allowing us to utilize the basic syntax for importing data.
A CSV is a form of plain text file designed to organize tabular data through specific structuring. As a plain text file, it exclusively accommodates actual text data, with values separated by commas by default. This straightforward format facilitates the representation of structured information in a way that is easily readable and universally compatible across various data processing applications.
library(readr)
library(gt)
df <- read_csv(
file = "~/STEMResearch/datasets/hh_income.csv",
show_col_types = FALSE
)
gt(df)
| id | race | gender | married | education | age | income | expenditure | ses |
|---|---|---|---|---|---|---|---|---|
| 1 | White | Female | Yes | Masters | 32 | 528.87 | 417.77 | High |
| 2 | Other | Male | Yes | Bachelors | 46 | 422.02 | 501.42 | High |
| 3 | White | Female | No | Doctoral | 43 | 466.81 | 336.07 | Middle |
| 4 | Black | Female | No | Masters | 35 | 598.32 | 454.08 | Low |
| 5 | Black | Male | Yes | Bachelors | 36 | 510.16 | 340.33 | Middle |
| 6 | Other | Female | Yes | Doctoral | 44 | 517.48 | 372.84 | Middle |
| 7 | White | Male | Yes | Bachelors | 30 | 399.02 | 348.82 | Low |
| 8 | Black | Male | No | Bachelors | 42 | 546.79 | 393.82 | Low |
| 9 | Other | Female | Yes | Masters | 30 | 467.69 | 460.62 | Low |
| 10 | Black | Male | Yes | Doctoral | 28 | 551.13 | 353.95 | High |
Before importing Excel a dataset into R, it's crucial to comprehend its structure and characteristics. Prior considerations include:
Worksheet name: Identify the name of the worksheet containing the dataset - income.
Column headings: Determine if the dataset includes column headings - Yes.
Selected columns: Specify which columns to import if not importing all - All.
Exclusion of rows: Decide if any rows above or below need exclusion - skip 4.
Column count: Be aware of the total number of columns - 9.
Observation count: Understand the total number of observations - 10.
Dataset range: Note the range of the dataset, especially if there's non-data information in the worksheet - A5:I15.
library(readxl)
df <- read_excel(
path = "~/STEMResearch/datasets/hh_income.xlsx",
sheet = "income",
skip = 4,
na = c("-", "Dont know", "NA", "N/A")
)
gt(df)
| id | race | gender | married | education | age | income | expenditure | ses |
|---|---|---|---|---|---|---|---|---|
| 1 | White | Female | Yes | Masters | 32 | 528.87 | 417.77 | High |
| 2 | Other | Male | Yes | Bachelors | NA | 422.02 | 501.42 | High |
| 3 | White | Female | No | Doctoral | 43 | 466.81 | 336.07 | Middle |
| 4 | Black | Female | No | Masters | 35 | 598.32 | NA | Low |
| 5 | Black | NA | Yes | Bachelors | 36 | 510.16 | 340.33 | Middle |
| 6 | Other | Female | Yes | Doctoral | 44 | 517.48 | 372.84 | Middle |
| 7 | White | Male | Yes | Bachelors | 30 | 399.02 | 348.82 | Low |
| 8 | Black | Male | No | Bachelors | 42 | 546.79 | 393.82 | Low |
| 9 | Other | Female | Yes | Masters | 30 | NA | NA | Low |
| 10 | Black | Male | Yes | Doctoral | 28 | 551.13 | 353.95 | High |
library(haven)
Importing datasets from Stata is a straightforward process in R; you
can achieve this effortlessly by employing the read_stata()
function from the haven library.
df <- read_stata(
file = "~/STEMResearch/datasets/hh_income.dta"
)
gt(df)
| id | race | gender | married | education | age | income | expenditure | ses |
|---|---|---|---|---|---|---|---|---|
| 1 | White | Female | Yes | Masters | 32 | 528.87 | 417.77 | High |
| 2 | Other | Male | Yes | Bachelors | 46 | 422.02 | 501.42 | High |
| 3 | White | Female | No | Doctoral | 43 | 466.81 | 336.07 | Middle |
| 4 | Black | Female | No | Masters | 35 | 598.32 | 454.08 | Low |
| 5 | Black | Male | Yes | Bachelors | 36 | 510.16 | 340.33 | Middle |
| 6 | Other | Female | Yes | Doctoral | 44 | 517.48 | 372.84 | Middle |
| 7 | White | Male | Yes | Bachelors | 30 | 399.02 | 348.82 | Low |
| 8 | Black | Male | No | Bachelors | 42 | 546.79 | 393.82 | Low |
| 9 | Other | Female | Yes | Masters | 30 | 467.69 | 460.62 | Low |
| 10 | Black | Male | Yes | Doctoral | 28 | 551.13 | 353.95 | High |
The haven library also includes
the read_spss() function, a valuable tool for importing
SPSS (.sav format) files into R.
df <- read_spss(
file = "~/STEMResearch/datasets/hh_income.sav"
)
gt(df)
| id | race | gender | married | education | age | income | expenditure | ses |
|---|---|---|---|---|---|---|---|---|
| 1 | White | Female | Yes | Masters | 32 | 528.87 | 417.77 | High |
| 2 | Other | Male | Yes | Bachelors | 46 | 422.02 | 501.42 | High |
| 3 | White | Female | No | Doctoral | 43 | 466.81 | 336.07 | Middle |
| 4 | Black | Female | No | Masters | 35 | 598.32 | 454.08 | Low |
| 5 | Black | Male | Yes | Bachelors | 36 | 510.16 | 340.33 | Middle |
| 6 | Other | Female | Yes | Doctoral | 44 | 517.48 | 372.84 | Middle |
| 7 | White | Male | Yes | Bachelors | 30 | 399.02 | 348.82 | Low |
| 8 | Black | Male | No | Bachelors | 42 | 546.79 | 393.82 | Low |
| 9 | Other | Female | Yes | Masters | 30 | 467.69 | 460.62 | Low |
| 10 | Black | Male | Yes | Doctoral | 28 | 551.13 | 353.95 | High |