1.4 Import datasets

The initial step in data analysis involves loading data into R.

We will learn the following:

  • 1.4.1 Import CSV files

  • 1.4.2 Import Excel files

  • 1.4.3 Import Stata files

  • 1.4.4 Import SPSS files

Before importing a dataset into an R DataFrame, it is advisable to gain insight into its contents. Skimming through the file beforehand provides valuable information, helping determine essential columns and identify those that can be omitted during the loading process.

In this section, we proceed under the assumption that the dataset is accurately entered, allowing us to utilize the basic syntax for importing data.

1.4.1 Import CSV files

A CSV is a form of plain text file designed to organize tabular data through specific structuring. As a plain text file, it exclusively accommodates actual text data, with values separated by commas by default. This straightforward format facilitates the representation of structured information in a way that is easily readable and universally compatible across various data processing applications.

library(readr)
library(gt)
df <- read_csv(
  file = "~/STEMResearch/datasets/hh_income.csv",
  show_col_types = FALSE
)
gt(df)
id race gender married education age income expenditure ses
1 White Female Yes Masters 32 528.87 417.77 High
2 Other Male Yes Bachelors 46 422.02 501.42 High
3 White Female No Doctoral 43 466.81 336.07 Middle
4 Black Female No Masters 35 598.32 454.08 Low
5 Black Male Yes Bachelors 36 510.16 340.33 Middle
6 Other Female Yes Doctoral 44 517.48 372.84 Middle
7 White Male Yes Bachelors 30 399.02 348.82 Low
8 Black Male No Bachelors 42 546.79 393.82 Low
9 Other Female Yes Masters 30 467.69 460.62 Low
10 Black Male Yes Doctoral 28 551.13 353.95 High

1.4.2 Import Excel files

Before importing Excel a dataset into R, it's crucial to comprehend its structure and characteristics. Prior considerations include:

  • Worksheet name: Identify the name of the worksheet containing the dataset - income.

  • Column headings: Determine if the dataset includes column headings - Yes.

  • Selected columns: Specify which columns to import if not importing all - All.

  • Exclusion of rows: Decide if any rows above or below need exclusion - skip 4.

  • Column count: Be aware of the total number of columns - 9.

  • Observation count: Understand the total number of observations - 10.

  • Dataset range: Note the range of the dataset, especially if there's non-data information in the worksheet - A5:I15.

library(readxl)
df <- read_excel(
    path = "~/STEMResearch/datasets/hh_income.xlsx",
    sheet = "income",
    skip = 4,
    na = c("-", "Dont know", "NA", "N/A")
)
gt(df)
id race gender married education age income expenditure ses
1 White Female Yes Masters 32 528.87 417.77 High
2 Other Male Yes Bachelors NA 422.02 501.42 High
3 White Female No Doctoral 43 466.81 336.07 Middle
4 Black Female No Masters 35 598.32 NA Low
5 Black NA Yes Bachelors 36 510.16 340.33 Middle
6 Other Female Yes Doctoral 44 517.48 372.84 Middle
7 White Male Yes Bachelors 30 399.02 348.82 Low
8 Black Male No Bachelors 42 546.79 393.82 Low
9 Other Female Yes Masters 30 NA NA Low
10 Black Male Yes Doctoral 28 551.13 353.95 High
library(haven)

1.4.3 Import Stata datasets

Importing datasets from Stata is a straightforward process in R; you can achieve this effortlessly by employing the read_stata() function from the haven library.

df <- read_stata(
    file = "~/STEMResearch/datasets/hh_income.dta"
)
gt(df)
id race gender married education age income expenditure ses
1 White Female Yes Masters 32 528.87 417.77 High
2 Other Male Yes Bachelors 46 422.02 501.42 High
3 White Female No Doctoral 43 466.81 336.07 Middle
4 Black Female No Masters 35 598.32 454.08 Low
5 Black Male Yes Bachelors 36 510.16 340.33 Middle
6 Other Female Yes Doctoral 44 517.48 372.84 Middle
7 White Male Yes Bachelors 30 399.02 348.82 Low
8 Black Male No Bachelors 42 546.79 393.82 Low
9 Other Female Yes Masters 30 467.69 460.62 Low
10 Black Male Yes Doctoral 28 551.13 353.95 High

1.4.4 Import SPSS datasets

The haven library also includes the read_spss() function, a valuable tool for importing SPSS (.sav format) files into R.

df <- read_spss(
    file = "~/STEMResearch/datasets/hh_income.sav"
)
gt(df)
id race gender married education age income expenditure ses
1 White Female Yes Masters 32 528.87 417.77 High
2 Other Male Yes Bachelors 46 422.02 501.42 High
3 White Female No Doctoral 43 466.81 336.07 Middle
4 Black Female No Masters 35 598.32 454.08 Low
5 Black Male Yes Bachelors 36 510.16 340.33 Middle
6 Other Female Yes Doctoral 44 517.48 372.84 Middle
7 White Male Yes Bachelors 30 399.02 348.82 Low
8 Black Male No Bachelors 42 546.79 393.82 Low
9 Other Female Yes Masters 30 467.69 460.62 Low
10 Black Male Yes Doctoral 28 551.13 353.95 High