BADM 321-01 R Tutorial 1: Importing Data in R

Rstudio Panes

Let’s take a look at each pane in Rstudio

Importing Data in R

There are multiple ways to import dataset(s).

Import Dataset

You can simply click “Import Dataset” in Environments. Then select the right form based on the type of your file:

Excel file: From Excel…
CSV file: From Text (readr)…

And so on.

Click “Browse” to select the data, then click “Import”.

Interactive Command

You can write a command to choose the file. To do so, you need to know the form of file:

Excel file:

install.packages("readxl")
data <- readxl::read_xlsx(file.choose())

There are many packages in R that help you operate functions. “readxl” is the one reading and loading excel files. You will see many more packages throughout this class. As the code demonstrates, for the first time using a package, you need to install the package.

CSV file:

data <- read.csv(file.choose())

Designate the Path

If you do not want to repetitively choose the file every time, this is the most convenient way to import the data.

Excel file:

# Import the first sheet (default)
data <- readxl::read_xlsx("~/Documents/BADM321/R/Data/BIG5_DATA.xlsx")

# Import the second sheet
data <- readxl::read_xlsx("~/Documents/BADM321/R/Data/BIG5_DATA.xlsx", sheet = 2)

CSV file:

data <- read.csv("~/Documents/BADM321/R/Data/BIG5_DATA.csv")

You can check the path of your file through:

file.choose()

Initial Data Investigation

After importing the data, you want to conduct an initial investigation of the dataset to gain insight into its structure and variables.

Data Structure

# Load the data
data <- readxl::read_xlsx("~/Documents/BADM321/R/Data/BIG5_DATA.xlsx")

# View column names
names(data)

##  [1] "race"              "age"               "engnat"           
##  [4] "gender"            "hand"              "source"           
##  [7] "country"           "Extraversion"      "Neuroticism"      
## [10] "Agreeableness"     "Conscientiousness" "Openness"

# Data structure
str(data)

## tibble [19,635 × 12] (S3: tbl_df/tbl/data.frame)
##  $ race             : num [1:19635] 1 10 4 3 6 6 0 3 3 3 ...
##  $ age              : num [1:19635] 99 97 92 80 79 79 78 77 77 77 ...
##  $ engnat           : num [1:19635] 1 2 1 1 1 1 1 1 1 1 ...
##  $ gender           : num [1:19635] 3 2 1 2 2 2 1 1 1 1 ...
##  $ hand             : num [1:19635] 1 2 3 1 1 1 1 1 1 1 ...
##  $ source           : num [1:19635] 1 1 5 1 1 1 1 1 1 2 ...
##  $ country          : chr [1:19635] "US" "US" "IN" "US" ...
##  $ Extraversion     : num [1:19635] 2.9 3.4 5 3.1 2.4 2.6 2.9 3.1 2.6 2.9 ...
##  $ Neuroticism      : num [1:19635] 3.3 3.4 5 2.7 3.5 3.3 2.3 2.2 3.1 3.4 ...
##  $ Agreeableness    : num [1:19635] 3.5 3.4 5 2.7 3.2 3.4 2.8 3.4 3.1 3 ...
##  $ Conscientiousness: num [1:19635] 3.4 3.4 5 3.4 2.9 3 2.9 3.5 3.1 3.5 ...
##  $ Openness         : num [1:19635] 3.7 3.4 5 3.4 3.5 3.6 3 3.3 3.4 3.8 ...

# Get a summary of the data
summary(data)

##       race             age            engnat          gender     
##  Min.   : 0.000   Min.   :13.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 3.000   1st Qu.:18.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 3.000   Median :22.00   Median :1.000   Median :2.000  
##  Mean   : 5.312   Mean   :26.26   Mean   :1.364   Mean   :1.616  
##  3rd Qu.: 8.000   3rd Qu.:31.00   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :13.000   Max.   :99.00   Max.   :2.000   Max.   :3.000  
##       hand          source        country           Extraversion  
##  Min.   :0.00   Min.   :1.000   Length:19635       Min.   :0.000  
##  1st Qu.:1.00   1st Qu.:1.000   Class :character   1st Qu.:2.900  
##  Median :1.00   Median :1.000   Mode  :character   Median :3.100  
##  Mean   :1.13   Mean   :1.951                      Mean   :3.077  
##  3rd Qu.:1.00   3rd Qu.:2.000                      3rd Qu.:3.300  
##  Max.   :3.00   Max.   :5.000                      Max.   :5.000  
##   Neuroticism    Agreeableness   Conscientiousness    Openness    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000     Min.   :0.000  
##  1st Qu.:2.600   1st Qu.:3.000   1st Qu.:2.900     1st Qu.:3.100  
##  Median :3.100   Median :3.200   Median :3.200     Median :3.300  
##  Mean   :3.095   Mean   :3.205   Mean   :3.155     Mean   :3.314  
##  3rd Qu.:3.600   3rd Qu.:3.400   3rd Qu.:3.400     3rd Qu.:3.600  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000     Max.   :5.000

# Check the number of rows and columns
nrow(data)

## [1] 19635

ncol(data)

## [1] 12

# Overall dimension
dim(data)

## [1] 19635    12

# See the first 3 rows and last 3 rows
head(data, 3)

## # A tibble: 3 × 12
##    race   age engnat gender  hand source country Extraversion Neuroticism
##   <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <chr>          <dbl>       <dbl>
## 1     1    99      1      3     1      1 US               2.9         3.3
## 2    10    97      2      2     2      1 US               3.4         3.4
## 3     4    92      1      1     3      5 IN               5           5  
## # ℹ 3 more variables: Agreeableness <dbl>, Conscientiousness <dbl>,
## #   Openness <dbl>

tail(data, 3)

## # A tibble: 3 × 12
##    race   age engnat gender  hand source country Extraversion Neuroticism
##   <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <chr>          <dbl>       <dbl>
## 1    13    13      2      1     2      1 IN               3           4.3
## 2    13    13      1      2     1      1 US               2.9         3.2
## 3     9    13      1      2     1      1 US               2.7         4.4
## # ℹ 3 more variables: Agreeableness <dbl>, Conscientiousness <dbl>,
## #   Openness <dbl>

# Check for missing values
sum(is.na(data))

## [1] 1

# Check the distribution of categorical variables (e.g., country)
table(data$country)

## 
##  (nu   A1   A2   AE   AG   AL   AO   AP   AR   AS   AT   AU   AZ   BA   BB   BD 
##  367    8    9   99    1   12    1   19   41    1   20  973    4   10    2   42 
##   BE   BF   BG   BH   BM   BN   BO   BR   BS   BT   BW   BZ   CA   CH   CL   CM 
##   86    1   41    8    8    5    3  175    2    1    4   15  922   40   18    2 
##   CN   CO   CR   CV   CY   CZ   DE   DK   DO   DZ   EC   EE   EG   ES   ET   EU 
##   39   18    9    1    8   28  191  122    5    4    6   13   47   82    1   24 
##   FI   FJ   FO   FR   GB   GD   GE   GG   GH   GP   GR   GT   GU   GY   HK   HN 
##   89    2    1  129 1529    1    4    2   20    1   85    3    1    1   41    4 
##   HR   HT   HU   ID   IE   IL   IM   IN   IQ   IR   IS   IT   JE   JM   JO   JP 
##   40    2   34  168  107   26    1 1456    2   16   13  276    3   28   13   37 
##   KE   KG   KH   KR   KW   KY   KZ   LA   LB   LK   LS   LT   LV   LY   MA   ME 
##   43    1    3   26    6    1    1    2   40   31    2   29   21    2    9    3 
##   MK   MM   MN   MP   MR   MT   MU   MV   MW   MX   MY   MZ   NA   NG   NI   NL 
##    7    3    2    2    1   11    8    3    2   82  241    2    8   35    2  132 
##   NO   NP   NZ   OM   PA   PE   PG   PH   PK   PL   PR   PT   PW   PY   QA   RO 
##  147   10  156    6    4    8    2  642  220   79   15   87    1    2   10  135 
##   RS   RU   RW   SA   SD   SE   SG   SI   SK   SR   SV   SY   TC   TH   TN   TR 
##   85   19    2   43    1  168  133   34   22    1    6    2    1   42    7   69 
##   TT   TW   TZ   UA   UG   US   UY   UZ   VC   VE   VI   VN   ZA   ZM   ZW 
##   22   25    2   12   11 8731    2    1    2   17    2   30  179    2    3

I strongly encourage you to use “R Wizard” in ChatGPT to solve any R problems/issues/errors. AI makes learning easy and fun, explore it as much as you can!