Let’s take a look at each pane in Rstudio
There are multiple ways to import dataset(s).
You can simply click “Import Dataset” in Environments. Then select the right form based on the type of your file:
Excel file: From Excel…
CSV file: From Text (readr)…
And so on.
Click “Browse” to select the data, then click “Import”.
You can write a command to choose the file. To do so, you need to know the form of file:
install.packages("readxl")
data <- readxl::read_xlsx(file.choose())
There are many packages in R that help you operate functions. “readxl” is the one reading and loading excel files. You will see many more packages throughout this class. As the code demonstrates, for the first time using a package, you need to install the package.
data <- read.csv(file.choose())
If you do not want to repetitively choose the file every time, this is the most convenient way to import the data.
# Import the first sheet (default)
data <- readxl::read_xlsx("~/Documents/BADM321/R/Data/BIG5_DATA.xlsx")
# Import the second sheet
data <- readxl::read_xlsx("~/Documents/BADM321/R/Data/BIG5_DATA.xlsx", sheet = 2)
data <- read.csv("~/Documents/BADM321/R/Data/BIG5_DATA.csv")
You can check the path of your file through:
file.choose()
After importing the data, you want to conduct an initial investigation of the dataset to gain insight into its structure and variables.
# Load the data
data <- readxl::read_xlsx("~/Documents/BADM321/R/Data/BIG5_DATA.xlsx")
# View column names
names(data)
## [1] "race" "age" "engnat"
## [4] "gender" "hand" "source"
## [7] "country" "Extraversion" "Neuroticism"
## [10] "Agreeableness" "Conscientiousness" "Openness"
# Data structure
str(data)
## tibble [19,635 × 12] (S3: tbl_df/tbl/data.frame)
## $ race : num [1:19635] 1 10 4 3 6 6 0 3 3 3 ...
## $ age : num [1:19635] 99 97 92 80 79 79 78 77 77 77 ...
## $ engnat : num [1:19635] 1 2 1 1 1 1 1 1 1 1 ...
## $ gender : num [1:19635] 3 2 1 2 2 2 1 1 1 1 ...
## $ hand : num [1:19635] 1 2 3 1 1 1 1 1 1 1 ...
## $ source : num [1:19635] 1 1 5 1 1 1 1 1 1 2 ...
## $ country : chr [1:19635] "US" "US" "IN" "US" ...
## $ Extraversion : num [1:19635] 2.9 3.4 5 3.1 2.4 2.6 2.9 3.1 2.6 2.9 ...
## $ Neuroticism : num [1:19635] 3.3 3.4 5 2.7 3.5 3.3 2.3 2.2 3.1 3.4 ...
## $ Agreeableness : num [1:19635] 3.5 3.4 5 2.7 3.2 3.4 2.8 3.4 3.1 3 ...
## $ Conscientiousness: num [1:19635] 3.4 3.4 5 3.4 2.9 3 2.9 3.5 3.1 3.5 ...
## $ Openness : num [1:19635] 3.7 3.4 5 3.4 3.5 3.6 3 3.3 3.4 3.8 ...
# Get a summary of the data
summary(data)
## race age engnat gender
## Min. : 0.000 Min. :13.00 Min. :0.000 Min. :0.000
## 1st Qu.: 3.000 1st Qu.:18.00 1st Qu.:1.000 1st Qu.:1.000
## Median : 3.000 Median :22.00 Median :1.000 Median :2.000
## Mean : 5.312 Mean :26.26 Mean :1.364 Mean :1.616
## 3rd Qu.: 8.000 3rd Qu.:31.00 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :13.000 Max. :99.00 Max. :2.000 Max. :3.000
## hand source country Extraversion
## Min. :0.00 Min. :1.000 Length:19635 Min. :0.000
## 1st Qu.:1.00 1st Qu.:1.000 Class :character 1st Qu.:2.900
## Median :1.00 Median :1.000 Mode :character Median :3.100
## Mean :1.13 Mean :1.951 Mean :3.077
## 3rd Qu.:1.00 3rd Qu.:2.000 3rd Qu.:3.300
## Max. :3.00 Max. :5.000 Max. :5.000
## Neuroticism Agreeableness Conscientiousness Openness
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.600 1st Qu.:3.000 1st Qu.:2.900 1st Qu.:3.100
## Median :3.100 Median :3.200 Median :3.200 Median :3.300
## Mean :3.095 Mean :3.205 Mean :3.155 Mean :3.314
## 3rd Qu.:3.600 3rd Qu.:3.400 3rd Qu.:3.400 3rd Qu.:3.600
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
# Check the number of rows and columns
nrow(data)
## [1] 19635
ncol(data)
## [1] 12
# Overall dimension
dim(data)
## [1] 19635 12
# See the first 3 rows and last 3 rows
head(data, 3)
## # A tibble: 3 × 12
## race age engnat gender hand source country Extraversion Neuroticism
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 99 1 3 1 1 US 2.9 3.3
## 2 10 97 2 2 2 1 US 3.4 3.4
## 3 4 92 1 1 3 5 IN 5 5
## # ℹ 3 more variables: Agreeableness <dbl>, Conscientiousness <dbl>,
## # Openness <dbl>
tail(data, 3)
## # A tibble: 3 × 12
## race age engnat gender hand source country Extraversion Neuroticism
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13 13 2 1 2 1 IN 3 4.3
## 2 13 13 1 2 1 1 US 2.9 3.2
## 3 9 13 1 2 1 1 US 2.7 4.4
## # ℹ 3 more variables: Agreeableness <dbl>, Conscientiousness <dbl>,
## # Openness <dbl>
# Check for missing values
sum(is.na(data))
## [1] 1
# Check the distribution of categorical variables (e.g., country)
table(data$country)
##
## (nu A1 A2 AE AG AL AO AP AR AS AT AU AZ BA BB BD
## 367 8 9 99 1 12 1 19 41 1 20 973 4 10 2 42
## BE BF BG BH BM BN BO BR BS BT BW BZ CA CH CL CM
## 86 1 41 8 8 5 3 175 2 1 4 15 922 40 18 2
## CN CO CR CV CY CZ DE DK DO DZ EC EE EG ES ET EU
## 39 18 9 1 8 28 191 122 5 4 6 13 47 82 1 24
## FI FJ FO FR GB GD GE GG GH GP GR GT GU GY HK HN
## 89 2 1 129 1529 1 4 2 20 1 85 3 1 1 41 4
## HR HT HU ID IE IL IM IN IQ IR IS IT JE JM JO JP
## 40 2 34 168 107 26 1 1456 2 16 13 276 3 28 13 37
## KE KG KH KR KW KY KZ LA LB LK LS LT LV LY MA ME
## 43 1 3 26 6 1 1 2 40 31 2 29 21 2 9 3
## MK MM MN MP MR MT MU MV MW MX MY MZ NA NG NI NL
## 7 3 2 2 1 11 8 3 2 82 241 2 8 35 2 132
## NO NP NZ OM PA PE PG PH PK PL PR PT PW PY QA RO
## 147 10 156 6 4 8 2 642 220 79 15 87 1 2 10 135
## RS RU RW SA SD SE SG SI SK SR SV SY TC TH TN TR
## 85 19 2 43 1 168 133 34 22 1 6 2 1 42 7 69
## TT TW TZ UA UG US UY UZ VC VE VI VN ZA ZM ZW
## 22 25 2 12 11 8731 2 1 2 17 2 30 179 2 3