Reading in the first data set
Reading in or importing data files to RStudio is a necessary step to gain access to any files that are needed for cleaning or tidying. After imported data is cleaned, it is then more suitable for exploration.
As we know data formats are not homogeneous,and come in many different flavors. So,whether data is in CSV, SPSS,XLSX,SAS,TXT,STATA,or HTML as well as many other formats, there is usually R package to read in the data.
The first data set I will read in is from the included R package “Data Sets”. It is the MTCars (MotorTrend) dataset which was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption as well as 10 aspects of automotive design and performance for 32 cars (1973-74).
This R chunk loads in the data sets package and provides a summary of the statistics for the mtcars data set
library(datasets)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
This chunk uses the DT function to create a table for the mtcars dataset. It helps to better read the data set.
library(DT)
datatable(mtcars)
This R chunk uses an alternative to the summary function called Skim. Skim provides a comprehensive overview of the mtcars data set as well as providing a visualization of the data in the rows represented by histograms.
library(skimr)
## Warning: package 'skimr' was built under R version 4.1.2
skim(mtcars)
| Name | mtcars |
| Number of rows | 32 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
| cyl | 0 | 1 | 6.19 | 1.79 | 4.00 | 4.00 | 6.00 | 8.00 | 8.00 | ▆▁▃▁▇ |
| disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
| hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
| drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
| wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
| qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
| vs | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
| carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
This R chunk exemplifies the granularity of the Skim package by selecting specific columns to summarize.
skim(mtcars,hp,wt)
| Name | mtcars |
| Number of rows | 32 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
| wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
This R chunk provides the column names of the mtcars dataset using the colnames() function.
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
This R chuck introduces the dim() function provides information on the dimensions of the data set,which shows this data array to have 32 rows and 11 columns.
dim(mtcars)
## [1] 32 11
This R chunk shows a generic visualization of the mtcars object using the plot() function.
plot(mtcars)
I wanted to try reading data in from an external data set, that used the csv format.
This first R chunk reads in the eggs tidy csv data
#library(readr)
#eggs_tidy <- read_csv("_data/eggs_tidy.csv")
library(readr)
eggs_tidy_eggs_tidy <- read_csv("C:/Users/Bud/Downloads/eggs_tidy - eggs_tidy.csv")
## Rows: 120 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): month
## dbl (5): year, large_half_dozen, large_dozen, extra_large_half_dozen, extra_...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(eggs_tidy_eggs_tidy)
Summarizes the eggs_tidy data set
summary(eggs_tidy_eggs_tidy)
## month year large_half_dozen large_dozen
## Length:120 Min. :2004 Min. :126.0 Min. :225.0
## Class :character 1st Qu.:2006 1st Qu.:129.4 1st Qu.:233.5
## Mode :character Median :2008 Median :174.5 Median :267.5
## Mean :2008 Mean :155.2 Mean :254.2
## 3rd Qu.:2011 3rd Qu.:174.5 3rd Qu.:268.0
## Max. :2013 Max. :178.0 Max. :277.5
## extra_large_half_dozen extra_large_dozen
## Min. :132.0 Min. :230.0
## 1st Qu.:135.8 1st Qu.:241.5
## Median :185.5 Median :285.5
## Mean :164.2 Mean :266.8
## 3rd Qu.:185.5 3rd Qu.:285.5
## Max. :188.1 Max. :290.0
Summarizes data set using the skim function
library(skimr)
skim(eggs_tidy_eggs_tidy)
| Name | eggs_tidy_eggs_tidy |
| Number of rows | 120 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| month | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2008.50 | 2.88 | 2004 | 2006.00 | 2008.5 | 2011.0 | 2013.00 | ▇▇▇▇▇ |
| large_half_dozen | 0 | 1 | 155.17 | 22.59 | 126 | 129.44 | 174.5 | 174.5 | 178.00 | ▆▁▁▁▇ |
| large_dozen | 0 | 1 | 254.20 | 18.55 | 225 | 233.50 | 267.5 | 268.0 | 277.50 | ▅▂▁▁▇ |
| extra_large_half_dozen | 0 | 1 | 164.22 | 24.68 | 132 | 135.78 | 185.5 | 185.5 | 188.13 | ▆▁▁▁▇ |
| extra_large_dozen | 0 | 1 | 266.80 | 22.80 | 230 | 241.50 | 285.5 | 285.5 | 290.00 | ▅▂▁▁▇ |
This chunk uses the tibble function which provides a more comprehensive and readable data frame
library(tibble)
as_tibble(eggs_tidy_eggs_tidy)
## # A tibble: 120 x 6
## month year large_half_dozen large_dozen extra_large_half~ extra_large_doz~
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 January 2004 126 230 132 230
## 2 Februa~ 2004 128. 226. 134. 230
## 3 March 2004 131 225 137 230
## 4 April 2004 131 225 137 234.
## 5 May 2004 131 225 137 236
## 6 June 2004 134. 231. 137 241
## 7 July 2004 134. 234. 137 241
## 8 August 2004 134. 234. 137 241
## 9 Septem~ 2004 130. 234. 136. 241
## 10 October 2004 128. 234. 136. 241
## # ... with 110 more rows