We are moving along at a quick pace and learning to work with Data. This week’s assignment focuses on mutating and analyzing data using ‘dplyr’ and creating fancy histograms.

Practice Assignment 5 involves working with three data sets within R - ‘mtcars,’ ‘esoph’ and ‘diamonds.’

Let’s begin…keep up with me as I weave some magic with all this data.

1. Identify data type of variables in mtcars

The first task is simple. I have to identify the data type of each variable in ‘mtcars.’ I ensure the dplyr package is loaded, using the ‘library()’ function. I’ve already installed it in R in the past, so I don’t need to install it again. If I try, R will send me a nasty message and I’ve already had enough of those during my work week from different sources. Don’t need any more, especially from a software package!

library(dplyr)
library(datasets)

I load the data set using the ‘data()’ function and print out the data frame. It is, however, a large data set, so for the next task involving ‘mtcars,’ I will use the ‘tibble’ function.

data(mtcars)
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

To identify the data type of each variable, I use the ‘str()’ function.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

2. Classification of each variable

I’ve looked at the data frame and am now able to classify each variable. The classifications are listed below.

Variable ’’ is discrete. It lists the names of the different cars.

Variable ‘mpg’ is continuous.

Variable ‘cyl’ is discrete.

Variable ‘disp’ is continuous.

Variable ‘hp’ is discrete.

Variable ‘drat’ is continuous.

Variable ‘wt’ is continuous.

Variable ‘qsec’ is continuous.

Variable ‘vs’ is discrete.

Variable ‘am’ is discrete.

Variable ‘gear’ is discrete.

Variable ‘carb’ is discrete.

3. Distribution of three variables in mtcars

The third task involves reporting the distribution of any three variables of my choice from ‘mtcars’ using the ‘summary()’ function. I’m first going to create a ‘tibble’ for ’mtcars.

summary <- tbl_df(mtcars)
summary

Now I’ll select and specify the 3 variables of my choice in a vector and print the summary of that vector.

cols <- c('mpg', 'cyl', 'gear')
cols
## [1] "mpg"  "cyl"  "gear"
summary[, cols]
## # A tibble: 32 x 3
##      mpg   cyl  gear
##    <dbl> <dbl> <dbl>
##  1  21.0     6     4
##  2  21.0     6     4
##  3  22.8     4     4
##  4  21.4     6     3
##  5  18.7     8     3
##  6  18.1     6     3
##  7  14.3     8     3
##  8  24.4     4     4
##  9  22.8     4     4
## 10  19.2     6     4
## # ... with 22 more rows

4. Identify data type of three variables in esoph

The fourth task in this assignment involves bringing in a different data set - ‘esoph.’ I have to identify the data type of three of the variables, ‘agegp,’ ‘alcgp’ and ‘tobgp’ in this data set. I am not going to print the data set as it is large, but I’ll provide the code to import ‘esoph’, print the data frame as well as its tibble below.

data(esoph)
esoph
esoph_tbl <- tbl_df(esoph)

I then identify the specific variables I want to focus on and use the ‘str()’ function again to identify the data types.

esoph_tbl <- select(esoph_tbl, agegp, alcgp, tobgp)
esoph_tbl
## # A tibble: 88 x 3
##    agegp     alcgp    tobgp
##    <ord>     <ord>    <ord>
##  1 25-34 0-39g/day 0-9g/day
##  2 25-34 0-39g/day    10-19
##  3 25-34 0-39g/day    20-29
##  4 25-34 0-39g/day      30+
##  5 25-34     40-79 0-9g/day
##  6 25-34     40-79    10-19
##  7 25-34     40-79    20-29
##  8 25-34     40-79      30+
##  9 25-34    80-119 0-9g/day
## 10 25-34    80-119    10-19
## # ... with 78 more rows
str(esoph_tbl)
## Classes 'tbl_df', 'tbl' and 'data.frame':    88 obs. of  3 variables:
##  $ agegp: Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alcgp: Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ tobgp: Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...

5. Reporting the frequency and relative frequency distributions of three columns in esoph

The fun begins. Things are getting a little complicated. I have to report the frequency distribution and relative frequency distribution of some of the variables in esoph next. The good news is that I’ve already selected the three variables and assigned them to ‘esoph_tbl.’ I will first create the frequency distribution of the three variables using the ‘ftable()’ function.

esoph.freq <- ftable(esoph_tbl)

Next, I will find out the number of rows in the frequency distribution. This will help me calculate the relative frequency distribution using the formula relative frequency = frequency/number of rows

nrow(esoph.freq)
## [1] 24
esoph.relfreq <- esoph.freq/nrow(esoph.freq)
esoph.relfreq
##                 tobgp   0-9g/day      10-19      20-29        30+
## agegp alcgp                                                      
## 25-34 0-39g/day       0.04166667 0.04166667 0.04166667 0.04166667
##       40-79           0.04166667 0.04166667 0.04166667 0.04166667
##       80-119          0.04166667 0.04166667 0.00000000 0.04166667
##       120+            0.04166667 0.04166667 0.04166667 0.04166667
## 35-44 0-39g/day       0.04166667 0.04166667 0.04166667 0.04166667
##       40-79           0.04166667 0.04166667 0.04166667 0.04166667
##       80-119          0.04166667 0.04166667 0.04166667 0.04166667
##       120+            0.04166667 0.04166667 0.04166667 0.00000000
## 45-54 0-39g/day       0.04166667 0.04166667 0.04166667 0.04166667
##       40-79           0.04166667 0.04166667 0.04166667 0.04166667
##       80-119          0.04166667 0.04166667 0.04166667 0.04166667
##       120+            0.04166667 0.04166667 0.04166667 0.04166667
## 55-64 0-39g/day       0.04166667 0.04166667 0.04166667 0.04166667
##       40-79           0.04166667 0.04166667 0.04166667 0.04166667
##       80-119          0.04166667 0.04166667 0.04166667 0.04166667
##       120+            0.04166667 0.04166667 0.04166667 0.04166667
## 65-74 0-39g/day       0.04166667 0.04166667 0.04166667 0.04166667
##       40-79           0.04166667 0.04166667 0.04166667 0.00000000
##       80-119          0.04166667 0.04166667 0.04166667 0.04166667
##       120+            0.04166667 0.04166667 0.04166667 0.04166667
## 75+   0-39g/day       0.04166667 0.04166667 0.00000000 0.04166667
##       40-79           0.04166667 0.04166667 0.04166667 0.04166667
##       80-119          0.04166667 0.04166667 0.00000000 0.00000000
##       120+            0.04166667 0.04166667 0.00000000 0.00000000

6. Reporting the joint frequency to two sets of columns in esoph

The last task involving the data frame ‘esoph’ requires me to report the joint frequency of two different sets of columns - ‘agegp’ and ‘alcgp’; ‘alcgp’ and ‘tobgp.’ In order to do this, I use the ‘xtabs()’ function.

jf_agegp.alcgp <- xtabs(~agegp+alcgp, data=esoph_tbl)
jf_agegp.alcgp
##        alcgp
## agegp   0-39g/day 40-79 80-119 120+
##   25-34         4     4      3    4
##   35-44         4     4      4    3
##   45-54         4     4      4    4
##   55-64         4     4      4    4
##   65-74         4     3      4    4
##   75+           3     4      2    2
jf_alcgp.tobgp <- xtabs(~alcgp+tobgp, data=esoph_tbl)
jf_alcgp.tobgp
##            tobgp
## alcgp       0-9g/day 10-19 20-29 30+
##   0-39g/day        6     6     5   6
##   40-79            6     6     6   5
##   80-119           6     6     4   5
##   120+             6     6     5   4

7. Loading dataset ‘diamonds’

The final task in this assignment comprises several sub-tasks. All of these sub-tasks involve the data set ‘diamonds.’ I will first load the package ‘ggplot2’ as I have to create a histogram.

library(ggplot2)

The data set ‘diamonds’ is huge so I am not going to print it. ‘The command’?diamonds’ will display a description of the data set in the viewer pane in RStudio. I will, however, provide the code below.

data(diamonds)
diamonds
?diamonds

7.a Range of variables

To find the range of all four variables

The first sub-task involving ‘diamonds’ is to display the range of the variables ‘price,’ ‘carat,’ ‘depth,’ and ‘table.’

I will first do it collectively for all four variables and then individually for each of the four variables. I will use the ‘range()’ function.

range_diamonds <- select(diamonds, price, carat, depth, table)
range_diamonds
## # A tibble: 53,940 x 4
##    price carat depth table
##    <int> <dbl> <dbl> <dbl>
##  1   326  0.23  61.5    55
##  2   326  0.21  59.8    61
##  3   327  0.23  56.9    65
##  4   334  0.29  62.4    58
##  5   335  0.31  63.3    58
##  6   336  0.24  62.8    57
##  7   336  0.24  62.3    57
##  8   337  0.26  61.9    55
##  9   337  0.22  65.1    61
## 10   338  0.23  59.4    61
## # ... with 53,930 more rows
range(range_diamonds)
## [1]     0.2 18823.0

To find the range of the variable ‘price’

range(diamonds$price)
## [1]   326 18823

To find the range of the variable ‘carat’

range(diamonds$carat)
## [1] 0.20 5.01

To find the range of the variable ‘depth’

range(diamonds$depth)
## [1] 43 79

To find the range of the variable ‘table’

range(diamonds$table)
## [1] 43 95

7.b Report grouped frequency

This is, by far, the most complicated task in this entire assignment. It took me several tries to get this right. I have to report the grouped frequency of any two variables identified in 7.a above. I choose ‘carat’ and ‘depth.’ Both are continuous variables. I will have to sequence and cut the variables. I begin with the ‘carat’ variable. From the previous step I gathered that the ‘carat’ variable ranges from 0.20 to 5.01. So I am going to sequence it in increments of 0.45. This generates a table of 10 columns

breaks <- seq(from=0.20, to=5.01, by=0.45)
carat.cut <- cut(diamonds$carat, breaks, right=FALSE)
table(carat.cut)
## carat.cut
## [0.2,0.65) [0.65,1.1) [1.1,1.55)   [1.55,2)   [2,2.45) [2.45,2.9) 
##      24969      17201       7910       1706       1989        125 
## [2.9,3.35) [3.35,3.8) [3.8,4.25) [4.25,4.7) 
##         29          5          4          1

From the previous step I gathered that the ‘depth’ variable ranges from 43 to 79. So I am going to sequence it in increments of four. This generates a table of 10 columns.

depth_break <- diamonds$depth
breaks_d <- seq(from=40, to=80, by=4)
depth.cut <- cut(depth_break, breaks_d, right=FALSE)
table(depth.cut)
## depth.cut
## [40,44) [44,48) [48,52) [52,56) [56,60) [60,64) [64,68) [68,72) [72,76) 
##       2       1       2      58    5051   46740    1992      88       3 
## [76,80) 
##       3

I am ready to create the grouped frequency of the variables ‘carat’ and ‘depth.’

ftable(carat.cut, depth.cut)
##            depth.cut [40,44) [44,48) [48,52) [52,56) [56,60) [60,64) [64,68) [68,72) [72,76) [76,80)
## carat.cut                                                                                           
## [0.2,0.65)                 0       0       1      23    1804   22737     396       6       0       2
## [0.65,1.1)                 2       1       0      26    1933   14110    1074      51       3       1
## [1.1,1.55)                 0       0       1       6     819    6761     308      15       0       0
## [1.55,2)                   0       0       0       0     189    1472      41       4       0       0
## [2,2.45)                   0       0       0       3     280    1555     140      11       0       0
## [2.45,2.9)                 0       0       0       0      18      86      20       1       0       0
## [2.9,3.35)                 0       0       0       0       8      13       8       0       0       0
## [3.35,3.8)                 0       0       0       0       0       3       2       0       0       0
## [3.8,4.25)                 0       0       0       0       0       3       1       0       0       0
## [4.25,4.7)                 0       0       0       0       0       0       1       0       0       0

7.c To plot a histogram of the four variables - ‘price,’ ‘carat,’ ‘depth,’ and ‘table.’

I will first plot a histogram for ‘price.’ The x-axis denotes the price while the y-axis shows the frequency.

hist(diamonds$price, border="green", col="blue")

Here’s the histogram for ‘carat.’

hist(diamonds$carat, border="red", col="yellow")

The histogram for ‘depth’ looks like this.

hist(diamonds$depth, border="black", col="brown" )

And finally, here’s the histogram for ‘table’

hist(diamonds$table, border="pink", col="purple")