We are moving along at a quick pace and learning to work with Data. This week’s assignment focuses on mutating and analyzing data using ‘dplyr’ and creating fancy histograms.
Practice Assignment 5 involves working with three data sets within R - ‘mtcars,’ ‘esoph’ and ‘diamonds.’
Let’s begin…keep up with me as I weave some magic with all this data.
The first task is simple. I have to identify the data type of each variable in ‘mtcars.’ I ensure the dplyr package is loaded, using the ‘library()’ function. I’ve already installed it in R in the past, so I don’t need to install it again. If I try, R will send me a nasty message and I’ve already had enough of those during my work week from different sources. Don’t need any more, especially from a software package!
library(dplyr)
library(datasets)
I load the data set using the ‘data()’ function and print out the data frame. It is, however, a large data set, so for the next task involving ‘mtcars,’ I will use the ‘tibble’ function.
data(mtcars)
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
To identify the data type of each variable, I use the ‘str()’ function.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
I’ve looked at the data frame and am now able to classify each variable. The classifications are listed below.
Variable ’’ is discrete. It lists the names of the different cars.
Variable ‘mpg’ is continuous.
Variable ‘cyl’ is discrete.
Variable ‘disp’ is continuous.
Variable ‘hp’ is discrete.
Variable ‘drat’ is continuous.
Variable ‘wt’ is continuous.
Variable ‘qsec’ is continuous.
Variable ‘vs’ is discrete.
Variable ‘am’ is discrete.
Variable ‘gear’ is discrete.
Variable ‘carb’ is discrete.
The third task involves reporting the distribution of any three variables of my choice from ‘mtcars’ using the ‘summary()’ function. I’m first going to create a ‘tibble’ for ’mtcars.
summary <- tbl_df(mtcars)
summary
Now I’ll select and specify the 3 variables of my choice in a vector and print the summary of that vector.
cols <- c('mpg', 'cyl', 'gear')
cols
## [1] "mpg" "cyl" "gear"
summary[, cols]
## # A tibble: 32 x 3
## mpg cyl gear
## <dbl> <dbl> <dbl>
## 1 21.0 6 4
## 2 21.0 6 4
## 3 22.8 4 4
## 4 21.4 6 3
## 5 18.7 8 3
## 6 18.1 6 3
## 7 14.3 8 3
## 8 24.4 4 4
## 9 22.8 4 4
## 10 19.2 6 4
## # ... with 22 more rows
The fourth task in this assignment involves bringing in a different data set - ‘esoph.’ I have to identify the data type of three of the variables, ‘agegp,’ ‘alcgp’ and ‘tobgp’ in this data set. I am not going to print the data set as it is large, but I’ll provide the code to import ‘esoph’, print the data frame as well as its tibble below.
data(esoph)
esoph
esoph_tbl <- tbl_df(esoph)
I then identify the specific variables I want to focus on and use the ‘str()’ function again to identify the data types.
esoph_tbl <- select(esoph_tbl, agegp, alcgp, tobgp)
esoph_tbl
## # A tibble: 88 x 3
## agegp alcgp tobgp
## <ord> <ord> <ord>
## 1 25-34 0-39g/day 0-9g/day
## 2 25-34 0-39g/day 10-19
## 3 25-34 0-39g/day 20-29
## 4 25-34 0-39g/day 30+
## 5 25-34 40-79 0-9g/day
## 6 25-34 40-79 10-19
## 7 25-34 40-79 20-29
## 8 25-34 40-79 30+
## 9 25-34 80-119 0-9g/day
## 10 25-34 80-119 10-19
## # ... with 78 more rows
str(esoph_tbl)
## Classes 'tbl_df', 'tbl' and 'data.frame': 88 obs. of 3 variables:
## $ agegp: Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alcgp: Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ tobgp: Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
The fun begins. Things are getting a little complicated. I have to report the frequency distribution and relative frequency distribution of some of the variables in esoph next. The good news is that I’ve already selected the three variables and assigned them to ‘esoph_tbl.’ I will first create the frequency distribution of the three variables using the ‘ftable()’ function.
esoph.freq <- ftable(esoph_tbl)
Next, I will find out the number of rows in the frequency distribution. This will help me calculate the relative frequency distribution using the formula relative frequency = frequency/number of rows
nrow(esoph.freq)
## [1] 24
esoph.relfreq <- esoph.freq/nrow(esoph.freq)
esoph.relfreq
## tobgp 0-9g/day 10-19 20-29 30+
## agegp alcgp
## 25-34 0-39g/day 0.04166667 0.04166667 0.04166667 0.04166667
## 40-79 0.04166667 0.04166667 0.04166667 0.04166667
## 80-119 0.04166667 0.04166667 0.00000000 0.04166667
## 120+ 0.04166667 0.04166667 0.04166667 0.04166667
## 35-44 0-39g/day 0.04166667 0.04166667 0.04166667 0.04166667
## 40-79 0.04166667 0.04166667 0.04166667 0.04166667
## 80-119 0.04166667 0.04166667 0.04166667 0.04166667
## 120+ 0.04166667 0.04166667 0.04166667 0.00000000
## 45-54 0-39g/day 0.04166667 0.04166667 0.04166667 0.04166667
## 40-79 0.04166667 0.04166667 0.04166667 0.04166667
## 80-119 0.04166667 0.04166667 0.04166667 0.04166667
## 120+ 0.04166667 0.04166667 0.04166667 0.04166667
## 55-64 0-39g/day 0.04166667 0.04166667 0.04166667 0.04166667
## 40-79 0.04166667 0.04166667 0.04166667 0.04166667
## 80-119 0.04166667 0.04166667 0.04166667 0.04166667
## 120+ 0.04166667 0.04166667 0.04166667 0.04166667
## 65-74 0-39g/day 0.04166667 0.04166667 0.04166667 0.04166667
## 40-79 0.04166667 0.04166667 0.04166667 0.00000000
## 80-119 0.04166667 0.04166667 0.04166667 0.04166667
## 120+ 0.04166667 0.04166667 0.04166667 0.04166667
## 75+ 0-39g/day 0.04166667 0.04166667 0.00000000 0.04166667
## 40-79 0.04166667 0.04166667 0.04166667 0.04166667
## 80-119 0.04166667 0.04166667 0.00000000 0.00000000
## 120+ 0.04166667 0.04166667 0.00000000 0.00000000
The last task involving the data frame ‘esoph’ requires me to report the joint frequency of two different sets of columns - ‘agegp’ and ‘alcgp’; ‘alcgp’ and ‘tobgp.’ In order to do this, I use the ‘xtabs()’ function.
jf_agegp.alcgp <- xtabs(~agegp+alcgp, data=esoph_tbl)
jf_agegp.alcgp
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 4 4 3 4
## 35-44 4 4 4 3
## 45-54 4 4 4 4
## 55-64 4 4 4 4
## 65-74 4 3 4 4
## 75+ 3 4 2 2
jf_alcgp.tobgp <- xtabs(~alcgp+tobgp, data=esoph_tbl)
jf_alcgp.tobgp
## tobgp
## alcgp 0-9g/day 10-19 20-29 30+
## 0-39g/day 6 6 5 6
## 40-79 6 6 6 5
## 80-119 6 6 4 5
## 120+ 6 6 5 4
The final task in this assignment comprises several sub-tasks. All of these sub-tasks involve the data set ‘diamonds.’ I will first load the package ‘ggplot2’ as I have to create a histogram.
library(ggplot2)
The data set ‘diamonds’ is huge so I am not going to print it. ‘The command’?diamonds’ will display a description of the data set in the viewer pane in RStudio. I will, however, provide the code below.
data(diamonds)
diamonds
?diamonds
To find the range of all four variables
The first sub-task involving ‘diamonds’ is to display the range of the variables ‘price,’ ‘carat,’ ‘depth,’ and ‘table.’
I will first do it collectively for all four variables and then individually for each of the four variables. I will use the ‘range()’ function.
range_diamonds <- select(diamonds, price, carat, depth, table)
range_diamonds
## # A tibble: 53,940 x 4
## price carat depth table
## <int> <dbl> <dbl> <dbl>
## 1 326 0.23 61.5 55
## 2 326 0.21 59.8 61
## 3 327 0.23 56.9 65
## 4 334 0.29 62.4 58
## 5 335 0.31 63.3 58
## 6 336 0.24 62.8 57
## 7 336 0.24 62.3 57
## 8 337 0.26 61.9 55
## 9 337 0.22 65.1 61
## 10 338 0.23 59.4 61
## # ... with 53,930 more rows
range(range_diamonds)
## [1] 0.2 18823.0
To find the range of the variable ‘price’
range(diamonds$price)
## [1] 326 18823
To find the range of the variable ‘carat’
range(diamonds$carat)
## [1] 0.20 5.01
To find the range of the variable ‘depth’
range(diamonds$depth)
## [1] 43 79
To find the range of the variable ‘table’
range(diamonds$table)
## [1] 43 95
This is, by far, the most complicated task in this entire assignment. It took me several tries to get this right. I have to report the grouped frequency of any two variables identified in 7.a above. I choose ‘carat’ and ‘depth.’ Both are continuous variables. I will have to sequence and cut the variables. I begin with the ‘carat’ variable. From the previous step I gathered that the ‘carat’ variable ranges from 0.20 to 5.01. So I am going to sequence it in increments of 0.45. This generates a table of 10 columns
breaks <- seq(from=0.20, to=5.01, by=0.45)
carat.cut <- cut(diamonds$carat, breaks, right=FALSE)
table(carat.cut)
## carat.cut
## [0.2,0.65) [0.65,1.1) [1.1,1.55) [1.55,2) [2,2.45) [2.45,2.9)
## 24969 17201 7910 1706 1989 125
## [2.9,3.35) [3.35,3.8) [3.8,4.25) [4.25,4.7)
## 29 5 4 1
From the previous step I gathered that the ‘depth’ variable ranges from 43 to 79. So I am going to sequence it in increments of four. This generates a table of 10 columns.
depth_break <- diamonds$depth
breaks_d <- seq(from=40, to=80, by=4)
depth.cut <- cut(depth_break, breaks_d, right=FALSE)
table(depth.cut)
## depth.cut
## [40,44) [44,48) [48,52) [52,56) [56,60) [60,64) [64,68) [68,72) [72,76)
## 2 1 2 58 5051 46740 1992 88 3
## [76,80)
## 3
I am ready to create the grouped frequency of the variables ‘carat’ and ‘depth.’
ftable(carat.cut, depth.cut)
## depth.cut [40,44) [44,48) [48,52) [52,56) [56,60) [60,64) [64,68) [68,72) [72,76) [76,80)
## carat.cut
## [0.2,0.65) 0 0 1 23 1804 22737 396 6 0 2
## [0.65,1.1) 2 1 0 26 1933 14110 1074 51 3 1
## [1.1,1.55) 0 0 1 6 819 6761 308 15 0 0
## [1.55,2) 0 0 0 0 189 1472 41 4 0 0
## [2,2.45) 0 0 0 3 280 1555 140 11 0 0
## [2.45,2.9) 0 0 0 0 18 86 20 1 0 0
## [2.9,3.35) 0 0 0 0 8 13 8 0 0 0
## [3.35,3.8) 0 0 0 0 0 3 2 0 0 0
## [3.8,4.25) 0 0 0 0 0 3 1 0 0 0
## [4.25,4.7) 0 0 0 0 0 0 1 0 0 0
I will first plot a histogram for ‘price.’ The x-axis denotes the price while the y-axis shows the frequency.
hist(diamonds$price, border="green", col="blue")
Here’s the histogram for ‘carat.’
hist(diamonds$carat, border="red", col="yellow")
The histogram for ‘depth’ looks like this.
hist(diamonds$depth, border="black", col="brown" )
And finally, here’s the histogram for ‘table’
hist(diamonds$table, border="pink", col="purple")