The focus today is to demonstrate: importing data; understanding data through processes for transformation, visualization, and modeling; and communication through processes of explanation and documentation of information from these data.
In this demonstration I use mtcars, a dataset available through the datasets package that accompanies the base R installation.
First, one way to refine this dataset is demonstrated (There are many ways, as with all of the processes I demonstrate with mtcars).
Then, some variables in mtcars are transformed in various ways using functions available in the R package called dplyr.
Next, vizualizations of some variables in mtcars are shown using functions available in the R package ggplot2.
Last, several simple statistical hypotheses are modeled with some variables in the mtcars dataset.
The example of these processes in the data sciences is provided in the file that created this web page, R_Example_6March2017.Rmd
, which is an RMarkdown file. RMarkdown is an authoring language that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format commonly applied in data science) with embedded “chunks” of R code that are run so their output can be included in a final document.
The RScript file, ‘R_Example_6May2017.R’, implements all of the R processes explained in this HTML page. I recommend that you execute the code at ‘R_Example_6May2017.R’ before you run this RMarkdown file because you must install various R packages to run the RScript which also are used in this RMarkdown file.
Any line that starts with a “#” is treated as a comment in R code and is not executed by R. Annotation of RScripts with comments aids in reproducibility of analyses. In most cases, the R code used to create the analyses shown on this HTML page is reproduced. Typically, however, RMarkdown reports hide the R code used to create an HTML page.
This demonstration is not meant to instruct you in the use of R or RStudio. It merely is an introductory tour of many R and RStudio capabilities.
To demonstrate processes in R, I import mtcars, a dataset with a small number of observations and variables. The data in mtcars were extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Said in another way, the data set, mtcars, has 32 observations (cars) with measures for 11 variables (aspects of design and performance). The mtcars dataset is “built-in” to RStudio, so we do not need to used methods available in R for importing data from external sources.
Here is a listing of the 32-observation-by-11-variable mtcars dataset to which you can refer throughout this demonstration.
[The two graphics showing the mtcars variable layout were read from an external file*]
By entering the name of the dataset in the RStudio console pane (the lower left quadrant of the RStudio layout) or in the RStudio source pane (the upper left quadrant of the RStudio layout) will print the dataset shown below, in the console:
mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Printing a dataset like mtcars is suitable for viewing in the console (the lower left pane of the RStudio layout). It contains few observations and variables. However, what about datasets that contain hundreds … thousands … hundreds of thousands of observations?
Another approach is to print in the console the first six lines of the dataset using the R function, head.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Or, to print the last six lines.
tail(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
We load an R “package” (a special purpose library of code in R that accomplishes a bundle of computing tasks) called dplyr to transform to the mtcars data.
library(dplyr)
We create a more orderly table of the mtcars dataset by executing what is called a tibble function, tbl_df, to transform mtcars into a new table of data, mtcars_df. (tibble….tbl_df….get it?)
Now, a print of mtcars_df is a more orderly set of columns (variables) and only the first 10 rows (observations…or cars):
mtcars_df <- tbl_df(mtcars)
mtcars_df
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# ... with 22 more rows
A tbl_df (i.e., “tibble”) object is a data frame that provides a nicer printing method, useful when working with large datasets.
Now that mtcars has been transformed, the result of the transformation, mtcars_df, is described in R as a dataframe rather than a dataset.
Also, note that the size of the table of data also is printed (32 rows by 11 columns….or, in our case, 32 cars by 11 aspects of performance).
We can use the dplyr function, glimpse, to view variable names, the type of variables (in this case, “dbl,” meaning numeric data), and a sample of some of the data points in mtcars_df:
glimpse(mtcars_df)
Observations: 32
Variables: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
The dataset, mtcars, and the dataframe created from it, mtcars_df, do not contain any missing data. It is quite common in practical research settings to have missing data for some cases (in mtcars_df, cars) within variables (in mtcars_df, aspects of performance). We teach in our data science fundamentals course how to handle missing data in data modeling.
Rarely do researchers analyze the raw data that they collect. The R package, dplyr, makes reshaping data for analysis a relatively easy task.
It is common to filter a subset of rows in a data table according to criteria (dplyr’s filter function). Or, perhaps to select only some variables (dplyr’s select function).
Often, researchers wish to create new variables from existing variables (dplyr’s mutate function). And, statistical summaries of variables are desired (dplyr’s summarise function; note the spelling!).
The important and efficient use of dplyr is to use a small set of verbs to reshape data. Among the dplyr verbs you will learn to use to reshape data tables are:
2.c.1. filter
2.c.2. select
2.c.3. mutate
2.c.4 summarise
Provided are some examples (2.c.1. through 2.c.4.) of use of these verbs with the dataframe, mtcars_df. Compare the transformations made with the raw data shown in the information sheet about mtcars that you can observe above.
Here is a filter that results in a dataframe containing only the cars that have a straight, rather than a vertical, engine:
filter(mtcars_df, vs == 1)
# A tibble: 14 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
4 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
5 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
6 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
7 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
8 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
9 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
10 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
11 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
12 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
13 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
14 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Notice that 14 of the 32 cars meet this filtering criterion (see “Source: local dataframe [14 x 11]”).
The filter verb has two parameters: the dataframe to be filtered (mtcars_df); the filtering criterion (vs == 1, where “==” means “the same as”).
Suppose I wish to select only mtcars_df variables that contain measures of weight (in pounds/1,000) and number of forward gears:
select(mtcars_df, wt, gear)
# A tibble: 32 × 2
wt gear
* <dbl> <dbl>
1 2.620 4
2 2.875 4
3 2.320 4
4 3.215 3
5 3.440 3
6 3.460 3
7 3.570 3
8 3.190 4
9 3.150 4
10 3.440 4
# ... with 22 more rows
Notice that the selection retains all 32 cars, but only two variables (see “Source: local data frame [32 x 2]”)
Now, suppose I wish to mutate mtcars_df by calculating a new variable from disp, the engine displacment in cubic inches: disp_l, the displacement in liters:
displace <- mutate(mtcars, displ_l = disp / 61.0237)
select (displace, disp, displ_l)
disp displ_l
1 160.0 2.621932
2 160.0 2.621932
3 108.0 1.769804
4 258.0 4.227866
5 360.0 5.899347
6 225.0 3.687092
7 360.0 5.899347
8 146.7 2.403984
9 140.8 2.307300
10 167.6 2.746474
11 167.6 2.746474
12 275.8 4.519556
13 275.8 4.519556
14 275.8 4.519556
15 472.0 7.734700
16 460.0 7.538055
17 440.0 7.210313
18 78.7 1.289663
19 75.7 1.240502
20 71.1 1.165121
21 120.1 1.968088
22 318.0 5.211090
23 304.0 4.981671
24 350.0 5.735477
25 400.0 6.554830
26 79.0 1.294579
27 120.3 1.971365
28 95.1 1.558411
29 351.0 5.751864
30 145.0 2.376126
31 301.0 4.932510
32 121.0 1.982836
I created a new variable, displ_l, by converting cubic inches to liters (there are 61.0237 cubic inches in a liter). I created a new dataframe, displace, from which I, then, selected only disp and displ_l, the displacement in cubic inches and liters, respectively.
Last, let’s calculate some summary statistics for a few variables in the new data I created, displace. Calculated are the means for disp and displ_l:
summarise(displace, count=n(), mean_cu_in = mean(disp), mean_ltr=mean(displ_l))
count mean_cu_in mean_ltr
1 32 230.7219 3.780857
I added the function, count, to return the number of cars in mtcars_df. The means are the sum of all disp or displ_l divided by the number of cars.
As indicated in a white paper by Fusion Charts,
We visualize information to meet a very basic need – to tell a story. It’s one of the most primitive forms of communication known to man, having its origins in cave drawings dated as early as 30,000 B.C., even before written communication, which emerged in 3,000 B.C. Vision is the single most important faculty we use to communicate information.
A R package, ggplot2, provides a grammar for commands that create visualizations of data.
The concept behind ggplot2 divides plot into three different fundamental parts: Plot = data + Aesthetics + Geometry. The principal components of every plot can be defined as follow:
data is a dataframe containing variables to be visualized;
Aesthetics is used to indicate the variables to be visualized This component also can control the color, the size, or the shape of points plotted, the height of bars drawn, etc….; and
Geometry defines the type of graphics (histogram, box plot, line plot, density plot, dot plot….)
I wish to plot the relationship between the weight of a car and its mileage. I, first, load the ggplot2 library (make sure you have ggplot2 installed). Then, I use the ggplot2 code to create a scatterplot of car miles per gallon by car weight.
In this R code, a plot is created using a dataframe (mtcars_df) + aesthetics (the x and y axes) + a geometry (geom_points, which plots points):
# Make sure you already have installed the ggplot2 package.
library(ggplot2)
ggplot(mtcars_df, aes(x=wt, y=mpg)) +
geom_point()
Plotted on the graph are the weight (wt) and lines per gallon (mpg) for every car. The joint occurence of wt and mpg is what is called a Cartesian coordinate and is shown by a point for each car.
But, hmmm, the points are small, aren’t they? So, let’s change the size of the points by modifying geom_point() to geom_point(size=5).
ggplot(mtcars_df, aes(x=wt, y=mpg)) +
geom_point(size=5)
Let’s change the plot so that the shape of points becomes a diamond (shape=23), the size is larger (size=8_), the color is red (color=“red”), and the diamonds are not filled (fill=NA):
ggplot(mtcars_df, aes(x=wt, y=mpg)) +
geom_point(size=8, shape=23, color="red", fill=NA)
How about adding a line though black, round points (minus some of the other previously used plot charactertistics) that minimizes the distance between the line and the swarm of points?
ggplot(mtcars_df, aes(x=wt, y=mpg)) +
geom_point(size=4) +
geom_smooth(method=lm, se=FALSE)
ggplot(mtcars_df, aes(x=wt, y=mpg)) +
geom_point(size=4) +
geom_smooth(se=FALSE)
Change the color of the smooth line? Add color=“red” to the geom_smooth characteristic.
ggplot(mtcars_df, aes(x=wt, y=mpg)) +
geom_point(size=4) +
geom_smooth(se=FALSE, color="red")
Pretty. But, what about a title and labels for axes of the graph?
ggplot(mtcars_df, aes(x=wt, y=mpg)) +
geom_point(size=4) +
geom_smooth(se=TRUE, color="red") +
labs(title="Relationship Between Miles Per Gallon and Weight",
x="Weight", y="Miles Per Gallon")
Now, let’s plot wt and mpg by the number of cylinders (cyl), make points based on different numbers of cylinders different shapes and colors, and add a legend to the plot?
mtcars_df$cyl <- as.factor(mtcars_df$cyl)
# Previous command changes the cyl variable from numeric
# to a factor (a categorical variable).
ggplot(mtcars_df, aes(x=wt, y=mpg, shape=cyl, color=cyl)) +
geom_point(size=4) +
labs(title="Relationship Between Miles Per Gallon and Weight",
x="Weight", y="Miles Per Gallon")
ggplot(mtcars_df, aes(x=wt, y=mpg, shape=cyl, color=cyl)) +
geom_point(size=4) +
labs(title="Relationship Between Miles Per Gallon and Weight",
x="Weight", y="Miles Per Gallon") +
theme(legend.position="top")
Change the “look and feel” of the plot I added theme_economist, a based on plot design for The Economist magazine, after installing and loading the package, “ggthemes.”
# Make sure you have installed the ggthemes package.
library(ggthemes)
mtcars_df$cyl <- as.factor(mtcars_df$cyl)
ggplot(mtcars_df, aes(x=wt, y=mpg, shape=cyl, color=cyl)) +
geom_point(size=4) +
labs(title="Relationship Between Miles Per Gallon and Weight",
x="Weight", y="Miles Per Gallon") +
theme(legend.position="top") +
theme_economist()
Some academic journals do not accept color graphics. This plot uses a grey scale color palette entirely:
ggplot(mtcars_df, aes(x=wt, y=mpg, shape=cyl, color=cyl)) +
geom_point(size=4) +
labs(title="Relationship Between Miles Per Gallon and Weight",
x="Weight", y="Miles Per Gallon") +
theme(legend.position="top",
axis.title=element_text(size=17),
axis.text=element_text(size=13),
plot.title=element_text(size=19)) +
scale_color_grey()
Suppose I am interested in testing the null hypothesis that fuel efficiency, miles/gallon, is equal for cars with automatic and manual transmissions. There are observable differences in our 32-car sample. But, mtcars is just a sample from a population of cars. Can I infer that mileage is the same or different in the population of cars based on this sample of data?
To test this hypothesis, I structure a t-test of the differences between mean mileage by transmission type:
mpg.at <- mtcars_df[mtcars$am == 0,]$mpg
mpg.mt <- mtcars_df[mtcars$am == 1,]$mpg
t.test(mpg.mt, mpg.at)
Welch Two Sample t-test
data: mpg.mt and mpg.at
t = 3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.209684 11.280194
sample estimates:
mean of x mean of y
24.39231 17.14737
The mean mileage for cars with manual transmissions is approximatly 24 miles/gallon, while for automatics is approximatly 17 miles/gallon – a 7 miles/gallon difference in the sample. According to my analysis, I am 95% confident that the actual mileage difference between automatic and manual transmissions is between 3 and 11 miles per gallon.
If I set the probability of Type 1 error (error made rejecting the null hypothesis when it should not be rejected) to 0.05, I can reject the null hypothesis of “no difference” because the probability is 0.0013 of observing a t-value of 3.76, with 18 degrees of freedom, if the null hypothesis is true.
Hey, now, that’s a mouthful. But, you will not only learn to talk that gibberish in a basic data science course, but you also will learn and practice to understand it!
Now, this whole thing in a graphic:
boxplot(mpg~am, data=mtcars_df, main ='Fig. 1. Fuel Efficiency',
ylab='Miles per gallon',names=c("Automatic","Manual"),notch=FALSE, col=(c("gold","skyblue")))
This boxplot contains the following information:
OK, now, I test this same hypothesis about the effect of transmission type on mileage in another way – through linear regression.
Linear regression fits a line a “best fit” through a plot of points formed by the intersection of coordinates formed by values of two variables. The equation of a regression line is y = a + bx, where y is a dependent variable (mpg in our case), x is an independent variable (type of transmission), a is an intercept term, and b is the slope of the “best fitting” rgression line.
So, in R:
fit_1var <- lm(mtcars$mpg ~ mtcars$am)
summary(fit_1var)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.147368 1.124603 15.247492 1.133983e-15
mtcars$am 7.244939 1.764422 4.106127 2.850207e-04
par(mfrow = c(1,2))
plot(mtcars$am, mtcars$mpg, main = 'MPG & Transmission')
abline(fit_1var)
The slope of the regression line is appoximately 7, which is change in mpg for a change of one unit in am (the difference between automatic transmissions, coded “0”, and manual transmissions, coded “1”). This is the same result as for the t-test.
This report is an example of explanation of findings of data analysis. Moreover, this report also is an example of documentation through use of RMarkdown.
You will learn and practice explanation and documentation of analyses using RMarkdown to create HTML reports, slide presentations, and PDF files. You will host your HTML reports at RPubs, a free RMarkdown document hosting service.
This document is hosted at http://rpubs.com/davidpassmore/273562.