In One Demonstration File, Documentation of Almost All of the Ways We Will Learn to Process Data in WF ED 540, Data Analysis in Workforce Education and Development at Penn State

Today….

In WF ED 540, you will learn to: refine data; analyze data through processes for transformation, visualization, and modeling; and communicate data that were refined and analyzed through processes of explanation and documentation. A demonstration is offered of almost all of the ways you will learn to process data in WF ED 540.

A Demonstration Offered

In this demonstration I use mtcars, a dataset available through the “datasets” package that accompanies the base R installation.

First, one way to refine this dataset is demonstrated (There are many ways, as with all of the processes I demonstrate with mtcars).
Then, some variables in mtcars is transformed in various ways using functions available in the R package called dplyr.
Next, vizualizations of some variables in mtcars are shown using functions available in the R package ggviz.
Last, several simple statistical hypotheses are modeled with some variables in the mtcars dataset.

Explanation and documentation? The example of these processes in the data sciences is provided in this file, WFED540Overview_27Aug2015.Rmd, which is an RMarkdown file. R Markdown is an authoring language that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format commonly applied in data science) with embedded “chunks” of R code that are run so their output can be included in a final document. You receive instruction in WF ED 540 in the creation of RMarkdown files and for displaying their output as HTML files.

You, too, will be able to run the RMarkdown file, WFED540Overview_27Aug2015.Rmd, so that you can examine the scope of what we will learn and practice throughout WF ED 540. You receive in a Piazza note a URL with a link for you to download WFED540Overview_27Aug2015.Rmd as well as a link to a video that displays procdures for acquiring and running this RMarkdown file.

Refining the dataset mtcars

To demonstrate processes in R, I will examine mtcars, a dataset with small number of observations and variables. The data in mtcars was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The data set, mtcars, has 32 observations (cars) with measures for 11 variables (aspects of design and performance).

During the first class meeting you are provided with a printed copy of a description and listing of the 32-observation-by-11-variable mtcars dataset to which you can refer throughout this demonstration. A downloadable Adobe PDF file of this description and listing is available.

By entering the name of the dataset in the R console quadrant or in the source quadrant will print the dataset in the console:

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Printing a dataset like mtcars is suitable for viewing in the console. It contains few observations and variables. However, what about datasets that contain hundreds … thousands … hundreds of thousands of observations?

Another approach is to print in the console the first six lines of the dataset using the R function, head.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Or, to print the last six lines.

tail(mtcars)

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

But, these printing approaches are not entirely satisfactory when there are many variables in a dataset. Many datasets you will examine in WF ED 540 contain thousands of variables.

We can load an R “package” (a special purpose library of code in R that accomplishes a bundle of computing tasks) called dplyr to provide some order to the mtcars data. We create a more orderly table of the mtcars dataset by executing a function, tbl_df, to transform mtcars into a new table of data, mtcars_df.

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mtcars_df <- tbl_df(mtcars)

Now, a print of mtcars_df is more orderly set of columns (variables) and only the first 10 rows (observations…or cars):

mtcars_df

## Source: local data frame [32 x 11]
## 
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## ..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...

Also, note that the size of the table of data also is printed (32 rows by 11 columns….or, in our case, 32 cars by 11 aspects of performance).

We can use the dplyr function glimpse to view variable names, the type of variables (in this case “dbl” meaning numeric data), and a sample of some of the data points in mtcars_df:

glimpse(mtcars_df)

## Observations: 32
## Variables:
## $ mpg  (dbl) 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  (dbl) 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp (dbl) 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   (dbl) 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat (dbl) 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   (dbl) 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec (dbl) 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   (dbl) 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   (dbl) 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear (dbl) 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb (dbl) 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

The dataset, mtcars, and the table created from it, mtcars_df, do not contain any missing data. It is quite common in practical research settings to have data for some cases (in mtcars_df, cars) within variables (in mtcars_df, aspects of performance). In WF ED 540 you will learn how to handle missing data in data modeling.

Transforming mtcars_df Using dplyr Functions

Rarely do researchers analyze the raw data that they collect. The R package, dplyr, makes reshaping data for analysis a relatively easy task.

It is common to filter a subset of rows in a data table according to criteria (dplyr’s filter function). Or, perhaps to select only some variables (dplyr’s select function).

Often, researchers wish to create new variables from existing variables (dplyr’s mutate function). And, statistical summaries of variables are desired (dplyr’s summarise function; note the spelling!).

The important and efficient use of dplyr is to use a small set of verbs to reshape data. Among the dplyr verbs you will learn to use to reshape data tables are:

filter
select
mutate
summarise

Provided are some examples of use of these verbs with the dataset, mtcars_df. Compare the transformations made with the raw data shown in the information sheet you received in the first class meeting.

Here is a filter of only the cars that have a straight engine:

filter(mtcars_df, vs == 1)

## Source: local data frame [14 x 11]
## 
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 2  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 3  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 4  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 5  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 6  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 7  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 8  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 9  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 10 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 11 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 12 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 13 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 14 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Notice that 14 of the 32 cars meet this filtering criterion (see “Source: local data frame [14 x 11]”).

The filter verb has two parameters: the dataset to be filtered (mtcars_df); the filtering criterion (vs == 1, where “==” means “the same as”).

Suppose I wish to select only mtcars_df variables that contain measures of weight (in pounds/1,000) and number of forward gears:

select(mtcars_df, wt, gear)

## Source: local data frame [32 x 2]
## 
##       wt gear
## 1  2.620    4
## 2  2.875    4
## 3  2.320    4
## 4  3.215    3
## 5  3.440    3
## 6  3.460    3
## 7  3.570    3
## 8  3.190    4
## 9  3.150    4
## 10 3.440    4
## ..   ...  ...

Notice that the selection retains all 32 cars, but only two variables (see “Source: local data frame [32 x 2]”)

Now, suppose I wish to mutate mtcars_df by calculating a new variable from disp, the engine displacment in cubic inches: disp_l, the displacement in liters:

displace <- mutate(mtcars, displ_l = disp / 61.0237)
select (displace, disp, displ_l)

##     disp  displ_l
## 1  160.0 2.621932
## 2  160.0 2.621932
## 3  108.0 1.769804
## 4  258.0 4.227866
## 5  360.0 5.899347
## 6  225.0 3.687092
## 7  360.0 5.899347
## 8  146.7 2.403984
## 9  140.8 2.307300
## 10 167.6 2.746474
## 11 167.6 2.746474
## 12 275.8 4.519556
## 13 275.8 4.519556
## 14 275.8 4.519556
## 15 472.0 7.734700
## 16 460.0 7.538055
## 17 440.0 7.210313
## 18  78.7 1.289663
## 19  75.7 1.240502
## 20  71.1 1.165121
## 21 120.1 1.968088
## 22 318.0 5.211090
## 23 304.0 4.981671
## 24 350.0 5.735477
## 25 400.0 6.554830
## 26  79.0 1.294579
## 27 120.3 1.971365
## 28  95.1 1.558411
## 29 351.0 5.751864
## 30 145.0 2.376126
## 31 301.0 4.932510
## 32 121.0 1.982836

I created a new variable, displ_l, by converting cubic inches to liters (there are 61.0237 cubic inches in a liter). I created a new data table, displace, from which I, then, selected only disp and displ_l, the displacement in cubic inches and liters, respectively.

Last, let’s calculate some summary statistics for a few variables in the new data I created, displace. Calculated are the means for disp and displ_l:

summarise(displace, count=n(), mean_cu_in = mean(disp), mean_ltr=mean(displ_l))

##   count mean_cu_in mean_ltr
## 1    32   230.7219 3.780857

I added the function, count, to return the number of cars in mtcars_df. The means are the sum of all disp or displ_l divided by the number of cars.

Visualizing mtcars_df variables using ggviz

As indicated in a white paper by Fusion Charts,

We visualize information to meet a very basic need – to tell a story. It’s one of the most primitive forms of communication known to man, having its origins in cave drawings dated as early as 30,000 B.C., even before written communication, which emerged in 3,000 B.C. Vision is the single most important faculty we use to communicate information.

A R package, ggvis, provides a grammar for commands that create visualizations of data. First, I load the ggvis library. Next, I load a library called magrittr that allows us to string a series of ggvis commands togther (more about this after the plot is displayed from ggvis). Last, I create scatterplot of car miles per gallon by car weight.

library(ggvis)
library(magrittr)
mtcars_df %>% ggvis(~wt, ~mpg) %>% layer_points()

The magrittr library allows the use of what is called a “pipe” in programming using the “%>%” symbol. When you see “%>%”, read “then.”

So, we can read “mtcars_df %>% ggvis(~wt, ~mpg) %>% layer_points()” as:

Use mtcars_df…THEN
Create a plot of mpg by wt…THEN
Plot the intersection of Cartesian coordinates formed by each wt and mpg pair as a point.

Let’s change the plot so that the shape of points is now a diamond, the size is smaller, the color is red, and the diamonds are no filled:

mtcars %>% 
  ggvis(~wt, ~mpg) %>% 
  layer_points(size := 25, shape := "diamond", stroke := "red", fill := NA)

How about adding a smooth line though the points?

mtcars %>% 
  ggvis(~wt, ~mpg) %>%
  layer_points() %>%
  layer_smooths()

Color the points in the plot by the number of cylinders?

mtcars %>% 
  ggvis(~wt, ~mpg) %>% 
  layer_points(fill = ~factor(cyl))

I am able to change the graph by mrely changing simple ggvis properties for the layers of the graph. In WF ED 540, you will learn and practice making visualizations such as scatterplots, bar graphs, line graphs, histograms, regrssion lines, box plots, and other plot types. Once you get the hang of it, plotting becomes simple because there is a simple grammar for ggvis that specifies the dataset to be plotted, the coordinate system applied to the plot, and the style of points plotted.

Statistical Modeling with Some Variables in mtcars_df

Suppose I am interested in testing the null hypothesis that fuel efficiency, miles/gallon, is equal for cars with automatic and manual transmissions. There are observable differences in our 32-car sample. But, mtcars is just a sample from a population of cars. Can I infer that mileage is the same or different in the population of cars based on this sample of data.

To test this hypothesis, I structure a t-test of the differences between mean mileage by transmission type:

mpg.at <- mtcars_df[mtcars$am == 0,]$mpg
mpg.mt <- mtcars_df[mtcars$am == 1,]$mpg
t.test(mpg.at, mpg.mt)

## 
##  Welch Two Sample t-test
## 
## data:  mpg.at and mpg.mt
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

The mean mileage for cars with manual transmissions is approximatly 24, while for automatics is approximatly 17 – a 7 miles/gallon difference in the sample. According to my analysis, I am 95% confident that the actual mileage difference between automatic and manual transmissions is between 3 and 11 miles per gallon.

If I set the probability of Type 1 error (error made rejecting the null hypothesis when it should not be rejected) to 0.05, I can reject the null hypothesis of “no difference” because the probability is 0.0013 of observing a t-value of 3.76, with 18 degrees of freedom, if the null hypothesis is true.

Hey, now, that’s a mouthful. But, you will not only learn to talk that gibberish in WF ED 540, but you also will learn and practice to understand it!

Now, this whole thing in a graphic:

boxplot(mpg~am, data=mtcars_df, main ='Fig. 1. Fuel Efficiency',
        ylab='Miles per gallon',names=c("Automatic","Manual"),notch=FALSE, col=(c("gold","skyblue")))

OK, now, I test this hypothesis in another way – through linear regression.

Linear regression fits a line a “best fit” through a plot of points formed by the intersection of coordinates formed by values of two variables. The equation of a regression line is y = a + bx, where y is a depndent variable (mpg in our case), x is an independent variable (type of transmission), a is an intercpt term, and b is the slope of the “best fit” rgression line.

So, in R:

fit_1var <- lm(mtcars$mpg ~ mtcars$am)
summary(fit_1var)$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## mtcars$am    7.244939   1.764422  4.106127 2.850207e-04

par(mfrow = c(1,2))
plot(mtcars$am, mtcars$mpg, main = 'Simple Linear Regression: MPG vs Transmission Type')
abline(fit_1var)

The slope of the regression line is appoximately 7, which is change in mpg for a change of one unit in am (the difference between automatic transmissions, coded “0”, and manual transmissions, coded “1”). This is the same result as for the t-test.

In WF ED 540, you learn and practice the fundamentals of statistical null hypothesis testing and estimation using t-tests, chi-square tests of independence, and simple linear regression.

Explanation and Documentation

This report is an example of explanation of findings of data analysis. Moreover, this report also is an example of documentation through use of RMarkdown.

You will learn and practice explanation and documentation of analyses using RMarkdown to create HTML reports, slide presentations, and PDF files. You will host your HTML reports at RPubs, a free RMarkdown document hosting service.

This document is hosted at http://rpubs.com/davidpassmore/105923.