This example taken from a Stata demo [http://www.stata.com/capabilities/session.html] [Stata Website 9/6/2012]. That was re-written in R [http://rwiki.sciviews.org/doku.php?id=guides:demos:stata_demo_with_r][R Wiki]. It has been modified for use with RStudio.
Here, we load the “foreign” library, which is used to read and write data sets in formats used by SAS, SPSS, Stata and others.
The package “ggplot2” is an alternate to default graphic commands in R. “hmisc” has some utilities that are handy. [Note: lines in the shaded boxes on this page are the actual R code.]
library(foreign)
library(ggplot2)
library(Hmisc)
Alternatively you could load the packages using the point-and-click tools in RStudio
Thousands of user written commands are available in R. For more information visit the [http://cran.r-project.org/web/packages/][R-Project website]
and check out the CRAN TASK VIEWS.
The dataset we will use for this session is about vintage 1978 automobiles sold in the United States.
The next line downloads the “auto” dataset from the Stata website, converts it from Stata to R and stores it in a temporary R dataframe named “auto”.
auto <- read.dta("http://www.stata-press.com/data/r9/auto.dta")
In the UR pane of the RStudio screen you will see the “auto” dataframe listed with 74 obs of 12 variables. Click on it and a data browser will appear in the UL pane.
The History tab in the UR pane keeps track of commands R has run, successfully and unsuccessfully. The commands can then easily be rerun. After opening the auto dataframe, you can see that the command to browse the data is: View(auto)
Note: If a command intrigues you, you can precede any command with a question mark on the command line in Console pane to find help e.g. ?View This will open the help tab in the LR quadrant. Or you can begin typing the command and press 'tab' to view syntax.
We can get more information about the structure of the “auto” dataframe including the type of variables and general information on the dataframe itself.
str(auto)
## 'data.frame': 74 obs. of 12 variables:
## $ make : chr "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
## $ price : int 4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
## $ mpg : int 22 17 22 20 15 18 26 20 16 19 ...
## $ rep78 : int 3 3 NA 3 4 3 NA 3 3 3 ...
## $ headroom : num 2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
## $ trunk : int 11 11 12 16 20 21 10 16 17 13 ...
## $ weight : int 2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
## $ length : int 186 173 168 196 222 218 170 200 207 200 ...
## $ turn : int 40 40 35 40 43 43 34 42 43 42 ...
## $ displacement: int 121 258 121 196 350 231 304 196 231 231 ...
## $ gear_ratio : num 3.58 2.53 3.08 2.93 2.41 ...
## $ foreign : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "datalabel")= chr "1978 Automobile Data"
## - attr(*, "time.stamp")= chr "13 Apr 2005 17:45"
## - attr(*, "formats")= chr "%-18s" "%8.0gc" "%8.0g" "%8.0g" ...
## - attr(*, "types")= int 18 252 252 252 254 252 252 252 252 252 ...
## - attr(*, "val.labels")= chr "" "" "" "" ...
## - attr(*, "var.labels")= chr "Make and Model" "Price" "Mileage (mpg)" "Repair Record 1978" ...
## - attr(*, "version")= int 8
## - attr(*, "label.table")=List of 1
## ..$ origin: Named num 0 1
## .. ..- attr(*, "names")= chr "Domestic" "Foreign"
To further examine a particular variable we can ask for specific attributes of a variable by name. R can hold multiple data sets in memory at the same time, so each reference to a variable needs to specify which dataset it belongs to. When running a command that references many columns, repeating the name of the data set can become quite tedious.
However, if you are using RStudio, you can type the first letter(s) of a column name and press the 'tab' key to select the column name from a list.
str(auto$foreign)
## Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
class(auto$foreign)
## [1] "factor"
length(auto$foreign)
## [1] 74
An alternative way of avoiding repetition is to use the attach() function. Attach tells R that variables names will refer to the specified dataframe (until onecalls detach()). The attraction of this approach is that it cuts down even more on typing in the name of the data set and makes many commands more similar to how they look in Stata. A drawback is that if one is making changes to the dataset, it is easy to get confused if it is the old or new data set that is attached. Most R afficianadoes will advise against the use of attach, but it is certainly up to you.
summary(auto)
## make price mpg rep78
## Length:74 Min. : 3291 Min. :12.0 Min. :1.00
## Class :character 1st Qu.: 4220 1st Qu.:18.0 1st Qu.:3.00
## Mode :character Median : 5006 Median :20.0 Median :3.00
## Mean : 6165 Mean :21.3 Mean :3.41
## 3rd Qu.: 6332 3rd Qu.:24.8 3rd Qu.:4.00
## Max. :15906 Max. :41.0 Max. :5.00
## NA's :5
## headroom trunk weight length turn
## Min. :1.50 Min. : 5.0 Min. :1760 Min. :142 Min. :31.0
## 1st Qu.:2.50 1st Qu.:10.2 1st Qu.:2250 1st Qu.:170 1st Qu.:36.0
## Median :3.00 Median :14.0 Median :3190 Median :192 Median :40.0
## Mean :2.99 Mean :13.8 Mean :3019 Mean :188 Mean :39.6
## 3rd Qu.:3.50 3rd Qu.:16.8 3rd Qu.:3600 3rd Qu.:204 3rd Qu.:43.0
## Max. :5.00 Max. :23.0 Max. :4840 Max. :233 Max. :51.0
##
## displacement gear_ratio foreign
## Min. : 79 Min. :2.19 Domestic:52
## 1st Qu.:119 1st Qu.:2.73 Foreign :22
## Median :196 Median :2.96
## Mean :197 Mean :3.02
## 3rd Qu.:245 3rd Qu.:3.35
## Max. :425 Max. :3.89
##
describe(auto)
## auto
##
## 12 Variables 74 Observations
## ---------------------------------------------------------------------------
## make
## n missing unique
## 74 0 74
##
## lowest : AMC Concord AMC Pacer AMC Spirit Audi 5000 Audi Fox
## highest: Volvo 260 VW Dasher VW Diesel VW Rabbit VW Scirocco
## ---------------------------------------------------------------------------
## price
## n missing unique Mean .05 .10 .25 .50 .75
## 74 0 74 6165 3780 3913 4220 5006 6332
## .90 .95
## 11081 13157
##
## lowest : 3291 3299 3667 3748 3798
## highest: 12990 13466 13594 14500 15906
## ---------------------------------------------------------------------------
## mpg
## n missing unique Mean .05 .10 .25 .50 .75
## 74 0 21 21.3 14.00 14.30 18.00 20.00 24.75
## .90 .95
## 28.70 32.05
##
## lowest : 12 14 15 16 17, highest: 30 31 34 35 41
## ---------------------------------------------------------------------------
## rep78
## n missing unique Mean
## 69 5 5 3.406
##
## 1 2 3 4 5
## Frequency 2 8 30 18 11
## % 3 12 43 26 16
## ---------------------------------------------------------------------------
## headroom
## n missing unique Mean
## 74 0 8 2.993
##
## 1.5 2 2.5 3 3.5 4 4.5 5
## Frequency 4 13 14 13 15 10 4 1
## % 5 18 19 18 20 14 5 1
## ---------------------------------------------------------------------------
## trunk
## n missing unique Mean .05 .10 .25 .50 .75
## 74 0 18 13.76 7.00 8.00 10.25 14.00 16.75
## .90 .95
## 20.00 20.35
##
## 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23
## Frequency 1 1 3 5 4 5 8 3 4 4 5 12 8 1 6 2 1 1
## % 1 1 4 7 5 7 11 4 5 5 7 16 11 1 8 3 1 1
## ---------------------------------------------------------------------------
## weight
## n missing unique Mean .05 .10 .25 .50 .75
## 74 0 64 3019 1895 2026 2250 3190 3600
## .90 .95
## 4051 4186
##
## lowest : 1760 1800 1830 1930 1980, highest: 4130 4290 4330 4720 4840
## ---------------------------------------------------------------------------
## length
## n missing unique Mean .05 .10 .25 .50 .75
## 74 0 47 187.9 154.7 158.2 170.0 192.5 203.8
## .90 .95
## 218.0 221.0
##
## lowest : 142 147 149 154 155, highest: 220 221 222 230 233
## ---------------------------------------------------------------------------
## turn
## n missing unique Mean .05 .10 .25 .50 .75
## 74 0 18 39.65 33.65 34.00 36.00 40.00 43.00
## .90 .95
## 45.00 46.00
##
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 48 51
## Frequency 1 1 2 6 6 9 4 3 1 6 4 7 12 3 3 3 2 1
## % 1 1 3 8 8 12 5 4 1 8 5 9 16 4 4 4 3 1
## ---------------------------------------------------------------------------
## displacement
## n missing unique Mean .05 .10 .25 .50 .75
## 74 0 31 197.3 87.95 97.00 119.00 196.00 245.25
## .90 .95
## 340.40 350.00
##
## lowest : 79 85 86 89 90, highest: 304 318 350 400 425
## ---------------------------------------------------------------------------
## gear_ratio
## n missing unique Mean .05 .10 .25 .50 .75
## 74 0 36 3.015 2.365 2.442 2.730 2.955 3.352
## .90 .95
## 3.714 3.780
##
## lowest : 2.19 2.24 2.26 2.28 2.41, highest: 3.73 3.74 3.78 3.81 3.89
## ---------------------------------------------------------------------------
## foreign
## n missing unique
## 74 0 2
##
## Domestic (52, 70%), Foreign (22, 30%)
## ---------------------------------------------------------------------------
List the value of the columns “make” and “mpg” for the first ten lines of the data set. Data sets in R are referenced using matrix notation, either using row/column numbers, or using names.
auto[1:10, c("make", "mpg")]
## make mpg
## 1 AMC Concord 22
## 2 AMC Pacer 17
## 3 AMC Spirit 22
## 4 Buick Century 20
## 5 Buick Electra 15
## 6 Buick LeSabre 18
## 7 Buick Opel 26
## 8 Buick Regal 20
## 9 Buick Riviera 16
## 10 Buick Skylark 19
auto$foreign
## [1] Domestic Domestic Domestic Domestic Domestic Domestic Domestic
## [8] Domestic Domestic Domestic Domestic Domestic Domestic Domestic
## [15] Domestic Domestic Domestic Domestic Domestic Domestic Domestic
## [22] Domestic Domestic Domestic Domestic Domestic Domestic Domestic
## [29] Domestic Domestic Domestic Domestic Domestic Domestic Domestic
## [36] Domestic Domestic Domestic Domestic Domestic Domestic Domestic
## [43] Domestic Domestic Domestic Domestic Domestic Domestic Domestic
## [50] Domestic Domestic Domestic Foreign Foreign Foreign Foreign
## [57] Foreign Foreign Foreign Foreign Foreign Foreign Foreign
## [64] Foreign Foreign Foreign Foreign Foreign Foreign Foreign
## [71] Foreign Foreign Foreign Foreign
## Levels: Domestic Foreign
as.numeric(auto$foreign)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2
Look more closely at the factor variable 'foreign' by copying it into a new variable
x <- auto$foreign
In the UR pane click on x to examine its attributes more closely.
Missing values in R are represented by 'NA' (value is Not Available). Unlike other stat packages, missing values in R are not less than or greater than other values, they simple aren't.
There are lots of useful 'na' functions to deal with them though.
Here we can see which rows have a missing value for repair record (rep78).
auto[is.na(auto$rep78), ]
## make price mpg rep78 headroom trunk weight length turn
## 3 AMC Spirit 3799 22 NA 3.0 12 2640 168 35
## 7 Buick Opel 4453 26 NA 3.0 10 2230 170 34
## 45 Plym. Sapporo 6486 26 NA 1.5 8 2520 182 38
## 51 Pont. Phoenix 4424 19 NA 3.5 13 3420 203 43
## 64 Peugeot 604 12990 14 NA 3.5 14 3420 192 38
## displacement gear_ratio foreign
## 3 121 3.08 Domestic
## 7 304 2.87 Domestic
## 45 119 3.54 Domestic
## 51 231 3.08 Domestic
## 64 163 3.58 Foreign
By tabulating it and specifying useNA = “ifany”, we can see that it contains some missing values, indicated by NA (without useNA, missingvalues are silently omitted).
table(auto$rep78, useNA = "ifany")
##
## 1 2 3 4 5 <NA>
## 2 8 30 18 11 5
table(auto$rep78)
##
## 1 2 3 4 5
## 2 8 30 18 11
Here, the summary is genereated first for all cars, then for cars where MPG is below mean, and finally for cars where MPG is above mean. The results show that the more fuel-efficient cars cost less on average (they are likely to be smaller).
summary(auto$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3290 4220 5010 6170 6330 15900
summary(auto$price[auto$mpg < mean(auto$mpg)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3290 4730 5720 7090 9250 15900
summary(auto$price[auto$mpg >= mean(auto$mpg)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3300 3990 4480 4880 5250 9740
The with() command can be used to shorten the typing of the summary command above. The parameters to this function are the dataframe name followed by the R command. Any variables are assumed to be in this dataframe.
with(auto, summary(price[mpg >= mean(mpg)]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3300 3990 4480 4880 5250 9740
The by() command can be used to execute commands on subsets of data, similarly to the by option in Stata. The last parameter can be the name of any function that can handle subsets of the first parameter as arguments
by(auto[, c("price", "mpg")], auto$foreign, summary)
## auto$foreign: Domestic
## price mpg
## Min. : 3291 Min. :12.0
## 1st Qu.: 4186 1st Qu.:16.8
## Median : 4782 Median :19.0
## Mean : 6072 Mean :19.8
## 3rd Qu.: 6200 3rd Qu.:22.0
## Max. :15906 Max. :34.0
## --------------------------------------------------------
## auto$foreign: Foreign
## price mpg
## Min. : 3748 Min. :14.0
## 1st Qu.: 4522 1st Qu.:21.0
## Median : 5759 Median :24.5
## Mean : 6385 Mean :24.8
## 3rd Qu.: 7068 3rd Qu.:27.5
## Max. :12990 Max. :41.0
Other packages provide additional bells and whistles to standard R commands. For example, the epicalc package has a 'tab1' command that generates a table and histogram in one shot.
library(epicalc)
tab1(auto$rep78)
## auto$rep78 :
## Frequency %(NA+) %(NA-)
## 1 2 2.7 2.9
## 2 8 10.8 11.6
## 3 30 40.5 43.5
## 4 18 24.3 26.1
## 5 11 14.9 15.9
## <NA> 5 6.8 0.0
## Total 74 100.0 100.0
tableStack(headroom:length, by = foreign, dataFrame = auto)
## Domestic Foreign Test stat.
## Total 52 22
##
## Headroom (in.) Ranksum test
## median(IQR) 3.5 (2.4,4) 2.5 (2.5,3)
##
## Trunk space (cu. ft.) t-test (72 df) = 3.27
## mean(SD) 14.8 (4.3) 11.4 (3.2)
##
## Weight (lbs.) t-test (72 df) = 6.25
## mean(SD) 3317.1 (695.4) 2315.9 (433)
##
## Length (in.) t-test (72 df) = 5.89
## mean(SD) 196.1 (20) 168.5 (13.7)
##
## P value
## Total
##
## Headroom (in.) 0.011
## median(IQR)
##
## Trunk space (cu. ft.) 0.002
## mean(SD)
##
## Weight (lbs.) < 0.001
## mean(SD)
##
## Length (in.) < 0.001
## mean(SD)
##
detach("package:epicalc")
Below, we run a t-test to examine if the average gas mileage is higher or lower for foreign cars compared to domestic cars. To run a t test using a dummy variable (rather than comparing two columns), one must use a “formula”, which is a general way to represent variable relationships, and used throughout the statistical procedures in R.
with(auto, t.test(mpg ~ foreign, var.equal = TRUE))
##
## Two Sample t-test
##
## data: mpg by foreign
## t = -3.631, df = 72, p-value = 0.0005254
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.661 -2.230
## sample estimates:
## mean in group Domestic mean in group Foreign
## 19.83 24.77
##
Here we examine the correlation between mpg and weight. The two function calls pass the same data to the cor() function. However, in the first case the arguments are two vectors, while in the second case we pass a single argument of type data.frame. This has an effect on the output.
cor(auto$mpg, auto$weight)
## [1] -0.8072
cor(auto[, c("weight", "mpg", "length", "turn", "displacement")])
## weight mpg length turn displacement
## weight 1.0000 -0.8072 0.9460 0.8574 0.8949
## mpg -0.8072 1.0000 -0.7958 -0.7192 -0.7056
## length 0.9460 -0.7958 1.0000 0.8643 0.8351
## turn 0.8574 -0.7192 0.8643 1.0000 0.7768
## displacement 0.8949 -0.7056 0.8351 0.7768 1.0000
We can also use by() to generate separate correlations for foreign and domestic cars, and we conclude the correlation exploration by examining a correlation table of a number of variables
by(auto[, c("weight", "mpg")], auto$foreign, cor)
## auto$foreign: Domestic
## weight mpg
## weight 1.0000 -0.8759
## mpg -0.8759 1.0000
## --------------------------------------------------------
## auto$foreign: Foreign
## weight mpg
## weight 1.0000 -0.6829
## mpg -0.6829 1.0000
A plot of mileage by weight shows the expected, decreasing, relationship.
qplot(mpg, weight, data = auto)
We can look at this relationship for foreign vs domestic vehicles by adding 'facets'.
qplot(weight, mpg, data = auto, facets = . ~ foreign)
qplot(weight, mpg, data = auto, color = foreign, main = "Total")
We want to model the relationship between MPG and weight, and we will estimate a linear regression model using the lm() command.
Based on the graphs, we judge the relationship nonlinear and will model MPG as a quadratic in weight. Also, based on the graphs, we judge the relationship to be different for domestic and foreign cars. We will include an indicator (dummy) variable for foreign and evaluate afterwards whether this adequately describes the difference. Thus, we will fit the model:
\[ mpg = \beta_0 + \beta_1 weight + \beta_2 weight^2 + \beta_3 foreign + \hat{l} \]
Note that the lm() command returns an object containing the regression results, so we will need to assign the results to a variable and call summary() on them. (If we forget to assign the results to a variable, the default printing method will be used to output the estimated coefficients, but that format is typically not what we want.
mod1 <- lm(mpg ~ weight + I(weight^2) + foreign, auto)
summary(mod1)
##
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + foreign, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.636 -1.899 -0.302 0.810 13.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.65e+01 6.20e+00 9.12 1.6e-13 ***
## weight -1.66e-02 3.97e-03 -4.18 8.4e-05 ***
## I(weight^2) 1.59e-06 6.25e-07 2.55 0.013 *
## foreignForeign -2.20e+00 1.06e+00 -2.08 0.041 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.28 on 70 degrees of freedom
## Multiple R-squared: 0.691, Adjusted R-squared: 0.678
## F-statistic: 52.3 on 3 and 70 DF, p-value: <2e-16
##
In RStudio, you can examine the mod1 object. It holds lots of information that you can access directly. We can pull the predicted (fitted) values from the lm results in mod1 and assign them to a column in the data set.
auto$mpghat <- predict(mod1)
Now plot
qplot(weight, mpg, data = auto, facets = . ~ foreign, geom = c("point")) +
geom_line(data = auto, aes(weight, mpghat, color = "Mpghat")) + scale_colour_manual("Mileage -- Fitted values",
breaks = "Mpghat", values = "red") + opts(legend.position = "bottom")
Now, we show our results to an engineer. “No,” he says. “It should take twice as much energy to move 2,000 pounds 1 mile compared with moving 1,000 pounds, and therefore twice as much gasoline. Miles per gallon is not quadratic in weight; gallons per mile is a linear in weight.”
We go back to the computer to estimate the “engineerically correct” model. First we do a quick plot of gpm vs weight.
auto$gpm <- 1/auto$mpg
qplot(gpm, weight, data = auto, color = foreign, xlab = {
"Gallons per mile (gpm)"
}, geom = c("point", "smooth"))
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
Satisfied that the engineer is indeed correct, we rerun the regression.
mod2 <- lm(gpm ~ weight + foreign, data = auto)
summary(mod2)
##
## Call:
## lm(formula = gpm ~ weight + foreign, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.021376 -0.002258 0.000529 0.004324 0.010898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.35e-04 4.02e-03 -0.18 0.8555
## weight 1.62e-05 1.18e-06 13.74 <2e-16 ***
## foreignForeign 6.22e-03 2.00e-03 3.11 0.0027 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00632 on 71 degrees of freedom
## Multiple R-squared: 0.762, Adjusted R-squared: 0.756
## F-statistic: 114 on 2 and 71 DF, p-value: <2e-16
##
Lo and behold, after specifying the model correctly, we find that foreign cars in 1978 were actually less efficient. Foreign cars may have yielded better gas mileage than domestic cars in 1978, but this was only because they were so light. Long live American manufacturing!
Use the Project menu in RStudio to create separate projects for your work. RStudio will save everything in your workspace and command history so you can pick-up exactly where you left off.