Setup

library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

4.1. A student evaluation on a teacher is on a 1-5 Leichert scale. Suppose the answers to the first 3 questions are given in this table.

Enter in the data for question 1 and 2 using (c), scan(), read.table, or data.entry().

data=c(3,3,3,4,3,4,3,4,3,4,5,2,5,5,2,2,5,5,4,2,1,3,1,1,1,3,1,1,1,1)
q=c(rep("Q1",10), rep("Q2", 10), rep("Q3", 10))
mydata=data.frame(cbind(data,q))

1. Make a table of the results of question 1 and question 2 separately.

table(mydata$data[1:10])
## 
## 1 2 3 4 5 
## 0 0 6 4 0
table(mydata$data[11:20])
## 
## 1 2 3 4 5 
## 0 4 0 1 5

2. Make a contingency table of questions 1 and 2.

table(mydata$data[1:10], mydata$data[11:20])
##    
##     1 2 3 4 5
##   1 0 0 0 0 0
##   2 0 0 0 0 0
##   3 0 2 0 1 3
##   4 0 2 0 0 2
##   5 0 0 0 0 0

3. Make a stacked barplot of questions 2 and 3.

barplot(table(mydata$data[11:20], mydata$data[21:30]), main="Barplot Q2 vs. Q3", col=seq(1:10))

4. Make a side-by-side barplot of all three questions.

boxplot(data~as.factor(q), main="Boxplot")

4.9. The built-in data set mtcars contains information about cars from a 1974 Motor Trend issue. Load the data set (data(mtcars)) and try to answer the following:

data(mtcars)

1. What are the variable names? (Try names.)

names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

2. What is the maximum mpg?

max(mtcars$mpg)
## [1] 33.9

3. Which car has this?

mtcars[which.max(mtcars$mpg),]
##                 mpg cyl disp hp drat    wt qsec vs am gear carb
## Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1

4. What are the first five cars listed?

head(mtcars, n=5)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

5. What horsepower (hp) does the “Valiant” have?

mtcars["Valiant","hp"]
## [1] 105

6. What are all the values for the Mercedes 450slc (Merc 450SLC)?

mtcars["Merc 450SLC",]
##              mpg cyl  disp  hp drat   wt qsec vs am gear carb
## Merc 450SLC 15.2   8 275.8 180 3.07 3.78   18  0  0    3    3

7. Make a scatterplot of cylinders (cyl) vs. miles per gallon (mpg). Fit a regression line. Is this a good candidate for linear regression?

ggplot(data = mtcars) +
         geom_point(mapping = aes(x = cyl, y = mpg)) +
         geom_smooth(mapping = aes(x = cyl, y = mpg), method = "lm")

cor(mtcars$cyl, mtcars$mpg)^2
## [1] 0.72618

Yes, it is a good candidate for linear regression.

4. 3.3.1 Exercise 2

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

?ggplot2::mpg
## starting httpd help server ... done
ggplot2::mpg
## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    cla~
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
##  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     com~
##  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     com~
##  3 audi         a4      2    2008     4 manu~ f        20    31 p     com~
##  4 audi         a4      2    2008     4 auto~ f        21    30 p     com~
##  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     com~
##  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     com~
##  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     com~
##  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     com~
##  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     com~
## 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     com~
## # ... with 224 more rows

The categorical variables are: manufacturer, model, year, cyl, trans, drv, fl, and class. The continuous variables are: displ, cty, and hwy. You can see this information by running mpg and looking variables that are measureable vs. not measureable. (Generally chr variables are not measureable; the dbl and int could be measureable, but you to need additional context to deduce it, such as knowing what the variable definitions are.)

5. 3.5.1 Exercise 3

What plots does the following code make? What does . do?

ggplot(data = ggplot2::mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = ggplot2::mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

They make scatterplots, split into facets by either drv or cyl. The “.” in relation to the “~” and the drv/cyl variable tells R if the facets should be displayed stacked one on top of the other (“~ .”) or side by side (“. ~”).

6. 3.6.1. Exercise 2

Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

This will be a scatterplot with displ on the x axis and hwy on the y, and each point will have a color based on which drv it is. On top of the plot there will be a smooth line fitted to the data.

ggplot(data = ggplot2::mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Actual result: what I said, but however there were three smoothed lines, colored by the drv.

7. 3.7.1. Exercise 1

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

ggplot(data = diamonds) +
  geom_pointrange(mapping = aes(x = cut, y = depth),
                  stat = "summary",
                  fun.ymin = min,
                  fun.ymax = max,
                  fun.y = median)

The default geom is pointrange.

8. 3.8.1. Exercise 1

What is the problem with this plot? How could you improve it?

ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

There are more data points than plots on graph because so many points overlap with each other. You can improve this visualization by changing the geom to jitter.

ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy), position =) + 
  geom_jitter()

9. 3.9.1 Exercise 4

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

City and highway mpg are positively related to each other. It’s important to fix the aspect ratio using “coord_fixed()” because we are dealing with the same measure (mpg) on each axis. geom_abline() adds a reference line to clearly show that there’s a linear relationship between the two variables.

10. Load two data.frames by using the code below.

id=c(1,2,3,4,5) 
age=c(31,42,51,55,70) 
gender=c(0,0,1,1,1) 
mydata1=data.frame(cbind(id,age)) 
colnames(mydata1)=c("id", "age") 
mydata2=data.frame(cbind(id,gender)) 
colnames(mydata1)=c("id", "gender")

Now, use the merge command to generate a new data.frame that is linked based on ‘id.’ R supports inner / outer joins without a problem but is even friendlier when loading the sqlr package.

merge(mydata1, mydata2, by="id")
##   id gender.x gender.y
## 1  1       31        0
## 2  2       42        0
## 3  3       51        1
## 4  4       55        1
## 5  5       70        1