Week 2 R Assignment

Setup

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

4.1. A student evaluation on a teacher is on a 1-5 Leichert scale. Suppose the answers to the first 3 questions are given in this table.

Enter in the data for question 1 and 2 using (c), scan(), read.table, or data.entry().

data=c(3,3,3,4,3,4,3,4,3,4,5,2,5,5,2,2,5,5,4,2,1,3,1,1,1,3,1,1,1,1)
q=c(rep("Q1",10), rep("Q2", 10), rep("Q3", 10))
mydata=data.frame(cbind(data,q))

1. Make a table of the results of question 1 and question 2 separately.

table(mydata$data[1:10])

## 
## 1 2 3 4 5 
## 0 0 6 4 0

table(mydata$data[11:20])

## 
## 1 2 3 4 5 
## 0 4 0 1 5

2. Make a contingency table of questions 1 and 2.

table(mydata$data[1:10], mydata$data[11:20])

##    
##     1 2 3 4 5
##   1 0 0 0 0 0
##   2 0 0 0 0 0
##   3 0 2 0 1 3
##   4 0 2 0 0 2
##   5 0 0 0 0 0

3. Make a stacked barplot of questions 2 and 3.

barplot(table(mydata$data[11:20], mydata$data[21:30]), main="Barplot Q2 vs. Q3", col=seq(1:10))

4. Make a side-by-side barplot of all three questions.

boxplot(data~as.factor(q), main="Boxplot")

4.2. In the library MASS is a dataset UScereal which contains information about popular breakfast cereals. Attach the data set as follows:

library('MASS')

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:plotly':
## 
##     select

data('UScereal')
attach(UScereal)
names(UScereal)

##  [1] "mfr"       "calories"  "protein"   "fat"       "sodium"   
##  [6] "fibre"     "carbo"     "sugars"    "shelf"     "potassium"
## [11] "vitamins"

Now, investigate the following relationships, and make comments on what you see. You can use tables, barplots, scatterplots, etc. to do your investigation.

1. The relationship between manufacturer and shelf.

table(mfr, shelf)

##    shelf
## mfr  1  2  3
##   G  6  7  9
##   K  4  7 10
##   N  2  0  1
##   P  2  1  6
##   Q  0  3  2
##   R  4  0  1

The manufacturers are distributed differently across the shelves: General Mills and Kelloggs are on all three shelves, with slightly more brands on the top rather than bottom shelf. Nabisco is never in the middle; it’s either on the bottom or top. Post tends to be on top. Quaker is never on the bottom shelf, and is either on the middle or top. Ralston Purina is never in the middle; it’s most likely to be on bottom.

2. The relationship between fat and vitamins.

ggplot(data = UScereal) +
         geom_point(mapping = aes(x = fat, y = vitamins), position = "jitter")

There is no relationship between fat and vitamins.

3. The relationship between fat and shelf.

ggplot(data = UScereal) +
         geom_point(mapping = aes(x = fat, y = shelf), position = "jitter") +
         geom_smooth(mapping = aes(x = fat, y = shelf))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Fattier cereals are placed on higher shelves.

4. The relationship between carbohydrates and sugars.

ggplot(data = UScereal) +
         geom_point(mapping = aes(x = carbo, y = sugars)) +
         geom_smooth(mapping = aes(x = carbo, y = sugars))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Although there are some very low-carb cereals that have less sugar, overall there’s no strong relationship between carbs and sugar.

5. The relationship between fibre and manufacturer.

ggplot(data = UScereal) +
         geom_point(mapping = aes(x = mfr, y = fibre))

Nabisco cereals tend to be higher in fiber, as well as Post on occasion and some Kelloggs brands. As for the other cereals there’s no relationship between fibre and manufacturer.

6. The relationship between sodium and sugars.

ggplot(data = UScereal) +
         geom_point(mapping = aes(x = sodium, y = sugars)) +
         geom_smooth(mapping = aes(x = sodium, y = sugars))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Overall sodium and sugars are positively related (despite there being a “pocket” of cereals that consciously try to be low-sugar in spite of the sodium).

Are there any other relationships you can predict and investigate?

Predictions: Sugary cereals tend to be on higher shelves. Higher-carlorie cereals tend to be on higher shelves. Sugar and calories are positively related.

ggplot(data = UScereal) +
         geom_point(mapping = aes(x = sugars, y = shelf)) +
         geom_smooth(mapping = aes(x = sugars, y = shelf))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = UScereal) +
         geom_point(mapping = aes(x = calories, y = shelf)) +
         geom_smooth(mapping = aes(x = calories, y = shelf))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = UScereal) +
         geom_point(mapping = aes(x = sugars, y = calories)) +
         geom_smooth(mapping = aes(x = sugars, y = calories))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

4.9. The built-in data set mtcars contains information about cars from a 1974 Motor Trend issue. Load the data set (data(mtcars)) and try to answer the following:

data(mtcars)

1. What are the variable names? (Try names.)

names(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

2. What is the maximum mpg?

max(mtcars$mpg)

## [1] 33.9

3. Which car has this?

mtcars[which.max(mtcars$mpg),]

##                 mpg cyl disp hp drat    wt qsec vs am gear carb
## Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1

4. What are the first five cars listed?

head(mtcars, n=5)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

5. What horsepower (hp) does the “Valiant” have?

mtcars["Valiant","hp"]

## [1] 105

6. What are all the values for the Mercedes 450slc (Merc 450SLC)?

mtcars["Merc 450SLC",]

##              mpg cyl  disp  hp drat   wt qsec vs am gear carb
## Merc 450SLC 15.2   8 275.8 180 3.07 3.78   18  0  0    3    3

7. Make a scatterplot of cylinders (cyl) vs. miles per gallon (mpg). Fit a regression line. Is this a good candidate for linear regression?

ggplot(data = mtcars) +
         geom_point(mapping = aes(x = cyl, y = mpg)) +
         geom_smooth(mapping = aes(x = cyl, y = mpg), method = "lm")

cor(mtcars$cyl, mtcars$mpg)^2

## [1] 0.72618

Yes, it is a good candidate for linear regression.

4. 3.3.1 Exercise 2

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

?ggplot2::mpg

## starting httpd help server ... done

ggplot2::mpg

## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    cla~
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
##  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     com~
##  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     com~
##  3 audi         a4      2    2008     4 manu~ f        20    31 p     com~
##  4 audi         a4      2    2008     4 auto~ f        21    30 p     com~
##  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     com~
##  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     com~
##  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     com~
##  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     com~
##  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     com~
## 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     com~
## # ... with 224 more rows

The categorical variables are: manufacturer, model, year, cyl, trans, drv, fl, and class. The continuous variables are: displ, cty, and hwy. You can see this information by running mpg and looking variables that are measureable vs. not measureable. (Generally chr variables are not measureable; the dbl and int could be measureable, but you to need additional context to deduce it, such as knowing what the variable definitions are.)

5. 3.5.1 Exercise 3

What plots does the following code make? What does . do?

ggplot(data = ggplot2::mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = ggplot2::mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

They make scatterplots, split into facets by either drv or cyl. The “.” in relation to the “~” and the drv/cyl variable tells R if the facets should be displayed stacked one on top of the other (“~ .”) or side by side (“. ~”).

6. 3.6.1. Exercise 2

Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

This will be a scatterplot with displ on the x axis and hwy on the y, and each point will have a color based on which drv it is. On top of the plot there will be a smooth line fitted to the data.

ggplot(data = ggplot2::mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Actual result: what I said, but however there were three smoothed lines, colored by the drv.

7. 3.7.1. Exercise 1

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

ggplot(data = diamonds) +
  geom_pointrange(mapping = aes(x = cut, y = depth),
                  stat = "summary",
                  fun.ymin = min,
                  fun.ymax = max,
                  fun.y = median)

The default geom is pointrange.

8. 3.8.1. Exercise 1

What is the problem with this plot? How could you improve it?

ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

There are more data points than plots on graph because so many points overlap with each other. You can improve this visualization by changing the geom to jitter.

ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy), position =) + 
  geom_jitter()

9. 3.9.1 Exercise 4

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

City and highway mpg are positively related to each other. It’s important to fix the aspect ratio using “coord_fixed()” because we are dealing with the same measure (mpg) on each axis. geom_abline() adds a reference line to clearly show that there’s a linear relationship between the two variables.

10. Load two data.frames by using the code below.

id=c(1,2,3,4,5) 
age=c(31,42,51,55,70) 
gender=c(0,0,1,1,1) 
mydata1=data.frame(cbind(id,age)) 
colnames(mydata1)=c("id", "age") 
mydata2=data.frame(cbind(id,gender)) 
colnames(mydata1)=c("id", "gender")

Now, use the merge command to generate a new data.frame that is linked based on ‘id.’ R supports inner / outer joins without a problem but is even friendlier when loading the sqlr package.

merge(mydata1, mydata2, by="id")

##   id gender.x gender.y
## 1  1       31        0
## 2  2       42        0
## 3  3       51        1
## 4  4       55        1
## 5  5       70        1