Setup
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Enter in the data for question 1 and 2 using (c), scan(), read.table, or data.entry().
data=c(3,3,3,4,3,4,3,4,3,4,5,2,5,5,2,2,5,5,4,2,1,3,1,1,1,3,1,1,1,1)
q=c(rep("Q1",10), rep("Q2", 10), rep("Q3", 10))
mydata=data.frame(cbind(data,q))
table(mydata$data[1:10])
##
## 1 2 3 4 5
## 0 0 6 4 0
table(mydata$data[11:20])
##
## 1 2 3 4 5
## 0 4 0 1 5
table(mydata$data[1:10], mydata$data[11:20])
##
## 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 2 0 1 3
## 4 0 2 0 0 2
## 5 0 0 0 0 0
barplot(table(mydata$data[11:20], mydata$data[21:30]), main="Barplot Q2 vs. Q3", col=seq(1:10))
boxplot(data~as.factor(q), main="Boxplot")
library('MASS')
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
data('UScereal')
attach(UScereal)
names(UScereal)
## [1] "mfr" "calories" "protein" "fat" "sodium"
## [6] "fibre" "carbo" "sugars" "shelf" "potassium"
## [11] "vitamins"
Now, investigate the following relationships, and make comments on what you see. You can use tables, barplots, scatterplots, etc. to do your investigation.
table(mfr, shelf)
## shelf
## mfr 1 2 3
## G 6 7 9
## K 4 7 10
## N 2 0 1
## P 2 1 6
## Q 0 3 2
## R 4 0 1
The manufacturers are distributed differently across the shelves: General Mills and Kelloggs are on all three shelves, with slightly more brands on the top rather than bottom shelf. Nabisco is never in the middle; it’s either on the bottom or top. Post tends to be on top. Quaker is never on the bottom shelf, and is either on the middle or top. Ralston Purina is never in the middle; it’s most likely to be on bottom.
ggplot(data = UScereal) +
geom_point(mapping = aes(x = fat, y = vitamins), position = "jitter")
There is no relationship between fat and vitamins.
ggplot(data = UScereal) +
geom_point(mapping = aes(x = fat, y = shelf), position = "jitter") +
geom_smooth(mapping = aes(x = fat, y = shelf))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Fattier cereals are placed on higher shelves.
ggplot(data = UScereal) +
geom_point(mapping = aes(x = carbo, y = sugars)) +
geom_smooth(mapping = aes(x = carbo, y = sugars))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Although there are some very low-carb cereals that have less sugar, overall there’s no strong relationship between carbs and sugar.
ggplot(data = UScereal) +
geom_point(mapping = aes(x = mfr, y = fibre))
Nabisco cereals tend to be higher in fiber, as well as Post on occasion and some Kelloggs brands. As for the other cereals there’s no relationship between fibre and manufacturer.
ggplot(data = UScereal) +
geom_point(mapping = aes(x = sodium, y = sugars)) +
geom_smooth(mapping = aes(x = sodium, y = sugars))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Overall sodium and sugars are positively related (despite there being a “pocket” of cereals that consciously try to be low-sugar in spite of the sodium).
Predictions: Sugary cereals tend to be on higher shelves. Higher-carlorie cereals tend to be on higher shelves. Sugar and calories are positively related.
ggplot(data = UScereal) +
geom_point(mapping = aes(x = sugars, y = shelf)) +
geom_smooth(mapping = aes(x = sugars, y = shelf))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = UScereal) +
geom_point(mapping = aes(x = calories, y = shelf)) +
geom_smooth(mapping = aes(x = calories, y = shelf))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = UScereal) +
geom_point(mapping = aes(x = sugars, y = calories)) +
geom_smooth(mapping = aes(x = sugars, y = calories))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
data(mtcars)
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
max(mtcars$mpg)
## [1] 33.9
mtcars[which.max(mtcars$mpg),]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
head(mtcars, n=5)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
mtcars["Valiant","hp"]
## [1] 105
mtcars["Merc 450SLC",]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
ggplot(data = mtcars) +
geom_point(mapping = aes(x = cyl, y = mpg)) +
geom_smooth(mapping = aes(x = cyl, y = mpg), method = "lm")
cor(mtcars$cyl, mtcars$mpg)^2
## [1] 0.72618
Yes, it is a good candidate for linear regression.
Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
?ggplot2::mpg
## starting httpd help server ... done
ggplot2::mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl cla~
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p com~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p com~
## 3 audi a4 2 2008 4 manu~ f 20 31 p com~
## 4 audi a4 2 2008 4 auto~ f 21 30 p com~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p com~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p com~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p com~
## 8 audi a4 q~ 1.8 1999 4 manu~ 4 18 26 p com~
## 9 audi a4 q~ 1.8 1999 4 auto~ 4 16 25 p com~
## 10 audi a4 q~ 2 2008 4 manu~ 4 20 28 p com~
## # ... with 224 more rows
The categorical variables are: manufacturer, model, year, cyl, trans, drv, fl, and class. The continuous variables are: displ, cty, and hwy. You can see this information by running mpg and looking variables that are measureable vs. not measureable. (Generally chr variables are not measureable; the dbl and int could be measureable, but you to need additional context to deduce it, such as knowing what the variable definitions are.)
What plots does the following code make? What does . do?
ggplot(data = ggplot2::mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = ggplot2::mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
They make scatterplots, split into facets by either drv or cyl. The “.” in relation to the “~” and the drv/cyl variable tells R if the facets should be displayed stacked one on top of the other (“~ .”) or side by side (“. ~”).
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
This will be a scatterplot with displ on the x axis and hwy on the y, and each point will have a color based on which drv it is. On top of the plot there will be a smooth line fitted to the data.
ggplot(data = ggplot2::mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Actual result: what I said, but however there were three smoothed lines, colored by the drv.
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median)
The default geom is pointrange.
What is the problem with this plot? How could you improve it?
ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
There are more data points than plots on graph because so many points overlap with each other. You can improve this visualization by changing the geom to jitter.
ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy), position =) +
geom_jitter()
What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = ggplot2::mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
City and highway mpg are positively related to each other. It’s important to fix the aspect ratio using “coord_fixed()” because we are dealing with the same measure (mpg) on each axis. geom_abline() adds a reference line to clearly show that there’s a linear relationship between the two variables.
id=c(1,2,3,4,5)
age=c(31,42,51,55,70)
gender=c(0,0,1,1,1)
mydata1=data.frame(cbind(id,age))
colnames(mydata1)=c("id", "age")
mydata2=data.frame(cbind(id,gender))
colnames(mydata1)=c("id", "gender")
Now, use the merge command to generate a new data.frame that is linked based on ‘id.’ R supports inner / outer joins without a problem but is even friendlier when loading the sqlr package.
merge(mydata1, mydata2, by="id")
## id gender.x gender.y
## 1 1 31 0
## 2 2 42 0
## 3 3 51 1
## 4 4 55 1
## 5 5 70 1