Load the beer data file
load(file="beer.RData")
See the internal ’Str’ucture of IQ data
str(beer)
## 'data.frame': 212 obs. of 6 variables:
## $ BEER : Factor w/ 212 levels "American Amber Lager",..: 1 2 3 5 7 6 9 10 16 17 ...
## $ Brewery : Factor w/ 68 levels "A.B. Pripps Bryggerier (Sweden)",..: 61 61 61 4 5 5 2 7 11 11 ...
## $ Calories : int 136 132 96 153 95 157 94 155 177 163 ...
## $ Carbohydrates: num 10.5 10.5 7.6 16 3.2 8.9 2.6 14.2 15.6 13.9 ...
## $ Alcohol : num 4.1 4.1 3.2 4.9 4.2 5.9 4.1 4.6 5.2 4.7 ...
## $ Type : Factor w/ 2 levels "Domestic","Imported": 1 1 1 1 1 1 1 1 1 1 ...
you can also see what variables you have in the data
names(beer)
## [1] "BEER" "Brewery" "Calories" "Carbohydrates"
## [5] "Alcohol" "Type"
To see the five-number summary
summary(beer$Calories)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.0 135.0 150.0 153.9 163.5 330.0
Make a boxplot of the five-number summary. You can modify the boxplot with interquartile range(IQR)
boxplot(beer$Calories, range=0)
boxplot(beer$Calories, range=1.5)
Standard deviation
sd(beer$Calories)
## [1] 41.38973
Coefficient of variation (to compare SD across variables with different means)
sd(beer$Calories)/mean(beer$Calories)
## [1] 0.2688716
Use the following code to install the necessary R package. (Remove “repos = ‘http://cran.us.r-project.org’” when intalling in your R)
install.packages("tidyverse", repos = 'http://cran.us.r-project.org')
## Installing package into '/Users/luanpao/Library/R/3.3/library'
## (as 'lib' is unspecified)
##
## The downloaded binary packages are in
## /var/folders/wg/j02h67j14d7dthw4m0r0gjk80000gn/T//RtmpOuEHS6/downloaded_packages
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
To plot a smooth density estimation for a variable
ggplot(beer, aes(x=Calories)) + geom_density()
Density estimation and histogram on the same graph, with a transparent density plot
ggplot(beer, aes(x=Calories)) + geom_histogram(aes(y=..density..), binwidth=5, col="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")
To make a nicer graph, experiment with different values of binwidth
ggplot(beer, aes(x=Calories)) + geom_histogram(aes(y=..density..), binwidth=10, colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")
pnorm(a, mean=mu, sd=sigma): The sahre of the popultion below ‘a’ if the distribution follows a normal distribution with mean ‘mu’ and standard deviation ‘sigma’
load(file='ncaa.RData')
str(ncaa)
## 'data.frame': 78 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GPA : num 7.94 8.29 4.64 7.47 8.88 ...
## $ IQ : int 111 107 100 107 114 115 111 97 100 112 ...
## $ Gender : int 2 2 2 2 1 2 2 2 1 2 ...
## $ SelfConcept: int 67 43 52 66 58 51 71 51 49 51 ...
mean(ncaa$GPA)
## [1] 7.446538
sd(ncaa$GPA)
## [1] 2.099557
Now, see the share of the population below 3
pnorm(3, mean=7.45, sd=2.1)
## [1] 0.01704322
qnorm(p, mean=mu, sd=sigma): the p percentile of the distribution if the distribution follows a normal with mean mu and standard deviation sigma
qnorm(.9, mean=7.45, sd=2.1)
## [1] 10.14126
Normal quantile plot
qqnorm(beer$Calories)
qqline(beer$Calories, col="red")
CPS micro data (choose only people from Pennsylvania)
Load the data file
load(file="cps.RData")
Let’s see what the data look like
str(cps$hourwage)
## atomic [1:297222] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "label")= chr "Hourly wage"
## - attr(*, "format.stata")= chr "%4.2f"
To see finve-number summary of a sub-set sample (only Pennsylvania)
boxplot(subset(cps, state=="Pennsylvania")$hourwage, range=1.5)
Do your assignment with CO2 emissions vehicles data
load(file="canfreg.RData")