##! [Alt text](\Users\Rosalie\Documents\University\Year 1\Term 2A\Statistics I\Lab Report w2 w3\Graph republicans-democrats Michigan.png)
library(tidyverse)
library(openintro)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
Based on this plot, the data nearly follows the same bell formed curve as a normal distribution curve. Hpwever it is not as symmetrical around the mean as a normal distributed curve would be.
To visualize only the delay of flights headed to Los Angeles, one can use filters.
library(tidyverse)
library(openintro)
lax_flights <- nycflights %>%
filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
lax_flights %>%
summarise(mean_dd = mean(dep_delay),
median_dd = median(dep_delay),
n = n())
You can also filter based on multiple criteria. Suppose you are interested in flights headed to San Francisco (SFO) in February:
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
Showing 65 to 32,735 of 68 entries, 16 total columns (filtered from 32,735 total entries) Thus, 68 flights meet these criteria.
summary(sfo_feb_flights$arr_delay)
The median is -11.00 and thus, the half of the flights arrived 11 minutes earlier.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sfo_feb_flights%>%
group_by(carrier)%>%
summarise(median_dd=median(dep_delay),iqr_dd=IQR(arr_delay),n_flights=n())
## # A tibble: 5 x 4
## carrier median_dd iqr_dd n_flights
## * <chr> <dbl> <dbl> <int>
## 1 AA 13 17.5 10
## 2 B6 -2 12.2 6
## 3 DL -3 22 19
## 4 UA -2 22 21
## 5 VX -3.5 21.2 12
The carriers DL and UA show the highest range in their IQR with each 22. This shows that these two carriers have the most variability in arrival delays.
Within this exercise we want to schedule our travel during a month that minimizes the potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay:
#Mean
nycflights %>%
group_by(month)%>%
summarise(mean_dd=mean(dep_delay))%>%
arrange(desc(mean_dd))
## # A tibble: 12 x 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
Another option would be, to choose the month with the lowest median departure delay:
#Median
nycflights %>%
group_by(month)%>%
summarise(median_dd=median(dep_delay))%>%
arrange(desc(median_dd))
## # A tibble: 12 x 2
## month median_dd
## <int> <dbl>
## 1 12 1
## 2 6 0
## 3 7 0
## 4 3 -1
## 5 5 -1
## 6 8 -1
## 7 1 -2
## 8 2 -2
## 9 4 -2
## 10 11 -2
## 11 9 -3
## 12 10 -3
With the median we can separate the higher half of data sample from the lower half. It is the “middle value”. The mean, on the other hand, is defined as the simple average of the given set of values.
Since the mean outlines the center of gravity within a set of data, it is more helpful to look at the mean, in order to schedule our travel.
For this exercise, we are going to visualize the distribution the of on time departure rate across all the three major NYC airports. We consider any flight delayed for 5 minutes or more to be delayed:
nycflights<-nycflights%>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights%>%
group_by(origin)%>%
summarise(ot_dep_rate = sum(dep_type=="on time") / n()) %>%
arrange(desc(ot_dep_rate))
## # A tibble: 3 x 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
Lets visualize this distribution:
library(ggplot2)
ggplot(data=nycflights,aes(x=origin,fill=dep_type))+geom_bar()
Looking at the results, one should choose LGA.
The following code shows the mutation of the data frame which includes a new variable avg_speed. When calculating the average speed the air_time needs to be converted from minutes to hours. That is why it is divided by 60. This value is then used in next line using the pipe operator (%>%). The avg_speed is the average speed traveled by the plane for each flight (in mph). It is calculated as distance divided by number of hours of traveled. In the last line those columns that are wished to be shown are selected using the select().
nycflights %>% mutate(airtime.hours = air_time/60) %>%
mutate(avg.speed = distance/airtime.hours) %>%
select(origin, dest, avg.speed, distance, airtime.hours)
Here the code shows how to make a scatterplot of avg_speed vs. distance, using the data frame from exercise 8. Only thing added to the exercise 8 is the last line and the airtime.hours is not selected. The ggplot makes the scatterplot and aes makes it possible to name the variables from the data frame. The y-axis (vertical) is the avg.speed and x-axis (horizontal) is the distance. The relationship between average speed and distance. The furter you go the higher the average speed is until a certain flight distance (around 1700). In shorter trips the take off and landing take bigger percentage of the total flight time. During take off and landing the speed is slower than in cruising altitude.
nycflights %>% mutate(airtime.hours = air_time/60) %>%
mutate(avg.speed = distance/airtime.hours) %>%
select(origin, dest, avg.speed, distance) %>%
ggplot(aes(y=avg.speed, x=distance)) +geom_point() + geom_smooth()
This plot is made from data frame that only contains flights from American Airlines (AA), Delta Airlines (DL), and United Airlines (UA), and the points are colored by carrier. Again aes makes it possible to name the variables from the data frame
b <- seq(0, 800, by = 100)
df.AA.US<-nycflights%>%
filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = df.AA.US, mapping=aes(x=dep_delay, y=arr_delay, colour = carrier))+
geom_point()
del_dep_arr <- nycflights %>%
filter(carrier=="AA" | carrier=="DL" | carrier =="UA") %>%
filter(arr_delay <= 0) %>%
arrange(desc(dep_delay))
print(del_dep_arr)
In that data frame you can see a flight on 11th of July 2013. Departure delay was 63 minutes and still it still arrived 11 minutes before arrival date. The cutoff point is for departure delays where you can still expect to get to your destination on time is Y-axis 0, Every flight that that is above x=0 but below Y=0 has had departure delay but has still arrived on time or even before arrival time.
The replication of the artist’s drawing is made by generating randomly sprinkled points to the square using the function rnorm(n, min, max) for both coordinates of every point. First we use a function that determines if the point is in the circle.Then data frame is created. Then the graph is made by using the ggplot, and assigning different colors to the dot depending whether they are inside or outside the “circle” (the line).
y.circle <- function(x){
sqrt(1-x^2)
}
n.points = 1000
points.x <- runif(n=n.points,min=0, max=1)
points.y <- runif(n=n.points,min=0, max=1)
df <- data.frame(points.x, points.y)
inside <-points.y < y.circle(points.x)
df$inside <- inside
ggplot(data=df) +
geom_point(aes(x=points.x, y=points.y, color=inside)) +
geom_function(fun=y.circle)
The following shows how to repeat this painting thousands of times. For each painting produced,the proportion of points that fall within the circle is calculated, relative to the total number of points that are distributed all over the painting.
mean(df$inside)
# This is the number of points in our painting
prop.in.one.painting <- function(){
n.points <- 10000
# This is the number of points in our painting
point.x <- runif(n.points,min=0, max=1)
point.y <- runif(n.points,min=0, max=1)
mean(point.y < y.circle(point.x))
}
Next code creates a histogram of the proportions, one proportion for each painting. The variable prop.in.paintings is not a data.frame. so it is directly put into the aes(x=prop.in.paintings) argument and the data argument can be dropped.
n.paintings <- 10000 # This is the number of paintings
prop.in.paintings <- replicate(n.paintings, prop.in.one.painting())
ggplot() +
geom_histogram(aes(x = prop.in.paintings), binwidth = 0.0003) +
scale_x_continuous(breaks=seq(0.60, 0.90, by=.005))
mean(prop.in.paintings)*4
First n.points and n.paintings both had value 10 000 but the following shows how the histogram changes if these values are changed to 80 000.
prop.in.one.painting <- function(){
n.points <- 80000
point.x <- runif(n.points,min=0, max=1)
point.y <- runif(n.points,min=0, max=1)
mean(point.y < y.circle(point.x))
n.paintings <- 80000
prop.in.paintings <- replicate(n.paintings, prop.in.one.painting())
}
ggplot() +
geom_histogram(aes(x = prop.in.paintings), binwidth = 0.0003) +
scale_x_continuous(breaks=seq(0.60, 0.90, by=.005))
mean(prop.in.paintings)*4
This diagram show the probability of the dots falling inside the circle. It is binomial distribution.
We are working at a company that manufactures lifts and we need to estimate the expected weight each lift needs to carry. We manufacture two versions of lift: a small model with just enough space for five people and one large model with enough space for ten people.
At every moment, the lift journey can carry a different weight. The total weight in the lift depends on the number of people in it and their weights. For this exercise we are going to predict what kind of weights the lift should be expected to carry. We will first calculate the descriptive statistics of the data-called bdims.
library("openintro")
library("tidyverse")
library("knitr")
all <- bdims %>%
summarise(sex="all",mean=mean(wgt),median=median(wgt),
"standart deviation"=sd(wgt),range=range(wgt),
IQR=IQR(wgt),
"central 50%"=quantile(wgt,prob=c(0.25,0.75))%>%
paste(collapse=" to "),
"central 95%"=quantile(wgt,prob=c(0.025,0.975))%>%
paste(collapse = " to "),
"central 99%"=quantile(wgt,prob=c(0.005,0.995))%>%
paste(collapse = " to "))
male_female <- bdims %>%
group_by(sex)%>%
summarise(mean=mean(wgt),median=median(wgt),
"standart deviation"=sd(wgt),range=range(wgt),
IQR=IQR(wgt),
"central 50%"=quantile(wgt,prob=c(0.25,0.75))%>%
paste(collapse=" to "),
"central 95%"=quantile(wgt,prob=c(0.025,0.975))%>%
paste(collapse = " to "),
"central 99%"=quantile(wgt,prob=c(0.005,0.995))%>%
paste(collapse = " to "))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
df.all<-data.frame(all)
df.male_female<-data.frame(male_female)
df.combined <- rbind(df.all,df.male_female)
kable(df.combined)
Moving on we will show four different diagrams visualizing the distribution of the weights in the data-set:
library("openintro")
library("tidyverse")
library("knitr")
bdims %>%
ggplot() +
geom_histogram(aes(x=wgt, fill=factor(sex)), binwidth=1, position="dodge")
This histogram shows the weight on the x-axis and the count on the y-axis. Both sex are marked with different colors, blue showing the distribution of weights for males and pink showing the distribution of weights for females. In general we can see that the women are on average lighter than man are.
bdims %>%
ggplot() +
geom_boxplot(aes(y=wgt, x=factor(sex)))
This boxplot visualizes the interquartile range of each the man and the female weights.
bdims %>%
ggplot() +
geom_violin(aes(y=wgt, x=factor(sex)))
The violin distribution again shows how the weight is distributed within males and females. The x-axis shows the gender, while the y-axis shows the weight in kg.
bdims %>%
ggplot() +
geom_density(aes(x=wgt, fill=factor(sex)), alpha=.5) +
scale_x_continuous(breaks=seq(40,120,by=5))
The density distribution visualizes how dense the weight of all females or males is distributed.
lift.weight <- function(lift.size){
sampled.row.numbers <- sample(1:nrow(bdims), lift.size)
sampled.rows <- bdims[sampled.row.numbers,]
mean(sampled.rows$wgt)
}
n.lifts <- 10000
mean.weight.5 <- replicate(n.lifts, lift.weight(5))
mean.weight.10 <- replicate(n.lifts, lift.weight(10))
df.5 <- data.frame(mean.weight=mean.weight.5, model="small")
df.10 <- data.frame(mean.weight=mean.weight.10, model="large")
df <- rbind(df.5, df.10)
kable(df)
ggplot(df)+
geom_boxplot(aes(y=mean.weight,x=factor(model)))+
scale_y_continuous(breaks = seq(50,90,by=10))
ggplot(df)+
geom_density(aes(x=mean.weight,fill=factor(model)),alpha=0.6)
lift.weight <- function(lift.size){
sampled.row.numbers <- sample(1:nrow(bdims), lift.size)
sampled.rows <- bdims[sampled.row.numbers,]
mean(sampled.rows$wgt)
}
n.lifts <- 10000
mean.weight.5 <- replicate(n.lifts, lift.weight(5))
mean.weight.10 <- replicate(n.lifts, lift.weight(10))
df.5 <- data.frame(mean.weight=mean.weight.5, model="small")
df.10 <- data.frame(mean.weight=mean.weight.10, model="large")
df <- rbind(df.5, df.10)
kable(df)
ggplot(df)+
geom_boxplot(aes(y=mean.weight,x=factor(model))) +
scale_y_continuous(breaks = seq(50,90,by=10))
ggplot(df)+
geom_density(aes(x=mean.weight,fill=factor(model)),alpha=0.6)
lift_models <- df %>% group_by(model) %>%
summarise(mean=mean(mean.weight),median=median(mean.weight),"standard deviation"=sd(mean.weight),IQR=IQR(mean.weight),range=range(mean.weight),"central 50%"=quantile(mean.weight,prob=c(0.25,0.75))%>%
paste(collapse=" to "),
"central 95%"=quantile(mean.weight,prob=c(0.025,0.975))%>%
paste(collapse = " to "),
"central 99%"=quantile(mean.weight,prob=c(0.005,0.995))%>%
paste(collapse = " to "))
## `summarise()` has grouped output by 'model'. You can override using the `.groups` argument.
df_lift_models<-data.frame(lift_models)
kable(lift_models)
If we have a look at both of the distributions, we can see that the median of both of the lifts are very similar. However, the mean weight of the large lift is bigger than the mean weight of the small lift. Both lifts have several outliers, both being smaller and larger than the first and third quantile. Both distributions are normally distributed, however, the peak of the large lift is set higher than the peak of the smaller lift. Additionally, the smaller lift has a greater variation in weights, which can be explained by the sample size of the lift.
Based on this plot, the data nearly follows the same bell formed curve as a normal distribution curve. However it is not as symmetrical around the mean as a normal distributed curve would be.
library(tidyverse)
library(ggplot2)
library(openintro)
female <- bdims %>%
filter(sex == 0)
male <- bdims %>%
filter(sex == 1)
female_mean <- mean(female$wgt)
female_sd <- sd(female$wgt)
sim_norm <- rnorm(n = nrow(female), mean = female_mean, sd = female_sd)
ggplot(data = data.frame(sim_norm), aes(sample = sim_norm)) +
stat_qq() +
stat_qq_line()
Not all points are located on the line. This plot contains much less information which can be generalised since it is a sample of the actual data set. The appearance of the graph is much different since the different quantiles are less noticeable and it the aesthetics are less like the standard normal distribution.
sim_norm <- rnorm(n = nrow(female), mean = female_mean, sd = female_sd)
ggplot(data = data.frame(sim_norm), aes(sample = sim_norm)) +
stat_qq() +
stat_qq_line()
qqnormsim(sample = wgt, data = female)
Eventhough the aesthetics of the graph are very different from the previous ones, it still looks like the female weight is distributed fairly normal. The mean and median are located around the theoretical 0. In all the graphs it is hard to see how much data is actually situated in one spot, which makes it difficult to determine what the percentage of data per step of standard deviation is. However, it does seem that within the first step of standard deviation around 34% of the data is situated on each side. One step further it becomes a bit less, so around 13,5% is quite probable. The third step should contain around 2,5% of the data on each side, looking at the graph this is very likely to be true. Therefore, one could conclude that the female weight is (close to) a normal distribution.
male_mean <- mean(male$wgt)
male_sd <- sd(male$wgt)
sim_norm <- rnorm(n = nrow(male), mean = male_mean, sd = male_sd)
ggplot(data = data.frame(sim_norm), aes(sample = sim_norm)) +
stat_qq() +
stat_qq_line()
qqnormsim(sample = wgt, data = male)
The same reasoning as with the distribution of the weight of female participants can be applied to that of the male of participants. According to the rules of a normal distribution, that of the male weight seems to be one.
sim_norm1 <- rnorm
ggplot(data = data.frame(sim_norm), aes(sample = sim_norm)) +
stat_qq() +
stat_qq_line()
Since the weight of both genders is assumed to be a normal distribution, the distribution of both sexes is as well. It is very hard to tell which one would be closer to a normal distribution since the dots are that close together. Therefore, I will not make a statement on this.