ggplot(exam_grades) + geom_bar(aes(x=sex, fill = sex)) +
labs(title = "Distribution by gender",
x = "Gender",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
count(exam_grades,sex)
## # A tibble: 2 × 2
## sex n
## <chr> <int>
## 1 Man 188
## 2 Woman 45
ggplot(exam_grades) + geom_bar(aes(x = semester,fill = semester)) +
labs(title = "Distribution by semester",
x = "Semester",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Answer: There are 188 male and 45 female students. and total 6 semesters.
exam_grades_average <- exam_grades %>%
mutate(average_score = (exam1 + exam2 + exam3)/3) %>%
print()
## # A tibble: 233 × 7
## semester sex exam1 exam2 exam3 course_grade average_score
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-1 Man 84.5 69.5 86.5 76.3 80.2
## 2 2000-1 Man 80 74 67 75.4 73.7
## 3 2000-1 Man 56 70 71.5 67.1 65.8
## 4 2000-1 Man 64 61 67.5 63.5 64.2
## 5 2000-1 Man 90.5 72.5 75 72.4 79.3
## 6 2000-1 Man 74 78.5 84.5 71.4 79
## 7 2000-1 Man 60.5 44 58 56.1 54.2
## 8 2000-1 Man 89 82 88 78.0 86.3
## 9 2000-1 Woman 87.5 86.5 95 82.9 89.7
## 10 2000-1 Man 91 98 88 89.1 92.3
## # … with 223 more rows
Answer: Average score is not the same as the course_grade.
my_data <- exam_grades %>%
arrange(desc(course_grade)) %>%
head(1) %>%
print()
## # A tibble: 1 × 6
## semester sex exam1 exam2 exam3 course_grade
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2001-1 Man 98 96.5 98.5 97.6
Answer: The highest course grade is 97.5667, this is a male student, the score is for the semester 2001-1.
my_data2 <- exam_grades %>%
pivot_longer(c("exam1", "exam2","exam3"), names_to = "exam_number", values_to = "score")%>%
print()
## # A tibble: 699 × 5
## semester sex course_grade exam_number score
## <chr> <chr> <dbl> <chr> <dbl>
## 1 2000-1 Man 76.3 exam1 84.5
## 2 2000-1 Man 76.3 exam2 69.5
## 3 2000-1 Man 76.3 exam3 86.5
## 4 2000-1 Man 75.4 exam1 80
## 5 2000-1 Man 75.4 exam2 74
## 6 2000-1 Man 75.4 exam3 67
## 7 2000-1 Man 67.1 exam1 56
## 8 2000-1 Man 67.1 exam2 70
## 9 2000-1 Man 67.1 exam3 71.5
## 10 2000-1 Man 63.5 exam1 64
## # … with 689 more rows
Answer: There are 699 rows now.
my_data2 %>%
filter(!is.na(score)) %>%
group_by(exam_number) %>%
summarise(mean_score = mean(score)) %>%
print()
## # A tibble: 3 × 2
## exam_number mean_score
## <chr> <dbl>
## 1 exam1 80.8
## 2 exam2 72.6
## 3 exam3 75.5
Answer: The mean score of all students for exam1 is 80.76619, exam2 is 72.60558, exam3 is 75.47959.
my_data3<- my_data2 %>%
filter(!is.na(course_grade)) %>%
group_by(semester) %>%
summarise(mean_course_grade = mean(course_grade)) %>%
print()
## # A tibble: 6 × 2
## semester mean_course_grade
## <chr> <dbl>
## 1 2000-1 72.3
## 2 2000-2 70.6
## 3 2001-1 76.7
## 4 2001-2 69.0
## 5 2002-1 72.3
## 6 2003-1 72.3
Answer: For the semester 2000-1, the mean corse grade is 72.31186; 2000-2 is 70.57298; 2001-1 is 76.72930; 2001-2 is 68.96486; 2002-1 is 72.27668; 2003-1 is 72.34206.
my_data4<- my_data2 %>%
filter(!is.na(course_grade)) %>%
group_by(sex) %>%
print()
## # A tibble: 699 × 5
## # Groups: sex [2]
## semester sex course_grade exam_number score
## <chr> <chr> <dbl> <chr> <dbl>
## 1 2000-1 Man 76.3 exam1 84.5
## 2 2000-1 Man 76.3 exam2 69.5
## 3 2000-1 Man 76.3 exam3 86.5
## 4 2000-1 Man 75.4 exam1 80
## 5 2000-1 Man 75.4 exam2 74
## 6 2000-1 Man 75.4 exam3 67
## 7 2000-1 Man 67.1 exam1 56
## 8 2000-1 Man 67.1 exam2 70
## 9 2000-1 Man 67.1 exam3 71.5
## 10 2000-1 Man 63.5 exam1 64
## # … with 689 more rows
ggplot(my_data4) + geom_boxplot(aes(x= course_grade, y = sex, color = sex)) +
labs(title = "Course grade by gender",
x = "Course grade",
y = "Gender") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Answer: The median score of female students is a little
higher than male students.
my_data5<- my_data2 %>%
filter(!is.na(course_grade)) %>%
group_by(semester, sex) %>%
summarise(mean_course_grade = mean(course_grade)) %>%
print()
## # A tibble: 12 × 3
## # Groups: semester [6]
## semester sex mean_course_grade
## <chr> <chr> <dbl>
## 1 2000-1 Man 71.6
## 2 2000-1 Woman 74.9
## 3 2000-2 Man 70.3
## 4 2000-2 Woman 71.6
## 5 2001-1 Man 77.7
## 6 2001-1 Woman 71.5
## 7 2001-2 Man 67.2
## 8 2001-2 Woman 76.6
## 9 2002-1 Man 71.5
## 10 2002-1 Woman 77.3
## 11 2003-1 Man 71.9
## 12 2003-1 Woman 73.6
ggplot(my_data5) + geom_point(aes(x = semester, y =mean_course_grade,color = sex)) +
labs(title = "Cours grade distribution by gender",
x = "Semester",
y = "Course grade") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Answer: In most semesters, female students’ course
grade are higher than male’s, except semester 2001-1, male’s course
grade is much higher than females’.
flights_dest <-flights %>%
count(dest) %>%
arrange(desc(n)) %>%
head(10) %>%
print()
## # A tibble: 10 × 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
Answer: They are ORD, ATL,LAX,BOS,MCO,CLT,SFO,FLL,MIA,DCA.
flights1 <- flights %>%
left_join(airlines, by = "carrier") %>%
filter(!is.na(dep_time)) %>%
count(name)%>%
arrange(desc(n)) %>%
print()
## # A tibble: 16 × 2
## name n
## <chr> <int>
## 1 United Air Lines Inc. 57979
## 2 JetBlue Airways 54169
## 3 ExpressJet Airlines Inc. 51356
## 4 Delta Air Lines Inc. 47761
## 5 American Airlines Inc. 32093
## 6 Envoy Air 25163
## 7 US Airways Inc. 19873
## 8 Endeavor Air Inc. 17416
## 9 Southwest Airlines Co. 12083
## 10 Virgin America 5131
## 11 AirTran Airways Corporation 3187
## 12 Alaska Airlines Inc. 712
## 13 Frontier Airlines Inc. 682
## 14 Mesa Airlines Inc. 545
## 15 Hawaiian Airlines Inc. 342
## 16 SkyWest Airlines Inc. 29
Answer: United Air Lines Inc. has the most flights departuring from airports of NYC.
flights2 <- flights %>%
left_join(weather) %>%
filter(!is.na(dep_delay), !is.na(temp)) %>%
group_by(origin,year,month,day) %>%
summarise(mean_temp = mean(temp), mean_delay = mean(dep_delay)) %>%
arrange(desc(mean_temp))%>%
select(mean_temp,mean_delay,everything()) %>%
print()
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
## `summarise()` has grouped output by 'origin', 'year', 'month'. You can override
## using the `.groups` argument.
## # A tibble: 1,092 × 6
## # Groups: origin, year, month [36]
## mean_temp mean_delay origin year month day
## <dbl> <dbl> <chr> <int> <int> <int>
## 1 94.0 15.3 LGA 2013 7 19
## 2 93.6 21.9 EWR 2013 7 19
## 3 93.5 14.1 EWR 2013 7 18
## 4 92.2 26.7 LGA 2013 7 18
## 5 90.2 4.76 EWR 2013 7 15
## 6 90.2 34.6 LGA 2013 7 7
## 7 90.0 16.7 JFK 2013 7 16
## 8 89.8 22.3 JFK 2013 7 18
## 9 89.6 39.5 EWR 2013 7 7
## 10 89.3 11.0 LGA 2013 7 17
## # … with 1,082 more rows
Answer: For LGA, the highest mean temp is on the July 19,2013, which is 93.97235F, the mean departure delay is 15.28 mins;for EWR, the highest mean temp is on the July, 19,2013, which is 93.56F, the mean departure delay is 21.93 mins; for JFK, the highest mean temp is on July 16,2013, which is 90.03 F, the mean departure delay is 16.72mins.
flights3 <- flights%>%
left_join(weather)%>%
filter(!is.na(dep_delay),!is.na(temp)) %>%
group_by(origin, year, month, day) %>%
summarise(mean_delay = mean(dep_delay), mean_temp= mean(temp))%>%
print()
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
## `summarise()` has grouped output by 'origin', 'year', 'month'. You can override
## using the `.groups` argument.
## # A tibble: 1,092 × 6
## # Groups: origin, year, month [36]
## origin year month day mean_delay mean_temp
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 EWR 2013 1 1 18.3 37.2
## 2 EWR 2013 1 2 25.3 29.7
## 3 EWR 2013 1 3 8.45 30.3
## 4 EWR 2013 1 4 12.1 36.0
## 5 EWR 2013 1 5 5.70 38.6
## 6 EWR 2013 1 6 12.4 42.5
## 7 EWR 2013 1 7 9.78 42.0
## 8 EWR 2013 1 8 3.81 41.0
## 9 EWR 2013 1 9 3.13 43.9
## 10 EWR 2013 1 10 5.58 45.2
## # … with 1,082 more rows
cor(flights3$mean_delay,flights3$mean_temp)
## [1] 0.1516537
ggplot(flights3, aes(x=mean_temp, y=mean_delay, color = origin, fill = origin)) + geom_point() + geom_smooth() +
labs(title = "Mean delay vs Mean temp by origin",
x = "Temperature (F)",
y = "Departure delay (mins)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Answer: It shows there is a very weak relationship
between temp and delay for all three airports.