Openintro Data set


Task a: How many male and female students are there in the data set? How many semesters are there in the data set?

ggplot(exam_grades) + geom_bar(aes(x=sex, fill = sex)) +
  labs(title = "Distribution by gender", 
       x = "Gender", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

count(exam_grades,sex)
## # A tibble: 2 × 2
##   sex       n
##   <chr> <int>
## 1 Man     188
## 2 Woman    45
ggplot(exam_grades) + geom_bar(aes(x = semester,fill = semester)) +
  labs(title = "Distribution by semester", 
       x = "Semester", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2))) 

Answer: There are 188 male and 45 female students. and total 6 semesters.


Task b: Create a new column named “average_score”, which is the average score of three exams for each student. Is this score the same as the score in the column course_grade?

exam_grades_average <- exam_grades %>%
  mutate(average_score = (exam1 + exam2 + exam3)/3) %>%
  print()
## # A tibble: 233 × 7
##    semester sex   exam1 exam2 exam3 course_grade average_score
##    <chr>    <chr> <dbl> <dbl> <dbl>        <dbl>         <dbl>
##  1 2000-1   Man    84.5  69.5  86.5         76.3          80.2
##  2 2000-1   Man    80    74    67           75.4          73.7
##  3 2000-1   Man    56    70    71.5         67.1          65.8
##  4 2000-1   Man    64    61    67.5         63.5          64.2
##  5 2000-1   Man    90.5  72.5  75           72.4          79.3
##  6 2000-1   Man    74    78.5  84.5         71.4          79  
##  7 2000-1   Man    60.5  44    58           56.1          54.2
##  8 2000-1   Man    89    82    88           78.0          86.3
##  9 2000-1   Woman  87.5  86.5  95           82.9          89.7
## 10 2000-1   Man    91    98    88           89.1          92.3
## # … with 223 more rows

Answer: Average score is not the same as the course_grade.


Task c: Find out the highest course grade among all records. Is that student female or male? In which semester did it happen?

my_data <- exam_grades %>%
  arrange(desc(course_grade)) %>%
  head(1) %>%
  print()
## # A tibble: 1 × 6
##   semester sex   exam1 exam2 exam3 course_grade
##   <chr>    <chr> <dbl> <dbl> <dbl>        <dbl>
## 1 2001-1   Man      98  96.5  98.5         97.6

Answer: The highest course grade is 97.5667, this is a male student, the score is for the semester 2001-1.


Task d: Now the data set is not tidy since columns exam1, exam2, exam3 are be values of a variable “exam_number”. Tidy the data and put exam numbers into exam_number column, and scores into a new column score. How many rows do you have now?

my_data2 <- exam_grades %>% 
  pivot_longer(c("exam1", "exam2","exam3"), names_to = "exam_number", values_to = "score")%>%
   print()
## # A tibble: 699 × 5
##    semester sex   course_grade exam_number score
##    <chr>    <chr>        <dbl> <chr>       <dbl>
##  1 2000-1   Man           76.3 exam1        84.5
##  2 2000-1   Man           76.3 exam2        69.5
##  3 2000-1   Man           76.3 exam3        86.5
##  4 2000-1   Man           75.4 exam1        80  
##  5 2000-1   Man           75.4 exam2        74  
##  6 2000-1   Man           75.4 exam3        67  
##  7 2000-1   Man           67.1 exam1        56  
##  8 2000-1   Man           67.1 exam2        70  
##  9 2000-1   Man           67.1 exam3        71.5
## 10 2000-1   Man           63.5 exam1        64  
## # … with 689 more rows

Answer: There are 699 rows now.


Task e: Find the mean score of all students in exam1, exam2 and exam3 respectively. Ignore NA values if there is any.

my_data2 %>%
  filter(!is.na(score)) %>%
  group_by(exam_number) %>%
  summarise(mean_score = mean(score)) %>%
  print()
## # A tibble: 3 × 2
##   exam_number mean_score
##   <chr>            <dbl>
## 1 exam1             80.8
## 2 exam2             72.6
## 3 exam3             75.5

Answer: The mean score of all students for exam1 is 80.76619, exam2 is 72.60558, exam3 is 75.47959.


Task f: Find the mean course grade of all students for each semester, respectively

my_data3<- my_data2 %>%
  filter(!is.na(course_grade)) %>%
  group_by(semester) %>%
  summarise(mean_course_grade = mean(course_grade)) %>%
  print()
## # A tibble: 6 × 2
##   semester mean_course_grade
##   <chr>                <dbl>
## 1 2000-1                72.3
## 2 2000-2                70.6
## 3 2001-1                76.7
## 4 2001-2                69.0
## 5 2002-1                72.3
## 6 2003-1                72.3

Answer: For the semester 2000-1, the mean corse grade is 72.31186; 2000-2 is 70.57298; 2001-1 is 76.72930; 2001-2 is 68.96486; 2002-1 is 72.27668; 2003-1 is 72.34206.


Task g: Create an appropriate graph to compare the course grades of male students and female students.

my_data4<- my_data2 %>%
  filter(!is.na(course_grade)) %>%
  group_by(sex) %>%
  print()
## # A tibble: 699 × 5
## # Groups:   sex [2]
##    semester sex   course_grade exam_number score
##    <chr>    <chr>        <dbl> <chr>       <dbl>
##  1 2000-1   Man           76.3 exam1        84.5
##  2 2000-1   Man           76.3 exam2        69.5
##  3 2000-1   Man           76.3 exam3        86.5
##  4 2000-1   Man           75.4 exam1        80  
##  5 2000-1   Man           75.4 exam2        74  
##  6 2000-1   Man           75.4 exam3        67  
##  7 2000-1   Man           67.1 exam1        56  
##  8 2000-1   Man           67.1 exam2        70  
##  9 2000-1   Man           67.1 exam3        71.5
## 10 2000-1   Man           63.5 exam1        64  
## # … with 689 more rows
ggplot(my_data4) + geom_boxplot(aes(x= course_grade, y = sex, color = sex)) +
  labs(title = "Course grade by gender", 
       x = "Course grade", 
       y = "Gender") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2))) 

Answer: The median score of female students is a little higher than male students.


Task h: Create an appropriate graph to compare the course grades of male students and female students in different semesters.

my_data5<- my_data2 %>%
  filter(!is.na(course_grade)) %>%
  group_by(semester, sex) %>%
  summarise(mean_course_grade = mean(course_grade)) %>%
  print()
## # A tibble: 12 × 3
## # Groups:   semester [6]
##    semester sex   mean_course_grade
##    <chr>    <chr>             <dbl>
##  1 2000-1   Man                71.6
##  2 2000-1   Woman              74.9
##  3 2000-2   Man                70.3
##  4 2000-2   Woman              71.6
##  5 2001-1   Man                77.7
##  6 2001-1   Woman              71.5
##  7 2001-2   Man                67.2
##  8 2001-2   Woman              76.6
##  9 2002-1   Man                71.5
## 10 2002-1   Woman              77.3
## 11 2003-1   Man                71.9
## 12 2003-1   Woman              73.6
ggplot(my_data5) + geom_point(aes(x = semester,  y =mean_course_grade,color = sex))  +
  labs(title = "Cours grade distribution by gender", 
       x = "Semester", 
       y = "Course grade") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Answer: In most semesters, female students’ course grade are higher than male’s, except semester 2001-1, male’s course grade is much higher than females’.


nycflights13 data set


Task a: What are the ten most common destinations for flights from NYC airports in 2013? Print a result that lists these in descending order of frequency and shows the number of flights heading to each airport.

flights_dest <-flights %>%
  count(dest) %>%
  arrange(desc(n)) %>%
  head(10) %>%
  print()
## # A tibble: 10 × 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705

Answer: They are ORD, ATL,LAX,BOS,MCO,CLT,SFO,FLL,MIA,DCA.


Task b: Which airlines have the most flights departing from NYC airports in 2013? Print a result that lists these in descending order of frequency and shows the number of flights for each airline. Your result should have full names of each carrier such as “American Airlines Inc.”

flights1 <- flights %>% 
  left_join(airlines, by = "carrier")  %>%
  filter(!is.na(dep_time)) %>%
  count(name)%>%
  arrange(desc(n)) %>%
  print()
## # A tibble: 16 × 2
##    name                            n
##    <chr>                       <int>
##  1 United Air Lines Inc.       57979
##  2 JetBlue Airways             54169
##  3 ExpressJet Airlines Inc.    51356
##  4 Delta Air Lines Inc.        47761
##  5 American Airlines Inc.      32093
##  6 Envoy Air                   25163
##  7 US Airways Inc.             19873
##  8 Endeavor Air Inc.           17416
##  9 Southwest Airlines Co.      12083
## 10 Virgin America               5131
## 11 AirTran Airways Corporation  3187
## 12 Alaska Airlines Inc.          712
## 13 Frontier Airlines Inc.        682
## 14 Mesa Airlines Inc.            545
## 15 Hawaiian Airlines Inc.        342
## 16 SkyWest Airlines Inc.          29

Answer: United Air Lines Inc. has the most flights departuring from airports of NYC.


Task c: which day had the highest mean temperature in each of the origin airport? What was the average departure delay on that day for each origin airport?

flights2 <- flights %>%
  left_join(weather) %>%
  filter(!is.na(dep_delay), !is.na(temp)) %>%
  group_by(origin,year,month,day) %>%
  summarise(mean_temp = mean(temp), mean_delay = mean(dep_delay)) %>%
  arrange(desc(mean_temp))%>%
  select(mean_temp,mean_delay,everything()) %>%
  print()
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
## `summarise()` has grouped output by 'origin', 'year', 'month'. You can override
## using the `.groups` argument.
## # A tibble: 1,092 × 6
## # Groups:   origin, year, month [36]
##    mean_temp mean_delay origin  year month   day
##        <dbl>      <dbl> <chr>  <int> <int> <int>
##  1      94.0      15.3  LGA     2013     7    19
##  2      93.6      21.9  EWR     2013     7    19
##  3      93.5      14.1  EWR     2013     7    18
##  4      92.2      26.7  LGA     2013     7    18
##  5      90.2       4.76 EWR     2013     7    15
##  6      90.2      34.6  LGA     2013     7     7
##  7      90.0      16.7  JFK     2013     7    16
##  8      89.8      22.3  JFK     2013     7    18
##  9      89.6      39.5  EWR     2013     7     7
## 10      89.3      11.0  LGA     2013     7    17
## # … with 1,082 more rows

Answer: For LGA, the highest mean temp is on the July 19,2013, which is 93.97235F, the mean departure delay is 15.28 mins;for EWR, the highest mean temp is on the July, 19,2013, which is 93.56F, the mean departure delay is 21.93 mins; for JFK, the highest mean temp is on July 16,2013, which is 90.03 F, the mean departure delay is 16.72mins.


Task d: Create a visualization that effectively shows if there is a relationship between the average daily departure delay and the average daily temperature for all three New York city airports.

flights3 <- flights%>%
  left_join(weather)%>%
  filter(!is.na(dep_delay),!is.na(temp)) %>%
  group_by(origin, year, month, day) %>%
  summarise(mean_delay = mean(dep_delay), mean_temp= mean(temp))%>%
  print()
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
## `summarise()` has grouped output by 'origin', 'year', 'month'. You can override
## using the `.groups` argument.
## # A tibble: 1,092 × 6
## # Groups:   origin, year, month [36]
##    origin  year month   day mean_delay mean_temp
##    <chr>  <int> <int> <int>      <dbl>     <dbl>
##  1 EWR     2013     1     1      18.3       37.2
##  2 EWR     2013     1     2      25.3       29.7
##  3 EWR     2013     1     3       8.45      30.3
##  4 EWR     2013     1     4      12.1       36.0
##  5 EWR     2013     1     5       5.70      38.6
##  6 EWR     2013     1     6      12.4       42.5
##  7 EWR     2013     1     7       9.78      42.0
##  8 EWR     2013     1     8       3.81      41.0
##  9 EWR     2013     1     9       3.13      43.9
## 10 EWR     2013     1    10       5.58      45.2
## # … with 1,082 more rows
cor(flights3$mean_delay,flights3$mean_temp)
## [1] 0.1516537
ggplot(flights3, aes(x=mean_temp, y=mean_delay, color = origin, fill = origin)) + geom_point() + geom_smooth()  +
  labs(title = "Mean delay vs Mean temp by origin", 
       x = "Temperature (F)", 
       y = "Departure delay (mins)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Answer: It shows there is a very weak relationship between temp and delay for all three airports.