Homework 2

(a) Database gpa


Questions 1: Does more study hours lead to higher gpa for this data set?

This question is equivalent to asking whether two numeric variables are independent or not. To see this clearly in a graph, we should plot a line graph, which is usually better than a point graph.

ggplot(gpa) + geom_smooth(aes(x = studyweek, y = gpa))

It seems that the line is close to a horizontal line, which suggests that the two variables are not significantly dependent of each other.

Linear fitting using lm function

A quantitative way to answer the question is to perform a linear fit between two variables using the lm function. It is very straightforward to use it:

result <- lm(gpa$gpa ~ gpa$studyweek)
print(result)
## 
## Call:
## lm(formula = gpa$gpa ~ gpa$studyweek)
## 
## Coefficients:
##   (Intercept)  gpa$studyweek  
##      3.578490       0.001127

This to fit the data with a linear model

\[ y = ax + b \]

where \(y\) is gpa, and \(x\) is studyweek. So the template of lm is lm (<y> ~ <x>) where <y> is the data for the dependent variable, and <x> is the data for the independent variable.

The result will return the fitting values of \(a\) and \(b\), where \(b\) is the Intercept, and \(a\) is the coefficient of \(x\).

Our interest is to see that whether \(a\) is nonzero or not. However, in practice \(a\) will never be exactly zero. So the correct thing to do here is to check the 95% confidence interval of \(a\) and see whether the interval contains zero or not. To do this, we need to use the confint function with the lm object as the input.

confint(result)
##                      2.5 %      97.5 %
## (Intercept)    3.408867546 3.748112927
## gpa$studyweek -0.006331395 0.008585977

As we see, the 95% confidence interval of \(a\) is between \([-0.0063, 0.0086]\) which contains zero. Therefore, we can say that the data do not suggest that more study hours lead to higher gpa.

Similarly, we can answer the second and the third question now.


Questions 2: Does more going-out nights lead to lower gpa for this data set?
ggplot(gpa) + geom_smooth(aes(x = out, y = gpa))

result <- lm(gpa$gpa ~ gpa$out)
print(result)
## 
## Call:
## lm(formula = gpa$gpa ~ gpa$out)
## 
## Coefficients:
## (Intercept)      gpa$out  
##     3.50425      0.04543
confint(result)
##                   2.5 %    97.5 %
## (Intercept)  3.29133359 3.7171683
## gpa$out     -0.04588533 0.1367508

Again, both the graph and the linear fit suggests that there is no significant positive or negative linear relationship between out and gpa.


Questions 3: Is there a correlation between sleeping hours and the number of going-out nights?
ggplot(gpa) + geom_smooth(aes(x = sleepnight, y = out))

result <- lm(gpa$out ~ gpa$sleepnight)
print(result)
## 
## Call:
## lm(formula = gpa$out ~ gpa$sleepnight)
## 
## Coefficients:
##    (Intercept)  gpa$sleepnight  
##        -0.5147          0.3714
confint(result)
##                     2.5 %    97.5 %
## (Intercept)    -2.2808274 1.2515132
## gpa$sleepnight  0.1239877 0.6189011

Since the 95% confidence interval does not contain zero, we can say that there is an evident positive relationship between out and sleepnight.


Questions 5: Do female students study more hours than male students in the data set?
Questions 6: Do male students go out more than female students in the data set?
Questions 7: Do female students have better gpa than male students in the data set?

For these three questions, we are working on the relationship between a categorical variable and a continuous variable. We should create boxplots and do two-sample t-tests.

ggplot(gpa) + geom_boxplot(aes(x = gender, y = studyweek))

ggplot(gpa) + geom_boxplot(aes(x = gender, y = out))

ggplot(gpa) + geom_boxplot(aes(x = gender, y = gpa))

So we see that in each plot there is a difference between two groups in the median value. But is this due to sampling error? We need to perform the quantitative tests.

data1 <- filter(gpa, gender == "male")
data2 <- filter(gpa, gender == "female")

t.test(data1$studyweek, data2$studyweek)
## 
##  Welch Two Sample t-test
## 
## data:  data1$studyweek and data2$studyweek
## t = -1.1076, df = 18.393, p-value = 0.2823
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.568441   3.882394
## sample estimates:
## mean of x mean of y 
##  15.75000  20.09302
t.test(data1$out, data2$out)
## 
##  Welch Two Sample t-test
## 
## data:  data1$out and data2$out
## t = 1.9056, df = 17.76, p-value = 0.07302
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06282028  1.27599858
## sample estimates:
## mean of x mean of y 
##  2.583333  1.976744
t.test(data1$gpa, data2$gpa)
## 
##  Welch Two Sample t-test
## 
## data:  data1$gpa and data2$gpa
## t = -0.39191, df = 14.535, p-value = 0.7008
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3307978  0.2282862
## sample estimates:
## mean of x mean of y 
##  3.560000  3.611256

We see that none of the p-values is below 0.05, so the sample size here is too small to draw the conclusion that the two groups have a statistically significant difference in the mean value of all three variables.

(b) Database loans_full_schema

Questions 5: Is there a relationship between annual income and loan amount? Why?

We can now use the same analysis tool to answer this question. Let’s first create a line graph.

ggplot(loans_full_schema) + geom_smooth(aes(x = annual_income, y = loan_amount))

We are having a very unnatural curve here. What happened? Let’s check the scatterplot.

ggplot(loans_full_schema) + 
  geom_smooth(aes(x = annual_income, y = loan_amount)) +
  geom_point(aes(x = annual_income, y = loan_amount))

Now it’s clear that the curve after \(x > 250,000\) is not very meaningful which simply fits some outliers. So we should filter those data out.

To make our plot look better, we may set the points more transparent and change the color of the line.

loans <- filter(loans_full_schema, annual_income < 250000)

ggplot(loans) + 
  geom_point(aes(x = annual_income, y = loan_amount), alpha = 0.2) +
  geom_smooth(aes(x = annual_income, y = loan_amount), color = "red") 

After inspecting the new plot, we may see that there are some loans with zero annual income. These loans look suspicious as common sense tells us that one should not loan to a person with no income at all.

Therefore it may be reasonable to remove those points and redo the fit.

loans <- filter(loans_full_schema, annual_income < 250000, annual_income > 0)

ggplot(loans) + 
  geom_point(aes(x = annual_income, y = loan_amount), alpha = 0.2) +
  geom_smooth(aes(x = annual_income, y = loan_amount), color = "red") 

Now we see a nicely increasing curve with quite small margin of error (the grey region). Therefore we speculate that there is a positive relationship between annual income and the loan amount. Finally, let’s do the linear fit.

result <- lm(loans$loan_amount ~ loans$annual_income)
print(result)
## 
## Call:
## lm(formula = loans$loan_amount ~ loans$annual_income)
## 
## Coefficients:
##         (Intercept)  loans$annual_income  
##           8634.7929               0.1019
confint(result)
##                            2.5 %       97.5 %
## (Intercept)         8.249353e+03 9020.2323619
## loans$annual_income 9.735241e-02    0.1065137

So we see that the two variables roughly follows the formula

\[\mathrm{Loan\;Amount} \approx 8635 + 0.102 \times \mathrm{Annual\;Income}\].


Homework 3

(i) Basics in Data Transformation with the flights data set

(h) (Bonus, not required) Find flights on which weekday (from Monday to Sunday) had the longest departure delay on average.

To do this, we need to create a new column weekyday which needs to be computed. First, we may check the calender that January 1st, 2013 is a Tuesday. Then we need a function to calculate the day number for each day counted from Jan 1.

month_days = c(31,28,31,30,31,30,31,31,30,31,30,31) # The number of days from Janurary to December

day_number <- function(month, day) {
  return(sum(month_days[0:(month-1)]) + day)
}

my_data <- flights %>%
  mutate(day_number = map2_dbl(month, day, day_number),
    weekday = case_when(
    day_number %% 7 == 1 ~ "Tuesday",
    day_number %% 7 == 2 ~ "Wednesday",
    day_number %% 7 == 3 ~ "Thursday",
    day_number %% 7 == 4 ~ "Friday",
    day_number %% 7 == 5 ~ "Saturday",
    day_number %% 7 == 6 ~ "Sunday",
    .default = "Monday"
  )) %>%
  glimpse()
## Rows: 336,776
## Columns: 21
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ day_number     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ weekday        <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", …

In this code there are two things that we have not studies yet - define a function in R and make it vectorised using map2_dbl function, which we will discuss later.

After creating the label, things get easier:

my_data2 <- my_data %>%
  filter(!is.na(dep_delay)) %>%
  group_by(weekday) %>%
  summarise(mean_delay = mean(dep_delay)) %>%
  arrange(desc(mean_delay)) %>%
  print()
## # A tibble: 7 × 2
##   weekday   mean_delay
##   <chr>          <dbl>
## 1 Thursday       16.1 
## 2 Monday         14.8 
## 3 Friday         14.7 
## 4 Wednesday      11.8 
## 5 Sunday         11.6 
## 6 Tuesday        10.6 
## 7 Saturday        7.65


(ii) Analyzing the seattlepets data set


(d) How many names appear more than 100 times in the data set excluding “NA”?

Part (d) is relatively easy. We can first group by pet names and then count and arrange.

my_data <- seattlepets %>%
  filter(!is.na(animal_name)) %>%
  count(animal_name) %>%
  arrange(desc(n)) %>%
  print()
## # A tibble: 13,929 × 2
##    animal_name     n
##    <chr>       <int>
##  1 Lucy          439
##  2 Charlie       387
##  3 Luna          355
##  4 Bella         331
##  5 Max           270
##  6 Daisy         261
##  7 Molly         240
##  8 Jack          232
##  9 Lily          232
## 10 Stella        227
## # … with 13,919 more rows

To get the names that appear more than 100 times, we can simply filter the new summary table:

my_data %>%
  filter(n >= 100) %>%
  nrow()
## [1] 56

So there are 56 names that appear more than 100 times.


(e) For all names that appear more than 100 times in the data set, which has the highest “cat_to_dog” ratio? Which has the lowest? The “cat_to_dog” ratio can be computed this way - if a name appears 200 times, in which 150 are for cats and 50 are for dogs, the ratio is 150/50 = 3.

For part (e), we need to count for each name how many cats and dogs there are. To do this, we can group by name first, and create the summary that we need.

my_data <- seattlepets %>%
  filter(!is.na(animal_name)) %>%
  group_by(animal_name) %>% 
  summarise(n = n(), cat_n = sum(species == 'Cat'), dog_n = sum(species == 'Dog')) %>%
  arrange(desc(n)) %>%
  print()
## # A tibble: 13,929 × 4
##    animal_name     n cat_n dog_n
##    <chr>       <int> <int> <int>
##  1 Lucy          439   102   337
##  2 Charlie       387    81   306
##  3 Luna          355   111   244
##  4 Bella         331    82   249
##  5 Max           270    83   186
##  6 Daisy         261    40   221
##  7 Molly         240    54   186
##  8 Jack          232    65   167
##  9 Lily          232    86   146
## 10 Stella        227    42   185
## # … with 13,919 more rows

Now the job becomes easier. We can filter for n >= 100, then add a new column being the ratio, and then arrange the summary table by the ratio.

my_data %>%
  filter(n >= 100) %>%
  mutate(cat_to_dog = cat_n/dog_n) %>%
  arrange(desc(cat_to_dog)) %>%
  print()
## # A tibble: 56 × 5
##    animal_name     n cat_n dog_n cat_to_dog
##    <chr>       <int> <int> <int>      <dbl>
##  1 Shadow        132    53    79      0.671
##  2 Lily          232    86   146      0.589
##  3 Leo           150    54    96      0.562
##  4 Loki          115    40    75      0.533
##  5 Oliver        210    73   137      0.533
##  6 Gracie        142    48    94      0.511
##  7 Sam           101    34    67      0.507
##  8 Mia           105    34    71      0.479
##  9 Oscar         147    47   100      0.47 
## 10 Luna          355   111   244      0.455
## # … with 46 more rows

So the name “Shadow” wins in this category!