gpa
This question is equivalent to asking whether two numeric variables are independent or not. To see this clearly in a graph, we should plot a line graph, which is usually better than a point graph.
ggplot(gpa) + geom_smooth(aes(x = studyweek, y = gpa))
It seems that the line is close to a horizontal line, which suggests that the two variables are not significantly dependent of each other.
lm
functionA quantitative way to answer the question is to perform a linear fit
between two variables using the lm
function. It is very
straightforward to use it:
result <- lm(gpa$gpa ~ gpa$studyweek)
print(result)
##
## Call:
## lm(formula = gpa$gpa ~ gpa$studyweek)
##
## Coefficients:
## (Intercept) gpa$studyweek
## 3.578490 0.001127
This to fit the data with a linear model
\[ y = ax + b \]
where \(y\) is gpa
, and
\(x\) is studyweek
. So the
template of lm
is lm (<y> ~ <x>)
where <y>
is the data for the dependent variable, and
<x>
is the data for the independent variable.
The result will return the fitting values of \(a\) and \(b\), where \(b\) is the Intercept, and \(a\) is the coefficient of \(x\).
Our interest is to see that whether \(a\) is nonzero or not. However, in practice
\(a\) will never be exactly zero. So
the correct thing to do here is to check the 95% confidence interval of
\(a\) and see whether the interval
contains zero or not. To do this, we need to use the
confint
function with the lm
object as the input.
confint(result)
## 2.5 % 97.5 %
## (Intercept) 3.408867546 3.748112927
## gpa$studyweek -0.006331395 0.008585977
As we see, the 95% confidence interval of \(a\) is between \([-0.0063, 0.0086]\) which contains zero. Therefore, we can say that the data do not suggest that more study hours lead to higher gpa.
Similarly, we can answer the second and the third question now.
ggplot(gpa) + geom_smooth(aes(x = out, y = gpa))
result <- lm(gpa$gpa ~ gpa$out)
print(result)
##
## Call:
## lm(formula = gpa$gpa ~ gpa$out)
##
## Coefficients:
## (Intercept) gpa$out
## 3.50425 0.04543
confint(result)
## 2.5 % 97.5 %
## (Intercept) 3.29133359 3.7171683
## gpa$out -0.04588533 0.1367508
Again, both the graph and the linear fit suggests that there is no
significant positive or negative linear relationship between
out
and gpa
.
ggplot(gpa) + geom_smooth(aes(x = sleepnight, y = out))
result <- lm(gpa$out ~ gpa$sleepnight)
print(result)
##
## Call:
## lm(formula = gpa$out ~ gpa$sleepnight)
##
## Coefficients:
## (Intercept) gpa$sleepnight
## -0.5147 0.3714
confint(result)
## 2.5 % 97.5 %
## (Intercept) -2.2808274 1.2515132
## gpa$sleepnight 0.1239877 0.6189011
Since the 95% confidence interval does not contain zero, we can say
that there is an evident positive relationship between out
and sleepnight
.
For these three questions, we are working on the relationship between a categorical variable and a continuous variable. We should create boxplots and do two-sample t-tests.
ggplot(gpa) + geom_boxplot(aes(x = gender, y = studyweek))
ggplot(gpa) + geom_boxplot(aes(x = gender, y = out))
ggplot(gpa) + geom_boxplot(aes(x = gender, y = gpa))
So we see that in each plot there is a difference between two groups in the median value. But is this due to sampling error? We need to perform the quantitative tests.
data1 <- filter(gpa, gender == "male")
data2 <- filter(gpa, gender == "female")
t.test(data1$studyweek, data2$studyweek)
##
## Welch Two Sample t-test
##
## data: data1$studyweek and data2$studyweek
## t = -1.1076, df = 18.393, p-value = 0.2823
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.568441 3.882394
## sample estimates:
## mean of x mean of y
## 15.75000 20.09302
t.test(data1$out, data2$out)
##
## Welch Two Sample t-test
##
## data: data1$out and data2$out
## t = 1.9056, df = 17.76, p-value = 0.07302
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06282028 1.27599858
## sample estimates:
## mean of x mean of y
## 2.583333 1.976744
t.test(data1$gpa, data2$gpa)
##
## Welch Two Sample t-test
##
## data: data1$gpa and data2$gpa
## t = -0.39191, df = 14.535, p-value = 0.7008
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3307978 0.2282862
## sample estimates:
## mean of x mean of y
## 3.560000 3.611256
We see that none of the p-values is below 0.05, so the sample size here is too small to draw the conclusion that the two groups have a statistically significant difference in the mean value of all three variables.
loans_full_schema
We can now use the same analysis tool to answer this question. Let’s first create a line graph.
ggplot(loans_full_schema) + geom_smooth(aes(x = annual_income, y = loan_amount))
We are having a very unnatural curve here. What happened? Let’s check the scatterplot.
ggplot(loans_full_schema) +
geom_smooth(aes(x = annual_income, y = loan_amount)) +
geom_point(aes(x = annual_income, y = loan_amount))
Now it’s clear that the curve after \(x > 250,000\) is not very meaningful which simply fits some outliers. So we should filter those data out.
To make our plot look better, we may set the points more transparent and change the color of the line.
loans <- filter(loans_full_schema, annual_income < 250000)
ggplot(loans) +
geom_point(aes(x = annual_income, y = loan_amount), alpha = 0.2) +
geom_smooth(aes(x = annual_income, y = loan_amount), color = "red")
After inspecting the new plot, we may see that there are some loans with zero annual income. These loans look suspicious as common sense tells us that one should not loan to a person with no income at all.
Therefore it may be reasonable to remove those points and redo the fit.
loans <- filter(loans_full_schema, annual_income < 250000, annual_income > 0)
ggplot(loans) +
geom_point(aes(x = annual_income, y = loan_amount), alpha = 0.2) +
geom_smooth(aes(x = annual_income, y = loan_amount), color = "red")
Now we see a nicely increasing curve with quite small margin of error (the grey region). Therefore we speculate that there is a positive relationship between annual income and the loan amount. Finally, let’s do the linear fit.
result <- lm(loans$loan_amount ~ loans$annual_income)
print(result)
##
## Call:
## lm(formula = loans$loan_amount ~ loans$annual_income)
##
## Coefficients:
## (Intercept) loans$annual_income
## 8634.7929 0.1019
confint(result)
## 2.5 % 97.5 %
## (Intercept) 8.249353e+03 9020.2323619
## loans$annual_income 9.735241e-02 0.1065137
So we see that the two variables roughly follows the formula
\[\mathrm{Loan\;Amount} \approx 8635 + 0.102 \times \mathrm{Annual\;Income}\].
flights
data
setTo do this, we need to create a new column weekyday
which needs to be computed. First, we may check the calender that
January 1st, 2013 is a Tuesday. Then we need a function to calculate the
day number for each day counted from Jan 1.
month_days = c(31,28,31,30,31,30,31,31,30,31,30,31) # The number of days from Janurary to December
day_number <- function(month, day) {
return(sum(month_days[0:(month-1)]) + day)
}
my_data <- flights %>%
mutate(day_number = map2_dbl(month, day, day_number),
weekday = case_when(
day_number %% 7 == 1 ~ "Tuesday",
day_number %% 7 == 2 ~ "Wednesday",
day_number %% 7 == 3 ~ "Thursday",
day_number %% 7 == 4 ~ "Friday",
day_number %% 7 == 5 ~ "Saturday",
day_number %% 7 == 6 ~ "Sunday",
.default = "Monday"
)) %>%
glimpse()
## Rows: 336,776
## Columns: 21
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ day_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ weekday <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", …
In this code there are two things that we have not studies yet -
define a function in R and make it vectorised using
map2_dbl
function, which we will discuss later.
After creating the label, things get easier:
my_data2 <- my_data %>%
filter(!is.na(dep_delay)) %>%
group_by(weekday) %>%
summarise(mean_delay = mean(dep_delay)) %>%
arrange(desc(mean_delay)) %>%
print()
## # A tibble: 7 × 2
## weekday mean_delay
## <chr> <dbl>
## 1 Thursday 16.1
## 2 Monday 14.8
## 3 Friday 14.7
## 4 Wednesday 11.8
## 5 Sunday 11.6
## 6 Tuesday 10.6
## 7 Saturday 7.65
seattlepets
data setPart (d) is relatively easy. We can first group by pet names and then count and arrange.
my_data <- seattlepets %>%
filter(!is.na(animal_name)) %>%
count(animal_name) %>%
arrange(desc(n)) %>%
print()
## # A tibble: 13,929 × 2
## animal_name n
## <chr> <int>
## 1 Lucy 439
## 2 Charlie 387
## 3 Luna 355
## 4 Bella 331
## 5 Max 270
## 6 Daisy 261
## 7 Molly 240
## 8 Jack 232
## 9 Lily 232
## 10 Stella 227
## # … with 13,919 more rows
To get the names that appear more than 100 times, we can simply filter the new summary table:
my_data %>%
filter(n >= 100) %>%
nrow()
## [1] 56
So there are 56 names that appear more than 100 times.
For part (e), we need to count for each name how many cats and dogs there are. To do this, we can group by name first, and create the summary that we need.
my_data <- seattlepets %>%
filter(!is.na(animal_name)) %>%
group_by(animal_name) %>%
summarise(n = n(), cat_n = sum(species == 'Cat'), dog_n = sum(species == 'Dog')) %>%
arrange(desc(n)) %>%
print()
## # A tibble: 13,929 × 4
## animal_name n cat_n dog_n
## <chr> <int> <int> <int>
## 1 Lucy 439 102 337
## 2 Charlie 387 81 306
## 3 Luna 355 111 244
## 4 Bella 331 82 249
## 5 Max 270 83 186
## 6 Daisy 261 40 221
## 7 Molly 240 54 186
## 8 Jack 232 65 167
## 9 Lily 232 86 146
## 10 Stella 227 42 185
## # … with 13,919 more rows
Now the job becomes easier. We can filter for
n >= 100
, then add a new column being the ratio, and
then arrange the summary table by the ratio.
my_data %>%
filter(n >= 100) %>%
mutate(cat_to_dog = cat_n/dog_n) %>%
arrange(desc(cat_to_dog)) %>%
print()
## # A tibble: 56 × 5
## animal_name n cat_n dog_n cat_to_dog
## <chr> <int> <int> <int> <dbl>
## 1 Shadow 132 53 79 0.671
## 2 Lily 232 86 146 0.589
## 3 Leo 150 54 96 0.562
## 4 Loki 115 40 75 0.533
## 5 Oliver 210 73 137 0.533
## 6 Gracie 142 48 94 0.511
## 7 Sam 101 34 67 0.507
## 8 Mia 105 34 71 0.479
## 9 Oscar 147 47 100 0.47
## 10 Luna 355 111 244 0.455
## # … with 46 more rows
So the name “Shadow” wins in this category!