Section 1 of Exam 1: Regression Model

2a.

Predict Cited_by_Patent_Count utilizing a regression model.

pat_lm <- lm(Cited_by_Patent_Count ~ Cites_Patent_Count, data = Training)
summary(pat_lm)
## 
## Call:
## lm(formula = Cited_by_Patent_Count ~ Cites_Patent_Count, data = Training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9469 -0.0523 -0.0523 -0.0523  3.7661 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.0522603  0.0130896   3.993 7.32e-05 ***
## Cites_Patent_Count 0.0026705  0.0005322   5.018 6.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3223 on 624 degrees of freedom
## Multiple R-squared:  0.03879,    Adjusted R-squared:  0.03725 
## F-statistic: 25.18 on 1 and 624 DF,  p-value: 6.811e-07

The approximate relationship between Cited_by_Patent_Count and Cites_Patent_Count is as follows:

\[\hat{Cited\_by\_Patent\_Count} \approx 0.05226 + 0.00267 \times Cites\_Patent\_Count\]

2b.

Confidence Interval

confint(pat_lm, level = 0.90)
##                            5 %        95 %
## (Intercept)        0.030697775 0.073822761
## Cites_Patent_Count 0.001793859 0.003547102

There is a 90% chance that Cites_Patent_Count will reside between .0018 and .0035.

2c.

There is a 90% chance that the intercept for the regression model will reside between .0307 and .0738.

2d.

Evaluation of Cook’s Distance

cooks <- cooks.distance(pat_lm)

plot(pat_lm, which = 4)

influential <- cooks[(cooks > (3 * mean(cooks, na.rm = TRUE)))]
influential
74 0.0611118
127 0.0923969
179 0.0738189
231 0.8849406
296 0.0972799
417 0.0711123
505 0.6775879
567 2.6352514
602 2.4638272

Evaluating cook’s distance can happen in various ways. The first of which is identifying points that are > than one. Any value for cook’s distance that is greater than 1 is generally considered to be an outlier and should be investigated further. Additionally, another way to identify influential points within in a data set is seeking out observations that have a distance that is 3 times greater than the mean of all values.

In doing so, there are two observations that have a value for cook’s distance that are greater than one. These occur at observations 567 and 602. Meaning that they are outliers, and should be investigated to see the impact that they have on the model. Additionally, the other observations listed in the table above could also be considered influential on the overall model, but not outliers persay.

2e.

Evaluating residual plot

plot(pat_lm, which = 2)

The following QQ-plot would be indicative that the residuals are not normally distributed.

Key Notes:

  • Appearance of a LARGE gap in values and incremental increase in tails
  • QQ-Plot is heavy tailed, meaning that there is a large increase or decrease at the beginning or end of range
  • QQ-line is nearly parallel to the x-axis

For these reasons, a conclusion can be made that the residuals are not normally distributed.

3.

Model Predictions

obs_for_pred = data.frame(Cites_Patent_Count = c(340,300))

predict(pat_lm,obs_for_pred)
##         1         2 
## 0.9602238 0.8534046

The predicted value of Cited_Patent_Count when Cites_patent_Count is equal to 340 is .960

The predicted value of Cited_Patent_Count when Cites_patent_Count is equal to 300 is .853

4.

Determine if these predicted values are extrapolation

leverage <- hatvalues(pat_lm)
max(leverage)
## [1] 0.2995991
x_new  = c(1, 340)
x_new2 = c(1,300)
X = model.matrix(pat_lm)

calc_lev1 <- t(x_new)%*%solve(t(X)%*%X)%*%x_new
calc_lev2 <- t(x_new2)%*%solve(t(X)%*%X)%*%x_new2
Prediction Values Calculated Leverage Max Leverage from Model
340 0.3086801 0.2995991
300 0.2398486 0.2995991

When Cites_patent_Count is equal to 340 the leverage is equal to .309, which is greater than the model max, meaning that this point is considered to be extrapolation.

When Cites_patent_Count is equal to 300 the leverage is equal to .240, which is less than the model max, meaning that this point is not considered to be extrapolation.

Section 2 of Exam 1: Joining Data

7.

The four types of mutating joins learned in class are:

  • Inner Join
  • Full Join
  • Left Join
  • Right Join

8a.

Perform an inner join on Training Data and Sequence Counts

Sequen %>% inner_join(Training, by = c("Patent_No." = "Patent_Number")) -> Inner_Joined_Data
Inner_Joined_Data
## # A tibble: 222 × 4
##    Patent_No.        Sequence_Count Cites_Patent_Count Cited_by_Patent_Count
##    <chr>                      <dbl>              <dbl>                 <dbl>
##  1 CA 189065 S                    0                  0                     0
##  2 NI 202000072 A                 0                  0                     0
##  3 KR 20210032013 A              80                  0                     0
##  4 PH 12020550461 A1              0                  0                     0
##  5 TW I722568 B                   0                  0                     0
##  6 CN 112533674 A                 0                  0                     2
##  7 CO 2021002954 A2               0                  0                     0
##  8 JP 2021040642 A                0                  1                     0
##  9 AU 2019/341683 A1              0                  0                     0
## 10 JP 2021042204 A                0                  4                     0
## # … with 212 more rows

8b.

Perform a full join

 Sequen %>% full_join(Training, by = c("Patent_No." = "Patent_Number")) -> Full_Joined_Data
Full_Joined_Data
## # A tibble: 626 × 4
##    Patent_No.        Sequence_Count Cites_Patent_Count Cited_by_Patent_Count
##    <chr>                      <dbl>              <dbl>                 <dbl>
##  1 CA 189065 S                    0                  0                     0
##  2 NI 202000072 A                 0                  0                     0
##  3 KR 20210032013 A              80                  0                     0
##  4 PH 12020550461 A1              0                  0                     0
##  5 TW I722568 B                   0                  0                     0
##  6 CN 112533674 A                 0                  0                     2
##  7 CO 2021002954 A2               0                  0                     0
##  8 JP 2021040642 A                0                  1                     0
##  9 AU 2019/341683 A1              0                  0                     0
## 10 JP 2021042204 A                0                  4                     0
## # … with 616 more rows

8c.

Perform a join so that only the patents in the sequence counts data set are included.

Sequen %>% left_join(Training, by = c("Patent_No." = "Patent_Number")) -> Left_Joined_Data
Left_Joined_Data
## # A tibble: 222 × 4
##    Patent_No.        Sequence_Count Cites_Patent_Count Cited_by_Patent_Count
##    <chr>                      <dbl>              <dbl>                 <dbl>
##  1 CA 189065 S                    0                  0                     0
##  2 NI 202000072 A                 0                  0                     0
##  3 KR 20210032013 A              80                  0                     0
##  4 PH 12020550461 A1              0                  0                     0
##  5 TW I722568 B                   0                  0                     0
##  6 CN 112533674 A                 0                  0                     2
##  7 CO 2021002954 A2               0                  0                     0
##  8 JP 2021040642 A                0                  1                     0
##  9 AU 2019/341683 A1              0                  0                     0
## 10 JP 2021042204 A                0                  4                     0
## # … with 212 more rows

Note: I did make my best attempt of formatting this table in a similar way as the smaller ones are. Unfortunately, for me I could not figure out how to add a scroll bar as they are very large, but also keep the header section colored. I tried various methods, if you know how to do this please let me know, or of a better method to create a pretty table.

9a.

Group the Training and Testing Data by Partition. Summarize each grouping by calculating the standard deviation of Cites_Patent_Count for each group.

summarization <- Train_Test %>%
  group_by(Partition) %>%
  summarise(`Standard Deviation` = round(sd(Cites_Patent_Count),3))
Partition Standard Deviation
0 24.228
1 16.685

9b.

Identify the top 3 patents with the highest number of citations in the Cited_by_Patent_Count column.

top_citations <- Train_Test %>%
  arrange(desc(Cited_by_Patent_Count)) 
Patent_Number Cites_Patent_Count Cited_by_Patent_Count Partition
WO 2021/250648 A1 3 11 1
WO 2021/084429 A1 68 4 0
US 11014911 B2 13 3 0

The table presents the top 3 Cited_by_Patent_Count

Filtered_data <- Train_Test %>%
  filter(Cites_Patent_Count > 0) %>%
  arrange(Cites_Patent_Count)
Filtered_data
## # A tibble: 137 × 4
##    Patent_Number      Cites_Patent_Count Cited_by_Patent_Count Partition
##    <chr>                           <dbl>                 <dbl>     <dbl>
##  1 US 2021/0206757 A1                  1                     0         0
##  2 AU 2018/372109 B2                   1                     0         0
##  3 AU 2018/275359 B2                   1                     0         0
##  4 TW I732431 B                        1                     0         0
##  5 EP 3630789 A4                       1                     0         0
##  6 TW I729530 B                        1                     0         0
##  7 AU 2016/222928 B2                   1                     0         0
##  8 TW I726942 B                        1                     0         0
##  9 US 2021/0128729 A1                  1                     0         0
## 10 EP 3668495 A4                       1                     0         0
## # … with 127 more rows

Sorted to ensure that Cites_Patent_Count really was filtered. Unsorted version below:

Train_Test %>%
  filter(Cites_Patent_Count > 0)
## # A tibble: 137 × 4
##    Patent_Number      Cites_Patent_Count Cited_by_Patent_Count Partition
##    <chr>                           <dbl>                 <dbl>     <dbl>
##  1 JP 2021100972 A                     3                     0         0
##  2 US 2021/0206757 A1                  1                     0         0
##  3 EP 3096786 B1                      10                     0         0
##  4 EP 3096783 B1                      16                     0         0
##  5 AU 2018/372109 B2                   1                     0         0
##  6 AU 2018/275359 B2                   1                     0         0
##  7 JP 2021098716 A                     2                     0         0
##  8 TW I732431 B                        1                     0         0
##  9 RU 2750454 C2                       2                     0         0
## 10 WO 2021/124210 A1                   6                     0         0
## # … with 127 more rows