Section 1 of Exam 1: Regression Model
2a.
Predict Cited_by_Patent_Count utilizing a regression model.
##
## Call:
## lm(formula = Cited_by_Patent_Count ~ Cites_Patent_Count, data = Training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9469 -0.0523 -0.0523 -0.0523 3.7661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0522603 0.0130896 3.993 7.32e-05 ***
## Cites_Patent_Count 0.0026705 0.0005322 5.018 6.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3223 on 624 degrees of freedom
## Multiple R-squared: 0.03879, Adjusted R-squared: 0.03725
## F-statistic: 25.18 on 1 and 624 DF, p-value: 6.811e-07
The approximate relationship between Cited_by_Patent_Count and Cites_Patent_Count is as follows:
\[\hat{Cited\_by\_Patent\_Count} \approx 0.05226 + 0.00267 \times Cites\_Patent\_Count\]
2b.
Confidence Interval
## 5 % 95 %
## (Intercept) 0.030697775 0.073822761
## Cites_Patent_Count 0.001793859 0.003547102
There is a 90% chance that Cites_Patent_Count will reside between .0018 and .0035.
2c.
There is a 90% chance that the intercept for the regression model will reside between .0307 and .0738.
2d.
Evaluation of Cook’s Distance
| influential | |
|---|---|
| 74 | 0.0611118 |
| 127 | 0.0923969 |
| 179 | 0.0738189 |
| 231 | 0.8849406 |
| 296 | 0.0972799 |
| 417 | 0.0711123 |
| 505 | 0.6775879 |
| 567 | 2.6352514 |
| 602 | 2.4638272 |
Evaluating cook’s distance can happen in various ways. The first of which is identifying points that are > than one. Any value for cook’s distance that is greater than 1 is generally considered to be an outlier and should be investigated further. Additionally, another way to identify influential points within in a data set is seeking out observations that have a distance that is 3 times greater than the mean of all values.
In doing so, there are two observations that have a value for cook’s distance that are greater than one. These occur at observations 567 and 602. Meaning that they are outliers, and should be investigated to see the impact that they have on the model. Additionally, the other observations listed in the table above could also be considered influential on the overall model, but not outliers persay.
2e.
Evaluating residual plot
The following QQ-plot would be indicative that the residuals are not normally distributed.
Key Notes:
- Appearance of a LARGE gap in values and incremental increase in tails
- QQ-Plot is heavy tailed, meaning that there is a large increase or decrease at the beginning or end of range
- QQ-line is nearly parallel to the x-axis
For these reasons, a conclusion can be made that the residuals are not normally distributed.
3.
Model Predictions
## 1 2
## 0.9602238 0.8534046
The predicted value of Cited_Patent_Count when Cites_patent_Count is equal to 340 is .960
The predicted value of Cited_Patent_Count when Cites_patent_Count is equal to 300 is .853
4.
Determine if these predicted values are extrapolation
## [1] 0.2995991
x_new = c(1, 340)
x_new2 = c(1,300)
X = model.matrix(pat_lm)
calc_lev1 <- t(x_new)%*%solve(t(X)%*%X)%*%x_new
calc_lev2 <- t(x_new2)%*%solve(t(X)%*%X)%*%x_new2| Prediction Values | Calculated Leverage | Max Leverage from Model |
|---|---|---|
| 340 | 0.3086801 | 0.2995991 |
| 300 | 0.2398486 | 0.2995991 |
When Cites_patent_Count is equal to 340 the leverage is equal to .309, which is greater than the model max, meaning that this point is considered to be extrapolation.
When Cites_patent_Count is equal to 300 the leverage is equal to .240, which is less than the model max, meaning that this point is not considered to be extrapolation.
Section 2 of Exam 1: Joining Data
7.
The four types of mutating joins learned in class are:
- Inner Join
- Full Join
- Left Join
- Right Join
8a.
Perform an inner join on Training Data and Sequence Counts
Sequen %>% inner_join(Training, by = c("Patent_No." = "Patent_Number")) -> Inner_Joined_Data
Inner_Joined_Data## # A tibble: 222 × 4
## Patent_No. Sequence_Count Cites_Patent_Count Cited_by_Patent_Count
## <chr> <dbl> <dbl> <dbl>
## 1 CA 189065 S 0 0 0
## 2 NI 202000072 A 0 0 0
## 3 KR 20210032013 A 80 0 0
## 4 PH 12020550461 A1 0 0 0
## 5 TW I722568 B 0 0 0
## 6 CN 112533674 A 0 0 2
## 7 CO 2021002954 A2 0 0 0
## 8 JP 2021040642 A 0 1 0
## 9 AU 2019/341683 A1 0 0 0
## 10 JP 2021042204 A 0 4 0
## # … with 212 more rows
8b.
Perform a full join
Sequen %>% full_join(Training, by = c("Patent_No." = "Patent_Number")) -> Full_Joined_Data
Full_Joined_Data## # A tibble: 626 × 4
## Patent_No. Sequence_Count Cites_Patent_Count Cited_by_Patent_Count
## <chr> <dbl> <dbl> <dbl>
## 1 CA 189065 S 0 0 0
## 2 NI 202000072 A 0 0 0
## 3 KR 20210032013 A 80 0 0
## 4 PH 12020550461 A1 0 0 0
## 5 TW I722568 B 0 0 0
## 6 CN 112533674 A 0 0 2
## 7 CO 2021002954 A2 0 0 0
## 8 JP 2021040642 A 0 1 0
## 9 AU 2019/341683 A1 0 0 0
## 10 JP 2021042204 A 0 4 0
## # … with 616 more rows
8c.
Perform a join so that only the patents in the sequence counts data set are included.
Sequen %>% left_join(Training, by = c("Patent_No." = "Patent_Number")) -> Left_Joined_Data
Left_Joined_Data## # A tibble: 222 × 4
## Patent_No. Sequence_Count Cites_Patent_Count Cited_by_Patent_Count
## <chr> <dbl> <dbl> <dbl>
## 1 CA 189065 S 0 0 0
## 2 NI 202000072 A 0 0 0
## 3 KR 20210032013 A 80 0 0
## 4 PH 12020550461 A1 0 0 0
## 5 TW I722568 B 0 0 0
## 6 CN 112533674 A 0 0 2
## 7 CO 2021002954 A2 0 0 0
## 8 JP 2021040642 A 0 1 0
## 9 AU 2019/341683 A1 0 0 0
## 10 JP 2021042204 A 0 4 0
## # … with 212 more rows
Note: I did make my best attempt of formatting this table in a similar way as the smaller ones are. Unfortunately, for me I could not figure out how to add a scroll bar as they are very large, but also keep the header section colored. I tried various methods, if you know how to do this please let me know, or of a better method to create a pretty table.
9a.
Group the Training and Testing Data by Partition. Summarize each grouping by calculating the standard deviation of Cites_Patent_Count for each group.
summarization <- Train_Test %>%
group_by(Partition) %>%
summarise(`Standard Deviation` = round(sd(Cites_Patent_Count),3))| Partition | Standard Deviation |
|---|---|
| 0 | 24.228 |
| 1 | 16.685 |
9b.
Identify the top 3 patents with the highest number of citations in the Cited_by_Patent_Count column.
| Patent_Number | Cites_Patent_Count | Cited_by_Patent_Count | Partition |
|---|---|---|---|
| WO 2021/250648 A1 | 3 | 11 | 1 |
| WO 2021/084429 A1 | 68 | 4 | 0 |
| US 11014911 B2 | 13 | 3 | 0 |
The table presents the top 3 Cited_by_Patent_Count
Filtered_data <- Train_Test %>%
filter(Cites_Patent_Count > 0) %>%
arrange(Cites_Patent_Count)
Filtered_data## # A tibble: 137 × 4
## Patent_Number Cites_Patent_Count Cited_by_Patent_Count Partition
## <chr> <dbl> <dbl> <dbl>
## 1 US 2021/0206757 A1 1 0 0
## 2 AU 2018/372109 B2 1 0 0
## 3 AU 2018/275359 B2 1 0 0
## 4 TW I732431 B 1 0 0
## 5 EP 3630789 A4 1 0 0
## 6 TW I729530 B 1 0 0
## 7 AU 2016/222928 B2 1 0 0
## 8 TW I726942 B 1 0 0
## 9 US 2021/0128729 A1 1 0 0
## 10 EP 3668495 A4 1 0 0
## # … with 127 more rows
Sorted to ensure that Cites_Patent_Count really was filtered. Unsorted version below:
## # A tibble: 137 × 4
## Patent_Number Cites_Patent_Count Cited_by_Patent_Count Partition
## <chr> <dbl> <dbl> <dbl>
## 1 JP 2021100972 A 3 0 0
## 2 US 2021/0206757 A1 1 0 0
## 3 EP 3096786 B1 10 0 0
## 4 EP 3096783 B1 16 0 0
## 5 AU 2018/372109 B2 1 0 0
## 6 AU 2018/275359 B2 1 0 0
## 7 JP 2021098716 A 2 0 0
## 8 TW I732431 B 1 0 0
## 9 RU 2750454 C2 2 0 0
## 10 WO 2021/124210 A1 6 0 0
## # … with 127 more rows