Section 1 of Exam 1: Regression Model

Red font is changes

Black font is directions.

Green font is questions.

Blue font is answers.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
  1. Using the formatting learned in the first class, type the following header in R Markdown:

See above header

Read the csv file titled Training Data into R. All rows of this dataset should be used as your training set in the development of a regression model. This dataset contains information about some of Pfizer’s patent applications. Each patent application has a unique ID given to it, and this ID is listed in the Patent_Number column. The column titled Cites_Patent_Count lists the number of patent citations made within the corresponding patent application, and the column titled Cited_by_Patent_Count lists the number of patents which cite the document identified by the Patent_Number. The Cited_by_Patent_Count column can be seen as a measure of the influence and strength of the innovation, and as such, it can be useful to try to predict this column to help firms understand the potential influence of their innovations.

training <- read.csv("C:/Users/justt/Desktop/School/621/Exams/Exam 1/Training Data.csv")
str(training)
## 'data.frame':    626 obs. of  3 variables:
##  $ Patent_Number        : chr  "PL 3341367 T3" "HR P20210871 T1" "CR 20210284 A" "US 2021/0205309 A1" ...
##  $ Cites_Patent_Count   : int  0 0 0 0 3 0 0 1 0 0 ...
##  $ Cited_by_Patent_Count: int  0 0 0 0 0 0 0 0 0 0 ...
  1. Use regression to try to predict the Cited_by_Patent_Count, with Cites_Patent_Count used as a covariate. Based on your results, write an equation using the following format: Cited_by_Patent_Count ≈ B1(Cites_Patent_Count) + B2, where B1 and B2 are real numbers
training <- training[ ,-1]
colnames(training)
## [1] "Cites_Patent_Count"    "Cited_by_Patent_Count"
colnames(training) <- c("B1", "B2")
model1 <- lm(B2 ~ B1, data = training)
model1
## 
## Call:
## lm(formula = B2 ~ B1, data = training)
## 
## Coefficients:
## (Intercept)           B1  
##     0.05226      0.00267
summary(model1)
## 
## Call:
## lm(formula = B2 ~ B1, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9469 -0.0523 -0.0523 -0.0523  3.7661 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0522603  0.0130896   3.993 7.32e-05 ***
## B1          0.0026705  0.0005322   5.018 6.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3223 on 624 degrees of freedom
## Multiple R-squared:  0.03879,    Adjusted R-squared:  0.03725 
## F-statistic: 25.18 on 1 and 624 DF,  p-value: 6.811e-07
## let Cites_Patent_Count = 60
B1a <- data.frame(B1 = c(60))
predict(model1, B1a, type = "response")
##         1 
## 0.2124891

An equation for Cited_by_Patent_Count, using the regression model of model1 <- lm(B2 ~ B1, data = training), where B1 = 60 would be Cited_by_Patent_Count of 0.2124891 ≈ when Cites_Patent_Count is.

Fill-in the blanks for the following statements:

  1. Your regression equation provides an “estimate” of the actual values for the numbers B1 and B2. There is a 90% probability that the actual value of B1 lies between _________ and ________.
confint(model1, level = 0.90)
##                     5 %        95 %
## (Intercept) 0.030697775 0.073822761
## B1          0.001793859 0.003547102

There is a 90% probability that the actual value of B1 lies between 0.001793859 and 0.003547102.

  1. There is a 90% probability that the actual value of B2 lies between ________ and ________.
confint(model1, level = 0.90)
##                     5 %        95 %
## (Intercept) 0.030697775 0.073822761
## B1          0.001793859 0.003547102

There is a 90% probability that the actual value of B2 lies between 0.030697775 and 0.073822761.

  1. Based on Cook’s Distance, are there outliers that may possibly need to be removed and/or treated differently in the dataset? Create a plot using R to support your answer.
head(cooks.distance(model1))
##            1            2            3            4            5            6 
## 2.174933e-05 2.174933e-05 2.174933e-05 2.174933e-05 2.810625e-05 2.174933e-05
max(cooks.distance(model1))
## [1] 2.635251

Yes, there are outliers that may possibly need to be removed and/or treated differently in the dataset. The Max is 2.635251, which is greater than 1.0 value for Cook’s Distance.

plot(model1)

This shows that there are 2 points outside the 1.0 range for Cook’s Distance. Points 602 and 567 should be considered outliers and should be removed. This could be the result of bad data capture, or data anomolies.

  1. Check the normality of residuals by creating a QQ-plot. Based on the plot, are the residuals normally distributed?
qqnorm(model1$residuals, main = "model1")
qqline(model1$residuals)

Yes, the residuals are normally distributed.

Use the regression model that you created in #2 to predict the Cited_by_Patent_Count in each of the following scenarios:

  1. Cites_Patent_Count = 340
model1 <- lm(B2 ~ B1, data = training)
## let Cites_Patent_Count = 340
B1b <- data.frame(B1 = c(340))
predict(model1, B1b, type = "response")
##         1 
## 0.9602238

The predicted Cited_by_Patent_Count when Cites_Patent_Count = 340 is 0.9602238.

  1. Cites_Patent_Count = 300
model1 <- lm(B2 ~ B1, data = training)
## let Cites_Patent_Count = 300
B1c <- data.frame(B1 = c(300))
predict(model1, B1c, type = "response")
##         1 
## 0.8534046

The predicted Cited_by_Patent_Count when Cites_Patent_Count = 300 is 0.8534046.

model1 <- lm(B2 ~ B1, data = training)
head(hatvalues(model1))
##          1          2          3          4          5          6 
## 0.00164921 0.00164921 0.00164921 0.00164921 0.00160247 0.00164921
max(hatvalues(model1))
## [1] 0.2995991
  1. Was the prediction made in part a of #3 considered to be extrapolation?
x_new = c(1, 340)
X= model.matrix(model1)
t(x_new)%*%solve(t(X)%*%X)%*%x_new
##           [,1]
## [1,] 0.3086801

Yes, this is extrapolation because it’s value of 0.3086801 is greater than the max leverage of 0.2995991.

  1. Was the prediction made in part b of #3 considered to be extrapolation?
x_new = c(1, 300)
X= model.matrix(model1)
t(x_new)%*%solve(t(X)%*%X)%*%x_new
##           [,1]
## [1,] 0.2398486

No, this is not extrapolation because it’s value of 0.2398486 is less than the max leverage of 0.2995991.

Fill in the blanks based on the regression model that you created in #2, and the corresponding prediction that you made in part b of #3:

  1. When Cites_Patent_Count is 300 (as in part b of #3), there is a 95% chance that Cited_by_Patent_Count will be between _________ and __________.
B2_pred = data.frame(B1 = c(300))
B2_pred
##    B1
## 1 300
predict(model1, B2_pred, interval = "prediction", level = .95, type = "response")
##         fit       lwr      upr
## 1 0.8534046 0.1486074 1.558202

When Cites_Patent_Count is 300 (as in part b of #3), there is a 95% chance that Cited_by_Patent_Count will be between 0.1486074 and 1.558202.

  1. When Cites_Patent_Count is 300, there is a 95% chance that the mean response will fall between _________ and _________.
predict(model1, B2_pred, interval = "confidence", level = 0.95, type = "response")
##         fit       lwr      upr
## 1 0.8534046 0.5434141 1.163395

When Cites_Patent_Count is 300, there is a 95% chance that the mean response will fall between 0.5434141 and 1.163395.

Section 2 of Exam 1: Joining Data

  1. Type the following header in R Markdown, using formatting learned in the first class:

See above header

  1. Type the following in R Markdown, being sure to make the phrase “mutating joins” appear in bold in your HTML document:

The four types of mutating joins learned in class are:

  1. Open the Training Data and Sequence Counts csv files, and notice that each file has a column listing the patent application number. For full credit, perform the following joins in R without changing any column titles in either dataset:
train <- read.csv("C:/Users/justt/Desktop/School/621/Exams/Exam 1/Training Data.csv")
sc <- read.csv("C:/Users/justt/Desktop/School/621/Exams/Exam 1/Sequence Counts.csv")
str(train)
## 'data.frame':    626 obs. of  3 variables:
##  $ Patent_Number        : chr  "PL 3341367 T3" "HR P20210871 T1" "CR 20210284 A" "US 2021/0205309 A1" ...
##  $ Cites_Patent_Count   : int  0 0 0 0 3 0 0 1 0 0 ...
##  $ Cited_by_Patent_Count: int  0 0 0 0 0 0 0 0 0 0 ...
str(sc)
## 'data.frame':    222 obs. of  2 variables:
##  $ Patent_No.    : chr  "CA 189065 S" "NI 202000072 A" "KR 20210032013 A" "PH 12020550461 A1" ...
##  $ Sequence_Count: int  0 0 80 0 0 0 0 0 0 0 ...
  1. Perform an inner join
train %>% inner_join(sc, by = c("Patent_Number" = "Patent_No.")) -> joined_data
head(joined_data)
##       Patent_Number Cites_Patent_Count Cited_by_Patent_Count Sequence_Count
## 1       CA 189065 S                  0                     0              0
## 2    NI 202000072 A                  0                     0              0
## 3  KR 20210032013 A                  0                     0             80
## 4 PH 12020550461 A1                  0                     0              0
## 5      TW I722568 B                  0                     0              0
## 6    CN 112533674 A                  0                     2              0

The head of the inner joined data is in the table above.

  1. Perform a full join
train %>% full_join(sc, by = c("Patent_Number" = "Patent_No.")) -> joined_data1
head(joined_data1)
##        Patent_Number Cites_Patent_Count Cited_by_Patent_Count Sequence_Count
## 1      PL 3341367 T3                  0                     0             NA
## 2    HR P20210871 T1                  0                     0             NA
## 3      CR 20210284 A                  0                     0             NA
## 4 US 2021/0205309 A1                  0                     0             NA
## 5    JP 2021100972 A                  3                     0             NA
## 6  AU 2021/203768 A1                  0                     0             NA

The head of the full joined data is in the table above.

  1. Join the two datasets so that only the patents in the Sequence Counts dataset are included
train %>% right_join(sc, by = c("Patent_Number" = "Patent_No.")) -> joined_data2
head(joined_data2)
##       Patent_Number Cites_Patent_Count Cited_by_Patent_Count Sequence_Count
## 1       CA 189065 S                  0                     0              0
## 2    NI 202000072 A                  0                     0              0
## 3  KR 20210032013 A                  0                     0             80
## 4 PH 12020550461 A1                  0                     0              0
## 5      TW I722568 B                  0                     0              0
## 6    CN 112533674 A                  0                     2              0

The head of the right joined data is in the table above.

  1. Read the Training and Testing Data csv file into R. This file contains a binary column called Partition which is 0 when the patent was part of the training set used to develop the regression model in #2, and is equal to 1 otherwise. Use R to perform the following on this dataset:
tandt <- read.csv("C:/Users/justt/Desktop/School/621/Exams/Exam 1/Training and Testing Data.csv")
str(tandt)
## 'data.frame':    725 obs. of  4 variables:
##  $ Patent_Number        : chr  "PL 3341367 T3" "HR P20210871 T1" "CR 20210284 A" "US 2021/0205309 A1" ...
##  $ Cites_Patent_Count   : int  0 0 0 0 3 0 0 1 0 0 ...
##  $ Cited_by_Patent_Count: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partition            : int  0 0 0 0 0 0 0 0 0 0 ...

Group the dataset by Partition.

by_part <- group_by(tandt, Partition)
head(by_part)
## # A tibble: 6 × 4
## # Groups:   Partition [1]
##   Patent_Number      Cites_Patent_Count Cited_by_Patent_Count Partition
##   <chr>                           <int>                 <int>     <int>
## 1 PL 3341367 T3                       0                     0         0
## 2 HR P20210871 T1                     0                     0         0
## 3 CR 20210284 A                       0                     0         0
## 4 US 2021/0205309 A1                  0                     0         0
## 5 JP 2021100972 A                     3                     0         0
## 6 AU 2021/203768 A1                   0                     0         0

The head of the Grouped data set is in the table above.

Summarize each grouping by calculating the standard deviation of Cites_Patent_Count for each group.

Stan_dev <- summarise(by_part, cpc = sd(Cites_Patent_Count, na.rm = TRUE))
Stan_dev
## # A tibble: 2 × 2
##   Partition   cpc
##       <int> <dbl>
## 1         0  24.2
## 2         1  16.7

The Summarized data set is in the table above.

  1. Identify the top 3 patents with the highest number of citations in the Cited_by_Patent_Count column.
top3 <- arrange(tandt, desc(Cited_by_Patent_Count))
head(top3)
##        Patent_Number Cites_Patent_Count Cited_by_Patent_Count Partition
## 1  WO 2021/250648 A1                  3                    11         1
## 2  WO 2021/084429 A1                 68                     4         0
## 3     US 11014911 B2                 13                     3         0
## 4 US 2021/0196810 A1                  0                     2         0
## 5 US 2021/0121555 A1                  0                     2         0
## 6     CN 112533674 A                  0                     2         0

The top 3 patents with the highest # of citations are:

  1. Filter the dataset to display only those rows for which the Cites_Patent_Count is greater than zero.
filter0 <- filter(tandt, Cites_Patent_Count > 0)
head(filter0)
##        Patent_Number Cites_Patent_Count Cited_by_Patent_Count Partition
## 1    JP 2021100972 A                  3                     0         0
## 2 US 2021/0206757 A1                  1                     0         0
## 3      EP 3096786 B1                 10                     0         0
## 4      EP 3096783 B1                 16                     0         0
## 5  AU 2018/372109 B2                  1                     0         0
## 6  AU 2018/275359 B2                  1                     0         0

The head of the Filtered data (137 rows) is in the table above.