Red font is changes
Black font is directions.
Green font is questions.
Blue font is answers.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
See above header
Read the csv file titled Training Data into R. All rows of this dataset should be used as your training set in the development of a regression model. This dataset contains information about some of Pfizer’s patent applications. Each patent application has a unique ID given to it, and this ID is listed in the Patent_Number column. The column titled Cites_Patent_Count lists the number of patent citations made within the corresponding patent application, and the column titled Cited_by_Patent_Count lists the number of patents which cite the document identified by the Patent_Number. The Cited_by_Patent_Count column can be seen as a measure of the influence and strength of the innovation, and as such, it can be useful to try to predict this column to help firms understand the potential influence of their innovations.
training <- read.csv("C:/Users/justt/Desktop/School/621/Exams/Exam 1/Training Data.csv")
str(training)
## 'data.frame': 626 obs. of 3 variables:
## $ Patent_Number : chr "PL 3341367 T3" "HR P20210871 T1" "CR 20210284 A" "US 2021/0205309 A1" ...
## $ Cites_Patent_Count : int 0 0 0 0 3 0 0 1 0 0 ...
## $ Cited_by_Patent_Count: int 0 0 0 0 0 0 0 0 0 0 ...
training <- training[ ,-1]
colnames(training)
## [1] "Cites_Patent_Count" "Cited_by_Patent_Count"
colnames(training) <- c("B1", "B2")
model1 <- lm(B2 ~ B1, data = training)
model1
##
## Call:
## lm(formula = B2 ~ B1, data = training)
##
## Coefficients:
## (Intercept) B1
## 0.05226 0.00267
summary(model1)
##
## Call:
## lm(formula = B2 ~ B1, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9469 -0.0523 -0.0523 -0.0523 3.7661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0522603 0.0130896 3.993 7.32e-05 ***
## B1 0.0026705 0.0005322 5.018 6.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3223 on 624 degrees of freedom
## Multiple R-squared: 0.03879, Adjusted R-squared: 0.03725
## F-statistic: 25.18 on 1 and 624 DF, p-value: 6.811e-07
## let Cites_Patent_Count = 60
B1a <- data.frame(B1 = c(60))
predict(model1, B1a, type = "response")
## 1
## 0.2124891
An equation for Cited_by_Patent_Count, using the regression model of model1 <- lm(B2 ~ B1, data = training), where B1 = 60 would be Cited_by_Patent_Count of 0.2124891 ≈ when Cites_Patent_Count is.
Fill-in the blanks for the following statements:
confint(model1, level = 0.90)
## 5 % 95 %
## (Intercept) 0.030697775 0.073822761
## B1 0.001793859 0.003547102
There is a 90% probability that the actual value of B1 lies between 0.001793859 and 0.003547102.
confint(model1, level = 0.90)
## 5 % 95 %
## (Intercept) 0.030697775 0.073822761
## B1 0.001793859 0.003547102
There is a 90% probability that the actual value of B2 lies between 0.030697775 and 0.073822761.
head(cooks.distance(model1))
## 1 2 3 4 5 6
## 2.174933e-05 2.174933e-05 2.174933e-05 2.174933e-05 2.810625e-05 2.174933e-05
max(cooks.distance(model1))
## [1] 2.635251
Yes, there are outliers that may possibly need to be removed and/or treated differently in the dataset. The Max is 2.635251, which is greater than 1.0 value for Cook’s Distance.
plot(model1)
This shows that there are 2 points outside the 1.0 range for Cook’s Distance. Points 602 and 567 should be considered outliers and should be removed. This could be the result of bad data capture, or data anomolies.
qqnorm(model1$residuals, main = "model1")
qqline(model1$residuals)
Yes, the residuals are normally distributed.
Use the regression model that you created in #2 to predict the Cited_by_Patent_Count in each of the following scenarios:
model1 <- lm(B2 ~ B1, data = training)
## let Cites_Patent_Count = 340
B1b <- data.frame(B1 = c(340))
predict(model1, B1b, type = "response")
## 1
## 0.9602238
The predicted Cited_by_Patent_Count when Cites_Patent_Count = 340 is 0.9602238.
model1 <- lm(B2 ~ B1, data = training)
## let Cites_Patent_Count = 300
B1c <- data.frame(B1 = c(300))
predict(model1, B1c, type = "response")
## 1
## 0.8534046
The predicted Cited_by_Patent_Count when Cites_Patent_Count = 300 is 0.8534046.
model1 <- lm(B2 ~ B1, data = training)
head(hatvalues(model1))
## 1 2 3 4 5 6
## 0.00164921 0.00164921 0.00164921 0.00164921 0.00160247 0.00164921
max(hatvalues(model1))
## [1] 0.2995991
x_new = c(1, 340)
X= model.matrix(model1)
t(x_new)%*%solve(t(X)%*%X)%*%x_new
## [,1]
## [1,] 0.3086801
Yes, this is extrapolation because it’s value of 0.3086801 is greater than the max leverage of 0.2995991.
x_new = c(1, 300)
X= model.matrix(model1)
t(x_new)%*%solve(t(X)%*%X)%*%x_new
## [,1]
## [1,] 0.2398486
No, this is not extrapolation because it’s value of 0.2398486 is less than the max leverage of 0.2995991.
Fill in the blanks based on the regression model that you created in #2, and the corresponding prediction that you made in part b of #3:
B2_pred = data.frame(B1 = c(300))
B2_pred
## B1
## 1 300
predict(model1, B2_pred, interval = "prediction", level = .95, type = "response")
## fit lwr upr
## 1 0.8534046 0.1486074 1.558202
When Cites_Patent_Count is 300 (as in part b of #3), there is a 95% chance that Cited_by_Patent_Count will be between 0.1486074 and 1.558202.
predict(model1, B2_pred, interval = "confidence", level = 0.95, type = "response")
## fit lwr upr
## 1 0.8534046 0.5434141 1.163395
When Cites_Patent_Count is 300, there is a 95% chance that the mean response will fall between 0.5434141 and 1.163395.
See above header
The four types of mutating joins learned in class are:
train <- read.csv("C:/Users/justt/Desktop/School/621/Exams/Exam 1/Training Data.csv")
sc <- read.csv("C:/Users/justt/Desktop/School/621/Exams/Exam 1/Sequence Counts.csv")
str(train)
## 'data.frame': 626 obs. of 3 variables:
## $ Patent_Number : chr "PL 3341367 T3" "HR P20210871 T1" "CR 20210284 A" "US 2021/0205309 A1" ...
## $ Cites_Patent_Count : int 0 0 0 0 3 0 0 1 0 0 ...
## $ Cited_by_Patent_Count: int 0 0 0 0 0 0 0 0 0 0 ...
str(sc)
## 'data.frame': 222 obs. of 2 variables:
## $ Patent_No. : chr "CA 189065 S" "NI 202000072 A" "KR 20210032013 A" "PH 12020550461 A1" ...
## $ Sequence_Count: int 0 0 80 0 0 0 0 0 0 0 ...
train %>% inner_join(sc, by = c("Patent_Number" = "Patent_No.")) -> joined_data
head(joined_data)
## Patent_Number Cites_Patent_Count Cited_by_Patent_Count Sequence_Count
## 1 CA 189065 S 0 0 0
## 2 NI 202000072 A 0 0 0
## 3 KR 20210032013 A 0 0 80
## 4 PH 12020550461 A1 0 0 0
## 5 TW I722568 B 0 0 0
## 6 CN 112533674 A 0 2 0
The head of the inner joined data is in the table above.
train %>% full_join(sc, by = c("Patent_Number" = "Patent_No.")) -> joined_data1
head(joined_data1)
## Patent_Number Cites_Patent_Count Cited_by_Patent_Count Sequence_Count
## 1 PL 3341367 T3 0 0 NA
## 2 HR P20210871 T1 0 0 NA
## 3 CR 20210284 A 0 0 NA
## 4 US 2021/0205309 A1 0 0 NA
## 5 JP 2021100972 A 3 0 NA
## 6 AU 2021/203768 A1 0 0 NA
The head of the full joined data is in the table above.
train %>% right_join(sc, by = c("Patent_Number" = "Patent_No.")) -> joined_data2
head(joined_data2)
## Patent_Number Cites_Patent_Count Cited_by_Patent_Count Sequence_Count
## 1 CA 189065 S 0 0 0
## 2 NI 202000072 A 0 0 0
## 3 KR 20210032013 A 0 0 80
## 4 PH 12020550461 A1 0 0 0
## 5 TW I722568 B 0 0 0
## 6 CN 112533674 A 0 2 0
The head of the right joined data is in the table above.
tandt <- read.csv("C:/Users/justt/Desktop/School/621/Exams/Exam 1/Training and Testing Data.csv")
str(tandt)
## 'data.frame': 725 obs. of 4 variables:
## $ Patent_Number : chr "PL 3341367 T3" "HR P20210871 T1" "CR 20210284 A" "US 2021/0205309 A1" ...
## $ Cites_Patent_Count : int 0 0 0 0 3 0 0 1 0 0 ...
## $ Cited_by_Patent_Count: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partition : int 0 0 0 0 0 0 0 0 0 0 ...
Group the dataset by Partition.
by_part <- group_by(tandt, Partition)
head(by_part)
## # A tibble: 6 × 4
## # Groups: Partition [1]
## Patent_Number Cites_Patent_Count Cited_by_Patent_Count Partition
## <chr> <int> <int> <int>
## 1 PL 3341367 T3 0 0 0
## 2 HR P20210871 T1 0 0 0
## 3 CR 20210284 A 0 0 0
## 4 US 2021/0205309 A1 0 0 0
## 5 JP 2021100972 A 3 0 0
## 6 AU 2021/203768 A1 0 0 0
The head of the Grouped data set is in the table above.
Summarize each grouping by calculating the standard deviation of Cites_Patent_Count for each group.
Stan_dev <- summarise(by_part, cpc = sd(Cites_Patent_Count, na.rm = TRUE))
Stan_dev
## # A tibble: 2 × 2
## Partition cpc
## <int> <dbl>
## 1 0 24.2
## 2 1 16.7
The Summarized data set is in the table above.
top3 <- arrange(tandt, desc(Cited_by_Patent_Count))
head(top3)
## Patent_Number Cites_Patent_Count Cited_by_Patent_Count Partition
## 1 WO 2021/250648 A1 3 11 1
## 2 WO 2021/084429 A1 68 4 0
## 3 US 11014911 B2 13 3 0
## 4 US 2021/0196810 A1 0 2 0
## 5 US 2021/0121555 A1 0 2 0
## 6 CN 112533674 A 0 2 0
The top 3 patents with the highest # of citations are:
filter0 <- filter(tandt, Cites_Patent_Count > 0)
head(filter0)
## Patent_Number Cites_Patent_Count Cited_by_Patent_Count Partition
## 1 JP 2021100972 A 3 0 0
## 2 US 2021/0206757 A1 1 0 0
## 3 EP 3096786 B1 10 0 0
## 4 EP 3096783 B1 16 0 0
## 5 AU 2018/372109 B2 1 0 0
## 6 AU 2018/275359 B2 1 0 0
The head of the Filtered data (137 rows) is in the table above.