Data Science עם R - חזרה עד כה ותרגולים

ביחידה זו נעשה תרגול וחיזוק של היסודות, ועד לנושאים האחרונים שנלמדו ביחידות הקודמות.

כל זמן הוא “זמן טוב” לטיפים וטריקים - קיצורי דרך שיחסכו לכם הרבה מאמץ ב-RStudio

עזרה על פקודות

טעינת חלונית העזרה - פשוט עומדים על פקודה ולוחצים F1. הפקודה הרלוונטית תעלה בצד ימין למטה. לחילופין, הקלדה של הקוד הבא בקונסול:

?mean # <- if you are sure in the exact name
?mutate_all
??mutate_all # <- for part of a name or unloaded library, you will get a list of matching options

הרצת קוד מהירה

כדי להריץ שורות קוד, אין צורך לסמן את כל השורות. אפשר פשוט לעמוד על השורה הראשונה וללחוץ ctrl+Enter. R יריץ את השורה הראשונה, יציג לכם את התוצאה או אזהרות, וידלג אוטומטית לשורה הבאה (עליה שוב אפשר ללחוץ ctrl+Enter).

ניקוי חלון ה-Console

פשוט עמדו בחלון הקונסול ולחצו ctrl+L

פתיחת אובייקט לצפייה מיידית

תוך כדי לחיצה על ctrl לחצו על האובייקט המבוקש. הוא יפתח בחלון ב-RStudio. רק השתדלו להימנע מאובייקטים גדולים מדי (ייקח להם המון זמן להיטען או שהם יתקעו את RStudio).

“טיול בהיסטוריה”

על ידי לחיצה בקונסול על ctrl+חץ למעלה, ניתן לעיין בפקודות אחרונות שהורצו. אם תתחילו לרשום פקודה אז הסטוריה זו תקבל פילטר לפי האותיות שהזנתם.

קפיצה בין טאבים (scripts and datasets)

פשוט לחצו על ctrl + tab כאשר אתם עומדים על חלונית הסקריפט, או על ctrl + shift + tab

חלוניות שכדאי לשים לב אליהם

חלונית ה-Environment עוזרת להתמצא במשתנים שטענתם.
כפתור ה-Import Dataset - יעזור לכם לטעון קבצים, למי שלא רוצה לעשות שימוש בקוד. היתרון הגדול של כפתור זה הוא שניתן לראות את הנתונים וגם לבחור איך לייבא כל עמודה.
מי שעובד עם version tracking - יופיע לו Git. שימושי למי שמכיר…
חלונית ה-Connections - תעזור לכם להתחבר לנתונים בפורמטים שונים, וניגע בזה בהמשך.

תיעוד קוד “כמו שצריך”

כפי שבטח שמתם לב כבר, התו # משמש לכתיבת הערות. אפשר לשים אותו בתחילת שורה או אחרי פקודה בסיומה של שורה.

אבל מה קורה כשמכניסים את השורה הבאה:

# ==== This is a section header within a script! ====

# ==== Also cool, but you get my drift ====

עכשיו לחיזוק וחזרה על הבסיס

סוגי אובייקטים

הרבה מהדוגמאות שלנו עובדות עם datasets קיימים, או עם datasets שהורדתי מ-Kaggle. בסיסי נתונים קיימים בעצם נטענים אוטומטית עם R. כמו: iris, diamonds, mtcars. בסיסי נתונים אחרים ניתן לטעון עם פקודות קריאה כמו read_csv או readxl::read_excel או כפי שהראיתי כרגע, על ידי לחיצה על Import Dataset בחלונית ה-Environment.

טענתם Dataset? צעד ראשון להציץ בו!

כדי להציץ בDataset בצעו את אחד מהדברים הבאים:

# load the dataset however you want, and then use (one of the above or a combination of:)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

#View(iris) # can also be accomplished by ctrl+left mouse click on name
names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

typeof(iris) # what is the type of iris?

## [1] "list"

typeof(iris$Species) # what is the type of column (variable) Species

## [1] "integer"

iris # usually this doesn't provide a very nice view...

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

dim(iris)

## [1] 150   5

# these are all base-R functions, but another recommended one is
library(dplyr) # <- load the dplyr library where glimpse is defined

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

glimpse(iris)

## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, s...

ויז’ואליזציה

על ויז’ואליזציה כבר דיברנו הרבה. היא בין הדברים הראשונים שתעשו כדי ללמוד את ה-Dataset. אבל לא נרחיב עליה פה יותר. מומלץ לעבוד עם החבילה ggplot2.

חזרה “ליסודות+”

דיברנו הרבה על קריאת קבצים, ועבודה עם Datasets, אבל דילגנו על כמה דברים “פשוטים יותר”. ההיכרות עמם תעזור לכם בהמשך במגוון היבטים - למידה של מודלים חדשים, ותכנות מתקדם יותר ב-R.

העבודה ב-R היא וקטורית בעיקרה, זאת אומרת שהרבה מהפונקציות מקבלות וקטור ומחזירות ערך. כך לדוגמה mean, max, sd, וכו’. יש גם פונקציות שמחזירות וקטור כמו range או runif.

תמיד אפשר לפנות לוקטור מסוים בתוך מטריצה על ידי שימוש בסימן $ או על ידי הפנייה עם סוגריים מרובעים.

iris$Sepal.Length # loads just the vector Sepal.Length

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

mean(iris$Sepal.Length) # provides the mean of Sepal.Length

## [1] 5.843333

range(iris$Sepal.Length) # the minimum and maximum (vector result with two numbers)

## [1] 4.3 7.9

mean(iris[,1]) # call Sepal.Length by location (it is the first column) and then compute mean

## [1] 5.843333

יש כל מיני דרכים לייצר וקטורים בקלות, לשימושים נפוצים.

1:6 # all numbers between 1 and 6

## [1] 1 2 3 4 5 6

seq(from = 1, to = 6, by = 2) # now just the odds

## [1] 1 3 5

seq(1, 6, 2) # abbreviation, just keep the arguments in order

## [1] 1 3 5

runif(10, min = -1, max = 6) # 10 uniform numbers between -1 and 6

##  [1]  2.79197076  4.85129721 -0.08948618  3.18588529  2.30704703
##  [6]  4.64050263  4.42301571  5.70292303  1.48484364  1.99357138

rnorm(5, mean = 0, sd = 3) # 5 normally distributed points with mean 0 and std 3

## [1]  2.9149416  3.9077670 -2.8718532 -1.6565101 -0.6844927

c("My", "Name", "is", pi) # c() combines scalars. In this case to a string vector, notice how the pi became a string R does the type-casting alone

## [1] "My"               "Name"             "is"              
## [4] "3.14159265358979"

cbind(c("My", "Name", "is", pi), c(1,2,3,4)) # combine columns into a matrix with cbind (rbind does this for rows)

##      [,1]               [,2]
## [1,] "My"               "1" 
## [2,] "Name"             "2" 
## [3,] "is"               "3" 
## [4,] "3.14159265358979" "4"

round(runif(30, min = 0.5, 6.5)) # roll the dice 30 times. Round takes a vector and returns a vector with the same size

##  [1] 4 6 2 1 4 6 4 3 3 6 6 1 5 2 4 5 1 3 6 1 2 1 1 3 2 3 1 1 4 3

פשוט תכנות


# here is an integer
1L
typeof(1L)
# but we rarely use it... usually we will write
1.1
typeof(1.1)
typeof(c(1.1, 2.1, 3.6, pi))
# sometimes it is useful to get the length of a vector
length(1:10)
# but NROW and NCOL are much more predictable
NROW(1:10)
NCOL(1:10)
dim(1:10) # doesn't work on such vectors...

# You've already seen scientific notations
6.022140857e23 # anyone recognizes?

# We've already covered special "statistical and mathematical quirks" like
Inf
-Inf
NaN
NA

# No need to introduce
# * multiplication
# / division
# + -
# but

5 %% 3 # is the remainder

# TRUE FALSE and the likes
TRUE & FALSE
TRUE | FALSE
TRUE & TRUE
!TRUE
FALSE != TRUE
FALSE == FALSE

NA * 1
NA^0

c(T, T, F, T) | c(F, F, F, F) # TRUE, FALSE can be shortener to T, F

# factors are important - they represent categorial or ordinal variables
factor(c("Jan", "Feb", "Mar", "Apr"))
typeof(factor(c("Jan", "Feb", "Mar", "Apr"))) # why do they show up as integers you ask?
levels(factor(c("Jan", "Feb", "Mar", "Apr"))) # why hold strings, when you can hold numbers...

# type cast to any type by as.XXX
as.numeric("3.141593") # alomst pi
as.numeric("foo! bar.") # be reasonable with your casting...
as.character(pi) # pi masked as a string
as.factor(c("Jan", "Feb", "Mar", "Apr")) # diff between factor and as.factor is level order

כמעט סיימנו את החזרה על ה-Basics. כמה לולאות והתניות.

for (i in 1:10){
   # do some action like
   cat(i)
}

counter <- 1
while (counter <= 10){
   # do some other action
   cat(counter*pi, "\n") # \n is a newline
   counter <- counter + 1
}

R_is_cool <- TRUE
if (R_is_cool) {
   cat("No doubt about it")
} else {
   cat("Bhaa")
}

# to define a function
plus_one <- function(original_number){
   plus_one <- original_number + 1
   return(plus_one)
}

ה-data.frame וה-tibble

מבני הנתונים החשובים ביותר בהקשר של ניתוח נתונים ב-R הם ה-data.frame וה-tibble. למעשה כל ה-datasets שעבדנו עליהם עד כה היו מסוגים אלו.

הם שניהם נראים כמו מטריצה כאשר כל שורה היא תצפית וכל עמודה היא משתנה. אפשר לקרוא רק לחלק מה-data בכל מיני צורות כגון:

iris[iris$Species == "setosa",]
iris[1:10, 4:5]

אבל הדרכים המועילות ביותר הן דווקא עם התחביר שאיתו עבדנו עד כה ואיתו נמשיך לעבוד - תחביר tidyverse.

library(tidyverse)
iris %>%
   filter(Species == "setosa")
   
iris %>%
   slice(1:10) %>%
   select(4:5)

אפשר גם להגדיר tibble או data.frame בצורה ידנית:

# Data frames are the basic form of data analysis table. they are part of "base-R"
dataframe_example = data.frame(numbers = 1:3, letters = c("a", "B", "c")) 
dataframe_example

##   numbers letters
## 1       1       a
## 2       2       B
## 3       3       c

# but notice how the tibble printing is much better

tibble_example <- tibble(numbers = 1:3, letters = c("a", "B", "c"))
tibble_example

## # A tibble: 3 x 2
##   numbers letters
##     <int> <chr>  
## 1       1 a      
## 2       2 B      
## 3       3 c

# or you can also do that manually in a more aesthetic manner (notice the r in tribble, stands for "row" definition form)
tibble_example2 <- tribble(
  ~numbers, ~letters,
  1, "a",
  2, "B",
  3, "c"
)
tibble_example2

## # A tibble: 3 x 2
##   numbers letters
##     <dbl> <chr>  
## 1       1 a      
## 2       2 B      
## 3       3 c

# The great benefit of tibble is the ability to work with slightly more complex data structures
tribble(
  ~numbers, ~few_letters,
  1, c("a","b"),
  2, c("C"),
  3, NULL
)

## # A tibble: 3 x 2
##   numbers few_letters
##     <dbl> <list>     
## 1       1 <chr [2]>  
## 2       2 <chr [1]>  
## 3       3 <NULL>

רשימות

ב-R ניתן ליצור אובייקטים המשולבים מתתי-אובייקטים שרירותיים. אובייקטים אלו יקראו רשימות lists. הגדרת lists מאפשרת גמישות רבה במיוחד, אך גם מקשה על העבודה איתם.

list_example <- list(user_name = c("foo", "bar"),
                     user_transactions = 
                       rbind(1:10, 2:11, 3:12))
list_example$hello <- "world"

list_example[[1]]

## [1] "foo" "bar"

list_example$user_name

## [1] "foo" "bar"

list_example

## $user_name
## [1] "foo" "bar"
## 
## $user_transactions
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    2    3    4    5    6    7    8    9    10
## [2,]    2    3    4    5    6    7    8    9   10    11
## [3,]    3    4    5    6    7    8    9   10   11    12
## 
## $hello
## [1] "world"

glimpse(list_example)

## List of 3
##  $ user_name        : chr [1:2] "foo" "bar"
##  $ user_transactions: int [1:3, 1:10] 1 2 3 2 3 4 3 4 5 4 ...
##  $ hello            : chr "world"

פונקציות מרכזיות ב-tidyverse

הפונקציות המרכזיות שאותן למדנו בשיעורים הקודמים הן פונקציות המשמשות לעבודה עם data.frames או tibbles. (בקוד הבא אותיות גדולות יוחלפו באובייקטים אמיתיים)

# create a new variable:
NEW_DATASET <- OLD_DATASET %>% 
   mutate(NEW_VAR = FUNCTION(OLD_VAR))

# filter a dataset
NEW_DATASET <- OLD_DATASET %>%
   filter(LOGICAL_CONDITION)
   
# sort a dataset
NEW_DATASET <- OLD_DATASET %>%
   arrange(VARIABLE) # or use arrange(desc(VARIABLE)) for a descending order
   
# rearrange a dataset "gather" from multiple columns into multiple lines
NEW_DATASET <- OLD_DATASET %>%
   gather(NEW_KEY_NAME, NEW_VALUE_NAME, -EXCLUDE_VECTOR1, -EXCLUDE_VECTOR2,...)

# rearrange a data set "spread" from multiple lines into multiple columns. Opposite of gather
NEW_DATASET <- OLD_DATASET %>%
   spread(KEY_VARIABLE, VALUE_VARIABLE)
# KEY_VARIABLE will define the variable names in the new dataset and
# VALUE_VARIABLE will define the corresponding values

# grouped operations on a dataset
NEW_DATASET <- OLD_DATASET %>%
   group_by(SOME_VAR) %>%
   summarize(min(VECTOR),
             max(VECTOR),
             mean(VECTOR),
             ...) # useful for mean, min, sd.., and customized vector->scalar functions

# count occurrances of values in a vector
NEW_DATASET <- OLD_DATASET %>%
   count(SOME_VAR)

תרגול רגרסיה

תזכורת, ראינו בשיעור שעבר איך מפעילים מודל רגרסיה:

iris_lm_complete <- lm(data = iris, 
                       formula = Sepal.Width ~ Sepal.Length + Petal.Width + Petal.Length + Species)
summary(iris_lm_complete)

# and for stepwise we can do
iris_stepwise <- MASS::stepAIC(iris_lm_complete, direction = "forward", trace = TRUE)
summary(iris_stepwise)
# in this case the initial model is also the final stepwise model

בתרגיל זה (שאת תחילתו ראינו בשיעור שעבר) תתאימו מודל לחיזוי מחירם של יהלומים.

מאגר הנתונים diamonds נטען אוטומטית כאשר טוענים את חבילת ggplot2 (או חבילת tidyverse שכוללת אותה).

library(tidyverse)
glimpse(diamonds)

## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

# Use ?diamonds to see the documentation of the database
# or click F1 on diamonds

הוסיפו ל-diamonds משתנה הנקרא is_train, ששווה TRUE עבור 80% מהנתונים (שיקבעו באקראי), ושווה FALSE עבור היתר.
1. השתמשו בפקודת mutate ובפקודת runif כדי להגריל באקראי מספרים בין 0-1, ולהגדיר את המשתנה is_train.

diamonds <- diamonds %>%
   mutate(XXX = runif(NROW(XXX)) <= 0.8)

כעת בנו מודל ראשוני שכולל את כל המשתנים ומנסה להסביר את המחיר. תזכורת לגבי הפעלת הפקודה:

diamonds_lm <- lm(formula = XXX ~ XXX + XXX + ...,
                  data = diamonds)

שאלה למחשבה - האם המודל שהפעלתם כרגע הוא באמת רגרסיה לינארית? רמז: איך מוגדר המשתנה depth. באפשרותך לבדוק זאת בתיעוד של diamonds

?diamonds

חשבו את ה-rss של הtest set (residual sum of sqaures)
1. כדי להריץ את החיזוי על ה-test set השלימו את הקוד הבא. לשם כך עליכם להשתמש בפונקציה predict. פונקציה זו מקבלת את האובייקט (המודל הלינארי שהתאמנו), ואת הdataset החדש עליו יש לבצע את החישוב. ניתקל בפונקציה זו גם בשימוש באלגוריתמים דומים לחיזוי.

test_price_lm <- predict(object = XXX, 
                         newdata = XXX %>% filter(!XXX))

בצעו את החישוב עצמו של השגיאה. שימו לב לשימוש בתחביר ה“קלאסי” של R (ולא בתחביר tidyverse). זה אחד המקרים שבהם התחביר הקלאסי הוא תמציתי יותר, ויחסית ברור.

sum(  (test_price_lm - diamonds$price[!diamonds$is_train])  ^2  )

באופן דומה, ניתן לחשב במקום RSS את השגיאה במונחי מחיר, כלומר מחיר ממוצע וערך מוחלט של המחיר. חשבו את ערכם של המדדים הללו.

mean(XXX)
mean(abs(XXX))

אם יהלומן משתמש במודל שפיתחתם לצורך קביעת המחיר, האם בממוצע הוא יהיה בהערכת יתר או בהערכת חסר?
הקוד הבא יוסיף לכם לdataset עמודה של תחזית המחיר (עבור כל התצפיות). השתמשו בעמודה זו ובתרשים (boxplot - ראו קוד למטה) לצורך צפייה בהתפלגות ה-RSS. מה אתם יכולים לומר על צורת ההתפלגות? האם היא מוטה? דומה להתפלגות נורמלית?

diamonds <- diamonds %>%
  mutate(predicted_price = predict(object = diamonds_lm,
                                   newdata = diamonds)) %>%
  mutate(predicted_rss = predicted_price -  price)

# Add the boxplot based on the new variable (split to train/test)

ggplot(diamonds, aes(y = predicted_rss, x = is_train)) + 
   geom_boxplot()

# To have a better look zoom in between -1000,1000

ggplot(diamonds, aes(y = predicted_rss, x = is_train)) + 
   geom_boxplot() + 
   coord_cartesian(ylim = c(-1000, 1000))

התיאוריה אומרת שהשגיאה $\epsilon$ במודל רגרסיה לינארית צריכה להתפלג נורמלית עם תוחלת 0. מהסתכלות בתרשימים, מה מבין ההנחות מופר (נורמליות או תוחלת 0).
דרך מקובלת לבחון את השערת הנורמליות היא באמצעות qqplot. ה-qqplot מסדר את האחוזונים של התצפיות למול אחוזונים של ההתפלגות הנורמלית. אם רואים חריגה מהאלכסון של y=x המשמעות היא שההתפלגות חורגת מההתפלגות הנורמלית. בנו תרשים qqplot כדי להבחין בחריגות אלו.

ggplot(diamonds, aes(sample = XXX)) + 
   geom_qq() + geom_qq_line()

כעת, ננסה לשפר את רמת החיזוי על ידי פיצול מאגר הנתונים למשתנה ה-clarity וחישוב מודל רגרסיה בכל קטגוריה בנפרד. פצלו את מאגר הנתונים לכל היהלומים בקטגוריות I1-VS1 לעומת קטגוריות VVS2-IF.

# create a variable which will split the dataset
diamonds <- diamonds %>%
               mutate(high_clarity = clarity %in% c(XXX, XXX, XXX))

# split the dataset to low clarity and high clarity
diamonds_lc <- diamonds %>%
                  filter(XXX)
diamonds_hc <- diamonds %>%
                  filter(XXX)

# generate the linear models.
# think about - should you put the "clarity" variable inside the model or not?
diamonds_lc_lm <- lm(formula = XXX ~ XXX + XXX + ...,
                     data = diamonds_lc %>% filter(is_train))
diamonds_hc_lm <- XXX

# compute the residuals of the models, over the test set
diamonds_lc_resid <- XXX
diamonds_hc_resid <- XXX

# compute the rss of the new "split" model
sum(c(diamonds_lc_resid, diamonds_hc_resid)^2)

# versus the earlier error we computed with the single model
sum(  (test_price_lm - diamonds$price[!diamonds$is_train])  ^2  )

אם עשיתם את התרגיל נכון, אז כנראה שהבחנתם בשיפור מסוים בתחזית לאחר שפיצלתם לשני מודלים. במובן מסוים, הדבר דומה למודל מסוג אחר הנראה עצי החלטה (classification and regression trees), עליהם נלמד בפרק מתקדם יותר. גם הם מחלקים את המרחב לחלקים שונים.