class: center, middle, inverse, title-slide # Week 10 ## Oct. 25 ### Alea Wilbur-Mujtaba ### Updated: 2021-10-25 --- <style type="text/css"> mark{ background-color:#e9c46a; } .highlight { background-color: yellow; } </style> --- class: inverse, middle, center #Coding in R --- # Assignments - Submit ONLY your code (not the R output - If I need it, I will run your code) - Code should be a clean version (i.e., I should be able to just run your code and reproduce your output) - If you make mistakes, go back and revise your code. Clean it up before you submit it. - You can leave comments about errors, warnings, or anything that happened to you ```r # I learned that if I forget a comma, I will get this warning message # telling me about an "unexpected numeric constant" # cat_count <- c(5, 14 2) # Error: unexpected numeric constant in "cat_count <- c(5, 14 2)" ``` --- # Comments - You don't need to tell me what you did but what the code does **NO:** ```r #numbers changed from original exp_timespent ## [1] 40 10 15 10 25 reality_timespent ## [1] 30 30 20 10 10 ``` **YES:** ```r # Assign activities that I expected to do during # quarantine to an object titled "exp_activities". exp_activities <- c("Research", "Sew", "Crochet") # Look at object to check my work exp_activities ``` --- # Comments You can (and should) leave comments in between pieces of code or at the ends of lines of code. This will be very helpful when we get to graphs and summerizing data. ```r # Create a pie chart graph representing the activities done in reality. reality <- ggplot(lockdown_reality, # Call the dataset that I want to use "lockdown_reality" aes(x = "", y = reality_timespent, # y represents amount for each category fill = reality_activities)) + # fill represents the names of the categories geom_bar( stat = "identity", color = "white", size = 2) + # Creates stacked bar chart coord_polar( "y", start = 0 ) + # puts bar chart on a circle axis (becomes pie chart) theme_void() + # gets rid of extra grid stuff theme(legend.position = "bottom", # Create a legend and put it at the bottom legend.title = element_blank(), # Eliminates legend title legend.direction = "horizontal", # legend is horizontal list plot.title = element_text(hjust = 0.5, size=22, face="bold")) + scale_fill_brewer(palette="BuPu", direction = -1) + # flips order of Color theme ggtitle("Reality") reality # view graph ``` --- # Why so much emphasis on code?? 1. You need to make your code reproducible by others. This means explaining to them (and your future self) what you did at each stage and how the code words. - You will forget and you will be sad otherwise - Your colleagues and I will very much appreciate clear comments! 2. This is a way of learning by noting down stuff that you notice and think about them. It will also help identifying parts that you don't understand about your code and ask questions! --- # Nice and ugly codes Writing "nice" codes is about: * Always leave a space before and after a symbol **Ugly codes** ```r 2*3 object1<-2 ``` **Nice codes** ```r 2 * 3 object1 <- 2 ``` --- #Nice and ugly codes * This rule applies to commas as well * BUT we don't leave spaces before and after parentheses **Ugly codes** ```r object1<-c(9,3,4,1) object1 <- c ( 9,3,4,1 ) ``` **Nice codes** ```r object1 <- c(9,3,4,1) #Even more correct: object1 <- c(9, 3, 4, 1) ``` Keep noticing this as we move forward and get yourself used to write nice codes since the very beginning: habits are difficult to change! --- class: inverse, middle, center # Dollar sign syntax & Tidyverse syntax --- # Background info There are some coordinated efforts (e.g., tidyverse) but, in general, distributed development means that uniform conventions are often not followed concerning function names, arguments, and documentation. This means that there are several ways to "code" in R and get to the same output. Main ways discussed: Dollar sign syntax vs. Tidyverse syntax --- <div class="figure"> <img src="SyntaxComparisonCheatsheet.jpg" alt="https://twitter.com/AmeliaMN/status/1347325301273939969/photo/1" width="2729" /> <p class="caption">https://twitter.com/AmeliaMN/status/1347325301273939969/photo/1</p> </div> --- # Rectangular data In most cases, we will use tibbles or dataframes (both are a type of matrix). A **dataframe** is a dataset in R (like an excel worksheet). One main difference between dataframes and tibbles is how your output is displayed when looking at datasets. --- # Dataframes and Dollar Sign Syntax Since columns have names, we can call each column using the symbol **$** ```r people$names ``` ``` ## [1] "Victor" "Vicky" "Victoria" "Vinny" "Val" ``` ```r people$yob ``` ``` ## [1] 1994 1987 1989 1985 1993 ``` --- class: inverse, middle, center ## Important ### You always need to call both the dataset and the column name ###datasetName$ColumnName --- # Dataframes vs Tibbles 'dollar sign' syntax is so called because of the use of **$** to connect a dataframe name with a column name. Dataframes are a very common way to work with data in R. Some functions do not work with tibbles (tidyverse database format) so you'll likely go back to this at one point (e.g., regression analysis classes) Tidyverse is better for data wrangling and visualization. --- # Tidyverse package We will start working with the tidyverse package which is another coding approach (syntax) to R. ```r #If you haven't installed it yet install.packages("tidyverse") ``` ```r # if you have installed it already library(tidyverse) ``` --- # Some data We are going to use this dcldata package for some data that we are going to use for examples Source: https://rdrr.io/github/dcl-docs/dcldata/ Book with examples: [Data Wrangling, Sara Altman & Bill Behrman](https://dcl-wrangle.stanford.edu/index.html) ```r install.packages("remotes") remotes::install_github("dcl-docs/dcldata") ``` ```r library("dcldata") ``` --- # Tibbles Now we will learn about **tibbles** which are the dataframes of tidyverse. Let's upload the same dataset in both dataframe and tibble format. ```r congress.df = as.data.frame(congress) # dataframe congress_tib = as_tibble(congress) # tibble ``` > Start noticing that tidyverse prefer an underscore in functions compared to traditional R where we use a dot > Note that you can always switch from one to the other using these functions. --- # When you open a dataset These are some of the first things that you **always** want to do when you open a dataset. Check its dimensions, colnames, and have a look at the data ```r # Dimensions of your dataset dim(congress_tib nrow(congress_tib) ncol(congress_tib) # Column names colnames(congress_tib) # Full picture of your dataset str(congress_tib) summary(congress_tib) # View pieces of your data (default is 5 rows) head(congress_tib) head(congress_tib, 10) tail(congress_tib, 1) # View your data View(congress_tib) View(people) ``` --- # Tibbles vs dataframes Let's start by printing the dataframe and the tibble and note a few differences in how they appear. ```r people congress_tib ``` -- An advantage of tibble is that they don't print out the entire dataset - which is very inconvenient when, by mistake, you type the name of very large datasets. Note, again, that you can visualize both tibble and dataframe using the **View** function. ```r View(people) View(congress_tib) ``` --- # Tibbles vs dataframes Note that, in tibbles, each column reports its type * **int** stands for integers (4, 5, 6) * **dbl** stands for doubles, or real numbers (4.4, 5.55, 6.66) * **chr** stands for character vectors, or strings ("federica", "fusi") * **dttm** stands for date-times (a date + a time) (12:24:2020:08:06) * **lgl** stands for logical, vectors that contain only TRUE or FALSE. * **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values ("car", "bus", "train") * **date** stands for dates (01/12/2020) --- # Tibbles vs dataframes The functions that we learned for dataframes work for tibbles too. You can call a column by its name and perform functions with it ```r table(people$state) table(congress_tib$state) ``` --- # Tibbles vs. dataframes - Tibbles don't work with all commands so...don't forget about dataframes!! - You can always go from a tibble to a dataframe *From now on, we will mostly work with tibbles unless we have a reason not to*. --- # Column names We already saw that column names work in the same way they do for dataframes. You can always rename a column using the **rename** function: <font size="+2"><center><mark><font color = "#e76f51">rename(newname = oldname)</center></mark></font color></font size> ```r congress_tib2 <- # New tibble with renamed columns rename(congress_tib, # Original tibble first_lastname = name, # new and old name geo_division = division) # new and old name ``` --- #Column names Some other helpful commands when you are re-naming columns **tolower** and **toupper** they, respectively, transform a string into all lower case or all upper case. If you want to change all the column names, you can use them with the **colnames** command. <font size="+2"><center><mark><font color = "#e76f51">colnames(YourTibble) <- toupper(colnames(YourTibble))</center></mark></font color></font size> ```r colnames(congress_tib2) colnames(congress_tib2) <- toupper(colnames(congress_tib2)) colnames(congress_tib2) # Double check if you achieved your task ``` This might be helpful if the columns mix upper and lower case and you cannot remember how you named your variables. --- # Recap notes Always make sure that your dataset has well named columns. Same rules as for object names: 1. Column names should be **intuitive** 2. Keep them **short** 3. Try to be **consistent** (all lower or upper case, similar separation sign, or all attached) --- # Joining data --- # Open our data ```r immigration = read_csv("Immigration.csv") education = read_csv("Education.csv") income = read_csv("Income.csv") relationship = read_csv("RelationshipStatus.csv") ``` --- # Joining datasets A common action is to join and merge datasets with each others. <font size="+2"><center><mark><font color = "#e76f51"> SomeJoinFunction(tibbleX, tibbleY, by = "ColumnName.x" = "ColumnName.y")</center></mark></font color></font size> You need to know how the merging needs to work: 1. Do you want to keep all rows from both datasets? Or only one? Or only the ones common across both datasets? 2. Do you want to keep all columns from both datasets? Or only some? 3. Is there a unique identifier that it's common across both datasets? How should recognize that two rows refer to the same observation? You cannot perform a merge without thinking about these questions first. Answers to these questions depend *entirely* from your data questions. --- # Join functions **full_join()** return <font color = "#e76f51"> all rows and all columns from both x and y </font color>. Where there are not matching values, returns NA for the one missing. **inner_join()** return <font color = "#e76f51">all rows from x where there are matching values in y, and all columns from x and y</font color>. If there are multiple matches between x and y, all combination of the matches are returned. **left_join()** return <font color = "#e76f51">all rows from x, and all columns from x and y</font color>. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. **right_join()** return <font color = "#e76f51">all rows from y, and all columns from x and y</font color>. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- #Try on your own 1. You want to merge the immigration and education datasets. You want to keep all matching rows. 2. You want to merge the immigration and education datasets and keep as much information as possible from the largest dataset. 3. You want to merge the datasets education and income and keep as much information as possible from both datasets. --- # Solutions **Q1** ```r data_1 = inner_join(immigration, education) nrow(data_1) ``` **Q2** ```r data_2 = left_join(immigration, education) ``` **Q3** ```r data_3 = full_join(education, income) ``` --- # Just one try Now try to merge the immigration and relationship datasets so that we keep all matching rows. What happens? -- ```r data_4 = inner_join(immigration, relationship, by = c("id" = "id")) ``` -- By default, join functions match all columns with the same name across both datasets. If you don't want this default, you need to specify which columns should be considered: - This could be fewer columns (e.g., in this case) - This could be more columns - for instance, columns whose names do not match. --- # Other joint commands Source: https://dplyr.tidyverse.org/reference/join.html **semi_join()** return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. **anti_join()** return all rows from x where there are not matching values in y, keeping just columns from x --- class: inverse, middle, center # Unite and separate **unite** is a function that allows you to combine the content of two or more columns. **separate** allows you to separate the content of one column into two or more columns. <font size="+2"><center><mark><font color = "#e76f51">unite(YourData, NewColumn, OldColumns, sep = " ")</center></mark></font color></font size> <font size="+2"><center><mark><font color = "#e76f51">separate(YourData, OldColumn, into = "NewColumn", sep = " ")</center></mark></font color></font size> Note the argument **sep**. It's pretty common and it's always used to indicate a symbol by which a string should be separated (common ones: |, _ , a blank space, a common...) --- # Separate Let's see a quick example ```r table3 ``` -- We need to separate the column *rate* into two columns, cases and population. ```r table3_R = separate(table3, rate, into = c("cases", "population")) # In this case, the default works but to be sure you can specify your 'separator' table3_R = separate(table3, rate, into = c("cases", "population"), sep = "/") # The default output is a character column; if you want a better match, you need to specify covert = T table3_R = separate(table3, rate, into = c("cases", "population"), sep = "/", convert = T) ``` --- # Unite Let's now do the operation in reverse... ```r # See the default separator table3_R2 = unite(table3_R, rate, c(cases, population)) # You can change it table3_R2 = unite(table3_R, rate, c(cases, population), sep = "-" ) ``` --- # DPLYR Dplyr, (included in Tidyverse) is a package that allows you to manipulate and summarize your data in a more dynamic way. For instance, you might want to know how many crimes occur in a set of neighborhood in a certain time interval; you might want to know the average number of unemployment days by race and gender; the number of children in foster care by city and zip code, and so on. But dplyr makes it easier to answer these questions and manipulate your data. To use dplyr, there six verbs that you need to know: * **filter()** to select cases based on their values * **arrange()** to reorder the cases * **select()** to select variables based on their names * **mutate()** and transmute() to add new variables that are functions of existing variables * **summarise()** to condense multiple values to a single value * **group_by()** to group cases according to certain variables --- #Data We are going to use some data from the fivethirtyeight package, which takes data from 538's articles and provide them in a tidy format for R users. cabinet_turnover is a dataset reporting how long the members of the Cabinet stayed in charge as described in this [article](https://fivethirtyeight.com/features/two-years-in-turnover-in-trumps-cabinet-is-still-historically-high/). ```r #install.packages("fivethirtyeight") library("fivethirtyeight") data = cabinet_turnover ``` --- # summarise **summarise()** to condense multiple values to a single value - in other words, it summarize **across multiple rows**. --- #summarise Let's we want to find the average lenght that cabinet members stay in office. ```r data %>% summarize(average = mean(length, na.rm = T)) ``` Or when it's the earliest that a cabinet member left its job ```r data %>% summarize(min_days = min(days, na.rm = T)) ``` --- # group_by Summarize is the most useful when used with group_by as it provides the possibility to work with grouped summaries. Let's see one example where we ask the article's main question: which presidents has the highest turnover? -- To answer, we need to know the average number of days that cabinet members were in charge, by president. First, we need to group the observations by president and then calculate the average across all observations. ```r data %>% group_by(president) %>% summarise(presid_lenght = mean(length, na.rm = T)) ``` *Which president had the highest turnover?* --- # group_by What cabinet members is the most likely to resign after a short period in the administration? What cabinet member stay in charge, on average, the shortest amount of time? -- ```r data %>% group_by(position) %>% summarise(position_earliest = mean(days, na.rm = T)) %>% arrange(position_earliest) #<< data %>% group_by(position) %>% summarise(position_length = mean(length, na.rm = T)) %>% arrange(position_length) data %>% group_by(position) %>% summarise(position_length = mean(length, na.rm = T)) %>% arrange(desc(position_length)) ``` --- # summarize functions You should be familiar with most of them: - mean - n - median - sd - min - max - quantile(x, 0.25) - sum --- #count A couple of things that might be helpful to you moving forward.... Let's say you want to count how many cabinet members have left each president... ```r data %>% group_by(president) %>% summarise(count = n()) #<< #The function n is used to count the number of observation per group ``` ```r data %>% group_by(appointee) %>% summarise(count = n()) %>% arrange(desc(count)) # this would allow you to see if someone had more than one appointment. ``` What if you want to know if they had more than one appointment with the same president or another one? --- # Solutions There are a few ways to do this and visualize your data ```r # This is the basic solution. # Problem: we don't easily see if someone worked for more than one president data %>% group_by(appointee, president) %>% summarise(count = n()) %>% arrange(desc(count)) # Things might be easier if we arrange by appointee name as well # The pring command allows you to print more than the standard 10 rows print( data %>% group_by(appointee, president) %>% summarise(count = n()) %>% arrange(appointee, desc(count)), n= 300) # this would allow you to see if someone had more than one appointment. ``` --- #Solutions ```r # We could filter out the ones that had only one appointment # We see all those who worked more than once for the same president data %>% group_by(appointee, president) %>% summarise(count = n()) %>% filter(count >1) %>% arrange(appointee, desc(count)) ``` --- #Mutate **mutate()** and transmute() to add new variables that are functions of existing variables - in other words, it allows you to perform operations on your columns. Let's say that you want to know how long each cabinet member stayed in the administration as a percentage of the full adminstration term (4 years) ```r data2 = data %>% mutate(perc_length = (length / (365 * 4) * 100)) ``` -- Note that you can always make longer pipe to see the data as you want them ```r data2 = data %>% mutate(perc_length = (length / (365 * 4) * 100)) %>% arrange(perc_length) ``` --- # transmute Transmute works like *mutate* but it creates a new column in a new dataset ```r data2 = data %>% transmute(perc_length = (length / (365 * 4) * 100)) %>% #<< arrange(perc_length) ``` --- # mutate As with the other functions, you can use 'mutate' within a pipe How many cabinet members left before mid-administration? ```r data %>% mutate(perc_length = (length / (365 * 4) * 100)) %>% filter(perc_length <= 50)%>% summarise(count = n()) ``` --- #When writing a code 1. Do not start to code right away. 2. Write down all the steps you want to do. Don't think about the exact code yet! - use your own words. Make a "game plan" 3. Think of which dplyr functions you will need for each step 4. Start drafting your code. 5. You can run each line of code, adding one line at each step, to make sure you are going in the right direction. 6. Run your final code 7. Think if your results make sense!