title: “PUAD 5140” subtitle: “Working with your data” author: “type here your name” date: “September 20, 2022” output: html_document: toc: yes toc_float: yes theme: cerulean highlight: haddock toc_depth: 3 pdf_document: toc: yes html_notebook: toc: yes toc_float: yes number_sections: yes highlight: haddock toc_depth: 3 —
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
It’s good practice to first get your packages load it to confirm they are all installed and working
install.packages("devtools")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
devtools::install_github("kosukeimai/qss-package")
## Skipping install of 'qss' from a github remote, the SHA1 (59933ec0) has not changed since last install.
## Use `force = TRUE` to force installation
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tinytex")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
#tinytex::install_tinytex() ##for mac
install.packages("qss")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Warning: package 'qss' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
#install.packages("glimpse")
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
#devtools::install_github("ddauber/r4np")
install.packages("ggthemes")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(qss)
library(gapminder)
#library(r4np)
library(ggthemes)
data_url<-("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
candy.raw<-read_csv(data_url)
## Rows: 85 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): competitorname
## dbl (12): chocolate, fruity, caramel, peanutyalmondy, nougat, crispedricewaf...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
candy.raw
## # A tibble: 85 × 13
## competito…¹ choco…² fruity caramel peanu…³ nougat crisp…⁴ hard bar pluri…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 Grand 1 0 1 0 0 1 0 1 0
## 2 3 Musketee… 1 0 0 0 1 0 0 1 0
## 3 One dime 0 0 0 0 0 0 0 0 0
## 4 One quarter 0 0 0 0 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0 0 0 0 0
## 6 Almond Joy 1 0 0 1 0 0 0 1 0
## 7 Baby Ruth 1 0 1 1 1 0 0 1 0
## 8 Boston Bak… 0 0 0 1 0 0 0 0 1
## 9 Candy Corn 0 0 0 0 0 0 0 0 1
## 10 Caramel Ap… 0 1 1 0 0 0 0 0 0
## # … with 75 more rows, 3 more variables: sugarpercent <dbl>,
## # pricepercent <dbl>, winpercent <dbl>, and abbreviated variable names
## # ¹competitorname, ²chocolate, ³peanutyalmondy, ⁴crispedricewafer, ⁵pluribus
Let’s look at the number of rows and columns:
dim(candy.raw)
## [1] 85 13
And get a sneak-peak of your dataframe:
head(candy.raw)
## # A tibble: 6 × 13
## competitor…¹ choco…² fruity caramel peanu…³ nougat crisp…⁴ hard bar pluri…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 Grand 1 0 1 0 0 1 0 1 0
## 2 3 Musketeers 1 0 0 0 1 0 0 1 0
## 3 One dime 0 0 0 0 0 0 0 0 0
## 4 One quarter 0 0 0 0 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0 0 0 0 0
## 6 Almond Joy 1 0 0 1 0 0 0 1 0
## # … with 3 more variables: sugarpercent <dbl>, pricepercent <dbl>,
## # winpercent <dbl>, and abbreviated variable names ¹competitorname,
## # ²chocolate, ³peanutyalmondy, ⁴crispedricewafer, ⁵pluribus
candy.data<- candy.raw
head(candy.data)
## # A tibble: 6 × 13
## competitor…¹ choco…² fruity caramel peanu…³ nougat crisp…⁴ hard bar pluri…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 Grand 1 0 1 0 0 1 0 1 0
## 2 3 Musketeers 1 0 0 0 1 0 0 1 0
## 3 One dime 0 0 0 0 0 0 0 0 0
## 4 One quarter 0 0 0 0 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0 0 0 0 0
## 6 Almond Joy 1 0 0 1 0 0 0 1 0
## # … with 3 more variables: sugarpercent <dbl>, pricepercent <dbl>,
## # winpercent <dbl>, and abbreviated variable names ¹competitorname,
## # ²chocolate, ³peanutyalmondy, ⁴crispedricewafer, ⁵pluribus
candy.data %>% rename(candy_name =competitorname,
hard_candy=hard)
## # A tibble: 85 × 13
## candy_n…¹ choco…² fruity caramel peanu…³ nougat crisp…⁴ hard_…⁵ bar pluri…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 Grand 1 0 1 0 0 1 0 1 0
## 2 3 Musket… 1 0 0 0 1 0 0 1 0
## 3 One dime 0 0 0 0 0 0 0 0 0
## 4 One quar… 0 0 0 0 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0 0 0 0 0
## 6 Almond J… 1 0 0 1 0 0 0 1 0
## 7 Baby Ruth 1 0 1 1 1 0 0 1 0
## 8 Boston B… 0 0 0 1 0 0 0 0 1
## 9 Candy Co… 0 0 0 0 0 0 0 0 1
## 10 Caramel … 0 1 1 0 0 0 0 0 0
## # … with 75 more rows, 3 more variables: sugarpercent <dbl>,
## # pricepercent <dbl>, winpercent <dbl>, and abbreviated variable names
## # ¹candy_name, ²chocolate, ³peanutyalmondy, ⁴crispedricewafer, ⁵hard_candy,
## # ⁶pluribus
candy.data <- candy.data %>% rename(candy_name =competitorname,
hard_candy=hard)
head(candy.data)
## # A tibble: 6 × 13
## candy_name choco…¹ fruity caramel peanu…² nougat crisp…³ hard_…⁴ bar pluri…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 Grand 1 0 1 0 0 1 0 1 0
## 2 3 Muskete… 1 0 0 0 1 0 0 1 0
## 3 One dime 0 0 0 0 0 0 0 0 0
## 4 One quart… 0 0 0 0 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0 0 0 0 0
## 6 Almond Joy 1 0 0 1 0 0 0 1 0
## # … with 3 more variables: sugarpercent <dbl>, pricepercent <dbl>,
## # winpercent <dbl>, and abbreviated variable names ¹chocolate,
## # ²peanutyalmondy, ³crispedricewafer, ⁴hard_candy, ⁵pluribus
Exclude the variable pluribus from our dataframe
candy.data<- candy.data %>% select(-pluribus)
Select a subset of variables [candy_name, chocolate, sugarpercent,winpercent, caramel] in a new dataframe named candy.data2
candy.data2<-candy.data %>% select (candy_name, chocolate, sugarpercent,winpercent, caramel)
head(candy.data2)
## # A tibble: 6 × 5
## candy_name chocolate sugarpercent winpercent caramel
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 100 Grand 1 0.732 67.0 1
## 2 3 Musketeers 1 0.604 67.6 0
## 3 One dime 0 0.011 32.3 0
## 4 One quarter 0 0.011 46.1 0
## 5 Air Heads 0 0.906 52.3 0
## 6 Almond Joy 1 0.465 50.3 0
Filter a subset of rows that meet a condition (only candys that are chocolate )
candy.data2<-candy.data2 %>% filter(chocolate==1)
Filter a subset of rows that meet two expressions (candy is chocolate and caramel)
candy.data3<-candy.data2 %>% filter(chocolate==1 & caramel==1)
Create a new variable named total that adds the number of ingredients for each candy name in the dataframe candy.data2:
candy.data <- candy.data %>% mutate(total = chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard_candy + bar)
head(candy.data)
## # A tibble: 6 × 13
## candy_name choco…¹ fruity caramel peanu…² nougat crisp…³ hard_…⁴ bar sugar…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 Grand 1 0 1 0 0 1 0 1 0.732
## 2 3 Muskete… 1 0 0 0 1 0 0 1 0.604
## 3 One dime 0 0 0 0 0 0 0 0 0.011
## 4 One quart… 0 0 0 0 0 0 0 0 0.011
## 5 Air Heads 0 1 0 0 0 0 0 0 0.906
## 6 Almond Joy 1 0 0 1 0 0 0 1 0.465
## # … with 3 more variables: pricepercent <dbl>, winpercent <dbl>, total <dbl>,
## # and abbreviated variable names ¹chocolate, ²peanutyalmondy,
## # ³crispedricewafer, ⁴hard_candy, ⁵sugarpercent
Hmmmm… that took too long! Let’s find a shorter way to do this last step. Let’s use the handy rowwise(). Rowwise() makes it easy to summarise values across columns within one row.
candy.data <- candy.data %>% rowwise() %>% mutate(total3 = sum(c_across(chocolate:bar)))
candy.data
## # A tibble: 85 × 14
## # Rowwise:
## candy_n…¹ choco…² fruity caramel peanu…³ nougat crisp…⁴ hard_…⁵ bar sugar…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 Grand 1 0 1 0 0 1 0 1 0.732
## 2 3 Musket… 1 0 0 0 1 0 0 1 0.604
## 3 One dime 0 0 0 0 0 0 0 0 0.011
## 4 One quar… 0 0 0 0 0 0 0 0 0.011
## 5 Air Heads 0 1 0 0 0 0 0 0 0.906
## 6 Almond J… 1 0 0 1 0 0 0 1 0.465
## 7 Baby Ruth 1 0 1 1 1 0 0 1 0.604
## 8 Boston B… 0 0 0 1 0 0 0 0 0.313
## 9 Candy Co… 0 0 0 0 0 0 0 0 0.906
## 10 Caramel … 0 1 1 0 0 0 0 0 0.604
## # … with 75 more rows, 4 more variables: pricepercent <dbl>, winpercent <dbl>,
## # total <dbl>, total3 <dbl>, and abbreviated variable names ¹candy_name,
## # ²chocolate, ³peanutyalmondy, ⁴crispedricewafer, ⁵hard_candy, ⁶sugarpercent
Look at a summary of the data:
now let’s explore a summary of this new var. Note I am not assigning anything back to my candy.data dataframe
And if we want to have a summary broken down by groups (say those candy names that have chocolate and those that don’t):
candy.data %>% count (chocolate ,total) %>%
group_by(chocolate) %>%
# count creates a column called 'n'
mutate(percent = n / sum(n) * 100)
## # A tibble: 9 × 4
## # Groups: chocolate [2]
## chocolate total n percent
## <dbl> <dbl> <int> <dbl>
## 1 0 0 5 10.4
## 2 0 1 28 58.3
## 3 0 2 14 29.2
## 4 0 3 1 2.08
## 5 1 1 7 18.9
## 6 1 2 13 35.1
## 7 1 3 10 27.0
## 8 1 4 4 10.8
## 9 1 5 3 8.11
candy.data %>%
group_by(chocolate) %>%
summarize(nougatcount = sum(nougat), caremelcount =sum(caramel), barcount =sum(bar))
## # A tibble: 2 × 4
## chocolate nougatcount caremelcount barcount
## <dbl> <dbl> <dbl> <dbl>
## 1 0 1 4 1
## 2 1 6 10 20
Create a new variable that contains two categories: high, when the sugar content is greter or equal than 0.55 and low, when sugarpercent is lower than 0.55
candy.data2<-candy.data2 %>% mutate(choc_sugar = case_when(sugarpercent>=0.55 ~ 'high',
sugarpercent<0.55 ~ 'low'))
Let’s export the file we just created as a comma-separated file.
write.csv(candy.data2 )
## "","candy_name","chocolate","sugarpercent","winpercent","caramel","choc_sugar"
## "1","100 Grand",1,0.73199999,66.971725,1,"high"
## "2","3 Musketeers",1,0.60399997,67.602936,0,"high"
## "3","Almond Joy",1,0.465,50.347546,0,"low"
## "4","Baby Ruth",1,0.60399997,56.914547,1,"high"
## "5","Charleston Chew",1,0.60399997,38.975037,0,"high"
## "6","HersheyÕs Kisses",1,0.127,55.375454,0,"low"
## "7","HersheyÕs Krackel",1,0.43000001,62.284481,0,"low"
## "8","HersheyÕs Milk Chocolate",1,0.43000001,56.490501,0,"low"
## "9","HersheyÕs Special Dark",1,0.43000001,59.236122,0,"low"
## "10","Junior Mints",1,0.197,57.21925,0,"low"
## "11","Kit Kat",1,0.31299999,76.7686,0,"low"
## "12","Peanut butter M&MÕs",1,0.82499999,71.46505,0,"high"
## "13","M&MÕs",1,0.82499999,66.574585,0,"high"
## "14","Milk Duds",1,0.30199999,55.064072,1,"low"
## "15","Milky Way",1,0.60399997,73.099556,1,"high"
## "16","Milky Way Midnight",1,0.73199999,60.800701,1,"high"
## "17","Milky Way Simply Caramel",1,0.96499997,64.35334,1,"high"
## "18","Mounds",1,0.31299999,47.829754,0,"low"
## "19","Mr Good Bar",1,0.31299999,54.526451,0,"low"
## "20","Nestle Butterfinger",1,0.60399997,70.735641,0,"high"
## "21","Nestle Crunch",1,0.31299999,66.47068,0,"low"
## "22","Peanut M&Ms",1,0.59299999,69.483788,0,"high"
## "23","ReeseÕs Miniatures",1,0.034000002,81.866257,0,"low"
## "24","ReeseÕs Peanut Butter cup",1,0.72000003,84.18029,0,"high"
## "25","ReeseÕs pieces",1,0.40599999,73.43499,0,"low"
## "26","ReeseÕs stuffed with pieces",1,0.98799998,72.887901,0,"high"
## "27","Rolo",1,0.86000001,65.716286,1,"high"
## "28","Sixlets",1,0.22,34.722,0,"low"
## "29","Nestle Smarties",1,0.26699999,37.887188,0,"low"
## "30","Snickers",1,0.546,76.673782,1,"low"
## "31","Snickers Crisper",1,0.60399997,59.529251,1,"high"
## "32","Tootsie Pop",1,0.60399997,48.982651,0,"high"
## "33","Tootsie Roll Juniors",1,0.31299999,43.068897,0,"low"
## "34","Tootsie Roll Midgies",1,0.17399999,45.736748,0,"low"
## "35","Tootsie Roll Snack Bars",1,0.465,49.653503,0,"low"
## "36","Twix",1,0.546,81.642914,1,"low"
## "37","Whoppers",1,0.87199998,49.524113,0,"high"
First, remember you need to create a dataframe -object- to assign the data downloaded from GitHub. Name the object (your new dataframe) election.raw. And remember to first create an object named election.url and assign to it this link: https://raw.githubusercontent.com/kosukeimai/qss-package/master/data-raw/csv/elections.csv. [hint1: remember to create a chunk below that contains your code for this step] [hint2: Use the dplyr function to read csv files]
data_url<-("https://raw.githubusercontent.com/kosukeimai/qss-package/master/data-raw/csv/elections.csv")
election.raw<-read_csv(data_url)
## Rows: 43494 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): state, county
## dbl (4): year, rep, dem, other
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First explore the number of columns and rows in a new chunk. [hint: use dim]
dim(election.raw)
## [1] 43494 6
Now take a peak at the first rows of your dataframe in a new chunk. [hint: use head()]
head(election.raw)
## # A tibble: 6 × 6
## year state county rep dem other
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1960 alabama autauga 1149 1324 89
## 2 1960 alabama baldwin 4812 5647 187
## 3 1960 alabama barbour 1166 2148 37
## 4 1960 alabama bibb 1052 1697 4
## 5 1960 alabama blount 2557 3404 19
## 6 1960 alabama bullock 412 757 6
Insert a new chunk and create a copy of your dataframe: election.clean
election.clean<- election.raw
head(election.clean)
## # A tibble: 6 × 6
## year state county rep dem other
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1960 alabama autauga 1149 1324 89
## 2 1960 alabama baldwin 4812 5647 187
## 3 1960 alabama barbour 1166 2148 37
## 4 1960 alabama bibb 1052 1697 4
## 5 1960 alabama blount 2557 3404 19
## 6 1960 alabama bullock 412 757 6
Create a new variable with the total of all votes for democrats, republicans and other [hint use mutate to create a new variable totalvotes add these three variables, remember to use the operator: +]
election.clean <- election.clean %>% mutate(totalvotes = rep + dem + other)
head(election.clean)
## # A tibble: 6 × 7
## year state county rep dem other totalvotes
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1960 alabama autauga 1149 1324 89 2562
## 2 1960 alabama baldwin 4812 5647 187 10646
## 3 1960 alabama barbour 1166 2148 37 3351
## 4 1960 alabama bibb 1052 1697 4 2753
## 5 1960 alabama blount 2557 3404 19 5980
## 6 1960 alabama bullock 412 757 6 1175
Exclude (filter out) the variable “other” from the election.clean dataframe [hint: use select and the operator:-]
election.clean<- election.clean %>% select(-other)
head(election.clean)
## # A tibble: 6 × 6
## year state county rep dem totalvotes
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1960 alabama autauga 1149 1324 2562
## 2 1960 alabama baldwin 4812 5647 10646
## 3 1960 alabama barbour 1166 2148 3351
## 4 1960 alabama bibb 1052 1697 2753
## 5 1960 alabama blount 2557 3404 5980
## 6 1960 alabama bullock 412 757 1175
Create a new variable demprop in the election.clean dataframe with the proportion of dem votes out of the total [hint: use dem/totalvotes]
election.clean <- election.clean %>% mutate(demprop = dem/totalvotes)
head(election.clean)
## # A tibble: 6 × 7
## year state county rep dem totalvotes demprop
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1960 alabama autauga 1149 1324 2562 0.517
## 2 1960 alabama baldwin 4812 5647 10646 0.530
## 3 1960 alabama barbour 1166 2148 3351 0.641
## 4 1960 alabama bibb 1052 1697 2753 0.616
## 5 1960 alabama blount 2557 3404 5980 0.569
## 6 1960 alabama bullock 412 757 1175 0.644
Now let’s turn that into a percent. Replace the proportions in demprop with the percent [hint: use mutate and the SAME variable name -demprop-, and use demprop*100 to derive the percents]
election.clean <- election.clean %>% mutate(demprop = demprop*100)
head(election.clean)
## # A tibble: 6 × 7
## year state county rep dem totalvotes demprop
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1960 alabama autauga 1149 1324 2562 51.7
## 2 1960 alabama baldwin 4812 5647 10646 53.0
## 3 1960 alabama barbour 1166 2148 3351 64.1
## 4 1960 alabama bibb 1052 1697 2753 61.6
## 5 1960 alabama blount 2557 3404 5980 56.9
## 6 1960 alabama bullock 412 757 1175 64.4
Now let’s create a new dataframe election.ca that contains only a subset of our election.clean dataframe. Use filter to keep only observations in California [hint: use filter(variable ==“value”)]
election.ca <- election.clean
Let’s summarize this data. Create a summary table for your election.ca dataframe that groups observations by county [hint: use group_by and the variable that contains the counties]. the summary table should include the number of total votes, total rep votes and total dem votes [hint: use summarize(sum1 =sum(var1), sum2 =sum(var2), sum3= sum(var3))]
election.ca %>%
group_by(county) %>%
summarize(totalvotes = sum(totalvotes), totalrepvotes =sum(rep), totaldemvotes =sum(dem))
## # A tibble: 1,835 × 4
## county totalvotes totalrepvotes totaldemvotes
## <chr> <dbl> <dbl> <dbl>
## 1 abbeville 101181 46864 48665
## 2 acadia 300951 153571 126933
## 3 accomack 156255 81121 65891
## 4 ada 1374333 802136 488826
## 5 adair 377151 221346 141565
## 6 adams 2998998 1550243 1296437
## 7 addison 187109 79985 95448
## 8 aiken 570328 358094 190050
## 9 aitkin 103645 45022 53018
## 10 alachua 914130 375357 491492
## # … with 1,825 more rows
###Bonus point:
Try creating a density plot for a subset of the states: using ggplot:
#QUIZ ENDS
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.