Introduction to Data Science HW 4

Copyright Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva

# Enter your name here: Merissah Gilbert

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

Reminders of things to practice from previous weeks:
Descriptive statistics: mean( ) max( ) min( )
Coerce to numeric: as.numeric( )

Part 1: Use the Starter Code

Below, I have provided a starter file to help you.

Each of these lines of code must be commented (the comment must that explains what is going on, so that I know you understand the code and results).

library(jsonlite)
#this line of code is tell R to add jsonlite (pre-installed) to the current R document so it may be used
dataset <- url("https://intro-datascience.s3.us-east-2.amazonaws.com/role.json")
#this line of code is taking data from a url and assigned the values to 'dataset' (the name we are calling the data)
readlines <- jsonlite::fromJSON(dataset)
#this line of code is assigning the name 'readlines' and telling jsonlite to gather from data frame of json file 'dateaset'
df <- readlines$objects$person
#this line of code is assigning the name 'df' to a dataset that is from the 'person' columns in the 'objects' data set from the 'readlines' data frame

Explore the df dataframe (e.g., using head() or whatever you think is best).

str(df)

## 'data.frame':    100 obs. of  17 variables:
##  $ bioguideid  : chr  "C000880" "G000386" "L000174" "M001153" ...
##  $ birthday    : chr  "1951-05-20" "1933-09-17" "1940-03-31" "1957-05-22" ...
##  $ cspanid     : int  26440 1167 1552 1004138 25277 5929 1859 1962 45465 92069 ...
##  $ firstname   : chr  "Michael" "Charles" "Patrick" "Lisa" ...
##  $ gender      : chr  "male" "male" "male" "female" ...
##  $ gender_label: chr  "Male" "Male" "Male" "Female" ...
##  $ lastname    : chr  "Crapo" "Grassley" "Leahy" "Murkowski" ...
##  $ link        : chr  "https://www.govtrack.us/congress/members/michael_crapo/300030" "https://www.govtrack.us/congress/members/charles_grassley/300048" "https://www.govtrack.us/congress/members/patrick_leahy/300065" "https://www.govtrack.us/congress/members/lisa_murkowski/300075" ...
##  $ middlename  : chr  "D." "E." "J." "A." ...
##  $ name        : chr  "Sen. Michael “Mike” Crapo [R-ID]" "Sen. Charles “Chuck” Grassley [R-IA]" "Sen. Patrick Leahy [D-VT]" "Sen. Lisa Murkowski [R-AK]" ...
##  $ namemod     : chr  "" "" "" "" ...
##  $ nickname    : chr  "Mike" "Chuck" "" "" ...
##  $ osid        : chr  "N00006267" "N00001758" "N00009918" "N00026050" ...
##  $ pvsid       : chr  "26830" "53293" "53353" "15841" ...
##  $ sortname    : chr  "Crapo, Michael “Mike” (Sen.) [R-ID]" "Grassley, Charles “Chuck” (Sen.) [R-IA]" "Leahy, Patrick (Sen.) [D-VT]" "Murkowski, Lisa (Sen.) [R-AK]" ...
##  $ twitterid   : chr  "MikeCrapo" "ChuckGrassley" "SenatorLeahy" "LisaMurkowski" ...
##  $ youtubeid   : chr  "senatorcrapo" "senchuckgrassley" "SenatorPatrickLeahy" "senatormurkowski" ...

head(df)

##   bioguideid   birthday cspanid firstname gender gender_label  lastname
## 1    C000880 1951-05-20   26440   Michael   male         Male     Crapo
## 2    G000386 1933-09-17    1167   Charles   male         Male  Grassley
## 3    L000174 1940-03-31    1552   Patrick   male         Male     Leahy
## 4    M001153 1957-05-22 1004138      Lisa female       Female Murkowski
## 5    M001111 1950-10-11   25277     Patty female       Female    Murray
## 6    S000148 1950-11-23    5929   Charles   male         Male   Schumer
##                                                               link middlename
## 1    https://www.govtrack.us/congress/members/michael_crapo/300030         D.
## 2 https://www.govtrack.us/congress/members/charles_grassley/300048         E.
## 3    https://www.govtrack.us/congress/members/patrick_leahy/300065         J.
## 4   https://www.govtrack.us/congress/members/lisa_murkowski/300075         A.
## 5     https://www.govtrack.us/congress/members/patty_murray/300076           
## 6  https://www.govtrack.us/congress/members/charles_schumer/300087         E.
##                                   name namemod nickname      osid pvsid
## 1     Sen. Michael “Mike” Crapo [R-ID]             Mike N00006267 26830
## 2 Sen. Charles “Chuck” Grassley [R-IA]            Chuck N00001758 53293
## 3            Sen. Patrick Leahy [D-VT]                  N00009918 53353
## 4           Sen. Lisa Murkowski [R-AK]                  N00026050 15841
## 5             Sen. Patty Murray [D-WA]                  N00007876 53358
## 6  Sen. Charles “Chuck” Schumer [D-NY]            Chuck N00001093 26976
##                                  sortname     twitterid           youtubeid
## 1     Crapo, Michael “Mike” (Sen.) [R-ID]     MikeCrapo        senatorcrapo
## 2 Grassley, Charles “Chuck” (Sen.) [R-IA] ChuckGrassley    senchuckgrassley
## 3            Leahy, Patrick (Sen.) [D-VT]  SenatorLeahy SenatorPatrickLeahy
## 4           Murkowski, Lisa (Sen.) [R-AK] LisaMurkowski    senatormurkowski
## 5             Murray, Patty (Sen.) [D-WA]   PattyMurray  SenatorPattyMurray
## 6  Schumer, Charles “Chuck” (Sen.) [D-NY]    SenSchumer      SenatorSchumer

Explain the dataset
o What is the dataset about?
o How many rows are there and what does a row represent?
o How many columns and what does each column represent?

#there are 100 rows in the dataset and each row is a US Senator.
#There are 17 columns representing identifiable information on the senators such as first name, last name, birthday, gender, link to senator page, youtube and twitter names, etc.

C. What does running this line of code do? Explain in a comment:

vals <- substr(df$birthday,1,4)
#running this line of code takes the 1st 4 characters from the birthday row of the data set 'df' and assigned it under the value name 'val'

D. Create a new attribute ‘age’ - how old the person is Hint: You may need to convert it to numeric first.

vals <- as.numeric(vals)
age <- (2024-(vals))

E. Create a function that reads in the role json dataset, and adds the age attribute to the dataframe, and returns that dataframe

newfun <- function(df){
  df$age <- c(age)
  return(df)
}

F. Use (call, invoke) the function, and store the results in df

df <- newfun(df)

Part 2: Investigate the resulting dataframe ‘df’

How many senators are women?

library(dbplyr)
sum(df$gender_label=="Female")

## [1] 24

How many senators have a YouTube account?

length(na.omit(df$youtubeid))

## [1] 73

How many women senators have a YouTube account?

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:dbplyr':
## 
##     ident, sql

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

womenyout <- function(df){
  df %>%
    filter(!is.na(youtubeid), gender_label=="Female")
}
nrow(womenyout(df))

## [1] 16

Create a new dataframe called youtubeWomen that only includes women senators who have a YouTube account.

youtubeWomen <- womenyout(df)

Make a histogram of the age of senators in youtubeWomen, and then another for the senetors in df. Add a comment describing the shape of the distributions.

hist(youtubeWomen$age)

hist(df$age)

#the histograms shape resembles a normal distribution with some outliers or anomolies.