Introduction to Data Science HW 4

Copyright Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva

# Enter your name here: Patrick Smith

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

Reminders of things to practice from previous weeks:
Descriptive statistics: mean( ) max( ) min( )
Coerce to numeric: as.numeric( )

Part 1: Use the Starter Code

Below, I have provided a starter file to help you.

Each of these lines of code must be commented (the comment must that explains what is going on, so that I know you understand the code and results).

library(jsonlite)
dataset <- url("https://intro-datascience.s3.us-east-2.amazonaws.com/role.json")
readlines <- jsonlite::fromJSON(dataset)
df <- readlines$objects$person

Explore the df dataframe (e.g., using head() or whatever you think is best).

  bioguideid   birthday cspanid firstname gender gender_label  lastname                                                             link middlename
1    C000880 1951-05-20   26440   Michael   male         Male     Crapo    https://www.govtrack.us/congress/members/michael_crapo/300030         D.
2    G000386 1933-09-17    1167   Charles   male         Male  Grassley https://www.govtrack.us/congress/members/charles_grassley/300048         E.
3    L000174 1940-03-31    1552   Patrick   male         Male     Leahy    https://www.govtrack.us/congress/members/patrick_leahy/300065         J.
4    M001153 1957-05-22 1004138      Lisa female       Female Murkowski   https://www.govtrack.us/congress/members/lisa_murkowski/300075         A.
5    M001111 1950-10-11   25277     Patty female       Female    Murray     https://www.govtrack.us/congress/members/patty_murray/300076           
6    S000148 1950-11-23    5929   Charles   male         Male   Schumer  https://www.govtrack.us/congress/members/charles_schumer/300087         E.
                                  name namemod nickname      osid pvsid                                sortname     twitterid           youtubeid
1     Sen. Michael “Mike” Crapo [R-ID]             Mike N00006267 26830     Crapo, Michael “Mike” (Sen.) [R-ID]     MikeCrapo        senatorcrapo
2 Sen. Charles “Chuck” Grassley [R-IA]            Chuck N00001758 53293 Grassley, Charles “Chuck” (Sen.) [R-IA] ChuckGrassley    senchuckgrassley
3            Sen. Patrick Leahy [D-VT]                  N00009918 53353            Leahy, Patrick (Sen.) [D-VT]  SenatorLeahy SenatorPatrickLeahy
4           Sen. Lisa Murkowski [R-AK]                  N00026050 15841           Murkowski, Lisa (Sen.) [R-AK] LisaMurkowski    senatormurkowski
5             Sen. Patty Murray [D-WA]                  N00007876 53358             Murray, Patty (Sen.) [D-WA]   PattyMurray  SenatorPattyMurray
6  Sen. Charles “Chuck” Schumer [D-NY]            Chuck N00001093 26976  Schumer, Charles “Chuck” (Sen.) [D-NY]    SenSchumer      SenatorSchumer

Explain the dataset
o What is the dataset about?
- summary(df) finds details about the dataset. o How many rows are there and what does a row represent?
- nrow(df) to find the number of rows in the dataset o How many columns and what does each column represent?
ncol(df) finds the number of columns. Colnames(df) finds the names.

  bioguideid          birthday            cspanid         firstname            gender          gender_label         lastname             link          
 Length:100         Length:100         Min.   :    260   Length:100         Length:100         Length:100         Length:100         Length:100        
 Class :character   Class :character   1st Qu.:  25277   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Median :  68489   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                       Mean   : 584001                                                                                              
                                       3rd Qu.:1004138                                                                                                 
                                       Max.   :9269028                                                                                                 
                                       NA's   :11                                                                                                      
  middlename            name             namemod            nickname             osid              pvsid             sortname          twitterid        
 Length:100         Length:100         Length:100         Length:100         Length:100         Length:100         Length:100         Length:100        
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character
                                                                                                                                                  
                                                                                                                                                  
                                                                                                                                                  
                                                                                                                                                  
  youtubeid        
 Length:100        
 Class :character  
 Mode  :character  
 
nrow(df)
[1] 100
ncol(df)
[1] 17
colnames(df)
 [1] "bioguideid"   "birthday"     "cspanid"      "firstname"    "gender"       "gender_label" "lastname"     "link"         "middlename"  
[10] "name"         "namemod"      "nickname"     "osid"         "pvsid"        "sortname"     "twitterid"    "youtubeid"

C. What does running this line of code do? Explain in a comment:

vals <- substr(df$birthday,1,4)

The substr gets a substring of data. birthday1,4 is the birthday in year due to 4 element value so like 1951 instead of 1951-05-20 for Michael Crapo .

D. Create a new attribute ‘age’ - how old the person is old Hint: You may need to convert it to numeric first.

```{r}> old<-as.numeric(vals) > age<-2021-old > age [1] 70 88 81 64 71 71 87 72 71 71 66 67 66 60 62 60 57 49 53 56 75 50 58 64 50 66 49 53 70 63 57 63 74 88 71 59 69 69 78 67 80 70 48 74 55 61 65 61 [49] 66 69 50 74 45 72 77 60 70 51 63 64 69 67 42 74 57 69 69 77 66 87 79 65 72 78 74 67 58 68 64 49 75 63 44 59 52 57 51 61 67 49 61 63 62 67 67 69 [97] 62 48 34 52


E. Create a function that reads in the role json dataset, and adds the age attribute to the dataframe, and returns that dataframe


```{r}> df<-data.frame(df,age)   
> head(df)
  bioguideid   birthday cspanid firstname gender gender_label  lastname                                                             link middlename
1    C000880 1951-05-20   26440   Michael   male         Male     Crapo    https://www.govtrack.us/congress/members/michael_crapo/300030         D.
2    G000386 1933-09-17    1167   Charles   male         Male  Grassley https://www.govtrack.us/congress/members/charles_grassley/300048         E.
3    L000174 1940-03-31    1552   Patrick   male         Male     Leahy    https://www.govtrack.us/congress/members/patrick_leahy/300065         J.
4    M001153 1957-05-22 1004138      Lisa female       Female Murkowski   https://www.govtrack.us/congress/members/lisa_murkowski/300075         A.
5    M001111 1950-10-11   25277     Patty female       Female    Murray     https://www.govtrack.us/congress/members/patty_murray/300076           
6    S000148 1950-11-23    5929   Charles   male         Male   Schumer  https://www.govtrack.us/congress/members/charles_schumer/300087         E.
                                  name namemod nickname      osid pvsid                                sortname     twitterid           youtubeid age
1     Sen. Michael “Mike” Crapo [R-ID]             Mike N00006267 26830     Crapo, Michael “Mike” (Sen.) [R-ID]     MikeCrapo        senatorcrapo  70
2 Sen. Charles “Chuck” Grassley [R-IA]            Chuck N00001758 53293 Grassley, Charles “Chuck” (Sen.) [R-IA] ChuckGrassley    senchuckgrassley  88
3            Sen. Patrick Leahy [D-VT]                  N00009918 53353            Leahy, Patrick (Sen.) [D-VT]  SenatorLeahy SenatorPatrickLeahy  81
4           Sen. Lisa Murkowski [R-AK]                  N00026050 15841           Murkowski, Lisa (Sen.) [R-AK] LisaMurkowski    senatormurkowski  64
5             Sen. Patty Murray [D-WA]                  N00007876 53358             Murray, Patty (Sen.) [D-WA]   PattyMurray  SenatorPattyMurray  71
6  Sen. Charles “Chuck” Schumer [D-NY]            Chuck N00001093 26976  Schumer, Charles “Chuck” (Sen.) [D-NY]    SenSchumer      SenatorSchumer  71

F. Use (call, invoke) the function, and store the results in df

```{r}> df<-function(df,age) + df<-data.frame(df,age)


## Part 2: Investigate the resulting dataframe 'df'

A.  How many senators are women? 


```{r}> sum(df$gender=='female')
[1] 24
       nrow(df[df$gender=='female',])
[1] 24

How many senators have a YouTube account?

    100-nrow(df[df$youtubeid=='youtubeid',])
    [1] 73 Senators have a YouTube Account

How many women senators have a YouTube account?

    [1] 16

Create a new dataframe called youtubeWomen that only includes women senators who have a YouTube account.

youtubewomen<-data.frame(df$youtubeid=='youtubeid'&df$gender=='female')

Make a histogram of the age of senators in youtubeWomen, and then another for the senetors in df. Add a comment describing the shape of the distributions.

```{r} hist(age)

``` Most Senators are around the age of 70. Almost a Triangle/Pyramid shape. Most of the female senators are under 40. The age gap may be the reason of increased percentage of youtube.