Assignment 3

Author

Kelly Ratigan

Published

July 6, 2025

Homework 3

Homework 3, R Programming parts A, B and C.

Part A

  1. Store the NHANESraw data in a data frame and give it a reasonable name.  Examine the variables in the data frame by performing a glimpse(). Copy and paste the results of that glimpse() into your assignment.
library(NHANES)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
df_nhanes <- NHANESraw

Glimpse into data:

glimpse(df_nhanes)
Rows: 20,293
Columns: 78
$ ID               <int> 51624, 51625, 51626, 51627, 51628, 51629, 51630, 5163…
$ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10,…
$ Gender           <fct> male, male, male, male, female, male, female, female,…
$ Age              <int> 34, 4, 16, 10, 60, 26, 49, 1, 10, 80, 10, 80, 4, 35, …
$ AgeMonths        <int> 409, 49, 202, 131, 722, 313, 596, 12, 124, NA, 121, N…
$ Race1            <fct> White, Other, Black, Black, Black, Mexican, White, Wh…
$ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Education        <fct> High School, NA, NA, NA, High School, 9 - 11th Grade,…
$ MaritalStatus    <fct> Married, NA, NA, NA, Widowed, Married, LivePartner, N…
$ HHIncome         <fct> 25000-34999, 20000-24999, 45000-54999, 20000-24999, 1…
$ HHIncomeMid      <int> 30000, 22500, 50000, 22500, 12500, 30000, 40000, 4000…
$ Poverty          <dbl> 1.36, 1.07, 2.27, 0.81, 0.69, 1.01, 1.91, 1.36, 2.68,…
$ HomeRooms        <int> 6, 9, 5, 6, 6, 4, 5, 5, 7, 4, 5, 5, 7, NA, 6, 6, 5, 6…
$ HomeOwn          <fct> Own, Own, Own, Rent, Rent, Rent, Rent, Rent, Own, Own…
$ Work             <fct> NotWorking, NA, NotWorking, NA, NotWorking, Working, …
$ Weight           <dbl> 87.4, 17.0, 72.3, 39.8, 116.8, 97.6, 86.7, 9.4, 26.0,…
$ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, 75.7, NA, NA, NA, NA, NA,…
$ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Height           <dbl> 164.7, 105.4, 181.3, 147.8, 166.0, 173.0, 168.4, NA, …
$ BMI              <dbl> 32.22, 15.30, 22.00, 18.22, 42.39, 32.61, 30.57, NA, …
$ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BMI_WHO          <fct> 30.0_plus, 12.0_18.5, 18.5_to_24.9, 12.0_18.5, 30.0_p…
$ Pulse            <int> 70, NA, 68, 68, 72, 72, 86, NA, 70, 88, 84, 54, NA, N…
$ BPSysAve         <int> 113, NA, 109, 93, 150, 104, 112, NA, 108, 139, 94, 12…
$ BPDiaAve         <int> 85, NA, 59, 41, 68, 49, 75, NA, 53, 43, 45, 60, NA, N…
$ BPSys1           <int> 114, NA, 112, 92, 154, 102, 118, NA, 106, 142, 94, 12…
$ BPDia1           <int> 88, NA, 62, 36, 70, 50, 82, NA, 60, 62, 38, 62, NA, N…
$ BPSys2           <int> 114, NA, 114, 94, 150, 104, 108, NA, 106, 140, 92, 12…
$ BPDia2           <int> 88, NA, 60, 44, 68, 48, 74, NA, 50, 46, 40, 62, NA, N…
$ BPSys3           <int> 112, NA, 104, 92, 150, 104, 116, NA, 110, 138, 96, 11…
$ BPDia3           <int> 82, NA, 58, 38, 68, 50, 76, NA, 56, 40, 50, 58, NA, N…
$ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ DirectChol       <dbl> 1.29, NA, 1.55, 1.89, 1.16, 1.16, 1.16, NA, 1.58, 1.9…
$ TotChol          <dbl> 3.49, NA, 4.97, 4.16, 5.22, 4.14, 6.70, NA, 4.14, 4.7…
$ UrineVol1        <int> 352, NA, 281, 139, 30, 202, 77, NA, 39, 128, 109, 38,…
$ UrineFlow1       <dbl> NA, NA, 0.415, 1.078, 0.476, 0.563, 0.094, NA, 0.300,…
$ UrineVol2        <int> NA, NA, NA, NA, 246, NA, NA, NA, NA, NA, NA, NA, NA, …
$ UrineFlow2       <dbl> NA, NA, NA, NA, 2.51, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Diabetes         <fct> No, No, No, No, Yes, No, No, No, No, No, No, Yes, No,…
$ DiabetesAge      <int> NA, NA, NA, NA, 56, NA, NA, NA, NA, NA, NA, 70, NA, N…
$ HealthGen        <fct> Good, NA, Vgood, NA, Fair, Good, Good, NA, NA, Excell…
$ DaysPhysHlthBad  <int> 0, NA, 2, NA, 20, 2, 0, NA, NA, 0, NA, 0, NA, NA, NA,…
$ DaysMentHlthBad  <int> 15, NA, 0, NA, 25, 14, 10, NA, NA, 0, NA, 0, NA, NA, …
$ LittleInterest   <fct> Most, NA, NA, NA, Most, None, Several, NA, NA, None, …
$ Depressed        <fct> Several, NA, NA, NA, Most, Most, Several, NA, NA, Non…
$ nPregnancies     <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA, NA, NA, NA,…
$ nBabies          <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA, NA, NA, NA,…
$ Age1stBaby       <int> NA, NA, NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, N…
$ SleepHrsNight    <int> 4, NA, 8, NA, 4, 4, 8, NA, NA, 6, NA, 9, NA, 7, NA, N…
$ SleepTrouble     <fct> Yes, NA, No, NA, No, No, Yes, NA, NA, No, NA, No, NA,…
$ PhysActive       <fct> No, NA, Yes, NA, No, Yes, No, NA, NA, Yes, NA, No, NA…
$ PhysActiveDays   <int> NA, NA, 5, NA, NA, 2, NA, NA, NA, 4, NA, NA, NA, NA, …
$ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TVHrsDayChild    <int> NA, 4, NA, 1, NA, NA, NA, NA, 1, NA, 3, NA, 2, NA, 5,…
$ CompHrsDayChild  <int> NA, 1, NA, 1, NA, NA, NA, NA, 0, NA, 0, NA, 1, NA, 0,…
$ Alcohol12PlusYr  <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, Yes, NA, No, N…
$ AlcoholDay       <int> NA, NA, NA, NA, NA, 19, 2, NA, NA, 1, NA, NA, NA, NA,…
$ AlcoholYear      <int> 0, NA, NA, NA, 0, 48, 20, NA, NA, 52, NA, 0, NA, NA, …
$ SmokeNow         <fct> No, NA, NA, NA, Yes, No, Yes, NA, NA, No, NA, No, NA,…
$ Smoke100         <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, Yes, NA, Yes,…
$ SmokeAge         <int> 18, NA, NA, NA, 16, 15, 38, NA, NA, 16, NA, 21, NA, N…
$ Marijuana        <fct> Yes, NA, NA, NA, NA, Yes, Yes, NA, NA, NA, NA, NA, NA…
$ AgeFirstMarij    <int> 17, NA, NA, NA, NA, 10, 18, NA, NA, NA, NA, NA, NA, N…
$ RegularMarij     <fct> No, NA, NA, NA, NA, Yes, No, NA, NA, NA, NA, NA, NA, …
$ AgeRegMarij      <int> NA, NA, NA, NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, N…
$ HardDrugs        <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, NA, NA, NA, NA…
$ SexEver          <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, NA, NA, NA, N…
$ SexAge           <int> 16, NA, NA, NA, 15, 9, 12, NA, NA, NA, NA, NA, NA, NA…
$ SexNumPartnLife  <int> 8, NA, NA, NA, 4, 10, 10, NA, NA, NA, NA, NA, NA, NA,…
$ SexNumPartYear   <int> 1, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, NA, …
$ SameSex          <fct> No, NA, NA, NA, No, No, Yes, NA, NA, NA, NA, NA, NA, …
$ SexOrientation   <fct> Heterosexual, NA, NA, NA, NA, Heterosexual, Heterosex…
$ WTINT2YR         <dbl> 80100.544, 53901.104, 13953.078, 11664.899, 20090.339…
$ WTMEC2YR         <dbl> 81528.772, 56995.035, 14509.279, 12041.635, 21000.339…
$ SDMVPSU          <int> 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
$ SDMVSTRA         <int> 83, 79, 84, 86, 75, 88, 85, 86, 88, 77, 86, 79, 84, 7…
$ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, U…
  1. One of the variables in the data is HomeOwn that indicates the type of living situation (Own, Rent, or Other) of the subject.  Use dplyr commands to determine the number of individuals who fall into each of the categories of HomeOwn.

    df_nhanes %>% count(HomeOwn)
    # A tibble: 4 × 2
      HomeOwn     n
      <fct>   <int>
    1 Own     10939
    2 Rent     8715
    3 Other     502
    4 <NA>      137
  2. Another variable is HomeRooms which gives the number of rooms in the home of the subject.  Use the dplyr commands to calculate the count and the mean number of rooms for each of the categories of HomeOwn.  Arrange the list in descending order by the mean number of rooms.

    grouped <- group_by(df_nhanes, HomeOwn)
    summary_table <- summarise(grouped, count = n(), mean_rooms = mean(HomeRooms, na.rm = TRUE))
    arranged_table <- arrange(summary_table, desc(mean_rooms))
    
    arranged_table
    # A tibble: 4 × 3
      HomeOwn count mean_rooms
      <fct>   <int>      <dbl>
    1 Own     10939       6.76
    2 Other     502       5.67
    3 <NA>      137       5.67
    4 Rent     8715       4.62
  3. Another variable in the data is Education level.  Calculate the count and the mean number of rooms for each of the levels of education.  Arrange the list in descending order by mean number of rooms.   Does there appear to be a relationship between the education level and the number of rooms in the home?  Explain in one to two sentences.

    by_educ <- group_by(df_nhanes, Education)
    sum_educ <- summarise(by_educ,count = n(),mean_rooms = mean(HomeRooms, na.rm = TRUE))
    room_educ <- arrange(sum_educ, desc(mean_rooms))
    room_educ
    # A tibble: 6 × 3
      Education      count mean_rooms
      <fct>          <int>      <dbl>
    1 College Grad    2656       6.34
    2 Some College    3399       5.93
    3 <NA>            8535       5.88
    4 High School     2595       5.60
    5 9 - 11th Grade  1787       5.30
    6 8th Grade       1321       5.01

    4 Analysis: There is a positive relationship between level of education and number of rooms per home. The higher the education, the higher the average number of rooms for homeowners.

  4. The NHANESraw data includes individuals under age 20.  Filter your data to only include those age 20 and over.  Calculate the count and the mean number of rooms for each of the levels of education.  Arrange the list in descending order by mean number of rooms.   Review the description of the education variable in the NHANES help file (?NHANES) and explain in a couple of sentences why we see the substantial change in the output from this filter relative to the output in the previous part.

    df_over20 <- filter(df_nhanes, Age >= 20)
    educ_over20 <- group_by(df_over20, Education)
    
    #count
    sum_educ_over20 <- summarise(educ_over20, count = n(), mean_rooms = mean(HomeRooms, na.rm = TRUE))
    
    #desc order
    over20_table <- arrange(sum_educ_over20, desc(mean_rooms))
    #table
    over20_table
    # A tibble: 6 × 3
      Education      count mean_rooms
      <fct>          <int>      <dbl>
    1 <NA>              20       6.53
    2 College Grad    2656       6.34
    3 Some College    3399       5.93
    4 High School     2595       5.60
    5 9 - 11th Grade  1787       5.30
    6 8th Grade       1321       5.01

    5 Analysis: In the original dataset, the number of missing values or NA for education is high because it includes individuals under 20 who likely haven’t completed or reported their education yet. By filtering for only those age 20 and over, the NA values drop significantly, since most adults have completed some form of education and have that data recorded.

Part B

library(mdsr)
Warning: package 'mdsr' was built under R version 4.4.1
  1. Rats are a notorious problem in NYC.  The violation_code for “Evidence of rats or live rats present” is 04K.  Load the Violations data into R and give it a convenient name.  Filter your data only for the “04K” code.  Group the violations by boro and give the counts for the number of violations in each boro. 

    data("Violations")
    df_viol <- Violations
    
    df_rats <- filter(df_viol, violation_code == "04K")
    
    # Group by borough
    boro_rats <- group_by(df_rats, boro)
    
    # Count how many rat violations in each boro
    count_rats <- summarise(boro_rats, count = n())
    count_rats
    # A tibble: 5 × 2
      boro          count
      <chr>         <int>
    1 BRONX           464
    2 BROOKLYN        819
    3 MANHATTAN      1393
    4 QUEENS          433
    5 STATEN ISLAND    20
  2. We might wonder if rats are more of a problem in the summer months of June, July and August.   Add a variable to your data frame that gives the month of the inspection.  This variable should be the label (e.g. Jan, Feb, etc. ).  Give counts for the number of 04K violations in the months.  Arrange your counts in descending order.   In one to two sentences state if your output indicates rats are more of a problem in the summer months.

    df_rats$month <- format(df_rats$inspection_date, "%b")
    
    rats_by_month <- group_by(df_rats, month)
    count_month <- summarise(rats_by_month, count = n())
    
    count_month_desc <- arrange(count_month, desc(count))
    
    count_month_desc
    # A tibble: 12 × 2
       month count
       <chr> <int>
     1 Mar     334
     2 Apr     323
     3 Dec     297
     4 May     295
     5 Feb     262
     6 Jun     260
     7 Jan     254
     8 Oct     248
     9 Nov     241
    10 Jul     210
    11 Aug     205
    12 Sep     200

    2 Analysis: According to the data, the summer months are not typically recording the most amount of violations. Interestingly enough, July is the month with the least frequent violations.

  3. Along with the Violations data frame, an additional data frame includes the Cuisines which are a translation of the cuisine_codes in the Violations data.  Load the Violations data and join it with Cuisines data to include the cuisine_descriptions with each violation.  Also join your data with the ViolationCodes data frame to include the violation description with each violation in your data set.  Store this data frame with a convenient name.   Create a glimpse of the resulting data frame.

    data("Cuisines")
    data("ViolationCodes")
    
    df_viol <- Violations
    viol_and_cuisine <- left_join(df_viol, Cuisines, by = "cuisine_code")
    joined_viol_data <- left_join(viol_and_cuisine, ViolationCodes, by = "violation_code")
    
    glimpse(joined_viol_data)
    Rows: 480,621
    Columns: 19
    $ camis                 <int> 30075445, 30075445, 30075445, 30075445, 30075445…
    $ dba                   <chr> "MORRIS PARK BAKE SHOP", "MORRIS PARK BAKE SHOP"…
    $ boro                  <chr> "BRONX", "BRONX", "BRONX", "BRONX", "BRONX", "BR…
    $ building              <int> 1007, 1007, 1007, 1007, 1007, 1007, 1007, 1007, …
    $ street                <chr> "MORRIS PARK AVE", "MORRIS PARK AVE", "MORRIS PA…
    $ zipcode               <int> 10462, 10462, 10462, 10462, 10462, 10462, 10462,…
    $ phone                 <dbl> 7188924968, 7188924968, 7188924968, 7188924968, …
    $ inspection_date       <dttm> 2015-02-09, 2014-03-03, 2013-10-10, 2013-09-11,…
    $ action                <chr> "Violations were cited in the following area(s).…
    $ violation_code        <chr> "06C", "10F", NA, "04L", "04N", "04C", "04L", "0…
    $ score                 <int> 6, 2, NA, 6, 6, 32, 32, 32, 32, 32, 32, NA, 10, …
    $ grade                 <chr> "A", "A", NA, "A", "A", NA, NA, NA, NA, NA, NA, …
    $ grade_date            <dttm> 2015-02-09, 2014-03-03, NA, 2013-09-11, 2013-09…
    $ record_date           <dttm> 2016-01-06, 2016-01-06, 2016-01-06, 2016-01-06,…
    $ inspection_type       <chr> "Cycle Inspection / Initial Inspection", "Cycle …
    $ cuisine_code          <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
    $ cuisine_description   <fct> "Bakery", "Bakery", "Bakery", "Bakery", "Bakery"…
    $ critical_flag         <chr> "Critical", "Not Critical", "Not Applicable", "C…
    $ violation_description <chr> "Food not protected from potential source of con…
  4. Create a new data frame that includes all of the variables from your previous data frame but limits it to only the borough of “STATEN ISLAND” and only cuisine_description “Hamburgers”.  Use this data frame to produce a list of violations arranged in order from most common to least common.   Filter your list so that only the top ten by count are given. 

    SI_burgers <- filter(joined_viol_data,boro == "STATEN ISLAND",cuisine_description == "Hamburgers")
    violations_burg <- group_by(SI_burgers, violation_description)
    
    #stuck here come back later
    violation_counts <- summarise(violations_burg, count = n())
    violations_top <- arrange(violation_counts, desc(count))
    
    #should only print top
    top10_violations <- head(violations_top, 10)
    top10_violations
    # A tibble: 10 × 2
       violation_description                                                   count
       <chr>                                                                   <int>
     1 "Non-food contact surface improperly constructed. Unacceptable materia…    92
     2 "Cold food item held above 41º F (smoked fish and reduced oxygen packa…    49
     3 "Facility not vermin proof. Harborage or conditions conducive to attra…    40
     4 "Food contact surface not properly washed, rinsed and sanitized after …    29
     5 "Filth flies or food/refuse/sewage-associated (FRSA) flies present in …    24
     6 "Plumbing not properly installed or maintained; anti-siphonage or back…    23
     7 "Evidence of mice or live mice present in facility's food and/or non-f…    22
     8 "Raw, cooked or prepared food is adulterated, contaminated, cross-cont…    10
     9 "Hot food item not held at or above 140º F."                                9
    10 "Food not protected from potential source of contamination during stor…     8
  5. The Violations data has one line for every violation.   We might be interested in identifying how many inspections were performed.  Dplyr has a function called distinct() that will eliminate duplicate rows from a data frame.  The variable camis is used to identify a unique restaurant (there can be many restaurants with the same name.)   Using your previous data frame select only the variables camis, dba, street, and inspection_date.  Pipe this data frame into the command distinct() to produce a data frame of only the inspections.  Save this data frame into a new object.  How many rows does this data frame have?

    inspect_data <- select(SI_burgers, camis, dba, street, inspection_date)
    distinct_data <- distinct(inspect_data)
    
    #count
    nrow(distinct_data)
    [1] 148

    5 Answer: There are 148 rows.

  6. Using your new data frame count the number of times each individual restaurant (use the camis variable) was inspected.  Sort your list in decreasing order by the number of inspections.

    #camis is resturant ID?
    grouped_inspect <- group_by(distinct_data, camis)
    inspect_count <- summarise(grouped_inspect, count = n())
    
    sort_inspect <- arrange(inspect_count, desc(count))
    sort_inspect
    # A tibble: 25 × 2
          camis count
          <int> <int>
     1 40609676    10
     2 41029337     9
     3 41033173     9
     4 40609677     8
     5 40609680     8
     6 41155817     8
     7 41155821     8
     8 41707838     8
     9 40370356     7
    10 40513416     6
    # ℹ 15 more rows

Part C

The police department in the city of Berkley, California releases a list of all individuals arrested. The file Berkley ArrestDownload Berkley Arrestcontains the arrest list for November through early December 2017. The data includes 16 variables about each individual who was arrested and the offense for which they were arrested. Two of the variables include the subject’s date of birth and the date of the arrest.  The variables are recorded as character variables.

  1. Read the data into R.   Provide a glimpse of the data frame

    Berkeley_Arrest <- read.csv("~/Desktop/Berkeley_Arrest.csv")
    glimpse(Berkeley_Arrest)
    Rows: 185
    Columns: 17
    $ X                   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
    $ arrest_number       <int> 21105, 21120, 21112, 21110, 21115, 21114, 21111, 2…
    $ arrest_type         <chr> "SUSP. OF FELONY", "COURT FILED (WARRANT)", "ON-VI…
    $ subject             <chr> "Teddy Oliver Webster", "Herbert Stephen Blue", "M…
    $ race                <chr> "Black", "White", "White", "Black", "Other", "Whit…
    $ sex                 <chr> "Male", "Male", "Male", "Male", "Male", "Female", …
    $ date_of_birth       <chr> "1968-12-18", "1957-08-30", "1993-02-16", "1958-04…
    $ age                 <int> NA, 57, 21, NA, 56, 23, 60, 27, 55, 29, 20, 76, 41…
    $ height              <chr> "5 Ft. 9 In.", "5 Ft. 8 In.", NA, "5 Ft. 10 In.", …
    $ weight              <int> 135, 140, NA, 170, 152, NA, 200, 185, 185, 140, 13…
    $ hair                <chr> "BRO", "BRO", NA, "BLK", "BRO", "BRO", "BLK", "RED…
    $ eyes                <chr> "BRO", "HAZ", NA, "BRO", "BRO", NA, "BRO", "HAZ", …
    $ statute             <chr> "1203.2 - F; 422; 10852;", "Warr - Out (F);", "484…
    $ statute_type        <chr> "PC; VC;", "PC;", "PC;", "PC;", "PC;", "PC;", "PC;…
    $ statute_description <chr> "Probation Violation : Felony; THREATEN CRIME W/IN…
    $ case_number         <chr> "2017-00067675", NA, "2017-00067861", NA, "2017-00…
    $ arrest_date         <chr> "2017-11-06", "2017-11-08", "2017-11-07", "2017-11…
  2. Identify which day of the week most arrests occurred and provide a bar chart to support your answer.

    library(ggplot2)
    #need to make it into an actual date
    Berkeley_Arrest$arrest_date <- as.Date(Berkeley_Arrest$arrest_date)
    Berkeley_Arrest$day_of_week <- weekdays(Berkeley_Arrest$arrest_date)
    
    #count each day
    day_counts <- Berkeley_Arrest %>% count(day_of_week)
    
    # Plot a bar chart
    ggplot(day_counts, aes(x = day_of_week, y = n)) +
      geom_bar(stat = "identity", fill = "blue") +
      labs(title = "Arrests by Day of the Week",
           x = "Day of Week",
           y = "Number of Arrests") 

  3. Use the date of birth and the date of arrest to determine the age of the individuals at the time of their arrest.  Add this variable to your data frame and call it “real_age”. Be sure to create this variable as an age as it would be given.  For example, if I am over 25 but under 26 I will tell people I am 25. You can make use of the floor() function to round down to the nearest integer. Provide a summary() command of the variable your create.

    Berkeley_Arrest$date_of_birth <- as.Date(Berkeley_Arrest$date_of_birth, format = "%Y-%m-%d")
    Berkeley_Arrest$arrest_date <- as.Date(Berkeley_Arrest$arrest_date, format = "%Y-%m-%d")
    
    Berkeley_Arrest$real_age <- floor(as.numeric(Berkeley_Arrest$arrest_date - Berkeley_Arrest$date_of_birth) / 365)
    summary(Berkeley_Arrest$real_age)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      18.00   27.00   37.00   39.59   50.00   84.00 
  4. The data set also includes a variable that is the age of the subjects. Some of those ages do not match the age you determined in the previous step. Provide a list of subjects in which the actual age you calculated differs from the age recorded by the officer by more than one year in either direction. Provide output that gives the names of the subject, their real_age and their age recorded by the officer.

    # filtering
    age_check <- filter(Berkeley_Arrest,
                        !is.na(real_age),
                        !is.na(age),
                        abs(real_age - age) > 1)
    
    #limit
    age_result <- select(age_check, subject, real_age, age)
    age_result
                             subject real_age age
    1           Herbert Stephen Blue       60  57
    2             Michael Joseph May       24  21
    3         Anthony Wilfred Kerman       29  27
    4             Damon Lamont Jones       47  43
    5                   Gerald Arcos       28  23
    6         Sara Sofija Antunovich       37  35
    7          Andrew Francis Supple       59  57
    8          Scotty Emmanuel Guess       62  60
    9  CHRISTOPHER RANDOLPH TORRENCE       44  41
    10           Edward Rae Mitchell       53  46
    11               Farad Ami Green       43  45
    12              Adan Mora Morfin       64  61
    13        Christopher Cole Tabor       34  32
    14        Daniel James Blackbear       33  31
    15        Fredrick Arzell Chisom       56  51
    16            Adam Kenneth Jones       45  40
    17                 PRICE WHEELER       46  42
    18           Louis Joseph Lawyer       34  31
    19             Nicholas M Shelby       26  20