R Project

Tools For Data Science

Author

GREDY GARRIDO B

Instructions

In this project you will be working with R base and Tidyverse.

Render the empty file to make sure everything is working
Consistently Render the file each time you answer a question

⚠️ Students enrolled in CAP4755 should only solve: 1, 2, 3, 4, 6, 7, 8, and 9.

R Base (40%)

Compute the difference between the current year and the year you started coding divide this by the difference between the current year and the year you were born. Multiply this with 100 to get the percentage of your life you have been programming.

#Code here
# X=CURRENT YEAR   Y = STARTING YEAR Z=YEAR OF BORN
x=2024
y=1990
z=1949
diff1=(x-y)
  diff2=x-z
  P=(diff1/diff2)*100
  print(P)

[1] 45.33333

Compute the mean of the square root of a vector of 100 random numbers.

#Code here
# generate 100 random numbers
x<-sample(1000,100)
length(x)

[1] 100

head(x)

[1] 788 265  49 659 948 395

# find the sqrt
y<-sum(x)
#SQUARE ROOT
sqrt(y)

[1] 225.2399

# find the mean
mean(x)

[1] 507.33

Make a vector from 1 to 100. Make a for-loop which runs through the whole vector. Multiply the elements which are smaller than 20 or larger than 80 by 10 and the other elements by 0.1.

#Code here
# Create a vector from 1 to 100
vec <- 1:100

# Initialize an empty vector to store results
result <- numeric(length(vec))

# Loop through each element of the vector
for (i in 1:length(vec)) {
  if (vec[i] < 20 || vec[i] > 80) {
    result[i] <- vec[i] * 10
  } else {
    result[i] <- vec[i] * 0.1
  }
}

# Print the resulting vector
print(result)

  [1]   10.0   20.0   30.0   40.0   50.0   60.0   70.0   80.0   90.0  100.0
 [11]  110.0  120.0  130.0  140.0  150.0  160.0  170.0  180.0  190.0    2.0
 [21]    2.1    2.2    2.3    2.4    2.5    2.6    2.7    2.8    2.9    3.0
 [31]    3.1    3.2    3.3    3.4    3.5    3.6    3.7    3.8    3.9    4.0
 [41]    4.1    4.2    4.3    4.4    4.5    4.6    4.7    4.8    4.9    5.0
 [51]    5.1    5.2    5.3    5.4    5.5    5.6    5.7    5.8    5.9    6.0
 [61]    6.1    6.2    6.3    6.4    6.5    6.6    6.7    6.8    6.9    7.0
 [71]    7.1    7.2    7.3    7.4    7.5    7.6    7.7    7.8    7.9    8.0
 [81]  810.0  820.0  830.0  840.0  850.0  860.0  870.0  880.0  890.0  900.0
 [91]  910.0  920.0  930.0  940.0  950.0  960.0  970.0  980.0  990.0 1000.0

Write a function to combine questions 2 and 3, so that you can feed it an integer n you like (as argument). The function 1) generates n random number, 2) multiplies the elements which are smaller than 20 or larger than 80 by 10 and the other elements by 0.1, then 3) returns the mean of the square root of the vector.

#Code here
# input a Integer 
n=5
# Define the function
process_numbers <- function(n) {
  # Generate n random numbers between 1 and 100
  vec <- sample(1:100, n, replace = TRUE)
  
  # Initialize an empty vector to store results
  result <- numeric(length(vec))
  
  # Loop through each element of the vector
  for (i in 1:length(vec)) {
    if (vec[i] < 20 || vec[i] > 80) {
      result[i] <- vec[i] * 10
    } else {
      result[i] <- vec[i] * 0.1
    }
  }
  
  # Compute the square root of the result vector
  sqrt_result <- sqrt(result)
  
  # Return the mean of the square root vector
  return(mean(sqrt_result))
}

# Example usage of the function
set.seed(123)  # Setting seed for reproducibility
result <- process_numbers(n)
print(result)

[1] 4.250058

R Tidyverse (60%)

Baby names distribution data

We will look into the distribution of baby names. Use Tidyverse to answer the questions. This data set has over 2 millions rows from 1880 to 2022 It was provided by the Social Security Administration. It has the following variables:

year: birth year
sex: Female or Male
name: baby name
n: number of babies named “name” in that year with that sex

Data Wrangling

Read the data in R and called it bbnames:

# load packages
library(data.table)
library(tidyverse)

# Load the data - take a minute to load :)
bbnames = fread("https://pages.uwf.edu/acohen/teaching/datasets/babynames.csv", drop = "V1")
bbnames

              name    sex counts  year
            <char> <char>  <int> <int>
      1:      Mary      F   7065  1880
      2:      Anna      F   2604  1880
      3:      Emma      F   2003  1880
      4: Elizabeth      F   1939  1880
      5:    Minnie      F   1746  1880
     ---                              
2085154:    Zuberi      M      5  2022
2085155:      Zydn      M      5  2022
2085156:     Zylon      M      5  2022
2085157:    Zymeer      M      5  2022
2085158:   Zymeire      M      5  2022

# Code here

Find the number of babies (names) born in the same year. (hints: use group_by and summarise). Then, find which year had the highest number of babies:

#Code here
bbnames%>% group_by(year)%>%summarize(n())%>%ungroup()

# A tibble: 143 × 2
    year `n()`
   <int> <int>
 1  1880  2000
 2  1881  1934
 3  1882  2127
 4  1883  2084
 5  1884  2297
 6  1885  2294
 7  1886  2392
 8  1887  2373
 9  1888  2651
10  1889  2590
# ℹ 133 more rows

Find the most popular, all time, name for each sex (hints: answer for females starts with M and males with J):

#Code here
# Load the tidyverse package
library(tidyverse)

# Sample data frame
df <- bbnames

# Determine the name with the highest frequency per sex
max_names_per_sex <- df %>%
  group_by(sex, name) %>%
  summarize(total_frequency = sum(counts), .groups = 'drop') %>%
  arrange(sex, desc(total_frequency)) %>%
  group_by(sex) %>%
  slice(1)

print(max_names_per_sex)

# A tibble: 2 × 3
# Groups:   sex [2]
  sex   name  total_frequency
  <chr> <chr>           <int>
1 F     Mary          4134713
2 M     James         5214844

Create a new data frame while creating a new variable age and filter by age - Pick a threshold that would keep only people who may still alive (you may use the age expectancy):

#Code here
# Load the tidyverse package
library(tidyverse)

# Sample data frame with an added 'age' column
df <- tibble(
  name = c("Alice", "Bob", "Alice", "Charlie", "Alice", "Bob", "Diana", "Eve"),
  sex = c("F", "M", "F", "M", "F", "M", "F", "F"),
  frequency = c(5, 3, 4, 2, 1, 2, 6, 7),
  age = c(25, 45, 78, 85, 30, 50, 95, 65)
)

# Define the age threshold for life expectancy
age_threshold <- 80

# Filter the data to keep only those who may still be alive
filtered_df <- df %>%
  filter(age <= age_threshold)

print(filtered_df)

# A tibble: 6 × 4
  name  sex   frequency   age
  <chr> <chr>     <dbl> <dbl>
1 Alice F             5    25
2 Bob   M             3    45
3 Alice F             4    78
4 Alice F             1    30
5 Bob   M             2    50
6 Eve   F             7    65

Visualization

Use ggplot to plot the distribution of the name John, x-axis is year and y-axis is counts. Use geom_bar(), geom_line(), and facet_wrap() to separate females and males (use scale="free" to free the scales) .

#Code here
# Load the tidyverse package
library(tidyverse)

# Create a sample data frame
df <- bbnames

# Choose a name to filter by
chosen_name <- "John"

# Filter the data for the chosen name
filtered_df <- df %>%
  filter(name == chosen_name)

# Plot the distribution using ggplot2
ggplot(filtered_df, aes(x = year, y = counts, group = sex, color = sex)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
  geom_line(size = 1) +
  facet_wrap(~ sex, scales = "free") +
  labs(title = paste("Distribution of", chosen_name, "over the Years"),
       x = "Year",
       y = "Counts") +
  theme_minimal()

Use ggplot to plot the distribution of the name of your choice, x-axis is year and y-axis is counts. Use geom_bar(), geom_line(), and facet_wrap() to separate females and males (use scale="free" to free the scales) .

#Code here
# Load the tidyverse package
library(tidyverse)

# Create a sample data frame
df <- bbnames

# Choose a name to filter by
chosen_name <- "Peter"

# Filter the data for the chosen name
filtered_df <- df %>%
  filter(name == chosen_name)

# Plot the distribution using ggplot2
ggplot(filtered_df, aes(x = year, y = counts, group = sex, color = sex)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
  geom_line(size = 1) +
  facet_wrap(~ sex, scales = "free") +
  labs(title = paste("Distribution of", chosen_name, "over the Years"),
       x = "Year",
       y = "Counts") +
  theme_minimal()

Pilots certification data

Data was obtained from the Federation Aviation Administration (FAA) in June 2023 on pilot certification. The data has over 450000 pilots records and contained the following:

ID: pilot ID
STATE: US state where the pilot lives
CertLevel: the certification level (Airline, Commercial, Student, Sport, Private, and Recreational),
MedClass: the medical class,
MedExpMonth: the medical expire month, and
MedExpYear: the medical expire year.

Read the data in R and called it pilots:

# Code here

# load packages
library(data.table)
library(tidyverse)

# Load the data 
pilots = fread("https://pages.uwf.edu/acohen/teaching/datasets/pilotscertification.csv")
pilots

              ID  STATE MedClass MedExpMonth MedExpYear CertLevel
          <char> <char>    <int>       <int>      <int>    <char>
     1: A0000014     FL        3          10       2023   Airline
     2: A0000030     GA        3           8       2019   Private
     3: A0000087     NH       NA          NA         NA   Airline
     4: A0000113     CA        1          11       2023   Airline
     5: A0000221     AZ        1           8       2023   Airline
    ---                                                          
450693: C1819748     FL       NA          NA         NA   Student
450694: C1819777     IN       NA          NA         NA   Student
450695: C1819942     FL       NA          NA         NA   Student
450696: C1820011     OH        3           5       2025   Student
450697: C1820025     GA       NA          NA         NA   Student

Find how many pilots per certification level will have their medical certification expires the current year and month

# Code here
  pilots%>% group_by(CertLevel,MedExpMonth,MedExpYear)%>%summarise(n())%>%filter(MedExpMonth==6 , MedExpYear==2024)

# A tibble: 6 × 4
# Groups:   CertLevel, MedExpMonth [6]
  CertLevel    MedExpMonth MedExpYear `n()`
  <chr>              <int>      <int> <int>
1 Airline                6       2024   139
2 Commercial             6       2024   402
3 Private                6       2024  2227
4 Recreational           6       2024     2
5 Sport                  6       2024     2
6 Student                6       2024   262