#Code here
# X=CURRENT YEAR Y = STARTING YEAR Z=YEAR OF BORN
x=2024
y=1990
z=1949
diff1=(x-y)
diff2=x-z
P=(diff1/diff2)*100
print(P)[1] 45.33333
Tools For Data Science
In this project you will be working with R base and Tidyverse.
⚠️ Students enrolled in CAP4755 should only solve: 1, 2, 3, 4, 6, 7, 8, and 9.
#Code here
# X=CURRENT YEAR Y = STARTING YEAR Z=YEAR OF BORN
x=2024
y=1990
z=1949
diff1=(x-y)
diff2=x-z
P=(diff1/diff2)*100
print(P)[1] 45.33333
#Code here
# generate 100 random numbers
x<-sample(1000,100)
length(x)[1] 100
head(x)[1] 788 265 49 659 948 395
# find the sqrt
y<-sum(x)
#SQUARE ROOT
sqrt(y)[1] 225.2399
# find the mean
mean(x)[1] 507.33
for-loop which runs through the whole vector. Multiply the elements which are smaller than 20 or larger than 80 by 10 and the other elements by 0.1.#Code here
# Create a vector from 1 to 100
vec <- 1:100
# Initialize an empty vector to store results
result <- numeric(length(vec))
# Loop through each element of the vector
for (i in 1:length(vec)) {
if (vec[i] < 20 || vec[i] > 80) {
result[i] <- vec[i] * 10
} else {
result[i] <- vec[i] * 0.1
}
}
# Print the resulting vector
print(result) [1] 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
[11] 110.0 120.0 130.0 140.0 150.0 160.0 170.0 180.0 190.0 2.0
[21] 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
[31] 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
[41] 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0
[51] 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
[61] 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0
[71] 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0
[81] 810.0 820.0 830.0 840.0 850.0 860.0 870.0 880.0 890.0 900.0
[91] 910.0 920.0 930.0 940.0 950.0 960.0 970.0 980.0 990.0 1000.0
function to combine questions 2 and 3, so that you can feed it an integer n you like (as argument). The function 1) generates n random number, 2) multiplies the elements which are smaller than 20 or larger than 80 by 10 and the other elements by 0.1, then 3) returns the mean of the square root of the vector.#Code here
# input a Integer
n=5
# Define the function
process_numbers <- function(n) {
# Generate n random numbers between 1 and 100
vec <- sample(1:100, n, replace = TRUE)
# Initialize an empty vector to store results
result <- numeric(length(vec))
# Loop through each element of the vector
for (i in 1:length(vec)) {
if (vec[i] < 20 || vec[i] > 80) {
result[i] <- vec[i] * 10
} else {
result[i] <- vec[i] * 0.1
}
}
# Compute the square root of the result vector
sqrt_result <- sqrt(result)
# Return the mean of the square root vector
return(mean(sqrt_result))
}
# Example usage of the function
set.seed(123) # Setting seed for reproducibility
result <- process_numbers(n)
print(result)[1] 4.250058
We will look into the distribution of baby names. Use Tidyverse to answer the questions. This data set has over 2 millions rows from 1880 to 2022 It was provided by the Social Security Administration. It has the following variables:
Read the data in R and called it bbnames:
# load packages
library(data.table)
library(tidyverse)
# Load the data - take a minute to load :)
bbnames = fread("https://pages.uwf.edu/acohen/teaching/datasets/babynames.csv", drop = "V1")
bbnames name sex counts year
<char> <char> <int> <int>
1: Mary F 7065 1880
2: Anna F 2604 1880
3: Emma F 2003 1880
4: Elizabeth F 1939 1880
5: Minnie F 1746 1880
---
2085154: Zuberi M 5 2022
2085155: Zydn M 5 2022
2085156: Zylon M 5 2022
2085157: Zymeer M 5 2022
2085158: Zymeire M 5 2022
# Code heregroup_by and summarise). Then, find which year had the highest number of babies:#Code here
bbnames%>% group_by(year)%>%summarize(n())%>%ungroup()# A tibble: 143 × 2
year `n()`
<int> <int>
1 1880 2000
2 1881 1934
3 1882 2127
4 1883 2084
5 1884 2297
6 1885 2294
7 1886 2392
8 1887 2373
9 1888 2651
10 1889 2590
# ℹ 133 more rows
#Code here
# Load the tidyverse package
library(tidyverse)
# Sample data frame
df <- bbnames
# Determine the name with the highest frequency per sex
max_names_per_sex <- df %>%
group_by(sex, name) %>%
summarize(total_frequency = sum(counts), .groups = 'drop') %>%
arrange(sex, desc(total_frequency)) %>%
group_by(sex) %>%
slice(1)
print(max_names_per_sex)# A tibble: 2 × 3
# Groups: sex [2]
sex name total_frequency
<chr> <chr> <int>
1 F Mary 4134713
2 M James 5214844
age and filter by age - Pick a threshold that would keep only people who may still alive (you may use the age expectancy):#Code here
# Load the tidyverse package
library(tidyverse)
# Sample data frame with an added 'age' column
df <- tibble(
name = c("Alice", "Bob", "Alice", "Charlie", "Alice", "Bob", "Diana", "Eve"),
sex = c("F", "M", "F", "M", "F", "M", "F", "F"),
frequency = c(5, 3, 4, 2, 1, 2, 6, 7),
age = c(25, 45, 78, 85, 30, 50, 95, 65)
)
# Define the age threshold for life expectancy
age_threshold <- 80
# Filter the data to keep only those who may still be alive
filtered_df <- df %>%
filter(age <= age_threshold)
print(filtered_df)# A tibble: 6 × 4
name sex frequency age
<chr> <chr> <dbl> <dbl>
1 Alice F 5 25
2 Bob M 3 45
3 Alice F 4 78
4 Alice F 1 30
5 Bob M 2 50
6 Eve F 7 65
year and y-axis is counts. Use geom_bar(), geom_line(), and facet_wrap() to separate females and males (use scale="free" to free the scales) .#Code here
# Load the tidyverse package
library(tidyverse)
# Create a sample data frame
df <- bbnames
# Choose a name to filter by
chosen_name <- "John"
# Filter the data for the chosen name
filtered_df <- df %>%
filter(name == chosen_name)
# Plot the distribution using ggplot2
ggplot(filtered_df, aes(x = year, y = counts, group = sex, color = sex)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
geom_line(size = 1) +
facet_wrap(~ sex, scales = "free") +
labs(title = paste("Distribution of", chosen_name, "over the Years"),
x = "Year",
y = "Counts") +
theme_minimal()year and y-axis is counts. Use geom_bar(), geom_line(), and facet_wrap() to separate females and males (use scale="free" to free the scales) .#Code here
# Load the tidyverse package
library(tidyverse)
# Create a sample data frame
df <- bbnames
# Choose a name to filter by
chosen_name <- "Peter"
# Filter the data for the chosen name
filtered_df <- df %>%
filter(name == chosen_name)
# Plot the distribution using ggplot2
ggplot(filtered_df, aes(x = year, y = counts, group = sex, color = sex)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
geom_line(size = 1) +
facet_wrap(~ sex, scales = "free") +
labs(title = paste("Distribution of", chosen_name, "over the Years"),
x = "Year",
y = "Counts") +
theme_minimal()Data was obtained from the Federation Aviation Administration (FAA) in June 2023 on pilot certification. The data has over 450000 pilots records and contained the following:
Read the data in R and called it pilots:
# Code here
# load packages
library(data.table)
library(tidyverse)
# Load the data
pilots = fread("https://pages.uwf.edu/acohen/teaching/datasets/pilotscertification.csv")
pilots ID STATE MedClass MedExpMonth MedExpYear CertLevel
<char> <char> <int> <int> <int> <char>
1: A0000014 FL 3 10 2023 Airline
2: A0000030 GA 3 8 2019 Private
3: A0000087 NH NA NA NA Airline
4: A0000113 CA 1 11 2023 Airline
5: A0000221 AZ 1 8 2023 Airline
---
450693: C1819748 FL NA NA NA Student
450694: C1819777 IN NA NA NA Student
450695: C1819942 FL NA NA NA Student
450696: C1820011 OH 3 5 2025 Student
450697: C1820025 GA NA NA NA Student
# Code here
pilots%>% group_by(CertLevel,MedExpMonth,MedExpYear)%>%summarise(n())%>%filter(MedExpMonth==6 , MedExpYear==2024)# A tibble: 6 × 4
# Groups: CertLevel, MedExpMonth [6]
CertLevel MedExpMonth MedExpYear `n()`
<chr> <int> <int> <int>
1 Airline 6 2024 139
2 Commercial 6 2024 402
3 Private 6 2024 2227
4 Recreational 6 2024 2
5 Sport 6 2024 2
6 Student 6 2024 262