In project 2, three of the “wide” datasets identified in the Week 6 Discussion items needed to be selected and cleaning. Each discussion item also had an analysis that needed to be addressed to complete the assignment. The three databases used were National Endowment for the Arts, Jhalak Das dataset, and Animated TV Shows Around the World (1948 - 2022).
Analysis Question: Scatter plot of artist income by type and overall U.S. labor force
#Loading in libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ stringr 1.4.1
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:purrr':
##
## compact
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(ggplot2)
# Reading in csv file from GitHub.
# Any blank cell into NA
# Skip first 2 rows
# Dropping all NA
filepath <- 'https://raw.githubusercontent.com/melbow2424/Data-607--Project-2/main/Table1aArtistProfile.csv'
artist_profile <- read.csv(filepath, na.strings=c("","NA"), skip = 2) %>%
drop_na()
# Header row to uppercase
# Replacing the first cell with DEMOGRAPHIC
colnames(artist_profile) <- toupper(colnames(artist_profile)) %>%
replace(1, "DEMOGRAPHIC")
# Creating data frame for only rows of index 9 to 12
df2_transformed <- artist_profile[9:12,]
# Printing data frame
head(df2_transformed)
## DEMOGRAPHIC
## 9 Median earnings of workers employed full-year/full time4
## 10 Men
## 11 Women
## 12 Women's-to-men's earnings ratio
## ENTIRE.U.S..LABOR.FORCE ALL.ARTISTS ARCHITECTS1
## 9 $48,305 $58,005 $83,195
## 10 $53,670 $64,405 $87,675
## 11 $42,940 $51,575 $70,710
## 12 $0.80 $0.80 $0.81
## FINE.ARTISTS..ART.DIRECTORS..ANIMATORS DESIGNERS ACTORS
## 9 $50,505 $55,560 $41,060
## 10 $54,025 $63,275 $43,220
## 11 $42,425 $50,505 $37,570
## 12 $0.79 $0.80 $0.87
## PRODUCERS.AND.DIRECTORS DANCERS.AND.CHOREOGRAPHERS MUSICIANS ENTERTAINERS
## 9 $69,605 $36,365 $45,875 $41,260
## 10 $70,710 * $46,620 $48,620
## 11 $64,405 * $42,940 $31,640
## 12 $0.91 * $0.92 $0.65
## ANNOUNCERS WRITERS.AND.AUTHORS PHOTOGRAPHERS
## 9 $51,675 $60,115 $42,940
## 10 $52,730 $64,830 $47,455
## 11 $48,485 $56,950 $36,465
## 12 $0.92 $0.88 $0.77
# Pivoting columns to job for plot
df_pivot <- df2_transformed %>%
pivot_longer(cols=c('ENTIRE.U.S..LABOR.FORCE',
'ALL.ARTISTS'), names_to='jobs' ,values_to = "income")%>%
filter(DEMOGRAPHIC == "Median earnings of workers employed full-year/full time4" |
DEMOGRAPHIC == "Men" |
DEMOGRAPHIC == "Women")
# Plot jobs vs income of Median earnings, Men, and Women
ggplot(data = df_pivot, aes(x = jobs, y = income, color = DEMOGRAPHIC)) +
geom_point()
Analysis Question:
* The three terms be in the same column
called terms or semesters.
* “sex and age” should be two distinct
columns; ‘sex’ and ‘age’.
* Instead of taking “test number” as
variable, we should consider two columns with headers ‘test1’ and ‘test
2’ accordingly.
* Finally, we should split the entire table into two
for two types of observational units: STUDENTS and RESULTS. In the
students table, we can keep id, name, phone, sex and age columns. On the
results table, we can take the rest; id, terms, test1 and test2 columns.
In this way we can join the two tables at any instance with id as
primary key.
library(readxl)
library(tidyverse)
library(stringr)
# Reading in csv file from GitHub.
file <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/Project%202/jhalak_das_untidy.csv"
df <- read.csv(file)
head(df)
## id name phone sex.and.age test.number term.1 term.2 term.3
## 1 1 Mike 134 m_12 test 1 76 84 87
## 2 2 Linda 270 f_13 test 1 88 90 73
## 3 3 Sam 210 m_11 test 1 78 74 80
## 4 4 Esther 617 f_12 test 1 68 75 74
## 5 5 Mary 114 f_14 test 1 65 67 64
## 6 1 Mike 134 m_12 test 2 85 80 90
# split up sex and age
# pivot term columns to single column, term, column: grade
# pivot column into multiple columns and in row have grade variables
df_clean <- df %>%
separate(col=sex.and.age, into=c("sex","age"), sep="_", convert=TRUE)%>%
pivot_longer(cols=c("term.1","term.2", "term.3"), names_to="term",
names_pattern="([0-3])", values_to = "grade") %>%
pivot_wider(names_from = 'test.number', values_from = 'grade')
head(df_clean)
## # A tibble: 6 × 8
## id name phone sex age term `test 1` `test 2`
## <int> <chr> <int> <chr> <int> <chr> <int> <int>
## 1 1 Mike 134 m 12 1 76 85
## 2 1 Mike 134 m 12 2 84 80
## 3 1 Mike 134 m 12 3 87 90
## 4 2 Linda 270 f 13 1 88 87
## 5 2 Linda 270 f 13 2 90 82
## 6 2 Linda 270 f 13 3 73 94
# Data frame split into two for two
students <- df_clean[c('id', 'name', 'phone', 'sex', 'age')]
head(students)
## # A tibble: 6 × 5
## id name phone sex age
## <int> <chr> <int> <chr> <int>
## 1 1 Mike 134 m 12
## 2 1 Mike 134 m 12
## 3 1 Mike 134 m 12
## 4 2 Linda 270 f 13
## 5 2 Linda 270 f 13
## 6 2 Linda 270 f 13
results <- df_clean[c('id', 'term', 'test 1', 'test 2')]
head(results)
## # A tibble: 6 × 4
## id term `test 1` `test 2`
## <int> <chr> <int> <int>
## 1 1 1 76 85
## 2 1 2 84 80
## 3 1 3 87 90
## 4 2 1 88 87
## 5 2 2 90 82
## 6 2 3 73 94
Analysis Question: Which animated TV series produced the most episodes and from which county.
library(dplyr)
library(tidyverse)
library(plyr)
# Reading in csv files from GitHub. There were two different csv files.
# Any blank cell into NA
animated_tv_1948_1986 = read.csv('https://raw.githubusercontent.com/melbow2424/Data-607--Project-2/main/Animated%20TV%20Shows%20Around%20the%20World%20(1948%20-%201986).csv',
na.strings=c("","NA"))
animated_tv_1987_2022 = read.csv('https://raw.githubusercontent.com/melbow2424/Data-607--Project-2/main/Animated%20TV%20Shows%20Around%20the%20World%20(1987%20-%202022).csv',
na.strings=c("","NA"))
# Combining csv file to get all years from 1948 to 2022 in years
animated_tv <- rbind.fill(animated_tv_1948_1986,animated_tv_1987_2022) %>%
filter(Episodes != "TBA")
# Viewing dataframe
head(animated_tv)
## Title Episodes Country Premiere.Year Final.Year
## 1 3000 Leagues in Search of Mother 52 Japan 1976 1976
## 2 Ace o Nerae! 26 Japan 1973 1974
## 3 Acrobunch 24 Japan 1982 1982
## 4 Adventures of Pow Wow 52 US 1949 1950
## 5 Adventures of the Gummi Bears 65 US 1985 1991
## 6 Adventures of the Little Koala 52 Japan 1984 1985
## Seasons Original.Channel Technique
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
#Checking structure of data frame
str(animated_tv)
## 'data.frame': 3249 obs. of 8 variables:
## $ Title : chr "3000 Leagues in Search of Mother" "Ace o Nerae!" "Acrobunch" "Adventures of Pow Wow" ...
## $ Episodes : chr "52" "26" "24" "52" ...
## $ Country : chr "Japan" "Japan" "Japan" "US" ...
## $ Premiere.Year : int 1976 1973 1982 1949 1985 1984 1959 1981 1983 1970 ...
## $ Final.Year : chr "1976" "1974" "1982" "1950" ...
## $ Seasons : chr NA NA NA NA ...
## $ Original.Channel: chr NA NA NA NA ...
## $ Technique : chr NA NA NA NA ...
#When checking the structure, Episodes of tv shows are in char variable.
#To find the greatest number of episodes of tv shows, char needed to change to integer (or num).
#I chose integer
animated_tv <- transform(animated_tv, Episodes = as.integer(Episodes))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
# Checking structure of data frame
str(animated_tv)
## 'data.frame': 3249 obs. of 8 variables:
## $ Title : chr "3000 Leagues in Search of Mother" "Ace o Nerae!" "Acrobunch" "Adventures of Pow Wow" ...
## $ Episodes : int 52 26 24 52 65 52 39 26 42 52 ...
## $ Country : chr "Japan" "Japan" "Japan" "US" ...
## $ Premiere.Year : int 1976 1973 1982 1949 1985 1984 1959 1981 1983 1970 ...
## $ Final.Year : chr "1976" "1974" "1982" "1950" ...
## $ Seasons : chr NA NA NA NA ...
## $ Original.Channel: chr NA NA NA NA ...
## $ Technique : chr NA NA NA NA ...
#Creating data frame with just the max (greatest number) of episodes of tv shows.
most_episodes <- animated_tv[animated_tv$Episodes == max(animated_tv$Episodes, na.rm = TRUE) ,]
#Displaying that dataframe
head(most_episodes, 1)
## Title Episodes Country Premiere.Year Final.Year Seasons
## 514 Sazae-san 6342 Japan 1969 Present <NA>
## Original.Channel Technique
## 514 <NA> <NA>
The animated TV series with the most episodes is Sazae-san from Japan. It premiered in 1969 and is still running today.