In project 2, three of the “wide” datasets identified in the Week 6 Discussion items needed to be selected and cleaning. Each discussion item also had an analysis that needed to be addressed to complete the assignment. The three databases used were National Endowment for the Arts, Jhalak Das dataset, and Animated TV Shows Around the World (1948 - 2022).

Discussion Item 1: National Endowment for the Arts

Analysis Question: Scatter plot of artist income by type and overall U.S. labor force

#Loading in libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ stringr 1.4.1
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:purrr':
## 
##     compact
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(ggplot2)

# Reading in csv file from GitHub.
# Any blank cell into NA
# Skip first 2 rows
# Dropping all NA
filepath <- 'https://raw.githubusercontent.com/melbow2424/Data-607--Project-2/main/Table1aArtistProfile.csv'
artist_profile <- read.csv(filepath, na.strings=c("","NA"), skip = 2) %>%
  drop_na()

# Header row to uppercase
# Replacing the first cell with DEMOGRAPHIC
colnames(artist_profile) <- toupper(colnames(artist_profile)) %>%
  replace(1, "DEMOGRAPHIC")

# Creating data frame for only rows of index 9 to 12
df2_transformed <- artist_profile[9:12,]

# Printing data frame
head(df2_transformed)
##                                                 DEMOGRAPHIC
## 9  Median earnings of workers employed full-year/full time4
## 10                                                      Men
## 11                                                    Women
## 12                          Women's-to-men's earnings ratio
##    ENTIRE.U.S..LABOR.FORCE ALL.ARTISTS ARCHITECTS1
## 9                  $48,305     $58,005     $83,195
## 10                 $53,670     $64,405     $87,675
## 11                 $42,940     $51,575     $70,710
## 12                   $0.80       $0.80       $0.81
##    FINE.ARTISTS..ART.DIRECTORS..ANIMATORS DESIGNERS  ACTORS
## 9                                 $50,505   $55,560 $41,060
## 10                                $54,025   $63,275 $43,220
## 11                                $42,425   $50,505 $37,570
## 12                                  $0.79     $0.80   $0.87
##    PRODUCERS.AND.DIRECTORS DANCERS.AND.CHOREOGRAPHERS MUSICIANS ENTERTAINERS
## 9                  $69,605                    $36,365   $45,875      $41,260
## 10                 $70,710                          *   $46,620      $48,620
## 11                 $64,405                          *   $42,940      $31,640
## 12                   $0.91                          *     $0.92        $0.65
##    ANNOUNCERS WRITERS.AND.AUTHORS PHOTOGRAPHERS
## 9     $51,675             $60,115       $42,940
## 10    $52,730             $64,830       $47,455
## 11    $48,485             $56,950       $36,465
## 12      $0.92               $0.88         $0.77
# Pivoting columns to job for plot
df_pivot <- df2_transformed %>%
  pivot_longer(cols=c('ENTIRE.U.S..LABOR.FORCE', 
                      'ALL.ARTISTS'), names_to='jobs' ,values_to = "income")%>%
  filter(DEMOGRAPHIC == "Median earnings of workers employed full-year/full time4" |
           DEMOGRAPHIC == "Men" |
           DEMOGRAPHIC == "Women")

# Plot jobs vs income of Median earnings, Men, and Women
ggplot(data = df_pivot, aes(x = jobs, y = income, color = DEMOGRAPHIC)) +
  geom_point()



Discussion Item 2: Jhalak Das data set

Analysis Question:
* The three terms be in the same column called terms or semesters.
* “sex and age” should be two distinct columns; ‘sex’ and ‘age’.
* Instead of taking “test number” as variable, we should consider two columns with headers ‘test1’ and ‘test 2’ accordingly.
* Finally, we should split the entire table into two for two types of observational units: STUDENTS and RESULTS. In the students table, we can keep id, name, phone, sex and age columns. On the results table, we can take the rest; id, terms, test1 and test2 columns. In this way we can join the two tables at any instance with id as primary key.

library(readxl)
library(tidyverse)
library(stringr) 

# Reading in csv file from GitHub.
file <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/Project%202/jhalak_das_untidy.csv"
df <- read.csv(file)
head(df)
##   id   name phone sex.and.age test.number term.1 term.2 term.3
## 1  1   Mike   134        m_12      test 1     76     84     87
## 2  2  Linda   270        f_13      test 1     88     90     73
## 3  3    Sam   210        m_11      test 1     78     74     80
## 4  4 Esther   617        f_12      test 1     68     75     74
## 5  5   Mary   114        f_14      test 1     65     67     64
## 6  1   Mike   134        m_12      test 2     85     80     90
# split up sex and age
# pivot term columns to single column, term, column: grade
# pivot column into multiple columns and in row have grade variables
df_clean <- df %>%
  separate(col=sex.and.age, into=c("sex","age"), sep="_", convert=TRUE)%>%
  pivot_longer(cols=c("term.1","term.2", "term.3"), names_to="term",
               names_pattern="([0-3])", values_to = "grade") %>%
  pivot_wider(names_from = 'test.number', values_from = 'grade')

head(df_clean)
## # A tibble: 6 × 8
##      id name  phone sex     age term  `test 1` `test 2`
##   <int> <chr> <int> <chr> <int> <chr>    <int>    <int>
## 1     1 Mike    134 m        12 1           76       85
## 2     1 Mike    134 m        12 2           84       80
## 3     1 Mike    134 m        12 3           87       90
## 4     2 Linda   270 f        13 1           88       87
## 5     2 Linda   270 f        13 2           90       82
## 6     2 Linda   270 f        13 3           73       94
# Data frame split into two for two

students <- df_clean[c('id', 'name', 'phone', 'sex', 'age')]

head(students)
## # A tibble: 6 × 5
##      id name  phone sex     age
##   <int> <chr> <int> <chr> <int>
## 1     1 Mike    134 m        12
## 2     1 Mike    134 m        12
## 3     1 Mike    134 m        12
## 4     2 Linda   270 f        13
## 5     2 Linda   270 f        13
## 6     2 Linda   270 f        13
results <- df_clean[c('id', 'term', 'test 1', 'test 2')]

head(results)
## # A tibble: 6 × 4
##      id term  `test 1` `test 2`
##   <int> <chr>    <int>    <int>
## 1     1 1           76       85
## 2     1 2           84       80
## 3     1 3           87       90
## 4     2 1           88       87
## 5     2 2           90       82
## 6     2 3           73       94

Discussion Item 3: Animated TV Shows Around the World

Analysis Question: Which animated TV series produced the most episodes and from which county.

library(dplyr)
library(tidyverse)
library(plyr)

# Reading in csv files from GitHub. There were two different csv files.
# Any blank cell into NA

animated_tv_1948_1986 = read.csv('https://raw.githubusercontent.com/melbow2424/Data-607--Project-2/main/Animated%20TV%20Shows%20Around%20the%20World%20(1948%20-%201986).csv',
                                 na.strings=c("","NA"))

animated_tv_1987_2022 = read.csv('https://raw.githubusercontent.com/melbow2424/Data-607--Project-2/main/Animated%20TV%20Shows%20Around%20the%20World%20(1987%20-%202022).csv',
                                 na.strings=c("","NA"))


# Combining csv file to get all years from 1948 to 2022 in years 

animated_tv <- rbind.fill(animated_tv_1948_1986,animated_tv_1987_2022) %>%
  filter(Episodes != "TBA") 

# Viewing dataframe

head(animated_tv)
##                              Title Episodes Country Premiere.Year Final.Year
## 1 3000 Leagues in Search of Mother       52   Japan          1976       1976
## 2                     Ace o Nerae!       26   Japan          1973       1974
## 3                        Acrobunch       24   Japan          1982       1982
## 4            Adventures of Pow Wow       52      US          1949       1950
## 5    Adventures of the Gummi Bears       65      US          1985       1991
## 6   Adventures of the Little Koala       52   Japan          1984       1985
##   Seasons Original.Channel Technique
## 1    <NA>             <NA>      <NA>
## 2    <NA>             <NA>      <NA>
## 3    <NA>             <NA>      <NA>
## 4    <NA>             <NA>      <NA>
## 5    <NA>             <NA>      <NA>
## 6    <NA>             <NA>      <NA>
#Checking structure of data frame

str(animated_tv)
## 'data.frame':    3249 obs. of  8 variables:
##  $ Title           : chr  "3000 Leagues in Search of Mother" "Ace o Nerae!" "Acrobunch" "Adventures of Pow Wow" ...
##  $ Episodes        : chr  "52" "26" "24" "52" ...
##  $ Country         : chr  "Japan" "Japan" "Japan" "US" ...
##  $ Premiere.Year   : int  1976 1973 1982 1949 1985 1984 1959 1981 1983 1970 ...
##  $ Final.Year      : chr  "1976" "1974" "1982" "1950" ...
##  $ Seasons         : chr  NA NA NA NA ...
##  $ Original.Channel: chr  NA NA NA NA ...
##  $ Technique       : chr  NA NA NA NA ...
#When checking the structure, Episodes of tv shows are in char variable. 
#To find the greatest number of episodes of tv shows, char needed to change to integer (or num).
#I chose integer 

animated_tv <- transform(animated_tv, Episodes = as.integer(Episodes))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
# Checking structure of data frame
str(animated_tv)
## 'data.frame':    3249 obs. of  8 variables:
##  $ Title           : chr  "3000 Leagues in Search of Mother" "Ace o Nerae!" "Acrobunch" "Adventures of Pow Wow" ...
##  $ Episodes        : int  52 26 24 52 65 52 39 26 42 52 ...
##  $ Country         : chr  "Japan" "Japan" "Japan" "US" ...
##  $ Premiere.Year   : int  1976 1973 1982 1949 1985 1984 1959 1981 1983 1970 ...
##  $ Final.Year      : chr  "1976" "1974" "1982" "1950" ...
##  $ Seasons         : chr  NA NA NA NA ...
##  $ Original.Channel: chr  NA NA NA NA ...
##  $ Technique       : chr  NA NA NA NA ...
#Creating data frame with just the max (greatest number) of episodes of tv shows.
most_episodes <- animated_tv[animated_tv$Episodes == max(animated_tv$Episodes, na.rm = TRUE) ,]


#Displaying that dataframe
head(most_episodes, 1)
##         Title Episodes Country Premiere.Year Final.Year Seasons
## 514 Sazae-san     6342   Japan          1969    Present    <NA>
##     Original.Channel Technique
## 514             <NA>      <NA>

The animated TV series with the most episodes is Sazae-san from Japan. It premiered in 1969 and is still running today.