The assignment for this project entailed of retrieving three wide datasets posted previously by the class in a discussion board and conduct the analysis the poster inteded for their dataset.
For this project, I am using Alice Ding’s data on the test performance of students throughout 3 terms. I am also using a sample dataset retirieved from Coco Donovan where the dataset needed to be manually created due to the insufficient source of the article used. The table used in this analysis replicates the table posted in the discussion board. Finally, I am also using adata set obtained from Nick Climaco on the populaiton growth of Mexico, the Us, and Canda from the yeras 1994-2000.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RCurl)
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:tidyr':
##
## complete
#Alice Ding's DB
students_data <- getURL("https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv")
students_data <- read_csv(students_data)
## Rows: 10 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, sex and age, test number
## dbl (5): id, phone, term 1, term 2, term 3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Nick CLimaco
nafta_data <- getURL("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/nafta_countries.csv")
nafta_data <- read_csv(nafta_data)
## Rows: 3 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): id, 1994, 1995, 1996, 1997, 1998, 1999, 2000
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##Religion DataFramee This date frame created is referencing the sample data set classmate Coco Donovan posted in week 5 discussion board. The described sample was retrieved from a site that did not link to the complete dataset so therefore the data frame created is one that only reflect 5 cases of regious classifications
#creating the untidy data posted in the DB
religioninc_data <- data.frame(
"Religion" = c("Agnostic", "Atheist", "Budhdist", "Catholic", "refused"),
"<10k" = c(27, 12, 27, 418, 15),
"10-20k" = c(34, 27, 21, 617, 14),
"20-30k" = c(60, 37, 30, 732, 15),
"30-40k" = c(81, 52, 34, 670, 11),
"40-50k" = c(76, 35, 33, 638, 10),
"50-75k" = c(137, 70, 58, 1116, 35),
"75-100k" = c(122, 73, 62, 949, 21),
"100-150k" = c(109, 59, 39, 792, 17),
">150k" = c(84, 74, 53, 633, 18),
"refused" = c(96, 76, 54, 1489, 116)
)
write.csv(religioninc_data, "religion-income.csv")
religioninc_data
## Religion X.10k X10.20k X20.30k X30.40k X40.50k X50.75k X75.100k X100.150k
## 1 Agnostic 27 34 60 81 76 137 122 109
## 2 Atheist 12 27 37 52 35 70 73 59
## 3 Budhdist 27 21 30 34 33 58 62 39
## 4 Catholic 418 617 732 670 638 1116 949 792
## 5 refused 15 14 15 11 10 35 21 17
## X.150k refused
## 1 84 96
## 2 74 76
## 3 53 54
## 4 633 1489
## 5 18 116
The religion-income data set will then be tidied by converting the original created table into a long format, where variables such as Religion and Income brackets are formed, as well as the frequency of respondents falling into each category
religioninc_long <- pivot_longer(religioninc_data, cols = -Religion, names_to = "Income_Bracket", values_to = "Respondents Frequency") %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X.10k", "<10k", Income_Bracket)) %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X10.20k", "10-20k", Income_Bracket)) %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X20.30k", "20-30k", Income_Bracket)) %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X30.40k", "30-40k", Income_Bracket)) %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X40.50k", "40-50k", Income_Bracket)) %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X50.75k", "50-75k", Income_Bracket)) %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X75.100k", "75-100k", Income_Bracket)) %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X100.150k", "100-150k", Income_Bracket)) %>%
mutate(Income_Bracket = ifelse(Income_Bracket == "X.150k", ">150K", Income_Bracket))
religioninc_long
## # A tibble: 50 × 3
## Religion Income_Bracket `Respondents Frequency`
## <chr> <chr> <dbl>
## 1 Agnostic <10k 27
## 2 Agnostic 10-20k 34
## 3 Agnostic 20-30k 60
## 4 Agnostic 30-40k 81
## 5 Agnostic 40-50k 76
## 6 Agnostic 50-75k 137
## 7 Agnostic 75-100k 122
## 8 Agnostic 100-150k 109
## 9 Agnostic >150K 84
## 10 Agnostic refused 96
## # … with 40 more rows
For the purpose of this analysis, Coco suggested the way he would approach this analysis is by grouping the data by religion to see what the religious makeup of all respondents was by percentages. A visual analysis with parallel bar charts will be created to get an idea of which religion has the wealthiest followers (based solely on the results of this data).
religioninc_long %>%
group_by(Religion) %>%
summarise(religioninc_long, Respondents_Percentage = (`Respondents Frequency`/sum(`Respondents Frequency`))*100) %>%
ggplot(aes(Religion, Respondents_Percentage, fill=Income_Bracket)) +
geom_col(stat = "identity")
## `summarise()` has grouped output by 'Religion'. You can override using the
## `.groups` argument.
## Warning: Ignoring unknown parameters: stat
## Results It is shown that the most prevalent religion among the
respondents is Catholicism while the least prevalent is Buddhism,
excluding those who refused to answer. In the same chart, its is also
shown that the most common income bracket among catholics are 50-75k and
75-100k. It should be noted, however, that the brackets are
inconsistents in the data collection where for all of the bracket groups
under 50K range only 10k while after 50k has a 25k range, allowiing to
account for more respondents.
This next data set was obtained through the discussion board response of Alice Ding where the data entails of some demographic factors of the student and how they performed on each test across 3 terms. To tidy the data, the column “sex and age” will be seperated into two columns to represent each variable. The columns “term 1”, “term 2”, and “term 3” will be combined to one column, “Terms” where another column will be created to represents its values, “Scores
students_data
## # A tibble: 10 × 8
## id name phone `sex and age` `test number` `term 1` `term 2` `term 3`
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Mike 134 m_12 test 1 76 84 87
## 2 2 Linda 270 f_13 test 1 88 90 73
## 3 3 Sam 210 m_11 test 1 78 74 80
## 4 4 Esther 617 f_12 test 1 68 75 74
## 5 5 Mary 114 f_14 test 1 65 67 64
## 6 1 Mike 134 m_12 test 2 85 80 90
## 7 2 Linda 270 f_13 test 2 87 82 94
## 8 3 Sam 210 m_11 test 2 80 87 80
## 9 4 Esther 617 f_12 test 2 70 75 78
## 10 5 Mary 114 f_14 test 2 68 70 63
student_tidy <- students_data %>%
separate("sex and age", c("Sex", "Age"), sep = "_")%>%
mutate(Sex = ifelse(Sex == "f", "Female", Sex)) %>%
mutate(Sex = ifelse(Sex == "m", "Male", Sex)) %>%
pivot_longer(cols = c("term 1", "term 2", "term 3"), names_to = "Terms", values_to = "Scores")
student_tidy$Age <- as.numeric(student_tidy$Age)
student_tidy$Scores <- as.numeric(student_tidy$Scores)
student_tidy
## # A tibble: 30 × 8
## id name phone Sex Age `test number` Terms Scores
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 Mike 134 Male 12 test 1 term 1 76
## 2 1 Mike 134 Male 12 test 1 term 2 84
## 3 1 Mike 134 Male 12 test 1 term 3 87
## 4 2 Linda 270 Female 13 test 1 term 1 88
## 5 2 Linda 270 Female 13 test 1 term 2 90
## 6 2 Linda 270 Female 13 test 1 term 3 73
## 7 3 Sam 210 Male 11 test 1 term 1 78
## 8 3 Sam 210 Male 11 test 1 term 2 74
## 9 3 Sam 210 Male 11 test 1 term 3 80
## 10 4 Esther 617 Female 12 test 1 term 1 68
## # … with 20 more rows
In terms of analyzing Alice’s data, she suggested an approach to this analysis would be finding the average test score for each student or average test score overall. It could also be counted how many students are of each sex and find the average age as well.
#Avg test score for each student
avg_scores <- student_tidy %>%
group_by(name) %>%
summarise(avg_score = mean(Scores))
avg_scores
## # A tibble: 5 × 2
## name avg_score
## <chr> <dbl>
## 1 Esther 73.3
## 2 Linda 85.7
## 3 Mary 66.2
## 4 Mike 83.7
## 5 Sam 79.8
ggplot(avg_scores, aes(name, avg_score)) +
geom_bar(stat="identity", fill="dark green") +
labs(x = "Students", y = "Average Score", title = "Average Test Scores by Student")
#Avg test score for each sex
#fix later
avgscores_bysex <- student_tidy %>%
group_by(Sex) %>%
summarise(avg_score = mean(Scores), avg_age = mean(Age))
avgscores_bysex
## # A tibble: 2 × 3
## Sex avg_score avg_age
## <chr> <dbl> <dbl>
## 1 Female 75.1 13
## 2 Male 81.8 11.5
ggplot(avgscores_bysex, aes(Sex, avg_score, fill=Sex)) +
geom_bar(stat = "identity") +
labs(x = "Sex of Students", y = "Score", title = "Number of Students by Sex")
## Results In the charts above, it is show that from the 5 students test
perfomance recorded, overall Linda scored the best, scoring 85.67%, whie
Mary performed the worst, scoring 66.17%. While the person who scored
the best is a girl, it is shown that the boys overall performed better,
averaging at 81.75 and were younger (averaging to 11.5), while the girls
avereaged a score of 75.06 and on average, aged 13 years old.
This last data set was retrieved from classmate Nick Climaco discussion board response where it entails of the population growth from 1994-2000 of Mexico, the US, and Canada. To tidy this data, the table will be transformed into a long format t compress te years into a single columns and the population into another.
nafta_data
## # A tibble: 3 × 9
## `id` Country `1994` `1995` `1996` `1997` `1998` `1999` `2000`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Mexico 88314424 89969572 9.16e7 9.32e7 9.48e7 9.63e7 9.79e7
## 2 2 United States 263126000 266278000 2.69e8 2.73e8 2.76e8 2.79e8 2.82e8
## 3 3 Canada 29000663 29302311 2.96e7 2.99e7 3.02e7 3.04e7 3.07e7
nafta_long <- nafta_data %>%
pivot_longer(cols = -c("id", "Country"), names_to = "Year", values_to = "Population")
colnames(nafta_long)[1] <- "ID"
nafta_long
## # A tibble: 21 × 4
## ID Country Year Population
## <dbl> <chr> <chr> <dbl>
## 1 1 Mexico 1994 88314424
## 2 1 Mexico 1995 89969572
## 3 1 Mexico 1996 91586555
## 4 1 Mexico 1997 93183094
## 5 1 Mexico 1998 94767284
## 6 1 Mexico 1999 96334810
## 7 1 Mexico 2000 97873442
## 8 2 United States 1994 263126000
## 9 2 United States 1995 266278000
## 10 2 United States 1996 269394000
## # … with 11 more rows
In his discussion board response, Nick recommended that for future analysis, the population growth rate for each year could be calculated and compared to the other two countries in the dataset.
pop_growth <- nafta_long %>%
group_by(Country) %>%
arrange(Year) %>%
mutate(Population_Growth_Rate = ((Population - lag(Population))/lag(Population)) * 100) %>%
na.omit()
pop_growth
## # A tibble: 18 × 5
## # Groups: Country [3]
## ID Country Year Population Population_Growth_Rate
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 Mexico 1995 89969572 1.87
## 2 2 United States 1995 266278000 1.20
## 3 3 Canada 1995 29302311 1.04
## 4 1 Mexico 1996 91586555 1.80
## 5 2 United States 1996 269394000 1.17
## 6 3 Canada 1996 29610218 1.05
## 7 1 Mexico 1997 93183094 1.74
## 8 2 United States 1997 272657000 1.21
## 9 3 Canada 1997 29905948 0.999
## 10 1 Mexico 1998 94767284 1.70
## 11 2 United States 1998 275854000 1.17
## 12 3 Canada 1998 30155173 0.833
## 13 1 Mexico 1999 96334810 1.65
## 14 2 United States 1999 279040000 1.15
## 15 3 Canada 1999 30401286 0.816
## 16 1 Mexico 2000 97873442 1.60
## 17 2 United States 2000 282162411 1.12
## 18 3 Canada 2000 30685730 0.936
pop_growth %>%
ggplot(aes(x = Year, y = Population_Growth_Rate, color = Country, linetype = "-")) +
geom_point() +
labs(x = "Year", y = "Population Growth Rate", title = "Population Growth Rate from 1994-1995")
## Results Referring to the chart above, it is shown that in the years
of 1994-1995, Mexico’s population has been decreasing at a faster and
conistent rate while for the US, its poppulation has been consolidating
slightly downward. Canada on the other hand experienced a sharp decline
in their population from 1997-1998 but seemed to be reocovering back to
its normanl rates by the end of 2000.