Project-2.knit

Introduction

The assignment for this project entailed of retrieving three wide datasets posted previously by the class in a discussion board and conduct the analysis the poster inteded for their dataset.

For this project, I am using Alice Ding’s data on the test performance of students throughout 3 terms. I am also using a sample dataset retirieved from Coco Donovan where the dataset needed to be manually created due to the insufficient source of the article used. The table used in this analysis replicates the table posted in the discussion board. Finally, I am also using adata set obtained from Nick Climaco on the populaiton growth of Mexico, the Us, and Canda from the yeras 1994-2000.

Importing

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(RCurl)

## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete

#Alice Ding's DB
students_data <- getURL("https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv")
students_data <- read_csv(students_data)

## Rows: 10 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, sex and age, test number
## dbl (5): id, phone, term 1, term 2, term 3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Nick CLimaco
nafta_data <- getURL("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/nafta_countries.csv")
nafta_data <- read_csv(nafta_data)

## Rows: 3 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): ï»¿id, 1994, 1995, 1996, 1997, 1998, 1999, 2000
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##Religion DataFramee This date frame created is referencing the sample data set classmate Coco Donovan posted in week 5 discussion board. The described sample was retrieved from a site that did not link to the complete dataset so therefore the data frame created is one that only reflect 5 cases of regious classifications

#creating the untidy data posted in the DB
religioninc_data <- data.frame(
  "Religion" = c("Agnostic", "Atheist", "Budhdist", "Catholic", "refused"),
  "<10k" = c(27, 12, 27, 418, 15),
  "10-20k" = c(34, 27, 21, 617, 14),
  "20-30k" = c(60, 37, 30, 732, 15),
  "30-40k" = c(81, 52, 34, 670, 11),
  "40-50k" = c(76, 35, 33, 638, 10),
    "50-75k" = c(137, 70, 58, 1116, 35),
  "75-100k" = c(122, 73, 62, 949, 21),
  "100-150k" = c(109, 59, 39, 792, 17),
  ">150k" = c(84, 74, 53, 633, 18),
  "refused" = c(96, 76, 54, 1489, 116)
)
write.csv(religioninc_data, "religion-income.csv")
religioninc_data

##   Religion X.10k X10.20k X20.30k X30.40k X40.50k X50.75k X75.100k X100.150k
## 1 Agnostic    27      34      60      81      76     137      122       109
## 2  Atheist    12      27      37      52      35      70       73        59
## 3 Budhdist    27      21      30      34      33      58       62        39
## 4 Catholic   418     617     732     670     638    1116      949       792
## 5  refused    15      14      15      11      10      35       21        17
##   X.150k refused
## 1     84      96
## 2     74      76
## 3     53      54
## 4    633    1489
## 5     18     116

The religion-income data set will then be tidied by converting the original created table into a long format, where variables such as Religion and Income brackets are formed, as well as the frequency of respondents falling into each category

religioninc_long <- pivot_longer(religioninc_data, cols = -Religion, names_to = "Income_Bracket", values_to = "Respondents Frequency") %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X.10k", "<10k", Income_Bracket)) %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X10.20k", "10-20k", Income_Bracket)) %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X20.30k", "20-30k", Income_Bracket)) %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X30.40k", "30-40k", Income_Bracket)) %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X40.50k", "40-50k", Income_Bracket)) %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X50.75k", "50-75k", Income_Bracket)) %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X75.100k", "75-100k", Income_Bracket)) %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X100.150k", "100-150k", Income_Bracket)) %>%
  mutate(Income_Bracket = ifelse(Income_Bracket == "X.150k", ">150K", Income_Bracket)) 
religioninc_long

## # A tibble: 50 × 3
##    Religion Income_Bracket `Respondents Frequency`
##    <chr>    <chr>                            <dbl>
##  1 Agnostic <10k                                27
##  2 Agnostic 10-20k                              34
##  3 Agnostic 20-30k                              60
##  4 Agnostic 30-40k                              81
##  5 Agnostic 40-50k                              76
##  6 Agnostic 50-75k                             137
##  7 Agnostic 75-100k                            122
##  8 Agnostic 100-150k                           109
##  9 Agnostic >150K                               84
## 10 Agnostic refused                             96
## # … with 40 more rows

Analysis of Religion-Income Datataset

For the purpose of this analysis, Coco suggested the way he would approach this analysis is by grouping the data by religion to see what the religious makeup of all respondents was by percentages. A visual analysis with parallel bar charts will be created to get an idea of which religion has the wealthiest followers (based solely on the results of this data).

religioninc_long %>%
  group_by(Religion) %>%
  summarise(religioninc_long, Respondents_Percentage = (`Respondents Frequency`/sum(`Respondents Frequency`))*100) %>%
  ggplot(aes(Religion, Respondents_Percentage, fill=Income_Bracket)) +
  geom_col(stat = "identity")

## `summarise()` has grouped output by 'Religion'. You can override using the
## `.groups` argument.

## Warning: Ignoring unknown parameters: stat

## Results It is shown that the most prevalent religion among the respondents is Catholicism while the least prevalent is Buddhism, excluding those who refused to answer. In the same chart, its is also shown that the most common income bracket among catholics are 50-75k and 75-100k. It should be noted, however, that the brackets are inconsistents in the data collection where for all of the bracket groups under 50K range only 10k while after 50k has a 25k range, allowiing to account for more respondents.

Student’s Test Performance

This next data set was obtained through the discussion board response of Alice Ding where the data entails of some demographic factors of the student and how they performed on each test across 3 terms. To tidy the data, the column “sex and age” will be seperated into two columns to represent each variable. The columns “term 1”, “term 2”, and “term 3” will be combined to one column, “Terms” where another column will be created to represents its values, “Scores

students_data

## # A tibble: 10 × 8
##       id name   phone `sex and age` `test number` `term 1` `term 2` `term 3`
##    <dbl> <chr>  <dbl> <chr>         <chr>            <dbl>    <dbl>    <dbl>
##  1     1 Mike     134 m_12          test 1              76       84       87
##  2     2 Linda    270 f_13          test 1              88       90       73
##  3     3 Sam      210 m_11          test 1              78       74       80
##  4     4 Esther   617 f_12          test 1              68       75       74
##  5     5 Mary     114 f_14          test 1              65       67       64
##  6     1 Mike     134 m_12          test 2              85       80       90
##  7     2 Linda    270 f_13          test 2              87       82       94
##  8     3 Sam      210 m_11          test 2              80       87       80
##  9     4 Esther   617 f_12          test 2              70       75       78
## 10     5 Mary     114 f_14          test 2              68       70       63

student_tidy <- students_data %>%
separate("sex and age", c("Sex", "Age"), sep = "_")%>%
  mutate(Sex = ifelse(Sex == "f", "Female", Sex)) %>%
  mutate(Sex = ifelse(Sex == "m", "Male", Sex)) %>%
  pivot_longer(cols = c("term 1", "term 2", "term 3"), names_to = "Terms", values_to = "Scores") 

student_tidy$Age <- as.numeric(student_tidy$Age)
student_tidy$Scores <- as.numeric(student_tidy$Scores)
student_tidy

## # A tibble: 30 × 8
##       id name   phone Sex      Age `test number` Terms  Scores
##    <dbl> <chr>  <dbl> <chr>  <dbl> <chr>         <chr>   <dbl>
##  1     1 Mike     134 Male      12 test 1        term 1     76
##  2     1 Mike     134 Male      12 test 1        term 2     84
##  3     1 Mike     134 Male      12 test 1        term 3     87
##  4     2 Linda    270 Female    13 test 1        term 1     88
##  5     2 Linda    270 Female    13 test 1        term 2     90
##  6     2 Linda    270 Female    13 test 1        term 3     73
##  7     3 Sam      210 Male      11 test 1        term 1     78
##  8     3 Sam      210 Male      11 test 1        term 2     74
##  9     3 Sam      210 Male      11 test 1        term 3     80
## 10     4 Esther   617 Female    12 test 1        term 1     68
## # … with 20 more rows

Analysis of Student Test Performances

In terms of analyzing Alice’s data, she suggested an approach to this analysis would be finding the average test score for each student or average test score overall. It could also be counted how many students are of each sex and find the average age as well.

#Avg test score for each student
avg_scores <- student_tidy %>%
  group_by(name) %>%
  summarise(avg_score = mean(Scores)) 
avg_scores

## # A tibble: 5 × 2
##   name   avg_score
##   <chr>      <dbl>
## 1 Esther      73.3
## 2 Linda       85.7
## 3 Mary        66.2
## 4 Mike        83.7
## 5 Sam         79.8

  ggplot(avg_scores, aes(name, avg_score)) +
  geom_bar(stat="identity", fill="dark green") +
  labs(x = "Students", y = "Average Score", title = "Average Test Scores by Student")

#Avg test score for each sex
  #fix later
avgscores_bysex <- student_tidy %>% 
  group_by(Sex) %>%
  summarise(avg_score = mean(Scores), avg_age = mean(Age))
avgscores_bysex

## # A tibble: 2 × 3
##   Sex    avg_score avg_age
##   <chr>      <dbl>   <dbl>
## 1 Female      75.1    13  
## 2 Male        81.8    11.5

ggplot(avgscores_bysex, aes(Sex, avg_score, fill=Sex)) +
  geom_bar(stat = "identity") +
  labs(x = "Sex of Students", y = "Score", title = "Number of Students by Sex")

## Results In the charts above, it is show that from the 5 students test perfomance recorded, overall Linda scored the best, scoring 85.67%, whie Mary performed the worst, scoring 66.17%. While the person who scored the best is a girl, it is shown that the boys overall performed better, averaging at 81.75 and were younger (averaging to 11.5), while the girls avereaged a score of 75.06 and on average, aged 13 years old.

Population Growth Over Time

This last data set was retrieved from classmate Nick Climaco discussion board response where it entails of the population growth from 1994-2000 of Mexico, the US, and Canada. To tidy this data, the table will be transformed into a long format t compress te years into a single columns and the population into another.

nafta_data

## # A tibble: 3 × 9
##   `ï»¿id` Country          `1994`    `1995`   `1996` `1997` `1998` `1999` `2000`
##     <dbl> <chr>             <dbl>     <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1       1 Mexico         88314424  89969572   9.16e7 9.32e7 9.48e7 9.63e7 9.79e7
## 2       2 United States 263126000 266278000   2.69e8 2.73e8 2.76e8 2.79e8 2.82e8
## 3       3 Canada         29000663  29302311   2.96e7 2.99e7 3.02e7 3.04e7 3.07e7

nafta_long <- nafta_data %>%
  pivot_longer(cols = -c("ï»¿id", "Country"), names_to = "Year", values_to = "Population")
colnames(nafta_long)[1] <- "ID"

nafta_long

## # A tibble: 21 × 4
##       ID Country       Year  Population
##    <dbl> <chr>         <chr>      <dbl>
##  1     1 Mexico        1994    88314424
##  2     1 Mexico        1995    89969572
##  3     1 Mexico        1996    91586555
##  4     1 Mexico        1997    93183094
##  5     1 Mexico        1998    94767284
##  6     1 Mexico        1999    96334810
##  7     1 Mexico        2000    97873442
##  8     2 United States 1994   263126000
##  9     2 United States 1995   266278000
## 10     2 United States 1996   269394000
## # … with 11 more rows

Analysis of Population Growth Over Time

In his discussion board response, Nick recommended that for future analysis, the population growth rate for each year could be calculated and compared to the other two countries in the dataset.

pop_growth <- nafta_long %>%
 group_by(Country) %>%
  arrange(Year) %>%
  mutate(Population_Growth_Rate = ((Population - lag(Population))/lag(Population)) * 100) %>%
  na.omit()
pop_growth

## # A tibble: 18 × 5
## # Groups:   Country [3]
##       ID Country       Year  Population Population_Growth_Rate
##    <dbl> <chr>         <chr>      <dbl>                  <dbl>
##  1     1 Mexico        1995    89969572                  1.87 
##  2     2 United States 1995   266278000                  1.20 
##  3     3 Canada        1995    29302311                  1.04 
##  4     1 Mexico        1996    91586555                  1.80 
##  5     2 United States 1996   269394000                  1.17 
##  6     3 Canada        1996    29610218                  1.05 
##  7     1 Mexico        1997    93183094                  1.74 
##  8     2 United States 1997   272657000                  1.21 
##  9     3 Canada        1997    29905948                  0.999
## 10     1 Mexico        1998    94767284                  1.70 
## 11     2 United States 1998   275854000                  1.17 
## 12     3 Canada        1998    30155173                  0.833
## 13     1 Mexico        1999    96334810                  1.65 
## 14     2 United States 1999   279040000                  1.15 
## 15     3 Canada        1999    30401286                  0.816
## 16     1 Mexico        2000    97873442                  1.60 
## 17     2 United States 2000   282162411                  1.12 
## 18     3 Canada        2000    30685730                  0.936

pop_growth %>%
ggplot(aes(x = Year, y = Population_Growth_Rate, color = Country, linetype = "-")) +
  geom_point() + 
  labs(x = "Year", y = "Population Growth Rate", title = "Population Growth Rate from 1994-1995")

## Results Referring to the chart above, it is shown that in the years of 1994-1995, Mexico’s population has been decreasing at a faster and conistent rate while for the US, its poppulation has been consolidating slightly downward. Canada on the other hand experienced a sharp decline in their population from 1997-1998 but seemed to be reocovering back to its normanl rates by the end of 2000.