Project 2 Instructions:

Choose any three of the “wide” data sets identified in the Week 6 Discussion items. (You may use your own data set; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

Table 1:

This Table has been taken from our classmate Coco Donovan’s discussion. The table possesses the data regarding people from different religions and their income bracket. The reason why I chose this table is to analyze how people from different Abrahamic religions i.e. Islam, Christianity and Judaism, are doing financially based on this dataset.

Creating and Transforming Table:

First of all we’ll load the data into our R studio. The link from where we can access the data was already available in the discussion courtesy of Coco Donovan.

religion_wide <- read.csv("https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/relinc.csv")
knitr::kable(head(religion_wide))
religion X.10k X10.20k X20.30k X30.40k X40.50k X50.75k X75.100k X100.150k X.150k refused
Agnostic 27 34 60 81 76 137 122 109 84 96
Atheist 12 27 37 52 35 70 73 59 74 76
Buddhist 27 21 30 34 33 58 62 39 53 54
Catholic 418 617 732 670 638 1116 949 792 633 1489
refused 15 14 15 11 10 35 21 17 18 116
Evangelical Prot 575 869 1064 982 881 1486 949 723 414 1529

So as we can see that the data is in clearly wide and untidy format. I will be using pivot_longer() from dplyr to transform the table into more analytic friendly form but before that lets fix the names of columns that contains income brackets using rename() function. We can arrange and group the final data in transformed table using arrange() and group_by() functions:

religion_long <- religion_wide|>
  rename('<10K' = X.10k, '10-20k'=X10.20k,'20-30k'=X20.30k  ,'30-40k'=X30.40k,'40-50k'=X40.50k,'50-75k'=X50.75k,'75-100k'=X75.100k,'100-150k'=X100.150k,'>150k'=X.150k)|>
  pivot_longer(cols = 2:11, names_to = "Income", values_to = "Frequency")|>
  group_by(religion)|>
  arrange(Frequency,.by_group = TRUE)
knitr::kable(head(religion_long))
religion Income Frequency
Agnostic <10K 27
Agnostic 10-20k 34
Agnostic 20-30k 60
Agnostic 40-50k 76
Agnostic 30-40k 81
Agnostic >150k 84

Analysis:

Now we can clearly see that the table has been transformed into more acceptable format we comes to analysis. Lets create a new column using mutate() function which will give us the percentage of each income bracket grouped by their religion with the help of group_by() function. Since our focus was to see how all the Abrahamic religions are doing lets focus on that and filter out the rest of the religions using filter() function.

Note: For this analysis the refused rows are also being filtered out

religion_summary <- religion_long|>
  group_by(religion)|>
  mutate(Percent = round(Frequency/sum(Frequency)*100,2))|>
  filter( religion %in% c( "Muslim" ,"Catholic", "Orthodox","Jewish","Other Christian", "Evangelical Prot"))|>
  filter(Income != "refused")
knitr::kable(head(religion_summary))
religion Income Frequency Percent
Catholic <10K 418 5.19
Catholic 10-20k 617 7.66
Catholic >150k 633 7.86
Catholic 40-50k 638 7.92
Catholic 30-40k 670 8.32
Catholic 20-30k 732 9.09

Now we got the summarized table and we can further go ahead and plot the data using ggplot and see how people from different religions are performing financially.

ggplot(data = religion_summary, aes(x=Income, y= Percent, fill=religion))+
  geom_bar(stat = "identity",position = "dodge")+theme_bw()

In the above bar graph we have plotted the percentage against income and we can see that the distributions look pretty normal. People from different religions have peaks in different income brackets but one thing to notice here and that is, that this graph could be misleading in way that we have percentage of people grouped by their religion because if look at the data set you’ll find very low frequency of people from the religion of Islam As shown in the table below

religion_comp <- religion_long|>
  group_by(religion)|>
  filter( religion %in% c( "Muslim" ,"Catholic"))|>
  spread(religion,Frequency)|>
  filter(Income != "refused")
knitr::kable(head(religion_comp))
Income Catholic Muslim
<10K 418 6
>150k 633 6
10-20k 617 7
100-150k 792 8
20-30k 732 9
30-40k 670 10

Now Islam being the second largest religion and definitely more followers than Catholic Christians (19% to 17% roughly) according to Wikipedia so the sample size should have been larger than one we have and if we plot both Islam and Catholicism against each other on a bar chart a very skewed distribution can be observed as shown in the graph below

religion_comp1 <- religion_long|>
  group_by(religion)|>
  filter( religion %in% c( "Muslim" ,"Catholic"))
ggplot(data = religion_comp1, aes(x = Income, y = Frequency, fill=religion)) +
  geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75)  +
  ylim(0,800) +
  geom_text(aes(label = Frequency), fontface = "bold", vjust = 1.5,
             position = position_dodge(.9), size = 4)+theme_bw()
## Warning: Removed 3 rows containing missing values (`geom_bar()`).
## Warning: Removed 3 rows containing missing values (`geom_text()`).

Conclusion:

If we look at the current data and decide how people are doing from different religions financially so according to the analysis it will be in accurate to use the data set above since it was acquired from christian dominated area and cannot be generalized for the rest of the world. But if you want to compare the religion percent wise then one can probably get a rough idea.

Table 2:

The table 2 in this project is taken from Waheeb Algabri’s discussion. The table contains information of three students i.e. Suzan, John and Alex, and their scores in three different subject. We will be trying to tidy up and transform the table first and then we will try to analyze the data set afterwards:

Creating and Transforming Table:

Since we don’t have any link provided to the actual table so we have to create the table in Rstudio using data.frame() function. Initially we created the columns and the stored it into columns names and after that those columns were joined into a data frame.

Name <- c("Suzan", "John", "Alex")
Age <-  c( 27, 31, 31)
Gender <- c("F", "M", "M")
Math1 <- c(90,95, 84)
Science_2 <- c(75,80,70)
English <- c(86,91,65)

mark_sheet <- data.frame(Name, Age, Gender, Math1, Science_2, English)
knitr::kable(head(mark_sheet))
Name Age Gender Math1 Science_2 English
Suzan 27 F 90 75 86
John 31 M 95 80 91
Alex 31 M 84 70 65

Now we can see that we got our desired table in wider format and we can transform it to longer as suggested by Waheeb in his discussion. This around I would try a different method to transform the table. Previously we tried pivot_longer() function and this time we will use gather() function from dplyr and try to transform the table and arrange it according but before doing that we will use the rename() function to fix the subject name and make it in accordance with the final tidy table as suggested in Waheeb’s discussion.

mark_sheet_long <- mark_sheet|>
  rename("Math"="Math1", "Science"="Science_2")|>
  gather("Subject","Score",4:6)|>
  group_by(Name)|>
  arrange(desc(Score), .by_group = TRUE)
knitr::kable(head(mark_sheet_long))
Name Age Gender Subject Score
Alex 31 M Math 84
Alex 31 M Science 70
Alex 31 M English 65
John 31 M Math 95
John 31 M English 91
John 31 M Science 80

Now the table has been tidied and transformed to the desired form and we can go ahead with analysis:

Analysis:

Since the there is not much of a data available to play with, we will try to assign grades to each student and then check out the overall percentage and grade for each student. Over here i will be using mutate() function with case_when() to create a new column that will posses the grade for each student in each subject

grade <- mark_sheet_long|>
  mutate(Subject_wise_grade = case_when(Score >= 90 & Score <= 92.9 ~ 'A-',
  Score >= 87 & Score <= 89.9 ~ 'B+', Score >=93 ~ 'A', Score >=83 & Score <= 86.9 ~'B', Score >=80 & Score <= 82.9 ~ 'B-', Score >= 77 & Score <= 79.9 ~ 'C+', Score >= 70 & Score <= 76.9 ~'C', Score <= 69.9 ~ 'F'))|>
  select(Name,Subject,Score, Subject_wise_grade)|>
  group_by(Name)|>
  arrange(Subject_wise_grade,.by_group = TRUE)
knitr::kable(grade)
Name Subject Score Subject_wise_grade
Alex Math 84 B
Alex Science 70 C
Alex English 65 F
John Math 95 A
John English 91 A-
John Science 80 B-
Suzan Math 90 A-
Suzan English 86 B
Suzan Science 75 C

Note: The grade has been assigned using the criteria of course Data 607

Now we can see the assigned grade for each subject for each student. Lets find out the overall grade and distribution of the score. We will using the same grading criteria with same mutate() function but this time we’ll aggregate the mean of score for each student using aggregate() function. The reason behind using aggregate function to expand upon different functions and how can we use them in our analysis

overall_score <- grade|>
  select(Name, Score)|>
  aggregate(.~Name,mean)|>
  mutate(Overall_grade = case_when(Score >= 90 & Score <= 92.9 ~ 'A-',
  Score >= 87 & Score <= 89.9 ~ 'B+', Score >=93 ~ 'A', Score >=83 & Score <= 86.9 ~'B', Score >=80 & Score <= 82.9 ~ 'B-', Score >= 77 & Score <= 79.9 ~ 'C+', Score >= 70 & Score <= 76.9 ~'C', Score <= 69.9 ~ 'F'))|>
  
  
  arrange(desc(Score))
knitr::kable(overall_score)
Name Score Overall_grade
John 88.66667 B+
Suzan 83.66667 B
Alex 73.00000 C

We can see the overall grade in the table above. No surprises as John bad better grades overall and he ended up on top. Conversely Alex ended at the bottom of the class. We can checkout the distribution of scores among the student using ggplot bar chart.

ggplot(data = overall_score, aes(x=Name, y=Score, fill = Name))+
  geom_bar(stat = "identity")+theme_bw()

Conclusion:

The data set that we just analyzed and transformed was pretty simple and straightforward. John the highest scorer in the class understandably got good grades and stood out while Alex ended at the bottom of the grade. Since there were only three students so we could not get an actual distribution but nevertheless something to visually represent how students stand in the class.

Table 3:

Table 3 has been taken from the discussion of our classmate Farhana Akhtar. The reason why i chose this table to see how each candidate is doing against other in CA and FL.

Creating and Transforming Table:

Since I wanted to explore more on the transformation of table and previously we have used pivot_longer() and gather() functions. Now this time around we will using a different technique to transform the table. Before that lets create the table and get on with the transformation. In order to create the table we’ll make columns and store it in their corresponding variable(column name) and then using data.frame() function we can create the data frame.

candidate <- c("Hillary Clinton", "Donald Trump","Gary Johnson","Jill Stein")
CA <- c(5931283,3184721,308392,166311)
FL <- c(4485745,4605515,206007,64019)

election <- data.frame(candidate,CA, FL)
election <- as.data.table(election)

knitr::kable((head(election)))
candidate CA FL
Hillary Clinton 5931283 4485745
Donald Trump 3184721 4605515
Gary Johnson 308392 206007
Jill Stein 166311 64019

You’ll notice that after creating the data frame I have changed it to a data table which will help me using the melt() function to transform the table. after transforming the data table we can also re arrange and group the data as needed:

election_long <- election|>
  melt(id.vars="candidate",
       measure.vars = c("CA", "FL"),
     variable.name = "state",
     value.name = "votes")|>
  group_by(candidate)|>
  arrange(desc(votes), .by_group = TRUE)
knitr::kable(head(election_long))
candidate state votes
Donald Trump FL 4605515
Donald Trump CA 3184721
Gary Johnson CA 308392
Gary Johnson FL 206007
Hillary Clinton CA 5931283
Hillary Clinton FL 4485745

We can see that the table has been transformed into a longer format and now we can carry on with our analysis:

Analysis:

This particular is relatively easy to analyze. We can find out the percent of votes that a candidate is getting in each state and then store the result in a separate column using mutate() function

election_summary <- election_long|>
  group_by(state)|>
  mutate(Percent_votes = round(votes/(sum(votes))*100,0))|>
  arrange(desc(Percent_votes),.by_group = TRUE)

We can also visualize the the percentage of votes each candidate got in CA and Fl

ggplot(election_summary, aes(x=candidate , y= Percent_votes, fill=candidate))+
  geom_bar(stat = "identity")+facet_wrap(~state)+theme_bw()+scale_x_discrete(guide = guide_axis(angle = 20))+theme(legend.position = "none")+labs(x="",y="Percent of Votes") 

We can clearly see that Hillary is leading in CA while Trump is leading in FL. We can also check who got the popular votes out of these two states.

election_popular <- election_long|>
  select(candidate,votes)|>
  aggregate(.~candidate,sum)|>
  mutate(percent_votes = round(votes/(sum(votes))*100,0))|>
  arrange(desc(votes),.by_group = TRUE)
knitr::kable(head(election_popular))
candidate votes percent_votes
Hillary Clinton 10417028 55
Donald Trump 7790236 41
Gary Johnson 514399 3
Jill Stein 230330 1

According to the table above Hillary has the lead in popular vote. We can also see that using ggplot

ggplot(election_popular, aes(x=candidate, y=votes, fill = candidate))+
  geom_bar(stat = "identity")+theme_bw()+theme(legend.position = "none")+labs(x="",y="Total Votes") 

Conclusion:

We can see that Hillary has a clear lead in the popular votes but if your looking at the data considering state as an entity then we can call it a draw between trump and Hillary since each won one state. Note: Each state has different number of representative based on their population.