Choose any three of the “wide” data sets identified in the Week 6 Discussion items. (You may use your own data set; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
This Table has been taken from our classmate Coco Donovan’s discussion. The table possesses the data regarding people from different religions and their income bracket. The reason why I chose this table is to analyze how people from different Abrahamic religions i.e. Islam, Christianity and Judaism, are doing financially based on this dataset.
First of all we’ll load the data into our R studio. The link from where we can access the data was already available in the discussion courtesy of Coco Donovan.
religion_wide <- read.csv("https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/relinc.csv")
knitr::kable(head(religion_wide))
| religion | X.10k | X10.20k | X20.30k | X30.40k | X40.50k | X50.75k | X75.100k | X100.150k | X.150k | refused |
|---|---|---|---|---|---|---|---|---|---|---|
| Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 | 96 |
| Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 | 76 |
| Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 | 54 |
| Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 | 1489 |
| refused | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 | 116 |
| Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 | 949 | 723 | 414 | 1529 |
So as we can see that the data is in clearly wide and untidy format. I will be using pivot_longer() from dplyr to transform the table into more analytic friendly form but before that lets fix the names of columns that contains income brackets using rename() function. We can arrange and group the final data in transformed table using arrange() and group_by() functions:
religion_long <- religion_wide|>
rename('<10K' = X.10k, '10-20k'=X10.20k,'20-30k'=X20.30k ,'30-40k'=X30.40k,'40-50k'=X40.50k,'50-75k'=X50.75k,'75-100k'=X75.100k,'100-150k'=X100.150k,'>150k'=X.150k)|>
pivot_longer(cols = 2:11, names_to = "Income", values_to = "Frequency")|>
group_by(religion)|>
arrange(Frequency,.by_group = TRUE)
knitr::kable(head(religion_long))
| religion | Income | Frequency |
|---|---|---|
| Agnostic | <10K | 27 |
| Agnostic | 10-20k | 34 |
| Agnostic | 20-30k | 60 |
| Agnostic | 40-50k | 76 |
| Agnostic | 30-40k | 81 |
| Agnostic | >150k | 84 |
Now we can clearly see that the table has been transformed into more acceptable format we comes to analysis. Lets create a new column using mutate() function which will give us the percentage of each income bracket grouped by their religion with the help of group_by() function. Since our focus was to see how all the Abrahamic religions are doing lets focus on that and filter out the rest of the religions using filter() function.
Note: For this analysis the refused rows are also being filtered out
religion_summary <- religion_long|>
group_by(religion)|>
mutate(Percent = round(Frequency/sum(Frequency)*100,2))|>
filter( religion %in% c( "Muslim" ,"Catholic", "Orthodox","Jewish","Other Christian", "Evangelical Prot"))|>
filter(Income != "refused")
knitr::kable(head(religion_summary))
| religion | Income | Frequency | Percent |
|---|---|---|---|
| Catholic | <10K | 418 | 5.19 |
| Catholic | 10-20k | 617 | 7.66 |
| Catholic | >150k | 633 | 7.86 |
| Catholic | 40-50k | 638 | 7.92 |
| Catholic | 30-40k | 670 | 8.32 |
| Catholic | 20-30k | 732 | 9.09 |
Now we got the summarized table and we can further go ahead and plot the data using ggplot and see how people from different religions are performing financially.
ggplot(data = religion_summary, aes(x=Income, y= Percent, fill=religion))+
geom_bar(stat = "identity",position = "dodge")+theme_bw()
In the above bar graph we have plotted the percentage against income and we can see that the distributions look pretty normal. People from different religions have peaks in different income brackets but one thing to notice here and that is, that this graph could be misleading in way that we have percentage of people grouped by their religion because if look at the data set you’ll find very low frequency of people from the religion of Islam As shown in the table below
religion_comp <- religion_long|>
group_by(religion)|>
filter( religion %in% c( "Muslim" ,"Catholic"))|>
spread(religion,Frequency)|>
filter(Income != "refused")
knitr::kable(head(religion_comp))
| Income | Catholic | Muslim |
|---|---|---|
| <10K | 418 | 6 |
| >150k | 633 | 6 |
| 10-20k | 617 | 7 |
| 100-150k | 792 | 8 |
| 20-30k | 732 | 9 |
| 30-40k | 670 | 10 |
Now Islam being the second largest religion and definitely more followers than Catholic Christians (19% to 17% roughly) according to Wikipedia so the sample size should have been larger than one we have and if we plot both Islam and Catholicism against each other on a bar chart a very skewed distribution can be observed as shown in the graph below
religion_comp1 <- religion_long|>
group_by(religion)|>
filter( religion %in% c( "Muslim" ,"Catholic"))
ggplot(data = religion_comp1, aes(x = Income, y = Frequency, fill=religion)) +
geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75) +
ylim(0,800) +
geom_text(aes(label = Frequency), fontface = "bold", vjust = 1.5,
position = position_dodge(.9), size = 4)+theme_bw()
## Warning: Removed 3 rows containing missing values (`geom_bar()`).
## Warning: Removed 3 rows containing missing values (`geom_text()`).
If we look at the current data and decide how people are doing from different religions financially so according to the analysis it will be in accurate to use the data set above since it was acquired from christian dominated area and cannot be generalized for the rest of the world. But if you want to compare the religion percent wise then one can probably get a rough idea.
The table 2 in this project is taken from Waheeb Algabri’s discussion. The table contains information of three students i.e. Suzan, John and Alex, and their scores in three different subject. We will be trying to tidy up and transform the table first and then we will try to analyze the data set afterwards:
Since we don’t have any link provided to the actual table so we have to create the table in Rstudio using data.frame() function. Initially we created the columns and the stored it into columns names and after that those columns were joined into a data frame.
Name <- c("Suzan", "John", "Alex")
Age <- c( 27, 31, 31)
Gender <- c("F", "M", "M")
Math1 <- c(90,95, 84)
Science_2 <- c(75,80,70)
English <- c(86,91,65)
mark_sheet <- data.frame(Name, Age, Gender, Math1, Science_2, English)
knitr::kable(head(mark_sheet))
| Name | Age | Gender | Math1 | Science_2 | English |
|---|---|---|---|---|---|
| Suzan | 27 | F | 90 | 75 | 86 |
| John | 31 | M | 95 | 80 | 91 |
| Alex | 31 | M | 84 | 70 | 65 |
Now we can see that we got our desired table in wider format and we can transform it to longer as suggested by Waheeb in his discussion. This around I would try a different method to transform the table. Previously we tried pivot_longer() function and this time we will use gather() function from dplyr and try to transform the table and arrange it according but before doing that we will use the rename() function to fix the subject name and make it in accordance with the final tidy table as suggested in Waheeb’s discussion.
mark_sheet_long <- mark_sheet|>
rename("Math"="Math1", "Science"="Science_2")|>
gather("Subject","Score",4:6)|>
group_by(Name)|>
arrange(desc(Score), .by_group = TRUE)
knitr::kable(head(mark_sheet_long))
| Name | Age | Gender | Subject | Score |
|---|---|---|---|---|
| Alex | 31 | M | Math | 84 |
| Alex | 31 | M | Science | 70 |
| Alex | 31 | M | English | 65 |
| John | 31 | M | Math | 95 |
| John | 31 | M | English | 91 |
| John | 31 | M | Science | 80 |
Now the table has been tidied and transformed to the desired form and we can go ahead with analysis:
Since the there is not much of a data available to play with, we will try to assign grades to each student and then check out the overall percentage and grade for each student. Over here i will be using mutate() function with case_when() to create a new column that will posses the grade for each student in each subject
grade <- mark_sheet_long|>
mutate(Subject_wise_grade = case_when(Score >= 90 & Score <= 92.9 ~ 'A-',
Score >= 87 & Score <= 89.9 ~ 'B+', Score >=93 ~ 'A', Score >=83 & Score <= 86.9 ~'B', Score >=80 & Score <= 82.9 ~ 'B-', Score >= 77 & Score <= 79.9 ~ 'C+', Score >= 70 & Score <= 76.9 ~'C', Score <= 69.9 ~ 'F'))|>
select(Name,Subject,Score, Subject_wise_grade)|>
group_by(Name)|>
arrange(Subject_wise_grade,.by_group = TRUE)
knitr::kable(grade)
| Name | Subject | Score | Subject_wise_grade |
|---|---|---|---|
| Alex | Math | 84 | B |
| Alex | Science | 70 | C |
| Alex | English | 65 | F |
| John | Math | 95 | A |
| John | English | 91 | A- |
| John | Science | 80 | B- |
| Suzan | Math | 90 | A- |
| Suzan | English | 86 | B |
| Suzan | Science | 75 | C |
Note: The grade has been assigned using the criteria of course Data 607
Now we can see the assigned grade for each subject for each student. Lets find out the overall grade and distribution of the score. We will using the same grading criteria with same mutate() function but this time we’ll aggregate the mean of score for each student using aggregate() function. The reason behind using aggregate function to expand upon different functions and how can we use them in our analysis
overall_score <- grade|>
select(Name, Score)|>
aggregate(.~Name,mean)|>
mutate(Overall_grade = case_when(Score >= 90 & Score <= 92.9 ~ 'A-',
Score >= 87 & Score <= 89.9 ~ 'B+', Score >=93 ~ 'A', Score >=83 & Score <= 86.9 ~'B', Score >=80 & Score <= 82.9 ~ 'B-', Score >= 77 & Score <= 79.9 ~ 'C+', Score >= 70 & Score <= 76.9 ~'C', Score <= 69.9 ~ 'F'))|>
arrange(desc(Score))
knitr::kable(overall_score)
| Name | Score | Overall_grade |
|---|---|---|
| John | 88.66667 | B+ |
| Suzan | 83.66667 | B |
| Alex | 73.00000 | C |
We can see the overall grade in the table above. No surprises as John bad better grades overall and he ended up on top. Conversely Alex ended at the bottom of the class. We can checkout the distribution of scores among the student using ggplot bar chart.
ggplot(data = overall_score, aes(x=Name, y=Score, fill = Name))+
geom_bar(stat = "identity")+theme_bw()
The data set that we just analyzed and transformed was pretty simple and straightforward. John the highest scorer in the class understandably got good grades and stood out while Alex ended at the bottom of the grade. Since there were only three students so we could not get an actual distribution but nevertheless something to visually represent how students stand in the class.
Table 3 has been taken from the discussion of our classmate Farhana Akhtar. The reason why i chose this table to see how each candidate is doing against other in CA and FL.
Since I wanted to explore more on the transformation of table and previously we have used pivot_longer() and gather() functions. Now this time around we will using a different technique to transform the table. Before that lets create the table and get on with the transformation. In order to create the table we’ll make columns and store it in their corresponding variable(column name) and then using data.frame() function we can create the data frame.
candidate <- c("Hillary Clinton", "Donald Trump","Gary Johnson","Jill Stein")
CA <- c(5931283,3184721,308392,166311)
FL <- c(4485745,4605515,206007,64019)
election <- data.frame(candidate,CA, FL)
election <- as.data.table(election)
knitr::kable((head(election)))
| candidate | CA | FL |
|---|---|---|
| Hillary Clinton | 5931283 | 4485745 |
| Donald Trump | 3184721 | 4605515 |
| Gary Johnson | 308392 | 206007 |
| Jill Stein | 166311 | 64019 |
You’ll notice that after creating the data frame I have changed it to a data table which will help me using the melt() function to transform the table. after transforming the data table we can also re arrange and group the data as needed:
election_long <- election|>
melt(id.vars="candidate",
measure.vars = c("CA", "FL"),
variable.name = "state",
value.name = "votes")|>
group_by(candidate)|>
arrange(desc(votes), .by_group = TRUE)
knitr::kable(head(election_long))
| candidate | state | votes |
|---|---|---|
| Donald Trump | FL | 4605515 |
| Donald Trump | CA | 3184721 |
| Gary Johnson | CA | 308392 |
| Gary Johnson | FL | 206007 |
| Hillary Clinton | CA | 5931283 |
| Hillary Clinton | FL | 4485745 |
We can see that the table has been transformed into a longer format and now we can carry on with our analysis:
This particular is relatively easy to analyze. We can find out the percent of votes that a candidate is getting in each state and then store the result in a separate column using mutate() function
election_summary <- election_long|>
group_by(state)|>
mutate(Percent_votes = round(votes/(sum(votes))*100,0))|>
arrange(desc(Percent_votes),.by_group = TRUE)
We can also visualize the the percentage of votes each candidate got in CA and Fl
ggplot(election_summary, aes(x=candidate , y= Percent_votes, fill=candidate))+
geom_bar(stat = "identity")+facet_wrap(~state)+theme_bw()+scale_x_discrete(guide = guide_axis(angle = 20))+theme(legend.position = "none")+labs(x="",y="Percent of Votes")
We can clearly see that Hillary is leading in CA while Trump is leading in FL. We can also check who got the popular votes out of these two states.
election_popular <- election_long|>
select(candidate,votes)|>
aggregate(.~candidate,sum)|>
mutate(percent_votes = round(votes/(sum(votes))*100,0))|>
arrange(desc(votes),.by_group = TRUE)
knitr::kable(head(election_popular))
| candidate | votes | percent_votes |
|---|---|---|
| Hillary Clinton | 10417028 | 55 |
| Donald Trump | 7790236 | 41 |
| Gary Johnson | 514399 | 3 |
| Jill Stein | 230330 | 1 |
According to the table above Hillary has the lead in popular vote. We can also see that using ggplot
ggplot(election_popular, aes(x=candidate, y=votes, fill = candidate))+
geom_bar(stat = "identity")+theme_bw()+theme(legend.position = "none")+labs(x="",y="Total Votes")
We can see that Hillary has a clear lead in the popular votes but if your looking at the data considering state as an entity then we can call it a draw between trump and Hillary since each won one state. Note: Each state has different number of representative based on their population.