Our task is to read in a .csv file and tidy the dataset.
Load required libraries
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(tibble)
library(ggplot2)
library(scales)
library(kableExtra)
I copied this file from the web site linked below, and created a CSV file. I uploaded the file to github and I will read the untidy data into R. https://towardsdatascience.com/whats-tidy-data-how-to-organize-messy-datasets-in-python-with-melt-and-pivotable-functions-5d52daa996c9
Read the .csv file from a github link
url <- "https://raw.githubusercontent.com/Vthomps000/DATA607_VT/master/religion.csv"
untidy <- read_csv(url)
untidy <- as.tibble(untidy)
untidy
## # A tibble: 5 x 12
## X1 religion `<10k` `10-20k` `20-30k` `30-40k` `40-50k` `50-75k` `75-100k`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 Agnostic 27 34 60 81 76 137 122
## 2 1 Atheist 12 27 37 52 35 70 73
## 3 2 Buddhist 27 21 30 34 33 58 62
## 4 3 Catholic 418 617 732 670 638 1116 949
## 5 4 refused 15 14 15 11 10 35 21
## # ... with 3 more variables: `100-150k` <dbl>, `>150k` <dbl>, refused <dbl>
Began by renaming an unecessary column
clean <- untidy %>% rename(" " = X1)
clean
## # A tibble: 5 x 12
## ` ` religion `<10k` `10-20k` `20-30k` `30-40k` `40-50k` `50-75k` `75-100k`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 Agnostic 27 34 60 81 76 137 122
## 2 1 Atheist 12 27 37 52 35 70 73
## 3 2 Buddhist 27 21 30 34 33 58 62
## 4 3 Catholic 418 617 732 670 638 1116 949
## 5 4 refused 15 14 15 11 10 35 21
## # ... with 3 more variables: `100-150k` <dbl>, `>150k` <dbl>, refused <dbl>
Reshaped the data
cleaner <- clean %>% gather(`<10k`:`refused`, key = income, value = counts) %>% rename("Frequency" = counts) %>% select(religion, income, Frequency) %>%
arrange(religion)
cleaner
## # A tibble: 50 x 3
## religion income Frequency
## <chr> <chr> <dbl>
## 1 Agnostic <10k 27
## 2 Agnostic 10-20k 34
## 3 Agnostic 20-30k 60
## 4 Agnostic 30-40k 81
## 5 Agnostic 40-50k 76
## 6 Agnostic 50-75k 137
## 7 Agnostic 75-100k 122
## 8 Agnostic 100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic refused 96
## # ... with 40 more rows
I started by visualizing the whole data set. I found that most of the survey was conducted by Catholics so I decided to focus on them.
target=c("<10k","10-20k","20-30k","30-40k", "40-50k", "50-75k", "75-100k", "100-150k", ">150k", "refused")
religion=filter(cleaner,income %in%target) #filtering fighter planes
g=ggplot(data=religion,aes(x=income,y=Frequency ,group = religion, color = religion))
g=g+geom_line(stat="identity",size=1.6)
g=g+ggtitle("Income by religion")
g=g+ylab("# of People")+xlab("Income")
g=g+theme_get()
g=g+theme(plot.title = element_text(hjust = 0.5),text=element_text(size=13))
g
religion2<- cleaner %>% select(religion, income, Frequency) %>%
filter(religion == "Catholic")
religion2
## # A tibble: 10 x 3
## religion income Frequency
## <chr> <chr> <dbl>
## 1 Catholic <10k 418
## 2 Catholic 10-20k 617
## 3 Catholic 20-30k 732
## 4 Catholic 30-40k 670
## 5 Catholic 40-50k 638
## 6 Catholic 50-75k 1116
## 7 Catholic 75-100k 949
## 8 Catholic 100-150k 792
## 9 Catholic >150k 633
## 10 Catholic refused 1489
summary(religion2)
## religion income Frequency
## Length:10 Length:10 Min. : 418.0
## Class :character Class :character 1st Qu.: 634.2
## Mode :character Mode :character Median : 701.0
## Mean : 805.4
## 3rd Qu.: 909.8
## Max. :1489.0
I was abe to tidy the data set.