Assignment Description

Our task is to read in a .csv file and tidy the dataset.

Libraries

Load required libraries

library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(tibble)
library(ggplot2)
library(scales)
library(kableExtra)

Data Import

I copied this file from the web site linked below, and created a CSV file. I uploaded the file to github and I will read the untidy data into R. https://towardsdatascience.com/whats-tidy-data-how-to-organize-messy-datasets-in-python-with-melt-and-pivotable-functions-5d52daa996c9

Read the .csv file from a github link

url <- "https://raw.githubusercontent.com/Vthomps000/DATA607_VT/master/religion.csv"
untidy <- read_csv(url)
untidy <- as.tibble(untidy)
untidy
## # A tibble: 5 x 12
##      X1 religion `<10k` `10-20k` `20-30k` `30-40k` `40-50k` `50-75k` `75-100k`
##   <dbl> <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
## 1     0 Agnostic     27       34       60       81       76      137       122
## 2     1 Atheist      12       27       37       52       35       70        73
## 3     2 Buddhist     27       21       30       34       33       58        62
## 4     3 Catholic    418      617      732      670      638     1116       949
## 5     4 refused      15       14       15       11       10       35        21
## # ... with 3 more variables: `100-150k` <dbl>, `>150k` <dbl>, refused <dbl>

Data Evaluation

Began by renaming an unecessary column

clean <- untidy %>% rename(" " = X1)
clean
## # A tibble: 5 x 12
##     ` ` religion `<10k` `10-20k` `20-30k` `30-40k` `40-50k` `50-75k` `75-100k`
##   <dbl> <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
## 1     0 Agnostic     27       34       60       81       76      137       122
## 2     1 Atheist      12       27       37       52       35       70        73
## 3     2 Buddhist     27       21       30       34       33       58        62
## 4     3 Catholic    418      617      732      670      638     1116       949
## 5     4 refused      15       14       15       11       10       35        21
## # ... with 3 more variables: `100-150k` <dbl>, `>150k` <dbl>, refused <dbl>

Reshaped the data

cleaner <- clean %>% gather(`<10k`:`refused`, key = income, value = counts) %>% rename("Frequency" = counts) %>% select(religion, income, Frequency) %>%
arrange(religion)

cleaner
## # A tibble: 50 x 3
##    religion income   Frequency
##    <chr>    <chr>        <dbl>
##  1 Agnostic <10k            27
##  2 Agnostic 10-20k          34
##  3 Agnostic 20-30k          60
##  4 Agnostic 30-40k          81
##  5 Agnostic 40-50k          76
##  6 Agnostic 50-75k         137
##  7 Agnostic 75-100k        122
##  8 Agnostic 100-150k       109
##  9 Agnostic >150k           84
## 10 Agnostic refused         96
## # ... with 40 more rows

Data Analysis & Visualization

I started by visualizing the whole data set. I found that most of the survey was conducted by Catholics so I decided to focus on them.

target=c("<10k","10-20k","20-30k","30-40k", "40-50k", "50-75k", "75-100k", "100-150k", ">150k", "refused")
religion=filter(cleaner,income %in%target)   #filtering fighter planes
g=ggplot(data=religion,aes(x=income,y=Frequency ,group = religion, color = religion))
g=g+geom_line(stat="identity",size=1.6)
g=g+ggtitle("Income by religion")
g=g+ylab("# of People")+xlab("Income")
g=g+theme_get()
g=g+theme(plot.title = element_text(hjust = 0.5),text=element_text(size=13))
g

religion2<- cleaner %>% select(religion, income, Frequency) %>%
  filter(religion == "Catholic")
religion2
## # A tibble: 10 x 3
##    religion income   Frequency
##    <chr>    <chr>        <dbl>
##  1 Catholic <10k           418
##  2 Catholic 10-20k         617
##  3 Catholic 20-30k         732
##  4 Catholic 30-40k         670
##  5 Catholic 40-50k         638
##  6 Catholic 50-75k        1116
##  7 Catholic 75-100k        949
##  8 Catholic 100-150k       792
##  9 Catholic >150k          633
## 10 Catholic refused       1489
summary(religion2)
##    religion            income            Frequency     
##  Length:10          Length:10          Min.   : 418.0  
##  Class :character   Class :character   1st Qu.: 634.2  
##  Mode  :character   Mode  :character   Median : 701.0  
##                                        Mean   : 805.4  
##                                        3rd Qu.: 909.8  
##                                        Max.   :1489.0

Conclusion

I was abe to tidy the data set.