In this project we have been asked to find three untidy data sets which consist of Two classmates data set and one of your choosing, I will then need to tidy these data sets and perform some analysis on them. These data sets should be available as CSV files, preferably in a wide data format.
For the first data I will be analyzing my own data set and my data set is about New York State voter registration, I will be analyzing New York Sate voter registration enrollment by county as of February 27, 2024 by tidying it and produce some graph to show any relationship. That data can be found from the [New York State Board of Elections Website] (https://elections.ny.gov/enrollment-county). During the tidiness of the data set I will be removing not needed rows/columns, I will start by removing the first few rows that had a title and information about the data. The process involve imported my excel data set as a csv file then upload it to [Github], which can be found from this URL(https://raw.githubusercontent.com/jnaval88/DATA607/main/Proejct%202/DATASET1/county_feb24-revised.csv). I load the data set to R Studio then na.omit to remove any missing data or N/A data from the data set.
voterEnrollmentByCounty_NA <- read_csv('https://raw.githubusercontent.com/jnaval88/DATA607/main/Proejct%202/DATASET1/county_feb24-revised.csv' )
Data set display with missing values or N/A values.
datatable(voterEnrollmentByCounty_NA)
Cleaning the data set by removing NA values, then display.
voterEnrollmentByCounty_2024 <- voterEnrollmentByCounty_NA %>% na.omit()
datatable(voterEnrollmentByCounty_2024)
For this step I will put the data set into a tidy format for a better generalized standard use, in order to do that I will be using pivot_longer and pivot_wider.
voterEnrollmentByCounty_2024 <-
voterEnrollmentByCounty_2024 %>%
pivot_longer('DEM':'BLANK',names_to = "Party", values_to = "Number of Voters") %>%
pivot_wider(-TOTAL,names_from = STATUS, values_from = 'Number of Voters')
## Warning: Specifying the `id_cols` argument by position was deprecated in tidyr 1.3.0.
## ℹ Please explicitly name `id_cols`, like `id_cols = -TOTAL`.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
datatable(voterEnrollmentByCounty_2024)
After I finished putting the data in a standardize tidy format now I will create two new columns by mutation. These new columns will show the proportion of each party voters present by county, then I want to reintroduce this column by utilizing the inner_join function in a newly created data table to compute this number using the organized data.
totalVotersByCounty <- voterEnrollmentByCounty_2024 %>%
group_by(COUNTY) %>%
summarise(Total_Voters = sum(Active+Inactive))
voterEnrollmentByCounty_2024 <- inner_join( voterEnrollmentByCounty_2024, totalVotersByCounty, by = "COUNTY")
This is will calculate the proportions of voters in two different ways, first I will calculate the active proportion of voters by county, secondly it will calculate the inactive proportion of voters by county. After I finish calculating both proportions of voters, I will calculate the total of proportions by the active with the inactive.
voterEnrollmentByCounty_2024 <- voterEnrollmentByCounty_2024 %>%
mutate( Active_Prop = round(Active/Total_Voters,2) , Inactive_Prop = round(Inactive/Total_Voters,2) , Total_Prop = round(Total/Total_Voters,2) )
This will create subset value
voterEnrollmentByCounty_2024Subset <-
voterEnrollmentByCounty_2024 %>%
select( REGION, COUNTY, Party, Total_Prop )
datatable(voterEnrollmentByCounty_2024Subset)
This bar graph below show total voters by different party and are color coded by party.
ggplot(voterEnrollmentByCounty_2024, aes(x = Party, color = Party, fill = Party)) + geom_bar() +
xlab("Party") + ylab("Total Voters") +
ggtitle("Number of Votes by County in New York State")
This is a point graph that show total voters by party.
ggplot(voterEnrollmentByCounty_2024, aes(Party, Total_Voters, color = Party)) + geom_point() +
xlab("Party") + ylab("Total Voters") +
ggtitle("Number of votes by county in New York State")
This boxplot show Total Proportions of voters by party.
ggplot(voterEnrollmentByCounty_2024Subset, aes( x = Party , y = Total_Prop, color = Party )) +
geom_boxplot(outlier.size=2,outlier.colour="red", notch = TRUE)
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
This is a correlation test to see if there is any connection between active voters and party.
cor.test(voterEnrollmentByCounty_2024$Total_Voters, voterEnrollmentByCounty_2024$Active)
##
## Pearson's product-moment correlation
##
## data: voterEnrollmentByCounty_2024$Total_Voters and voterEnrollmentByCounty_2024$Active
## t = 13.023, df = 370, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4867066 0.6266002
## sample estimates:
## cor
## 0.5606407
This is a correlation test to see if there is any connection between inactive voters and party.
cor.test(voterEnrollmentByCounty_2024$Total_Voters, voterEnrollmentByCounty_2024$Inactive)
##
## Pearson's product-moment correlation
##
## data: voterEnrollmentByCounty_2024$Total_Voters and voterEnrollmentByCounty_2024$Inactive
## t = 12.011, df = 370, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4523220 0.5990595
## sample estimates:
## cor
## 0.5296419
By looking a the graph along with the total proportion of active voters we can conclude that New York counties doesn’t seem to lean toward DEM party eventhough it always seem to look that way. Most of the other counties lean more toward other parties witht he exception of Kings and queens.
For my second data set, I will analyze Marriage Rates in the US, this was posted by my classmate Brandon Cunningham, he posted a two links of the data, one raw data from CDC and the other from ourworldindata.org. For the purpose of this project I will be using the raw data from CDC to perform my tidiness and analysis. In order for me to be able to analyzed my classmate data I need to download the data which I download from the CDC link provided by my classmate.I took my classmate data and upload it to my [Github] which can be view from this URL
Marriage_Rates_US <- read_csv('https://raw.githubusercontent.com/jnaval88/DATA607/main/Proejct%202/DATASET2/ACSST1Y2022.S1201-2024-02-20T191328.csv') %>% na.omit()
This display the current data
datatable(head(Marriage_Rates_US))
This part clean the data by making the columns name shorter and changing them. As per the previous data contains very long names with special characters.
# Columns renaming by changing the name to a shorter name
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Total!!Estimate"] <- "Total_Estimate"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Total!!Margin of Error"] <- "Total_Margin_Error"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Now married (except separated)!!Estimate"] <- "Now_married_Estimate"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Now married (except separated)!!Margin of Error"] <- "Now_married_Margin_Error"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Widowed!!Estimate"] <- "Windowed_Estimate"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Widowed!!Margin of Error"] <- "Windowed_Margin_Error"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Divorced!!Estimate"] <- "Divorced_Estimate"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Divorced!!Margin of Error"] <- "Divorced_Marriage_Error"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Separated!!Estimate"] <- "Seperated_Estimate"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Separated!!Margin of Error"] <- "Seperated_Margin_Error"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Never married!!Estimate"] <- "Never_married_Estimate"
names(Marriage_Rates_US)[names(Marriage_Rates_US) == "United States!!Never married!!Margin of Error"] <- "Never_married_Margin_Error"
This remove the percent symbol from the dataset and convert that columns to numeric.
# removing the % sign from the dataset
Marriage_Rates_US$Now_married_Estimate = as.numeric(gsub("\\%", "", Marriage_Rates_US$Now_married_Estimate))
## Warning: NAs introduced by coercion
Marriage_Rates_US$Windowed_Estimate = as.numeric(gsub("\\%", "", Marriage_Rates_US$Windowed_Estimate))
## Warning: NAs introduced by coercion
Marriage_Rates_US$Divorced_Estimate = as.numeric(gsub("\\%", "", Marriage_Rates_US$Divorced_Estimate,))
## Warning: NAs introduced by coercion
Marriage_Rates_US$Seperated_Estimate = as.numeric(gsub("\\%", "", Marriage_Rates_US$Seperated_Estimate))
## Warning: NAs introduced by coercion
Marriage_Rates_US$Never_married_Estimate = as.numeric(gsub("\\%", "", Marriage_Rates_US$Never_married_Estimate))
## Warning: NAs introduced by coercion
In the dataset all the columns name with Margin Error have my symbol, in order for me to be able to utilizing those columns I tidy them by removing them.
Marriage_Rates_US <- Marriage_Rates_US %>% mutate( Total_Margin_Error = str_replace(str_extract(Total_Margin_Error,"\\d+\\D\\d+"),",",""))
Marriage_Rates_US <- Marriage_Rates_US |> mutate( Total_Estimate = str_replace(str_extract(Total_Estimate,"\\d+\\D\\d+"),",",""))
Marriage_Rates_US <- Marriage_Rates_US %>% mutate( Now_married_Margin_Error = str_replace(str_extract(Now_married_Margin_Error,"\\d+\\D\\d+"),",",""))
Marriage_Rates_US <- Marriage_Rates_US %>% mutate( Windowed_Margin_Error = str_replace(str_extract(Windowed_Margin_Error,"\\d+\\D\\d+"),",",""))
Marriage_Rates_US <- Marriage_Rates_US %>% mutate( Divorced_Marriage_Error = str_replace(str_extract(Divorced_Marriage_Error,"\\d+\\D\\d+"),",",""))
Marriage_Rates_US <- Marriage_Rates_US %>% mutate( Never_married_Margin_Error = str_replace(str_extract(Never_married_Margin_Error,"\\d+\\D\\d+"),",",""))
Marriage_Rates_US <- Marriage_Rates_US %>% mutate( Seperated_Margin_Error = str_replace(str_extract(Seperated_Margin_Error,"\\d+\\D\\d+"),",",""))
This display the dataset with names changing, special characters,
datatable(head(Marriage_Rates_US))
This re-arrange the column then I performed pivot_longer to the part of the dataset I will be using.
# re-arranging the column
Marriage_Rates_US_new <- Marriage_Rates_US[, c(1, 2, 4, 6, 8, 10, 12, 3, 5, 7, 9, 11, 13)]
Marriage_Rates_US_new <-
Marriage_Rates_US_new %>%
pivot_longer('Now_married_Estimate':'Never_married_Estimate', names_to = "Married_Status", values_to = "Percentage_of_Estimate")
datatable(Marriage_Rates_US_new)
ggplot(Marriage_Rates_US_new, aes(y = Total_Estimate, x = Percentage_of_Estimate)) +
geom_boxplot()
## Warning: Removed 10 rows containing non-finite values (`stat_boxplot()`).
For my third I choose GDP by Country by Year will I will be analyzing, and this data was obtained from {worldbank] (https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?most_recent_year_desc=true). I upload this dataset to my Github account which can be found in this URL. (https://raw.githubusercontent.com/jnaval88/DATA607/main/Proejct%202/DATASET3/GDP-Percent-Growth-by-Country-by-Year.csv)
gdpByCountryByYear <- read_csv('https://raw.githubusercontent.com/jnaval88/DATA607/main/Proejct%202/DATASET3/GDP-Percent-Growth-by-Country-by-Year.csv')
Data display
datatable(head(gdpByCountryByYear))
I will now format the data neatly using the pivot_longer function. Furthermore, I will only take into account New York State schools for the purposes of this assignment, and I will utilize the filter function to only show these.
gdpByCountryByYear <-
gdpByCountryByYear %>%
pivot_longer('1960':'2019',names_to = "year", values_to = "GDP Growth Percent") %>%
select ( -starts_with('Indicator'))
datatable(head(gdpByCountryByYear, n= 100L) )
Having organized the data, I will now conduct some analysis on it. Using the group_by function to group the data by country and the summarise function to extract the relevant statistics, I will start by reviewing some summary statistics for the data. In order to move the nations with the greatest average growth rates to the top of the new data frame, I will also utilize the {arrange} function.
StatsGDPbyCounty <- gdpByCountryByYear %>%
group_by(`Country Name`) %>%
summarise(mean = round(mean(`GDP Growth Percent`,na.rm=T),2), sd = round(sd(`GDP Growth Percent`,na.rm=T),2) , sum = round(sum(`GDP Growth Percent`,na.rm=T),2)) %>%
arrange(desc(mean))
datatable(StatsGDPbyCounty)
This
# mean Graph
ggplot(StatsGDPbyCounty, aes(x =mean)) +
geom_histogram( na.rm = T) +
geom_vline(xintercept = mean(StatsGDPbyCounty$mean, na.rm = T) , col = "red", size = 1.5) +
ggtitle("StatsGDPbyCounty")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#sum Graph
ggplot(StatsGDPbyCounty, aes(x =sum)) +
geom_histogram( na.rm = T) +
geom_vline(xintercept = sum(StatsGDPbyCounty$sum, na.rm = T) , col = "red", size = 1.5) +
ggtitle("StatsGDPbyCounty")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#SD Graph
ggplot(StatsGDPbyCounty, aes(x =sd)) +
geom_histogram( na.rm = T) +
geom_vline(xintercept = sd(StatsGDPbyCounty$sd,na.rm = T) , col = "red", size = 1.5) +
ggtitle("StatsGDPbyCounty")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cor.test(StatsGDPbyCounty$sd, StatsGDPbyCounty$mean)
##
## Pearson's product-moment correlation
##
## data: StatsGDPbyCounty$sd and StatsGDPbyCounty$mean
## t = 3.6523, df = 256, p-value = 0.0003151
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1032256 0.3355505
## sample estimates:
## cor
## 0.2225452
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 12 rows containing missing values (`geom_point()`).
model1 <- lm(sd ~ mean, data = StatsGDPbyCounty)
summary(model1)
##
## Call:
## lm(formula = sd ~ mean, data = StatsGDPbyCounty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3648 -2.1954 -0.7326 0.8763 27.4118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2962 0.4973 6.628 2e-10 ***
## mean 0.4344 0.1189 3.652 0.000315 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.563 on 256 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.04953, Adjusted R-squared: 0.04581
## F-statistic: 13.34 on 1 and 256 DF, p-value: 0.0003151
model2 <- lm(log(sd) ~ log(mean), data = StatsGDPbyCounty)
## Warning in log(mean): NaNs produced
summary(model2)
##
## Call:
## lm(formula = log(sd) ~ log(mean), data = StatsGDPbyCounty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48773 -0.42175 0.00623 0.33034 2.01132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.20516 0.09113 13.224 <2e-16 ***
## log(mean) 0.16809 0.06761 2.486 0.0136 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5302 on 250 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.02413, Adjusted R-squared: 0.02023
## F-statistic: 6.181 on 1 and 250 DF, p-value: 0.01356
I ran multiple analysis to see which country have the highest gdp, and I also test to see if there is a correlation between the mean and the country gpd. After performing these test it seems that Equatorial Guinea rank number 1 as the country with the highest average increase in GDP from 1960-2019, follow by Bosnia and Herzegovina. Before running these analysis my assumption was never about either this two countries.
Enrollment by county. (2024, February 27). Retrieved February 28th, 2024, from https://www.elections.ny.gov/EnrollmentCounty.html
Marriage Rates in US: Data Source:https://data.census.gov/table/ACSST1Y2022.S1201?q=family&t=Families%20and%20Living%20Arrangements
GDP by Country by Year: https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?most_recent_year_desc=true