knitr:: include_graphics("player.png")

untidy data posted by mohammed ramadan in discussion 5

One example of an untidy dataset is a table of NBA player statistics that is taken from an HTML page. The table is in “wide” format, where each row represents a player, and each column represents a statistic.

An analysis that might be performed on this data is to compare the performance of different players based on their statistics. For example, one might be interested in comparing the points scored by LeBron James and Kevin Durant, or in comparing the assists of James Harden and Giannis Antetokounmpo. However, the wide format of the table makes it difficult to perform these comparisons, since the relevant statistics are spread out across different columns.

library(RMySQL)
con <- dbConnect(MySQL(),
                 host = "localhost",
                 username = "root",
                 password = "Alex9297248844",
                 dbname = "Project2")
con <- dbGetQuery(con, "SELECT * FROM player_stats ")
print (con)
##   id                player points rebounds assists steals blocks
## 1  1          LeBron James   27.4      8.5     8.3    1.3    0.6
## 2  2          Kevin Durant   29.0      7.3     5.3    0.7    1.2
## 3  3          James Harden   25.2      6.5    11.5    1.5    0.8
## 4  4 Giannis Antetokounmpo   28.4     11.2     5.9    1.3    1.4

Tidy and transform the data

# convert the data from wide format to long format

library(tidyr)

df <- con %>%
  pivot_longer(cols = c("points", "rebounds", "assists", "steals", "blocks"),
               names_to = "statistic",
               values_to = "value")
print(df)
## # A tibble: 20 × 4
##       id player                statistic value
##    <int> <chr>                 <chr>     <dbl>
##  1     1 LeBron James          points     27.4
##  2     1 LeBron James          rebounds    8.5
##  3     1 LeBron James          assists     8.3
##  4     1 LeBron James          steals      1.3
##  5     1 LeBron James          blocks      0.6
##  6     2 Kevin Durant          points     29  
##  7     2 Kevin Durant          rebounds    7.3
##  8     2 Kevin Durant          assists     5.3
##  9     2 Kevin Durant          steals      0.7
## 10     2 Kevin Durant          blocks      1.2
## 11     3 James Harden          points     25.2
## 12     3 James Harden          rebounds    6.5
## 13     3 James Harden          assists    11.5
## 14     3 James Harden          steals      1.5
## 15     3 James Harden          blocks      0.8
## 16     4 Giannis Antetokounmpo points     28.4
## 17     4 Giannis Antetokounmpo rebounds   11.2
## 18     4 Giannis Antetokounmpo assists     5.9
## 19     4 Giannis Antetokounmpo steals      1.3
## 20     4 Giannis Antetokounmpo blocks      1.4

using dplyr to filter the data and compare the points scored by LeBron James and Kevin Duran

library(dplyr)

df %>%
  filter(player %in% c("LeBron James", "Kevin Durant"), statistic == "points") %>%
  select(player, value)
## # A tibble: 2 × 2
##   player       value
##   <chr>        <dbl>
## 1 LeBron James  27.4
## 2 Kevin Durant  29

bar charts to compare the mean values of different statistics for each player

library(ggplot2)

ggplot(df, aes(x = player, y = value, fill = statistic)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Player", y = "Value", fill = "Statistic") +
  theme_minimal()

box plots to compare the distribution of different statistics for each player

ggplot(df, aes(x = player, y = value, fill = statistic)) +
  geom_boxplot() +
  labs(x = "Player", y = "Value", fill = "Statistic") +
  theme_minimal()

# Filter the data to include only LeBron James and Kevin Durant
data_subset <- df %>%
  filter(player %in% c("LeBron James", "Kevin Durant"))

# Perform a two-sample t-test to compare the mean points scored between the two players
t.test(value ~ player, data = data_subset, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  value by player
## t = -0.073002, df = 8, p-value = 0.9436
## alternative hypothesis: true difference in means between group Kevin Durant and group LeBron James is not equal to 0
## 95 percent confidence interval:
##  -16.94587  15.90587
## sample estimates:
## mean in group Kevin Durant mean in group LeBron James 
##                       8.70                       9.22

the p-value of 0.94 suggests that there is not a significant difference in the means of the two groups, and the confidence interval (-16.95, 15.91) includes zero, which also supports this conclusion.

conculution

I compared the performance of different players based on their statistics, specifically focusing on the points scored by LeBron James and Kevin Durant. I also used visualizations to compare the mean values and distribution of different statistics for each player,also performed a two-sample t-test to determine whether there was a significant difference in the mean points scored between LeBron James and Kevin Durant.


Untidy data posted by Coco Donovan in discussion 5

sources: untidy data

knitr:: include_graphics("coco.png")

As far as analysis goes you could group by religion and see what the religious makeup of all respondents was by percentages. This is also just an idea, but it could be helpful to conduct a visual analysis with parallel bar charts to get and idea of which religion has the wealthiest followers (based solely on the results of this data)

Data cleanup
library(RMySQL)
# Connect to the database using the environment variables
coco<- dbConnect(MySQL(),
                 host = "localhost",
                 username = "root",
                 password = "Alex9297248844",
                 dbname = "Project2")
coco <- dbGetQuery(coco, "SELECT * from religion_survey")
knitr::kable(coco)
id religion lt_10k _10_20k _20_30k _30_40k _40_50k _50_75k _75_100k _100_150k gt_150k refused
1 Agnostic 27 34 60 81 76 134 122 109 84 96
2 Atheist 12 27 37 52 35 70 73 59 74 76
3 Buddhist 27 21 30 34 33 58 62 39 53 54
4 Catholic 418 617 732 670 638 1116 949 792 633 1489
5 Refused 15 14 15 11 10 35 21 17 18 116
Tidy and transform the data using tidyr and dplyr
library(tidyverse)
library(dplyr)

drop id column

coco_new <- select(coco, -id)
knitr::kable(coco_new)
religion lt_10k _10_20k _20_30k _30_40k _40_50k _50_75k _75_100k _100_150k gt_150k refused
Agnostic 27 34 60 81 76 134 122 109 84 96
Atheist 12 27 37 52 35 70 73 59 74 76
Buddhist 27 21 30 34 33 58 62 39 53 54
Catholic 418 617 732 670 638 1116 949 792 633 1489
Refused 15 14 15 11 10 35 21 17 18 116
Convert the data from wide format to long format using
coco_tidy <- coco_new %>%
  pivot_longer(cols = -religion, names_to = "income_level", values_to = "count") %>%
  mutate(income_level = gsub("income_", "", income_level)) %>%
  arrange(religion, income_level)

head (coco_tidy)
## # A tibble: 6 × 3
##   religion income_level count
##   <chr>    <chr>        <int>
## 1 Agnostic _100_150k      109
## 2 Agnostic _10_20k         34
## 3 Agnostic _20_30k         60
## 4 Agnostic _30_40k         81
## 5 Agnostic _40_50k         76
## 6 Agnostic _50_75k        134
summary(coco_tidy)
##    religion         income_level           count        
##  Length:50          Length:50          Min.   :  10.00  
##  Class :character   Class :character   1st Qu.:  27.75  
##  Mode  :character   Mode  :character   Median :  58.50  
##                                        Mean   : 201.50  
##                                        3rd Qu.: 114.25  
##                                        Max.   :1489.00

Replace all occurrences of “refused” in the income_level column with NA, creating a tidy dataset where all values are of the same type and format.

coco_tidy <- coco_tidy %>%
  mutate(income_level = replace(income_level, income_level == "refused", NA))
head(coco_tidy)
## # A tibble: 6 × 3
##   religion income_level count
##   <chr>    <chr>        <int>
## 1 Agnostic _100_150k      109
## 2 Agnostic _10_20k         34
## 3 Agnostic _20_30k         60
## 4 Agnostic _30_40k         81
## 5 Agnostic _40_50k         76
## 6 Agnostic _50_75k        134
library(dplyr)
library(ggplot2)

# group data by religion and income level, calculate percentage of respondents
data_summary <- coco_tidy %>%
  group_by(religion, income_level) %>%
  summarize(count = sum(count)) %>%
  mutate(percent = count/sum(count) * 100)
head(data_summary)
## # A tibble: 6 × 4
## # Groups:   religion [1]
##   religion income_level count percent
##   <chr>    <chr>        <int>   <dbl>
## 1 Agnostic _100_150k      109   13.2 
## 2 Agnostic _10_20k         34    4.13
## 3 Agnostic _20_30k         60    7.29
## 4 Agnostic _30_40k         81    9.84
## 5 Agnostic _40_50k         76    9.23
## 6 Agnostic _50_75k        134   16.3
# create parallel bar chart
ggplot(data_summary, aes(x = income_level, y = percent, fill = religion)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Income Distribution by Religion",
       x = "Income Level", y = "Percentage of Respondents") +
  theme_bw()

summary(data_summary)
##    religion         income_level           count            percent      
##  Length:50          Length:50          Min.   :  10.00   Min.   : 2.330  
##  Class :character   Class :character   1st Qu.:  27.75   1st Qu.: 6.581  
##  Mode  :character   Mode  :character   Median :  58.50   Median : 8.704  
##                                        Mean   : 201.50   Mean   :10.000  
##                                        3rd Qu.: 114.25   3rd Qu.:13.078  
##                                        Max.   :1489.00   Max.   :42.647
# Create the stacked bar chart
ggplot(data_summary, aes(x = religion, y = percent, fill = income_level)) +
  geom_bar(stat = "identity", position = "stack") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Wealth Distribution by Religion", x = "Religion", y = "Percentage", fill = "Income Level")


Untidy data posted by Farhana Akther in discussion 5

knitr:: include_graphics("Farhana.png")

The table above reports vote counts for two US states, California and Florida. In this table, the column names CA and FL are values of the variable state. Therefore, we can say that this table is in an untidy format.

As per analysis, we can compare the vote counts in each state vs candidate. We can also “melt” the table and transform into a long format and tidy the table so that further analysis can be done; including plotting functions, hypothesis testing functions, and modeling functions such as linear regression.

Source

Data cleanup
library(RMySQL)
Farhana<- dbConnect(MySQL(),
                 host = "localhost",
                 username = "root",
                 password = "Alex9297248844",
                 dbname = "Project2")
Farhana <- dbGetQuery(Farhana, "SELECT * FROM election_results ")
knitr::kable(Farhana)
candidate CA FL
Donald Trump 3184721 4605515
Gary Johnson 308392 206007
Hillary Clinton 5931283 4485745
Jill Stein 166311 64019

Upload requied packages

library(tidyverse)
library(dplyr)
Convert the data from wide to long format
election_data <-Farhana %>%
  pivot_longer(cols = c(CA, FL), names_to = "state", values_to = "votes")
knitr::kable(election_data)
candidate state votes
Donald Trump CA 3184721
Donald Trump FL 4605515
Gary Johnson CA 308392
Gary Johnson FL 206007
Hillary Clinton CA 5931283
Hillary Clinton FL 4485745
Jill Stein CA 166311
Jill Stein FL 64019
Analysis

Calculate the total votes for each candidate

total_votes <- election_data %>%
  group_by(candidate) %>%
  summarize(total_votes = sum(votes))

Join the total votes to the long format data frame

long_election_data <- election_data %>%
  left_join(total_votes, by = "candidate")
knitr::kable(long_election_data)
candidate state votes total_votes
Donald Trump CA 3184721 7790236
Donald Trump FL 4605515 7790236
Gary Johnson CA 308392 514399
Gary Johnson FL 206007 514399
Hillary Clinton CA 5931283 10417028
Hillary Clinton FL 4485745 10417028
Jill Stein CA 166311 230330
Jill Stein FL 64019 230330

Bar plot of the vote counts for each candidate in each state

library(ggplot2)

ggplot(long_election_data, aes(x = state, y = votes, fill = candidate)) +
  geom_col(position = "dodge") +
  labs(title = "Vote Counts by State and Candidate",
       x = "State", y = "Votes",
       fill = "Candidate")

summary(long_election_data)
##   candidate            state               votes          total_votes      
##  Length:8           Length:8           Min.   :  64019   Min.   :  230330  
##  Class :character   Class :character   1st Qu.: 196083   1st Qu.:  443382  
##  Mode  :character   Mode  :character   Median :1746556   Median : 4152318  
##                                        Mean   :2368999   Mean   : 4737998  
##                                        3rd Qu.:4515688   3rd Qu.: 8446934  
##                                        Max.   :5931283   Max.   :10417028

I will use linear regression to model the relationship between the vote counts for each candidate in California and Florida:

# Fit a linear regression model to the data
lm_model <- lm(votes ~ state + candidate, data = long_election_data)

# View the model summary
summary(lm_model)
## 
## Call:
## lm(formula = votes ~ state + candidate, data = long_election_data)
## 
## Residuals:
##       1       2       3       4       5       6       7       8 
## -739075  739075   22515  -22515  694091 -694091   22468  -22468 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)   
## (Intercept)               3923796     654792   5.992  0.00931 **
## stateFL                    -57355     585664  -0.098  0.92816   
## candidateGary Johnson    -3637918     828254  -4.392  0.02187 * 
## candidateHillary Clinton  1313396     828254   1.586  0.21098   
## candidateJill Stein      -3779953     828254  -4.564  0.01973 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 828300 on 3 degrees of freedom
## Multiple R-squared:  0.9509, Adjusted R-squared:  0.8855 
## F-statistic: 14.53 on 4 and 3 DF,  p-value: 0.02639

Based on the linear regression analysis, we can see that the coefficients for the candidate variables are all significant with p-values less than 0.05, except for the stateFL variable which is not significant. This suggests that the candidate chosen had a significant effect on the number of votes received, while the state did not have a significant effect. The R-squared value of 0.9509 indicates that the model explains a high proportion of the variance in the data, and the F-statistic of 14.53 with a p-value of 0.02639 suggests that the model is statistically significant. Overall, the analysis suggests that the choice of candidate had a significant impact on the number of votes received, while the state did not have a significant effect.