Data 607 Project 2

DATA SET 1: Data set 1 has been created by a classmate regarding basketball players and their scores during the years of 2019-2021.

# Create the dataset
data <- data.frame(
  Player = c("LeBron James", "Stephen Curry", "Kevin Durant"),
  Team = c("Lakers", "Warriors", "Nets"),
  Season_2019_Points = c(2000, 1800, 1900),
  Season_2020_Points = c(2100, 1900, 2000),
  Season_2021_Points = c(2200, 2000, 2100)
)
# Save the dataset as a CSV file
write.csv(data, "basketball_points.csv", row.names = FALSE)

# The csv has been uploaded to github and imported here
library(readr)

github_url <- "https://raw.githubusercontent.com/NooriSelina/Data-607/main/basketball_points.csv"
basketball_data <- read.csv(github_url)
head(basketball_data)

##          Player     Team Season_2019_Points Season_2020_Points
## 1  LeBron James   Lakers               2000               2100
## 2 Stephen Curry Warriors               1800               1900
## 3  Kevin Durant     Nets               1900               2000
##   Season_2021_Points
## 1               2200
## 2               2000
## 3               2100

Next, the data will be made tidy with tidyr and dplyr packages. It will be transformed into “long” format. To perform the analysis of player performance, we calculated the overall mean points scored by each player across all seasons and identified the top-performing players with the highest overall mean points.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Convert the basketball_data to long format
long_basketball_data <- basketball_data %>%
  pivot_longer(
    cols = starts_with("Season_"),
    names_to = "Season",
    values_to = "Points"
  ) %>%
  arrange(Player, Season)

# Print the long format data
print(long_basketball_data)

## # A tibble: 9 × 4
##   Player        Team     Season             Points
##   <chr>         <chr>    <chr>               <int>
## 1 Kevin Durant  Nets     Season_2019_Points   1900
## 2 Kevin Durant  Nets     Season_2020_Points   2000
## 3 Kevin Durant  Nets     Season_2021_Points   2100
## 4 LeBron James  Lakers   Season_2019_Points   2000
## 5 LeBron James  Lakers   Season_2020_Points   2100
## 6 LeBron James  Lakers   Season_2021_Points   2200
## 7 Stephen Curry Warriors Season_2019_Points   1800
## 8 Stephen Curry Warriors Season_2020_Points   1900
## 9 Stephen Curry Warriors Season_2021_Points   2000

Analysis: To perform the analysis of which players performed the best during the previous seasons, we are calculating the mean values for each player, for each of the seasons. Then, a graph is being produced to visualize the results.

# Load necessary packages
library(dplyr)
library(ggplot2)

# Calculate mean points for each player and season
mean_points <- long_basketball_data %>%
  group_by(Player, Season) %>%
  summarize(Mean_Points = mean(Points))

## `summarise()` has grouped output by 'Player'. You can override using the
## `.groups` argument.

# Print the mean_points data frame
print(mean_points)

## # A tibble: 9 × 3
## # Groups:   Player [3]
##   Player        Season             Mean_Points
##   <chr>         <chr>                    <dbl>
## 1 Kevin Durant  Season_2019_Points        1900
## 2 Kevin Durant  Season_2020_Points        2000
## 3 Kevin Durant  Season_2021_Points        2100
## 4 LeBron James  Season_2019_Points        2000
## 5 LeBron James  Season_2020_Points        2100
## 6 LeBron James  Season_2021_Points        2200
## 7 Stephen Curry Season_2019_Points        1800
## 8 Stephen Curry Season_2020_Points        1900
## 9 Stephen Curry Season_2021_Points        2000

# Create a bar chart to visualize mean points by player and season
ggplot(mean_points, aes(x = Season, y = Mean_Points, fill = Player)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Mean Points by Player and Season",
    x = "Season",
    y = "Mean Points",
    fill = "Player"
  ) +
  theme_minimal()

Conclusion: In conclusion, the provided data represents the points scored by three prominent basketball players (Kevin Durant, LeBron James, and Stephen Curry) over the course of three consecutive seasons (2019, 2020, and 2021). It illustrates the consistent performance and gradual increase in points scored by all three players over these seasons, with LeBron James who possessed the highest point total by the end of this period.

DATA SET 2: The following data set includes regional populations from the years 2000 to 2022. It was already posted as CSV form by classmate. I uploaded the CSV to github, and imported via the github link as below.

# URL of the raw CSV file on GitHub
github_url <- "https://raw.githubusercontent.com/NooriSelina/Data-607/main/world_population.csv"

# Read the CSV file from GitHub into a data frame
world_population_data <- read.csv(github_url)

# Print the first few rows of the imported data
head(world_population_data)

##   Rank CCA3 Country.Territory          Capital Continent X2022.Population
## 1   36  AFG       Afghanistan            Kabul      Asia         41128771
## 2  138  ALB           Albania           Tirana    Europe          2842321
## 3   34  DZA           Algeria          Algiers    Africa         44903225
## 4  213  ASM    American Samoa        Pago Pago   Oceania            44273
## 5  203  AND           Andorra Andorra la Vella    Europe            79824
## 6   42  AGO            Angola           Luanda    Africa         35588987
##   X2020.Population X2015.Population X2010.Population X2000.Population
## 1         38972230         33753499         28189672         19542982
## 2          2866849          2882481          2913399          3182021
## 3         43451666         39543154         35856344         30774621
## 4            46189            51368            54849            58230
## 5            77700            71746            71519            66097
## 6         33428485         28127721         23364185         16394062
##   X1990.Population X1980.Population X1970.Population Area..km..
## 1         10694796         12486631         10752971     652230
## 2          3295066          2941651          2324731      28748
## 3         25518074         18739378         13795915    2381741
## 4            47818            32886            27075        199
## 5            53569            35611            19860        468
## 6         11828638          8330047          6029700    1246700
##   Density..per.km.. Growth.Rate World.Population.Percentage
## 1           63.0587      1.0257                        0.52
## 2           98.8702      0.9957                        0.04
## 3           18.8531      1.0164                        0.56
## 4          222.4774      0.9831                        0.00
## 5          170.5641      1.0100                        0.00
## 6           28.5466      1.0315                        0.45

Next, this wide data set is being changed to long format” using dyplr and tidyr

# Load necessary packages
library(tidyr)
library(dplyr)

# Rename columns with special characters
world_population_data <- world_population_data %>%
  rename(
    Rank = Rank,
    CCA3 = CCA3,
    Country_Territory = `Country.Territory`,
    Capital = Capital,
    Continent = Continent,
    Population_2022 = `X2022.Population`,
    Population_2020 = `X2020.Population`,
    Population_2015 = `X2015.Population`,
    Population_2010 = `X2010.Population`,
    Population_2000 = `X2000.Population`,
    Population_1990 = `X1990.Population`,
    Population_1980 = `X1980.Population`,
    Population_1970 = `X1970.Population`,
    Area_km2 = `Area..km..`,
    Density_per_km2 = `Density..per.km..`,
    Growth_Rate = `Growth.Rate`,
    World_Population_Percentage = `World.Population.Percentage`
  )

# Convert the World Population dataset to long format
long_population_data <- world_population_data %>%
  pivot_longer(
    cols = starts_with("Population_"),
    names_to = "Year",
    values_to = "Population"
  ) %>%
  arrange(Country_Territory, Year)  # Arrange by Country/Territory and Year

# Print the first few rows of the long format data
head(long_population_data)

## # A tibble: 6 × 11
##    Rank CCA3  Country_Territory Capital Continent Area_km2 Density_per_km2
##   <int> <chr> <chr>             <chr>   <chr>        <int>           <dbl>
## 1    36 AFG   Afghanistan       Kabul   Asia        652230            63.1
## 2    36 AFG   Afghanistan       Kabul   Asia        652230            63.1
## 3    36 AFG   Afghanistan       Kabul   Asia        652230            63.1
## 4    36 AFG   Afghanistan       Kabul   Asia        652230            63.1
## 5    36 AFG   Afghanistan       Kabul   Asia        652230            63.1
## 6    36 AFG   Afghanistan       Kabul   Asia        652230            63.1
## # ℹ 4 more variables: Growth_Rate <dbl>, World_Population_Percentage <dbl>,
## #   Year <chr>, Population <int>

Analysis: Analysis: For this analysis, we visualized the total world population trends from 2000 to 2022 using a bar chart. The chart highlights how the global population has evolved over these years, providing a clear overview of population changes over time.

# Load necessary packages
library(dplyr)
library(ggplot2)

# Calculate the total world population by year
total_population_by_year <- long_population_data %>%
  group_by(Year) %>%
  summarize(Total_Population = sum(Population))

# Create a bar chart to visualize total population by year
ggplot(total_population_by_year, aes(x = Year, y = Total_Population)) +
  geom_bar(stat = "identity", fill = "navy") +
  geom_text(aes(label = Total_Population), vjust = -0.5, hjust = 0.5, size = 3) +  # Add text annotations
  labs(
    title = "Total World Population Trends (2000-2022)",
    x = "Year",
    y = "Total Population"
  ) +
  theme_minimal()

Conclusion: The analysis of world population trends from 2000 to 2022 reveals a consistent and notable pattern: the total population across all countries is on an upward trajectory. Over this period, the global population has demonstrated a clear and continuous increase.

DATA SET 3: The following data set consists of the occurrence of diabetes among a group of individuals, encompassing various characteristics such as age, skin thickness, and more. . The data set was downloaded as csv, uploaded to github, and imported to R.

# URL of the raw CSV file on GitHub
github_url <- "https://raw.githubusercontent.com/NooriSelina/Data-607/main/diabetes%20(1).csv"

# Read the CSV file from GitHub into a data frame
diabetes_data <- read.csv(github_url)

# Print the first few rows of the imported data
head(diabetes_data)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Next, we have changed the format to long data instead of wide data.

library(dplyr)
library(tidyr)

long_diabetes_data <- diabetes_data %>% # Change format to long form
  pivot_longer(
    cols = -Outcome,  # Specify the columns to remain as is 
    names_to = "Measurement",  # Name for the new variable containing column names
    values_to = "Value"  # Name for the new variable containing the values
  )

long_diabetes_data

## # A tibble: 6,144 × 3
##    Outcome Measurement                Value
##      <int> <chr>                      <dbl>
##  1       1 Pregnancies                6    
##  2       1 Glucose                  148    
##  3       1 BloodPressure             72    
##  4       1 SkinThickness             35    
##  5       1 Insulin                    0    
##  6       1 BMI                       33.6  
##  7       1 DiabetesPedigreeFunction   0.627
##  8       1 Age                       50    
##  9       0 Pregnancies                1    
## 10       0 Glucose                   85    
## # ℹ 6,134 more rows

Analysis: An analysis was conducted to investigate the potential relationship between Glucose levels and Age in the dataset. We performed a thorough examination of the data and found no discernible pattern or significant correlation between these two variables. To visually represent this, we created a scatter plot, which clearly illustrates the lack of a meaningful association between Glucose and Age, as data points were widely dispersed across the plot without forming any recognizable trend or pattern.

# Load necessary packages
library(dplyr)
library(ggplot2)

# Create a scatter plot to visualize the relationship between Glucose and Age
ggplot(diabetes_data, aes(x = Age, y = Glucose)) +
  geom_point(color = "skyblue") +
  labs(
    title = "Scatter Plot of Glucose vs. Age",
    x = "Age",
    y = "Glucose"
  ) +
  theme_minimal()

Conclusion: In conclusion, our analysis of the dataset revealed that there is no substantial correlation between Glucose levels and Age, suggesting that these two variables may not be directly related in the context of diabetes. Further investigations and additional variables may be necessary to uncover more complex interactions contributing to diabetes onset.

Data 607 Project 2

2023-10-10