DATA SET 1: Data set 1 has been created by a classmate regarding basketball players and their scores during the years of 2019-2021.
# Create the dataset
data <- data.frame(
Player = c("LeBron James", "Stephen Curry", "Kevin Durant"),
Team = c("Lakers", "Warriors", "Nets"),
Season_2019_Points = c(2000, 1800, 1900),
Season_2020_Points = c(2100, 1900, 2000),
Season_2021_Points = c(2200, 2000, 2100)
)
# Save the dataset as a CSV file
write.csv(data, "basketball_points.csv", row.names = FALSE)
# The csv has been uploaded to github and imported here
library(readr)
github_url <- "https://raw.githubusercontent.com/NooriSelina/Data-607/main/basketball_points.csv"
basketball_data <- read.csv(github_url)
head(basketball_data)
## Player Team Season_2019_Points Season_2020_Points
## 1 LeBron James Lakers 2000 2100
## 2 Stephen Curry Warriors 1800 1900
## 3 Kevin Durant Nets 1900 2000
## Season_2021_Points
## 1 2200
## 2 2000
## 3 2100
Next, the data will be made tidy with tidyr and dplyr packages. It will be transformed into “long” format. To perform the analysis of player performance, we calculated the overall mean points scored by each player across all seasons and identified the top-performing players with the highest overall mean points.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Convert the basketball_data to long format
long_basketball_data <- basketball_data %>%
pivot_longer(
cols = starts_with("Season_"),
names_to = "Season",
values_to = "Points"
) %>%
arrange(Player, Season)
# Print the long format data
print(long_basketball_data)
## # A tibble: 9 × 4
## Player Team Season Points
## <chr> <chr> <chr> <int>
## 1 Kevin Durant Nets Season_2019_Points 1900
## 2 Kevin Durant Nets Season_2020_Points 2000
## 3 Kevin Durant Nets Season_2021_Points 2100
## 4 LeBron James Lakers Season_2019_Points 2000
## 5 LeBron James Lakers Season_2020_Points 2100
## 6 LeBron James Lakers Season_2021_Points 2200
## 7 Stephen Curry Warriors Season_2019_Points 1800
## 8 Stephen Curry Warriors Season_2020_Points 1900
## 9 Stephen Curry Warriors Season_2021_Points 2000
Analysis: To perform the analysis of which players performed the best during the previous seasons, we are calculating the mean values for each player, for each of the seasons. Then, a graph is being produced to visualize the results.
# Load necessary packages
library(dplyr)
library(ggplot2)
# Calculate mean points for each player and season
mean_points <- long_basketball_data %>%
group_by(Player, Season) %>%
summarize(Mean_Points = mean(Points))
## `summarise()` has grouped output by 'Player'. You can override using the
## `.groups` argument.
# Print the mean_points data frame
print(mean_points)
## # A tibble: 9 × 3
## # Groups: Player [3]
## Player Season Mean_Points
## <chr> <chr> <dbl>
## 1 Kevin Durant Season_2019_Points 1900
## 2 Kevin Durant Season_2020_Points 2000
## 3 Kevin Durant Season_2021_Points 2100
## 4 LeBron James Season_2019_Points 2000
## 5 LeBron James Season_2020_Points 2100
## 6 LeBron James Season_2021_Points 2200
## 7 Stephen Curry Season_2019_Points 1800
## 8 Stephen Curry Season_2020_Points 1900
## 9 Stephen Curry Season_2021_Points 2000
# Create a bar chart to visualize mean points by player and season
ggplot(mean_points, aes(x = Season, y = Mean_Points, fill = Player)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Mean Points by Player and Season",
x = "Season",
y = "Mean Points",
fill = "Player"
) +
theme_minimal()
Conclusion: In conclusion, the provided data represents the points
scored by three prominent basketball players (Kevin Durant, LeBron
James, and Stephen Curry) over the course of three consecutive seasons
(2019, 2020, and 2021). It illustrates the consistent performance and
gradual increase in points scored by all three players over these
seasons, with LeBron James who possessed the highest point total by the
end of this period.
DATA SET 2: The following data set includes regional populations from the years 2000 to 2022. It was already posted as CSV form by classmate. I uploaded the CSV to github, and imported via the github link as below.
# URL of the raw CSV file on GitHub
github_url <- "https://raw.githubusercontent.com/NooriSelina/Data-607/main/world_population.csv"
# Read the CSV file from GitHub into a data frame
world_population_data <- read.csv(github_url)
# Print the first few rows of the imported data
head(world_population_data)
## Rank CCA3 Country.Territory Capital Continent X2022.Population
## 1 36 AFG Afghanistan Kabul Asia 41128771
## 2 138 ALB Albania Tirana Europe 2842321
## 3 34 DZA Algeria Algiers Africa 44903225
## 4 213 ASM American Samoa Pago Pago Oceania 44273
## 5 203 AND Andorra Andorra la Vella Europe 79824
## 6 42 AGO Angola Luanda Africa 35588987
## X2020.Population X2015.Population X2010.Population X2000.Population
## 1 38972230 33753499 28189672 19542982
## 2 2866849 2882481 2913399 3182021
## 3 43451666 39543154 35856344 30774621
## 4 46189 51368 54849 58230
## 5 77700 71746 71519 66097
## 6 33428485 28127721 23364185 16394062
## X1990.Population X1980.Population X1970.Population Area..km..
## 1 10694796 12486631 10752971 652230
## 2 3295066 2941651 2324731 28748
## 3 25518074 18739378 13795915 2381741
## 4 47818 32886 27075 199
## 5 53569 35611 19860 468
## 6 11828638 8330047 6029700 1246700
## Density..per.km.. Growth.Rate World.Population.Percentage
## 1 63.0587 1.0257 0.52
## 2 98.8702 0.9957 0.04
## 3 18.8531 1.0164 0.56
## 4 222.4774 0.9831 0.00
## 5 170.5641 1.0100 0.00
## 6 28.5466 1.0315 0.45
Next, this wide data set is being changed to long format” using dyplr and tidyr
# Load necessary packages
library(tidyr)
library(dplyr)
# Rename columns with special characters
world_population_data <- world_population_data %>%
rename(
Rank = Rank,
CCA3 = CCA3,
Country_Territory = `Country.Territory`,
Capital = Capital,
Continent = Continent,
Population_2022 = `X2022.Population`,
Population_2020 = `X2020.Population`,
Population_2015 = `X2015.Population`,
Population_2010 = `X2010.Population`,
Population_2000 = `X2000.Population`,
Population_1990 = `X1990.Population`,
Population_1980 = `X1980.Population`,
Population_1970 = `X1970.Population`,
Area_km2 = `Area..km..`,
Density_per_km2 = `Density..per.km..`,
Growth_Rate = `Growth.Rate`,
World_Population_Percentage = `World.Population.Percentage`
)
# Convert the World Population dataset to long format
long_population_data <- world_population_data %>%
pivot_longer(
cols = starts_with("Population_"),
names_to = "Year",
values_to = "Population"
) %>%
arrange(Country_Territory, Year) # Arrange by Country/Territory and Year
# Print the first few rows of the long format data
head(long_population_data)
## # A tibble: 6 × 11
## Rank CCA3 Country_Territory Capital Continent Area_km2 Density_per_km2
## <int> <chr> <chr> <chr> <chr> <int> <dbl>
## 1 36 AFG Afghanistan Kabul Asia 652230 63.1
## 2 36 AFG Afghanistan Kabul Asia 652230 63.1
## 3 36 AFG Afghanistan Kabul Asia 652230 63.1
## 4 36 AFG Afghanistan Kabul Asia 652230 63.1
## 5 36 AFG Afghanistan Kabul Asia 652230 63.1
## 6 36 AFG Afghanistan Kabul Asia 652230 63.1
## # ℹ 4 more variables: Growth_Rate <dbl>, World_Population_Percentage <dbl>,
## # Year <chr>, Population <int>
Analysis: Analysis: For this analysis, we visualized the total world population trends from 2000 to 2022 using a bar chart. The chart highlights how the global population has evolved over these years, providing a clear overview of population changes over time.
# Load necessary packages
library(dplyr)
library(ggplot2)
# Calculate the total world population by year
total_population_by_year <- long_population_data %>%
group_by(Year) %>%
summarize(Total_Population = sum(Population))
# Create a bar chart to visualize total population by year
ggplot(total_population_by_year, aes(x = Year, y = Total_Population)) +
geom_bar(stat = "identity", fill = "navy") +
geom_text(aes(label = Total_Population), vjust = -0.5, hjust = 0.5, size = 3) + # Add text annotations
labs(
title = "Total World Population Trends (2000-2022)",
x = "Year",
y = "Total Population"
) +
theme_minimal()
Conclusion: The analysis of world population trends from 2000 to 2022
reveals a consistent and notable pattern: the total population across
all countries is on an upward trajectory. Over this period, the global
population has demonstrated a clear and continuous increase.
DATA SET 3: The following data set consists of the occurrence of diabetes among a group of individuals, encompassing various characteristics such as age, skin thickness, and more. . The data set was downloaded as csv, uploaded to github, and imported to R.
# URL of the raw CSV file on GitHub
github_url <- "https://raw.githubusercontent.com/NooriSelina/Data-607/main/diabetes%20(1).csv"
# Read the CSV file from GitHub into a data frame
diabetes_data <- read.csv(github_url)
# Print the first few rows of the imported data
head(diabetes_data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Next, we have changed the format to long data instead of wide data.
library(dplyr)
library(tidyr)
long_diabetes_data <- diabetes_data %>% # Change format to long form
pivot_longer(
cols = -Outcome, # Specify the columns to remain as is
names_to = "Measurement", # Name for the new variable containing column names
values_to = "Value" # Name for the new variable containing the values
)
long_diabetes_data
## # A tibble: 6,144 × 3
## Outcome Measurement Value
## <int> <chr> <dbl>
## 1 1 Pregnancies 6
## 2 1 Glucose 148
## 3 1 BloodPressure 72
## 4 1 SkinThickness 35
## 5 1 Insulin 0
## 6 1 BMI 33.6
## 7 1 DiabetesPedigreeFunction 0.627
## 8 1 Age 50
## 9 0 Pregnancies 1
## 10 0 Glucose 85
## # ℹ 6,134 more rows
Analysis: An analysis was conducted to investigate the potential relationship between Glucose levels and Age in the dataset. We performed a thorough examination of the data and found no discernible pattern or significant correlation between these two variables. To visually represent this, we created a scatter plot, which clearly illustrates the lack of a meaningful association between Glucose and Age, as data points were widely dispersed across the plot without forming any recognizable trend or pattern.
# Load necessary packages
library(dplyr)
library(ggplot2)
# Create a scatter plot to visualize the relationship between Glucose and Age
ggplot(diabetes_data, aes(x = Age, y = Glucose)) +
geom_point(color = "skyblue") +
labs(
title = "Scatter Plot of Glucose vs. Age",
x = "Age",
y = "Glucose"
) +
theme_minimal()
Conclusion: In conclusion, our analysis of the dataset revealed that
there is no substantial correlation between Glucose levels and Age,
suggesting that these two variables may not be directly related in the
context of diabetes. Further investigations and additional variables may
be necessary to uncover more complex interactions contributing to
diabetes onset.