In Part 1, you will use the data set that you found and that was approved by me as part of the Final Project Proposal. Recall that data set must contain at least two quantitative variables and at least one categorical variable. You must do the following:
1. Choose two quantitative variables and one
categorical variable from the data set. Provide a short (1 – 2
sentences) description of the data set so that I know what kind of data
you’re working with. You don’t need to list and describe every variable
in the data set, but you should mention a few, including the three you
will work with on this project. Also provide the URL where you
downloaded the data from. This is the only part where you do not need to
write R
code.
This dataset goes over the Contracts for the NFL Players from 2019. The URL is https://www.lock5stat.com/datapage3e.html.
The two quantitative variables and the one qualitative variable I chose are the following:
Total Money is a quantitative variable, and is all the money that the NFL Players are projected to make with their contract. These units are in millions of US dollars.
Yearly Salary is a quantitative variable, and represents the money that the NFL Players are making that year. These units are in millions of dollars.
Team is a categorical variable, and represents the team that each player plays on. There are 56-68 players on each team. (49ers, Bears, Bengals, Bills, Broncos, Browns, Buccaneers, Cardinals, Chargers, Chiefs, Colts, Cowboys, Dolphins, Eagles, Falcons, Giants, Jaguars, Jets, Lions, Packers, Panthers, Patriots, Raiders, Rams, Ravens, Redskins, Saints, Seahawks, Steelers, Texans, Titans, Vikings)
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
NFLContracts2019 <- read_csv("~/Documents/R code for Stats/NFLContracts2019 (1).csv")
## Rows: 1988 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Player, Position, Team
## dbl (2): TotalMoney, YearlySalary
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(NFLContracts2019)
2. Make four plots: (1) one of one of the quantitative variables (your choice as to which) only, (2) one of your categorical variable only, (3) one that allows for a comparison of the quantitative variable across the levels/categories of the categorical variable, and (4) one of your two quantitative variables against each other. Every plot must be publication quality. In other words, axis labels must be clear and capitalized (when appropriate); axis limits and breaks must be carefully chosen; additional aesthetics and corresponding legends must be present if appropriate; etc. Further, the size of the axis text and labels in your plots should be similar to the size of the text in the document.
ggplot(data = NFLContracts2019, mapping = aes(x = YearlySalary)) +
geom_histogram(bins = 7, color = "purple", fill = "pink") +
labs(title = "Yearly Salary for Players", x = "Yearly Salary (Millions USD)", y = "Number of Players") +
theme(plot.title = element_text(hjust = 0.5), element_text(size = 12))
ggplot(data = NFLContracts2019, mapping = aes(x = Team, fill = Team)) +
geom_bar(show.legend = FALSE)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
labs(title = "Amount of Players on Each Team", x = "Team", y = "Number of Players") +
theme(plot.title = element_text(hjust = 0.5), element_text(size = 12))
ggplot(data = NFLContracts2019, mapping = aes(x = Team, y = TotalMoney, color = Team)) +
geom_boxplot(show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
labs(title = "Total Money from Contracts \n Projected for all Teams", x = "Team", y = "Total Money (Millions USD)") +
theme(plot.title = element_text(hjust = 0.5), element_text(size = 12))
ggplot(data = NFLContracts2019, mapping = aes(x = YearlySalary, y = TotalMoney, color = Team)) +
geom_point() +
geom_smooth(color = 'purple') +
labs(title = "Comparitive Graph of Yearly Salary \n and Total Money per Team", x = "Yearly Salary (Millions USD)", y = "Total Money (Millions USD)") +
theme(plot.title = element_text(hjust = 0.5), element_text(size = 12))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
3. Here you will compute three sets of summary statistics. For the quantitative variable, you must write a function that computes the mean, median, standard deviation, interquartile range, minimum, maximum, \(5^{\text{th}}\) percentile, and \(95^{\text{th}}\) percentile in the form of a named vector, where it’s clear what each value measures. Then use the function to compute the set of summary statistics for your quantitative variable. Next, for the categorical variable, compute the percentage of values that fall into each of the categories. Finally, compute the same set of summary statistics for the quantitative variable for each category of the categorical variable.
Summary_Stats <- function(x) {
c(Mean = mean(x), Median = median(x), SD = sd(x), IQR = IQR(x), Min = min(x), Max = max(x), quantile(x, c(.05)), quantile(x, c(.95)))
}
Summary_Stats(NFLContracts2019$YearlySalary)
## Mean Median SD IQR Min Max 5% 95%
## 3.033404 0.906500 4.426823 2.605000 0.488000 35.000000 0.540000 12.500000
Summary_Stats(NFLContracts2019$TotalMoney)
## Mean Median SD IQR Min Max 5% 95%
## 10.55420 3.00750 19.10292 7.98025 0.49500 150.00000 0.64500 52.18265
Category_Team <- function(Team) {
prop.table(table(Team)) * 100
}
Category_Team(NFLContracts2019$Team)
## Team
## 49ers Bears Bengals Bills Broncos Browns Buccaneers
## 3.219316 2.867203 3.118712 2.967807 3.370221 3.068410 3.118712
## Cardinals Chargers Chiefs Colts Cowboys Dolphins Eagles
## 3.118712 3.269618 3.118712 3.118712 3.219316 3.370221 2.917505
## Falcons Giants Jaguars Jets Lions Packers Panthers
## 3.219316 3.118712 3.420523 3.319920 2.917505 3.018109 3.068410
## Patriots Raiders Rams Ravens Redskins Saints Seahawks
## 3.269618 3.118712 2.917505 3.118712 3.319920 3.018109 3.370221
## Steelers Texans Titans Vikings
## 2.967807 3.269618 2.867203 2.816901
Team_Stats_YearlySalary <- NFLContracts2019 %>%
group_by(Team) %>%
summarise(Mean = mean(YearlySalary),
Median = median(YearlySalary),
SD = sd(YearlySalary),
IQR = IQR(YearlySalary),
Min = min(YearlySalary),
Max = max(YearlySalary),
fifth = quantile(YearlySalary, c(.05)),
ninetyfifth = quantile(YearlySalary, c(.95)))
Team_Stats_YearlySalary
## # A tibble: 32 × 9
## Team Mean Median SD IQR Min Max fifth ninetyfifth
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 49ers 3.41 0.950 4.82 3.90 0.495 27.5 0.525 13.1
## 2 Bears 3.65 1.34 4.61 4.77 0.495 23.5 0.552 12.4
## 3 Bengals 3.09 0.976 4.11 3.56 0.525 16.3 0.541 13.4
## 4 Bills 2.68 1.6 2.76 3.02 0.525 11.1 0.566 9.1
## 5 Broncos 2.58 0.757 4.30 1.25 0.495 22.1 0.507 11.7
## 6 Browns 3.12 0.934 4.21 2.60 0.525 18 0.54 12.3
## 7 Buccaneers 2.96 0.895 3.86 3.26 0.495 16.5 0.54 10.8
## 8 Cardinals 2.63 0.948 3.65 1.30 0.57 16.5 0.572 10.9
## 9 Chargers 2.63 0.805 3.99 2.05 0.525 20.8 0.54 11.4
## 10 Chiefs 3.05 1.06 4.47 2.16 0.495 20.8 0.526 13.9
## # ℹ 22 more rows
Team_Stats_TotalMoney <- NFLContracts2019 %>%
group_by(Team) %>%
summarise(Mean = mean(TotalMoney),
Median = median(TotalMoney),
SD = sd(TotalMoney),
IQR = IQR(TotalMoney),
Min = min(TotalMoney),
Max = max(TotalMoney),
fifth = quantile(TotalMoney, c(.05)),
ninetyfifth = quantile(TotalMoney, c(.95)))
Team_Stats_TotalMoney
## # A tibble: 32 × 9
## Team Mean Median SD IQR Min Max fifth ninetyfifth
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 49ers 12.3 3.44 21.9 11.5 0.495 138. 0.669 45.4
## 2 Bears 13.4 3.07 22.9 15.4 0.495 141 0.645 48.6
## 3 Bengals 11.2 3.07 19.1 8.61 0.645 96 0.645 59.6
## 4 Bills 8.17 3.2 10.6 8.29 0.645 50 0.93 29.1
## 5 Broncos 8.56 2.5 18.0 5.21 0.495 114. 0.522 36.5
## 6 Browns 11.9 3.59 20.1 6.20 0.57 90 0.72 50
## 7 Buccaneers 10.5 3.23 17.2 7.41 0.495 82.5 0.649 49.8
## 8 Cardinals 8.55 2.70 16.0 5.32 0.57 82.5 0.649 38.8
## 9 Chargers 8.19 2.68 15.4 5.02 0.57 83.2 0.677 42.8
## 10 Chiefs 10.7 3.43 18.6 4.92 0.495 104 0.806 47.9
## # ℹ 22 more rows
In Part 2, you will use Airbnb data, located in Blackboard. Tomslee.net contains a description for Airbnb locations worldwide from a few years ago. Go to the following website: ‘https://tomslee.net/airbnb-data-collection-get-the-data’ for detailed information on the variables found in the data set. Further, each price represents the cost per night for the Airbnb (in US dollars and typically for the entire Airbnb, not per person). You must do the following:
1. Import the Airbnb data set into R
,
and print the structure of the object. What type of data structure is
the data set stored as? How many Airbnbs are there in the data set?
library(readr)
Airbnb <- read_csv("~/Documents/R code for Stats/Airbnb.csv")
## Rows: 428 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): room_type, city, neighborhood, location
## dbl (10): room_id, survey_id, host_id, reviews, overall_satisfaction, accom...
## lgl (4): country, borough, bathrooms, minstay
## dttm (1): last_modified
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(Airbnb)
str(Airbnb)
## spc_tbl_ [428 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ room_id : num [1:428] 15706636 17010199 17609700 6336901 8832942 ...
## $ survey_id : num [1:428] 1432 1432 1432 1432 1432 ...
## $ host_id : num [1:428] 9.74e+06 6.97e+07 1.20e+08 2.19e+07 4.63e+07 ...
## $ room_type : chr [1:428] "Shared room" "Shared room" "Shared room" "Shared room" ...
## $ country : logi [1:428] NA NA NA NA NA NA ...
## $ city : chr [1:428] "Newhaven CT" "Newhaven CT" "Newhaven CT" "Newhaven CT" ...
## $ borough : logi [1:428] NA NA NA NA NA NA ...
## $ neighborhood : chr [1:428] "New Haven" "New Haven" "West Haven" "East Haven" ...
## $ reviews : num [1:428] 0 4 0 10 0 16 0 1 20 6 ...
## $ overall_satisfaction: num [1:428] 0 5 0 5 0 3 0 0 5 5 ...
## $ accommodates : num [1:428] 2 1 1 1 1 3 2 1 2 6 ...
## $ bedrooms : num [1:428] 1 1 1 1 1 1 1 1 1 2 ...
## $ bathrooms : logi [1:428] NA NA NA NA NA NA ...
## $ price : num [1:428] 120 80 75 42 43 55 50 20 31 112 ...
## $ minstay : logi [1:428] NA NA NA NA NA NA ...
## $ last_modified : POSIXct[1:428], format: "2017-07-11 01:23:18" "2017-07-11 01:23:16" ...
## $ latitude : num [1:428] 41.3 41.3 41.3 41.3 41.3 ...
## $ longitude : num [1:428] -72.9 -72.9 -72.9 -72.9 -72.9 ...
## $ location : chr [1:428] "0101000020E6100000213EB0E3BF3852C07632384A5EA74440" "0101000020E61000007BDAE1AFC93A52C05169C4CC3EA94440" "0101000020E6100000F6EB4E779E3C52C090DAC4C9FDA04440" "0101000020E6100000B94F8E02443852C0758F6CAE9AA34440" ...
## - attr(*, "spec")=
## .. cols(
## .. room_id = col_double(),
## .. survey_id = col_double(),
## .. host_id = col_double(),
## .. room_type = col_character(),
## .. country = col_logical(),
## .. city = col_character(),
## .. borough = col_logical(),
## .. neighborhood = col_character(),
## .. reviews = col_double(),
## .. overall_satisfaction = col_double(),
## .. accommodates = col_double(),
## .. bedrooms = col_double(),
## .. bathrooms = col_logical(),
## .. price = col_double(),
## .. minstay = col_logical(),
## .. last_modified = col_datetime(format = ""),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. location = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
nrow(Airbnb)
## [1] 428
2. Use the summary function you wrote in problem 3 of Part 1 to compute some summary statistics for the price per night variable. Print your results.
Summary_Stats <- function(x) {
c(Mean = mean(x), Median = median(x), SD = sd(x), IQR = IQR(x), Min = min(x), Max = max(x), quantile(x, c(.05)), quantile(x, c(.95)))
}
Summary_Stats(Airbnb$price)
## Mean Median SD IQR Min Max 5%
## 99.10514 68.50000 115.16511 53.00000 15.00000 1000.00000 30.00000
## 95%
## 250.00000
3. Find the mean, median, standard deviation, and interquartile range of the price per night based on the number of bedrooms in each Airbnb. Print your results.
Stats_grouped_by_room <- Airbnb %>%
group_by(bedrooms) %>%
summarise(Mean = mean(price),
Median = median(price),
SD = sd(price),
IQR = IQR(price))
Stats_grouped_by_room
## # A tibble: 6 × 5
## bedrooms Mean Median SD IQR
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 74.0 65 40.6 50
## 2 1 74.1 59 63.5 40
## 3 2 121. 100 69.5 75
## 4 3 230. 240 135. 150
## 5 4 434. 395 306. 300
## 6 5 313. 75 392. 465
4. Create a variable that calculates the price per
person per night at each Airbnb, assuming each Airbnb is filled to
capacity. Then print the following columns for the five most expensive
Airbnbs in terms of the price per person per night:
host_id
, room_type
, accommodates
,
and price_per_person
.
Airbnb$price_per_person <- Airbnb$price/Airbnb$accommodates
Top5Expense_Airbnb <- Airbnb %>%
select(host_id, room_type, accommodates, price_per_person) %>%
arrange(desc(price_per_person))
head(Top5Expense_Airbnb, 5)
## # A tibble: 5 × 4
## host_id room_type accommodates price_per_person
## <dbl> <chr> <dbl> <dbl>
## 1 16063785 Entire home/apt 2 428.
## 2 103180487 Entire home/apt 4 218.
## 3 44968385 Entire home/apt 7 143.
## 4 16351157 Private room 1 130
## 5 23986531 Entire home/apt 2 125
5. Calculate the mean, median, standard deviation,
and interquartile range of the price per night based on the neighborhood
in which the Airbnb is located, as well as the number of Airbnbs per
neighborhood, for the Airbnbs that accommodate at least three people.
Note that you can use n()
, without anything inside the
parentheses, inside of summarize()
to calculate counts.
Print your results.
Stats_by_Neighborhood <- Airbnb %>%
filter(accommodates >= 3) %>%
group_by(neighborhood) %>%
summarize(
count = n(),
mean_price = mean(price),
median_price = median(price),
sd_price = sd(price),
iqr_price = IQR(price)
)
Stats_by_Neighborhood
## # A tibble: 6 × 6
## neighborhood count mean_price median_price sd_price iqr_price
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 County subdivisions not defi… 1 258 258 NA 0
## 2 East Haven 4 276. 258. 206. 311.
## 3 Hamden 3 201 250 126. 118.
## 4 New Haven 150 153. 100. 157. 104.
## 5 Orange 1 240 240 NA 0
## 6 West Haven 5 73.4 58 48.2 54
6. Create a new variable that classifies Airbnb prices as follows: “low” if the price per person per night is under $30, “moderate” if the price per person per night is at least $30 and under $50, “high” if the price per person per night is at least $50 and under $75, and “very high” if the price per person per night is at least $75. Print the values of your new variable (not the entire row of data) for the first 50 Airbnbs in the data set.
library(readr)
Airbnb$Price_Classification <- NA
for (i in 1:nrow(Airbnb)) {
if (Airbnb$price_per_person[i] < 30) {
Airbnb$Price_Classification[i] <- "low"
} else if (Airbnb$price_per_person[i] >= 30 & Airbnb$price_per_person[i] <= 50) {
Airbnb$Price_Classification[i] <- "moderate"
} else if (Airbnb$price_per_person[i] > 50 & Airbnb$price_per_person[i] <= 75) {
Airbnb$Price_Classification[i] <- "high"
} else {
Airbnb$Price_Classification[i] <- "very high"
}
}
head(Airbnb$Price_Classification, 50)
## [1] "high" "very high" "high" "moderate" "moderate" "low"
## [7] "low" "low" "low" "low" "high" "low"
## [13] "moderate" "high" "moderate" "low" "moderate" "moderate"
## [19] "low" "low" "low" "low" "low" "low"
## [25] "low" "high" "very high" "very high" "very high" "very high"
## [31] "very high" "high" "high" "very high" "very high" "moderate"
## [37] "high" "very high" "high" "moderate" "moderate" "moderate"
## [43] "moderate" "very high" "moderate" "moderate" "moderate" "moderate"
## [49] "moderate" "high"
7. Make the variable you created in problem 6 an ordered factor and add it to the data set (if you didn’t already in problem 6). Print the first 20 values of the factor.
Airbnb$Price_Classification <- factor(
Airbnb$Price_Classification,
levels = c("low", "moderate", "high", "very high"),
ordered = TRUE
)
head(Airbnb$Price_Classification, 20)
## [1] high very high high moderate moderate low low
## [8] low low low high low moderate high
## [15] moderate low moderate moderate low low
## Levels: low < moderate < high < very high
8. Create a bar chart of the ordered factor from problem 7. Your \(y\)-axis values should be percents instead of counts, and each price category must be appropriately labeled. Additionally, the plot should look professional, so you must fully polish it. Do not use multiple colors (other than black, white, and gray), as it’s unnecessary here.
ggplot(Airbnb, mapping = aes(x = Price_Classification, y = after_stat(count/sum(count)*100))) +
geom_bar() +
labs(title = "Percentage of Airbnbs in each \n Price Classification",
x = "Price Classification",
y = "Percentage(%)") +
theme(plot.title = element_text(hjust = 0.5), element_text(size = 12))