Part 1

In Part 1, you will use the data set that you found and that was approved by me as part of the Final Project Proposal. Recall that data set must contain at least two quantitative variables and at least one categorical variable. You must do the following:

1. Choose two quantitative variables and one categorical variable from the data set. Provide a short (1 – 2 sentences) description of the data set so that I know what kind of data you’re working with. You don’t need to list and describe every variable in the data set, but you should mention a few, including the three you will work with on this project. Also provide the URL where you downloaded the data from. This is the only part where you do not need to write R code.

This dataset goes over the Contracts for the NFL Players from 2019. The URL is https://www.lock5stat.com/datapage3e.html.

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
NFLContracts2019 <- read_csv("~/Documents/R code for Stats/NFLContracts2019 (1).csv")
## Rows: 1988 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Player, Position, Team
## dbl (2): TotalMoney, YearlySalary
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(NFLContracts2019)

2. Make four plots: (1) one of one of the quantitative variables (your choice as to which) only, (2) one of your categorical variable only, (3) one that allows for a comparison of the quantitative variable across the levels/categories of the categorical variable, and (4) one of your two quantitative variables against each other. Every plot must be publication quality. In other words, axis labels must be clear and capitalized (when appropriate); axis limits and breaks must be carefully chosen; additional aesthetics and corresponding legends must be present if appropriate; etc. Further, the size of the axis text and labels in your plots should be similar to the size of the text in the document.

ggplot(data = NFLContracts2019, mapping = aes(x = YearlySalary)) +
  geom_histogram(bins = 7, color = "purple", fill = "pink") +
  labs(title = "Yearly Salary for Players", x = "Yearly Salary (Millions USD)", y = "Number of Players") +
  theme(plot.title = element_text(hjust = 0.5), element_text(size = 12))

ggplot(data = NFLContracts2019, mapping = aes(x = Team, fill = Team)) +
  geom_bar(show.legend = FALSE)+ 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  labs(title = "Amount of Players on Each Team", x = "Team", y = "Number of Players") +
  theme(plot.title = element_text(hjust = 0.5), element_text(size = 12)) 

ggplot(data = NFLContracts2019, mapping = aes(x = Team, y = TotalMoney, color = Team)) +
  geom_boxplot(show.legend = FALSE) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  labs(title = "Total Money from Contracts \n Projected for all Teams", x = "Team", y = "Total Money (Millions USD)") +
  theme(plot.title = element_text(hjust = 0.5), element_text(size = 12)) 

ggplot(data = NFLContracts2019, mapping = aes(x = YearlySalary, y = TotalMoney, color = Team)) +
  geom_point() +
  geom_smooth(color = 'purple') + 
  labs(title = "Comparitive Graph of Yearly Salary \n and Total Money per Team", x = "Yearly Salary (Millions USD)", y = "Total Money (Millions USD)") +
  theme(plot.title = element_text(hjust = 0.5),  element_text(size = 12)) 
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

3. Here you will compute three sets of summary statistics. For the quantitative variable, you must write a function that computes the mean, median, standard deviation, interquartile range, minimum, maximum, \(5^{\text{th}}\) percentile, and \(95^{\text{th}}\) percentile in the form of a named vector, where it’s clear what each value measures. Then use the function to compute the set of summary statistics for your quantitative variable. Next, for the categorical variable, compute the percentage of values that fall into each of the categories. Finally, compute the same set of summary statistics for the quantitative variable for each category of the categorical variable.

Summary_Stats <- function(x) {
  c(Mean = mean(x), Median = median(x), SD = sd(x), IQR = IQR(x), Min = min(x), Max = max(x), quantile(x, c(.05)), quantile(x, c(.95)))
}
Summary_Stats(NFLContracts2019$YearlySalary)
##      Mean    Median        SD       IQR       Min       Max        5%       95% 
##  3.033404  0.906500  4.426823  2.605000  0.488000 35.000000  0.540000 12.500000
Summary_Stats(NFLContracts2019$TotalMoney)
##      Mean    Median        SD       IQR       Min       Max        5%       95% 
##  10.55420   3.00750  19.10292   7.98025   0.49500 150.00000   0.64500  52.18265
Category_Team <- function(Team) {
  prop.table(table(Team)) * 100
}
Category_Team(NFLContracts2019$Team)
## Team
##      49ers      Bears    Bengals      Bills    Broncos     Browns Buccaneers 
##   3.219316   2.867203   3.118712   2.967807   3.370221   3.068410   3.118712 
##  Cardinals   Chargers     Chiefs      Colts    Cowboys   Dolphins     Eagles 
##   3.118712   3.269618   3.118712   3.118712   3.219316   3.370221   2.917505 
##    Falcons     Giants    Jaguars       Jets      Lions    Packers   Panthers 
##   3.219316   3.118712   3.420523   3.319920   2.917505   3.018109   3.068410 
##   Patriots    Raiders       Rams     Ravens   Redskins     Saints   Seahawks 
##   3.269618   3.118712   2.917505   3.118712   3.319920   3.018109   3.370221 
##   Steelers     Texans     Titans    Vikings 
##   2.967807   3.269618   2.867203   2.816901
Team_Stats_YearlySalary <- NFLContracts2019 %>%
  group_by(Team) %>%
  summarise(Mean = mean(YearlySalary), 
            Median = median(YearlySalary), 
            SD = sd(YearlySalary), 
            IQR = IQR(YearlySalary), 
            Min = min(YearlySalary), 
            Max = max(YearlySalary), 
            fifth = quantile(YearlySalary, c(.05)), 
            ninetyfifth = quantile(YearlySalary, c(.95)))
Team_Stats_YearlySalary
## # A tibble: 32 × 9
##    Team        Mean Median    SD   IQR   Min   Max fifth ninetyfifth
##    <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
##  1 49ers       3.41  0.950  4.82  3.90 0.495  27.5 0.525        13.1
##  2 Bears       3.65  1.34   4.61  4.77 0.495  23.5 0.552        12.4
##  3 Bengals     3.09  0.976  4.11  3.56 0.525  16.3 0.541        13.4
##  4 Bills       2.68  1.6    2.76  3.02 0.525  11.1 0.566         9.1
##  5 Broncos     2.58  0.757  4.30  1.25 0.495  22.1 0.507        11.7
##  6 Browns      3.12  0.934  4.21  2.60 0.525  18   0.54         12.3
##  7 Buccaneers  2.96  0.895  3.86  3.26 0.495  16.5 0.54         10.8
##  8 Cardinals   2.63  0.948  3.65  1.30 0.57   16.5 0.572        10.9
##  9 Chargers    2.63  0.805  3.99  2.05 0.525  20.8 0.54         11.4
## 10 Chiefs      3.05  1.06   4.47  2.16 0.495  20.8 0.526        13.9
## # ℹ 22 more rows
Team_Stats_TotalMoney <- NFLContracts2019 %>%
  group_by(Team) %>%
  summarise(Mean = mean(TotalMoney), 
            Median = median(TotalMoney), 
            SD = sd(TotalMoney), 
            IQR = IQR(TotalMoney), 
            Min = min(TotalMoney), 
            Max = max(TotalMoney), 
            fifth = quantile(TotalMoney, c(.05)), 
            ninetyfifth = quantile(TotalMoney, c(.95)))
Team_Stats_TotalMoney
## # A tibble: 32 × 9
##    Team        Mean Median    SD   IQR   Min   Max fifth ninetyfifth
##    <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
##  1 49ers      12.3    3.44  21.9 11.5  0.495 138.  0.669        45.4
##  2 Bears      13.4    3.07  22.9 15.4  0.495 141   0.645        48.6
##  3 Bengals    11.2    3.07  19.1  8.61 0.645  96   0.645        59.6
##  4 Bills       8.17   3.2   10.6  8.29 0.645  50   0.93         29.1
##  5 Broncos     8.56   2.5   18.0  5.21 0.495 114.  0.522        36.5
##  6 Browns     11.9    3.59  20.1  6.20 0.57   90   0.72         50  
##  7 Buccaneers 10.5    3.23  17.2  7.41 0.495  82.5 0.649        49.8
##  8 Cardinals   8.55   2.70  16.0  5.32 0.57   82.5 0.649        38.8
##  9 Chargers    8.19   2.68  15.4  5.02 0.57   83.2 0.677        42.8
## 10 Chiefs     10.7    3.43  18.6  4.92 0.495 104   0.806        47.9
## # ℹ 22 more rows

Part 2

In Part 2, you will use Airbnb data, located in Blackboard. Tomslee.net contains a description for Airbnb locations worldwide from a few years ago. Go to the following website: ‘https://tomslee.net/airbnb-data-collection-get-the-data’ for detailed information on the variables found in the data set. Further, each price represents the cost per night for the Airbnb (in US dollars and typically for the entire Airbnb, not per person). You must do the following:

1. Import the Airbnb data set into R, and print the structure of the object. What type of data structure is the data set stored as? How many Airbnbs are there in the data set?

library(readr)
Airbnb <- read_csv("~/Documents/R code for Stats/Airbnb.csv")
## Rows: 428 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): room_type, city, neighborhood, location
## dbl  (10): room_id, survey_id, host_id, reviews, overall_satisfaction, accom...
## lgl   (4): country, borough, bathrooms, minstay
## dttm  (1): last_modified
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(Airbnb)
str(Airbnb)
## spc_tbl_ [428 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ room_id             : num [1:428] 15706636 17010199 17609700 6336901 8832942 ...
##  $ survey_id           : num [1:428] 1432 1432 1432 1432 1432 ...
##  $ host_id             : num [1:428] 9.74e+06 6.97e+07 1.20e+08 2.19e+07 4.63e+07 ...
##  $ room_type           : chr [1:428] "Shared room" "Shared room" "Shared room" "Shared room" ...
##  $ country             : logi [1:428] NA NA NA NA NA NA ...
##  $ city                : chr [1:428] "Newhaven CT" "Newhaven CT" "Newhaven CT" "Newhaven CT" ...
##  $ borough             : logi [1:428] NA NA NA NA NA NA ...
##  $ neighborhood        : chr [1:428] "New Haven" "New Haven" "West Haven" "East Haven" ...
##  $ reviews             : num [1:428] 0 4 0 10 0 16 0 1 20 6 ...
##  $ overall_satisfaction: num [1:428] 0 5 0 5 0 3 0 0 5 5 ...
##  $ accommodates        : num [1:428] 2 1 1 1 1 3 2 1 2 6 ...
##  $ bedrooms            : num [1:428] 1 1 1 1 1 1 1 1 1 2 ...
##  $ bathrooms           : logi [1:428] NA NA NA NA NA NA ...
##  $ price               : num [1:428] 120 80 75 42 43 55 50 20 31 112 ...
##  $ minstay             : logi [1:428] NA NA NA NA NA NA ...
##  $ last_modified       : POSIXct[1:428], format: "2017-07-11 01:23:18" "2017-07-11 01:23:16" ...
##  $ latitude            : num [1:428] 41.3 41.3 41.3 41.3 41.3 ...
##  $ longitude           : num [1:428] -72.9 -72.9 -72.9 -72.9 -72.9 ...
##  $ location            : chr [1:428] "0101000020E6100000213EB0E3BF3852C07632384A5EA74440" "0101000020E61000007BDAE1AFC93A52C05169C4CC3EA94440" "0101000020E6100000F6EB4E779E3C52C090DAC4C9FDA04440" "0101000020E6100000B94F8E02443852C0758F6CAE9AA34440" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   room_id = col_double(),
##   ..   survey_id = col_double(),
##   ..   host_id = col_double(),
##   ..   room_type = col_character(),
##   ..   country = col_logical(),
##   ..   city = col_character(),
##   ..   borough = col_logical(),
##   ..   neighborhood = col_character(),
##   ..   reviews = col_double(),
##   ..   overall_satisfaction = col_double(),
##   ..   accommodates = col_double(),
##   ..   bedrooms = col_double(),
##   ..   bathrooms = col_logical(),
##   ..   price = col_double(),
##   ..   minstay = col_logical(),
##   ..   last_modified = col_datetime(format = ""),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   location = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
nrow(Airbnb)
## [1] 428

2. Use the summary function you wrote in problem 3 of Part 1 to compute some summary statistics for the price per night variable. Print your results.

Summary_Stats <- function(x) {
  c(Mean = mean(x), Median = median(x), SD = sd(x), IQR = IQR(x), Min = min(x), Max = max(x), quantile(x, c(.05)), quantile(x, c(.95)))
}
Summary_Stats(Airbnb$price)
##       Mean     Median         SD        IQR        Min        Max         5% 
##   99.10514   68.50000  115.16511   53.00000   15.00000 1000.00000   30.00000 
##        95% 
##  250.00000

3. Find the mean, median, standard deviation, and interquartile range of the price per night based on the number of bedrooms in each Airbnb. Print your results.

Stats_grouped_by_room <- Airbnb %>%
  group_by(bedrooms) %>%
  summarise(Mean = mean(price), 
            Median = median(price), 
            SD = sd(price), 
            IQR = IQR(price))
Stats_grouped_by_room
## # A tibble: 6 × 5
##   bedrooms  Mean Median    SD   IQR
##      <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1        0  74.0     65  40.6    50
## 2        1  74.1     59  63.5    40
## 3        2 121.     100  69.5    75
## 4        3 230.     240 135.    150
## 5        4 434.     395 306.    300
## 6        5 313.      75 392.    465

4. Create a variable that calculates the price per person per night at each Airbnb, assuming each Airbnb is filled to capacity. Then print the following columns for the five most expensive Airbnbs in terms of the price per person per night: host_id, room_type, accommodates, and price_per_person.

Airbnb$price_per_person <- Airbnb$price/Airbnb$accommodates
Top5Expense_Airbnb <- Airbnb %>%
  select(host_id, room_type, accommodates, price_per_person) %>%
  arrange(desc(price_per_person)) 
head(Top5Expense_Airbnb, 5)
## # A tibble: 5 × 4
##     host_id room_type       accommodates price_per_person
##       <dbl> <chr>                  <dbl>            <dbl>
## 1  16063785 Entire home/apt            2             428.
## 2 103180487 Entire home/apt            4             218.
## 3  44968385 Entire home/apt            7             143.
## 4  16351157 Private room               1             130 
## 5  23986531 Entire home/apt            2             125

5. Calculate the mean, median, standard deviation, and interquartile range of the price per night based on the neighborhood in which the Airbnb is located, as well as the number of Airbnbs per neighborhood, for the Airbnbs that accommodate at least three people. Note that you can use n(), without anything inside the parentheses, inside of summarize() to calculate counts. Print your results.

Stats_by_Neighborhood <- Airbnb %>%
  filter(accommodates >= 3) %>%
  group_by(neighborhood) %>%
  summarize(
    count = n(),
    mean_price = mean(price),
    median_price = median(price),
    sd_price = sd(price),
    iqr_price = IQR(price)
  )
Stats_by_Neighborhood
## # A tibble: 6 × 6
##   neighborhood                  count mean_price median_price sd_price iqr_price
##   <chr>                         <int>      <dbl>        <dbl>    <dbl>     <dbl>
## 1 County subdivisions not defi…     1      258           258      NA          0 
## 2 East Haven                        4      276.          258.    206.       311.
## 3 Hamden                            3      201           250     126.       118.
## 4 New Haven                       150      153.          100.    157.       104.
## 5 Orange                            1      240           240      NA          0 
## 6 West Haven                        5       73.4          58      48.2       54

6. Create a new variable that classifies Airbnb prices as follows: “low” if the price per person per night is under $30, “moderate” if the price per person per night is at least $30 and under $50, “high” if the price per person per night is at least $50 and under $75, and “very high” if the price per person per night is at least $75. Print the values of your new variable (not the entire row of data) for the first 50 Airbnbs in the data set.

library(readr)
Airbnb$Price_Classification <- NA
for (i in 1:nrow(Airbnb)) {
  if (Airbnb$price_per_person[i] < 30) {
    Airbnb$Price_Classification[i] <- "low"
  } else if (Airbnb$price_per_person[i] >= 30 & Airbnb$price_per_person[i] <= 50) {
    Airbnb$Price_Classification[i] <- "moderate"
  } else if (Airbnb$price_per_person[i] > 50 & Airbnb$price_per_person[i] <= 75) {
    Airbnb$Price_Classification[i] <- "high"
  } else {
    Airbnb$Price_Classification[i] <- "very high"
  }
}
head(Airbnb$Price_Classification, 50)
##  [1] "high"      "very high" "high"      "moderate"  "moderate"  "low"      
##  [7] "low"       "low"       "low"       "low"       "high"      "low"      
## [13] "moderate"  "high"      "moderate"  "low"       "moderate"  "moderate" 
## [19] "low"       "low"       "low"       "low"       "low"       "low"      
## [25] "low"       "high"      "very high" "very high" "very high" "very high"
## [31] "very high" "high"      "high"      "very high" "very high" "moderate" 
## [37] "high"      "very high" "high"      "moderate"  "moderate"  "moderate" 
## [43] "moderate"  "very high" "moderate"  "moderate"  "moderate"  "moderate" 
## [49] "moderate"  "high"

7. Make the variable you created in problem 6 an ordered factor and add it to the data set (if you didn’t already in problem 6). Print the first 20 values of the factor.

Airbnb$Price_Classification <- factor(
  Airbnb$Price_Classification,
  levels = c("low", "moderate", "high", "very high"),
  ordered = TRUE
)
head(Airbnb$Price_Classification, 20)
##  [1] high      very high high      moderate  moderate  low       low      
##  [8] low       low       low       high      low       moderate  high     
## [15] moderate  low       moderate  moderate  low       low      
## Levels: low < moderate < high < very high

8. Create a bar chart of the ordered factor from problem 7. Your \(y\)-axis values should be percents instead of counts, and each price category must be appropriately labeled. Additionally, the plot should look professional, so you must fully polish it. Do not use multiple colors (other than black, white, and gray), as it’s unnecessary here.

ggplot(Airbnb, mapping = aes(x = Price_Classification, y = after_stat(count/sum(count)*100))) +
  geom_bar() +
  labs(title = "Percentage of Airbnbs in each \n Price Classification",
       x = "Price Classification",
       y = "Percentage(%)") +
  theme(plot.title = element_text(hjust = 0.5), element_text(size = 12))