Week 4 Group Lab

BUS 32100

Due 11:59PM April 19 2023

Name: Florencia Palacios Collaborated with: Gabriel Neira

You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas.

In some cases, a incomplete code is given to guide you in the right direction. However, you will need to fill in the blanks in order to run the code block. Make sure to fill all the blanks, or comment them out, before you knit your R markdown file. Otherwise, it will return knitting errors.

Rio Olympics data set revisited

Load data

rio = read.csv("https://raw.githubusercontent.com/BUSN32100/data_files/master/rio.csv")
rio %>% 
  count(nationality) %>% 
  arrange(desc(n))
##     nationality   n
## 1           USA 567
## 2           BRA 485
## 3           GER 441
## 4           AUS 431
## 5           FRA 410
## 6           CHN 404
## 7           GBR 374
## 8           JPN 346
## 9           CAN 321
## 10          ESP 313
## 11          ITA 312
## 12          RUS 286
## 13          NED 249
## 14          POL 242
## 15          ARG 223
## 16          KOR 213
## 17          NZL 208
## 18          UKR 205
## 19          SWE 164
## 20          COL 154
## 21          HUN 154
## 22          RSA 146
## 23          DEN 128
## 24          MEX 126
## 25          BLR 124
## 26          CUB 123
## 27          IND 123
## 28          EGY 122
## 29          BEL 108
## 30          CZE 104
## 31          SUI 104
## 32          KAZ 103
## 33          SRB 103
## 34          TUR 103
## 35          ROU  98
## 36          POR  95
## 37          GRE  93
## 38          CRO  88
## 39          VEN  88
## 40          IRL  80
## 41          KEN  80
## 42          NGR  78
## 43          AUT  71
## 44          UZB  70
## 45          ALG  68
## 46          LTU  67
## 47          IRI  64
## 48          SLO  63
## 49          NOR  62
## 50          TUN  61
## 51          JAM  57
## 52          AZE  56
## 53          TPE  56
## 54          FIJ  54
## 55          FIN  54
## 56          THA  54
## 57          SVK  51
## 58          BUL  50
## 59          MAR  49
## 60          ISR  47
## 61          EST  46
## 62          MGL  43
## 63          CHI  42
## 64          GEO  40
## 65          PUR  40
## 66          QAT  39
## 67          ECU  38
## 68          ETH  38
## 69          HKG  38
## 70          MNE  35
## 71          ZIM  35
## 72          BRN  34
## 73          ARM  32
## 74          LAT  32
## 75          MAS  32
## 76          TTO  32
## 77          PRK  31
## 78          BAH  30
## 79          HON  30
## 80          DOM  29
## 81          PER  29
## 82          INA  28
## 83          ANG  26
## 84          IRQ  26
## 85          SIN  25
## 86          CMR  24
## 87          MDA  23
## 88          VIE  23
## 89          SEN  22
## 90          GUA  21
## 91          UGA  21
## 92          KGZ  19
## 93          URU  17
## 94          CYP  16
## 95          GHA  16
## 96          PHI  13
## 97          UAE  13
## 98          BOL  12
## 99          BOT  12
## 100         CIV  12
## 101         ERI  12
## 102         BAR  11
## 103         BIH  11
## 104         CRC  11
## 105         KSA  11
## 106         MRI  11
## 107         PAR  11
## 108         CGO  10
## 109         HAI  10
## 110         LUX  10
## 111         NAM  10
## 112         PAN  10
## 113         ROT  10
## 114         SEY  10
## 115         ANT   9
## 116         BDI   9
## 117         COK   9
## 118         IOA   9
## 119         LIB   9
## 120         SRI   9
## 121         TKM   9
## 122         BER   8
## 123         ESA   8
## 124         ISL   8
## 125         JOR   8
## 126         KOS   8
## 127         LES   8
## 128         PNG   8
## 129         SAM   8
## 130         ARU   7
## 131         BAN   7
## 132         DJI   7
## 133         GRN   7
## 134         ISV   7
## 135         LBA   7
## 136         MLT   7
## 137         MYA   7
## 138         NEP   7
## 139         PAK   7
## 140         RWA   7
## 141         SKN   7
## 142         SYR   7
## 143         TAN   7
## 144         TGA   7
## 145         TJK   7
## 146         ZAM   7
## 147         ALB   6
## 148         BEN   6
## 149         CAF   6
## 150         CAM   6
## 151         GAB   6
## 152         GUY   6
## 153         LAO   6
## 154         MAD   6
## 155         MKD   6
## 156         MLI   6
## 157         MOZ   6
## 158         NIG   6
## 159         PLE   6
## 160         SUD   6
## 161         SUR   6
## 162         AND   5
## 163         BUR   5
## 164         CAY   5
## 165         CPV   5
## 166         FSM   5
## 167         GBS   5
## 168         GUI   5
## 169         GUM   5
## 170         LCA   5
## 171         MAW   5
## 172         MHL   5
## 173         NCA   5
## 174         PLW   5
## 175         SMR   5
## 176         TOG   5
## 177         ASA   4
## 178         COD   4
## 179         COM   4
## 180         GAM   4
## 181         IVB   4
## 182         MDV   4
## 183         OMA   4
## 184         SLE   4
## 185         VAN   4
## 186         VIN   4
## 187         AFG   3
## 188         BIZ   3
## 189         BRU   3
## 190         KIR   3
## 191         LIE   3
## 192         MON   3
## 193         SOL   3
## 194         SSD   3
## 195         STP   3
## 196         TLS   3
## 197         YEM   3
## 198         BHU   2
## 199         CHA   2
## 200         DMA   2
## 201         GEQ   2
## 202         LBR   2
## 203         MTN   2
## 204         NRU   2
## 205         SOM   2
## 206         SWZ   2
## 207         TUV   1

Hint We talk about using group_by and summarise together to create a summary table in lecture. If you only use summarise function (without using group_by), you can create a dataframe with one row to summarise the data.)

rio %>% 
  summarize(sum(gold),sum(bronze),sum(silver))
##   sum(gold) sum(bronze) sum(silver)
## 1       666         704         655
  rio=rio %>% 
  rowwise() %>% 
  mutate(total=sum(c(gold,bronze,silver))) %>% 
  ungroup()
rio %>% 
  mutate(age=2016-year_of_birth) 
## # A tibble: 11,538 × 14
##           id name  natio…¹ sex   date_…² height weight sport  gold silver bronze
##        <int> <chr> <chr>   <chr> <chr>    <dbl>  <int> <chr> <int>  <int>  <int>
##  1 736041664 A Je… ESP     male  1969-1…   1.72     64 athl…     0      0      0
##  2 532037425 A La… KOR     fema… 1986-0…   1.68     56 fenc…     0      0      0
##  3 435962603 Aaro… CAN     male  1992-0…   1.98     79 athl…     0      0      1
##  4 521041435 Aaro… MDA     male  1991-0…   1.83     80 taek…     0      0      0
##  5  33922579 Aaro… NZL     male  1990-1…   1.81     71 cycl…     0      0      0
##  6 173071782 Aaro… AUS     male  1990-0…   1.8      67 tria…     0      0      0
##  7 266237702 Aaro… USA     male  1993-0…   2.05     98 voll…     0      0      1
##  8 382571888 Aaro… AUS     male  1991-0…   1.93    100 aqua…     0      0      0
##  9  87689776 Aaur… ESP     fema… 1988-1…   1.8      62 athl…     0      0      0
## 10 997877719 Abab… ETH     fema… 1991-0…   1.65     54 athl…     0      0      0
## # … with 11,528 more rows, 3 more variables: year_of_birth <int>, total <int>,
## #   age <dbl>, and abbreviated variable names ¹​nationality, ²​date_of_birth
rio$age = 2016 - rio$year_of_birth
rio %>% 
  group_by(sport) %>% 
  summarise(min = min(age),max = max(age)) %>% 
  arrange(min)
## # A tibble: 28 × 3
##    sport               min   max
##    <chr>             <dbl> <dbl>
##  1 aquatics             14    41
##  2 table tennis         15    54
##  3 athletics            16    47
##  4 fencing              16    42
##  5 football             16    41
##  6 gymnastics           16    41
##  7 shooting             16    56
##  8 weightlifting        16    41
##  9 archery              17    44
## 10 modern pentathlon    17    37
## # … with 18 more rows

Hint n() counts the total number of rows in each group, slice(c(1,n())) will return the first and last row

rio %>% 
  group_by(sport) %>% 
  summarise(n_participants=n(),n_gold=sum(gold),gold_ratio=n_gold/n_participants) %>% 
  arrange(gold_ratio) %>% 
  
  slice(c(which.min(gold_ratio),which.max(gold_ratio)))
## # A tibble: 2 × 4
##   sport      n_participants n_gold gold_ratio
##   <chr>               <int>  <int>      <dbl>
## 1 golf                  120      2     0.0167
## 2 gymnastics            324     30     0.0926
 #The sport with the lowest gold ratio is golf, and the sport with the highest gymnastics

Bike rentals in DC

Data

bike = read_csv("https://raw.githubusercontent.com/BUSN32100/data_files/master/bikeshare-day.csv")

The data include daily bike rental counts (by members and casual users) of Capital Bikeshare in Washington, DC in 2011 and 2012 as well as weather information on these days.

Source: UCI Machine Learning Repository - Bike Sharing Dataset The original data sources are http://capitalbikeshare.com/system-data and http://www.freemeteo.com.

The codebook is below:

Variable name Description
instant record index
dteday date
season season (1:winter, 2:spring, 3:summer, 4:fall)
yr year (0: 2011, 1:2012)
mnth month (1 to 12)
holiday weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday day of the week
workingday if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp Normalized temperature in Celsius. The values are divided by 41 (max)
atemp Normalized feeling temperature in Celsius. The values are divided by 50 (max)
hum Normalized humidity. The values are divided by 100 (max)
windspeed Normalized wind speed. The values are divided by 67 (max)
casual Count of casual users
registered Count of registered users
cnt Count of total rental bikes including both casual and registered

Data wrangling

bike$dteday=as.Date(bike$dteday)

bike %>% 
  group_by(dteday,weekday,holiday) %>% 
  summarise(total_rents=sum(cnt)) %>% 
  arrange(desc(total_rents))
## `summarise()` has grouped output by 'dteday', 'weekday'. You can override using
## the `.groups` argument.
## # A tibble: 731 × 4
## # Groups:   dteday, weekday [731]
##    dteday     weekday holiday total_rents
##    <date>       <dbl>   <dbl>       <dbl>
##  1 2012-09-15       6       0        8714
##  2 2012-09-29       6       0        8555
##  3 2012-09-22       6       0        8395
##  4 2012-03-23       5       0        8362
##  5 2012-05-19       6       0        8294
##  6 2012-09-09       0       0        8227
##  7 2012-07-25       3       0        8173
##  8 2012-09-21       5       0        8167
##  9 2012-10-05       5       0        8156
## 10 2012-06-02       6       0        8120
## # … with 721 more rows
bike %>% 
 filter(season==2) %>%
  group_by(dteday,weekday,holiday) %>% 
  summarise(total_rents=sum(cnt)) %>% 
  arrange(desc(total_rents)) %>% 
  head(3)
## `summarise()` has grouped output by 'dteday', 'weekday'. You can override using
## the `.groups` argument.
## # A tibble: 3 × 4
## # Groups:   dteday, weekday [3]
##   dteday     weekday holiday total_rents
##   <date>       <dbl>   <dbl>       <dbl>
## 1 2012-03-23       5       0        8362
## 2 2012-05-19       6       0        8294
## 3 2012-06-02       6       0        8120
bike %>% 
  filter(yr==0 & holiday==1) %>%
  group_by(dteday) %>% 
  summarise(dteday,total_rents=sum(cnt),weathersit,temp) %>% 
  arrange(desc(total_rents))
## # A tibble: 10 × 4
##    dteday     total_rents weathersit  temp
##    <date>           <dbl>      <dbl> <dbl>
##  1 2011-07-04        6043          2 0.727
##  2 2011-10-10        5117          1 0.571
##  3 2011-05-30        4098          1 0.733
##  4 2011-11-11        3368          1 0.324
##  5 2011-09-05        3351          2 0.673
##  6 2011-04-15        3126          1 0.447
##  7 2011-11-24        1495          1 0.373
##  8 2011-12-26        1317          1 0.322
##  9 2011-02-21        1107          2 0.303
## 10 2011-01-17        1000          2 0.176
bike %>% 
  mutate(season=case_when(season==1~"winter",season==2~"spring",season==3~"summer",season==4~"fall")) 
## # A tibble: 731 × 16
##    instant dteday     season    yr  mnth holiday weekday working…¹ weath…²  temp
##      <dbl> <date>     <chr>  <dbl> <dbl>   <dbl>   <dbl>     <dbl>   <dbl> <dbl>
##  1       1 2011-01-01 winter     0     1       0       6         0       2 0.344
##  2       2 2011-01-02 winter     0     1       0       0         0       2 0.363
##  3       3 2011-01-03 winter     0     1       0       1         1       1 0.196
##  4       4 2011-01-04 winter     0     1       0       2         1       1 0.2  
##  5       5 2011-01-05 winter     0     1       0       3         1       1 0.227
##  6       6 2011-01-06 winter     0     1       0       4         1       1 0.204
##  7       7 2011-01-07 winter     0     1       0       5         1       2 0.197
##  8       8 2011-01-08 winter     0     1       0       6         0       2 0.165
##  9       9 2011-01-09 winter     0     1       0       0         0       1 0.138
## 10      10 2011-01-10 winter     0     1       0       1         1       1 0.151
## # … with 721 more rows, 6 more variables: atemp <dbl>, hum <dbl>,
## #   windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>, and abbreviated
## #   variable names ¹​workingday, ²​weathersit
bike$temp_raw=bike$temp*41
bike$atemp_raw=bike$atemp*50
bike$hum_raw=bike$hum*100
bike$windspeed_raw=bike$windspeed*67

head(bike)
## # A tibble: 6 × 20
##   instant dteday     season    yr  mnth holiday weekday workingday weath…¹  temp
##     <dbl> <date>      <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>   <dbl> <dbl>
## 1       1 2011-01-01      1     0     1       0       6          0       2 0.344
## 2       2 2011-01-02      1     0     1       0       0          0       2 0.363
## 3       3 2011-01-03      1     0     1       0       1          1       1 0.196
## 4       4 2011-01-04      1     0     1       0       2          1       1 0.2  
## 5       5 2011-01-05      1     0     1       0       3          1       1 0.227
## 6       6 2011-01-06      1     0     1       0       4          1       1 0.204
## # … with 10 more variables: atemp <dbl>, hum <dbl>, windspeed <dbl>,
## #   casual <dbl>, registered <dbl>, cnt <dbl>, temp_raw <dbl>, atemp_raw <dbl>,
## #   hum_raw <dbl>, windspeed_raw <dbl>, and abbreviated variable name
## #   ¹​weathersit
all(bike$casual+bike$registered==bike$cnt)
## [1] TRUE

Modelling

#install.packages("caTools")
set.seed(123)
library(caTools)
samples=sample.split(bike$cnt, SplitRatio=0.7) 
train=subset(bike, samples==T)
test=subset(bike, samples==F)
model=lm(cnt~temp_raw,data=train)
summary(model)
## 
## Call:
## lm(formula = cnt ~ temp_raw, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4556.5 -1140.5   -41.7  1064.3  3670.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1322.697    188.102   7.032  6.6e-12 ***
## temp_raw     155.982      8.683  17.964  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1470 on 509 degrees of freedom
## Multiple R-squared:  0.388,  Adjusted R-squared:  0.3868 
## F-statistic: 322.7 on 1 and 509 DF,  p-value: < 2.2e-16
#cnt=1322+156*temp_raw
#Intercept 1322 and coefficient 156, both positives and significants. Rentals increase with higher temperature.
#R squared 0.388
model1=lm(cnt~atemp_raw, data=train)
summary(model1)
## 
## Call:
## lm(formula = cnt ~ atemp_raw, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4541.3 -1087.3   -94.8  1068.0  4326.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1073.832    200.556   5.354  1.3e-07 ***
## atemp_raw    144.204      8.002  18.022  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1468 on 509 degrees of freedom
## Multiple R-squared:  0.3895, Adjusted R-squared:  0.3883 
## F-statistic: 324.8 on 1 and 509 DF,  p-value: < 2.2e-16
#cnt=1074+144*atemp_raw
#Coefficient and intercept significants, daily bike rentals increase with the feeling temperature
#Rsquared =0.3895
model2=lm(cnt~season+atemp_raw+season:atemp_raw, data=train)
summary(model2)
## 
## Call:
## lm(formula = cnt ~ season + atemp_raw + season:atemp_raw, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4640.0 -1022.4  -161.2  1088.4  3712.8 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1265.226    445.872  -2.838  0.00473 ** 
## season            1136.960    183.283   6.203 1.15e-09 ***
## atemp_raw          237.221     23.344  10.162  < 2e-16 ***
## season:atemp_raw   -43.609      9.029  -4.830 1.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1405 on 507 degrees of freedom
## Multiple R-squared:  0.4433, Adjusted R-squared:   0.44 
## F-statistic: 134.6 on 3 and 507 DF,  p-value: < 2.2e-16
#cnt=-1265+1136*season+237*atem_raw-43*season*atemp_raw
#All coefficients are significant. The intercept is negative as well as the coefficient of the interaction between temperature and season (the colder seasons have a higher number so they become more negative as it gets colder) consistent with when is colder number of rentals decrease. 
model3=lm(cnt~season+yr+holiday+workingday+weathersit+temp_raw+atemp_raw+hum_raw+windspeed_raw, data=train)
summary(model3)
## 
## Call:
## lm(formula = cnt ~ season + yr + holiday + workingday + weathersit + 
##     temp_raw + atemp_raw + hum_raw + windspeed_raw, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4044.2  -441.0    61.4   518.3  2997.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2072.033    277.651   7.463 3.78e-13 ***
## season         329.254     37.765   8.719  < 2e-16 ***
## yr            1993.146     77.782  25.625  < 2e-16 ***
## holiday       -806.922    235.856  -3.421 0.000674 ***
## workingday      99.275     87.061   1.140 0.254711    
## weathersit    -513.606     96.085  -5.345 1.37e-07 ***
## temp_raw        83.923     36.127   2.323 0.020577 *  
## atemp_raw       46.385     33.582   1.381 0.167820    
## hum_raw        -14.977      3.808  -3.933 9.55e-05 ***
## windspeed_raw  -45.399      8.049  -5.641 2.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 865.4 on 501 degrees of freedom
## Multiple R-squared:  0.7912, Adjusted R-squared:  0.7874 
## F-statistic: 210.9 on 9 and 501 DF,  p-value: < 2.2e-16
#Rsquared is 0.7912

Your answer here:

A good day to ride a bike could be a cool but dry summer working day in 2012 with no wind, . #A good day would be to ride in the summer in 2012 on a working day, avoiding holidays, with a Clear, Few clouds, Partly cloudy, Partly cloudy day with a high temperature feeling and low humidity and windspeed.

pred=predict(model3,newdata=test) 
r_sqr_pred=1-((sum((pred-test$cnt)^2))/(sum((mean(test$cnt)-test$cnt)^2)))
r_sqr_pred  
## [1] 0.791317
#Rsquared on the testing data 0.791317

Exploratory data analysis (Optional)

(bike=bike %>%
  mutate(atemp_raw = atemp * 50)) %>%
  ggplot(mapping=aes(x=dteday, y=cnt, color=temp_raw)) +
  geom_point(alpha=0.7) +
  labs(
    title="Bike Rentals vs Temperature and Date",
    subtitle="Bike rentals increase with better weather",
    x="Date",
    y="Bike Rentals",
    color="Temperature"
  ) +
  theme_minimal()