Name: Florencia Palacios Collaborated with: Gabriel Neira
You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas.
In some cases, a incomplete code is given to guide you in the right direction. However, you will need to fill in the blanks in order to run the code block. Make sure to fill all the blanks, or comment them out, before you knit your R markdown file. Otherwise, it will return knitting errors.
Load data
rio = read.csv("https://raw.githubusercontent.com/BUSN32100/data_files/master/rio.csv")count() check which country
brought the most athletes, and how many was this? What are the 5 most
represented countries? Do the same for the men and women
separately.rio %>%
count(nationality) %>%
arrange(desc(n))## nationality n
## 1 USA 567
## 2 BRA 485
## 3 GER 441
## 4 AUS 431
## 5 FRA 410
## 6 CHN 404
## 7 GBR 374
## 8 JPN 346
## 9 CAN 321
## 10 ESP 313
## 11 ITA 312
## 12 RUS 286
## 13 NED 249
## 14 POL 242
## 15 ARG 223
## 16 KOR 213
## 17 NZL 208
## 18 UKR 205
## 19 SWE 164
## 20 COL 154
## 21 HUN 154
## 22 RSA 146
## 23 DEN 128
## 24 MEX 126
## 25 BLR 124
## 26 CUB 123
## 27 IND 123
## 28 EGY 122
## 29 BEL 108
## 30 CZE 104
## 31 SUI 104
## 32 KAZ 103
## 33 SRB 103
## 34 TUR 103
## 35 ROU 98
## 36 POR 95
## 37 GRE 93
## 38 CRO 88
## 39 VEN 88
## 40 IRL 80
## 41 KEN 80
## 42 NGR 78
## 43 AUT 71
## 44 UZB 70
## 45 ALG 68
## 46 LTU 67
## 47 IRI 64
## 48 SLO 63
## 49 NOR 62
## 50 TUN 61
## 51 JAM 57
## 52 AZE 56
## 53 TPE 56
## 54 FIJ 54
## 55 FIN 54
## 56 THA 54
## 57 SVK 51
## 58 BUL 50
## 59 MAR 49
## 60 ISR 47
## 61 EST 46
## 62 MGL 43
## 63 CHI 42
## 64 GEO 40
## 65 PUR 40
## 66 QAT 39
## 67 ECU 38
## 68 ETH 38
## 69 HKG 38
## 70 MNE 35
## 71 ZIM 35
## 72 BRN 34
## 73 ARM 32
## 74 LAT 32
## 75 MAS 32
## 76 TTO 32
## 77 PRK 31
## 78 BAH 30
## 79 HON 30
## 80 DOM 29
## 81 PER 29
## 82 INA 28
## 83 ANG 26
## 84 IRQ 26
## 85 SIN 25
## 86 CMR 24
## 87 MDA 23
## 88 VIE 23
## 89 SEN 22
## 90 GUA 21
## 91 UGA 21
## 92 KGZ 19
## 93 URU 17
## 94 CYP 16
## 95 GHA 16
## 96 PHI 13
## 97 UAE 13
## 98 BOL 12
## 99 BOT 12
## 100 CIV 12
## 101 ERI 12
## 102 BAR 11
## 103 BIH 11
## 104 CRC 11
## 105 KSA 11
## 106 MRI 11
## 107 PAR 11
## 108 CGO 10
## 109 HAI 10
## 110 LUX 10
## 111 NAM 10
## 112 PAN 10
## 113 ROT 10
## 114 SEY 10
## 115 ANT 9
## 116 BDI 9
## 117 COK 9
## 118 IOA 9
## 119 LIB 9
## 120 SRI 9
## 121 TKM 9
## 122 BER 8
## 123 ESA 8
## 124 ISL 8
## 125 JOR 8
## 126 KOS 8
## 127 LES 8
## 128 PNG 8
## 129 SAM 8
## 130 ARU 7
## 131 BAN 7
## 132 DJI 7
## 133 GRN 7
## 134 ISV 7
## 135 LBA 7
## 136 MLT 7
## 137 MYA 7
## 138 NEP 7
## 139 PAK 7
## 140 RWA 7
## 141 SKN 7
## 142 SYR 7
## 143 TAN 7
## 144 TGA 7
## 145 TJK 7
## 146 ZAM 7
## 147 ALB 6
## 148 BEN 6
## 149 CAF 6
## 150 CAM 6
## 151 GAB 6
## 152 GUY 6
## 153 LAO 6
## 154 MAD 6
## 155 MKD 6
## 156 MLI 6
## 157 MOZ 6
## 158 NIG 6
## 159 PLE 6
## 160 SUD 6
## 161 SUR 6
## 162 AND 5
## 163 BUR 5
## 164 CAY 5
## 165 CPV 5
## 166 FSM 5
## 167 GBS 5
## 168 GUI 5
## 169 GUM 5
## 170 LCA 5
## 171 MAW 5
## 172 MHL 5
## 173 NCA 5
## 174 PLW 5
## 175 SMR 5
## 176 TOG 5
## 177 ASA 4
## 178 COD 4
## 179 COM 4
## 180 GAM 4
## 181 IVB 4
## 182 MDV 4
## 183 OMA 4
## 184 SLE 4
## 185 VAN 4
## 186 VIN 4
## 187 AFG 3
## 188 BIZ 3
## 189 BRU 3
## 190 KIR 3
## 191 LIE 3
## 192 MON 3
## 193 SOL 3
## 194 SSD 3
## 195 STP 3
## 196 TLS 3
## 197 YEM 3
## 198 BHU 2
## 199 CHA 2
## 200 DMA 2
## 201 GEQ 2
## 202 LBR 2
## 203 MTN 2
## 204 NRU 2
## 205 SOM 2
## 206 SWZ 2
## 207 TUV 1
summarise and sum
, check how many medals of each type—gold, silver, bronze—were awarded
at this Olympics?Hint We talk
about using group_by and summarise together to
create a summary table in lecture. If you only use
summarise function (without using group_by),
you can create a dataframe with one row to summarise the
data.)
rio %>%
summarize(sum(gold),sum(bronze),sum(silver))## sum(gold) sum(bronze) sum(silver)
## 1 666 704 655
mutate to create a column
called total which adds the number of gold, silver, and
bronze medals for each athlete, and add this column to the dataframe
rio. Who are the top 10 athletes with the most number of
medals and how many was this? rio=rio %>%
rowwise() %>%
mutate(total=sum(c(gold,bronze,silver))) %>%
ungroup()age, equal to 2016 - (the year of birth). Use
group_by, summarise to find the minimum and
maximum age of each sports, and order results by the ascending order of
minimum agerio %>%
mutate(age=2016-year_of_birth) ## # A tibble: 11,538 × 14
## id name natio…¹ sex date_…² height weight sport gold silver bronze
## <int> <chr> <chr> <chr> <chr> <dbl> <int> <chr> <int> <int> <int>
## 1 736041664 A Je… ESP male 1969-1… 1.72 64 athl… 0 0 0
## 2 532037425 A La… KOR fema… 1986-0… 1.68 56 fenc… 0 0 0
## 3 435962603 Aaro… CAN male 1992-0… 1.98 79 athl… 0 0 1
## 4 521041435 Aaro… MDA male 1991-0… 1.83 80 taek… 0 0 0
## 5 33922579 Aaro… NZL male 1990-1… 1.81 71 cycl… 0 0 0
## 6 173071782 Aaro… AUS male 1990-0… 1.8 67 tria… 0 0 0
## 7 266237702 Aaro… USA male 1993-0… 2.05 98 voll… 0 0 1
## 8 382571888 Aaro… AUS male 1991-0… 1.93 100 aqua… 0 0 0
## 9 87689776 Aaur… ESP fema… 1988-1… 1.8 62 athl… 0 0 0
## 10 997877719 Abab… ETH fema… 1991-0… 1.65 54 athl… 0 0 0
## # … with 11,528 more rows, 3 more variables: year_of_birth <int>, total <int>,
## # age <dbl>, and abbreviated variable names ¹nationality, ²date_of_birth
rio$age = 2016 - rio$year_of_birth
rio %>%
group_by(sport) %>%
summarise(min = min(age),max = max(age)) %>%
arrange(min)## # A tibble: 28 × 3
## sport min max
## <chr> <dbl> <dbl>
## 1 aquatics 14 41
## 2 table tennis 15 54
## 3 athletics 16 47
## 4 fencing 16 42
## 5 football 16 41
## 6 gymnastics 16 41
## 7 shooting 16 56
## 8 weightlifting 16 41
## 9 archery 17 44
## 10 modern pentathlon 17 37
## # … with 18 more rows
sports, with a column called n_participants
which contains the number of participants in each sport, a column called
n_gold which contains the number of gold medals in each
sports. Then calculate the ratio of the number of gold medals to
participants for each sport. Which sport has the highest ratio? Which
has the lowest?Hint
n() counts the total number of rows in each group,
slice(c(1,n())) will return the first and last
row
rio %>%
group_by(sport) %>%
summarise(n_participants=n(),n_gold=sum(gold),gold_ratio=n_gold/n_participants) %>%
arrange(gold_ratio) %>%
slice(c(which.min(gold_ratio),which.max(gold_ratio)))## # A tibble: 2 × 4
## sport n_participants n_gold gold_ratio
## <chr> <int> <int> <dbl>
## 1 golf 120 2 0.0167
## 2 gymnastics 324 30 0.0926
#The sport with the lowest gold ratio is golf, and the sport with the highest gymnasticsbike = read_csv("https://raw.githubusercontent.com/BUSN32100/data_files/master/bikeshare-day.csv")The data include daily bike rental counts (by members and casual users) of Capital Bikeshare in Washington, DC in 2011 and 2012 as well as weather information on these days.
Source: UCI Machine Learning Repository - Bike Sharing Dataset The original data sources are http://capitalbikeshare.com/system-data and http://www.freemeteo.com.
The codebook is below:
| Variable name | Description |
|---|---|
instant |
record index |
dteday |
date |
season |
season (1:winter, 2:spring, 3:summer, 4:fall) |
yr |
year (0: 2011, 1:2012) |
mnth |
month (1 to 12) |
holiday |
weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule) |
weekday |
day of the week |
workingday |
if day is neither weekend nor holiday is 1, otherwise is 0. |
weathersit |
1: Clear, Few clouds, Partly cloudy, Partly cloudy |
| 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist | |
| 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds | |
| 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog | |
temp |
Normalized temperature in Celsius. The values are divided by 41 (max) |
atemp |
Normalized feeling temperature in Celsius. The values are divided by 50 (max) |
hum |
Normalized humidity. The values are divided by 100 (max) |
windspeed |
Normalized wind speed. The values are divided by 67 (max) |
casual |
Count of casual users |
registered |
Count of registered users |
cnt |
Count of total rental bikes including both casual and registered |
dteday, weekday, and holiday.
What about three dates with the highest casual rentals?bike$dteday=as.Date(bike$dteday)
bike %>%
group_by(dteday,weekday,holiday) %>%
summarise(total_rents=sum(cnt)) %>%
arrange(desc(total_rents))## `summarise()` has grouped output by 'dteday', 'weekday'. You can override using
## the `.groups` argument.
## # A tibble: 731 × 4
## # Groups: dteday, weekday [731]
## dteday weekday holiday total_rents
## <date> <dbl> <dbl> <dbl>
## 1 2012-09-15 6 0 8714
## 2 2012-09-29 6 0 8555
## 3 2012-09-22 6 0 8395
## 4 2012-03-23 5 0 8362
## 5 2012-05-19 6 0 8294
## 6 2012-09-09 0 0 8227
## 7 2012-07-25 3 0 8173
## 8 2012-09-21 5 0 8167
## 9 2012-10-05 5 0 8156
## 10 2012-06-02 6 0 8120
## # … with 721 more rows
dteday, weekday, and
holiday.bike %>%
filter(season==2) %>%
group_by(dteday,weekday,holiday) %>%
summarise(total_rents=sum(cnt)) %>%
arrange(desc(total_rents)) %>%
head(3)## `summarise()` has grouped output by 'dteday', 'weekday'. You can override using
## the `.groups` argument.
## # A tibble: 3 × 4
## # Groups: dteday, weekday [3]
## dteday weekday holiday total_rents
## <date> <dbl> <dbl> <dbl>
## 1 2012-03-23 5 0 8362
## 2 2012-05-19 6 0 8294
## 3 2012-06-02 6 0 8120
bike %>%
filter(yr==0 & holiday==1) %>%
group_by(dteday) %>%
summarise(dteday,total_rents=sum(cnt),weathersit,temp) %>%
arrange(desc(total_rents))## # A tibble: 10 × 4
## dteday total_rents weathersit temp
## <date> <dbl> <dbl> <dbl>
## 1 2011-07-04 6043 2 0.727
## 2 2011-10-10 5117 1 0.571
## 3 2011-05-30 4098 1 0.733
## 4 2011-11-11 3368 1 0.324
## 5 2011-09-05 3351 2 0.673
## 6 2011-04-15 3126 1 0.447
## 7 2011-11-24 1495 1 0.373
## 8 2011-12-26 1317 1 0.322
## 9 2011-02-21 1107 2 0.303
## 10 2011-01-17 1000 2 0.176
season variable to be a
factor with meaningful level names as outlined in the codebook using
case_when, and save it back to the bike
variable (it is important to rewrite the season column in
the framework for later questions).bike %>%
mutate(season=case_when(season==1~"winter",season==2~"spring",season==3~"summer",season==4~"fall")) ## # A tibble: 731 × 16
## instant dteday season yr mnth holiday weekday working…¹ weath…² temp
## <dbl> <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2011-01-01 winter 0 1 0 6 0 2 0.344
## 2 2 2011-01-02 winter 0 1 0 0 0 2 0.363
## 3 3 2011-01-03 winter 0 1 0 1 1 1 0.196
## 4 4 2011-01-04 winter 0 1 0 2 1 1 0.2
## 5 5 2011-01-05 winter 0 1 0 3 1 1 0.227
## 6 6 2011-01-06 winter 0 1 0 4 1 1 0.204
## 7 7 2011-01-07 winter 0 1 0 5 1 2 0.197
## 8 8 2011-01-08 winter 0 1 0 6 0 2 0.165
## 9 9 2011-01-09 winter 0 1 0 0 0 1 0.138
## 10 10 2011-01-10 winter 0 1 0 1 1 1 0.151
## # … with 721 more rows, 6 more variables: atemp <dbl>, hum <dbl>,
## # windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>, and abbreviated
## # variable names ¹workingday, ²weathersit
temp_raw,
atemp_raw, hum_raw, windspeed_raw
and store the raw values of temperature, feeling temperature, humidity
and windspeed, respectively, as specified in the codebook. Update the
dataframe and store the new columns in the dataframe.bike$temp_raw=bike$temp*41
bike$atemp_raw=bike$atemp*50
bike$hum_raw=bike$hum*100
bike$windspeed_raw=bike$windspeed*67
head(bike)## # A tibble: 6 × 20
## instant dteday season yr mnth holiday weekday workingday weath…¹ temp
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2011-01-01 1 0 1 0 6 0 2 0.344
## 2 2 2011-01-02 1 0 1 0 0 0 2 0.363
## 3 3 2011-01-03 1 0 1 0 1 1 1 0.196
## 4 4 2011-01-04 1 0 1 0 2 1 1 0.2
## 5 5 2011-01-05 1 0 1 0 3 1 1 0.227
## 6 6 2011-01-06 1 0 1 0 4 1 1 0.204
## # … with 10 more variables: atemp <dbl>, hum <dbl>, windspeed <dbl>,
## # casual <dbl>, registered <dbl>, cnt <dbl>, temp_raw <dbl>, atemp_raw <dbl>,
## # hum_raw <dbl>, windspeed_raw <dbl>, and abbreviated variable name
## # ¹weathersit
casual and
registered adds up to cnt for each
record.all(bike$casual+bike$registered==bike$cnt)## [1] TRUE
set.seed function. Using the sample.split
function (in the caTools package) to randomly choose 70%
data as the train data (stored in variable train) and the
rest as the test data (stored in variable test).#install.packages("caTools")
set.seed(123)
library(caTools)
samples=sample.split(bike$cnt, SplitRatio=0.7)
train=subset(bike, samples==T)
test=subset(bike, samples==F)temp_raw) on the
train data. Write the linear model, interpret the slope and
the intercept in context of the data. What is the train R
square for this model?model=lm(cnt~temp_raw,data=train)
summary(model)##
## Call:
## lm(formula = cnt ~ temp_raw, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4556.5 -1140.5 -41.7 1064.3 3670.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1322.697 188.102 7.032 6.6e-12 ***
## temp_raw 155.982 8.683 17.964 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1470 on 509 degrees of freedom
## Multiple R-squared: 0.388, Adjusted R-squared: 0.3868
## F-statistic: 322.7 on 1 and 509 DF, p-value: < 2.2e-16
#cnt=1322+156*temp_raw
#Intercept 1322 and coefficient 156, both positives and significants. Rentals increase with higher temperature.
#R squared 0.388atemp_raw you created before) on the train
data. Write the linear model, interpret the slope and the intercept in
context of the data. What is the train R square for this
model?model1=lm(cnt~atemp_raw, data=train)
summary(model1)##
## Call:
## lm(formula = cnt ~ atemp_raw, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4541.3 -1087.3 -94.8 1068.0 4326.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1073.832 200.556 5.354 1.3e-07 ***
## atemp_raw 144.204 8.002 18.022 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1468 on 509 degrees of freedom
## Multiple R-squared: 0.3895, Adjusted R-squared: 0.3883
## F-statistic: 324.8 on 1 and 509 DF, p-value: < 2.2e-16
#cnt=1074+144*atemp_raw
#Coefficient and intercept significants, daily bike rentals increase with the feeling temperature
#Rsquared =0.3895train
data. Write the linear model, interpret the coefficients associated with
different seasons, and determine and interpret the \(R^2\).model2=lm(cnt~season+atemp_raw+season:atemp_raw, data=train)
summary(model2)##
## Call:
## lm(formula = cnt ~ season + atemp_raw + season:atemp_raw, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4640.0 -1022.4 -161.2 1088.4 3712.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1265.226 445.872 -2.838 0.00473 **
## season 1136.960 183.283 6.203 1.15e-09 ***
## atemp_raw 237.221 23.344 10.162 < 2e-16 ***
## season:atemp_raw -43.609 9.029 -4.830 1.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1405 on 507 degrees of freedom
## Multiple R-squared: 0.4433, Adjusted R-squared: 0.44
## F-statistic: 134.6 on 3 and 507 DF, p-value: < 2.2e-16
#cnt=-1265+1136*season+237*atem_raw-43*season*atemp_raw
#All coefficients are significant. The intercept is negative as well as the coefficient of the interaction between temperature and season (the colder seasons have a higher number so they become more negative as it gets colder) consistent with when is colder number of rentals decrease. train data from season, year, whether the day is
holiday or not, whether the day is a workingday or not, the weather
category, temperature, feeling temperature, humidity and windspeed. What
is the train R square for this model.model3=lm(cnt~season+yr+holiday+workingday+weathersit+temp_raw+atemp_raw+hum_raw+windspeed_raw, data=train)
summary(model3)##
## Call:
## lm(formula = cnt ~ season + yr + holiday + workingday + weathersit +
## temp_raw + atemp_raw + hum_raw + windspeed_raw, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4044.2 -441.0 61.4 518.3 2997.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2072.033 277.651 7.463 3.78e-13 ***
## season 329.254 37.765 8.719 < 2e-16 ***
## yr 1993.146 77.782 25.625 < 2e-16 ***
## holiday -806.922 235.856 -3.421 0.000674 ***
## workingday 99.275 87.061 1.140 0.254711
## weathersit -513.606 96.085 -5.345 1.37e-07 ***
## temp_raw 83.923 36.127 2.323 0.020577 *
## atemp_raw 46.385 33.582 1.381 0.167820
## hum_raw -14.977 3.808 -3.933 9.55e-05 ***
## windspeed_raw -45.399 8.049 -5.641 2.84e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 865.4 on 501 degrees of freedom
## Multiple R-squared: 0.7912, Adjusted R-squared: 0.7874
## F-statistic: 210.9 on 9 and 501 DF, p-value: < 2.2e-16
#Rsquared is 0.7912Your answer here:
A good day to ride a bike could be a cool but dry summer working day in 2012 with no wind, . #A good day would be to ride in the summer in 2012 on a working day, avoiding holidays, with a Clear, Few clouds, Partly cloudy, Partly cloudy day with a high temperature feeling and low humidity and windspeed.
pred=predict(model3,newdata=test)
r_sqr_pred=1-((sum((pred-test$cnt)^2))/(sum((mean(test$cnt)-test$cnt)^2)))
r_sqr_pred ## [1] 0.791317
#Rsquared on the testing data 0.791317(bike=bike %>%
mutate(atemp_raw = atemp * 50)) %>%
ggplot(mapping=aes(x=dteday, y=cnt, color=temp_raw)) +
geom_point(alpha=0.7) +
labs(
title="Bike Rentals vs Temperature and Date",
subtitle="Bike rentals increase with better weather",
x="Date",
y="Bike Rentals",
color="Temperature"
) +
theme_minimal()