Question 1

First, we run a regression with "creatclear" as the target variable and age as the predictor variable, and get the results below:

lmRate = lm(creatclear~age, data = creatinine) 
summary(lmRate) 
FALSE 
FALSE Call:
FALSE lm(formula = creatclear ~ age, data = creatinine)
FALSE 
FALSE Residuals:
FALSE      Min       1Q   Median       3Q      Max 
FALSE -18.2249  -4.6175   0.2221   4.7212  15.8221 
FALSE 
FALSE Coefficients:
FALSE              Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept) 147.81292    1.37965  107.14   <2e-16 ***
FALSE age          -0.61982    0.03475  -17.84   <2e-16 ***
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 6.911 on 155 degrees of freedom
FALSE Multiple R-squared:  0.6724,  Adjusted R-squared:  0.6703 
FALSE F-statistic: 318.2 on 1 and 155 DF,  p-value: < 2.2e-16

(A)

Clearance rate of 55-year-old = intercept + 55 * "age coefficient"
= 147.82192 + 55 * -0.61982
= 113.7318

(B)

The creatinine clearance rate decreases by 0.61982 ml/ minute as age increases by 1

(C)

We will compare their rate to the expected rate of their age respectively.

Expected rate of 40 = 147.82192 + 40 * -0.61982
= 123.0291

The 40-year-old has 135 ml/minute, which is (135-123.0291)/123.0291 = 0.09730 healthier than the average of 40-year-olds.

Expected rate of 60 = 147.82192 + 60 * -0.61982 = 110.6327

The 60-year-old has 112 ml/minute, which is (112-110.6327)/110.6327 = 0.01235 healthier than the average of 60-year-olds.

9.73% to 1.24%, so we conclude the 40-year-old is healthier for their age.

Question 2

ggplot(s550)+ geom_point(aes(x=mileage, y=price))+
  facet_wrap(~year, nrow(4))+
  coord_fixed(ratio=0.5) + xlim(0, 150000) + ylim(10000,150000)+
  ggtitle("Car Price Based on Mileage")+
  theme(plot.title = element_text(hjust = 0.5))

(Additional Practice)

If we only want to consider cars with less mileage and newer cars, we could limit the mileage to 50,000. Although this will remove some values from the plot, it will allow us to get a better view of how the price is affected by mileage in the certain mileage interval.

ggplot(s550)+ geom_point(aes(x=mileage, y=price))+
  facet_wrap(~year, nrow(4))+
  coord_fixed(ratio=0.2) + xlim(0, 50000) + ylim(20000,150000)+
  ggtitle("Car Price Based on Mileage")+
  theme(plot.title = element_text(hjust = 0.5))

Question 3

Filter the data, we will only observe properties with leasing rate over 20%, stories: 10~20, age smaller than 15, and size: 100,000~500,000 square feet.

greenbuildings %>% filter(leasing_rate > 20 & stories>10 & stories <20 & age< 15 & size>100000 & size< 500000)

Classify by green/non-green and three different classes, so we will get 6 different categories after this. I use mean rent to evaluate because it will not be affected that much after using more filters to only consider the properties we are interested in.

# Class A
green_a = greenbuildings %>%
  filter(green_rating == "1" & class_a == "1") %>%
  summarize(green_a = mean(Rent))

non_green_a = greenbuildings %>%
  filter(green_rating == "0" & class_a == "1") %>%
  summarize(non_green_a = mean(Rent))

# Class B
green_b = greenbuildings %>%
  filter(green_rating == "1" & class_b == "1") %>%
  summarize(green_b = mean(Rent))

non_green_b = greenbuildings %>%
  filter(green_rating == "0" & class_b == "1") %>%
  summarize(non_green_b = mean(Rent))

# Class C
green_c = greenbuildings %>%
  filter(green_rating == "1" & class_a == "0" & class_b=="0") %>%
  summarize(green_c = mean(Rent))

non_green_c = greenbuildings %>%
  filter(green_rating == "0" & class_a == "0" & class_b=="0") %>%
  summarize(non_green_c = mean(Rent))

After classifying them, we can see the mean rent of different classes.

mean_rent<-c(green_a, non_green_a, green_b, non_green_b, green_c, non_green_c)
view(mean_rent)

Now, create a barplot to see the difference.

value_types<-c(30.98901, 32.59642, 26.08924, 26.40649, 28.17143, 23.91232)
names<-c("green_a", "non_green_a", "green_b", "non_green_b", "green_c", "non_green_c")
barplot(value_types, names.arg = names, ylab = "mean of rent")

The filter I set for the building age is 15, meaning we are not considering decisions for longer terms. For class a and b properties, the rent is lower for green buidlings. The counterintuitive result might be caused by other underlying factors. In conclusion, the extra $5 million on the green building will not be worth it, she should not invest more to build the green building

Question 4

Plot A

bikeshare %>%
  group_by(hr) %>%
  summarize(avgbike = mean(total))%>%
  ggplot(aes(x = hr, y = avgbike)) +
  geom_line()+ 
  ggtitle("Average Bike Rentals by Hours") +
  xlab("Hour of the Day") + ylab("Average Bike Rented")+
  theme(plot.title = element_text(hjust = 0.5))

From plot A we can see the average bike rented peaks during the rush hour of the day, so we know people who rented the bike might use it as a transportation to get to work and go home. #### Plot B

label.workday <- as_labeller(
     c(`0` = "Weekend or Holiday", `1` = "Working Day"))

bikeshare %>%
  group_by(hr,workingday) %>%
  summarize(avgbike = mean(total))%>%
  ggplot(aes(x = hr, y = avgbike)) +
  geom_line()+
  facet_wrap(~workingday, nrow=2, labeller = label.workday)+
  ggtitle("Average Bike Rentals by Hours\n Working Day or Not") +
  xlab("Hour of the Day") + ylab("Average Bike Rented")+
  theme(plot.title = element_text(hjust = 0.5))

In plot B, we divide the plots into "weekend or holiday" and "working day", now we can be certain about the assumption we made for plot A: the two peaks are caused by rush hours of getting on/off work. On the other hand, for weekends and holidays, people usually rent the bikes from 10AM to 4 PM, we assume they are riding bikes for leisure because of the time period.

Plot C

bikeshare %>% 
  filter(hr==8)%>%
  group_by(workingday, weathersit) %>% 
  summarize(avgbike = mean(total)) %>% 
  ggplot(aes(x = weathersit, y = avgbike)) +
  geom_col()+
  facet_wrap(~workingday, nrow=2, labeller = label.workday)+
  ggtitle("Average Bike Rentals at 8AM \n under Different Weather Situations") +
  xlab("Weather Situation") + ylab("Average Bike Rented")+
  theme(plot.title = element_text(hjust = 0.5))

First, we can see that people rent more bikes on working days. Second, the x label 1, 2, 3 represent weather situation, we can interpregt 1 as "very nice"", 2 as "good"", 3 as "bad". In the "good" weather we can see the bike rented is slightly lower than in "ver nice" weather, but generally they numbers are really close whether it's working day or not. Despite working day or not, the amount rented decreases dramatically when it is a "bad" weather.