R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

# =================== Step 2. Missing Data =============================================
# Read data from hard disk
# The data you would like to load (e.g., mpgData.csv) must be stored in
# the project folder previously set as working directory
bike <- read.csv("day.csv") # update!


# Remove missing data
good <- complete.cases(bike) # update!
bike_clean <- bike[good, ] # update!
# =====================================
# ====================== Step 3. Discriptive Statistics (Project Task 1) ===============
# From this point forward, make sure to use the mpg_clean instead of mpg
# This can be done for any variable. See below how to find some measures for
# variable cty. cty definition: miles per gallon in city
# ======================================================================================
# =================== Step 3.1 Statistical measures ====================================
summary(bike_clean$cnt) # update!
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      22    3152    4548    4504    5956    8714
mean(bike_clean$cnt) # update!
## [1] 4504.349
var(bike_clean$cnt) # update!
## [1] 3752788
sd(bike_clean$cnt) # update!
## [1] 1937.211
max(bike_clean$cnt)-min(bike_clean$cnt) # update!
## [1] 8692
IQR(bike_clean$cnt) # update!
## [1] 2804
# =================== Step 3.2 Histogram and Boxplot ===================================
# Method 1: using base R
hist(bike_clean$cnt) # update!

hist(bike_clean$cnt, main = "cnt", border = "blue",            # update!
     col = "lightblue", xlab = "count of total rental bikes including both casual and registered",     # update!
     breaks = 100)                                         # update!

boxplot(bike_clean$cnt, ylab = "count of total rental bikes including both casual and registered",   # update!
        col = "blue")
# Method 2: using ggplot2 package
# You need to install the ggplot2 package only once
# To install this package: Go to Packages tab (bottom right window),
# hit install, typ in the package name: ggplot2, hit install
library(ggplot2)

g <- ggplot(data = bike_clean)   # update!
g + geom_histogram(aes(x = cnt), color = "blue", fill = "lightblue", bins = 100) +  # update!
  xlab("count of total rental bikes including both casual and registered")                                           # update!

g + geom_boxplot(aes(x = "", y = cnt), color = "blue", fill = "#009FD4", width = 0.25)  # update!

# ======================================================================================
# =================== Step 3.3 Explore relationships ====================================
# You can see the relationships between any two variables in a scatter plot
# For example, below you can see the relationshipd between cty and displ
plot(bike_clean$hum, bike_clean$cnt, col = "red",                      # update!
     xlab = "Normalized humidity", ylab = "count of total rental bikes including both casual and registered")   # update!

# One more, below you can see the relationships between cty and hwy
plot(bike_clean$atemp, bike_clean$cnt, col = "red",                               # update!
     xlab = "Normalized feeling temperature in Celsius", ylab = "count of total rental bikes including both casual and registered")  # update!

# ============= Step 4. Confidence Interval (CI) (Project Task 2-1)======================
# Say if we want to construct a two-sided CI for "cty" variable in the mpg data
# at 95% confidence level:
# population standard deviation is unknown, so we chekc if number of observations, n,
# is greater than 40. In this case, it is (n=234>40), so we can use Z distributoin.
# Below, we find the parameters we need for CI:

# x-bar:
mean(bike_clean$cnt)   # update!
## [1] 4504.349
# Sample standard deviaion:
sd(bike_clean$cnt)   # update!
## [1] 1937.211
# z_alpha/2
# in this example, alpha is 0.05, so for a two-sided CI we use 1-alpha/2 to find
# corresponding z-value
qnorm(1-0.05/2)  # update if needed!
## [1] 1.959964
# Finally, we use the approapriate CI formula to calculate the boundaries.
# ====================== Step 5. Hypothesis Testing (Project Task 2-2) ===================
# Say if there is a criterion for the "cty" variable and it should be 16, so we want to
# perform a hypothesis testing to see if populaiton mean for this varibale is 16 at 95% CL
# We need to take the 7-step procedure for hypothesis testing.
# Parameters that are needed:

# x-bar:
mean(bike_clean$cnt)  # update!
## [1] 4504.349
# Sample standard deviaion:
sd(bike_clean$cnt)   # update!
## [1] 1937.211
# in this example, the Mu-null (i.e., hypothesized value) is 16, n is 234, so z-null is:
(mean(bike_clean$cnt)-4300)/(sd(bike_clean$cnt)/sqrt(732))   # update!
## [1] 2.853978
# using the p-value approach, we need to calculate the p-value:
2*pnorm(-2.873429)   # update! p-value = 2*P(Z<-|z_null|)
## [1] 0.004060423
# In this example p-value is 0.002019198 which is less than alpha, so we reject the null
# hypothesis. This means we support the alternative hypothesis, that states the
# population mean of cty (miles per gallon in city) is not 16 at 95% confidence level
# and based on a sample of 234 observations
# ================= Step 6. Regression analysis ===========================================
# ============ Step 6.1 simple linear regresson (Project Task 3-1)=========================
# Say if we want to develop a simple linear regression model to estimate
# "cty" (miles per gallon in city) based on "displ" (engine displacement)
model <- lm(cnt ~ temp, data = bike_clean)    # update!
# model output can be seen by running a summary command:
summary(model)                              # update if needed!
## 
## Call:
## lm(formula = cnt ~ temp, data = bike_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16
plot(cnt ~ temp, data = bike_clean,         # update!
     xlab = "Normalized temperature in Celsius",     # update!
     ylab = "count of total rental bikes including both casual and registered",     # update!
     main = "Simple linear regression",
     pch = 19)
abline(model, col = "red", lwd = "5", lty = 1)   # update if needed!

# ========================================================================================
# ============ Step 6.2 multiple linear regresson (Project Task 3-2)======================
# Say if we want to develop a multiple linear regression model to estimate
# "cty" (miles per gallon in city) based on "displ" (engine displacement)
# and "cyl" (number of cylinders)
model2 <- lm(cnt ~ hum + atemp, data = bike_clean)    # update!
# model output can be seen by running a summary command:
summary(model2)     # update if needed!
## 
## Call:
## lm(formula = cnt ~ hum + atemp, data = bike_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4833.9 -1071.8   -54.8  1050.2  4308.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2440.0      274.2   8.899  < 2e-16 ***
## hum          -2622.0      382.8  -6.850 1.57e-11 ***
## atemp         7822.6      334.6  23.382  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1459 on 728 degrees of freedom
## Multiple R-squared:  0.4347, Adjusted R-squared:  0.4331 
## F-statistic: 279.9 on 2 and 728 DF,  p-value: < 2.2e-16
# additional independent varibales can be added by + sign
model3 <- lm(cnt ~ hum + atemp + windspeed + holiday, data = bike_clean)   # update!
summary(model3)     # update if needed!
## 
## Call:
## lm(formula = cnt ~ hum + atemp + windspeed + holiday, data = bike_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4913.2 -1052.2   -89.6  1065.3  4362.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3805.5      342.7  11.104  < 2e-16 ***
## hum          -3175.8      382.8  -8.296 5.23e-16 ***
## atemp         7485.3      329.8  22.694  < 2e-16 ***
## windspeed    -4414.8      708.6  -6.230 7.90e-10 ***
## holiday       -585.1      314.6  -1.860   0.0633 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1420 on 726 degrees of freedom
## Multiple R-squared:  0.4657, Adjusted R-squared:  0.4628 
## F-statistic: 158.2 on 4 and 726 DF,  p-value: < 2.2e-16
# =================== Step 2. Missing Data =============================================
# Read data from hard disk
# The data you would like to load (e.g., mpgData.csv) must be stored in
# the project folder previously set as working directory
bike <- read.csv("day.csv") # update!
# Remove missing data
good <- complete.cases(bike) # update!
bike_clean <- bike[good, ] # update!
# ======================================================================================
# ====================== Step 3. Discriptive Statistics (Project Task 1) ===============
# From this point forward, make sure to use the mpg_clean instead of mpg
# This can be done for any variable. See below how to find some measures for
# variable cty. cty definition: miles per gallon in city
# ======================================================================================
# =================== Step 3.1 Statistical measures ====================================
summary(bike_clean$cnt) # update!
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      22    3152    4548    4504    5956    8714
mean(bike_clean$cnt) # update!
## [1] 4504.349
var(bike_clean$cnt) # update!
## [1] 3752788
sd(bike_clean$cnt) # update!
## [1] 1937.211
max(bike_clean$cnt)-min(bike_clean$cnt) # update!
## [1] 8692
IQR(bike_clean$cnt) # update!
## [1] 2804
# ======================================================================================
# =================== Step 3.2 Histogram and Boxplot ===================================
# Method 1: using base R
hist(bike_clean$cnt) # update!

hist(bike_clean$cnt, main = "cnt", border = "blue",            # update!
     col = "lightblue", xlab = "count of total rental bikes including both casual and registered",     # update!
     breaks = 100)                                         # update!

boxplot(bike_clean$cnt, ylab = "count of total rental bikes including both casual and registered",   # update!
        col = "blue")

# Method 2: using ggplot2 package
# You need to install the ggplot2 package only once
# To install this package: Go to Packages tab (bottom right window),
# hit install, typ in the package name: ggplot2, hit install
library(ggplot2)
g <- ggplot(data = bike_clean)   # update!
g + geom_histogram(aes(x = cnt), color = "blue", fill = "lightblue", bins = 100) +  # update!
  xlab("count of total rental bikes including both casual and registered")                                           # update!

g + geom_boxplot(aes(x = "", y = cnt), color = "blue", fill = "#009FD4", width = 0.25)  # update!

# ======================================================================================
# =================== Step 3.3 Explore relationships ====================================
# You can see the relationships between any two variables in a scatter plot
# For example, below you can see the relationshipd between cty and displ
plot(bike_clean$hum, bike_clean$cnt, col = "red",                      # update!
     xlab = "Normalized humidity", ylab = "count of total rental bikes including both casual and registered")   # update!

# One more, below you can see the relationships between cty and hwy
plot(bike_clean$atemp, bike_clean$cnt, col = "red",                               # update!
     xlab = "Normalized feeling temperature in Celsius", ylab = "count of total rental bikes including both casual and registered")  # update!

# =======================================================================================
# ============= Step 4. Confidence Interval (CI) (Project Task 2-1)======================
# Say if we want to construct a two-sided CI for "cty" variable in the mpg data
# at 95% confidence level:
# population standard deviation is unknown, so we chekc if number of observations, n,
# is greater than 40. In this case, it is (n=234>40), so we can use Z distributoin.
# Below, we find the parameters we need for CI:

# x-bar:
mean(bike_clean$cnt)   # update!
## [1] 4504.349
# Sample standard deviaion:
sd(bike_clean$cnt)   # update!
## [1] 1937.211
# z_alpha/2
# in this example, alpha is 0.05, so for a two-sided CI we use 1-alpha/2 to find
# corresponding z-value
qnorm(1-0.05/2)  # update if needed!
## [1] 1.959964
# Finally, we use the approapriate CI formula to calculate the boundaries.



# ========================================================================================
# ====================== Step 5. Hypothesis Testing (Project Task 2-2) ===================
# Say if there is a criterion for the "cty" variable and it should be 16, so we want to
# perform a hypothesis testing to see if populaiton mean for this varibale is 16 at 95% CL
# We need to take the 7-step procedure for hypothesis testing.
# Parameters that are needed:

# x-bar:
mean(bike_clean$cnt)  # update!
## [1] 4504.349
# Sample standard deviaion:
sd(bike_clean$cnt)   # update!
## [1] 1937.211
# in this example, the Mu-null (i.e., hypothesized value) is 16, n is 234, so z-null is:
(mean(bike_clean$cnt)-4300)/(sd(bike_clean$cnt)/sqrt(732))   # update!
## [1] 2.853978
# using the p-value approach, we need to calculate the p-value:
2*pnorm(-2.873429)   # update! p-value = 2*P(Z<-|z_null|)
## [1] 0.004060423
# In this example p-value is 0.002019198 which is less than alpha, so we reject the null
# hypothesis. This means we support the alternative hypothesis, that states the
# population mean of cty (miles per gallon in city) is not 16 at 95% confidence level
# and based on a sample of 234 observations
# =========================================================================================
# ================= Step 6. Regression analysis ===========================================
# ============ Step 6.1 simple linear regresson (Project Task 3-1)=========================
# Say if we want to develop a simple linear regression model to estimate
# "cty" (miles per gallon in city) based on "displ" (engine displacement)
model <- lm(cnt ~ temp, data = bike_clean)    # update!
# model output can be seen by running a summary command:
summary(model)                              # update if needed!
## 
## Call:
## lm(formula = cnt ~ temp, data = bike_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16
plot(cnt ~ temp, data = bike_clean,         # update!
     xlab = "Normalized temperature in Celsius",     # update!
     ylab = "count of total rental bikes including both casual and registered",     # update!
     main = "Simple linear regression",
     pch = 19)
abline(model, col = "red", lwd = "5", lty = 1)   # update if needed!

# ========================================================================================
# ============ Step 6.2 multiple linear regresson (Project Task 3-2)======================
# Say if we want to develop a multiple linear regression model to estimate
# "cty" (miles per gallon in city) based on "displ" (engine displacement)
# and "cyl" (number of cylinders)
model2 <- lm(cnt ~ hum + atemp, data = bike_clean)    # update!
# model output can be seen by running a summary command:
summary(model2)     # update if needed!
## 
## Call:
## lm(formula = cnt ~ hum + atemp, data = bike_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4833.9 -1071.8   -54.8  1050.2  4308.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2440.0      274.2   8.899  < 2e-16 ***
## hum          -2622.0      382.8  -6.850 1.57e-11 ***
## atemp         7822.6      334.6  23.382  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1459 on 728 degrees of freedom
## Multiple R-squared:  0.4347, Adjusted R-squared:  0.4331 
## F-statistic: 279.9 on 2 and 728 DF,  p-value: < 2.2e-16
# additional independent varibales can be added by + sign
model3 <- lm(cnt ~ hum + atemp + windspeed + holiday, data = bike_clean)   # update!
summary(model3)     # update if needed!
## 
## Call:
## lm(formula = cnt ~ hum + atemp + windspeed + holiday, data = bike_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4913.2 -1052.2   -89.6  1065.3  4362.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3805.5      342.7  11.104  < 2e-16 ***
## hum          -3175.8      382.8  -8.296 5.23e-16 ***
## atemp         7485.3      329.8  22.694  < 2e-16 ***
## windspeed    -4414.8      708.6  -6.230 7.90e-10 ***
## holiday       -585.1      314.6  -1.860   0.0633 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1420 on 726 degrees of freedom
## Multiple R-squared:  0.4657, Adjusted R-squared:  0.4628 
## F-statistic: 158.2 on 4 and 726 DF,  p-value: < 2.2e-16