# Load packages
library(ggplot2)
library(dplyr)
# Import data
census <- read.csv("/resources/rstudio/BusinessStatistics/Data/census.csv")
str(census)
summary(census)
Suppose that you are writing a report on the economic importance of snowbirds in the Lakes Region Planning Commission’s Region. The director ask you what the share of seasonal homes in total (housing_percentOfseasonal) for a typical town in the region is.
I would chose the median for typical values. ## Q2 Explain your answer. I would do this because the data has outliers that would skew the data if I were to use the mean. ## Q3 What is the highest percentage of the seasonal homes by any town in the region? The highest percentage of seasonal homes by town in the region is about 28%.
# Create faceted histogram
ggplot(census, aes(x = housing_percentOfseasonal)) +
geom_histogram()
# Create box plots of city mpg by UR_aboveAve
ggplot(census, aes(x = 1, y = housing_percentOfseasonal)) +
geom_boxplot()
# Create overlaid density plots for same data
ggplot(census, aes(x = housing_percentOfseasonal)) +
geom_density(alpha = .3)
# If data has extreme values
census %>%
summarize(median = median(housing_percentOfseasonal, na.rm = TRUE),
IQR = IQR(housing_percentOfseasonal, na.rm = TRUE))
# If data doesn't have extreme values
census %>%
summarize(mean = mean(housing_percentOfseasonal, na.rm = TRUE),
sd = sd(housing_percentOfseasonal, na.rm = TRUE))
Suppose that director suspect that the share of seaonsal homes (popBA_percent) may be associated with the educational level of residents. Divide the towns into two groups: 1) educated towns (the share of population with Bachelor’s degree or higher than the average) and 2) other towns (the share of population with Bachelor’s degree or lower than the average). ## Q4 What is the share of seasonal homes in total in a typical educated town? 27%. ## Q5 What is the share of seasonal homes in total in a typical less educated educated town? Just above 31%. ## Q6 What possible explanation you may have for the significant difference, if any? The data I used to get the answer was calculated by using the median. If I were to use the mean both percentages would have gone up 2% due to the outliers in the data set.
# Create a new variable, UR > or < average
UR_ave <- mean(census$unemplRate)
census$UR_aboveAve <- ifelse(census$unemplRate >= UR_ave, "equal or above ave", "below ave")
# Create box plots of total population by UR_aboveAve
ggplot(census, aes(x = UR_aboveAve, y = housing_percentOfseasonal)) +
geom_boxplot()
# If data has extreme values
census %>%
group_by(UR_aboveAve) %>%
summarize(median = median(housing_percentOfseasonal, na.rm = TRUE),
IQR = IQR(housing_percentOfseasonal, na.rm = TRUE))
# If data doesn't have extreme values
census %>%
group_by(UR_aboveAve) %>%
summarize(mean = mean(housing_percentOfseasonal, na.rm = TRUE),
sd = sd(housing_percentOfseasonal, na.rm = TRUE))