#For my final homework, I decided to use the population dataset from my second homework. I’m a real estate #research analyst and it is really interesting for us to investigate city “winners” and “losers”. How some #cities evolve into hubs and others decay over time. A big component of this is in-migration and a focus on population #changes. Population change is sticky, it may take decades for meaningful growth and evolution of the cityscape and #real-estate demand to materialize. Hence, this dataset immediately piqued my interest as it compares the population #between two decades for a random sampling of large cities.
#A meaningful question an analyst might ask is how this random sample compares to the benchmark #(Overall Us population growth rate between the 1920’s and the 1930’s). From publicly available data, #the CAGR of US population was ~1.5% over this period and I calculate the CAGR for the sample set and compared individual #city growth rates to this U.S. benchmark rate to identify cities that are “Winners” and “Losers”. This is a simplistic assumption #that winners and losers can be identified solely on the basis of population growth rates #and in the real-world several other considerations would have to be taken into account for the basis of such a determination.
#install.packages(“boot”) require(boot) require(dplyr)
population <- data.frame(bigcity\(u, bigcity\)x) View(population)
#2 -> Data wrangling that includes column renaming colnames(population) <- c(“TwentyPop”, “ThirtyPop”) View(population)
#2 -> Data wrangling that includes adding a new column that calculates the population CAGR for each city over that decade
population <- mutate(population, CAGR = round((((population\(ThirtyPop/population\)TwentyPop)^(0.1)-1)*100), 1))
#2 -> Data wrangling that adds a new status column that identifies city “Winners” and “Losers” by comparing individual CAGR to the benchmark US rate
population\(WinnersLosers <- ifelse(population\)CAGR >= 1.5, “WINNER”, “LOSER”)
#1 -> Data exploration, Uses the summary function to gain an overview of the dataset (including the CAGR calculated column). summary(population)
#Mean and median of TwentyPop
Mean20pop <- mean(population\(TwentyPop) Median20pop <- median(population\)TwentyPop)
#Mean and median of ThirtyPop
Mean30pop <- mean(population\(ThirtyPop) Median30pop <- median(population\)ThirtyPop)
#Display the mean and median for above two columns print(paste0(“The mean population of the randomly selected 49 US cities in 1920 is:”, round(Mean20pop, 2), " and the median population for the same dataset is: ", round(Median20pop, 2)))
print(paste0(“The mean population of the randomly selected 49 US cities in 1930 is:”, round(Mean30pop, 2), " and the median population for the same dataset is: ", round(Median30pop, 2)))
#Graphics - Print multiple plots at once attach(population) par(mfrow=c(2,2)) #Histogram - 1920’s Population set hist(population$TwentyPop, main = “1920’s Population Histogram”, xlab = “Population”)
#Histogram - 1930’s Population set hist(population$ThirtyPop, main = “1930’s Population Histogram”, xlab = “Population”)
#scatterplot plot(ThirtyPop ~ TwentyPop, data = population)
#boxplot boxplot(population$CAGR)
#The conclusion from the summary statistics and graphics above is that population time series is a sticky metric. The random sample of 49 cities can be #considered to be an approximation of the US average. Thus, when considered on an absolute basis, mean and median for the same sample #of cities in the 20’s doesn’t differ much from that of the 30’s. However, when we look at the CAGR #(ity-wide calculation across decades, a cross-sectional and time-series comparison) on a percentage basis, #there is a wide dispersion with some cities recording #outsized out-migration (negative CAGR) and other cities recording an outsized in-migration, indicating clearly identifiable “winners” and “losers” #Compare this again with the overall U.S. population growth rate over this period which didn’t change much.