Submission: Complete your solutions in the
provided R Markdown file, named
FinalOnline-<your-name>.Rmd (replace
<your-name>). Submit both the .Rmd and
the HTML/PDF by the deadline.
Permitted Resources: You may use R help, class
materials, your own previous code, and the following packages:
tidyverse, openintro,
nycflights13 (and dependencies). No Internet
search, AI tools, or outside communication (except with the instructor)
allowed.
Late Policy: No late submissions.
Code Requirements: Use only the allowed packages. For each question, include relevant code and concise written answers. Use graphs/tables as instructed.
Graph Formatting: Polish your graphs.
Point Value: Each question is worth 10% of the base grade.
Full Credit (100%): Complete any 10 out of 11 questions to reach a perfect score.
Extra Credit (120%): You may answer all questions to earn a maximum of 120% of the base grade
Partial Credit: Points will be awarded based on the accuracy and completeness of your responses.
Use code (with output) to answer each question unless otherwise specified.
The following questions shall be answered by working with the
world_bank_pop and who data sets from the
openinto library.
The data set world_bank_pop is not clean. Clean the
data set such that the after data tidying you have six columns:
country, year, SP.URB.TOTL,
SP.URB.GROW, SP.POP.TOTL,
SP.POP.GROW. Give your code and show the first 10 rows of
the data set after being tidied. Then explain the meaning of each
column.
Replace the country column of the tided data set in
step a) with full names of the country (for example, replace
USA with United States of America) by checking
the data frame who, which contains the full name of each
country corresponding to the three-digit country code. Give your code
and show the updated data set in a manner to illustrate that the task is
correctly fulfilled.
With the data set obtained in step b), answer which countries had undergone significant urbanization between 2000 and 2017. You need to show the code and the results (either graphs or tables) to support your answer.
For the following tasks, use data set planes and
flights from the nycflights13 package.
For the planes data set, only keep planes from
manufacturers that have more than 10 samples in the data set. Then
convert manufacturer column into a factor. Then combine
AIRBUS and AIRBUS INDUSTRIE as a single
category AIRBUS; combine MCDONNELL DOUGLAS,
MCDONNELL DOUGLAS AIRCRAFT CO and
MCDONNELL DOUGLAS CORPORATION into a single category
MCDONNELL. Save your data frame as a new one. Show your
code and the first 10 rows of the updated data frame.
Join the flights data set with the
planes data set, study how plane models correlate with the
flight distance with proper data visualizations or summary tables. You
are required to summarize your findings concisely in your own
words.
For the following tasks, use the data set weather,
flights or planes from the
nycflights13 package.
Create a plot of the temperature change across the whole year of
2013 at the JFK airport. (Hint: You need to first create a
datetime variable for each hour.)
Find out which day of the year has the largest temperature difference (defined as the difference between the highest and the lowest temperature) across the day (0am - 11pm).
Find a way to select all overnight flights (also
called “Red Eye Flights” that depart at late night and arrive in the
early morning) from the flights data set. Here overnight
flights are defined as flights that departed between 10pm and 1am, and
having an air time of over 4 hours . Create a categorical variable
overnight_flag with YES or NO as
the possible values. Show your code and the updated data frame.
Someone says that most overnight flights use relatively small
planes. Verify whether this is true with the data frame obtained in c)
and the planes data set.
Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.
From the gss_cat data set, find factors that are
significantly correlated with the reported income.
From the smoking data set of the
openintro package, find find factors that are significantly
correlated with the smoking status and the number of cigarettes smoked
per day.