# Load required packages
library(readxl)
library(mosaic)
# Note: No setwd() needed - use relative paths or RStudio projects
# Data files should be in a 'data/' subfolder or same directoryExercise List 1
Variables, Scales of Measurement & Data Visualization
Setup
Load required packages. Note: We use relative paths, so make sure your working directory contains the data files or adjust the paths accordingly.
Section 1.2: Variables and Scales of Measurement
Exercise 17: Major
A professor records the majors of her 30 students.
| Student | Major |
|---|---|
| 1 | Accounting |
| 2 | Management |
| … | … |
| 30 | Economics |
Questions:
- What is the measurement scale of the Major variable?
- Summarize the results in tabular form.
- What information can be extracted from the data?
Solution
a) Measurement Scale
The measurement scale of the Major variable is nominal - values differ only in name. There is no inherent order or ranking between different majors.
Show solution
# Always clean the script first to have a clear environment
rm(list = ls())
# Read the dataset
MAJOR <- read_excel("Major.xlsx")
# Attach the dataset to access variables directly
attach(MAJOR)
# Create the variables
# Write the desired name of the variable followed by "<-" and then write
# the name that the variable is assigned in the dataset that is attached
StudNo <- Student
Major <- Major
detach(MAJOR)
# head() displays the first few observations within the dataset
# View() displays a spreadsheet-style data viewer with rows and columns
head(MAJOR)# A tibble: 6 × 2
Student Major
<dbl> <chr>
1 1 Accounting
2 2 Management
3 3 Economics
4 4 Finance
5 5 History
6 6 Accounting
b) Frequency Table
Show solution
# Data is unorganized, organizing this data by major...
Sorted <- MAJOR[order(MAJOR$Major), ]
# What are the possible values of 'Major' and how many levels does it have?
factor(Major) [1] Accounting Management Economics Finance History Accounting
[7] Economics History Undecided Management Statistics Management
[13] Undecided Psychology Accounting History Economics Undecided
[19] Finance Accounting Statistics Psychology Management Finance
[25] Psychology Management Finance History Statistics Economics
8 Levels: Accounting Economics Finance History Management ... Undecided
Show solution
# But how many students do we see per major?
# We can count using which() and length() functions
# 'which' identifies the criteria & 'length' counts the number
# == is an equality operator (checking if the criteria is met!)
length(which(MAJOR$Major == 'Accounting'))[1] 4
Show solution
length(which(MAJOR$Major == 'Economics'))[1] 4
Show solution
length(which(MAJOR$Major == 'Finance'))[1] 4
Show solution
length(which(MAJOR$Major == 'History'))[1] 4
Show solution
length(which(MAJOR$Major == 'Management'))[1] 5
Show solution
length(which(MAJOR$Major == 'Psychology'))[1] 3
Show solution
length(which(MAJOR$Major == 'Statistics'))[1] 3
Show solution
length(which(MAJOR$Major == 'Undecided'))[1] 3
Show solution
# Or the same by using the function "table"
# It counts the categorical variable in an output form
w <- data.frame(table(Major))
w Major Freq
1 Accounting 4
2 Economics 4
3 Finance 4
4 History 4
5 Management 5
6 Psychology 3
7 Statistics 3
8 Undecided 3
Show solution
# Change the name of the "Freq" column to "Frequency"
# Freq is in the second column of the data frame
names(w)[2] <- 'Frequency'
w Major Frequency
1 Accounting 4
2 Economics 4
3 Finance 4
4 History 4
5 Management 5
6 Psychology 3
7 Statistics 3
8 Undecided 3
c) Information Extracted
Management has the highest number of students, whereas Psychology, Statistics & Undecided have the lowest number of students.
Exercise 18: DOW
Data shows companies in the Dow Jones Industrial Average with Year joined, Industry, and Price.
Questions:
- What is the measurement scale of the Industry variable?
- What is the measurement scale of the Year variable? What are its strengths and weaknesses?
- What is the measurement scale of the Price variable? What are its strengths and weaknesses?
Solution
This exercise does not require R software. The answers are based on theoretical understanding of measurement scales.
a) Industry - Nominal scale (categories with no inherent order)
b) Year - Interval scale (equal intervals between years, but no true zero point in a meaningful sense for this context)
c) Price - Ratio scale (has a true zero point, equal intervals, and meaningful ratios)
Section 1.3: Data Preparation
Exercise 26: Fitness
Survey of 418 individuals about exercise, marital status, and income.
| ID | Exercise | Married | Income |
|---|---|---|---|
| 1 | Always | Yes | 106299 |
| 2 | Sometimes | Yes | 86570 |
| … | … | … | … |
Questions:
- Sort by Income. Of the 10 highest income earners, how many are married and always exercise?
- How many married individuals who exercise sometimes earn > $110,000?
- Are there any missing values? In which variables?
- How many individuals are married? How many are unmarried?
- How many are married & always exercise? Unmarried & never exercise?
Solution
Show solution
# Clean script
rm(list = ls())
FITNESS <- read_excel("Fitness.xlsx")
# "attach" the dataset to define variables
attach(FITNESS)
# Define your variables out of the dataset FITNESS
ID <- ID
Exercise <- Exercise
Married <- Married
Income <- Income
detach(FITNESS)a) Sort by Income
Show solution
# Sorting data by ordering it in a new data frame called SortedData1
# Default: data is sorted in ascending order
SortedData1 <- FITNESS[order(FITNESS$Income), ]
# To see top 10 earners, sort in descending order
SortedData1_desc <- FITNESS[order(FITNESS$Income, decreasing = TRUE), ]
head(SortedData1_desc, 10)# A tibble: 10 × 4
ID Exercise Married Income
<dbl> <chr> <chr> <dbl>
1 39 Sometimes Yes 119890
2 63 Always Yes 119851
3 161 Always No 119615
4 51 Sometimes Yes 119354
5 138 Often Yes 119282
6 404 Sometimes No 119058
7 188 Often Yes 118771
8 196 Often Yes 118688
9 356 Sometimes Yes 118314
10 354 Sometimes Yes 118260
b) Married, Sometimes Exercise, Income > $110,000
Show solution
# Sort the data by Marriage and Exercise in DESCENDING order (not default!)
SortedData2 <- FITNESS[order(FITNESS$Married, FITNESS$Exercise, decreasing = TRUE), ]
# How many of those earn more than $110,000 per year?
SortedData3 <- FITNESS[order(FITNESS$Married, FITNESS$Exercise, FITNESS$Income, decreasing = TRUE), ]
# We can use the which and length functions
# Where 'which'=category of a variable & 'length'=the number of qualified observations
# Using operators to specify we want to find >$110,000 in income per year
length(which(FITNESS$Married == 'Yes' & FITNESS$Exercise == 'Sometimes' & FITNESS$Income > 110000))[1] 9
c) Missing Values
Show solution
# Identifying which observations have missing values using is.na()
# Looking through TRUE/FALSE output is inconvenient, so we use which(is.na()) function
which(is.na(FITNESS$ID))integer(0)
Show solution
which(is.na(FITNESS$Exercise))[1] 14 62 86 143 175
Show solution
which(is.na(FITNESS$Married))[1] 48 111
Show solution
which(is.na(FITNESS$Income))[1] 21 55 181
d) Married vs Unmarried Count
Show solution
w <- table(FITNESS$Married)
w
No Yes
134 281
e) Married & Always Exercise; Unmarried & Never Exercise
Show solution
# Married and always exercise
length(which(FITNESS$Married == 'Yes' & FITNESS$Exercise == 'Always'))[1] 69
Show solution
# Unmarried and never exercise
length(which(FITNESS$Married == 'No' & FITNESS$Exercise == 'Never'))[1] 74
Exercise 30: Football Players
Quarterback statistics: Comp, Att, Pct, Yds, Avg, Yds/G, TD, Int
Questions:
- Are there missing values? Which variables/observations?
- Omit observations with missing values. How many removed?
- Remove outliers: TD < 5 or Int > 20. How many removed?
Solution
Show solution
rm(list = ls())
Football_Players <- read_excel("Football_Players.xlsx")a) Missing Values
Show solution
# Count total missing values
sum(is.na(Football_Players))[1] 3
Show solution
# Answer: 3 values are missing.
# Find which variables have missing values
which(is.na(Football_Players$Comp))integer(0)
Show solution
which(is.na(Football_Players$Att))[1] 28
Show solution
which(is.na(Football_Players$Pct))integer(0)
Show solution
which(is.na(Football_Players$Yds))[1] 25
Show solution
which(is.na(Football_Players$Avg))integer(0)
Show solution
which(is.na(Football_Players$TD))integer(0)
Show solution
which(is.na(Football_Players$Int))[1] 29
Show solution
which(is.na(Football_Players$'Yds/G'))integer(0)
b) Remove Missing Values
Show solution
# Method 1: using sum(is.na(x)) function will automatically give the amount of missing values
# Method 2: using na.omit(x) function will omit the NA values
Football_Players1 <- na.omit(Football_Players)
# Count observations removed
cat("Original observations:", nrow(Football_Players), "\n")Original observations: 43
Show solution
cat("After removing NA:", nrow(Football_Players1), "\n")After removing NA: 40
Show solution
cat("Observations removed:", nrow(Football_Players) - nrow(Football_Players1))Observations removed: 3
c) Remove Outliers
Show solution
# We want to remove outlier cases where the players have < 5 touchdowns
# or > 20 interceptions
# Method 1: using the subset(x) function
# | means "or"
newdata1 <- subset(Football_Players, TD < 5 | Int > 20)
newdata1# A tibble: 5 × 10
Player Team Comp Att Pct Yds Avg `Yds/G` TD Int
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 9 Hummingbirds 55 112 49.1 544 4.9 136 1 3
2 10 Iguanas 123 224 54.9 1430 6.4 204. 4 6
3 18 Penguins 255 476 53.6 2894 6.1 193. 11 22
4 31 Sparrows 78 127 61.4 861 6.8 215. 4 5
5 34 Bulldogs 93 140 66.4 833 6 208. 4 5
Show solution
# Method 2: select specific columns only
newdata2 <- subset(Football_Players, TD < 5 | Int > 20, select = c(TD, Int))
newdata2# A tibble: 5 × 2
TD Int
<dbl> <dbl>
1 1 3
2 4 6
3 11 22
4 4 5
5 4 5
Show solution
# Answer: 5 observations can be removed from the data
# 4 observations had < 5 touchdowns
# 1 observation had > 20 interceptionsExercise 31: Salaries
City of Seattle employee compensation data: Department, Job Title, Hourly Rate.
Questions:
- Split data by department
- Check for missing values in each subset
Solution
Show solution
rm(list = ls())
SALARIES <- read_excel("Salaries.xlsx")
# Count employees per department
table(SALARIES$Department)
Dept Of HR IT Dept
106 708
Public Utilities Sustainability & Environ Dept
1398 33
Show solution
# Create subsets by department
subset1 <- subset(SALARIES, Department == 'Public Utilities')
subset2 <- subset(SALARIES, Department == 'Sustainability & Environ Dept')
subset3 <- subset(SALARIES, Department == 'IT Dept')
subset4 <- subset(SALARIES, Department == 'Dept Of HR')
# Check for missing values in each subset
cat("Missing values in Public Utilities:", sum(is.na(subset1)), "\n")Missing values in Public Utilities: 0
Show solution
cat("Missing values in Sustainability:", sum(is.na(subset2)), "\n")Missing values in Sustainability: 0
Show solution
cat("Missing values in IT Dept:", sum(is.na(subset3)), "\n")Missing values in IT Dept: 2
Show solution
cat("Missing values in HR Dept:", sum(is.na(subset4)))Missing values in HR Dept: 0
Section 2.1: Visualizing Categorical Variables
Exercise 8: Millennials
Survey of 600 millennials about their faith: Very Religious, Somewhat Religious, Slightly Religious, Not Religious.
Questions:
- Construct frequency and relative frequency distributions. What is the most common response?
- Construct a bar chart
- Construct a pie chart. Are results consistent with earlier study (35% not religious)?
Solution
Show solution
# You need library(mosaic) for this exercise
rm(list = ls())
MILLENNIALS <- read_excel("Millennials.xlsx")a) Frequency Distributions
Show solution
# Frequency distribution: using table(x$variable) function
# Distribution of values in a dataset using original units of observations
Frequency <- table(MILLENNIALS$Faith)
Frequency
Not Religious Slightly Religious Somewhat Religious Very Religious
210 129 183 78
Show solution
# Relative frequency: using table(x$variable)/length(x$variable) function
# Distribution of values where 100% (i.e., 1) entails all observations
RelativeFrequency <- table(MILLENNIALS$Faith) / length(MILLENNIALS$Faith)
RelativeFrequency
Not Religious Slightly Religious Somewhat Religious Very Religious
0.350 0.215 0.305 0.130
Show solution
# Answer: 'Not Religious' is the most common answer in the surveyb) Bar Chart
Show solution
# Horizontal bar chart
# Y-axis shows the variable values (e.g., not religious, religious, etc.)
barplot(Frequency,
main = "Bar Chart for Faith Survey",
horiz = TRUE,
col = "pink",
xlim = c(0, 250),
cex.names = 0.6)
abline(v = 0)Show solution
# Vertical bar chart with custom colors
collors <- c("red", "purple", "pink", "olivedrab")
barplot(Frequency,
main = "Bar Chart for Faith Survey",
col = collors,
cex.names = 0.75)
abline(h = 0)c) Pie Chart
Show solution
# R software decides the colours used in the pie chart by default
pie(Frequency, main = "Pie Chart for Faith Survey")Show solution
# Let's decide the colors used in pie chart
# The order of colours = the order of dataset variables
collors <- c("red", "purple", "pink", "olivedrab")
pie(Frequency, main = "Pie Chart for Faith Survey", col = collors)Show solution
# Pie chart for relative frequency
pie(RelativeFrequency, main = "Pie Chart for Relative Frequency", col = collors)Answer: Yes, the results of the survey match the results found in the earlier study. As we can see, approximately 0.350 (35%) of millennials identify themselves as “Not Religious”, which is consistent with the earlier study.
Section 2.2: Relationships Between Categorical Variables
Exercise 14: Bar
Survey at a local bar: customers identified their sex and drink choice (beer, wine, soft drink).
Questions:
- Construct contingency table. How many males? How many drank wine?
- P(beer | male)? P(beer | female)?
Solution
Show solution
rm(list = ls())
BAR <- read_excel("Bar.xlsx")a) Contingency Table
Show solution
myTable <- table(BAR$Sex, BAR$Drink)
myTable
beer soft drink wine
female 38 10 20
male 142 20 40
Show solution
# addmargins() adds the Sum column and row for better visual data representation
addmargins(myTable)
beer soft drink wine Sum
female 38 10 20 68
male 142 20 40 202
Sum 180 30 60 270
Show solution
# If we want the same table with probabilities instead of observations
prop.table(myTable)
beer soft drink wine
female 0.14074074 0.03703704 0.07407407
male 0.52592593 0.07407407 0.14814815
Show solution
# Alternative arrangement
mynewTable <- table(BAR$Drink, BAR$Sex)
addmargins(mynewTable)
female male Sum
beer 38 142 180
soft drink 10 20 30
wine 20 40 60
Sum 68 202 270
Visualization:
Show solution
barplot(myTable,
legend = rownames(myTable),
col = c("lavender", "lightblue"),
beside = TRUE,
main = "Drink Preferences by Sex")Answer: There were 202 male customers. 60 customers drank wine.
b) Conditional Probabilities
- P(beer | male) = 142 (male beer drinkers) / 202 (total male customers) = 0.703 (70.3%)
- P(beer | female) = 38 (female beer drinkers) / 68 (total female customers) = 0.559 (55.9%)
Section 2.3: Visualizing Numerical Variables
Exercise 33: Prime
Amazon Prime customer expenditures data.
Questions:
- Frequency & relative frequency distributions (intervals: 400-700, 700-1000, etc.)
- Construct histogram
Solution
Show solution
rm(list = ls())
PRIME <- read_excel("Prime.xlsx")a) Frequency Distributions
Show solution
# Check the range of data
cat("Max expenditure:", max(PRIME$Expenditures), "\n")Max expenditure: 2117
Show solution
cat("Min expenditure:", min(PRIME$Expenditures))Min expenditure: 467
Show solution
# Max = 2117 --> upper limit of the last interval has to be higher
# To create a frequency distribution using intervals with a certain width:
# use seq(lower limit of first, upper limit of last, by=width of interval) function
intervals <- seq(400, 2200, by = 300)
# Then, use cut() function
# left=FALSE, right=TRUE ensures intervals are open on the left
# and closed on the right (e.g., 400 < x <= 700)
intervals.cut <- cut(PRIME$Expenditures, intervals, left = FALSE, right = TRUE)
expenditure.freq <- table(intervals.cut)
expenditure.freqintervals.cut
(400,700] (700,1e+03] (1e+03,1.3e+03] (1.3e+03,1.6e+03]
10 28 66 50
(1.6e+03,1.9e+03] (1.9e+03,2.2e+03]
38 8
Show solution
# Relative frequency distribution table
relative.freq <- table(intervals.cut) / length(intervals.cut)
relative.freqintervals.cut
(400,700] (700,1e+03] (1e+03,1.3e+03] (1.3e+03,1.6e+03]
0.05 0.14 0.33 0.25
(1.6e+03,1.9e+03] (1.9e+03,2.2e+03]
0.19 0.04
Show solution
# Combined table with cumulative frequencies
finaltable <- transform(expenditure.freq,
Rel_Freq = prop.table(relative.freq),
Cum_Freq = cumsum(relative.freq))
finaltable intervals.cut Freq Rel_Freq.intervals.cut Rel_Freq.Freq
(400,700] (400,700] 10 (400,700] 0.05
(700,1e+03] (700,1e+03] 28 (700,1e+03] 0.14
(1e+03,1.3e+03] (1e+03,1.3e+03] 66 (1e+03,1.3e+03] 0.33
(1.3e+03,1.6e+03] (1.3e+03,1.6e+03] 50 (1.3e+03,1.6e+03] 0.25
(1.6e+03,1.9e+03] (1.6e+03,1.9e+03] 38 (1.6e+03,1.9e+03] 0.19
(1.9e+03,2.2e+03] (1.9e+03,2.2e+03] 8 (1.9e+03,2.2e+03] 0.04
Cum_Freq
(400,700] 0.05
(700,1e+03] 0.19
(1e+03,1.3e+03] 0.52
(1.3e+03,1.6e+03] 0.77
(1.6e+03,1.9e+03] 0.96
(1.9e+03,2.2e+03] 1.00
b) Histogram
Show solution
hist(PRIME$Expenditures,
breaks = intervals,
right = TRUE,
main = "Histogram for Prime Customer Expenditure",
xlab = "Expenditure in $",
col = "cyan")Exercise 35: Texts
Weekly text messages sent by 150 teenagers.
Questions:
- Frequency distributions (intervals: 500-600, 600-700, etc.)
- Construct polygon - symmetric or skewed?
- Construct ogive - proportion sending > 850 texts?
Solution
Show solution
rm(list = ls())
TEXTS <- read_excel("Texts.xlsx")a) Frequency Distributions
Show solution
cat("Min texts:", min(TEXTS$Texts), "\n")Min texts: 504
Show solution
cat("Max texts:", max(TEXTS$Texts))Max texts: 980
Show solution
# max=980, make sure the upper limit of the last interval is higher than this value
# Creating a frequency distribution using intervals of a given width
intervals <- seq(500, 1000, by = 100)
# Use cut() function
# Left & right ensure intervals are open on the left and closed on the right
intervals.cut <- cut(TEXTS$Texts, intervals, left = FALSE, right = TRUE)
texts.freq <- table(intervals.cut)
texts.freqintervals.cut
(500,600] (600,700] (700,800] (800,900] (900,1e+03]
27 33 30 33 27
Show solution
# Relative frequency distribution table
relative.freq <- table(intervals.cut) / length(intervals.cut)
relative.freqintervals.cut
(500,600] (600,700] (700,800] (800,900] (900,1e+03]
0.18 0.22 0.20 0.22 0.18
Show solution
# Combined table with cumulative frequencies
finaltable <- transform(texts.freq,
Rel_Freq = prop.table(relative.freq),
Cum_Freq = cumsum(Freq))
finaltable intervals.cut Freq Rel_Freq.intervals.cut Rel_Freq.Freq Cum_Freq
1 (500,600] 27 (500,600] 0.18 27
2 (600,700] 33 (600,700] 0.22 60
3 (700,800] 30 (700,800] 0.20 90
4 (800,900] 33 (800,900] 0.22 123
5 (900,1e+03] 27 (900,1e+03] 0.18 150
b) Polygon
Show solution
# Create histogram and store the data
hist.score <- hist(TEXTS$Texts,
breaks = intervals,
right = TRUE,
main = 'Histogram with Polygon for Texts',
xlab = 'Texts',
col = "plum3",
border = "mediumpurple4")
# hist.score tells us all breaking points of our bars (i.e., mids)
# as well as the frequency of all of our bars (i.e., counts)
# Add polygon line
x.axis <- c(min(TEXTS$Texts), hist.score$mids, max(TEXTS$Texts))
y.axis <- c(0, hist.score$counts, 0)
lines(x.axis, y.axis, type = 'l', lwd = 2)c) Ogive (Cumulative Frequency Curve)
Show solution
# Inspect bins
bins <- table(cut(intervals,
breaks = seq(from = min(hist.score$breaks),
to = max(hist.score$breaks),
by = hist.score$breaks[2] - hist.score$breaks[1]),
right = TRUE))
# Create ogive data
ucl <- seq(from = min(hist.score$breaks),
to = max(hist.score$breaks),
by = hist.score$breaks[2] - hist.score$breaks[1])
ucl <- c(0, ucl[-1])
cf <- c(0, cumsum(hist.score$counts))
# Plot the Ogive
par(bg = "gray90")
plot(ucl, cf,
type = "b",
col = "blue",
pch = 20,
main = "Ogive for Text Messages",
xlab = "Upper Class Limit",
ylab = "Cumulative Frequency")Section 2.4: More Visualization Methods
Exercise 42: Car Price
Used sedan data: Price, Age, Mileage
Questions:
- Scatterplot: Price vs Age
- Scatterplot: Price vs Mileage
- Create Mileage_category (< 50,000 = Low, else High)
- Scatterplot with categories
Solution
Show solution
rm(list = ls())
Car_Price <- read_excel("Car_Price.xlsx")a) Price vs Age
Show solution
plot(Car_Price$Price ~ Car_Price$Age,
col = "blue",
main = "Price vs Age",
xlab = "Age (years)",
ylab = "Price ($)",
pch = 16)Answer: There appears to be a negative relationship between the price of a car and its age, because the older the car, the cheaper it is.
b) Price vs Mileage
Show solution
plot(Car_Price$Price ~ Car_Price$Miles,
col = "orange",
main = "Price vs Mileage",
xlab = "Mileage",
ylab = "Price ($)",
pch = 16)Answer: There appears to be a negative relationship between the price of a car and its mileage. Cheaper cars tend to have more mileage (which also relates to being older cars).
c) Create Mileage Category
Show solution
# Convert to categorical variable using ifelse()
Car_Price$Miles_Cat <- ifelse(Car_Price$Miles < 50000, "Low_Mileage", "High_Mileage")
table(Car_Price$Miles_Cat)
High_Mileage Low_Mileage
7 13
Answer: 7 cars have mileage greater than 50,000 (High Mileage category).
d) Scatterplot with Categories
Show solution
plot(Car_Price$Price ~ Car_Price$Age,
pch = 16,
col = ifelse(Car_Price$Miles_Cat == "Low_Mileage", "blue", "red"),
main = "Price vs Age by Mileage Category",
xlab = "Age (years)",
ylab = "Price ($)")
legend("topright",
legend = c("Low Mileage", "High Mileage"),
pch = 16,
col = c("blue", "red"))Answer: The negative relationship between age and price is consistent for cars of both mileage categories.
Exercise 43: Internet Stocks
Compare Amazon (AMZN) and Alphabet (GOOG) stock performance 2016-2019.
Questions:
Construct a line chart showing both stocks over time. Which stock shows greater price appreciation?
Solution
Show solution
rm(list = ls())
Internet_Stocks <- read_excel("Internet_Stocks.xlsx")
# Check max values for y-axis scaling
cat("Max Amazon:", max(Internet_Stocks$AMZN), "\n")Max Amazon: 2012.71
Show solution
cat("Max Google:", max(Internet_Stocks$GOOG))Max Google: 1337.02
Show solution
colors <- c("maroon", "navyblue")
plot(Internet_Stocks$AMZN ~ Internet_Stocks$Date,
main = "Performance of Internet Stocks",
type = "l",
xlab = "Date",
ylab = "Stock Price ($)",
col = "maroon",
ylim = c(0, 2100),
lwd = 2)
lines(Internet_Stocks$GOOG ~ Internet_Stocks$Date,
col = "navyblue",
type = "l",
lwd = 2)
legend("topleft",
legend = c("Amazon", "Google"),
col = c("maroon", "navyblue"),
lty = 1,
lwd = 2)Answer: Amazon shows a greater trajectory of price appreciation over the 2016-2019 period, reaching approximately $2,000 compared to Google’s approximately $1,200.
Summary
Key Takeaways
- Measurement scales: Nominal, Ordinal, Interval, Ratio
- Data preparation: Sorting, handling missing values, removing outliers
- Categorical visualization: Bar charts, pie charts, contingency tables
- Numerical visualization: Histograms, polygons, ogives
- Relationships: Scatterplots, line charts
Data Files Required
Make sure you have these Excel files in your working directory:
- Major.xlsx
- Fitness.xlsx
- Football_Players.xlsx
- Salaries.xlsx
- Millennials.xlsx
- Bar.xlsx
- Prime.xlsx
- Texts.xlsx
- Car_Price.xlsx
- Internet_Stocks.xlsx