Variables, Scales of Measurement & Data Visualization

Author

BC8 Quantitative Methods

Published

January 20, 2026

Setup

Load required packages. Note: We use relative paths, so make sure your working directory contains the data files or adjust the paths accordingly.

# Load required packages
library(readxl)
library(mosaic)

# Note: No setwd() needed - use relative paths or RStudio projects
# Data files should be in a 'data/' subfolder or same directory

Section 1.2: Variables and Scales of Measurement

Exercise 17: Major

A professor records the majors of her 30 students.

Student Major
1 Accounting
2 Management
30 Economics

Questions:

  1. What is the measurement scale of the Major variable?
  2. Summarize the results in tabular form.
  3. What information can be extracted from the data?

Solution

a) Measurement Scale

The measurement scale of the Major variable is nominal - values differ only in name. There is no inherent order or ranking between different majors.

Show solution
# Always clean the script first to have a clear environment
rm(list = ls())

# Read the dataset
MAJOR <- read_excel("Major.xlsx")

# Attach the dataset to access variables directly
attach(MAJOR)

# Create the variables
# Write the desired name of the variable followed by "<-" and then write
# the name that the variable is assigned in the dataset that is attached
StudNo <- Student
Major <- Major

detach(MAJOR)

# head() displays the first few observations within the dataset
# View() displays a spreadsheet-style data viewer with rows and columns
head(MAJOR)
# A tibble: 6 × 2
  Student Major     
    <dbl> <chr>     
1       1 Accounting
2       2 Management
3       3 Economics 
4       4 Finance   
5       5 History   
6       6 Accounting

b) Frequency Table

Show solution
# Data is unorganized, organizing this data by major...
Sorted <- MAJOR[order(MAJOR$Major), ]

# What are the possible values of 'Major' and how many levels does it have?
factor(Major)
 [1] Accounting Management Economics  Finance    History    Accounting
 [7] Economics  History    Undecided  Management Statistics Management
[13] Undecided  Psychology Accounting History    Economics  Undecided 
[19] Finance    Accounting Statistics Psychology Management Finance   
[25] Psychology Management Finance    History    Statistics Economics 
8 Levels: Accounting Economics Finance History Management ... Undecided
Show solution
# But how many students do we see per major?
# We can count using which() and length() functions
# 'which' identifies the criteria & 'length' counts the number
# == is an equality operator (checking if the criteria is met!)

length(which(MAJOR$Major == 'Accounting'))
[1] 4
Show solution
length(which(MAJOR$Major == 'Economics'))
[1] 4
Show solution
length(which(MAJOR$Major == 'Finance'))
[1] 4
Show solution
length(which(MAJOR$Major == 'History'))
[1] 4
Show solution
length(which(MAJOR$Major == 'Management'))
[1] 5
Show solution
length(which(MAJOR$Major == 'Psychology'))
[1] 3
Show solution
length(which(MAJOR$Major == 'Statistics'))
[1] 3
Show solution
length(which(MAJOR$Major == 'Undecided'))
[1] 3
Show solution
# Or the same by using the function "table"
# It counts the categorical variable in an output form
w <- data.frame(table(Major))
w
       Major Freq
1 Accounting    4
2  Economics    4
3    Finance    4
4    History    4
5 Management    5
6 Psychology    3
7 Statistics    3
8  Undecided    3
Show solution
# Change the name of the "Freq" column to "Frequency"
# Freq is in the second column of the data frame
names(w)[2] <- 'Frequency'
w
       Major Frequency
1 Accounting         4
2  Economics         4
3    Finance         4
4    History         4
5 Management         5
6 Psychology         3
7 Statistics         3
8  Undecided         3

c) Information Extracted

Management has the highest number of students, whereas Psychology, Statistics & Undecided have the lowest number of students.


Exercise 18: DOW

Data shows companies in the Dow Jones Industrial Average with Year joined, Industry, and Price.

Questions:

  1. What is the measurement scale of the Industry variable?
  2. What is the measurement scale of the Year variable? What are its strengths and weaknesses?
  3. What is the measurement scale of the Price variable? What are its strengths and weaknesses?

Solution

Note

This exercise does not require R software. The answers are based on theoretical understanding of measurement scales.

a) Industry - Nominal scale (categories with no inherent order)

b) Year - Interval scale (equal intervals between years, but no true zero point in a meaningful sense for this context)

c) Price - Ratio scale (has a true zero point, equal intervals, and meaningful ratios)


Section 1.3: Data Preparation

Exercise 26: Fitness

Survey of 418 individuals about exercise, marital status, and income.

ID Exercise Married Income
1 Always Yes 106299
2 Sometimes Yes 86570

Questions:

  1. Sort by Income. Of the 10 highest income earners, how many are married and always exercise?
  2. How many married individuals who exercise sometimes earn > $110,000?
  3. Are there any missing values? In which variables?
  4. How many individuals are married? How many are unmarried?
  5. How many are married & always exercise? Unmarried & never exercise?

Solution

Show solution
# Clean script
rm(list = ls())

FITNESS <- read_excel("Fitness.xlsx")

# "attach" the dataset to define variables
attach(FITNESS)

# Define your variables out of the dataset FITNESS
ID <- ID
Exercise <- Exercise
Married <- Married
Income <- Income

detach(FITNESS)

a) Sort by Income

Show solution
# Sorting data by ordering it in a new data frame called SortedData1
# Default: data is sorted in ascending order
SortedData1 <- FITNESS[order(FITNESS$Income), ]

# To see top 10 earners, sort in descending order
SortedData1_desc <- FITNESS[order(FITNESS$Income, decreasing = TRUE), ]
head(SortedData1_desc, 10)
# A tibble: 10 × 4
      ID Exercise  Married Income
   <dbl> <chr>     <chr>    <dbl>
 1    39 Sometimes Yes     119890
 2    63 Always    Yes     119851
 3   161 Always    No      119615
 4    51 Sometimes Yes     119354
 5   138 Often     Yes     119282
 6   404 Sometimes No      119058
 7   188 Often     Yes     118771
 8   196 Often     Yes     118688
 9   356 Sometimes Yes     118314
10   354 Sometimes Yes     118260

b) Married, Sometimes Exercise, Income > $110,000

Show solution
# Sort the data by Marriage and Exercise in DESCENDING order (not default!)
SortedData2 <- FITNESS[order(FITNESS$Married, FITNESS$Exercise, decreasing = TRUE), ]

# How many of those earn more than $110,000 per year?
SortedData3 <- FITNESS[order(FITNESS$Married, FITNESS$Exercise, FITNESS$Income, decreasing = TRUE), ]

# We can use the which and length functions
# Where 'which'=category of a variable & 'length'=the number of qualified observations
# Using operators to specify we want to find >$110,000 in income per year
length(which(FITNESS$Married == 'Yes' & FITNESS$Exercise == 'Sometimes' & FITNESS$Income > 110000))
[1] 9

c) Missing Values

Show solution
# Identifying which observations have missing values using is.na()
# Looking through TRUE/FALSE output is inconvenient, so we use which(is.na()) function

which(is.na(FITNESS$ID))
integer(0)
Show solution
which(is.na(FITNESS$Exercise))
[1]  14  62  86 143 175
Show solution
which(is.na(FITNESS$Married))
[1]  48 111
Show solution
which(is.na(FITNESS$Income))
[1]  21  55 181

d) Married vs Unmarried Count

Show solution
w <- table(FITNESS$Married)
w

 No Yes 
134 281 

e) Married & Always Exercise; Unmarried & Never Exercise

Show solution
# Married and always exercise
length(which(FITNESS$Married == 'Yes' & FITNESS$Exercise == 'Always'))
[1] 69
Show solution
# Unmarried and never exercise
length(which(FITNESS$Married == 'No' & FITNESS$Exercise == 'Never'))
[1] 74

Exercise 30: Football Players

Quarterback statistics: Comp, Att, Pct, Yds, Avg, Yds/G, TD, Int

Questions:

  1. Are there missing values? Which variables/observations?
  2. Omit observations with missing values. How many removed?
  3. Remove outliers: TD < 5 or Int > 20. How many removed?

Solution

Show solution
rm(list = ls())

Football_Players <- read_excel("Football_Players.xlsx")

a) Missing Values

Show solution
# Count total missing values
sum(is.na(Football_Players))
[1] 3
Show solution
# Answer: 3 values are missing.

# Find which variables have missing values
which(is.na(Football_Players$Comp))
integer(0)
Show solution
which(is.na(Football_Players$Att))
[1] 28
Show solution
which(is.na(Football_Players$Pct))
integer(0)
Show solution
which(is.na(Football_Players$Yds))
[1] 25
Show solution
which(is.na(Football_Players$Avg))
integer(0)
Show solution
which(is.na(Football_Players$TD))
integer(0)
Show solution
which(is.na(Football_Players$Int))
[1] 29
Show solution
which(is.na(Football_Players$'Yds/G'))
integer(0)

b) Remove Missing Values

Show solution
# Method 1: using sum(is.na(x)) function will automatically give the amount of missing values
# Method 2: using na.omit(x) function will omit the NA values

Football_Players1 <- na.omit(Football_Players)

# Count observations removed
cat("Original observations:", nrow(Football_Players), "\n")
Original observations: 43 
Show solution
cat("After removing NA:", nrow(Football_Players1), "\n")
After removing NA: 40 
Show solution
cat("Observations removed:", nrow(Football_Players) - nrow(Football_Players1))
Observations removed: 3

c) Remove Outliers

Show solution
# We want to remove outlier cases where the players have < 5 touchdowns
# or > 20 interceptions

# Method 1: using the subset(x) function
# | means "or"
newdata1 <- subset(Football_Players, TD < 5 | Int > 20)
newdata1
# A tibble: 5 × 10
  Player Team          Comp   Att   Pct   Yds   Avg `Yds/G`    TD   Int
   <dbl> <chr>        <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
1      9 Hummingbirds    55   112  49.1   544   4.9    136      1     3
2     10 Iguanas        123   224  54.9  1430   6.4    204.     4     6
3     18 Penguins       255   476  53.6  2894   6.1    193.    11    22
4     31 Sparrows        78   127  61.4   861   6.8    215.     4     5
5     34 Bulldogs        93   140  66.4   833   6      208.     4     5
Show solution
# Method 2: select specific columns only
newdata2 <- subset(Football_Players, TD < 5 | Int > 20, select = c(TD, Int))
newdata2
# A tibble: 5 × 2
     TD   Int
  <dbl> <dbl>
1     1     3
2     4     6
3    11    22
4     4     5
5     4     5
Show solution
# Answer: 5 observations can be removed from the data
# 4 observations had < 5 touchdowns
# 1 observation had > 20 interceptions

Exercise 31: Salaries

City of Seattle employee compensation data: Department, Job Title, Hourly Rate.

Questions:

  1. Split data by department
  2. Check for missing values in each subset

Solution

Show solution
rm(list = ls())

SALARIES <- read_excel("Salaries.xlsx")

# Count employees per department
table(SALARIES$Department)

                   Dept Of HR                       IT Dept 
                          106                           708 
             Public Utilities Sustainability & Environ Dept 
                         1398                            33 
Show solution
# Create subsets by department
subset1 <- subset(SALARIES, Department == 'Public Utilities')
subset2 <- subset(SALARIES, Department == 'Sustainability & Environ Dept')
subset3 <- subset(SALARIES, Department == 'IT Dept')
subset4 <- subset(SALARIES, Department == 'Dept Of HR')

# Check for missing values in each subset
cat("Missing values in Public Utilities:", sum(is.na(subset1)), "\n")
Missing values in Public Utilities: 0 
Show solution
cat("Missing values in Sustainability:", sum(is.na(subset2)), "\n")
Missing values in Sustainability: 0 
Show solution
cat("Missing values in IT Dept:", sum(is.na(subset3)), "\n")
Missing values in IT Dept: 2 
Show solution
cat("Missing values in HR Dept:", sum(is.na(subset4)))
Missing values in HR Dept: 0

Section 2.1: Visualizing Categorical Variables

Exercise 8: Millennials

Survey of 600 millennials about their faith: Very Religious, Somewhat Religious, Slightly Religious, Not Religious.

Questions:

  1. Construct frequency and relative frequency distributions. What is the most common response?
  2. Construct a bar chart
  3. Construct a pie chart. Are results consistent with earlier study (35% not religious)?

Solution

Show solution
# You need library(mosaic) for this exercise
rm(list = ls())

MILLENNIALS <- read_excel("Millennials.xlsx")

a) Frequency Distributions

Show solution
# Frequency distribution: using table(x$variable) function
# Distribution of values in a dataset using original units of observations
Frequency <- table(MILLENNIALS$Faith)
Frequency

     Not Religious Slightly Religious Somewhat Religious     Very Religious 
               210                129                183                 78 
Show solution
# Relative frequency: using table(x$variable)/length(x$variable) function
# Distribution of values where 100% (i.e., 1) entails all observations
RelativeFrequency <- table(MILLENNIALS$Faith) / length(MILLENNIALS$Faith)
RelativeFrequency

     Not Religious Slightly Religious Somewhat Religious     Very Religious 
             0.350              0.215              0.305              0.130 
Show solution
# Answer: 'Not Religious' is the most common answer in the survey

b) Bar Chart

Show solution
# Horizontal bar chart
# Y-axis shows the variable values (e.g., not religious, religious, etc.)
barplot(Frequency,
        main = "Bar Chart for Faith Survey",
        horiz = TRUE,
        col = "pink",
        xlim = c(0, 250),
        cex.names = 0.6)
abline(v = 0)

Show solution
# Vertical bar chart with custom colors
collors <- c("red", "purple", "pink", "olivedrab")
barplot(Frequency,
        main = "Bar Chart for Faith Survey",
        col = collors,
        cex.names = 0.75)
abline(h = 0)

c) Pie Chart

Show solution
# R software decides the colours used in the pie chart by default
pie(Frequency, main = "Pie Chart for Faith Survey")

Show solution
# Let's decide the colors used in pie chart
# The order of colours = the order of dataset variables
collors <- c("red", "purple", "pink", "olivedrab")
pie(Frequency, main = "Pie Chart for Faith Survey", col = collors)

Show solution
# Pie chart for relative frequency
pie(RelativeFrequency, main = "Pie Chart for Relative Frequency", col = collors)

Answer: Yes, the results of the survey match the results found in the earlier study. As we can see, approximately 0.350 (35%) of millennials identify themselves as “Not Religious”, which is consistent with the earlier study.


Section 2.2: Relationships Between Categorical Variables

Exercise 14: Bar

Survey at a local bar: customers identified their sex and drink choice (beer, wine, soft drink).

Questions:

  1. Construct contingency table. How many males? How many drank wine?
  2. P(beer | male)? P(beer | female)?

Solution

Show solution
rm(list = ls())

BAR <- read_excel("Bar.xlsx")

a) Contingency Table

Show solution
myTable <- table(BAR$Sex, BAR$Drink)
myTable
        
         beer soft drink wine
  female   38         10   20
  male    142         20   40
Show solution
# addmargins() adds the Sum column and row for better visual data representation
addmargins(myTable)
        
         beer soft drink wine Sum
  female   38         10   20  68
  male    142         20   40 202
  Sum     180         30   60 270
Show solution
# If we want the same table with probabilities instead of observations
prop.table(myTable)
        
               beer soft drink       wine
  female 0.14074074 0.03703704 0.07407407
  male   0.52592593 0.07407407 0.14814815
Show solution
# Alternative arrangement
mynewTable <- table(BAR$Drink, BAR$Sex)
addmargins(mynewTable)
            
             female male Sum
  beer           38  142 180
  soft drink     10   20  30
  wine           20   40  60
  Sum            68  202 270

Visualization:

Show solution
barplot(myTable,
        legend = rownames(myTable),
        col = c("lavender", "lightblue"),
        beside = TRUE,
        main = "Drink Preferences by Sex")

Answer: There were 202 male customers. 60 customers drank wine.

b) Conditional Probabilities

  • P(beer | male) = 142 (male beer drinkers) / 202 (total male customers) = 0.703 (70.3%)
  • P(beer | female) = 38 (female beer drinkers) / 68 (total female customers) = 0.559 (55.9%)

Section 2.3: Visualizing Numerical Variables

Exercise 33: Prime

Amazon Prime customer expenditures data.

Questions:

  1. Frequency & relative frequency distributions (intervals: 400-700, 700-1000, etc.)
  2. Construct histogram

Solution

Show solution
rm(list = ls())

PRIME <- read_excel("Prime.xlsx")

a) Frequency Distributions

Show solution
# Check the range of data
cat("Max expenditure:", max(PRIME$Expenditures), "\n")
Max expenditure: 2117 
Show solution
cat("Min expenditure:", min(PRIME$Expenditures))
Min expenditure: 467
Show solution
# Max = 2117 --> upper limit of the last interval has to be higher
# To create a frequency distribution using intervals with a certain width:
# use seq(lower limit of first, upper limit of last, by=width of interval) function

intervals <- seq(400, 2200, by = 300)

# Then, use cut() function
# left=FALSE, right=TRUE ensures intervals are open on the left
# and closed on the right (e.g., 400 < x <= 700)

intervals.cut <- cut(PRIME$Expenditures, intervals, left = FALSE, right = TRUE)
expenditure.freq <- table(intervals.cut)
expenditure.freq
intervals.cut
        (400,700]       (700,1e+03]   (1e+03,1.3e+03] (1.3e+03,1.6e+03] 
               10                28                66                50 
(1.6e+03,1.9e+03] (1.9e+03,2.2e+03] 
               38                 8 
Show solution
# Relative frequency distribution table
relative.freq <- table(intervals.cut) / length(intervals.cut)
relative.freq
intervals.cut
        (400,700]       (700,1e+03]   (1e+03,1.3e+03] (1.3e+03,1.6e+03] 
             0.05              0.14              0.33              0.25 
(1.6e+03,1.9e+03] (1.9e+03,2.2e+03] 
             0.19              0.04 
Show solution
# Combined table with cumulative frequencies
finaltable <- transform(expenditure.freq,
                       Rel_Freq = prop.table(relative.freq),
                       Cum_Freq = cumsum(relative.freq))
finaltable
                      intervals.cut Freq Rel_Freq.intervals.cut Rel_Freq.Freq
(400,700]                 (400,700]   10              (400,700]          0.05
(700,1e+03]             (700,1e+03]   28            (700,1e+03]          0.14
(1e+03,1.3e+03]     (1e+03,1.3e+03]   66        (1e+03,1.3e+03]          0.33
(1.3e+03,1.6e+03] (1.3e+03,1.6e+03]   50      (1.3e+03,1.6e+03]          0.25
(1.6e+03,1.9e+03] (1.6e+03,1.9e+03]   38      (1.6e+03,1.9e+03]          0.19
(1.9e+03,2.2e+03] (1.9e+03,2.2e+03]    8      (1.9e+03,2.2e+03]          0.04
                  Cum_Freq
(400,700]             0.05
(700,1e+03]           0.19
(1e+03,1.3e+03]       0.52
(1.3e+03,1.6e+03]     0.77
(1.6e+03,1.9e+03]     0.96
(1.9e+03,2.2e+03]     1.00

b) Histogram

Show solution
hist(PRIME$Expenditures,
     breaks = intervals,
     right = TRUE,
     main = "Histogram for Prime Customer Expenditure",
     xlab = "Expenditure in $",
     col = "cyan")


Exercise 35: Texts

Weekly text messages sent by 150 teenagers.

Questions:

  1. Frequency distributions (intervals: 500-600, 600-700, etc.)
  2. Construct polygon - symmetric or skewed?
  3. Construct ogive - proportion sending > 850 texts?

Solution

Show solution
rm(list = ls())

TEXTS <- read_excel("Texts.xlsx")

a) Frequency Distributions

Show solution
cat("Min texts:", min(TEXTS$Texts), "\n")
Min texts: 504 
Show solution
cat("Max texts:", max(TEXTS$Texts))
Max texts: 980
Show solution
# max=980, make sure the upper limit of the last interval is higher than this value

# Creating a frequency distribution using intervals of a given width
intervals <- seq(500, 1000, by = 100)

# Use cut() function
# Left & right ensure intervals are open on the left and closed on the right
intervals.cut <- cut(TEXTS$Texts, intervals, left = FALSE, right = TRUE)
texts.freq <- table(intervals.cut)
texts.freq
intervals.cut
  (500,600]   (600,700]   (700,800]   (800,900] (900,1e+03] 
         27          33          30          33          27 
Show solution
# Relative frequency distribution table
relative.freq <- table(intervals.cut) / length(intervals.cut)
relative.freq
intervals.cut
  (500,600]   (600,700]   (700,800]   (800,900] (900,1e+03] 
       0.18        0.22        0.20        0.22        0.18 
Show solution
# Combined table with cumulative frequencies
finaltable <- transform(texts.freq,
                       Rel_Freq = prop.table(relative.freq),
                       Cum_Freq = cumsum(Freq))
finaltable
  intervals.cut Freq Rel_Freq.intervals.cut Rel_Freq.Freq Cum_Freq
1     (500,600]   27              (500,600]          0.18       27
2     (600,700]   33              (600,700]          0.22       60
3     (700,800]   30              (700,800]          0.20       90
4     (800,900]   33              (800,900]          0.22      123
5   (900,1e+03]   27            (900,1e+03]          0.18      150

b) Polygon

Show solution
# Create histogram and store the data
hist.score <- hist(TEXTS$Texts,
                   breaks = intervals,
                   right = TRUE,
                   main = 'Histogram with Polygon for Texts',
                   xlab = 'Texts',
                   col = "plum3",
                   border = "mediumpurple4")

# hist.score tells us all breaking points of our bars (i.e., mids)
# as well as the frequency of all of our bars (i.e., counts)

# Add polygon line
x.axis <- c(min(TEXTS$Texts), hist.score$mids, max(TEXTS$Texts))
y.axis <- c(0, hist.score$counts, 0)
lines(x.axis, y.axis, type = 'l', lwd = 2)

c) Ogive (Cumulative Frequency Curve)

Show solution
# Inspect bins
bins <- table(cut(intervals,
                  breaks = seq(from = min(hist.score$breaks),
                              to = max(hist.score$breaks),
                              by = hist.score$breaks[2] - hist.score$breaks[1]),
                  right = TRUE))

# Create ogive data
ucl <- seq(from = min(hist.score$breaks),
          to = max(hist.score$breaks),
          by = hist.score$breaks[2] - hist.score$breaks[1])
ucl <- c(0, ucl[-1])

cf <- c(0, cumsum(hist.score$counts))

# Plot the Ogive
par(bg = "gray90")
plot(ucl, cf,
     type = "b",
     col = "blue",
     pch = 20,
     main = "Ogive for Text Messages",
     xlab = "Upper Class Limit",
     ylab = "Cumulative Frequency")


Section 2.4: More Visualization Methods

Exercise 42: Car Price

Used sedan data: Price, Age, Mileage

Questions:

  1. Scatterplot: Price vs Age
  2. Scatterplot: Price vs Mileage
  3. Create Mileage_category (< 50,000 = Low, else High)
  4. Scatterplot with categories

Solution

Show solution
rm(list = ls())

Car_Price <- read_excel("Car_Price.xlsx")

a) Price vs Age

Show solution
plot(Car_Price$Price ~ Car_Price$Age,
     col = "blue",
     main = "Price vs Age",
     xlab = "Age (years)",
     ylab = "Price ($)",
     pch = 16)

Answer: There appears to be a negative relationship between the price of a car and its age, because the older the car, the cheaper it is.

b) Price vs Mileage

Show solution
plot(Car_Price$Price ~ Car_Price$Miles,
     col = "orange",
     main = "Price vs Mileage",
     xlab = "Mileage",
     ylab = "Price ($)",
     pch = 16)

Answer: There appears to be a negative relationship between the price of a car and its mileage. Cheaper cars tend to have more mileage (which also relates to being older cars).

c) Create Mileage Category

Show solution
# Convert to categorical variable using ifelse()
Car_Price$Miles_Cat <- ifelse(Car_Price$Miles < 50000, "Low_Mileage", "High_Mileage")
table(Car_Price$Miles_Cat)

High_Mileage  Low_Mileage 
           7           13 

Answer: 7 cars have mileage greater than 50,000 (High Mileage category).

d) Scatterplot with Categories

Show solution
plot(Car_Price$Price ~ Car_Price$Age,
     pch = 16,
     col = ifelse(Car_Price$Miles_Cat == "Low_Mileage", "blue", "red"),
     main = "Price vs Age by Mileage Category",
     xlab = "Age (years)",
     ylab = "Price ($)")

legend("topright",
       legend = c("Low Mileage", "High Mileage"),
       pch = 16,
       col = c("blue", "red"))

Answer: The negative relationship between age and price is consistent for cars of both mileage categories.


Exercise 43: Internet Stocks

Compare Amazon (AMZN) and Alphabet (GOOG) stock performance 2016-2019.

Questions:

Construct a line chart showing both stocks over time. Which stock shows greater price appreciation?

Solution

Show solution
rm(list = ls())

Internet_Stocks <- read_excel("Internet_Stocks.xlsx")

# Check max values for y-axis scaling
cat("Max Amazon:", max(Internet_Stocks$AMZN), "\n")
Max Amazon: 2012.71 
Show solution
cat("Max Google:", max(Internet_Stocks$GOOG))
Max Google: 1337.02
Show solution
colors <- c("maroon", "navyblue")

plot(Internet_Stocks$AMZN ~ Internet_Stocks$Date,
     main = "Performance of Internet Stocks",
     type = "l",
     xlab = "Date",
     ylab = "Stock Price ($)",
     col = "maroon",
     ylim = c(0, 2100),
     lwd = 2)

lines(Internet_Stocks$GOOG ~ Internet_Stocks$Date,
      col = "navyblue",
      type = "l",
      lwd = 2)

legend("topleft",
       legend = c("Amazon", "Google"),
       col = c("maroon", "navyblue"),
       lty = 1,
       lwd = 2)

Answer: Amazon shows a greater trajectory of price appreciation over the 2016-2019 period, reaching approximately $2,000 compared to Google’s approximately $1,200.


Summary

Key Takeaways

  • Measurement scales: Nominal, Ordinal, Interval, Ratio
  • Data preparation: Sorting, handling missing values, removing outliers
  • Categorical visualization: Bar charts, pie charts, contingency tables
  • Numerical visualization: Histograms, polygons, ogives
  • Relationships: Scatterplots, line charts

Data Files Required

Make sure you have these Excel files in your working directory:

  • Major.xlsx
  • Fitness.xlsx
  • Football_Players.xlsx
  • Salaries.xlsx
  • Millennials.xlsx
  • Bar.xlsx
  • Prime.xlsx
  • Texts.xlsx
  • Car_Price.xlsx
  • Internet_Stocks.xlsx