SPS_Data607_Week2_2A - David Chen

SQL and R – Movie Ratings

Collect simple movie-rating data, store it in a SQL database, and analyze it in R.

Implementation Notes:

PostgreSQL is the recommended (and supported) relational database for this assignment. If you are unable to install and configure PostgreSQL in your environment, you may find SQLite easier to use.
Students experienced with relational databases may use any local or cloud-based database for their assignment. NEVER include passwords in your code.
For this assignment, the ability to reproduce your work in my environment is not required, but you need to provide all of the necessary code, include the code used to create and populate tables.

Task Description

Select six recent popular movies (or television episodes or books or songs or …) Ask at least five people to rate each movie they have seen on a 1–5 scale. Store the collected ratings in a SQL database of your choice. Load the data from SQL into R as a dataframe.

Approach

I need to demonstrate how to use an SQL connector in R and execute SQL code within R to read and write data in a database. It also shows how SQL can be used to perform queries. So I will have to know the libraries for connector and how to run SQL in R.

Installation

Created a new Ubuntu 25.10 CT in PVE node.

apt update && apt upgrade
apt install postgresql
sudo -u postgres psql

Adding login password

ALTER USER postgres PASSWORD ‘YourStrongPassword’;

nano /etc/postgresql/*/main/postgresql.conf
nano /etc/postgresql/*/main/pg_hba.conf
ufw allow 5432/tcp

update these 2 files and open firewall to allow pgadmin4 remote access to this CT.

also allowing password to login database.

Now ask ChatGPT to create this simple database based on the requirements.

CREATE TABLE movies (
    movie_id SERIAL PRIMARY KEY,
    title VARCHAR(100) NOT NULL
);
CREATE TABLE ratings (
    rating_id SERIAL PRIMARY KEY,
    movie_id INT REFERENCES movies(movie_id),
    rater_name VARCHAR(50) NOT NULL,
    rating INT CHECK (rating BETWEEN 1 AND 5)
);
INSERT INTO movies (title) VALUES
('Avatar: The Way of Water'),
('Oppenheimer'),
('Barbie'),
('Stranger Things S5'),
('The Marvels'),
('Killers of the Flower Moon');
INSERT INTO ratings (movie_id, rater_name, rating) VALUES
(1, 'Alice', 5),
(1, 'Bob', 4),
(1, 'Charlie', 3),
(1, 'David', 4),
(1, 'Eve', 5),

(2, 'Alice', 4),
(2, 'Bob', 5),
(2, 'Charlie', 4),
(2, 'David', 3),
(2, 'Eve', 4),

(3, 'Alice', 3),
(3, 'Bob', 4),
(3, 'Charlie', 5),
(3, 'David', 3),
(3, 'Eve', 4),

(4, 'Alice', 5),
(4, 'Bob', 5),
(4, 'Charlie', 4),
(4, 'David', 4),
(4, 'Eve', 5),

(5, 'Alice', 3),
(5, 'Bob', 4),
(5, 'Charlie', 3),
(5, 'David', 4),
(5, 'Eve', 3),

(6, 'Alice', 4),
(6, 'Bob', 4),
(6, 'Charlie', 5),
(6, 'David', 5),
(6, 'Eve', 4);

After this step , manually delete some values to null

Swtich to RStudio and connect to PostgreSQL

#install.packages("DBI")       # Generic database interface
#install.packages("RPostgres") # PostgreSQL driver

library(DBI)
library(RPostgres)

con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "chatgpt_c",    # your database name
  host = "192.168.100.61",          # or server IP
  port = 5432,                 # default PostgreSQL port
  user = "postgres",           # your DB username
  password = "ubuntu"    # your DB password
)

dbListTables(con)

[1] "movies"  "ratings"

query <- "SELECT m.title, r.rater_name, r.rating
          FROM movies m
          JOIN ratings r ON m.movie_id = r.movie_id
          ORDER BY m.movie_id, r.rater_name;"

data <- dbGetQuery(con, query)
print(data)

                        title rater_name rating
1    Avatar: The Way of Water      Alice      5
2    Avatar: The Way of Water        Bob      4
3    Avatar: The Way of Water    Charlie      3
4    Avatar: The Way of Water      David      4
5    Avatar: The Way of Water        Eve     NA
6                 Oppenheimer      Alice      4
7                 Oppenheimer        Bob      5
8                 Oppenheimer    Charlie      4
9                 Oppenheimer      David      3
10                Oppenheimer        Eve      4
11                     Barbie      Alice      3
12                     Barbie        Bob     NA
13                     Barbie    Charlie      5
14                     Barbie      David      3
15                     Barbie        Eve      4
16         Stranger Things S5      Alice      5
17         Stranger Things S5        Bob      5
18         Stranger Things S5    Charlie      4
19         Stranger Things S5      David     NA
20         Stranger Things S5        Eve      5
21                The Marvels      Alice      3
22                The Marvels        Bob      4
23                The Marvels    Charlie      3
24                The Marvels      David      4
25                The Marvels        Eve      3
26 Killers of the Flower Moon      Alice     NA
27 Killers of the Flower Moon        Bob      4
28 Killers of the Flower Moon    Charlie      5
29 Killers of the Flower Moon      David      5
30 Killers of the Flower Moon        Eve      4

Option1: Replacing NA with “0”

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
data <- data%>%
  mutate(rating = ifelse(is.na(rating), 0, rating))
print(data)

                        title rater_name rating
1    Avatar: The Way of Water      Alice      5
2    Avatar: The Way of Water        Bob      4
3    Avatar: The Way of Water    Charlie      3
4    Avatar: The Way of Water      David      4
5    Avatar: The Way of Water        Eve      0
6                 Oppenheimer      Alice      4
7                 Oppenheimer        Bob      5
8                 Oppenheimer    Charlie      4
9                 Oppenheimer      David      3
10                Oppenheimer        Eve      4
11                     Barbie      Alice      3
12                     Barbie        Bob      0
13                     Barbie    Charlie      5
14                     Barbie      David      3
15                     Barbie        Eve      4
16         Stranger Things S5      Alice      5
17         Stranger Things S5        Bob      5
18         Stranger Things S5    Charlie      4
19         Stranger Things S5      David      0
20         Stranger Things S5        Eve      5
21                The Marvels      Alice      3
22                The Marvels        Bob      4
23                The Marvels    Charlie      3
24                The Marvels      David      4
25                The Marvels        Eve      3
26 Killers of the Flower Moon      Alice      0
27 Killers of the Flower Moon        Bob      4
28 Killers of the Flower Moon    Charlie      5
29 Killers of the Flower Moon      David      5
30 Killers of the Flower Moon        Eve      4

data %>%
  group_by(rater_name) %>%
  summarise(ratings_count = n())

# A tibble: 5 × 2
  rater_name ratings_count
  <chr>              <int>
1 Alice                  6
2 Bob                    6
3 Charlie                6
4 David                  6
5 Eve                    6

data %>%
  group_by(title) %>%
  summarise(ratings_count = n())

# A tibble: 6 × 2
  title                      ratings_count
  <chr>                              <int>
1 Avatar: The Way of Water               5
2 Barbie                                 5
3 Killers of the Flower Moon             5
4 Oppenheimer                            5
5 Stranger Things S5                     5
6 The Marvels                            5

Average rating per movie

data %>%
  group_by(title) %>%
  summarise(avg_rating = round(mean(rating, na.rm = TRUE), 2))

# A tibble: 6 × 2
  title                      avg_rating
  <chr>                           <dbl>
1 Avatar: The Way of Water          3.2
2 Barbie                            3  
3 Killers of the Flower Moon        3.6
4 Oppenheimer                       4  
5 Stranger Things S5                3.8
6 The Marvels                       3.4

avg_ratings <- data %>%
  group_by(title) %>%
  summarise(avg_rating = round(mean(rating, na.rm = TRUE), 2))
  ggplot(avg_ratings, aes(x = reorder(title, -avg_rating), y = avg_rating)) +
  geom_col(fill = "steelblue") +          # create the bars
  geom_text(aes(label = avg_rating), vjust = -0.5) +  # show value on top
  labs(title = "Average Ratings per Movie",
       x = "Movie",
       y = "Average Rating") +
  ylim(0, 5) +                            # rating scale 1-5
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Average rating per user

data %>%
  group_by(rater_name) %>%
  summarise(avg_rating = round(mean(rating, na.rm = TRUE), 2))

# A tibble: 5 × 2
  rater_name avg_rating
  <chr>           <dbl>
1 Alice            3.33
2 Bob              3.67
3 Charlie          4   
4 David            3.17
5 Eve              3.33

avg_ratings <- data %>%
  group_by(rater_name) %>%
  summarise(avg_rating = round(mean(rating, na.rm = TRUE), 2))
  ggplot(avg_ratings, aes(x = reorder(rater_name, -avg_rating), y = avg_rating)) +
  geom_col(fill = "steelblue") +          # create the bars
  geom_text(aes(label = avg_rating), vjust = -0.5) +  # show value on top
  labs(title = "Average Ratings per User",
       x = "Name",
       y = "Average Rating") +
  ylim(0, 5) +                            # rating scale 1-5
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))