LA(final)

Author

Chinmayi(063) and Jaishree(088)

Learning Activity 1

Course Name: Exploratory Data Analysis

Course Code: 22ISE644

Academic Year: 2025 – 26, 6^th semester

Team-33

Team Name:The Outliers

Member1: Jaishree M(1NT23IS088) sec-B

Member2: Chinmayi K(1NT23IS063)sec-A

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

data1 <- read.csv("foodreview.csv")
library(tidyverse) # Load the essential library for data manipulation

Warning: package 'stringr' was built under R version 4.5.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.2     ✔ stringr   1.6.0
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#  Create a data frame
df <- data.frame(data1)
# --- 1. SETUP & INITIAL INSPECTION ---
# 1. Load the library
library(tidyverse) 

# 2. View data structure
str(df)

'data.frame':   368459 obs. of  10 variables:
 $ Id                    : chr  "1" "2" "3" "4" ...
 $ ProductId             : chr  "B001E4KFG0" "B00813GRG4" "B000LQOCH0" "B000UA0QIQ" ...
 $ UserId                : chr  "A3SGXH7AUHU8GW" "A1D87F6ZCVE5NK" "ABXLMWJIXXAIN" "A395BORC6FGVXV" ...
 $ ProfileName           : chr  "delmartian" "dll pa" "Natalia Corres \"Natalia Corres\"" "Karl" ...
 $ HelpfulnessNumerator  : chr  "1" "0" "1" "3" ...
 $ HelpfulnessDenominator: chr  "1" "0" "1" "3" ...
 $ Score                 : chr  "5" "1" "4" "2" ...
 $ Time                  : chr  "1303862400" "1346976000" "1219017600" "1307923200" ...
 $ Summary               : chr  "Good Quality Dog Food" "Not as Advertised" "\"Delight\" says it all" "Cough Medicine" ...
 $ Text                  : chr  "I have bought several of the Vitality canned dog food products and have found them all to be of good quality. T"| __truncated__ "Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if t"| __truncated__ "This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - i"| __truncated__ "If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition t"| __truncated__ ...

# 3. View statistical summary
summary(df)

      Id             ProductId            UserId          ProfileName       
 Length:368459      Length:368459      Length:368459      Length:368459     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 HelpfulnessNumerator HelpfulnessDenominator    Score          
 Length:368459        Length:368459          Length:368459     
 Class :character     Class :character       Class :character  
 Mode  :character     Mode  :character       Mode  :character  
     Time             Summary              Text          
 Length:368459      Length:368459      Length:368459     
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character

# 4. View first 6 rows
head(df)

  Id  ProductId         UserId                     ProfileName
1  1 B001E4KFG0 A3SGXH7AUHU8GW                      delmartian
2  2 B00813GRG4 A1D87F6ZCVE5NK                          dll pa
3  3 B000LQOCH0  ABXLMWJIXXAIN Natalia Corres "Natalia Corres"
4  4 B000UA0QIQ A395BORC6FGVXV                            Karl
5  5 B006K2ZZ7K A1UQRSCLF8GW1T   Michael D. Bigham "M. Wassir"
6  6 B006K2ZZ7K  ADT0SRK1MGOEU                  Twoapennything
  HelpfulnessNumerator HelpfulnessDenominator Score       Time
1                    1                      1     5 1303862400
2                    0                      0     1 1346976000
3                    1                      1     4 1219017600
4                    3                      3     2 1307923200
5                    0                      0     5 1350777600
6                    0                      0     4 1342051200
                Summary
1 Good Quality Dog Food
2     Not as Advertised
3 "Delight" says it all
4        Cough Medicine
5           Great taffy
6            Nice Taffy
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Text
1                                                                                                                                                                                                                                                       I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
2                                                                                                                                                                                                                                                                                                                                Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
3 This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
4                                                                                                                                                                                                                                                                                                   If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal.
5                                                                                                                                                                                                                                                                                                                                                                                  Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.
6                                                                                              I got a wild hair for taffy and ordered this five pound bag. The taffy was all very enjoyable with many flavors: watermelon, root beer, melon, peppermint, grape, etc. My only complaint is there was a bit too much red/black licorice-flavored pieces (just not my particular favorites). Between me, my kids, and my husband, this lasted only two weeks! I would recommend this brand of taffy -- it was a delightful treat.

# 5. View dimensions (Rows, Columns)
dim(df)

[1] 368459     10

# 6. View all column names
colnames(df)

 [1] "Id"                     "ProductId"              "UserId"                
 [4] "ProfileName"            "HelpfulnessNumerator"   "HelpfulnessDenominator"
 [7] "Score"                  "Time"                   "Summary"               
[10] "Text"

# --- 2. DATA CLEANING & TYPE CONVERSION ---

# Clean up the NAs created during conversion

# 7. Convert Score to numeric and view class

df$Score <- as.numeric(as.character(df$Score))

Warning: NAs introduced by coercion

df$Score[is.na(df$Score)] <- 0



# 8. Convert HelpfulnessNumerator and view head
df$HelpfulnessNumerator <- as.numeric(as.character(df$HelpfulnessNumerator))

Warning: NAs introduced by coercion

df$HelpfulnessNumerator[is.na(df$HelpfulnessNumerator)] <- 0


# 9. Convert HelpfulnessDenominator and view summary
df$HelpfulnessDenominator <- as.numeric(as.character(df$HelpfulnessDenominator))

Warning: NAs introduced by coercion

df$HelpfulnessDenominator[is.na(df$HelpfulnessDenominator)] <- 0



# 10. Ensure Time is numeric and view type
df$Time <- as.numeric(as.character(df$Time))

Warning: NAs introduced by coercion

df$Time[is.na(df$Time)] <- 0


# This should return 0 if all columns are successfully cleaned
sum(is.na(df$Score))

[1] 0

sum(is.na(df$HelpfulnessNumerator))

[1] 0

# Check the summary again - the "NA's" count should now be 0
summary(df$HelpfulnessDenominator)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 0.000e+00 0.000e+00 3.001e+00 1.000e+00 1.203e+05

# --- 3. ADDING AND DELETING COLUMNS ---
# 11. Add ReviewLength and view first few values
df <- df %>% mutate(ReviewLength = nchar(as.character(Text)))
head(df$ReviewLength)

[1] 263 190 509 219 140 416

# 12. Add TempCol and view columns to see it added
df <- df %>% mutate(TempCol = NA)
colnames(df)

 [1] "Id"                     "ProductId"              "UserId"                
 [4] "ProfileName"            "HelpfulnessNumerator"   "HelpfulnessDenominator"
 [7] "Score"                  "Time"                   "Summary"               
[10] "Text"                   "ReviewLength"           "TempCol"

# 13. Add IsAmazon and view table of values
df <- df %>% mutate(IsAmazon = TRUE)
table(df$IsAmazon)


  TRUE 
368459

# 14. Delete TempCol and verify it is gone
df <- df %>% select(-TempCol)
"TempCol" %in% colnames(df) # Should return FALSE

[1] FALSE

colnames(df)

 [1] "Id"                     "ProductId"              "UserId"                
 [4] "ProfileName"            "HelpfulnessNumerator"   "HelpfulnessDenominator"
 [7] "Score"                  "Time"                   "Summary"               
[10] "Text"                   "ReviewLength"           "IsAmazon"

# 15. Delete IsAmazon and ProfileName and view remaining columns
df <- df %>% select(-c(IsAmazon))
colnames(df)

 [1] "Id"                     "ProductId"              "UserId"                
 [4] "ProfileName"            "HelpfulnessNumerator"   "HelpfulnessDenominator"
 [7] "Score"                  "Time"                   "Summary"               
[10] "Text"                   "ReviewLength"

# --- 4. RENAMING AND REORDERING ---
# 16. Rename Score to Rating and view
df <- df %>% rename(Rating = Score)
names(df)

 [1] "Id"                     "ProductId"              "UserId"                
 [4] "ProfileName"            "HelpfulnessNumerator"   "HelpfulnessDenominator"
 [7] "Rating"                 "Time"                   "Summary"               
[10] "Text"                   "ReviewLength"

# 17. Rename ProductId to Product and view
df <- df %>% rename(Product = ProductId)
names(df)

 [1] "Id"                     "Product"                "UserId"                
 [4] "ProfileName"            "HelpfulnessNumerator"   "HelpfulnessDenominator"
 [7] "Rating"                 "Time"                   "Summary"               
[10] "Text"                   "ReviewLength"

# 18. Reorder columns (Front) and view top
df <- df %>% select(Id, Rating, Text, everything())
head(df, 2)

  Id Rating
1  1      5
2  2      1
                                                                                                                                                                                                                                                                     Text
1 I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
2                                                                          Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
     Product         UserId ProfileName HelpfulnessNumerator
1 B001E4KFG0 A3SGXH7AUHU8GW  delmartian                    1
2 B00813GRG4 A1D87F6ZCVE5NK      dll pa                    0
  HelpfulnessDenominator       Time               Summary ReviewLength
1                      1 1303862400 Good Quality Dog Food          263
2                      0 1346976000     Not as Advertised          190

# 19. Move last column to first and view headers
df <- df %>% select(last_col(), everything())
colnames(df)

 [1] "ReviewLength"           "Id"                     "Rating"                
 [4] "Text"                   "Product"                "UserId"                
 [7] "ProfileName"            "HelpfulnessNumerator"   "HelpfulnessDenominator"
[10] "Time"                   "Summary"

# --- 5. SUBSETTING (FILTERING & SELECTING) ---
# 20. Filter 5-star reviews and view row count
high_ratings <- df %>% filter(Rating == 5)
nrow(high_ratings)

[1] 175944

# 21. Filter helpful reviews and view summary
helpful_voted <- df %>% filter(HelpfulnessNumerator > 0)
summary(helpful_voted$HelpfulnessNumerator)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
1.000e+00 1.000e+00 2.000e+00 7.033e+00 3.000e+00 1.109e+05

# 22. Filter long reviews and view minimum length
long_reviews <- df %>% filter(ReviewLength > 500)
min(long_reviews$ReviewLength)

[1] 501

# 23. Select 2 different columns (Product, Helpfulness)
df_simple <- df %>% select(Product, HelpfulnessNumerator)

# View the new structure to see the column types and first few entries
#str(df_simple)

# 24. Slice first 25 rows and view dimensions
df_sample <- df %>% slice(1:25)
dim(df_sample)

[1] 25 11

# 25. Randomly select 3 rows and view them
df_random <- df %>% slice_sample(n = 3)
print(df_random)

  ReviewLength        Id Rating
1          359    119324      4
2            5  backache      0
3          337    111208      2
                                                                                                                                                                                                                                                                                                                                                                     Text
1 I would give the coffee 5 stars if there was even a hint of mocha flavor. Since this is the primary reason I bought the coffee, this is a disappointment. I will say it is a great tasting coffee - has a really "clean" dark roast taste...bold, without tasting bitter or burnt. I would recommend this coffee, just don't buy it if you are expecting a mocha taste.
2                                                                                                                                                                                                                                                                                                                                                                    coma
3                       When I purchased this there was not a 'warning' review that this would contain decaf coffee.  *sorry to all the decaf lovers out there*<br /><br />I guess if you like decaf and regular coffee you will like this sampler - I would have LOVED this sampler had there been NO decaf!  They really should state that this sampler contains decaf!
           Product             UserId                          ProfileName
1       B001E5DZX4     A3H0U6S46IR4LP                    Ace-Reviewer"Tom"
2  kidney problems  oedema (swelling)  necrosis (cell death). If swallowed
3       B0060KOF7I     A27AC2IN1HE41Y     Melinda F. Friddell "m friddell"
  HelpfulnessNumerator HelpfulnessDenominator       Time
1                    0                      1 1312156800
2                    0                      0          0
3                    0                      0 1340582400
                            Summary
1 Great coffee, but no Mocha flavor
2               respiratory failure
3            1/3 Decaf - NOT HAPPY!

# --- 6. FACTOR MANIPULATION ---
# 26. Convert Rating to factor and view
df$Rating <- as.factor(df$Rating)
is.factor(df$Rating)

[1] TRUE

# 27. View levels of the Rating factor
levels(df$Rating)

 [1] "0"      "1"      "2"      "3"      "4"      "5"      "6"      "7"     
 [9] "7.25"   "8"      "9"      "14"     "20"     "30"     "31"     "35"    
[17] "190"    "210"    "304"    "343.75" "365"    "800"    "1985"   "2006"  
[25] "2010"   "2011"   "2012"   "2015"   "30713"  "30813"  "42513"  "52313" 
[33] "60813"  "62813"  "110812"

# 28. Change order of levels and view new order
df$Rating <- factor(df$Rating, levels = c("5", "4", "3", "2", "1"))
levels(df$Rating)

[1] "5" "4" "3" "2" "1"

# 29. Reorder levels by review length and view
df$Rating <- reorder(df$Rating, df$ReviewLength, FUN = mean)
levels(df$Rating)

[1] "5" "1" "2" "4" "3"

# 30. Drop unused levels and view
df$Rating <- droplevels(df$Rating)
levels(df$Rating)

[1] "5" "1" "2" "4" "3"

# --- 7. RECODING & TEXT ---
# 31. Rename factor levels and view
levels(df$Rating) <- c("Top", "High", "Mid", "Low", "Bottom")
levels(df$Rating)

[1] "Top"    "High"   "Mid"    "Low"    "Bottom"

# 32. Recode to Sentiment and view distribution
df$Sentiment <- recode(df$Rating, "Top"="Positive", "High"="Positive", .default="Negative")
table(df$Sentiment)


Positive Negative 
  201426    73610

# 33. Recode Length to Category and view table
df$TextSize <- ifelse(df$ReviewLength > 200, "Long", "Short")
table(df$TextSize)


  Long  Short 
215692 152767

# 34. Uppercase Summary and view first 5
df$Summary <- toupper(df$Summary)
head(df$Summary, 5)

[1] "GOOD QUALITY DOG FOOD"   "NOT AS ADVERTISED"      
[3] "\"DELIGHT\" SAYS IT ALL" "COUGH MEDICINE"         
[5] "GREAT TAFFY"

# 35. Clean HTML tags and view a sample text
df$Text <- str_replace_all(df$Text, "<br />", " ")
head(df$Text, 1)

[1] "I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most."

# --- 8. TRANSFORMING ---
# 36. Square root of length and view
df <- df %>% mutate(SqrtLen = sqrt(ReviewLength))
head(df$SqrtLen)

[1] 16.21727 13.78405 22.56103 14.79865 11.83216 20.39608

# 37. Log transform and view
df <- df %>% mutate(LogHelpful = log(HelpfulnessNumerator + 1))
head(df$LogHelpful)

[1] 0.6931472 0.0000000 0.6931472 1.3862944 0.0000000 0.0000000

# 38. Centering ReviewLength and view mean (should be ~0)
df <- df %>% mutate(CenteredLen = ReviewLength - mean(ReviewLength, na.rm=T))
mean(df$CenteredLen, na.rm=T)

[1] 4.245801e-15

# 39. Standardize (Z-score) and view summary
df$StandardizedLen <- as.vector(scale(df$ReviewLength))
summary(df$StandardizedLen)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.8887 -0.5618 -0.2955  0.0000  0.2153 28.9658

# --- 9. GROUPING & SUMMARIZING ---
# 40. Group by Rating (View internal grouping)
df_grouped <- df %>% group_by(Rating)
groups(df_grouped)

[[1]]
Rating

# 41. Average length per group and view table
df_summary <- df_grouped %>% summarise(AvgLength = mean(ReviewLength))
print(df_summary)

# A tibble: 6 × 2
  Rating AvgLength
  <fct>      <dbl>
1 Top         401.
2 High        485.
3 Mid         487.
4 Low         499.
5 Bottom      518.
6 <NA>        164.

# 42. Count reviews per product and view top 5
df_counts <- df %>% group_by(Product) %>% summarise(ReviewCount = n())
head(df_counts)

# A tibble: 6 × 2
  Product                                                            ReviewCount
  <chr>                                                                    <int>
1 ""                                                                          97
2 "\n100050,B000LQORDE,A7MXWFCWVXJZE,H. Smith Beautacentric\",2,9,5…           1
3 "\n104958,B0067K02QY,A203FM2T6A396Q,Matthew Higgins,0,0,5,1346889…           1
4 "\n179714,B005DDCKHK,A2B79X7HROANPC,Shelly Mathie,0,0,5,131500800…           1
5 "\n183425,B004U43ZO0,A8ZEIMTD8NDOV,C. Kleist,1,2,2,1308528000,I h…           1
6 "\n183697,B000E1DSSQ,AZM2TYEC6IYZ,Kim Witzke,1,1,5,1305676800,Our…           1

# 43. SD per group and view table
df_stats <- df %>% group_by(Rating) %>% summarise(SD_Len = sd(ReviewLength))
print(df_stats)

# A tibble: 6 × 2
  Rating SD_Len
  <fct>   <dbl>
1 Top      406.
2 High     537.
3 Mid      454.
4 Low      485.
5 Bottom   475.
6 <NA>     208.

# 44. Standard Error table and view
df_se <- df %>% group_by(Rating) %>% 
         summarise(sd=sd(ReviewLength), n=n()) %>% 
         mutate(se=sd/sqrt(n))
print(df_se)

# A tibble: 6 × 4
  Rating    sd      n    se
  <fct>  <dbl>  <int> <dbl>
1 Top     406. 175944 0.968
2 High    537.  25482 3.36 
3 Mid     454.  14251 3.81 
4 Low     485.  38912 2.46 
5 Bottom  475.  20447 3.32 
6 <NA>    208.  93423 0.679

# --- 10. RESHAPING ---
# 45. Tiny wide subset and view
df_tiny <- df %>% select(Id, HelpfulnessNumerator, HelpfulnessDenominator) %>% head(5)
print(df_tiny)

  Id HelpfulnessNumerator HelpfulnessDenominator
1  1                    1                      1
2  2                    0                      0
3  3                    1                      1
4  4                    3                      3
5  5                    0                      0

# 46. Gather (Wide to Long) and view
df_long <- df_tiny %>% gather(key="Metric", value="Val", 2:3)
print(df_long)

   Id                 Metric Val
1   1   HelpfulnessNumerator   1
2   2   HelpfulnessNumerator   0
3   3   HelpfulnessNumerator   1
4   4   HelpfulnessNumerator   3
5   5   HelpfulnessNumerator   0
6   1 HelpfulnessDenominator   1
7   2 HelpfulnessDenominator   0
8   3 HelpfulnessDenominator   1
9   4 HelpfulnessDenominator   3
10  5 HelpfulnessDenominator   0

# 47. Spread (Long to Wide) and view
df_wide <- df_long %>% spread(key="Metric", value="Val")
print(df_wide)

  Id HelpfulnessDenominator HelpfulnessNumerator
1  1                      1                    1
2  2                      0                    0
3  3                      1                    1
4  4                      3                    3
5  5                      0                    0

# --- 11. TIME SERIES & FINAL ---
# 48. Convert Unix to Date and view
df$Date <- as.POSIXct(df$Time, origin="1970-01-01")
head(df$Date)

[1] "2011-04-27 05:30:00 IST" "2012-09-07 05:30:00 IST"
[3] "2008-08-18 05:30:00 IST" "2011-06-13 05:30:00 IST"
[5] "2012-10-21 05:30:00 IST" "2012-07-12 05:30:00 IST"

# 49. Extract Year and view counts per year
df$Year <- format(df$Date, "%Y")
table(df$Year)


 1970  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010 
93437     2     6     4    17    49   276   668  3244 11226 16612 26952 42248 
 2011  2012 
78421 95297

# 50. Remove all rows that contain at least one NA value and SAVE as df_final
df_final <- na.omit(df) 

# Check the dimensions of your finished dataset
dim(df_final)

[1] 275035     19

# 51. Create a simple table showing how many reviews happened each day
daily_counts <- df %>% 
  group_by(Date = as.Date(Date)) %>% 
  summarise(TotalReviews = n())

# View the first 10 days of activity
head(daily_counts, 10)

# A tibble: 10 × 2
   Date       TotalReviews
   <date>            <int>
 1 1970-01-01        93435
 2 1970-01-02            2
 3 1999-10-08            1
 4 1999-10-25            1
 5 2000-06-23            1
 6 2000-06-29            1
 7 2000-07-31            1
 8 2000-08-09            2
 9 2000-12-05            1
10 2001-02-22            1