Introduction

Project Recap

  • CSV Files

    • Content.csv
      • Content ID: Randomly generated number given to each post.
      • Content Type: Type of format the content is in.
      • Category: Genre the content fits in.
    • Reactions.csv
      • Content ID: Randomly generated number given to each post.
      • Type: Reaction type given.
      • Datetime: Year, Month, Day, Time, the reaction was placed.
    • ReactionTypes.csv
      • Type: Randomly generated number given to each post.
      • Sentiment: Positive or negative label.
      • Score: This is a “range” of number classified by Social Buzz that quantifies how “popular” each reaction is. 0…75: Disgust…Superlove.

  • Data Schematics



SQL Queries and Analysis

SQL queries are written in RMarkdown.

  • Libraries used
    • library(odbc)
    • library(DT)


Problem: How to join tables together?


Query Process:[Reactions.csv] is the main data set. [Content.csv] will left join [Reactions.csv] by [Content_ID]. [ReactionTypes.csv] will left join to [Reactions.csv] by [Type].



select 
  *
from Reactions
    left join Content on
    Content.Content_ID = Reactions.Content_ID
    left join ReactionTypes on
    ReactionTypes.Type = Reactions.Type


Conclusion: Tables are joined.




Problem: What are the top 5 categories that was has the most SUM(Score).


Query Process:It must be grouped by [Content.Category], then take the sum of the [ReactionTypes.Score]. The higher the score, the more popular it is.



select
    Top 5 Content.Category
    ,sum(Score) as Total_Score

from Reactions
    left join Content on
    Content.Content_ID = Reactions.Content_ID
    left join ReactionTypes on
    ReactionTypes.Type = Reactions.Type
group by 
    Content.Category
order by 
    Total_Score Desc


Conclusion: The animals, science, healthy eating, technology, and food seems to be the top 5 with the highest score.




Problem: How far does the data collected seems to span?. Whats the count of reactions(not scores) were present during each month/year?.


Query Process: Use sql datepart() function to separate [Reactions].[Datetime] into the month and year.



select 
    datepart(month,Datetime) as Month
    ,datepart(year,Datetime) as Year
    ,count(Reactions.Type) as Count_of_Reactions

from Reactions
    left join Content on
    Content.Content_ID = Reactions.Content_ID
    left join ReactionTypes on
    ReactionTypes.Type = Reactions.Type

group by 
    datepart(year,Datetime)
    ,datepart(MONTH,Datetime)
order by
    Count_of_Reactions desc


Conclusion: May, 2021 had the most amount of [Reactions].




Problem: For each month, from 2020 to 2021, what was the most dominant [Category] and whats count of reactions.


Query Process: First sub-query named temp will group the data by [Month],[Year], and, [Content.Category], then count() the number of reactions, named as [Count_of_Reactions].

Common Table Expression(CTE) named [partition] will RANK() the [Count_of_Reactions] high to low, 1,2,3…, but within separate Years and Months. From sub-query temp.

Select the columns I need, Month, Year, Category, Count_of_Reactions, from the CTE called partition, but the rank has to be 1.



with partition as (
select
Month
,Year
,Category
,Count_of_Reactions
,RANK() over (partition by Month, Year order by temp.Count_of_Reactions desc) Count_Rank

from
(
select 
    datepart(month,Datetime) as Month
    ,datepart(year,Datetime) as Year
    ,Content.Category
    ,count(ReactionTypes.Type) as Count_of_Reactions

from Reactions
    left join Content on
    Content.Content_ID = Reactions.Content_ID
    left join ReactionTypes on
    ReactionTypes.Type = Reactions.Type

group by 
    datepart(MONTH,Datetime)
    ,datepart(year,Datetime)
    ,Content.Category
) as temp
)

select
Month
,Year
,Category
,Count_of_Reactions
from partition
where Count_Rank = 1
order by Count_of_Reactions desc


Conclusion: Query results shows that during August/2020, science had the most reactions. On January/2021 it has the same amount, but it was for animals instead. On May/2021, the most reacted category was animals.




Problem: What method of upload was the most preferred?


Query Process: Use [Content] table, group by content type and do a count.



select
    Content_Type
    ,count(Content_ID) as Count
from 
    Content
group by 
    Content_Type
order by
  Count desc


Conclusion: It seems the most preferred way of upload was photos.