CSV Files
Data Schematics
SQL queries are written in RMarkdown.
Problem: How to join tables together?
Query Process:[Reactions.csv] is the main data set. [Content.csv] will left join [Reactions.csv] by [Content_ID]. [ReactionTypes.csv] will left join to [Reactions.csv] by [Type].
select
*
from Reactions
left join Content on
Content.Content_ID = Reactions.Content_ID
left join ReactionTypes on
ReactionTypes.Type = Reactions.Type
Conclusion: Tables are joined.
Problem: What are the top 5 categories that was has the
most SUM(Score).
Query Process:It must be grouped by [Content.Category], then take the sum of the [ReactionTypes.Score]. The higher the score, the more popular it is.
select
Top 5 Content.Category
,sum(Score) as Total_Score
from Reactions
left join Content on
Content.Content_ID = Reactions.Content_ID
left join ReactionTypes on
ReactionTypes.Type = Reactions.Type
group by
Content.Category
order by
Total_Score Desc
Conclusion: The animals, science, healthy eating,
technology, and food seems to be the top 5 with the highest
score.
Problem: How far does the data collected seems to
span?. Whats the count of reactions(not scores) were present during each
month/year?.
Query Process: Use sql datepart() function to separate [Reactions].[Datetime] into the month and year.
select
datepart(month,Datetime) as Month
,datepart(year,Datetime) as Year
,count(Reactions.Type) as Count_of_Reactions
from Reactions
left join Content on
Content.Content_ID = Reactions.Content_ID
left join ReactionTypes on
ReactionTypes.Type = Reactions.Type
group by
datepart(year,Datetime)
,datepart(MONTH,Datetime)
order by
Count_of_Reactions desc
Conclusion: May, 2021 had the most amount of
[Reactions].
Problem: For each month, from 2020 to 2021, what was
the most dominant [Category] and whats count of reactions.
Query Process: First sub-query named temp
will group the data by [Month],[Year], and, [Content.Category], then
count() the number of reactions, named as [Count_of_Reactions].
Common Table Expression(CTE) named [partition] will
RANK() the [Count_of_Reactions] high to low, 1,2,3…, but within separate
Years and Months. From sub-query temp.
Select the columns I need, Month, Year, Category,
Count_of_Reactions, from the CTE called partition, but the rank has to
be 1.
with partition as (
select
Month
,Year
,Category
,Count_of_Reactions
,RANK() over (partition by Month, Year order by temp.Count_of_Reactions desc) Count_Rank
from
(
select
datepart(month,Datetime) as Month
,datepart(year,Datetime) as Year
,Content.Category
,count(ReactionTypes.Type) as Count_of_Reactions
from Reactions
left join Content on
Content.Content_ID = Reactions.Content_ID
left join ReactionTypes on
ReactionTypes.Type = Reactions.Type
group by
datepart(MONTH,Datetime)
,datepart(year,Datetime)
,Content.Category
) as temp
)
select
Month
,Year
,Category
,Count_of_Reactions
from partition
where Count_Rank = 1
order by Count_of_Reactions desc
Conclusion: Query results shows that during
August/2020, science had the most reactions. On January/2021 it has the
same amount, but it was for animals instead. On May/2021, the most
reacted category was animals.
Problem: What method of upload was the most
preferred?
Query Process: Use [Content] table, group by content type and do a count.
select
Content_Type
,count(Content_ID) as Count
from
Content
group by
Content_Type
order by
Count desc
Conclusion: It seems the most preferred way of upload
was photos.