The source that was used in this assignment was Happiness 2020 csv file. This csv file came from the sourced “Sustainable Development Solutions Network.” The file Happiness 2020 had twenty variables, and a hundred fifty-three observations. However, from those twenty variables I used four of them to create my visualization. One variable I choose was the “Healthy Life Expectancy” that is the average number of years that a person can expect to live in “full heath.” It is an estimate that can be considered to measure one’s quality of life. This variable is quantitative. A second variable I used was “Ladder score”, also known as the happiness score. It is Metric used by the United Nations Sustainable Development Solutions Network to metricize the happiness of citizens in a country. This variable is also quantitative. Another variable that I used was the “Regional Indicator.” In this file the Regional Indicator is how the world is divided by regions such as East Asia, Western Europe, North American, etc. This variable is qualitative. In addition, another qualitative variable that I used was the countries names that were provided in this dataset.
To start, I had to load the Packages: Tidyverse which have the ggplot that used to create my Plots. The RcolorBrewer package that I used to color the plot. Plus, I load the dataset Happiness 2020 by downloading Happiness 2020 csv file onto my laptop. In RStudio on top in “Session,” “Set Works Directory,” “Choose Directory,” I created the path to the file, so RStudio could read the file.
I had so many ideas of how I would like to present a visualization, but I decided to start small first with simple graphics and move from there. I created five plots. From plot zero to plot four. Plot zero I compared the Ladder Score and Healthy Expectancy from two countries Brazil and United States. Plot1 instead of analyzing countries I decided to analyze the Regional Indicators. The Regional Indicators that I chose were “Latin America and Caribbean and North America and ANZ” using the same variables Ladder Score and Healthy Expectancy. In Plot2, I decided to keep both variables Ladder Score and Healthy Expectancy, but instead of analyzing the Regional Indicators I filtered manually the largest countries in Latin American and add colors to them. I had the legend with the countries showing in plot with distinct colors for each country. For this set of colors in plot2 I used the palette “Set1”. I liked the results of Plot2. However, I decided to clean more. So, I created Plot3. The Plot3 visualization shows the three largest happiest countries in Latin American and the three least happy ones. In order to collect that information, first I hard to filter just the largest countries in Latin America. To do this I created a new chunk, and I named Largest Countries Latin America that reads from the filed Happiness in the first line. Also, I used pipe function to read “and then” “%>%” to move to the second line and connect the first line with the second. In the second line of the chunk, I used the function filtered the countries name using the function forward pipe operator “%in%”and “c” that stands for concatenate to create a list of the largest countries in Latin America. This cleaning I got down to nine observations. After filtering the largest countries in Latin America, I did a second clean. This second clean I used some functions used in the first chunk, but I named this second chunk as “Happiest” that reads from my new file “Largest Countries in Latin America.” I used it again in order to filter the functions pipe, filter, forward pipe operator, concatenate and the country’s name. In a third clean I created another chunk and I named “Least Happy.” For this chunk I choose the three least happy largest countries in Latin American. This time I also got three observations and twenty variables for the happiest and least happy. After filtering the happiest and least happy countries I created Plot3. In Plot3 I used Healthy life Expectancy as my X, the Ladder Score as my Y, label with the Country Name. The function Label with Country Name showed the countries names on the dots on the graphics. I chose Geom point to create my visualization. For colors, I choose blue with the size five to show the three Happiest countries and red with the size two to show the least three happy ones. My goal was to identify the happiest and least happy largest countries in Latin America and see their quality of life. I like plot 3 a lot, but I created Plot4 to keep everything that I did in Plot3 but with a legend. Plot4 was not exactly what I was looking for, so I created Plot6. Plot6 have all the information that plot3 have it, but with a legend.
In this step I loaded the Packages: Tidyverse which have the ggplot program, the RcolorBrewer package that I used to color, and the dataset Happiness 2020 csv where I collected the information to create my plots.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(RColorBrewer)
setwd("C:/Users/aline/Downloads/Data 110 W 6pm Rachel Saidi/Happiness2020")
happiness <- read_csv("happiness2020.csv")
## Rows: 153 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country name, Regional indicator
## dbl (18): Ladder score, Standard error of ladder score, upperwhisker, lowerw...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here in this step, I am choosing the data from the Happiness that I will use to create my plots. This dataset Happiness has 153 observations, but I do not need all of them. So, I will use “filter” to collect the information that I will need.
First, I filtered the two countries names that I wanted to create my first plot and I named my first chunk happy.
happy <- happiness %>%
filter(`Country name` %in% c("Brazil","United States"))
After filtering the countries names, I used the chunk of happy with variables Healthy Life Expectancy and the Ladder score to see which country was the happiest and had a long healthy life expectancy. To differentiate those two countries, I used the palette color Dark2.
plot <- happy %>%
ggplot(aes(x = `Healthy life expectancy`, y = `Ladder score`, color = `Country name`)) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
theme_bw()
plot
There is a distinct difference between Brazil and United States. As we can see Brazil Ladder Score is below 6.4 and the Healthy Life Expectancy is around 66.5 years. Different from the United States that the Ladder Score is close to 7 and the Healthy Life Expectancy is around 70 years old.
This time I decided to filter by Regional Indicator to analyze the Latin American and Caribbean with North America and ANZ. I named this chunk happy2.
happy2 <- happiness %>%
filter(`Regional indicator` %in% c("Latin America and Caribbean", "North America and ANZ"))
In this time instead of analyzing the countries as I did in plot 0. I used the chuck happy2 that I created to analyze the Regional Indicator to visualize the Latin American and Caribbean and North America and ANZ.
plot1 <- happy2 %>%
ggplot(aes(x = `Healthy life expectancy`, y = `Ladder score`, color = `Regional indicator`)) +
geom_point() +
geom_line() +
scale_color_brewer(palette = "Dark2") +
theme_bw() +
ggtitle("Latin America and Caribbean Versus North America and ANZ in
Happiness and Life Expectancy")
plot1
The results shows that North America and Anz have higher Ladder Score and Healthy Life Expectancy than Latin American and Caribbean.
I decided to be more specific and analyze the largest countries in Latin American to see if there is an enormous difference between their Ladder Score and Health life Expectancy. So, I created a new chunk and I named “largest countries south America” and I manually filtered those countries using %in% c stands for list and the countries names.
largestcountriessouthamerica <- happiness %>%
filter(`Country name` %in% c("Brazil","Argentina","Peru", "Colombia","Bolivia","Venezuela","Chile","Paraguay","Ecuador"))
As a result, I got nine observations, and from this new dataset largest countries south America I used to create the following plots.
I created plot2 with the chunk largest countries south America, using the variables Healthy Life Expectancy and the Ladder score. Each country was shown on the graphic with a color from the palette “Set1”. The legend has the countries names, and each country corresponds to one color.
plot2 <- largestcountriessouthamerica %>%
ggplot(aes(x = `Healthy life expectancy`, y = `Ladder score`, color = `Country name`)) +
geom_line() +
geom_point(size = 5) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
ggtitle(" The Happiest and Life Expectancy Scores in
the Largest Countries in Latin America")
plot2
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
The result was a colorful graphic that I could see each country represented by one color. This visualization shows that Brazil is the happiest largest country in Latin American, but the health life expectancy is around 66.5 years old. The least happy country it was Venezuela, but the healthy life expectancy is the same as the happiest country Brazil. Which is interesting. I thought the happiest country would have a higher score also in the healthy Life Expectancy. However, this is not the case.
After filtering the largest countries in Latin America, I filtered the top three largest happiest countries by their country names. To do that I created a chunk and I called “happiest”.
happiest <- largestcountriessouthamerica %>%
filter(`Country name` %in% c("Brazil", "Chile", "Colombia"))
From this chunk I got three variables that represent the three countries.
In addition to filter the happiest largest countries, I also filtered the largest three least happy ones naming the chunk “least happy.”
leasthappy <- largestcountriessouthamerica %>%
filter(`Country name` %in% c("Venezuela", "Paraguay", "Bolivia"))
From this chunk I also got three observations that are the countries that I was looking for it.
I created plot3 using Geom Point with the chunk largest countries, the variables Healthy Life Expectancy and the Ladder score. Also, I used the “happiest” chunk with blue and sizes 5 to show the three happiest countries and “least happy” chunk with red and sizes 2 to show the least three happy ones. The theme was classic to have a white background. The Label to name each point according to the country name and ggtitle for the title.
plot3 <- largestcountriessouthamerica %>%
ggplot(aes(x = `Healthy life expectancy`, y = `Ladder score`, label = `Country name`)) +
geom_point(data = happiest,
col = "#1b98e0",
size = 5) +
geom_point(data = leasthappy,
col = "red",
size = 2) +
geom_text() +
theme_classic() +
ggtitle("Top Three Happiest and The Three Least Happy
Largest Countries in Latina America")
plot3
The result was great. Even my sister, from looking at the graphic without much explanation, was able to read it well. But Plot3 does not have a legend.
I created a visualization where anyone can see the top three largest happiest countries in Latin America and the least three largest happy ones. This information is helpful to people who is considering visit/ travel to Latina American or are looking for a warm, and friendly place to live. I used ggplot to create the plot. The “Healthy life Expectancy” I used as my “X” quantitative variable. The “Ladder Score” as my “Y” quantitative variable. The “Geom_Point” graphic, and I named the dots under the country’s names that is a qualitative variable. I also used the colors blue and size five to show the three happiest countries and red a smaller size two to show the least three happy ones. Plus, I used the Theme Classic for a white background and ggtile for my title. In addition, I tried to create a legend by entering the country names and the color manually.
plot4 <- largestcountriessouthamerica %>%
ggplot(aes(x = `Healthy life expectancy`, y = `Ladder score`, label = `Country name`, color = `Country name`)) +
geom_point(data = happiest,
col = "#1b98e0",
size = 5) +
geom_point(data = leasthappy,
col = "red",
size = 2,) +
geom_text() +
theme_classic() +
#scale_color_discrete(name = "Largest Countries in Latin American") +
ggtitle("The Top Three Happiest and
Least Happy Largest Countries in Latin America") +
scale_color_manual("Largest Countries in Latin American", breaks = c("Argentina","Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Paraguay", "Peru", "Venezuela"), values = c("black", "red", "blue", "blue", "blue", "black","red", "black", "red"))
plot4
I was able to create a legend with the title and the countries names. I try to have each country in the legend to be corresponded by its respective color. Such as the hippest country in blue such as Brazil, Chile, and Colombia. The least happy with red. However, I got this letter “a” above that I am not sure where they are coming from instead of dots.
I found this function online and I decided to try to use it to try to create the legend to my plot.
data <- data.frame(Xdata = rnorm(10),
Ydata = rnorm(10),
LegendData = c("ld-01", "ld-02", "ld-03",
"ld-04", "ld-05", "ld-06",
"ld-07", "ld-08", "ld-09",
"ld-10"))
# Create a Scatter Plot and change
# the size of legend
ggplot(data, aes(Xdata, Ydata, color = LegendData)) +
geom_point()+
guides(color = guide_legend(override.aes = list(size = 10)))
I used everything from plot5 in the new plot6. The difference was that instead of entering the legend information using scale manual I added another line of geom_point. As shown in the function above I used “guide”, “color” and “color_legend” plus the sizes to create the legend.
plot6 <- largestcountriessouthamerica %>%
ggplot(aes(x = `Healthy life expectancy`, y = `Ladder score`, label = `Country name`, color = `Country name`)) +
geom_point(data = happiest,
col = "#1b98e0",
size = 5) +
geom_point(data = leasthappy,
col = "red",
size = 2,) +
geom_text() +
theme_classic() +
geom_point()+
guides(color = guide_legend(override.aes = list(size = 5)))+
ggtitle("The Top Three Happiest and Three
Least Happy Largest Countries in Latin America")
plot6
The result was the same graphic plot5, but with a legend that each country corresponds to one color, but we still be able to see the three happiest countries in blue and the three least happy ones in red, but now we do not have the teller “a”.
This plot6 shows the three largest happiest countries in Latin America in blue. Brazil has the highest Ladder Score higher, and its healthy life expectancy is around 66 years old. The top three happiest countries are Brazil Colombia, and Chile. They have a Ladder Score above 6. Also, plot6 shows the least three happy ones that are: Bolivia, Paraguay, and Venezuela. Venezuela is the least happy country. It has surprised me to see that Venezuela has the same healthy life expectancy then Brazil. Considering that Brazil has the highest ladders score and Venezuela has the lowest ladder score. I thought the happiest country would have the higher healthy life expectancy, but it seems not always to be the case. I could not imagine that Chile would have the higher healthy life expectancy. I am wondering which other factors besides the Ladder score would influence this result and to be one of the top happiest countries. I am wondering what Chile is doing that it was able to have a good score in both variables. I am happy to see that Brazil got a higher score in the Ladder score, but I am concerning about the healthy life expectancy. Brazil is the country was born and my parents still living in there. I am wondering which factors may interfere in the ladder score and healthy life expectancy. Data that I could include in my analysis is the “Perception of corruption” to see if these effects in the “Healthy life expectancy.” Also, I could use the “Stander Error of Ladder Score” comparing to the “ladder Score” to see if there is a more precise Ladder Score dataset.