#load packages
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ------------------------------------------------------------------------------------------- tidyverse 1.3.2 --v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.8 v dplyr 1.0.9
v tidyr 1.1.4 v stringr 1.4.0
v readr 2.1.2 v forcats 0.5.1-- Conflicts ---------------------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
d <- read_csv("isbell_2019_background_data.csv")
Rows: 198 Columns: 45-- Column specification ------------------------------------------------------------------------------------------------------------
Delimiter: ","
chr (12): ID, Home_Country, Gender, Current_Student, Highest_Ed, LangDom_1, LangDom_2, LangDom_3, LangDom_4, LangDom_5, LangDom_...
dbl (33): Birthyear, Lang2_percent, Lang3_percent, Lang4_percent, Lang5_percent, LangOthr_percent, Korean_jth_lang, Age_Start_Ko...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Several ways to do this. First up: table() function
table(d$Gender)
F M
174 24
Or, count()
d %>% count(Gender)
First, create the age var. Two ways of doing this:
d$Age <- 2018 - d$Birthyear
d <- d %>% mutate(Age = 2018 - Birthyear)
Summary stats:
d %>% summarise(mean = mean(Age),
sd = sd(Age),
median = median(Age),
min = min(Age),
max = max(Age))
NA
Time for a histogram. We’ll make it decent looking via ggplot
d %>% ggplot(aes(x = Age))+
geom_histogram(binwidth = 1)+ #bindwidth sets the 'width' of each bucket/bar of the histogram
theme_bw()
An easy way to do this is unique()
unique(d$Home_Country)
[1] "Germany" "China" "Taiwan" "Hong Kong" "Russia" "El Salvador" "Sri Lanka" "Mexico"
[9] "Turkmenistan" "Ecuador" "Vietnam" "Singapore" "Malaysia" "Indonesia" "Kazahkstan" "Japan"
[17] "France" "Philippines" "Turkey" "Iran" "Thailand" "Italy" "Belarus" "Bangladesh"
[25] "Ukraine" "Chile" "Columbia" "Brazil" "Mongolia" "Kyrgyzstan" "Uzbekistan" "Azerbaijan"
[33] "Peru" "USA" "Bermuda" "Spain"
36 unique home countries represented in the data!
To count them, we’ll use count() again. You could also
use table(), but that can get messy with so many
countries.
d %>% count(Home_Country)
This gives a lot of info, but you could manually find the top 5. You
could pass this to a viewing window with a %>% view()
added to that last line of code.
Fancier solution:
d %>% count(Home_Country) %>% arrange(desc(n)) %>% slice_head(n = 5)
What that code did was arrange the output from
count(Home_country) in descending order and then grab just
the top 5 rows.
For this, you can copy-paste the code from Task 2.
#KorSpk
d %>% summarise(mean = mean(KorSpk),
sd = sd(KorSpk),
median = median(KorSpk),
min = min(KorSpk),
max = max(KorSpk))
#KorLis
d %>% summarise(mean = mean(KorLis),
sd = sd(KorLis),
median = median(KorLis),
min = min(KorLis),
max = max(KorLis))
Scatterplot time:
d %>% ggplot(aes(x = KorLis, y = KorSpk))+
geom_point()+
theme_bw()
This works, but many of the points are overlapping (you don’t see all 198 points!). One way to fix this is introduce some “jitter”, or slightly adjustment of points for visualization purposes:
d %>% ggplot(aes(x = KorLis, y = KorSpk))+
geom_jitter()+
theme_bw()
Here’s something interesting… “Motiv_Love” is self-reported motivation to learn Korean due to having a Korean-speaking significant other.
d %>% ggplot(aes(x = Motiv_Love, y = KorSpk))+
geom_jitter()+
theme_bw()
Doesn’t look like much of a relationship!