Exercise 1: Math score and Read score
Load data file and set random seed
## Loading required package: Ecfun
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
## distcod county district grspan enrltot teachers
## 1 75119 Alameda Sunol Glen Unified KK-08 195 10.90
## 2 61499 Butte Manzanita Elementary KK-08 240 11.15
## 3 61549 Butte Thermalito Union Elementary KK-08 1550 82.90
## 4 61457 Butte Golden Feather Union Elementary KK-08 243 14.00
## 5 61523 Butte Palermo Union Elementary KK-08 1335 71.50
## 6 62042 Fresno Burrel Union Elementary KK-08 137 6.40
## calwpct mealpct computer testscr compstu expnstu str avginc
## 1 0.5102 2.0408 67 690.80 0.3435898 6384.911 17.88991 22.690001
## 2 15.4167 47.9167 101 661.20 0.4208333 5099.381 21.52466 9.824000
## 3 55.0323 76.3226 169 643.60 0.1090323 5501.955 18.69723 8.978000
## 4 36.4754 77.0492 85 647.70 0.3497942 7101.831 17.35714 8.978000
## 5 33.1086 78.4270 171 640.85 0.1280899 5235.988 18.67133 9.080333
## 6 12.3188 86.9565 25 605.55 0.1824818 5580.147 21.40625 10.415000
## elpct readscr mathscr
## 1 0.000000 691.6 690.0
## 2 4.583333 660.5 661.9
## 3 30.000002 636.3 650.9
## 4 0.000000 651.9 643.5
## 5 13.857677 641.8 639.9
## 6 12.408759 605.7 605.4
## ─ Attaching packages ────────────────────────── tidyverse 1.3.0 ─
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ─ Conflicts ─────────────────────────── tidyverse_conflicts() ─
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## # A tibble: 45 x 17
## # Groups: county [45]
## distcod county district grspan enrltot teachers calwpct mealpct computer
## <int> <fct> <fct> <fct> <int> <dbl> <dbl> <dbl> <int>
## 1 75119 Alame… Sunol G… KK-08 195 10.9 0.510 2.04 67
## 2 61457 Butte Golden … KK-08 243 14 36.5 77.0 85
## 3 61572 Calav… Mark Tw… KK-08 777 36.8 13.0 39.8 148
## 4 61762 Contr… Oakley … KK-08 4153 200. 7.53 23.4 241
## 5 61911 El Do… Latrobe… KK-08 145 9.30 6.21 0 27
## 6 62331 Fresno Orange … KK-08 379 19 32.2 93.1 35
## 7 62596 Glenn Lake El… KK-08 129 5 9.30 50.4 10
## 8 62745 Humbo… Cutten … KK-06 515 27.4 16.1 26.2 58
## 9 63172 Imper… Magnoli… KK-08 108 5 3.67 32.1 25
## 10 63255 Inyo Bishop … KK-08 1510 73.3 10.3 37.1 141
## # … with 35 more rows, and 8 more variables: testscr <dbl>, compstu <dbl>,
## # expnstu <dbl>, str <dbl>, avginc <dbl>, elpct <dbl>, readscr <dbl>,
## # mathscr <dbl>
Plot scatter plot of readscr by mathscr
## function (x, data, ...)
## UseMethod("xyplot")
## <bytecode: 0x7fca97adba30>
## <environment: namespace:lattice>
Exercise 2: 133 class-level 95%-confidence intervals for language test score
Load data file
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:Ecdat':
##
## SP500
## lang IQ class GS SES COMB
## 1 46 15.0 180 29 23 0
## 2 45 14.5 180 29 10 0
## 3 33 9.5 180 29 15 0
## 4 46 11.0 180 29 23 0
## 5 20 8.0 180 29 10 0
## 6 30 9.5 180 29 10 0
Compute the 95% CIs for language and create columns
dta_n <- dta %>%
mutate(classID = factor(class, levels = levels(class), labels = c(1:length(levels(.$class))))) %>%
group_by(classID) %>%
summarize(language_mean = mean(lang),
language_lb = language_mean - 1.96*sd(lang),
language_ub = language_mean + 1.96*sd(lang)) %>%
as.data.frame
tail(dta_n,3)## classID language_mean language_lb language_ub
## 131 131 38.09091 26.953191 49.22863
## 132 132 29.30000 3.264031 55.33597
## 133 133 28.42857 14.762009 42.09513
Exercise 3: 133 class-level 95%-confidence intervals for language test score
Load data file
## Loading required package: carData
##
## Attaching package: 'carData'
## The following object is masked from 'package:Ecdat':
##
## Mroz
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
First, find the median prestige score for each of the three types of occupation.
Then, use this median values in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.
## function (x, data, ...)
## UseMethod("xyplot")
## <bytecode: 0x7fca97adba30>
## <environment: namespace:lattice>
Prestige %>%
group_by(type) %>%
mutate(pt_med = median(prestige),
pt_type = case_when(prestige > pt_med ~ "High",
prestige < pt_med ~ "Low")) %>%
xyplot(income ~ education | type, groups = pt_type, data = ., type = c("g","p","r"))## Warning: Factor `type` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `type` contains implicit NA, consider using
## `forcats::fct_explicit_na`
Exercise 4: Nobel Laureates
dta_1 <- read.table("/Users/haolunfu/Documents/資料管理/week5/nobel_countries.txt", header = T)
head(dta_1)## Country Year
## 1 France 2014
## 2 UK 1950
## 3 UK 2017
## 4 US 2016
## 5 Canada 2013
## 6 China 2012
dta_2 <- read.table("/Users/haolunfu/Documents/資料管理/week5/nobel_winners.txt", header = T)
head(dta_2)## Name Gender Year
## 1 Patrick Modiano Male 2014
## 2 Bertrand Russell Male 1950
## 3 Kazuo Ishiguro Male 2017
## 4 Bob Dylan Male 2016
## 5 Alice Munro Female 2013
## 6 Mo Yan Male 2012
## Joining, by = "Year"
## Country Year Name Gender
## 1 UK 2017 Kazuo Ishiguro Male
## 2 US 2016 Bob Dylan Male
## 3 Russia 2015 <NA> <NA>
## 4 France 2014 Patrick Modiano Male
## 5 Canada 2013 Alice Munro Female
## 6 China 2012 Mo Yan Male
## 7 Sweden 2011 <NA> <NA>
## 8 UK 1950 Bertrand Russell Male
## 9 <NA> 1938 Pearl Buck Female
First, merge the data from nobel_countries dataset and winners by year
Then, arrange the order by year and show in descending power.
Exercise 5
Load data file
fL <- "http://www.amstat.org/publications/jse/datasets/sat.dat.txt"
dta <- read.table(fL, row.names=1)
head(dta)## V2 V3 V4 V5 V6 V7 V8
## Alabama 4.405 17.2 31.144 8 491 538 1029
## Alaska 8.963 17.6 47.951 47 445 489 934
## Arizona 4.778 19.3 32.175 27 448 496 944
## Arkansas 4.459 17.1 28.934 6 482 523 1005
## California 4.992 24.0 41.078 45 417 485 902
## Colorado 5.443 18.4 34.571 29 462 518 980
Rename the variable names
names(dta) <- c("Spending", "PTR", "Salary", "PE", "Verbal", "Math", "SAT")
dta$Region <- state.division
head(dta)## Spending PTR Salary PE Verbal Math SAT Region
## Alabama 4.405 17.2 31.144 8 491 538 1029 East South Central
## Alaska 8.963 17.6 47.951 47 445 489 934 Pacific
## Arizona 4.778 19.3 32.175 27 448 496 944 Mountain
## Arkansas 4.459 17.1 28.934 6 482 523 1005 West South Central
## California 4.992 24.0 41.078 45 417 485 902 Pacific
## Colorado 5.443 18.4 34.571 29 462 518 980 Mountain
Plot the scatter plot and try to find the relationship between Salary and Region for the SAT scores.
## function (x, data, ...)
## UseMethod("xyplot")
## <bytecode: 0x7fca97adba30>
## <environment: namespace:lattice>
# The results revealed that the negative signs contains West North Central, Mountain, New England, East South Central, West South Central