MATH2349 Semester 2, 2019

Setup

Installing and loading the necessary packages:

library(readr) # Useful for importing data
library(knitr) # Useful for creating nice tables
library(dplyr) #Useful for demonstrate, sub-setting and filtering data

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data Description

Data was extracted from UCI machine learning repository. The data consist of evaluations of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant (TA) assignments at the Statistics Department of the University of Wisconsin-Madison. The scores were divided into 3 roughly equal-sized categories (“low”, “medium”, and “high”) to form the class variable.

Data Source link: https://archive.ics.uci.edu/ml/datasets/Teaching+Assistant+Evaluation

Data types in the datase:

Integer variable: Class_Size
Nominal variables: English_Speaking, Instructor, Course, Semester
Ordinal Variable: Class

Variable Description:

English_Speaking- Whether of not the TA is a native English speaker (1=English speaker, 2=non-English speaker)
Instructor- Course instructor (25 categories)
Course- 26 courses (categories)
Semester- Summer or regular semester (1=Summer, 2=Regular)
Class_Size- Size of the class
Class- Teaching performance score (1=Low, 2=Medium, 3=High)

Read/Import Data

# This is an R chunk for importing the data. Provide your R codes here:
tae <- read_csv("tae.data", col_types = cols(Class = col_factor(levels = c("1", "2", "3"), ordered = TRUE), Class_Size = col_integer(), Course = col_character(), English_Speaking = col_character(),  Instructor = col_character(), Semester = col_character()))

I used readr functions to import data. For convenience, I downloaded the csv file from UCI repository and added column labels in the data file which originaly didn’t consist variable names.

Assigning data types for the variables :

Assigned the variables English_Speaking, Instructor, Course, Semester as characters
Assigned Class_Size as an integer
Assigned Class as an ordered factor.

Inspect and Understand

str(tae) # Structure of the dataset

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 151 obs. of  6 variables:
##  $ English_Speaking: chr  "1" "2" "1" "1" ...
##  $ Instructor      : chr  "23" "15" "23" "5" ...
##  $ Course          : chr  "3" "3" "3" "2" ...
##  $ Semester        : chr  "1" "1" "2" "2" ...
##  $ Class_Size      : int  19 17 49 33 55 20 19 27 58 20 ...
##  $ Class           : Ord.factor w/ 3 levels "1"<"2"<"3": 3 3 3 3 3 3 3 3 3 3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   English_Speaking = col_character(),
##   ..   Instructor = col_character(),
##   ..   Course = col_character(),
##   ..   Semester = col_character(),
##   ..   Class_Size = col_integer(),
##   ..   Class = col_factor(levels = c("1", "2", "3"), ordered = TRUE, include_na = FALSE)
##   .. )

tae$Class <- tae$Class %>% factor(levels = c("1", "2", "3"), labels = c("Low", "Medium", "High")) # Assigning labels for the "Class" variable

The dataframe “tae” has 151 observations of 6 variables. Out of these 6 variables, there are 4 charachter type variables, one integer type variable and one ordinal variable. Labels were assigned for the numeric factors of the “Class” variable.

Subsetting I

# Subset the data and convert it to a matrix 
tae1<- tae[1:10,] # Subset the data with first 10 rows and define it as "tae1"
show(tae1) # Display the subset

## # A tibble: 10 x 6
##    English_Speaking Instructor Course Semester Class_Size Class
##    <chr>            <chr>      <chr>  <chr>         <int> <ord>
##  1 1                23         3      1                19 High 
##  2 2                15         3      1                17 High 
##  3 1                23         3      2                49 High 
##  4 1                5          2      2                33 High 
##  5 2                7          11     2                55 High 
##  6 2                23         3      1                20 High 
##  7 2                9          5      2                19 High 
##  8 2                10         3      2                27 High 
##  9 1                22         3      1                58 High 
## 10 2                15         3      1                20 High

str(tae1) # Structure of the subset "tae1"

## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  6 variables:
##  $ English_Speaking: chr  "1" "2" "1" "1" ...
##  $ Instructor      : chr  "23" "15" "23" "5" ...
##  $ Course          : chr  "3" "3" "3" "2" ...
##  $ Semester        : chr  "1" "1" "2" "2" ...
##  $ Class_Size      : int  19 17 49 33 55 20 19 27 58 20
##  $ Class           : Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 3 3 3 3 3 3 3 3 3

Subset of the first 10 rows of the dataset was named as “tae1”. “tae1” is a dataframe with 10 observations of 6 variables and since it have several rows(vectors), “tae1” behaves like a matrix. All the variables of “tae1” have the same datatypes as its superset “tae”.

Subsetting II

tae2 <- tae[, c(1,6)] #Subset the data with 1st and last columns
show(tae2) # Display the subset

## # A tibble: 151 x 2
##    English_Speaking Class
##    <chr>            <ord>
##  1 1                High 
##  2 2                High 
##  3 1                High 
##  4 1                High 
##  5 2                High 
##  6 2                High 
##  7 2                High 
##  8 2                High 
##  9 1                High 
## 10 2                High 
## # … with 141 more rows

str(tae2) # Structure of the subset

## Classes 'tbl_df', 'tbl' and 'data.frame':    151 obs. of  2 variables:
##  $ English_Speaking: chr  "1" "2" "1" "1" ...
##  $ Class           : Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 3 3 3 3 3 3 3 3 3 ...

save(tae2, file = "tae2.RData") # Save "tae2" object in .RData format in the working directory

Dataframe “tae” was subsetted with the first and the last columns and named it as “tae2”. “tae2” is a dataframe with 151 observations of 2 varibles. First variable has character data type and the second variable is an ordered factor. “tae2” was saved it as an R Data object in the working directory.

Create a new Data Frame

tae3 <- tae [1:4, c("Class_Size","Class")] #Subset the data with the first 4 observations of the "Class_Size" (integer) and "Class" (ordinary) variables
show(tae3)

## # A tibble: 4 x 2
##   Class_Size Class
##        <int> <ord>
## 1         19 High 
## 2         17 High 
## 3         49 High 
## 4         33 High

str(tae3)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  2 variables:
##  $ Class_Size: int  19 17 49 33
##  $ Class     : Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 3 3 3

v1 <- c(1, 2, 3, 4) # Creating a new numeric vector "v1" with 4 observations
tae4 <- cbind(tae3, v1) # Adding "v1" to "tae3" as a column
show(tae4)

##   Class_Size Class v1
## 1         19  High  1
## 2         17  High  2
## 3         49  High  3
## 4         33  High  4

str(tae4)

## 'data.frame':    4 obs. of  3 variables:
##  $ Class_Size: int  19 17 49 33
##  $ Class     : Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 3 3 3
##  $ v1        : num  1 2 3 4

“tae3” is a subset of “tae”,having 4 observations of an integer variable (“Class_Size”) and an orinal variable (“Class”). A new numerical vector “v1” was created with 4 values and added to “tae3” as a column, the new dataframe was defined as “tae4”. Hence “tae4” containes 4 observations of 3 varibles with data types, integer, ordinal and numeric.