Project 2, Dataset 2

1.Create a .CSV file (or optionally, a MySQL database!) that includes one of the “wide” datasets identified in the Week 6 Discussion items. You’re encouraged to use a “wide” structure similar to how the information appears above, sothat you can practice tidying and transformations as described below.

I chose dataset that was posted by Jhalak Das and that includes the information about the students, their grades for tests for each term.

students <- rbind(c("id", "name", "phone", "sex and age", "test number","term 1", "term 2", "term 3"),
             c(1, "Mike", 134, "m_12", "test 1", 76, 84, 87),
             c(2, "Linda", 270, "f_13", "test 1", 88, 90, 73),
             c(3, "Sam", 210, "m_11", "test 1", 78, 74, 80),
             c(4, "Esther", 617, "f_12", "test 1", 68, 75, 74),
             c(5, "Marya", 114, "f_14", "test 1", 65, 67, 64),
             c(1, "Mike", 134, "m_12", "test 2", 85, 80, 90),
             c(2, "Linda", 270, "f_13", "test 2", 87, 82, 94),
             c(3, "Sam", 210, "m_11", "test 2", 80, 87, 80),
             c(4, "Esther", 617, "f_12", "test 2", 70, 75, 78),
             c(5, "Marya", 114, "f_14", "test 2", 68, 70, 63))

write.table(students, file = "students.csv", sep = ",", col.names=F, row.names=F)

2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

2.1 Load a csv file from Github

The file will be downloaded from the Github repository to csv file using read.csv function.

students <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/Project-2/students.csv', header=TRUE, sep=",", check.names=FALSE)

By checking the data downloaded, we have 8 column with 10 rows.

summary(students)

##        id        name               phone     sex and age       
##  Min.   :1   Length:10          Min.   :114   Length:10         
##  1st Qu.:2   Class :character   1st Qu.:134   Class :character  
##  Median :3   Mode  :character   Median :210   Mode  :character  
##  Mean   :3                      Mean   :269                     
##  3rd Qu.:4                      3rd Qu.:270                     
##  Max.   :5                      Max.   :617                     
##  test number            term 1          term 2          term 3     
##  Length:10          Min.   :65.00   Min.   :67.00   Min.   :63.00  
##  Class :character   1st Qu.:68.50   1st Qu.:74.25   1st Qu.:73.25  
##  Mode  :character   Median :77.00   Median :77.50   Median :79.00  
##                     Mean   :76.50   Mean   :78.40   Mean   :78.30  
##                     3rd Qu.:83.75   3rd Qu.:83.50   3rd Qu.:85.25  
##                     Max.   :88.00   Max.   :90.00   Max.   :94.00

head(students)

##   id   name phone sex and age test number term 1 term 2 term 3
## 1  1   Mike   134        m_12      test 1     76     84     87
## 2  2  Linda   270        f_13      test 1     88     90     73
## 3  3    Sam   210        m_11      test 1     78     74     80
## 4  4 Esther   617        f_12      test 1     68     75     74
## 5  5  Marya   114        f_14      test 1     65     67     64
## 6  1   Mike   134        m_12      test 2     85     80     90

2.2 Transform data

Tidy data should follow the rules: each variable is its own column, each observation is its own row, and each value is its own cell. At the current moment, we have “wide data” an can transform it to “long data” as it is better to work with.
We are going to work with data from csv file and use tidyverse package to transform it.
First, the three columns with terms turn into one column called terms. Multiple variables should be stored in one column. pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

students <- students %>% 
  pivot_longer(c("term 1", "term 2", "term 3"), names_to = "term", values_to = "grade")
head(students)

## # A tibble: 6 x 7
##      id name  phone `sex and age` `test number` term   grade
##   <int> <chr> <int> <chr>         <chr>         <chr>  <int>
## 1     1 Mike    134 m_12          test 1        term 1    76
## 2     1 Mike    134 m_12          test 1        term 2    84
## 3     1 Mike    134 m_12          test 1        term 3    87
## 4     2 Linda   270 f_13          test 1        term 1    88
## 5     2 Linda   270 f_13          test 1        term 2    90
## 6     2 Linda   270 f_13          test 1        term 3    73

Rename column “test number” to avoid spaces

students <- plyr::rename(students, c("test number" = "test_number"))

Then, instead of taking “test number” as variable, we can consider two columns with headers ‘test 1’ and ‘test 2’. The test number column is not a variable, it contains two variables that we will make as columns.

students <- students %>%
    pivot_wider(names_from=test_number, values_from = grade)
head(students)

## # A tibble: 6 x 7
##      id name  phone `sex and age` term   `test 1` `test 2`
##   <int> <chr> <int> <chr>         <chr>     <int>    <int>
## 1     1 Mike    134 m_12          term 1       76       85
## 2     1 Mike    134 m_12          term 2       84       80
## 3     1 Mike    134 m_12          term 3       87       90
## 4     2 Linda   270 f_13          term 1       88       87
## 5     2 Linda   270 f_13          term 2       90       82
## 6     2 Linda   270 f_13          term 3       73       94

In addition, “sex and age” should be two distinct columns; ‘sex’ and ‘age’.

students <- students %>% 
  separate("sex and age", into = c("sex", "age"))
head(students)

## # A tibble: 6 x 8
##      id name  phone sex   age   term   `test 1` `test 2`
##   <int> <chr> <int> <chr> <chr> <chr>     <int>    <int>
## 1     1 Mike    134 m     12    term 1       76       85
## 2     1 Mike    134 m     12    term 2       84       80
## 3     1 Mike    134 m     12    term 3       87       90
## 4     2 Linda   270 f     13    term 1       88       87
## 5     2 Linda   270 f     13    term 2       90       82
## 6     2 Linda   270 f     13    term 3       73       94

Also, we can substitute “m” and “f” for sex with “male” and “female” accordingly, and transform age column into numeric from character.

students["sex"][students["sex"] == "f"] <- "female"
students["sex"][students["sex"] == "m"] <- "male"
students$age = as.numeric(students$age)
head(students)

## # A tibble: 6 x 8
##      id name  phone sex      age term   `test 1` `test 2`
##   <int> <chr> <int> <chr>  <dbl> <chr>     <int>    <int>
## 1     1 Mike    134 male      12 term 1       76       85
## 2     1 Mike    134 male      12 term 2       84       80
## 3     1 Mike    134 male      12 term 3       87       90
## 4     2 Linda   270 female    13 term 1       88       87
## 5     2 Linda   270 female    13 term 2       90       82
## 6     2 Linda   270 female    13 term 3       73       94

summary(students)

##        id        name               phone         sex                 age      
##  Min.   :1   Length:15          Min.   :114   Length:15          Min.   :11.0  
##  1st Qu.:2   Class :character   1st Qu.:134   Class :character   1st Qu.:12.0  
##  Median :3   Mode  :character   Median :210   Mode  :character   Median :12.0  
##  Mean   :3                      Mean   :269                      Mean   :12.4  
##  3rd Qu.:4                      3rd Qu.:270                      3rd Qu.:13.0  
##  Max.   :5                      Max.   :617                      Max.   :14.0  
##      term               test 1         test 2     
##  Length:15          Min.   :64.0   Min.   :63.00  
##  Class :character   1st Qu.:70.5   1st Qu.:72.50  
##  Mode  :character   Median :75.0   Median :80.00  
##                     Mean   :76.2   Mean   :79.27  
##                     3rd Qu.:82.0   3rd Qu.:86.00  
##                     Max.   :90.0   Max.   :94.00

Finally, we will change values of column term from “term 1” to just “1” and make sure that we will get numeric values, not character.

students["term"][students["term"] == "term 1"] <- "1"
students["term"][students["term"] == "term 2"] <- "2"
students["term"][students["term"] == "term 3"] <- "3"
students$term = as.numeric(students$term)
head(students)

## # A tibble: 6 x 8
##      id name  phone sex      age  term `test 1` `test 2`
##   <int> <chr> <int> <chr>  <dbl> <dbl>    <int>    <int>
## 1     1 Mike    134 male      12     1       76       85
## 2     1 Mike    134 male      12     2       84       80
## 3     1 Mike    134 male      12     3       87       90
## 4     2 Linda   270 female    13     1       88       87
## 5     2 Linda   270 female    13     2       90       82
## 6     2 Linda   270 female    13     3       73       94

3. Perform the analysis requested in the discussion item.

Since there were no questions for analysis in the post, we can create our own.
We can find out which age performs best on the tests using mean grade for test 1. We see that at age of 14 there is only 1 student, Maya who performs not so good as other students.

students %>%
    group_by(age) %>%
    summarise(test1Mean = mean(`test 1`),test2Mean =  mean(`test 2`)) %>%
    ggplot(data = ., aes(x = age, y = test1Mean, 
    fill = age)) + 
  geom_bar(stat = "identity") +
  theme_light() +
  labs(title="Test grades, mean by age", x= "Age", y = "Mean test grades") +
  theme(plot.title = element_text(hjust = 0.5))

We can find out which sex performs better during the test. In general, there is no big difference between female and male mean test grades but male has a little gigher grade on average.

students %>%
    group_by(sex) %>%
    summarise(test1Mean = mean(`test 1`),test2Mean =  mean(`test 2`))

## # A tibble: 2 x 3
##   sex    test1Mean test2Mean
##   <chr>      <dbl>     <dbl>
## 1 female      73.8      76.3
## 2 male        79.8      83.7

ggplot(students, aes(x=sex, y=`test 1`, color=sex)) + 
  geom_boxplot() +
  theme_light() +
  labs(title="Test grades, mean by sex", x= "Sex", y = "Mean test grades") +
  theme(plot.title = element_text(hjust = 0.5))

4. Conclusion.