Research Questions

  1. Two common file types are .csv and .tsv. What is the difference between the two, and can you suggest a function for loading a .tsv file into R? [3 marks]

1). A .csv file is a comma separated values file whereas .tsv is tab separated values file. Depending on the data set will depend on which file type is used as .tsv data is usually more simple data compared to .csv data. Csv data is usually larger values therefore a comma is required to separate the values which is why a .csv file would then be used. A function for loading a .tsv would be to use the package data.table, which will have the following line of code to run.

require(data.table)

data <-as.data.frame(fread(“file.tsv”))

or

read.table(file = ‘filename.tsv’, sep = ’, header = TRUE)

  1. Three types of message commonly seen in the R Console are 1) error messages, 2) warnings and 3) general messages. Briefly describe the importance of each type of message and what information they are conveying. [3 marks]

2). An error message tells you that the function you are trying to run cannot function. Usually the error message will tell you where the error is occurring to guide you to a solution to the problem. Warning message is used to tell you of a possible problem by using that code however the code can still run and generate an output. Therefore, this message is telling you that it recommends using a different function to prevent errors from occurring. General messages are used to give more information about the output and therefore allows you to know what information is missing from the argument to get the function working correctly.

  1. What is the difference between an integer and a floating point value? Floats are sometimes called “doubles”. What does this mean? [2 marks]

3). An integer is a number that does not have any decimal points, it’s a whole number. Whereas a floating-point number does have decimal points after the number. Doubles are more precise as they have double the amount of decimal points than a float. Therefore this allows the results to be more accurate as there is more decimal points in the doubles than a floating point or an integer.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)
library(pander)

Q1. Created the data table without the comments and blanks rows and removing X1 column

a1724529.csv <- read_csv("a1724529.csv", comment = "@", col_types = "ccnnnn")
## Warning: Missing column names filled in: 'X1' [1]
a1724529.csv<-select(a1724529.csv, -X1)
view(a1724529.csv)

Q2. Extracting CellType and Treatment from the wide form and converting it into long form

a1724529Long <- a1724529.csv %>% pivot_longer(cols = -c("Gene"), names_to = "Treatment", values_to = "dCt") 
a1724529LongCols <- separate(a1724529Long, Treatment, into = c("CellTypes", "Treatments"), sep = " ")
print(a1724529LongCols)
## # A tibble: 28 x 4
##    Gene  CellTypes Treatments   dCt
##    <chr> <chr>     <chr>      <dbl>
##  1 SORL1 HEK293    WT         11.4 
##  2 SORL1 CD25Hi    WT         14.2 
##  3 SORL1 HEK293    KO         12.4 
##  4 SORL1 CD25Hi    KO         11.0 
##  5 SORL1 HEK293    WT          8.45
##  6 SORL1 CD25Hi    WT         13.9 
##  7 SORL1 HEK293    KO         10.6 
##  8 SORL1 CD25Hi    KO         11.0 
##  9 SORL1 HEK293    WT          9.38
## 10 SORL1 CD25Hi    WT         14.4 
## # … with 18 more rows

Q3. Creating boxplots of the CD25Hi KO and WT as well as the HEK293 KO and WT. The ggplot and geom_boxplot was required. There was a warning message about the x and y stated however it was the only why to get the code to run with the data, therefore it was still used.

Bplot <- ggplot(a1724529Long, aes(x = a1724529Long$Treatment, y = a1724529Long$dCt)) 
Bplot + geom_boxplot(varwidth = T, fill = "orange") + labs(title = "Box plot", subtitle = "dCt grouped by treatment", x = "Treatment Type", y = "dCt")
## Warning: Use of `a1724529Long$Treatment` is discouraged. Use `Treatment`
## instead.
## Warning: Use of `a1724529Long$dCt` is discouraged. Use `dCt` instead.

Q4. Linear Regression of the a1724529LongCols using the formula dCt~CellType + Treatment + CellType:Treatment. Pander function was required to generate the summary table.

geom_smooth(method = "lm", se = FALSE)
## geom_smooth: na.rm = FALSE, orientation = NA, se = FALSE
## stat_smooth: na.rm = FALSE, orientation = NA, se = FALSE, method = lm
## position_identity
a1724529LinearRegression <-lm(dCt ~ CellTypes + Treatments + CellTypes:Treatments, data = a1724529LongCols) 
summary(a1724529LinearRegression)
## 
## Call:
## lm(formula = dCt ~ CellTypes + Treatments + CellTypes:Treatments, 
##     data = a1724529LongCols)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7389 -0.5860 -0.1981  0.4598  1.7370 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   11.6408     0.3558  32.717  < 2e-16 ***
## CellTypesHEK293               -0.6175     0.5032  -1.227 0.231672    
## TreatmentsWT                   2.1892     0.5032   4.351 0.000216 ***
## CellTypesHEK293:TreatmentsWT  -3.0233     0.7116  -4.249 0.000281 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9414 on 24 degrees of freedom
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.6687 
## F-statistic: 19.16 on 3 and 24 DF,  p-value: 1.465e-06
pander(a1724529LinearRegression)
Fitting linear model: dCt ~ CellTypes + Treatments + CellTypes:Treatments
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.64 0.3558 32.72 2.015e-21
CellTypesHEK293 -0.6175 0.5032 -1.227 0.2317
TreatmentsWT 2.189 0.5032 4.351 0.0002165
CellTypesHEK293:TreatmentsWT -3.023 0.7116 -4.249 0.0002807