Research Questions
1). A .csv file is a comma separated values file whereas .tsv is tab separated values file. Depending on the data set will depend on which file type is used as .tsv data is usually more simple data compared to .csv data. Csv data is usually larger values therefore a comma is required to separate the values which is why a .csv file would then be used. A function for loading a .tsv would be to use the package data.table, which will have the following line of code to run.
require(data.table)
data <-as.data.frame(fread(“file.tsv”))
or
read.table(file = ‘filename.tsv’, sep = ’, header = TRUE)
2). An error message tells you that the function you are trying to run cannot function. Usually the error message will tell you where the error is occurring to guide you to a solution to the problem. Warning message is used to tell you of a possible problem by using that code however the code can still run and generate an output. Therefore, this message is telling you that it recommends using a different function to prevent errors from occurring. General messages are used to give more information about the output and therefore allows you to know what information is missing from the argument to get the function working correctly.
3). An integer is a number that does not have any decimal points, it’s a whole number. Whereas a floating-point number does have decimal points after the number. Doubles are more precise as they have double the amount of decimal points than a float. Therefore this allows the results to be more accurate as there is more decimal points in the doubles than a floating point or an integer.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(pander)
Q1. Created the data table without the comments and blanks rows and removing X1 column
a1724529.csv <- read_csv("a1724529.csv", comment = "@", col_types = "ccnnnn")
## Warning: Missing column names filled in: 'X1' [1]
a1724529.csv<-select(a1724529.csv, -X1)
view(a1724529.csv)
Q2. Extracting CellType and Treatment from the wide form and converting it into long form
a1724529Long <- a1724529.csv %>% pivot_longer(cols = -c("Gene"), names_to = "Treatment", values_to = "dCt")
a1724529LongCols <- separate(a1724529Long, Treatment, into = c("CellTypes", "Treatments"), sep = " ")
print(a1724529LongCols)
## # A tibble: 28 x 4
## Gene CellTypes Treatments dCt
## <chr> <chr> <chr> <dbl>
## 1 SORL1 HEK293 WT 11.4
## 2 SORL1 CD25Hi WT 14.2
## 3 SORL1 HEK293 KO 12.4
## 4 SORL1 CD25Hi KO 11.0
## 5 SORL1 HEK293 WT 8.45
## 6 SORL1 CD25Hi WT 13.9
## 7 SORL1 HEK293 KO 10.6
## 8 SORL1 CD25Hi KO 11.0
## 9 SORL1 HEK293 WT 9.38
## 10 SORL1 CD25Hi WT 14.4
## # … with 18 more rows
Q3. Creating boxplots of the CD25Hi KO and WT as well as the HEK293 KO and WT. The ggplot and geom_boxplot was required. There was a warning message about the x and y stated however it was the only why to get the code to run with the data, therefore it was still used.
Bplot <- ggplot(a1724529Long, aes(x = a1724529Long$Treatment, y = a1724529Long$dCt))
Bplot + geom_boxplot(varwidth = T, fill = "orange") + labs(title = "Box plot", subtitle = "dCt grouped by treatment", x = "Treatment Type", y = "dCt")
## Warning: Use of `a1724529Long$Treatment` is discouraged. Use `Treatment`
## instead.
## Warning: Use of `a1724529Long$dCt` is discouraged. Use `dCt` instead.
Q4. Linear Regression of the a1724529LongCols using the formula dCt~CellType + Treatment + CellType:Treatment. Pander function was required to generate the summary table.
geom_smooth(method = "lm", se = FALSE)
## geom_smooth: na.rm = FALSE, orientation = NA, se = FALSE
## stat_smooth: na.rm = FALSE, orientation = NA, se = FALSE, method = lm
## position_identity
a1724529LinearRegression <-lm(dCt ~ CellTypes + Treatments + CellTypes:Treatments, data = a1724529LongCols)
summary(a1724529LinearRegression)
##
## Call:
## lm(formula = dCt ~ CellTypes + Treatments + CellTypes:Treatments,
## data = a1724529LongCols)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7389 -0.5860 -0.1981 0.4598 1.7370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6408 0.3558 32.717 < 2e-16 ***
## CellTypesHEK293 -0.6175 0.5032 -1.227 0.231672
## TreatmentsWT 2.1892 0.5032 4.351 0.000216 ***
## CellTypesHEK293:TreatmentsWT -3.0233 0.7116 -4.249 0.000281 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9414 on 24 degrees of freedom
## Multiple R-squared: 0.7055, Adjusted R-squared: 0.6687
## F-statistic: 19.16 on 3 and 24 DF, p-value: 1.465e-06
pander(a1724529LinearRegression)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 11.64 | 0.3558 | 32.72 | 2.015e-21 |
| CellTypesHEK293 | -0.6175 | 0.5032 | -1.227 | 0.2317 |
| TreatmentsWT | 2.189 | 0.5032 | 4.351 | 0.0002165 |
| CellTypesHEK293:TreatmentsWT | -3.023 | 0.7116 | -4.249 | 0.0002807 |