Υπάρχουν αρκετές βάσεις δεδομένων που μπορούμε να κατεβάσουμε δεδομένα έκφρασης γονιδίων. Σε αυτή τη σειρά μαθημάτων θα χρησιμοποιήσουμε το “Gene Expression Omnibus” ή αλλιώς GEO. Η διεύθυνση του αναγράφεται παρακάτω.
https://www.ncbi.nlm.nih.gov/geo/
Από το GEO μας ενδιαφέρουν τα datasets, και ειδικότερα αυτά που πέρα από τα raw δεδομένα (CEL files), δίνουν και δεδομένα τα οποία έχουν προ-επεξεργαστεί και ειναι έτοιμα για ανάλυση. Αυτά τα βρίσκουμε στο λινκ “Datasets” που υπάρχει σε κάποιο σημείο στην αρχική σελίδα του GEO.
Στην συνέχεια έχουμε την δυνατότητα να χρησιμοποιήσουμε
Εμείς θα χρησιμοποιήσουμε το keyword “Smoking”, ώστε να δούμε datasets τα οποια σχετίζονται με τις επιδράσεις του καπνίσματος.
Ετσι, θα κατεβάσουμε το dataset “GDS3709 Cigarette smoke effect on the oral mucosa”. Για αρχή ψάχνουμε ένα dataset με σχετικά απλό πειραματικό σχεδιασμό ώστε να μπορέσουμε να μελετήσουμε τα στατιστικά τμήματα της αναλυσης.
Μπορούμε να κατεβάσουμε το dataset χρησιμοποιώντας το λινκ στο “DataSet SOFT file”.
Ειδάλλως, κάντε copy το λινκ και χρησιμοποιήστε μία από τις δύο γραμμές στην R
## get data
# 1ος τρόπος
download.file(url = "ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS3nnn/GDS3709/soft/GDS3709.soft.gz", destfile = "./GDS3709.soft.gz")
# 2ος τρόπος
system("wget ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS3nnn/GDS3709/soft/GDS3709.soft.gz")
Ο πρώτος τρόπος, ειναι κώδικας R. έχει όμως το μειονέκτημα ότι πρέπει να δοθεί το destfile. Ο 2ος τρόπς ειναι στην ουσία system-call της εντολής wget.
Αφού κατέβει, το πρώτο βήμα ειναι να κάνουμε decompress στο αρχείο.
gunzip GDS3709.soft.gz
Αν στην συνέχεια το ανοίξουμε θα δούμε ότι οι αρχικές γραμμές ειναι ως εξής:
^DATABASE = Geo
!Database_name = Gene Expression Omnibus (GEO)
!Database_institute = NCBI NLM NIH
!Database_web_link = http://www.ncbi.nlm.nih.gov/geo
!Database_email = geo@ncbi.nlm.nih.gov
!Database_ref = Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6
^DATASET = GDS3709
!dataset_title = Cigarette smoke effect on the oral mucosa
!dataset_description = Analysis of oral mucosae from 40 cigarette smokers and 40 age and gender matched never-smokers. Results provide insight into the carcinogenic effects of cigarette smoke.
!dataset_type = Expression profiling by array
!dataset_pubmed_id = 20179299
!dataset_platform = GPL570
!dataset_platform_organism = Homo sapiens
!dataset_platform_technology_type = in situ oligonucleotide
!dataset_feature_count = 54675
!dataset_sample_organism = Homo sapiens
!dataset_sample_type = RNA
!dataset_channel_count = 1
!dataset_sample_count = 79
!dataset_value_type = count
!dataset_reference_series = GSE17913
!dataset_order = none
!dataset_update_date = May 13 2010
^SUBSET = GDS3709_1
!subset_dataset_id = GDS3709
!subset_description = female
!subset_sample_id = GSM447400,GSM447401,GSM447402,GSM447403,GSM447405,GSM447411,GSM447413,GSM447415,GSM447416,GSM447418,GSM447422,GSM447424,GSM447425,GSM447427,GSM447428,GSM447429,GSM447430,GSM447431,GSM447432,GSM447434,GSM447435,GSM447440,GSM447442,GSM447444,GSM447448,GSM447449,GSM447450,GSM447451,GSM447452,GSM447458,GSM447461,GSM447462,GSM447463,GSM447464,GSM447467,GSM447468,GSM447469,GSM447472,GSM447473
!subset_type = gender
^SUBSET = GDS3709_2
!subset_dataset_id = GDS3709
!subset_description = male
!subset_sample_id = GSM447398,GSM447399,GSM447404,GSM447406,GSM447407,GSM447408,GSM447409,GSM447410,GSM447412,GSM447414,GSM447417,GSM447419,GSM447420,GSM447421,GSM447423,GSM447426,GSM447433,GSM447436,GSM447437,GSM447438,GSM447439,GSM447441,GSM447443,GSM447445,GSM447446,GSM447447,GSM447453,GSM447454,GSM447455,GSM447456,GSM447457,GSM447459,GSM447460,GSM447465,GSM447466,GSM447470,GSM447471,GSM447474,GSM447475,GSM447476
!subset_type = gender
^SUBSET = GDS3709_3
!subset_dataset_id = GDS3709
!subset_description = cigarette smoke
!subset_sample_id = GSM447401,GSM447404,GSM447406,GSM447407,GSM447409,GSM447411,GSM447412,GSM447413,GSM447415,GSM447416,GSM447425,GSM447426,GSM447430,GSM447433,GSM447435,GSM447439,GSM447440,GSM447441,GSM447443,GSM447444,GSM447445,GSM447446,GSM447448,GSM447449,GSM447450,GSM447452,GSM447453,GSM447455,GSM447456,GSM447458,GSM447459,GSM447461,GSM447464,GSM447466,GSM447468,GSM447470,GSM447472,GSM447474,GSM447475
!subset_type = agent
^SUBSET = GDS3709_4
!subset_dataset_id = GDS3709
!subset_description = control
!subset_sample_id = GSM447398,GSM447399,GSM447400,GSM447402,GSM447403,GSM447405,GSM447408,GSM447410,GSM447414,GSM447417,GSM447418,GSM447419,GSM447420,GSM447421,GSM447422,GSM447423,GSM447424,GSM447427,GSM447428,GSM447429,GSM447431,GSM447432,GSM447434,GSM447436,GSM447437,GSM447438,GSM447442,GSM447447,GSM447451,GSM447454,GSM447457,GSM447460,GSM447462,GSM447463,GSM447465,GSM447467,GSM447469,GSM447471,GSM447473,GSM447476
!subset_type = agent
^DATASET = GDS3709
#ID_REF = Platform reference identifier
#IDENTIFIER = identifier
#GSM447401 = Value for GSM447401: Smoker female study #107; src: Smoker female buccal mucosa
#GSM447411 = Value for GSM447411: Smoker female study #20; src: Smoker female buccal mucosa
#GSM447413 = Value for GSM447413: Smoker female study #205; src: Smoker female buccal mucosa
#GSM447415 = Value for GSM447415: Smoker female study #210; src: Smoker female buccal mucosa
#GSM447416 = Value for GSM447416: Smoker female study #216; src: Smoker female buccal mucosa
#GSM447425 = Value for GSM447425: Smoker female study #26; src: Smoker female buccal mucosa
#GSM447430 = Value for GSM447430: Smoker female study #2; src: Smoker female buccal mucosa
#GSM447435 = Value for GSM447435: Smoker female study #33; src: Smoker female buccal mucosa
#GSM447440 = Value for GSM447440: Smoker female study #41; src: Smoker female buccal mucosa
#GSM447444 = Value for GSM447444: Smoker female study #48; src: Smoker female buccal mucosa
#GSM447448 = Value for GSM447448: Smoker female study #51; src: Smoker female buccal mucosa
#GSM447449 = Value for GSM447449: Smoker female study #53; src: Smoker female buccal mucosa
#GSM447450 = Value for GSM447450: Smoker female study #55; src: Smoker female buccal mucosa
#GSM447452 = Value for GSM447452: Smoker female study #5; src: Smoker female buccal mucosa
#GSM447458 = Value for GSM447458: Smoker female study #68; src: Smoker female buccal mucosa
#GSM447461 = Value for GSM447461: Smoker female study #71; src: Smoker female buccal mucosa
#GSM447464 = Value for GSM447464: Smoker female study #77; src: Smoker female buccal mucosa
#GSM447468 = Value for GSM447468: Smoker female study #88; src: Smoker female buccal mucosa
#GSM447472 = Value for GSM447472: Smoker female study #93; src: Smoker female buccal mucosa
#GSM447400 = Value for GSM447400: Never smoker female study #105; src: Never smoker female buccal mucosa
#GSM447402 = Value for GSM447402: Never smoker female study #10; src: Never smoker female buccal mucosa
#GSM447403 = Value for GSM447403: Never smoker female study #110; src: Never smoker female buccal mucosa
#GSM447405 = Value for GSM447405: Never smoker female study #11; src: Never smoker female buccal mucosa
#GSM447418 = Value for GSM447418: Never smoker female study #248; src: Never smoker female buccal mucosa
#GSM447422 = Value for GSM447422: Never smoker female study #265; src: Never smoker female buccal mucosa
#GSM447424 = Value for GSM447424: Never smoker female study #269; src: Never smoker female buccal mucosa
#GSM447427 = Value for GSM447427: Never smoker female study #272; src: Never smoker female buccal mucosa
#GSM447428 = Value for GSM447428: Never smoker female study #274; src: Never smoker female buccal mucosa
#GSM447429 = Value for GSM447429: Never smoker female study #299; src: Never smoker female buccal mucosa
#GSM447431 = Value for GSM447431: Never smoker female study #3; src: Never smoker female buccal mucosa
#GSM447432 = Value for GSM447432: Never smoker female study #30; src: Never smoker female buccal mucosa
#GSM447434 = Value for GSM447434: Never smoker female study #32; src: Never smoker female buccal mucosa
#GSM447442 = Value for GSM447442: Never smoker female study #43; src: Never smoker female buccal mucosa
#GSM447451 = Value for GSM447451: Never smoker female study #57; src: Never smoker female buccal mucosa
#GSM447462 = Value for GSM447462: Never smoker female study #72; src: Never smoker female buccal mucosa
#GSM447463 = Value for GSM447463: Never smoker female study #74; src: Never smoker female buccal mucosa
#GSM447467 = Value for GSM447467: Never smoker female study #84; src: Never smoker female buccal mucosa
#GSM447469 = Value for GSM447469: Never smoker female study #8; src: Never smoker female buccal mucosa
#GSM447473 = Value for GSM447473: Never smoker female study #95; src: Never smoker female buccal mucosa
#GSM447404 = Value for GSM447404: Smoker male study #113; src: Smoker male buccal mucosa
#GSM447406 = Value for GSM447406: Smoker male study #12; src: Smoker male buccal mucosa
#GSM447407 = Value for GSM447407: Smoker male study #14; src: Smoker male buccal mucosa
#GSM447409 = Value for GSM447409: Smoker male study #19; src: Smoker male buccal mucosa
#GSM447412 = Value for GSM447412: Smoker male study #203; src: Smoker male buccal mucosa
#GSM447426 = Value for GSM447426: Smoker male study #271; src: Smoker male buccal mucosa
#GSM447433 = Value for GSM447433: Smoker male study #31; src: Smoker male buccal mucosa
#GSM447439 = Value for GSM447439: Smoker male study #40; src: Smoker male buccal mucosa
#GSM447441 = Value for GSM447441: Smoker male study #42; src: Smoker male buccal mucosa
#GSM447443 = Value for GSM447443: Smoker male study #44; src: Smoker male buccal mucosa
#GSM447445 = Value for GSM447445: Smoker male study #49; src: Smoker male buccal mucosa
#GSM447446 = Value for GSM447446: Smoker male study #4; src: Smoker male buccal mucosa
#GSM447453 = Value for GSM447453: Smoker male study #61; src: Smoker male buccal mucosa
#GSM447455 = Value for GSM447455: Smoker male study #64; src: Smoker male buccal mucosa
#GSM447456 = Value for GSM447456: Smoker male study #65; src: Smoker male buccal mucosa
#GSM447459 = Value for GSM447459: Smoker male study #6; src: Smoker male buccal mucosa
#GSM447466 = Value for GSM447466: Smoker male study #81; src: Smoker male buccal mucosa
#GSM447470 = Value for GSM447470: Smoker male study #9; src: Smoker male buccal mucosa
#GSM447474 = Value for GSM447474: Smoker male study #96; src: Smoker male buccal mucosa
#GSM447475 = Value for GSM447475: Smoker male study #98; src: Smoker male buccal mucosa
#GSM447398 = Value for GSM447398: Never smoker male study #100; src: Never smoker male buccal mucosa
#GSM447399 = Value for GSM447399: Never smoker male study #102; src: Never smoker male buccal mucosa
#GSM447408 = Value for GSM447408: Never smoker male study #17; src: Never smoker male buccal mucosa
#GSM447410 = Value for GSM447410: Never smoker male study #1; src: Never smoker male buccal mucosa
#GSM447414 = Value for GSM447414: Never smoker male study #21; src: Never smoker male buccal mucosa
#GSM447417 = Value for GSM447417: Never smoker male study #22; src: Never smoker male buccal mucosa
#GSM447419 = Value for GSM447419: Never smoker male study #251; src: Never smoker male buccal mucosa
#GSM447420 = Value for GSM447420: Never smoker male study #255; src: Never smoker male buccal mucosa
#GSM447421 = Value for GSM447421: Never smoker male study #260; src: Never smoker male buccal mucosa
#GSM447423 = Value for GSM447423: Never smoker male study #267; src: Never smoker male buccal mucosa
#GSM447436 = Value for GSM447436: Never smoker male study #35; src: Never smoker male buccal mucosa
#GSM447437 = Value for GSM447437: Never smoker male study #36; src: Never smoker male buccal mucosa
#GSM447438 = Value for GSM447438: Never smoker male study #38; src: Never smoker male buccal mucosa
#GSM447447 = Value for GSM447447: Never smoker male study #50; src: Never smoker male buccal mucosa
#GSM447454 = Value for GSM447454: Never smoker male study #63; src: Never smoker male buccal mucosa
#GSM447457 = Value for GSM447457: Never smoker male study #66; src: Never smoker male buccal mucosa
#GSM447460 = Value for GSM447460: Never smoker male study #70; src: Never smoker male buccal mucosa
#GSM447465 = Value for GSM447465: Never smoker male study #80; src: Never smoker male buccal mucosa
#GSM447471 = Value for GSM447471: Never smoker male study #91; src: Never smoker male buccal mucosa
#GSM447476 = Value for GSM447476: Never smoker male study #9; src: Never smoker male buccal mucosa
!dataset_table_begin
Επίσης στο τέλος-τέλος (τελευταία γραμμή) εχει την παρακάτω γραμμή.
!dataset_table_end
Πέρα από αυτές τις γραμμές, οι υπόλοιπες γραμμές έχουν την μορφή πίνακα. Δηλαδή
ID_REF IDENTIFIER GSM447401 GSM447411 GSM447413 GSM447415 GSM447416 GSM447425 GSM447430 GSM447435 GSM447
1007_s_at MIR4640 1124 1196 982.8 1075 1114 1302 1279 1210 1199 1172 1245 1307 1263 1282 1191 1189 1202
1053_at RFC2 203.3 181.5 229.6 160.4 209.5 191.9 180.6 204.3 173.5 249.8 213.3 191.8 202.7 187 191.2 173.4 162.7 172.1
117_at HSPA6 53.4 55.6 49.04 39.1 51.94 69.8 52.44 51.25 47.07 109.6 48.1 43.84 49.29 59.64 39 50.07 49.22 43.33
Μπορούμε να πάρουμε το τμήμα του αρχείου που θέλουμε είτε
grep -e '^[\^\#\!]' -v GDS3709.soft > GDS3709.soft.clean
Εδώ απλά χρησιμοποιούμε την grep ώστε να αγνοήσει τις γραμμές που ξεκινούν με τους παραπάνω χαρακτήρες. Οι υπόλοιπες γραμμές αποθηκεύουνται στο GDS3709.soft.clean.
Στην GEO datasets σελίδα, υπάρχει ένα “κουμπί”, που γράφει “Data subsets” και μας πληροφορεί για το τι είδους data υπάρχουν σε κάθε στήλη. Έτσι…
Samples Factors Title
gender agent
GSM447401 female cigarette smoke Smoker female study #107
GSM447411 female cigarette smoke Smoker female study #20
GSM447413 female cigarette smoke Smoker female study #205
GSM447415 female cigarette smoke Smoker female study #210
GSM447416 female cigarette smoke Smoker female study #216
GSM447425 female cigarette smoke Smoker female study #26
GSM447430 female cigarette smoke Smoker female study #2
GSM447435 female cigarette smoke Smoker female study #33
GSM447440 female cigarette smoke Smoker female study #41
GSM447444 female cigarette smoke Smoker female study #48
GSM447448 female cigarette smoke Smoker female study #51
GSM447449 female cigarette smoke Smoker female study #53
GSM447450 female cigarette smoke Smoker female study #55
GSM447452 female cigarette smoke Smoker female study #5
GSM447458 female cigarette smoke Smoker female study #68
GSM447461 female cigarette smoke Smoker female study #71
GSM447464 female cigarette smoke Smoker female study #77
GSM447468 female cigarette smoke Smoker female study #88
GSM447472 female cigarette smoke Smoker female study #93
GSM447400 female control Never smoker female study #105
GSM447402 female control Never smoker female study #10
GSM447403 female control Never smoker female study #110
GSM447405 female control Never smoker female study #11
GSM447418 female control Never smoker female study #248
GSM447422 female control Never smoker female study #265
GSM447424 female control Never smoker female study #269
GSM447427 female control Never smoker female study #272
GSM447428 female control Never smoker female study #274
GSM447429 female control Never smoker female study #299
GSM447431 female control Never smoker female study #3
GSM447432 female control Never smoker female study #30
GSM447434 female control Never smoker female study #32
GSM447442 female control Never smoker female study #43
GSM447451 female control Never smoker female study #57
GSM447462 female control Never smoker female study #72
GSM447463 female control Never smoker female study #74
GSM447467 female control Never smoker female study #84
GSM447469 female control Never smoker female study #8
GSM447473 female control Never smoker female study #95
...
Έτσι, ξέρουμε ότι έχουμε male, female δειγματα και καπνιστές και μη καπνιστές. Το φύλο και το αν καπνίζει κάποιος ειναι ανεξάρτητες μεταβλητές που επηρεάζουν την εξάρτημένη μεταβλητή, η οποία ειναι το επίπεδο έκφρασης του κάθε γονιδίου. Άρα εμείς θα προσπαθήσουμε να “εξηγήσουμε” την τιμή έκφρασης με βάση του φύλο του ατόμου και το αν καπνίζει κάποιος. Αρχικά θα τα δούμε αυτά χωριστά και στην συνέχεια θα τα συνδυάσουμε.
Όμως, αρχικά, ας δούμε μερικά στοιχεία για την κατανομή των τιμών έκφρασης στο κάθε άτομο
raw.data = read.table("GDS3709.soft.clean", header = TRUE)
dim(raw.data)
## [1] 54675 81
data <- raw.data[,-c(1,2)]
dim(data)
## [1] 54675 79
## let's get only the females. The first 39 samples are females
female.data <- data[,1:39]
label <- rep(c("smoker", "control"), c(19, 20))
label
## [1] "smoker" "smoker" "smoker" "smoker" "smoker" "smoker" "smoker"
## [8] "smoker" "smoker" "smoker" "smoker" "smoker" "smoker" "smoker"
## [15] "smoker" "smoker" "smoker" "smoker" "smoker" "control" "control"
## [22] "control" "control" "control" "control" "control" "control" "control"
## [29] "control" "control" "control" "control" "control" "control" "control"
## [36] "control" "control" "control" "control"
## this is very skewed towards small values
boxplot((female.data))
## we need to obtain the log of this
boxplot(log(female.data))