The goal of this lab is to practice importing data from various data sources (csv, txt etc.). Selecting a dataset to analyze. Finding out if the data contains any missing values. Generating statistical characteristics (Central tendency & dispersion) and visualizing the data.
Here we will load data from different formats such as csv, txt, xlsx, SPSS, WWW, json, SQL, noSQL)
library(readxl)
library(foreign)
library(rjson)
library(XML)
csv_dataset <- read.csv("csv_file.csv")
head(csv_dataset)
excel_dataset <- read_excel("excel_file.xlsx")
head(excel_dataset)
spss_dataset <- read.spss("distance.sav")
head(spss_dataset)
$Year
[1] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
[28] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
$Distance
[1] 2374.316 2523.864 2585.533 2694.602 2969.307 3123.325 3277.309 3376.553 3489.659 3556.706 3679.627 3868.240 4037.188 4230.839 4154.546
[16] 4200.207 4420.411 4502.123 4655.094 4630.473 4919.184 5012.086 5164.823 5226.791 5484.910 5583.174 5835.424 6270.900 6691.396 7219.932
[31] 7268.254 7257.270 7238.574 7210.845 7354.472 7478.202 7654.054 7792.262 7899.172 8029.656 7997.621 8103.716 8276.073 8309.723
txt_dataset <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt", header = FALSE)
head(txt_dataset)
Json_dataset <- fromJSON(file= "animals.json" )
head(Json_dataset)
[[1]]
[[1]]$name
[1] "Meowsy"
[[1]]$species
[1] "cat"
[[1]]$foods
[[1]]$foods$likes
[1] "tuna" "catnip"
[[1]]$foods$dislikes
[1] "ham" "zucchini"
[[2]]
[[2]]$name
[1] "Barky"
[[2]]$species
[1] "dog"
[[2]]$foods
[[2]]$foods$likes
[1] "bones" "carrots"
[[2]]$foods$dislikes
[1] "tuna"
[[3]]
[[3]]$name
[1] "Purrpaws"
[[3]]$species
[1] "cat"
[[3]]$foods
[[3]]$foods$likes
[1] "mice"
[[3]]$foods$dislikes
[1] "cookies"
xml_dataset <- xmlTreeParse("test.xml")
head(xml_dataset)
$doc
$file
[1] "test.xml"
$version
[1] "1.0"
$children
$children$root
<root testAttr="testValue">
<result>
<child>data1</child>
<child>A1343358848.646</child>
<child>
<internal>
<data>one</data>
<data>two</data>
<unique>Z1343358848.646</unique>
</internal>
</child>
</result>
</root>
attr(,"class")
[1] "XMLDocumentContent"
$dtd
$external
NULL
$internal
NULL
attr(,"class")
[1] "DTDList"
www_dataset <- "https://www.google.com/"
head(www_dataset)
[1] "https://www.google.com/"
CSV and Excel File: This file was provided to us by our statistics professor last semester for some data analytics/statistics practice.
SPSS File : This file was downloaed from https://people.bath.ac.uk/pssiw/stats2/page16/page16.html It had the following description about the data “these data show how far, on average, each person in the UK drives each year. We can look at the relationship between time and how far people drive.”
TXT File: This file is from datacamps website and can be found at https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt
JSON File: This file was downloaded from https://raw.githubusercontent.com/LearnWebCode/json-example/master/animals-1.json
XML File: This file was downloaded from https://gist.githubusercontent.com/djangofan/3186223/raw/ee3840f937e745d5683b046a85cc4a0337e18428/test.xml
Here I will use the csv file that I uploaded above called csv_dataset. This dataset has two columns set1 and set2. The file has 50 rows.
head(csv_dataset)
Seeing if any missing values exist in the data
print(sum(rowSums(is.na(csv_dataset))))
[1] 0
0 indicates that there are no missing values in the dataset
To measures the Central Tendency of the data we will look at the Mean, Median, Mode and Midrange. We can get the mean and median from the five-number summary
summary(csv_dataset)
Set.1 Set.2
Min. : 1.0 Min. : 1.00
1st Qu.: 6.0 1st Qu.: 7.00
Median : 8.0 Median : 9.00
Mean : 8.1 Mean : 8.68
3rd Qu.:10.0 3rd Qu.:11.00
Max. :17.0 Max. :17.00
# Getting the median. This code was copied from https://www.tutorialspoint.com/r/r_mean_median_mode.htm
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
The mode for Set 1 is: 8 and the mode for Set 2 is also 8
Conclusion for Central Tendency
From the 5 point summary and the getmode() function above we can see the following
| Measure | Set 1 | Set 2 |
|---|---|---|
| Mean | 8.1 | 8.68 |
| Median | 8.0 | 9.0 |
| Mode | 8.0 | 8.0 |
| Midrange | (1+17)/2 = 9.0 | (1+17)/2 = 9.0 |
To measure Data Dispersion, we will look at Range, Quartiles, Variance, Standard Deviation and IQR. Again looking at the 5 point summary
summary(csv_dataset)
Set.1 Set.2
Min. : 1.0 Min. : 1.00
1st Qu.: 6.0 1st Qu.: 7.00
Median : 8.0 Median : 9.00
Mean : 8.1 Mean : 8.68
3rd Qu.:10.0 3rd Qu.:11.00
Max. :17.0 Max. :17.00
The variance for Set 1 is: 11.1530612 and the variance for Set 2 is 12.3444898
The standard deviation for Set 1 is: 3.3396199 and the standard deviation for Set 2 is 3.5134726
Conclusion for Dispersion From the 5 point summary and the formulae above, we can see the following:
| Measure | Set 1 | Set 2 |
|---|---|---|
| Range | 17 - 1 = 16 | 17 - 1 = 16 |
| Quartiles | Q1 = 6, Q2= 8, Q3 = 10 | Q1 = 7, Q2= 9, Q3 = 11 |
| Variance | 11.1530612244898 | 12.3444897959184 |
| Standard Dev. | 3.33961992216027 | 3.51347261209169 |
| IQR | 10 - 6 = 4 | 11 - 7 = 4 |
We will draw the following visualizations for the two sets
1. Box Plot 2. Histograms 3. Normal QQ Plot
boxplot(csv_dataset, main = "Box Plot for the two data sets")
hist(csv_dataset$Set.1, main = "Histogram for the first data set")
hist(csv_dataset$Set.2, main = "Histogram for the second data set")
qqnorm(csv_dataset$Set.1, main = "Normal Q-Q Plot for Set 1")
qqline(csv_dataset$Set.1,col='red')
qqnorm(csv_dataset$Set.2, main = "Normal Q-Q Plot for Set 2")
qqline(csv_dataset$Set.2,col='blue')