Question 1: Tidy Data

What is tidy data?

Tidy data is the output form of the data after dataset tidying process, which is in turn a part of data cleaning.Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

Structure is the form and shape of your data. In statistics, most datasets are rectangular data tables(data frames) and are made up of rows and columns.

Semantics is the meaning for the dataset. Datasets are a collection of values, either quantitative or qualitative. These values are organized in 2 ways — variable & observation.

The three rules of Tidy Data: - Each variable is a column - Each variable is a row - Each type of observational unit is a table

Question 2: Long and Wide

What are wide and long data formats? 2 use cases of each data frame structure along with example script in R. Which packages can be used for creating long and wide data formats?

The Long Format

A table stored in ‘long’ format has a single column for each variable in the system. An example dataset created is given below:

##    ID    Name Year Score
## 1   1 Avinash 2017    93
## 2   2 Shreyas 2017    97
## 3   3  Rajesh 2017    94
## 4   4     Sai 2017    97
## 5   5   Deepa 2017    86
## 6   1 Avinash 2018    79
## 7   2 Shreyas 2018    83
## 8   3  Rajesh 2018    88
## 9   4     Sai 2018    93
## 10  5   Deepa 2018   100
## 11  1 Avinash 2019    76
## 12  2 Shreyas 2019    79
## 13  3  Rajesh 2019    93
## 14  4     Sai 2019    75
## 15  5   Deepa 2019    85
## 16  1 Avinash 2020    87
## 17  2 Shreyas 2020    85
## 18  3  Rajesh 2020    85
## 19  4     Sai 2020    79
## 20  5   Deepa 2020    99
## 21  1 Avinash 2021    86
## 22  2 Shreyas 2021    83
## 23  3  Rajesh 2021    99
## 24  4     Sai 2021    93
## 25  5   Deepa 2021    91

In the above case, each data point represents the socre of an individual in a particular year, so our variables are ID, Name, Year, Score

The Wide Format

A table stored in ‘wide’ format spreads a variable across several columns. The wide format of the same sample dataset above is given below:

##   ID    Name 2017 2018 2019 2020 2021
## 1  1 Avinash   93   79   76   87   86
## 2  2 Shreyas   97   83   79   85   83
## 3  3  Rajesh   94   88   93   85   99
## 4  4     Sai   97   93   75   79   93
## 5  5   Deepa   86  100   85   99   91

Most R functions need data in the long format, and it is often easier to process data in a long format.

But on the other hand, it is easier for people to view and comprehend wide format, especially when it is being input and validated, where human comprehension is important for ensuring quality and accuracy.

Datasets tend to start out life in wide format, and then become long as it becomes used more for processing. Fortunately converting back and forth is pretty easy nowadays, especially with the tidyr package.

Let’s see this with an example:

##    ID    Name Year Score
## 1   1 Avinash 2017    93
## 2   2 Shreyas 2017    97
## 3   3  Rajesh 2017    94
## 4   4     Sai 2017    97
## 5   5   Deepa 2017    86
## 6   1 Avinash 2018    79
## 7   2 Shreyas 2018    83
## 8   3  Rajesh 2018    88
## 9   4     Sai 2018    93
## 10  5   Deepa 2018   100

This data as we can see is in the long format, and is veary easy to work with in R, we can visualise or even summarize data easily in this format. But, if a layman wants to get some insights just looking at the data, without visualization, it is difficult to comprehend. So, now we will proceed to convert this long data to wide data using the spread() function from the tidyr package:

##   ID    Name 2017 2018 2019 2020 2021
## 1  1 Avinash   93   79   76   87   86
## 2  2 Shreyas   97   83   79   85   83
## 3  3  Rajesh   94   88   93   85   99
## 4  4     Sai   97   93   75   79   93
## 5  5   Deepa   86  100   85   99   91

As we can see now, the data is veary appealing and we can draw initial insights from the data just by looking at it. Initial insights help to give direction to further data analysis.

Now, we can also conver this same wide data to the long format we saw earlier, using the gather() function of the tidyr package:

##    ID    Name Year Score
## 1   1 Avinash 2017    93
## 2   2 Shreyas 2017    97
## 3   3  Rajesh 2017    94
## 4   4     Sai 2017    97
## 5   5   Deepa 2017    86
## 6   1 Avinash 2018    79
## 7   2 Shreyas 2018    83
## 8   3  Rajesh 2018    88
## 9   4     Sai 2018    93
## 10  5   Deepa 2018   100

longdata and longdata2 are identical datasets, and converting it to wide and then to long again had no loss of information. We can validate if the two datasets are same:

## [1] TRUE

Some packages are meant to work with wide format well while most work great with long data. It is upto the analyst which format to use depending on his/her goals.

Question 3: Barplot and Histogram

Import the ‘iris’ dataset available in R, visualise the following two type of charts using the dataset

-Barplot

-Histogram

Explain where these two types of plots differ with their respective implementations.

## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'gridExtra' was built under R version 3.6.3

Histogram

Histogram is defined as a type of bar chart that is used to represent statistical information by way of bars to show the frequency distribution of continuous data. It indicates the number of observations which lie in-between the range of values, known as class or bin.

# Sepal length 
HisSl <- ggplot(data=iris, aes(x=Sepal.Length))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) + 
  xlab("Sepal Length (cm)") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=iris, aes(xintercept = mean(Sepal.Length)),
             linetype="dashed",color="grey")

# Sepal width
HistSw <- ggplot(data=iris, aes(x=Sepal.Width)) +
  geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) + 
  xlab("Sepal Width (cm)") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Width")+
  geom_vline(data=iris, aes(xintercept = mean(Sepal.Width)),
             linetype="dashed",color="grey")

# Petal length
HistPl <- ggplot(data=iris, aes(x=Petal.Length))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) + 
  xlab("Petal Length (cm)") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Petal Length")+
  geom_vline(data=iris, aes(xintercept = mean(Petal.Length)),
             linetype="dashed",color="grey")

# Petal width
HistPw <- ggplot(data=iris, aes(x=Petal.Width))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) + 
  xlab("Petal Width (cm)") +  
  ylab("Frequency") + 
  theme(legend.position="right" )+
  ggtitle("Histogram of Petal Width")+
  geom_vline(data=iris, aes(xintercept = mean(Petal.Width)),
             linetype="dashed",color="grey")

# Plot all visualizations
grid.arrange(HisSl + ggtitle(""),
             HistSw + ggtitle(""),
             HistPl + ggtitle(""),
             HistPw  + ggtitle(""),
             nrow = 2,
             top = textGrob("Iris Frequency Histogram", 
                            gp=gpar(fontsize=15))
)

Given above is a grid comparing the histograms for the continuous and numerical parameters, namely Petal.Length, Petal.Width, Sepal.Length and Sepal.Width.

The x-axis in the above graphs are continuous variables while the y-axis is also continuous and plots the frequency. The classes have been differentiated and color coded according to the categorical variable Species so that we can gain insights into the data.

Barplot

Bar Plot is a chart that graphically represents the comparison between categories of data. It displays grouped data by way of parallel rectangular bars of equal width but varying the length. Each rectangular block indicates specific category and the length of the bars depends on the values they hold.

Given above is a grid comparing the bar plots of the mean of variables Petal.Length, Petal.Width, Sepal.Length and Sepal.Width across the categorical parameter Species which has the factor setosa, versicolor and virginica.

The x-axis in the above graphs are discrete bins while the y-axis is continuous and plots the mean of the lenth/width.

Question 4: Outliers

In the same ‘iris’ dataset, you have been assigned the task of identifying outliers. You must be able to provide a comprehensive glimpse of the dataset using a suitable visualisation. What method would you choose and why?

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Boxplot

When trying to visualise the outliers, box-plots are the most intutive and the first choice to do so.

Box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score.

Following are the boxplots for each of the variable across the three Species.

The dots represent outliers.

  • Outliers are defined as values that are 3 times IQR above third quantile or 3 times IQR below first quantile.

  • Suspected outliers are 1.5 times IQR (inter quartile range) above third quantile or 1.5 times IQR below first quantile.

We can easily apply filters on the basis of above information and pin point the exact outlier data points.

k-means Clustering

Another way to detect outliers is clustering. By grouping data into clusters, those data not assigned to any clusters are taken as outliers. With k-means, the data are partitioned into k groups by assigning them to the closest cluster centers. After that, we can calculate the distance (or dissimilarity) between each object and its cluster center, and pick those with largest distances as outliers.

By using clustering for finding outliers, we are finding the outliers common across the variables, while if we are using the box plot, we are just able to get the outliers one variable at a time.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.314583    2.895833     4.973958   1.7031250
## 2     5.175758    3.624242     1.472727   0.2727273
## 3     4.738095    2.904762     1.790476   0.3523810
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 119          7.7         2.6          6.9         2.3
## 118          7.7         3.8          6.7         2.2
## 132          7.9         3.8          6.4         2.0
## 123          7.7         2.8          6.7         2.0
## 106          7.6         3.0          6.6         2.1

Below we are plotting the data with Sepal.Width on the y-axis and Sepal.Length on the x-axis.

In the above figure, cluster centers are labeled with asteriks “*”and outliers with plus “+”

Question 5: Depth for Diamonds

Import the ‘diamonds’ dataset that is provided by the R ggplot package. Your aim is to create a scatter plot of price vs depth. How will you proceed to find the depth value where most diamonds are found? Show in chart.

We have to create a scatter plot of price vs depth. We will use ggplot() and then use geom_point() to to get what we need. But since there is very high amiunt of data points which would overlap, hence to make out visualisation a little bit clear, we will use aplha=0.1 which would change the opacity of the points to 10%.

Still we cannot draw any conclusions on the depth value where most diamonds are found. This is because, even if we have changed the aplha to 0.1, when 10 points overlap they will have complete dark black point and we cannot distinguish between 10 points overlapping or 100.

To deal with this issue, we will now use the geom_bind2d() from the ggplot2 package to create a scatter plot with color coding according to the density. One may see this as similar to a heat map.

We can see that the highest density is at around the depth value of 61. We can clearly see this with the help of a histogram as given below:

The histogram just verifies our conclusions from the scatter plot, that the depth value where most diamonds are found is at around 61 where the count is approximately 2500.

Question 6: State Names and Vowels

In this case, we will use the ‘USArrests’ dataset provided by R

- Abbreviate the names of states.

- Select the states that contain the letter ‘b’

- Count the frequency of the vowels in names and plot their frequency distribution plot.

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

The USArrests dataset has been loaded successfully and above are some head() elements.

We will now see the rownames of our dataset, and also store them in states_full

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

These are complete state names, which we need to abbreviate. Lucky for us, R comes in list of the sate names stored in state.name and their abbreviations state.abb. Our task is now only to map the state.name to states_full, with the condition state.name == states_full, which we are passing inside the which() function.

We then store the results in states_abbr

##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
## [16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
## [31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
## [46] "VA" "WA" "WV" "WI" "WY"

Now for the subsections 2 and 3 of this question it is ambiguous, whether the abbreviated state names to be used or the abbreviated state names. To deal with this confusion, both cases have been considered.

To check if a pattern is in a string we use the grepl() function, which returns TRUE if the pattern is present. Hence, we now have a vector of logical values which we can use to subset the data

The pattern we are passing is [Bb] which means we are checking if B or b is present in the string

##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE

We now use full to subset the data and select the states that contain the letter ‘b’

## [1] "Alabama"  "Nebraska"

“Alabama” and “Nebraska” have b in them.

To cover the other case for abbreviated state names, we follow the same procedure.

## character(0)

We have no abbreviated state names with the letter b in them.

Now we have to create a frequency distribution for the frequency of vowels in the state names. We will start with the case for full state names.

We will use the tokenizers package to tokenize the state names into their constituent alphabets

## Warning: package 'tokenizers' was built under R version 3.6.3
##  [1] "m" "a" "s" "s" "a" "c" "h" "u" "s" "e" "t" "t" "s"

We now create an empty data frame which will keep a record of our total frequencies

##   Var1 Freq
## 1    a    0
## 2    e    0
## 3    i    0
## 4    o    0
## 5    u    0

We will now first start with calculating the number of vowels in the first state name and then run a for loop to calculate for others.

We create a temporary dataset with the frequency of the alphabets, using the table() function

##   Var1 Freq
## 1    a    4
## 2    b    1
## 3    l    1
## 4    m    1

We now proceed to merge this to the frequency data frame with just the vowels, which we created earlier.

##   Var1 Freq.x Freq.y
## 1    a      0      4
## 2    e      0     NA
## 3    i      0     NA
## 4    o      0     NA
## 5    u      0     NA

Only the a alphabet was common among the two datasets an its value has been added and the remaining missing values were taken care of using NA.

Now, what we want to do is, add the Freq.x and Freq.y columns, but we cannot add anything with NA. So, we will replace all NA with 0.

##   Var1 Freq.x Freq.y
## 1    a      0      4
## 2    e      0      0
## 3    i      0      0
## 4    o      0      0
## 5    u      0      0

Now we add the two columns into a new column Freq

##   Var1 Freq.x Freq.y Freq
## 1    a      0      4    4
## 2    e      0      0    0
## 3    i      0      0    0
## 4    o      0      0    0
## 5    u      0      0    0

Now we do not need Freq.x and Freq.y columns so we will remove them.

Now we have our final result.

##   Var1 Freq
## 1    a    4
## 2    e    0
## 3    i    0
## 4    o    0
## 5    u    0

Now we need to keep repeating this process with other states and keep updating our final table freq_full. We now run a for loop for that process as below

##   Vowels Frequency
## 1      a        61
## 2      e        28
## 3      i        44
## 4      o        36
## 5      u         8

The above frequency table shows the frequencies of the vowels in the full state names.

For visualising the frequency distribution, we will now proceed to plot a pie-chart Pie-chart is just a bar chart with polar coordinates.

Now we repeat the whole procedure for the abbreviated state names.

##   Vowels Frequency
## 1      a        12
## 2      e         3
## 3      i         8
## 4      o         5
## 5      u         1

Question 7: Mystery Method

What is the value of fn(3)? Can you explain what is happening at each step? mystery_method <- function(x) { function(z) Reduce(function(y, w) w(y), x, z) } fn <- mystery_method(c(function(x) x + 1, function(x) x * x))

First, we will run the given code, and find out the value of fn(3)

## [1] 16

We can see that the value of fn(3) is 16. We will need this value to validate our breaking down of the function in each step given below.

fn calls mystery_method(). In mystery_method() we are passing c(function(x) x + 1, function(x) x * x) as x. Hence we will replace x by c(function(x) x + 1, function(x) x * x), create a new function fn2 and run the function again and validate our progress.

## [1] 16

We are still getting the same answer 16 and hence we are on the right path. A lot of functions are being passed, and it is disturbing our analysis of the function. So, we will name these functions as fnA, fnB and fnC and create a new function fn3 which is easier to analyse.

## [1] 16

Output is still 16, validating our progress.

Reduce() reduces a vector, x, to a single value by recursively calling a function, f, two arguments at a time. It combines the first two elements with f, then combines the result of that call with the third element, and so on.

Now, the Reduce() function is taking three arguments fnA , c(fnB,fnC) and z. A three argument Reduce call will initialize at the third argument, which is z. So, first z will be passed as an argument to fnB so it will be fnB(z) The inner function, fnA take an argument and a function and apply that function to the argument. Hence, fnB(z) and fnC will be passed to fnA

## [1] 16

The result is still 16 validating our progress. Now, let us take even a deeper look and solve our mystery_method()

The following is what is happening inside the fnA

## [1] 16

The following is the output of fnB

## [1] 4

fnC taking the output value from fnB(3) i.e. 4.

## [1] 16

And this is how we have arrived at the value 16

Created using R Markdown by Shreyas Khadse