Problem Set 1

Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. We begin by studying some of the theoretical aspects of visualization. To do that we must appreciate the basic steps in the process of making a visualization.

The objective of this assignment is to introduce you to R markdown and to complete and explain basic plots before moving on to more complicated ways to graph data.

A couple of tips, remember that there is preprocessing involved in many graphics so you may have to do summaries or calculations to prepare, those should be included in your work.
To ensure accuracy pay close attention to axes and labels, you will be evaluated based on the accuracy of your graphics.

The final product of your homework (this file) should include a short summary of each graphic.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Moodle. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.

Questions

Find the mtcars data in R. This is the dataset that you will use to create your graphics.

Create a pie chart showing the proportion of cars from the mtcars data set that have different carb values.

#summary(mtcars)
#mtcars
#Calculate the frequency of different carb values using table function
mtcarscarb = table(mtcars$carb)
#Create percent label values
percentlabels<- round(100*mtcarscarb/sum(mtcarscarb), 1)
#Create labels for each pie in the chart
pielabels<- paste(percentlabels, "%", sep="")
#R code to create the Pie Chart
pie(mtcarscarb,col = rainbow(length(mtcarscarb)), labels = pielabels , main = 'Pie Chart of Proportion of Cars with Different Number of Carburetors', cex = 0.8)
 #Legend for the pie chart
legend("topright", c("Carb-1","Carb-2","Carb-3","Carb-4","Carb-6","Carb-8"), cex=0.6, fill=  rainbow(length(mtcarscarb)))

The pie chart above, we could tell that those sample cars have 6 carb values, which are 1, 2, 3, 4, 6 and 8. 31.2 percent of cars have carb value of 2 and another 31.2 percent of cars have carb value of 4. This means that the majority of the cars have carb value of 2 or 4. Following are the value of 1, which has 21.9 percent. Carb3, 6 and 8 together only makee up about 15 percent of the whole.

Create a bar graph, that shows the number of each gear type in mtcars.

gear<-table(mtcars$gear)
barplot(gear,main = "Car Distribution by Number of Gears",xlab ="Number of Gears",ylab = "Number of Cars",col=c("darkblue","red","darkgreen"))
axis(2,at=seq(0,16,1))

The above plot, we could summarize that there are three gear types, which are 3, 4 and 5. The majority of cars have 3 or 4 gears. Only a small number of cars have 5 gears.

cyl_gear<- table(mtcars$cyl,mtcars$gear)
cyl_gear

##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

Next show a stacked bar graph of the number of each gear type and how they are further divided out by cyl.

barplot(cyl_gear,main="Car Distribution by Number of Gears and Cylinders",xlab = "Number of Gears",ylab = "Number of Cars",names.arg = c("3 Gears","4 Gears","5 Gears"),cex.names = 1,col=c("darkblue","red","darkgreen"),legend=rownames(cyl_gear),args.legend = list(title="Number of Cylinders"))
axis(2,at=seq(0,16,1))

The above stacked barplot extends the bar chart described in the previous part. This chart displays the number of cars on the y-axis based on categorization of number of gears on the x-axis, similar to Part II. However, this chart also adds another layer of visualization by breaking down number of cars in each gear group into cars that have different number of cylinders (4, 6 and 8). This is highlighted by showcasing different colors to identify number of cars with different cylinders in each gear level. For example, there is 1 car shaded in dark blue, indicating number of cylinders = 4, for the gear level 3. This means that our of 15 cars that have 3 gears, only 1 has 4 cylinders, 2 have 6 cylinders and the reamining have 8 cylinders.

Draw a scatter plot showing the relationship between wt and mpg.

plot(mtcars$wt, mtcars$mpg, main="Distribution of Car Mileage vs Car Weight", xlab="Car Weight (1000 lbs)", ylab="Miles Per Gallon ", col = "blue")

abline(lm(mtcars$mpg~mtcars$wt), col="red") # regression line (y~x) 
lines(lowess(mtcars$wt,mtcars$mpg), col="blue") # lowess line (x,y)

The above scatter plot provides a visualization for change in the miles per gallon, depicted on the y-axis, as the car weight is changed on the x-axis. Scatter plots essentially provide a directional patter on change in the response variable (on y-axis) as the independent variable (on x-axis) is changed. Here, the graph illusrates that, on an average, the mileage (mpg) of a car decreases as the weight of car is increased

The linear regression line in red color tells us the car weight and mpg share negtive correlation with each other. The lowess line in blue color indicates the samilar result but offers a better fit line in presenting the relationship of the 2 variables.

Design a visualization of your choice using the data and write a brief summary about why you chose that visualization.

boxplot(mtcars$mpg ~ mtcars$cyl, main = "Box Plot of Mileage vs Number of Cylinders", xlab = "Number of Cylinders", ylab = "Miles per Gallon", 
        col = "lightgreen")

I have found that one of the most useful plots for visualization is the box plot, especially if we are looking at representating factor data types with multiple levels. Moreover, a box plot provides a measure of mean of response variable for each level of independent variable on the x-axis. It also represents the interquantile range of the data for each level of independent variable that provides a visual interpretation of the variance of data point in those levels. Addtionally, the box plot also shows the min and max ranges of the values for each level, effectively highlighting the outliers at a glance. We see this for cyclinder group 8 where one outlier has very low mpg value, which is displayed below the box.

Here, I have utilized a box plot to display the miles per gallon values for cars with different cylinder types. Based on the visualization, we can interpret that the average miles per gallon for a car is higher for lower number of cylinders, i.e. mpg decreases as the number of cylinders increases. We can also infer that there is high variability in the mpg values of cars with 4 cylinders as compared to cars with 6 or 8 cylinders. This can be inferred by the bigger interquantile range for number of cyliders group 4 vs that of group 6 and 8.

Problem Set 1

Visualization Process

Directions

Questions