Reading and Graphing Data Using Scatterplot

Name: Achmad Fahry Baihaki

NIM: 2206065110100

Institute: Maulana Malik Ibrahim Islamic State University of Malang

Departement: Computer Science

library(mosaicCalc)

## Loading required package: mosaic

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2

## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.

## 
## Attaching package: 'mosaic'

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following object is masked from 'package:ggplot2':
## 
##     stat

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

## Loading required package: mosaicCore

## 
## Attaching package: 'mosaicCore'

## The following objects are masked from 'package:dplyr':
## 
##     count, tally

## 
## Attaching package: 'mosaicCalc'

## The following object is masked from 'package:stats':
## 
##     D

Intro to scatterplot

Sometimes, the mathematical model that we’ll create will be motivated by data. Like statistical modelling. We’ll learn to process of setting parameters of a mathematical function to make the function a close representation of some data. It’s called curve fitting.

That means, we’ll have to learn how to access data in computer files, how data are stored and how to visualize the data. To do it all, R and the mosaic package make this straightforward.

The data files we’ll be using are stored as spreadsheets on the internet. Usually, the spreadsheet will have multiple variables; each variable is stored as one column. (The rows are “cases”, sometimes called “data points”). To read the data in to R, you need to know the name of the file and its location. Often, the location will be an address on the Internet.

In this case, we’ll work with “Income-housing.csv” data, which is located at “http://www.mosaic-web.org/go/datasets/Income-Housing.csv”. This data contains information from a survey on housing conditions for people in different income brackets in the US.

Here’s how to read it into R:

Housing = read.csv("http://www.mosaic-web.org/go/datasets/Income-Housing.csv")

This is how we look the data:

Housing

##   Income IncomePercentile CrimeProblem AbandonedBuildings IncompleteBathroom
## 1   3914                5         39.6               12.6                2.6
## 2  10817               15         32.4               10.0                3.3
## 3  21097               30         26.7                7.1                2.3
## 4  34548               50         23.9                4.1                2.1
## 5  51941               70         21.4                2.3                2.4
## 6  72079               90         19.9                1.2                2.0
##   NoCentralHeat ExposedWires AirConditioning TwoBathrooms MotorVehicle
## 1          32.3          5.5            52.3         13.9         57.3
## 2          34.7          5.0            55.4         16.9         82.1
## 3          28.1          2.4            61.7         24.8         91.7
## 4          21.4          2.1            69.8         39.6         97.0
## 5          14.9          1.4            73.9         51.2         98.0
## 6           9.6          1.0            76.7         73.2         99.0
##   TwoVehicles ClothesWasher ClothesDryer Dishwasher Telephone
## 1        17.3          57.8         37.5       16.5      68.7
## 2        34.3          61.4         38.0       16.0      79.7
## 3        56.4          78.6         62.0       25.8      90.8
## 4        75.3          84.4         75.2       41.6      96.5
## 5        86.6          92.8         88.9       58.2      98.3
## 6        92.9          97.1         95.6       79.7      99.5
##   DoctorVisitsUnder7 DoctorVisits7To18 NoDoctorVisitUnder7 NoDoctorVisit7To18
## 1                3.6               2.6                13.7               31.2
## 2                3.7               2.6                14.9               32.0
## 3                3.6               2.1                13.8               31.4
## 4                4.0               2.3                10.4               27.3
## 5                4.0               2.5                 7.7               23.9
## 6                4.7               3.1                 5.3               17.5

We can access the names of all variabels sing names(objectName) syntax:

names(Housing)

##  [1] "Income"              "IncomePercentile"    "CrimeProblem"       
##  [4] "AbandonedBuildings"  "IncompleteBathroom"  "NoCentralHeat"      
##  [7] "ExposedWires"        "AirConditioning"     "TwoBathrooms"       
## [10] "MotorVehicle"        "TwoVehicles"         "ClothesWasher"      
## [13] "ClothesDryer"        "Dishwasher"          "Telephone"          
## [16] "DoctorVisitsUnder7"  "DoctorVisits7To18"   "NoDoctorVisitUnder7"
## [19] "NoDoctorVisit7To18"

Also we can access one of the variables. Just use objectName$variableName. Example:

Housing$AbandonedBuildings

## [1] 12.6 10.0  7.1  4.1  2.3  1.2

Making scatterplot

One of the most familiar graphical forms is the scatter-plot, a format in which each “case” or “data point” is plotted as a dot at the coordinate location given by two variables.

For instance, here’s a scatter plot of the fraction of household that have two vehicles , versus the median income in their bracket

gf_point(TwoVehicles ~ Income, data = Housing )

Because graphs are constructed in layers. If we want to plot a math function over the data, we need to use plotting function to make another layer. Then use %>% (pipe symbol) to display layers in the same plot. Note that %>% can never go at the start of a new line.

gf_point( 
  TwoVehicles ~ Income, data=Housing ) %>%
  slice_plot(
    20 - Income/2000 ~ Income, color = "blue")

If we prefer to set the limits of the axes to something of own choices, we can do this. For instance:

gf_point(
  TwoVehicles ~ Income, data = Housing) %>% 
  slice_plot(
    20-Income/2000 ~ Income, color = "blue") %>%
  gf_lims(
    x = range(0,40000), 
    y=range(0,100))

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 48 row(s) containing missing values (geom_path).

Even though math function drawn isn’t a very good match to the data, but the important thing is about how to draw graphs, not how to choose a family of functions or find parameters.

Properly made scientific graphics should have informative axis names. We can set the axis names directly using gf_labs:

gf_point(
  TwoVehicles ~ Income, data=Housing) %>%
  gf_labs(x= "Income Bracket ($US per household)/year",
          y = "Fraction of Households",
          main = "Two Vehicles") %>%
  gf_lims(x = range(0,40000), y = range(0,100))

## Warning: Removed 2 rows containing missing values (geom_point).