datadive5pari

###1. A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

#read data file
midwest <- read.csv(file="https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/midwest.csv", header=TRUE, sep=",")
colnames(midwest)

##  [1] "rownames"             "PID"                  "county"              
##  [4] "state"                "area"                 "poptotal"            
##  [7] "popdensity"           "popwhite"             "popblack"            
## [10] "popamerindian"        "popasian"             "popother"            
## [13] "percwhite"            "percblack"            "percamerindan"       
## [16] "percasian"            "percother"            "popadults"           
## [19] "perchsd"              "percollege"           "percprof"            
## [22] "poppovertyknown"      "percpovertyknown"     "percbelowpoverty"    
## [25] "percchildbelowpovert" "percadultpoverty"     "percelderlypoverty"  
## [28] "inmetro"              "category"

From analysing the dataset column names most of them are straightforward, but some column names like below mentioned were unclear.
“perchsd” “percollege” “percprof”

Documentation gave further clarity on meaning of columns: perchsd-Percent with high school diploma. percollege-Percent college educated. percprof-Percent with professional degree

###2. At least one element or your data that is unclear even after reading the documentation You may need to do some digging, but is there anything about the data that your documentation does not explain?

sample_category <- sample(midwest$category, 10)
sample_category

##  [1] "LHR" "AAR" "ALU" "AHR" "LAR" "ALU" "AAR" "AAR" "ALU" "AAR"

This is another column which is still unclear to me. According to the documentation the “category” column contains miscellaneous categories or labels for the data points in the dataset.

###3. Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear. You can use color or an annotation, but also make sure to explain your thoughts using Markdown Do you notice any significant risks? If so, what could you do to reduce negative consequences?

###loading libraries

#library(tidyverse)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     midwest

I tried relating the ‘category’ column to other columns in dataset and check which column it might be associated with.

ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=category, size=popdensity)) + 
  geom_smooth(method="loess", se=T)

## `geom_smooth()` using formula = 'y ~ x'

ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=T)

## `geom_smooth()` using formula = 'y ~ x'

ggplot(midwest, aes(x=inmetro, y=poptotal)) + 
  geom_point(aes(col=category, size=popdensity)) + 
  geom_smooth(method="loess", se=T)

## `geom_smooth()` using formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.005

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 1.005

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 6.9047e-31

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1.01

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.005

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 1.005

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 6.9047e-31

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1.01

The third graph shows some warning messages which indicates issues with data or the model fit. I am still figuring out how to relate this column to other columns.It can be a poverty line bracket or something else entirely or just a random categorization.

datadive5pari

parimala

2023-09-26