DATA 607 - Fall 2018

Jeff Littlejohn

Rpubs link: Github link:

Assignment - Loading Data into a Data Frame

Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or.). Here, you are asked to use R-you may use base functions or packages as you like.

A famous-if slightly moldy-dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”

Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns.

You should also add meaningful column names and replace the abbreviations used in the data-for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

Steps:

1. Load the dataset

2. Give it column names based on data dictionary - relace abbreviations for better descriptors

3. Explore the data

4. Create a subset, including the column that indicates edible or poisonous along with 3-4 other columns

**5. Place your solution into a single R Markdown (.Rmd) file and publish your solution out to rpubs.com. Post the .Rmd file in your GitHub repository, and provide the appropriate URLs to your GitHub repository and your rpubs.com file in your assignment link. You should also have the original data file accessible through your code-for example, stored in a GitHub repository and referenced in your code