DATA 621 Blog 3 Most Valued Data Science Skills

RPUBS.COM, R SCRIPTS
AND RMARKDOWN FILES UPLOADED ON GITHUB REPO

Introduction

This blog post is aimed at identifying the most valued data science skills. “Data scientist” is a broad term that can refer to a number of different careers. Generally, a data scientist analyzes data to learn about scientific processes, market trends, and risk management. Data scientists work in a variety of industries, ranging from tech to medicine to government agencies. The qualifications for a job in data science vary because the title is so broad. However, there are certain skills employers look for in almost every data scientist. For example, data scientists need strong statistical, analytical, reporting skills, and more As a team, we have done a deep study in identifying some important skills. On the basis of our study we have categorized data science skills into 3 main groups:

Soft Skills
- Communication
- Collaboration
- Critical Thinking
- Problem Solving
- Learning
- Data Intuition
Hard Skills
- Data Aqusistion and Analysis
- Data Vizualization
- Statistics and Probability
Tools R

As observed, the above skills are not exhaustive. For the sake of this blog post I would like to give more emphasis on analytical Skills. Perhaps the most important skill for a data scientist is to be able to analyze information. Data scientists have to look at, and make sense of, large amounts of data. They have to be able to see patterns and trends and have an idea of what those patterns mean. I believe presetting this th importance of this skill and others using real world data set is relevant . To this, I have chosen US Chronic Disease Indicators (CDI).

Background

Centers for Disease Control and Prevention (CDC) - Division of Population Health provides cross-cutting set of 124 indicators that were developed by consensus and that allows states and territories and large metropolitan areas to uniformly define, collect, and report chronic disease data that are important to public health practice and available for states, territories and large metropolitan areas. In addition to providing access to state-specific indicator data, the CDI web site serves as a gateway to additional information and data resources. Chronic diseases represents 7 of the top 10 causes of death in the United States. They are defined broadly as conditions that last 1 year or more and require ongoing medical attention or limit activities of daily living or both. As we group, we would like to better understand the reasons behind the causes for Chronic Diseases. Going forward, this Project is designed to focus on the most valued skills used in Data Science in order to perform an Exploratory Analysis of the US Chronic Disease Indicators Dataset.

Objective

Discovering the most valued data science skills by utilizing the chosen data set.

Approach

The approach taken by this blog post is refreshing the skills learned in CUNY MSDS program.

Motivation

I am motivated to use this title as a blog post based on a team work for another course in data science . I belive having excellent individual motivation will also add value on a team environment

Data information

The Dataset has 124 indicators that are divided in 18 groups.

Data source

The data was downloaded from the Center For Diseases Control and Prevention website in csv format.

How to load Dataset

I loaded the data using read.csv for the purpose of indicating missing values of empty strings with an identifier

disease <- read.csv("file:///C:/Users/Yohannes/Desktop/DATA 621 BLOG 3 Y/Data 621 Blog 3/USChronicDiseaseIndicators.7z", na.strings = "") #US_CDI.csv

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : line 1 appears to contain embedded nulls

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : line 2 appears to contain embedded nulls

## Warning in read.table(file = file, header = header, sep = sep,
## quote = quote, : incomplete final line found by readTableHeader
## on 'C:/Users/Yohannes/Desktop/DATA 621 BLOG 3 Y/Data 621 Blog 3/
## USChronicDiseaseIndicators.7z'

Visualization of data value type

I have decided to take a closer look at one of the top 5 causes of chronic diseases.

###

Tobacco

Tobacco is one of the top Chronic Disease Indicator of preventable deaths in the US. Smoking leads to disease and disability and harms nearly every organ of the body. Among the other indicator groups, using Tobacco is related to cancer, heart disease, stroke, lung diseases, diabetes, and chronic obstructive pulmonary disease (COPD), which includes emphysema and chronic bronchitis. You can view further details here

It is likely that if the youth is using smokeless Tobacco, then they’ll smoke it too because of how close the numbers are. They are both harmful. Everyday, someone younger than 18 tries their first smoking of Tobacco.

Amount of Tobacco Product Excise Tax?

Increasing the price of tobacco discourages people from buying them, especially youths. With this structure put into place, improvements in health will get better in the long term. Majority of the states did not excise tax from the Tobacco product.

```

Conclusion

To conclude, team work is very important, especially in Data Science. Had this blog been done in a team, it would have been more comprehensive . Real world data sets are often messy and have lots of missing values. Data does not convert correctly the first time, file is incomplete or invalid and the list goes on. However all of these problems are solvable but can be time consuming. In the end, it is all done to get the results needed to present useful information to make better decisions. Visualizing and communicating data is incredibly important, especially with young companies that are making data-driven decisions for the first time, or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings, or the way techniques work to audiences, both technical and non-technical. Visualization-wise, it can be immensely helpful to be familiar with data visualization tools like matplotlib, ggplot, or d3.js. Tableau has become a popular data visualization and dashboarding tool as well. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information. Some additional reading materials can be found on https://www.kdnuggets.com/2018/05/simplilearn-9-must-have-skills-data-scientist.html