RPUBS.COM, R SCRIPTS
AND RMARKDOWN FILES UPLOADED ON GITHUB REPO
This blog post is aimed at identifying the most valued data science skills. “Data scientist” is a broad term that can refer to a number of different careers. Generally, a data scientist analyzes data to learn about scientific processes, market trends, and risk management. Data scientists work in a variety of industries, ranging from tech to medicine to government agencies. The qualifications for a job in data science vary because the title is so broad. However, there are certain skills employers look for in almost every data scientist. For example, data scientists need strong statistical, analytical, reporting skills, and more As a team, we have done a deep study in identifying some important skills. On the basis of our study we have categorized data science skills into 3 main groups:
As observed, the above skills are not exhaustive. For the sake of this blog post I would like to give more emphasis on analytical Skills. Perhaps the most important skill for a data scientist is to be able to analyze information. Data scientists have to look at, and make sense of, large amounts of data. They have to be able to see patterns and trends and have an idea of what those patterns mean. I believe presetting this th importance of this skill and others using real world data set is relevant . To this, I have chosen US Chronic Disease Indicators (CDI).
Centers for Disease Control and Prevention (CDC) - Division of Population Health provides cross-cutting set of 124 indicators that were developed by consensus and that allows states and territories and large metropolitan areas to uniformly define, collect, and report chronic disease data that are important to public health practice and available for states, territories and large metropolitan areas. In addition to providing access to state-specific indicator data, the CDI web site serves as a gateway to additional information and data resources. Chronic diseases represents 7 of the top 10 causes of death in the United States. They are defined broadly as conditions that last 1 year or more and require ongoing medical attention or limit activities of daily living or both. As we group, we would like to better understand the reasons behind the causes for Chronic Diseases. Going forward, this Project is designed to focus on the most valued skills used in Data Science in order to perform an Exploratory Analysis of the US Chronic Disease Indicators Dataset.
Discovering the most valued data science skills by utilizing the chosen data set.
The approach taken by this blog post is refreshing the skills learned in CUNY MSDS program.
I am motivated to use this title as a blog post based on a team work for another course in data science . I belive having excellent individual motivation will also add value on a team environment
The data was downloaded from the Center For Diseases Control and Prevention website in csv format.
I loaded the data using read.csv for the purpose of indicating missing values of empty strings with an identifier
disease <- read.csv("file:///C:/Users/Yohannes/Desktop/DATA 621 BLOG 3 Y/Data 621 Blog 3/USChronicDiseaseIndicators.7z", na.strings = "") #US_CDI.csv## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : line 1 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : line 2 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep,
## quote = quote, : incomplete final line found by readTableHeader
## on 'C:/Users/Yohannes/Desktop/DATA 621 BLOG 3 Y/Data 621 Blog 3/
## USChronicDiseaseIndicators.7z'
I have decided to take a closer look at one of the top 5 causes of chronic diseases.
Tobacco is one of the top Chronic Disease Indicator of preventable deaths in the US. Smoking leads to disease and disability and harms nearly every organ of the body. Among the other indicator groups, using Tobacco is related to cancer, heart disease, stroke, lung diseases, diabetes, and chronic obstructive pulmonary disease (COPD), which includes emphysema and chronic bronchitis. You can view further details here
It is likely that if the youth is using smokeless Tobacco, then they’ll smoke it too because of how close the numbers are. They are both harmful. Everyday, someone younger than 18 tries their first smoking of Tobacco.
```
To conclude, team work is very important, especially in Data Science. Had this blog been done in a team, it would have been more comprehensive . Real world data sets are often messy and have lots of missing values. Data does not convert correctly the first time, file is incomplete or invalid and the list goes on. However all of these problems are solvable but can be time consuming. In the end, it is all done to get the results needed to present useful information to make better decisions. Visualizing and communicating data is incredibly important, especially with young companies that are making data-driven decisions for the first time, or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings, or the way techniques work to audiences, both technical and non-technical. Visualization-wise, it can be immensely helpful to be familiar with data visualization tools like matplotlib, ggplot, or d3.js. Tableau has become a popular data visualization and dashboarding tool as well. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information. Some additional reading materials can be found on https://www.kdnuggets.com/2018/05/simplilearn-9-must-have-skills-data-scientist.html