Introduction to Data Science

Rachit Shukla
The Startup
Published in
5 min readNov 21, 2020

--

Here I am going to explain what data science is and the skills one needs to become a Data Scientist. I’ll describe about the roles and responsibilities of a data scientist and various applications of data science. I’ll also discuss about how Data Science & Big Data work together and how this field is gaining importance in the world. Let’s begin this journey…

Elements of Data Science. Photo source: shutterstock.

What is Data Science

Data Science is an emerging field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data

When we combine domain expertise and scientific methods with technology, we get Data Science which enables one to find solutions for existing problems. Let’s look at each components of data science separately.

Domain Expertise, Scientific Methods & Technology:

Data scientists collect data, explore, analyze & visualize it. They apply mathematical and statistical models to find patterns and solutions in the data. Data scientists should also be domain experts as they need to have a passion for data and discover the right patterns in them.

Traditionally, domain experts like scientists and statisticians collected & analyzed data in a laboratory setup or in a controlled environment. The data was then subject to relevant laws or mathematical & statistical models to analyze the dataset and derive relevant information from it. For example, they used the model to calculate the mean, median, mode, standard deviation and so on of a dataset. It helped them test their hypothesis or create a new one.

Data Analysis can be Descriptive, Predictive or Prescriptive

Descriptive data analysis means to study a dataset to decipher the details. Predictive data analysis signifies creating a model based on existing information to predict outcome and behavior. And Prescriptive data analysis refers to suggesting actions for a given situation using collected information.

Technology of current era

We now have access to tools and techniques that process data and extract the information we need. For instance, there are data processing tools for data wrangling. We have new and flexible programming languages like Python & R that are more efficient and easier to use. With the creation of operating systems that support multiple OS platforms like Windows, Mac & Linux it is now easier to integrate systems and process big data. Application design and extensive software libraries help develop more robust, scalable & data driven applications. Data Scientists use these technologies to build data models and run them in an automated fashion to predict the outcome efficiently. This is called Machine Learning, which helps provide insights into the underlying data. Data scientists can also use technology to manipulate data, extract information from it and use it to build tools, applications & services.

But only technology use and domain knowledge without mathematical and statistical knowledge often leads to incorrect patterns and wrong interpretations. This can cause serious damage to businesses.

What a Data Scientist does in daily life:

Data Acquisition, Data Wrangling, Data Visualization, Data Report, Data Products:

A data scientist starts his day with a question or business problem, then he/she uses Data Acquisition to collect data from the real world. The process of Data Wrangling is implemented with data tools and modern technologies that include data cleansing, data manipulation, data discovery and data pattern identification. The next step is to create and train models for Machine Learning. Data scientist designs mathematical & statistical models. After designing a data model it is represented using Data Visualization techniques. The next task is to prepare Data Report. After the report is prepared, he/she finally creates Data Products and services.

Skills a Data Scientist should have:

Asking the right questions, Analytical thinking, Data interpreting & wrangling, Statistical & Mathematical thinking, Data visualization, Story telling:

A data scientist should be able to ask the right questions for which he/she needs domain expertise. Then the curiosity to learn and create new concepts. And the ability to communicate questions effectively to domain experts.

Data scientists should think analytically to understand hidden patterns in the data structure.

They should be able to interpret and wrangle the data by removing redundant and irrelevant data collected from various sources.

Statistical thinking and the ability to apply mathematical methods are important traits for a data scientist.

A data scientist must have data visualizing ability through graphics and proper story telling to summarize and communicate the analytical results to the audience. Here python and its libraries play an important role. So building projects using real world datasets will help build this skill. Also, building Data Driven Applications for digital services and data products is sure to serve the required skill.

Sources of Big Data:

Now as the big data is getting generated every second through different media, the role of data science has become more important. So it is crucial to know what big data is and how we are connected to it.

Every time you sign-in to facebook, twitter, instagram or YouTube you generate data about yourself, your preferences & even about your lifestyle

Every time you record your heartbeat through your mobile’s biometric sensors, post a tweet on twitter, create any blog or website, switch on mobile’s GPS network, upload or view an image, video or audio or even when you sign-in to a website you are generating data about yourself, your preferences and your lifestyle. Big data is the collection of these and a lot more data that the world is constantly creating. In this age of Internet of Things (IoT) big data is a reality and a need. The huge data processing is being done by the help of Hadoop clusters.

Big Data is usually referenced by 3 Vs: Volume, Velocity & Variety.

Volume refers to the enormous amount of data generated from various sources. Velocity means huge amount of data flow at a tremendous speed from different devices, sensors and applications. To deal with it, efficient and timely data processing is required. Variety signifies different formats of data, namely; Structured, semi-structured and unstructured data.

Structured data refer to RDBMS(Relational DataBase Management Systems) data which can be stored and retrieved easily through SQL.

Semi-structured data are usually in a form of files like xml, json documents and NoSQL database.

Unstructured data point to text files, images, audios and videos. In short, all multimedia content are unstructured data.

Conclusion:

The importance of Data Science in current era can be recognized with regard to Big Data. The Big Data is a massive collection of data stored on distributed systems or machines popularly known as Hadoop clusters. And Data Science helps extract information from this data and build information-driven enterprises. Clearly, expertise in Data Science is going to become the sure-shot way to achieve profit in today’s increasingly competitive business environment.

--

--

Rachit Shukla
The Startup

Electrical Engineer, Data Analyst, Data Science Enthusiast