Data by itself is just an information source. But unless you can understand it, you will not be able to use it effectively. Data Analytics Process involves the combination of processes to help you extract the information from data sets. Along with domain knowledge, programming, mathematics and statistical skills to arrive at the decision-making process with the help of data.
Let’s take an example of a bank statement. A bank statement usually lists deposit, withdrawal and the balance after every transaction. Though the statement is highly descriptive and is a good source of information, still it fails to tell the pattern of savings and expenses.
But the moment, we present the data in a line chart with expenses for each day. It tells the overall transaction pattern of expenses. Also, can visualize the pattern for over a 2 or 3 or 4 months and even compare them with each other and it helps to see any unusual spikes or depths to understand the expenses better.
Bank statement is a Highly structured data set. But you will be mostly working with the unstructured data set. It will require the cleaning of the data before using it in any constructive way.
Data Analytics Process Steps
Data Analytics is a step-by-step process to reach a conclusion. Let’s look at these steps briefly: –
Step 1: – Business Problem – To ask questions to identify the business problems. – The Process of analytics begins with questions or business problems of stakeholders. Here are a few examples of questions: –
- Who are my customers?
- Why are my sales going down?
- How do I manage my inventory?
- Why is my system not scaling up with increasing traffic volume?
Such Business Problems trigger the need to analyze data and find answers.
Step 2: – Data acquisition – Collecting data sets related to the business problem or question from the real world. Collect relevant data from various sources for analysis. A Data Scientist has to use database skills to fetch the relevant data from databases.
Data Scientist Expertise with File handling and the ability to deal with multiple File formats is an important skill to download and analyze the data. Web scraping is the popular way to extract information from the web. A lot of information sites provide streaming APIs, Such as Twitter, Facebook, LinkedIn and other social media and information companies. The server logs can also be extracted from the enterprise system servers to analyze and optimize application performance.
Step 3: – Data wrangling – With the data tools and modern techniques that included Data Cleansing and Data Manipulation. Data Wrangling is the most important step of the Data analytics process. Data Wrangling contains Data Cleansing, Data Manipulation.
Data Cleansing – Usually data is neither in the expected format nor consistent. The Data Cleansing process gets rid of unwanted elements present in the data.
Data Manipulation – Data Manipulation techniques such as transform, merge, aggregate, group by and reshape. Hence form the data and make it available for exploratory data analysis.
Step 4: – Exploratory Data Analysis (EDA) – With mathematical and graphical output to aid data analysis.
Step 5: – Data exploration – Data Exploration contains Data Discovery and Data Pattern Identification. Discovers the data and identifies patterns in data.
It uses all the available data presented in either a numerical or graphical format. This helps to identify the right patterns in the data. The data and underlying pattern are fed into an appropriate machine learning model leading directly to the conclusion and prediction phase.
Step 6: – Conclusions or prediction – Draw conclusions and make predictions by creating training models for machine learning. This step uses a lot of mathematical and statistical functions.
Step 7: – Communication or Data Visualization – To present the analysis work.
Data Wrangling is the most challenging phase and takes up 70% of the Data Scientist’s time.
Causes of challenges in the data wrangling phase
- Unexpected data format: – A new format or inconsistent data requires more preparation work.
- Erroneous data: – Data contains lots of errors and unwanted values that have to be cleansed.
- Manipulate and prepare data: – Manipulate voluminous data using data wrangling techniques and tools and make it ready for analysis.
- Understanding structure of data: – Understand how the data is organized in the first place, linear or cluster. Plot them if possible.
- Determining the relationship of variables – Observation, feature, and response are keys and the relationship between them must be determined; it is difficult to determine this relationship.
- Based on the overall data analysis process to draw conclusions and make accurate predictions.
- Selection Should be accurate or it will lead to a lot of iterations and a waste of time.
- Identifying the right patterns and applying the right algorithm is critical.
- The hypothesis building and hypothesis testing processes together lead to an appropriate model selection.
- Mathematical and statistical functions have to be carefully built for the model chosen.