Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Following are the different steps involved in EDA:
- Data Collection – It is the process of gathering the data in a systematic way that enables testing the hypothesis and evaluating outcomes easily.
- Data Cleaning – It is the process of ensuring that your data is correct and usable by identifying any errors in the data, or missing data by correcting or deleting them.
- Data Preprocessing – It is a data mining technique that involves transforming raw data into an understandable format. It includes normalization and standardization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training data set.
- Data Visualization – It is the graphical representation of information and data. It uses statistical graphics, plots, information graphics and other tools to communicate information clearly and efficiently.
Approach – EDA approach studies the data to recommend admissible models that fit the data.
Focus – The focus is on Data, its structure, outliers, and models suggested by the data.
Assumptions – EDA techniques make minimal or no assumptions. They present and show all the underlying data without any data loss.
Exploratory Data Analysis Techniques
- Quantitative – Mathematical and statistical functions provide numeric outputs for the inputted data.
- Graphical – Graphical techniques use statistical functions for graphical output.
Exploratory Data Analysis – Quantitative Technique
The EDA quantitative focus on numerical data has two goals: –
- Measurement of Central Tendency
- Mean is the point that indicates how centralized the data points are.
- Suitable for symmetric distribution.
- Median is the exact middle value.
- Suitable for skewed distributions and for catching outliers in the dataset.
- Mode is the most common value in the data (frequency).
- Measurement of Spread
- Variance is approximately the mean of the squares of the deviations.
- Standard deviation is the square root of the variance.
- The interquartile range is the distance between the 75th and 25th percentile.
- It’s essentially the middle 50% of the data.
Exploratory Data Analysis – Graphical Technique
Histograms and Scatter Plots are two popular graphical techniques to depict data.
- Histogram graphically summarizes the distribution of a univariate dataset. It shows:
- The center or location of data.
- The spread of data.
- The skewness of data.
- The presence of outliers.
- The presence of multiple models in the data.
- A scatter plot represents relationships between two variables. It can answer these questions visually:
- Are variables X and Y related?
- Are variables X and Y linearly related?
- Are variables X and Y Non-linearly related?
- Does change in variation of Y depend on X?
- Are there outliers?