Introduction to Statistics for Data Analytics

Statistics

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. It’s widely used to understand the complex problems of the real world and simplify them to make well-informed decisions.

Tools available to analyze data:

  • Statistical principles
  • Functions
  • Algorithms

What you can do using statistical tools:

  • Analyze the primary data.
  • Build a statistical model
  • Predict the future outcome

Statistical and Non-Statistical Analysis

Statistical Analysis is Scientific, based on numbers or statistical values and useful in providing complete insight to the data.

Non-Statistical Analysis is based on very generic information and exclusive of statistical or quantitative analysis.

Major categories of Statistics

There are two major categories of statistics

  1. Descriptive analytics – Descriptive analytics helps organize the data and focuses on the main characteristics of the data. It provides a concise summary of the data. Data can be summarized numerically or graphically.
    For example: – You can collect the data about the visitors to your website for a week and summarize the data.
  2. Inferential analytics – Inferential analytics generalizes the larger data set and applies the probability theory to draw a conclusion. In this approach, a random sample of data is taken from a population to describe and make inference about the population.

Note: – Inferential analysis is valuable when it is not convenient or possible to examine each member of an entire population.

Let’s understand both categories better through an example. Suppose we have to study the height of a person in an entire population. It can be done in two ways: –

  • We need to record the height of each person in the population. It will be a tedious process. Instead, if we categorize height as “Tall”, “Medium” and “Short” and then we take only a sample from the population, this is an inferential analysis.
  • In the descriptive analysis method, we need to record the height of every person in the population and then divide the data for the maximum height, minimum height and average height of the population.

Consideration of statistical analysis in Data Analytics

Purpose – The purpose of the statistical analysis should be clear and well defined.

Document Questions – Prepare the questions in advance which need to be asked.

Define population of interest – Select the population based on the purpose of analysis.

Determine Sample – Determine the sample that you want to draw based on the purpose of the study.

Statistics and Parameters

Statistics – These are the quantitative values calculated from the sample.

Parameters – They are the characteristics of the population.

Suppose you have x0, x1, x2, …, xN. A sample from the population and you want to know some vital information, such as average, most occurring characteristics and so on. These are calculated using the formulas below.

Statistics Formula

Mean – It is the average, a typical value present in the distribution and is calculated by summing the values and dividing them by the number of values.

Variance – It measures the sample variability.

Standard Deviation – It explains, how spread out the data is from the mean. The greater the standard deviation, the greater the spread in the data.

Terms used in Statistical analysis to describe data

Typical terms used in data analysis are: 

Search – It is used to find unusual data. Unusual data refers to data that does not match the parameters set at the beginning.

Inspect – It refers to the studying of the dataset and determining how spread out it is.

Characterize – It refers to determining the central tendency of the data.

Conclusion – It refers to the preliminary or high-level conclusions about the data.

Based on the understanding of search, inspection, and characterization. We can draw some preliminary or high level conclusions about the data.

Statistical Analysis Process for Data Analytics

Statistical analysis process consists of four steps: –

Step 1: Find the population of interest that suits the purpose of statistical analysis.

Step 2: Draw a random sample that represents the population.

Step 3: Compute sample statistics to describe the spread and shape of the dataset.

Step 4: Make inferences using the sample and calculations. Apply it back to the population.

Data Distribution

This is the collection of data values arranged in a sequence according to their relative frequency and occurrences.

To understand any kind of problem, it is important to describe the data in terms of its spread and shape using graphical techniques.

Range of the data indicates the quantitative values, minimum and maximum.

Frequency indicates the number of occurrences of any particular data value in the given set.

Central tendency indicates whether the data value accumulates in the middle of the distribution or toward the end.

Dispersion

In statistics, dispersion, also called variability, scatter, or spread denotes how stretched or squeezed a distribution is:

Range – It refers to the difference between the maximum and minimum values.

Interquartile range – It refers to the difference between the 25th and 75th percentile.

Variance – It refers to Data values around the mean value.

Standard deviation – It is the Square root of the variance measured in the same unit. It also indicates how spread out the data is.

Histogram

Histogram is the graphical representation of the data distribution, first introduced by Karl Pearson. 

  • To construct the histogram the first step is to ‘bin’ the range of values, that is, divide the entire range of values into a series of intervals and then count how many values fall into each interval. 
  • The bins are usually specified as consecutive, non-overlapping intervals of a variable. 
  • The bins must be adjacent and are usually of equal size.
  • In the graphical representation, each bar represents a group of values, also called Bin.
  • The height of the bar represents the frequency of the values in the bin.
  • It helps assess the probability distribution of a variable by depicting the frequencies of observations occurring in the certain range of values.

Bell Curve – Normal Distribution

A normal distribution is the most commonly used distribution in statistics. It is characterized by its bell shape and its two parameters, mean and standard deviation.

Bell curve or normal deviation

  • It is symmetric around the mean.
  • If we draw a line at the center, we will get a symmetric shape on both sides.
  • The mean, median and mode of a normal distribution are equal. 
  • Normal distributions are denser in the center and less dense in the tails or sides.
  • Normal distributions are defined by two parameters, the mean and the standard deviation.
  • Bell curve is also known as the Gaussian curve.

The bell curve is divided into three parts to understand data distribution better.

Peak – Generally, the peak is within one standard deviation from the mean.

Flanks – They are the areas beyond the peak, but between one and two standard deviations from the mean.

Tails – They refer to the area far from the center of the distribution and considered to be beyond two standard deviations from the mean.

Bell Curve – Left Skewed

Skewed data distribution indicates the tendency of the data distribution to be more spread out on one side.

In this graphical representation, the data is left skewed, Mean is less than Medium. The distribution is negatively skewed or represents negative statistics. Left tail contains the large distributions.

Bell Curve – Right Skewed

In this graphical representation, the data has a right-skewed or positively skewed distribution, Mean is greater than Medium. Right tail contains large distributions.

Kurtosis

Kurtosis describes the shape of a probability distribution and just as there is skewness, there are different ways of quantifying a theoretical distribution and corresponding ways of estimating it from a sample of a population. Depending on the particular measure of kurtosis that is used, there are various interpretations of kurtosis.

Kurtosis measures the tendency of the data towards the center or towards the tail.

Platykurtic is negative kurtosis.

Mesokurtic represents a normal distribution curve.

Leptokurtic is positive kurtosis.

1 thought on “Introduction to Statistics for Data Analytics

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Shares