Characteristics of Frequency Distribution — Descriptive Statistics

Jeeva Selvaraju
8 min readJun 8, 2021

--

Frequency Distribution — Frequency distribution in statistics, a list, table, graph or data set organized to show the frequency of occurrence of each possible outcome of a repeatable event observed many times.

For example, in the following list of numbers(1, 2, 3, 4, 6, 9, 9, 8, 5, 1, 1, 9, 9, 0, 6, 9). The frequency of the number 9 is 5 (because it occurs 5 times).

Characteristic of Frequency Distribution

Frequency Distributions are classified into four types

Modality

Symmetry

Measure of Central Tendency

Measure of Dispersion or Variability

Modality

Modality — The modality of a distribution is determined by the number of peaks it contains.

Types of Modality: Unimodal, Bimodal, Multimodal.

Types of Modality

Unimodal — A unimodal distribution has one values that occur frequently (one peak)

Bimodal — A bimodal distribution has two values that occur frequently (two peaks) and

Multimodal — A multimodal has two or several frequently occurring values (more than two peaks)

Symmetry

Symmetry — Symmetry means that one half of the distribution is a mirror image of the other half of the image.

Types of Symmetry: Symmetric, Asymmetric

Normal Curve(Symmetric) & Positive/Negative Skew(Asymmetric)

Symmetric — The normal distribution is a symmetric distribution with no skew. The tails are exactly the same. A normal bell curve equal on both sides.

Normal Bell Curve (Symmetric)

Asymmetric —Asymmetry is the absence of, or a violation of, symmetry. which is not identical on both sides of a central line.

Types of Asymmetric: Positive Skewness, Negative Skewness

Positive Skewness — Positive Skewness is when the tail on the right side of the distribution is longer or fatter than the tail on the left side. The mean and median will be greater than the mode.

Negative Skewness — Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode.

Negative & Positive Skewness(Asymmetric)

Measure of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics.

In other words, central tendency computes the “center” around which the data is distributed.

The mean, median and mode are all valid measures of central tendency

Mean — (Average Value)

Mode — (Middle Value)

Median — (Value occurs maximum no of times)

Mean

The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.

For Example, We have 10 random numbers like (1,5,2,8,4,55,9,7,3,6) and we need to add all the 10 numbers

sum of all 10 numbers →1+5+2+8+4+55+9+7+3+6 = 100

Mean

So, if we have n values in a data set and they have values x1, x2, …, xn, the sample mean, usually denoted by x¯ (pronounced “x bar”), is:

Sample Mean

This formula is usually written in a slightly different manner using the Greek capital letter, ∑, pronounced “sigma”, which means “sum of…”:

Sample Mean Formula

You may have noticed that the above formula refers to the sample mean. So, why have we called it a sample mean?

Please take a look on my previous post Population and Sample to get better understanding in sample and population.

This is because, in statistics, samples and populations have very different meanings and these differences are very important, even if, in the case of the mean, they are calculated in the same way. To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter “mu”, denoted as μ:

Population Mean Formula

Disadvantages of Mean:

Let us take the above example for summarizing

We have 10 random numbers like (1,5,2,8,4,55,9,7,3,6)

Let us assume this 10 random numbers as 10 employee salary in thousands

(1k,5k,2k,8k,4k,55k,9k,7k,3k,6k)

Outlier Outliers are data points that are far from other data points. In other words, they’re unusual values in a dataset.

So here one Employee has large amount of salary = 55k, So this value is far from other data points and it affects the whole data, so it is called as the outlier data

Note: Mean is highly affected by the outliers. The mean is being skewed by the two large salaries. Therefore, in this situation, we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation.

Median

The median is a simple measure of central tendency. To find the median, we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values.

Simple way to remember: Middle Value is called Median

If data count is in odd:

1,7,6, 9, 8, 2, 3, 5,4 → Arrange it is ascending order

1,2,3,4,5,6,7,8,9 → Total Count = 9 (odd number)

Middle Value is the Median → Median = 5

If data count is in even:

1,7,6, 9, 8, 2, 3, 5,4,10 → Arrange it is ascending order

1,2,3,4,5,6,7,8,9,10 → Total Count = 10 (odd number)

Middle 2 Values is the Median → Average the 2 numbers to get the median

Median Formula

Median = 5.5

Mode

·Mode is the number which appears most often in a set of number and Mode is used for categorical data where we wish to know which is the most common category.

Example: in {5, 4, 6, 5, 9, 5, 7, 3} the Mode is 5 (it occurs most often)

problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set

Note: To use the mode to describe the central tendency of this data set would be misleading

Measure of Dispersion or Variability

Measures of dispersion describe the spread of the data. They include the range, interquartile range, standard deviation and variance. The range is given as the smallest and largest observations. This is the simplest measure of variability.

Variability is also referred to as spread, scatter or dispersion. It is most commonly measured with the following: Range — the difference between the highest and lowest values.

Variability refers to how spread out a group of data is. The common measures of variability are the range, IQR, variance, and standard deviation.

Measures of variability or dispersion are descriptive statistics that can only be used to describe the data in a given data set or study.

Range

Variance

Standard Deviation

Inter Quartile Range (IQR)

Range

The range is the difference between the lowest and highest values.

Range = Maximum Value — Minimum Value (Max — Min)

Example: In {2,4, 6, 9, 3, 7,10}, order in ascending order

lowest value is 2, and the highest is 10

Range = 10–2 = 8

Variance

The variance measures the average degree to which each point differs from the mean. The average of all data points.

Variance measures variability from the average or mean. Therefore, the variance statistic can help determine the risk an investor assumes when purchasing a specific security. A large variance indicates that numbers in the set are far from the mean and from each other, while a small variance indicates the opposite

Unlike range and quartiles, the variance combines all the values in a data set to produce a measure of spread. … It is calculated as the average squared deviation of each number from the mean of a data set. For example, for the numbers 1, 2, and 3 the mean is 2 and the variance is 0.667

Variance Formula

Standard Deviation

The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance. If the data points are further from the mean, there is a higher deviation within the data set; thus, the more spread out the data, the higher the standard deviation.

Standard Deviation Formula

Inter Quartile Range (IQR)

Before going to IQR, let us know about the Quartile and Percentile

Percentile — Nth Percentile states that at least Nth % of values less than or equal to this value and (100-N) is greater than equal to this value.

percentile simply states Nth Percentile of people are below me

Percentile Formula

Quartile — In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or quarters, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic.

Dividing data in to ¼ parts

Q1 →1st Quartile — 25th Percentile

Q2 → 2nd Quartile — 50th Percentile

Q3 → 3rd Quartile — 75th Percentile

Inter Quartile Range — The IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), ​first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3).

IQR = Q3-Q1

Thanks for reading. please take a look on my stories to enhance more knowledge in statistics for data science.

--

--

Jeeva Selvaraju
Jeeva Selvaraju

Written by Jeeva Selvaraju

Big Data Engineer | Data Science-Machine Learning Enthusiast | Blogger

No responses yet