Guttation: Have you ever seen the shining crystal-like water droplets in the night-time or early in the morning? You must have noticed these water droplets...
Guttation: Definition, Process and Significance
December 19, 2024Analysis of frequency distributions: In our daily life, we work with a lot of data. We can conveniently sort this data into frequency distributions. Frequency is how often an event occurs, and distribution is data sorting. Hence, we can define frequency distribution as representing data by sorting it based on how frequently something occurs.
We can present frequency distributions in different ways, including tabular and graphical forms. Also, they can be done for discrete data or a continuous interval of values. For example, the number of trains that halt at Kharagpur station each day can be represented as discrete frequency distribution. In contrast, the mass of jackfruits harvested this summer is represented as continuous frequency distribution, as the mass of each jackfruit may vary from time to time.
Frequency distribution is used in statistics to reveal the important features of a data set. We can use this information to make informed decisions about the data. The measures of central tendency give us a single value that can fairly represent a data set. The three measures of central tendencies are:
However, sometimes these measures of central tendency are not sufficient to represent the data.
Consider the distances, in kilometres, run by the athletes \(A\) and \(B\) in \(5\) days is given below.
\(A: 10, 0, 11, 29, 0\)
\(B: 9, 10, 11, 8, 12\)
Here, the mean and the median distances for both \(A\) and \(B\) are the same, \({\rm{10}}\,{\rm{km}}.\)
So, we cannot differentiate the performance of these athletes based on the central tendencies of their running data, which incidentally places them as equally good. But, we can see that \(A’\)s distances vary significantly compared to \(B,\) who runs roughly the same distance every day.
In this case, we need a number to highlight the variation or dispersion in data. Such a number is called a measure of dispersion.
Range, mean deviation, quartile deviation, and standard deviation are the measures of dispersions we are familiar with. This highlights how the data is spread out about the mean. Generally, the higher the value of a measure of dispersion, the more the data is scattered about its mean.
Sometimes, we may require a measure of dispersion independent of the units of the data we are working on. Such a measure of dispersion independent of units of the data is called the coefficient of variation, denoted as \(C.V.\)
Coefficient of variation can be calculated using the formula,
\(C.V. = \frac{\sigma }{{\bar x}} \times 100\)
Here,
\(σ=\)Standard deviation of the data set
\({\bar x=}\)mean of the data set
Coefficient of Variation vs Standard Deviation
The other measures of dispersions – range, mean deviation, quartile deviation, and standard deviation are absolute measures of dispersion. They have the same unit of measurement as that of the underlying data.
Example: Mean distance run by an athlete is \(10\;{\rm{km}},\) and the standard deviation is \(2\;{\rm{km}}.\) It implies that the distances would be between \(10\;{\rm{km}} + 2\;{\rm{km}} = 12\;{\rm{km}}\) and \(10\;{\rm{km}} – 2\;{\rm{km}} = 8\;{\rm{km}}.\)
The coefficient of variation is a unitless and relative measure of dispersion. It is a number that relates the standard deviation of a data set to its mean.
Example:
If the value of \(C.V.\) is \(20\%,\) it implies that the standard deviation for this data set is \(20\%\) of its mean.
Uses of Coefficient of Variation
In the example of athletes seen above, we can find who is more consistent by calculating standard deviation and finding the coefficient of variation is not necessary.
There are two cases where we can use the coefficient of variation.
Data Sets With a Large Difference Between their Means
Suppose you measure the fuel required by a scooter and car in a year.
Fuel required | Scooter (Litres) | Car (Litres) |
Mean | \(200\) | \(1000\) |
Standard deviation | \(20\) | \(100\) |
Observe that there is a significant difference in the standard deviation and mean fuel requirements of a scooter and a car.
In this case, the coefficient of variation would give a better result.
\(C.V. = \frac{\sigma }{{\bar x}} \times 100\)
\(\therefore {(C.V.)_{{\rm{scooter}}}} = \frac{{20}}{{200}} \times 100 = 10\% \)
\(\therefore {(C.V.)_{{\rm{car}}}} = \frac{{100}}{{1000}} \times 100 = 10\% \)
So, statistically, the scooter and the car are equally efficient.
Data Sets Using Different Units of Measurements
Suppose you are measuring the heights and weights of the students in a class. Both these data sets have different units. To comment on the dispersion of these data sets, we can use the coefficient of variation.
Parameter | Height(cm) | Weight(kg) |
Mean | \(150\) | \(60\) |
Standard deviation | \(30\) | \(12\) |
\(C.V. = \frac{\sigma }{{\bar x}} \times 100\)
\(\therefore {(C.V.)_{{\rm{height}}}} = \frac{{30}}{{150}} \times 100 = 20\% \)
\(\therefore {(C.V.)_{{\rm{car}}}} = \frac{{12}}{{60}} \times 100 = 20\% \)
Below are a few solved examples that can help in getting a better idea.
Q.1. For the below data set, calculate the coefficient of variation.
\(4, 8, 12, 16, 10, 17, 18, 22, 13, 20\)
Ans:
Mean for this data set, \(\bar x = \frac{{4 + 8 + 12 + 16 + 10 + 17 + 18 + 22 + 13 + 20}}{{10}}\)
\( \Rightarrow \bar x = \frac{{4 + 8 + 12 + 16 + 10 + 17 + 18 + 22 + 13 + 20}}{{10}}\)
\( \Rightarrow \bar x = \frac{{140}}{{10}}\)
\(\therefore \bar x = 14\)
\({x_i}\) | \(\left( {{x_i} – \bar x} \right)\) | \({\left( {{x_i} – \bar x} \right)^2}\) |
\(4\) | \(−10\) | \(100\) |
\(8\) | \(−6\) | \(36\) |
\(12\) | \(−2\) | \(4\) |
\(16\) | \(2\) | \(4\) |
\(10\) | \(−4\) | \(16\) |
\(17\) | \(3\) | \(9\) |
\(18\) | \(4\) | \(16\) |
\(22\) | \(8\) | \(64\) |
\(13\) | \(−1\) | \(1\) |
\(20\) | \(6\) | \(36\) |
\(\sum {{{\left( {{x_i} – \bar x} \right)}^2}} = 286\) |
Standard deviation, \(\sigma = \sqrt {\frac{{\sum {{{\left( {{x_i} – \bar x} \right)}^2}} }}{n}} \)
\( \Rightarrow \sigma = \sqrt {\frac{{286}}{{10}}} \)
\( \Rightarrow \sigma = \sqrt {\frac{{286}}{{10}}} \)
\(\therefore \sigma = 5.35\)
Coefficient of variation, \(C.V. = \frac{\sigma }{{\bar x}} \times 100\)
\( \Rightarrow C.V. = \frac{{5.35}}{{14}} \times 100\)
\(\therefore C.V. = 38.21\% \)
Q.2. The mean of a data is \(33\) and its coefficient of variation is \(10\%.\) Find the standard deviation.
Ans:
\(C.V. = \frac{\sigma }{{\overline x }} \times 100\)
\(\therefore 10 = \frac{\sigma }{{33}} \times 100\)
\(\therefore \frac{{10 \times 33}}{{100}} = \sigma \)
\(\therefore \sigma = 3.3\)Q.3. The number of apples and pears consumed by a family in a week are given below. Find which fruit has a more consistent consumption.
Number of apples Number of apples | \(5\) | \(4\) | \(5\) | \(5\) | \(6\) | \(4\) | \(6\) |
Number of pears | \(2\) | \(7\) | \(1\) | \(3\) | \(2\) | \(6\) | \(7\) |
Ans:
a. Apples
\(n=7\)
Mean, \(\bar x = \frac{{5 + 4 + 5 + 5 + 6 + 4 + 6}}{7}\)
\( \Rightarrow \bar x = \frac{{35}}{7}\)
\(\therefore \bar x = 5\) apples
\({{x_i}}\) | \(\left( {{x_i} – \bar x} \right)\) | \({\left( {{x_i} – \bar x} \right)^2}\) |
\(5\) | \(0\) | \(0\) |
\(4\) | \(−1\) | \(1\) |
\(5\) | \(0\) | \(0\) |
\(5\) | \(0\) | \(0\) |
\(6\) | \(2\) | \(4\) |
\(4\) | \(−1\) | \(1\) |
\(6\) | \(2\) | \(4\) |
\(\sum {{{\left( {{x_i} – \bar x} \right)}^2}} = 10\) |
Standard deviation, \(\sigma = \sqrt {\frac{{\sum {{{\left( {{x_i} – \bar x} \right)}^2}} }}{n}} \)
\( \Rightarrow {\sigma _{{\rm{apples }}}} = \sqrt {\frac{{10}}{7}} \)
\(∴σ=1.2\) apples
Coefficient of variation, \(C.V. = \frac{\sigma }{{\bar x}} \times 100\)
\({(C.V.)_{{\rm{apples }}}} = \frac{{1.2}}{5} \times 100\)
\(\therefore {(C.V.)_{{\rm{apples }}}} = 24\% \)
b. Pears
\(n=7\)
Mean, \(\bar x = \frac{{2 + 7 + 1 + 3 + 2 + 6 + 7}}{7}\)
\( \Rightarrow \bar x = \frac{{28}}{7}\)
\(\therefore \bar x = 4\) pears
\({{x_i}}\) | \({\left( {{x_i} – \bar x} \right)}\) | \({{{\left( {{x_i} – \bar x} \right)}^2}}\) |
\(2\) | \(−2\) | \(4\) |
\(7\) | \(3\) | \(9\) |
\(1\) | \(−3\) | \(9\) |
\(3\) | \(−1\) | \(1\) |
\(2\) | \(−2\) | \(4\) |
\(6\) | \(2\) | \(4\) |
\(7\) | \(3\) | \(9\) |
\(\sum {{{\left( {{x_i} – \bar x} \right)}^2}} = 40\) |
Standard deviation, \(\sigma = \sqrt {\frac{{\sum {{{\left( {{x_i} – \bar x} \right)}^2}} }}{n}} \)
\( \Rightarrow {\sigma _{{\rm{pears}}}} = \sqrt {\frac{{40}}{7}} \)
\(∴σ=2.39\) pears
Coefficient of variation, \(C.V. = \frac{\sigma }{{\bar x}} \times 100\)
\( \Rightarrow {(C.V.)_{{\rm{pears}}}} = \frac{{2.39}}{4} \times 100\)
\(\therefore {(C.V.)_{{\rm{pears}}}} = 59.75\% \)
Now, \({(C.V.)_{{\rm{apples}}}} < {(C.V.)_{{\rm{pears}}}}\)
Hence, apples have a more consistent consumption than pears.
4. The nearer a place is to the coast, the lesser the temperature variation. The temperature data set of two cities recorded for \(5\) months is below. Statistically, which city is nearer to the coast?
Temperature of city \(1\)(in \(^ \circ {\rm{C}}\)) | \(22\) | \(23\) | \(13\) | \(24\) | \(18\) |
Temperature of city \(2\)(in \(^ \circ {\rm{C}}\)) | \(15\) | \(18\) | \(11\) | \(17\) | \(14\) |
Ans:
a. City \(1\)
\(n=5\)
Mean, \( \bar x = \frac{{22 + 23 + 13 + 24 + 18}}{5}\)
\( \Rightarrow \bar x = \frac{{100}}{5}\)
\(\therefore \bar x = {20^ \circ }{\rm{C}}\)
\({{x_i}}\) | \({\left( {{x_i} – \bar x} \right)}\) | \({{{\left( {{x_i} – \bar x} \right)}^2}}\) |
\(22\) | \(2\) | \(4\) |
\(23\) | \(3\) | \(9\) |
\(13\) | \(−7\) | \(49\) |
\(24\) | \(4\) | \(16\) |
\(18\) | \(−2\) | \(4\) |
\(\sum {{{\left( {{x_i} – \bar x} \right)}^2}} = 82\) |
Standard deviation, \(\sigma = \sqrt {\frac{{\sum {{{\left( {{x_i} – \bar x} \right)}^2}} }}{n}} \)
\( \Rightarrow {\sigma _1} = \sqrt {\frac{{82}}{5}} \)
\(\therefore \sigma = {4.05^\circ }{\rm{C}}\)
Coefficient of variation, \(C.V. = \frac{\sigma }{{\bar x}} \times 100\)
\( \Rightarrow {(C.V.)_1} = \frac{{4.05}}{{20}} \times 100\)
\(\therefore {(C.V.)_1} = 20.25\% \)
b. City 2
\(n=5\)
Mean, \(\bar x = \frac{{15 + 18 + 11 + 17 + 14}}{5}\)
\( \Rightarrow \bar x = \frac{{75}}{5}\)
\(\therefore \bar x = {15^\circ }{\rm{C}}\)
\({{x_i}}\) | \({\left( {{x_i} – \bar x} \right)}\) | \({{{\left( {{x_i} – \bar x} \right)}^2}}\) |
\(15\) | \(0\) | \(0\) |
\(18\) | \(3\) | \(9\) |
\(11\) | \(−4\) | \(16\) |
\(17\) | \(2\) | \(4\) |
\(14\) | \(−1\) | \(1\) |
\(\sum {{{\left( {{x_i} – \bar x} \right)}^2}} = 30\) |
Standard deviation, \(\sigma = \sqrt {\frac{{\sum {{{\left( {{x_i} – \bar x} \right)}^2}} }}{n}} \)
\( \Rightarrow {\sigma _2} = \sqrt {\frac{{30}}{5}} \)
\(\therefore \sigma = {2.45^\circ }{\rm{C}}\)
Coefficient of variation, \(C.V. = \frac{\sigma }{{\bar x}} \times 100\)
\( \Rightarrow {(C.V.)_2} = \frac{{2.45}}{{15}} \times 100\)
\(\therefore {(C.V.)_2} = 16.33\% \)
Now, \({(C.V.)_1} > {(C.V.)_2},\) i.e. the temperature of city \(2\) is more consistent than city \(1.\)
Hence, statistically, city \(2\) is closer to the coast than city \(1.\)
Q.5. A school has two houses in standard \({\rm{1}}{{\rm{0}}^{{\rm{th}}}}\) with \(42\) and \(60\) students, respectively. Average points scored per month by these houses are \(750\) and \(400.\) Their respective standard deviations are \(8\) and \(10.\) (i) Which house has a higher total score? (ii) Which house is a better performer in terms of consistency?
Ans:
House | \(1\) | \(2\) |
Mean | \(750\) | \(400\) |
Standard Deviation | \(8\) | \(10\) |
Coefficient of variation, \({(C.V.)_1} = \frac{{{\sigma _1}}}{{\overline {{{\bar x}_1}} }} \times 100\)
\( \Rightarrow {(C.V.)_1} = \frac{8}{{750}} \times 100\)
\(\therefore {(C.V.)_1} = 1.07\% \)
For house \(2,{{\bar x}_2} = 400\) points, \({\sigma _2} = 10\) points
Coefficient of variation, \({(C.V.)_2} = \frac{{{\sigma _2}}}{{\overline {{{\bar x}_2}} }} \times 100\)
\( \Rightarrow {(C.V.)_2} = \frac{{10}}{{400}} \times 100\)
\(\therefore {(C.V.)_2} = 2.5\% \)
Now, \({(C.V.)_1} < {(C.V.)_2},\) i.e. the score of house \(1\) is more consistent than house \(2.\)
Hence, house \(1\) is a better performer in terms of consistency.
Analysis of frequency distribution involves using data to make informed decisions. To make these decisions, we use representative numbers called measures of central tendency. To highlight the variability in a data set, we use measures of dispersion. To compare variability in two data sets with different units of measurements, we need a unitless measure of dispersion called the coefficient of variation. The coefficient of variation is a unitless number, and it is a relative measure of dispersion as it shows the variability of data in relation to its mean. Coefficient of variation can also be used to compare two data sets when there is a significant difference between their means or when they have the same mean.
Students might be having many questions with respect to the Analysis of Frequency Distributions. Here are a few commonly asked questions and answers.
Q.1. How do you analyse frequency distribution data?
Ans: We can analyse frequency distribution data by finding the measures of the central tendency of the data set. The measure of central tendencies are mean, mode and median. Also, we can find the variability of the data by calculating measures of dispersion.
Q.2. What are the \(3\) types of frequency distributions?
Ans: The \(3\) types of frequency distributions are
1. Ungrouped frequency distribution table
2. Grouped frequency distribution table
3. Cumulative frequency distribution table
Q.3. How do you describe a frequency distribution table?
Ans: A frequency distribution table is a conveniently sorted form of data that can be used to arrive at numbers that can be used to make decisions with respect to that data.
Q.4. How do you describe frequency analysis?
Ans: Frequency analysis is working with the data to find specific numbers representing the data set and its variability and using this information.
Q.5. What is the formula of frequency distribution?
Ans: The formula of a frequency distribution is that it can be represented in a table or on a graph. In a table, we write the frequency of the analysed thing against it. The frequency can be plotted using points, bars, or lines to get desired results in a graph.