Understanding shape of data is a very important step in data analysis. It helps to see where the most information is lying and to find out the extreme values (outliers) in a given data. Skewness is one of the most important measures used to explore the shape of a data distribution. Before we get into the skewness topic, we should first clarify the frequency distribution and the normal distribution (Gaussian distribution).
Frequency and Normal Distribution
Frequency Distribution is simply the number of times each value (or range of values) occurs in a dataset. A data is said to be "normally distributed" or to have a "normal distribution" if its frequency distribution has the shape of a normal curve, a special type of symmetrical bell-shaped curve (Figure 1); showing that values near the mean are more frequent in occurrence than data far from the mean.
Figure 1: Not that the mean and median are equal in symmetrical bell-shaped data. Source: Basic Statistics for the Behavioral Sciences, sixth edition. Gary W. Heiman.
In most cases, a normally distributed data is unlikely to have exactly the shape of a normal curve. If the distribution is shaped almost like a normal curve, we usually say that the data is approximately normally distributed (Figure 2).
Figure 2: Normal distribution plot of a data sample.
When you have a dataset and needs to measure the asymmetry (distortion) of its distribution; this means you need to know whether the data distribution is symmetrical on both sides (Figure 1) or deviates from the symmetrical normal distribution as shown in Figure 3. A skewed distribution can be either skewed to the the right (positively skewed) or to the left (negatively skewed).
Figure 2: (a) Positively skewed distribution. (b) Negatively skewed distribution. Source: Basic Statistics for the Behavioral Sciences, 6th ed. Gary W. Heiman.
Notice that the positively skewed distribution tail extends to the right; this means that the high values have low frequencies and the median is higher than the mean (Figure 2.a). On the other hand, negatively skewed distribution tail extends to the left; this means that the low values have low frequencies and the median is less than the mean (Figure 2.b).
To know if your data is positively or negatively skewed, or approximately normally distributed, you need to calculate the skewness value of this data. The skewness value can be positive, zero, or negative. Many tools can calculate skewness and other descriptive statistics for you easily, such as SAS, SPSS, Python, R and Microsoft Excel.
A skewness value "greater than 1, or less than -1" indicates a highly skewed distribution. A value "between 0.5 and 1, or -0.5 and -1" is moderately skewed. A value "between -0.5 and 0.5" indicates that the distribution is approximately symmetrical.
Suppose we have of a data set contains a 12 adults’ height (males and females). Data is shown in Table 1. We need to know if our sample data is normally distributed or skewed.
Table 1: Adults’ males and females heights
Now, we need to explore the data using basic descriptive statistics and skewness to understand our data and draw a conclusion based on this descriptive summary. Here, we used Microsoft Excel to do so (Table 2).
Table 2: Basic descriptive statistics for adults’ height.
Results presented in Table 2 showed that mean and median do not markedly differ, for both males and females data, and the skewness of both males and females’ height data are between -0.5 and 0.5 which means that both distributions are approximately symmetric (not skewed).
It is worth mentioning that the skewness value does not only tell us about the shape of the data, it also helps us to decide which descriptive statistics (mean or median) best represents the center of the data. Generally speaking, if your data distribution is asymmetric (skewed), or contains outliers, the median is more representative measure of central tendency than the mean. In our previous example, both mean and median values can be used to measure central tendency of the datasets.
Peck, R., Olsen, C., & Devore, J. L. (2015). Introduction to statistics and data analysis: Cengage Learning.
Heiman, G. W. (2011). Basic Statistics for the Behavioral Sciences (6th ed.). USA: Cengage Learning.
Samuels, M. L., Witmer, J. A., & Schaffner, A. (2012). Statistics for the Life Sciences (4th ed.): Pearson Education, Inc.
Weiss, N. A., & Weiss, C. A. (2012). Introductory Statistics (9th ed.): Pearson Education, Inc.