3 Characteristics of Distributions

This page goes over the four basic characteristics of distributions that you should be able to describe, calculate, and interpret.

Shape, Outliers, Center, and Spread (SOCS).

Important

Make sure that for anything that you use your own judgement for (that is not calculated or that uses any assumed estimates), that you properly describe each characteristic with “unsure” vocabulary like words like “seems to be” or words that end with “-ly”, Like “roughly”

3.1 Shape

You can describe shape by its skewness and modality.

Make sure to mention both in a free response question because you never know what exactly a rubric will be looking for!

3.1.1 Skewness

Skew

A distribution is right skewed if there is a distinct trailing tail on the larger valued part of the distribution (typically the right side).

A distribution is left skewed if there is a distinct trailing tail on the smaller valued part of the distribution (typically the left side).

If there is no skewness and you instead see symmetry, then you can say that the distribution is synmmetrical.

3.1.2 Modality

The modality of a distribution describes how many peaks there are to the data.

Modality

If there is one distinct “peak” in the distribution, then the distribution is unimodal.

If there is two distinct “peak”s in the distribution, then the distribution is bimodal

If there’s any more, then you can saw the distribution is multimodal.

3.2 Outliers

Outliers are data points that don’t exactly fit in with the rest of the data trend.

There are two main approaches, and you should always try the second approach when feasible.

Use your own judgement as to whether something is an outlier or not.
Use the 1.5 IQR Rule to quantitatively identify any outliers.

1.5 IQR Rule

Identify any outliers with this methodical approach by calculating two cut-offs:

\(Q_1 - 1.5 \cdot IQR\) If any value is smaller than this cutoff, then they are a low outlier
\(Q_3 + 1.5 \cdot IQR\) If any value is larger than this cutoff, then they are a high outlier

View the notes below for a refresher on what \(Q_1\), \(Q_3\), and \(IQR\) are.

3.3 Measures of Center

The center of a distribution of data can be described with either the mean or the median. When choosing which one to use to describe a set of data, you have to consider whether there is skewness or outliers in your distribution then choose appropriately between the mean and the median.

3.3.1 Mean / Average

Mean / Average

The average of a set of quantitative data, also known as the mean, can be found by adding up all the values in the set and dividing by the total amount of data points.

When we do this for a population, we are able to calulate the population mean, \(\mu_X\).

\[\mu_X x = \frac{1}{n}\sum x_i = \frac{\sum x_i}{n}\]

When we do this for a sample, we are able to calulate the sample mean, \(\bar x\).

\[\bar x = \frac{1}{n}\sum x_i = \frac{\sum x_i}{n}\]

The mean is not a robust measurement. In other words, it is non-resistant to outliers of skewness in data. This makes it so that the mean is not a reliable measure of center when there is skew or outliers.

3.3.2 Median

Median

The median of a set of quantatitative data can be found by finding the “middle value” in the set of data.

If trying to find the median, make sure to : 1. sort the data from lowest to highest and then 2. point out the middle value and a. if there is an odd number of values, you can point out a value from the data set itself. b. if there is an even number of values, you have to find the average of the middle two values from the data set and use that average as the median.

Tip

If you know how long the data set is, you can count up the to the

If it’s odd number of values: the \(\frac{n}{2} + 0.5\) th value
If it’s an even number of values: the \(\frac{n}{2}\) th and the \(\frac{n}{2} + 1\) th values

The mean is not a robust measurement. In other words, it is resistant to outliers of skewness in data. This makes it so that the median is reliable measure of center when there is skew or outliers. Think of how housing prices work–everytime housing prices are talked about, the median housing price is mentioned. Why? Because using the mean would misrepresent what the typical housing price is. The more expensive houses in the area will pull the mean into the the further region of housing prices.

3.4 Measures of Spread / Variation

The spread (or variation, used interchangably) of a ditribution of data describes how spread out the set of data is.

You essentially have 2 choices to use: 1. The range and standard deviation or 2. The IQR

The choice again revolves around the characteristics of the distribution–whether there is skewness or outliers.

3.4.1 Range

Range

Range is a single number that describes the distance from the maximum value to the minimum value. Calculate the range by subtracting the maximum value by the minimum value.

\[\text{range} = \text{max} - \text{min}\]

Make sure you know how to use range in a sentence properly

Range is a statistical measure. If a maximum of a data set is 10 and the minimum of a data set of 3. You can say “The range is 7.”

You cannot say “The range is 3 to 10”. This means that you do not understand that range is a single number calculated with the formula above.

However, you can opt into a more English orientated sentence by saying “The data ranges from 3 to 10.””

The range is a non-resistant measure. It does not particluarly describe data properly when there is skewness or outliers. This is because the minimum and maximum both change erratically in the presence of skewness or outliers.

3.4.2 Interquartile Range (IQR)

Interquartile Range

The IQR is a single number that describes the distance from the 3rd quartile and the 1st quartile. Calculate the IQR by subtracting the 3rd quartile and the 1st quartile.

\[\text{IQR} = Q_3 - Q_1\]

What is the 1st quartile (\(Q_1\)) and the 3rd quartile (\(Q_3\))?

If the median describes the value in the middle of the data set, then the 1st quartile describes the value at 1/4 of the data set and the 3rd quartile describes the value at the 3/4 of the data set.

Find the quartiles manually by:

Following the steps for finding the median.
After identifying the median, separate the data set into two halves.
1. If you had a odd number of values, then exclude the one number that you pointed out the be the median in the halves.
2. If you had an even number of values, then don’t exclude any numbers.
Follow the steps to find the median in both halves! You’ll have identified the 1st quartile in by looking at the half with smaller numbers and the 3rd quartile by looking at the half with larger numbers.

The IQR is a resistant measure. It describes data properly when there is skewness or outliers. This is because the 1st quartile and 3rd quartile are similar to the median in resistance.

3.4.3 Standard Deviation (SD)

Standard Deviation (SD)

The standard deviation of a set of data is a number that roughly describes the average distance of every data point to the mean of the data set.

When we do this for a population, we are able to calulate the population standard deviation, \(\sigma_X\).

\[\sigma_X = \sqrt{\frac{1}{n-1}\sum (x_i - \bar x)^2} = \sqrt{\frac{\sum(x_i -\bar x)^2}{n}}\]

When we do this for a sample, we are able to calulate the sample standard deviaiton, \(s_x\).

\[s_x = \sqrt{\frac{1}{n-1}\sum (x_i - \bar x)^2} = \sqrt{\frac{\sum(x_i -\bar x)^2}{n - 1}}\]

Notice that there is a slight difference in calculation between the two formulas. It’s not possible to explain this difference between the formulas without a better understanding of calculus and probability. But just know that you’ll never use the formula for population standard deviation, and that if you are asked to calculate sample standard deviation, that the second formula above is the only one that you’ll ever use.

Interpretation of Standard Deviation

The [variable in the context of the problem] typically varies by [value of SD] from the mean of [value of the mean]

The range is a non-resistant measure. It does not particluarly describe data properly when there is skewness or outliers. This is because the standard deviation essentially follows the formula for mean. And as we know, the mean is a non-resistant measure because it changes erratically in the presence of skewness or outliers.