STATISTICS

BIG DATA & ANALYTICS

Statistics is the field of collecting data, and analyzing it in order to come to some conclusion about it. For example:

I have big feet, size 46 EU (11 UK / 12 US), so finding shoes is never an easy task, especially in Asia where the average shoe size is smaller than in the West. Most stores don't stock up on the bigger sizes. This is because they have collected data from past purchases. They have seen which sizes get the most sales and which sizes usually need to get rid of at the end of season and they have optimized their stock based on that. Very few people have the same shoe size as me, so it doesn't make sense to stock up on those and risk not selling them out, instead they will stock up on the more common sizes.

This concept can be applied to pretty much any industry. Grocery stores would stock up the the items most popular for their customers and restaurants would buy the ingredients they need for the most popular dishes on their menu. Hotels can see which days and rooms are the most popular and adjust pricing based on that. City planners could even study the traffic times and popular roads and optimize stop lights to allow a smooth flow of traffic.

There are 2 main types of statistics, descriptive and inferential. Let's take a look at the key differences:

This is simply describing and summarizing data. For example, asking 4 friends what their favorite fruit is.

2 of them said banana, 1 said apple, and the other 1 said strawberry.

Based on that, we can say 50% of your friends like bananas, 25% likes apples, and 25% likes strawberries.

Inferences refers to the conclusion so this is using the data you have and making an inference or conclusion from it about a larger population.

By looking at grocery store purchase data from customers we could deduce what percentage of customers buy fruit from the store.

Data can be broken up intro either numeric (quantitative) or categorical (qualitative).

Numeric data can be broken up further into continuous (measured) data or discrete (counted) data.

Examples of continuous data would be the speed of a car, or the time it takes to finish a marathon. You can see how this needs to be measured.

Examples of discrete data would be the number of rooms in a house, or the number of children a family has. You can see how this needs to be counted.

Categorical data can be broken up further into nominal (unordered) or ordinal (ordered) data.

Examples of nominal data would be hair color, gender, and nationality.

Examples of ordinal data would be a scale of how much you agree to a specific question (Strongly Agree, Somewhat Agree, Don't Agree, Don't Agree At All)

Sometimes categorical data can be written as numbers:

Brown / Black (0 / 1)

Strongly Agree (1) Somewhat Agree (2) Agree (3) Don't Agree (4) Don't Agree At All (5)

But this does not make it numeric data.

The three measures of center we will discuss are mean, median, and mode.

The mean is often referred to as the average of the data. To calculate the mean we would add up all the data points and divide the total by the number of data points.

```
number_of_rainy_days_per_month = [2, 5, 6, 7, 3, 5, 5]
total = 2 + 5 + 6 + 7 + 3 + 5 + 5 = 33
mean = 33 / 7 = 4.71
// python
import numpy as np
np.mean(number_of_rainy_days_per_month)
```

The easiest way to think of the median is as the value in the middle. So you would sort the values from lowest to highest and find the middle value.

```
number_of_rainy_days_per_month = [2, 5, 6, 7, 3, 5, 5]
number_of_rainy_days_per_month_low_to_high = [2, 3, 5, 5, 5, 6, 7]
median = 5
// python
import numpy as np
np.median(number_of_rainy_days_per_month)
```

The mode is the most frequent value in the data. To calculate this you would group all the same values and see which has the group has the most values.

```
number_of_rainy_days_per_month = [2, 5, 6, 7, 3, 5, 5]
2 (1)
3 (1)
5 (3)
6 (1)
7 (1)
mode = 5
// python
from scipy import stats
stats.mode(number_of_rainy_days_per_month)
```

The spread is how spread apart or close together the data points are. The measures of spread we will discuss are variance, standard deviation, mean absolute deviation, and interquartile range.

The variance is the average distance from each data point to the data's mean. The higher the variance, the more spread out the data is.

```
// calculate the mean
number_of_rainy_days_per_month = [2, 5, 6, 7, 3, 5, 5]
total = 2 + 5 + 6 + 7 + 3 + 5 + 5 = 33
mean = 33 / 7 = 4.71
// subtract the mean from each data point
-2.71
2.29
3.29
4.29
0.29
2.29
2.29
// square each distance
-2.71 ** 2 = 7.3441
2.29 ** 2 = 5.2441
3.29 ** 2 = 10.8241
4.29 ** 2 = 18.4041
0.29 ** 2 = 0.0841
2.29 ** 2 = 5.2441
2.29 ** 2 = 5.2441
// add the distances together
7.3441 + 5.2441 + 10.8241 + 18.4041 + 0.0841 + 5.2441 + 5.2441 = 52.3887
// divide the sum by number of data points - 1
52.3887 / (7 - 1)
variance = 8.73145
// python
import numpy as np
np.var(number_of_rainy_days_per_month, ddof=1)
// without ddof=1 population variance is calculated instead of sample variance
```

The standard deviation is the square root of the variance.

```
number_of_rainy_days_per_month = [2, 5, 6, 7, 3, 5, 5]
variance = 8.73145
standard deviation = √8.73145 = 2.9549
// python
import numpy as np
np.sqrt(np.var(number_of_rainy_days_per_month, ddof=1))
OR
np.std(number_of_rainy_days_per_month, ddof=1)
```

This is an alternative to standard deviation. Neither is considered better, but this is used less commonly. The main difference is that the standard deviation squares distances, meaning that longer distances are penalized more than shorter ones while the mean absolute deviation penalizes each distance equally.

The Interquartile Range is the distance between the 25th and 75th percentile. This is the hight of a box in a boxplot.

```
// python
import numpy as np
from scipy.stats import iqr
number_of_rainy_days_per_month = [2, 5, 6, 7, 3, 5, 5]
// quantiles (0.5 quantile / median)
np.quantile(number_of_rainy_days_per_month, 0.5)
// quartiles (0, 0.25, 0.5, 0.75, 1)
np.quantile(number_of_rainy_days_per_month, [0, 0.25, 0.5, 0.75, 1])
// interquartile range
np.quantile(number_of_rainy_days_per_month, 0.75) - np.quantile(number_of_rainy_days_per_month, 0.25)
OR
iqr(number_of_rainy_days_per_month)
```

Outliers are data points that are very different than the majority of data points. We usually determine if a data point is an outlier by:

```
data_point < Q1 - 1.5 x IQR
OR
data_point > Q3 + 1.5 x IQR
// python
from scipy.stats import iqr
number_of_rainy_days_per_month = [2, 5, 6, 7, 3, 5, 5]
// calculate IQR
iqr = iqr(number_of_rainy_days_per_month)
// calculate lower and upper threshold
lower_threshold = np.quantile(number_of_rainy_days_per_month, 0.25) - 1.5 * iqr
upper_threshold = np.quantile(number_of_rainy_days_per_month, 0.75) + 1.5 * iqr
// then subset the data to remove data points below the lower threshold and above the upper threshold
```

© 2024 Potado. All rights reserved.