# `statistics` --- 數學統計函式¶

Some datasets use `NaN` (not a number) values to represent missing data. Since NaNs have unusual comparison semantics, they cause surprising or undefined behaviors in the statistics functions that sort data or that count occurrences. The functions affected are `median()`, `median_low()`, `median_high()`, `median_grouped()`, `mode()`, `multimode()`, and `quantiles()`. The `NaN` values should be stripped before calling these functions:

```>>> from statistics import median
>>> from math import isnan
>>> from itertools import filterfalse

>>> data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4]
>>> sorted(data)  # This has surprising behavior
[20.7, nan, 14.4, 18.3, 19.2, nan]
>>> median(data)  # This result is unexpected
16.35

>>> sum(map(isnan, data))    # Number of missing values
2
>>> clean = list(filterfalse(isnan, data))  # Strip NaN values
>>> clean
[20.7, 19.2, 18.3, 14.4]
>>> sorted(clean)  # Sorting now works as expected
[14.4, 18.3, 19.2, 20.7]
>>> median(clean)       # This result is now well defined
18.75
```

## 平均值與中央位置量數¶

 `mean()` 數據的算術平均（平均值）。 `fmean()` Fast, floating point arithmetic mean, with optional weighting. `geometric_mean()` 數據中的幾何平均數。 `harmonic_mean()` 数据的调和均值 `median()` 數據的中位數（中間值）。 `median_low()` 數據中較小的中位數。 `median_high()` 數據中較大的中位數。 `median_grouped()` 分組數據的中位數或50%處。 `mode()` 离散的或标称的数据的单个众数（出现最多的值）。 `multimode()` 离散的或标称的数据的众数（出现最多的值）列表。 `quantiles()` 将数据以相等的概率分为多个间隔。

 `pstdev()` 數據的母體標準差 `pvariance()` 數據的母體變異數 `stdev()` 數據的樣本標準差 `variance()` 數據的樣本變異數

## 对两个输入之间关系的统计¶

 `covariance()` 兩變數樣本的共變異數 `correlation()` 两个变量的皮尔逊相关系数。 `linear_regression()` 简单线性回归的斜率和截距。

## 函式細節¶

statistics.mean(data)

data 为空，将会引发 `StatisticsError`

```>>> mean([1, 2, 3, 4, 4])
2.8
>>> mean([-1.0, 2.5, 3.25, 5.75])
2.625

>>> from fractions import Fraction as F
>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
Fraction(13, 21)

>>> from decimal import Decimal as D
>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
Decimal('0.5625')
```

statistics.fmean(data, weights=None)

data 转换成浮点数并且计算算术平均数。

```>>> fmean([3.5, 4.0, 5.25])
4.25
```

Optional weighting is supported. For example, a professor assigns a grade for a course by weighting quizzes at 20%, homework at 20%, a midterm exam at 30%, and a final exam at 30%:

```>>> grades = [85, 92, 83, 91]
>>> weights = [0.20, 0.20, 0.30, 0.30]
87.6
```

If weights is supplied, it must be the same length as the data or a `ValueError` will be raised.

statistics.geometric_mean(data)

data 转换成浮点数并且计算几何平均数。

```>>> round(geometric_mean([54, 24, 36]), 1)
36.0
```

statistics.harmonic_mean(data, weights=None)

```>>> harmonic_mean([40, 60])
48.0
```

```>>> harmonic_mean([40, 60], weights=[5, 30])
56.0
```

statistics.median(data)

```>>> median([1, 3, 5])
3
```

```>>> median([1, 3, 5, 7])
4.0
```

statistics.median_low(data)

```>>> median_low([1, 3, 5])
3
>>> median_low([1, 3, 5, 7])
3
```

statistics.median_high(data)

```>>> median_high([1, 3, 5])
3
>>> median_high([1, 3, 5, 7])
5
```

statistics.median_grouped(data, interval=1)

```>>> median_grouped([52, 52, 53, 54])
52.5
```

```>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
3.7
```

```>>> median_grouped([1, 3, 3, 5, 7], interval=1)
3.25
>>> median_grouped([1, 3, 3, 5, 7], interval=2)
3.5
```

CPython 實作細節： 在某些情况下，`median_grouped()` 可以会将数据点强制转换为浮点数。 此行为在未来有可能会发生改变。

• "Statistics for the Behavioral Sciences", Frederick J Gravetter and Larry B Wallnau (8th Edition).

• Gnome Gnumeric 电子表格中的 SSMEDIAN 函数，包括 这篇讨论

statistics.mode(data)

`mode` 将假定是离散数据并返回一个单一的值。 这是通常的学校教学中标准的处理方式：

```>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
3
```

```>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'
```

statistics.multimode(data)

```>>> multimode('aabbbbccddddeeffffgg')
['b', 'd', 'f']
>>> multimode('')
[]
```

statistics.pstdev(data, mu=None)

```>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
0.986893273527251
```
statistics.pvariance(data, mu=None)

```>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
>>> pvariance(data)
1.25
```

```>>> mu = mean(data)
>>> pvariance(data, mu)
1.25
```

```>>> from decimal import Decimal as D
>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
Decimal('24.815')

>>> from fractions import Fraction as F
>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
Fraction(13, 72)
```

statistics.stdev(data, xbar=None)

```>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
1.0810874155219827
```
statistics.variance(data, xbar=None)

```>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
>>> variance(data)
1.3720238095238095
```

```>>> m = mean(data)
>>> variance(data, m)
1.3720238095238095
```

```>>> from decimal import Decimal as D
>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
Decimal('31.01875')

>>> from fractions import Fraction as F
>>> variance([F(1, 6), F(1, 2), F(5, 3)])
Fraction(67, 108)
```

statistics.quantiles(data, *, n=4, method='exclusive')

data 分隔为具有相等概率的 n 个连续区间。 返回分隔这些区间的 `n - 1` 个分隔点的列表。

n 设为 4 以使用四分位（默认值）。 将 n 设为 10 以使用十分位。 将 n 设为 100 以使用百分位，即给出 99 个分隔点来将 data 分隔为 100 个大小相等的组。 如果 n 小于 1 则将引发 `StatisticsError`

data 可以是包含样本数据的任意可迭代对象。 为了获得有意义的结果，data 中数据点的数量应当大于 n。 如果数据点的数量小于两个则将引发 `StatisticsError`

method 用于计算分位值，它会由于 data 是包含还是排除总体的最低和最高可能值而有所不同。

method 设为 "inclusive" 可用于描述总体数据或已明确知道包含有总体数据中最极端值的样本。 data 中的最小值会被作为第 0 个百分位而最大值会被作为第 100 个百分位。 总体数据里处于 m 个已排序数据点中 第 i 个 以下的部分会以 `(i - 1) / (m - 1)` 来计算。 给定 11 个样本值，该方法会对它们进行排序并赋予以下百分位: 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%。

```# Decile cut points for empirically sampled data
>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
...         100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
...         106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
...         111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
...         103, 107, 101, 81, 109, 104]
>>> [round(q, 1) for q in quantiles(data, n=10)]
[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
```

statistics.covariance(x, y, /)

```>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> covariance(x, y)
0.75
>>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
>>> covariance(x, z)
-7.5
>>> covariance(z, x)
-7.5
```

statistics.correlation(x, y, /)

```>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
>>> correlation(x, x)
1.0
>>> correlation(x, y)
-1.0
```

statistics.linear_regression(x, y, /, *, proportional=False)

y = slope * x + intercept + noise

```>>> year = [1971, 1975, 1979, 1982, 1983]
>>> films_total = [1, 2, 3, 4, 5]
>>> slope, intercept = linear_regression(year, films_total)
>>> round(slope * 2019 + intercept)
16
```

If proportional is true, the independent variable x and the dependent variable y are assumed to be directly proportional. The data is fit to a line passing through the origin. Since the intercept will always be 0.0, the underlying linear function simplifies to:

y = slope * x + noise

## 例外¶

exception statistics.StatisticsError

`ValueError` 的子类，表示统计相关的异常。

## `NormalDist` 物件¶

`NormalDist` 工具可用于创建和操纵 随机变量 的正态分布。 这个类将数据度量值的平均值和标准差作为单一实体来处理。

class statistics.NormalDist(mu=0.0, sigma=1.0)

sigma 为负数，将会引发 `StatisticsError`

mean

median

mode

stdev

variance

classmethod from_samples(data)

data 可以是任何 iterable 并且应当包含能被转换为 `float` 类型的值。 如果 data 不包含至少两个元素，则会引发 `StatisticsError`，因为估算中心值至少需要一个点而估算分散度至少需要两个点。

samples(n, *, seed=None)

pdf(x)

The relative likelihood is computed as the probability of a sample occurring in a narrow range divided by the width of the range (hence the word "density"). Since the likelihood is relative to other points, its value can be greater than `1.0`.

cdf(x)

inv_cdf(p)

Compute the inverse cumulative distribution function, also known as the quantile function or the percent-point function. Mathematically, it is written `x : P(X <= x) = p`.

overlap(other)

quantiles(n=4)

n 设为 4 以使用四分位（默认值）。 将 n 设为 10 以使用十分位。将 n 设为 100 以使用百分位，即给出 99 个分隔点来将正态分布分隔为 100 个大小相等的组。

zscore(x)

`NormalDist` 的实例支持加上、减去、乘以或除以一个常量。 这些运算被用于转换和缩放。 例如：

```>>> temperature_february = NormalDist(5, 2.5)             # Celsius
>>> temperature_february * (9/5) + 32                     # Fahrenheit
NormalDist(mu=41.0, sigma=4.5)
```

```>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
>>> drug_effects = NormalDist(0.4, 0.15)
>>> combined = birth_weights + drug_effects
>>> round(combined.mean, 1)
3.1
>>> round(combined.stdev, 1)
0.5
```

### `NormalDist` 示例和用法¶

`NormalDist` 适合用来解决经典概率问题。

```>>> sat = NormalDist(1060, 195)
>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
>>> round(fraction * 100.0, 1)
18.4
```

```>>> list(map(round, sat.quantiles()))
[928, 1060, 1192]
>>> list(map(round, sat.quantiles(n=10)))
[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
```

```>>> def model(x, y, z):
...     return (3*x + 7*x*y - 5*y) / (11 * z)
...
>>> n = 100_000
>>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
>>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
>>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
>>> quantiles(map(model, X, Y, Z))
[1.4591308524824727, 1.8035946855390597, 2.175091447274739]
```

Normal distributions can be used to approximate Binomial distributions when the sample size is large and when the probability of a successful trial is near 50%.

```>>> n = 750             # Sample size
>>> p = 0.65            # Preference for Python
>>> q = 1.0 - p         # Preference for Ruby
>>> k = 500             # Room capacity

>>> # Approximation using the cumulative normal distribution
>>> from math import sqrt
>>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
0.8402

>>> # Solution using the cumulative binomial distribution
>>> from math import comb, fsum
>>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
0.8402

>>> # Approximation using a simulation
>>> from random import seed, choices
>>> seed(8675309)
>>> def trial():
...     return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
>>> mean(trial() <= k for i in range(10_000))
0.8398
```

Wikipedia has a nice example of a Naive Bayesian Classifier. The challenge is to predict a person's gender from measurements of normally distributed features including height, weight, and foot size.

```>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
```

```>>> ht = 6.0        # height
>>> wt = 130        # weight
>>> fs = 8          # foot size
```

```>>> prior_male = 0.5
>>> prior_female = 0.5
>>> posterior_male = (prior_male * height_male.pdf(ht) *
...                   weight_male.pdf(wt) * foot_size_male.pdf(fs))

>>> posterior_female = (prior_female * height_female.pdf(ht) *
...                     weight_female.pdf(wt) * foot_size_female.pdf(fs))
```

```>>> 'male' if posterior_male > posterior_female else 'female'
'female'
```