Descriptive Statistics

The descriptive_statistics module provides functions for calculating basic statistical measures that describe the central tendency, variability, and distribution of datasets.

real_simple_stats.descriptive_statistics.coefficient_of_variation(values: Sequence[float]) float[source]
real_simple_stats.descriptive_statistics.detect_fake_statistics(survey_sponsor: str, is_voluntary: bool, correlation_not_causation: bool) list[str][source]

Detect potential issues with statistical claims or studies.

Parameters:
  • survey_sponsor – Organization sponsoring the survey/study

  • is_voluntary – Whether the survey uses voluntary response sampling

  • correlation_not_causation – Whether correlation is being presented as causation

Returns:

List of warning messages about potential statistical issues

Example

>>> detect_fake_statistics("Diet Pill Company", True, True)
['Potential bias: Self-funded study', 'Warning: Voluntary response samples are biased',
 'Warning: Correlation does not imply causation']
real_simple_stats.descriptive_statistics.draw_cumulative_frequency_table(values: Sequence[int]) dict[int, int][source]

Generate a cumulative frequency table from a list of discrete values.

Parameters:

values – List of discrete integer values

Returns:

Dictionary mapping each unique value to its cumulative frequency

Example

>>> draw_cumulative_frequency_table([1, 2, 1, 3, 2, 1])
{1: 3, 2: 5, 3: 6}
real_simple_stats.descriptive_statistics.draw_frequency_table(values: Sequence[str | int]) dict[str | int, int][source]

Generate a frequency table from a list of categorical or discrete values.

Parameters:

values – List of categorical or discrete values to count

Returns:

Dictionary mapping each unique value to its frequency

Example

>>> draw_frequency_table(['A', 'B', 'A', 'C', 'B', 'A'])
{'A': 3, 'B': 2, 'C': 1}
real_simple_stats.descriptive_statistics.five_number_summary(values: Sequence[float]) dict[str, float][source]

Return the five-number summary: min, Q1, median, Q3, max.

Parameters:

values – List of numerical values

Returns:

min, Q1, median, Q3, max

Return type:

Dictionary with keys

Raises:

ValueError – If the input list is empty

Example

>>> five_number_summary([1, 2, 3, 4, 5])
{'min': 1, 'Q1': 1.5, 'median': 3, 'Q3': 4.5, 'max': 5}
>>> five_number_summary([5])
{'min': 5, 'Q1': 5, 'median': 5, 'Q3': 5, 'max': 5}
real_simple_stats.descriptive_statistics.interquartile_range(values: Sequence[float]) float[source]
real_simple_stats.descriptive_statistics.is_continuous(values: Sequence[float]) bool[source]

Determine if a variable is continuous (contains non-integer values).

Parameters:

values – List of numerical values to check

Returns:

True if any values are non-integers, False if all are integers

Example

>>> is_continuous([1.5, 2.0, 3.0])
True
>>> is_continuous([1.0, 2.0, 3.0])
False
real_simple_stats.descriptive_statistics.is_discrete(values: Sequence[float]) bool[source]

Determine if a variable is discrete (all values are integers).

Parameters:

values – List of numerical values to check

Returns:

True if all values are integers, False otherwise

Example

>>> is_discrete([1.0, 2.0, 3.0])
True
>>> is_discrete([1.5, 2.0, 3.0])
False
real_simple_stats.descriptive_statistics.mean(values: Sequence[float]) float[source]

Calculate the arithmetic mean (average) of a dataset.

Parameters:

values – List of numerical values

Returns:

The arithmetic mean

Raises:

ValueError – If the input list is empty

Example

>>> mean([1, 2, 3, 4, 5])
3.0
real_simple_stats.descriptive_statistics.median(values: Sequence[float]) float[source]

Calculate the median (middle value) of a dataset.

Parameters:

values – List of numerical values

Returns:

The median value

Raises:

ValueError – If the input list is empty

Example

>>> median([1, 2, 3, 4, 5])
3.0
>>> median([1, 2, 3, 4])
2.5
real_simple_stats.descriptive_statistics.sample_std_dev(values: Sequence[float]) float[source]

Calculate the sample standard deviation of a dataset.

Parameters:

values – List of numerical values

Returns:

The sample standard deviation (square root of sample variance)

Raises:

ValueError – If fewer than 2 values are provided

Example

>>> sample_std_dev([1, 2, 3, 4, 5])
1.5811388300841898
real_simple_stats.descriptive_statistics.sample_variance(values: Sequence[float]) float[source]

Calculate the sample variance of a dataset.

Uses the sample variance formula with (n-1) degrees of freedom (Bessel’s correction).

Parameters:

values – List of numerical values

Returns:

The sample variance

Raises:

ValueError – If fewer than 2 values are provided

Example

>>> sample_variance([1, 2, 3, 4, 5])
2.5

Functions Overview

Central Tendency

real_simple_stats.descriptive_statistics.mean(values: Sequence[float]) float[source]

Calculate the arithmetic mean (average) of a dataset.

Parameters:

values – List of numerical values

Returns:

The arithmetic mean

Raises:

ValueError – If the input list is empty

Example

>>> mean([1, 2, 3, 4, 5])
3.0
real_simple_stats.descriptive_statistics.median(values: Sequence[float]) float[source]

Calculate the median (middle value) of a dataset.

Parameters:

values – List of numerical values

Returns:

The median value

Raises:

ValueError – If the input list is empty

Example

>>> median([1, 2, 3, 4, 5])
3.0
>>> median([1, 2, 3, 4])
2.5

Variability

real_simple_stats.descriptive_statistics.sample_variance(values: Sequence[float]) float[source]

Calculate the sample variance of a dataset.

Uses the sample variance formula with (n-1) degrees of freedom (Bessel’s correction).

Parameters:

values – List of numerical values

Returns:

The sample variance

Raises:

ValueError – If fewer than 2 values are provided

Example

>>> sample_variance([1, 2, 3, 4, 5])
2.5
real_simple_stats.descriptive_statistics.coefficient_of_variation(values: Sequence[float]) float[source]

Usage Examples

Basic Statistics

Calculate common descriptive statistics for a dataset:

from real_simple_stats import descriptive_statistics as desc

# Sample dataset
data = [12, 15, 18, 20, 22, 25, 28, 30, 32, 35]

# Central tendency
mean_val = desc.mean(data)
median_val = desc.median(data)
mode_val = desc.mode(data)

print(f"Mean: {mean_val}")
print(f"Median: {median_val}")
print(f"Mode: {mode_val}")

# Variability
variance_val = desc.variance(data)
std_dev = desc.standard_deviation(data)
cv = desc.coefficient_of_variation(data)

print(f"Variance: {variance_val:.2f}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Coefficient of Variation: {cv:.2f}%")

Population vs Sample Statistics

Understanding the difference between population and sample statistics:

# Same dataset, different calculations
sample_data = [85, 90, 78, 92, 88, 76, 95, 82, 89, 91]

# Population statistics (when you have the entire population)
pop_variance = desc.variance(sample_data)
pop_std = desc.standard_deviation(sample_data)

# Sample statistics (when you have a sample from a larger population)
sample_variance = desc.sample_variance(sample_data)
sample_std = desc.sample_standard_deviation(sample_data)

print("Population Statistics:")
print(f"  Variance: {pop_variance:.2f}")
print(f"  Standard Deviation: {pop_std:.2f}")

print("Sample Statistics:")
print(f"  Variance: {sample_variance:.2f}")
print(f"  Standard Deviation: {sample_std:.2f}")

Error Handling

The functions include comprehensive error handling:

import real_simple_stats.descriptive_statistics as desc

# Empty dataset
try:
    result = desc.mean([])
except ValueError as e:
    print(f"Error: {e}")

# Single value for sample statistics
try:
    result = desc.sample_variance([42])
except ValueError as e:
    print(f"Error: {e}")

# Five-number summary works with small datasets too
summary_single = desc.five_number_summary([5])
# For a single value, all stats equal that value

summary_two = desc.five_number_summary([1, 2])
# With two values, Q1=min and Q3=max

# Non-numeric data
try:
    result = desc.mean([1, 2, "three", 4])
except TypeError as e:
    print(f"Error: {e}")

Mathematical Background

Mean (Arithmetic Average)

The arithmetic mean is the sum of all values divided by the number of values:

\[\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\]

Where: - \(\bar{x}\) is the sample mean - \(n\) is the number of observations - \(x_i\) is the i-th observation

Median

The median is the middle value when data is arranged in ascending order:

  • For odd n: median = middle value

  • For even n: median = average of two middle values

Variance

Population Variance:

\[\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2\]

Sample Variance:

\[s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2\]

Standard Deviation

The standard deviation is the square root of the variance:

  • Population: \(\sigma = \sqrt{\sigma^2}\)

  • Sample: \(s = \sqrt{s^2}\)

Coefficient of Variation

The coefficient of variation expresses the standard deviation as a percentage of the mean:

\[CV = \frac{\sigma}{|\mu|} \times 100\%\]

This allows comparison of variability between datasets with different units or scales.

See Also

  • probability_utils - For probability calculations

  • hypothesis_testing - For statistical testing

  • ../tutorials/basic_statistics - Tutorial on descriptive statistics