Descriptive Statistics

The descriptive_statistics module provides functions for calculating basic statistical measures that describe the central tendency, variability, and distribution of datasets.

real_simple_stats.descriptive_statistics.coefficient_of_variation(values: Sequence[float]) → float[source]

real_simple_stats.descriptive_statistics.detect_fake_statistics(survey_sponsor: str, is_voluntary: bool, correlation_not_causation: bool) → list[str][source]

Detect potential issues with statistical claims or studies.

Parameters:

survey_sponsor – Organization sponsoring the survey/study
is_voluntary – Whether the survey uses voluntary response sampling
correlation_not_causation – Whether correlation is being presented as causation

Returns:

List of warning messages about potential statistical issues

Example

>>> detect_fake_statistics("Diet Pill Company", True, True)
['Potential bias: Self-funded study', 'Warning: Voluntary response samples are biased',
 'Warning: Correlation does not imply causation']

real_simple_stats.descriptive_statistics.draw_cumulative_frequency_table(values: Sequence[int]) → dict[int, int][source]

Generate a cumulative frequency table from a list of discrete values.

Parameters:: values – List of discrete integer values
Returns:: Dictionary mapping each unique value to its cumulative frequency

Example

>>> draw_cumulative_frequency_table([1, 2, 1, 3, 2, 1])
{1: 3, 2: 5, 3: 6}

real_simple_stats.descriptive_statistics.draw_frequency_table(values: Sequence[str | int]) → dict[str | int, int][source]

Generate a frequency table from a list of categorical or discrete values.

Parameters:: values – List of categorical or discrete values to count
Returns:: Dictionary mapping each unique value to its frequency

Example

>>> draw_frequency_table(['A', 'B', 'A', 'C', 'B', 'A'])
{'A': 3, 'B': 2, 'C': 1}

real_simple_stats.descriptive_statistics.five_number_summary(values: Sequence[float]) → dict[str, float][source]

Return the five-number summary: min, Q1, median, Q3, max.

Parameters:: values – List of numerical values
Returns:: min, Q1, median, Q3, max
Return type:: Dictionary with keys
Raises:: ValueError – If the input list is empty

Example

>>> five_number_summary([1, 2, 3, 4, 5])
{'min': 1, 'Q1': 1.5, 'median': 3, 'Q3': 4.5, 'max': 5}
>>> five_number_summary([5])
{'min': 5, 'Q1': 5, 'median': 5, 'Q3': 5, 'max': 5}

real_simple_stats.descriptive_statistics.interquartile_range(values: Sequence[float]) → float[source]

real_simple_stats.descriptive_statistics.is_continuous(values: Sequence[float]) → bool[source]

Determine if a variable is continuous (contains non-integer values).

Parameters:: values – List of numerical values to check
Returns:: True if any values are non-integers, False if all are integers

Example

>>> is_continuous([1.5, 2.0, 3.0])
True
>>> is_continuous([1.0, 2.0, 3.0])
False

real_simple_stats.descriptive_statistics.is_discrete(values: Sequence[float]) → bool[source]

Determine if a variable is discrete (all values are integers).

Parameters:: values – List of numerical values to check
Returns:: True if all values are integers, False otherwise

Example

>>> is_discrete([1.0, 2.0, 3.0])
True
>>> is_discrete([1.5, 2.0, 3.0])
False

real_simple_stats.descriptive_statistics.mean(values: Sequence[float]) → float[source]

Calculate the arithmetic mean (average) of a dataset.

Parameters:: values – List of numerical values
Returns:: The arithmetic mean
Raises:: ValueError – If the input list is empty

Example

>>> mean([1, 2, 3, 4, 5])
3.0

real_simple_stats.descriptive_statistics.median(values: Sequence[float]) → float[source]

Calculate the median (middle value) of a dataset.

Parameters:: values – List of numerical values
Returns:: The median value
Raises:: ValueError – If the input list is empty

Example

>>> median([1, 2, 3, 4, 5])
3.0
>>> median([1, 2, 3, 4])
2.5

real_simple_stats.descriptive_statistics.sample_std_dev(values: Sequence[float]) → float[source]

Calculate the sample standard deviation of a dataset.

Parameters:: values – List of numerical values
Returns:: The sample standard deviation (square root of sample variance)
Raises:: ValueError – If fewer than 2 values are provided

Example

>>> sample_std_dev([1, 2, 3, 4, 5])
1.5811388300841898

real_simple_stats.descriptive_statistics.sample_variance(values: Sequence[float]) → float[source]

Calculate the sample variance of a dataset.

Uses the sample variance formula with (n-1) degrees of freedom (Bessel’s correction).

Parameters:: values – List of numerical values
Returns:: The sample variance
Raises:: ValueError – If fewer than 2 values are provided

Example

>>> sample_variance([1, 2, 3, 4, 5])
2.5

Functions Overview

Central Tendency

real_simple_stats.descriptive_statistics.mean(values: Sequence[float]) → float[source]

Calculate the arithmetic mean (average) of a dataset.

Parameters:: values – List of numerical values
Returns:: The arithmetic mean
Raises:: ValueError – If the input list is empty

Example

>>> mean([1, 2, 3, 4, 5])
3.0

real_simple_stats.descriptive_statistics.median(values: Sequence[float]) → float[source]

Calculate the median (middle value) of a dataset.

Parameters:: values – List of numerical values
Returns:: The median value
Raises:: ValueError – If the input list is empty

Example

>>> median([1, 2, 3, 4, 5])
3.0
>>> median([1, 2, 3, 4])
2.5

Variability

real_simple_stats.descriptive_statistics.sample_variance(values: Sequence[float]) → float[source]

Calculate the sample variance of a dataset.

Uses the sample variance formula with (n-1) degrees of freedom (Bessel’s correction).

Parameters:: values – List of numerical values
Returns:: The sample variance
Raises:: ValueError – If fewer than 2 values are provided

Example

>>> sample_variance([1, 2, 3, 4, 5])
2.5

real_simple_stats.descriptive_statistics.coefficient_of_variation(values: Sequence[float]) → float[source]

Usage Examples

Basic Statistics

Calculate common descriptive statistics for a dataset:

from real_simple_stats import descriptive_statistics as desc

# Sample dataset
data = [12, 15, 18, 20, 22, 25, 28, 30, 32, 35]

# Central tendency
mean_val = desc.mean(data)
median_val = desc.median(data)
mode_val = desc.mode(data)

print(f"Mean: {mean_val}")
print(f"Median: {median_val}")
print(f"Mode: {mode_val}")

# Variability
variance_val = desc.variance(data)
std_dev = desc.standard_deviation(data)
cv = desc.coefficient_of_variation(data)

print(f"Variance: {variance_val:.2f}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Coefficient of Variation: {cv:.2f}%")

Population vs Sample Statistics

Understanding the difference between population and sample statistics:

# Same dataset, different calculations
sample_data = [85, 90, 78, 92, 88, 76, 95, 82, 89, 91]

# Population statistics (when you have the entire population)
pop_variance = desc.variance(sample_data)
pop_std = desc.standard_deviation(sample_data)

# Sample statistics (when you have a sample from a larger population)
sample_variance = desc.sample_variance(sample_data)
sample_std = desc.sample_standard_deviation(sample_data)

print("Population Statistics:")
print(f"  Variance: {pop_variance:.2f}")
print(f"  Standard Deviation: {pop_std:.2f}")

print("Sample Statistics:")
print(f"  Variance: {sample_variance:.2f}")
print(f"  Standard Deviation: {sample_std:.2f}")

Error Handling

The functions include comprehensive error handling:

import real_simple_stats.descriptive_statistics as desc

# Empty dataset
try:
    result = desc.mean([])
except ValueError as e:
    print(f"Error: {e}")

# Single value for sample statistics
try:
    result = desc.sample_variance([42])
except ValueError as e:
    print(f"Error: {e}")

# Five-number summary works with small datasets too
summary_single = desc.five_number_summary([5])
# For a single value, all stats equal that value

summary_two = desc.five_number_summary([1, 2])
# With two values, Q1=min and Q3=max

# Non-numeric data
try:
    result = desc.mean([1, 2, "three", 4])
except TypeError as e:
    print(f"Error: {e}")

Mathematical Background

Mean (Arithmetic Average)

The arithmetic mean is the sum of all values divided by the number of values:

\[\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\]

Where: - \(\bar{x}\) is the sample mean - \(n\) is the number of observations - \(x_i\) is the i-th observation

Median

The median is the middle value when data is arranged in ascending order:

For odd n: median = middle value
For even n: median = average of two middle values

Variance

Population Variance:

\[\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2\]

Sample Variance:

\[s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2\]

Standard Deviation

The standard deviation is the square root of the variance:

Population: \(\sigma = \sqrt{\sigma^2}\)
Sample: \(s = \sqrt{s^2}\)

Coefficient of Variation

The coefficient of variation expresses the standard deviation as a percentage of the mean:

\[CV = \frac{\sigma}{|\mu|} \times 100\%\]

This allows comparison of variability between datasets with different units or scales.

Descriptive Statistics

Functions Overview

Central Tendency

Variability

Usage Examples

Basic Statistics

Population vs Sample Statistics

Error Handling

Mathematical Background

Mean (Arithmetic Average)

Median

Variance

Standard Deviation

Coefficient of Variation

See Also