Mathematical Formulas - LaTeX Notation

Complete mathematical reference for all Real Simple Stats functions with LaTeX formulas.


Descriptive Statistics

Mean (Arithmetic Average)

Formula: $\(\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i\)$

Function: mean(data)

Example:

import real_simple_stats as rss
data = [1, 2, 3, 4, 5]
result = rss.mean(data)  # 3.0

Median

Formula: $\(\text{Median} = \begin{cases} x_{(n+1)/2} & \text{if } n \text{ is odd} \\ \frac{x_{n/2} + x_{(n/2)+1}}{2} & \text{if } n \text{ is even} \end{cases}\)$

Function: median(data)


Sample Variance

Formula: $\(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\)$

Function: sample_variance(data)

Note: Uses \(n-1\) (Bessel’s correction) for unbiased estimation.


Sample Standard Deviation

Formula: $\(s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}\)$

Function: sample_std_dev(data)


Coefficient of Variation

Formula: $\(CV = \frac{s}{\bar{x}} \times 100\%\)$

Function: coefficient_of_variation(data)

Interpretation: Relative variability; useful for comparing datasets with different units.


Interquartile Range (IQR)

Formula: $\(IQR = Q_3 - Q_1\)$

where \(Q_1\) is the 25th percentile and \(Q_3\) is the 75th percentile.

Function: interquartile_range(data)


Probability Distributions

Normal Distribution

Probability Density Function (PDF)

Formula: $\(f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)$

Function: normal_pdf(x, mu, sigma)

Parameters:

  • \(\mu\) = mean

  • \(\sigma\) = standard deviation

  • \(x\) = value


Cumulative Distribution Function (CDF)

Formula: $\(F(x) = P(X \leq x) = \int_{-\infty}^{x} \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(t-\mu)^2}{2\sigma^2}} dt\)$

Function: normal_cdf(x, mu, sigma)


Standard Normal (Z-score)

Formula: $\(Z = \frac{X - \mu}{\sigma}\)$

Function: z_score(x, mu, sigma)

Properties:

  • \(Z \sim N(0, 1)\)

  • \(P(|Z| \leq 1.96) \approx 0.95\)


Binomial Distribution

Probability Mass Function (PMF)

Formula: $\(P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\)$

where \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\)

Function: binomial_probability(n, k, p)

Parameters:

  • \(n\) = number of trials

  • \(k\) = number of successes

  • \(p\) = probability of success


Mean and Variance

Formulas: $\(E[X] = np\)\( \)\(\text{Var}(X) = np(1-p)\)$

Functions: binomial_mean(n, p), binomial_variance(n, p)


Poisson Distribution

Probability Mass Function

Formula: $\(P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\)$

Function: poisson_pmf(k, lam)

Parameters:

  • \(\lambda\) = rate parameter (mean)

  • \(k\) = number of events

Properties:

  • \(E[X] = \lambda\)

  • \(\text{Var}(X) = \lambda\)


Geometric Distribution

Probability Mass Function

Formula: $\(P(X = k) = (1-p)^{k-1} p\)$

Function: geometric_pmf(k, p)

Interpretation: Probability that first success occurs on trial \(k\).

Properties:

  • \(E[X] = \frac{1}{p}\)

  • \(\text{Var}(X) = \frac{1-p}{p^2}\)


Exponential Distribution

Probability Density Function

Formula: $\(f(x) = \lambda e^{-\lambda x}, \quad x \geq 0\)$

Function: exponential_pdf(x, lam)

Properties:

  • \(E[X] = \frac{1}{\lambda}\)

  • \(\text{Var}(X) = \frac{1}{\lambda^2}\)

  • Memoryless property: \(P(X > s + t | X > s) = P(X > t)\)


🧪 Hypothesis Testing

One-Sample t-Test

Test Statistic: $\(t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\)$

Function: one_sample_t_test(data, mu0)

Degrees of Freedom: \(df = n - 1\)

Hypotheses:

  • \(H_0: \mu = \mu_0\)

  • \(H_1: \mu \neq \mu_0\) (two-tailed)


Two-Sample t-Test (Independent)

Test Statistic (Equal Variances): $\(t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\)$

where pooled standard deviation: $\(s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\)$

Function: two_sample_t_test(data1, data2)

Degrees of Freedom: \(df = n_1 + n_2 - 2\)


Paired t-Test

Test Statistic: $\(t = \frac{\bar{d}}{s_d / \sqrt{n}}\)$

where \(d_i = x_{1i} - x_{2i}\) are the paired differences.

Function: paired_t_test(data1, data2)


One-Sample Z-Test

Test Statistic: $\(Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\)$

Function: one_sample_z_test(data, mu0, sigma)

Note: Requires known population standard deviation \(\sigma\).


Chi-Square Goodness-of-Fit Test

Test Statistic: $\(\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\)$

Function: chi_square_statistic(observed, expected)

Parameters:

  • \(O_i\) = observed frequency

  • \(E_i\) = expected frequency

  • \(k\) = number of categories

Degrees of Freedom: \(df = k - 1 - p\) (where \(p\) = number of estimated parameters)


One-Way ANOVA

Test Statistic: $\(F = \frac{MS_{between}}{MS_{within}} = \frac{SS_{between}/(k-1)}{SS_{within}/(N-k)}\)$

where: $\(SS_{between} = \sum_{i=1}^{k} n_i(\bar{x}_i - \bar{x})^2\)\( \)\(SS_{within} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(x_{ij} - \bar{x}_i)^2\)$

Function: one_way_anova(groups)

Degrees of Freedom: \(df_1 = k-1\), \(df_2 = N-k\)


📉 Regression & Correlation

Pearson Correlation Coefficient

Formula: $\(r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\)$

Function: pearson_correlation(x, y)

Properties:

  • \(-1 \leq r \leq 1\)

  • \(r = 1\): perfect positive correlation

  • \(r = -1\): perfect negative correlation

  • \(r = 0\): no linear correlation


Simple Linear Regression

Model: $\(y = \beta_0 + \beta_1 x + \epsilon\)$

Least Squares Estimates: $\(\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\)$

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\]

Function: linear_regression(x, y)


Coefficient of Determination (R²)

Formula: $\(R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\)$

Function: coefficient_of_determination(x, y)

Interpretation: Proportion of variance in \(y\) explained by \(x\).


Multiple Linear Regression

Model: $\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\)$

Matrix Form: $\(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)$

Least Squares Solution: $\(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)$

Function: multiple_regression(X, y)


🔄 Time Series Analysis

Simple Moving Average (SMA)

Formula: $\(SMA_t = \frac{1}{k}\sum_{i=0}^{k-1} x_{t-i}\)$

Function: moving_average(data, window_size, method='simple')


Exponential Moving Average (EMA)

Formula: $\(EMA_t = \alpha x_t + (1-\alpha)EMA_{t-1}\)$

where \(\alpha = \frac{2}{k+1}\) (smoothing factor)

Function: moving_average(data, window_size, method='exponential')


Autocorrelation Function (ACF)

Formula: $\(\rho_k = \frac{\sum_{t=1}^{n-k}(x_t - \bar{x})(x_{t+k} - \bar{x})}{\sum_{t=1}^{n}(x_t - \bar{x})^2}\)$

Function: autocorrelation(data, max_lag)

Interpretation: Correlation between \(x_t\) and \(x_{t+k}\).


Linear Trend

Model: $\(x_t = \beta_0 + \beta_1 t + \epsilon_t\)$

Function: linear_trend(data)

Returns: slope (\(\beta_1\)), intercept (\(\beta_0\)), \(R^2\)


First-Order Differencing

Formula: $\(\nabla x_t = x_t - x_{t-1}\)$

Function: difference(data, lag=1, order=1)

Purpose: Remove trend, achieve stationarity.


Resampling Methods

Bootstrap Confidence Interval

Algorithm:

  1. Draw \(B\) bootstrap samples with replacement

  2. Calculate statistic \(\theta^*_b\) for each sample

  3. Find percentiles of bootstrap distribution

Percentile Method: $\(CI = [\theta^*_{\alpha/2}, \theta^*_{1-\alpha/2}]\)$

Function: bootstrap(data, statistic, n_iterations, confidence_level)


Permutation Test

Test Statistic: $\(T_{obs} = f(X_1, X_2)\)$

P-value: $\(p = \frac{\#\{T_{perm} \geq T_{obs}\}}{B}\)$

where \(B\) = number of permutations.

Function: permutation_test(data1, data2, statistic, n_permutations)


Jackknife Standard Error

Formula: $\(SE_{jack} = \sqrt{\frac{n-1}{n}\sum_{i=1}^{n}(\theta_{(i)} - \bar{\theta})^2}\)$

where \(\theta_{(i)}\) is the statistic computed without observation \(i\).

Function: jackknife(data, statistic)


Effect Sizes

Cohen’s d

Formula: $\(d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}\)$

where: $\(s_{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\)$

Function: cohens_d(group1, group2, pooled=True)

Interpretation:

  • Small: \(|d| \approx 0.2\)

  • Medium: \(|d| \approx 0.5\)

  • Large: \(|d| \approx 0.8\)


Hedges’ g

Formula: $\(g = d \times \left(1 - \frac{3}{4(n_1 + n_2) - 9}\right)\)$

Function: hedges_g(group1, group2)

Note: Bias-corrected version of Cohen’s d for small samples.


Eta-Squared (η²)

Formula: $\(\eta^2 = \frac{SS_{between}}{SS_{total}}\)$

Function: eta_squared(groups)

Interpretation: Proportion of total variance explained by group membership.


Partial Eta-Squared

Formula: $\(\eta_p^2 = \frac{SS_{effect}}{SS_{effect} + SS_{error}}\)$

Function: partial_eta_squared(groups)


Omega-Squared (ω²)

Formula: $\(\omega^2 = \frac{SS_{between} - (k-1)MS_{within}}{SS_{total} + MS_{within}}\)$

Function: omega_squared(groups)

Note: Less biased than \(\eta^2\), especially for small samples.


Cramér’s V

Formula: $\(V = \sqrt{\frac{\chi^2}{n \times \min(r-1, c-1)}}\)$

Function: cramers_v(contingency_table)

Range: \(0 \leq V \leq 1\)

Interpretation:

  • Small: \(V \approx 0.1\)

  • Medium: \(V \approx 0.3\)

  • Large: \(V \approx 0.5\)


Odds Ratio

Formula (2×2 table): $\(OR = \frac{a \times d}{b \times c}\)$

for table: $\(\begin{bmatrix} a & b \\ c & d \end{bmatrix}\)$

Function: odds_ratio(contingency_table)

Interpretation:

  • \(OR = 1\): no association

  • \(OR > 1\): positive association

  • \(OR < 1\): negative association


Relative Risk

Formula: $\(RR = \frac{a/(a+b)}{c/(c+d)}\)$

Function: relative_risk(contingency_table)


Cohen’s h

Formula: $\(h = 2(\arcsin\sqrt{p_1} - \arcsin\sqrt{p_2})\)$

Function: cohens_h(p1, p2)

Use: Effect size for difference between two proportions.


🔬 Power Analysis

Power for t-Test

Formula: $\(\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_1 \text{ true})\)$

Non-centrality Parameter: $\(\delta = \frac{\mu_1 - \mu_0}{\sigma} \sqrt{n}\)$

Function: power_t_test(delta=None, n=None, power=None, sig_level=0.05)

Note: Provide any 3 parameters to solve for the 4th.


Sample Size for t-Test

Formula: $\(n = \frac{(z_{1-\alpha/2} + z_{1-\beta})^2 \sigma^2}{\delta^2}\)$

where:

  • \(z_{1-\alpha/2}\) = critical value for significance level

  • \(z_{1-\beta}\) = critical value for power

  • \(\delta\) = effect size

Function: power_t_test(delta, power, sig_level)


Power for ANOVA

Effect Size (Cohen’s f): $\(f = \sqrt{\frac{\eta^2}{1-\eta^2}}\)$

Function: power_anova(effect_size, k, n=None, power=None, sig_level=0.05)


Minimum Detectable Effect

Formula: $\(MDE = \frac{(z_{1-\alpha/2} + z_{1-\beta})\sigma}{\sqrt{n}}\)$

Function: minimum_detectable_effect(n, power, sig_level, sigma)


Bayesian Statistics

Beta-Binomial Conjugate Update

Prior: $\(p \sim \text{Beta}(\alpha, \beta)\)$

Likelihood: $\(X | p \sim \text{Binomial}(n, p)\)$

Posterior: $\(p | X \sim \text{Beta}(\alpha + k, \beta + n - k)\)$

Function: beta_binomial_update(prior_alpha, prior_beta, successes, trials)


Normal-Normal Conjugate Update

Prior: $\(\mu \sim N(\mu_0, \sigma_0^2)\)$

Likelihood: $\(X_i | \mu \sim N(\mu, \sigma^2)\)$

Posterior: $\(\mu | X \sim N(\mu_n, \sigma_n^2)\)$

where: $\(\mu_n = \frac{\sigma^2\mu_0 + n\sigma_0^2\bar{x}}{\sigma^2 + n\sigma_0^2}\)$

\[\sigma_n^2 = \frac{\sigma^2\sigma_0^2}{\sigma^2 + n\sigma_0^2}\]

Function: normal_normal_update(prior_mean, prior_variance, data, data_variance)


Gamma-Poisson Conjugate Update

Prior: $\(\lambda \sim \text{Gamma}(\alpha, \beta)\)$

Likelihood: $\(X_i | \lambda \sim \text{Poisson}(\lambda)\)$

Posterior: $\(\lambda | X \sim \text{Gamma}(\alpha + \sum x_i, \beta + n)\)$

Function: gamma_poisson_update(prior_shape, prior_rate, data)


Credible Interval

Definition: $\(P(\theta \in [L, U] | X) = 1 - \alpha\)$

Function: credible_interval(distribution, params, credibility)

Note: Bayesian analog of confidence interval.


Highest Density Interval (HDI)

Definition: Shortest interval containing \((1-\alpha)\) of the posterior probability.

Function: highest_density_interval(samples, credibility)

Property: All points inside HDI have higher density than points outside.


Bayes Factor

Formula: $\(BF_{10} = \frac{P(D|H_1)}{P(D|H_0)} \times \frac{P(H_1)}{P(H_0)}\)$

Function: bayes_factor(likelihood_h1, likelihood_h0, prior_odds)

Interpretation (Kass & Raftery):

  • \(BF < 1\): Evidence for \(H_0\)

  • \(1 < BF < 3\): Barely worth mentioning

  • \(3 < BF < 10\): Substantial evidence for \(H_1\)

  • \(10 < BF < 30\): Strong evidence

  • \(30 < BF < 100\): Very strong evidence

  • \(BF > 100\): Decisive evidence


📐 Multivariate Analysis

Principal Component Analysis (PCA)

Objective: Find orthogonal directions of maximum variance.

Eigenvalue Decomposition: $\(\mathbf{\Sigma} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T\)$

where:

  • \(\mathbf{\Sigma}\) = covariance matrix

  • \(\mathbf{V}\) = eigenvectors (principal components)

  • \(\mathbf{\Lambda}\) = diagonal matrix of eigenvalues

Transformed Data: $\(\mathbf{Z} = \mathbf{X}\mathbf{V}\)$

Function: pca(X, n_components)


Mahalanobis Distance

Formula: $\(D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}\)$

Function: mahalanobis_distance(X, point)

Interpretation: Distance accounting for correlations and scale differences.


Confidence Intervals

CI for Mean (Known σ)

Formula: $\(\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\)$

Function: confidence_interval_known_std(mean, std_dev, n, confidence)


CI for Mean (Unknown σ)

Formula: $\(\bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}\)$

Function: confidence_interval_unknown_std(sample_mean, sample_std, n, confidence)


Central Limit Theorem

Sampling Distribution of the Mean

Properties: $\(E[\bar{X}] = \mu\)\( \)\(\text{Var}(\bar{X}) = \frac{\sigma^2}{n}\)\( \)\(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \text{ (approximately, for large } n\text{)}\)$

Functions:

  • sampling_distribution_mean(pop_mean)

  • sampling_distribution_variance(pop_std, sample_size)

  • clt_probability_greater_than(x, mean, std_dev, n)

  • clt_probability_less_than(x, mean, std_dev, n)

  • clt_probability_between(x1, x2, mean, std_dev, n)


📚 References

Notation:

  • \(\mu\) = population mean

  • \(\sigma\) = population standard deviation

  • \(\bar{x}\) = sample mean

  • \(s\) = sample standard deviation

  • \(n\) = sample size

  • \(\alpha\) = significance level

  • \(\beta\) = Type II error rate

  • \(1-\beta\) = statistical power

Common Critical Values:

  • \(z_{0.975} = 1.96\) (95% CI, two-tailed)

  • \(z_{0.995} = 2.576\) (99% CI, two-tailed)

  • \(t_{0.975, \infty} \approx 1.96\)


See also: