Frequently Asked Questions (FAQ)

Common questions about Real Simple Stats, answered.


📦 Installation & Setup

Q: How do I install Real Simple Stats?

A: Use pip:

pip install real-simple-stats

For the latest development version:

pip install git+https://github.com/kylejones200/real_simple_stats.git

Q: What are the system requirements?

A:

  • Python: 3.7 or higher

  • Dependencies: NumPy, SciPy (automatically installed)

  • Optional: matplotlib (for plotting), pandas (for data handling)


Q: Can I use this in Google Colab or Jupyter?

A: Yes! Install in the first cell:

!pip install real-simple-stats
import real_simple_stats as rss

Q: Do I need to install matplotlib separately?

A: No, matplotlib is included as a dependency. However, if you only want the statistical functions without plotting, you can skip it.


General Usage

Q: How do I import the package?

A: Standard import:

import real_simple_stats as rss

# Use functions
mean = rss.mean([1, 2, 3, 4, 5])

Or import specific functions:

from real_simple_stats import mean, median, std_dev

mean([1, 2, 3])

Q: What data types does the package accept?

A: Most functions accept:

  • Python lists: [1, 2, 3, 4, 5]

  • NumPy arrays: np.array([1, 2, 3, 4, 5])

  • Tuples: (1, 2, 3, 4, 5)

For multivariate functions, use lists of lists or 2D NumPy arrays.


Q: Do functions modify my original data?

A: No! All functions return new values without modifying your input data.

data = [1, 2, 3, 4, 5]
result = rss.mean(data)
# data is unchanged

Q: What’s the difference between sample and population functions?

A:

  • Sample functions (e.g., sample_std_dev): Use \(n-1\) in denominator (Bessel’s correction)

  • Population functions (e.g., population_std_dev): Use \(n\) in denominator

Rule of thumb: Use sample functions for real-world data (most common).

# Sample standard deviation (n-1)
rss.sample_std_dev([1, 2, 3, 4, 5])

# Population standard deviation (n)
rss.population_std_dev([1, 2, 3, 4, 5])

Statistical Tests

Q: When should I use a t-test vs. z-test?

A:

  • t-test: Unknown population standard deviation (most common)

  • z-test: Known population standard deviation (rare in practice)

# Unknown σ (use t-test)
t_stat, p_value = rss.one_sample_t_test(data, mu0=100)

# Known σ (use z-test)
z_stat, p_value = rss.one_sample_z_test(data, mu0=100, sigma=15)

Q: How do I interpret p-values?

A:

  • p < 0.05: Statistically significant (reject null hypothesis)

  • p ≥ 0.05: Not statistically significant (fail to reject null hypothesis)

Important: p-value is NOT the probability that the null hypothesis is true!

t_stat, p_value = rss.two_sample_t_test(group1, group2)

if p_value < 0.05:
    print("Significant difference between groups")
else:
    print("No significant difference")

Q: What’s the difference between one-tailed and two-tailed tests?

A:

  • Two-tailed (default): Tests if means are different (either direction)

  • One-tailed: Tests if one mean is specifically greater or less

Most Real Simple Stats functions use two-tailed tests by default.


Q: Should I use paired or independent t-test?

A:

  • Paired t-test: Same subjects measured twice (before/after, matched pairs)

  • Independent t-test: Different subjects in each group

# Paired (same subjects)
before = [120, 130, 125, 135, 140]
after = [115, 125, 120, 130, 135]
t_stat, p_value = rss.paired_t_test(before, after)

# Independent (different subjects)
group1 = [120, 130, 125, 135, 140]
group2 = [115, 125, 120, 130, 135]
t_stat, p_value = rss.two_sample_t_test(group1, group2)

Q: What sample size do I need?

A: Use power analysis:

# For t-test with medium effect size (d=0.5), 80% power
result = rss.power_t_test(delta=0.5, power=0.8, sig_level=0.05)
print(f"Need {result['n']} participants per group")

Regression & Correlation

Q: What’s the difference between correlation and regression?

A:

  • Correlation (pearson_correlation): Measures strength of linear relationship (-1 to 1)

  • Regression (linear_regression): Predicts one variable from another

# Correlation
r = rss.pearson_correlation(x, y)  # Just a number

# Regression
slope, intercept, r_value, p_value, std_err = rss.linear_regression(x, y)
# Can make predictions: y = slope*x + intercept

Q: How do I interpret R²?

A: R² (coefficient of determination) = proportion of variance explained

  • R² = 0.00: No predictive power

  • R² = 0.25: Weak relationship

  • R² = 0.50: Moderate relationship

  • R² = 0.75: Strong relationship

  • R² = 1.00: Perfect prediction

slope, intercept, r_value, p_value, std_err = rss.linear_regression(x, y)
r_squared = r_value ** 2
print(f"Model explains {r_squared*100:.1f}% of variance")

Q: Can I do multiple regression?

A: Yes! Use multiple_regression:

X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]  # Multiple predictors
y = [2, 4, 5, 4, 5]

result = rss.multiple_regression(X, y)
print(f"R² = {result['r_squared']:.3f}")
print(f"Coefficients: {result['coefficients']}")

Probability & Distributions

Q: How do I calculate probabilities for normal distribution?

A:

# P(X ≤ x)
prob = rss.normal_cdf(x=100, mu=100, sigma=15)

# P(X > x) = 1 - P(X ≤ x)
prob = 1 - rss.normal_cdf(x=100, mu=100, sigma=15)

# P(a < X < b)
prob = rss.normal_cdf(b, mu, sigma) - rss.normal_cdf(a, mu, sigma)

Q: What’s the difference between PDF and CDF?

A:

  • PDF (Probability Density Function): Height of distribution curve

  • CDF (Cumulative Distribution Function): Area under curve up to x

For probabilities, use CDF:

# Probability that X ≤ 1.96 for standard normal
prob = rss.normal_cdf(1.96, 0, 1)  # ≈ 0.975

Q: How do I find critical values?

A:

# For normal distribution (z-score)
z_critical = rss.normal_ppf(0.975, 0, 1)  # 1.96 for 95% CI

# For chi-square
chi_critical = rss.critical_chi_square_value(alpha=0.05, df=5)

🔄 Advanced Topics

Q: What’s the difference between bootstrap and permutation test?

A:

  • Bootstrap: Estimates uncertainty (confidence intervals)

  • Permutation test: Tests hypotheses (p-values)

# Bootstrap for CI
result = rss.bootstrap(data, np.mean, n_iterations=1000)
print(f"95% CI: {result['confidence_interval']}")

# Permutation test for hypothesis
result = rss.permutation_test(group1, group2,
                               lambda d1, d2: np.mean(d1) - np.mean(d2))
print(f"p-value: {result['p_value']}")

Q: When should I use Bayesian vs. frequentist methods?

A:

  • Frequentist (t-tests, p-values): Traditional, widely accepted

  • Bayesian: Incorporates prior knowledge, gives probability of hypotheses

Use Bayesian when:

  • You have prior information

  • You want probability statements about parameters

  • You need to update beliefs with new data

# Bayesian update
post_alpha, post_beta = rss.beta_binomial_update(
    prior_alpha=1, prior_beta=1,  # Uniform prior
    successes=7, trials=10
)

# Credible interval (Bayesian CI)
lower, upper = rss.credible_interval('beta',
                                      {'alpha': post_alpha, 'beta': post_beta})

Q: What’s PCA and when should I use it?

A: PCA (Principal Component Analysis) reduces dimensions while preserving variance.

Use when:

  • You have many correlated variables

  • You want to visualize high-dimensional data

  • You need to reduce multicollinearity

result = rss.pca(X, n_components=2)
print(f"Explained variance: {result['explained_variance']}")

Effect Sizes

Q: Why do I need effect sizes?

A: P-values tell you if an effect exists; effect sizes tell you how large it is.

Example:

# Significant but small effect
t_stat, p_value = rss.two_sample_t_test(group1, group2)
d = rss.cohens_d(group1, group2)

print(f"p-value: {p_value:.4f}")  # p < 0.05 (significant)
print(f"Cohen's d: {d:.3f}")      # d = 0.15 (tiny effect)

Interpretation: Statistically significant but practically meaningless.


Q: Which effect size should I use?

A:

  • Cohen’s d: Comparing two means

  • Eta-squared: ANOVA (multiple groups)

  • Cramér’s V: Categorical data (chi-square)

  • : Regression

# Two groups
d = rss.cohens_d(group1, group2)

# Multiple groups (ANOVA)
eta_sq = rss.eta_squared([group1, group2, group3])

# Categorical
v = rss.cramers_v([[10, 20], [30, 40]])

Q: How do I interpret Cohen’s d?

A:

  • Small: d ≈ 0.2

  • Medium: d ≈ 0.5

  • Large: d ≈ 0.8

d = rss.cohens_d(group1, group2)
interpretation = rss.interpret_effect_size(d, 'd')
print(f"Cohen's d = {d:.3f} ({interpretation})")

🔧 Technical Questions

Q: Are the functions vectorized?

A: Yes, most functions use NumPy internally for efficient computation.


Q: Can I use this with pandas DataFrames?

A: Yes! Convert columns to lists or arrays:

import pandas as pd
import real_simple_stats as rss

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Method 1: Convert to list
mean_A = rss.mean(df['A'].tolist())

# Method 2: Use values (NumPy array)
mean_A = rss.mean(df['A'].values)

# Regression
slope, intercept, *_ = rss.linear_regression(df['A'].values, df['B'].values)

Q: How accurate are the calculations?

A: Real Simple Stats uses SciPy and NumPy for numerical computations, which are industry-standard and highly accurate. Results match those from R, SPSS, and other statistical software.


Q: Can I use this for production/research?

A: Yes! The package is:

  • Well-tested (86% code coverage)

  • Based on established statistical methods

  • Uses reliable numerical libraries (SciPy, NumPy)

  • Documented with references

However, always validate results for critical applications.


Q: Is this package maintained?

A: Yes! Check the GitHub repository for:

  • Latest updates

  • Issue tracking

  • Contribution guidelines


Educational Questions

Q: Can I use this for teaching?

A: Absolutely! Real Simple Stats is designed for education:

  • Clear function names

  • Comprehensive docstrings

  • Step-by-step examples

  • Educational focus over performance


Q: Is there a textbook or course that uses this?

A: While not tied to a specific textbook, Real Simple Stats aligns with standard introductory statistics curricula. See INTERACTIVE_EXAMPLES.md for tutorials.


Q: How does this compare to R or SPSS?

A:

  • Simpler: Easier to learn than R

  • More accessible: Free and open-source (unlike SPSS)

  • Python-based: Integrates with data science ecosystem

  • Educational: Designed for learning, not just analysis

See MIGRATION_GUIDE.md for detailed comparisons.


🐛 Troubleshooting

Q: I get “ModuleNotFoundError: No module named ‘real_simple_stats’”

A: Install the package:

pip install real-simple-stats

Make sure you’re using the correct package name (with hyphens).


Q: Functions return unexpected results

A: Check:

  1. Data format: Are you passing lists/arrays?

  2. Sample vs. population: Using correct function?

  3. Parameter order: Check docstring with help(rss.function_name)

# Check documentation
help(rss.two_sample_t_test)

Q: I get “ValueError: Input arrays must have the same length”

A: For paired tests and correlation, ensure both arrays have the same length:

# Wrong
x = [1, 2, 3]
y = [4, 5]  # Different length!

# Correct
x = [1, 2, 3]
y = [4, 5, 6]  # Same length

Q: Plots don’t show up

A:

import matplotlib.pyplot as plt
import real_simple_stats as rss

rss.plot_normal_histogram(data)
plt.show()  # Add this!

Q: I get warnings about “divide by zero”

A: This can happen with:

  • Empty datasets

  • Zero variance (all values the same)

  • Zero expected frequencies (chi-square)

Check your data:

data = [5, 5, 5, 5, 5]
std = rss.sample_std_dev(data)  # Will be 0

Best Practices


Q: How should I report results?

A: Include:

  • Test statistic and degrees of freedom

  • P-value

  • Effect size

  • Confidence interval (when appropriate)

Example:

"A two-sample t-test revealed a significant difference between groups,
t(18) = 2.45, p = .025, d = 0.73, 95% CI [0.5, 3.2]."

Q: Should I correct for multiple comparisons?

A: Yes, if you’re running multiple tests on the same dataset. Common methods:

  • Bonferroni correction: Divide α by number of tests

  • False Discovery Rate (FDR)

# 3 tests, use α = 0.05/3 = 0.0167
alpha_corrected = 0.05 / 3

Additional Resources

Q: Where can I learn more?

A:


Q: How do I report bugs or request features?

A:

  1. Check existing issues

  2. Create a new issue with:

    • Description of problem/feature

    • Example code (if applicable)

    • Expected vs. actual behavior


Q: Can I contribute?

A: Yes! See CONTRIBUTING.md for guidelines.


📞 Still Have Questions?


Last Updated: 2025 Version: 0.3.0