# Frequently Asked Questions (FAQ) Common questions about Real Simple Stats, answered. --- ## 📦 Installation & Setup ### Q: How do I install Real Simple Stats? **A:** Use pip: ```bash pip install real-simple-stats ``` For the latest development version: ```bash pip install git+https://github.com/kylejones200/real_simple_stats.git ``` --- ### Q: What are the system requirements? **A:** - **Python**: 3.7 or higher - **Dependencies**: NumPy, SciPy (automatically installed) - **Optional**: matplotlib (for plotting), pandas (for data handling) --- ### Q: Can I use this in Google Colab or Jupyter? **A:** Yes! Install in the first cell: ```python !pip install real-simple-stats import real_simple_stats as rss ``` --- ### Q: Do I need to install matplotlib separately? **A:** No, matplotlib is included as a dependency. However, if you only want the statistical functions without plotting, you can skip it. --- ## General Usage ### Q: How do I import the package? **A:** Standard import: ```python import real_simple_stats as rss # Use functions mean = rss.mean([1, 2, 3, 4, 5]) ``` Or import specific functions: ```python from real_simple_stats import mean, median, std_dev mean([1, 2, 3]) ``` --- ### Q: What data types does the package accept? **A:** Most functions accept: - Python lists: `[1, 2, 3, 4, 5]` - NumPy arrays: `np.array([1, 2, 3, 4, 5])` - Tuples: `(1, 2, 3, 4, 5)` For multivariate functions, use lists of lists or 2D NumPy arrays. --- ### Q: Do functions modify my original data? **A:** No! All functions return new values without modifying your input data. ```python data = [1, 2, 3, 4, 5] result = rss.mean(data) # data is unchanged ``` --- ### Q: What's the difference between sample and population functions? **A:** - **Sample functions** (e.g., `sample_std_dev`): Use $n-1$ in denominator (Bessel's correction) - **Population functions** (e.g., `population_std_dev`): Use $n$ in denominator **Rule of thumb**: Use sample functions for real-world data (most common). ```python # Sample standard deviation (n-1) rss.sample_std_dev([1, 2, 3, 4, 5]) # Population standard deviation (n) rss.population_std_dev([1, 2, 3, 4, 5]) ``` --- ## Statistical Tests ### Q: When should I use a t-test vs. z-test? **A:** - **t-test**: Unknown population standard deviation (most common) - **z-test**: Known population standard deviation (rare in practice) ```python # Unknown σ (use t-test) t_stat, p_value = rss.one_sample_t_test(data, mu0=100) # Known σ (use z-test) z_stat, p_value = rss.one_sample_z_test(data, mu0=100, sigma=15) ``` --- ### Q: How do I interpret p-values? **A:** - **p < 0.05**: Statistically significant (reject null hypothesis) - **p ≥ 0.05**: Not statistically significant (fail to reject null hypothesis) **Important**: p-value is NOT the probability that the null hypothesis is true! ```python t_stat, p_value = rss.two_sample_t_test(group1, group2) if p_value < 0.05: print("Significant difference between groups") else: print("No significant difference") ``` --- ### Q: What's the difference between one-tailed and two-tailed tests? **A:** - **Two-tailed** (default): Tests if means are different (either direction) - **One-tailed**: Tests if one mean is specifically greater or less Most Real Simple Stats functions use two-tailed tests by default. --- ### Q: Should I use paired or independent t-test? **A:** - **Paired t-test**: Same subjects measured twice (before/after, matched pairs) - **Independent t-test**: Different subjects in each group ```python # Paired (same subjects) before = [120, 130, 125, 135, 140] after = [115, 125, 120, 130, 135] t_stat, p_value = rss.paired_t_test(before, after) # Independent (different subjects) group1 = [120, 130, 125, 135, 140] group2 = [115, 125, 120, 130, 135] t_stat, p_value = rss.two_sample_t_test(group1, group2) ``` --- ### Q: What sample size do I need? **A:** Use power analysis: ```python # For t-test with medium effect size (d=0.5), 80% power result = rss.power_t_test(delta=0.5, power=0.8, sig_level=0.05) print(f"Need {result['n']} participants per group") ``` --- ## Regression & Correlation ### Q: What's the difference between correlation and regression? **A:** - **Correlation** (`pearson_correlation`): Measures strength of linear relationship (-1 to 1) - **Regression** (`linear_regression`): Predicts one variable from another ```python # Correlation r = rss.pearson_correlation(x, y) # Just a number # Regression slope, intercept, r_value, p_value, std_err = rss.linear_regression(x, y) # Can make predictions: y = slope*x + intercept ``` --- ### Q: How do I interpret R²? **A:** R² (coefficient of determination) = proportion of variance explained - **R² = 0.00**: No predictive power - **R² = 0.25**: Weak relationship - **R² = 0.50**: Moderate relationship - **R² = 0.75**: Strong relationship - **R² = 1.00**: Perfect prediction ```python slope, intercept, r_value, p_value, std_err = rss.linear_regression(x, y) r_squared = r_value ** 2 print(f"Model explains {r_squared*100:.1f}% of variance") ``` --- ### Q: Can I do multiple regression? **A:** Yes! Use `multiple_regression`: ```python X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]] # Multiple predictors y = [2, 4, 5, 4, 5] result = rss.multiple_regression(X, y) print(f"R² = {result['r_squared']:.3f}") print(f"Coefficients: {result['coefficients']}") ``` --- ## Probability & Distributions ### Q: How do I calculate probabilities for normal distribution? **A:** ```python # P(X ≤ x) prob = rss.normal_cdf(x=100, mu=100, sigma=15) # P(X > x) = 1 - P(X ≤ x) prob = 1 - rss.normal_cdf(x=100, mu=100, sigma=15) # P(a < X < b) prob = rss.normal_cdf(b, mu, sigma) - rss.normal_cdf(a, mu, sigma) ``` --- ### Q: What's the difference between PDF and CDF? **A:** - **PDF** (Probability Density Function): Height of distribution curve - **CDF** (Cumulative Distribution Function): Area under curve up to x For probabilities, use **CDF**: ```python # Probability that X ≤ 1.96 for standard normal prob = rss.normal_cdf(1.96, 0, 1) # ≈ 0.975 ``` --- ### Q: How do I find critical values? **A:** ```python # For normal distribution (z-score) z_critical = rss.normal_ppf(0.975, 0, 1) # 1.96 for 95% CI # For chi-square chi_critical = rss.critical_chi_square_value(alpha=0.05, df=5) ``` --- ## 🔄 Advanced Topics ### Q: What's the difference between bootstrap and permutation test? **A:** - **Bootstrap**: Estimates uncertainty (confidence intervals) - **Permutation test**: Tests hypotheses (p-values) ```python # Bootstrap for CI result = rss.bootstrap(data, np.mean, n_iterations=1000) print(f"95% CI: {result['confidence_interval']}") # Permutation test for hypothesis result = rss.permutation_test(group1, group2, lambda d1, d2: np.mean(d1) - np.mean(d2)) print(f"p-value: {result['p_value']}") ``` --- ### Q: When should I use Bayesian vs. frequentist methods? **A:** - **Frequentist** (t-tests, p-values): Traditional, widely accepted - **Bayesian**: Incorporates prior knowledge, gives probability of hypotheses Use Bayesian when: - You have prior information - You want probability statements about parameters - You need to update beliefs with new data ```python # Bayesian update post_alpha, post_beta = rss.beta_binomial_update( prior_alpha=1, prior_beta=1, # Uniform prior successes=7, trials=10 ) # Credible interval (Bayesian CI) lower, upper = rss.credible_interval('beta', {'alpha': post_alpha, 'beta': post_beta}) ``` --- ### Q: What's PCA and when should I use it? **A:** PCA (Principal Component Analysis) reduces dimensions while preserving variance. **Use when:** - You have many correlated variables - You want to visualize high-dimensional data - You need to reduce multicollinearity ```python result = rss.pca(X, n_components=2) print(f"Explained variance: {result['explained_variance']}") ``` --- ## Effect Sizes ### Q: Why do I need effect sizes? **A:** P-values tell you if an effect exists; effect sizes tell you how large it is. **Example:** ```python # Significant but small effect t_stat, p_value = rss.two_sample_t_test(group1, group2) d = rss.cohens_d(group1, group2) print(f"p-value: {p_value:.4f}") # p < 0.05 (significant) print(f"Cohen's d: {d:.3f}") # d = 0.15 (tiny effect) ``` **Interpretation**: Statistically significant but practically meaningless. --- ### Q: Which effect size should I use? **A:** - **Cohen's d**: Comparing two means - **Eta-squared**: ANOVA (multiple groups) - **Cramér's V**: Categorical data (chi-square) - **R²**: Regression ```python # Two groups d = rss.cohens_d(group1, group2) # Multiple groups (ANOVA) eta_sq = rss.eta_squared([group1, group2, group3]) # Categorical v = rss.cramers_v([[10, 20], [30, 40]]) ``` --- ### Q: How do I interpret Cohen's d? **A:** - **Small**: d ≈ 0.2 - **Medium**: d ≈ 0.5 - **Large**: d ≈ 0.8 ```python d = rss.cohens_d(group1, group2) interpretation = rss.interpret_effect_size(d, 'd') print(f"Cohen's d = {d:.3f} ({interpretation})") ``` --- ## 🔧 Technical Questions ### Q: Are the functions vectorized? **A:** Yes, most functions use NumPy internally for efficient computation. --- ### Q: Can I use this with pandas DataFrames? **A:** Yes! Convert columns to lists or arrays: ```python import pandas as pd import real_simple_stats as rss df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) # Method 1: Convert to list mean_A = rss.mean(df['A'].tolist()) # Method 2: Use values (NumPy array) mean_A = rss.mean(df['A'].values) # Regression slope, intercept, *_ = rss.linear_regression(df['A'].values, df['B'].values) ``` --- ### Q: How accurate are the calculations? **A:** Real Simple Stats uses SciPy and NumPy for numerical computations, which are industry-standard and highly accurate. Results match those from R, SPSS, and other statistical software. --- ### Q: Can I use this for production/research? **A:** Yes! The package is: - Well-tested (86% code coverage) - Based on established statistical methods - Uses reliable numerical libraries (SciPy, NumPy) - Documented with references However, always validate results for critical applications. --- ### Q: Is this package maintained? **A:** Yes! Check the [GitHub repository](https://github.com/kylejones200/real_simple_stats) for: - Latest updates - Issue tracking - Contribution guidelines --- ## Educational Questions ### Q: Can I use this for teaching? **A:** Absolutely! Real Simple Stats is designed for education: - Clear function names - Comprehensive docstrings - Step-by-step examples - Educational focus over performance --- ### Q: Is there a textbook or course that uses this? **A:** While not tied to a specific textbook, Real Simple Stats aligns with standard introductory statistics curricula. See [INTERACTIVE_EXAMPLES.md](INTERACTIVE_EXAMPLES.md) for tutorials. --- ### Q: How does this compare to R or SPSS? **A:** - **Simpler**: Easier to learn than R - **More accessible**: Free and open-source (unlike SPSS) - **Python-based**: Integrates with data science ecosystem - **Educational**: Designed for learning, not just analysis See [MIGRATION_GUIDE.md](MIGRATION_GUIDE.md) for detailed comparisons. --- ## 🐛 Troubleshooting ### Q: I get "ModuleNotFoundError: No module named 'real_simple_stats'" **A:** Install the package: ```bash pip install real-simple-stats ``` Make sure you're using the correct package name (with hyphens). --- ### Q: Functions return unexpected results **A:** Check: 1. **Data format**: Are you passing lists/arrays? 2. **Sample vs. population**: Using correct function? 3. **Parameter order**: Check docstring with `help(rss.function_name)` ```python # Check documentation help(rss.two_sample_t_test) ``` --- ### Q: I get "ValueError: Input arrays must have the same length" **A:** For paired tests and correlation, ensure both arrays have the same length: ```python # Wrong x = [1, 2, 3] y = [4, 5] # Different length! # Correct x = [1, 2, 3] y = [4, 5, 6] # Same length ``` --- ### Q: Plots don't show up **A:** ```python import matplotlib.pyplot as plt import real_simple_stats as rss rss.plot_normal_histogram(data) plt.show() # Add this! ``` --- ### Q: I get warnings about "divide by zero" **A:** This can happen with: - Empty datasets - Zero variance (all values the same) - Zero expected frequencies (chi-square) Check your data: ```python data = [5, 5, 5, 5, 5] std = rss.sample_std_dev(data) # Will be 0 ``` --- ## Best Practices ### Q: What's the recommended workflow? **A:** 1. **Explore data**: Use descriptive statistics 2. **Visualize**: Create plots 3. **Test hypotheses**: Run appropriate tests 4. **Calculate effect sizes**: Assess practical significance 5. **Report results**: Include all relevant statistics ```python import real_simple_stats as rss # 1. Descriptive statistics print(rss.five_number_summary(data)) # 2. Visualize rss.plot_box_plot(data) # 3. Test t_stat, p_value = rss.one_sample_t_test(data, mu0=100) # 4. Effect size d = rss.cohens_d(data, [100]*len(data)) # 5. Report print(f"t({len(data)-1}) = {t_stat:.2f}, p = {p_value:.3f}, d = {d:.2f}") ``` --- ### Q: How should I report results? **A:** Include: - Test statistic and degrees of freedom - P-value - Effect size - Confidence interval (when appropriate) **Example**: ``` "A two-sample t-test revealed a significant difference between groups, t(18) = 2.45, p = .025, d = 0.73, 95% CI [0.5, 3.2]." ``` --- ### Q: Should I correct for multiple comparisons? **A:** Yes, if you're running multiple tests on the same dataset. Common methods: - Bonferroni correction: Divide α by number of tests - False Discovery Rate (FDR) ```python # 3 tests, use α = 0.05/3 = 0.0167 alpha_corrected = 0.05 / 3 ``` --- ## Additional Resources ### Q: Where can I learn more? **A:** - **Documentation**: [ReadTheDocs](https://real-simple-stats.readthedocs.io/) - **Examples**: [Interactive Tutorials](INTERACTIVE_EXAMPLES.md) - **API Reference**: [Function Comparison](API_COMPARISON.md) - **Math Details**: [Mathematical Formulas](MATHEMATICAL_FORMULAS.md) --- ### Q: How do I report bugs or request features? **A:** 1. Check [existing issues](https://github.com/kylejones200/real_simple_stats/issues) 2. Create a new issue with: - Description of problem/feature - Example code (if applicable) - Expected vs. actual behavior --- ### Q: Can I contribute? **A:** Yes! See [CONTRIBUTING.md](../CONTRIBUTING.md) for guidelines. --- ## 📞 Still Have Questions? - **GitHub Issues**: [Ask a question](https://github.com/kylejones200/real_simple_stats/issues) - **Documentation**: [Full docs](https://real-simple-stats.readthedocs.io/) - **Examples**: [Interactive tutorials](INTERACTIVE_EXAMPLES.md) --- **Last Updated**: 2025 **Version**: 0.3.0