Skip to content Skip to footer
-70%

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python 2nd Edition, ISBN-13: 978-1492072942

Original price was: $50.00.Current price is: $14.99.

 Safe & secure checkout

Description

Description

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python 2nd Edition, ISBN-13: 978-1492072942

[PDF eBook eTextbook] – Available Instantly

  • Publisher: ‎ O’Reilly Media; 2nd edition (June 16, 2020)
  • Language: ‎ English
  • 360 pages
  • ISBN-10: ‎ 149207294X
  • ISBN-13: ‎ 978-1492072942

Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. The second edition of this popular guide adds comprehensive examples in Python, provides practical guidance on applying statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

  • Why exploratory data analysis is a key preliminary step in data science
  • How random sampling can reduce bias and yield a higher-quality dataset, even with big data
  • How the principles of experimental design yield definitive answers to questions
  • How to use regression to estimate outcomes and detect anomalies
  • Key classification techniques for predicting which categories a record belongs to
  • Statistical machine learning methods that “learn” from data
  • Unsupervised learning methods for extracting meaning from unlabeled data.

Table of Contents:

Preface

Conventions Used in This Book

Using Code Examples

O’Reilly Online Learning

How to Contact Us

Acknowledgments

1. Exploratory Data Analysis

Elements of Structured Data

Further Reading

Rectangular Data

Data Frames and Indexes

Nonrectangular Data Structures

Further Reading

Estimates of Location

Mean

Median and Robust Estimates

Example: Location Estimates of Population and Murder Rates

Further Reading

Estimates of Variability

Standard Deviation and Related Estimates

Estimates Based on Percentiles

Example: Variability Estimates of State Population

Further Reading

Exploring the Data Distribution

Percentiles and Boxplots

Frequency Tables and Histograms

Density Plots and Estimates

Further Reading

Exploring Binary and Categorical Data

Mode

Expected Value

Probability

Further Reading

Correlation

Scatterplots

Further Reading

Exploring Two or More Variables

Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)

Two Categorical Variables

Categorical and Numeric Data

Visualizing Multiple Variables

Further Reading

Summary

2. Data and Sampling Distributions

Random Sampling and Sample Bias

Bias

Random Selection

Size Versus Quality: When Does Size Matter?

Sample Mean Versus Population Mean

Further Reading

Selection Bias

Regression to the Mean

Further Reading

Sampling Distribution of a Statistic

Central Limit Theorem

Standard Error

Further Reading

The Bootstrap

Resampling Versus Bootstrapping

Further Reading

Confidence Intervals

Further Reading

Normal Distribution

Standard Normal and QQ-Plots

Long-Tailed Distributions

Further Reading

Student’s t-Distribution

Further Reading

Binomial Distribution

Further Reading

Chi-Square Distribution

Further Reading

F-Distribution

Further Reading

Poisson and Related Distributions

Poisson Distributions

Exponential Distribution

Estimating the Failure Rate

Weibull Distribution

Further Reading

Summary

3. Statistical Experiments and Significance Testing

A/B Testing

Why Have a Control Group?

Why Just A/B? Why Not C, D,…?

Further Reading

Hypothesis Tests

The Null Hypothesis

Alternative Hypothesis

One-Way Versus Two-Way Hypothesis Tests

Further Reading

Resampling

Permutation Test

Example: Web Stickiness

Exhaustive and Bootstrap Permutation Tests

Permutation Tests: The Bottom Line for Data Science

Further Reading

Statistical Significance and p-Values

p-Value

Alpha

Type 1 and Type 2 Errors

Data Science and p-Values

Further Reading

t-Tests

Further Reading

Multiple Testing

Further Reading

Degrees of Freedom

Further Reading

ANOVA

F-Statistic

Two-Way ANOVA

Further Reading

Chi-Square Test

Chi-Square Test: A Resampling Approach

Chi-Square Test: Statistical Theory

Fisher’s Exact Test

Relevance for Data Science

Further Reading

Multi-Arm Bandit Algorithm

Further Reading

Power and Sample Size

Sample Size

Further Reading

Summary

4. Regression and Prediction

Simple Linear Regression

The Regression Equation

Fitted Values and Residuals

Least Squares

Prediction Versus Explanation (Profiling)

Further Reading

Multiple Linear Regression

Example: King County Housing Data

Assessing the Model

Cross-Validation

Model Selection and Stepwise Regression

Weighted Regression

Further Reading

Prediction Using Regression

The Dangers of Extrapolation

Confidence and Prediction Intervals

Factor Variables in Regression

Dummy Variables Representation

Factor Variables with Many Levels

Ordered Factor Variables

Interpreting the Regression Equation

Correlated Predictors

Multicollinearity

Confounding Variables

Interactions and Main Effects

Regression Diagnostics

Outliers

Influential Values

Heteroskedasticity, Non-Normality, and Correlated Errors

Partial Residual Plots and Nonlinearity

Polynomial and Spline Regression

Polynomial

Splines

Generalized Additive Models

Further Reading

Summary

5. Classification

Naive Bayes

Why Exact Bayesian Classification Is Impractical

The Naive Solution

Numeric Predictor Variables

Further Reading

Discriminant Analysis

Covariance Matrix

Fisher’s Linear Discriminant

A Simple Example

Further Reading

Logistic Regression

Logistic Response Function and Logit

Logistic Regression and the GLM

Generalized Linear Models

Predicted Values from Logistic Regression

Interpreting the Coefficients and Odds Ratios

Linear and Logistic Regression: Similarities and Differences

Assessing the Model

Further Reading

Evaluating Classification Models

Confusion Matrix

The Rare Class Problem

Precision, Recall, and Specificity

ROC Curve

AUC

Lift

Further Reading

Strategies for Imbalanced Data

Undersampling

Oversampling and Up/Down Weighting

Data Generation

Cost-Based Classification

Exploring the Predictions

Further Reading

Summary

6. Statistical Machine Learning

K-Nearest Neighbors

A Small Example: Predicting Loan Default

Distance Metrics

One Hot Encoder

Standardization (Normalization, z-Scores)

Choosing K

KNN as a Feature Engine

Tree Models

A Simple Example

The Recursive Partitioning Algorithm

Measuring Homogeneity or Impurity

Stopping the Tree from Growing

Predicting a Continuous Value

How Trees Are Used

Further Reading

Bagging and the Random Forest

Bagging

Random Forest

Variable Importance

Hyperparameters

Boosting

The Boosting Algorithm

XGBoost

Regularization: Avoiding Overfitting

Hyperparameters and Cross-Validation

Summary

7. Unsupervised Learning

Principal Components Analysis

A Simple Example

Computing the Principal Components

Interpreting Principal Components

Correspondence Analysis

Further Reading

K-Means Clustering

A Simple Example

K-Means Algorithm

Interpreting the Clusters

Selecting the Number of Clusters

Hierarchical Clustering

A Simple Example

The Dendrogram

The Agglomerative Algorithm

Measures of Dissimilarity

Model-Based Clustering

Multivariate Normal Distribution

Mixtures of Normals

Selecting the Number of Clusters

Further Reading

Scaling and Categorical Variables

Scaling the Variables

Dominant Variables

Categorical Data and Gower’s Distance

Problems with Clustering Mixed Data

Summary

Bibliography

Index

Peter Bruce is the Founder and Chief Academic Officer of the Institute for Statistics Education at Statistics.com, which offers about 80 courses in statistics and analytics, roughly half of which are aimed at data scientists. He has authored or co-authored several books in statistics and analytics, and he earned his Bachelor’s degree at Princeton, and Masters degrees at Harvard and the University of Maryland.

Andrew Bruce, Principal Research Scientist at Amazon, has over 30 years of experience in statistics and data science in academia, government and business. The co-author of Applied Wavelet Analysis with S-PLUS, he earned his bachelor’s degree at Princeton, and PhD in statistics at the University of Washington.

Peter Gedeck, Senior Data Scientist at Collaborative Drug Discovery, specializes in the development of machine learning algorithms to predict biological and physicochemical properties of drug candidates. Co-author of Data Mining for Business Analytics, he earned PhD’s in Chemistry from the University of Erlangen-Nürnberg in Germany and Mathematics from Fernuniversität Hagen, Germany.

What makes us different?

• Instant Download

• Always Competitive Pricing

• 100% Privacy

• FREE Sample Available

• 24-7 LIVE Customer Support

Delivery Info

Reviews (0)