Skip to main content

11.1 Python for Data Science

Python is a core language for data science due to scientific libraries, notebooks, and production deployment options.

Data science workflow with Python

  1. Data ingestion
  2. Cleaning and transformation
  3. Exploration and visualization
  4. Modeling and evaluation
  5. Reproducible deployment

Useful foundation modules from Python standard library

  • statistics: descriptive statistics and variance tools
  • math: numeric operations and constants
  • csv / json: common data exchange formats
  • sqlite3: lightweight database-backed analysis

Ecosystem packages (official project docs)

  • NumPy: n-dimensional array computing
  • pandas: tabular data processing
  • Matplotlib: plotting and visualizations
  • scikit-learn: machine learning algorithms and evaluation

Reproducibility practices

  • Isolate environment with venv
  • Pin dependencies
  • Track data schema assumptions in code/tests
  • Separate training, validation, and test datasets
  • Log model parameters and metrics

Practical quick example (stdlib)

import statistics

data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
print(statistics.mean(data))
print(statistics.median(data))

Official references