Skip to main content

11.1 Python for Data Science

Python is a core language for data science due to scientific libraries, notebooks, and production deployment options.

Data science workflow with Python

Data ingestion
Cleaning and transformation
Exploration and visualization
Modeling and evaluation
Reproducible deployment

Useful foundation modules from Python standard library

statistics: descriptive statistics and variance tools
math: numeric operations and constants
csv / json: common data exchange formats
sqlite3: lightweight database-backed analysis

Ecosystem packages (official project docs)

NumPy: n-dimensional array computing
pandas: tabular data processing
Matplotlib: plotting and visualizations
scikit-learn: machine learning algorithms and evaluation

Reproducibility practices

Isolate environment with venv
Pin dependencies
Track data schema assumptions in code/tests
Separate training, validation, and test datasets
Log model parameters and metrics

Practical quick example (stdlib)

import statistics

data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
print(statistics.mean(data))
print(statistics.median(data))

Official references

Python stdlib tour: https://docs.python.org/3/tutorial/stdlib.html
statistics: https://docs.python.org/3/library/statistics.html
csv: https://docs.python.org/3/library/csv.html
json: https://docs.python.org/3/library/json.html
NumPy docs: https://numpy.org/doc/
pandas docs: https://pandas.pydata.org/docs/
Matplotlib docs: https://matplotlib.org/stable/users/index.html
scikit-learn docs: https://scikit-learn.org/stable/user_guide.html

Data science workflow with Python
Useful foundation modules from Python standard library
Ecosystem packages (official project docs)
Reproducibility practices
Practical quick example (stdlib)
Official references