11.1 Python for Data Science
Python is a core language for data science due to scientific libraries, notebooks, and production deployment options.
Data science workflow with Python
- Data ingestion
- Cleaning and transformation
- Exploration and visualization
- Modeling and evaluation
- Reproducible deployment
Useful foundation modules from Python standard library
statistics: descriptive statistics and variance toolsmath: numeric operations and constantscsv/json: common data exchange formatssqlite3: lightweight database-backed analysis
Ecosystem packages (official project docs)
- NumPy: n-dimensional array computing
- pandas: tabular data processing
- Matplotlib: plotting and visualizations
- scikit-learn: machine learning algorithms and evaluation
Reproducibility practices
- Isolate environment with
venv - Pin dependencies
- Track data schema assumptions in code/tests
- Separate training, validation, and test datasets
- Log model parameters and metrics
Practical quick example (stdlib)
import statistics
data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
print(statistics.mean(data))
print(statistics.median(data))
Official references
- Python stdlib tour: https://docs.python.org/3/tutorial/stdlib.html
statistics: https://docs.python.org/3/library/statistics.htmlcsv: https://docs.python.org/3/library/csv.htmljson: https://docs.python.org/3/library/json.html- NumPy docs: https://numpy.org/doc/
- pandas docs: https://pandas.pydata.org/docs/
- Matplotlib docs: https://matplotlib.org/stable/users/index.html
- scikit-learn docs: https://scikit-learn.org/stable/user_guide.html