-pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). Importing data from each of these
-data sources is provided by function with the prefix ``read_*``. Similarly, the ``to_*`` methods are used to store data.
+pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). The ability to import data from each of these
+data sources is provided by functions with the prefix, ``read_*``. Similarly, the ``to_*`` methods are used to store data.
.. image:: ../_static/schemas/02_io_readwrite.svg
:align: center
@@ -181,7 +180,7 @@ data sources is provided by function with the prefix ``read_*``. Similarly, the
-Selecting or filtering specific rows and/or columns? Filtering the data on a condition? Methods for slicing, selecting, and extracting the
+Selecting or filtering specific rows and/or columns? Filtering the data on a particular condition? Methods for slicing, selecting, and extracting the
data you need are available in pandas.
.. image:: ../_static/schemas/03_subset_columns_rows.svg
@@ -228,7 +227,7 @@ data you need are available in pandas.
-pandas provides plotting your data out of the box, using the power of Matplotlib. You can pick the plot type (scatter, bar, boxplot,...)
+pandas provides plotting for your data right out of the box with the power of Matplotlib. Simply pick the plot type (scatter, bar, boxplot,...)
corresponding to your data.
.. image:: ../_static/schemas/04_plot_overview.svg
@@ -275,7 +274,7 @@ corresponding to your data.
-There is no need to loop over all rows of your data table to do calculations. Data manipulations on a column work elementwise.
+There's no need to loop over all rows of your data table to do calculations. Column data manipulations work elementwise in pandas.
Adding a column to a :class:`DataFrame` based on existing data in other columns is straightforward.
.. image:: ../_static/schemas/05_newcolumn_2.svg
@@ -322,7 +321,7 @@ Adding a column to a :class:`DataFrame` based on existing data in other columns
-Basic statistics (mean, median, min, max, counts...) are easily calculable. These or custom aggregations can be applied on the entire
+Basic statistics (mean, median, min, max, counts...) are easily calculable across data frames. These, or even custom aggregations, can be applied on the entire
data set, a sliding window of the data, or grouped by categories. The latter is also known as the split-apply-combine approach.
.. image:: ../_static/schemas/06_groupby.svg
@@ -369,8 +368,8 @@ data set, a sliding window of the data, or grouped by categories. The latter is
-Change the structure of your data table in multiple ways. You can :func:`~pandas.melt` your data table from wide to long/tidy form or :func:`~pandas.pivot`
-from long to wide format. With aggregations built-in, a pivot table is created with a single command.
+Change the structure of your data table in a variety of ways. You can use :func:`~pandas.melt` to reshape your data from a wide format to a long and tidy one. Use :func:`~pandas.pivot`
+ to go from long to wide format. With aggregations built-in, a pivot table can be created with a single command.
.. image:: ../_static/schemas/07_melt.svg
:align: center
@@ -416,7 +415,7 @@ from long to wide format. With aggregations built-in, a pivot table is created w
-Multiple tables can be concatenated both column wise and row wise as database-like join/merge operations are provided to combine multiple tables of data.
+Multiple tables can be concatenated column wise or row wise with pandas' database-like join and merge operations.
.. image:: ../_static/schemas/08_concat_row.svg
:align: center
@@ -505,7 +504,7 @@ pandas has great support for time series and has an extensive set of tools for w
-Data sets do not only contain numerical data. pandas provides a wide range of functions to clean textual data and extract useful information from it.
+Data sets often contain more than just numerical data. pandas provides a wide range of functions to clean textual data and extract useful information from it.
.. raw:: html
@@ -551,9 +550,9 @@ the pandas-equivalent operations compared to software you already know:
:class-card: comparison-card
:shadow: md
- The `R programming language
`__ provides the
- ``data.frame`` data structure and multiple packages, such as
- `tidyverse
`__ use and extend ``data.frame``
+ The `R programming language `__ provides a
+ ``data.frame`` data structure as well as packages like
+ `tidyverse `__ which use and extend ``data.frame``
for convenient data handling functionalities similar to pandas.
+++
@@ -572,8 +571,8 @@ the pandas-equivalent operations compared to software you already know:
:class-card: comparison-card
:shadow: md
- Already familiar to ``SELECT``, ``GROUP BY``, ``JOIN``, etc.?
- Most of these SQL manipulations do have equivalents in pandas.
+ Already familiar with ``SELECT``, ``GROUP BY``, ``JOIN``, etc.?
+ Many SQL manipulations have equivalents in pandas.
+++
@@ -613,7 +612,7 @@ the pandas-equivalent operations compared to software you already know:
Users of `Excel `__
or other spreadsheet programs will find that many of the concepts are
- transferrable to pandas.
+ transferable to pandas.
+++
@@ -631,10 +630,10 @@ the pandas-equivalent operations compared to software you already know:
:class-card: comparison-card
:shadow: md
- The `SAS `__ statistical software suite
- also provides the ``data set`` corresponding to the pandas ``DataFrame``.
- Also SAS vectorized operations, filtering, string processing operations,
- and more have similar functions in pandas.
+ `SAS `__, the statistical software suite,
+ uses the ``data set`` structure, which closely corresponds pandas' ``DataFrame``.
+ Also SAS vectorized operations such as filtering or string processing operations
+ have similar functions in pandas.
+++
diff --git a/doc/source/getting_started/install.rst b/doc/source/getting_started/install.rst
index ae7c9d4ea9c62..0002ed869eb31 100644
--- a/doc/source/getting_started/install.rst
+++ b/doc/source/getting_started/install.rst
@@ -6,88 +6,75 @@
Installation
============
-The easiest way to install pandas is to install it
-as part of the `Anaconda `__ distribution, a
-cross platform distribution for data analysis and scientific computing.
-The `Conda `__ package manager is the
-recommended installation method for most users.
+The pandas development team officially distributes pandas for installation
+through the following methods:
-Instructions for installing :ref:`from source `,
-:ref:`PyPI `, or a
-:ref:`development version ` are also provided.
+* Available on `conda-forge `__ for installation with the conda package manager.
+* Available on `PyPI `__ for installation with pip.
+* Available on `Github `__ for installation from source.
+
+.. note::
+ pandas may be installable from other sources besides the ones listed above,
+ but they are **not** managed by the pandas development team.
.. _install.version:
Python version support
----------------------
-Officially Python 3.9, 3.10 and 3.11.
+See :ref:`Python support policy `.
Installing pandas
-----------------
-.. _install.anaconda:
+.. _install.conda:
-Installing with Anaconda
-~~~~~~~~~~~~~~~~~~~~~~~~
+Installing with Conda
+~~~~~~~~~~~~~~~~~~~~~
-For users that are new to Python, the easiest way to install Python, pandas, and the
-packages that make up the `PyData `__ stack
-(`SciPy `__, `NumPy `__,
-`Matplotlib `__, `and more `__)
-is with `Anaconda `__, a cross-platform
-(Linux, macOS, Windows) Python distribution for data analytics and
-scientific computing. Installation instructions for Anaconda
-`can be found here `__.
+For users working with the `Conda `__ package manager,
+pandas can be installed from the ``conda-forge`` channel.
-.. _install.miniconda:
+.. code-block:: shell
-Installing with Miniconda
-~~~~~~~~~~~~~~~~~~~~~~~~~
+ conda install -c conda-forge pandas
-For users experienced with Python, the recommended way to install pandas with
-`Miniconda `__.
-Miniconda allows you to create a minimal, self-contained Python installation compared to Anaconda and use the
-`Conda `__ package manager to install additional packages
-and create a virtual environment for your installation. Installation instructions for Miniconda
-`can be found here `__.
+To install the Conda package manager on your system, the
+`Miniforge distribution `__
+is recommended.
-The next step is to create a new conda environment. A conda environment is like a
-virtualenv that allows you to specify a specific version of Python and set of libraries.
-Run the following commands from a terminal window.
+Additionally, it is recommended to install and run pandas from a virtual environment.
.. code-block:: shell
conda create -c conda-forge -n name_of_my_env python pandas
-
-This will create a minimal environment with only Python and pandas installed.
-To put your self inside this environment run.
-
-.. code-block:: shell
-
+ # On Linux or MacOS
source activate name_of_my_env
# On Windows
activate name_of_my_env
-.. _install.pypi:
+.. tip::
+ For users that are new to Python, the easiest way to install Python, pandas, and the
+ packages that make up the `PyData `__ stack such as
+ `SciPy `__, `NumPy `__ and
+ `Matplotlib `__
+ is with `Anaconda `__, a cross-platform
+ (Linux, macOS, Windows) Python distribution for data analytics and
+ scientific computing.
-Installing from PyPI
-~~~~~~~~~~~~~~~~~~~~
+ However, pandas from Anaconda is **not** officially managed by the pandas development team.
-pandas can be installed via pip from
-`PyPI `__.
+.. _install.pip:
-.. code-block:: shell
-
- pip install pandas
+Installing with pip
+~~~~~~~~~~~~~~~~~~~
-.. note::
- You must have ``pip>=19.3`` to install from PyPI.
+For users working with the `pip `__ package manager,
+pandas can be installed from `PyPI `__.
-.. note::
+.. code-block:: shell
- It is recommended to install and run pandas from a virtual environment, for example,
- using the Python standard library's `venv `__
+ pip install pandas
pandas can also be installed with sets of optional dependencies to enable certain functionality. For example,
to install pandas with the optional dependencies to read Excel files.
@@ -98,25 +85,8 @@ to install pandas with the optional dependencies to read Excel files.
The full list of extras that can be installed can be found in the :ref:`dependency section.`
-Handling ImportErrors
-~~~~~~~~~~~~~~~~~~~~~
-
-If you encounter an ``ImportError``, it usually means that Python couldn't find pandas in the list of available
-libraries. Python internally has a list of directories it searches through, to find packages. You can
-obtain these directories with.
-
-.. code-block:: python
-
- import sys
- sys.path
-
-One way you could be encountering this error is if you have multiple Python installations on your system
-and you don't have pandas installed in the Python installation you're currently using.
-In Linux/Mac you can run ``which python`` on your terminal and it will tell you which Python installation you're
-using. If it's something like "/usr/bin/python", you're using the Python from the system, which is not recommended.
-
-It is highly recommended to use ``conda``, for quick installation and for package and dependency updates.
-You can find simple installation instructions for pandas :ref:`in this document `.
+Additionally, it is recommended to install and run pandas from a virtual environment, for example,
+using the Python standard library's `venv `__
.. _install.source:
@@ -144,49 +114,24 @@ index from the PyPI registry of anaconda.org. You can install it by running.
pip install --pre --extra-index https://pypi.anaconda.org/scientific-python-nightly-wheels/simple pandas
-Note that you might be required to uninstall an existing version of pandas to install the development version.
+.. note::
+ You might be required to uninstall an existing version of pandas to install the development version.
-.. code-block:: shell
+ .. code-block:: shell
- pip uninstall pandas -y
+ pip uninstall pandas -y
Running the test suite
----------------------
-pandas is equipped with an exhaustive set of unit tests. The packages required to run the tests
-can be installed with ``pip install "pandas[test]"``. To run the tests from a
-Python terminal.
-
-.. code-block:: python
-
- >>> import pandas as pd
- >>> pd.test()
- running: pytest -m "not slow and not network and not db" /home/user/anaconda3/lib/python3.9/site-packages/pandas
-
- ============================= test session starts ==============================
- platform linux -- Python 3.9.7, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
- rootdir: /home/user
- plugins: dash-1.19.0, anyio-3.5.0, hypothesis-6.29.3
- collected 154975 items / 4 skipped / 154971 selected
- ........................................................................ [ 0%]
- ........................................................................ [ 99%]
- ....................................... [100%]
-
- ==================================== ERRORS ====================================
-
- =================================== FAILURES ===================================
-
- =============================== warnings summary ===============================
-
- =========================== short test summary info ============================
-
- = 1 failed, 146194 passed, 7402 skipped, 1367 xfailed, 5 xpassed, 197 warnings, 10 errors in 1090.16s (0:18:10) =
+If pandas has been installed :ref:`from source `, running ``pytest pandas`` will run all of pandas unit tests.
+The unit tests can also be run from the pandas module itself with the :func:`test` function. The packages required to run the tests
+can be installed with ``pip install "pandas[test]"``.
.. note::
- This is just an example of what information is shown. Test failures are not necessarily indicative
- of a broken pandas installation.
+ Test failures are not necessarily indicative of a broken pandas installation.
.. _install.dependencies:
@@ -203,10 +148,9 @@ pandas requires the following dependencies.
================================================================ ==========================
Package Minimum supported version
================================================================ ==========================
-`NumPy `__ 1.22.4
+`NumPy `__ 1.26.0
`python-dateutil `__ 2.8.2
-`pytz `__ 2020.1
-`tzdata `__ 2022.1
+`tzdata `__ 2023.3
================================================================ ==========================
.. _install.optional_dependencies:
@@ -220,7 +164,7 @@ For example, :func:`pandas.read_hdf` requires the ``pytables`` package, while
optional dependency is not installed, pandas will raise an ``ImportError`` when
the method requiring that dependency is called.
-If using pip, optional pandas dependencies can be installed or managed in a file (e.g. requirements.txt or pyproject.toml)
+With pip, optional pandas dependencies can be installed or managed in a file (e.g. requirements.txt or pyproject.toml)
as optional extras (e.g. ``pandas[performance, aws]``). All optional dependencies can be installed with ``pandas[all]``,
and specific sets of dependencies are listed in the sections below.
@@ -239,62 +183,66 @@ Installable with ``pip install "pandas[performance]"``
===================================================== ================== ================== ===================================================================================================================================================================================
Dependency Minimum Version pip extra Notes
===================================================== ================== ================== ===================================================================================================================================================================================
-`numexpr `__ 2.8.0 performance Accelerates certain numerical operations by using multiple cores as well as smart chunking and caching to achieve large speedups
-`bottleneck `__ 1.3.4 performance Accelerates certain types of ``nan`` by using specialized cython routines to achieve large speedup.
-`numba `__ 0.55.2 performance Alternative execution engine for operations that accept ``engine="numba"`` using a JIT compiler that translates Python functions to optimized machine code using the LLVM compiler.
+`numexpr `__ 2.9.0 performance Accelerates certain numerical operations by using multiple cores as well as smart chunking and caching to achieve large speedups
+`bottleneck `__ 1.3.6 performance Accelerates certain types of ``nan`` by using specialized cython routines to achieve large speedup.
+`numba `__ 0.59.0 performance Alternative execution engine for operations that accept ``engine="numba"`` using a JIT compiler that translates Python functions to optimized machine code using the LLVM compiler.
===================================================== ================== ================== ===================================================================================================================================================================================
Visualization
^^^^^^^^^^^^^
-Installable with ``pip install "pandas[plot, output_formatting]"``.
+Installable with ``pip install "pandas[plot, output-formatting]"``.
-========================= ================== ================== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== ================== =============================================================
-matplotlib 3.6.1 plot Plotting library
-Jinja2 3.1.2 output_formatting Conditional formatting with DataFrame.style
-tabulate 0.8.10 output_formatting Printing in Markdown-friendly format (see `tabulate`_)
-========================= ================== ================== =============================================================
+========================================================== ================== ================== =======================================================
+Dependency Minimum Version pip extra Notes
+========================================================== ================== ================== =======================================================
+`matplotlib `__ 3.8.3 plot Plotting library
+`Jinja2 `__ 3.1.3 output-formatting Conditional formatting with DataFrame.style
+`tabulate `__ 0.9.0 output-formatting Printing in Markdown-friendly format (see `tabulate`_)
+========================================================== ================== ================== =======================================================
Computation
^^^^^^^^^^^
Installable with ``pip install "pandas[computation]"``.
-========================= ================== =============== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =============== =============================================================
-SciPy 1.8.1 computation Miscellaneous statistical functions
-xarray 2022.03.0 computation pandas-like API for N-dimensional data
-========================= ================== =============== =============================================================
+============================================== ================== =============== =======================================
+Dependency Minimum Version pip extra Notes
+============================================== ================== =============== =======================================
+`SciPy `__ 1.12.0 computation Miscellaneous statistical functions
+`xarray `__ 2024.1.1 computation pandas-like API for N-dimensional data
+============================================== ================== =============== =======================================
+
+.. _install.excel_dependencies:
Excel files
^^^^^^^^^^^
Installable with ``pip install "pandas[excel]"``.
-========================= ================== =============== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =============== =============================================================
-xlrd 2.0.1 excel Reading Excel
-xlsxwriter 3.0.3 excel Writing Excel
-openpyxl 3.0.10 excel Reading / writing for xlsx files
-pyxlsb 1.0.9 excel Reading for xlsb files
-========================= ================== =============== =============================================================
+================================================================== ================== =============== =============================================================
+Dependency Minimum Version pip extra Notes
+================================================================== ================== =============== =============================================================
+`xlrd `__ 2.0.1 excel Reading for xls files
+`xlsxwriter `__ 3.2.0 excel Writing for xlsx files
+`openpyxl `__ 3.1.2 excel Reading / writing for Excel 2010 xlsx/xlsm/xltx/xltm files
+`pyxlsb `__ 1.0.10 excel Reading for xlsb files
+`python-calamine `__ 0.1.7 excel Reading for xls/xlsx/xlsm/xlsb/xla/xlam/ods files
+`odfpy `__ 1.4.1 excel Reading / writing for OpenDocument 1.2 files
+================================================================== ================== =============== =============================================================
HTML
^^^^
Installable with ``pip install "pandas[html]"``.
-========================= ================== =============== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =============== =============================================================
-BeautifulSoup4 4.11.1 html HTML parser for read_html
-html5lib 1.1 html HTML parser for read_html
-lxml 4.8.0 html HTML parser for read_html
-========================= ================== =============== =============================================================
+=============================================================== ================== =============== ==========================
+Dependency Minimum Version pip extra Notes
+=============================================================== ================== =============== ==========================
+`BeautifulSoup4 `__ 4.12.3 html HTML parser for read_html
+`html5lib `__ 1.1 html HTML parser for read_html
+`lxml `__ 4.9.2 html HTML parser for read_html
+=============================================================== ================== =============== ==========================
One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.read_html` function:
@@ -325,43 +273,45 @@ XML
Installable with ``pip install "pandas[xml]"``.
-========================= ================== =============== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =============== =============================================================
-lxml 4.8.0 xml XML parser for read_xml and tree builder for to_xml
-========================= ================== =============== =============================================================
+======================================== ================== =============== ====================================================
+Dependency Minimum Version pip extra Notes
+======================================== ================== =============== ====================================================
+`lxml `__ 4.9.2 xml XML parser for read_xml and tree builder for to_xml
+======================================== ================== =============== ====================================================
SQL databases
^^^^^^^^^^^^^
-Installable with ``pip install "pandas[postgresql, mysql, sql-other]"``.
+Traditional drivers are installable with ``pip install "pandas[postgresql, mysql, sql-other]"``
-========================= ================== =============== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =============== =============================================================
-SQLAlchemy 1.4.36 postgresql, SQL support for databases other than sqlite
- mysql,
- sql-other
-psycopg2 2.9.3 postgresql PostgreSQL engine for sqlalchemy
-pymysql 1.0.2 mysql MySQL engine for sqlalchemy
-========================= ================== =============== =============================================================
+================================================================== ================== =============== ============================================
+Dependency Minimum Version pip extra Notes
+================================================================== ================== =============== ============================================
+`SQLAlchemy `__ 2.0.0 postgresql, SQL support for databases other than sqlite
+ mysql,
+ sql-other
+`psycopg2 `__ 2.9.9 postgresql PostgreSQL engine for sqlalchemy
+`pymysql `__ 1.1.0 mysql MySQL engine for sqlalchemy
+`adbc-driver-postgresql `__ 1.2.0 postgresql ADBC Driver for PostgreSQL
+`adbc-driver-sqlite `__ 1.2.0 sql-other ADBC Driver for SQLite
+================================================================== ================== =============== ============================================
Other data sources
^^^^^^^^^^^^^^^^^^
-Installable with ``pip install "pandas[hdf5, parquet, feather, spss, excel]"``
+Installable with ``pip install "pandas[hdf5, parquet, iceberg, feather, spss, excel]"``
-========================= ================== ================ =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== ================ =============================================================
-PyTables 3.7.0 hdf5 HDF5-based reading / writing
-blosc 1.21.0 hdf5 Compression for HDF5; only available on ``conda``
-zlib hdf5 Compression for HDF5
-fastparquet 0.8.1 - Parquet reading / writing (pyarrow is default)
-pyarrow 7.0.0 parquet, feather Parquet, ORC, and feather reading / writing
-pyreadstat 1.1.5 spss SPSS files (.sav) reading
-odfpy 1.4.1 excel Open document format (.odf, .ods, .odt) reading / writing
-========================= ================== ================ =============================================================
+====================================================== ================== ================ ==========================================================
+Dependency Minimum Version pip extra Notes
+====================================================== ================== ================ ==========================================================
+`PyTables `__ 3.8.0 hdf5 HDF5-based reading / writing
+`zlib `__ hdf5 Compression for HDF5
+`fastparquet `__ 2024.2.0 - Parquet reading / writing (pyarrow is default)
+`pyarrow `__ 12.0.1 parquet, feather Parquet, ORC, and feather reading / writing
+`PyIceberg `__ 0.7.1 iceberg Apache Iceberg reading / writing
+`pyreadstat `__ 1.2.6 spss SPSS files (.sav) reading
+`odfpy `__ 1.4.1 excel Open document format (.odf, .ods, .odt) reading / writing
+====================================================== ================== ================ ==========================================================
.. _install.warn_orc:
@@ -376,27 +326,26 @@ Access data in the cloud
Installable with ``pip install "pandas[fss, aws, gcp]"``
-========================= ================== =============== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =============== =============================================================
-fsspec 2022.05.0 fss, gcp, aws Handling files aside from simple local and HTTP (required
- dependency of s3fs, gcsfs).
-gcsfs 2022.05.0 gcp Google Cloud Storage access
-pandas-gbq 0.17.5 gcp Google Big Query access
-s3fs 2022.05.0 aws Amazon S3 access
-========================= ================== =============== =============================================================
+============================================ ================== =============== ==========================================================
+Dependency Minimum Version pip extra Notes
+============================================ ================== =============== ==========================================================
+`fsspec `__ 2023.12.2 fss, gcp, aws Handling files aside from simple local and HTTP (required
+ dependency of s3fs, gcsfs).
+`gcsfs `__ 2023.12.2 gcp Google Cloud Storage access
+`s3fs `__ 2023.12.2 aws Amazon S3 access
+============================================ ================== =============== ==========================================================
Clipboard
^^^^^^^^^
Installable with ``pip install "pandas[clipboard]"``.
-========================= ================== =============== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =============== =============================================================
-PyQt4/PyQt5 5.15.6 clipboard Clipboard I/O
-qtpy 2.2.0 clipboard Clipboard I/O
-========================= ================== =============== =============================================================
+======================================================================================== ================== =============== ==============
+Dependency Minimum Version pip extra Notes
+======================================================================================== ================== =============== ==============
+`PyQt4 `__/`PyQt5 `__ 5.15.9 clipboard Clipboard I/O
+`qtpy `__ 2.3.0 clipboard Clipboard I/O
+======================================================================================== ================== =============== ==============
.. note::
@@ -409,19 +358,19 @@ Compression
Installable with ``pip install "pandas[compression]"``
-========================= ================== =============== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =============== =============================================================
-Zstandard 0.17.0 compression Zstandard compression
-========================= ================== =============== =============================================================
+================================================= ================== =============== ======================
+Dependency Minimum Version pip extra Notes
+================================================= ================== =============== ======================
+`Zstandard `__ 0.19.0 compression Zstandard compression
+================================================= ================== =============== ======================
-Consortium Standard
-^^^^^^^^^^^^^^^^^^^
+Timezone
+^^^^^^^^
-Installable with ``pip install "pandas[consortium-standard]"``
+Installable with ``pip install "pandas[timezone]"``
-========================= ================== =================== =============================================================
-Dependency Minimum Version pip extra Notes
-========================= ================== =================== =============================================================
-dataframe-api-compat 0.1.7 consortium-standard Consortium Standard-compatible implementation based on pandas
-========================= ================== =================== =============================================================
+========================================== ================== =================== ==============================================
+Dependency Minimum Version pip extra Notes
+========================================== ================== =================== ==============================================
+`pytz `__ 2023.4 timezone Alternative timezone library to ``zoneinfo``.
+========================================== ================== =================== ==============================================
diff --git a/doc/source/getting_started/intro_tutorials/01_table_oriented.rst b/doc/source/getting_started/intro_tutorials/01_table_oriented.rst
index caaff3557ae40..efcdb22778ef4 100644
--- a/doc/source/getting_started/intro_tutorials/01_table_oriented.rst
+++ b/doc/source/getting_started/intro_tutorials/01_table_oriented.rst
@@ -46,7 +46,7 @@ I want to store passenger data of the Titanic. For a number of passengers, I kno
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
- "Bonnell, Miss. Elizabeth",
+ "Bonnell, Miss Elizabeth",
],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"],
@@ -192,8 +192,8 @@ Check more options on ``describe`` in the user guide section about :ref:`aggrega
.. note::
This is just a starting point. Similar to spreadsheet
software, pandas represents data as a table with columns and rows. Apart
- from the representation, also the data manipulations and calculations
- you would do in spreadsheet software are supported by pandas. Continue
+ from the representation, the data manipulations and calculations
+ you would do in spreadsheet software are also supported by pandas. Continue
reading the next tutorials to get started!
.. raw:: html
@@ -204,7 +204,7 @@ Check more options on ``describe`` in the user guide section about :ref:`aggrega
- Import the package, aka ``import pandas as pd``
- A table of data is stored as a pandas ``DataFrame``
- Each column in a ``DataFrame`` is a ``Series``
-- You can do things by applying a method to a ``DataFrame`` or ``Series``
+- You can do things by applying a method on a ``DataFrame`` or ``Series``
.. raw:: html
@@ -215,7 +215,7 @@ Check more options on ``describe`` in the user guide section about :ref:`aggrega
To user guide
-A more extended explanation to ``DataFrame`` and ``Series`` is provided in the :ref:`introduction to data structures
`.
+A more extended explanation of ``DataFrame`` and ``Series`` is provided in the :ref:`introduction to data structures ` page.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/02_read_write.rst b/doc/source/getting_started/intro_tutorials/02_read_write.rst
index 832c2cc25712f..0549c17a1013c 100644
--- a/doc/source/getting_started/intro_tutorials/02_read_write.rst
+++ b/doc/source/getting_started/intro_tutorials/02_read_write.rst
@@ -97,11 +97,11 @@ in this ``DataFrame`` are integers (``int64``), floats (``float64``) and
strings (``object``).
.. note::
- When asking for the ``dtypes``, no brackets are used!
+ When asking for the ``dtypes``, no parentheses ``()`` are used!
``dtypes`` is an attribute of a ``DataFrame`` and ``Series``. Attributes
- of a ``DataFrame`` or ``Series`` do not need brackets. Attributes
+ of a ``DataFrame`` or ``Series`` do not need ``()``. Attributes
represent a characteristic of a ``DataFrame``/``Series``, whereas
- methods (which require brackets) *do* something with the
+ methods (which require parentheses ``()``) *do* something with the
``DataFrame``/``Series`` as introduced in the :ref:`first tutorial <10min_tut_01_tableoriented>`.
.. raw:: html
@@ -111,6 +111,12 @@ strings (``object``).
My colleague requested the Titanic data as a spreadsheet.
+.. note::
+ If you want to use :func:`~pandas.to_excel` and :func:`~pandas.read_excel`,
+ you need to install an Excel reader as outlined in the
+ :ref:`Excel files ` section of the
+ installation documentation.
+
.. ipython:: python
titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
@@ -166,11 +172,11 @@ The method :meth:`~DataFrame.info` provides technical information about a
- The table has 12 columns. Most columns have a value for each of the
rows (all 891 values are ``non-null``). Some columns do have missing
values and less than 891 ``non-null`` values.
-- The columns ``Name``, ``Sex``, ``Cabin`` and ``Embarked`` consists of
+- The columns ``Name``, ``Sex``, ``Cabin`` and ``Embarked`` consist of
textual data (strings, aka ``object``). The other columns are
- numerical data with some of them whole numbers (aka ``integer``) and
- others are real numbers (aka ``float``).
-- The kind of data (characters, integers,…) in the different columns
+ numerical data, some of them are whole numbers (``integer``) and
+ others are real numbers (``float``).
+- The kind of data (characters, integers, …) in the different columns
are summarized by listing the ``dtypes``.
- The approximate amount of RAM used to hold the DataFrame is provided
as well.
@@ -188,7 +194,7 @@ The method :meth:`~DataFrame.info` provides technical information about a
- Getting data in to pandas from many different file formats or data
sources is supported by ``read_*`` functions.
- Exporting data out of pandas is provided by different
- ``to_*``\ methods.
+ ``to_*`` methods.
- The ``head``/``tail``/``info`` methods and the ``dtypes`` attribute
are convenient for a first check.
diff --git a/doc/source/getting_started/intro_tutorials/03_subset_data.rst b/doc/source/getting_started/intro_tutorials/03_subset_data.rst
index 6d7ec01551572..ced976f680885 100644
--- a/doc/source/getting_started/intro_tutorials/03_subset_data.rst
+++ b/doc/source/getting_started/intro_tutorials/03_subset_data.rst
@@ -101,7 +101,7 @@ selection brackets ``[]``.
.. note::
The inner square brackets define a
:ref:`Python list ` with column names, whereas
- the outer brackets are used to select the data from a pandas
+ the outer square brackets are used to select the data from a pandas
``DataFrame`` as seen in the previous example.
The returned data type is a pandas DataFrame:
@@ -300,7 +300,7 @@ want to select.
-When using the column names, row labels or a condition expression, use
+When using column names, row labels or a condition expression, use
the ``loc`` operator in front of the selection brackets ``[]``. For both
the part before and after the comma, you can use a single label, a list
of labels, a slice of labels, a conditional expression or a colon. Using
@@ -335,14 +335,14 @@ the name ``anonymous`` to the first 3 elements of the fourth column:
.. ipython:: python
titanic.iloc[0:3, 3] = "anonymous"
- titanic.head()
+ titanic.iloc[:5, 3]
.. raw:: html
To user guide
-See the user guide section on :ref:`different choices for indexing
` to get more insight in the usage of ``loc`` and ``iloc``.
+See the user guide section on :ref:`different choices for indexing ` to get more insight into the usage of ``loc`` and ``iloc``.
.. raw:: html
@@ -354,13 +354,11 @@ See the user guide section on :ref:`different choices for indexing REMEMBER
- When selecting subsets of data, square brackets ``[]`` are used.
-- Inside these brackets, you can use a single column/row label, a list
+- Inside these square brackets, you can use a single column/row label, a list
of column/row labels, a slice of labels, a conditional expression or
a colon.
-- Select specific rows and/or columns using ``loc`` when using the row
- and column names.
-- Select specific rows and/or columns using ``iloc`` when using the
- positions in the table.
+- Use ``loc`` for label-based selection (using row/column names).
+- Use ``iloc`` for position-based selection (using table positions).
- You can assign new values to a selection based on ``loc``/``iloc``.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/04_plotting.rst b/doc/source/getting_started/intro_tutorials/04_plotting.rst
index e96eb7c51a12a..e9f83c602d086 100644
--- a/doc/source/getting_started/intro_tutorials/04_plotting.rst
+++ b/doc/source/getting_started/intro_tutorials/04_plotting.rst
@@ -32,8 +32,10 @@ How do I create plots in pandas?
air_quality.head()
.. note::
- The usage of the ``index_col`` and ``parse_dates`` parameters of the ``read_csv`` function to define the first (0th) column as
- index of the resulting ``DataFrame`` and convert the dates in the column to :class:`Timestamp` objects, respectively.
+ The ``index_col=0`` and ``parse_dates=True`` parameters passed to the ``read_csv`` function define
+ the first (0th) column as index of the resulting ``DataFrame`` and convert the dates in the column
+ to :class:`Timestamp` objects, respectively.
+
.. raw:: html
@@ -85,7 +87,7 @@ I want to plot only the columns of the data table with the data from Paris.
air_quality["station_paris"].plot()
plt.show()
-To plot a specific column, use the selection method of the
+To plot a specific column, use a selection method from the
:ref:`subset data tutorial <10min_tut_03_subset>` in combination with the :meth:`~DataFrame.plot`
method. Hence, the :meth:`~DataFrame.plot` method works on both ``Series`` and
``DataFrame``.
@@ -127,7 +129,7 @@ standard Python to get an overview of the available plot methods:
]
.. note::
- In many development environments as well as IPython and
+ In many development environments such as IPython and
Jupyter Notebook, use the TAB button to get an overview of the available
methods, for example ``air_quality.plot.`` + TAB.
@@ -238,7 +240,7 @@ This strategy is applied in the previous example:
- The ``.plot.*`` methods are applicable on both Series and DataFrames.
- By default, each of the columns is plotted as a different element
- (line, boxplot,…).
+ (line, boxplot, …).
- Any plot created by pandas is a Matplotlib object.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/05_add_columns.rst b/doc/source/getting_started/intro_tutorials/05_add_columns.rst
index d59a70cc2818e..481c094870e12 100644
--- a/doc/source/getting_started/intro_tutorials/05_add_columns.rst
+++ b/doc/source/getting_started/intro_tutorials/05_add_columns.rst
@@ -51,7 +51,7 @@ hPa, the conversion factor is 1.882*)
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
air_quality.head()
-To create a new column, use the ``[]`` brackets with the new column name
+To create a new column, use the square brackets ``[]`` with the new column name
at the left side of the assignment.
.. raw:: html
@@ -89,8 +89,8 @@ values in each row*.
-Also other mathematical operators (``+``, ``-``, ``*``, ``/``,…) or
-logical operators (``<``, ``>``, ``==``,…) work element-wise. The latter was already
+Other mathematical operators (``+``, ``-``, ``*``, ``/``, …) and logical
+operators (``<``, ``>``, ``==``, …) also work element-wise. The latter was already
used in the :ref:`subset data tutorial <10min_tut_03_subset>` to filter
rows of a table using a conditional expression.
diff --git a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
index fe3ae820e7085..1399ab66426f4 100644
--- a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
+++ b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
@@ -162,7 +162,7 @@ columns by passing ``numeric_only=True``:
It does not make much sense to get the average value of the ``Pclass``.
If we are only interested in the average age for each gender, the
-selection of columns (rectangular brackets ``[]`` as usual) is supported
+selection of columns (square brackets ``[]`` as usual) is supported
on the grouped data as well:
.. ipython:: python
@@ -235,7 +235,7 @@ category in a column.
-The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records
+The function is a shortcut, it is actually a groupby operation in combination with counting the number of records
within each group:
.. ipython:: python
diff --git a/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst b/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst
index 6a0b59b26350c..e4b34e3af57bf 100644
--- a/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst
+++ b/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst
@@ -266,7 +266,7 @@ For more information about :meth:`~DataFrame.pivot_table`, see the user guide se
::
- air_quality.groupby(["parameter", "location"]).mean()
+ air_quality.groupby(["parameter", "location"])[["value"]].mean()
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
index 9081f274cd941..024300bb8a9b0 100644
--- a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
+++ b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
@@ -137,7 +137,7 @@ Hence, the resulting table has 3178 = 1110 + 2068 rows.
Most operations like concatenation or summary statistics are by default
across rows (axis 0), but can be applied across columns as well.
-Sorting the table on the datetime information illustrates also the
+Sorting the table on the datetime information also illustrates the
combination of both tables, with the ``parameter`` column defining the
origin of the table (either ``no2`` from table ``air_quality_no2`` or
``pm25`` from table ``air_quality_pm25``):
@@ -271,7 +271,7 @@ Add the parameters' full description and name, provided by the parameters metada
Compared to the previous example, there is no common column name.
However, the ``parameter`` column in the ``air_quality`` table and the
-``id`` column in the ``air_quality_parameters_name`` both provide the
+``id`` column in the ``air_quality_parameters`` table both provide the
measured variable in a common format. The ``left_on`` and ``right_on``
arguments are used here (instead of just ``on``) to make the link
between the two tables.
@@ -286,7 +286,7 @@ between the two tables.
To user guide
-pandas supports also inner, outer, and right joins.
+pandas also supports inner, outer, and right joins.
More information on join/merge of tables is provided in the user guide section on
:ref:`database style merging of tables
`. Or have a look at the
:ref:`comparison with SQL` page.
@@ -300,7 +300,7 @@ More information on join/merge of tables is provided in the user guide section o
REMEMBER
-- Multiple tables can be concatenated both column-wise and row-wise using
+- Multiple tables can be concatenated column-wise or row-wise using
the ``concat`` function.
- For database-like merging/joining of tables, use the ``merge``
function.
diff --git a/doc/source/getting_started/intro_tutorials/09_timeseries.rst b/doc/source/getting_started/intro_tutorials/09_timeseries.rst
index 470b3908802b2..6ba3c17fac3c3 100644
--- a/doc/source/getting_started/intro_tutorials/09_timeseries.rst
+++ b/doc/source/getting_started/intro_tutorials/09_timeseries.rst
@@ -77,9 +77,9 @@ I want to work with the dates in the column ``datetime`` as datetime objects ins
Initially, the values in ``datetime`` are character strings and do not
provide any datetime operations (e.g. extract the year, day of the
-week,…). By applying the ``to_datetime`` function, pandas interprets the
+week, …). By applying the ``to_datetime`` function, pandas interprets the
strings and convert these to datetime (i.e. ``datetime64[ns, UTC]``)
-objects. In pandas we call these datetime objects similar to
+objects. In pandas we call these datetime objects that are similar to
``datetime.datetime`` from the standard library as :class:`pandas.Timestamp`.
.. raw:: html
@@ -117,7 +117,7 @@ length of our time series:
air_quality["datetime"].max() - air_quality["datetime"].min()
The result is a :class:`pandas.Timedelta` object, similar to ``datetime.timedelta``
-from the standard Python library and defining a time duration.
+from the standard Python library which defines a time duration.
.. raw:: html
@@ -257,7 +257,7 @@ the adapted time scale on plots. Let’s apply this on our data.
-
-Create a plot of the :math:`NO_2` values in the different stations from the 20th of May till the end of 21st of May
+Create a plot of the :math:`NO_2` values in the different stations from May 20th till the end of May 21st.
.. ipython:: python
:okwarning:
@@ -295,7 +295,7 @@ Aggregate the current hourly time series values to the monthly maximum value in
.. ipython:: python
- monthly_max = no_2.resample("M").max()
+ monthly_max = no_2.resample("MS").max()
monthly_max
A very powerful method on time series data with a datetime index, is the
@@ -310,7 +310,7 @@ converting secondly data into 5-minutely data).
The :meth:`~Series.resample` method is similar to a groupby operation:
- it provides a time-based grouping, by using a string (e.g. ``M``,
- ``5H``,…) that defines the target frequency
+ ``5H``, …) that defines the target frequency
- it requires an aggregation function such as ``mean``, ``max``,…
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/10_text_data.rst b/doc/source/getting_started/intro_tutorials/10_text_data.rst
index 5b1885791d8fb..8493a071863c4 100644
--- a/doc/source/getting_started/intro_tutorials/10_text_data.rst
+++ b/doc/source/getting_started/intro_tutorials/10_text_data.rst
@@ -134,8 +134,8 @@ only one countess on the Titanic, we get one row as a result.
.. note::
More powerful extractions on strings are supported, as the
:meth:`Series.str.contains` and :meth:`Series.str.extract` methods accept `regular
- expressions `__, but out of
- scope of this tutorial.
+ expressions `__, but are out of
+ the scope of this tutorial.
.. raw:: html
@@ -200,7 +200,7 @@ In the "Sex" column, replace values of "male" by "M" and values of "female" by "
Whereas :meth:`~Series.replace` is not a string method, it provides a convenient way
to use mappings or vocabularies to translate certain values. It requires
-a ``dictionary`` to define the mapping ``{from : to}``.
+a ``dictionary`` to define the mapping ``{from: to}``.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/includes/titanic.rst b/doc/source/getting_started/intro_tutorials/includes/titanic.rst
index 6e03b848aab06..41159516200fa 100644
--- a/doc/source/getting_started/intro_tutorials/includes/titanic.rst
+++ b/doc/source/getting_started/intro_tutorials/includes/titanic.rst
@@ -11,7 +11,7 @@ This tutorial uses the Titanic data set, stored as CSV. The data
consists of the following data columns:
- PassengerId: Id of every passenger.
-- Survived: Indication whether passenger survived. ``0`` for yes and ``1`` for no.
+- Survived: Indication whether passenger survived. ``0`` for no and ``1`` for yes.
- Pclass: One out of the 3 ticket classes: Class ``1``, Class ``2`` and Class ``3``.
- Name: Name of passenger.
- Sex: Gender of passenger.
diff --git a/doc/source/getting_started/overview.rst b/doc/source/getting_started/overview.rst
index 05a7d63b7ff47..98a68080d33ef 100644
--- a/doc/source/getting_started/overview.rst
+++ b/doc/source/getting_started/overview.rst
@@ -6,11 +6,11 @@
Package overview
****************
-pandas is a `Python `__ package providing fast,
+pandas is a `Python `__ package that provides fast,
flexible, and expressive data structures designed to make working with
"relational" or "labeled" data both easy and intuitive. It aims to be the
-fundamental high-level building block for doing practical, **real-world** data
-analysis in Python. Additionally, it has the broader goal of becoming **the
+fundamental high-level building block for Python's practical, **real-world** data
+analysis. Additionally, it seeks to become **the
most powerful and flexible open source data analysis/manipulation tool
available in any language**. It is already well on its way toward this goal.
@@ -174,3 +174,4 @@ License
-------
.. literalinclude:: ../../../LICENSE
+ :language: none
diff --git a/doc/source/getting_started/tutorials.rst b/doc/source/getting_started/tutorials.rst
index 1220c915c3cbc..eae7771418485 100644
--- a/doc/source/getting_started/tutorials.rst
+++ b/doc/source/getting_started/tutorials.rst
@@ -112,10 +112,9 @@ Various tutorials
* `Wes McKinney's (pandas BDFL) blog `_
* `Statistical analysis made easy in Python with SciPy and pandas DataFrames, by Randal Olson `_
-* `Statistical Data Analysis in Python, tutorial videos, by Christopher Fonnesbeck from SciPy 2013 `_
+* `Statistical Data Analysis in Python, tutorial by Christopher Fonnesbeck from SciPy 2013 `_
* `Financial analysis in Python, by Thomas Wiecki `_
* `Intro to pandas data structures, by Greg Reda `_
-* `Pandas and Python: Top 10, by Manish Amde `_
* `Pandas DataFrames Tutorial, by Karlijn Willems `_
* `A concise tutorial with real life examples `_
* `430+ Searchable Pandas recipes by Isshin Inada `_
diff --git a/doc/source/reference/arrays.rst b/doc/source/reference/arrays.rst
index 41ddbd048e6c5..d37eebef5c0c0 100644
--- a/doc/source/reference/arrays.rst
+++ b/doc/source/reference/arrays.rst
@@ -61,7 +61,7 @@ is an :class:`ArrowDtype`.
support as NumPy including first-class nullability support for all data types, immutability and more.
The table below shows the equivalent pyarrow-backed (``pa``), pandas extension, and numpy (``np``) types that are recognized by pandas.
-Pyarrow-backed types below need to be passed into :class:`ArrowDtype` to be recognized by pandas e.g. ``pd.ArrowDtype(pa.bool_())``
+Pyarrow-backed types below need to be passed into :class:`ArrowDtype` to be recognized by pandas e.g. ``pd.ArrowDtype(pa.bool_())``.
=============================================== ========================== ===================
PyArrow type pandas extension type NumPy type
@@ -114,7 +114,7 @@ values.
ArrowDtype
-For more information, please see the :ref:`PyArrow user guide `
+For more information, please see the :ref:`PyArrow user guide `.
.. _api.arrays.datetime:
@@ -134,11 +134,6 @@ is the missing value for datetime data.
Timestamp
-.. autosummary::
- :toctree: api/
-
- NaT
-
Properties
~~~~~~~~~~
.. autosummary::
@@ -257,11 +252,6 @@ is the missing value for timedelta data.
Timedelta
-.. autosummary::
- :toctree: api/
-
- NaT
-
Properties
~~~~~~~~~~
.. autosummary::
@@ -465,7 +455,6 @@ pandas provides this through :class:`arrays.IntegerArray`.
UInt16Dtype
UInt32Dtype
UInt64Dtype
- NA
.. _api.arrays.float_na:
@@ -484,7 +473,6 @@ Nullable float
Float32Dtype
Float64Dtype
- NA
.. _api.arrays.categorical:
@@ -507,7 +495,7 @@ a :class:`CategoricalDtype`.
CategoricalDtype.categories
CategoricalDtype.ordered
-Categorical data can be stored in a :class:`pandas.Categorical`
+Categorical data can be stored in a :class:`pandas.Categorical`:
.. autosummary::
:toctree: api/
@@ -551,6 +539,21 @@ To create a Series of dtype ``category``, use ``cat = s.astype(dtype)`` or
If the :class:`Series` is of dtype :class:`CategoricalDtype`, ``Series.cat`` can be used to change the categorical
data. See :ref:`api.series.cat` for more.
+More methods are available on :class:`Categorical`:
+
+.. autosummary::
+ :toctree: api/
+
+ Categorical.as_ordered
+ Categorical.as_unordered
+ Categorical.set_categories
+ Categorical.rename_categories
+ Categorical.reorder_categories
+ Categorical.add_categories
+ Categorical.remove_categories
+ Categorical.remove_unused_categories
+ Categorical.map
+
.. _api.arrays.sparse:
Sparse
@@ -621,7 +624,6 @@ with a bool :class:`numpy.ndarray`.
:template: autosummary/class_without_autosummary.rst
BooleanDtype
- NA
.. Dtype attributes which are manually listed in their docstrings: including
@@ -662,6 +664,7 @@ Data type introspection
api.types.is_datetime64_dtype
api.types.is_datetime64_ns_dtype
api.types.is_datetime64tz_dtype
+ api.types.is_dtype_equal
api.types.is_extension_array_dtype
api.types.is_float_dtype
api.types.is_int64_dtype
@@ -698,7 +701,6 @@ Scalar introspection
api.types.is_float
api.types.is_hashable
api.types.is_integer
- api.types.is_interval
api.types.is_number
api.types.is_re
api.types.is_re_compilable
diff --git a/doc/source/reference/extensions.rst b/doc/source/reference/extensions.rst
index e177e2b1d87d5..e412793a328a3 100644
--- a/doc/source/reference/extensions.rst
+++ b/doc/source/reference/extensions.rst
@@ -34,6 +34,7 @@ objects.
api.extensions.ExtensionArray._accumulate
api.extensions.ExtensionArray._concat_same_type
+ api.extensions.ExtensionArray._explode
api.extensions.ExtensionArray._formatter
api.extensions.ExtensionArray._from_factorized
api.extensions.ExtensionArray._from_sequence
@@ -48,6 +49,7 @@ objects.
api.extensions.ExtensionArray.copy
api.extensions.ExtensionArray.view
api.extensions.ExtensionArray.dropna
+ api.extensions.ExtensionArray.duplicated
api.extensions.ExtensionArray.equals
api.extensions.ExtensionArray.factorize
api.extensions.ExtensionArray.fillna
diff --git a/doc/source/reference/frame.rst b/doc/source/reference/frame.rst
index fefb02dd916cd..e701d48a89db7 100644
--- a/doc/source/reference/frame.rst
+++ b/doc/source/reference/frame.rst
@@ -48,7 +48,7 @@ Conversion
DataFrame.convert_dtypes
DataFrame.infer_objects
DataFrame.copy
- DataFrame.bool
+ DataFrame.to_numpy
Indexing, iteration
~~~~~~~~~~~~~~~~~~~
@@ -74,6 +74,7 @@ Indexing, iteration
DataFrame.where
DataFrame.mask
DataFrame.query
+ DataFrame.isetitem
For more information on ``.at``, ``.iat``, ``.loc``, and
``.iloc``, see the :ref:`indexing documentation `.
@@ -117,7 +118,6 @@ Function application, GroupBy & window
DataFrame.apply
DataFrame.map
- DataFrame.applymap
DataFrame.pipe
DataFrame.agg
DataFrame.aggregate
@@ -185,11 +185,8 @@ Reindexing / selection / label manipulation
DataFrame.duplicated
DataFrame.equals
DataFrame.filter
- DataFrame.first
- DataFrame.head
DataFrame.idxmax
DataFrame.idxmin
- DataFrame.last
DataFrame.reindex
DataFrame.reindex_like
DataFrame.rename
@@ -198,7 +195,6 @@ Reindexing / selection / label manipulation
DataFrame.sample
DataFrame.set_axis
DataFrame.set_index
- DataFrame.tail
DataFrame.take
DataFrame.truncate
@@ -209,7 +205,6 @@ Missing data handling
.. autosummary::
:toctree: api/
- DataFrame.backfill
DataFrame.bfill
DataFrame.dropna
DataFrame.ffill
@@ -219,7 +214,6 @@ Missing data handling
DataFrame.isnull
DataFrame.notna
DataFrame.notnull
- DataFrame.pad
DataFrame.replace
Reshaping, sorting, transposing
@@ -238,7 +232,6 @@ Reshaping, sorting, transposing
DataFrame.swaplevel
DataFrame.stack
DataFrame.unstack
- DataFrame.swapaxes
DataFrame.melt
DataFrame.explode
DataFrame.squeeze
@@ -382,7 +375,6 @@ Serialization / IO / conversion
DataFrame.to_feather
DataFrame.to_latex
DataFrame.to_stata
- DataFrame.to_gbq
DataFrame.to_records
DataFrame.to_string
DataFrame.to_clipboard
diff --git a/doc/source/reference/general_functions.rst b/doc/source/reference/general_functions.rst
index 02b0bf5d13dde..e93514de5f762 100644
--- a/doc/source/reference/general_functions.rst
+++ b/doc/source/reference/general_functions.rst
@@ -73,6 +73,13 @@ Top-level evaluation
eval
+Datetime formats
+~~~~~~~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ tseries.api.guess_datetime_format
+
Hashing
~~~~~~~
.. autosummary::
diff --git a/doc/source/reference/groupby.rst b/doc/source/reference/groupby.rst
index 771163ae1b0bc..004651ac0074f 100644
--- a/doc/source/reference/groupby.rst
+++ b/doc/source/reference/groupby.rst
@@ -79,8 +79,9 @@ Function application
DataFrameGroupBy.cumsum
DataFrameGroupBy.describe
DataFrameGroupBy.diff
+ DataFrameGroupBy.ewm
+ DataFrameGroupBy.expanding
DataFrameGroupBy.ffill
- DataFrameGroupBy.fillna
DataFrameGroupBy.first
DataFrameGroupBy.head
DataFrameGroupBy.idxmax
@@ -105,6 +106,7 @@ Function application
DataFrameGroupBy.shift
DataFrameGroupBy.size
DataFrameGroupBy.skew
+ DataFrameGroupBy.kurt
DataFrameGroupBy.std
DataFrameGroupBy.sum
DataFrameGroupBy.var
@@ -130,8 +132,9 @@ Function application
SeriesGroupBy.cumsum
SeriesGroupBy.describe
SeriesGroupBy.diff
+ SeriesGroupBy.ewm
+ SeriesGroupBy.expanding
SeriesGroupBy.ffill
- SeriesGroupBy.fillna
SeriesGroupBy.first
SeriesGroupBy.head
SeriesGroupBy.last
@@ -161,6 +164,7 @@ Function application
SeriesGroupBy.shift
SeriesGroupBy.size
SeriesGroupBy.skew
+ SeriesGroupBy.kurt
SeriesGroupBy.std
SeriesGroupBy.sum
SeriesGroupBy.var
diff --git a/doc/source/reference/index.rst b/doc/source/reference/index.rst
index 6d3ce3d31f005..639bac4d40b70 100644
--- a/doc/source/reference/index.rst
+++ b/doc/source/reference/index.rst
@@ -24,13 +24,14 @@ The following subpackages are public.
`pandas-stubs `_ package
which has classes in addition to those that occur in pandas for type-hinting.
-In addition, public functions in ``pandas.io`` and ``pandas.tseries`` submodules
-are mentioned in the documentation.
+In addition, public functions in ``pandas.io``, ``pandas.tseries``, ``pandas.util`` submodules
+are explicitly mentioned in the documentation. Further APIs in these modules are not guaranteed
+to be stable.
.. warning::
- The ``pandas.core``, ``pandas.compat``, and ``pandas.util`` top-level modules are PRIVATE. Stable functionality in such modules is not guaranteed.
+ The ``pandas.core``, ``pandas.compat`` top-level modules are PRIVATE. Stable functionality in such modules is not guaranteed.
.. If you update this toctree, also update the manual toctree in the
.. main index.rst.template
@@ -53,6 +54,7 @@ are mentioned in the documentation.
options
extensions
testing
+ missing_value
.. This is to prevent warnings in the doc build. We don't want to encourage
.. these methods.
@@ -60,7 +62,6 @@ are mentioned in the documentation.
..
.. toctree::
- api/pandas.Index.holds_integer
api/pandas.Index.nlevels
api/pandas.Index.sort
diff --git a/doc/source/reference/indexing.rst b/doc/source/reference/indexing.rst
index 25e5b3b46b4f3..79a49b2030c3f 100644
--- a/doc/source/reference/indexing.rst
+++ b/doc/source/reference/indexing.rst
@@ -41,6 +41,7 @@ Properties
Index.empty
Index.T
Index.memory_usage
+ Index.array
Modifying and computations
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -61,13 +62,6 @@ Modifying and computations
Index.identical
Index.insert
Index.is_
- Index.is_boolean
- Index.is_categorical
- Index.is_floating
- Index.is_integer
- Index.is_interval
- Index.is_numeric
- Index.is_object
Index.min
Index.max
Index.reindex
@@ -104,12 +98,14 @@ Conversion
:toctree: api/
Index.astype
+ Index.infer_objects
Index.item
Index.map
Index.ravel
Index.to_list
Index.to_series
Index.to_frame
+ Index.to_numpy
Index.view
Sorting
@@ -489,3 +485,5 @@ Methods
PeriodIndex.asfreq
PeriodIndex.strftime
PeriodIndex.to_timestamp
+ PeriodIndex.from_fields
+ PeriodIndex.from_ordinals
diff --git a/doc/source/reference/io.rst b/doc/source/reference/io.rst
index fbd0f6bd200b9..37d9e7f6b7dbd 100644
--- a/doc/source/reference/io.rst
+++ b/doc/source/reference/io.rst
@@ -156,6 +156,16 @@ Parquet
read_parquet
DataFrame.to_parquet
+Iceberg
+~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ read_iceberg
+ DataFrame.to_iceberg
+
+.. warning:: ``read_iceberg`` is experimental and may change without warning.
+
ORC
~~~
.. autosummary::
@@ -188,13 +198,6 @@ SQL
read_sql
DataFrame.to_sql
-Google BigQuery
-~~~~~~~~~~~~~~~
-.. autosummary::
- :toctree: api/
-
- read_gbq
-
STATA
~~~~~
.. autosummary::
diff --git a/doc/source/reference/missing_value.rst b/doc/source/reference/missing_value.rst
new file mode 100644
index 0000000000000..3bf22aef765d1
--- /dev/null
+++ b/doc/source/reference/missing_value.rst
@@ -0,0 +1,24 @@
+{{ header }}
+
+.. _api.missing_value:
+
+==============
+Missing values
+==============
+.. currentmodule:: pandas
+
+NA is the way to represent missing values for nullable dtypes (see below):
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/class_without_autosummary.rst
+
+ NA
+
+NaT is the missing value for timedelta and datetime data (see below):
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/class_without_autosummary.rst
+
+ NaT
diff --git a/doc/source/reference/offset_frequency.rst b/doc/source/reference/offset_frequency.rst
index ab89fe74e7337..5876e005574fd 100644
--- a/doc/source/reference/offset_frequency.rst
+++ b/doc/source/reference/offset_frequency.rst
@@ -26,8 +26,6 @@ Properties
DateOffset.normalize
DateOffset.rule_code
DateOffset.n
- DateOffset.is_month_start
- DateOffset.is_month_end
Methods
~~~~~~~
@@ -35,7 +33,6 @@ Methods
:toctree: api/
DateOffset.copy
- DateOffset.is_anchored
DateOffset.is_on_offset
DateOffset.is_month_start
DateOffset.is_month_end
@@ -82,7 +79,6 @@ Methods
:toctree: api/
BusinessDay.copy
- BusinessDay.is_anchored
BusinessDay.is_on_offset
BusinessDay.is_month_start
BusinessDay.is_month_end
@@ -122,7 +118,6 @@ Methods
:toctree: api/
BusinessHour.copy
- BusinessHour.is_anchored
BusinessHour.is_on_offset
BusinessHour.is_month_start
BusinessHour.is_month_end
@@ -169,7 +164,6 @@ Methods
:toctree: api/
CustomBusinessDay.copy
- CustomBusinessDay.is_anchored
CustomBusinessDay.is_on_offset
CustomBusinessDay.is_month_start
CustomBusinessDay.is_month_end
@@ -209,7 +203,6 @@ Methods
:toctree: api/
CustomBusinessHour.copy
- CustomBusinessHour.is_anchored
CustomBusinessHour.is_on_offset
CustomBusinessHour.is_month_start
CustomBusinessHour.is_month_end
@@ -244,7 +237,6 @@ Methods
:toctree: api/
MonthEnd.copy
- MonthEnd.is_anchored
MonthEnd.is_on_offset
MonthEnd.is_month_start
MonthEnd.is_month_end
@@ -279,7 +271,6 @@ Methods
:toctree: api/
MonthBegin.copy
- MonthBegin.is_anchored
MonthBegin.is_on_offset
MonthBegin.is_month_start
MonthBegin.is_month_end
@@ -323,7 +314,6 @@ Methods
:toctree: api/
BusinessMonthEnd.copy
- BusinessMonthEnd.is_anchored
BusinessMonthEnd.is_on_offset
BusinessMonthEnd.is_month_start
BusinessMonthEnd.is_month_end
@@ -367,7 +357,6 @@ Methods
:toctree: api/
BusinessMonthBegin.copy
- BusinessMonthBegin.is_anchored
BusinessMonthBegin.is_on_offset
BusinessMonthBegin.is_month_start
BusinessMonthBegin.is_month_end
@@ -415,7 +404,6 @@ Methods
:toctree: api/
CustomBusinessMonthEnd.copy
- CustomBusinessMonthEnd.is_anchored
CustomBusinessMonthEnd.is_on_offset
CustomBusinessMonthEnd.is_month_start
CustomBusinessMonthEnd.is_month_end
@@ -463,7 +451,6 @@ Methods
:toctree: api/
CustomBusinessMonthBegin.copy
- CustomBusinessMonthBegin.is_anchored
CustomBusinessMonthBegin.is_on_offset
CustomBusinessMonthBegin.is_month_start
CustomBusinessMonthBegin.is_month_end
@@ -499,7 +486,6 @@ Methods
:toctree: api/
SemiMonthEnd.copy
- SemiMonthEnd.is_anchored
SemiMonthEnd.is_on_offset
SemiMonthEnd.is_month_start
SemiMonthEnd.is_month_end
@@ -535,7 +521,6 @@ Methods
:toctree: api/
SemiMonthBegin.copy
- SemiMonthBegin.is_anchored
SemiMonthBegin.is_on_offset
SemiMonthBegin.is_month_start
SemiMonthBegin.is_month_end
@@ -571,7 +556,6 @@ Methods
:toctree: api/
Week.copy
- Week.is_anchored
Week.is_on_offset
Week.is_month_start
Week.is_month_end
@@ -607,7 +591,6 @@ Methods
:toctree: api/
WeekOfMonth.copy
- WeekOfMonth.is_anchored
WeekOfMonth.is_on_offset
WeekOfMonth.weekday
WeekOfMonth.is_month_start
@@ -645,7 +628,6 @@ Methods
:toctree: api/
LastWeekOfMonth.copy
- LastWeekOfMonth.is_anchored
LastWeekOfMonth.is_on_offset
LastWeekOfMonth.is_month_start
LastWeekOfMonth.is_month_end
@@ -681,7 +663,6 @@ Methods
:toctree: api/
BQuarterEnd.copy
- BQuarterEnd.is_anchored
BQuarterEnd.is_on_offset
BQuarterEnd.is_month_start
BQuarterEnd.is_month_end
@@ -717,7 +698,6 @@ Methods
:toctree: api/
BQuarterBegin.copy
- BQuarterBegin.is_anchored
BQuarterBegin.is_on_offset
BQuarterBegin.is_month_start
BQuarterBegin.is_month_end
@@ -753,7 +733,6 @@ Methods
:toctree: api/
QuarterEnd.copy
- QuarterEnd.is_anchored
QuarterEnd.is_on_offset
QuarterEnd.is_month_start
QuarterEnd.is_month_end
@@ -789,7 +768,6 @@ Methods
:toctree: api/
QuarterBegin.copy
- QuarterBegin.is_anchored
QuarterBegin.is_on_offset
QuarterBegin.is_month_start
QuarterBegin.is_month_end
@@ -798,6 +776,146 @@ Methods
QuarterBegin.is_year_start
QuarterBegin.is_year_end
+BHalfYearEnd
+------------
+.. autosummary::
+ :toctree: api/
+
+ BHalfYearEnd
+
+Properties
+~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ BHalfYearEnd.freqstr
+ BHalfYearEnd.kwds
+ BHalfYearEnd.name
+ BHalfYearEnd.nanos
+ BHalfYearEnd.normalize
+ BHalfYearEnd.rule_code
+ BHalfYearEnd.n
+ BHalfYearEnd.startingMonth
+
+Methods
+~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ BHalfYearEnd.copy
+ BHalfYearEnd.is_on_offset
+ BHalfYearEnd.is_month_start
+ BHalfYearEnd.is_month_end
+ BHalfYearEnd.is_quarter_start
+ BHalfYearEnd.is_quarter_end
+ BHalfYearEnd.is_year_start
+ BHalfYearEnd.is_year_end
+
+BHalfYearBegin
+--------------
+.. autosummary::
+ :toctree: api/
+
+ BHalfYearBegin
+
+Properties
+~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ BHalfYearBegin.freqstr
+ BHalfYearBegin.kwds
+ BHalfYearBegin.name
+ BHalfYearBegin.nanos
+ BHalfYearBegin.normalize
+ BHalfYearBegin.rule_code
+ BHalfYearBegin.n
+ BHalfYearBegin.startingMonth
+
+Methods
+~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ BHalfYearBegin.copy
+ BHalfYearBegin.is_on_offset
+ BHalfYearBegin.is_month_start
+ BHalfYearBegin.is_month_end
+ BHalfYearBegin.is_quarter_start
+ BHalfYearBegin.is_quarter_end
+ BHalfYearBegin.is_year_start
+ BHalfYearBegin.is_year_end
+
+HalfYearEnd
+-----------
+.. autosummary::
+ :toctree: api/
+
+ HalfYearEnd
+
+Properties
+~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ HalfYearEnd.freqstr
+ HalfYearEnd.kwds
+ HalfYearEnd.name
+ HalfYearEnd.nanos
+ HalfYearEnd.normalize
+ HalfYearEnd.rule_code
+ HalfYearEnd.n
+ HalfYearEnd.startingMonth
+
+Methods
+~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ HalfYearEnd.copy
+ HalfYearEnd.is_on_offset
+ HalfYearEnd.is_month_start
+ HalfYearEnd.is_month_end
+ HalfYearEnd.is_quarter_start
+ HalfYearEnd.is_quarter_end
+ HalfYearEnd.is_year_start
+ HalfYearEnd.is_year_end
+
+HalfYearBegin
+-------------
+.. autosummary::
+ :toctree: api/
+
+ HalfYearBegin
+
+Properties
+~~~~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ HalfYearBegin.freqstr
+ HalfYearBegin.kwds
+ HalfYearBegin.name
+ HalfYearBegin.nanos
+ HalfYearBegin.normalize
+ HalfYearBegin.rule_code
+ HalfYearBegin.n
+ HalfYearBegin.startingMonth
+
+Methods
+~~~~~~~
+.. autosummary::
+ :toctree: api/
+
+ HalfYearBegin.copy
+ HalfYearBegin.is_on_offset
+ HalfYearBegin.is_month_start
+ HalfYearBegin.is_month_end
+ HalfYearBegin.is_quarter_start
+ HalfYearBegin.is_quarter_end
+ HalfYearBegin.is_year_start
+ HalfYearBegin.is_year_end
+
BYearEnd
--------
.. autosummary::
@@ -825,7 +943,6 @@ Methods
:toctree: api/
BYearEnd.copy
- BYearEnd.is_anchored
BYearEnd.is_on_offset
BYearEnd.is_month_start
BYearEnd.is_month_end
@@ -861,7 +978,6 @@ Methods
:toctree: api/
BYearBegin.copy
- BYearBegin.is_anchored
BYearBegin.is_on_offset
BYearBegin.is_month_start
BYearBegin.is_month_end
@@ -897,7 +1013,6 @@ Methods
:toctree: api/
YearEnd.copy
- YearEnd.is_anchored
YearEnd.is_on_offset
YearEnd.is_month_start
YearEnd.is_month_end
@@ -933,7 +1048,6 @@ Methods
:toctree: api/
YearBegin.copy
- YearBegin.is_anchored
YearBegin.is_on_offset
YearBegin.is_month_start
YearBegin.is_month_end
@@ -973,7 +1087,6 @@ Methods
FY5253.copy
FY5253.get_rule_code_suffix
FY5253.get_year_end
- FY5253.is_anchored
FY5253.is_on_offset
FY5253.is_month_start
FY5253.is_month_end
@@ -1014,7 +1127,6 @@ Methods
FY5253Quarter.copy
FY5253Quarter.get_rule_code_suffix
FY5253Quarter.get_weeks
- FY5253Quarter.is_anchored
FY5253Quarter.is_on_offset
FY5253Quarter.year_has_extra_week
FY5253Quarter.is_month_start
@@ -1050,7 +1162,6 @@ Methods
:toctree: api/
Easter.copy
- Easter.is_anchored
Easter.is_on_offset
Easter.is_month_start
Easter.is_month_end
@@ -1071,7 +1182,6 @@ Properties
.. autosummary::
:toctree: api/
- Tick.delta
Tick.freqstr
Tick.kwds
Tick.name
@@ -1086,7 +1196,6 @@ Methods
:toctree: api/
Tick.copy
- Tick.is_anchored
Tick.is_on_offset
Tick.is_month_start
Tick.is_month_end
@@ -1107,7 +1216,6 @@ Properties
.. autosummary::
:toctree: api/
- Day.delta
Day.freqstr
Day.kwds
Day.name
@@ -1122,7 +1230,6 @@ Methods
:toctree: api/
Day.copy
- Day.is_anchored
Day.is_on_offset
Day.is_month_start
Day.is_month_end
@@ -1143,7 +1250,6 @@ Properties
.. autosummary::
:toctree: api/
- Hour.delta
Hour.freqstr
Hour.kwds
Hour.name
@@ -1158,7 +1264,6 @@ Methods
:toctree: api/
Hour.copy
- Hour.is_anchored
Hour.is_on_offset
Hour.is_month_start
Hour.is_month_end
@@ -1179,7 +1284,6 @@ Properties
.. autosummary::
:toctree: api/
- Minute.delta
Minute.freqstr
Minute.kwds
Minute.name
@@ -1194,7 +1298,6 @@ Methods
:toctree: api/
Minute.copy
- Minute.is_anchored
Minute.is_on_offset
Minute.is_month_start
Minute.is_month_end
@@ -1215,7 +1318,6 @@ Properties
.. autosummary::
:toctree: api/
- Second.delta
Second.freqstr
Second.kwds
Second.name
@@ -1230,7 +1332,6 @@ Methods
:toctree: api/
Second.copy
- Second.is_anchored
Second.is_on_offset
Second.is_month_start
Second.is_month_end
@@ -1251,7 +1352,6 @@ Properties
.. autosummary::
:toctree: api/
- Milli.delta
Milli.freqstr
Milli.kwds
Milli.name
@@ -1266,7 +1366,6 @@ Methods
:toctree: api/
Milli.copy
- Milli.is_anchored
Milli.is_on_offset
Milli.is_month_start
Milli.is_month_end
@@ -1287,7 +1386,6 @@ Properties
.. autosummary::
:toctree: api/
- Micro.delta
Micro.freqstr
Micro.kwds
Micro.name
@@ -1302,7 +1400,6 @@ Methods
:toctree: api/
Micro.copy
- Micro.is_anchored
Micro.is_on_offset
Micro.is_month_start
Micro.is_month_end
@@ -1323,7 +1420,6 @@ Properties
.. autosummary::
:toctree: api/
- Nano.delta
Nano.freqstr
Nano.kwds
Nano.name
@@ -1338,7 +1434,6 @@ Methods
:toctree: api/
Nano.copy
- Nano.is_anchored
Nano.is_on_offset
Nano.is_month_start
Nano.is_month_end
diff --git a/doc/source/reference/resampling.rst b/doc/source/reference/resampling.rst
index edbc8090fc849..2e0717081b129 100644
--- a/doc/source/reference/resampling.rst
+++ b/doc/source/reference/resampling.rst
@@ -38,7 +38,6 @@ Upsampling
Resampler.ffill
Resampler.bfill
Resampler.nearest
- Resampler.fillna
Resampler.asfreq
Resampler.interpolate
diff --git a/doc/source/reference/series.rst b/doc/source/reference/series.rst
index 58351bab07b22..6006acc8f5e16 100644
--- a/doc/source/reference/series.rst
+++ b/doc/source/reference/series.rst
@@ -25,6 +25,7 @@ Attributes
Series.array
Series.values
Series.dtype
+ Series.info
Series.shape
Series.nbytes
Series.ndim
@@ -47,7 +48,6 @@ Conversion
Series.convert_dtypes
Series.infer_objects
Series.copy
- Series.bool
Series.to_numpy
Series.to_period
Series.to_timestamp
@@ -177,17 +177,16 @@ Reindexing / selection / label manipulation
:toctree: api/
Series.align
+ Series.case_when
Series.drop
Series.droplevel
Series.drop_duplicates
Series.duplicated
Series.equals
- Series.first
Series.head
Series.idxmax
Series.idxmin
Series.isin
- Series.last
Series.reindex
Series.reindex_like
Series.rename
@@ -209,7 +208,6 @@ Missing data handling
.. autosummary::
:toctree: api/
- Series.backfill
Series.bfill
Series.dropna
Series.ffill
@@ -219,7 +217,6 @@ Missing data handling
Series.isnull
Series.notna
Series.notnull
- Series.pad
Series.replace
Reshaping, sorting
@@ -237,10 +234,8 @@ Reshaping, sorting
Series.unstack
Series.explode
Series.searchsorted
- Series.ravel
Series.repeat
Series.squeeze
- Series.view
Combining / comparing / joining / merging
-----------------------------------------
@@ -341,7 +336,6 @@ Datetime properties
Series.dt.tz
Series.dt.freq
Series.dt.unit
- Series.dt.normalize
Datetime methods
^^^^^^^^^^^^^^^^
@@ -525,6 +519,46 @@ Sparse-dtype specific methods and attributes are provided under the
Series.sparse.from_coo
Series.sparse.to_coo
+
+.. _api.series.list:
+
+List accessor
+~~~~~~~~~~~~~
+
+Arrow list-dtype specific methods and attributes are provided under the
+``Series.list`` accessor.
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/accessor_method.rst
+
+ Series.list.flatten
+ Series.list.len
+ Series.list.__getitem__
+
+
+.. _api.series.struct:
+
+Struct accessor
+~~~~~~~~~~~~~~~
+
+Arrow struct-dtype specific methods and attributes are provided under the
+``Series.struct`` accessor.
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/accessor_attribute.rst
+
+ Series.struct.dtypes
+
+.. autosummary::
+ :toctree: api/
+ :template: autosummary/accessor_method.rst
+
+ Series.struct.field
+ Series.struct.explode
+
+
.. _api.series.flags:
Flags
diff --git a/doc/source/reference/style.rst b/doc/source/reference/style.rst
index 2256876c93e01..742263c788c2f 100644
--- a/doc/source/reference/style.rst
+++ b/doc/source/reference/style.rst
@@ -27,6 +27,7 @@ Styler properties
Styler.template_html_style
Styler.template_html_table
Styler.template_latex
+ Styler.template_typst
Styler.template_string
Styler.loader
@@ -41,6 +42,7 @@ Style application
Styler.map_index
Styler.format
Styler.format_index
+ Styler.format_index_names
Styler.relabel_index
Styler.hide
Styler.concat
@@ -76,6 +78,7 @@ Style export and import
Styler.to_html
Styler.to_latex
+ Styler.to_typst
Styler.to_excel
Styler.to_string
Styler.export
diff --git a/doc/source/reference/testing.rst b/doc/source/reference/testing.rst
index a5d61703aceed..2c9c2dcae0f69 100644
--- a/doc/source/reference/testing.rst
+++ b/doc/source/reference/testing.rst
@@ -36,6 +36,7 @@ Exceptions and warnings
errors.DuplicateLabelError
errors.EmptyDataError
errors.IncompatibilityWarning
+ errors.IncompatibleFrequency
errors.IndexingError
errors.InvalidColumnName
errors.InvalidComparison
@@ -58,8 +59,6 @@ Exceptions and warnings
errors.PossiblePrecisionLoss
errors.PyperclipException
errors.PyperclipWindowsException
- errors.SettingWithCopyError
- errors.SettingWithCopyWarning
errors.SpecificationError
errors.UndefinedVariableError
errors.UnsortedIndexError
diff --git a/doc/source/reference/window.rst b/doc/source/reference/window.rst
index 14af2b8a120e0..2bd63f02faf69 100644
--- a/doc/source/reference/window.rst
+++ b/doc/source/reference/window.rst
@@ -30,15 +30,19 @@ Rolling window functions
Rolling.std
Rolling.min
Rolling.max
+ Rolling.first
+ Rolling.last
Rolling.corr
Rolling.cov
Rolling.skew
Rolling.kurt
Rolling.apply
+ Rolling.pipe
Rolling.aggregate
Rolling.quantile
Rolling.sem
Rolling.rank
+ Rolling.nunique
.. _api.functions_window:
@@ -71,15 +75,19 @@ Expanding window functions
Expanding.std
Expanding.min
Expanding.max
+ Expanding.first
+ Expanding.last
Expanding.corr
Expanding.cov
Expanding.skew
Expanding.kurt
Expanding.apply
+ Expanding.pipe
Expanding.aggregate
Expanding.quantile
Expanding.sem
Expanding.rank
+ Expanding.nunique
.. _api.functions_ewm:
diff --git a/doc/source/user_guide/10min.rst b/doc/source/user_guide/10min.rst
index 5def84b91705c..8beaa73090673 100644
--- a/doc/source/user_guide/10min.rst
+++ b/doc/source/user_guide/10min.rst
@@ -19,7 +19,7 @@ Customarily, we import as follows:
Basic data structures in pandas
-------------------------------
-Pandas provides two types of classes for handling data:
+pandas provides two types of classes for handling data:
1. :class:`Series`: a one-dimensional labeled array holding data of any type
such as integers, strings, Python objects etc.
@@ -91,8 +91,8 @@ will be completed:
df2.any df2.combine
df2.append df2.D
df2.apply df2.describe
- df2.applymap df2.diff
df2.B df2.duplicated
+ df2.diff
As you can see, the columns ``A``, ``B``, ``C``, and ``D`` are automatically
tab completed. ``E`` and ``F`` are there as well; the rest of the attributes have been
@@ -101,7 +101,7 @@ truncated for brevity.
Viewing data
------------
-See the :ref:`Essentially basics functionality section `.
+See the :ref:`Essential basic functionality section `.
Use :meth:`DataFrame.head` and :meth:`DataFrame.tail` to view the top and bottom rows of the frame
respectively:
@@ -177,13 +177,27 @@ See the indexing documentation :ref:`Indexing and Selecting Data ` and
Getitem (``[]``)
~~~~~~~~~~~~~~~~
-For a :class:`DataFrame`, passing a single label selects a columns and
-yields a :class:`Series` equivalent to ``df.A``:
+For a :class:`DataFrame`, passing a single label selects a column and
+yields a :class:`Series`:
.. ipython:: python
df["A"]
+If the label only contains letters, numbers, and underscores, you can
+alternatively use the column name attribute:
+
+.. ipython:: python
+
+ df.A
+
+Passing a list of column labels selects multiple columns, which can be useful
+for getting a subset/rearranging:
+
+.. ipython:: python
+
+ df[["B", "A"]]
+
For a :class:`DataFrame`, passing a slice ``:`` selects matching rows:
.. ipython:: python
@@ -451,7 +465,7 @@ Merge
Concat
~~~~~~
-pandas provides various facilities for easily combining together :class:`Series`` and
+pandas provides various facilities for easily combining together :class:`Series` and
:class:`DataFrame` objects with various kinds of set logic for the indexes
and relational algebra functionality in the case of join / merge-type
operations.
@@ -525,7 +539,7 @@ See the :ref:`Grouping section `.
df
Grouping by a column label, selecting column labels, and then applying the
-:meth:`~pandas.core.groupby.DataFrameGroupBy.sum` function to the resulting
+:meth:`.DataFrameGroupBy.sum` function to the resulting
groups:
.. ipython:: python
@@ -563,7 +577,7 @@ columns:
.. ipython:: python
- stacked = df2.stack(future_stack=True)
+ stacked = df2.stack()
stacked
With a "stacked" DataFrame or Series (having a :class:`MultiIndex` as the
diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst
index 682fa4c9b4fcc..f7ab466e92d93 100644
--- a/doc/source/user_guide/advanced.rst
+++ b/doc/source/user_guide/advanced.rst
@@ -11,13 +11,6 @@ and :ref:`other advanced indexing features `.
See the :ref:`Indexing and Selecting Data ` for general indexing documentation.
-.. warning::
-
- Whether a copy or a reference is returned for a setting operation may
- depend on the context. This is sometimes called ``chained assignment`` and
- should be avoided. See :ref:`Returning a View versus Copy
- `.
-
See the :ref:`cookbook` for some advanced strategies.
.. _advanced.hierarchical:
@@ -402,6 +395,7 @@ slicers on a single axis.
Furthermore, you can *set* the values using the following methods.
.. ipython:: python
+ :okwarning:
df2 = dfmi.copy()
df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10
@@ -976,7 +970,7 @@ of :ref:`frequency aliases ` with datetime-like inter
pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
- pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9H")
+ pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Additionally, the ``closed`` parameter can be used to specify which side(s) the intervals
are closed on. Intervals are closed on the right side by default.
diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst
index 2e299da5e5794..3fdd15462b51e 100644
--- a/doc/source/user_guide/basics.rst
+++ b/doc/source/user_guide/basics.rst
@@ -36,7 +36,7 @@ of elements to display is five, but you may pass a custom number.
Attributes and underlying data
------------------------------
-pandas objects have a number of attributes enabling you to access the metadata
+pandas objects have a number of attributes enabling you to access the metadata.
* **shape**: gives the axis dimensions of the object, consistent with ndarray
* Axis labels
@@ -59,7 +59,7 @@ NumPy's type system to add support for custom arrays
(see :ref:`basics.dtypes`).
To get the actual data inside a :class:`Index` or :class:`Series`, use
-the ``.array`` property
+the ``.array`` property.
.. ipython:: python
@@ -88,18 +88,18 @@ NumPy doesn't have a dtype to represent timezone-aware datetimes, so there
are two possibly useful representations:
1. An object-dtype :class:`numpy.ndarray` with :class:`Timestamp` objects, each
- with the correct ``tz``
+ with the correct ``tz``.
2. A ``datetime64[ns]`` -dtype :class:`numpy.ndarray`, where the values have
- been converted to UTC and the timezone discarded
+ been converted to UTC and the timezone discarded.
-Timezones may be preserved with ``dtype=object``
+Timezones may be preserved with ``dtype=object``:
.. ipython:: python
ser = pd.Series(pd.date_range("2000", periods=2, tz="CET"))
ser.to_numpy(dtype=object)
-Or thrown away with ``dtype='datetime64[ns]'``
+Or thrown away with ``dtype='datetime64[ns]'``:
.. ipython:: python
@@ -155,17 +155,6 @@ speedups. ``numexpr`` uses smart chunking, caching, and multiple cores. ``bottle
a set of specialized cython routines that are especially fast when dealing with arrays that have
``nans``.
-Here is a sample (using 100 column x 100,000 row ``DataFrames``):
-
-.. csv-table::
- :header: "Operation", "0.11.0 (ms)", "Prior Version (ms)", "Ratio to Prior"
- :widths: 25, 25, 25, 25
- :delim: ;
-
- ``df1 > df2``; 13.32; 125.35; 0.1063
- ``df1 * df2``; 21.71; 36.63; 0.5928
- ``df1 + df2``; 22.04; 36.50; 0.6039
-
You are highly encouraged to install both libraries. See the section
:ref:`Recommended Dependencies ` for more installation info.
@@ -269,7 +258,7 @@ using ``fillna`` if you wish).
.. ipython:: python
df2 = df.copy()
- df2["three"]["a"] = 1.0
+ df2.loc["a", "three"] = 1.0
df
df2
df + df2
@@ -299,8 +288,7 @@ Boolean reductions
~~~~~~~~~~~~~~~~~~
You can apply the reductions: :attr:`~DataFrame.empty`, :meth:`~DataFrame.any`,
-:meth:`~DataFrame.all`, and :meth:`~DataFrame.bool` to provide a
-way to summarize a boolean result.
+:meth:`~DataFrame.all`.
.. ipython:: python
@@ -408,20 +396,6 @@ raise a ValueError:
pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
-Note that this is different from the NumPy behavior where a comparison can
-be broadcast:
-
-.. ipython:: python
-
- np.array([1, 2, 3]) == np.array([2])
-
-or it can return False if broadcasting can not be done:
-
-.. ipython:: python
- :okwarning:
-
- np.array([1, 2, 3]) == np.array([1, 2])
-
Combining overlapping data sets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -491,15 +465,15 @@ For example:
.. ipython:: python
df
- df.mean(0)
- df.mean(1)
+ df.mean(axis=0)
+ df.mean(axis=1)
All such methods have a ``skipna`` option signaling whether to exclude missing
data (``True`` by default):
.. ipython:: python
- df.sum(0, skipna=False)
+ df.sum(axis=0, skipna=False)
df.sum(axis=1, skipna=True)
Combined with the broadcasting / arithmetic behavior, one can describe various
@@ -510,8 +484,8 @@ standard deviation of 1), very concisely:
ts_stand = (df - df.mean()) / df.std()
ts_stand.std()
- xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
- xs_stand.std(1)
+ xs_stand = df.sub(df.mean(axis=1), axis=0).div(df.std(axis=1), axis=0)
+ xs_stand.std(axis=1)
Note that methods like :meth:`~DataFrame.cumsum` and :meth:`~DataFrame.cumprod`
preserve the location of ``NaN`` values. This is somewhat different from
@@ -616,7 +590,7 @@ arguments. The special value ``all`` can also be used:
.. ipython:: python
- frame.describe(include=["object"])
+ frame.describe(include=["str"])
frame.describe(include=["number"])
frame.describe(include="all")
@@ -1323,8 +1297,8 @@ filling method chosen from the following table:
:header: "Method", "Action"
:widths: 30, 50
- pad / ffill, Fill values forward
- bfill / backfill, Fill values backward
+ ffill, Fill values forward
+ bfill, Fill values backward
nearest, Fill from the nearest index value
We illustrate these fill methods on a simple Series:
@@ -1622,7 +1596,7 @@ For instance:
This method does not convert the row to a Series object; it merely
returns the values inside a namedtuple. Therefore,
:meth:`~DataFrame.itertuples` preserves the data type of the values
-and is generally faster as :meth:`~DataFrame.iterrows`.
+and is generally faster than :meth:`~DataFrame.iterrows`.
.. note::
@@ -2021,7 +1995,7 @@ documentation sections for more on each type.
| | | | | ``'Int64'``, ``'UInt8'``, ``'UInt16'``,|
| | | | | ``'UInt32'``, ``'UInt64'`` |
+-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
-| ``nullable float`` | :class:`Float64Dtype`, ...| (none) | :class:`arrays.FloatingArray` | ``'Float32'``, ``'Float64'`` |
+| :ref:`nullable float ` | :class:`Float64Dtype`, ...| (none) | :class:`arrays.FloatingArray` | ``'Float32'``, ``'Float64'`` |
+-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
| :ref:`Strings ` | :class:`StringDtype` | :class:`str` | :class:`arrays.StringArray` | ``'string'`` |
+-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
@@ -2090,12 +2064,12 @@ different numeric dtypes will **NOT** be combined. The following example will gi
.. ipython:: python
- df1 = pd.DataFrame(np.random.randn(8, 1), columns=["A"], dtype="float32")
+ df1 = pd.DataFrame(np.random.randn(8, 1), columns=["A"], dtype="float64")
df1
df1.dtypes
df2 = pd.DataFrame(
{
- "A": pd.Series(np.random.randn(8), dtype="float16"),
+ "A": pd.Series(np.random.randn(8), dtype="float32"),
"B": pd.Series(np.random.randn(8)),
"C": pd.Series(np.random.randint(0, 255, size=8), dtype="uint8"), # [0,255] (range of uint8)
}
@@ -2275,23 +2249,6 @@ non-conforming elements intermixed that you want to represent as missing:
m = ["apple", pd.Timedelta("1day")]
pd.to_timedelta(m, errors="coerce")
-The ``errors`` parameter has a third option of ``errors='ignore'``, which will simply return the passed in data if it
-encounters any errors with the conversion to a desired data type:
-
-.. ipython:: python
- :okwarning:
-
- import datetime
-
- m = ["apple", datetime.datetime(2016, 3, 2)]
- pd.to_datetime(m, errors="ignore")
-
- m = ["apple", 2, 3]
- pd.to_numeric(m, errors="ignore")
-
- m = ["apple", pd.Timedelta("1day")]
- pd.to_timedelta(m, errors="ignore")
-
In addition to object conversion, :meth:`~pandas.to_numeric` provides another argument ``downcast``, which gives the
option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:
diff --git a/doc/source/user_guide/boolean.rst b/doc/source/user_guide/boolean.rst
index 3c361d4de17e5..7de0430123fd2 100644
--- a/doc/source/user_guide/boolean.rst
+++ b/doc/source/user_guide/boolean.rst
@@ -37,6 +37,19 @@ If you would prefer to keep the ``NA`` values you can manually fill them with ``
s[mask.fillna(True)]
+If you create a column of ``NA`` values (for example to fill them later)
+with ``df['new_col'] = pd.NA``, the ``dtype`` would be set to ``object`` in the
+new column. The performance on this column will be worse than with
+the appropriate type. It's better to use
+``df['new_col'] = pd.Series(pd.NA, dtype="boolean")``
+(or another ``dtype`` that supports ``NA``).
+
+.. ipython:: python
+
+ df = pd.DataFrame()
+ df['objects'] = pd.NA
+ df.dtypes
+
.. _boolean.kleene:
Kleene logical operations
diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst
index 34d04745ccdb5..1e7d66dfeb142 100644
--- a/doc/source/user_guide/categorical.rst
+++ b/doc/source/user_guide/categorical.rst
@@ -245,7 +245,8 @@ Equality semantics
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
whenever they have the same categories and order. When comparing two
-unordered categoricals, the order of the ``categories`` is not considered.
+unordered categoricals, the order of the ``categories`` is not considered. Note
+that categories with different dtypes are not the same.
.. ipython:: python
@@ -263,6 +264,16 @@ All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
c1 == "category"
+Notice that the ``categories_dtype`` should be considered, especially when comparing with
+two empty ``CategoricalDtype`` instances.
+
+.. ipython:: python
+
+ c2 = pd.Categorical(np.array([], dtype=object))
+ c3 = pd.Categorical(np.array([], dtype=float))
+
+ c2.dtype == c3.dtype
+
Description
-----------
@@ -647,7 +658,7 @@ Pivot tables:
raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
- pd.pivot_table(df, values="values", index=["A", "B"])
+ pd.pivot_table(df, values="values", index=["A", "B"], observed=False)
Data munging
------------
@@ -782,7 +793,7 @@ Assigning a ``Categorical`` to parts of a column of other types will use the val
:okwarning:
df = pd.DataFrame({"a": [1, 1, 1, 1, 1], "b": ["a", "a", "a", "a", "a"]})
- df.loc[1:2, "a"] = pd.Categorical(["b", "b"], categories=["a", "b"])
+ df.loc[1:2, "a"] = pd.Categorical([2, 2], categories=[2, 3])
df.loc[2:3, "b"] = pd.Categorical(["b", "b"], categories=["a", "b"])
df
df.dtypes
diff --git a/doc/source/user_guide/cookbook.rst b/doc/source/user_guide/cookbook.rst
index c0d2a14507383..91a0b4a4fe967 100644
--- a/doc/source/user_guide/cookbook.rst
+++ b/doc/source/user_guide/cookbook.rst
@@ -35,7 +35,7 @@ These are some neat pandas ``idioms``
)
df
-if-then...
+If-then...
**********
An if-then on one column
@@ -176,7 +176,7 @@ One could hard code:
Selection
---------
-Dataframes
+DataFrames
**********
The :ref:`indexing ` docs.
@@ -311,7 +311,7 @@ The :ref:`multindexing ` docs.
df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns])
df
# Now stack & Reset
- df = df.stack(0, future_stack=True).reset_index(1)
+ df = df.stack(0).reset_index(1)
df
# And fix the labels (Notice the label 'level_1' got added automatically)
df.columns = ["Sample", "All_X", "All_Y"]
@@ -688,7 +688,7 @@ The :ref:`Pivot ` docs.
aggfunc="sum",
margins=True,
)
- table.stack("City", future_stack=True)
+ table.stack("City")
`Frequency table like plyr in R
`__
@@ -771,7 +771,7 @@ To create year and month cross tabulation:
df = pd.DataFrame(
{"value": np.random.randn(36)},
- index=pd.date_range("2011-01-01", freq="M", periods=36),
+ index=pd.date_range("2011-01-01", freq="ME", periods=36),
)
pd.pivot_table(
@@ -794,12 +794,12 @@ Apply
index=["I", "II", "III"],
)
- def make_df(ser):
- new_vals = [pd.Series(value, name=name) for name, value in ser.items()]
- return pd.DataFrame(new_vals)
-
- df_orgz = pd.concat({ind: row.pipe(make_df) for ind, row in df.iterrows()})
+ def SeriesFromSubList(aList):
+ return pd.Series(aList)
+ df_orgz = pd.concat(
+ {ind: row.apply(SeriesFromSubList) for ind, row in df.iterrows()}
+ )
df_orgz
`Rolling apply with a DataFrame returning a Series
@@ -874,7 +874,7 @@ Timeseries
`__
`Aggregation and plotting time series
-`__
+`__
Turn a matrix with hours in columns and days in rows into a continuous row sequence in the form of a time series.
`How to rearrange a Python pandas DataFrame?
@@ -914,7 +914,7 @@ Using TimeGrouper and another grouping to create subgroups, then apply a custom
`__
`Resample intraday frame without adding new days
-`__
+`__
`Resample minute data
`__
@@ -1043,7 +1043,7 @@ CSV
The :ref:`CSV ` docs
-`read_csv in action `__
+`read_csv in action `__
`appending to a csv
`__
@@ -1489,7 +1489,7 @@ of the data values:
)
df
-Constant series
+Constant Series
---------------
To assess if a series has a constant value, we can check if ``series.nunique() <= 1``.
diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst
index 59bdb1926895f..90353d9f49f00 100644
--- a/doc/source/user_guide/copy_on_write.rst
+++ b/doc/source/user_guide/copy_on_write.rst
@@ -6,11 +6,13 @@
Copy-on-Write (CoW)
*******************
-Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the
-optimizations that become possible through CoW are implemented and supported. A complete list
-can be found at :ref:`Copy-on-Write optimizations `.
+.. note::
+
+ Copy-on-Write is now the default with pandas 3.0.
-We expect that CoW will be enabled by default in version 3.0.
+Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the
+optimizations that become possible through CoW are implemented and supported. All possible
+optimizations are supported starting from pandas 2.1.
CoW will lead to more predictable behavior since it is not possible to update more than
one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through
@@ -20,9 +22,26 @@ Previous behavior
-----------------
pandas indexing behavior is tricky to understand. Some operations return views while
-other return copies. Depending on the result of the operation, mutation one object
+other return copies. Depending on the result of the operation, mutating one object
might accidentally mutate another:
+.. code-block:: ipython
+
+ In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ In [2]: subset = df["foo"]
+ In [3]: subset.iloc[0] = 100
+ In [4]: df
+ Out[4]:
+ foo bar
+ 0 100 4
+ 1 2 5
+ 2 3 6
+
+
+Mutating ``subset``, e.g. updating its values, also updated ``df``. The exact behavior was
+hard to predict. Copy-on-Write solves accidentally modifying more than one object,
+it explicitly disallows this. ``df`` is unchanged:
+
.. ipython:: python
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
@@ -30,21 +49,111 @@ might accidentally mutate another:
subset.iloc[0] = 100
df
-Mutating ``subset``, e.g. updating its values, also updates ``df``. The exact behavior is
-hard to predict. Copy-on-Write solves accidentally modifying more than one object,
-it explicitly disallows this. With CoW enabled, ``df`` is unchanged:
+The following sections will explain what this means and how it impacts existing
+applications.
+
+.. _copy_on_write.migration_guide:
+
+Migrating to Copy-on-Write
+--------------------------
+
+Copy-on-Write is the default and only mode in pandas 3.0. This means that users
+need to migrate their code to be compliant with CoW rules.
+
+The default mode in pandas < 3.0 raises warnings for certain cases that will actively
+change behavior and thus change user intended behavior.
+
+pandas 2.2 has a warning mode
+
+.. code-block:: python
+
+ pd.options.mode.copy_on_write = "warn"
+
+that will warn for every operation that will change behavior with CoW. We expect this mode
+to be very noisy, since many cases that we don't expect that they will influence users will
+also emit a warning. We recommend checking this mode and analyzing the warnings, but it is
+not necessary to address all of these warning. The first two items of the following lists
+are the only cases that need to be addressed to make existing code work with CoW.
+
+The following few items describe the user visible changes:
+
+**Chained assignment will never work**
+
+``loc`` should be used as an alternative. Check the
+:ref:`chained assignment section ` for more details.
+
+**Accessing the underlying array of a pandas object will return a read-only view**
+
+.. ipython:: python
+
+ ser = pd.Series([1, 2, 3])
+ ser.to_numpy()
+
+This example returns a NumPy array that is a view of the Series object. This view can
+be modified and thus also modify the pandas object. This is not compliant with CoW
+rules. The returned array is set to non-writeable to protect against this behavior.
+Creating a copy of this array allows modification. You can also make the array
+writeable again if you don't care about the pandas object anymore.
+
+See the section about :ref:`read-only NumPy arrays `
+for more details.
+
+**Only one pandas object is updated at once**
+
+The following code snippet updated both ``df`` and ``subset`` without CoW:
+
+.. code-block:: ipython
+
+ In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ In [2]: subset = df["foo"]
+ In [3]: subset.iloc[0] = 100
+ In [4]: df
+ Out[4]:
+ foo bar
+ 0 100 4
+ 1 2 5
+ 2 3 6
+
+This is not possible anymore with CoW, since the CoW rules explicitly forbid this.
+This includes updating a single column as a :class:`Series` and relying on the change
+propagating back to the parent :class:`DataFrame`.
+This statement can be rewritten into a single statement with ``loc`` or ``iloc`` if
+this behavior is necessary. :meth:`DataFrame.where` is another suitable alternative
+for this case.
+
+Updating a column selected from a :class:`DataFrame` with an inplace method will
+also not work anymore.
.. ipython:: python
+ :okwarning:
+
+ df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ df["foo"].replace(1, 5, inplace=True)
+ df
+
+This is another form of chained assignment. This can generally be rewritten in 2
+different forms:
- pd.options.mode.copy_on_write = True
+.. ipython:: python
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
- subset = df["foo"]
- subset.iloc[0] = 100
+ df.replace({"foo": {1: 5}}, inplace=True)
df
-The following sections will explain what this means and how it impacts existing
-applications.
+A different alternative would be to not use ``inplace``:
+
+.. ipython:: python
+
+ df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ df["foo"] = df["foo"].replace(1, 5)
+ df
+
+**Constructors now copy NumPy arrays by default**
+
+The Series and DataFrame constructors now copies a NumPy array by default when not
+otherwise specified. This was changed to avoid mutating a pandas object when the
+NumPy array is changed inplace outside of pandas. You can set ``copy=False`` to
+avoid this copy.
Description
-----------
@@ -57,7 +166,7 @@ that shares data with another DataFrame or Series object inplace.
This avoids side-effects when modifying values and hence, most methods can avoid
actually copying the data and only trigger a copy when necessary.
-The following example will operate inplace with CoW:
+The following example will operate inplace:
.. ipython:: python
@@ -102,15 +211,17 @@ listed in :ref:`Copy-on-Write optimizations `.
Previously, when operating on views, the view and the parent object was modified:
-.. ipython:: python
-
- with pd.option_context("mode.copy_on_write", False):
- df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
- view = df[:]
- df.iloc[0, 0] = 100
+.. code-block:: ipython
- df
- view
+ In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ In [2]: subset = df["foo"]
+ In [3]: subset.iloc[0] = 100
+ In [4]: df
+ Out[4]:
+ foo bar
+ 0 100 4
+ 1 2 5
+ 2 3 6
CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well:
@@ -123,21 +234,27 @@ CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well:
df
view
+.. _copy_on_write_chained_assignment:
+
Chained Assignment
------------------
Chained assignment references a technique where an object is updated through
two subsequent indexing operations, e.g.
-.. ipython:: python
+.. code-block:: ipython
- with pd.option_context("mode.copy_on_write", False):
- df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
- df["foo"][df["bar"] > 5] = 100
- df
+ In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ In [2]: df["foo"][df["bar"] > 5] = 100
+ In [3]: df
+ Out[3]:
+ foo bar
+ 0 100 4
+ 1 2 5
+ 2 3 6
-The column ``foo`` is updated where the column ``bar`` is greater than 5.
-This violates the CoW principles though, because it would have to modify the
+The column ``foo`` was updated where the column ``bar`` is greater than 5.
+This violated the CoW principles though, because it would have to modify the
view ``df["foo"]`` and ``df`` in one step. Hence, chained assignment will
consistently never work and raise a ``ChainedAssignmentError`` warning
with CoW enabled:
@@ -154,83 +271,87 @@ With copy on write this can be done by using ``loc``.
df.loc[df["bar"] > 5, "foo"] = 100
-.. _copy_on_write.optimizations:
+.. _copy_on_write_read_only_na:
-Copy-on-Write optimizations
----------------------------
+Read-only NumPy arrays
+----------------------
-A new lazy copy mechanism that defers the copy until the object in question is modified
-and only if this object shares data with another object. This mechanism was added to
-following methods:
-
- - :meth:`DataFrame.reset_index` / :meth:`Series.reset_index`
- - :meth:`DataFrame.set_index`
- - :meth:`DataFrame.set_axis` / :meth:`Series.set_axis`
- - :meth:`DataFrame.set_flags` / :meth:`Series.set_flags`
- - :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis`
- - :meth:`DataFrame.reindex` / :meth:`Series.reindex`
- - :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like`
- - :meth:`DataFrame.assign`
- - :meth:`DataFrame.drop`
- - :meth:`DataFrame.dropna` / :meth:`Series.dropna`
- - :meth:`DataFrame.select_dtypes`
- - :meth:`DataFrame.align` / :meth:`Series.align`
- - :meth:`Series.to_frame`
- - :meth:`DataFrame.rename` / :meth:`Series.rename`
- - :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix`
- - :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix`
- - :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates`
- - :meth:`DataFrame.droplevel` / :meth:`Series.droplevel`
- - :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels`
- - :meth:`DataFrame.between_time` / :meth:`Series.between_time`
- - :meth:`DataFrame.filter` / :meth:`Series.filter`
- - :meth:`DataFrame.head` / :meth:`Series.head`
- - :meth:`DataFrame.tail` / :meth:`Series.tail`
- - :meth:`DataFrame.isetitem`
- - :meth:`DataFrame.pipe` / :meth:`Series.pipe`
- - :meth:`DataFrame.pop` / :meth:`Series.pop`
- - :meth:`DataFrame.replace` / :meth:`Series.replace`
- - :meth:`DataFrame.shift` / :meth:`Series.shift`
- - :meth:`DataFrame.sort_index` / :meth:`Series.sort_index`
- - :meth:`DataFrame.sort_values` / :meth:`Series.sort_values`
- - :meth:`DataFrame.squeeze` / :meth:`Series.squeeze`
- - :meth:`DataFrame.swapaxes`
- - :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel`
- - :meth:`DataFrame.take` / :meth:`Series.take`
- - :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp`
- - :meth:`DataFrame.to_period` / :meth:`Series.to_period`
- - :meth:`DataFrame.truncate`
- - :meth:`DataFrame.iterrows`
- - :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize`
- - :meth:`DataFrame.fillna` / :meth:`Series.fillna`
- - :meth:`DataFrame.interpolate` / :meth:`Series.interpolate`
- - :meth:`DataFrame.ffill` / :meth:`Series.ffill`
- - :meth:`DataFrame.bfill` / :meth:`Series.bfill`
- - :meth:`DataFrame.where` / :meth:`Series.where`
- - :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects`
- - :meth:`DataFrame.astype` / :meth:`Series.astype`
- - :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes`
- - :meth:`DataFrame.join`
- - :meth:`DataFrame.eval`
- - :func:`concat`
- - :func:`merge`
+Accessing the underlying NumPy array of a DataFrame will return a read-only array if the array
+shares data with the initial DataFrame:
-These methods return views when Copy-on-Write is enabled, which provides a significant
-performance improvement compared to the regular execution.
+The array is a copy if the initial DataFrame consists of more than one array:
+
+.. ipython:: python
+
+ df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
+ df.to_numpy()
+
+The array shares data with the DataFrame if the DataFrame consists of only one NumPy array:
+
+.. ipython:: python
+
+ df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
+ df.to_numpy()
+
+This array is read-only, which means that it can't be modified inplace:
+
+.. ipython:: python
+ :okexcept:
-How to enable CoW
+ arr = df.to_numpy()
+ arr[0, 0] = 100
+
+The same holds true for a Series, since a Series always consists of a single array.
+
+There are two potential solutions to this:
+
+- Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.
+- Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, so
+ it should be used with caution.
+
+.. ipython:: python
+
+ arr = df.to_numpy()
+ arr.flags.writeable = True
+ arr[0, 0] = 100
+ arr
+
+Patterns to avoid
-----------------
-Copy-on-Write can be enabled through the configuration option ``copy_on_write``. The option can
-be turned on __globally__ through either of the following:
+No defensive copy will be performed if two objects share the same data while
+you are modifying one object inplace.
.. ipython:: python
- pd.set_option("mode.copy_on_write", True)
+ df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
+ df2 = df.reset_index(drop=True)
+ df2.iloc[0, 0] = 100
- pd.options.mode.copy_on_write = True
+This creates two objects that share data and thus the setitem operation will trigger a
+copy. This is not necessary if the initial object ``df`` isn't needed anymore.
+Simply reassigning to the same variable will invalidate the reference that is
+held by the object.
.. ipython:: python
- :suppress:
- pd.options.mode.copy_on_write = False
+ df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
+ df = df.reset_index(drop=True)
+ df.iloc[0, 0] = 100
+
+No copy is necessary in this example.
+Creating multiple references keeps unnecessary references alive
+and thus will hurt performance with Copy-on-Write.
+
+.. _copy_on_write.optimizations:
+
+Copy-on-Write optimizations
+---------------------------
+
+A new lazy copy mechanism that defers the copy until the object in question is modified
+and only if this object shares data with another object. This mechanism was added to
+methods that don't require a copy of the underlying data. Popular examples are :meth:`DataFrame.drop` for ``axis=1``
+and :meth:`DataFrame.rename`.
+
+These methods return views when Copy-on-Write is enabled, which provides a significant
+performance improvement compared to the regular execution.
diff --git a/doc/source/user_guide/dsintro.rst b/doc/source/user_guide/dsintro.rst
index d1e981ee1bbdc..89981786d60b5 100644
--- a/doc/source/user_guide/dsintro.rst
+++ b/doc/source/user_guide/dsintro.rst
@@ -41,8 +41,8 @@ Here, ``data`` can be many different things:
* an ndarray
* a scalar value (like 5)
-The passed **index** is a list of axis labels. Thus, this separates into a few
-cases depending on what **data is**:
+The passed **index** is a list of axis labels. The constructor's behavior
+depends on **data**'s type:
**From ndarray**
@@ -87,8 +87,9 @@ index will be pulled out.
**From scalar value**
-If ``data`` is a scalar value, an index must be
-provided. The value will be repeated to match the length of **index**.
+If ``data`` is a scalar value, the value will be repeated to match
+the length of **index**. If the **index** is not provided, it defaults
+to ``RangeIndex(1)``.
.. ipython:: python
@@ -97,7 +98,7 @@ provided. The value will be repeated to match the length of **index**.
Series is ndarray-like
~~~~~~~~~~~~~~~~~~~~~~
-:class:`Series` acts very similarly to a ``ndarray`` and is a valid argument to most NumPy functions.
+:class:`Series` acts very similarly to a :class:`numpy.ndarray` and is a valid argument to most NumPy functions.
However, operations such as slicing will also slice the index.
.. ipython:: python
@@ -111,7 +112,7 @@ However, operations such as slicing will also slice the index.
.. note::
We will address array-based indexing like ``s.iloc[[4, 3, 1]]``
- in :ref:`section on indexing `.
+ in the :ref:`section on indexing `.
Like a NumPy array, a pandas :class:`Series` has a single :attr:`~Series.dtype`.
@@ -325,7 +326,7 @@ This case is handled identically to a dict of arrays.
.. ipython:: python
- data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
+ data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "S10")])
data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]
pd.DataFrame(data)
diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst
index bc2f4420da784..9c37f317a805e 100644
--- a/doc/source/user_guide/enhancingperf.rst
+++ b/doc/source/user_guide/enhancingperf.rst
@@ -50,7 +50,7 @@ We have a :class:`DataFrame` to which we want to apply a function row-wise.
{
"a": np.random.randn(1000),
"b": np.random.randn(1000),
- "N": np.random.randint(100, 1000, (1000)),
+ "N": np.random.randint(100, 1000, (1000), dtype="int64"),
"x": "x",
}
)
@@ -83,7 +83,7 @@ using the `prun ipython magic function `__.
+`__.
Using ``parallel=True`` (e.g. ``@jit(parallel=True)``) may result in a ``SIGABRT`` if the threading layer leads to unsafe
behavior. You can first `specify a safe threading layer `__
diff --git a/doc/source/user_guide/gotchas.rst b/doc/source/user_guide/gotchas.rst
index c00a236ff4e9d..e85eead4e0f09 100644
--- a/doc/source/user_guide/gotchas.rst
+++ b/doc/source/user_guide/gotchas.rst
@@ -121,7 +121,7 @@ Below is how to check if any of the values are ``True``:
if pd.Series([False, True, False]).any():
print("I am any")
-Bitwise boolean
+Bitwise Boolean
~~~~~~~~~~~~~~~
Bitwise boolean operators like ``==`` and ``!=`` return a boolean :class:`Series`
@@ -315,19 +315,8 @@ Why not make NumPy like R?
Many people have suggested that NumPy should simply emulate the ``NA`` support
present in the more domain-specific statistical programming language `R
-`__. Part of the reason is the NumPy type hierarchy:
-
-.. csv-table::
- :header: "Typeclass","Dtypes"
- :widths: 30,70
- :delim: |
-
- ``numpy.floating`` | ``float16, float32, float64, float128``
- ``numpy.integer`` | ``int8, int16, int32, int64``
- ``numpy.unsignedinteger`` | ``uint8, uint16, uint32, uint64``
- ``numpy.object_`` | ``object_``
- ``numpy.bool_`` | ``bool_``
- ``numpy.character`` | ``bytes_, str_``
+`__. Part of the reason is the
+`NumPy type hierarchy `__.
The R language, by contrast, only has a handful of built-in data types:
``integer``, ``numeric`` (floating-point), ``character``, and
@@ -379,9 +368,9 @@ constructors using something similar to the following:
.. ipython:: python
x = np.array(list(range(10)), ">i4") # big endian
- newx = x.byteswap().newbyteorder() # force native byteorder
+ newx = x.byteswap().view(x.dtype.newbyteorder()) # force native byteorder
s = pd.Series(newx)
See `the NumPy documentation on byte order
-`__ for more
+`__ for more
details.
diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst
index c28123cec4491..4ec34db6ed959 100644
--- a/doc/source/user_guide/groupby.rst
+++ b/doc/source/user_guide/groupby.rst
@@ -13,10 +13,8 @@ steps:
* **Applying** a function to each group independently.
* **Combining** the results into a data structure.
-Out of these, the split step is the most straightforward. In fact, in many
-situations we may wish to split the data set into groups and do something with
-those groups. In the apply step, we might wish to do one of the
-following:
+Out of these, the split step is the most straightforward. In the apply step, we
+might wish to do one of the following:
* **Aggregation**: compute a summary statistic (or statistics) for each
group. Some examples:
@@ -53,9 +51,7 @@ of the above three categories.
function.
-Since the set of object instance methods on pandas data structures is generally
-rich and expressive, we often simply want to invoke, say, a DataFrame function
-on each group. The name GroupBy should be quite familiar to those who have used
+The name GroupBy should be quite familiar to those who have used
a SQL-based tool (or ``itertools``), in which you can write code like:
.. code-block:: sql
@@ -65,7 +61,7 @@ a SQL-based tool (or ``itertools``), in which you can write code like:
GROUP BY Column1, Column2
We aim to make operations like this natural and easy to express using
-pandas. We'll address each area of GroupBy functionality then provide some
+pandas. We'll address each area of GroupBy functionality, then provide some
non-trivial examples / use cases.
See the :ref:`cookbook` for some advanced strategies.
@@ -134,21 +130,13 @@ We could naturally group by either the ``A`` or ``B`` columns, or both:
.. ipython:: python
grouped = df.groupby("A")
+ grouped = df.groupby("B")
grouped = df.groupby(["A", "B"])
.. note::
``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``.
-If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all
-the columns except the one we specify:
-
-.. ipython:: python
-
- df2 = df.set_index(["A", "B"])
- grouped = df2.groupby(level=df2.index.names.difference(["B"]))
- grouped.sum()
-
The above GroupBy will split the DataFrame on its index (rows). To split by columns, first do
a transpose:
@@ -170,9 +158,11 @@ output of aggregation functions will only contain unique index values:
.. ipython:: python
- lst = [1, 2, 3, 1, 2, 3]
+ index = [1, 2, 3, 1, 2, 3]
+
+ s = pd.Series([1, 2, 3, 10, 20, 30], index=index)
- s = pd.Series([1, 2, 3, 10, 20, 30], lst)
+ s
grouped = s.groupby(level=0)
@@ -248,7 +238,7 @@ GroupBy object attributes
~~~~~~~~~~~~~~~~~~~~~~~~~
The ``groups`` attribute is a dictionary whose keys are the computed unique groups
-and corresponding values are the axis labels belonging to each group. In the
+and corresponding values are the index labels belonging to each group. In the
above example we have:
.. ipython:: python
@@ -256,8 +246,8 @@ above example we have:
df.groupby("A").groups
df.T.groupby(get_letter_type).groups
-Calling the standard Python ``len`` function on the GroupBy object just returns
-the length of the ``groups`` dict, so it is largely just a convenience:
+Calling the standard Python ``len`` function on the GroupBy object returns
+the number of groups, which is the same as the length of the ``groups`` dictionary:
.. ipython:: python
@@ -268,7 +258,7 @@ the length of the ``groups`` dict, so it is largely just a convenience:
.. _groupby.tabcompletion:
-``GroupBy`` will tab complete column names (and other attributes):
+``GroupBy`` will tab complete column names, GroupBy operations, and other attributes:
.. ipython:: python
@@ -290,7 +280,7 @@ the length of the ``groups`` dict, so it is largely just a convenience:
In [1]: gb. # noqa: E225, E999
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
- gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
+ gb.apply gb.cummax gb.cumsum gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
.. _groupby.multiindex:
@@ -420,6 +410,18 @@ This is mainly syntactic sugar for the alternative, which is much more verbose:
Additionally, this method avoids recomputing the internal grouping information
derived from the passed key.
+You can also include the grouping columns if you want to operate on them.
+
+.. ipython:: python
+
+ grouped[["A", "B"]].sum()
+
+.. note::
+
+ The ``groupby`` operation in pandas drops the ``name`` field of the columns Index object
+ after the operation. This change ensures consistency in syntax between different
+ column selection methods within groupby operations.
+
.. _groupby.iterating-label:
Iterating through groups
@@ -452,7 +454,7 @@ Selecting a group
-----------------
A single group can be selected using
-:meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`:
+:meth:`.DataFrameGroupBy.get_group`:
.. ipython:: python
@@ -499,34 +501,33 @@ Built-in aggregation methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Many common aggregations are built-in to GroupBy objects as methods. Of the methods
-listed below, those with a ``*`` do *not* have a Cython-optimized implementation.
+listed below, those with a ``*`` do *not* have an efficient, GroupBy-specific, implementation.
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
- :delim: ;
-
- :meth:`~.DataFrameGroupBy.any`;Compute whether any of the values in the groups are truthy
- :meth:`~.DataFrameGroupBy.all`;Compute whether all of the values in the groups are truthy
- :meth:`~.DataFrameGroupBy.count`;Compute the number of non-NA values in the groups
- :meth:`~.DataFrameGroupBy.cov` * ;Compute the covariance of the groups
- :meth:`~.DataFrameGroupBy.first`;Compute the first occurring value in each group
- :meth:`~.DataFrameGroupBy.idxmax` *;Compute the index of the maximum value in each group
- :meth:`~.DataFrameGroupBy.idxmin` *;Compute the index of the minimum value in each group
- :meth:`~.DataFrameGroupBy.last`;Compute the last occurring value in each group
- :meth:`~.DataFrameGroupBy.max`;Compute the maximum value in each group
- :meth:`~.DataFrameGroupBy.mean`;Compute the mean of each group
- :meth:`~.DataFrameGroupBy.median`;Compute the median of each group
- :meth:`~.DataFrameGroupBy.min`;Compute the minimum value in each group
- :meth:`~.DataFrameGroupBy.nunique`;Compute the number of unique values in each group
- :meth:`~.DataFrameGroupBy.prod`;Compute the product of the values in each group
- :meth:`~.DataFrameGroupBy.quantile`;Compute a given quantile of the values in each group
- :meth:`~.DataFrameGroupBy.sem`;Compute the standard error of the mean of the values in each group
- :meth:`~.DataFrameGroupBy.size`;Compute the number of values in each group
- :meth:`~.DataFrameGroupBy.skew` *;Compute the skew of the values in each group
- :meth:`~.DataFrameGroupBy.std`;Compute the standard deviation of the values in each group
- :meth:`~.DataFrameGroupBy.sum`;Compute the sum of the values in each group
- :meth:`~.DataFrameGroupBy.var`;Compute the variance of the values in each group
+
+ :meth:`~.DataFrameGroupBy.any`,Compute whether any of the values in the groups are truthy
+ :meth:`~.DataFrameGroupBy.all`,Compute whether all of the values in the groups are truthy
+ :meth:`~.DataFrameGroupBy.count`,Compute the number of non-NA values in the groups
+ :meth:`~.DataFrameGroupBy.cov` * ,Compute the covariance of the groups
+ :meth:`~.DataFrameGroupBy.first`,Compute the first occurring value in each group
+ :meth:`~.DataFrameGroupBy.idxmax`,Compute the index of the maximum value in each group
+ :meth:`~.DataFrameGroupBy.idxmin`,Compute the index of the minimum value in each group
+ :meth:`~.DataFrameGroupBy.last`,Compute the last occurring value in each group
+ :meth:`~.DataFrameGroupBy.max`,Compute the maximum value in each group
+ :meth:`~.DataFrameGroupBy.mean`,Compute the mean of each group
+ :meth:`~.DataFrameGroupBy.median`,Compute the median of each group
+ :meth:`~.DataFrameGroupBy.min`,Compute the minimum value in each group
+ :meth:`~.DataFrameGroupBy.nunique`,Compute the number of unique values in each group
+ :meth:`~.DataFrameGroupBy.prod`,Compute the product of the values in each group
+ :meth:`~.DataFrameGroupBy.quantile`,Compute a given quantile of the values in each group
+ :meth:`~.DataFrameGroupBy.sem`,Compute the standard error of the mean of the values in each group
+ :meth:`~.DataFrameGroupBy.size`,Compute the number of values in each group
+ :meth:`~.DataFrameGroupBy.skew` * ,Compute the skew of the values in each group
+ :meth:`~.DataFrameGroupBy.std`,Compute the standard deviation of the values in each group
+ :meth:`~.DataFrameGroupBy.sum`,Compute the sum of the values in each group
+ :meth:`~.DataFrameGroupBy.var`,Compute the variance of the values in each group
Some examples:
@@ -535,16 +536,16 @@ Some examples:
df.groupby("A")[["C", "D"]].max()
df.groupby(["A", "B"]).mean()
-Another simple aggregation example is to compute the size of each group.
+Another aggregation example is to compute the size of each group.
This is included in GroupBy as the ``size`` method. It returns a Series whose
-index are the group names and whose values are the sizes of each group.
+index consists of the group names and the values are the sizes of each group.
.. ipython:: python
grouped = df.groupby(["A", "B"])
grouped.size()
-While the :meth:`~.DataFrameGroupBy.describe` method is not itself a reducer, it
+While the :meth:`.DataFrameGroupBy.describe` method is not itself a reducer, it
can be used to conveniently produce a collection of summary statistics about each of
the groups.
@@ -553,7 +554,7 @@ the groups.
grouped.describe()
Another aggregation example is to compute the number of unique values of each group.
-This is similar to the ``value_counts`` function, except that it only counts the
+This is similar to the :meth:`.DataFrameGroupBy.value_counts` function, except that it only counts the
number of unique values.
.. ipython:: python
@@ -566,11 +567,11 @@ number of unique values.
.. note::
Aggregation functions **will not** return the groups that you are aggregating over
- as named *columns*, when ``as_index=True``, the default. The grouped columns will
+ as named *columns* when ``as_index=True``, the default. The grouped columns will
be the **indices** of the returned object.
- Passing ``as_index=False`` **will** return the groups that you are aggregating over, if they are
- named **indices** or *columns*.
+ Passing ``as_index=False`` **will** return the groups that you are aggregating over as
+ named columns, regardless if they are named **indices** or *columns* in the inputs.
.. _groupby.aggregate.agg:
@@ -596,7 +597,7 @@ Any reduction method that pandas implements can be passed as a string to
grouped.agg("sum")
The result of the aggregation will have the group names as the
-new index along the grouped axis. In the case of multiple keys, the result is a
+new index. In the case of multiple keys, the result is a
:ref:`MultiIndex ` by default. As mentioned above, this can be
changed by using the ``as_index`` option:
@@ -617,7 +618,7 @@ this will make an extra copy.
.. _groupby.aggregate.udf:
-Aggregation with User-Defined Functions
+Aggregation with user-defined functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Users can also provide their own User-Defined Functions (UDFs) for custom aggregations.
@@ -650,24 +651,26 @@ different dtypes, then a common dtype will be determined in the same way as ``Da
Applying multiple functions at once
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-With grouped ``Series`` you can also pass a list or dict of functions to do
-aggregation with, outputting a DataFrame:
+On a grouped ``Series``, you can pass a list or dict of functions to
+:meth:`SeriesGroupBy.agg`, outputting a DataFrame:
.. ipython:: python
grouped = df.groupby("A")
grouped["C"].agg(["sum", "mean", "std"])
-On a grouped ``DataFrame``, you can pass a list of functions to apply to each
-column, which produces an aggregated result with a hierarchical index:
+On a grouped ``DataFrame``, you can pass a list of functions to
+:meth:`DataFrameGroupBy.agg` to aggregate each
+column, which produces an aggregated result with a hierarchical column index:
.. ipython:: python
grouped[["C", "D"]].agg(["sum", "mean", "std"])
-The resulting aggregations are named after the functions themselves. If you
-need to rename, then you can add in a chained operation for a ``Series`` like this:
+The resulting aggregations are named after the functions themselves.
+
+For a ``Series``, if you need to rename, you can add in a chained operation like this:
.. ipython:: python
@@ -677,8 +680,19 @@ need to rename, then you can add in a chained operation for a ``Series`` like th
.rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
)
+Or, you can simply pass a list of tuples each with the name of the new column and the aggregate function:
+
+.. ipython:: python
+
+ (
+ grouped["C"]
+ .agg([("foo", "sum"), ("bar", "mean"), ("baz", "std")])
+ )
+
For a grouped ``DataFrame``, you can rename in a similar manner:
+By chaining ``rename`` operation,
+
.. ipython:: python
(
@@ -687,6 +701,16 @@ For a grouped ``DataFrame``, you can rename in a similar manner:
)
)
+Or, passing a list of tuples,
+
+.. ipython:: python
+
+ (
+ grouped[["C", "D"]].agg(
+ [("foo", "sum"), ("bar", "mean"), ("baz", "std")]
+ )
+ )
+
.. note::
In general, the output column names should be unique, but pandas will allow
@@ -824,31 +848,28 @@ A common use of a transformation is to add the result back into the original Dat
Built-in transformation methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The following methods on GroupBy act as transformations. Of these methods, only
-``fillna`` does not have a Cython-optimized implementation.
+The following methods on GroupBy act as transformations.
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
- :delim: ;
-
- :meth:`~.DataFrameGroupBy.bfill`;Back fill NA values within each group
- :meth:`~.DataFrameGroupBy.cumcount`;Compute the cumulative count within each group
- :meth:`~.DataFrameGroupBy.cummax`;Compute the cumulative max within each group
- :meth:`~.DataFrameGroupBy.cummin`;Compute the cumulative min within each group
- :meth:`~.DataFrameGroupBy.cumprod`;Compute the cumulative product within each group
- :meth:`~.DataFrameGroupBy.cumsum`;Compute the cumulative sum within each group
- :meth:`~.DataFrameGroupBy.diff`;Compute the difference between adjacent values within each group
- :meth:`~.DataFrameGroupBy.ffill`;Forward fill NA values within each group
- :meth:`~.DataFrameGroupBy.fillna`;Fill NA values within each group
- :meth:`~.DataFrameGroupBy.pct_change`;Compute the percent change between adjacent values within each group
- :meth:`~.DataFrameGroupBy.rank`;Compute the rank of each value within each group
- :meth:`~.DataFrameGroupBy.shift`;Shift values up or down within each group
+
+ :meth:`~.DataFrameGroupBy.bfill`,Back fill NA values within each group
+ :meth:`~.DataFrameGroupBy.cumcount`,Compute the cumulative count within each group
+ :meth:`~.DataFrameGroupBy.cummax`,Compute the cumulative max within each group
+ :meth:`~.DataFrameGroupBy.cummin`,Compute the cumulative min within each group
+ :meth:`~.DataFrameGroupBy.cumprod`,Compute the cumulative product within each group
+ :meth:`~.DataFrameGroupBy.cumsum`,Compute the cumulative sum within each group
+ :meth:`~.DataFrameGroupBy.diff`,Compute the difference between adjacent values within each group
+ :meth:`~.DataFrameGroupBy.ffill`,Forward fill NA values within each group
+ :meth:`~.DataFrameGroupBy.pct_change`,Compute the percent change between adjacent values within each group
+ :meth:`~.DataFrameGroupBy.rank`,Compute the rank of each value within each group
+ :meth:`~.DataFrameGroupBy.shift`,Shift values up or down within each group
In addition, passing any built-in aggregation method as a string to
:meth:`~.DataFrameGroupBy.transform` (see the next section) will broadcast the result
-across the group, producing a transformed result. If the aggregation method is
-Cython-optimized, this will be performant as well.
+across the group, producing a transformed result. If the aggregation method has an efficient
+implementation, this will be performant as well.
.. _groupby.transformation.transform:
@@ -890,7 +911,7 @@ also accept User-Defined Functions (UDFs). The UDF must:
the built-in methods.
All of the examples in this section can be made more performant by calling
- built-in methods instead of using ``transform``.
+ built-in methods instead of using UDFs.
See :ref:`below for examples `.
.. versionchanged:: 2.0.0
@@ -921,7 +942,7 @@ Suppose we wish to standardize the data within each group:
We would expect the result to now have mean 0 and standard deviation 1 within
-each group, which we can easily check:
+each group (up to floating-point error), which we can easily check:
.. ipython:: python
@@ -995,18 +1016,18 @@ using a UDF is commented out and the faster alternative appears below.
.. ipython:: python
- # ts.groupby(lambda x: x.year).transform(
+ # result = ts.groupby(lambda x: x.year).transform(
# lambda x: (x - x.mean()) / x.std()
# )
grouped = ts.groupby(lambda x: x.year)
result = (ts - grouped.transform("mean")) / grouped.transform("std")
- # ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
+ # result = ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
grouped = ts.groupby(lambda x: x.year)
result = grouped.transform("max") - grouped.transform("min")
# grouped = data_df.groupby(key)
- # grouped.transform(lambda x: x.fillna(x.mean()))
+ # result = grouped.transform(lambda x: x.fillna(x.mean()))
grouped = data_df.groupby(key)
result = data_df.fillna(grouped.transform("mean"))
@@ -1060,7 +1081,7 @@ missing values with the ``ffill()`` method.
Filtration
----------
-A filtration is a GroupBy operation the subsets the original grouping object. It
+A filtration is a GroupBy operation that subsets the original grouping object. It
may either filter out entire groups, part of groups, or both. Filtrations return
a filtered version of the calling object, including the grouping columns when provided.
In the following example, ``class`` is included in the result.
@@ -1085,17 +1106,16 @@ Filtrations will respect subsetting the columns of the GroupBy object.
Built-in filtrations
~~~~~~~~~~~~~~~~~~~~
-The following methods on GroupBy act as filtrations. All these methods have a
-Cython-optimized implementation.
+The following methods on GroupBy act as filtrations. All these methods have an
+efficient, GroupBy-specific, implementation.
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
- :delim: ;
- :meth:`~.DataFrameGroupBy.head`;Select the top row(s) of each group
- :meth:`~.DataFrameGroupBy.nth`;Select the nth row(s) of each group
- :meth:`~.DataFrameGroupBy.tail`;Select the bottom row(s) of each group
+ :meth:`~.DataFrameGroupBy.head`,Select the top row(s) of each group
+ :meth:`~.DataFrameGroupBy.nth`,Select the nth row(s) of each group
+ :meth:`~.DataFrameGroupBy.tail`,Select the bottom row(s) of each group
Users can also use transformations along with Boolean indexing to construct complex
filtrations within groups. For example, suppose we are given groups of products and
@@ -1207,6 +1227,19 @@ The dimension of the returned result can also change:
grouped.apply(f)
+``apply`` on a Series can operate on a returned value from the applied function
+that is itself a series, and possibly upcast the result to a DataFrame:
+
+.. ipython:: python
+
+ def f(x):
+ return pd.Series([x, x ** 2], index=["x", "x^2"])
+
+
+ s = pd.Series(np.random.rand(5))
+ s
+ s.apply(f)
+
Similar to :ref:`groupby.aggregate.agg`, the resulting dtype will reflect that of the
apply function. If the results from different groups have different dtypes, then
a common dtype will be determined in the same way as ``DataFrame`` construction.
@@ -1228,7 +1261,7 @@ with
df.groupby("A", group_keys=False).apply(lambda x: x)
-Numba Accelerated Routines
+Numba accelerated routines
--------------------------
.. versionadded:: 1.1
@@ -1250,8 +1283,8 @@ will be passed into ``values``, and the group index will be passed into ``index`
Other useful features
---------------------
-Exclusion of "nuisance" columns
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Exclusion of non-numeric columns
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Again consider the example DataFrame we've been looking at:
@@ -1261,8 +1294,8 @@ Again consider the example DataFrame we've been looking at:
Suppose we wish to compute the standard deviation grouped by the ``A``
column. There is a slight problem, namely that we don't care about the data in
-column ``B`` because it is not numeric. We refer to these non-numeric columns as
-"nuisance" columns. You can avoid nuisance columns by specifying ``numeric_only=True``:
+column ``B`` because it is not numeric. You can avoid non-numeric columns by
+specifying ``numeric_only=True``:
.. ipython:: python
@@ -1289,17 +1322,8 @@ is only needed over one column (here ``colname``), it may be filtered
],
}
)
-
- # Decimal columns can be sum'd explicitly by themselves...
df_dec.groupby(["id"])[["dec_column"]].sum()
- # ...but cannot be combined with standard data types or they will be excluded
- df_dec.groupby(["id"])[["int_column", "dec_column"]].sum()
-
- # Use .agg function to aggregate over standard and "nuisance" data types
- # at the same time
- df_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"})
-
.. _groupby.observed:
Handling of (un)observed Categorical values
@@ -1331,35 +1355,55 @@ The returned dtype of the grouped will *always* include *all* of the categories
s = (
pd.Series([1, 1, 1])
- .groupby(pd.Categorical(["a", "a", "a"], categories=["a", "b"]), observed=False)
+ .groupby(pd.Categorical(["a", "a", "a"], categories=["a", "b"]), observed=True)
.count()
)
s.index.dtype
.. _groupby.missing:
-NA and NaT group handling
-~~~~~~~~~~~~~~~~~~~~~~~~~
+NA group handling
+~~~~~~~~~~~~~~~~~
+
+By ``NA``, we are referring to any ``NA`` values, including
+:class:`NA`, ``NaN``, ``NaT``, and ``None``. If there are any ``NA`` values in the
+grouping key, by default these will be excluded. In other words, any
+"``NA`` group" will be dropped. You can include NA groups by specifying ``dropna=False``.
+
+.. ipython:: python
+
+ df = pd.DataFrame({"key": [1.0, 1.0, np.nan, 2.0, np.nan], "A": [1, 2, 3, 4, 5]})
+ df
+
+ df.groupby("key", dropna=True).sum()
-If there are any NaN or NaT values in the grouping key, these will be
-automatically excluded. In other words, there will never be an "NA group" or
-"NaT group". This was not the case in older versions of pandas, but users were
-generally discarding the NA group anyway (and supporting it was an
-implementation headache).
+ df.groupby("key", dropna=False).sum()
Grouping with ordered factors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Categorical variables represented as instances of pandas's ``Categorical`` class
-can be used as group keys. If so, the order of the levels will be preserved:
+can be used as group keys. If so, the order of the levels will be preserved. When
+``observed=False`` and ``sort=False``, any unobserved categories will be at the
+end of the result in order.
.. ipython:: python
- data = pd.Series(np.random.randn(100))
+ days = pd.Categorical(
+ values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
+ categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
+ )
+ data = pd.DataFrame(
+ {
+ "day": days,
+ "workers": [3, 4, 1, 4, 2, 2],
+ }
+ )
+ data
- factor = pd.qcut(data, [0, 0.25, 0.5, 0.75, 1.0])
+ data.groupby("day", observed=False, sort=True).sum()
- data.groupby(factor, observed=False).mean()
+ data.groupby("day", observed=False, sort=False).sum()
.. _groupby.specify:
@@ -1397,19 +1441,20 @@ Groupby a specific column with the desired frequency. This is like resampling.
.. ipython:: python
- df.groupby([pd.Grouper(freq="1M", key="Date"), "Buyer"])[["Quantity"]].sum()
+ df.groupby([pd.Grouper(freq="1ME", key="Date"), "Buyer"])[["Quantity"]].sum()
When ``freq`` is specified, the object returned by ``pd.Grouper`` will be an
-instance of ``pandas.api.typing.TimeGrouper``. You have an ambiguous specification
-in that you have a named index and a column that could be potential groupers.
+instance of ``pandas.api.typing.TimeGrouper``. When there is a column and index
+with the same name, you can use ``key`` to group by the column and ``level``
+to group by the index.
.. ipython:: python
df = df.set_index("Date")
df["Date"] = df.index + pd.offsets.MonthEnd(2)
- df.groupby([pd.Grouper(freq="6M", key="Date"), "Buyer"])[["Quantity"]].sum()
+ df.groupby([pd.Grouper(freq="6ME", key="Date"), "Buyer"])[["Quantity"]].sum()
- df.groupby([pd.Grouper(freq="6M", level="Date"), "Buyer"])[["Quantity"]].sum()
+ df.groupby([pd.Grouper(freq="6ME", level="Date"), "Buyer"])[["Quantity"]].sum()
Taking the first rows of each group
@@ -1512,7 +1557,7 @@ Enumerate groups
To see the ordering of the groups (as opposed to the order of rows
within a group given by ``cumcount``) you can use
-:meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`.
+:meth:`.DataFrameGroupBy.ngroup`.
@@ -1594,7 +1639,7 @@ code more readable. First we set the data:
)
df.head(2)
-Now, to find prices per store/product, we can simply do:
+We now find the prices per store/product.
.. ipython:: python
@@ -1624,24 +1669,12 @@ object as a parameter into the function you specify.
Examples
--------
-Regrouping by factor
-~~~~~~~~~~~~~~~~~~~~
-
-Regroup columns of a DataFrame according to their sum, and sum the aggregated ones.
-
-.. ipython:: python
-
- df = pd.DataFrame({"a": [1, 0, 0], "b": [0, 1, 0], "c": [1, 0, 0], "d": [2, 3, 4]})
- df
- dft = df.T
- dft.groupby(dft.sum()).sum()
-
.. _groupby.multicolumn_factorization:
Multi-column factorization
~~~~~~~~~~~~~~~~~~~~~~~~~~
-By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
+By using :meth:`.DataFrameGroupBy.ngroup`, we can extract
information about the groups in a way similar to :func:`factorize` (as described
further in the :ref:`reshaping API `) but which applies
naturally to multiple columns of mixed type and different
@@ -1663,14 +1696,14 @@ introduction ` and the
dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()
-Groupby by indexer to 'resample' data
+GroupBy by indexer to 'resample' data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.
In order for resample to work on indices that are non-datetimelike, the following procedure can be utilized.
-In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation.
+In the following examples, **df.index // 5** returns an integer array which is used to determine what gets selected for the groupby operation.
.. note::
@@ -1713,4 +1746,4 @@ column index name will be used as the name of the inserted column:
result
- result.stack(future_stack=True)
+ result.stack()
diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst
index f0d6a76f0de5b..85e91859b90d0 100644
--- a/doc/source/user_guide/index.rst
+++ b/doc/source/user_guide/index.rst
@@ -78,6 +78,7 @@ Guides
boolean
visualization
style
+ user_defined_functions
groupby
window
timeseries
@@ -86,5 +87,6 @@ Guides
enhancingperf
scale
sparse
+ migration-3-strings
gotchas
cookbook
diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst
index 52bc43f52b1d3..ebd1791c0f4ad 100644
--- a/doc/source/user_guide/indexing.rst
+++ b/doc/source/user_guide/indexing.rst
@@ -29,13 +29,6 @@ this area.
production code, we recommended that you take advantage of the optimized
pandas data access methods exposed in this chapter.
-.. warning::
-
- Whether a copy or a reference is returned for a setting operation, may
- depend on the context. This is sometimes called ``chained assignment`` and
- should be avoided. See :ref:`Returning a View versus Copy
- `.
-
See the :ref:`MultiIndex / Advanced Indexing ` for ``MultiIndex`` and more advanced indexing documentation.
See the :ref:`cookbook` for some advanced strategies.
@@ -62,6 +55,8 @@ of multi-axis indexing.
* A boolean array (any ``NA`` values will be treated as ``False``).
* A ``callable`` function with one argument (the calling Series or DataFrame) and
that returns valid output for indexing (one of the above).
+ * A tuple of row (and column) indices whose elements are one of the
+ above inputs.
See more at :ref:`Selection by Label `.
@@ -78,6 +73,8 @@ of multi-axis indexing.
* A boolean array (any ``NA`` values will be treated as ``False``).
* A ``callable`` function with one argument (the calling Series or DataFrame) and
that returns valid output for indexing (one of the above).
+ * A tuple of row (and column) indices whose elements are one of the
+ above inputs.
See more at :ref:`Selection by Position `,
:ref:`Advanced Indexing ` and :ref:`Advanced
@@ -85,19 +82,26 @@ of multi-axis indexing.
* ``.loc``, ``.iloc``, and also ``[]`` indexing can accept a ``callable`` as indexer. See more at :ref:`Selection By Callable `.
+ .. note::
+
+ Destructuring tuple keys into row (and column) indexes occurs
+ *before* callables are applied, so you cannot return a tuple from
+ a callable to index both rows and columns.
+
Getting values from an object with multi-axes selection uses the following
notation (using ``.loc`` as an example, but the following applies to ``.iloc`` as
well). Any of the axes accessors may be the null slice ``:``. Axes left out of
the specification are assumed to be ``:``, e.g. ``p.loc['a']`` is equivalent to
``p.loc['a', :]``.
-.. csv-table::
- :header: "Object Type", "Indexers"
- :widths: 30, 50
- :delim: ;
- Series; ``s.loc[indexer]``
- DataFrame; ``df.loc[row_indexer,column_indexer]``
+.. ipython:: python
+
+ ser = pd.Series(range(5), index=list("abcde"))
+ ser.loc[["a", "c", "e"]]
+
+ df = pd.DataFrame(np.arange(25).reshape(5, 5), index=list("abcde"), columns=list("abcde"))
+ df.loc[["a", "c", "e"], ["b", "d"]]
.. _indexing.basics:
@@ -113,10 +117,9 @@ indexing pandas objects with ``[]``:
.. csv-table::
:header: "Object Type", "Selection", "Return Value Type"
:widths: 30, 30, 60
- :delim: ;
- Series; ``series[label]``; scalar value
- DataFrame; ``frame[colname]``; ``Series`` corresponding to colname
+ Series, ``series[label]``, scalar value
+ DataFrame, ``frame[colname]``, ``Series`` corresponding to colname
Here we construct a simple time series data set to use for illustrating the
indexing functionality:
@@ -259,6 +262,10 @@ The most robust and consistent way of slicing ranges along arbitrary axes is
described in the :ref:`Selection by Position ` section
detailing the ``.iloc`` method. For now, we explain the semantics of slicing using the ``[]`` operator.
+ .. note::
+
+ When the :class:`Series` has float indices, slicing will select by position.
+
With Series, the syntax works exactly as with an ndarray, returning a slice of
the values and the corresponding labels:
@@ -289,12 +296,6 @@ largely as a convenience since it is such a common operation.
Selection by label
------------------
-.. warning::
-
- Whether a copy or a reference is returned for a setting operation, may depend on the context.
- This is sometimes called ``chained assignment`` and should be avoided.
- See :ref:`Returning a View versus Copy `.
-
.. warning::
``.loc`` is strict when you present slicers that are not compatible (or convertible) with the index type. For example
@@ -324,7 +325,7 @@ The ``.loc`` attribute is the primary access method. The following are valid inp
* A single label, e.g. ``5`` or ``'a'`` (Note that ``5`` is interpreted as a *label* of the index. This use is **not** an integer position along the index.).
* A list or array of labels ``['a', 'b', 'c']``.
-* A slice object with labels ``'a':'f'`` (Note that contrary to usual Python
+* A slice object with labels ``'a':'f'``. Note that contrary to usual Python
slices, **both** the start and the stop are included, when present in the
index! See :ref:`Slicing with labels `.
* A boolean array.
@@ -402,9 +403,9 @@ are returned:
s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])
s.loc[3:5]
-If at least one of the two is absent, but the index is sorted, and can be
-compared against start and stop labels, then slicing will still work as
-expected, by selecting labels which *rank* between the two:
+If the index is sorted, and can be compared against start and stop labels,
+then slicing will still work as expected, by selecting labels which *rank*
+between the two:
.. ipython:: python
@@ -435,12 +436,6 @@ For more information about duplicate labels, see
Selection by position
---------------------
-.. warning::
-
- Whether a copy or a reference is returned for a setting operation, may depend on the context.
- This is sometimes called ``chained assignment`` and should be avoided.
- See :ref:`Returning a View versus Copy `.
-
pandas provides a suite of methods in order to get **purely integer based indexing**. The semantics follow closely Python and NumPy slicing. These are ``0-based`` indexing. When slicing, the start bound is *included*, while the upper bound is *excluded*. Trying to use a non-integer, even a **valid** label will raise an ``IndexError``.
The ``.iloc`` attribute is the primary access method. The following are valid inputs:
@@ -450,6 +445,8 @@ The ``.iloc`` attribute is the primary access method. The following are valid in
* A slice object with ints ``1:7``.
* A boolean array.
* A ``callable``, see :ref:`Selection By Callable `.
+* A tuple of row (and column) indexes, whose elements are one of the
+ above types.
.. ipython:: python
@@ -553,6 +550,12 @@ Selection by callable
``.loc``, ``.iloc``, and also ``[]`` indexing can accept a ``callable`` as indexer.
The ``callable`` must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing.
+.. note::
+
+ For ``.iloc`` indexing, returning a tuple from the callable is
+ not supported, since tuple destructuring for row and column indexes
+ occurs *before* applying callables.
+
.. ipython:: python
df1 = pd.DataFrame(np.random.randn(6, 4),
@@ -697,7 +700,7 @@ to have different probabilities, you can pass the ``sample`` function sampling w
s = pd.Series([0, 1, 2, 3, 4, 5])
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
- s.sample(n=3, weights=example_weights)
+ s.sample(n=2, weights=example_weights)
# Weights will be re-normalized automatically
example_weights2 = [0.5, 0, 0, 0, 0, 0]
@@ -711,7 +714,7 @@ as a string.
df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
'weight_column': [0.5, 0.4, 0.1, 0]})
- df2.sample(n=3, weights='weight_column')
+ df2.sample(n=2, weights='weight_column')
``sample`` also allows users to sample columns instead of rows using the ``axis`` argument.
@@ -855,9 +858,10 @@ and :ref:`Advanced Indexing ` you may select along more than one axis
.. warning::
- ``iloc`` supports two kinds of boolean indexing. If the indexer is a boolean ``Series``,
- an error will be raised. For instance, in the following example, ``df.iloc[s.values, 1]`` is ok.
- The boolean indexer is an array. But ``df.iloc[s, 1]`` would raise ``ValueError``.
+ While ``loc`` supports two kinds of boolean indexing, ``iloc`` only supports indexing with a
+ boolean array. If the indexer is a boolean ``Series``, an error will be raised. For instance,
+ in the following example, ``df.iloc[s.values, 1]`` is ok. The boolean indexer is an array.
+ But ``df.iloc[s, 1]`` would raise ``ValueError``.
.. ipython:: python
@@ -949,7 +953,7 @@ To select a row where each column meets its own criterion:
values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
- row_mask = df.isin(values).all(1)
+ row_mask = df.isin(values).all(axis=1)
df[row_mask]
@@ -1457,16 +1461,33 @@ Looking up values by index/column labels
Sometimes you want to extract a set of values given a sequence of row labels
and column labels, this can be achieved by ``pandas.factorize`` and NumPy indexing.
-For instance:
-.. ipython:: python
+For heterogeneous column types, we subset columns to avoid unnecessary NumPy conversions:
- df = pd.DataFrame({'col': ["A", "A", "B", "B"],
- 'A': [80, 23, np.nan, 22],
- 'B': [80, 55, 76, 67]})
- df
- idx, cols = pd.factorize(df['col'])
- df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
+.. code-block:: python
+
+ def pd_lookup_het(df, row_labels, col_labels):
+ rows = df.index.get_indexer(row_labels)
+ cols = df.columns.get_indexer(col_labels)
+ sub = df.take(np.unique(cols), axis=1)
+ sub = sub.take(np.unique(rows), axis=0)
+ rows = sub.index.get_indexer(row_labels)
+ values = sub.melt()["value"]
+ cols = sub.columns.get_indexer(col_labels)
+ flat_index = rows + cols * len(sub)
+ result = values[flat_index]
+ return result
+
+For homogeneous column types, it is fastest to skip column subsetting and go directly to NumPy:
+
+.. code-block:: python
+
+ def pd_lookup_hom(df, row_labels, col_labels):
+ rows = df.index.get_indexer(row_labels)
+ df = df.loc[:, sorted(set(col_labels))]
+ cols = df.columns.get_indexer(col_labels)
+ result = df.to_numpy()[rows, cols]
+ return result
Formerly this could be achieved with the dedicated ``DataFrame.lookup`` method
which was deprecated in version 1.2.0 and removed in version 2.0.0.
@@ -1704,186 +1725,56 @@ You can assign a custom index to the ``index`` attribute:
df_idx.index = pd.Index([10, 20, 30, 40], name="a")
df_idx
-.. _indexing.view_versus_copy:
-
-Returning a view versus a copy
-------------------------------
-
-When setting values in a pandas object, care must be taken to avoid what is called
-``chained indexing``. Here is an example.
-
-.. ipython:: python
-
- dfmi = pd.DataFrame([list('abcd'),
- list('efgh'),
- list('ijkl'),
- list('mnop')],
- columns=pd.MultiIndex.from_product([['one', 'two'],
- ['first', 'second']]))
- dfmi
-
-Compare these two access methods:
-
-.. ipython:: python
-
- dfmi['one']['second']
-
-.. ipython:: python
-
- dfmi.loc[:, ('one', 'second')]
-
-These both yield the same results, so which should you use? It is instructive to understand the order
-of operations on these and why method 2 (``.loc``) is much preferred over method 1 (chained ``[]``).
-
-``dfmi['one']`` selects the first level of the columns and returns a DataFrame that is singly-indexed.
-Then another Python operation ``dfmi_with_one['second']`` selects the series indexed by ``'second'``.
-This is indicated by the variable ``dfmi_with_one`` because pandas sees these operations as separate events.
-e.g. separate calls to ``__getitem__``, so it has to treat them as linear operations, they happen one after another.
-
-Contrast this to ``df.loc[:,('one','second')]`` which passes a nested tuple of ``(slice(None),('one','second'))`` to a single call to
-``__getitem__``. This allows pandas to deal with this as a single entity. Furthermore this order of operations *can* be significantly
-faster, and allows one to index *both* axes if so desired.
-
Why does assignment fail when using chained indexing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The problem in the previous section is just a performance issue. What's up with
-the ``SettingWithCopy`` warning? We don't **usually** throw warnings around when
-you do something that might cost a few extra milliseconds!
-
-But it turns out that assigning to the product of chained indexing has
-inherently unpredictable results. To see this, think about how the Python
-interpreter executes this code:
-
-.. code-block:: python
-
- dfmi.loc[:, ('one', 'second')] = value
- # becomes
- dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
-
-But this code is handled differently:
-
-.. code-block:: python
-
- dfmi['one']['second'] = value
- # becomes
- dfmi.__getitem__('one').__setitem__('second', value)
-
-See that ``__getitem__`` in there? Outside of simple cases, it's very hard to
-predict whether it will return a view or a copy (it depends on the memory layout
-of the array, about which pandas makes no guarantees), and therefore whether
-the ``__setitem__`` will modify ``dfmi`` or a temporary object that gets thrown
-out immediately afterward. **That's** what ``SettingWithCopy`` is warning you
-about!
-
-.. note:: You may be wondering whether we should be concerned about the ``loc``
- property in the first example. But ``dfmi.loc`` is guaranteed to be ``dfmi``
- itself with modified indexing behavior, so ``dfmi.loc.__getitem__`` /
- ``dfmi.loc.__setitem__`` operate on ``dfmi`` directly. Of course,
- ``dfmi.loc.__getitem__(idx)`` may be a view or a copy of ``dfmi``.
+:ref:`Copy-on-Write