6 Grouping, joining, and sorting
So far, we have looked at how to create data frames, read data into them, clean the data, and then analyze that clean, imported data in a number of ways. But analysis often requires more than just the basics: We often need to break our input data apart, to zoom in on particularly interesting subsets, to combine data from different sources, to transform the data into a new format or value, and then to sort it according to a variety of criteria. This type of action is known in the Pandas world as "split-apply-combine," and is our focus in this chapter. If you have experience with SQL and relational databases, then you’ll find many similarities, in both principle and name, with functionality in Pandas.
For example: A company might want to find out its total sales in the last quarter. But it might want to find out which countries have done particularly well (or poorly). Or perhaps the head of sales would like to see how much each individual salesperson has brought in, or how much each product has contributed to the company’s income.
These types of questions can be answered using a technique known as "grouping." Much like the GROUP BY
clause in an SQL query, we can use grouping in Pandas to ask the same question for various subsets of our data.