Data Selection

Data selection is a crucial part of data manipulation and analysis. Pandas provides several methods to select data from a DataFrame.

In DE_Tools we explore how to do Data Selection with Pandas

https://github.com/rhyslwells/DE_Tools/blob/main/Explorations/Transformation/selection.ipynb

Data Selection in ML

Examples

Selecting Columns

You can select a single column from a DataFrame using either bracket notation or dot notation:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
column_a = df['A']  # or df.A

Selecting Rows by Index

To select rows by their index position, you can use slicing:

rows_0_to_2 = df[0:3]  # Selects the first three rows

Selecting Rows by Date Range

If your DataFrame has a DateTime index, you can select rows within a specific date range:

date_rng = pd.date_range(start='2013-01-01', end='2013-01-06', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df.set_index('date', inplace=True)
selected_dates = df['2013-01-02':'2013-01-04']

Label-based Selection

Use .loc or .at to select rows by label:

df = pd.DataFrame({'Weather': ['Sunny', 'Rain', 'Cloudy'], 'Temp': [30, 22, 25]})
df.set_index('Weather', inplace=True)
rain_row = df.loc['Rain']  # or df.at['Rain']

Position-based Selection

Use .iloc or .iat to select rows by position:

third_row = df.iloc[2]  # Selects the third row
specific_value = df.iat[1, 1]  # Selects the value at row 1, column 1

Conditional Selection

Create a new DataFrame based on a condition:

df_new = df[df['var1'] >= 999]  # Selects rows where 'var1' is greater than or equal to 999

The condition df["var1"] >= 999 creates a boolean Series that filters the rows of df.

Data Archive

Explorer