Using pandas we can do the following:
In DE_Tools see:
Combining Datasets
# Merge
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'B'], 'value': [3, 4]})
merged_df = pd.merge(df1, df2, on='key')
# Concat
concat_df = pd.concat([df1, df2])
# Join
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)
joined_df = df1.join(df2, lsuffix='_left', rsuffix='_right')
Aggregating Data Groupby
df = pd.DataFrame({'category': ['A', 'B', 'A'], 'value': [10, 20, 30]})
grouped_df = df.groupby('category').sum()
Split-Apply-Combine
Change the Layout of Data using a Pandas Stack
See Pandas Stack for when to use
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['first', 'second'])
stacked_df = df.stack()
unstacked_df = stacked_df.unstack()
Pivot Table: Summarize Data
df = pd.DataFrame({'A': ['foo', 'foo', 'bar'], 'B': ['one', 'two', 'one'], 'C': [1, 2, 3]})
pivot_table = df.pivot_table(values='C', index='A', columns='B', aggfunc='sum')
Handle Time-Based
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
time_series_df = pd.DataFrame(date_rng, columns=['date'])
time_series_df['data'] = range(len(time_series_df))
Sorting Order Data
df = pd.DataFrame({'A': [3, 1, 2]})
sorted_df = df.sort_values(by='A')
Rename Columns
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df = df.drop(columns=['B'])
df = df.rename(columns={'A': 'new_A'})
Identify Duplicate Entries
df = pd.DataFrame({'A': [1, 1, 2], 'B': [3, 4, 4]})
duplicates = df.duplicated()