Skip to main content
  1. Blog posts/

Python - pandas Data Integration (concat, join, merge)

·5 mins
Table of Contents

This article explains methods for integrating multiple datasets into one using the pandas library.

There are several methods for integrating data, but this time we will cover concat, join, merge.

Before we explain, let’s create two example data frames.

>>> import pandas as pd 

>>> df1 = pd.DataFrame({
    'Class1' : [95, 92, 98, 100],
    'Class2' : [91, 93, 97, 99]
})

>>> df2 = pd.DataFrame({
    'Class1' : [87, 89],
    'Class2' : [85, 90]
})

Output of d1:

Class1 Class2
0 87 85
1 89 90

Output of d2:

Class1 Class2
0 95 91
1 92 93
2 98 97
3 100 99

1. concat #

The concat function of the pandas library is used to append data frames. This function can append multiple data frames in either row or column direction.

Let’s append df1 and df2 to the result.

>>> result = pd.concat([df1, df2])
>>> result
Class1 Class2
0 95 91.0
1 92 93.0
2 98 97.0
3 100 99.0
4 87 85.0
5 89 90.0
6 96 NaN
7 83 NaN

pd.concat([df1, df2]) appends df1 and df2 in the row direction. That is, the two data frames are connected vertically.

Next, let’s integrate d3, which only has the Class1 column, into the result.

>>> df3 = pd.DataFrame({
    'Class1' : [96, 83]
})

>>> pd.concat([result, df3], ignore_index=True)
Class1 Class2
0 95 91.0
1 92 93.0
2 98 97.0
3 100 99.0
4 87 85.0
5 89 90.0
6 96 NaN
7 83 NaN

Since the d3 data does not have the ‘Class2’ column, it outputs blank values.

2. join #

The join method of the pandas library is used to combine two data frames based on a specific column. It generally performs a role similar to SQL’s JOIN operation. Unlike concatjoin integrates horizontally.

>>> df4 = pd.DataFrame({
    'Class3' : [93, 91, 95, 98]
})

>>> df1.join(df4)
Class1 Class2 Class3
a 95 91 93
b 92 93 91
c 98 97 95
d 100 99 98

It is also possible to output by arbitrarily setting the index as follows.

>>> index_label = ['a','b','c','d']
>>> df1a = pd.DataFrame({'Class1': [95, 92, 98, 100],
                    'Class2': [91, 93, 97, 99]}, index= index_label)
>>> df4a = pd.DataFrame({'Class3': [93, 91, 95, 98]}, index=index_label)

>>> df1a.join(df4a)
Class1 Class2 Class3
a 95 91 93
b 92 93 91
c 98 97 95
d 100 99 98

3. merge #

The merge function of the pandas library is used to merge (integrate) two data frames based on a specific column. Using the merge function, you can combine data frames based on common columns between them.

>>> df_A_B = pd.DataFrame({'Sales Month': ['January', 'February', 'March', 'April'],
                       'Product A': [100, 150, 200, 130],
                       'Product B': [90, 110, 140, 170]})

>>> df_C_D = pd.DataFrame({'Sales Month': ['January', 'February', 'March', 'April'],
                       'Product C': [112, 141, 203, 134],
                       'Product D': [90, 110, 140, 170]})

df_A_B

Sales Month Product A Product B
0 January 100 90
1 February 150 110
2 March 200 140
3 April 130 170

df_C_D

Sales Month Product C Product D
0 January 112 90
1 February 141 110
2 March 203 140
3 April 134 170

Use merge to merge the two data frames based on the ‘Sales Month’ column. As a result, the two data frames are combined based on the ‘Sales Month’ column, and the data is organized around the common columns.

>>> df_A_B.merge(df_C_D)
Sales Month Product A Product B Product C Product D
0 January 100 90 112 90
1 February 150 110 141 110
2 March 200 140 203 140
3 April 130 170 134 170

Let’s implement four different ways to combine two data frames using the merge method.

>>> df_left = pd.DataFrame({'key':['A','B','C'], 'left': [1, 2, 3]})
>>> df_right = pd.DataFrame({'key':['A','B','D'], 'right': [4, 5, 6]})

1.

>>> df_left.merge(df_right, how='left', on = 'key')
key left right
0 A 1 4.0
1 B 2 5.0
2 C 3 NaN

Left join df_left and df_right based on the ‘key’ column. A left join keeps all rows from the left data frame (df_left) and adds rows from the right data frame (df_right) that have a matching key value. If a matching key value is not present in the right data frame, it is filled with NaN.

2.

>>> df_left.merge(df_right, how='right', on = 'key')
key left right
0 A 1.0 4
1 B 2.0 5
2 D NaN 6

Right join df_left and df_right based on the ‘key’ column. A right join keeps all rows from the right data frame (df_right) and adds rows from the left data frame (df_left) that have a matching key value. If a matching key value is not present in the left data frame, it is filled with NaN.

3.

>>> df_left.merge(df_right, how='outer', on = 'key')
key left right
0 A 1.0 4.0
1 B 2.0 5.0
2 D 3.0 NaN
3 D NaN 6.0

Outer join df_left and df_right based on the ‘key’ column. An outer join includes all rows from both data frames, filling with NaN where a match is only present in one of the data frames.

4.

>>> df_left.merge(df_right, how='inner', on = 'key')
key left right
0 A 1 4
1 B 2 5

Inner join df_left and df_right based on the ‘key’ column. An inner join includes only rows that are common to both data frames. That is, it combines rows from both data frames that have the same ‘key’ value.