pyspark.sql.DataFrame.intersectAll#
- DataFrame.intersectAll(other)[source]#
Return a new
DataFramecontaining rows in both thisDataFrameand anotherDataFramewhile preserving duplicates.This is equivalent to INTERSECT ALL in SQL. As standard in SQL, this function resolves columns by position (not by name).
New in version 2.4.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- Returns
DataFrameCombined DataFrame.
Examples
Example 1: Intersecting two DataFrames with the same schema
>>> df1 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"]) >>> df2 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"]) >>> result_df = df1.intersectAll(df2).sort("C1", "C2") >>> result_df.show() +---+---+ | C1| C2| +---+---+ | a| 1| | a| 1| | b| 3| +---+---+
Example 2: Intersecting two DataFrames with different schemas
>>> df1 = spark.createDataFrame([(1, "A"), (2, "B")], ["id", "value"]) >>> df2 = spark.createDataFrame([(2, "B"), (3, "C")], ["id", "value"]) >>> result_df = df1.intersectAll(df2).sort("id", "value") >>> result_df.show() +---+-----+ | id|value| +---+-----+ | 2| B| +---+-----+
Example 3: Intersecting all rows from two DataFrames with mismatched columns
>>> df1 = spark.createDataFrame([(1, 2), (1, 2), (3, 4)], ["A", "B"]) >>> df2 = spark.createDataFrame([(1, 2), (1, 2)], ["C", "D"]) >>> result_df = df1.intersectAll(df2).sort("A", "B") >>> result_df.show() +---+---+ | A| B| +---+---+ | 1| 2| | 1| 2| +---+---+