![]() ![]() Spark = ('').getOrCreate()ĭf = spark.createDataFrame(data = arrayStructureData, schema = arrayStructureSchema)ĭf.filter( (df.state = "OH") & (df.SQL: Visual QuickStart Guide, 3rd Edition | ||OH |F |įrom import StructType,StructField, StringType, IntegerType, ArrayTypeįrom import col,array_contains If your DataFrame consists of nested struct columns, you can use any of the above syntaxes to filter the rows based on the nested column.ĭf.filter(df.name.lastname = "Williams") \ The below example uses array_contains() from Pyspark SQL functions which checks if a value contains in an array if present it returns true otherwise false.įrom import array_containsĭf.filter(array_contains(df.languages,"Java")) \ When you want to filter rows from DataFrame based on value present in an array collection column, you can use the first syntax. # rlike - SQL RLIKE pattern (LIKE with Regex)ĭf2.filter(("(?i)^*rose$")).show() You can use rlike() to filter by checking values case insensitive.ĭata2 = [(2,"Michael Rose"),(3,"Robert Williams"),ĭf2 = spark.createDataFrame(data = data2, schema = )ĭf2.filter(("%rose%")).show() If you have SQL background you must be familiar with like and rlike (regex like), PySpark also provides similar methods in Column class to filter similar values using wildcard characters. For more examples on Column class, refer to PySpark Column Functions.ĭf.filter(df.state.startswith("N")).show() ![]() You can also filter DataFrame rows by using startswith(), endswith() and contains() methods of Column class. Filter Based on Starts With, Ends With, Contains ![]() #These show all records with NY (NY is not part of the list)ĭf.filter(df.state.isin(li)=False).show()Ħ. If you have a list of elements and you wanted to filter that is not in the list or in the list, use isin() function of Column class and it doesn’t have isnotin() function but you do the same using not operator (~) Below is just a simple example using AND (&) condition, you can extend this with OR(|), and NOT(!) conditional expressions as needed.ĭf.filter( (df.state = "OH") & (df.gender = "M") ) \ In PySpark, to filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. If you are coming from SQL background, you can use that knowledge in PySpark to filter DataFrame rows with SQL expressions.Ĥ. In order to use this first you need to import from import colģ. Use Column with the condition to filter the rows from DataFrame, using this you can express complex condition by referring column names using lname Advertisementsĭf.filter(df.state = "OH").show(truncate=False) | |- element: string (containsNull = true) | |- middlename: string (nullable = true) This yields below schema and DataFrame results. StructField('gender', StringType(), True)ĭf = spark.createDataFrame(data = data, schema = schema) StructField('state', StringType(), True), StructField('languages', ArrayType(StringType()), True), StructField('lastname', StringType(), True) StructField('middlename', StringType(), True), StructField('firstname', StringType(), True), Here, I am using a DataFrame with StructType and ArrayType columns as I will also be covering examples with struct and array types as-well.įrom import StructType,StructFieldįrom import StringType, IntegerType, ArrayType condition would be an expression you wanted to filter.īefore we start with examples, first let’s create a DataFrame. PySpark DataFrame filter() Syntaxīelow is syntax of the filter function. Note: PySpark Column Functions provides several options that can be used with filter().
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |