Solve Using Regexp_Replace in PySpark - Interview Question

Solve Using Regexp_Replace in PySpark - Interview Question

One of PySpark's most well-known and beneficial functions is regexp replace. Using this function, we'll try to answer one of the most recent interview question.

A text file containing a dataset separated by a - separator is provided to you. To get the data into tabular format, you must use PySpark to import it.

Example: 1-A-12-2-B-23-3-C-34-4-D-15

Put this dataset in a file location or dbfs, as desired.

I've included this information to one of the files since I'm utilizing Databricks to fix this issue.

dbutils.fs.put('/FileStore/tables/interview_1.txt','1-A-12-2-B-23-3-C-34-4-D-15')

Steps to be follow:

  1. Load text file using text function in pyspark
    df=spark.read.text('/FileStore/tables/interview_1.txt')​
  2. Use regexp_replace function to replace every 3 occurrence of -(hyphen) with (-,) so that we get one identifer to break the data.
    df1=df.withColumn("new_value",regexp_replace("value","(.*?\\-){3}","$0,")).drop("value")​
  3. Use explode and split function to split the column with -, and assign to new column,if needed.
    df2=df1.withColumn("new_value_1",explode(split(df1.new_value,'-,'))).drop("new_value")​
  4. Use split function to get single column and store in new column let say ID,Name and Age
    df3=df2.withColumn("ID",(split(df2.new_value_1,'-')[0])).withColumn("Name",(split(df2.new_value_1,'-')[1])).withColumn("Age",(split(df2.new_value_1,'-')[2])).drop("new_value_1")
    df3.show()​

 

Below is the full code

from pyspark.sql.functions import *
df=spark.read.text('/FileStore/tables/interview_1.txt')
df1=df.withColumn("new_value",regexp_replace("value","(.*?\\-){3}","$0,")).drop("value")
df2=df1.withColumn("new_value_1",explode(split(df1.new_value,'-,'))).drop("new_value")
df3=df2.withColumn("ID",(split(df2.new_value_1,'-')[0])).withColumn("Name",(split(df2.new_value_1,'-')[1])).withColumn("Age",(split(df2.new_value_1,'-')[2])).drop("new_value_1")
df3.show()

 

Would you like a video explanation of this issue? Here is the link

Remember to subscribe to the channel.

0 Comments