Solve Using Regexp_Replace in PySpark - Interview Question

One of PySpark's most well-known and beneficial functions is regexp replace. Using this function, we'll try to answer one of the most recent interview question.

A text file containing a dataset separated by a - separator is provided to you. To get the data into tabular format, you must use PySpark to import it.

Example: 1-A-12-2-B-23-3-C-34-4-D-15

Put this dataset in a file location or dbfs, as desired.

I've included this information to one of the files since I'm utilizing Databricks to fix this issue.


Steps to be follow:

  1. Load text file using text function in pyspark'/FileStore/tables/interview_1.txt')​
  2. Use regexp_replace function to replace every 3 occurrence of -(hyphen) with (-,) so that we get one identifer to break the data.
  3. Use explode and split function to split the column with -, and assign to new column,if needed.
  4. Use split function to get single column and store in new column let say ID,Name and Age


Below is the full code

from pyspark.sql.functions import *'/FileStore/tables/interview_1.txt')


