Calculates the duration of each job posting by finding the difference between its expiration and posted dates. Converts the POSTED and EXPIRED columns from string to date format. Update DURATION if it is null with the number of days between EXPIRED and POSTED, otherwise, the existing value is kept.
Cleans up multiple text columns in the DataFrame by extracting and formatting the content originally enclosed in double quotes. Columns to clean contain those string values often wrapped in brackets, double quotes, or cluttered with newlines and extra spaces. For each of these columns, using regular expressions to remove square brackets, line breaks, and excess whitespace, formats comma-separated items with a proper space after each comma, and removes all double quotes, resulting in cleaner, more readable text entries across the specified columns.
Code
# 2. Remove square brackets, line breaks, spaces, and replace the formatting between commas with “,”, then remove the double quotesfrom pyspark.sql.functions import regexp_replace, colcolumns_to_clean = ["SOURCE_TYPES", "SOURCES", "URL", "EDUCATION_LEVELS_NAME", "SKILLS", "SKILLS_NAME", "SPECIALIZED_SKILLS", "SPECIALIZED_SKILLS_NAME", "CERTIFICATIONS", "CERTIFICATIONS_NAME", "COMMON_SKILLS", "COMMON_SKILLS_NAME", "SOFTWARE_SKILLS", "SOFTWARE_SKILLS_NAME", "CIP6", "CIP6_NAME", "CIP4", "CIP4_NAME", "CIP2", "CIP2_NAME", "LIGHTCAST_SECTORS", "LIGHTCAST_SECTORS_NAME"] for col_name in columns_to_clean: df = df.withColumn(col_name, regexp_replace(regexp_replace(regexp_replace(col(col_name), r'[\[\]\n\s]+', ''), r'","', '", '), r'"', ''))
3. Clean the education level column
Cleans the EDUCATION_LEVELS column by extracting and retaining only the numeric portion of each entry. Removing surrounding text or symbols, leaving just the numeric education level in the column. This makes the data more consistent and easier to work with for analysis or modeling purposes.
Cleans the LOCATION column, ensures that all location information appears on one line, and removes curly braces, resulting a cleaner, more uniform LOCATION column for reading and analyzing
Similarly as in updating duration, fills in the value with the number of days between MODELED_EXPIRED and POSTED, helps standardize and complete the duration data for modeled job postings
Standardizes the values in the REMOTE_TYPE_NAME column to ensure consistency in describing remote work types. Replaces values None and Not Remote with On-Site, changes Hybrid Remote to Hybrid, and keeps Remote as is. Simplify and unify the classification of job postings based on work location
Although filling missing values is a common data cleaning strategy, we chose not to do it in this case to preserve the integrity and accuracy of the original dataset. Imputing numerical fields like salary with the median could distort salary distributions and mask meaningful patterns or outliers. Similarly, replacing missing categorical fields with “Unknown” may introduce noise and reduce the reliability of downstream analysis, especially in modeling or clustering tasks. Additionally, dropping columns with over 50% missing data might lead to the loss of potentially valuable or unique information. By keeping the missing values intact, we allow for more transparent analysis and leave room for context-aware handling in specific use cases.
8. Reason of not dropping unnecessary columns
The dataset is overly complex, with more than 100 different variables and columns. Therefore, we have taken the approach of directly extracting a specific column or columns of the data to be analyzed to generate a dataframe and analyze it. This way we don’t need to remove unwanted columns.
9. Save the cleaned data
Code
# save data# 1. use coalesce(1) to merge all partitions into one filedf.coalesce(1).write.option("header", "true").csv("data/lightcast_cleaned_temp")# 2. Find and rename the generated filesimport osimport shutil# get pathgenerated_file_path ='data/lightcast_cleaned_temp'for filename in os.listdir(generated_file_path):if filename.startswith('part-'): # find file# rename and move shutil.move(os.path.join(generated_file_path, filename), 'data/lightcast_cleaned.csv')# delete useless foldershutil.rmtree(generated_file_path)