Data Cleaning & Exploration

Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends

Modified

May 2, 2025

Load the dataset

Code

import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "vscode"
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, col, regexp_replace, transform, isnan

spark = SparkSession.builder.appName("LightcastData").getOrCreate()

# Load Data
df = spark.read.option("header", "true").option("inferSchema", "true").option("multiLine","true").option("escape", "\"").csv("./data/lightcast_job_postings.csv")

df.show(5)

1. Update Duration

Calculates the duration of each job posting by finding the difference between its expiration and posted dates. Converts the POSTED and EXPIRED columns from string to date format. Update DURATION if it is null with the number of days between EXPIRED and POSTED, otherwise, the existing value is kept.

Code

# 1.DURATION = EXPIRED - POSTED

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

from pyspark.sql.functions import datediff, when, to_date, col

df = df.withColumn("POSTED", to_date("POSTED", "MM/dd/yyyy")) \
       .withColumn("EXPIRED", to_date("EXPIRED", "MM/dd/yyyy"))

df = df.withColumn(
    "DURATION",
    when(col("DURATION").isNull(), datediff("EXPIRED", "POSTED"))
    .otherwise(col("DURATION"))
)

2. Clean the columns

Cleans up multiple text columns in the DataFrame by extracting and formatting the content originally enclosed in double quotes. Columns to clean contain those string values often wrapped in brackets, double quotes, or cluttered with newlines and extra spaces. For each of these columns, using regular expressions to remove square brackets, line breaks, and excess whitespace, formats comma-separated items with a proper space after each comma, and removes all double quotes, resulting in cleaner, more readable text entries across the specified columns.

Code

# 2. Remove square brackets, line breaks, spaces, and replace the formatting between commas with “,”, then remove the double quotes

from pyspark.sql.functions import regexp_replace, col

columns_to_clean = ["SOURCE_TYPES", "SOURCES", "URL", "EDUCATION_LEVELS_NAME", "SKILLS", 
                    "SKILLS_NAME", "SPECIALIZED_SKILLS", "SPECIALIZED_SKILLS_NAME", "CERTIFICATIONS", 
                    "CERTIFICATIONS_NAME", "COMMON_SKILLS", "COMMON_SKILLS_NAME", "SOFTWARE_SKILLS", 
                    "SOFTWARE_SKILLS_NAME", "CIP6", "CIP6_NAME", "CIP4", "CIP4_NAME", "CIP2", 
                    "CIP2_NAME", "LIGHTCAST_SECTORS", "LIGHTCAST_SECTORS_NAME"]  

for col_name in columns_to_clean:
    df = df.withColumn(col_name, 
                       regexp_replace(regexp_replace(regexp_replace(col(col_name), r'[\[\]\n\s]+', ''), r'","', '", '), r'"', ''))

3. Clean the education level column

Cleans the EDUCATION_LEVELS column by extracting and retaining only the numeric portion of each entry. Removing surrounding text or symbols, leaving just the numeric education level in the column. This makes the data more consistent and easier to work with for analysis or modeling purposes.

Code

# 3.EDUCATION_LEVELS only keeps digits
from pyspark.sql.functions import regexp_extract

df = df.withColumn("EDUCATION_LEVELS", regexp_extract("EDUCATION_LEVELS", r'(\d+)', 1))

4. Clean the location column

Cleans the LOCATION column, ensures that all location information appears on one line, and removes curly braces, resulting a cleaner, more uniform LOCATION column for reading and analyzing

Code

# 4. LOCATION only keeps data
from pyspark.sql.functions import col, regexp_replace

df = df.withColumn("LOCATION", 
                           regexp_replace(regexp_replace(col("LOCATION"), r"\s*\n\s*", " "), r"[{}]", ""))

5. Update Modeled Duration

Similarly as in updating duration, fills in the value with the number of days between MODELED_EXPIRED and POSTED, helps standardize and complete the duration data for modeled job postings

Code

# 5.MODELED_DURATION = MODELED_EXPIRED - POSTED

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

from pyspark.sql.functions import datediff, when, to_date, col

df = df.withColumn("MODELED_EXPIRED", to_date("MODELED_EXPIRED", "MM/dd/yyyy"))

df = df.withColumn(
    "MODELED_DURATION",
    when(col("MODELED_DURATION").isNull(), datediff("MODELED_EXPIRED", "POSTED"))
    .otherwise(col("MODELED_DURATION"))
)

6. Standardize Remote Work Types

Standardizes the values in the REMOTE_TYPE_NAME column to ensure consistency in describing remote work types. Replaces values None and Not Remote with On-Site, changes Hybrid Remote to Hybrid, and keeps Remote as is. Simplify and unify the classification of job postings based on work location

Code

# 6. Standardize Remote Work Types
from pyspark.sql.functions import when, col

df = df.withColumn(
    "REMOTE_TYPE_NAME",
    when(col("REMOTE_TYPE_NAME").isin("[None]", "Not Remote") | col("REMOTE_TYPE_NAME").isNull(), "On-Site")
    .when(col("REMOTE_TYPE_NAME") == "Hybrid Remote", "Hybrid")
    .when(col("REMOTE_TYPE_NAME") == "Remote", "Remote")
    .otherwise(col("REMOTE_TYPE_NAME"))
)

7. Reason of not filling NAs this time

Although filling missing values is a common data cleaning strategy, we chose not to do it in this case to preserve the integrity and accuracy of the original dataset. Imputing numerical fields like salary with the median could distort salary distributions and mask meaningful patterns or outliers. Similarly, replacing missing categorical fields with “Unknown” may introduce noise and reduce the reliability of downstream analysis, especially in modeling or clustering tasks. Additionally, dropping columns with over 50% missing data might lead to the loss of potentially valuable or unique information. By keeping the missing values intact, we allow for more transparent analysis and leave room for context-aware handling in specific use cases.

8. Reason of not dropping unnecessary columns

The dataset is overly complex, with more than 100 different variables and columns. Therefore, we have taken the approach of directly extracting a specific column or columns of the data to be analyzed to generate a dataframe and analyze it. This way we don’t need to remove unwanted columns.

9. Save the cleaned data

Code

# save data
# 1. use coalesce(1) to merge all partitions into one file
df.coalesce(1).write.option("header", "true").csv("data/lightcast_cleaned_temp")

# 2. Find and rename the generated files
import os
import shutil

# get path
generated_file_path = 'data/lightcast_cleaned_temp'

for filename in os.listdir(generated_file_path):
    if filename.startswith('part-'):  # find file
        # rename and move
        shutil.move(os.path.join(generated_file_path, filename), 'data/lightcast_cleaned.csv')

# delete useless folder
shutil.rmtree(generated_file_path)