Connecting SQL Server to IntelliJ IDEA Using PySpark (With Full Explanation)

SQL5 min read
Connecting SQL Server to IntelliJ IDEA Using PySpark (With Full Explanation)
pysparkmssql serverdatabase

Working with data often means bridging traditional databases with modern processing engines. In this guide, we’ll connect Microsoft SQL Server to Apache Spark using PySpark inside IntelliJ IDEA, and more importantly, understand what’s happening behind the scenes.

Why This Setup Matters

  • SQL Server → stores structured data
  • Spark → processes large-scale data efficiently
  • PySpark → lets you use Python for distributed computing

This integration allows you to:

  • Pull data from SQL Server
  • Transform it using Spark
  • Optionally write it back

1. Why Do We Need a JDBC Driver?

Before jumping into setup, let’s understand the core concept.

What is JDBC? JDBC (Java Database Connectivity) is a standard API that allows Java-based applications (like Spark) to communicate with databases.

Why is the JDBC .jar file needed? Spark is built on JVM (Java Virtual Machine). SQL Server speaks its own protocol. The JDBC driver acts as a translator between them.

Without the JDBC driver:

  • Spark cannot understand how to talk to SQL Server
  • You’ll get errors like: No suitable driver found

Think of it like this:

  • Spark → speaks Java
  • SQL Server → speaks TDS (Tabular Data Stream protocol)
  • JDBC driver → acts as an interpreter between them

2. Download and Place the JDBC Driver

Download from Microsoft:

https://learn.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-ver17

Steps:

  1. Extract the ZIP file

  2. Navigate to: sqljdbc → enu → jars

  3. Choose: jre8 → Java 8 jre11 → Java 11

  4. Create folder: C:\spark\jars

  5. Copy: mssql-jdbc-13.4.0.jre11.jar

3. Why Do We Add the JAR in Spark Config?

Now the important part most people skip understanding 👇 Code:

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("MSSQL Connection") \ .config("spark.jars", "C:/spark/jars/mssql-jdbc-13.4.0.jre11.jar") \ .getOrCreate()

What does this do?

  • spark.jars tells Spark:
    • “Load this external library before running anything”

Why is this required? Spark runs in a distributed way:

  • Driver process
  • Executor processes

If the JDBC driver is not included:

  • Executors won’t know how to connect to SQL Server
  • Job will fail at runtime

In simple terms: You are injecting the database driver into Spark’s runtime environment.

4. SQL Server Setup (What’s Actually Happening)

Instance Type

  • Default instance: MSSQLSERVER
  • Connection: localhost:1433

5. What is TCP/IP and Why Enable It?

What is TCP/IP? TCP/IP (Transmission Control Protocol / Internet Protocol) is the fundamental communication protocol used by computers to talk over a network.

Why SQL Server Needs It? By default:

  • SQL Server may only accept local/internal connections But Spark connects using:
  • Network-based communication (even on localhost) So TCP/IP must be enabled to allow:
  • External programs (like Spark) to connect

Steps to Enable TCP/IP

  1. Open: SQLServerManager15.msc (or based on version)

  2. Navigate: SQL Server Network Configuration → Protocols for MSSQLSERVER

  3. Enable:

    • TCP/IP
  4. Check Port:

    • Go to IP Addresses tab
    • Find TCP Port (usually 1433)

Security Note (Important)

Enabling TCP/IP does NOT automatically make your system unsafe, but: Risks (if misconfigured):

  • If your port (1433) is exposed to the internet:
    • Unauthorized access attempts
    • Brute-force login attacks

Best Practices:

  • Keep SQL Server bound to: localhost

  • Do NOT expose port 1433 publicly unless needed

  • Use strong passwords

  • Enable firewall rules

  • Avoid using sa account

For local development, this setup is generally safe.

Blog image

6. Authentication Setup (Why Mixed Mode?)

What is Mixed Mode? SQL Server supports:

  • Windows Authentication
  • SQL Server Authentication

Why we enable Mixed Mode? Spark does NOT use Windows authentication easily.

So we use:

  • Username + Password authentication

Create Login

  • Username: spark_user
  • Assign:
    • Database access
    • Role: db_owner

7. JDBC Connection Explained

jdbc_url = "jdbc:sqlserver://localhost:1433;databaseName=your_database" connection_properties = { "user": "spark_user", "password": "your_password", "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver" }

Breakdown:

  • jdbc:sqlserver:// → protocol
  • localhost:1433 → server + port
  • databaseName → target DB
  • driver → tells Spark which JDBC driver to use

8. Reading Data from SQL Server

df = spark.read.jdbc( url=jdbc_url, table="your_table_name", properties=connection_properties ) df.show()

What happens internally?

  1. Spark connects via JDBC
  2. SQL query is executed
  3. Data is pulled into Spark DataFrame

9. Transform Data Using Spark

from pyspark.sql.functions import col df_filtered = df.filter(col("column_name") > 100)

Spark processes data:

  • In-memory
  • Distributed across nodes

10. Writing Data Back (Important Concept)

⚠️ Why overwrite is dangerous?

mode="overwrite"

This:

  • Drops table
  • Recreates it
  • Deletes all existing data

Safer Options

mode="append"

Or:

  • Write to a new table
  • Use Spark only for processing

Important Concept 👉 Spark is NOT a transactional database

Unlike SQL Server:

  • No ACID guarantees
  • No row-level updates

Spark is: ✔ A data processing engine ❌ Not meant for safe incremental DB updates

Final Thoughts

This integration between:

  • Microsoft SQL Server
  • Apache Spark

is powerful—but only when you understand the why behind each step.

Key Learnings:

  • JDBC driver = communication bridge
  • spark.jars = runtime dependency injection
  • TCP/IP = network communication enabler
  • Authentication = controlled access
  • Spark ≠ database