Connecting SQL Server to IntelliJ IDEA Using PySpark (With Full Explanation)

Apr 13, 2026•SQL•5 min read

pysparkmssql serverdatabase

Working with data often means bridging traditional databases with modern processing engines. In this guide, we’ll connect Microsoft SQL Server to Apache Spark using PySpark inside IntelliJ IDEA, and more importantly, understand what’s happening behind the scenes.

Why This Setup Matters

SQL Server → stores structured data
Spark → processes large-scale data efficiently
PySpark → lets you use Python for distributed computing

This integration allows you to:

Pull data from SQL Server
Transform it using Spark
Optionally write it back

1. Why Do We Need a JDBC Driver?

Before jumping into setup, let’s understand the core concept.

What is JDBC? JDBC (Java Database Connectivity) is a standard API that allows Java-based applications (like Spark) to communicate with databases.

Why is the JDBC .jar file needed? Spark is built on JVM (Java Virtual Machine). SQL Server speaks its own protocol. The JDBC driver acts as a translator between them.

Without the JDBC driver:

Spark cannot understand how to talk to SQL Server
You’ll get errors like: No suitable driver found

Think of it like this:

Spark → speaks Java
SQL Server → speaks TDS (Tabular Data Stream protocol)
JDBC driver → acts as an interpreter between them

2. Download and Place the JDBC Driver

Download from Microsoft:

https://learn.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-ver17

Steps:

Extract the ZIP file
Navigate to: sqljdbc → enu → jars
Choose: jre8 → Java 8 jre11 → Java 11
Create folder: C:\spark\jars
Copy: mssql-jdbc-13.4.0.jre11.jar

3. Why Do We Add the JAR in Spark Config?

Now the important part most people skip understanding 👇 Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MSSQL Connection") \
    .config("spark.jars", "C:/spark/jars/mssql-jdbc-13.4.0.jre11.jar") \
    .getOrCreate()

What does this do?

spark.jars tells Spark:
- “Load this external library before running anything”

Why is this required? Spark runs in a distributed way:

Driver process
Executor processes

If the JDBC driver is not included:

Executors won’t know how to connect to SQL Server
Job will fail at runtime

In simple terms: You are injecting the database driver into Spark’s runtime environment.

4. SQL Server Setup (What’s Actually Happening)

Instance Type

Default instance: MSSQLSERVER
Connection: localhost:1433

5. What is TCP/IP and Why Enable It?

What is TCP/IP? TCP/IP (Transmission Control Protocol / Internet Protocol) is the fundamental communication protocol used by computers to talk over a network.

Why SQL Server Needs It? By default:

SQL Server may only accept local/internal connections But Spark connects using:
Network-based communication (even on localhost) So TCP/IP must be enabled to allow:
External programs (like Spark) to connect

Steps to Enable TCP/IP

Open: SQLServerManager15.msc (or based on version)
Navigate: SQL Server Network Configuration → Protocols for MSSQLSERVER
Enable:
- TCP/IP
Check Port:
- Go to IP Addresses tab
- Find TCP Port (usually 1433)

Security Note (Important)

Enabling TCP/IP does NOT automatically make your system unsafe, but: Risks (if misconfigured):

If your port (1433) is exposed to the internet:
- Unauthorized access attempts
- Brute-force login attacks

Best Practices:

Keep SQL Server bound to: localhost
Do NOT expose port 1433 publicly unless needed
Use strong passwords
Enable firewall rules
Avoid using sa account

For local development, this setup is generally safe.

6. Authentication Setup (Why Mixed Mode?)

What is Mixed Mode? SQL Server supports:

Windows Authentication
SQL Server Authentication

Why we enable Mixed Mode? Spark does NOT use Windows authentication easily.

So we use:

Username + Password authentication

Create Login

Username: spark_user
Assign:
- Database access
- Role: db_owner

7. JDBC Connection Explained

jdbc_url = "jdbc:sqlserver://localhost:1433;databaseName=your_database"

connection_properties = {
    "user": "spark_user",
    "password": "your_password",
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

Breakdown:

jdbc:sqlserver:// → protocol
localhost:1433 → server + port
databaseName → target DB
driver → tells Spark which JDBC driver to use

8. Reading Data from SQL Server

df = spark.read.jdbc(
    url=jdbc_url,
    table="your_table_name",
    properties=connection_properties
)

df.show()

What happens internally?

Spark connects via JDBC
SQL query is executed
Data is pulled into Spark DataFrame

9. Transform Data Using Spark

from pyspark.sql.functions import col

df_filtered = df.filter(col("column_name") > 100)

Spark processes data:

In-memory
Distributed across nodes

10. Writing Data Back (Important Concept)

⚠️ Why overwrite is dangerous?

mode="overwrite"

This:

Drops table
Recreates it
Deletes all existing data

Safer Options

mode="append"

Or:

Write to a new table
Use Spark only for processing

Important Concept 👉 Spark is NOT a transactional database

Unlike SQL Server:

No ACID guarantees
No row-level updates

Spark is: ✔ A data processing engine ❌ Not meant for safe incremental DB updates

Final Thoughts

This integration between:

Microsoft SQL Server
Apache Spark

is powerful—but only when you understand the why behind each step.

Key Learnings:

JDBC driver = communication bridge
spark.jars = runtime dependency injection
TCP/IP = network communication enabler
Authentication = controlled access
Spark ≠ database