Introduction

PySpark is the Python API for Apache Spark, an open-source distributed computing system that enables fast, scalable data processing. PySpark allows Python developers to leverage the powerful capabilities of Spark for big data analytics, machine learning, and data engineering tasks without needing to delve into the complexities of Java or Scala.

With PySpark , users can process large datasets across clusters, perform distributed data transformations, and run machine learning algorithms. It integrates seamlessly with popular data processing frameworks like Hadoop and supports multiple data formats, making it a versatile tool in data science and analytics.

This introduction provides an overview of PySpark's configuration, help you easy to set up and use in local environment computer.

Installation

  1. Install Python at : https://www.python.org/downloads/
  2. Install Java First at all you need to download latest version of java at : https://jdk.java.net. I'm using java 23 for this post.
  3. Install PySpark

First at all, you also need to download Apache Spark from :

I'm using https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz to make a tutorial for this post.

Configuration Python

  1. Java
import os
os.environ["JAVA_HOME"] = fr"D:\Soft\JAVA\jdk-23.0.1"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin;" + os.environ["PATH"]
  1. PySpark
import os
os.environ["SPARK_HOME"] = fr"D:\Soft\pyspark\spark-3.5.4-bin-hadoop3"
os.environ["PATH"] = os.environ["SPARK_HOME"] + "/bin;" + os.environ["PATH"]

After done, you can try check Pyspark at command line :

Try Example with Pyspark Notebook.

import numpy as np
import pandas as pd
spark = SparkSession.builder \
    .appName("Debugging Example") \
    .master("local[*]") \
    .config("spark.eventLog.enabled", "true") \
    .config("spark.sql.shuffle.partitions", "1") \
    .getOrCreate()

spark.sparkContext.setLogLevel("DEBUG")
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# rename columns
df = df.toDF("a", "b", "c")
df

Use df.show(5) to see test output with pyspark.

Let's try some machine learning data example :

import requests

# URL for the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Download the dataset and save it locally
response = requests.get(url)
with open("iris.data", "wb") as file:
    file.write(response.content)
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Iris Data Analysis") \
    .master("local[*]")\
    .getOrCreate()

# Path to the locally downloaded Iris dataset
iris_data_path = "iris.data"

# Define the schema for the data
columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

# Load the data into a DataFrame
df = spark.read.csv(iris_data_path, header=False, inferSchema=True)

# Set column names
df = df.toDF(*columns)

# Show the first few rows of the DataFrame
df.show()

# Stop the SparkSession when done
spark.stop()

It working! Cheers !

Reference

Author Of article : chuongmep Read full article