PySpark with Google Colab

PySpark on Google Colab — Automatic Setup

  1. Installation March/2020
  2. Automatic Installation
    1. Java Installation
    2. Spark Installation
    3. PySpark Usage Example

Buy me a Coffee

Open in Google Colab:

Open In Colab

Quick Installation March/2020

In general, to use PySpark in Colab as of March/2020, the following commands would be used in a Colab cell:

Install Java

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
import os # operating system management library
os.system("wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz")
os.system("tar xf /spark-2.4.5-bin-hadoop2.7.tgz")

Install PySpark

!pip install -q pyspark
# Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{ver_spark}-bin-hadoop2.7"
# Load PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test_spark").master("local[*]").getOrCreate()
spark

However, when a new version of Spark is released, the download links will need to be updated, since the 2.x.x versions are always removed when a new one comes out.

The best approach is to configure the setup automatically so that it downloads the version greater than 2.3.4 (the previous one) and less than Spark 3.0.0 (which was still in development at the time).

To do this, the following code detects the current version of Spark, downloads it, extracts it, and then installs Spark on Google Colab.

Automatic Installation

Java Installation

Google Colaboratory runs in a Linux environment, so Linux shell commands can be used preceded by the ‘!’ character.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Spark Installation

Automatically get the latest version of Spark from

from bs4 import BeautifulSoup
import requests
#Get the spark versions from the web page
url = 'https://downloads.apache.org/spark/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
# read the web page and get the available spark versions
link_files = []
for link in soup.find_all('a'):
  link_files.append(link.get('href'))
spark_link = [x for x in link_files if 'spark' in x]
print(spark_link)

[‘spark-2.3.4/’, ‘spark-2.4.5/’, ‘spark-3.0.0-preview2/’]

The version to use will be those above spark-2.3.4 and below spark-3.0.0.

Get the version and remove the trailing ‘/’ character

ver_spark = spark_link[1][:-1] # get the version and remove the trailing '/' character
print(ver_spark)
spark-2.4.5
import os # operating system management library
#automatically install the desired version of spark
link = "https://www-us.apache.org/dist/spark/"
os.system(f"wget -q {link}{ver_spark}/{ver_spark}-bin-hadoop2.7.tgz")
os.system(f"tar xf {ver_spark}-bin-hadoop2.7.tgz")

# install pyspark
!pip install -q pyspark
|████████████████████████████████| 217.8MB 63kB/s
|████████████████████████████████| 204kB 53.8MB/s
Building wheel for pyspark (setup.py) ... done

Set environment variables

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{ver_spark}-bin-hadoop2.7"

Load PySpark into the system

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test_spark").master("local[*]").getOrCreate()
spark

SparkSession - in-memory

SparkContext

Version
v2.4.5
Master
local[*]
AppName
pyspark-shell

PySpark Usage Example

Read a test file

archivo = './sample_data/california_housing_train.csv'
df_spark = spark.read.csv(archivo, inferSchema=True, header=True)

# print file type
print(type(df_spark))
<class 'pyspark.sql.dataframe.DataFrame'>

Number of records in the dataframe?

df_spark.count()
17000

Dataframe schema

df_spark.printSchema()
root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)

Dataframe column names?

df_spark.columns
['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value']

View the first 20 records of the dataframe

df_spark.show()
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|  -114.57|   33.57|              20.0|     1454.0|         326.0|     624.0|     262.0|        1.925|           65500.0|
|  -114.58|   33.63|              29.0|     1387.0|         236.0|     671.0|     239.0|       3.3438|           74000.0|
|  -114.58|   33.61|              25.0|     2907.0|         680.0|    1841.0|     633.0|       2.6768|           82400.0|
|  -114.59|   34.83|              41.0|      812.0|         168.0|     375.0|     158.0|       1.7083|           48500.0|
|  -114.59|   33.61|              34.0|     4789.0|        1175.0|    3134.0|    1056.0|       2.1782|           58400.0|
|   -114.6|   34.83|              46.0|     1497.0|         309.0|     787.0|     271.0|       2.1908|           48100.0|
|   -114.6|   33.62|              16.0|     3741.0|         801.0|    2434.0|     824.0|       2.6797|           86500.0|
|   -114.6|    33.6|              21.0|     1988.0|         483.0|    1182.0|     437.0|        1.625|           62000.0|
|  -114.61|   34.84|              48.0|     1291.0|         248.0|     580.0|     211.0|       2.1571|           48600.0|
|  -114.61|   34.83|              31.0|     2478.0|         464.0|    1346.0|     479.0|        3.212|           70400.0|
|  -114.63|   32.76|              15.0|     1448.0|         378.0|     949.0|     300.0|       0.8585|           45000.0|
|  -114.65|   34.89|              17.0|     2556.0|         587.0|    1005.0|     401.0|       1.6991|           69100.0|
|  -114.65|    33.6|              28.0|     1678.0|         322.0|     666.0|     256.0|       2.9653|           94900.0|
|  -114.65|   32.79|              21.0|       44.0|          33.0|      64.0|      27.0|       0.8571|           25000.0|
|  -114.66|   32.74|              17.0|     1388.0|         386.0|     775.0|     320.0|       1.2049|           44000.0|
|  -114.67|   33.92|              17.0|       97.0|          24.0|      29.0|      15.0|       1.2656|           27500.0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
only showing top 20 rows

Statistical Description of the Dataframe

df_spark.describe().toPandas().transpose()
01234
summarycountmeanstddevminmax
longitude17000-119.562108235293752.0051664084260357-124.35-114.31
latitude1700035.62522470588272.137339794657086732.5441.95
housing_median_age1700028.5893529411764712.5869369816604061.052.0
total_rooms170002643.6644117647062179.9470714527772.037937.0
total_bedrooms17000539.4108235294118421.49945157986481.06445.0
population170001429.57394117647051147.8529591595273.035682.0
households17000501.2219411764706384.52084085590161.06082.0
median_income170003.8835781000000211.90815651837910360.499915.0001
median_house_value17000207300.91235294117115983.7643872089514999.0500001.0

Statistical description of a single column (‘median_house_value’)

df_spark.describe(['median_house_value']).show()
+-------+------------------+
|summary|median_house_value|
+-------+------------------+
|  count|             17000|
|   mean|207300.91235294117|
| stddev|115983.76438720895|
|    min|           14999.0|
|    max|          500001.0|
+-------+------------------+

This is how you can automatically install Spark on Google Colab and use it for free.

The free version only provides one CPU; if you want to increase processing capacity, you need to pay.

Previous
Next