PySpark on Google Colab — Automatic Setup

Installation March/2020
Automatic Installation
1. Java Installation
2. Spark Installation
3. PySpark Usage Example

Open in Google Colab:

Quick Installation March/2020

In general, to use PySpark in Colab as of March/2020, the following commands would be used in a Colab cell:

Install Java

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

import os # operating system management library
os.system("wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz")
os.system("tar xf /spark-2.4.5-bin-hadoop2.7.tgz")

Install PySpark

!pip install -q pyspark

# Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{ver_spark}-bin-hadoop2.7"

# Load PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test_spark").master("local[*]").getOrCreate()
spark

However, when a new version of Spark is released, the download links will need to be updated, since the 2.x.x versions are always removed when a new one comes out.

The best approach is to configure the setup automatically so that it downloads the version greater than 2.3.4 (the previous one) and less than Spark 3.0.0 (which was still in development at the time).

To do this, the following code detects the current version of Spark, downloads it, extracts it, and then installs Spark on Google Colab.

Automatic Installation

Java Installation

Google Colaboratory runs in a Linux environment, so Linux shell commands can be used preceded by the ‘!’ character.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Spark Installation

Automatically get the latest version of Spark from

from bs4 import BeautifulSoup
import requests

#Get the spark versions from the web page
url = 'https://downloads.apache.org/spark/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

# read the web page and get the available spark versions
link_files = []
for link in soup.find_all('a'):
  link_files.append(link.get('href'))
spark_link = [x for x in link_files if 'spark' in x]
print(spark_link)

[‘spark-2.3.4/’, ‘spark-2.4.5/’, ‘spark-3.0.0-preview2/’]

The version to use will be those above spark-2.3.4 and below spark-3.0.0.

Get the version and remove the trailing ‘/’ character

ver_spark = spark_link[1][:-1] # get the version and remove the trailing '/' character
print(ver_spark)

spark-2.4.5

import os # operating system management library
#automatically install the desired version of spark
link = "https://www-us.apache.org/dist/spark/"
os.system(f"wget -q {link}{ver_spark}/{ver_spark}-bin-hadoop2.7.tgz")
os.system(f"tar xf {ver_spark}-bin-hadoop2.7.tgz")

# install pyspark
!pip install -q pyspark

|████████████████████████████████| 217.8MB 63kB/s
|████████████████████████████████| 204kB 53.8MB/s
Building wheel for pyspark (setup.py) ... done

Set environment variables

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{ver_spark}-bin-hadoop2.7"

Load PySpark into the system

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test_spark").master("local[*]").getOrCreate()
spark

SparkSession - in-memory

SparkContext

Version: v2.4.5
Master: local[*]
AppName: pyspark-shell

PySpark Usage Example

Read a test file

archivo = './sample_data/california_housing_train.csv'
df_spark = spark.read.csv(archivo, inferSchema=True, header=True)

# print file type
print(type(df_spark))

<class 'pyspark.sql.dataframe.DataFrame'>

Number of records in the dataframe?

df_spark.count()

Dataframe schema

df_spark.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)

Dataframe column names?

df_spark.columns

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value']

View the first 20 records of the dataframe

df_spark.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|  -114.57|   33.57|              20.0|     1454.0|         326.0|     624.0|     262.0|        1.925|           65500.0|
|  -114.58|   33.63|              29.0|     1387.0|         236.0|     671.0|     239.0|       3.3438|           74000.0|
|  -114.58|   33.61|              25.0|     2907.0|         680.0|    1841.0|     633.0|       2.6768|           82400.0|
|  -114.59|   34.83|              41.0|      812.0|         168.0|     375.0|     158.0|       1.7083|           48500.0|
|  -114.59|   33.61|              34.0|     4789.0|        1175.0|    3134.0|    1056.0|       2.1782|           58400.0|
|   -114.6|   34.83|              46.0|     1497.0|         309.0|     787.0|     271.0|       2.1908|           48100.0|
|   -114.6|   33.62|              16.0|     3741.0|         801.0|    2434.0|     824.0|       2.6797|           86500.0|
|   -114.6|    33.6|              21.0|     1988.0|         483.0|    1182.0|     437.0|        1.625|           62000.0|
|  -114.61|   34.84|              48.0|     1291.0|         248.0|     580.0|     211.0|       2.1571|           48600.0|
|  -114.61|   34.83|              31.0|     2478.0|         464.0|    1346.0|     479.0|        3.212|           70400.0|
|  -114.63|   32.76|              15.0|     1448.0|         378.0|     949.0|     300.0|       0.8585|           45000.0|
|  -114.65|   34.89|              17.0|     2556.0|         587.0|    1005.0|     401.0|       1.6991|           69100.0|
|  -114.65|    33.6|              28.0|     1678.0|         322.0|     666.0|     256.0|       2.9653|           94900.0|
|  -114.65|   32.79|              21.0|       44.0|          33.0|      64.0|      27.0|       0.8571|           25000.0|
|  -114.66|   32.74|              17.0|     1388.0|         386.0|     775.0|     320.0|       1.2049|           44000.0|
|  -114.67|   33.92|              17.0|       97.0|          24.0|      29.0|      15.0|       1.2656|           27500.0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
only showing top 20 rows

Statistical Description of the Dataframe

df_spark.describe().toPandas().transpose()

	0	1	2	3	4
summary	count	mean	stddev	min	max
longitude	17000	-119.56210823529375	2.0051664084260357	-124.35	-114.31
latitude	17000	35.6252247058827	2.1373397946570867	32.54	41.95
housing_median_age	17000	28.58935294117647	12.586936981660406	1.0	52.0
total_rooms	17000	2643.664411764706	2179.947071452777	2.0	37937.0
total_bedrooms	17000	539.4108235294118	421.4994515798648	1.0	6445.0
population	17000	1429.5739411764705	1147.852959159527	3.0	35682.0
households	17000	501.2219411764706	384.5208408559016	1.0	6082.0
median_income	17000	3.883578100000021	1.9081565183791036	0.4999	15.0001
median_house_value	17000	207300.91235294117	115983.76438720895	14999.0	500001.0

Statistical description of a single column (‘median_house_value’)

df_spark.describe(['median_house_value']).show()

+-------+------------------+
|summary|median_house_value|
+-------+------------------+
|  count|             17000|
|   mean|207300.91235294117|
| stddev|115983.76438720895|
|    min|           14999.0|
|    max|          500001.0|
+-------+------------------+

This is how you can automatically install Spark on Google Colab and use it for free.

The free version only provides one CPU; if you want to increase processing capacity, you need to pay.

PySpark with Google Colab