Files
Fozan Talat f4cc1dc33b Update windows.md (#335)
changed spark version from 3.0.3 to the  3.3.2
2023-02-26 22:20:26 +01:00

128 lines
3.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Windows
Here we'll show you how to install Spark 3.3.2 for Windows.
We tested it on Windows 10 and 11 Home edition, but it should work
for other versions distros as well
In this tutorial, we'll use [MINGW](https://www.mingw-w64.org/)/[Gitbash](https://gitforwindows.org/) for command line
If you use WSL, follow the instructions from [linux.md](linux.md)
### Installing Java
Spark needs Java 11. Download it from here: [https://www.oracle.com/de/java/technologies/javase/jdk11-archive-downloads.html](https://www.oracle.com/de/java/technologies/javase/jdk11-archive-downloads.html). Select “Windows x64 Compressed Archive” (you may have to create an oracle account for that)
Unpack it to a folder with no space in the path. We use `C:/tools` - so the full path to JDK is `/c/tools/jdk-11.0.13`
Now lets configure it and add it to `PATH`:
```bash
export JAVA_HOME="/c/tools/jdk-11.0.13"
export PATH="${JAVA_HOME}/bin:${PATH}"
```
Check that Java works correctly:
```bash
java --version
```
Output:
```
java 11.0.13 2021-10-19 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.13+10-LTS-370)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.13+10-LTS-370, mixed mode)
```
### Hadoop
Next, we need to have Hadoop binaries.
We'll need Hadoop 3.2 which we'll get from [here](https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.0).
Create a folder (`/c/tools/hadoop-3.2.0`) and put the files there
```bash
HADOOP_VERSION="3.2.0"
PREFIX="https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-${HADOOP_VERSION}/bin/"
FILES="hadoop.dll hadoop.exp hadoop.lib hadoop.pdb libwinutils.lib winutils.exe winutils.pdb"
for FILE in ${FILES}; do
wget "${PREFIX}/${FILE}"
done
```
Add it to `PATH`:
```bash
export HADOOP_HOME="/c/tools/hadoop-3.2.0"
export PATH="${HADOOP_HOME}/bin:${PATH}"
```
### Spark
Now download Spark. Select version 3.3.2
```bash
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
```
Unpack it in some location without spaces, e.g. `c:/tools/`:
```bash
tar xzfv spark-3.3.2-bin-hadoop3.tgz
```
Let's also add it to `PATH`:
```bash
export SPARK_HOME="/c/tools/spark-3.3.2-bin-hadoop3"
export PATH="${SPARK_HOME}/bin:${PATH}"
```
### Testing it
Go to this directory
```bash
cd spark-3.3.2-bin-hadoop3
```
And run spark-shell:
```bash
./bin/spark-shell.cmd
```
At this point you may get a message from windows firewall — allow it.
There could be some warnings (like this):
```
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/tools/spark-3.3.2-bin-hadoop3/jars/spark-unsafe_2.12-3.3.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
```
You can safely ignore them.
Now let's run this:
```
val data = 1 to 10000
val distData = sc.parallelize(data)
distData.filter(_ < 10).collect()
```
### PySpark
It's the same for all platforms. Go to [pyspark.md](pyspark.md).