Accessing Oracle using PySpark .

To run oracle commands on oracle server using pyspark .

For EMR

First install software

sudo su

pip install cx_Oracle==6.0b1

Function 1 : To run select command in oracle and print result , we could store this in RDD or DF and use it further as well.

## defining a function

def run_select_oracle(sql_file) :

## Opening file passed into function

file_open=open(sql_file)

## Opening file to read

read_file=file_open.read()

## Defining connections to connect to oracle

connection=cx_Oracle.connect('schema_name/user@ip:port/database')

## Defining cursor so that data can be stored in there for temp

cursor = connection.cursor()

## actually running a command

res_cursor=cursor.execute(read_file)

## Creating empty list to store result

results = []

## Loop to get values row by row

for row in res_cursor.fetchall():

## Appending values in list defined above

results.append(row)

## displaying values

print(results)

## closing connection so that no orphan connections are there

connection.close()

## calling function with file name

run_select_oracle("/home/user/path_to_sql/file_name.sql")

Function 2 : Insert command using similar function , just make sure to add commit post insert else they won't be visible to any one .

def run_insert_oracle(sql_file):

file_open=open(sql_file)

read_file=file_open.read()

connection=cx_Oracle.connect('schema_name/user@ip:port/database')

cursor = connection.cursor()

res_cursor=cursor.execute(read_file)

connection.commit()

connection.close()

run_insert_oracle("/home/user/path_to_sql/file_name.sql")

Comments