导航:首页 > 软件知识 > spark如何运行python程序

spark如何运行python程序

发布时间：2023-01-28 21:21:31

㈠如何用Python写spark

1.RDD是PariRDD类型
def add1(line):
return line[0] + line[1]
def add2(x1,x2):
return x1 + x2
sc = SparkContext(appName="gridAnalyse")
rdd = sc.parallelize([1,2,3])
list1 = rdd.map(lambda line: (line,1)).map(lambda (x1,x2) : x1 + x2).collect() #只有一个参数，通过匹配来直接获取（赋值给里面对应位置的变量）
list1 = rdd.map(lambda line: (line,1)).map(lambda x1,x2 : x1 + x2).collect() #错误，相当于函数有两个参数
list2 = rdd.map(lambda line: (line,1)).map(lambda line : line[0] + line[1]).collect() #只有一个参数，参数是Tuple或List数据类型，再从集合的对应位置取出数据
list3 = rdd.map(lambda line: (line,1)).map(add1).collect() #传递函数，将Tuple或List类型数据传给形参
list4 = rdd.map(lambda line: (line,1)).map(add2).collect() #错误，因为输入只有一个，却有两个形参
当RDD是PairRDD时，map中可以写lambda表达式和传入一个函数。
a、写lambda表达式：
可以通过(x1,x2,x3)来匹配获取值；或者使用line获取集合，然后从集合中获取。
b、传入函数
根据spark具体的transaction OR action 操作来确定自定义函数参数的个数，此例子中只有一个参数，从形参（集合类型）中获取相应位置的数据。

㈡科普Spark，Spark是什么，如何使用Spark

科普Spark，Spark是什么，如何使用Spark

1.Spark基于什么算法的分布式计算（很简单）

2.Spark与MapRece不同在什么地方

3.Spark为什么比Hadoop灵活

4.Spark局限是什么

5.什么情况下适合使用Spark

Spark与Hadoop的对比

Spark的中间数据放到内存中，对于迭代运算效率更高。

Spark更适合于迭代运算比较多的ML和DM运算。因为在Spark里面，有RDD的抽象概念。

Spark比Hadoop更通用

Spark提供的数据集操作类型有很多种，不像Hadoop只提供了Map和Rece两种操作。比如map, filter, flatMap, sample, groupByKey, receByKey, union, join, cogroup, mapValues, sort,partionBy等多种操作类型，Spark把这些操作称为Transformations。同时还提供Count, collect, rece, lookup, save等多种actions操作。

这些多种多样的数据集操作类型，给给开发上层应用的用户提供了方便。各个处理节点之间的通信模型不再像Hadoop那样就是唯一的Data Shuffle一种模式。用户可以命名，物化，控制中间结果的存储、分区等。可以说编程模型比Hadoop更灵活。

不过由于RDD的特性，Spark不适用那种异步细粒度更新状态的应用，例如web服务的存储或者是增量的web爬虫和索引。就是对于那种增量修改的应用模型不适合。

容错性

在分布式数据集计算时通过checkpoint来实现容错，而checkpoint有两种方式，一个是checkpoint data，一个是logging the updates。用户可以控制采用哪种方式来实现容错。

可用性

Spark通过提供丰富的Scala, Java，Python API及交互式Shell来提高可用性。

Spark与Hadoop的结合

Spark可以直接对HDFS进行数据的读写，同样支持Spark on YARN。Spark可以与MapRece运行于同集群中，共享存储资源与计算，数据仓库Shark实现上借用Hive，几乎与Hive完全兼容。

Spark的适用场景

Spark是基于内存的迭代计算框架，适用于需要多次操作特定数据集的应用场合。需要反复操作的次数越多，所需读取的数据量越大，受益越大，数据量小但是计算密集度较大的场合，受益就相对较小（大数据库架构中这是是否考虑使用Spark的重要因素）

由于RDD的特性，Spark不适用那种异步细粒度更新状态的应用，例如web服务的存储或者是增量的web爬虫和索引。就是对于那种增量修改的应用模型不适合。总的来说Spark的适用面比较广泛且比较通用。

运行模式

本地模式

Standalone模式

Mesoes模式

yarn模式

Spark生态系统

Shark ( Hive on Spark): Shark基本上就是在Spark的框架基础上提供和Hive一样的H iveQL命令接口，为了最大程度的保持和Hive的兼容性，Shark使用了Hive的API来实现query Parsing和 Logic Plan generation，最后的PhysicalPlan execution阶段用Spark代替Hadoop MapRece。通过配置Shark参数，Shark可以自动在内存中缓存特定的RDD，实现数据重用，进而加快特定数据集的检索。同时，Shark通过UDF用户自定义函数实现特定的数据分析学习算法，使得SQL数据查询和运算分析能结合在一起，最大化RDD的重复使用。

Spark streaming: 构建在Spark上处理Stream数据的框架，基本的原理是将Stream数据分成小的时间片断（几秒），以类似batch批量处理的方式来处理这小部分数据。Spark Streaming构建在Spark上，一方面是因为Spark的低延迟执行引擎（100ms+）可以用于实时计算，另一方面相比基于Record的其它处理框架（如Storm），RDD数据集更容易做高效的容错处理。此外小批量处理的方式使得它可以同时兼容批量和实时数据处理的逻辑和算法。方便了一些需要历史数据和实时数据联合分析的特定应用场合。

Bagel: Pregel on Spark，可以用Spark进行图计算，这是个非常有用的小项目。Bagel自带了一个例子，实现了Google的PageRank算法。

End.

㈢机器学习实践：如何将Spark与Python结合

可以学习一下林大贵这本书，从头到尾教你如何使用python+spark+hadoop实现常用的算法训练和部署。

《Python+Spark2.0+Hadoop机器学习与大数据实战_林大贵》

链接：https://pan..com/s/1VGUOyr3WnOb_uf3NA_ZdLA

提取码：ewzf

㈣如何在Spark2.0.2中启动Ipython Notebook

IPython Configuration
This installation workflow loosely follows the one contributed by Fernando Perez here. This should be performed on the machine where the IPython Notebook will be executed, typically one of the Hadoop nodes.
First create an IPython profile for use with PySpark.

1

ipython profile create pyspark

This should have created the profile directory ~/.ipython/profile_pyspark/. Edit the file~/.ipython/profile_pyspark/ipython_notebook_config.py to have:

1
2
3
4
5

c = get_config()

c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8880 # or whatever you want; be aware of conflicts with CDH

If you want a password prompt as well, first generate a password for the notebook app:

1

python -c 'from IPython.lib import passwd; print passwd()' > ~/.ipython/profile_pyspark/nbpasswd.txt

and set the following in the same .../ipython_notebook_config.py file you just edited:

1
2

PWDFILE='~/.ipython/profile_pyspark/nbpasswd.txt'
c.NotebookApp.password = open(PWDFILE).read().strip()

Finally, create the file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py with the following contents:

1
2
3
4
5
6
7
8
9

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Starting IPython Notebook with PySpark
IPython Notebook should be run on a machine from which PySpark would be run on, typically one of the Hadoop nodes.
First, make sure the following environment variables are set:

1
2
3
4
5

# for the CDH-installed Spark
export SPARK_HOME='/opt/cloudera/parcels/CDH/lib/spark'

# this is where you specify all the options you would normally add after bin/pyspark
export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client --num-executors 24 --executor-memory 10g --executor-cores 5'

Note that you must set whatever other environment variables you want to get Spark running the way you desire. For example, the settings above are consistent with running the CDH-installed Spark in YARN-client mode. If you wanted to run your own custom Spark, you could build it, put the JAR on HDFS, set theSPARK_JAR environment variable, along with any other necessary parameters. For example, see here for running a custom Spark on YARN.
Finally, decide from what directory to run the IPython Notebook. This directory will contain the .ipynb files that represent the different notebooks that can be served. See the IPython docs for more information. From this directory, execute:

1

ipython notebook --profile=pyspark

Note that if you just want to serve the notebooks without initializing Spark, you can start IPython Notebook using a profile that does not execute the shell.py script in the startup file.
Example Session
At this point, the IPython Notebook server should be running. Point your browser to , which should open up the main access point to the available notebooks. This should look something like this:

This will show the list of possible .ipynb files to serve. If it is empty (because this is the first time you’re running it) you can create a new notebook, which will also create a new .ipynb file. As an example, here is a screenshot from a session that uses PySpark to analyze the GDELT event data set:

The full .ipynb file can be obtained as a GitHub gist.

㈤ python开发spark环境该如何配置，又该如何操作

1）输入：welcome="Hello!"回车

再输入：printwelcome或者直接welcome回车就可以看到输出Hello!

2）

[html]viewplain
welcome="hello"
you="world!"
printwelcome+you

输出：helloworld!

以上使用的是字符串，变量还有几种类型：数，字符串，列表，字典，文件。其他的和别的语言类似，下面先讲下列表：

3）

[html]viewplain
my_list=[]//这个就产生了一个空的列表。然后给它赋值
my_list=[1,2]
printmy_list
my_list.append(3)
printmy_list

4）字典：

[html]viewplain
contact={}
contact["name"]="shiyuezhong"
contact["phone"]=12332111

5）结合列表和字典：

[html]viewplain
contact_list=[]
contact1={}
contact1['name']='shiyuezhong'
contact1['phone']=12332111
contact_list.append(contact1)
contact2={}
contact2['name']='buding'
contact2['phone']=88888888
contact_list.append(contact2)

㈥ Spark的四种运行模式

介绍
本地模式
Spark单机运行，一般用于开发测试。

Standalone模式
构建一个由Master+Slave构成的Spark集群，Spark运行在集群中。

Spark on Yarn模式
Spark客户端直接连接Yarn。不需要额外构建Spark集群。

Spark on Mesos模式
Spark客户端直接连接Mesos。不需要额外构建Spark集群。

启动方式: spark-shell.sh(Scala)
spark-shell通过不同的参数控制采用何种模式进行。涉及两个参数：

对于Spark on Yarn模式和Spark on Mesos模式还可以通过 –deploy-mode参数控制Drivers程序的启动位置。

进入本地模式：

进入Standalone模式：

备注：测试发现MASTER_URL中使用主机名替代IP地址无法正常连接(hosts中有相关解析记录)，即以下命令连接不成功：

./spark-shell --master spark://ctrl:7077 # 连接失败
Spark on Yarn模式

备注：Yarn的连接信息在Hadoop客户端的配置文件中指定。通过spark-env.sh中的环境变量HADOOPCONFDIR指定Hadoop配置文件路径。

Spark on Mesos模式：

启动方式: pyspark(Python)
参数及用法与Scala语言的spark-shell相同，比如：

㈦如何运行含spark的python脚本

1、Spark脚本提交/运行/部署1.1spark-shell（交互窗口模式）运行Spark-shell需要指向申请资源的standalonespark集群信息，其参数为MASTER，还可以指定executor及driver的内存大小。sudospark-shell--executor-memory5g--driver-memory1g--masterspark://192.168.180.216:7077spark-shell启动完后，可以在交互窗口中输入Scala命令，进行操作，其中spark-shell已经默认生成sc对象，可以用：valuser_rdd1=sc.textFile(inputpath,10)读取数据资源等。1.2spark-shell（脚本运行模式）上面方法需要在交互窗口中一条一条的输入scala程序；将scala程序保存在test.scala文件中，可以通过以下命令一次运行该文件中的程序代码：sudospark-shell--executor-memory5g--driver-memory1g--masterspark//192.168.180.216:7077

㈧如何在ipython或python中使用Spark

如何在ipython中使用spark
说明：

spark 1.6.0
scala 2.10.5
spark安装路径是/usr/local/spark；已经在.bashrc中配置了SPARK_HOME环境变量。
方法一
/usr/local/spark/bin/pyspark默认打开的是python，而不是ipython。通过在pyspark文件中添加一行，来使用ipython打开。
cp pyspark ipyspark
vi ipyspark

# 在最前面添加

IPYTHON=1

# 启动

ipyspark

1
2
3
4
5
6
7
8
9
10
方法二：
通过为spark创建一个ipython 配置的方式实现。

# 为spark创建一个ipython 配置

ipython profile create spark

# 创建启动配置文件

cd ~/.config/ipython/profile_spark/startup
vi 00-pyspark-setup.py

1
2
3
4
5
6
7
8
9
在00-pyspark-setup.py中添加如下内容：
import os
import sys

# Configure the environment

if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = '/srv/spark'

# Create a variable for our root path

SPARK_HOME = os.environ['SPARK_HOME']

# Add the PySpark/py4j to the Python Path

sys.path.insert(0, os.path.join(SPARK_HOME, "python", "pyspark"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "lib", "py4j-0.9-src.zip"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
启动ipython
ipython –profile spark

1
测试程序
在ipython中输入一下命令，如果下面的程序执行完后输出一个数字，说明正确。
from pyspark import SparkContext
sc = SparkContext( 'local', 'pyspark')

def isprime(n):
"""
check if integer n is a prime
"""
# make sure n is a positive integer
n = abs(int(n))
# 0 and 1 are not primes
if n < 2:
return False
# 2 is the only even prime number
if n == 2:
return True
# all other even numbers are not primes
if not n & 1:
return False
# for all odd numbers
for x in range(3, int(n**0.5)+1, 2):
if n % x == 0:
return False
return True

# Create an RDD of numbers from 0 to 1,000,000

nums = sc.parallelize(xrange(1000000))

# Compute the number of primes in the RDD

print “Result: ”, nums.filter(isprime).count()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
方法三
将上面的程序放入test.py文件，执行命令python test.py。发现错误。因为没有将pyspark路径加入PYTHONPATH环境变量。
在~/.bashrc或/etc/profile中添加如下内容:

# python can call pyspark directly

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/pyspark:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

1
2
3
4
执行如下命令：

# 使配置生效

source ~/.bashrc

# 测试程序

python test.py

1
2
3
4
5
6
7
8
此时，已经能够运行了。

㈨如何运行含spark的python脚本

2~spark$ bin/spark-submit first.py
-----------first.py-------------------------------
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)
lines = sc.textFile("first.py")
pythonLines = lines.filter(lambda line: "Python" in line)
print "hello python"
print pythonLines.first()
print pythonLines.first()
print "hello spark!"
---------------------------------------------------
hello python
pythonLines = lines.filter(lambda line: "Python" in line)
pythonLines = lines.filter(lambda line: "Python" in line)
hello spark!

到spark的安装目录下/bin 下面 spark-submit ***.py 即可

㈩求助，python + spark运行程序出现错误

tmprdd1 = csdnRDD.map(lambda x: (x.split("\t")[2]))
x.split("\t")会产生一个list，有些数据是异常异常，产生的list不一定会有三个元素，所以就会异常退出。
你可以使用csdnRDD.map（lambda x:x.split("\t")）.filter(lambda x:len(x)<3) 看看有哪一写异常数据，然后确定如何过滤掉这些异常数据。

阅读全文

与spark如何运行python程序相关的资料

热点内容

义乌财务代理多少钱一个月发布：2024-04-28 23:30:32 浏览：94

银行卡交易虚拟类什么意思发布：2024-04-28 23:03:16 浏览：264

78年产的安公丸现在市场价多少发布：2024-04-28 22:22:44 浏览：569

怎么采集传感器的数据发布：2024-04-28 22:21:33 浏览：887

数控加工能学到什么技术发布：2024-04-28 22:06:40 浏览：548

李佳奇有什么产品发布：2024-04-28 21:59:46 浏览：823

数据网络是哪里发布：2024-04-28 21:23:17 浏览：796

为什么代理商必须修改标题发布：2024-04-28 20:53:58 浏览：546

一级交易权限范围是多少发布：2024-04-28 20:43:43 浏览：407

中桥农贸市场在哪个城市发布：2024-04-28 20:35:47 浏览：759

如何查询本单位信息发布：2024-04-28 20:16:19 浏览：106

发抖音视频在哪里获得小程序链接发布：2024-04-28 20:16:15 浏览：724

家庭小产品有哪些发布：2024-04-28 20:09:44 浏览：981

怎么删除理财产品发布：2024-04-28 20:09:42 浏览：89

烟台的海产品市场在什么地方发布：2024-04-28 19:28:59 浏览：796

个人纸原油交易有哪些平台发布：2024-04-28 19:25:31 浏览：345

时空石怎么交易吗发布：2024-04-28 18:54:49 浏览：922

校园旅游代理怎么做发布：2024-04-28 18:33:19 浏览：293

联想台式电脑如何查看硬件信息发布：2024-04-28 18:23:15 浏览：851

客户邮箱信息不回怎么办发布：2024-04-28 17:53:54 浏览：870