Stephen的SEM博客

python抓取百度结果中的排名和网址极速版图书馆泡8个小时的成果

2019年12月12日 | 标签:

标题：python抓取百度结果中的排名和网址极速版图书馆泡8个小时的成果

——————————————————————————————————————————-

时间：2012/11/18 20:10:59

——————————————————————————————————————————-

内容：

前言

实现方法

python的正则表达式帮了很大忙

然后走的弯路是要注意匹配的时候.*? 后面加个问号防止贪婪匹配

贪婪匹配

就是匹配43432seoseo 我用（.*）seo 会默认的抓取了 43432seo

因为他会极可能的匹配多点所以我要使用几个问号就是抓的少点

正则表达式的双引号用转移字符\加双引号然后多种匹配就是

(a|b)针对2种情况都可以匹配

基本上思路就是

首先用大家都很熟悉的urllib库来抓数据

然后用re的来书写正则表达

然后用findall来找到所有的匹配输入数组

然后用for循环来遍历数组

然后遍历的时候插入

关键是正则表达式耗时

然后就是基本的判断

mysql记得要使用commit才能提交

还要提高的地方

中文字符的支持

以及快照时间的分离

还有真正的着陆页面的活取

定时执行

import re,time,urllib2,MySQLdb
t=time.time()
key=”%E8%B7%B3%E8%88%9E%E6%9C%BA”
html=urllib2.urlopen(“http://www.baidu.com/s?wd=%s&rn=100″ % key).read()
t0=time.time()-t
p3=re.compile(r”id=\”(\d{1,3})\”.*?(mu=\”http://([^\”]*?)\”|<span class=\”g\”>\s*([^<]*)</span>)”)

t1=time.time()-t

conn = MySQLdb.connect(user=’root’,passwd=’111111′,db=’schools’)

cursor = conn.cursor()

m=p3.findall(html)
t2=time.time()-t
for i in m:
k=i[0]
k=int(k)

if i[1].find(“mu=\”http://www.baidu.com”)+1:#在mu中如果有www开头的是百度知道
print “zhidao.baidu.com”
sql = “insert into serp(url,kw, pos) values (‘%s’,’%s’, %d)” % (“zhidao.baidu.com”,key,k)
cursor.execute(sql)
elif i[1].find(“mu=\”http”)+1:#在mu中的就是百度产品了
print i[2]
sql = “insert into serp(url,kw, pos) values (‘%s’,’%s’, %d)” % (i[2],key,k)
cursor.execute(sql)
else:
print i[3]#就是正常产品了
sql = “insert into serp(url,kw, pos) values (‘%s’,’%s’, %d)” % (i[3],key,k)
cursor.execute(sql)

t3=time.time()-t
print “serp open spend %s\ncompile pattern spend %s\nfind all results spend %s\ntotal spend %s” %(t0,t1,t2,t3)
conn.commit()
cursor.close()

前言

实现方法

python的正则表达式帮了很大忙

然后走的弯路是要注意匹配的时候.*? 后面加个问号防止贪婪匹配

贪婪匹配

就是匹配43432seoseo 我用（.*）seo 会默认的抓取了 43432seo

因为他会极可能的匹配多点所以我要使用几个问号就是抓的少点

正则表达式的双引号用转移字符\加双引号然后多种匹配就是

(a|b)针对2种情况都可以匹配

基本上思路就是

首先用大家都很熟悉的urllib库来抓数据

然后用re的来书写正则表达

然后用findall来找到所有的匹配输入数组

然后用for循环来遍历数组

然后遍历的时候插入

关键是正则表达式耗时

然后就是基本的判断

mysql记得要使用commit才能提交

还要提高的地方

中文字符的支持

以及快照时间的分离

还有真正的着陆页面的活取

定时执行

import re,time,urllib2,MySQLdb
t=time.time()
key=”%E8%B7%B3%E8%88%9E%E6%9C%BA”
html=urllib2.urlopen(“http://www.baidu.com/s?wd=%s&rn=100″ % key).read()
t0=time.time()-t
p3=re.compile(r”id=\”(\d{1,3})\”.*?(mu=\”http://([^\”]*?)\”|<span class=\”g\”>\s*([^<]*)</span>)”)

t1=time.time()-t

conn = MySQLdb.connect(user=’root’,passwd=’111111′,db=’schools’)

cursor = conn.cursor()

m=p3.findall(html)
t2=time.time()-t
for i in m:
k=i[0]
k=int(k)

if i[1].find(“mu=\”http://www.baidu.com”)+1:#在mu中如果有www开头的是百度知道
print “zhidao.baidu.com”
sql = “insert into serp(url,kw, pos) values (‘%s’,’%s’, %d)” % (“zhidao.baidu.com”,key,k)
cursor.execute(sql)
elif i[1].find(“mu=\”http”)+1:#在mu中的就是百度产品了
print i[2]
sql = “insert into serp(url,kw, pos) values (‘%s’,’%s’, %d)” % (i[2],key,k)
cursor.execute(sql)
else:
print i[3]#就是正常产品了
sql = “insert into serp(url,kw, pos) values (‘%s’,’%s’, %d)” % (i[3],key,k)
cursor.execute(sql)

t3=time.time()-t
print “serp open spend %s\ncompile pattern spend %s\nfind all results spend %s\ntotal spend %s” %(t0,t1,t2,t3)
conn.commit()
cursor.close()

没有评论

python究极精简查询百度关键词排名代码9行

2019年12月11日 | 标签:

———————————————————————————————————————–

内容：

#-*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re,urllib2,time
key=”杀价”
url=”elong.com”
key=urllib2.quote(key)
t=time.time()
html=urllib2.urlopen(“http://www.baidu.com/s?word=%s” %key).read()
soup=BeautifulSoup(html)
cache=soup.find(“span”,text=re.compile(“%s” %url))
print cache.find_previous(“table”).get(“id”)
print cache.find_previous(“a”).get(“href”)
print urllib2.urlopen(cache.find_previous(“a”).get(“href”)).geturl()
print cache.get_text().split(” “)[3]
print time.time()-t

窍门使用find查找span标签中位置包含网址的然后使用find previous 查找table 使用get 返回id值

python 查找跳转后的url 使用urllib2.open(url).geturl()这个方法 open是打开geturl是得到url

使用变量替换是%S 然后在引号外面紧接些 % 变量名称

查文字是get_text()就可以再用split分开

#-*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re,urllib2,time
key=”杀价”
url=”elong.com”
key=urllib2.quote(key)
t=time.time()
html=urllib2.urlopen(“http://www.baidu.com/s?word=%s” %key).read()
soup=BeautifulSoup(html)
cache=soup.find(“span”,text=re.compile(“%s” %url))
print cache.find_previous(“table”).get(“id”)
print cache.find_previous(“a”).get(“href”)
print urllib2.urlopen(cache.find_previous(“a”).get(“href”)).geturl()
print cache.get_text().split(” “)[3]
print time.time()-t

窍门使用find查找span标签中位置包含网址的然后使用find previous 查找table 使用get 返回id值

python 查找跳转后的url 使用urllib2.open(url).geturl()这个方法 open是打开geturl是得到url

使用变量替换是%S 然后在引号外面紧接些 % 变量名称

查文字是get_text()就可以再用split分开

没有评论

python抓取51job公司名称招聘职位以及网址极速版

2019年12月6日 | 标签:

先使用命令安装bs4

sudo easy_install pip

安装pip工具

sudo pip install bs4

# -*- coding: utf8 -*-

import sys

reload（sys）

sys.setdefaultencoding（’utf-8’）

from bs4 import BeautifulSoup
import re，time，urllib2
html=urllib2.urlopen（”http://www.51job.com/shanghai”，timeout=5）.read（）
soup=BeautifulSoup（html）
div=soup.find（”div”，id=”dataidea_1″）
for links in div.find_all（”a”，title=True）:
print links.get（”title”）

print links.get（”href”）
html1=urllib2.urlopen（links.get（”href”），timeout=5）.read（）
soup1=BeautifulSoup（html1）
div1=soup1.find（”div”，class_=”redline”）
if div1!=None:

for link1 in div1.find_all（”a”，href=True）:
print link1.get_text（）
if soup1.find（”p”，”txt_font1″）!=None:
if soup1.find（”p”，”txt_font1″）.get_text（）.find（”tp”）>1:
print soup1.find（”p”，”txt_font1″）.get_text（）

print “\n”
print “\n”

先使用命令安装bs4

sudo easy_install pip

安装pip工具

sudo pip install bs4

# -*- coding: utf8 -*-

import sys

reload（sys）

sys.setdefaultencoding（’utf-8’）

from bs4 import BeautifulSoup
import re，time，urllib2
html=urllib2.urlopen（”http://www.51job.com/shanghai”，timeout=5）.read（）
soup=BeautifulSoup（html）
div=soup.find（”div”，id=”dataidea_1″）
for links in div.find_all（”a”，title=True）:
print links.get（”title”）

print links.get（”href”）
html1=urllib2.urlopen（links.get（”href”），timeout=5）.read（）
soup1=BeautifulSoup（html1）
div1=soup1.find（”div”，class_=”redline”）
if div1!=None:

for link1 in div1.find_all（”a”，href=True）:
print link1.get_text（）
if soup1.find（”p”，”txt_font1″）!=None:
if soup1.find（”p”，”txt_font1″）.get_text（）.find（”tp”）>1:
print soup1.find（”p”，”txt_font1″）.get_text（）

print “\n”
print “\n”

没有评论

Stephen的SEM博客

python抓取百度结果中的排名和网址极速版图书馆泡8个小时的成果

python究极精简查询百度关键词排名代码9行

python抓取51job公司名称招聘职位以及网址极速版

近期评论

近期文章

归档

分类

Stephen的SEM博客

python抓取百度结果中的排名和网址 极速版图书馆泡8个小时的成果

python究极精简查询百度关键词排名代码9行

python抓取51job公司名称招聘职位以及网址极速版

近期评论

近期文章

归档

分类

python抓取百度结果中的排名和网址极速版图书馆泡8个小时的成果