Last Update: 20.12.2006. By kerim in python
Now for those that are interested in HOW I did it. It’s actually pretty simple.
First of all pick a site that you want to download content from. I look around but since as said earlier i didnt find many “normal” ones i have choosen youporn.com.
1) Plenty of content
2) Rather simple way of getting at it.
Here is how to do it:
1) We start at the base URL www.youtube.com
and simply read the html
def fetchpage(self,url): pagefile= urllib.urlopen(url) #enter proxy here if wished return pagefile.read()
2) When we have that html we can determine how many pages there exist in youtube that we might check
def findLastPage(self,html): linkPattern="page\=[0-9]*" p = re.compile(linkPattern) m= p.findall(html) m.pop() # last item = next lastPage=m.pop() lastPage=lastPage[lastPage.rfind('=')+1:] #get rid of the 'page=' return lastPage
As you can see its the second last link that reads something with”page=” (a number)
3) Ok so we have the last page, the first is naturally … 1 :-)
Now all we have to do is to fetch all pages and try to determine all filelinks
for page in range(min, max): print "fetching: "+baseURL+'/?page='+str(page) html=self.fetchpage(baseURL+'/?page='+str(page)) self.searchlinks(baseURL,html) print "pages to check: "+str(len(self.allPages))
But what does “searchlinks” really do ?
Well basically on youtube all files are located at pages that follow the scheme at www.youtube/watch/<aNumberHere>
So we must go through all those pages again using the following code with regular expressions again.
def searchlinks(self,baseURL,html): linkPattern="/watch/[0-9]*" p = re.compile(linkPattern) m= p.findall(html) pages= for page in set(m): page=baseURL+page self.allPages.append(page)
4) Ok … but we see that actually this list is only another list of pages. Yes youtube doesnt directly store files at …/watch/number but rather another page with one(flv file) or two (flv and wmv file) links to the REAL files.
So we need another loop that goes through all “/watch/number”-pages
for page in set(self.allPages): print "looking for files in: " + page html=self.fetchpage(page) self.searchfiles(html) print "Files to download: "+str(self.allFiles)
5) And for all those pages we call searchfiles that will use a regular expression to find files (and it uses a preference for the type)
def searchfiles(self,html): flvPattern="http://download.youporn.com/download/[0-9]*/flv/[\w_]*.flv" wmvPattern="http://download.youporn.com/download/[0-9]*/pc/[\w_]*.wmv" p = re.compile(flvPattern) p2 = re.compile(wmvPattern) m=None if self.preference=="wmv": m= p2.search(html) if m==None or self.preference!="wmv": m=p.search(html) if m==None: return url = m.group() filename=url[url.rfind("/")+1:] self.allFiles[url]=filename
URL and filename are stored in a dictionary.
6) All that is to do now is to download the files.
print "Using %d number of threads to download" %self.numThreads mtfetcher.download(self.allFiles,self.numThreads)
And “mtfetcher” ? What is that ? Thats a multithreaded downloader. I modified a version from Marcio Boufleur called ppd.py ! It can be found at www-usr-inf.ufsm.br/~boufleur/ppd
Other sites can be downloaded by extending the youpornloader, overwriting the start method and the other methods to fetch pages, search files or links etc.
PornoPython is more or less only a “loader” for all the other classes.
Want content like this in your inbox
each workday irregularly? No BS, spam or tricks... just useful content: