Well… I will “lose attention” by having results hosted externally but I think the value is still there to do this.
After I finish the geo/language filters, search result caching and pagination code, I will post a script for creators to host results on their sites.
I am thinking about 2 options: Iframed results (easy implementation/configuration) and API JSON results for those who wish to parse/display in their own way.
@Mattches another question… When pulling the api data there are many sites with no ‘channel_id’ set. A user mentioned “theguardian.com” was not listed and that is why it got filtered out.
I updated my scripts and now have 22000 additional sites to check
I also am capturing http responses such as 404 not found, 502 not available, etc. and will make available via reporting tools in the near future.
In addition, i will be checking https cert validity and make that available as well.
Yea getting there. Today I am rewriting part of one of my core scripts to greatly reduce indexing time.
1 script to process api file 1x/day
1 script to fetch html of homepages 12x/minute
1 script to parse data into db 12x/minute
1 script to fetch Alexa Ranks 3x/minute (rate limits)
This will index 34,560 sites per 2 days or so.
Things I parse:
Title, meta tags, links, urls, imgs, plaintext
Logging http response codes (404 not found, 200 OK, etc).
Validating HTTPS certs
…
Any other metrics you want to see added?
I like the list of sites you have going there, that’s an awesome idea. Can see how that’s going to help find good sites!
Hey have you been making progress on youtube searches at all or is that something further down the road?