Building a dockerized application that scrapes Indeed.com for in-demand software development skills.
The tech industry moves fast. It is unrealistic to learn the ins-and-outs of every programming language and technology. As a result, software engineers must decide which skills to focus on.
To decide, I have often turned to articles compiled by awesome people who took the time to do the research. For example:
Every year, there is a new crop of these articles. They summarize and analyze trends in programming languages, containers vs. virtual machines, databases, web development vs. desktop…the list goes on and on.
I thought it would be worthwhile to automate this process to some extent. To that end, I developed an application that scrapes Indeed.com for data about demand for technology skills. It is dockerized. If you wish to run it locally, visit the github link at the end of this page for instructions.
First we specify a “task”. A task consists of:
A) Search term (i.e. “software engineer”)
C) An optional list of aliases for each of those skills; an alias is another name for a skill. For example, you might specify the skill “SQL” and the aliases “mysql” and “postgresql”. If the scraper sees the name of the skill or any of its aliases in a job post, it will flag the job post as containing the skill. For example, a job post with “mysql” in it will be flagged as containing “SQL”.
D) An optional list of US cities to concentrate the search to.
Create a task
Upon submitting the task, the application begins a continuous process of:
A) Querying Indeed.com with the specified search term. This returns a list of job posts. If provided a list of cities to focus on, the application will limit the search to those cities; otherwise, it will scrape randomly across the country.
B) Counting how often the specified skills (as indicated by their names or by their aliases) appear in the scraped job posts.
C) Counting how often pairs of skills appear in the job posts.
The scraper will usually collect ~500 posts in 10 minutes or so (I slowed it down a bit so as to not spam Indeed.com). The results are organized into charts that indicate:
A) Total demand
# of jobs listing each skill out of ~1000 scraped job listings
B) How demand has changed on a weekly basis. Note that for this to be of significant value, the scraper needs to be run periodically. In the example shown here, all data was scraped on a single day. As a result, there are more job postings in recent weeks than in weeks past. This creates the illusion that there has been a fantastic rise in demand over the past month. If the scraper is run continuously or every few weeks, it will identify how demand has changed over time.
Demand over time
C) How often skills are requested together (i.e. of all job posts that list Java, what fraction also list Python). Each percentage says “of all job posts that request the row-skill, what fraction also demand the column-skill”. For example, 55% of job posts that list Python also list Java.
D) Geographically, where demand for specific skills is concentrated
Out of a random sample of ~1000 job posts collected
This application provides useful and interesting insights. For example, the above demo suggests:
B) Python and R are often known as the machine learning and data science languages. Not to say they are not, many of the key tools of data science come from these language (Pandas, Scipy, Numpy, Deep learning frameworks, etc.), but Java and Machine Learning are listed together as frequently as Python and Machine Learning are.
D) It appears that about 40% of job posts listing Spark also list Scala. Also, about 40% of job posts listing Scala also list Spark. I had thought that Spark and Scala would appear together more often (https://databricks.com/session/just-enough-scala-for-spark).
This demand estimation process is not perfect. For example, the application does not segment skill demand based on job characteristics; i.e. it does not identify skills are associated with more senior and/or higher paying jobs. It also doesn’t identify more complex relationships between skills; it only identifies how often skills are listed together. For example, the application does not identify whether a job post list as a skill as required or as a “nice-to-have”.
I thought about putting this in the cloud for people to run their own tasks. However, I do not think it would be responsible or fair to Indeed.com, especially considering that the app uses Tor for requests. So instead I will post other Medium articles with results from this tool. For example, I wanted to explore the popularity of container technologies like Docker vs the popularity of virtual machine technologies like Vagrant and Virtual Box. I found very little stated demand for VM technologies in general and a huge increase in demand for Docker.
If there are other specific queries anyone would like to see, please feel free to shoot me a message or comment, and I’ll run it for you. If you want to run the application yourself, here is a link the repository. It is dockerized and all required images are on DockerHub. As a result, installation only requires downloading a docker-compose.yml file and executing “docker-compose up”. See the repository for more detail.