Job attracting factors for developers in small and mid-size as well as large enterprises and estimation of salary with supervised machine learning

Motivation

Imagine you have two job offers with the same compensation, benefits, and location, which factors would you consider, when selecting your preferred job? This is the exact question that developers were asked in 2020 Stack Overflow Developer Survey. Let’s find out what they had to say.

But first a recap, the 2020 Stack Overflow Developer Survey contains 64,461 developers, drawn from different organisations across the world. All in all, there are 61 variables comprising numeric and categorical variables. The source of the dataset can be found here.

As input for the direction for the analysis, I used the job factors attracting factors. There are 11 job factors that respondents were asked to select form. These are: Remote work options”,”Office environment or company culture”, “Financial performance or funding status of the company or organization”,”Opportunities for professional development”, “Diversity of the company or organization”, “How widely used or impactful my work output would be”, ”Industry that I’d be working in”,”Specific department or team I’d be working on”, “Flex time or a flexible schedule”, “Languages, frameworks, and other technologies I’d be working with”, and “Family friendliness

I grouped the dataset into small and mid-size enterprises (1 to <500 employees) and large enterprises (500 and more employees). 28,634 developers (44.42%) work in small and mid-size enterprises and 15,700 employees (24.36%) work large companies and 20,127 (31.22%) of the respondents did not indicate, which type of enterprise they worked in. For the analysis though, I used all these three datasets.

To begin the analysis, I analysed the 3 datasets separately. From this analysis the most important job factor is “Languages, frameworks, and other technologies I’d be working with”. So I used this information to construct research question 2, where I explore the tech environment by looking at tools respondents work with by focusing on the first five tools respondents desired to work with in 2021, with the reason that they might be indicating early information for future trends.

So I selected six categories of the tech tools to analyse. These are

  1. Programming, Scripting, and Mark-up Language
  2. Database Environments
  3. Web Frameworks
  4. Platforms
  5. New Collaboration Tools
  6. Other Frameworks, Libraries, and Tools

My approach is first to calculate precentages for each organizational tpye. Then I calculate the percentage differences for various job factors between the two organisational tpyes. Displaying the differences in the table, I observe that the factor “Financial performance or funding status of the company or organization” is centre factor from the eleven factors. Let us look at what I mean form the graph below.

Percentage differences between job factors for small and mid-size as well as large enterprise

Although it is more leaned towards large enterprises, I use it as an orientation for the construction of research questions 3 and 4. Here I estimate salary for developers using supervised machine learning.

Due to limited space, some of graphical representations are not presented. Nevertheless, I used some information from missing graphs for the analysis. For more details visit my Github site

Let us now look at the first question we are seeking an answer for .

  1. What are the most important job factors for developers when deciding between Jobs offers with same compensation, benefits, and locations? Which 3 factors do developers in both enterprise types consider to be most important?

The graph above shows percentage differences between jobs factors for developers in the two enterprise types. I have show the table earlier on. Developers in small and mid-size enterprises have “Remote work options”, “Languages, frameworks, and other technologies I’d be working with” and “Flex time or a flexible schedule” as the most important factors (factor values above the zero line). Respondents in large organisations though, consider “Specific department or team I’d be working on”, “How widely used or impactful my work output would be” and “Opportunities for professional development” (factor values below the zero line).

Note this result has not been tested for significance, but imagine the above outcome is correct, which three factors would you consider to be most important?

We no move to question 2.

2. Which are the 5 most desired tech tools for developers in 2021? Can we observe any difference in the preference pattern for these tools between the two enterprise forms?

  1. Programming, Scripting, and Markup Languages

To start which this question, consider the graph below. It is representing the perentage number of developers using the displayed tools for 2020 in this category programming, scripting and markup languages. In this graph, we have Java (77.49%), C (76.98%), JavaScript(64 %), HTML/CSS (55.70%) and SQL(55.10%) being predominately used by developers.

Percentages of developers in large enterprises for the year 2020

Now let us determine the five most preferred programming, scripting, and markup languages for 2021. They are C (11.16%), Java (10.59%), Python(9.41%), JavaScript (8.38%) and HTML/CSS (6.39%).

Percentage preference (14,047 respondents, large enterprises)

Although Java was predorminately used in 2020, C (11.16%) is more preferred in 2021. In small and mid-size enterprises, Python was favoured over HTML/CSS. The least preferred are Assembly, Ojective-C, Julia, Perl and VBA. Surprisingly, VBA seems to be the least preferred. This trend is confirmed in the three datasets.

2. Database Environment

For 2021, most preferred Databases are PostgreSQL (14.50%), MongoDB (13.14%), MySQL (12.90%), Redis (9.82%) and SOLITE (8.95%). Although MySQL was most used in 2020, PostgreSQL and MongoDB apper to be more preferred by respondents for 2021.

Percentage preference Database ( Small and Mid-Size Enterprises (22,966 respondents))

The least preferred are DynamoDB, Cassandra, Oracle and IBM DB2.

3. New Collaboration Tools

For 2021, most preferred tools are Github (23.73%), Slack (15.48%), Gitlab (11.57%), Google Suite (Docs, Meeet, etc) (11.17%) and Jira (10.91%). The graph below shows Gitlab (11.57%) is preferred over Google Suite (11.17%) and Jira (10.91%).

Preference percentage: New Collaboration Tools (Small and Mid-Size Enterprises (22,966 respondents))

The graph of 2020 also indicated that Github was predominately used. For 2021, it with 23.73% is still favourable among developers. Similar trend is also observed in large enterprises. The least preferred are Trello, Stack Overflow for Teams and Facebook Workplace.

4. Platforms

Linux (14.67%), Docker (13.97%), AWS (10.62%), Kubemetes (9.25%), and Window (9.12%) are the most desired for 2021.

Percentage of preference: Platforms (whole dataset (50,604 respondents))

The least desired are Arduino, Heroku, Slack Apps and Integration, Word Press and IBM Cloud or Watson.

5. Web Framework

React.js (16.47%), We.js (10.78%), Angular (10.75%), ASP.NET (8.34%) and ASP.NET Core (7.40%) are the most desired for 2021. Even though jQuery in 2020 is predominately used, it is less preferred by developers in 2021.

Percentage of preference: Web framework (whole dataset (50,604 respondents))

It did not make it top five preferred web frameworks. Similar observations were made on datasets for two organisational types. We.js seems also to be preferred. It is the number 2 of five most preferred web frameworks. The least desired are Symfony, Gatsby and Drupal.

6. Other Frameworks, Libraries, and Tools¶

The most desired tools in 2021 are Node.js (16.28%), .NET (10.77%), TenosrFlow (9.65%), .NET Core (9.16%) and React Native (8.04%). Node.js is still the most preferred followed by .NET. TensorFlow is preferred over .NET Core and React Native.

Percentage of preference: Other Frameworks, Libraries, and Tools (whole dataset (42,372 respondents))

Pandas did not make it to the first five preferred. The least desired are Ansible, Cordova, Puppet and Chef.

Now you have some picture of tools developers would prefer working with in 2021. For sure we did not test the statistical significance to cement our ascertain. Nevertheless, can we still conclued that these observations are early indications for future trends of tech tools and tech environment in these organisational set up? What do you think?

Come along now to the research questions 3 and 4.

3. Using supervised machine, which model best fit in estimating salary for developers?

There are many features in the dataset that might affect the jobs factors of developers. Using supervised ML, I train a model to estimate salary for developers.

Before training the model, I checked for missing values and impute the mean, mode or 0, depending on the variable. Looking at the distribution of salary below, it is strongly right skewed.

Distribution of salary

After creating the X (features) and y (response) dataframes, splitting the new dataframes into train and test dataframes, applying the linear regression, fitting the model, I make predictions with the test set and score the success of the model.

Heatmap showing the correlation of the numeric varaibles

Using the five numeric variables just 1.4% of the variability in salary is explained and the dataset was reduced 29,960 rows ~46.47% of the original dataset size. 54 categorical variables were present. Dummy encoding them resulted to 29,892 rows and 5,643 (1 response and 5,641 explanatory variables) that were used to estimate salary.

Training the model by using linear regression model broke down. A graph of the test performance is below.

Salary estimation with linear Model

However, RandomForestRegressor did work well. Approximately 56% variability in salary was explained by using 29,892 rows and 5,641 explanatory variables. The optimal number of features were around 1000. This can be seen in the graph below. Hence RandomForestRegressor outperfomed the linear Regression model.

R-Squared -features respresentation of tran and test data
Actual vrs predicted salary based RandomForest Regressor model

This notwithstanding, the graphs of actual vrs predicted has some strange lines. I think it might be reflecting outliers in salary and not due to over fitting of the model. There are several methods for dealing with outliers in machine learning. For this analysis though, I used simple boxplot to construct upper whiskers. 3.6% percent amounting to 2,302 datasets are removed and the new dataset then used in answering research question 4.

4. Do the models perform better in predicting salary after accounting for outliers in salary?

The same process of the analysis is repeated as in research 3 above. 1.7% of the variability in salary is explained through the 4 numeric variables (28,030 rows with 4 explanatory variables).

After accounting for categorial variables and dummy encoding, 27,967 rows and 5,400 independent variables resulted. Using Linear Regression model, 58% of variability in salary is explained with 1,500 set features on the test model and resulting in a positve correlation between predicted and actual values.

Rsquared by number of features based on the linear regression model
Actual vrs predicted salary based linear regression model

Using the RandomForest model, 68% variability in salary is explained with number of features at 1000. Here again there is a positive relationship between the predicted and actual values.

Rsquared by number of features using RandomForest Regressor model.

And here the coresponding Actual vrs Predicted graph.

Actual vrs Predicted salary values using RandomForest Regressor Model

So the answer to the research question 4 is a big YES!

Conclusion

Assuming our test is significant:

1. we can conclude that, there are different job attracting factors that developers from small and mid-size enterprises as well large enterprises would consider if they have two job offers, with same compensation, benefits, and locations, when percentage differences are used. For developers in small and mid-size enterprises, “Remote work options”, “Languages, frameworks, and other technologies I’d be working with” and “Flex time or a flexible schedule” play important role. While developers in large enterprises consider “Specific department or team I’d be working on”, “How widely used or impactful my work output would be” and “Opportunities for professional development”.

2.The pattern of preference for developers for desired tech tools for 2021 in both organizational types seems similar.

The five most desired tools for 2021

a. Programming, Scripting, and Markup Languages

C (11.16%), Java (10.59%), Python (9.41%), JavaScript (8.38%) and HTML/CSS (6.39%).

b. Database Environment

PostgreSQL (14.50%), MongoDB (13.14%), MySQL (12.90%), Redis (9.82%) and SOLITE (9.91%).

c. New Collaboration Tools

Github (23.73%), Slack (15.48%), Gitlab (11.57%), Google Suite (Docs, Meeet, etc) (11.17%) and Jira (10.91%).

d. Platform

Linux (14.67%), Docker (13.97%), AWS (10.62%), Kubemetes (9.25%), and Window (9.12%).

e. Web framework

React.js (16.47%), We.js (10.78%), Angular (10.75%), ASP.NET (8.34%) and ASP.NET Core (7.40%).

f. Other Frameworks, Libraries, and Tools¶

Node.js (16.28%), .NET (10.77%), TenosrFlow (9.65%), .NET Core (9.16%) and React Native (8.04%).

3. The RandomForest Regressor outperformed Linear Regression model in estimating developers’ salary when salary contains outliers.

Approximately 56% variability in salary was explained by using 1000 features on the train dataset.

4. Removing outliers from salary enhance the performance of both models.

Linear Regression model explained 58% of variability in salary explained with 1,500 set features on the test model, while the RandomForest Regressor explained 68% variability with1000 set features.

If you are interested in the code for creating the visualizations, please visit my Github repository here.

Many thanks for reading.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Paul Dzitse

Paul Dzitse

Econometrician and Data Analyst/Scientist