Key focus: Generating simulated dataset for regression problems using sklearn make_regression function (Python 3) is discussed in this article.
Problem statement
Suppose, a survey is conducted among the employees of a company. In that survey, the salary and the years of experience of the employees are collected. The aim of this data collection is to build a regression model that could predict the salary from the given experience (especially for the values not seen by the model).
If you are developer, you often have no access to survey data. In this scenario, you wish you could simulate the data for building a regression model.
Generating the dataset
To construct a simulated dataset for this scenario, the sklearn.dataset.make_regression↗ function available in the scikit-learn library can be used. The function generates the samples for a random regression problem.
The make_regression↗ function generates samples for inputs (features) and output (target) by applying random linear regression model. The values for generated samples have to be scaled to appropriate range for the given problem.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt #for plotting
x, y, coef = datasets.make_regression(n_samples=100,#number of samples
n_features=1,#number of features
n_informative=1,#number of useful features
noise=10,#bias and standard deviation of the guassian noise
coef=True,#true coefficient used to generated the data
random_state=0) #set for same data points for each run
# Scale feature x (years of experience) to range 0..20
x = np.interp(x, (x.min(), x.max()), (0, 20))
# Scale target y (salary) to range 20000..150000
y = np.interp(y, (y.min(), y.max()), (20000, 150000))
plt.ion() #interactive plot on
plt.plot(x,y,'.',label='training data')
plt.xlabel('Years of experience');plt.ylabel('Salary $')
plt.title('Experience Vs. Salary')
If you want the data to be presented in pandas dataframe format:
import pandas as pd
df = pd.DataFrame(data={'experience':x.flatten(),'salary':y})
df.head(10)
We have successfully completed generating simulated dataset for regression problems in Python3. Let’s move on to build and train a linear regression model using the generated dataset and use it for predictions.
Rate this article: Note: There is a rating embedded within this post, please visit this post to rate it.
Related topics
[1] Introduction to Signal Processing for Machine Learning |
[2] Generating simulated dataset for regression problems - sklearn make_regression |
[3] Hands-on: Basics of linear regression |
Books by the author