How can we predict the profit of a startup ?

Today, we will continue our tutorial on linear regression. As you know, in simple linear regression, we use it when we have a single independent variable and a single dependent variable. However, in multiple linear regression, several independent variables could affect the determination of the dependent variable.

Formula Multiple Linear Regression

For example, in simple linear regression, we observed that an employee’s salary depends on the number of years of experience. However, it can also depend on their level of education, skills, and other factors. As you can see, we have several variables influencing the prediction of an employee’s salary, which is why we need to use multiple linear regression.

Objective

We want our model to predict the profit of a startup based on these independent variables in order to help investors determine which companies to invest in, with the goal of maximizing profit.

Dataset

The dataset (can be found here) that we use in this model contains data about 50 startups. It has 5 columns: ‘R&D Spend,’ ‘Administration,’ ‘Marketing Spend,’ ‘State,’ and ‘Profit.’ The first 3 columns indicate how much each startup spends on R&D, Marketing, and administration costs, respectively. The ‘State’ column indicates which state the startup is based in, and the last column represents the profit made by the startup.

Step 1: Import Libraries

we will use 4 libraries such as :
NumPy: import numpy as np
Pandas: import pandas as pd to work with a dataset
Matplotlib: import matplotlib.pyplot as plt to visualize our plots for viewing
Sklearn: from sklearn.linear_model import LinearRegression to implement machine learning functions, in this example, we use Linear Regression

Step 2: Load the Dataset

We will be using the pandas dataframe. Here, X contains all the independent variable, which are “R&D Spend”, “Administration”, “Marketing Spend” and “State”. And y is the dependent variable, which is the “Profit”.
So for X, we specify : X = dataset.iloc[:,:-1].values
and for y, we specify : y = dataset.iloc[:,4].values

Step 3: Convert text variable to numbers

We can see that in our dataset, we have a categorical variable ‘State’ that needs to be encoded. The ‘State’ variable is at index 3. We will use the OneHotEncoder and ColumnTransformer classes to convert text to numbers

After running the above code snippet, we can observe that 3 dummy variables have been added since we had 3 different states. However, it’s essential to remove one of the dummy variables to avoid the dummy variable trap. You can read more about the dummy variable trap and why it’s necessary to remove one of the dummy variables.
X = X[:,1:]

Step 4: Split dataset – training set and test set

Next, we have to split the dataset into training and testing. We will use the training dataset for training the model and then check the performance of the model on the test dataset. For this we will use the train_test_split method from library model_selection

Step 5: Fit our model to training set

This is a straightforward step. We will be using the LinearRegression class from the sklearn.linear_model library.

Step 6: Predict the test set

Using the regressor we trained in the previous step, we will now employ it to predict the results of the test set and compare these predicted values with the actual values.
y_pred = regressor.predict(X_test)
Let us compare and see how well our model did. As you can see below, our model performed quite well.

multiple linear regression predict

Step 7: Backward Elimination

In the model that we just built, we used all the independent variables. However, it’s possible that some independent variables are more significant than others and have a greater impact on the profit, while others are not significant. This means that if we remove the less significant variables from the model, we may achieve better predictions. The first step is to add a column of 1’s to our X dataset as the first column.

Now we will start the backward elimination process. Since we will be creating a new optimal matrix of features, we will call it X_opt. This will contain only the independent features that are significant in predicting profit. Next we create a new regressor of the OLS class (Ordinary Least Square) from statsmodels library. It takes 2 argument
endog : which is the dependent variable.
exog : which is the matrix containing all independent variables.
Now we need to fit the OLS algorithm, then we will look at the summary to see which independent variable has p value higher than SL (0.05).

Let’s examine the output: x1 and x2 are the 2 dummy variables we added for state. x3 is R&D spent. x4 is Admin spent. x5 is marketing spent. We have to look for the highest P value greater than 0.5 which in this case is 0.99 (99%) for x2. So we have to remove x2 (2nd dummy variable for state) which has index 2.

Now we will repeat the process after removing the independent variables with highest p value.

Finally we are left with only 1 independent variable which is the ‘R&D spent’.


The entire backward elimination process can be automated. I’ve explained it step by step for better understanding. We can rebuild our model, this time taking only one independent variable, which is R&D spending, and make predictions. Our results may be better than the first time.

Join the discussion