How can we predict the profit of a startup ?

Today, we will continue our tutorial on linear regression. As you know, in simple linear regression, we use it when we have a single independent variable and a single dependent variable. However, in multiple linear regression, several independent variables could affect the determination of the dependent variable.

For example, in simple linear regression, we observed that an employee’s salary depends on the number of years of experience. However, it can also depend on their level of education, skills, and other factors. As you can see, we have several variables influencing the prediction of an employee’s salary, which is why we need to use multiple linear regression.

Objective

We want our model to predict the profit of a startup based on these independent variables in order to help investors determine which companies to invest in, with the goal of maximizing profit.

Dataset

The dataset (can be found here) that we use in this model contains data about 50 startups. It has 5 columns: ‘R&D Spend,’ ‘Administration,’ ‘Marketing Spend,’ ‘State,’ and ‘Profit.’ The first 3 columns indicate how much each startup spends on R&D, Marketing, and administration costs, respectively. The ‘State’ column indicates which state the startup is based in, and the last column represents the profit made by the startup.

Step 1: Import Libraries

we will use 4 libraries such as :
– NumPy: import numpy as np
– Pandas: import pandas as pd to work with a dataset
– Matplotlib: import matplotlib.pyplot as plt to visualize our plots for viewing
– Sklearn: from sklearn.linear_model import LinearRegression to implement machine learning functions, in this example, we use Linear Regression

Step 2: Load the Dataset

We will be using the pandas dataframe. Here, X contains all the independent variable, which are “R&D Spend”, “Administration”, “Marketing Spend” and “State”. And y is the dependent variable, which is the “Profit”.
So for X, we specify : X = dataset.iloc[:,:-1].values
and for y, we specify : y = dataset.iloc[:,4].values

Step 3: Convert text variable to numbers

We can see that in our dataset, we have a categorical variable ‘State’ that needs to be encoded. The ‘State’ variable is at index 3. We will use the OneHotEncoder and ColumnTransformer classes to convert text to numbers

After running the above code snippet, we can observe that 3 dummy variables have been added since we had 3 different states. However, it’s essential to remove one of the dummy variables to avoid the dummy variable trap. You can read more about the dummy variable trap and why it’s necessary to remove one of the dummy variables.
X = X[:,1:]

Step 4: Split dataset – training set and test set

Next, we have to split the dataset into training and testing. We will use the training dataset for training the model and then check the performance of the model on the test dataset. For this we will use the train_test_split method from library model_selection

Step 5: Fit our model to training set

This is a straightforward step. We will be using the LinearRegression class from the sklearn.linear_model library.

Step 6: Predict the test set

Using the regressor we trained in the previous step, we will now employ it to predict the results of the test set and compare these predicted values with the actual values.
y_pred = regressor.predict(X_test)
Let us compare and see how well our model did. As you can see below, our model performed quite well.

Step 7: Backward Elimination

In the model that we just built, we used all the independent variables. However, it’s possible that some independent variables are more significant than others and have a greater impact on the profit, while others are not significant. This means that if we remove the less significant variables from the model, we may achieve better predictions. The first step is to add a column of 1’s to our X dataset as the first column.

Now we will start the backward elimination process. Since we will be creating a new optimal matrix of features, we will call it X_opt. This will contain only the independent features that are significant in predicting profit. Next we create a new regressor of the OLS class (Ordinary Least Square) from statsmodels library. It takes 2 argument
endog : which is the dependent variable.
exog : which is the matrix containing all independent variables.
Now we need to fit the OLS algorithm, then we will look at the summary to see which independent variable has p value higher than SL (0.05).

Let’s examine the output: x₁ and x₂ are the 2 dummy variables we added for state. x₃ is R&D spent. x₄ is Admin spent. x₅ is marketing spent. We have to look for the highest P value greater than 0.5 which in this case is 0.99 (99%) for x₂. So we have to remove x₂ (2nd dummy variable for state) which has index 2.

Now we will repeat the process after removing the independent variables with highest p value.

Finally we are left with only 1 independent variable which is the ‘R&D spent’.

The entire backward elimination process can be automated. I’ve explained it step by step for better understanding. We can rebuild our model, this time taking only one independent variable, which is R&D spending, and make predictions. Our results may be better than the first time.

If you want to share your models with me on Kaggle here is the link to our dataset: 50 Startups Data – Kaggle.
Here is the Full Source Code.

How can we predict the profit of a startup ?

Objective

Dataset

Step 1: Import Libraries

Step 2: Load the Dataset

Step 3: Convert text variable to numbers

Step 4: Split dataset – training set and test set

Step 5: Fit our model to training set

Step 6: Predict the test set

Step 7: Backward Elimination

Join the discussion Cancel reply

Navigating the DataOps Revolution

Exploring CFT Log Analysis with Elasticsearch Machine Learning

K3s Ultra Light for IoT: A Use Case Study

Menu

How can we predict the profit of a startup ?

Objective

Dataset

Step 1: Import Libraries

Step 2: Load the Dataset

Step 3: Convert text variable to numbers

Step 4: Split dataset – training set and test set

Step 5: Fit our model to training set

Step 6: Predict the test set

Step 7: Backward Elimination

Join the discussion Cancel reply

Further reading

Navigating the DataOps Revolution

Exploring CFT Log Analysis with Elasticsearch Machine Learning

K3s Ultra Light for IoT: A Use Case Study

Menu