Restaurant Food Price Prediction

Prathamesh Mistry
11 min readOct 19, 2020

These months in the pandemic were a bummer as all of our favorite restaurants were closed. My cravings and thoughts about food and also my project submissions had me writing this article.

In this article, I will walk you through an End to End pipeline of a Machine Learning Project. The link to the project could be found at the end of this article.

“Problem Statement”

We have a dataset consisting of all restaurants and their details from the major cities in India. We have to clean the data, make it interpretable, and draw insights from the data. Also, we have to build up a regression model for predicting the average cost of dining at a restaurant.

“Importing the Libraries”

Importing the Libraries we would require along the way

# importing project dependencies:# compute dependency
import numpy as np
# dataframe dependency
import pandas as pd
# visualization dependencies
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# stats dependency
from scipy import stats
# feature_engineering dependency
import feature_engine
# for scraping some data
import webbrowser
# regular expressions
import re
import warnings
warnings.filterwarnings('ignore')
# import training dependenciesfrom sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
# import the evaluation metrics dependenicesfrom sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score,mean_squared_error

Importing in the datasets from a .csv file to a pandas data frame. Now we have 2 files, one file containing the training data and others considering the testing data. The testing data is not provided with any target variable.

The preferred practices to approach such projects is

  • Import train and test .csv files in 2 different dataframes.
  • Append the test data frame and to train the data frame. This will cause the target variable values of the records from the test set NaN.
  • Now preprocess this merged data frame.
  • After the preprocessing is done resplit the train and test set again as they were before, remember dropping the target variable with NaN from the test set.
  • Now shuffle and split the train set into train set and validation set.
  • Now train your model and evaluate your model on the validation set and pick the best model using cross-validation.
  • Now predict with this model on the train set and save you results in a .csv file.

Cross-validation on the validation set ensures that the model won't be baised on the test data.

# importing the dataset.
train_df = pd.read_excel('Data_Train.xlsx')
test_df = pd.read_excel('Data_Test.xlsx')
# get the shape
train_df.shape,test_df.shape # ((12690, 9), (4231, 8))
# merging the datasets
df = train_df.append(test_df) # (17011, 9)

We have got 2 variables numerical datatypes and the rest 6 are of objects datatypes.

In the merged data frame (df). We can see that there are some missing values. Note that all the missing values in the target variables are from the test set, so just exclude COST from the missing values.

So, there are some missing values in the features CITY (0.8%), LOCALITY(0.7%), RATINGS(0.02%) and VOTES(9.5%). All the features with missing values have less than 1 percent of missing data, except VOTES, which has just less than 10 percent of missing data.

Life is good till now!

“Data Cleaning and Feature Engineering”

This problem has to be solved by using NTP techniques. Let’s traverse across each of the features in the datatype one at a time.

TITLE:

The TITLE feature defines what type of Food place it is. Is it a Casual Diner, a Cafe, a Bakery, or something else

# Get all the unique title values.
df.TITLE.unique()
array(['CASUAL DINING', 'CASUAL DINING,BAR', 'QUICK BITES', 'DESSERT PARLOR', 'CAFÉ', 'MICROBREWERY', 'QUICK BITES,BEVERAGE SHOP', 'CASUAL DINING,IRANI CAFE', 'BAKERY,QUICK BITES', 'None', 'BAR,CASUAL DINING', 'BAR', 'PUB', 'BEVERAGE SHOP', 'FINE DINING', 'CAFÉ,QUICK BITES', 'BEVERAGE SHOP,DESSERT PARLOR', 'SWEET SHOP,QUICK BITES', 'DESSERT PARLOR,SWEET SHOP', 'BAKERY', 'BAKERY,DESSERT PARLOR', 'BAR,LOUNGE', 'FOOD COURT', 'LOUNGE', 'DESSERT PARLOR,BEVERAGE SHOP', 'LOUNGE,CASUAL DINING', 'FOOD TRUCK', 'QUICK BITES,FOOD COURT', 'SWEET SHOP', 'BEVERAGE SHOP,FOOD COURT', 'PUB,CASUAL DINING', 'MESS', 'MICROBREWERY,CASUAL DINING', 'CASUAL DINING,SWEET SHOP', 'KIOSK', 'QUICK BITES,KIOSK', 'CLUB', 'FINE DINING,BAR', 'DESSERT PARLOR,QUICK BITES', 'FOOD COURT,QUICK BITES', 'LOUNGE,CAFÉ', 'BAKERY,CONFECTIONERY', 'CASUAL DINING,CAFÉ', 'DHABA', 'CAFÉ,DESSERT PARLOR', 'QUICK BITES,DESSERT PARLOR', 'PUB,MICROBREWERY', 'LOUNGE,BAR', 'DESSERT PARLOR,CAFÉ', 'CAFÉ,BAR', 'SWEET SHOP,CONFECTIONERY', 'CASUAL DINING,PUB', 'MICROBREWERY,BAR', 'DESSERT PARLOR,BAKERY', 'QUICK BITES,SWEET SHOP', 'BEVERAGE SHOP,QUICK BITES', 'CASUAL DINING,LOUNGE', 'CASUAL DINING,CLUB', 'QUICK BITES,CAFÉ', 'BAR,CAFÉ', 'CAFÉ,CASUAL DINING', 'QUICK BITES,CASUAL DINING', 'CASUAL DINING,MICROBREWERY', 'CASUAL DINING,BAKERY', 'CAFÉ,BAKERY', 'MEAT SHOP', 'QUICK BITES,BAKERY', 'BAR,FINE DINING', 'SWEET SHOP,CASUAL DINING', 'MEAT SHOP,QUICK BITES', 'PUB,LOUNGE', 'BAKERY,CAFÉ', 'COCKTAIL BAR', 'FINE DINING,LOUNGE', 'CONFECTIONERY', 'QUICK BITES,BAR', 'BAKERY,FOOD COURT', 'PUB,BAR', 'DESSERT PARLOR,FOOD COURT', 'QUICK BITES,FOOD TRUCK', 'BAKERY,BEVERAGE SHOP', 'CLUB,BAR', 'BAKERY,SWEET SHOP', 'SWEET SHOP,BAKERY', 'CASUAL DINING,FOOD COURT', 'PAAN SHOP', 'BEVERAGE SHOP,CAFÉ', 'FOOD COURT,DESSERT PARLOR', 'CLUB,MICROBREWERY', 'CAFÉ,BEVERAGE SHOP', 'DESSERT PARLOR,PAAN SHOP', 'MICROBREWERY,LOUNGE', 'LOUNGE,CLUB', 'SWEET SHOP,DESSERT PARLOR', 'BAR,PUB', 'CONFECTIONERY,QUICK BITES', 'DESSERT PARLOR,KIOSK', 'LOUNGE,PUB', 'SWEET SHOP,BEVERAGE SHOP', 'FINE DINING,CAFÉ', 'BEVERAGE SHOP,CASUAL DINING', 'KIOSK,QUICK BITES', 'CASUAL DINING,DESSERT PARLOR', 'LOUNGE,FINE DINING', 'PUB,CAFÉ', 'CAFÉ,LOUNGE', 'BAR,CLUB', 'COCKTAIL BAR,CASUAL DINING', 'MICROBREWERY,PUB', 'CAFÉ,FINE DINING', 'KIOSK,FOOD COURT', 'LOUNGE,MICROBREWERY', 'BAR,MICROBREWERY', 'FOOD TRUCK,DESSERT PARLOR', 'IRANI CAFE,BAKERY', 'BAKERY,CASUAL DINING', 'PUB,FINE DINING', 'CAFÉ,FOOD COURT', 'FINE DINING,MICROBREWERY', 'CLUB,FINE DINING', 'QUICK BITES,MEAT SHOP', 'CLUB,CASUAL DINING', 'BHOJANALYA'], dtype=object)

There is a lot of repeated data in this column. Titles like Casual-Dining, Fine-Dining, Dining all mean the same. Similarly, there are many such titles declared unique just because of different writing notions. We will replace Cafe and Irani Cafe as just Cafe. This is known as Binning.

<!--- Binning CASUAL DINING ---!>
dining_title = df[df.TITLE.str.contains('DINING')].TITLE.value_counts().index
# mapper
df_cleaned.TITLE.replace(to_replace=dining_title,value='CASUAL DINING',inplace=True)
<!--- Binning CAFE ---!>cafe_title = df_cleaned[df_cleaned.TITLE.str.contains('CAF')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=cafe_title,value='CAFE',inplace=True)
<!--- Binning QUICK BITES---!>quick_bites_title = df_cleaned[df_cleaned.TITLE.str.contains('BITES')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=quick_bites_title,value='QUICK BITES',inplace=True)
<!--- Binning DESSERT PARLOUR ---!>dessert_title = df_cleaned[df_cleaned.TITLE.str.contains('DESSERT')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=dessert_title,value='DESSERT PARLOUR',inplace=True)
<!--- Binning BEVERAGE SHOP ---!>bev_title = df_cleaned[df_cleaned.TITLE.str.contains('BEVERAGE')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=bev_title,value='BEVERAGE SHOP',inplace=True)

We would replace all such values and denote them a common value. Titles like Dessert Parlor and Sweet Shop both mean almost the same. Thus club them together. Similarly, we will do such binning for the title that means almost the same. Note that I have binned such values based on simple logical as well as the type of cuisine they serve at their restaurant and the average cost for dining.

<!--- Binning BAR and PUB ---!>bar_title = df_cleaned[df_cleaned.TITLE.str.contains('BAR|PUB')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=bar_title,value='BAR_PUB',inplace=True)
<!--- Binning BAKERY and CONFECTIONARY and SWEET SHOP ---!>bakery_title = df_cleaned[df_cleaned.TITLE.str.contains('BAKERY|CONFECTIONERY|SWEET SHOP')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=bakery_title,value='BAKERY_SWEETS',inplace=True)
<!--- Binning MICROBRWERY, LOUNGE and CLUBS ---!>lounge_title = df_cleaned[df_cleaned.TITLE.str.contains('MICROBREWERY|LOUNGE|CLUB')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=lounge_title,value='LOUNGE_CLUB_MICROBREWERY',inplace=True)
<!--- Binning FOOD COURT, FOOD TRUCK and KIOSK ---!>food_court_title = df_cleaned[df_cleaned.TITLE.str.contains('FOOD COURT|FOOD TRUCK|KIOSK')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=food_court_title,value='FOOD_COURT_TRUCK',inplace=True)
<!--- Binning DHABA and MESS ---!>dhaba_title = df_cleaned[df_cleaned.TITLE.str.contains('MESS|DHABA|BHOJANALYA')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=dhaba_title,value='DHABA_MESS',inplace=True)
<!--- Binning PAAN and MEAT SHOP ---!>paan_title = df_cleaned[df_cleaned.TITLE.str.contains('PAAN')].TITLE.value_counts().index# mapper
df_cleaned.TITLE.replace(to_replace=paan_title,value='FOOD_COURT_TRUCK',inplace=True)

There are some values in the TITLE which have the value None. Now, these are the missing values. Let’s replace this None wale with NaN (np.nan). (We will address these missing values later)

df_cleaned.TITLE.replace('None',np.nan,inplace=True)

We have reduced, the title to these categories.

df_cleaned.TITLE.unique()array(['CASUAL DINING', 'QUICK BITES', 'DESSERT PARLOUR', 'CAFE',
'LOUNGE_CLUB_MICROBREWERY', nan, 'BAR_PUB', 'BEVERAGE SHOP',
'BAKERY_SWEETS', 'FOOD_COURT_TRUCK', 'DHABA_MESS'], dtype=object)

Most of the Restaurants are title either Quick_Bites and Casual_Dining.

CUISINES

After the TITLE Feature, let’s turn our head to the next feature CUISINE. The Cuisine feature shows the Kind of Food/Dishes served at a particular Restaurant. By eyeballing this feature we can spot some prominent cuisines.

A restaurant can serve multiple cuisines. What we will do is create 8 new binary features with the name of those 8 prominent cuisines. These 8 prominent cuisines are n_indian_cuisine, s_indian_cuisine, chinese_cuisine, bakery_desserts_cuisine, fast_food_cuisine, regional_cuisine, beverage_cuisine, non_indian_cuisine.

<!--- North_Indian_Cuisine ---!>df_cleaned['n_indian_cuisine'] = df_cleaned.CUISINES.mask(df_cleaned.CUISINES.str.contains('North Indian'),1)df_cleaned['n_indian_cuisine'] = df_cleaned['n_indian_cuisine'].mask(~df_cleaned.CUISINES.str.contains('North Indian'),0)<!--- South_Indian_Cuisine ---!>df_cleaned['s_indian_cuisine'] = df_cleaned.CUISINES.mask(df_cleaned.CUISINES.str.contains('South Indian'),1)df_cleaned['s_indian_cuisine'] = df_cleaned['s_indian_cuisine'].mask(~df_cleaned.CUISINES.str.contains('South Indian'),0)<!--- Chinese_Cuisine ---!>df_cleaned['chinese_cuisine'] = df_cleaned.CUISINES.mask(df_cleaned.CUISINES.str.contains('Chinese'),1)df_cleaned['chinese_cuisine'] = df_cleaned['chinese_cuisine'].mask(~df_cleaned.CUISINES.str.contains('Chinese'),0)<!--- bakery_desserts_Cuisine ---!>df_cleaned['bakery_desserts_cuisine'] = df_cleaned.CUISINES.mask(df_cleaned.CUISINES.str.contains('Desserts|Ice Cream|Bakery'),1)df_cleaned['bakery_desserts_cuisine'] = df_cleaned['bakery_desserts_cuisine'].mask(~df_cleaned.CUISINES.str.contains('Desserts|Ice Cream|Bakery'),0)<!--- fast_food_Cuisine ---!>df_cleaned['fast_food_cuisine'] = df_cleaned.CUISINES.mask(df_cleaned.CUISINES.str.contains('Pizza|Burger|Fast Food|Street Food|Finger Food|Rolls|Sandwiches|BBQ|Warps|Momo'),1)df_cleaned['fast_food_cuisine'] = df_cleaned['fast_food_cuisine'].mask(~df_cleaned.CUISINES.str.contains('Pizza|Burger|Fast Food|Street Food|Finger Food|Rolls|Sandwiches|BBQ|Warps|Momo'),0)<!--- regional_Cuisine ---!>df_cleaned['regional_cuisine'] = df_cleaned.CUISINES.mask(df_cleaned.CUISINES.str.contains('Maharashtrian|di$|ni$|{}'.format(indian_states)),1)df_cleaned['regional_cuisine'] = df_cleaned['regional_cuisine'].mask(~df_cleaned.CUISINES.str.contains('Maharashtrian|di$|ni$|{}'.format(indian_states)),0)<!--- beverage_Cuisine ---!>df_cleaned['beverage_cuisine'] = df_cleaned.CUISINES.mask(df_cleaned.CUISINES.str.contains('Beverage|Cafe|Juices'),1)df_cleaned['beverage_cuisine'] = df_cleaned['beverage_cuisine'].mask(~df_cleaned.CUISINES.str.contains('Beverage|Cafe|Juices'),0)<!--- non_Indian_Cuisine ---!>df_cleaned['non_indian_cuisine'] = df_cleaned.CUISINES.mask(df_cleaned.CUISINES.str.contains('American|Italian|Mexican|Continental|French|Arabian|Continental|Seafood|Asian|Japanese|Vietnamese'),1)df_cleaned['non_indian_cuisine'] = df_cleaned['non_indian_cuisine'].mask(~df_cleaned.CUISINES.str.contains('American|Italian|Mexican|Continental|French|Arabian|Continental|Seafood|Asian|Japanese|Vietnamese'),0)

Time

From the TIME feature, we will find the days the restaurant remains closed. We will do that by creating a sparse matrix of the data of the week using the TfidfVectorizer. On the day when the restaurant is closed the values will be zero else one.

CITY

For the CITY Variable, we will perform similar steps as we performed for the TITLE variable.

# get the top 10 most occurring cities.
top10_cities = ['Chennai', 'Bangalore', 'Hyderabad', 'Mumbai', 'New Delhi', 'Kochi',
'Gurgaon', 'Noida', 'Ghaziabad', 'Thane']
for city in top10_cities:
print(city)
temp_city = df_cleaned[df_cleaned.CITY.str.contains(city,na=False)].CITY.value_counts().index
print(temp_city)
# mapper
df_cleaned.CITY.replace(to_replace=temp_city,value=city,inplace=True

other_cities = df_cleaned[~df_cleaned.CITY.str.contains('Chennai|Bangalore|Hyderabad|Mumbai|New Delhi|Kochi|Gurgaon|Noida|Ghaziabad|Thane',na=False)].CITY.value_counts().index
# mapper
df_cleaned.CITY.replace(to_replace=other_cities,value='Others',inplace=True)

Initially, we had 450 unique city values, we lowered it down to 12 values. What is did is that bin all the frequent cities, and put the rest of the unique cities within other values.

LOCALITY

We will simply drop this column as almost all the values in the feature are unique, which won't have any impact on the cost of the food all over the nation. The location feature would come in handy if we were doing the same task of predicting the average dining price but on the State level.

VOTES

The is are the total number of votes given by the customers to a particular restaurant.

Votes being a number, it has the object datatype. This is because there is the string ‘ votes’ along with the number of votes. We have to clean this data, i.e. remove the string and make the data in the VOTES column of the numerical datatype.

df_cleaned.VOTES = df_cleaned.dropna().VOTES.apply(lambda x : float(x[:2]))

This is the snap off after performing the trimming and type casting from the above line of code.

RATING

The Ratings are given on the basis of the quality of food, restaurant standard, and kind of Cuisines server. In our dataset, the restaurants with high ratings are the Microbrewery, the Lounges, and the Clubs. These are also restaurants with high dining prices. Thus, we can also imply that the higher to average dining price the higher the ratings.

The rating is also a numerical value between 0 to 5, but in our case, the features have object datatype. The reason for it being an object data type is the missing values but also there is a value ‘New’ in the rating features. The value ‘New’ is given to the restaurants which haven't been rated yet. There is also a value ‘-’ in the rating feature which is missing values.Now this of cata is known as Dirty Data.

Let’s replace this Dirty Data with NaNs and change the data type of Rating to float.

df_cleaned.RATING.mask(df_cleaned.RATING == '-',np.nan,inplace=True)df_cleaned.RATING = df_cleaned.RATING.astype('float')

We are now done cleaning the data, let’s move forward with missing values imputation

“Missing Value Imputation”

Initially, we had some missing values in the dataset, but we dug deeper, we have encountered some more missing data. It’s time now to address this missing data. We will use a library call feature-engine for this procedure.

Remember not to input the missing values in the COST (as they are from the test set).

# importing the imptuation dependencies
from feature_engine.missing_data_imputers import _find_categorical_variables,_find_numerical_variables
from feature_engine.missing_data_imputers import CategoricalVariableImputer
# getting the numerical and the categorical features
cat_var = _find_categorical_variables(df_cleaned)
num_var = _find_numerical_variables(df_cleaned)

We have to perform Categoical imputation of the TITLE and CITY feature and Numerical imputation of the VALUE feature.

NUMERICAL IMPUTATION

We will apply 2 methods for imputation of missing numerical data and select the best method.

The Radom Sample imputation works better than the mean median imputer in our case. Finalize the Random Sample Imputation and move forward to the Categorical imputation.

df_cleaned = RandomSampleImputer(variables=['VOTES']).fit_transform(df_cleaned)

CATEGORICAL IMPUTATION

We will impute the missing values in the categorical features with the most frequent category of the column.

cat_missing = ['TITLE','CITY']
enc = CategoricalVariableImputer(fill_value='frequent',variables=cat_missing)
df_cleaned = enc.fit_transform(df_cleaned)

“One Hot Encoding”

We will again use the feature-engine library for OneHot Encoding the categorical variables.

from feature_engine.categorical_encoders import OneHotCategoricalEncoderonehot = OneHotCategoricalEncoder(drop_last=True,variables=cat_var)df_cleaned = onehot.fit_transform(df_cleaned)

“Building up the Machine Learning Regression Model”

Firstly, split the train and the test data as they were before the appending the two data frames. Also, drop the COST column from the test dataset.

df_train = df_cleaned[~df_cleaned.COST.isna()].copy()
df_test = df_cleaned[df_cleaned.COST.isna()].copy()
df_test.drop('COST',1,inplace=True)

Now split the features and the target in the train set,

X = df_train.drop('COST',1)
y = df_train['COST']

Now, we have to validate our model to see its performance before using it to predict price in the training dataset. So, we will split the train set into training and validation set with the split 70:30 ratio.

from sklearn.model_selection import train_test_splitX_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.3,random_state=42)X_train.shape,y_train.shape,X_val.shape,y_val.shape((8882, 39), (8882,), (3807, 39), (3807,))

Now we will train the model on the various regression models, and evaluate all of them of the validation set using the cross-validation score.

#let's create the models dataframemodels = [LinearRegression(),Ridge(alpha=0.001),Lasso(alpha=0.0003),ElasticNet(alpha=0.0001),SVR(),DecisionTreeRegressor(),RandomForestRegressor(),GradientBoostingRegressor(),AdaBoostRegressor(base_estimator=LinearRegression())]model_names = 'Linear Ridge Lasso ElasticNet SVR Descision_Tree Random_Forest Gradient_Boost Ada_Boost'.split()
models_df = pd.DataFrame(columns=['Model','MSE','R2','meanCV'])
for model,model_names in zip(models,model_names):
print(model)

model.fit(X_train,y_train)

y_preds = model.predict(X_val)

mse = mean_squared_error(y_val,y_preds)
r2 = r2_score(y_val,y_preds)
avgCV = np.mean(cross_val_score(model,X,y,scoring='r2'))


models_df = models_df.append({'Model':model_names,'MSE':mse,'R2':r2,'meanCV':avgCV},ignore_index=True)

After the training and evaluation of various regression models, we can see that the Gradient_Boost Model has a very high cross-validation score on the validation set than the rest of the algorithms.

So now we have Selected Gradient Boost Regressor model as the best model.

Now, we will finally predict the train set using this Gradient Descent Model.

best_model = GradientBoostingRegressor()
best_model.fit(X_train,y_train)
best_model.score(X_val,y_val)
test_prediction = best_model.predict(df_test)
test_prediction
array([1403.87326964, 290.08346387, 928.06214878, ..., 456.98586479,
300.59525699, 334.7633621 ])

Saving the final results in a .csv file

pd.DataFrame(test_prediction).to_csv('Train_results.csv')

Finally, saving the model for production using Joblib

import joblib
joblib.dump(best_model,'Restaurant_Food_Price.pkl')

Fellas, here we reach the end of the complete Restaurant Price Prediction Project.

The link to the complete project :

--

--

Prathamesh Mistry

Final Year Student, understanding the industrial approach and tools