I wanted to explore a set of data from Meetup.com. Meetup is a service used to organize online groups that host in-person and virtual events for people with similar interests. Users usually use the website to find friends, share a hobby, or for professional networking. I am an introvert but because of the pandemic I long for social interaction and joined a group in Seattle as its organizer. I organized a ton of events this past summer and was interested in obtaining information about these events to see what sort of conclusion or prediction for future events I can draw.

The proposed data set will come directly from Meetup using their console interface (https://secure.meetup.com/meetup_api/console/?path=/:urlname/events). Meetup API used to be free but not anymore. You’ll need to pay for the Pro edition to use their API. But, they do have a free interface to only make GET requests up to a response payload of 500 objects. It provides simple RESTful HTTP and streaming interfaces for exploring and interacting Meetup platform. This assignment will entail cleaning up and wrangling the extraneous data and parsing it to a data warehouse to explore and analyze.

Some of the questions I have in mind include:

  1. What kind of events draw the most reservation? Board games? Karaoke? Hiking?
  2. What day of the week is the most popular?
  3. What time of the day do most members RSVP?
  4. How many events have we had in total in the past years? Are we collectively becoming more popular as a group?
  5. Are members more inclined to join events that are relatively more popular? What are some of the most popular events and what do they have in common?
  6. When members RSVP for an event, the data should provide information about this time. Is there a time that is more favorable to announce an event?
  7. What is the mean and standard deviation of events that each organizer hosts? Are there any patterns?
  8. Seattle has many districts and events are scattered everywhere in different townships. Which areas or venues are more popular than others?
In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
%matplotlib inline
In [3]:
from tabulate import tabulate
import matplotlib.pyplot as plt
import sqlite3
In [4]:
con = sqlite3.connect('meetup.db')
cursor = con.cursor()

Let's peek at the schema. Limit to just first 30 rows.

Event can be viewed with URL https://www.meetup.com/{group_name}/events/{event_id}/

For example, https://www.meetup.com/1-5GenAsians/events/281289732/

In [5]:
data = cursor.execute("select * from event limit 30;")
print(tabulate(data.fetchall(), headers=["Event ID", "Status",'Event Date', 'RSVP Count', "Link"], tablefmt='psql'))
+------------+----------+--------------+--------------+-------------------------------------------------------+
|   Event ID | Status   | Event Date   |   RSVP Count | Link                                                  |
|------------+----------+--------------+--------------+-------------------------------------------------------|
|  109370632 | past     | 2013-03-22   |            5 | https://www.meetup.com/1-5GenAsians/events/109370632/ |
|  109868492 | past     | 2013-03-23   |            6 | https://www.meetup.com/1-5GenAsians/events/109868492/ |
|  111080172 | past     | 2013-03-30   |           11 | https://www.meetup.com/1-5GenAsians/events/111080172/ |
|  111088522 | past     | 2013-03-30   |            3 | https://www.meetup.com/1-5GenAsians/events/111088522/ |
|  110749042 | past     | 2013-03-31   |            8 | https://www.meetup.com/1-5GenAsians/events/110749042/ |
|  111871212 | past     | 2013-04-05   |           10 | https://www.meetup.com/1-5GenAsians/events/111871212/ |
|  112504332 | past     | 2013-04-06   |            8 | https://www.meetup.com/1-5GenAsians/events/112504332/ |
|  113192492 | past     | 2013-04-12   |           10 | https://www.meetup.com/1-5GenAsians/events/113192492/ |
|  113176612 | past     | 2013-04-13   |            4 | https://www.meetup.com/1-5GenAsians/events/113176612/ |
|  114838502 | past     | 2013-04-19   |            3 | https://www.meetup.com/1-5GenAsians/events/114838502/ |
|  116882312 | past     | 2013-05-03   |            8 | https://www.meetup.com/1-5GenAsians/events/116882312/ |
|  113174462 | past     | 2013-05-04   |            6 | https://www.meetup.com/1-5GenAsians/events/113174462/ |
|  118200732 | past     | 2013-05-11   |           10 | https://www.meetup.com/1-5GenAsians/events/118200732/ |
|  119185682 | past     | 2013-05-18   |            9 | https://www.meetup.com/1-5GenAsians/events/119185682/ |
|  120390832 | past     | 2013-05-24   |            8 | https://www.meetup.com/1-5GenAsians/events/120390832/ |
|  121853942 | past     | 2013-06-02   |            9 | https://www.meetup.com/1-5GenAsians/events/121853942/ |
|  123022632 | past     | 2013-06-07   |            8 | https://www.meetup.com/1-5GenAsians/events/123022632/ |
|  123844992 | past     | 2013-06-15   |           25 | https://www.meetup.com/1-5GenAsians/events/123844992/ |
|  126498932 | past     | 2013-06-28   |           11 | https://www.meetup.com/1-5GenAsians/events/126498932/ |
|  126318262 | past     | 2013-06-29   |            1 | https://www.meetup.com/1-5GenAsians/events/126318262/ |
|  127652212 | past     | 2013-07-03   |            4 | https://www.meetup.com/1-5GenAsians/events/127652212/ |
|  123571462 | past     | 2013-07-06   |            3 | https://www.meetup.com/1-5GenAsians/events/123571462/ |
|  129354962 | past     | 2013-07-13   |            3 | https://www.meetup.com/1-5GenAsians/events/129354962/ |
|  129003072 | past     | 2013-07-13   |            7 | https://www.meetup.com/1-5GenAsians/events/129003072/ |
|  130608792 | past     | 2013-07-27   |           20 | https://www.meetup.com/1-5GenAsians/events/130608792/ |
|  133462162 | past     | 2013-08-10   |           12 | https://www.meetup.com/1-5GenAsians/events/133462162/ |
|  134488172 | past     | 2013-08-17   |            9 | https://www.meetup.com/1-5GenAsians/events/134488172/ |
|  135768512 | past     | 2013-08-24   |           21 | https://www.meetup.com/1-5GenAsians/events/135768512/ |
|  136619502 | past     | 2013-08-31   |           11 | https://www.meetup.com/1-5GenAsians/events/136619502/ |
|  135478542 | past     | 2013-09-07   |            9 | https://www.meetup.com/1-5GenAsians/events/135478542/ |
+------------+----------+--------------+--------------+-------------------------------------------------------+

Let's see some of the most popular events. I define this to be events with the most RSVP.

In [6]:
data = cursor.execute("select event_date, rsvp_count, link from event order by rsvp_count desc limit 30;")
print(tabulate(data.fetchall(), headers=['Event Date', 'RSVP Count', 'Link'], tablefmt='psql'))
+--------------+--------------+-------------------------------------------------------+
| Event Date   |   RSVP Count | Link                                                  |
|--------------+--------------+-------------------------------------------------------|
| 2015-03-01   |          109 | https://www.meetup.com/1-5GenAsians/events/219920453/ |
| 2014-11-02   |          105 | https://www.meetup.com/1-5GenAsians/events/212049012/ |
| 2016-10-09   |          102 | https://www.meetup.com/1-5GenAsians/events/233570064/ |
| 2016-04-30   |           85 | https://www.meetup.com/1-5GenAsians/events/230305968/ |
| 2019-12-21   |           83 | https://www.meetup.com/1-5GenAsians/events/266398172/ |
| 2018-03-24   |           82 | https://www.meetup.com/1-5GenAsians/events/248285057/ |
| 2018-06-10   |           80 | https://www.meetup.com/1-5GenAsians/events/250424473/ |
| 2021-08-29   |           69 | https://www.meetup.com/1-5GenAsians/events/279628714/ |
| 2014-08-17   |           57 | https://www.meetup.com/1-5GenAsians/events/194901992/ |
| 2019-01-26   |           54 | https://www.meetup.com/1-5GenAsians/events/257948178/ |
| 2015-03-14   |           47 | https://www.meetup.com/1-5GenAsians/events/220103990/ |
| 2021-10-17   |           46 | https://www.meetup.com/1-5GenAsians/events/281053389/ |
| 2014-06-21   |           44 | https://www.meetup.com/1-5GenAsians/events/182543572/ |
| 2017-10-14   |           43 | https://www.meetup.com/1-5GenAsians/events/243442670/ |
| 2014-03-29   |           40 | https://www.meetup.com/1-5GenAsians/events/164683122/ |
| 2017-06-10   |           40 | https://www.meetup.com/1-5GenAsians/events/240287833/ |
| 2014-07-13   |           33 | https://www.meetup.com/1-5GenAsians/events/190799922/ |
| 2018-02-18   |           32 | https://www.meetup.com/1-5GenAsians/events/247445878/ |
| 2019-02-02   |           32 | https://www.meetup.com/1-5GenAsians/events/257999006/ |
| 2021-10-10   |           32 | https://www.meetup.com/1-5GenAsians/events/280631869/ |
| 2015-09-05   |           30 | https://www.meetup.com/1-5GenAsians/events/224725880/ |
| 2015-11-21   |           30 | https://www.meetup.com/1-5GenAsians/events/226542423/ |
| 2018-10-20   |           30 | https://www.meetup.com/1-5GenAsians/events/255446031/ |
| 2019-08-03   |           30 | https://www.meetup.com/1-5GenAsians/events/262715304/ |
| 2019-11-23   |           30 | https://www.meetup.com/1-5GenAsians/events/266412251/ |
| 2021-09-25   |           28 | https://www.meetup.com/1-5GenAsians/events/280609901/ |
| 2015-06-06   |           27 | https://www.meetup.com/1-5GenAsians/events/222782816/ |
| 2021-10-03   |           27 | https://www.meetup.com/1-5GenAsians/events/280505220/ |
| 2018-06-24   |           26 | https://www.meetup.com/1-5GenAsians/events/251467094/ |
| 2013-06-15   |           25 | https://www.meetup.com/1-5GenAsians/events/123844992/ |
+--------------+--------------+-------------------------------------------------------+
In [7]:
data = cursor.execute("select count(*) from event;")
print(f"Total number of events is {data.fetchone()[0]}")
Total number of events is 814

Let's see the number of events we have for each month

In [8]:
num_event_per_month = list(cursor.execute("""select strftime('%Y-%m', event_date), count(*) from event 
    group by strftime('%Y-%m', event_date);"""))
print(tabulate(num_event_per_month, headers=['Month', 'Number of Events'], tablefmt='psql'))
In [10]:
num_event_per_month = list(num_event_per_month)[-24:]
x = range(len(num_event_per_month))
width = 1/1.5

figure = plt.figure(figsize=(20, 6))

axes = figure.add_subplot(1, 1, 1)
axes.bar(x, [r[1] for r in num_event_per_month], width, align="center")
axes.set_xticks(x)
axes.set_xticklabels([r[0] for r in num_event_per_month])
axes.set_title("Number of Events Per Month")
axes.set_xlabel("Month")
axes.set_ylabel("# of Events")
axes.xaxis.grid(False)

plt.setp(axes.xaxis.get_majorticklabels(), rotation=70)
plt.show()
plt.close();

Next, I'm interested to see how many events we have per year and the number of events we hold per weekday

In [18]:
num_event_per_year = list(cursor.execute("""select strftime('%Y', event_date), count(*) from event 
    group by strftime('%Y', event_date);"""))
print(tabulate(num_event_per_year, headers=['Year', 'Number of Events'], tablefmt='psql'))

figure = plt.figure(figsize=(15, 6))

axes = figure.add_subplot(1, 1, 1)
axes.plot([r[0] for r in num_event_per_year], [r[1] for r in num_event_per_year])
axes.plot([r[0] for r in num_event_per_year], [r[1] for r in num_event_per_year], 'ro')
plt.axis([-1, 10, -5, 180])
plt.show()
+--------+--------------------+
|   Year |   Number of Events |
|--------+--------------------|
|   2013 |                 55 |
|   2014 |                 60 |
|   2015 |                 83 |
|   2016 |                 98 |
|   2017 |                 39 |
|   2018 |                165 |
|   2019 |                139 |
|   2020 |                 42 |
|   2021 |                132 |
|   2022 |                  1 |
+--------+--------------------+
In [28]:
num_event_per_weekday = list(cursor.execute("""select strftime('%w', event_date), count(*) from event 
    group by strftime('%w', event_date);"""))
print(tabulate(num_event_per_weekday, headers=['Day of the Week', 'Number of Events'], tablefmt='psql'))
+-------------------+--------------------+
|   Day of the Week |   Number of Events |
|-------------------+--------------------|
|                 0 |                278 |
|                 1 |                 19 |
|                 2 |                 23 |
|                 3 |                 43 |
|                 4 |                 37 |
|                 5 |                102 |
|                 6 |                312 |
+-------------------+--------------------+

No surprise there. Weekends are the most popular days to host events

In [1]:
%matplotlib inline
        
In [2]:
import warnings
        warnings.filterwarnings('ignore')
        
        from tabulate import tabulate
        import matplotlib.pyplot as plt
        import sqlite3
        import pandas as pd
        import numpy as np
        import seaborn as sns
        import random as py_random
        import scipy.stats as stats
        
        sns.set(style="whitegrid")
        

Extract, Transform, Load (ETL)¶

At this point, we should have:

  1. meetup.sql will create the data schema.
  2. meetup.py will fill the database tables with the correct data from the meetup JSON.
  3. meetup.db will have the data in it.
In [3]:
conn = sqlite3.connect('meetup.db')
        cursor = conn.cursor()
        

Before we start with the exploratory data analysis, let us examine the dataset to get a feeling of the sort of data we are handling.

As a summary from the "Ask" part of the assignment, here are some questions to bear in mind while doing the analysis.

  1. What kind of events draw the most reservation? Board games? Karaoke? Hiking?
  2. What day of the week is the most popular?
  3. What time of the day do most members RSVP?
  4. How many events have we had in total in the past years? Are we collectively becoming more popular as a group?
  5. Are members more inclined to join events that are relatively more popular? What are some of the most popular events and what do they have in common?
  6. When members RSVP for an event, the data should provide information about this time. Is there a time that is more favorable to announce an event?

We can start by peeking at the schema. Limit to just first 30 rows.

Event can be viewed with URL https://www.meetup.com/{group_name}/events/{event_id}/

For example, if the event ID is 281289732 for 1-5GenAsians group, https://www.meetup.com/1-5GenAsians/events/281289732/

In [4]:
data = cursor.execute("select event_id,event_date, rsvp_count, venue_lat, venue_lon, venue_city from event limit 30;")
        print(tabulate(data.fetchall(), headers=["Event ID", "Event Date", 'RSVP Count', "Lat", "Lon", "City"], tablefmt='psql'))
        
+------------+--------------+--------------+---------+----------+---------------+
        |   Event ID | Event Date   |   RSVP Count |     Lat |      Lon | City          |
        |------------+--------------+--------------+---------+----------+---------------|
        |  109370632 | 2013-03-22   |            5 | 47.6816 | -122.126 | Redmond       |
        |  109868492 | 2013-03-23   |            6 | 47.5287 | -121.825 | Snoqualmie    |
        |  111080172 | 2013-03-30   |           11 | 47.612  | -122.199 | Bellevue      |
        |  111088522 | 2013-03-30   |            3 | 48.0043 | -122.214 | Everett 98201 |
        |  110749042 | 2013-03-31   |            8 | 47.6034 | -122.201 | Bellevue      |
        |  111871212 | 2013-04-05   |           10 | 47.6098 | -122.204 | Bellevue      |
        |  112504332 | 2013-04-06   |            8 | 47.667  | -122.1   | Redmond       |
        |  113192492 | 2013-04-12   |           10 | 47.62   | -122.202 | Bellevue      |
        |  113176612 | 2013-04-13   |            4 | 47.6573 | -122.318 | Seattle       |
        |  114838502 | 2013-04-19   |            3 |         |          |               |
        |  116882312 | 2013-05-03   |            8 | 47.5772 | -122.169 | Bellevue      |
        |  113174462 | 2013-05-04   |            6 | 47.8593 | -121.971 | Monroe        |
        |  118200732 | 2013-05-11   |           10 | 47.612  | -122.199 | Bellevue      |
        |  119185682 | 2013-05-18   |            9 | 47.5984 | -122.325 | Seattle       |
        |  120390832 | 2013-05-24   |            8 | 47.7127 | -122.187 | Kirkland      |
        |  121853942 | 2013-06-02   |            9 | 47.5836 | -122.131 | Bellevue      |
        |  123022632 | 2013-06-07   |            8 | 47.613  | -122.2   | Bellevue      |
        |  123844992 | 2013-06-15   |           25 | 47.6906 | -122.402 | Seattle       |
        |  126498932 | 2013-06-28   |           11 | 47.6233 | -122.161 | Bellevue      |
        |  126318262 | 2013-06-29   |            1 |         |          |               |
        |  127652212 | 2013-07-03   |            4 | 47.7553 | -122.163 | Woodinville   |
        |  123571462 | 2013-07-06   |            3 | 47.6303 | -122.193 | Bellevue      |
        |  129354962 | 2013-07-13   |            3 | 47.599  | -122.325 | Seattle       |
        |  129003072 | 2013-07-13   |            7 | 47.6747 | -122.208 | Kirkland      |
        |  130608792 | 2013-07-27   |           20 | 47.5839 | -122.129 | Bellevue      |
        |  133462162 | 2013-08-10   |           12 | 47.7051 | -122.213 | Kirkland      |
        |  134488172 | 2013-08-17   |            9 | 47.6167 | -122.201 | Bellevue      |
        |  135768512 | 2013-08-24   |           21 | 47.5984 | -122.323 | Seattle       |
        |  136619502 | 2013-08-31   |           11 | 47.5836 | -122.131 | Bellevue      |
        |  135478542 | 2013-09-07   |            9 | 47.6659 | -122.114 | Redmond       |
        +------------+--------------+--------------+---------+----------+---------------+
        

Let's see some of the most popular events. I define this to be events with the most RSVP.

In [5]:
data = cursor.execute("select event_date, rsvp_count, event_id from event order by rsvp_count desc limit 15;")
        print(tabulate(data.fetchall(), headers=['Event Date', 'RSVP Count', 'Event ID'], tablefmt='psql'))
        
+--------------+--------------+------------+
        | Event Date   |   RSVP Count |   Event ID |
        |--------------+--------------+------------|
        | 2015-03-01   |          109 |  219920453 |
        | 2014-11-02   |          105 |  212049012 |
        | 2016-10-09   |          102 |  233570064 |
        | 2016-04-30   |           85 |  230305968 |
        | 2019-12-21   |           83 |  266398172 |
        | 2018-03-24   |           82 |  248285057 |
        | 2018-06-10   |           80 |  250424473 |
        | 2021-08-29   |           69 |  279628714 |
        | 2014-08-17   |           57 |  194901992 |
        | 2019-01-26   |           54 |  257948178 |
        | 2015-03-14   |           47 |  220103990 |
        | 2021-10-17   |           46 |  281053389 |
        | 2014-06-21   |           44 |  182543572 |
        | 2017-10-14   |           43 |  243442670 |
        | 2014-03-29   |           40 |  164683122 |
        +--------------+--------------+------------+
        
In [6]:
data = cursor.execute("select count(*) from event;")
        print(f"Total number of events is {data.fetchone()[0]}")
        
Total number of events is 814
        

Let's see the number of events we have for each month since this group was founded in 2013.

In [7]:
num_event_per_month = list(cursor.execute("""select strftime('%Y-%m', event_date), count(*) from event 
            group by strftime('%Y-%m', event_date);"""))
        print(tabulate(num_event_per_month, headers=['Month', 'Number of Events'], tablefmt='psql'))
        

Looks like there was a big spike after July 2022. This was when Coronavirus quarantine and regulations were more relaxed in Seattle and people were able to enjoy more festive events and activities. My guess is that people were tired of being cooped up at home for over a year and a half so now they are eager to go out and socialize.

In [8]:
num_event_per_month = list(num_event_per_month)[-24:]
        x = range(len(num_event_per_month))
        width = 1/1.5
        
        figure = plt.figure(figsize=(20, 6))
        
        axes = figure.add_subplot(1, 1, 1)
        axes.bar(x, [r[1] for r in num_event_per_month], width, align="center")
        axes.set_xticks(x)
        axes.set_xticklabels([r[0] for r in num_event_per_month])
        axes.set_title("Number of Events Per Month")
        axes.set_xlabel("Month")
        axes.set_ylabel("# of Events")
        axes.xaxis.grid(False)
        
        plt.setp(axes.xaxis.get_majorticklabels(), rotation=70)
        plt.show()
        plt.close();
        

Next, I'm interested to see how many events we have per year and the number of events we hold per weekday

In [9]:
num_event_per_year = list(cursor.execute("""select strftime('%Y', event_date), count(*) from event 
            group by strftime('%Y', event_date);"""))
        print(tabulate(num_event_per_year, headers=['Year', 'Number of Events'], tablefmt='psql'))
        
        figure = plt.figure(figsize=(15, 6))
        
        axes = figure.add_subplot(1, 1, 1)
        axes.plot([r[0] for r in num_event_per_year], [r[1] for r in num_event_per_year])
        axes.plot([r[0] for r in num_event_per_year], [r[1] for r in num_event_per_year], 'ro')
        plt.axis([-1, 10, -5, 180])
        plt.show()
        
+--------+--------------------+
        |   Year |   Number of Events |
        |--------+--------------------|
        |   2013 |                 55 |
        |   2014 |                 60 |
        |   2015 |                 83 |
        |   2016 |                 98 |
        |   2017 |                 39 |
        |   2018 |                165 |
        |   2019 |                139 |
        |   2020 |                 42 |
        |   2021 |                132 |
        |   2022 |                  1 |
        +--------+--------------------+
        

2020 experienced a low dip most definitely due to the pandemic. The data (through inspection) preliminarily suggests that most events were online but rose to face to face interaction starting back in 2021.

In [10]:
num_event_per_weekday = list(cursor.execute("""select strftime('%w', event_date), count(*) from event 
            group by strftime('%w', event_date);"""))
        print(tabulate(num_event_per_weekday, headers=['Day of the Week', 'Number of Events'], tablefmt='psql'))
        
+-------------------+--------------------+
        |   Day of the Week |   Number of Events |
        |-------------------+--------------------|
        |                 0 |                278 |
        |                 1 |                 19 |
        |                 2 |                 23 |
        |                 3 |                 43 |
        |                 4 |                 37 |
        |                 5 |                102 |
        |                 6 |                312 |
        +-------------------+--------------------+
        

No surprise there. Weekends are the most popular days to host events

In [11]:
events = pd.read_sql('SELECT * FROM event', conn)
        events.head()
        
Out[11]:
event_id status event_date event_time rsvp_count waitlist_count duration venue_lat venue_lon venue_city link
0 109370632 past 2013-03-22 19:00 5 0 0.0 47.681561 -122.125504 Redmond https://www.meetup.com/1-5GenAsians/events/109...
1 109868492 past 2013-03-23 12:30 6 0 0.0 47.528713 -121.825394 Snoqualmie https://www.meetup.com/1-5GenAsians/events/109...
2 111080172 past 2013-03-30 10:00 11 0 0.0 47.612015 -122.198830 Bellevue https://www.meetup.com/1-5GenAsians/events/111...
3 111088522 past 2013-03-30 12:00 3 0 0.0 48.004272 -122.214149 Everett 98201 https://www.meetup.com/1-5GenAsians/events/111...
4 110749042 past 2013-03-31 17:00 8 0 0.0 47.603401 -122.201317 Bellevue https://www.meetup.com/1-5GenAsians/events/110...
In [12]:
events.info()
        
<class 'pandas.core.frame.DataFrame'>
        RangeIndex: 814 entries, 0 to 813
        Data columns (total 11 columns):
         #   Column          Non-Null Count  Dtype  
        ---  ------          --------------  -----  
         0   event_id        814 non-null    object 
         1   status          814 non-null    object 
         2   event_date      814 non-null    object 
         3   event_time      814 non-null    object 
         4   rsvp_count      814 non-null    int64  
         5   waitlist_count  814 non-null    int64  
         6   duration        814 non-null    float64
         7   venue_lat       762 non-null    float64
         8   venue_lon       762 non-null    float64
         9   venue_city      762 non-null    object 
         10  link            814 non-null    object 
        dtypes: float64(3), int64(2), object(6)
        memory usage: 70.1+ KB
        

We need to make a few modifications. Event ID should really be an int64 but string is fine too, there won't be much to analyze on this property so we can just leave it alone. Status tells us whether the event was in the past or upcoming, we don't need this information. The venue city and event time should be categorial. Duration is expressed in terms of milliseconds. We will want to convert this to hours. Finally, the links can be useful to redirect to the Meetup page but not so much for analysis so let's remove that as well.

In [13]:
events['event_id'] = events['event_id'].astype('str')
        events['venue_city'] = events['venue_city'].astype('category')
        events['event_time'] = events['event_time'].astype('category')
        events = events.drop('status', 1)
        events = events.drop('link', 1)
        events['duration'] = events['duration'] / 3600000
        events.info()
        
<class 'pandas.core.frame.DataFrame'>
        RangeIndex: 814 entries, 0 to 813
        Data columns (total 9 columns):
         #   Column          Non-Null Count  Dtype   
        ---  ------          --------------  -----   
         0   event_id        814 non-null    object  
         1   event_date      814 non-null    object  
         2   event_time      814 non-null    category
         3   rsvp_count      814 non-null    int64   
         4   waitlist_count  814 non-null    int64   
         5   duration        814 non-null    float64 
         6   venue_lat       762 non-null    float64 
         7   venue_lon       762 non-null    float64 
         8   venue_city      762 non-null    category
        dtypes: category(2), float64(3), int64(2), object(2)
        memory usage: 50.3+ KB
        

Exploratory Data Analysis (EDA)¶

Single Variable Exploration¶

Let's begin our single variable exploration first and then analyze the pairwise relationships!

Event Time¶

In [14]:
print(pd.DataFrame(events.event_time.value_counts().sort_index()))
        ax = events.event_time.value_counts().sort_values().tail(10).plot.bar(figsize=(20,10),rot = 0)
        ax.set_ylabel("Count")
        ax.set_xlabel("Event Time")
        ax.grid(axis='x')
        
       event_time
        00:00           3
        03:00           1
        05:00           1
        06:30           3
        07:00           4
        ...           ...
        20:45           1
        21:00           5
        21:15           1
        22:00           1
        23:59           1
        
        [74 rows x 1 columns]
        
In [15]:
events.event_time.describe()
        
Out[15]:
count       814
        unique       74
        top       18:00
        freq         71
        Name: event_time, dtype: object

6 PM seems to be the most common time to host events. Event time may be a good indicator of a variable for linear regression in the later part.

RSVP¶

In [16]:
figure = plt.figure(figsize=(10, 6))
        axes = figure.add_subplot(1, 1, 1)
        axes.hist(events.rsvp_count)
        axes.set_title("Distribution of RSVP")
        axes.set_xlabel("RSVP")
        axes.set_ylabel("Counts")
        axes.grid(axis='x')
        plt.show()
        plt.close()
        
In [17]:
events.rsvp_count.describe()
        
Out[17]:
count    814.000000
        mean      10.968059
        std       10.296775
        min        1.000000
        25%        6.000000
        50%        9.000000
        75%       12.000000
        max      109.000000
        Name: rsvp_count, dtype: float64

Looks like most reservations are between 1 to 30. There appears to be some big events that host up to 90 members.

Waitlist Count¶

In [18]:
figure = plt.figure(figsize=(10, 6))
        axes = figure.add_subplot(1, 1, 1)
        axes.hist(events.waitlist_count)
        axes.set_title("Distribution of Waitlist")
        axes.set_xlabel("Waitlist")
        axes.set_ylabel("Counts")
        axes.grid(axis='x')
        plt.show()
        plt.close()
        
In [19]:
events.waitlist_count.describe()
        
Out[19]:
count    814.000000
        mean       0.108108
        std        0.853678
        min        0.000000
        25%        0.000000
        50%        0.000000
        75%        0.000000
        max       13.000000
        Name: waitlist_count, dtype: float64

Almost every event is accounted as there doesn't seem to be much of any waitlist. This is not a good variable for analysis for our linear model later.

Venue City¶

In [20]:
print(pd.DataFrame(events.venue_city.value_counts().sort_index()))
        
        # Sort venue cities by the count and then sort before plotting top 10 most frequented cities
        ax = events.venue_city.value_counts().sort_values().tail(10).plot.bar(figsize=(20,10),rot = 0)
        ax.set_ylabel("Count")
        ax.set_xlabel("City")
        ax.grid(axis='x')
        
                             venue_city
        Auburn                                1
        Beijing                               1
        Bellevue                            228
        Bellevue                              6
        Berlin                                1
        Bothell                               3
        Burien                                1
        Edmonds                               5
        Everett                               3
        Everett 98201                         1
        Federal Way                           6
        Gold Bar                              3
        Index                                 1
        Issaquah                             28
        Kenmore                               1
        Kent                                  1
        Kirkland                             47
        Kirkland                              1
        Lake Forest Park                      1
        Lynnwood                              6
        Marysville                            1
        Mercer Island                         4
        Monroe                                2
        Mount Vernon                          1
        Mountlake Ter                         1
        Mukilteo                              3
        North Bend                            1
        North Bend                            1
        Picnic Point-North Lynnwood           1
        Redmond                              52
        Redmond,                              1
        Renton                                1
        Sammamish                             1
        Seattle                             314
        Seattle                               4
        Shoreline                             1
        Skykomish                             1
        Snohomish                             3
        Snohomish,  98296                     1
        Snoqualmie                            1
        Snoqualmie Pass                       2
        Tacoma                                2
        Tukwila                               7
        Tukwila                               1
        Woodinville                           7
        renton                                1
        seattle                               2
        

Bellueve and Seattle are the happening cities to host events! It'll be interesting to apply Random Forest technique on this variable.

Duration¶

In [21]:
figure = plt.figure(figsize=(10, 6))
        axes = figure.add_subplot(1, 1, 1)
        axes.hist(events.duration)
        axes.set_title("Distribution of RSVP")
        axes.set_xlabel("RSVP")
        axes.set_ylabel("Counts")
        axes.grid(axis='x')
        plt.show()
        plt.close()
        
In [22]:
events.duration.describe()
        
Out[22]:
count    814.000000
        mean       2.474918
        std        2.740083
        min        0.000000
        25%        1.000000
        50%        2.000000
        75%        3.000000
        max       38.500000
        Name: duration, dtype: float64

Pairwise Relationships¶

Because my questions revolve around the popularity of the events, the variable of interest to compare with is going to be the count of the reservations.

RSVP versus Cities¶

Let's revisit some of the most common venues of the events.

In [23]:
print(pd.DataFrame(events.venue_city.value_counts().sort_values().tail(10)))
        
             venue_city
        Lynnwood              6
        Bellevue              6
        Federal Way           6
        Woodinville           7
        Tukwila               7
        Issaquah             28
        Kirkland             47
        Redmond              52
        Bellevue            228
        Seattle             314
        
In [24]:
grouped_by_sex = events.groupby("venue_city")
        
        figure = plt.figure(figsize=(10, 6))
        
        axes = figure.add_subplot(1, 1, 1)
        labels = ["Seattle","Issaquah","Kirkland","Redmond","Bellevue"]
        grouped_data = [grouped_by_sex["rsvp_count"].get_group(k) for k in labels]
        patch = axes.boxplot(grouped_data, labels=labels)
        axes.set_xlabel("City")
        axes.set_ylabel("RSVP")
        axes.set_title("RSVP versus City")
        
        plt.show()
        plt.close()
        

Wow, Bellevue has higher ranges than the other top 4 cities.

RSVP versus Event Duration¶

In [25]:
sns.jointplot(data=events, x = "duration", y = "rsvp_count", kind="hex", height=10)
        
Out[25]:
<seaborn.axisgrid.JointGrid at 0x23391854520>

There is no correlation between the duration of the event and the number of reservations. This suggests that members don't necessarily click on the Attend button based on how long the events are!

RSVP versus Event Time¶

In [26]:
figure = plt.figure(figsize=(18, 6))
        
        axes = figure.add_subplot(1, 1, 1)
        axes.scatter(events.event_time, events.rsvp_count, marker="o")
        
        axes.set_ylabel("RSVP")
        axes.set_xlabel("Time")
        axes.set_title("Scatter Plot of Time vs. RSVP")
        plt.xticks(rotation = 90) 
        plt.show()
        plt.close()
        

We might need to use more sophisticated modeling techniques.

In [1]:
%matplotlib inline
        

This will make all the matplotlib images appear in the notebook.

In [2]:
import numpy as np
        import random as py_random
        import numpy.random as np_random
        import time
        import seaborn as sns
        import matplotlib.pyplot as plt
        import pandas as pd
        import scipy.stats as stats
        import sqlite3
        import sklearn.linear_model as linear
        import random
        import patsy
        
        from pprint import pprint
        import warnings
        warnings.filterwarnings('ignore')
        sns.set(style="whitegrid")
        

Let's connect to our data warehouse and read the information into Panda series.

In [3]:
conn = sqlite3.connect('../2 - get/meetup.db')
        cursor = conn.cursor()
        
In [4]:
events = pd.read_sql('SELECT * FROM event', conn)
        events['event_id'] = events['event_id'].astype('str')
        events['venue_city'] = events['venue_city'].astype('category')
        events['event_time'] = events['event_time'].astype('category')
        events = events.drop('status', 1)
        events = events.drop('link', 1)
        events['duration'] = events['duration'] / 3600000
        events.info()
        
<class 'pandas.core.frame.DataFrame'>
        RangeIndex: 814 entries, 0 to 813
        Data columns (total 9 columns):
         #   Column          Non-Null Count  Dtype   
        ---  ------          --------------  -----   
         0   event_id        814 non-null    object  
         1   event_date      814 non-null    object  
         2   event_time      814 non-null    category
         3   rsvp_count      814 non-null    int64   
         4   waitlist_count  814 non-null    int64   
         5   duration        814 non-null    float64 
         6   venue_lat       762 non-null    float64 
         7   venue_lon       762 non-null    float64 
         8   venue_city      762 non-null    category
        dtypes: category(2), float64(3), int64(2), object(2)
        memory usage: 50.3+ KB
        

Instruction says to not use linear regression and Naive Bayes and stick to the source materials from Lab 6. The following work is based from Lab 6 solution provided in the class.

RSVP is an abbreviation of the French phrase 'Repondez, s'il vous plaît'. It translates to 'Respond, if you please' or, better still, 'Respond please. I want to predict the next event RSVP with penalty for errors. Let's use the Mean Squared Error as our loss function and the mean as our prediction.

In [5]:
rsvp_predict = events["rsvp_count"].mean()
        rsvp_predict
        
Out[5]:
10.968058968058967

I am going to assume the usual 95% error bounds. At the lower end, we have:

In [6]:
rsvp_std = events["rsvp_count"].std()
        rsvp_error = rsvp_std * 1.96
        print(rsvp_error)
        
20.18167931988092
        
In [7]:
rsvp_predict, rsvp_predict - rsvp_error, rsvp_predict + rsvp_error
        
Out[7]:
(10.968058968058967, -9.213620351821953, 31.14973828793989)
In [8]:
from scipy.stats import norm
        
        figure = plt.figure(figsize=(10,6))
        
        axes = figure.add_subplot(1, 1, 1)
        n, bins, patches = axes.hist(events['rsvp_count'], color="DimGray", density=True, bins=11, alpha=0.75)
        axes.set_xlabel("Count of RSVP")
        axes.set_ylabel("Density")
        axes.set_title("Density Histogram of Event RSVP with Normal Plot")
        
        xs = [(b2 + b1)/2 for b1, b2 in zip(bins, bins[1:])]    
        
        rings_mean = events["rsvp_count"].mean()
        rings_std  = events["rsvp_count"].std()
        ys = [norm.pdf( k, loc=rings_mean, scale=rings_std) for k in xs]
        axes.plot(xs, ys, color="darkred")
        
        plt.show()
        plt.close()
        
In [9]:
figure = plt.figure(figsize=(20, 8))
        
        abalone_rings_mn = events['rsvp_count'].min()
        abalone_rings_mx = events['rsvp_count'].max()
        
        axes = figure.add_subplot(1, 2, 1)
        
        values, base = np.histogram(events['rsvp_count'], bins=11, density=True)
        cumulative = np.cumsum(values)
        axes.plot(base[:-1], cumulative, color="steelblue")
        axes.set_xlim((abalone_rings_mn, abalone_rings_mx))
        
        sampled_data = [rings_mean + r * rings_std for r in np.random.standard_normal(10000)]
        values2, base = np.histogram(sampled_data, bins=base, density=True)
        cumulative2 = np.cumsum(values2)
        axes.plot( base[:-1], cumulative2, color="firebrick")
        axes.set_xlim((abalone_rings_mn, abalone_rings_mx))
        axes.set_xlabel("Empirical v. Theoretical: Normal Distribution")
        
        axes = figure.add_subplot(1, 2, 2)
        
        differences = cumulative2 - cumulative
        axes.plot(base[:-1], differences, color='firebrick')
        axes.set_xlim((abalone_rings_mn, abalone_rings_mx))
        axes.hlines(0, 0, 14000, linestyles="dotted")
        axes.set_xlabel("Empirical v. Theoretical: Normal Distribution, Difference")
        
        plt.show()
        plt.close()
        

Note that the empirical and theoretical plots for normal distribution are very close. The difference chart on the right gives us an estimation and approximation of how close the number of reservation is. Ideally, it'll be nice to have a flat horizontal line about y=0, but here we can see our model is pretty good except around the mean (about 20 RSVPs).

In [10]:
events.rsvp_count.describe()
        
Out[10]:
count    814.000000
        mean      10.968059
        std       10.296775
        min        1.000000
        25%        6.000000
        50%        9.000000
        75%       12.000000
        max      109.000000
        Name: rsvp_count, dtype: float64

This is a very large range going from one attendee to 109.

In [11]:
def freeman_diaconis( data):
            quartiles = stats.mstats.mquantiles( data, [0.25, 0.5, 0.75])
            iqr = quartiles[2] - quartiles[ 0]
            n = len( data)
            h = 2.0 * (iqr/n**(1.0/3.0))
            return int( h)
        
In [12]:
h = freeman_diaconis(events.rsvp_count)
        print("Freeman Diaconis: ", h)
        mn = int(events.rsvp_count.min())
        mx = int(events.rsvp_count.max())
        bins = [i for i in range( mn, mx, h)]
        
        figure = plt.figure(figsize=(10, 6))
        
        axes = figure.add_subplot(1, 1, 1)
        axes.hist(events.rsvp_count,bins=bins, color="darkslategray")
        axes.set_title("Event RSVP distribution - Freeman Diaconis")
        axes.set_xlabel("RSVP")
        
        plt.show()
        plt.close()
        
Freeman Diaconis:  1
        

Let's generate the Bootstrap of the posterior distribution of $\hat{\theta}$. We can write a simple function to do our bootstrap sampling for us. It takes the data, a metric function and the number of bootstrap samples as the arguments. We can then use the function by supplying the Meetup data, our metric function np.mean and indicate we want 1000 bootstrap samples. This returns the data we can use as our posterior distribution of the proportion.

In [13]:
def bootstrap_sample( data, f, n=100):
            result = []
            m = len( data)
            for _ in range( n):
                sample = np.random.choice( data, len(data), replace=True)
                r = f( sample)
                result.append( r)
            return np.array( result)
        
In [14]:
posterior = bootstrap_sample( events.rsvp_count, np.mean, 1000)
        figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
        
        axes = figure.add_subplot(1, 1, 1)
        
        axes.hist( posterior, density=True)
        axes.set_ylabel( "Density")
        axes.set_xlabel( "$\hat{theta}$")
        axes.set_title( "Posterior Distribution of $\hat{theta}$")
        
        plt.show()
        plt.close()
        

Furthermore, we can find the 90% Credible Interval (Bayesian Confidence Interval) for $\hat{\theta}$

In [15]:
stats.mstats.mquantiles(posterior, [0.05, 0.95])
        
Out[15]:
array([10.37374693, 11.58130221])

There is a 90% probability that the value of $\theta$ is between 10.4 and 11.6 based on the data.

Let's review our pairwise EDA with the target variable, RSVP:

  • venue city - definite positive progression but banding suggests an interaction term.
  • duration - no relationship at all.
  • event time - overall positive

Let's start with a linear model by looking at the correlation coefficients between each variable and the target. First we need these following dummy variables since venue_city and event_time are categorical.

In [16]:
events = pd.concat([events, pd.get_dummies(events["venue_city"], prefix="city")], axis=1)
        events = pd.concat([events, pd.get_dummies(events["event_time"], prefix="time")], axis=1)
        events.describe()
        
Out[16]:
rsvp_count waitlist_count duration venue_lat venue_lon city_Auburn city_Beijing city_Bellevue city_Bellevue city_Berlin ... time_19:15 time_19:30 time_19:45 time_20:00 time_20:30 time_20:45 time_21:00 time_21:15 time_22:00 time_23:59
count 814.000000 814.000000 814.000000 762.000000 762.000000 814.000000 814.000000 814.000000 814.000000 814.000000 ... 814.000000 814.000000 814.000000 814.000000 814.000000 814.000000 814.000000 814.000000 814.000000 814.000000
mean 10.968059 0.108108 2.474918 47.431800 -121.433413 0.001229 0.001229 0.280098 0.007371 0.001229 ... 0.003686 0.029484 0.001229 0.013514 0.003686 0.001229 0.006143 0.001229 0.001229 0.001229
std 10.296775 0.853678 2.740083 2.998231 11.537341 0.035050 0.035050 0.449323 0.085590 0.035050 ... 0.060634 0.169263 0.035050 0.115530 0.060634 0.035050 0.078181 0.035050 0.035050 0.035050
min 1.000000 0.000000 0.000000 0.000000 -122.519081 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.000000 0.000000 1.000000 47.598366 -122.325502 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 9.000000 0.000000 2.000000 47.618492 -122.206394 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 12.000000 0.000000 3.000000 47.659306 -122.152388 0.000000 0.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 109.000000 13.000000 38.500000 48.415180 116.407394 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 126 columns

In [17]:
def correlations(data, y, xs):
            rs = []
            rhos = []
            for x in xs:
                r = stats.pearsonr(data[y], data[x])[0]
                rs.append(r)
                rho = stats.spearmanr(data[y], data[x])[0]
                rhos.append(rho)
            return pd.DataFrame({"feature": xs, "r": rs, "rho": rhos})
        

In the Explore step of the assignment we selected the top four most frequented cities as well as the most common times.

In [18]:
correlations(events, "rsvp_count", ["city_Seattle","city_Issaquah","city_Redmond","city_Bellevue",
                                            "time_16:00","time_18:00","time_10:00","time_19:00", "duration"])
        
Out[18]:
feature r rho
0 city_Seattle -0.029668 0.005457
1 city_Issaquah -0.062954 -0.084100
2 city_Redmond 0.025708 0.087630
3 city_Bellevue 0.155868 0.128478
4 time_16:00 0.079179 0.050022
5 time_18:00 0.182046 0.087944
6 time_10:00 -0.007607 0.019013
7 time_19:00 -0.050407 -0.028687
8 duration 0.030478 0.106092

At a glance, there doesn't seem to be any huge relationship or anything linear. Let's attempt a linear a model regression.

In [19]:
import models
        model = "rsvp_count ~ duration + city_Seattle"
        result1 = models.bootstrap_linear_regression(model, data=events)
        models.describe_bootstrap_lr(result1)
        
Out[19]:

Model: rsvp_count ~ duration + city_Seattle

95% BCI
Coefficients Mean Lo Hi
$\beta_{0}$ 10.93 10.06 12.18
duration $\beta_{1}$ 0.11 -0.03 0.37
city_Seattle $\beta_{2}$ -0.60 -1.91 0.69
Metrics Mean Lo Hi
$\sigma$ 10.30 8.11 13.08
$R^2$ 0.00 0.00 0.01

:( I am running into a Python problem where the model cannot process the event time variable.

In [20]:
def plot_residuals(result, variables):
            figure = plt.figure(figsize=(20,6))
        
            plots = len( variables)
            rows = (plots // 3) + 1
        
            residuals = np.array([r[0] for r in result["residuals"]])
            limits = max(np.abs(residuals.min()), residuals.max())
            
            n = result["n"]
            for i, variable in enumerate( variables):
                axes = figure.add_subplot(rows, 3, i + 1)
        
                keyed_values = sorted(zip(events[variable].values, residuals), key=lambda x: x[ 0])
                ordered_residuals = [x[ 1] for x in keyed_values]
        
                axes.plot(list(range(0, n)), ordered_residuals, '.', color="dimgray", alpha=0.75)
                axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
                axes.set_ylim((-limits, limits))
                axes.set_ylabel("residuals")
                axes.set_xlabel(variable)
        
            plt.show()
            plt.close()
            
            return residuals
        
In [21]:
residuals1 = plot_residuals(result1, ["duration", "city_Seattle", "time_18:00"])
        

We saw earlier that RSVP is not normally distributed and that this might cause some problems but we can correct this by loggify the data.

In [22]:
events["log_rsvp"] = events["rsvp_count"].apply(np.log)
        figure = plt.figure(figsize=(20,6))
        axes = figure.add_subplot(1,1,1)
        axes.hist(events.log_rsvp)
        axes.set_title("Log(Charges) Histogram")
        plt.show()
        plt.close()
        

This looks slightly better than before. It's not exactly normal but is fairly symmetric.

In [23]:
model = "log_rsvp ~ duration + city_Seattle"
        result2 = models.bootstrap_linear_regression(model, data=events)
        models.describe_bootstrap_lr(result2)
        
Out[23]:

Model: log_rsvp ~ duration + city_Seattle

95% BCI
Coefficients Mean Lo Hi
$\beta_{0}$ 2.13 2.03 2.20
duration $\beta_{1}$ 0.01 -0.00 0.04
city_Seattle $\beta_{2}$ -0.00 -0.12 0.10
Metrics Mean Lo Hi
$\sigma$ 0.70 0.66 0.75
$R^2$ 0.00 0.00 0.02

In [24]:
residuals_final= plot_residuals(result2, ["duration", "city_Seattle"])
        
In [25]:
figure = plt.figure(figsize=(20,6))
        axes = figure.add_subplot(1,1,1)
        axes.hist(residuals_final, bins=50)
        axes.set_title("Residuals Histogram")
        plt.show()
        plt.close()  
        

To compare this linear model with the null model, we will first want to perform three rounds of 10-fold cross validation, estimating $R^2$ and $\sigma$ each round. Let's have a function that chunks xs into n chunks and another function that actual does cross validation.

In [26]:
def chunk(xs, n):
            k, m = divmod(len(xs), n)
            return [xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)]
        
In [27]:
def cross_validation(algorithm, formula, data, evaluate, fold_count=10, repetitions=3):
            indices = list(range(len( data)))
            metrics = []
            for _ in range(repetitions):
                random.shuffle(indices)
                folds = chunk(indices, fold_count)
                for fold in folds:
                    test_data = data.iloc[fold]
                    train_indices = [idx not in fold for idx in indices]
                    train_data = data.iloc[train_indices]
                    result = algorithm(formula, data=train_data)
                    model = result["model"]
                    y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
                    # y = np.ravel( y) # might need for logistic regression
                    results = models.summarize(formula, X, y, model)
                    metric = evaluate(results)
                    metrics.append(metric)
            return metrics
        
In [28]:
formula = "rsvp_count ~ duration + city_Seattle"
        result = cross_validation(models.linear_regression, formula, events, lambda r: (r["sigma"], r["r_squared"]))
        

It's unfortunate my formula is bad

In [29]:
print(r"95% CI for sigma:", stats.mstats.mquantiles([r[0] for r in result], [0.025, 0.975]))
        print(r"95% CI for R^2:", stats.mstats.mquantiles([r[1] for r in result], [0.025, 0.975]))
        
95% CI for sigma: [ 5.07680505 16.32984892]
        95% CI for R^2: [-0.14224067  0.01117915]
        

I wanted to evaluated this model against the data it was tested where the 10 fold cross validation simulates the application of the model against data that was not used to train it

The 10 fold cross validation results simulate the range of results we should expect to see and this is not looking good for us. :(

The following functions are all taken from course lab solutions.

In [30]:
from collections import defaultdict
        
        def data_collection():
            result = dict()
            result[ "train"] = defaultdict( list)
            result[ "test"] = defaultdict( list)
            return result
        
        def learning_curves(algorithm, formula, data, evaluate, fold_count=10, repetitions=3, increment=1):
            indices = list(range(len( data)))
            results = data_collection()
            for _ in range(repetitions):
                random.shuffle(indices)
                folds = chunk(indices, fold_count)
                for fold in folds:
                    test_data = data.iloc[ fold]
                    train_indices = [idx for idx in indices if idx not in fold]
                    train_data = data.iloc[train_indices]
                    for i in list(range(increment, 100, increment)) + [100]: # ensures 100% is always picked.
                        # the indices are already shuffled so we only need to take ever increasing chunks
                        train_chunk_size = int( np.ceil((i/100)*len( train_indices)))
                        train_data_chunk = data.iloc[train_indices[0:train_chunk_size]]
                        # we calculate the model
                        result = algorithm(formula, data=train_data_chunk)
                        model = result["model"]
                        # we calculate the results for the training data subset
                        y, X = patsy.dmatrices( formula, train_data_chunk, return_type="matrix")
                        result = models.summarize(formula, X, y, model)
                        metric = evaluate(result)
                        results["train"][i].append( metric)
                        
                        # we calculate the results for the test data.
                        y, X = patsy.dmatrices( formula, test_data, return_type="matrix")
                        result = models.summarize(formula, X, y, model)
                        metric = evaluate(result)
                        results["test"][i].append( metric)
                    #
                #
            # process results
            # Rely on the CLT...
            statistics = {}
            for k, v in results["train"].items():
                statistics[ k] = (np.mean(v), np.std(v))
            results["train"] = statistics
            statistics = {}
            for k, v in results["test"].items():
                statistics[ k] = (np.mean(v), np.std(v))
            results["test"] = statistics
            return results
        
        def results_to_curves( curve, results):
            all_statistics = results[ curve]
            keys = list( all_statistics.keys())
            keys.sort()
            mean = []
            upper = []
            lower = []
            for k in keys:
                m, s = all_statistics[ k]
                mean.append( m)
                upper.append( m + 2 * s)
                lower.append( m - 2 * s)
            return keys, lower, mean, upper
        
        def plot_learning_curves( results, metric, zoom=False):
            figure = plt.figure(figsize=(10,6))
        
            axes = figure.add_subplot(1, 1, 1)
        
            xs, train_lower, train_mean, train_upper = results_to_curves( "train", results)
            _, test_lower, test_mean, test_upper = results_to_curves( "test", results)
        
            axes.plot( xs, train_mean, color="steelblue")
            axes.fill_between( xs, train_upper, train_lower, color="steelblue", alpha=0.25, label="train")
            axes.plot( xs, test_mean, color="firebrick")
            axes.fill_between( xs, test_upper, test_lower, color="firebrick", alpha=0.25, label="test")
            axes.legend()
            axes.set_xlabel( "training set (%)")
            axes.set_ylabel( metric)
            axes.set_title("Learning Curves")
        
            if zoom:
                y_lower = int( 0.9 * np.amin([train_lower[-1], test_lower[-1]]))
                y_upper = int( 1.1 * np.amax([train_upper[-1], test_upper[-1]]))
                axes.set_ylim((y_lower, y_upper))
        
            plt.show()
            plt.close()
        
In [31]:
formula = "rsvp_count ~ city_Seattle"
        result = learning_curves(models.linear_regression, formula, events, lambda r: r["sigma"])
        
In [32]:
plot_learning_curves(result, r"$\sigma$")
        

Despite the bad model, the curves (red and blue lines) converged. We can attempt to validate these curves. Let's split the testing and training data and fit the decision tree to predict some charges.

In [33]:
from sklearn.model_selection import train_test_split
        from sklearn import tree
        
        y = events['rsvp_count']
        X = events[['city_Seattle', 'city_Redmond', 'city_Bellevue']]
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)
        clf = tree.DecisionTreeRegressor(random_state=0)
        clf = clf.fit(X_train, y_train)
        clf.score(X_test, y_test)
        
Out[33]:
0.014441419978661107

Let's use validation curves to estimate the best tree depth

In [34]:
from sklearn.tree import DecisionTreeRegressor
        from sklearn.model_selection import validation_curve
        
        # Depth of one to ten
        param_range = range(1, 11)
        
        train_scores, test_scores = validation_curve(
            tree.DecisionTreeRegressor(),
            X,
            y,
            param_name="max_depth",
            param_range=param_range,
        )
        
        train_scores_mean = np.mean(train_scores, axis=1)
        train_scores_std = np.std(train_scores, axis=1)
        test_scores_mean = np.mean(test_scores, axis=1)
        test_scores_std = np.std(test_scores, axis=1)
        
        plt.figure(figsize = (10,6))
        plt.title(f"Validation Curve with sklearn.tree.DecisionTreeRegressor")
        plt.xlabel("Depth of Tree")
        plt.ylabel("$R^2$")
        plt.plot(param_range, train_scores_mean, label="Training score", color="blue")
        plt.fill_between(param_range, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.2, color="darkorange")
        plt.plot(param_range, test_scores_mean, label="Cross-validation score", color="red")
        plt.fill_between(param_range, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.2, color="navy")
        plt.legend()
        
Out[34]:
<matplotlib.legend.Legend at 0x166ba08f400>

Doesn't appear to be an optimal depth. Let's see how this model does.

In [35]:
from sklearn.metrics import r2_score
        
        # 4 is assume to be the optimal depth from the validation curves above
        regressor = DecisionTreeRegressor(max_depth = 4)
        regressor.fit(X_train, y_train)
        
        y_pred = regressor.predict(X_test)
        r2_score(y_test, y_pred)  
        
Out[35]:
0.014441419978661107

This model is crap but it's still better than...

In [36]:
model = "rsvp_count ~ duration + city_Seattle"
        result1 = models.bootstrap_linear_regression(model, data=events)
        models.describe_bootstrap_lr(result1)
        
Out[36]:

Model: rsvp_count ~ duration + city_Seattle

95% BCI
Coefficients Mean Lo Hi
$\beta_{0}$ 10.93 9.96 11.96
duration $\beta_{1}$ 0.11 -0.02 0.32
city_Seattle $\beta_{2}$ -0.60 -1.71 0.55
Metrics Mean Lo Hi
$\sigma$ 10.30 7.62 12.37
$R^2$ 0.00 0.00 0.01

Let's look at the feature importance. First thing we want to do is ge the numerical feature importances and then make a list of tuples with the variables and their importance.

In [37]:
feature_list = list(X.columns)
        importances = list(regressor.feature_importances_)
        feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
        
        # Sort and print the feature importances
        feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
        [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
        
Variable: city_Bellevue        Importance: 0.8
        Variable: city_Seattle         Importance: 0.11
        Variable: city_Redmond         Importance: 0.09
        

This is interesting. I thought Seattle would be more important as there are more fun activities to host there.

In [38]:
x_values = list(range(len(importances)))
        plt.bar(x_values, importances, orientation = 'vertical')
        plt.xticks(x_values, feature_list, rotation='vertical')
        plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
        

This could be an indication that members are likely to RSVP if the event took place in Bellevue.