Google Data Analytics Professional Certificate Capstone Project Deep Dive
Cyclistic Case Study
Introduction
Hola! Google Data Analytics Professional Certificate, I have learned a lot of analytical skills with the help of many different tools. For Case Study 1 in Capstone Project, I am working as a junior data analyst for the Cyclistic marketing team. Cyclistic is a bike-share company in Chicago.
To answer key questions I followed the six steps of the data analysis process taught in the course Google Data Analytics Professional Certificate which are: Ask, Prepare, Process, Analyze, Share, and Act.
About The Company
In 2016, Cyclistic started a bike-share service that was a big hit. Since then, the program has grown to include 5,824 bikes that are tracked and locked into a network of 692 stations across Chicago. The bikes can be released at one station and returned to any other station in the system at any time.
Scenario
“Converting Casual Riders into Annual Members”
The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program.
1. Ask
Identifying the business task:
Strategy to maximize the number of yearly membership programe by converting casual and daily basis riders into annual riders
Consider key stakeholders:
Lily Monero & the Executive team
Stakeholder perspective:
Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.She thinks that instead of making a marketing campaign for all new customers, it would be better to try to turn casual riders into members.
Three Questions To Analyze:
- How do annual members and casual riders use Cyclistic bikes differently?
- Why would casual riders buy Cyclistic annual memberships?
- How can Cyclistic use digital media to influence casual riders to become members?
Monero has assigned the first question for the junior data analyst to analyze.
2. Prepare
Data Source:
You will use Cyclistic’s trip data from the past to look for patterns and analyze them. Here, you can get Cyclistic trip data from the last 12 months. The data has been made available by Motivate International Inc. under this license.
Data Organization & Description:
- Creating the main folder with the name of YYYY-MM-DD_Cyclist_Exercise
- Download 4 zip files whose names are:
Divvy_Trips_2019_Q2.csv
Divvy_Trips_2019_Q3.csv
Divvy_Trips_2019_Q4.csv
Divvy_Trips_2020_Q1.csv
- Extract the Zip files then put all the csv files into the main folder.
- File Content: Each Excel file contains 12 to 13 columns containing information related to rider id, ridership type, ride time and location and location, etc.
Data Security:
- Riders’ personal identifiable information is hidden through tokenization.
- Original files are backed up in a separate folder.
Data Limitations:
As the riders’ personal information is kept secret, Cyclistic won’t be able to link pass purchases to credit card numbers to find out if casual riders live in the Cyclistic service area or if they bought more than one single pass.
3. Process
Tool I have selected for data verification and cleaning:
- Using R
Why R?
If I use Excel:
You know, each csv file has 426,888 rows and 13 columns. That is a big number. We know that Excel can handle 10 million rows, but I try to process data and I have a lot of lack issues. That’s why I don’t use Excel for this reason. If you’re working with large amounts of data, I recommend avoiding Excel and Google Sheets.
If I use SQL:
We know that SQL is for managing large data sets… For the process of analyzing the data, SQL is best, but you don’t make any visualization using SQL. That’s why you have to choose Tableau for Viszuliation from the data and take insight through this visualization.
If I use R:
Are two reasons why I’m using R, I do data compiling, cleaning, merging, and visualization using R.
R is a language for writing programs about statistics. It does a great job of gaining insight (especially statistical insight) from data that has already been taken out of where it was stored (file, database, etc.) R can be used to run models or make graphs.
SQL and R are best used together for the whole process of Data Analytics.
Data Verification & Cleaning
In R for Wrangling the data (cleaning data), we first install this package
Install required packages
tidyverse for data import and wrangling
lubridate helps wrangle date attributes
Run this code in R
install.packages(tidyverse)
install.packages(lubridate)
library(tidyverse)
library(lubridate)
STEP 1: COLLECT DATA
Here earlier we created a main folder YYYY-MM-DD_Cyclist_Exercise on your PC, now we import data from your system using R, here we follow these steps.
Upload YYYY-MM-DD_Cyclist_Exercise datasets (csv files) here
First, we check our working directory through getwd() function
getwd()
after running this code we see the console panel here I attached the image below.
So here know our working directory now we know we are create a main folder in this address let me explain in the image below
Here my file exists in my document folder “C:/Users/wajiz.pk/OneDrive/Documents”
Now we are setting this working directory through setwd() function because we import csv files from this folder location that’s why we are setting this working directory.
setwd(“C:/Users/wajiz.pk/OneDrive/Documents/2022–09–15_Cyclist_Exercise”)
Now we are ready to upload files from this folder so now we are using the read_csv function
q2_2019 <- read_csv(“Divvy_Trips_2019_Q2.csv”)
q3_2019 <- read_csv(“Divvy_Trips_2019_Q3.csv”)
q4_2019 <- read_csv(“Divvy_Trips_2019_Q4.csv”)
q1_2020 <- read_csv(“Divvy_Trips_2020_Q1.csv”)
The operator <- is to assign to a variable in the same environment. Why we use this operator because we want a small and understandable name for working on this project.
Now we can upload files run by one, then we see the successful upload files in Environment section, see the image below for better understanding.
After running this code we see the file uploaded successfully, in the Console panel this csv file has 1108163 rows and 12 columns in it further we see column specifications and etc.
STEP 2: WRANGLE DATA AND COMBINE IT INTO A SINGLE FILE
Now we compare column names of each of the files because we see each column's name and its specification before combining all files into one file.
colnames(q3_2019)
colnames(q4_2019)
colnames(q2_2019)
colnames(q1_2020)
As we run this colnames() function we see all file's columns names respectively.
Rename columns to make them consistent with q1_2020 (as this will be the supposed going-forward table design)
rename() function
After remaining we inspect the dataframe and look for incongruence, using the structure str() function.
str(q1_2020)
We are changing the columns name with respect to the q1_2020 file. Now we see the structure of this file, we see the columns specification, and we compare it with all 3 files. In the image below, we see the columns specification of all chr, num, and posixct.
Highlighted in Yellow are all columns and his environment.
str(q4_2019)
Here is the image highlighted in blue. These two column names are the same as in q1_2020, but the specification of the column is different in q1_2020. These two columns are in chr form but here in the num function. This identifies the issue. Now see all the 2 files, and we can check what we can do next.
str(q3_2019)
This is the same issue we saw in q4_2019, and you can easily compare it to the q1_2020 structure.
str(q2_2019)
Same issues like q3_2019 and q4_2019
Above we see the structure of all 4 csv files. Now convert ride_id and rideable_type to characters so that they can stack correctly.
mutate() adds new variables and preserves existing ones
Using mutate() function
Run all to change the rider_id and riderable_type column specification into character (chr).
Now our data arrangement is done we doing all great now we are ready to combine all data in one big dataframe in all_trips. we can use the bind_rows() function to bind together two data frames by their rows.
all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)
Now we bind together in all_trips
Now check the column name of all_trips
colnames(all_trips)
We are removing this column in the marked below image because this data was dropped beginning in 2020.
Pipe %>% function is an extremely useful tool that allows you to express a sequence of multiple operations. They can greatly simplify your code and make your operations more intuitive.
STEP 3: CLEAN UP DATA AND ADD DATA TO PREPARE FOR ANALYSIS
Inspect the new table that has been created we can check columns, rows, dimensions, structure, summary and etc.
List of column names
colnames(all_trips)
How many rows are in the data frame?
nrow(all_trips)
3879822
Dimensions of the data frame?
dim(all_trips)
3879822 9
See the first 6 rows of data frame. Also tail(all_trips)
head(all_trips)
See a list of columns and data types (numeric, character, etc)
str(all_trips)
Statistical summary of data. Mainly for numerics
summary(all_trips)
There are a few problems we will need to fix:
1) In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.
Begin by seeing how many observations fall under each usertype
table(all_trips$member_casual)
Reassign to the desired values (we will go with the current 2020 labels)
Check to make sure the proper number of observations were reassigned
table(all_trips$member_casual)
2) The data can only be aggregated at the ride level, which is too granular. We will want to add some additional columns of data — such as day, month, and year — that provide additional opportunities to aggregate the data.
Now Put the date, month, day, and year of each ride in a column.
Using the as.date() function in this activity, the $ operator can be used to select a variable or column, give a variable or column a new value, or do both.
With the format() function, you can use the following symbols to print dates.
The default format is yyyy-mm-dd
all_trips$date <- as.Date(all_trips$started_at)
We are successfully adding a date column only date from the started_at column
all_trips$month <- format(as.Date(all_trips$date), “%m”)
We are successfully adding a month column in the month(00–12) format
all_trips$day <- format(as.Date(all_trips$date), “%d”)
We are successfully adding the day column in (day as number 0–31) format
all_trips$year <- format(as.Date(all_trips$date), “%Y”)
We are successfully adding a year column in (4-digit year) format
all_trips$day_of_week <- format(as.Date(all_trips$date), “%A”)
We are successfully added the day_of_week column in (unabbreviated weekday) format
3) We will want to add a calculated field for the length of a ride since the 2020Q1 data did not have the “tripduration” column. We will add “ride_length” to the entire dataframe for consistency.
Now add a “ride_length” calculation to all_trips (in seconds)
Here we learn a new function called difftime This function returns the difference between two objects of data/time in the specified unit. It is specially used for POSIXct objects. As we know are started_at and ended_at columns in POSIXCT object that’s why we using difftime() function.
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
We have successfully added the ride_lenght column for the difference in rider length in seconds.
Inspect the data structure of columns
str(all_trips)
We are adding six columns, check the yellow highlighted mention below the image.
In the above image, we can easily see the rider_lenght column specification is ‘difftime’ num.
So Convert “ride_length” from Factor to numeric so we can run calculations on the data. We are checking the column type
is.factor(all_trips$ride_length)
FALSE
rider_length column is not a factor.
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
The code is executed now we check the rider_lenght column type.
is.numeric(all_trips$ride_length)
TRUE
4) There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides
Remove “bad” data
The dataframe includes several rider_lenght in negative and start_station_name with HQ QR name we all want to delete this type of rows now we using this mention below the code.
We will create a new version of the dataframe (v2) since data is being removed
all_trips_v2 <- all_trips[!(all_trips$start_station_name == “HQ QR” | all_trips$ride_length<0),]
We successfully delete rows and put this data in all_trips_v2
4,5 Analyze & Share
STEP 4: CONDUCT DESCRIPTIVE ANALYSIS
Install packages required is ggplot2
install.packages(ggplot2)
library(ggplot2)
Here we perform descriptive analysis on ride_length (all figures in seconds)
straight average (total ride length / rides)
mean(all_trips_v2$ride_length)
1479.139
midpoint number in the ascending array of ride lengths
median(all_trips_v2$ride_length)
712
longest ride
max(all_trips_v2$ride_length)
9387024
shortest ride
min(all_trips_v2$ride_length)
1
Here we summarize all of these things easily from the summary() function You can condense the four lines above to one line using summary() on the specific attribute
summary(all_trips_v2$ride_length)
Now we compare members and casual users here we use the aggregate() function to get a summary of the data by group. The numbers include mean, minimum, maximum, sum, and so on.
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
Compare the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
We Notice that the days of the week are out of order. Let’s fix that
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c(“Sunday”, “Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”, “Saturday”))
Code Executed again running, Average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
Analyze ridership data by type and weekday
Now here we create the weekday field using the wday() function, group by usertype and weekday, and calculate the average duration, then sort.
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
Let’s visualize the number of rides by rider usertype
Here we pipe the ggplot function with this upper code now we get the graph in which weekday on the x-axis and number_of_rides on the y-axis
Creating a Visualization for Average Duration
6. Act
Summary of the insights gained from ggplot Visualization
Based on the above findings,
The first graph no.of rides as per usertype means causal vs member. Here we see that in the chart, members are greater than causal users, but during the weekend, casual users are more active than members. Members are usually the same all day as per the chart.
The second graph average duration of rides shows that causal users’ rides through the week are very high as compared to members.
These two graphs show that the number of riders taking members on weekdays is lower than the average rider’s average duration, implying that they ride more than members over the course of a week.
Suggestions
As the weekend is often the peak period for casual riders, the marketing campaign can offer weekend-only memberships at a lower price to get casual riders to become members.
Changes could be made to the membership subscription, like charging less as the length of the ride increases. This gives the members more reason to ride their bikes for longer distances. With these changes, it could also encourage casual riders to join so they can get discounts on ride lengths.
Reward: Depending on how long someone has been a subscriber, they may get some kind of reward. This factor also encourages the causal member to become a member and the members join this cyclist subscription for a longer time for this reward system.
If you want to learn with me follow me in: Newsletter, Website, and Twitter.