Finding the perfect name has been a topic of discussion amongst every couple naming their child. There are names that have always been popular and there are names that go through fads and trends. I want to look into the different metrics and statistics of different names and analyze the different between male and female name trends.
I want to study the change in proportion of names each year that are in the top 100 and 500 most popular, how these change over time, and what it could mean, along with other metrics that go along with naming styles.
One thing I am very interested in is Fad Names. I often remark that there are some names that are no longer given out to new babies in our current generation. For example, the name Shirley. Shirley is often considered now to be the name of a Grandmother, a name that many people in my parents generation, or my generation, do not have. But the names Mary, or Elizabeth seem to be common among girls of many different generations, and John and Michael seem to have transcended generations completely. This means the name Shirley, and others of the like, were fad names in my grandparents generation. I want to view the distribution of fad vs timeless names amongst Girls and Boys in the US across time, and see if there is a difference among them, since it seems to me that female names are more ’generational" than male names.
I would also like to view how diversity distributions of name choice has changed with respect to time.
The data I am using comes from the Social Security Administration’s Baby Names Data. The entities are name frequencies per year. One row has the attributes: Year, Sex, Name, frequency. The data begins in the year 1910, where every time a family registers their child through Social Security, they are added to the database that year.
We have to make note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data, which could change or somehow affect our data from before 1937, which is something to take into mind in our data that tracks different name metrics over time.
Immediately I have to make a choice. There are two datasets I could use. There is the compiled dataset mentioned, which includes only year, sex, name, and frequency. Or there is a dataset I could scrape from the Social Security Site directly, which includes the frequencies per year PER state. The drawback of the state data is that names with frequency fewer than 5 are omitted for security reasons for such individuals.
I ultimately choose to forego the state information, and use the dataset with all names included, in hopes that these will more accurately reflect US naming proportions.
For all the plots, the blue lines represent the male data, and the red lines represent the female data.
names_df <- read.csv("usa_names.csv", colClasses=c("numeric", "numeric", "character", "character", "numeric") )
names_df <- names_df %>%
select(year, gender, name, number) %>%
arrange(year, name)
head(names_df)
## year gender name number
## 1 1910 M Aaron 111
## 2 1910 F Abbie 28
## 3 1910 M Abe 31
## 4 1910 M Abner 12
## 5 1910 M Abraham 138
## 6 1910 M Abram 10
Questions to Contemplate and Analyze-
names_top_100_male <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number)) %>%
group_by(year) %>%
slice(1:100)
names_top_500_male <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number)) %>%
group_by(year) %>%
slice(1:500)
names_df_total_births_per_year_male <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number)) %>%
group_by(year) %>%
summarise(total_births = sum(number))
names_top_100_female <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number)) %>%
group_by(year) %>%
slice(1:100)
names_top_500_female <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number)) %>%
group_by(year) %>%
slice(1:500)
names_df_total_births_per_year_female <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number)) %>%
group_by(year) %>%
summarise(total_births = sum(number))
##Above we have the top 100 male and female names per year, once we divide them by the total names per year we will have the total population
#per year that have names in the top 100.
prop_male <- names_top_100_male %>% inner_join(names_df_total_births_per_year_male, by="year") %>%
mutate(prop = number / total_births)
prop_female <- names_top_100_female %>% inner_join(names_df_total_births_per_year_female, by="year") %>%
mutate(prop = number / total_births)
prop_male_500 <- names_top_500_male %>% inner_join(names_df_total_births_per_year_male, by="year") %>%
mutate(prop = number / total_births)
prop_female_500 <- names_top_500_female %>% inner_join(names_df_total_births_per_year_female, by="year") %>%
mutate(prop = number / total_births)
#These proportion entities are extremely valuable, we now have the total proportion that each name in the top 100 names represents of the total name pool per year.
head(prop_male)
## # A tibble: 6 x 6
## # Groups: year [1]
## year gender name number total_births prop
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1910 M John 11450 164228 0.0697
## 2 1910 M James 9192 164228 0.0560
## 3 1910 M William 8844 164228 0.0539
## 4 1910 M Robert 5609 164228 0.0342
## 5 1910 M George 5441 164228 0.0331
## 6 1910 M Joseph 5226 164228 0.0318
head(prop_female)
## # A tibble: 6 x 6
## # Groups: year [1]
## year gender name number total_births prop
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1910 F Mary 22848 352087 0.0649
## 2 1910 F Helen 10479 352087 0.0298
## 3 1910 F Margaret 8222 352087 0.0234
## 4 1910 F Dorothy 7314 352087 0.0208
## 5 1910 F Ruth 7209 352087 0.0205
## 6 1910 F Anna 6433 352087 0.0183
prop_year_100_male <- prop_male %>%
group_by(year) %>%
summarise(prop_100 = sum(prop))
prop_year_100_female <- prop_female %>%
group_by(year) %>%
summarise(prop_100 = sum(prop))
ggplot() +
geom_line(data=prop_year_100_male, aes(x=year, y=prop_100), color="Blue") +
geom_line(data=prop_year_100_female, aes(x=year, y=prop_100), color="Red") +
labs(title="Percentage of Babies With Names in the Top 100 Most Popular Names Per Year",
x = "Year",
y = "Percentage %")
1a. Top 100 Names: We can answer our first original Question. When viewing the above graph we can see before the 1950’s, (mostly pre-WWII era) There were significantly higher babies (per gender) being named one of the 100 most popular names. There is a significant dip in 1950 for both male and female baby names, and another strong dip in 1986-1987 for both male and female babies. It seems that 1910 and 1945 were the peaks of conformity in naming children.
Less babies being named one of the 100 most popular names points to a potential higher diversity in names, which will be interesting to explore.
We also notice that throughout the time frame (1910-2016), males seem to have around a 10% higher proportion of babies with names in the top 100 most popular names than their female counterparts, pretty consistently. I would like to look more into the difference between male and female name homegeneity, as well as why there is a decline in proportion of parents naming their children the most popular (same) names.
prop_year_500_male <- prop_male_500 %>%
group_by(year) %>%
summarise(prop_500 = sum(prop))
prop_year_500_female <- prop_female_500 %>%
group_by(year) %>%
summarise(prop_500 = sum(prop))
ggplot() +
geom_line(data=prop_year_500_male, mapping = aes(x=year, y=prop_500), color="Blue") +
geom_line(data=prop_year_500_female, mapping = aes(x=year, y=prop_500), color="Red") +
labs(title="Percentage of Babies With Names in the Top 500 Most Popular Names Per Year",
x = "Year",
y = "Percentage %")
1b. Top 500 Names: The top 500 name proportions are more interesting to me. We can see that in 1910 near 98-99% of all babies male and female shared the same 500 names, respectively. There was a small downtrend until 1925, then a small upward trend until about 1950 where the trend has pretty significantly decreased over time, which seems to follow an exponential (or at least non-linear) down curve. One take away could be that for both male and female babies, names are becoming more diverse MORE rapidly as time passes.
It could also indicate that a higher number of names are becoming popular. You can imagine that there could have been 500 popular names in the past and if babies were only named these names then 100% of the population would have names in the top 500 for that year. But, if there were 1000 popular names with a more uniform distribution amongst them, than there would be only around 50% of the populaton with the most popular baby names. So, another possible explanation is that over time, there has been a higher name pool of names, and more of these names are being chosen from.
We also notice a widening disparity in Male v Female proportion in the most popular name category, where the parents are naming females less and less popular names, at first glance. This could mean there is a wider name pool for female babies, but in general it tells us the proportion of parents naming their children the same few names is smaller in girls than in boys (the popular boys names are more popular in comparison with the most popular girls names).
We can see that in 2016 (The most recent data year) that still about 76% of female babies named and about 85% of male babies named were in the top 500 most popular name category.
In other words, in 1910, only around 1-2% of parents strayed out of the 500 most popular names when naming their child, but in 2016, 19% of parents strayed out of the 500 most popular names (derivation below).
(1 - [(female_prop)] * (females born)] + [(male_prop_500) * (males_born)] ) (1 - (.76 * 1435587 + .849 * 1641986 )/ 3077537) (1 - .80749) =.19251
The next thing we can analyze is diversity of name, are these data changes in proportion of popular nams correlated with a higher overall number of names in the population? Let’s explore the total number of unique names being allocated in each year.
#We already know how many males and females were named each year, but now we must find the number of unique names used
unique_names_per_year_male <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number)) %>%
group_by(year) %>% summarise(unique_names = n())
unique_names_per_year_female <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number)) %>%
group_by(year) %>% summarise(unique_names = n())
head(unique_names_per_year_male)
## # A tibble: 6 x 2
## year unique_names
## <dbl> <int>
## 1 1910 692
## 2 1911 754
## 3 1912 1114
## 4 1913 1256
## 5 1914 1494
## 6 1915 1740
head(unique_names_per_year_female)
## # A tibble: 6 x 2
## year unique_names
## <dbl> <int>
## 1 1910 1083
## 2 1911 1066
## 3 1912 1261
## 4 1913 1350
## 5 1914 1549
## 6 1915 1816
ggplot() +
geom_line(data=unique_names_per_year_male, aes(x=year, y=unique_names), color='blue') +
geom_line(data=unique_names_per_year_female, aes(x=year, y=unique_names), color='red') +
ylim(500,6500) +
labs(title="Number of Unique Names Per Year",
x = "Year",
y = "# of Unique Names")
This data could potentially indicate that the decrease in proportion of children with the same 100 and 500 most popular names is due to the rise in diversity of names and the growth of the potential name pool. This could be attributed to more genetic diversity in the nation, a cultural shift in wanting unique names, and other potential factors. We are unable to make any absolute assumptions about the causation of these trends.
To solve this problem I first want to explore the top 10 males and female names in any given year, I want to visually analyze if the same names remain, and the difference between the popular names, are some names consistently popular, do they change ranking quickly, or remain mroe stable?
names_top_10_male <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number)) %>%
group_by(year) %>%
slice(1:10)
names_top_10_male %>%
ggplot(aes(x=year, y=number, color=name)) +
geom_line() +
theme(legend.position="none") +
ylim(0,100000) +
labs(title="Popularity of Male names over time",
x = "Year",
y = "# of Babies With Name")
names_top_10_female <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number)) %>%
group_by(year) %>%
slice(1:10)
names_top_10_female %>%
ggplot(aes(x=year, y=number, color=name)) +
geom_line() +
theme(legend.position="none") +
ylim(0,100000) +
labs(title="Popularity of Female names over time",
x = "Year",
y = "# of Babies With Name")
The female data has few crazy peaks in name popularity and a few stand-out names at any one time. The female data seems to be more trend-oriented, while the male data seems to have more generation-crossing names. But, we are only looking at the top 10 names, not the entire rankings range.
We also see that the popularity of certain names has drastically decreased since 1990, where we see the most popular names having a significantly smaller share of the market of names than they did before. We continue to see the trend mentioned earler where in 1945-1955 there was a peak homogeneity in naming children (In the form of a huge trend name in females, and a wide use of the more seemingly timeless names in males).
As a quick aside, I wanted to see what the that huge trend name from 1945-1955 that we see in the female chart was. And why almost a hundred thousand girls in 1947 were named this most popular name. With some manipulation of the data we see that the peak in 1947 is the name Linda. Let’s look at the name Linda on its own.
year_df <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number)) %>%
filter(year == 1947)
head(year_df)
## year gender name number
## 1 1947 F Linda 99685
## 2 1947 F Mary 71686
## 3 1947 F Patricia 51276
## 4 1947 F Barbara 48791
## 5 1947 F Sandra 34776
## 6 1947 F Carol 33538
Linda_df <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number)) %>%
filter(name == 'Linda')
Linda_df %>%
ggplot(aes(x=year, y=number, color=name)) +
geom_line() +
theme(legend.position="none") +
labs(title="Popularity of 'Linda' over time",
x = "Year",
y = "# of Babies With Name")
Now of course, I was curious to the proliferation and the rise and fall of “Linda”. With a quick google search I learned that “Linda” is actually considered the trendiest name of all time. I found another data scientist who looked at this a similar dataset while I was researching why Linda was so popular. David Taylor, a biotechnologist and blogger at Proofreader.com, analyzed a names database as well and found Linda to be the trendiest name of all time on his metrics, where he “[took] into account both the swiftness with which a name enters and then exits the naming pool, as well as the intensity of its popularity. The names on his list are therefore ones that both had a sharp rise and fall and had a major impact.”. He had a calculation based on peak height and peak width to measure trend.
Now I am interested in the ‘trendiness’ metric and how I can use the measurements of how quickly a name becomes popular and how quickly it becomes unpopular to create a trend-scale of names.
It turns out, in 1947 this name spiked because of a hit song in 1946 called… “Linda”, written by Jack Lawrence. It is easy to imagine that this name dropped in popularity as quickly as it came with respect to the dieing popularity of this song, but this is a data science project, not a speculative assignment.
Now Let’s look for a male trending name, I am curious about the purple peak of 1956, as it seems to have a quick rise to #1 and then a slower decline over time, but remaining as one of the top most popular names even today. I am guessing it will not stem from pop culture since it has lasted the test of time, it is perhaps biblical or a common British name.
year_df <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number)) %>%
filter(year == 1956)
head(year_df)
## year gender name number
## 1 1956 M Michael 90620
## 2 1956 M James 84842
## 3 1956 M Robert 83905
## 4 1956 M David 81601
## 5 1956 M John 80759
## 6 1956 M William 58935
The name, Micheal, was the name in question, we see below its trend from 1910-2016 in popularity.
Michael_df <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number)) %>%
filter(name == 'Michael')
Michael_df %>%
ggplot(aes(x=year, y=number, color=name)) +
geom_line() +
theme(legend.position="none") +
labs(title="Popularity of 'Michael' over time",
x = "Year",
y = "# of Babies With Name")
In the top 10 data set, male and female both, there seem to be less intense spikes and trends in the male dataset. I want to test whether there is a significant difference in the trends of baby names between males and females, but we are only looking at 10 names per gender, and have not clearly defined a trend metric to study.
This leads us to our final Exploratory question,
I aim to test this by looking at the relative change in popularity in names across years. I will create what I call the trend meter. If name popularity changes quickly, this means it is changing trendiness, and it will mark higher on the trend meter.
I first want to standardize all data.
zs_df_male <- names_top_500_male %>%
group_by(year) %>%
mutate(mean = mean(number)) %>%
mutate(sd = sd(number)) %>%
mutate(z = (number - mean) / sd) %>%
ungroup()
zs_df_female <- names_top_500_female %>%
group_by(year) %>%
mutate(mean = mean(number)) %>%
mutate(sd = sd(number)) %>%
mutate(z = (number - mean) / sd) %>%
ungroup()
While standardizing our data, we have to think about what it means, let’s look at the top 500 most popular names in 1910. First, we find our trend metrics on the top 500 most popular names because names that are less than top 500 aren’t making significant trends, and this is an accurate place to start tracking names, once they reach top 500 popularity.
Let’s look at the first most popular name in 1910, John. The data is so skewed that the mean number of of males per name is 326, but a standard deviation of 986 (3x as large as the mean). we also see that the name that is closest to this mean is Morris (ranking #91 in popularity.) This means that the top (90/500)= 18% of our 500 most popular names are above the mean and 82% are below, this data is immensely skewed, to whereas our z-score for the #1 name is 11.27, a huge outlier.
We will graph the distribution of male and female names in 1910 and 2017
names_top_500_male_1910 <- names_top_500_male %>%
filter(year == 1910) %>%
mutate(toHighlight = ifelse( number >= 326.148, "yes", "no" ) )
names_top_500_female_1910 <- names_top_500_female %>%
filter(year == 1910) %>%
mutate(toHighlight = ifelse( number >= 691.130, "yes", "no" ) )
names_top_500_male_2016 <- names_top_500_male %>%
filter(year == 2016) %>%
mutate(toHighlight = ifelse( number >= 2788.098, "yes", "no" ) )
names_top_500_female_2016 <- names_top_500_female %>%
filter(year == 2016) %>%
mutate(toHighlight = ifelse( number >= 2186.564, "yes", "no" ) )
names_top_500_male_1910 %>%
ggplot(mapping = aes(x=reorder(name, -number), y=number, fill=toHighlight) ) +
geom_bar(stat = "identity") +
scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = FALSE ) +
labs(title="Distribution of popularity of top 500 Male Names in 1910",
x = "Year",
y = "# of Babies With Name")
names_top_500_female_1910 %>%
ggplot(mapping = aes(x=reorder(name, -number), y=number, fill=toHighlight) ) +
geom_bar(stat = "identity") +
scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = FALSE ) +
labs(title="Distribution of popularity of top 500 Female Names in 1910",
x = "Year",
y = "# of Babies With Name")
names_top_500_male_2016 %>%
ggplot(mapping = aes(x=reorder(name, -number), y=number, fill=toHighlight) ) +
geom_bar(stat = "identity") +
scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = FALSE ) +
labs(title="Distribution of popularity of top 500 Male Names in 2016",
x = "Year",
y = "# of Babies With Name")
names_top_500_female_2016 %>%
ggplot(mapping = aes(x=reorder(name, -number), y=number, fill=toHighlight) ) +
geom_bar(stat = "identity") +
scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = FALSE ) +
labs(title="Distribution of popularity of top 500 Female Names in 2016",
x = "Year",
y = "# of Babies With Name")
Above, we have the names above the median number of babies named per name colored in Orange. Using all our curated knowledge, we will track ‘trendiness’.
We must formally define trendiness, and how to measure it.
Trendiness could be defineed in multiple ways:
The problem with this lies here, we want to see how quickly a name becomes popular and un-popular again to measure a ‘fad’ name, but since we learned that there is a general trend of popular names taking up a smaller total proportion of all names, we know that popular names like “Michael” would look as if they are becoming less popular, but in reality some names keep the same ranking but take up a smaller proportion of the name pool. For this reason tracking change in Z-score or change in proportion doesn’t create accurate results, since it would show consistently popular names as being fad names, since they change proportion quickly.
Imagine we have three years of data and 5 names. If the first name was always rank 1 or 2 then the mean of this data would be around 1.5 with a small standard deviation, this name is POPULAR but it doesn’t match a trend or fad.
We can look at another name that goes from rank 5, to rank 3, to rank 1. This data point would have a mean of 3, but a higher standard of deviation, meaning the name went through a trend.
We use this ‘change in ranking per year’ metric because if we took the average ranking of each name across all years we would have data with small quick trends and data with one large change, that have the same ranking proportions, then they could have the same average and standard deviation even though they do not represent the same level in trend.
This metric will accurately outline the trend factors of names, and is the way we should define trend, through a higher standard deviation or volatility.
I will create a new data set where the entity is the name, and year is an attribute, we can now look at the rank of child names per gender in all of the used names per year. If a name is not present in at least 10 years, we will exclude it from the dataset.
I have one last interesting point before we continue, we know that the sqrt(2) - sqrt(1) is greater than sqrt(10000) - sqrt(9999). For this reason we will not track the change in rank, but the change in square roots of rank, to make names with higher ranks worth MORE in terms of our trend index.
We also may find different results in trendiness compared to data scientist David Taylor since our trend metric is likely different. I am taking volatility as the first measure, and popularity as a second degree measure, where he likely took popularity as a higher weight when calculating his metric.
#get all male data
all_males <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number))
#get all female data
all_females <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number))
#get how many years each name was issued
names_male_year_count <- names_df %>%
filter(gender == "M") %>%
arrange(year, desc(number)) %>%
group_by(name) %>%
summarize(years_with_name = n()) %>%
ungroup()
#get how many years each name was issued
names_female_year_count <- names_df %>%
filter(gender == "F") %>%
arrange(year, desc(number)) %>%
group_by(name) %>%
summarize(years_with_name = n()) %>%
ungroup()
names__male <-
names_male_year_count %>% inner_join(all_males, by='name') %>%
filter(years_with_name > 10)
names__female <-
names_female_year_count %>% inner_join(all_females, by='name') %>%
filter(years_with_name > 10)
rank_male_spread <- names__male %>%
mutate(rank = dense_rank(desc(number))) %>%
select(name, year, rank) %>%
tidyr::spread(year, rank)
rank_female_spread <- names__female %>%
mutate(rank = dense_rank(desc(number))) %>%
select(name, year, rank) %>%
tidyr::spread(year, rank)
#encode NA's
rank_male_spread[is.na(rank_male_spread)] <- 10000
rank_female_spread[is.na(rank_female_spread)] <- 10000
Our current data frame shows the ranks of children names each year for male and female babies. We encoded every NA as 10000 rank, so that we can subtract from these values. This makes it so we can track a change or rank from a name not used at all in a year or a name beginning to be used, a change in sqrt(10000) - sqrt(10000) is 0, so this change only decreased our standard deviation, but does not add to a a change in rank, which is good for our trend metric.
I will use matrices to subtract each rank from the year before it.
matrix_1 <- rank_male_spread %>%
select(-name) %>%
as.matrix() %>%
.[,-1]
matrix_2 <- rank_male_spread %>%
select(-name) %>%
as.matrix() %>%
.[,-ncol(.)]
#This is where we subtract the log of rank changes, because it makes higher rank changes worth more than lower rank changes of the same number.
diff_rank_male <- (sqrt(matrix_1) - sqrt(matrix_2)) %>%
magrittr::set_colnames(NULL) %>%
as_data_frame() %>%
mutate(name = rank_male_spread$name)
diff_rank_male$change <- rowSums( diff_rank_male[,1:106] )
diff_rank_male$avg_change <- rowMeans( diff_rank_male[,1:106] )
sd_rank_males <-suppressWarnings( transform(diff_rank_male, Trend_Metric=apply(diff_rank_male,1, sd, na.rm = TRUE)) )
trendm <- sd_rank_males %>%
select(name, avg_change, Trend_Metric) %>%
arrange(desc(Trend_Metric))
##female matrices
matrix_1f <- rank_female_spread %>%
select(-name) %>%
as.matrix() %>%
.[,-1]
matrix_2f <- rank_female_spread %>%
select(-name) %>%
as.matrix() %>%
.[,-ncol(.)]
diff_rank_female <- (sqrt(matrix_1f) - sqrt(matrix_2f)) %>%
magrittr::set_colnames(NULL) %>%
as_data_frame() %>%
mutate(name = rank_female_spread$name)
diff_rank_female$change <- rowSums( diff_rank_female[,1:106] )
diff_rank_female$avg_change <- rowMeans( diff_rank_female[,1:106] )
sd_rank_females <-suppressWarnings( transform(diff_rank_female, Trend_Metric=apply(diff_rank_female,1, sd, na.rm = TRUE)) )
trendsm <- sd_rank_males %>%
select(name, avg_change, Trend_Metric) %>%
arrange(desc(Trend_Metric))
trendsf <- sd_rank_females %>%
select(name, avg_change, Trend_Metric) %>%
arrange(desc(Trend_Metric))
head(trendsm)
## name avg_change Trend_Metric
## 1 Noah -0.5749504 6.228043
## 2 Liam -0.5663917 5.910047
## 3 Ethan -0.4998997 5.759774
## 4 Mason -0.5230817 5.481326
## 5 Jacob -0.5012225 5.326577
## 6 Michael -0.4667974 5.208330
head(trendsf)
## name avg_change Trend_Metric
## 1 Olivia -0.6460929 6.749955
## 2 Ava -0.6040010 6.544934
## 3 Mia -0.5844095 6.248015
## 4 Sophia -0.5896020 6.233287
## 5 Isabella -0.5667308 6.095240
## 6 Mary 0.5746583 6.033457
We can now look at the top ‘trendiest and volatile’ male and female trend names. The highest ranking male names being Noah, Liam, and Ethan. The highest ranking female names being Olivia, Ava, and Mia.
Noah <- names__male %>%
mutate(rank = dense_rank(desc(number))) %>%
select(name, year, rank) %>%
filter(name == 'Noah')
Noah %>%
ggplot(aes(x=year, y=rank, color=name)) +
geom_line() +
ylim(0,10500) +
theme(legend.position="none") +
labs(title="Ranking of 'Noah' over time",
x = "Year",
y = "Ranking")
Olivia <- names__female %>%
mutate(rank = dense_rank(desc(number))) %>%
select(name, year, rank) %>%
filter(name == 'Olivia')
Linda <- names__female %>%
mutate(rank = dense_rank(desc(number))) %>%
select(name, year, rank) %>%
filter(name == 'Linda')
ggplot() +
geom_line(data=Olivia, aes(x=year, y=rank), color="blue") +
geom_line(data=Linda, aes(x=year, y=rank), color='red') +
geom_line() +
ylim(0,10500) +
labs(title="Ranking of 'Olivia' and 'Linda' over time",
x = "Year",
y = "Ranking")
Olivia is the blue line, and Linda is the red line.
We have above graphed the ‘trendiest’ names according to our calculated metrics. Remember that ‘trendiest’ means most volatile in our situation, not most popular, but it does take popularity into account when 2 names have similar changes in rank, due to our sqrt() function being used during each change in rank calculation. We are looking at name FADS.
We also must remember than rank #1 is the best positioning, but is at the bottom of the graph, on the x-axis.
I also graphed Linda compared to Olivia to view the difference. I think the reason that Olivia is ranked so much higher than Linda (Even though Linda was previously considered the trendiest name), is because Olivia is more volatile on average, where Linda had much slower rates of change from 1947 to around 1963, diminishing its trendiness value, whereas Olivia in this time had rapid, quick rank shifts. Linda was ranked 46 out of 8,294 in our metrics.
Now that we have all of our SD as our trend metrics, we move onto our Hypothesis testing for our question, Do fad names occur more in males or females?
Null Hypothesis: There is no significant difference in the ‘trend metric’ between male and female baby names.
Alternative Hypothesis: There is a significant difference in the ‘trend metric’ between male and female baby names.
I want to run a T-test on both data sets. First, I will visually graph the distributions.
# Density plot
ggplot(sd_rank_males) + geom_density(aes(x = Trend_Metric), bw= .3)
ggplot(sd_rank_females) + geom_density(aes(x = Trend_Metric), bw = .3)
## Warning: Removed 1 rows containing non-finite values (stat_density).
We cannot see much in the density plots, we can see that the mode of female trend metric data is higher than the mode for male trend metric data. We will have to run our T-test of the means to see if they are significantly different.
avector <- as.vector(sd_rank_males['Trend_Metric'])
class(avector)
## [1] "data.frame"
avector2 <- as.vector(sd_rank_females['Trend_Metric'])
class(avector)
## [1] "data.frame"
t.test(avector, avector2)
##
## Welch Two Sample t-test
##
## data: avector and avector2
## t = -52.338, df = 10413, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5227918 -0.4850455
## sample estimates:
## mean of x mean of y
## 0.2229251 0.7268437
We see through our Two Sample t-test that our sample estimates for the male average trend score is .223, and the female average trend score is .7266. Our p-value for our null hypothesis is essentially 0, (2.2e-16) indicating that we can reject or null hypothesis.
Recall the trend metric is Standard deviation of the change in the square roots of ranking from year to year, which measures how quickly rankings change, weighting higher ranking changes as more than lower ranking changes of the same value.
We can say there is a significant difference in the ‘trend metric’ between male and female baby names. We can also say with 95% confidence that the average trend metric for female baby names is between .485 and .523 units higher than the average trend metric for male baby names.
In Layman terms, the average female name is ‘trendier’ or ‘goes through quicker ranking changes’ than the average male name, at a significant rate.
We have found some interesting conclusions to be drawn from the data for further exploration. We have found that for some reason in The United States, female names tend to change in popularity more rapidly than male names, there are more unique female names, and the proportion of parents naming females the same more popular names is lower than males.
We have also learned that over time parents have been choosing from a wider pool of names to name their children, and the most popular names are taking up less of this total market share.
This could be a testament to the ever-growing melting pot of America that brings together new cultures consistently. Another interesting project (maybe my next project), would be to analyze changes in heritage and cultures of families naming children to view if the change in names stem from the proportion of children being born in the US with respect to their cultural backgrounds, allowing me to isolate demographics and research its effects on name changes.
It wouldn’t make sense to predict which names will be “popular” or “trendy” next since it seems many come from current culture, such as US presidents, artists, and popular figures. These external factors that would be difficult to predict before they happen. But perhaps a third project would be studying the most ‘popular’ people in US culture for any given year and seeing if their names were popular baby names the next year.
Regardless of the reasons for different changes, we can conclude that the US name trend analysis is far from over. although we have learned a significant amount through this analysis, and it will be interesting to see how name fads change in the future.
Below are some external links to help the reader:
Dataset used: https://www.kaggle.com/salil007/a-very-extensive-exploratory-analysis-usa-names/data
Article found about “Linda”: https://www.bustle.com/p/linda-is-the-trendiest-baby-name-in-us-history-making-for-a-classic-yet-unexpected-pick-30410
Data Project by David Taylor: http://www.prooffreader.com/2014/07/trendiest-baby-names-in-social-security.html