This is an analysis of the relationship between the median earnings of a college’s graduates and other characteristics of the college. The data was obtained from the College Scorecard dataset produced by the U.S. Department of Education and can be found here. The U.S. Department of Education obtains the data through several sources including “federal reporting from institutions, data on federal financial aid, and tax information.”
In this analysis, I wish to use the R programming language to visually and quantitatively determine the effect several numerical and categorical variables have on the post-graduation median earnings of college students. The reason for inquiry lies in my interest in determining trends regarding the quality of universities in relationship to a variety of economic and academic factors. In determining the effect of certain variables on eventual economic outcome, I seek to better understand the tradeoffs of different college characteristics.
Library imports:
library(dplyr)
library(ggplot2)
library(scales)
library(readr)
library(knitr)
Data import:
df <- read_csv("Most-Recent-Cohorts-All-Data-Elements.csv")
There are 7593 colleges in this dataset with 1777 variables. For this analysis, we are only interested in several defining characteristics (e.g. institution name), categorical factors (e.g. operational status), and quantitative variables (e.g. admission rate).
The following code selects only the variables relevant to our analysis and only retains colleges that:
college <- df %>%
filter(CURROPER == 1,
ST_FIPS <= 56,
CONTROL != 3,
PREDDEG == 3) %>%
select(name = INSTNM, funding = CONTROL, admit = ADM_RATE, med_earnings = MD_EARN_WNE_P10,
med_fam_inc = MD_FAMINC, NPT4_PUB, NPT4_PRIV)
In order to create visualizations of the data, we need to convert the variables admit
, med_earnings
, med_fam_inc
, NPT4_PUB
, and NPT4_PRIV
from the character
type to the double
type. The following code also merges the NPT4_PUB
column and NPT4_PRIV
column into a single price
column.
college$admit <- as.double(college$admit)
college$med_earnings <- as.double(college$med_earnings)
college$med_fam_inc <- as.double(college$med_fam_inc)
college$NPT4_PUB <- as.double(college$NPT4_PUB)
college$NPT4_PRIV <- as.double(college$NPT4_PRIV)
college <- college %>%
rowwise %>%
mutate(price = sum(NPT4_PUB, NPT4_PRIV, na.rm = TRUE)) %>%
select(-c(NPT4_PUB, NPT4_PRIV))
college["price"][college["price"] == 0] <- NA
Here is a small view of what the data now looks like:
kable(head(college))
name | funding | admit | med_earnings | med_fam_inc | price |
---|---|---|---|---|---|
Alabama A & M University | 1 | 0.6538 | 29900 | 21429.0 | 13435 |
University of Alabama at Birmingham | 1 | 0.6043 | 40200 | 33731.0 | 16023 |
Amridge University | 2 | NA | 40100 | 14631.0 | 8862 |
University of Alabama in Huntsville | 1 | 0.8120 | 45600 | 39100.5 | 18661 |
Alabama State University | 1 | 0.4639 | 26700 | 21704.0 | 7400 |
The University of Alabama | 1 | 0.5359 | 42700 | 64600.5 | 20575 |
The variables selected are:
names(college)
## [1] "name" "funding" "admit" "med_earnings"
## [5] "med_fam_inc" "price"
funding
refers to the source of funding for the institution, 1
coding for a public university and 2
coding for a private universityadmit
is the admission rate of the institution on a scale of 0 to 1med_earnings
represents the median earnings of the institution’s students who are employed 10 years after enrollment in 2015 USDmed_fam_inc
is the median family income of the institution’s current students in 2015 USDprice
indicates the average net price of attendance in USD accounting for the full costs of attendance and awarded financial aidThe following code stores certain purely cosmetic alterations of the visualizations as the variables xdollar
, ydollar
, and titling
to allow for cleaner looking code.
xdollar <- c(scale_x_continuous(labels = dollar,
breaks = seq(0, 130000, 25000),
limits = c(0, NA)))
ydollar <- c(scale_y_continuous(labels = dollar,
breaks = seq(0, 130000, 25000),
limits = c(0, NA)))
titling <- theme(plot.title = element_text(hjust = 0.5,
face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"))
To analyze the relationship between median earnings and other factors, we would first like to get a preliminary understanding of the distribution of the median earnings of colleges’ graduates. We will create a boxplot of med_earnings
below:
ggplot(data = college) +
geom_boxplot(mapping = aes(x = "", y = med_earnings)) +
labs(title = "Median Earnings of \na College's Graduates",
x = NULL,
y = "Median Earnings in USD") +
ydollar +
titling
Due to the inclusion of outliers in our boxplot, we do not receive a good representation of the scale of the distribution of median earnings. Let’s take a look at a histogram instead:
ggplot(data = college) +
geom_histogram(mapping = aes(x = med_earnings)) +
labs(title = "Median Earnings of a College's Graduates",
x = "Median Earnings in USD",
y = "Frequency of Colleges") +
xdollar +
titling
A histogram of med_earnings
gives us a better visualization of the distribution of the median earnings of colleges’ graduates. However, to better understand the effect of different characteristics on economic outcomes, we would like to separate and compare the distributions of med_earnings
between public and private institutions.
A violin plot should combine the compactness of a boxplot with the visualization of the distribution of a histogram. Additionally, I have overlaid a plot of the data points to further aid with visualizing the distribution. Here is a violin plot of med_earnings
separated by the two values of funding
:
ggplot(data = college,
mapping = aes(x = factor(funding),
y = med_earnings)) +
geom_violin() +
geom_jitter(alpha = 0.15) +
scale_x_discrete(labels = c("Public", "Private")) +
labs(title = "Median Earnings of a College's Graduates \nby Source of Funding",
x = "Funding Source",
y = "Median Earnings in USD") +
ydollar +
titling
Constructing a violin plot does well in illustrating the visible difference in the distributions of med_earnings
between different kinds of universities. However, we should also verify the difference in distribution computationally. We will conduct a Kolmogorov-Smirnov test to determine if the difference in the distribution of the median earnings of colleges’ graduates between public and private institutions is statistically significant. For this analysis, we will consider any p-value less than 0.05
to be statistically significant.
public <- (college %>% filter(funding == 1))$med_earnings
private <- (college %>% filter(funding == 2))$med_earnings
ks.test(public, private, "two.sided")
##
## Two-sample Kolmogorov-Smirnov test
##
## data: public and private
## D = 0.085491, p-value = 0.009165
## alternative hypothesis: two-sided
With an incredibly small p-value of 0.009165
, we reject the null hypothesis that public and private colleges share the same distribution of the median earnings of their graduates in favor of our alternative hypothesis that the two different types of institutions have statistically significant differences between their distributions of med_earnings
.
In order to further examine the effect various factors have on the economic outcome of a college’s graduates, it would serve us well to fit a linear model to the data. Constructing a least squares regression line using med_earnings
and another variable will allow us to quantitatively observe the two variables’ relationship and determine the strength of the relationship, which might allow us to establish which variables are better predictors of good economic incomes than others.
We will utilize scatterplots to visibly observe the relationships between variables. The code below stores certain functions as variables to allow for neater code.
point_theme
affects nothing more than some simple cosmetics of the created plotsscatter
stores the bulk of the coding that declares the variables of interest to be plotted and the type of plot to be constructed, as well as some cosmetic alterationspoint_theme <- c(scale_x_continuous(labels = percent),
ydollar,
scale_color_manual(labels = c("Public","Private"),
values = c("#F8766D", "#00BFC4")))
scatter <- ggplot(data = college,
mapping = aes(x = admit,
y = med_earnings)) +
geom_point(mapping = aes(color = factor(funding)),
size = 2) +
point_theme +
titling
We would first like to observe the effect that a college’s admission rate has on the eventual median earnings of its graduates; we will construct a scatterplot to visualize this relationship:
scatter +
labs(title = "Median Earnings of Graduates against Admission Rate of Colleges",
x = "Admission Rate",
y = "Median Earnings in USD",
color = "Source of Funding")
By constructing a scatterplot of med_earnings
against admit
, we can observe the negative relationship between the median earnings of colleges’ graduates and the colleges’ admission rates. A lower admission rate is correlated with higher median earnings. To further verify the negative relationship between med_earnings
and admit
, we will fit a linear model to the scatterplot:
scatter +
geom_smooth(method = "lm",
color = "black") +
labs(title = "Median Earnings of Graduates against Admission Rate of Colleges",
x = "Admission Rate",
y = "Median Earnings in USD",
color = "Source of Funding")
fit1 <- lm(data = college,
med_earnings ~ admit)
summary(fit1)
##
## Call:
## lm(formula = med_earnings ~ admit, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27624 -6376 -755 5106 79688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53340.8 952.8 55.98 <2e-16 ***
## admit -15831.6 1396.5 -11.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10390 on 1455 degrees of freedom
## (339 observations deleted due to missingness)
## Multiple R-squared: 0.08116, Adjusted R-squared: 0.08053
## F-statistic: 128.5 on 1 and 1455 DF, p-value: < 2.2e-16
The above is a summary of the least squares regression line fit to the med_earnings
and admit
scatterplot. Most of the information given is not pertinent to the scope of our analysis. One should note the first values given for the (Intercept)
and admit
coefficients and the Multiple R-squared
value.
The coefficients give us the equation for the least squares regression line: \(med\_earnings = -15831.6 * admit + 53340.8\) where admit
is on a scale of 0 to 1. The product -15831.6 * admit
means that for every 1% increase in the admission rate, the median earnings for a student 10 years after enrollment is expected to decrease by $158.32. The y-intercept of 53340.8
implies that, for an institution with 0% admission rate, the expected median earnings would be $53,340.80.
The Multiple R-squared
value of 0.08116
means that 8.12% of the variation in median earnings can be explained by the linear relationship between median earnings and admission rate. This is a rather low value for R2. It is possible that a linear model is not the best fit for the data. Let’s plot med_earnings
against admit
but separate public and private colleges:
scatter +
labs(title = "Median Earnings of Graduates against Admission Rate of Colleges \nSeparated by Source of Funding",
x = "Admission Rate",
y = "Median Earnings in USD",
color = "Source of Funding") +
facet_wrap(~ funding)
By separating the two different types of institutions, we can see that private universities with low admission rates do not follow a linear pattern as much as institutions with high admission rates. This could explain a low R2 value.
Let’s explore other factors to see which variables might better predict median earnings.
We would like to observe the relationship between the median family income of current students at these institutions and the eventual median earnings of their graduates. To do so, we will plot med_earnings
against med_fam_inc
and fit a linear model to the result:
ggplot(data = college,
mapping = aes(x = med_fam_inc,
y = med_earnings)) +
geom_point(size = 2,
color = "skyblue2") +
geom_smooth(method = "lm",
color = "black") +
labs(title = "Median Earnings of Graduates against Median Family Income of Current Students",
x = "Median Family Income in USD",
y = "Median Earnings in USD") +
xdollar +
ydollar +
titling
fit2 <- lm(data = college,
med_earnings ~ med_fam_inc)
summary(fit2)
##
## Call:
## lm(formula = med_earnings ~ med_fam_inc, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32127 -6140 -2113 3923 88726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.134e+04 6.767e+02 46.31 <2e-16 ***
## med_fam_inc 2.344e-01 1.257e-02 18.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10780 on 1620 degrees of freedom
## (174 observations deleted due to missingness)
## Multiple R-squared: 0.1768, Adjusted R-squared: 0.1763
## F-statistic: 348 on 1 and 1620 DF, p-value: < 2.2e-16
Again, note the coefficients for (Intercept)
and med_fam_inc
and the value of Multiple R-squared
. The coefficients give us the equation to the least squares regression line for this plot: \(med\_earnings = 0.2344 * med\_fam\_inc + 31340\). The product 0.2344 * med_fam_inc
indicates that for every increase of $1000 in median family income, the median earnings is expected to increase by $234.40. The y-intercept of 31340
indicates that, for a university with students with a median family income of $0, the expected median earnings would be $31,340.
The Multiple R-squared
value of 0.1768
means that 17.68% of the variation in median earnings can be explained by the linear relationship between median earnings and median family income. This is a higher R2 value than the R2 computed for the med_earnings
vs. admit
plot, indicating that the median family income of a college’s students is a better predictor for the median earnings of that institution’s graduates than the admission rate of that institution.
Next, let’s examine the relationship between median earnings and the net price of attendance.
We would like to determine if the net price of attending an institution is a better predictor of the median earnings of the graduates of that institution than median family income or admission rate. To visualize the relationship, we will plot med_earnings
against price
and apply a linear model to the data:
ggplot(data = college,
mapping = aes(x = price,
y = med_earnings)) +
geom_point(size = 2,
color = "thistle3") +
geom_smooth(method = "lm",
color = "black") +
labs(title = "Median Earnings of Graduates against Net Price of Attendance",
x = "Net Price in USD",
y = "Median Earnings in USD") +
xdollar +
ydollar +
titling
fit3 <- lm(data = college,
med_earnings ~ price)
summary(fit3)
##
## Call:
## lm(formula = med_earnings ~ price, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30182 -5703 -947 4629 74205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.203e+04 7.599e+02 42.15 <2e-16 ***
## price 5.269e-01 3.634e-02 14.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10330 on 1550 degrees of freedom
## (244 observations deleted due to missingness)
## Multiple R-squared: 0.1194, Adjusted R-squared: 0.1189
## F-statistic: 210.2 on 1 and 1550 DF, p-value: < 2.2e-16
Once more, note the coefficients for (Intercept)
and price
and the value of Multiple R-squared
. The equation for the least squares regression line is: \(med\_earnings = 0.5269 * price + 32030\). The product 0.5269 * price
suggests that for every increase of $1000 in net price, the median earnings is expected to increase by $526.90. The y-intercept of 32030
indicates that, for an institution with a net price of $0, the expected median earnings would be $32,030.
The Multiple R-squared
value of 0.1194
means that 11.94% of the variation in median earnings can be explained by the linear relationship between median earnings and net price of attendance. Although this R2 value is higher than the one computed for the med_earnings
vs. admit
plot, it is lower than the R2 value computed for med_earnings
vs. med_fam_inc
. This suggests that the net price of an institution is a better predictor of the median earnings of that institution’s graduates than the admission rate of that college. However, net price is not as good of a predictor as the median family income of an institution’s students.
I began this analysis introducing the goals I wanted to achieve through the use of the R programming language. Through the construction of several visualizations, we have examined the effect of several numerical and categorical variables on the economic outcome of college graduates: the median earnings 10 years after enrollment.
Since the R2 value of the relationship between the median earnings of a college’s graduates and the median family income of the college’s students is the highest among the three variables examined (i.e. admit
, med_fam_inc
, and price
), it is the better predictor of the potential economic outcome of an institution’s students.
I look forward to conducting a further analysis of the effect of more variables on the ultimate economic outcome of students in post-secondary education.